mozilla :: #moc

14 Mar 2017
01:53jlazjustdave++
03:56atollircops oncall
03:56atolloh. oh my.
03:57atolli retract this ping.
03:57atolli have just suffered a very lousy UX experience, that presented an irc attack from a different network as one on ours.
03:57justdaveheh
03:57atollchannel name overlap. my apologies to all i summoned by accident.
04:04jlaznp
08:34jgrahamPlanned tree closure is about to start, tracked in bug 1347074
08:34firebothttps://bugzil.la/1347074 NEW, nobody@mozilla.org Tree closure for database migration
08:41jlazthanks for the heads up!
09:00nagios-scl3New Sysadmin OnDuty is pir
09:48ewongany ircops around?
09:49ewongjust wanna ask a policy question
09:52pirmight be able to answer
09:52ewongpir at what point can I become an op when the founder is mia?
09:53pirewong: for a specific channel ,I assume?
09:53ewongpir yes.
09:53pirusually decided on a case by case basis
09:53pirno t aware of a set policy
09:53ewongof course, when the founder returns, I can de-op myself
09:55ewongpir can I take this to pm?
09:55pirI don't have anything to add. I can't do it anyway... I'd suggest a bug
09:55ewongpir ah ok. thanks .. what component?
09:57pirinfrastructure & operations > infrastructure: irc
09:57ewongah thanks
09:58ewongcompletely missed that entry when I went through that I&O list
10:02jgrahamTrees are about to reopen
10:06ewongjgraham: thanks!
15:35rcarrollGood morning MOC. We are confirming an outage with Airmozilla Video on Demand services. We are reaching out to the service provider now.
15:35pirrcarroll: thanks. When contacting us can you mention the current oncall person? Harder to miss in scroll when highlighted
16:01pythOncPaulbhello MOC, so I have developed a check to parse through the DBs and determine which database/table is nearing a max integer threshold. How do I get it to you and get them installed on the db servers?
16:01pirpythOncPaulb: open a bug with details, please
16:01pythOncPaulbIt is a single nagios script, located currently on itutils1.
16:01pythOncPaulbI think there is already a bug opened
16:01pirpythOncPaulb: or if there is already a bug, put it there
16:02pythOncPaulbit has a hardcoded password. would rather not
16:02pirwell, there's your first problem: nagios checks should never have hardcoded passwords
16:03aselagea|builddutydigi: hi! I think your recent changes are causing nagios1 and admin1a servers to fail running puppet
16:03pirpassword needs to be provided as, at a minimum, an option so they can be pulled in from puppet secure storage
16:03pythOncPaulbunfortunately, i don't have access to nagios itself to do any of the testing. It does have the ability to pass a password on the command line as well; let me double check it and validate that it will work.
16:03pythOncPaulbsince i don't have access to do that, I had to create a temporary line for hte HC pass.
16:03pirpythOncPaulb: then remove any hardcoded passwords and put it in the bug
16:04digiaselagea|buildduty: I fixed them and they're recovering - next run should be successful
16:04digiaselagea|buildduty: if you would like to provide the hostnames of the machines I can run puppet now
16:05digiand hello!
16:05digiI've been watching #sysadmins but didnt see any hosts alert, yet
16:07aselagea|builddutyhttps://irccloud.mozilla.com/pastebin/qlnGOhFH/
16:08aselagea|builddutydigi: these are the hosts for which I've noticed alerts in #buildduty so far
16:08digithank you
16:23pythOncPaulbpir: the script is in https://bugzilla.mozilla.org/show_bug.cgi?id=1276804 --- it access user as the first arguement (so you can run it as nagios) and password as the second arguement
16:23firebotBug 1276804 is not accessible
16:25pirpythOncPaulb: that's not actually a nagios check script, it doesn't take the right options or return nagios exit codes, etc: https://nagios-plugins.org/doc/guidelines.html
16:27pirpythOncPaulb: any threasholds need to be available as options
16:27pythOncPaulb1) too many thresholds for options
16:27pirah, it is returning the exit codes though, my mistake
16:28pythOncPaulb... glad you caught that :p
16:28pirwarning and critical can be options, no?
16:28pythOncPaulb12 different thresholds
16:29pythOncPaulbthe check validates agasints all integer types
16:29pythOncPaulbso that would be 24
16:29pythOncPaulblet me see what i can noodle
16:30pythOncPaulbok hang on
16:30pirpythOncPaulb: you say &quot;perl max_integer_check.pl <user> <password>&quot; but then there&#39;s GetOptions?
16:30pythOncPaulbi can make the warning multiplier and critical multiplier on the commandline. (thought you were referring to the values for the max integers
16:30pirpythOncPaulb: can we not have dbuser and dbpass as options rather than just $ARGV[0], etc?
16:31pythOncPaulbsorry, my perl is not the strongest; let me check.
16:32pirGetOptions should let you do it in a more versatile form and use the other things listed
16:34gcox*skim* It allows you to specify parameters like dbuser and dbpass in --configfile. Hacking it onto the command-line was weird.
16:34* pir nods
16:43Tony_6unixfairy: I&#39;m on the phone with vidly at the moment. I&#39;ll provide an update as soon as possible.
16:52unixfairythank you Tony_6 n
17:00nagios-scl3New Sysadmin OnDuty is ashlee
17:10pythOncPaulbCan use --configfile, but again I don&#39;t have access to the vault, etc. What I can do is provide a config file and remove the password. Otherwise the script works well, i just don&#39;t have access to everything or the methods of access to be able to configure it 100%
17:11pirpythOncPaulb: we cannot have hardcoded passwords nor in config files. Opssec get cranky. Needs to be options.
17:12pirif you can make all the options work and get rid of the use of ARGV (should never be used in scripts like that) then it should be fine
17:17Tony_6unixfairy: airmo is back up
17:18pirashlee: ^^
17:18ashlee:D
17:20unixfairyTony_6: thank you
17:20unixfairywhat was root cause Tony_6
17:21Tony_6unixfairy - I&#39;ll email you
17:22unixfairyok or you can just update the bug
17:22unixfairyup to you
17:22unixfairybut I need something before I can send out the resolution comm :)
17:49unixfairyTony_6: any updates so I can send out resolution notice please
17:54gozerryanc: https://github.com/mozilla-it/prometheus-aggregator-moc/pull/6/files
18:13pythOncPaulbgetOptions works (--configfile <file name>) Or --dbpass=XXXXXXXX, etc. the config file is a key = value type settup so if we can get that value for dbpass from a secret, then it should work.
18:14jawshi, is there a bug on file for all of the 500 Internal Server Errors when pushing to mozreview?
18:15mcoteyup
18:15mcoteit&#39;s been tricky to figure out
18:15mcoteone sec
18:15jawsanswered in #developers, bug 1338530
18:15firebothttps://bugzil.la/1338530 NEW, glob@mozilla.com Push is failing on &quot;Error 500: Internal Server Error&quot;, however the review request appears to have wo
18:15mcoteyup
18:44guigsHeads up, I am updating Flash on plugin.mozilla.org https://bugzilla.mozilla.org/show_bug.cgi?id=1347239
18:44firebotBug 1347239 NEW, nobody@mozilla.org Plugincheck Database - Review and correct Adobe Flash Player 24.0.0.221 to 25.0.0.127
18:57joeykdid we happen to change hostnames for nagios? Im trying to run our ansible script and I keep getting this error:
18:57joeyk&quot;stderr&quot;: &quot;ssh: connect to host nagios2.private.scl3.mozilla.com port 22: Operation timed out&quot;,
19:00ashleejoeyk: i&#39;ll pm you
19:00joeykthanks!
20:25nthomasericz: ping
20:30ericznthomas: pong
20:30nthomasericz: hi, do you have a little time to discuss apache losing the plot on the releng web cluster ?
20:31nthomasseems to happen every few weeks, and is doing it right now
20:31ericznthomas: Sure
20:32nthomascool, thanks. is here good or should we head off somewhere else ?
20:35nthomasericz: heres the symptoms - make a request like https://secure.pub.build.mozilla.org/buildapi/running with network part of the Firefox web console open. Any parts of the page which are served by web1.releng.webapp.scl3 get a 503. web[23] are fine. I know that restarting apache will resolve this but Im curious if we can fix the root cause, and figure
20:35nthomasout why Zeus isnt pulling that node
20:36* ericz trying
20:38nthomas[Tue Mar 14 13:38:10 2017] [error] [client 54.221.177.194] (11)Resource temporarily unavailable: mod_wsgi (pid=14497): Unable to connect to WSGI daemon process &#39;buildapi&#39; on &#39;/var/run/wsgi.5115.2.1.sock&#39; after multiple attempts as listener backlog limit was exceeded.
20:39nthomasthis may be normal&#39;
20:40ericznthomas: So the wsgi proc is overloaded on 1 of those systems then? And we don&#39;t know why, and we don&#39;t know why an apache restart would fix it?
20:40ericzI had 3 requests to web1 when I hit that link, 1 of which succeeded and 2 x 503s.
20:41nthomasinteresting that it doesnt always fail
20:41ericzYes, seems more overloaded than completely broken
20:42nthomasfrom looking in the error logs, it started recently - https://pastebin.mozilla.org/8982041
20:44nthomasnot happening at all on web2, which smells more like something broken on web1, or maybe unfair sharing out of requests by Zeus
20:45ericzZeus should round-robin them and indeed it seems pretty well spread around in my request.
20:45ericzSo I agree web1 smells here.
20:45ericzIf we could strace the wsgi proc when it times out to see what the heck it&#39;s doing, that&#39;d help.
20:46nthomashmm
20:47ericzGoogling reveals &quot;restart apache&quot; heh
20:48nthomashalf a dozen buildapi processes, yay
20:50ericzAs expected
20:50nthomasthey all strace to &#39;restart_syscall(<... resuming interrupted call >&#39;
20:50ericzYeah
20:50ericzAnd if you add -f, they&#39;re doing stuff but not much
20:51ericzLooks pretty much the same on web1 and web2 though
20:51ericzThe wsgi config is WSGIDaemonProcess buildapi processes=6 threads=1 maximum-requests=5000
20:51ericzSo 6 procs is right
20:52nthomasand should be recycling itself with that max requests, right ?
20:52ericzSo I guess the &quot;listener backlog limit was exceeded&quot; might be referring to the maximum-requests value of 5000
20:52ericzOh is that a lifetime thing? Maybe
20:52* nthomas goes to confirm that
20:53ericzYes you&#39;re right nthomas
20:53ericzhttp://modwsgi.readthedocs.io/en/develop/configuration-directives/WSGIDaemonProcess.html
20:54nthomasyeah
20:54ericzlisten-backlog appears to control what I was thinking, default of 100
20:57nthomasalmost all the buildapi procs on web2 are much refresher than web1
20:57nthomasbut not on web3, probably a red herring
21:02ericzOk
21:03* nthomas is out of ideas
21:10ericzmoving to pm
21:42SeburoHi. Hope you can help. I have just found that I cannot post to #firefox, a bit awkward given that I am a contributor with SUMO who tries to help users. Any clues as to what the problem might be?
21:46ashleeSeburo: one sec
21:46SeburoNo need to worry, I think we have just figured out the issue.
21:46SeburoA user with a full caps name, any response to which will trigger a block of some kind.
21:52ashleeSeburo: rgr
15 Mar 2017
No messages
   
Last message: 41 days and 7 hours ago