mozilla :: #moc

10 Aug 2017
01:00nagios-scl3New Sysadmin OnDuty is ryanc
01:16ryancHi
02:36phrozynryanc good evening!
02:36ryancHello
02:37phrozynryanc I will be performing a rolling restart of the mozdefes cluster as I upgrade elasticsearch tonight - I'll downtime them as I work on each one
02:37ryancPerfect
02:38phrozynwhich channel should that be performed in?
02:38phrozynsysadmins-test?
02:38ryancWhat channel the bot is in that has a password
02:38ryancWhatever*
02:39phrozyn think it's that one
02:39ryancOne of them
07:50nliryanc: bug 1386387, Im going to reboot it, if you see alerts please ignore. Thank you.
07:50firebothttps://bugzil.la/1386387 NEW, nli@mozilla.com replace admin2a.private.tpe1.mozilla.com
07:50ryancGreat
07:50ryancHow long is it going to be down nli ?
07:51nlidepends how long its going to boot up. :P take it 20 mins. :)
07:51nli^ ryanc
07:51ryancAlrighty : ))
08:10nliryanc: done. :)
08:10ryancExcellent
08:11ryancThanks
08:12nliThank you. :)
09:00nagios-scl3New Sysadmin OnDuty is pir
13:23Caspy7I'm being attacked
13:23Caspy7...stopped for no
13:23Caspy7*now
13:24Caspy7hrm, not sure how best to handle this in general, got gobs of PMs and notifications
13:26Caspy7all the nicks that attacked me seem to be in #maroc
13:26Caspy7pir: ^
13:26Caspy7#firefox shut its doors temporarily because there was an influx of entries suddenly
13:26pirCaspy7: not much I can do on that amount of information
13:27Caspy7and user aaa was in #firefox and just said in the channel /join #maroc
13:27pirCaspy7: can you put names and information in a bug in infrastructure & operations > Infrastructure: IRC, please
13:32Caspy7pir: are you in #firefox now?
13:32pirCaspy7: no, I don't use that channel. I'm in the middle of fixing some service problems
13:33Caspy7well, I'm talking with someone...who may be taking their bots out for a test drive
13:33Caspy7something about "in 1 minute"
13:33Caspy7they are also in #maroc
13:34Caspy7I suspect all the users in that channel are
13:41Caspy7well, they're all gone now
13:58tannerhi, is there anyone here who would be able to check if there's anything from ovh stuck in communityhosting@mozilla.com's spam filter? it's a google group
14:00pirtanner: for google apps you need to talk to the group who handle that, I'm afraid, we don't have the access
14:01pirtanner: the service desk may be able to help directly or jen in application services
14:02tanneri'll head that way, thanks!
16:09safwanpir: Pontoon was down about 3 hours ago?
16:11pirsafwan: I have no knowledge of "Pontoon"?
16:11safwanpir: i meant https://pontoon.mozilla.org/
16:12pirsafwan: sorry, first I'm hearing about it
16:13safwanOh!
16:13safwanpir: MOC do not monitor it?
16:14pirsafwan: nothing in pingdom for it, nothing in nagios for it so I'd say no
16:14pirand given that I've never heard of it
16:14safwanOh!
16:14safwanpir: Its a mozilla site for localization
16:14safwanshould not be it monitored by MOC?
16:14pirwe can only monitor things people tell us about
16:15safwanWeird
16:15pirsince I know nothing about it I can't say if it should be monitored or not
16:15pirif it's on a critical path for release or such, then it should be
16:15safwanI though MOC is responsible for all mozilla site reliability! anything starts *.mozilla.org
16:16pirNot everything in a mozilla domain is considered critical and needs monitoring, for one thing
16:16safwanpir: To whom I should escalate to monitor this site?
16:16unixfairysafwan who manages the site
16:16pirnot everything is monitored by us, for another. Some teams prefer to do their own monitoring.
16:16unixfairyor developed it
16:17pirthe person who is responsible for the site
16:17safwanmathjazz maybe
16:17pirthe team or people who develop or maintain a site are the ones who would need to request MOC support
16:17safwanOk
16:18safwanany bugzilla bug or template?
16:20pirthey should refer to https://mana.mozilla.org/wiki/display/MOC/How+to+request+MOC+support+for+a+new+production+system+or+service
16:22safwanOk
16:22safwanBTW, I think no other teams has 24/7 awake people
16:25gcoxI am shocked, SHOCKED, that shadow IT isn't fully staffed for all contingencies.
16:27pirthe MOC is the only team that has people working 24/7, yes
17:00nagios-scl3New Sysadmin OnDuty is fauweh
19:54khanhHello MOC: AVOPS just received a status update from Teem, our EventBoard service provider, that there may be a calendar synchronizing issue which could potentially cause a delay on calendar sync updates from our EventBoard (iPad).
19:57khanhNo issue is observed or reported at the moment, but can we please send out an internal comm just so everyone can be alerted of a potential delay on the eventboard updates while Teems is investigating the issue from their end? Thanks!
19:59fauwehkhanh: we can do that
20:00khanhfauweh: Much appreciated
20:04fauwehkhanh: do we have a bug yet?
20:04khanhfauweh: not yet
20:04fauwehdo you want me to open one?
20:04fauweh(is helpful to include in comms)
20:05khanhplease do, thanks fauweh. I can provide a screenshot of the incidence report as needed.
20:06fauwehyeah, if you could attach once I open that'd be great
20:06khanhI sure can
20:08fauwehkhanh: https://bugzilla.mozilla.org/show_bug.cgi?id=1389238
20:08firebotBug 1389238 NEW, kferrando@mozilla.com Eventboard calendar sync issues reported by vendor
20:08fauweh(bug is unrestricted, can be locked down if needed)
20:10khanhIt's fine, thanks again fauweh
20:14fauwehkhanh: for sure, comms sent
21:14fauwehkhanh: any update on the eventboard issue or any reports from end-users?
21:14khanhfauweh: no updates yet
21:15khanhand no end-users report of issues
21:17fauwehok thanks for the info
21:25arrryanc: hey! Saw you comment about the nagios configs, and wanted to make sure we were on the same page there. I deleted a bunch of stuff we don't need and I added one missing host, but the checks you merged still need to be split back out.
21:25ryancOK
21:25ryancI'll add those back
21:26arrI think the moc is going to be pretty peeved it they start getting paged for releng hosts they can't do anything about :D
21:26arrI didn't have enough time to go through all the checks and compare to the old ones, though
21:27ryancThere's quite a bit
21:27arrryanc: the other thing I noticed was missing parents (switches) and PDUs. Those and the nagios host itself are the things that IT will ned to be notified for directly
21:28arranything that's strictly releng shouldn't be alerting IT
21:28ryancThen I'm going to have to revert everything because I don't know everything that you guys need and don't
21:28arrthe only thing in there right now that should go to IT is the nagios host
21:29arrbut we're missing all the switches, the firewall, the PDUs, etc
21:29arrthose all need to be added and alerting IT
21:29arrand the machines need to have parents added
21:29ryancWe have nothing to monitor for network infra in mdc1
21:30ryancAnd if we do, we're not aware of it
21:30arruh... how do we know when it goes down?
21:30ryancNo hostnames
21:30arrthat's... problematic
21:30ryancI know
21:30arrokay, so when that information is forthcoming, it needs to be added
21:30arrvan can tell you what the PDU info is
22:18rmfdWe're discussing hostnames right now
22:18rmfdWill be back with you shortly
22:22unixfairyarr ryanc^^
22:23ryancrmfd: Great
11 Aug 2017
No messages
   
Last message: 11 days and 7 hours ago