User Details
- User Since
- Jan 18 2024, 5:33 PM (70 w, 2 d)
- Availability
- Available
- LDAP User
- Scott French
- MediaWiki User
- SFrench-WMF [ Global Accounts ]
Thu, May 22
Alright, I think https://gerrit.wikimedia.org/r/1149505 should do what's needed, in the three "sufficiently broad" scrape configs where this issue is likely to be relevant.
Since a bit before 17:30 UTC today, we are no longer building PHP 7.4 ("publish" flavour) MediaWiki images during scap deployments. No further work is tracked here, so I'm marking this resolved.
Just came across https://github.com/prometheus/prometheus/issues/10755, which indeed recommends using __meta_kubernetes_pod_phase to skip terminated pods returned by service discovery.
Wed, May 21
Thank you very much, @Vgutierrez - this is great, and thank you for your offer to assist with validation.
After spending some time reading through T292552 and the code in uppercaseTitlesForUnicodeTransition.php [0], I think I better understand the process here.
Last step here is to merge https://gitlab.wikimedia.org/repos/releng/release/-/merge_requests/172 and scap sync-world to integrate it. I'll aim to do that either today or (more likely) tomorrow (May 22nd).
Remaining update-special-pages jobs have been migrated, following a successful manual test of the s6 shard earlier today. First scheduled run for all will happen at 5:00 UTC on the 22nd (i.e., soon).
Mon, May 19
Sun, May 18
Fri, May 16
Going by the git history on UcfirstOverrides.php, T292552: Rename articles and users to update our case mapping to PHP 7.4 and Unicode 11 appears to contain the most recent prior art on this process.
Given that we're well on our way to completing the periodic jobs migration and have not run into showstopper 8.1-compatibility issues, and we've removed the 7.4 fallback functionality from mwscript-k8s, I think it make sense to move forward with this.
Wed, May 14
The first post-migration hourly runs of centralauth-backfilllocalaccounts.php-loginwiki and centralauth-backfilllocalaccounts.php-metawiki have completed successfully. The logs contain roughly the same content as that from the last last pre-migration timer runs (i.e., reporting autoCreateUser failures due to IP blocks).
The first post-migration run of purge-expired-userrights succeeded earlier today. I also triggered a manual run of purge-expired-global-rights, which also succeeded.
The first run at 00:08 today (May 14th) completed without issue, and with very similar total elapsed time to the bare-metal case (a bit more than 6m). I've also spot-checked Special:ValidationStatistics on a handful of wikis, and the new stats values look reasonable relative to the prior ones (generated by the update-special-pages runs earlier today).
Tue, May 13
Updates:
- The first run of purge-temporary-accounts appears to have completed successfully earlier today.
- The purge-expired-userrights and purge-expired-global-rights jobs have now been migrated as well.
- Their next scheduled executions are on the 14th and 17th, respectively.
- I'll validate the former tomorrow once it has run. For the latter, we may want to trigger a manual run so as to avoid having the first run fall on a weekend.
- Next-up: the hourly loginwiki and metawiki account backfill jobs in profile::mediawiki::maintenance::backfill_localaccounts.
The update-flaggedrev-stats job has been migrated. I'll hold onto this until I'm able to confirm the first run is successful later today.
The remaining shards of the (renamed) job have been migrated. Given what we saw with the pilot on s6, I'm optimistic that these will all work without issue. However, I'll still plan to keep an eye on things as we enter June (and will set some calendar reminders for this).
Thank you both! In short, confd uses go.etcd.io/etcd just like Liberica, and thus will pick up the WMF root PKI CA cert from /etc/ssl/certs without intervention - and, importantly for all these cases, we're going to be presenting the certificate bundle that contains the PKI intermediate. Apologies for not writing this all down already and saving you all the digging.
Mon, May 12
From a quick read of [0] and (more importantly) [1], this looks relatively low-risk to migrate.
Given the progress made so far in the periodic maintenance jobs migration to k8s (and thus 8.1), it probably does not make sense to support this on 7.4, unless of course there is some urgent need of which I'm not aware (my read of T373752#10756865 is that there is not, but I might be misunderstanding).
After the s6 shard was migrated to k8s, I stated a manual pilot run of the job using this procedure. Thanks to @Krinkle for the discussion on safety concerns around this.
purge_temporary_accounts is now migrated, which I'll verify tomorrow after the first scheduled k8s-based run (14:27).
Fri, May 9
Finally had a chance to polish my MR a bit and post it today (draft). I'll take one more look on Monday before sending it for review for real-real, but folks are welcome to take a look before that if so inclined.
Alright, after a quick review, the main thing that's changed since November is the migration of all (high-traffic) LVS hosts outside core DCs to Liberica. This has a couple of implications:
- We no longer need to deal with pybal in those sites. This also means that conf1009 is no longer an unusually risky host to update (other than being the etcd-mirror source host).
- We need to figure out how best to validate that the Liberica control-plane continues to operate as expected after the update.
- We need to be aware of the fact that Liberica operates like confd, in that all etcd nodes in the associated core DC (profile::liberica::etcd_config > conftool_domain) are considered, rather than one.
- As a corollary, if we think it's still valuable for only one of ulsfo and eqsin to be exposed to the initial update, the only way to achieve that would be to temporarily shift one to eqiad.
- IMO, precisely because Liberica considers all nodes, I don't think that's worth it anymore (i.e., errors communicating with one node no longer means "cannot talk to etcd at all" as it did with pybal).
Now that the PHP 8.1 migration is winding down, this is near the top of my list of items to pick back up. If you have the cycles for it, I'd definitely be interested in having a second pair of eyes / hands on this.
Thank you both for the follow-up! If acks were the solution before and they're now back in place, then by all means let's resolve this :)
Thu, May 8
Great, thanks for confirming, @Tgr - I'll get started migrating these first thing next week, and I'll keep that in mind as an option.
I've extended the downtime to 7 days (from now), as it's unlikely this host will be returned to service before the original one would have expired tomorrow.
@fgiunchedi - I'm having a hard time sorting out what the outcome w.r.t. these probe failures was from T373369 and / or T326657. Was there a long-term silence that might have recently expired?
I've re-added a 1w downtime, as the earlier one was removed as a side-effect of the reimage. If we expect the host to be powered on for ongoing work, and also expect that work to extend past 1w from now, it may be a better option to set profile::monitoring::notifications_enabled: false in hiera.
I've posted a handful of patches to migrate the periodic jobs tracked here.
Wed, May 7
I'll be driving the migration of these jobs.
This was completed by ~ 17:30 UTC today. No issues encountered on either host during the update, and PHP appears to work via extremely basic tests:
FYI, the downtime I've applied is only 2 days, on the suspicion that the host is fine (e.g., just needs a clean bill of health before being returned to service). This may need extended if that's not the case.
Tue, May 6
The changes to notification settings are now live.
FYI, I've silenced notifications from this host for the next week, to avoid repeated pages while work is ongoing. These will need cleared if the host is returned to service earlier than that.
Mon, May 5
Both jobs have now had a successful first run:
After a bit of thought and some back-testing over the last 2 months of data, https://gerrit.wikimedia.org/r/1141959 sketches out what T390630#10787491 could look like.
Per discussion in #talk-to-moderator-tools, the desired phabricator tag for notifications is Moderator-Tools-Team. The pending patches will update those settings accordingly.
The LoginNotify and PageAssessments jobs have both been migrated. I'll follow up later today to confirm their first scheduled runs succeed (23:00 and 20:42 UTC respectively) before closing this out.
Fri, May 2
Following up on how this alerting might evolve, there was some discussion in T392989 about how to make the alert insensitive to transient excursions during large compactions.
Possible alternative: Rather than using an aggregation over hosts, we could instead use the minimum utilization over a rolling window, calibrated the typical duration of a large compaction. IMO, that's an easier-to-reason-about mechanism for "ignoring" the compactions vs. aggregating over hosts (e.g., we frequently see correlated behavior across hosts). As @Eevans notes, we would likely want to combine that with dropping the alert threshold.
I chatted with @MusikAnimal from Community-Tech earlier today, who confirmed there are no concerns with migrating the PageAssessments and LoginNotify jobs. However, it sounds like the already-migrated PageTriage jobs are actually owned by Moderator-Tools-Team. I'll fork those off to a separate task and follow up to update the alert routing.
Following up, I did get a chance to sketch this out, and indeed (1) it's not all that complicated in practice but (2) it does run head-first into the naming consistency question I mentioned.
Thu, May 1
For the record, https://gerrit.wikimedia.org/r/1139923 was reverted due to an issue with the rendered yaml, rather than an issue with the job itself. The yaml rendering issue should be fixed by https://gerrit.wikimedia.org/r/1140548.
Tue, Apr 29
Thanks, @Reedy - It's really not all that much effort to make this happen if it would help unblock you all.
@brouberol - This would require changes to scap, specifically the ability to override the set of environments relevant to a particular deployment (rather than using the "defaults" provided by the k8s_clusters config key).
Apr 19 2025
My apologies, @Reedy - I once again lost track of this one.
Since the logging issue appears to be resolved (e.g., the ongoing RT testing run is producing logs in logstash as expected), and nothing else of note appears to have arisen during yesterday's run that also surfaced the logging issue, I'm going to optimistically mark this as resolved.
Remaining patches to draft:
- Companion patch to https://gerrit.wikimedia.org/r/1137500 that updates mediawiki-deployments.yaml in train-dev.
- Final patch that removes the 'publish' flavour definition from build-images.py.
Apr 18 2025
Connecting some dots here that I forgot to add yesterday:
We decided to update parsoidtest1001 to PHP 8.1 "in place" for T380485: Transition parsoidtest1001 to PHP 8.1, so this work no longer transitively blocks 8.1 migration. This will give us more time to develop a solution for this use case on k8s that maximizes overlap / reuse with T276994: Provide an mwdebug functionality on kubernetes (mw-experimental).
Alright, the first run appears to have succeeded: