User Details
- User Since
- Jan 18 2024, 5:33 PM (75 w, 6 d)
- Availability
- Available
- LDAP User
- Scott French
- MediaWiki User
- SFrench-WMF [ Global Accounts ]
Today
FYI, I will be out next week and intend to pick this back up when I return. Assuming all goes smoothly with the miscweb migration, the next step is to rebase the "plain" httpd image stack on bookworm via https://gerrit.wikimedia.org/r/1162030 and deprecate the -bookworm track.
Status update before I disappear for a bit:
It has been over 1h since https://gerrit.wikimedia.org/r/1166016 was merged, and subsequent puppet runs on the prometheus hosts should now have picked up the change. Closing this out as resolved by @RKemper.
Yesterday
My sincere thanks for all of your help here @tstarling.
Tue, Jul 1
Thank you very much, @tstarling. That final rename is now complete.
Alright, the remaining steps here are:
Once I'm able to pull the PCRE2 backport builds into component/php83, I should be able to start on the build process.
Alright, one straggler I missed before on thwiki:
The renames listed in the task description have now been completed, and a MediaWiki configuration change has been applied that ensures titles starting with previously overridden characters now canonicalize to their correct title-case equivalents (ensuring that, e.g., the former redirects to the latter). Remaining cleanup will be tracked in the parent task (T394556).
Alright, all renames should be complete and the title-case mappings have been reverted to just the static override for Eszett. During the deployment, I spot-checked a number of previously overridden characters now canonicalize to their proper title-case equivalents.
comms: I've reached out to the user via Special:EmailUser with a heads-up about the upcoming rename.
Excimer 1.2.5-1 packages are now available in /var/cache/pbuilder/result/bullseye-amd64/ on build2001, but have not yet been included in the apt repository.
Mon, Jun 30
@Ladsgroup and @Zabe - Thank you both. It sounds like I do indeed need to pick up the change to support file renames. While I can do an initial test run with a local copy of the uppercaseTitlesForUnicodeTransition.php, I can't use that strategy with mwscript-k8s more generally, so I may need to backport https://gerrit.wikimedia.org/r/1164665 depending on timing.
My understanding is that your RequestTimeout change is a performance mitigation for the now-more-expensive manipulation of Excimer timers. Is there anything else that should be monitored as a result of that? (i.e., in addition to segfaults and exceptions more generally)
Ah, thanks for highlighting that, @Zabe! It looks like that was merged yesterday, so if it's critical, we'll need to make sure it gets picked up before we re-run the script. Given the current status of the schema migration, do you know whether we expect content to exist in that table that requires renames?
I was doing the final prep for actually running the renames this morning, and it seems there was a user created on idwiktionary 3 days ago would now be renamed. I need to sort out how / whether to communicate with this user before proceeding.
I was chatting with @Jelto earlier today about migrating miscweb, and it sounds like it should be doable / preferable to migrate in two steps, similar to what we're doing with shellbox and mediawiki - i.e., switch to httpd-bookworm and deploy / verify, then switch back to httpd once the latter has been rebased on bookworm.
Thu, Jun 26
Alright, I think https://gerrit.wikimedia.org/r/1164264 is the simplest option to achieve the specific behavior we want - i.e., reload rather than restart, and do so when any of the relevant resources change.
As of 17:20 UTC, all mediawiki releases have now migrated to the bookworm-based webserver image.
@Clement_Goubert - Ah, thanks for the additional details!
Thanks, @MoritzMuehlenhoff - that's an interesting idea!
Wed, Jun 25
I was chatting with @dancy earlier today about what might have caused this, and it's kind of a puzzling one.
This is done now. Thanks for the reviews, all!
Let's start with the good news: Everything that could be evaluated after migrating a single host (conf2006) seems to work as expected. We were able to confirm that conftool, etcdctl, MediaWiki, confd, and Liberica experience no connectivity issues as a result of the new certificates. Further, since conftool works, there is no reason to expect other python-etcd-based clients (e.g., navtiming, spicerack) will not.
Tue, Jun 24
As of ~ 17:30 UTC today, both mw-api-ext and mw-web are serving ~ 5% of traffic via the migration releases, which are in turn using the bookworm webserver image.
@MBH - Thanks for the report. Let me see if I can find the right folks to investigate this further.
Mon, Jun 23
@MatthewVernon - Ah, that's great! Yes, let's keep those pointed to failoid, then. I'll post a patch shortly to do the "manual equivalent" for the swift-ro services.
The webserver-bookworm image flavour is now live in mw-debug/next, passing httpbb checks and manual kicking-of-tires by me. No errors / issues surfaced in httpd container logs. None of this is surprising, given that apache 2.4.62 has been live on the mwdebug hosts for some time without issue.
Sat, Jun 21
Revisiting this today, here's a revised plan for the nginx TLS proxy portion of the migration.
Wed, Jun 18
After no issues were uncovered for shellbox-syntaxhighlight with ~ 24h on the new images, the remaining (5) shellbox instances have now been updated as well (staggered by datacenter by ~ 20m). Validating using the same graphs and logs as in T378128#10925040, no issues have been uncovered so far, though again I'll check in periodically throughout the day.
Alright, I'll let you take this from here, @Ladsgroup :)
Thank you both for resolving that! Indeed, prior to [0] landing in 5.3.0, RO state on external sections was ignored by dbctl when generating the committed dbconfig - thus Amir's observation in T395696#10927592.
Tue, Jun 17
A couple of updates:
@Ladsgroup - I was optimistically walking through some final checks ahead of releasing conftool, and I noticed that x3 is marked read-only:
After about and hour of soak with 1 replica per DC on the new httpd images and no issues observed, I've now moved all of syntaxhighlight forward. I've been keeping an eye on general service health in grafana (eqiad, codfw), httpd container logs (manual tailing with kubectl), mediawiki exec-channel errors (logstash), and ShellboxError exceptions (logstash), and will be checking in periodically throughout the day.
Mon, Jun 16
After the switch to a CronJob, I was able to successfully apply a lingering image diff from today's UTC-afternoon backport window using scap. Thanks for driving that @brouberol!
With the alert routing and severity changes now merged, I believe that wraps up the remaining work here. Thanks for the discussion, folks.
Fri, Jun 13
@tstarling - Thanks for reviewing the proposed renames and for confirming the mutating mode of the script still works as expected. I realize it has been a few years since it was last used, so that was definitely a concern.
Now that the new alert has been live for a couple of days without issuing false positives, I believe it makes sense to switch to task-severity.
Thu, Jun 12
Alright, we now have the ability to override the httpd image name easily. I'd propose we start with pilot on a single shellbox service in two steps (fraction of traffic -> all traffic), then expand to the remaining services, similar to what we did for the PHP 8.1 migration (although we can and should go much faster here).
The conftool 5.3.0 packages are now live on apt-staging, but have not yet been included in apt.wikimedia.org.
Inverting the order and piloting on shellbox early on sounds good. The only downside to that is the necessary change to the chart, but that's really quite easy.
Wed, Jun 11
Now that we can reuse some of the tools we created for the PHP 8.1 migration to pilot this easily, it makes sense to pick this back up and get it done.
The remaining shellbox instances have been updated everywhere as of 18:52 UTC today. Looking at general service health and some of the use-case-specific logstash queries from T377038, all looks well.
@elukey - Ah, I wonder if Google might have changed something. The 16384 number was based entirely on bisection with a small number of test events. It seemed to consistently be the "too large" threshold at the time, but something might have changed in the interim.
Mon, Jun 9
The SessionStoreDiskSpaceRunwayTooLow alert is now live, although in warning severity, to limit the scope of potential noise if something odd happens early on with the maths (e.g., while the history of the recording rule builds up).
The 2025-06-05-215815 image is live in shellbox-video as of ~ 17:10 UTC. No issues observed so far. I'll wait for https://gerrit.wikimedia.org/r/1154132 to be deployed (and soak for a bit) before moving ahead with the remaining shellbox instances.
Fri, Jun 6
Scanning through the changes merged since the summary in the task description was collected, the only notable one I see was the bump to wikimedia/wikipeg 5.0.0. Production appears to be running 4.0.0 as of the 2025-01-07-141744 image. As long as that doesn't carry any notable risk, beyond what could sneak through the PEG parser tests for ShellParser, that seems fine?
@bd808 - Thanks for flagging! Indeed, this fell by the wayside while dealing with other aspects of the PHP migration, and for lack of any urgent changes that needed to be deployed.
Wed, Jun 4
Thanks for the follow-up, all!
Tue, Jun 3
Two items came to mind while reviewing https://gerrit.wikimedia.org/r/1152853, which will also need done to make use of this (in order):
- The patternProperties for readOnlyBySection will need updated in the dbconfig json-schema [0].
- The section flavor check [1] in DbConfig.compute_config will need updated, to consider external-flavored sections when populating readOnlyBySection.
@Zabe - There will be an additional announcement soon, but similar to the guidance around other not-yet-supported use cases like sql.php in this wikitech-l thread, the interim solution is likely to involve moving your mwscript usage to the active deployment host (i.e., deployment.eqiad.wmnet) instead of the soon-to-be-decommissioned mwmaint* hosts.
Alright, I was able to run uppercaseTitlesForUnicodeTransition.php across all wikis in the default dry-run mode today.
Although changes to mediawiki-dumps-legacy will be needed before this feature can actually be put to use there (details in T389786#10881115), we were still able to "successfully" test this functionality today, and indeed it appears to work as expected.
Alas, as foretold in T389499#10671841, you cannot mutate the spec.template of a k8s Job object, regardless of whether it's suspended or not:
With https://gerrit.wikimedia.org/r/1152854 merged, I believe this problem should be fixed.