Page MenuHomePhabricator

Split core en.json to several files
Open, MediumPublic

Description

Some time in 2010 or so I raised the idea of splitting the core translation file (then MessagesEn.php) to several files to make it easier for translators. The basic idea is that it's easier to approach the translations as several smaller groups rather than one large group.

Back then it had about 2700 messages. @siebrand and @Nikerabbit were not enthusiastic about it, and said that it's not worth the effort. (We discussed it in person at the 2011 Berlin Hackathon, and possibly in writing on some mailing lists or Bugzilla tasks, but I cannot find it now.)

A few things changed since then:

  • It went up from 2700 to 3800. In fact, it's over 4000 if you count the optional and ignored messages.
  • We transitioned from PHP to JSON.
  • In practice we already have several separate en.json files: the core itself is split to Core, API, and Installer, and there are also separate repos for skins.
  • translatewiki.net configuration files are not that hard. (I don't quite know how did they look in 2010, to be honest, but I do know them now, and they aren't terrible.)

As far as I know, splitting a group is a matter of:

  • Finding a group of closely related messages, making sure that no information is lost compared to the current subgroups of messages en.json contains (T162172#3280030).
  • In the core repository (example):
    • Moving the relevant messages to a new en.json and qqq.json while keeping all the message keys identical. Unless there's a reason to do it differently, the new files should be under languages/i18n/new-group-name/en.json.
    • Adding an entry for the new file to function getMessagesDirs() in includes/cache/localisation/LocalisationCache.php.
    • Adding an entry for the new file to the banana section in Gruntfile.js.
  • In the translatewiki repository:
    • Adding a new group in groups/MediaWiki/MediaWiki.yaml and moving the ignored and optional messages into it (example).
    • Adding the new group to the appropriate aggregate "used by Wikimedia" group, such as groups/MediaWiki/WikimediaMainAgg.yaml or WikimediaTechnicalAgg.yaml. (example).
    • Adding the new group to the mediawiki:/group: section in repoconfig.yaml (example).
  • Doing a new export so that the translations are moved as well.
  • (Did I miss anything? Does anything need to be updated also in the scripts for synching translatewiki with Gerrit?)

I'm not talking about splitting it to 50 groups, but some initial groups I can think of are:

  • exif/ definitely the exif tags (about 380 messages)
  • datetime/ - maybe calendars (not only Gregorian, but also Hebrew, Persian, days of week, etc.)
  • maybe log messages
  • language converter
  • namespace messages (nstab, etc.)
  • skin messages (user menu, sidebar, tool box, etc.)
  • special pages
    • preferences/ Special:Preference
  • user emails (enotif-*, etc.)
  • user groups (group-*, grouppage-*, *.css, etc.)
  • user rights
  • I haven't given this much thought yet, but perhaps the ignored messages could be moved to a separate file. That file would simply be not loaded to translatewiki, and then we could remove the long "ignored" list from the translatewiki configuration (269 items at the moment). But that's really a separate issue to discuss.
  • Possibly some more.

I can do it myself some time as a pet project. This task is a kind of an RFC: Are there caveats that I am missing? Is it harder than I imagine? Is anybody opposed to it for any reason?

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Some suggestions for possible groupings:

  • Reader messages – messages that are seen by casual readers who don't touch the edit buttons or special pages at all
  • Editor messages – messages that are seen by people who edit, but are not necessarily logged in
  • User messages – messages that are seen by normal registered users (like preferences, etc)
  • Privileged messages – messages that are seen by people with special rights (patrollers, admins, etc)
    • (Maybe even split this one into different rights – for example, the vast majority of checkuser and abusefilter messages (and there are many of them!) are only for a very few select users, so translating those should probably be lowest priority, even much lower than messages for admins)
  • API messages – messages that are never seen by anyone 😜

Change 458165 had a related patch set uploaded (by Amire80; owner: Amire80):
[mediawiki/core@master] WIP Move exif messages to a separate i18n file

https://gerrit.wikimedia.org/r/458165

Change 458165 had a related patch set uploaded (by Amire80; owner: Amire80):
[mediawiki/core@master] WIP Move exif messages to a separate i18n file

https://gerrit.wikimedia.org/r/458165

One thing to consider here, is that the XMP parser very softly depends on these (along with stuff in core which depends on these), which is now a separate library. It'd be cool if we could split this out in someway that you still get these messages if you use the XMP library independently.

Change 458165 had a related patch set uploaded (by Amire80; owner: Amire80):
[mediawiki/core@master] WIP Move exif messages to a separate i18n file

https://gerrit.wikimedia.org/r/458165

One thing to consider here, is that the XMP parser very softly depends on these (along with stuff in core which depends on these), which is now a separate library. It'd be cool if we could split this out in someway that you still get these messages if you use the XMP library independently.

Thanks for the comment! If I understand correctly, this sounds sensible, but I'm really not familiar with this. Who is developing it? (You?)

Thanks for the comment! If I understand correctly, this sounds sensible, but I'm really not familiar with this. Who is developing it? (You?)

It was my gsoc project in 2010, but I'm not maintaining it anymore really, so I think the answer is nobody... (https://github.com/wikimedia/XMPReader for reference). There is some complicating factors though, in that MW still needs to have those messages (For the non-XMP exif support) and have it integrated into MediaWiki namespace and friends. Not all of those message are related to XMP (but most are, and that probably doesn't matter). I guess the way to do that would be to have a separate library containing the messages that both XMPReader and MediaWiki depend on, and have some magic to make messages from this library show up in the MediaWiki namespace. So actually doing this might derail this task, which I wouldn't want. In any case splitting the exif messages into separate json file is definitely the first step towards doing something like that.

If you can keep the format of the i18n files, all that is needed to use them in MediaWiki is to have them registered in $wgMessagesDirs. They can live in the XMPReader repo, which is then brought into MediaWiki in some manner (composer?).

Change 481489 had a related patch set uploaded (by Amire80; owner: Amire80):
[translatewiki@master] Split exif messages from MediaWiki core

https://gerrit.wikimedia.org/r/481489

Change 458165 merged by jenkins-bot:
[mediawiki/core@master] Move exif messages to a separate i18n file

https://gerrit.wikimedia.org/r/458165

Change 481489 merged by jenkins-bot:
[translatewiki@master] Split exif messages from MediaWiki core

https://gerrit.wikimedia.org/r/481489

It looks like the TWN bot has now removed the EXIF messages from the non-English .json files in the main languages/i18n directory, but hasn't yet created any non-English .json files in the languages/i18n/exif directory?

It looks like the TWN bot has now removed the EXIF messages from the non-English .json files in the main languages/i18n directory, but hasn't yet created any non-English .json files in the languages/i18n/exif directory?

Indeed. @Raymond , @Nikerabbit , do you have any idea about this? Did I do anything incorrectly? It's a bit concerning to see this just a few days before the next train.

If you can keep the format of the i18n files, all that is needed to use them in MediaWiki is to have them registered in $wgMessagesDirs. They can live in the XMPReader repo, which is then brought into MediaWiki in some manner (composer?).

Yeah, this seems pretty doable. My main concern is that the library is not updated on a regular basis as the PHP code is fairly stable, so new i18n messages wouldn't be pulled in unless someone does a release. What is the acceptable delay from message being translated to being deployed? We could probably automate the release process for i18n changes, but I don't think we want to be tagging a new release for every day's updates...

Change 482360 had a related patch set uploaded (by Nikerabbit; owner: Nikerabbit):
[translatewiki@master] Export mediawiki-exif with core

https://gerrit.wikimedia.org/r/482360

Change 482360 merged by jenkins-bot:
[translatewiki@master] Export mediawiki-exif with core

https://gerrit.wikimedia.org/r/482360

It looks like the TWN bot has now removed the EXIF messages from the non-English .json files in the main languages/i18n directory, but hasn't yet created any non-English .json files in the languages/i18n/exif directory?

Indeed. @Raymond , @Nikerabbit , do you have any idea about this? Did I do anything incorrectly? It's a bit concerning to see this just a few days before the next train.

Fixed with https://gerrit.wikimedia.org/r/482360 and https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/482409/ . I also documented the fix in this task's description.

Change 542007 had a related patch set uploaded (by Amire80; owner: Amire80):
[mediawiki/core@master] Split rest messages from the main en.json

https://gerrit.wikimedia.org/r/542007

Pppery subscribed.

All patches appear to have been merged. Can this ticket be closed as resolved?

No, I plan to do more. Thanks for reminding me the task number, I was just wondering about it recently :)

Change #1022510 had a related patch set uploaded (by Winston Sung; author: Winston Sung):

[mediawiki/core@master] i18n: Move preferences messages to a separate i18n file

https://gerrit.wikimedia.org/r/1022510

Change #1022509 had a related patch set uploaded (by Winston Sung; author: Winston Sung):

[translatewiki@master] Split preferences messages from MediaWiki core

https://gerrit.wikimedia.org/r/1022509

Change #1022510 merged by jenkins-bot:

[mediawiki/core@master] i18n: Move preferences messages to a separate i18n file

https://gerrit.wikimedia.org/r/1022510

Change #1022509 merged by jenkins-bot:

[translatewiki@master] Split preferences messages from MediaWiki core

https://gerrit.wikimedia.org/r/1022509

Change #1112685 had a related patch set uploaded (by Winston Sung; author: Winston Sung):

[mediawiki/core@master] i18n: Move system message "prefs-misc" into preferences message group

https://gerrit.wikimedia.org/r/1112685

Change #1112685 merged by jenkins-bot:

[mediawiki/core@master] i18n: Move system message "prefs-misc" into preferences message group

https://gerrit.wikimedia.org/r/1112685

Change #937590 had a related patch set uploaded (by Winston Sung; author: Winston Sung):

[mediawiki/core@master] i18n: Move datetime messages to a separate i18n file

https://gerrit.wikimedia.org/r/937590

Change #1113452 had a related patch set uploaded (by Winston Sung; author: Winston Sung):

[translatewiki@master] Split datetime messages from MediaWiki core

https://gerrit.wikimedia.org/r/1113452

No, there are more parts to be split.

Only 1 Gerrit change in core would be uploaded for review at the same time to prevent massive merge conflicts.

Change #937590 merged by jenkins-bot:

[mediawiki/core@master] i18n: Move datetime messages to a separate i18n file

https://gerrit.wikimedia.org/r/937590

Change #1118716 had a related patch set uploaded (by Raimond Spekking; author: Raimond Spekking):

[translatewiki@master] [Core] Separate DateTime

https://gerrit.wikimedia.org/r/1118716

Change #1118716 merged by jenkins-bot:

[translatewiki@master] [Core] Separate DateTime

https://gerrit.wikimedia.org/r/1118716

Change #1113452 merged by jenkins-bot:

[translatewiki@master] Split datetime messages from MediaWiki core

https://gerrit.wikimedia.org/r/1113452

No, there are more parts to be split.

Only 1 Gerrit change in core would be uploaded for review at the same time to prevent massive merge conflicts.

So what are the exit criteria for this? When will it be done? How can anyone contribute?

So what are the exit criteria for this?

When the main en.json split down to reasonable number of messages, at least not something like around 4,500 messages.

When will it be done?

I would expect before the end of 2025.

How can anyone contribute?

Anyone could submit Gerrit changes to make it reduce to a reasonable amount of messages.

Only 1 Gerrit change in core would be uploaded for review at the same time to prevent massive merge conflicts.

This is only about my current workflow, and won't block anyone to contribute when there are unreviewed changes authored by me.

So what are the exit criteria for this?

When the main en.json split down to reasonable number of messages, at least not something like around 4,500 messages.

When will it be done?

I would expect before the end of 2025.

How can anyone contribute?

Anyone could submit Gerrit changes to make it reduce to a reasonable amount of messages.

You've given no criteria for completion, for what "reasonable" is, the basis on which they might be split, or what the competing concerns are. Consequently you've made this something only you can take part in. Please write out your thoughts so we can get this one in one step.

the basis on which they might be split

I've updated the task description to reflect the current planned message groups to be split.

MaryMunyoki raised the priority of this task from Low to Medium.Apr 15 2025, 5:42 PM

Immediate ideas about groups to divide:

  • log messages
  • language converter
  • namespace messages (nstab, etc.)
  • user rights (action-, right-)
  • bot passwords
  • ignored messages

There will definitely be changes in the coming weeks.

I think there should be some sort of criteria for evaluating what is a good split. Neither extreme is good: One file gets too large and translators cannot focus, but on the other hand too many files makes it harder for developers to add/update messages as they need to be aware of all of the available groups.

I am not a fan of functional splits (e.g. log messages, user rights) and would prefer to keep similar messages together based on component/domain (e.g. language converter, all the skins and extensions).