Adding Uppercase and lowercase collation for Kazakh language
Closed, ResolvedPublicBUG REPORT
Actions

Assigned To

Authored By

	MuratKaribay
	Jan 21 2025, 10:05 PM

Description

Originally requested and discussed here: Әліпбиді жөндеу.

In fact, due to the lack of collation of the Kazakh alphabet, specific letters are at the end of the list, and are sorted by Unicode.

Kazakh Uppercase: А Ә Б В Г Ғ Д Е Ё Ж З И Й К Қ Л М Н Ң О Ө П Р С Т У Ұ Ү Ф Х Һ Ц Ч Ш Щ Ъ Ы І Ь Э Ю Я
Kazakh Lowercase: а ә б в г ғ д е ё ж з и й к қ л м н ң о ө п р с т у ұ ү ф х һ ц ч ш щ ъ ы і ь э ю я

Sample code:

class KazakhUppercaseCollation extends CustomUppercaseCollation {

	public function __construct() {
		parent::__construct( [
			'А',
                        'Ә',
			'Б',
			'В',
			'Г',
			'Ғ',
			'Д',
			'Е',
			'Ё',
			'Ж',
			'З',
			'И',
			'Й',
			'К',
			'Қ',
			'Л',
			'М',
			'Н',
			'Ң',
			'О',
			'Ө',
			'П',
			'Р',
			'С',
			'Т',
			'У',
			'Ұ',
                        'Ү',
			'Ф',
			'Х',
			'Һ',
			'Ц',
			'Ч',
			'Ш',
			'Щ',
			'Ъ',
			'Ы',
                        'І',
			'Ь',
			'Э',
			'Ю',
			'Я',
		], Language::factory( 'kk' ) );
	}
}

Details

	Subject	Repo	Branch	Lines +/-
	Add uca collation for Kazakh	operations/mediawiki-config	master	+1 -0
	Add "Ё" to first letter list for Kazakh	mediawiki/core	master	+1 -1

Customize query in gerrit

Event Timeline

MuratKaribay created this task.Jan 21 2025, 10:05 PM

MuratKaribay updated the task description. (Show Details)Jan 21 2025, 10:07 PM

jhsoby subscribed.Jan 22 2025, 7:04 AM

I can help with this, but I think it might actually be a good first task; but before I add that tag, I would like to check what's already in CLDR for Kazakh. Will update later, feel free to ping me if I haven't.

@jhsoby can you add collation from CLDR (Kazakh)?

@MuratKaribay: It looks to me like the CLDR data might be outdated, as it doesn't have the letter "Ә". So until CLDR is updated, the easiest solution is to make our own custom collation (like for Abkhaz and Bashkir).

Did you write the sample code in the task description? I'd like to add you as author of the file I'm adding.

@jhsoby: Yes, you're right CLDR is rarely updated, let's add collation for now, since the admins of Kazakh Wikipedia from long time want to solve the problem by sorting the alphabet.

Change #1114350 had a related patch set uploaded (by Jon Harald Søby; author: Jon Harald Søby):

[mediawiki/core@master] Add custom collation for Kazakh language

https://gerrit.wikimedia.org/r/1114350

gerritbot added a project: Patch-For-Review.Jan 27 2025, 11:17 AM

@MuratKaribay Yup. Patch uploaded – did you make the sample code from the description, or someone else? I should credit you (or whoever else made it) in the file. 😊

@jhsoby Thanks. I took the reference from Bashkir language, and I think when Kazakh will be switched to Latin, I will specify the collation once again.
And I'd like to know approximately in how many days repositories will be updated, as admins are waiting for a new sorting.
And one more question if in Kazakh wiki projects among admins there are no programmers and IT specialists, is it possible to create one position for those who don't want to be admins, but want to solve problems with code?

@jhsoby are you here?

In T384395#10502589, @MuratKaribay wrote:

if in Kazakh wiki projects among admins there are no programmers and IT specialists, is it possible to create one position for those who don't want to be admins, but want to solve problems with code?

Please bring up general questions on wiki group configurations in community forums instead - thanks!

srishakatux moved this task from Backlog to Current on the LPL Onboarding and Development board.Feb 10 2025, 10:24 PM

srishakatux edited projects, added LPL Onboarding and Development (Current); removed LPL Onboarding and Development.

srishakatux moved this task from Incoming Requests to Needs Review on the LPL Onboarding and Development (Current) board.

Hi,

The CLDR looks correct to me (But of course I don't speak this language). I'd prefer we use the CLDR collation instead of our own as long as it is correct. You can test the CLDR collation at https://icu4c-demos.unicode.org/icu-bin/collation.html making sure the drop down menu on the top left is set to kk (type=standard): Kazakh (Standard Sort Order)

I guess all those are using a different version than wikipedia. I believe wikipedia uses CLDR 37.

According to https://www.unicode.org/cldr/charts/37/ (Warning, huge download), CLDR 37 has the following order for Kazakh:

- ‐ ‑ – — , ; : ! ? . … ' ‘ ’ " “ ” « » ( ) [ ] { } § @ * / & # % ‰ + 0 1 2 3 4 5 6 7 8 9 а А ә Ә б Б в ᲀ В г Г ғ Ғ д ᲁ Д е Е ё Ё ж Ж з З и И й Й к К қ Қ л Л м М н Н ң Ң о ᲂ О ө Ө п П р Р с ᲃ С т ᲄ ᲅ Т у У ұ Ұ ү Ү ф Ф х Х һ Һ ц Ц ч Ч ш Ш щ Щ ъ ᲆ Ъ ы Ы і І ї Ї ь Ь э Э ю Ю я Я

@Bawolff Yes, you're right. I try icu4c-demos (it's all sorted correctly) and I think we should use CLDR.

@Bawolff I double-checked and one letter "ү" isn't sorted, why?
Example: the word "сүр" at the end isn't sorted.

Input: сәлем
қырық
қү
сығу
майлы
ірі
әбігер
ұлан
үсен
сабыр
сёр
сер
сір
cүр
сұр
күр
мүр
мяр
түр

Output:
<1 әбігер
<1 күр
<1 қү
<1 қырық
<1 майлы
<1 мүр
<1 мяр
<1 сабыр
<1 сәлем
<1 сер
<1 сёр
<1 сұр
<1 сығу
<1 сір
<1 түр
<1 ұлан
<1 үсен
<1 ірі
<1 cүр

In T384395#10551913, @MuratKaribay wrote:

@Bawolff I double-checked and one letter "ү" isn't sorted, why?
Example: the word "сүр" at the end isn't sorted.

That's because you have a Latin "c" instead of a Cyrillic "с" for the input of that word. You can verify by copy-pasting it into this tool: https://www.fontspace.com/unicode/analyzer#e=0YHRltGAIGPSr9GA

In T384395#10551924, @jhsoby wrote:

In T384395#10551913, @MuratKaribay wrote:

@Bawolff I double-checked and one letter "ү" isn't sorted, why?
Example: the word "сүр" at the end isn't sorted.

That's because you have a Latin "c" instead of a Cyrillic "с" for the input of that word. You can verify by copy-pasting it into this tool: https://www.fontspace.com/unicode/analyzer#e=0YHRltGAIGPSr9GA

Oh, my bad. :)

Change #1114350 merged by jenkins-bot:

[mediawiki/core@master] Add "Ё" to first letter list for Kazakh

https://gerrit.wikimedia.org/r/1114350

Maintenance_bot removed a project: Patch-For-Review.Feb 14 2025, 6:30 PM

ReleaseTaggerBot added a project: MW-1.44-notes (1.44.0-wmf.17; 2025-02-18).Feb 14 2025, 7:00 PM

srishakatux moved this task from Needs Review to Waiting for Deployment on the LPL Onboarding and Development (Current) board.Feb 19 2025, 4:04 PM

MaryMunyoki triaged this task as High priority.Feb 25 2025, 2:18 PM

MaryMunyoki lowered the priority of this task from High to Medium.

MaryMunyoki changed the task status from Open to In Progress.Feb 25 2025, 2:38 PM

The patch is probably deployed, but I think that I still see the categories sorted with Ә towards the end and not in the beginning. Perhaps somebody needs to run a maintenance script to update the category pages? Just a guess.

In addition to that patch, $wgCategoryCollation has to be changed for that wiki prior to running the maintenance script.

In T384395#10607538, @Bawolff wrote:

In addition to that patch, $wgCategoryCollation has to be changed for that wiki prior to running the maintenance script.

Thanks for the clarification! This makes me wonder, however: Why do we have it in local settings? For most languages, this should probably the default in core.

In T384395#10612910, @Amire80 wrote:

In T384395#10607538, @Bawolff wrote:

In addition to that patch, $wgCategoryCollation has to be changed for that wiki prior to running the maintenance script.

Thanks for the clarification! This makes me wonder, however: Why do we have it in local settings? For most languages, this should probably the default in core.

At the time there was concerns about how icu collations are not stable between versions of php. I think it was decided that that would be fine as long as we store the collation version in cl_collation so articles would be fixed over time after purge and update.php could detect if collation changed. However, there was a disagreement on gerrit over the best way to do this, so we ended up doing none of the choices instead, and here we are.

Change #1126483 had a related patch set uploaded (by Jon Harald Søby; author: Jon Harald Søby):

[operations/mediawiki-config@master] Add uca collation for Kazakh

https://gerrit.wikimedia.org/r/1126483

gerritbot added a project: Patch-For-Review.Mar 11 2025, 7:32 AM

Change #1126483 merged by jenkins-bot:

[operations/mediawiki-config@master] Add uca collation for Kazakh

https://gerrit.wikimedia.org/r/1126483

Mentioned in SAL (#wikimedia-operations) [2025-03-11T08:35:37Z] <kartik@deploy2002> Started scap sync-world: Backport for [[gerrit:1126483|Add uca collation for Kazakh (T384395)]]

Mentioned in SAL (#wikimedia-operations) [2025-03-11T08:38:37Z] <kartik@deploy2002> kartik, jhsoby: Backport for [[gerrit:1126483|Add uca collation for Kazakh (T384395)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-03-11T08:47:51Z] <kartik@deploy2002> Finished scap sync-world: Backport for [[gerrit:1126483|Add uca collation for Kazakh (T384395)]] (duration: 12m 13s)

Just as a reminder (because i didnt see it in the SAL), after changing the setting you must run updateCollation.php (otherwise pre-existing categories will be sorted wrongly and behave weirdly.)

Maintenance_bot removed a project: Patch-For-Review.Mar 11 2025, 9:32 AM

Mentioned in SAL (#wikimedia-operations) [2025-03-11T09:41:15Z] <kart_> Script run: mwscript updateCollation.php --wiki=kkwiki --previous-collation=uppercase (T384395)

In T384395#10622543, @Bawolff wrote:

Just as a reminder (because i didnt see it in the SAL), after changing the setting you must run updateCollation.php (otherwise pre-existing categories will be sorted wrongly and behave weirdly.)

The script took a really long time (almost an hour?), but finished just now, and all looks good as far as I can tell. Thanks @KartikMistry for deploying!

MaryMunyoki moved this task from Waiting for Deployment to Done (Q3 2024-25) on the LPL Onboarding and Development (Current) board.Mar 11 2025, 10:44 AM

Adding Uppercase and lowercase collation for Kazakh languageClosed, ResolvedPublicBUG REPORTActions

Description

Details

Event Timeline

Adding Uppercase and lowercase collation for Kazakh language
Closed, ResolvedPublicBUG REPORT
Actions