Page MenuHomePhabricator

Adding Uppercase and lowercase collation for Kazakh language
Closed, ResolvedPublicBUG REPORT

Description

Originally requested and discussed here: Әліпбиді жөндеу.

In fact, due to the lack of collation of the Kazakh alphabet, specific letters are at the end of the list, and are sorted by Unicode.

Kazakh Uppercase: А Ә Б В Г Ғ Д Е Ё Ж З И Й К Қ Л М Н Ң О Ө П Р С Т У Ұ Ү Ф Х Һ Ц Ч Ш Щ Ъ Ы І Ь Э Ю Я
Kazakh Lowercase: а ә б в г ғ д е ё ж з и й к қ л м н ң о ө п р с т у ұ ү ф х һ ц ч ш щ ъ ы і ь э ю я

Sample code:

class KazakhUppercaseCollation extends CustomUppercaseCollation {

	public function __construct() {
		parent::__construct( [
			'А',
                        'Ә',
			'Б',
			'В',
			'Г',
			'Ғ',
			'Д',
			'Е',
			'Ё',
			'Ж',
			'З',
			'И',
			'Й',
			'К',
			'Қ',
			'Л',
			'М',
			'Н',
			'Ң',
			'О',
			'Ө',
			'П',
			'Р',
			'С',
			'Т',
			'У',
			'Ұ',
                        'Ү',
			'Ф',
			'Х',
			'Һ',
			'Ц',
			'Ч',
			'Ш',
			'Щ',
			'Ъ',
			'Ы',
                        'І',
			'Ь',
			'Э',
			'Ю',
			'Я',
		], Language::factory( 'kk' ) );
	}
}

Event Timeline

I can help with this, but I think it might actually be a good first task; but before I add that tag, I would like to check what's already in CLDR for Kazakh. Will update later, feel free to ping me if I haven't.

@MuratKaribay: It looks to me like the CLDR data might be outdated, as it doesn't have the letter "Ә". So until CLDR is updated, the easiest solution is to make our own custom collation (like for Abkhaz and Bashkir).

Did you write the sample code in the task description? I'd like to add you as author of the file I'm adding.

@jhsoby: Yes, you're right CLDR is rarely updated, let's add collation for now, since the admins of Kazakh Wikipedia from long time want to solve the problem by sorting the alphabet.

Change #1114350 had a related patch set uploaded (by Jon Harald Søby; author: Jon Harald Søby):

[mediawiki/core@master] Add custom collation for Kazakh language

https://gerrit.wikimedia.org/r/1114350

@MuratKaribay Yup. Patch uploaded – did you make the sample code from the description, or someone else? I should credit you (or whoever else made it) in the file. 😊

@jhsoby Thanks. I took the reference from Bashkir language, and I think when Kazakh will be switched to Latin, I will specify the collation once again.
And I'd like to know approximately in how many days repositories will be updated, as admins are waiting for a new sorting.
And one more question if in Kazakh wiki projects among admins there are no programmers and IT specialists, is it possible to create one position for those who don't want to be admins, but want to solve problems with code?

if in Kazakh wiki projects among admins there are no programmers and IT specialists, is it possible to create one position for those who don't want to be admins, but want to solve problems with code?

Please bring up general questions on wiki group configurations in community forums instead - thanks!

Hi,

The CLDR looks correct to me (But of course I don't speak this language). I'd prefer we use the CLDR collation instead of our own as long as it is correct. You can test the CLDR collation at https://icu4c-demos.unicode.org/icu-bin/collation.html making sure the drop down menu on the top left is set to kk (type=standard): Kazakh (Standard Sort Order)

I guess all those are using a different version than wikipedia. I believe wikipedia uses CLDR 37.

According to https://www.unicode.org/cldr/charts/37/ (Warning, huge download), CLDR 37 has the following order for Kazakh:

- ‐ ‑ – — , ; : ! ? . … ' ‘ ’ " “ ” « » ( ) [ ] { } § @ * / & # % ‰ + 0 1 2 3 4 5 6 7 8 9 а А ә Ә б Б в ᲀ В г Г ғ Ғ д ᲁ Д е Е ё Ё ж Ж з З и И й Й к К қ Қ л Л м М н Н ң Ң о ᲂ О ө Ө п П р Р с ᲃ С т ᲄ ᲅ Т у У ұ Ұ ү Ү ф Ф х Х һ Һ ц Ц ч Ч ш Ш щ Щ ъ ᲆ Ъ ы Ы і І ї Ї ь Ь э Э ю Ю я Я

@Bawolff Yes, you're right. I try icu4c-demos (it's all sorted correctly) and I think we should use CLDR.

@Bawolff I double-checked and one letter "ү" isn't sorted, why?
Example: the word "сүр" at the end isn't sorted.

Input: сәлем
қырық
қү
сығу
майлы
ірі
әбігер
ұлан
үсен
сабыр
сёр
сер
сір
cүр
сұр
күр
мүр
мяр
түр

Output:
<1 әбігер
<1 күр
<1 қү
<1 қырық
<1 майлы
<1 мүр
<1 мяр
<1 сабыр
<1 сәлем
<1 сер
<1 сёр
<1 сұр
<1 сығу
<1 сір
<1 түр
<1 ұлан
<1 үсен
<1 ірі
<1 cүр

@Bawolff I double-checked and one letter "ү" isn't sorted, why?
Example: the word "сүр" at the end isn't sorted.

That's because you have a Latin "c" instead of a Cyrillic "с" for the input of that word. You can verify by copy-pasting it into this tool: https://www.fontspace.com/unicode/analyzer#e=0YHRltGAIGPSr9GA

@Bawolff I double-checked and one letter "ү" isn't sorted, why?
Example: the word "сүр" at the end isn't sorted.

That's because you have a Latin "c" instead of a Cyrillic "с" for the input of that word. You can verify by copy-pasting it into this tool: https://www.fontspace.com/unicode/analyzer#e=0YHRltGAIGPSr9GA

Oh, my bad. :)

Change #1114350 merged by jenkins-bot:

[mediawiki/core@master] Add "Ё" to first letter list for Kazakh

https://gerrit.wikimedia.org/r/1114350

MaryMunyoki lowered the priority of this task from High to Medium.
MaryMunyoki changed the task status from Open to In Progress.Feb 25 2025, 2:38 PM

The patch is probably deployed, but I think that I still see the categories sorted with Ә towards the end and not in the beginning. Perhaps somebody needs to run a maintenance script to update the category pages? Just a guess.

In addition to that patch, $wgCategoryCollation has to be changed for that wiki prior to running the maintenance script.

In addition to that patch, $wgCategoryCollation has to be changed for that wiki prior to running the maintenance script.

Thanks for the clarification! This makes me wonder, however: Why do we have it in local settings? For most languages, this should probably the default in core.

In addition to that patch, $wgCategoryCollation has to be changed for that wiki prior to running the maintenance script.

Thanks for the clarification! This makes me wonder, however: Why do we have it in local settings? For most languages, this should probably the default in core.

At the time there was concerns about how icu collations are not stable between versions of php. I think it was decided that that would be fine as long as we store the collation version in cl_collation so articles would be fixed over time after purge and update.php could detect if collation changed. However, there was a disagreement on gerrit over the best way to do this, so we ended up doing none of the choices instead, and here we are.

Change #1126483 had a related patch set uploaded (by Jon Harald Søby; author: Jon Harald Søby):

[operations/mediawiki-config@master] Add uca collation for Kazakh

https://gerrit.wikimedia.org/r/1126483

Change #1126483 merged by jenkins-bot:

[operations/mediawiki-config@master] Add uca collation for Kazakh

https://gerrit.wikimedia.org/r/1126483

Mentioned in SAL (#wikimedia-operations) [2025-03-11T08:35:37Z] <kartik@deploy2002> Started scap sync-world: Backport for [[gerrit:1126483|Add uca collation for Kazakh (T384395)]]

Mentioned in SAL (#wikimedia-operations) [2025-03-11T08:38:37Z] <kartik@deploy2002> kartik, jhsoby: Backport for [[gerrit:1126483|Add uca collation for Kazakh (T384395)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2025-03-11T08:47:51Z] <kartik@deploy2002> Finished scap sync-world: Backport for [[gerrit:1126483|Add uca collation for Kazakh (T384395)]] (duration: 12m 13s)

Just as a reminder (because i didnt see it in the SAL), after changing the setting you must run updateCollation.php (otherwise pre-existing categories will be sorted wrongly and behave weirdly.)

Mentioned in SAL (#wikimedia-operations) [2025-03-11T09:41:15Z] <kart_> Script run: mwscript updateCollation.php --wiki=kkwiki --previous-collation=uppercase (T384395)

jhsoby added a subscriber: KartikMistry.

Just as a reminder (because i didnt see it in the SAL), after changing the setting you must run updateCollation.php (otherwise pre-existing categories will be sorted wrongly and behave weirdly.)

The script took a really long time (almost an hour?), but finished just now, and all looks good as far as I can tell. Thanks @KartikMistry for deploying!