Academia.eduAcademia.edu

Tibetan Corpus Linguistics: progress and future prospects

2015

Abstract

A set of four presentations that discuss a Tibetan corpus linguistics project based at SOAS University of London from 2012-2015.

Key takeaways

  • (1) Look-up of possible analyses
Tibetan Corpus Linguistics: our progress so far Nathan W. Hill and Edward Garrett (SOAS, University of London) Introducing Corpus Linguistics Maslow's hierarchy of Corpus Linguistic needs Introducing Corpus Linguistics Maslow's hierarchy of Corpus Linguistic needs 1. Script is in Unicode Introducing Corpus Linguistics Maslow's hierarchy of Corpus Linguistic needs 1. Script is in Unicode 2. Some digital texts are available Introducing Corpus Linguistics Maslow's hierarchy of Corpus Linguistic needs 1. Script is in Unicode 2. Some digital texts are available 3. Segmenter: Words are divided (orthographically or with software) Introducing Corpus Linguistics Maslow's hierarchy of Corpus Linguistic needs 1. Script is in Unicode 2. Some digital texts are available 3. Segmenter: Words are divided 4. Tagger: Part of speech of each word is identifiable Introducing Corpus Linguistics Maslow's hierarchy of Corpus Linguistic needs 1. Script is in Unicode 2. Some digital texts are available 3. Segmenter: Words are divided 4. Tagger: Part of speech of each word is identifiable 1. E.g. ‘sit on a chair’ [noun] versus ‘chair a meeting’ [verb] Introducing Corpus Linguistics Maslow's hierarchy of Corpus Linguistic needs 1. Script is in Unicode 2. Some digital texts are available 3. Segmenter: Words are divided 4. Tagger: Part of speech of each word is identifiable 5. Lemmatizer: Different forms of a word are associated with each other Introducing Corpus Linguistics Maslow's hierarchy of Corpus Linguistic needs 1. Script is in Unicode 2. Some digital texts are available 3. Segmenter: Words are divided 4. Tagger: Part of speech of each word is identifiable 5. Lemmatizer: Different forms of a word are associated with each other 1. ‘sing’, ‘sang’, ‘sung’, ‘singing’, ‘sings’ all associated with [sing] Introducing Corpus Linguistics Maslow's hierarchy of Corpus Linguistic needs 1. Script is in Unicode 2. Some digital texts are available 3. Segmenter: Words are divided 4. Tagger: Part of speech of each word is identifiable 5. Lemmatizer: Different forms of a word are associated with each other 6. Parser: Higher order syntactic analysis 1. E.g. noun phrase detection, verbal rection, etc. E-resources for English Maslow's hierarchy of Corpus Linguistic needs 1. Script is in Unicode 2. Some digital texts are available 3. Segmenter: Words are divided 4. Tagger: Part of speech of each word is identifiable 5. Lemmatizer: Different forms of a word are associated with each other 6. Parser: Higher order syntactic analysis We have all of them! E-resources for English I want 'chair' as a verb. Part of speech tagging solves the problem They are all 'chair' as a verb! But it gets even fancier... • Sketch: typical subjects, objects, modifiers, etc. • Thesaurus: words with similar meanings Sketch Thesaurus and fancier still… • Comparing the profile of different words • Suggestions of example sentences • Comparing usage across region, time period, genre • Automatic translation • Speech recognition • … and on and on ... E-resources for Tibetan Maslow's hierarchy of Corpus Linguistic needs E-resources for Tibetan Maslow's hierarchy of Corpus Linguistic needs 1. Script is in Unicode E-resources for Tibetan Maslow's hierarchy of Corpus Linguistic needs 1. Script is in Unicode ✓ E-resources for Tibetan Maslow's hierarchy of Corpus Linguistic needs 1. Script is in Unicode ✓ 2. Some digital texts are available E-resources for Tibetan Maslow's hierarchy of Corpus Linguistic needs 1. Script is in Unicode ✓ 2. Some digital texts are available ✓ E-resources for Tibetan Maslow's hierarchy of Corpus Linguistic needs 1. Script is in Unicode ✓ 2. Some digital texts are available ✓ 3. Segmenter: Words are divided 4. Tagger: Part of speech of each word is identifiable 5. Lemmatizer: Different forms of a word are associated with each other 6. Parser: Higher order syntactic analysis E-resources for Tibetan Maslow's hierarchy of Corpus Linguistic needs 1. Script is in Unicode ✓ 2. Some digital texts are available ✓ 3. Segmenter: Words are divided 4. Tagger: Part of speech of each word is identifiable Our focus 5. Lemmatizer: Different forms of a word are associated with each other 6. Parser: Higher order syntactic analysis Tibetan e-resources: Old Tibetan Documents Online (OTDO) myི་ myi 'person' myི་ myi 'not' A second try with མི་ mi 'person' Tibetan in Digital Communication Goals 1. A part-of-speech tagged corpus of Tibetan texts 2. An automatic word breaker 3. An automatic part-of-speech tagger Our Corpora Classical Mdzaṅs-blun 9th century canonical narrative trans. from Chinese (55,059+ words) Bu ston chos ḥbyuṅ 13th century history, mostly quotes from earlier sources (89,129) Mi-la ras-paḥi rnam thar 15th century biography (41,864+ words) Mar-paḥi rnam thar 15th century biography (39,969+ words) Pavel 39,011 words of various texts Balk 85,143 catalog of Berlin Tibetica POS tag set The POS tag set will not be discussed much today. Garrett, Edward and Hill, Nathan W. and Kilgarriff, Adam and Vadlapudi, Ravikiran and Zadoks, Abel (2015). "The contribution of corpus linguistics to lexicography and the future of Tibetan dictionaries." Revue d'Etudes Tibétaines 32: 51-86. Word breaking Our weakness 0.95621 accurate (16 Aug, 2015) Workflow: man and machine Workflow: (1) Look-up of possible analyses Word Transliteration Part-of-speech tag rgyལ་པོ་ rgyal-po n.count དེ་ de d.dem ~ cv.sem ལ་ la case.all ~ n.count བʦuན་མོ་ btsun-mo n.count lŋ་ lṅa num.card བrgy་ brgya num.card ཡོད་ yod v.invar kyང་ kyaṅ cl.focus ། punc Workflow: (2) Pre-tagging Word Transliteration Part-of-speech tag rgyལ་པོ་ rgyal-po n.count དེ་ de d.dem ལ་ la case.all ~ n.count བʦuན་མོ་ btsun-mo n.count lŋ་ lṅa num.card བrgy་ brgya num.card ཡོད་ yod v.invar kyང་ kyaṅ cl.focus ། punc Workflow: (3) Hand-tagging Word Transliteration Part-of-speech tag rgyལ་པོ་ rgyal-po n.count དེ་ de d.dem ལ་ la case.all བʦuན་མོ་ btsun-mo n.count lŋ་ lṅa num.card བrgy་ brgya num.card ཡོད་ yod v.invar kyང་ kyaṅ cl.focus ། punc Workflow: (4) Rule suggestions Screen shot of rule suggestions (9 November 2013) Workflow: (4) Rule suggestions Screen shot of the rule suggestion [neg] ← [n.count] (9 November 2013) Workflow: (5) Checking consistency Using a programme provided by Pablo Faria of UNICAMP. Does de nas mean ‘from him’ or ‘then’? Disambiguating mi as negation or a noun Isolating mi [n.count] after the genitive rmoṅ-pa ḥi mi ḥgro ḥo 'an ignorant person goes'. Disambiguating mi as negation or a noun Isolating mi [n.count] after the genitive rmoṅ-pa ḥi mi ḥgro ḥo 'an ignorant person goes'. bskal-pa graṅs med-pa ḥi mi dge-ba ḥi las 'non virtuous deeds of countless eons'. Disambiguating mi as negation or a noun Isolating mi [n.count] after the genitive rmoṅ-pa ḥi mi ḥgro ḥo 'an ignorant person goes'. bskal-pa graṅs med-pa ḥi mi dge-ba ḥi las 'non virtuous deeds of countless eons'. rab tu ḥbyuṅ-ba ḥi mi rigs 'it is not proper to take ordination'. Disambiguating mi as negation or a noun Isolating mi [n.count] after the genitive rmoṅ-pa ḥi mi ḥgro ḥo 'an ignorant person goes'. bskal-pa graṅs med-pa ḥi mi dge-ba ḥi las 'non virtuous deeds of countless eons'. rab tu ḥbyuṅ-ba ḥi mi rigs 'it is not proper to take ordination'. RULE: If mi could be [n.count], follows a probable genitive, does not precede rigs, and does not precede a [n.v.xxx], and the word before the probable genitive is not an unambiguous [v.xxx] tag, then mark mi as a [n.count]. Disambiguating mi as negation or a noun Isolating mi [n.count] after the genitive rmoṅ-pa ḥi mi ḥgro ḥo 'an ignorant person goes'. bskal-pa graṅs med-pa ḥi mi dge-ba ḥi las 'non virtuous deeds of countless eons'. rab tu ḥbyuṅ-ba ḥi mi rigs 'it is not proper to take ordination'. RULE: If mi could be [n.count], follows a probable genitive, does not precede rigs, and does not precede a [n.v.xxx], and the word before the probable genitive is not an unambiguous [v.xxx] tag, then mark mi as a [n.count]. PATTERN: (\S+\|(?:\[(?!v\.)[^\]]*\])+\s+(?:འི་|kyི་|གི་|gyི་)\|\S+\s+(?:མི་| མ་))\|\S*\[n\.count\]\S*(?!\s+(?:རིགས་\||\S+\[n\.v\.)) REPLACE: $1|[n.count] The rule based tagger For more about the rule based tagger— Garrett, Edward and Hill, Nathan W. and Zadoks, Abel (2014) 'A Rule-based Part-of-speech Tagger for Classical Tibetan.' Himalayan Linguistics, 13 (1). pp. 9-57. Search Search Search Shingles Looking for [cl.focus] after [cv.loc] Shingles Looking for double case marking. Shingles Common collocations. Discovering new things about Tibetan grammar Conclusions on infinitive constructions 1. Past tense verbs do not occur as the subordinate verbs of indirect infinitives. 2. The matrix verbs gsol, med, grags, yod, ruṅ select the future tense. 3. It is possible that one group of verbs selects the present tense whereas others are equally happy to select the present and the future, but the overall rarity of future stems in the corpus makes the line between these two categories difficult to draw. Garrett, Edward and Hill, Nathan W. and Zadoks, Abel (2013) 'Disambiguating Tibetan verb stems with matrix verbs in the indirect infinitive construction.' Bulletin of Tibetology, 49 (2). pp. 35-44. How well does it work? Accuracy and Ambiguity Classical (159,144 words) LexTagger RuleTagger Difference Accuracy 1.00000 0.99906 Ambiguity 2.50755 1.37665 1.13090 Accuracy 1.00000 0.99901 Ambiguity 2.64819 1.40909 1.23910 Accuracy 1.00000 0.99898 Ambiguity 2.72774 1.40353 1.32421 (on 14 Nov 2014) Classical (226,021 words) LexTagger RuleTagger Difference (on 16 April 2015) Classical (226,021 words) LexTagger RuleTagger Difference (on 16 August 2015) Thank you