Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2022 Jun 8;17(6):e0269648.
doi: 10.1371/journal.pone.0269648. eCollection 2022.

Best practices for spatial language data harmonization, sharing and map creation-A case study of Uralic

Affiliations

Best practices for spatial language data harmonization, sharing and map creation-A case study of Uralic

Timo Rantanen et al. PLoS One. .

Abstract

Despite remarkable progress in digital linguistics, extensive databases of geographical language distributions are missing. This hampers both studies on language spatiality and public outreach of language diversity. We present best practices for creating and sharing digital spatial language data by collecting and harmonizing Uralic language distributions as case study. Language distribution studies have utilized various methodologies, and the results are often available as printed maps or written descriptions. In order to analyze language spatiality, the information must be digitized into geospatial data, which contains location, time and other parameters. When compiled and harmonized, this data can be used to study changes in languages' distribution, and combined with, for example, population and environmental data. We also utilized the knowledge of language experts to adjust previous and new information of language distributions into state-of-the-art maps. The extensive database, including the distribution datasets and detailed map visualizations of the Uralic languages are introduced alongside this article, and they are freely available.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Geographical overlap of different source materials concerning the distribution of the Khanty language(s) at the beginning of the 20th century.
Original sources Zsirai [34], Haarmann [41], Lytkin et al. [35], Grünthal & Salminen [33] and Abondolo [42] have been visualized using boundaries of each polygon. A solid green area has been created merging the distributions of all Khanty sources, and it is indicating the area where Khanty could have been spoken. Basemap datasets from Natural Earth [43], Digital Chart of the World [44] and ESRI [45].
Fig 2
Fig 2. Workflow for best practices in handling of language family data includes three separate phases: I processing and harmonization of spatial data collection: A path from analog and digital source data to a consistent geospatial database, II visualization combined with queries from experts in the case of lesser-studied languages, and creation of improved new maps based on updated information, III data sharing.
The outcomes of the best practices increase research opportunities and general understanding of language distributions. Original data and output are shown as rectangles, processing as ovals and overall benefits as hexagons. Details of the workflow are described in Section ‘Methods’.
Fig 3
Fig 3. Geographical distribution of the Uralic languages at the beginning of the 20th century.
The uncontroversial branches of the family are presented without overlapping areas. A list of original sources is available in S2 Appendix. Basemap datasets from Natural Earth [43], Digital Chart of the World [44] and ESRI [45].
Fig 4
Fig 4. Samoyedic languages at the beginning of the 20th century.
Languages are presented without overlapping areas. Original sources: Soviet Census of 1926 [54], Popov [55], Dolgikh [38], Dolgikh & Fajnberg [56], Dolgikh [57], Verbov [58], Grünthal & Salminen [33], Helimski [59], Tuchkova et al. [60], Siegl [61], Brykina & Gusev [62]. Basemap datasets from Natural Earth [43] and Digital Chart of the World [44].
Fig 5
Fig 5
Traditional (a) and current (b) distribution of Selkup. A comparison of the maps demonstrates the changes in language and dialectal distribution over time. Original sources for traditional distribution are Grünthal & Salminen [33], Tuchkova et al. [60] and for current distribution Tuchkova et al. [60], Kazakevich [63]. Basemap datasets from Natural Earth [43] and Digital Chart of the World [44].

References

    1. Bellwood PS. First migrants: ancient migration in global perspective. 1st ed. Chichester, West Sussex, UK; Malden, MA: Wiley-Blackwell; 2013.
    1. Heggarty P. Prehistory through language and archaeology. In: Bowern C, Evans B, editors. Routledge Handbook of Historical Linguistics. London: Taylor and Francis; 2014.
    1. Reich D. Who We Are and How We Got Here: Ancient DNA and the new science of the human past. 1st edition. Oxford, New York: Oxford University Press; 2018. 368 p.
    1. Spriggs M, Reich D. An ancient DNA Pacific journey: a case study of collaboration between archaeologists and geneticists. World Archaeol. 2019; 51(4): 620–639.
    1. Creanza N, Ruhlen M, Pemberton TJ, Rosenberg NA, Feldman MW, Ramachandran S. A comparison of worldwide phonemic and genetic variation in human populations. Proc Natl Acad Sci. 2015; 112(5): 1265–1272. doi: 10.1073/pnas.1424033112 - DOI - PMC - PubMed

Publication types