On April 16–17, 2026, the international scientific conference entitled “Linguistic Variation in the Contemporary Sociocultural Context” took place in Vilnius. The event served as a platform for the exchange of ideas among linguists and sociologists; however, from the perspective of contemporary technological challenges, two presentations by our researchers gained particular significance. These were the only papers during the entire event that directly addressed the issue of low-resource languages.
Technological challenges: Protecting against “linguistic homogenization”
A research team consisting of Prof. IS PAN Roman Roszko (ISS PAS), Dr hab. Danuta Roszko (UV), and Dr Piotr Szatkowski (ISS PAS) presented the results of their work on constructing specialized corpora for the Masurian ethnolect and the Lithuanian Puńsk dialect in Poland.
In the era of the rapid expansion of Large Language Models (LLMs), the researchers highlighted the phenomenon of “linguistic homogenisation”. The dominance of high-resource languages in AI training sets causes the specific structures of smaller varieties to be displaced by calques and simplifications.
Key aspects of the project include:
- Resource normalisation challenges and the creation of proper processing pipelines Due to the lack of standardized orthography in dialectal texts, it was necessary to develop advanced processing pipelines. These include tasks such as cleaning “orthographic noise” and performing full substantive correction.
- CLARIN-PL and CLARIN-PL-BIZ-Bis infrastructure The work is being carried out within the extended CLARIN-PL infrastructure, which allows for data preparation in interoperable standards (TMX, TSV, JSON), ready for integration with systems such as “KonText”.
- Benchmarking The project aims to create closed test sets that will allow for an objective assessment of how contemporary AI models perform in understanding and generating texts in these specific linguistic varieties.
A Sociolinguistic Perspective: Can School Save a Language?
Complementing the technological view of multilingualism was an analysis by MA Andrzej Żak (ISS PAS) regarding the status of the Kashubian language. The researcher employed the term “collateral language” – a variety whose linguistic status has been historically contested and which, despite legal recognition, currently struggles with revitalization challenges.
The main findings of the study are:
- The Educational Paradox Despite 30 years of teaching Kashubian in schools and its status as the only regional language in Poland, statistics indicate a decline in the number of active users.
- Extra-systemic Barriers An analysis of sociolinguistic interviews revealed that key obstacles are psychological and ideological factors, such as low social prestige of the language and a deeply rooted sense of shame among older generations.
- Future Strategy The study demonstrates that institutionalization alone (schools, government offices) is insufficient. For a language to survive, a change in social attitudes and the construction of a new, positive linguistic identity are essential.
Andrzej Żak’s participation was funded by the National Science Centre (NCN) SONATA BIS grant awarded to Prof. Nicole Dołowy: “ Linguistic diversity in Poland: collateral languages, language-oriented activities and conceptualization of collective identity” (2020/38/E/HS2/00006).

Summary: The role of CLARIN-PL and CLARIN-PL-BIZ-Bis projects in heritage protection
These presentations clearly demonstrated that protecting smaller linguistic varieties in the 21st century must follow a dual track. On one hand, advanced linguistic engineering – implemented through projects such as CLARIN-PL-BIZ-Bis – is essential to bring these languages into the digital sphere. On the other hand, sociolinguistic reflection is necessary to understand the human context of their use.
The fact that the topic of low-resource languages in Vilnius was raised almost exclusively by our representatives underscores the leading role of the Institute of Slavic Studies of the Polish Academy of Sciences (IS PAN) and the CLARIN-PL and CLARIN-PL-BIZ-Bis consortia in defining the directions of modern Digital Humanities. Without the active creation of data resources, smaller ethnolects are at risk of digital exclusion and fading into non-existence in a world governed by algorithms.
The project “CLARIN – Common Language Resources and Technology Infrastructure” is funded under the Second Priority of the European Funds for a Modern Economy 2021–2027 (FENG) program. Consortium members: Wrocław University of Science and Technology (leader), Institute of Computer Science of the Polish Academy of Sciences, Institute of Slavic Studies of the Polish Academy of Sciences, University of Lodz, University of Wrocław.





























