Low-resource languages in the digital reality – Report from the International Conference in Vilnius

On April 16–17, 2026, the international scientific conference entitled “Linguistic Variation in the Contemporary Sociocultural Context” took place in Vilnius. The event served as a platform for the exchange of ideas among linguists and sociologists; however, from the perspective of contemporary technological challenges, two presentations by our researchers gained particular significance. These were the only papers during the entire event that directly addressed the issue of low-resource languages.

Technological challenges: Protecting against “linguistic homogenization”

A research team consisting of Prof. IS PAN Roman Roszko (ISS PAS), Dr hab. Danuta Roszko (UV), and Dr Piotr Szatkowski (ISS PAS) presented the results of their work on constructing specialized corpora for the Masurian ethnolect and the Lithuanian Puńsk dialect in Poland.

In the era of the rapid expansion of Large Language Models (LLMs), the researchers highlighted the phenomenon of “linguistic homogenisation”. The dominance of high-resource languages in AI training sets causes the specific structures of smaller varieties to be displaced by calques and simplifications.

Key aspects of the project include:

  • Resource normalisation challenges and the creation of proper processing pipelines Due to the lack of standardized orthography in dialectal texts, it was necessary to develop advanced processing pipelines. These include tasks such as cleaning “orthographic noise” and performing full substantive correction.
  • CLARIN-PL and CLARIN-PL-BIZ-Bis infrastructure The work is being carried out within the extended CLARIN-PL infrastructure, which allows for data preparation in interoperable standards (TMX, TSV, JSON), ready for integration with systems such as “KonText”.
  • Benchmarking The project aims to create closed test sets that will allow for an objective assessment of how contemporary AI models perform in understanding and generating texts in these specific linguistic varieties.

A Sociolinguistic Perspective: Can School Save a Language?

Complementing the technological view of multilingualism was an analysis by MA Andrzej Żak (ISS PAS) regarding the status of the Kashubian language. The researcher employed the term “collateral language” – a variety whose linguistic status has been historically contested and which, despite legal recognition, currently struggles with revitalization challenges.

The main findings of the study are:

  • The Educational Paradox Despite 30 years of teaching Kashubian in schools and its status as the only regional language in Poland, statistics indicate a decline in the number of active users.
  • Extra-systemic Barriers An analysis of sociolinguistic interviews revealed that key obstacles are psychological and ideological factors, such as low social prestige of the language and a deeply rooted sense of shame among older generations.
  • Future Strategy The study demonstrates that institutionalization alone (schools, government offices) is insufficient. For a language to survive, a change in social attitudes and the construction of a new, positive linguistic identity are essential.

Andrzej Żak’s participation was funded by the National Science Centre (NCN) SONATA BIS grant awarded to Prof. Nicole Dołowy: “ Linguistic diversity in Poland: collateral languages, language-oriented activities and conceptualization of collective identity” (2020/38/E/HS2/00006).

Summary: The role of CLARIN-PL and CLARIN-PL-BIZ-Bis projects in heritage protection

These presentations clearly demonstrated that protecting smaller linguistic varieties in the 21st century must follow a dual track. On one hand, advanced linguistic engineering – implemented through projects such as CLARIN-PL-BIZ-Bis – is essential to bring these languages into the digital sphere. On the other hand, sociolinguistic reflection is necessary to understand the human context of their use.

The fact that the topic of low-resource languages in Vilnius was raised almost exclusively by our representatives underscores the leading role of the Institute of Slavic Studies of the Polish Academy of Sciences (IS PAN) and the CLARIN-PL and CLARIN-PL-BIZ-Bis consortia in defining the directions of modern Digital Humanities. Without the active creation of data resources, smaller ethnolects are at risk of digital exclusion and fading into non-existence in a world governed by algorithms.

The project “CLARIN – Common Language Resources and Technology Infrastructure” is funded under the Second Priority of the European Funds for a Modern Economy 2021–2027 (FENG) program. Consortium members: Wrocław University of Science and Technology (leader), Institute of Computer Science of the Polish Academy of Sciences, Institute of Slavic Studies of the Polish Academy of Sciences, University of Lodz, University of Wrocław. 

Prof. Roman Roszko delivering his presentation. Photo: private archive.
Andrzej Żak delivering his presentation. Photo: private archive.

Launch of the New PLLuM Language Models

We are pleased to announce the release of 11 new PLLuM models. Their primary advantage lies in their exceptional proficiency in the Polish language – including official/administrative styles – as well as a deep understanding of native cultural, historical, and legal contexts. These models are designed to support public administration, businesses, and individual users. Crucially, they have been released under open licenses that are fully compliant with the requirements of the EU AI Act.

The Specificity of the New PLLuM Models

The new PLLuM variants can significantly enhance the efficiency of public administration. They are capable of generating texts in over 20 types of official documents, supporting office and operational tasks, interpreting the context of administrative procedures, simplifying complex legal language, and working with standardized legal document templates.

Based on an analysis of real user interactions with PLLuM Chat, we have also developed mechanisms that enable the generation of safer and more precise responses.

Four Model Sizes

The new model family includes four sizes: refreshed versions of 8B, 12B, and 70B, along with a brand-new 4B category:

  • 4B – The smallest and fastest models with low computational requirements, ideal for task-specific fine-tuning.
  • 8B and 12B – Providing an excellent balance between speed and quality; recommended for production deployments, such as serving as the engine for RAG (Retrieval-Augmented Generation) systems.
  • 70B – The largest and most advanced model, designed to handle complex tasks effectively without the need for additional fine-tuning.

All versions are available under open licenses with full documentation compliant with the AI Act, including detailed descriptions of the models, data sources, training methods, and quality evaluation metrics.

Model Training

The models were developed in 2025 commissioned by the Ministry of Digital Affairs as part of the HIVE AI project. The project was implemented by a consortium consisting of: NASK PBI (leader), ACK Cyfronet AGH, Centre for Information Technology (COI), Institute of Computer Science PAS, Institute of Slavic Studies PAS, OPI PIB, Wrocław University of Science and Technology, and the University of Łódź.

The training process was based on a new, rich, and diverse corpus of text materials. The data was collected legally through licensing agreements, public domain sources, and Creative Commons resources.

Representing the Institute of Slavic Studies PAS, the project was coordinated by Dr. hab. Roman Roszko, Prof. IS PAS, with a team including Mgr. Tomasz Bernaś and Mgr. Valéry Trân Thiên, bridging the fields of computer science and linguistics.

Information about the premiere of new models is also available on the website of the Ministry of Digital Affairs.

   

The SILK Team at the Masters & Robots Event

 

On October 21st, 2025, MS Valéry Trân Thiên of the Semantics and Computational Linguistics Team at ISS PAS represented the HIVE AI consortium at Masters and Robots Warsaw 2025. The event attracted experts, entrepreneurs, innovators, technologists, and leaders from renowned global academic institutions, associations, and technology blogs including Google, NYU Stern School of Business, Executive Coaching and Consulting Institute, Imperial Business School in London, Eindhoven AI Systems Institute, Copenhagen Business School, Queensland University of Technology, Human Future, and What’s Next? etc.

At the Ministry of Digital Affairs of Poland stand, Valéry Trân Thiên addressed visitor inquiries about the progress of artificial intelligence development within Poland. He discussed both commonalities and distinctions between prominent Polish generative models, Bielik and PLLuM. The majority of questions centered on the PLLuM model and its deployments in the municipal offices of Gdynia and Łódź.

Since 2024, the Semantics and Computational Linguistics Team of the Institute of Slavic Studies, PAS, has been actively involved in advancing artificial intelligence through participation in two major projects funded by the Ministry of Digital Affairs: PLLuM (“Responsible development of an open large language model PLLuM [Polish Large Language Universal Model] to support breakthrough technologies in the public and economic sectors, including an open-source Polish-language intelligent citizen assistant”) and HIVE AI (“HIVE AI: Development and pilot deployment of large language models in Polish public administration”). The team’s participation at prominent national and international events is a key component of wider efforts to promote the PLLuM family of Polish generative models.

Valéry Trân Thiên (ISS PAS) pictured at the Ministry of Digital Affairs stand. Photo: private archive.

The Semantics and Computational Linguistics Team in Katowice

   

On October 6–7, 2025, at the International Congress Centre in Katowice, during the XXIII Local Government Capital and Finance Forum, Dr hab. Roman Roszko, prof. IS PAN, head of the Semantics and Computer Linguistics Team at ISS PAS, promoted effective ways to apply artificial intelligence in the daily work of local governments in Poland.

The Forum organizers invited Prof. R. Roszko as a speaker to participate in the debate: “Artificial Intelligence in Local Governments – Real Applications or Just a Trend?”. The debate was extended by a lunch break due to high interest. Local government officials were interested in specific application scenarios for generative models created within the HIVE AI project, as well as the tools and resources of CLARIN-PL-BIZ-Bis. Their primary concerns revolved around the security and quality of solutions utilizing artificial intelligence. The anxieties of local government representatives, who mostly associated artificial intelligence with readily available online chatbots, were alleviated during a presentation of RAG instances operating on PLLuM models. It should be noted that these types of solutions are currently being implemented by the HIVE AI consortium, among others, in the Gdynia City Hall.

Prof. Roman Roszko also spoke with representatives from companies involved in providing digital services to government offices at various administrative levels. A promising collaboration emerged between IS PAN and Warsaw-based company ABC PRO sp. z o.o. (https://abcpro.pl/) regarding legal electronic monitoring – solutions that the aforementioned company offers and plans to develop further with additional functionalities related to artificial intelligence.

Roman Roszko posing against an “event backdrop”. Photo: private archive.
In the photo, from left: Piotr Jegorow (ABC PRO, Managing Director), Roman Roszko (ISS PAS), Ryszard Adam Grytner (ABC PRO, President of the Board). Photo: ABC PRO.

The Semantics and Computational Linguistics Team at “Spodek” Arena in Katowice

 

On September 30, 2025, Valéry Trân Thiên, MA, from the Semantics and Computational Linguistics Team at ISS PAS represented the HIVE AI consortium during the Synerise Cup 2025 – the jubilee national championships for schools and kindergartens in chess. At the Ministry of Digital Affairs of Poland booth, he answered visitors’ questions about artificial intelligence and Polish Large Language Models (PLLuM). Every visitor to the booth could converse with the PLLuM generative model. Testers evaluated the knowledge of the Polish generative model, including its understanding of chess.

In 2024, the Semantics and Computational Linguistics Team of the Institute of Slavic Studies, Polish Academy of Sciences, participated in the PLLuM project (“Responsible development of an open large language model PLLuM [Polish Large Language Universal Model] to support breakthrough technologies in the public and economic sectors, including an open, Polish-language intelligent citizen assistant”), funded by the Ministry of Digital Affairs of Poland. The result of this work is the PLLuM family of generative models in base, chat, instruct, and non-commerce versions.

In 2025, the developed PLLuM models are being further developed and implemented in mObywatel (the Polish digital ID app) and selected offices of state administration within a new project called HIVE AI (“HIVE AI: Development and pilot implementation of large language models in Polish public administration”), funded by the Ministry of Digital Affairs of Poland. Stands/kiosks operated by the Ministry of Digital Affairs at important national and international events are part of a larger promotion of the PLLuM family of Polish generative models.

Event logo. Photo: private archive.
Valéry Trân Thiên (ISS PAS) pictured at the Ministry of Digital Affairs stand. Photo: private archive.

The Semantics and Computational Linguistics Team in Vilnius

From September 25–27, 2025, the Semantics and Computational Linguistics Team (SILK Team) of the Institute of Slavic Studies, Polish Academy of Sciences participated in the 7th International Conference on Applied Linguistics “Languages and People” (LiTaKA) at Vilnius University.

Representing the SILK Team, Prof. Andrius Utka and Prof. Jurgita Vaičenonienė presented the team’s achievements in three projects carried out at the Institute of Slavic Studies, Polish Academy of Sciences, related to broadly defined digital humanities.

Authors of the presentation: Roman Roszko (Institute of Slavic Studies PAS), Tomasz Bernaś (Institute of Slavic Studies PAS), Danuta Roszko (University of Warsaw), Andrius Utka (Vytautas Magnus University), Jurgita Vaičenonienė (Vytautas Magnus University).

The title of the presentation: (en) Lithuanian Text Data and Organic Instruction Corpora in the DARIAH-HUB, HIVE AI, and PLLuM Projects; (lt) Lietuviškų tekstinių duomenų ir instrukcijų tekstynai DARIAH-HUB, HIVE AI ir PLLuM projektuose; (pl) Litewskojęzyczne korpusy danych tekstowych i instrukcji organicznych w projektach DARIAH-HUB, HIVE AI ir PLLuM.

Presentation languages: Lithuanian and English.

In 2024, Prof. Jurgita Vaičenonienė and Prof. Andrius Utka of VDU were members of the SILK Team within the PLLuM project (consortium leader: Wroclaw University of Science and Technology, members: Institute of Computer Science of the Polish Academy of Sciences, Institute of Slavic Studies of the Polish Academy of Sciences, National Information Processing Institute – OPI PIB, Research and Academic Computer Network – NASK PIB, University of Łódź) funded by the Ministry of Digital Affairs of Poland. Both were responsible for acquiring, describing with metadata, deduplicating and cleaning a portion of Baltic text data, as well as preparing organic instructions (localization, identity-based, single- and multi-turn generative, etc.) necessary during the pre- and post-training phases of the PLLuM model family. The presentation delivered to the audience sparked lively discussion about large language models and their applications. Attendees also expressed interest in the reasons why the SILK Team at ISS PAS conducts research extending far beyond the Polish language, particularly considering that such work requires not only specialized knowledge but primarily significant effort and substantial funding (energy costs).

The SILK team and CLARIN-PL have been closely collaborating with Lithuanian CLARIN-LT structures for ten years. Professor Jurgita Vaičenonienė from Vytautas Magnus University is the national coordinator of the CLARIN-LT infrastructure. Professor Sigita Rackevičienė, pictured in Figure 1, from Michael Römer University in Vilnius, is one of the partners of CLARIN-LT.

In the photo, from left: Assoc. Prof. Jurgita Vaičenonienė (ISS PAS and Vytautas Magnus University in Kaunas), Professor Sigita Rackevičienė (Michael Römer University in Vilnius), Assoc. Prof. Andrius Utka (ISS PAS and Vytautas Magnus University in Kaunas). Photo: CLARIN-LT archive.
Assoc. Prof. VDU Jurgita Vaičenonienė (ISS PAS and Vytautas Magnus University in Kaunas). Photo: Hanna Holub.
Assoc. Prof. VDU Andrius Utka (ISS PAS and Vytautas Magnus University in Kaunas). Photo: Hanna Holub.
Title slide.

 

The Semantics and Computational Linguistics Team in Warsaw

From September 24–25, 2025, the Semantic and Computational Linguistics Team of Institute of Slavic Studies of the Polish Academy of Sciences participated in a working, workshop-based, and promotional meeting for the DARIAH-HUB project. The main goal of the event was to present the results achieved so far within the project framework and to outline further development and maintenance of the structures being created. Dr. hab. Roman Roszko, prof. IS PAN, the project coordinator representing the Institute of Slavic Studies PAS, and researchers from the Semantic and Computational Linguistics Team –  Daniel Dziułka, MA, (ISS PAS), Karol Kościelniak, PhD, (Adam Mickiewicz University), Valéry Trân Thiên, MA, (ISS PAS) – took part in the event. During the promotional meeting, consortium representatives presented, among other things, the Interdisciplinary Research Platform (IRP) and the Archaeological Module. In informal discussions, Prof. Roman Roszko raised the issue of key word and phrase extraction from textual resources using models from the PLLuM family as an element to enrich research objects at the second level of IRP integration.

The “Digital Research Infrastructure for the Arts and Humanities – DARIAH-PL” project, operating as DARIAH.HUB, is funded through investments A2.4.1 (“Investments in expanding research potential”) within the Development Plan of the National Recovery and Resilience Plan.

The project’s implementation contributes to the long-term plan for strengthening collaboration in building digital infrastructure for the humanities and arts in Poland.

The DARIAH-HUB consortium comprises: The Institute of Informatics PAS (leader), Institute of Literary Research PAS, Institute of Bioorganic Chemistry PAS: Poznań Supercomputing and Networking Centre PCSS, The Tadeusz Manteuffel Institute of History PAS, Institute of Polish Language PAS, Institute of Slavic Studies PAS, Institute of Art PAS, Poznan University of Technology, Wrocław University of Science And Technology, Maria Curie-Skłodowska University in Lublin, Adam Mickiewicz University in Poznań, University of Warsaw, and University of Wrocław.

DARIAH-HUB event participants. From left: Karol Kościelniak (Adam Mickiewicz University in Poznań), Roman Roszko (Institute of Slavic Studies PAS), Valéry Trân Thiên (Institute of Slavic Studies PAS), Daniel Dziułka (Institute of Slavic Studies PAS). Photo: Maciej Piasecki.
A morning breakfast combined with troubleshooting project-related issues concerning license acquisition and planning future tasks. In the photo, from left: Roman Roszko (Institute of Slavic Studies of the Polish Academy of Sciences) and Karol Kościelniak (Adam Mickiewicz University in Poznań). Photo: private archive.
During the DARIAH-HUB meeting, Prof. Roman Roszko (Institute of Slavic Studies PAS) and Prof. Maciej Piasecki (Wroclaw University of Science and Technology), summarized the CLARIN-PL Workshops (the 15th edition of the workshops titled “CLARIN in Research Practice”), which took place in Szczecin on September 22–23, 2025. They also discussed key issues related to the CLARIN-PL-BIZ-Bis project. Photo: private archive.

The Semantics and Computational Linguistics Team in Krakow

 

From September 22–24, 2025, the Semantics and Computational Linguistics Team (SILK) in Krakow participated in the “82nd Congress of the Polish Linguistic Society: Continuations and Innovations – Celebrating the Centenary of the Polish Linguistic Society”, held at Jagiellonian University. Representing SILK, Tomasz Bernaś presented the team’s achievements on two projects underway at the Institute of Slavic Studies of the Polish Academy of Sciences (ISS PAS), focused on the development and implementation of a large family of PLLuM generative models.

Authors of the presentation: Roman Roszko (Institute of Slavic Studies PAS), Tomasz Bernaś (Institute of Slavic Studies PAS), Danuta Roszko (University of Warsaw).

The title of the presentation: (en) The Polish Large Language Model (PLLuM) is Being Built with Contributions from the Institute of Slavic Studies, Polish Academy of Sciences, Including Corpora of Polish, Slavic and Baltic Text Data and Instructions; (pl) Korpusy polskich, słowiańskich i bałtyckich danych tekstowych oraz instrukcji wkładem Instytutu Slawistyki PAN w budowę polskiego dużego modelu generatywnego PLLuM.

Presentation languages: Polish and English.

During the discussion, Tomasz Bernaś, MA, answered questions regarding the ethics of PLLuM models, methods for verifying effective TDM (Text and Data Mining) exceptions, the EU Directive on Copyright in the Digital Single Market (DSM), and its Polish amendment from September 2024 in the context of access to the text data market.

In 2024, the SILK Team implemented, as part of the PLLuM consortium (Wroclaw University of Science and Technology – leader, Institute of Computer Science of the Polish Academy of Sciences, Institute of Slavic Studies of the Polish Academy of Sciences, National Information Processing Institute – OPI PIB, Research and Academic Computer Network – NASK PIB, University of Łódź), a task commissioned by the Ministry of Digital Affairs to create Polish large language models, PLLuM. Currently, work on developing the PLLuM model family is continuing in the HIVE AI project (Research and Academic Computer Network [NASK PIB] – leader, Central Information Technology Centre – COI, Cyfronet at the AGH University of Krakow, Institute of Computer Science of the Polish Academy of Sciences, Institute of Slavic Studies of the Polish Academy of Sciences, National Information Processing Institute – OPI PIB, University of Łódź). Simultaneously, advanced work is underway to implement PLLuM models in mObywatel (the Polish digital ID app) and pilot implementations are being carried out in state administration and local government offices, for example, in the Gdynia City Hall.

Title slide.
Tomasz Bernaś represented the Institute of Slavic Studies of the Polish Academy of Sciences at the “82nd Congress of the Polish Linguistic Society” in Krakow. Photo: event organizer.

Visit of Dr. Lara Sorgo at the ISS PAS

From 2 to 13 June 2025, the Institute of Slavic Studies of the Polish Academy of Sciences (Instytut Slawistyki PAN) welcomed Dr Lara Sorgo of the Institute of Ethnic Studies in Ljubljana as part of the PLURILINGMEDIA COST Action project. During her Short-Term Scientific Mission, Dr Sorgo had an excellent opportunity to learn about media in minority languages and their revitalisation, and to exchange ideas with fellow researchers.

Dr Lara Sorgo during a lecture for PhD students. Photo: private archive.

During her visit, Dr Sorgo gave a lecture to students of the Anthropos Doctoral School. In her lecture, titled “Between Policy and Practice: The Slovenian Model of Minority Protection”, she provided an overview of Slovenia’s minority protection model and presented empirical findings from research projects focusing on education, public administration and the media. Particular attention was paid to how radio, television, print, and digital media meet the needs of minority communities, with structured measures and practical challenges being highlighted.

ISS PAS organised a roundtable on minority language media. This scientific exchange provided a valuable opportunity to share ideas and obtain constructive feedback from experts in different research fields on MLM. Particular emphasis was placed on the role of minority language media in further developing and upgrading the theoretical framework of Dr Sorgo’s research.

Dr Lara Sorgo during a seminar at the Institute of Slavic Studies, Polish Academy of Sciences. Photo: Wiktoria Nylec.
In the photo from left to right: Dr Lara Sorgo, Prof. Karolina Bielenin-Lenczowska, and Prof. Nicole Dołowy-Rybińska. Photo: Wiktoria Nylec.
Participants of the seminar on minority language media. Photo: Wiktoria Nylec.
Prof. Nicole Dołowy-Rybińska during a seminar at the Institute of Slavic Studies, Polish Academy of Sciences. Photo: Wiktoria Nylec.

 

Ukrainian Winter is Behind Us

The series of lectures on interdisciplinary Ukrainian studies called Ukrainian Winter, which began on January 29, 2025, ended on the first day of March. The series of lectures was jointly organised by the Ivan Franko National University in Lviv in cooperation with the Institute of Slavic Studies, Polish Academy of Sciences, Vision Ukraine Netzwerk: Bildung, Sprache und Migration and the UCL Ukrainian Society.

The Ukrainian Winter series gathered over 300 registered participants and consisted of 14 lectures by Ukrainian and international scholars on a wide range of topics related to Ukrainian culture, history and society. The inaugural lecture on decolonization processes in Ukraine was given by Myroslav Shkandrij, Professor Emeritus at the University of Manitoba.

From the Institute of Slavic Studies of the Polish Academy of Sciences, a lecture entitled “Decolonial Content on Ukrainian YouTube: Revealing «kakaya raznitsa» and Blurring Cultural Boundaries with Russia” was delivered by dr Olha Tkachenko on February 21, 2025.

Lecture by Dr. Olha Tkachenko. Photo: private archive.
Institute of Slavic Studies, Polish Academy of Sciences

By continuing to use the site, you agree to the use of cookies, in accordance with the current browser settings. Privacy policy

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close