CLARIN-PL

http://clarin-pl.eu/

The Institute of Slavic Studies of the Polish Academy of Sciences – along with the Institute of Computer Science of the Polish Academy of Sciences, the Polish-Japanese Academy of Information Technology, the University of Lodz, the University of Wrocław and the Wrocław University of Technology – is a member of CLARIN-PL, a Polish research consortium which is part of pan-European research infrastructure CLARIN (Common Language Resources and Technology Infrastructure). Poland is one of the seven founding members of CLARIN ERIC.

The CLARIN-PL council: dr inż. Maciej Piasecki (Wrocław University of Technology; coordinator), prof. dr hab. Krzysztof Marasek (Polish-Japanese Academy of Information Technology), prof. dr hab. Adam Pawłowski (University of Wrocław), dr Piotr Pęzik (University of Lodz), dr Maciej Ogrodniczuk (Institute of Computer Science, Polish Academy of Sciences), dr hab. Roman Roszko, prof. IS PAN (Institute of Slavic Studies, Polish Academy of Sciences).

What is CLARIN ERIC?

CLARIN ERIC (Common Language Resources and Technology Infrastructure, European Research Infrastructure Consortium) is a pan-European research infrastructure which aims to offer language resources and tools for processing all European languages to researchers in the humanities and social sciences.

What are language resources?

Language resources are databases providing a formal description of various aspects of natural language, for example monolingual, bilingual and multilingual corpora (searchable collections of texts described using linguistic metadata and available online), dictionaries, translation memories, glossaries, grammars, stochastic language models, and so on.

What are language tools?

Language tools are software solutions for automatic analysis of text and speech at different levels: formal (morphological, syntactic), semantic and pragmatic. Such tools also include special software for particular tasks related to textual data processing, for example programmes for recognition of proper names and their semantic classification, or for mapping language data.

The significance of language resources and tools

Different types of language resources and tools are the basic component for the construction of language processing systems. In the case of languages for which they have not been sufficiently developed, application of natural language engineering is greatly limited.

The structure of CLARIN ERIC

CLARIN ERIC is a distributed research infrastructure consisting of several dozen technology centres located in nineteen member states, and one international organisation (as in January 2018); two countries (France and the United Kingdom) cooperate as observers. The list of CLARIN certified centres also includes those in the United States and Spain, which are not member states. The number of member states is systematically growing, which makes CLARIN ERIC one of the most dynamically developing ESFRI infrastructures.

The structure of CLARIN is described as a networked federation. The central body consists of only a few people. The vast majority of activities are undertaken directly at the level of members and are financed from their budgets. The central budget of CLARIN ERIC comes from membership fees and is considered low in relation to the scope of the tasks undertaken. The functioning of the entire infrastructure is ensured by contribution in kind from individual members. The principal role of central systems is to integrate services provided by local centres maintained by the members.

The CLARIN ERIC infrastructure is based on common standards and a limited but well-defined set of central functionalities. In addition, each year it focuses on realistically identified, common thematic areas and selected functionalities. Thanks to this, the diverse contribution of individual members is well integrated and harmonised within the framework of a dynamically developing system of pan-European research infrastructure.

CLARIN ERIC was established as one of the first ERIC consortia in the humanities and social sciences (the first one in which Poland participates). It is currently one of ESFRI infrastructures best evaluated by the European Commission and the research community (as confirmed, for example, by CLARIN ERIC obtaining the status of Landmark under ESFRI).

The aims and objectives of CLARIN ERIC

The strategic aim of CLARIN ERIC infrastructure is to consolidate language resources and tools for all natural languages used in Europe in one network system. The system is based on common standards for the description, access and sharing of collected (and/or created) language resources and tools for researchers in the humanities and social sciences, who are the main users of CLARIN.

The CLARIN ERIC infrastructure not only consolidates language resources and tools, but also provides ready-to-use network services that enable researchers to use them. Based on the needs of particular tasks, CLARIN designs, develops and provides access to research applications facilitating work with collections of texts. This activity of CLARIN can be characterised as practical action for the development of new methods of digital humanities and digital social sciences in the pan-European, multilingual and multicultural dimension.

The CLARIN ERIC infrastructure consists of CLARIN centres connected via the Internet. This infrastructure provides a unified system of logging and authorisation, in which all users use their own accounts to work from their home centres.

CLARIN promotes open access and open licences and, accordingly, creates only such resources and tools. However, not all resources and tools deposited by the users can be available in open access (e.g. the DSpace repository https://clarin-pl.eu/dspace/ at the CLARIN-PL website http://clarin-pl.eu/en/home-page/), as it is the creators themselves who decide about access options.

CLARIN requires that all resources and tools should be described using a single common standard of metadata called CMDI (Component MetaData Infrastructure). Key functionalities related to searching the resources are provided at the central level, while all research services and applications are developed and offered by individual national consortia within one connected infrastructure.

CLARIN infrastructure

The CLARIN infrastructure is a network of centres of different types:

  • A-Centres, constructing backbone technology and services necessary for network operation;
  • B-Centres, providing users with tools and resources related to natural language processing (the basic element of the network);
  • C-Centres, sharing resources description (metadata);
  • K-Centres, supporting users and providing access to knowledge and experts.

The Institute of Slavic Studies CLARIN-PL team

As in 2018, the Institute of Slavic Studies CLARIN-PL team is composed of: dr hab. Roman Roszko, prof. IS PAN (head), dr Maksim Duškin (since 2016), dr hab. Danuta Roszko (University of Warsaw), dr Wojciech Sosnowski and dr Roman Tymoshuk (since 2016).

Until 2016 the team was headed by prof. dr hab. Violetta Koseska, and the list of members also included (in alphabetical order): dr Anna Kisiel, dr Natalia Kotsyba and dr hab. Joanna Satoła-Staśkowiak.

The key task of the team is to develop data bases of translation memories, glossaries and corpora of Slavic and Baltic languages.

In mid-2016 the team completed their work on the development of a multilingual database of translation memories – Polish, Bulgarian, Lithuanian and Russian – containing almost seventeen and a half million wordforms. At the same time, the team also developed a framework for the semantic annotation of corpus data; for more information, see:

  • Koseska­, Violetta, and Roman Roszko (2015). On semantic annotation in Clarin-PL parallel corpora. Cognitive Studies | Études cognitives, 2015(15), 211–236. DOI: https://doi.org/10.11649/cs.2015.016
  • Koseska­, Violetta, and Roman Roszko (2016). Języki słowiańskie i litewski w korpusach równoległych Clarin-PL [Slavic languages and the Lithuanian language in the Clarin-PL parallel corpora]. Studia z Filologii Polskiej i Słowiańskiej, 51, 191–217. DOI: https://doi.org/10.11649/sfps.2016.011
  • Roszko­, Danuta, and Roman Roszko (2016). Polsko-litewskie korpusy równoległe. Elementy anotacji semantycznej z zakresu modalności możliwościowej i kwantyfikacji zakresowej [Polish-Lithuanian parallel corpora: Elements of the semantic annotation related to hypothetical and imperceptive modalities and scope quantification]. In Ewa Gruszczyńska, Agnieszka Leńko-Szymańska (eds.), Polskojęzyczne korpusy równoległe | Polish language Parallel Corpora, pp. 119–132. Warszawa: Instytut Lingwistyki Stosowanej, Wydział Lingwistyki Stosowanej Uniwersytetu Warszawskiego. http://rownolegle.blog.ils.uw.edu.pl/files/2016/03/0000_Korpusy.pdf

Since 2016, the Institute CLARIN team has been expanding the database of translation memories and developing a series of annotated parallel bilingual corpora (Polish-Lithuanian, Polish-Bulgarian, Polish-Russian, Polish-Ukrainian) with Polish as the language binding them together. As planned, their size will be over twenty-two million wordforms in total. The progress of work can be viewed on CLARIN-PL website using KonText, a tool for searching language resources. In order to gain full access to the corpora developed by the Institute team, those interested need to register as users at CLARIN-PL website and log on. Users who are not logged on have a limited access to resources and may not be able to locate all of them.

The corpora which are currently developed by the team use pre-2016 resources only to a limited extent. There have also been some modifications to their design following requests and suggestions from the users of language resources made available so far (mainly humanities scholars) and from new users of the corpora which are currently under way (mainly translators, doctoral students, university lecturers, publishers and representatives of commercial enterprises with branches in Lithuania, Ukraine, Russia and Bulgaria). As a result, the new multilingual resources will include more texts reflecting the current technological progress and the reality of the day (e.g. legal texts, texts concerning litigation proceedings, medical texts, contracts and agreements, technical documentation, tender procedure documentation, lists of products, occupations, medicines, etc.).

One entirely new feature of the corpora developed by the Institute team – which also comes in response to users’ suggestions – is that they include texts closely resembling colloquial, everyday speech. The seemingly impossible task of incorporating spoken material into multilingual parallel corpora (there is no such phenomenon as utterance in two languages at the same time) has been, to some extent, accomplished by including film dialogues.

Selected works published in 2017, whose authors used Clarin-PL multilingual corpora:

  • Jaskot, Maciej, Yuriĭ Ganoshenko, Wojciech Sosnowski, and Roman Tymoshuk (2017). Leksykon aktywnej frazeologii polskiej i ukraińskiej. Warszawa: KJV Digital, 312 pp. ISBN 978-83-946640-2-2
  • Jaskot, Maciej, and Wojciech Sosnowski, O fałszywych przyjaciołach tłumacza na przykładzie Leksykonu aktywnej frazeologii polskiej i ukraińskiej. In Barbara Borkowska-Kępska, Grzegorz Gwóźdź (eds.), LSP Perspectives 2. Języki specjalistyczne – nowe perspektywy 2. Wyższa Szkoła Biznesu w Dąbrowie Górniczej, pp. 55–62. ISBN 978-83-65621-30-6
  • Łukasik, Marek Wojciech (2017). Contrastive terminography. Cognitive Studies | Études cognitives, 2017(17). https://doi.org/10.11649/cs.1378
  • Satoła-Staśkowiak, Joanna (2017). Badania nad najmłodszą leksyką słowiańską w oparciu o korpusy językowe. In Diana Blagoeva (ed.), Bŭlgarsko-polski studii. Sofiia: Bŭlgarska akademiia na naukite, Institut za bŭlgarski ezik „Prof. Liubomir Andreĭchin”, pp. 32–45. ISBN 978-619-160-903-1
  • Sosnowski, Wojciech, and Roman Tymoshuk (2017). Konfrontacja językowa polskich i ukraińskich jednostek frazeologicznych na przykładzie materiału z Leksykonu aktywnej frazeologii polskiej i ukraińskiej. In Diana Blagoeva (ed.), Bŭlgarsko-polski studii. Sofiia: Bŭlgarska akademiia na naukite, Institut za bŭlgarski ezik „Prof. Liubomir Andreĭchin”, pp. 91–108. ISBN 978-619-160-903-1
  • Sosnowski, Wojciech Paweł, and Roman Tymoshuk (2017). On The dictionary of active Polish and Ukrainian phraseology (Leksykon aktywnej frazeologii polskiej i ukraińskiej): Contrastive linguistics and culture. Cognitive Studies | Études cognitives, 2017(17). https://doi.org/10.11649/cs.1317
  • Tymoshuk, Roman, and Wojciech Sosnowski (2017). Novi pidkhody do stvorennia suchasnykh frazeolohichnykh slovnykiv (na materiali «Leksykona polʹsʹkoї tа ukraїnsʹkoї аktyvnoї frazeolohiї»). Movoznavstvo, 2, 69–77.
  • Tymoshuk, Roman, and Wojciech Sosnowski (2017). O rabote nad „Leksikonom polʹskoĭ i ukrainskoĭ aktivnoĭ frazeologii”. In Ladislav Janovec (ed.), Svet v obrazech a ve frazeologii | World in pictures and in phraseology. Praha: Univerzita Karlova, Pedagogická fakulta, pp. 269–276. ISBN 978-80-7290-964-3

CLARIN-PL selected resources

Polish-Bulgarian-Russian Parallel Corpus

A trilingual parallel corpus of texts aligned at the sentence level; citation: Anna Kisiel, Violetta Koseska-Toszewa, Natalia, Kotsyba; Joanna Satoła-Staśkowiak, and Wojciech Sosnowski (2016). Polish-Bulgarian-Russian Parallel Corpus, CLARIN-PL digital repository, http://hdl.handle.net/11321/308)

BIBTEX:
@misc{11321/308,
title = {Polish-Bulgarian-Russian Parallel Corpus},
author = {Kisiel, Anna and Koseska-Toszewa, Violetta and Natalia, Kotsyba and Sato{l}a-Sta{'s}kowiak, Joanna and Sosnowski, Wojciech},
url = {http://hdl.handle.net/11321/308},
note = {{CLARIN}-{PL} digital repository},
copyright = {{IS} {PAS} corpora license},
year = {2016}
}

Polish-Lithuanian Parallel Corpus

A bilingual parallel corpus of texts aligned at the sentence level; citation: Danuta Roszko, and Roman Roszko (2016). Polish-Lithuanian Parallel Corpus, CLARIN-PL digital repository, http://hdl.handle.net/11321/309

BIBTEX:
@misc{11321/309,
title = {Polish-Lithuanian Parallel Corpus},
author = {Roszko, Danuta and Roszko, Roman},
url = {http://hdl.handle.net/11321/309},
note = {{CLARIN}-{PL} digital repository},
copyright = {{IS} {PAS} corpora license},
year = {2016}
}

Other resources

ChronoPress – corpus of press articles
Paralela – corpus search engine for a large collection of annotated Polish-English parallel texts
Słowa dnia [Words of the day] – words most frequently used in media discourse
plWordNet – a large network of words (191,000) and a lexico-semantic database (285,000 wordsenses, over 600,000 relations) of the Polish language with the functionality of a Polish-English and English-Polish dictionary (239,000 entries)
Spokes – search engine for conversational data; 232,756 utterances, over two million wordforms
Walenty – The Polish Valence Dictionary
KonText – Mono- and multilingual corpora, including those developed by the Institute of Slavic Studies team, e.g. the Polish-Bulgarian corpus.

CLARIN-PL selected tools and applications

Chunker – programme for shallow syntactic analysis
Websty – tool for text similarity analysis
Nowy Morfeusz – morphological analyser
Liner2 – tool for recognition of named entities and temporal expressions
Inforex – system for edition of annotated corpora
WiKNN – Wikipedia K-Nearest Neighbours classifier for Polish and English texts
Kuźnia – tool for (co-)creation of domain-specific inflected dictionaries
WNLoom-Viewer – application for plWordNet browsing
Mapa Literacka [Literary map]– tool for recognition of references to geographical names and names of locations
MeWeX – application extracting collocation dictionaries from corpora and creating lexical unit dictionaries
Speech – tools and services for spoken data processing
Phonetic transcription – tool for conversion of text into phonetic transcription
Morpho – tool for context-free morphological analysis
Tagger WCRFT2 – tool for tokenisation and morpho-syntactic tagging
Serel – tool for recognising the relationship between annotations in the text
Spatial – tool for recognising spatial relations in the text
WSD – tool for lexical meaning disambiguation
NER – tool for search and classification of named entities
Parser – dependency parser for Polish
Spejd/Spade – syntax parser
POLFIE – LFG parser for Polish
POLFIE-OT – LFG parser for Polish (with Optimality Theory module for automatic disambiguation)
WoSeDon – tool extracting sense frequency lists from texts
NoSketch – simple application for corpora search
Summarize – tool for summarising texts
ReSpa – tool for extracting key phrases for text
Inkluz – interface for detecting foreign language inclusions in Polish text
TermoPL – tool for extracting terms from text

Selected tools for English and German

Tager – English/German
Tager NLTK – English
NER – English/German
NER NLTK – English
Parser – English/German

 

For more CLARIN-PL resources, go to https://clarin-pl.eu/dspace/