LangGener Corpus of Polish-German bilingualism

“Language across Generations: Contact Induced Change in Morphosyntax in German-Polish Bilingual Speech (LANGGENER)” – a grant project financed by the National Science Centre and the German Research Foundation (Deutsche Forschungsgemeinschaft).
Polish team:

  • Principal Investigator: Prof. Anna Zielińska, PhD, Dr. habil. (Institute of Slavic Studies, Polish Academy of Sciences);
  • Co-Investigators: Irena Prawdzic, PhD (Institute of Slavic Studies PAS), Barbara Alicja Jańczak, PhD (Adam Mickiewicz University in Poznań), Anna Jorroch, PhD (University of Warsaw), Felicja Księżyk, Dr. habil. (University of Opole).

German team:

  • Principal Investigator: Prof. Björn Hansen (University of Regensburg);
  • Co-Investigators: Anna Bučkova, MA (University of Regensburg), Carolin Centner, MA (University of Regensburg), Iga Kościołek, MA (University of Regensburg), Prof. Marek Nekula (University of Regensburg); Michał Woźniak, PhD (Institute of Polish Language PAS, University of Regensburg); consultants: Prof. Sandra Birzer (University of Bamberg), Prof. Bernhard Brehmer (University of Greifswald); Roman Fisun, MA (University of Regensburg).

The aim of the project is an integrated description of German-Polish bilingualism in Poland and Germany, taking into account both linguistic and sociolinguistic aspects. The main objective of our scientific endeavour is to investigate whether morphosyntactic changes that are induced by language contact in the speech of bilingual people differ between generations. On the one hand, the object of our interest is the bilingual speaker, and on the other hand, the replication processes of morphosyntactic patterns in both languages: Polish and German.

An important outcome of the grant project is LangGener, the multimodal corpus of Polish-German bilingualism, which we are making available to researchers. The corpus encompasses approximately 78 hours of recordings in Polish and German with 58 representatives of two generations. The first generation consists of people born before 1945 in Germany, in the territories incorporated into Poland in 1945. We call this group Generation Poland/Generation Polen, abbreviated GP – from the country of residence, because the people who form this group remained in the lands of their birth after the borders were moved. The other generation consists of people who were also born in these territories, but after 1945, that is, already in Poland. Representatives of this generation emigrated to Germany and live there. We call this generation Generation Germany/Generation Deutschland, abbreviated GD. The corpus is annotated grammatically and sociolinguistically (!).

We provide corpus users with recordings and their transcriptions that are made with a semi-orthographic method, accessible to all users. One of the objectives of the project was to document the vanishing German dialects in north-western Poland. Therefore, the texts recorded in the German language in Poland are transcribed with dialectal features preserved. In addition, the corpus has been supplemented with rich metadata that contains: speaker’s code name; year of birth; generation (GD; GP); gender; year of immigration (for GD); level of education (incomplete primary, complete primary, vocational, secondary, higher); occupation; region of birth; place of birth; region of residence and place of residence. Moreover, the metadata contains descriptions of idiolects; including detailed descriptions of the features of German dialects in Poland.

The language contact phenomena are annotated in grammatical terms, indicating the syntactic phrase within which a given phenomenon occurred. The types of phrases were isolated on the basis of the index word which determines the grammatical properties of a given group of words. The following types of phrases were distinguished: nominal, verb, prepositional, adjectival, adverbial and sentence phrases. The phenomena of language contact, in turn, are: direct replication of morphemes and phonological forms from the source language; replication of patterns, i.e. the distribution, meaning or relationship of form to function according to the model of the source language (the form itself is not replicated in this process); other differences from the supra-regional variant of a given language that cannot be explained by the use of patterns from the source language; code-switching, i.e. the switching of the language code within a single utterance; deviations from the syntactic order; self-correction, i.e. expressions and phrases that speakers themselves identify as inadequate or incorrect and replace them with other expressions and phrases.

The novel sociolinguistic annotation is intended to facilitate the development of interviews in terms of content relating to speakers’ language biographies. A language biography is the history of language acquisition and use throughout a person’s life, taking into account language acquisition, use, change and loss in political and social contexts. Annotation is based on three categories: stages of life (i.e. early childhood, childhood, school age, adolescence, adulthood and old age), spheres of language use/domains (i.e. family/home, neighbourhood/place of residence, religion/denomination, friends/acquaintances, education, work, administration, media/press, television, Internet, national minority associations, travel, egodocuments/memoirs, letters) and conceptualisations of bilingualism (i.e. language experience, language ideologies, language management).

The team of German and Slavonic scholars from Poland and Germany hopes that the LangGener corpus will arouse interest and will be used by linguists, cultural studies experts, sociologists, anthropologists and representatives of other scientific disciplines for their own research.

Link to project description on the University of Regensburg website:

Corpus of Czech-German bilingualism

The joint project aimed to study the phenomena of language contact in two generations of German-Polish bilinguals and its typology. The German-Czech subproject addressed two groups of bilingual persons from one generation – repatriates and migrants. The corpus can serve the purposes of studying language learning and language erosion in the context of migration, and in particular – of examining the acquisition of German as a second language, and the erosion of Czech as a first language in Germany (Bavaria). It is also useful in answering sociolinguistic and dialectal research questions as well as in studying spoken language.

The corpus contains transcriptions of 16 interviews, with a total length of 27.5 hrs, with 20 respondents of Sudeten-German or other origin who were born around 1955 and left Czechoslovakia having turned 12, which is considered a critical age. In the interviews, recorded in German and Czech in 2018, 2019, and 2020, the respondents talk about their language biographies in the Czech Republic and in Germany.
The phenomena of contact and isolation are identified and annotated, and then assessed and interpreted quantitatively in the transcriptions of the interviews. In addition to material and structural replications which are classified according to their syntactic assignment, tagging also records code-switching, word order anomalies, auto-corrections, and other deviations from the relevant baseline.


Another result of the project is an open-access monograph:

Soziolinguistik trifft Korpuslinguistik. Deutsch-polnische und deutsch-tschechische Zweisprachigkeit. Scientific editors: Björn Hansen, Anna Zielińska. Universitätsverlag Winter GmbH.  doi:

Authors: Aneta Bučková, Carolin Centner, Björn Hansen, Barbara Alicja Jańczak, Anna Jorroch, Iga Kościołek, Felicja Księżyk, Marek Nekula, Irena Prawdzic, Michał Woźniak, Anna Zielińska

An international team of Slavists and Germanists presents the scientific problems and challenges of creating a spoken corpus of Polish-German bilingualism, LangGener, and a subcorpus of Czech-German bilingualism: from the creation of concepts and scientific disputes, through the development of methods for carrying out field studies and obtaining sociolinguistic interviews, to detailed technical solutions. The main aim of the monograph is to show the tensions and difficulties in combining sociolinguistic issues with corpus linguistics. The volume is a compendium of the creation of a sociolinguistically oriented linguistic corpus. The team hopes that the monograph will be of interest and inspire linguists to create sociolinguistic corpora.

