APPLICATION OF MULTILINGUAL CORPUS IN CONTRASTIVE STUDIES (ON THE EXAMPLE OF THE BULGARIAN-POLISH-LITHUANIAN PARALLEL CORPUS)

In this paper we present applications of a trilingual corpus in language research. Comparative and contrastive studies of Polish and Bulgarian as well as Polish and Lithuanian have been already conducted, but up to the best of our knowledge no such studies exist for Bulgarian and Lithuanian. On the one hand, it is interesting to note that two Slavic languages are compared to a Baltic language (Lithuanian). On the other hand, the three languages are marginally present in the EU because of the later ascension of the three countries to the EU. The paper shortly describes the first electronic Bulgarian–Polish–Lithuanian experimental corpus, currently under development only for research. We also focus our attention on the morphosyntactic annotation of the parallel trilingual corpus according to the Corpus Encoding Standard: we present a review of the Part-of-Speech (POS) classification of the participle in the three languages — Bulgarian, Polish, and Lithuanian in comparison to another POS, the adjective. We briefly discuss tagsets for corpus annotation from the point of view of possible unification in the future with some examples.


Introduction
One of the main problems in human communication is the presence of a huge variety of written and spoken languages in the world.Finding ways to support the connection of people from different ethnical parts of the world is becoming more and more important.Due to the recent development of information and communication technologies and the increased mobility of people around the globe, the number of bilingual electronic dictionaries, in which one of the languages is English, has increased extraordinarily.One cannot expect however that all people know English to communicate with each other, especially if their native languages (for example, Bulgarian and Polish) belong to the same language family.An Internet search shows that no electronic dictionaries exist at all for pairs of languages such as Bulgarian-Polish or Bulgarian-Lithuanian. Traditional printed paper dictionaries are either an antiquarian rarity (the most recent Bulgarian-Polish and Polish-Bulgarian dictionaries were published more than 20 years ago) or have never been published at all (Bulgarian-Lithuanian).For the creation of a bilingual electronic or online dictionary for Bulgarian, Polish and Lithuanian an electronic corpus is necessary which will provide the material for lexical database, supporting the dictionaries and their subsequent expansion and update.Furthermore, it is interesting to note that two Slavic languages are compared to a Baltic language (Lithuanian).Thus we expect a new and interesting scientific problem in front of us and hope that our studies will find a wider application.

Multilingual Corpora -Brief Overview
In recent decades many multilingual corpora were created in the field of corpus linguistics, such as the MULTEXT corpus; the MULTEXT-East corpus, annotated parallel and comparable, an extension of the corpus MULTEXT; the ECI/MCI corpus; Oslo Multilingual Corpus; ParaSol, a parallel and aligned corpus of Slavic and other languages (so-called Regensburg Parallel Corpus) [23]; Italian-German parallel corpus, a collection of legal and administrative documents written in Italian and German, due to the equal status of the both languages in South Tyrol [10]; Hong Kong bilingual parallel English-Chinese corpus of legal and documentary texts [7], etc.

MULTEXT corpus
Project MULTEXT Multilingual Tools and Corpora [8], is one of the largest EU projects in the domain of language engineering, whose goals are to develop standards and specifications for the encoding and processing of linguistic corpora, and to develop tools, corpora and linguistic resources embodying these standards.MULTEXT develops tools, corpora, and linguistic resources for a wide variety of languages, initially for seven West European languages Dutch, English, French, German, Italian, Spanish and Swedish, with more in later editions, including Bambara, Catalan, Kikongo, Occitan and Swahili.All Multext results are made freely and publicly available for non-commercial, non-military purposes.

European Corpus Initiative Multilingual Corpus I
The first release of the European Corpus Initiative, the Multilingual Corpus 1 (ECI/MCI: [6]), has 46 subcorpora in 27 (mainly European) languages.The total size of these is circa 92 million (lexical) words.The corpus has been available in digital form for scientific research at a low a cost as possible on CD-ROM since 1994, and is being distributed by ELSNET.Contents: German newspaper texts (approximately 34 million words) from the Frankfurter Rundschau from July 1992 -March 1993; French newspaper texts (approximately 4.1 million words) from Le Monde, consisting of material from September 1989, October 1989, and January 1990; extracts from the Leiden Corpus of Dutch, consisting of newspapers, transcribed speech, etc. (approximately 5.5 million words); parallel texts in English, French and Spanish from International Labor Organisation (ILO) "Official Bulletin, B Series" (approximately 5 million words); texts in Lithuanian (approximately 20 thousand words); scientific papers from Bulgarian journal "Science" (about 5 thousand words); etc.
MULTEXT-East annotated parallel, comparable, and speech corpora The MULTEXT-East, a freely available standardised multilingual dataset for language engineering research and development, first developed in the scope of the EU MULTEXT-East project [2], an extension of the project MULTEXT.MULTEXT-East covers a large number of mainly Central and Eastern European languages, three languages of which: Bulgarian, Czech and Slovene, belong to the Slavic group.It includes the morphosyntactic specifications (EAGLES-based), defining the features that describe word-level syntactic annotations; medium scale morphosyntactic lexicons; and annotated parallel, comparable, and small speech corpora.The most important component of this dataset is the linguistically annotated parallel corpus consisting of Orwell's novel "1984" in the English original and translations.

Oslo Multilingual Corpus
Oslo Multilingual Corpus [21], which is an extension of the English-Norwegian Parallel Corpus (ENPC).The ENPC consists of text excerpts of approximately 10,000 to 15,000 words from fictional and non-fictional Norwegian and English original texts and their translations, amounting to a total of 200 texts, or 2.6 million words.German, Dutch and Portuguese translations were added for some of the texts.The texts are SGML-encoded and aligned at sentence level.The corpus is being extended on the German and French, to ensure equal representation of texts in Dutch, English, French, German, Norwegian and Portuguese.Due to copyright restrictions, the corpus is only available to researchers and graduate students at the universities in Oslo and Bergen.

Bulgarian-Polish corpus
The first Bulgarian-Polish corpus [3], currently under development only for research in the framework of the joint research project "Semantics and Contrastive linguistics with a focus on a bilingual electronic dictionary" between Institute of Mathematics and Informatics-Bulgarian Academy of Sciences and Institute of Slavic Studies-Polish Academy of Sciences, coordinated by L. Dimitrova and V. Koseska, contains approximately 5 million words.It consists of two parts: a parallel and a comparable corpus.This bilingual corpus supports the lexical database (LDB) of the first experimental online Bulgarian-Polish dictionary [4].Some texts in the ongoing version of the parallel corpus are annotated at paragraph level.Some texts of the Bulgarian comparable corpus are annotated at "paragraph" and "sentence" levels, according to Corpus Encoding Standard (CES) [9].

Trilingual Bulgarian-Polish-Lithuanian corpus
The first Bulgarian-Polish-Lithuanian (for short, BG-PL-LT) corpus (currently under development only for research) contains more than 3 million words so far.
All collected texts in the corpus are texts published in and distributed over the Internet.The trilingual corpus comprises two corpora: parallel and comparable.

Bulgarian-Polish-Lithuanian parallel corpus
The BG-PL-LT parallel corpus contains more than 1 million words up to now.A part of the parallel corpus comprises original texts in one of the three languages with translations in two others, and texts of brochures of the European Commission, official documents of the European Union and the European Parliament, available through the Internet.The main part of the parallel corpus comprises texts (fiction, novels, short stories) in other languages translated into Bulgarian, Polish, and Lithuanian.When we have provided the electronic text of the original literary work or its translation, we include it as well in the corpus.
The development of methods allowing the construction of a multilingual parallel electronic corpus is a continuous process.We must stress that the parallel corpus of any three languages cannot be a sum of the individual corpora.It is obligatory to meet the condition of simultaneous accumulation of equivalent texts for all chosen languages.In other words, we cannot use ready monolingual corpora because the language material in them is accumulated to show the diversity and different levels (synchronic and diachronic) of a language system's development.Our aim should be to collect equivalent (nonetheless translated) language material, i.e. stylistically unambiguous, and contemporary.The diachronic level in the development of a language should not be taken into account.This level requires a different approach to the annonation of the material and is useless for the creation of multilingual dictionaries or electronic translation.
Another problem is the proportion of translated texts in the languages.It turned out that it is extremely difficult to find electronic texts of translations from Bulgarian to Lithuanian or vice versa -the two languages are spoken by small nations in comparison to other languages of the EU and are spoken relatively far from one another.It can be assumed (provisionally of course) that the Polish language 'builds a bridge' between them: for the pairs of languages Bulgarian-Polish and Polish-Lithuanian one can find freely available translations on the Internet.For example, Polish literature is more frequently translated to Bulgarian or Lithuanian than Bulgarian or Lithuanian to Polish.However, the translated texts in the three languages must be of comparable size.
We plan to annotate the BG-PL-LT parallel corpus according to the standards for morphosyntactic annotation of digital language resources.Due to typological differences (Bulgarian is analitical, Polish and Lithuanian synthetical) work during annotation of the parallel corpus will be difficult.Therefore, a condition that must necessaily be met is strict differentiation between form and content in the sentence of the natural language.

Bulgarian-Polish-Lithuanian comparable corpus
The comparable BG-PL-LT corpus includes: (1) texts in Bulgarian, Polish and Lithuanian with the text sizes being comparable across the three languages, mainly fiction, and (2) excerpts from electronic newspapers, distributed via Internet and with the same thematic content.
The main goal in collecting the trilingual corpus is the design and development of a BG-LT digital dictionary based on the BG-PL digital online dictionary.The corpus will provide a sample of the vocabulary, which is to be included in an initial experimental versions of BG-LT digital dictionary.
The structure of the parallel corpus groups texts according to content.Every group contains three parts (respectively four if the original language is different from the languages in the corpus).A detailed description of the corpus is provided for clarification to the user.
An excerpt of the description of the trilingual parallel corpus follows: Some of the texts have been annotated at paragraph level.This allows texts in all three languages and in pairs (BG-PL, PL-LT, BG-LT, and vice versa) to be aligned at paragraph level in order to produces aligned three-and bi-lingual corpora."Alignment" means the process of relating pairs of words, phrases, sentences or paragraphs in texts in different languages which are translation equivalent.One may say that "alignment" is a type of annotation performed over parallel corpora.
Excerpts of texts of the trilingual parallel corpus, marked at paragraph level follow: Corpus annotation is the process of adding linguistic information in an electronic form to a text corpus [9], [11].We would like to mention the following two most common types of corpus annotation: morphosyntactic annotation (also called grammatical tagging or part of speech (POS) tagging) and lemma annotation (where each word in the text is associated with the corresponding lemma).Lemma annotation is closely related to morphosyntactic annotation.Morphosyntactic annotation (POS tagging, where each word in the text is associated with its grammatical classification) is the task of labeling each word in a sequence of words with its appropriate part-of-speech.Words are often ambiguous with respect to their POS.For example, in Bulgarian the neuter singular forms of most adjectives serve double duty as adverbs:
The set of POS tags is called tagset.The size and choice of the tagsets vary across languages.The classical POS tagging system is based on a set of parts of speech including noun, adjective, numeral, pronoun, verb, adverb, preposition, conjunction, interjection, particle, and often (depending on the language) article, etc.Of course, morphologically rich languages need more detailed tagsets that reflect to various inflectional categories.The POS classification varies across different languages.Often there is more than one possible POS classification for a given language.
The applications of the morphosyntactic annotation include lexicography, parsing, language models in speech recognition, disambiguation clues for ambiguous words (machine translation), information retrieval, spelling correction, etc.
Here we would like to show that one cannot formally go about a direct use of the morphosyntactic annotation of a multilingual corpus.An in-depth contrastive study of specific phenomena in the respective languages is necessary.Next we attempt to perform a comparison of the morphosyntactic characteristics of the words of parallel texts across the three languages from the point of view of a possible future unification.
We will briefly review the POS classification of the participle (one of the important verbal forms) in the three languages, in comparison to another POS, the adjective.
The syntactic functions of the participle and the adjective cause their confusion as POS.In a sentence both participle and adjective have attributive and predicative function.One overlooks the fact that their meaning is quite different: a good illustration is the comparison of Polish and Bulgarian adjectives and participle taken from the electronic Bulgarian-Polish dictionary in working.
These examples illustrate well the difference in meaning between Bulgarian adjectives and participles and prove that syntactic criteria are not sufficient to classify POS.Of big importance is the semantic perspective differentiating the meanings of participle and adjective in both languages although the forms I and II in Bulgarian are equal.The comparison of Bulgarian participle and adjective with their Polish correspondences underlines the role of language confrontation in solving theoretical problems in a natural language.A description concerning only Bulgarian or Polish would not be able to solve decisively the question of differentiation of the chosen POS.The language comparison in Bulgarian and Polish shows that the lack of differentiation of the two POS types is a sign of incompetence.

Functions of the participle
The classification of a participle, not only as a verb form, is an important problem: the role of the participle varies significantly across languages, because its language use, distribution, quantity of forms, properties and functions are different.In contrast to English, for instance, where the participle are invariable, in the Slavic languages the forms of the participles are inflected (only adjectival participles).Participles are associated with verbal stem and contain information about the aspect, tense and valency of the finite forms of the respective verb.As is well-known the information about the aspect is important for the Slavic languages, but does not exist in English.Bulgarian, Polish and Lithuanian distinguish between the following functions of the participle form: predicative function, attributive function and semi-predicative function or adverbial function, which are illustrated by the following examples: A short explanation of the last example: the participles, used in the sentences, are related to the past tense forms to express simultaneity of the two states of the same agent.

Description:
The agent is speaking. State

Participle and verb
It is important to emphasize that participles preserve some properties of the finite form of the verb, such as voice, tense and aspect.In Bulgarian, Polish and Lithuanian there are active and passive participles: We stress that in Lithuanian a variant using past perfect tense is also possible: LT: Jis buvo paruošęs (past perfect) pamokas, kaip pradėjo skaityti knygą.
Polish has a more modest stock of verbal forms with temporal meaning than Bulgarian or Lithuanian.In any case when the lexical means modifying the temporal meanings are taken into account, the participles, verbal nouns, adverbs, and other lexical means it is clear that Polish can express also the same temporal meanings.In Lithuanian the quantity of finite verbal forms and participles is great.Lithuanian participles are distinguished by their ability to replace subordinate clauses in Polish and Bulgarian, for example (in A and B): A. Case of expressing simultaneity of two states (or states and events), referring to two separate agents, for instance:  Lithuanian uses the participle atsitūpus, which is part of dative absolute (dativus cum participio) construction: musei (dat.sg.) + participle [atsitūpus]).In Bulgarian and Polish to such constructions correspond subordinate clauses where the relation "sequentiality-causality" is based on context and knowledge (of speaker and listener) about reality.Furthermore, the statement's content is complicated by the contained repetition expressed by the form of past iterative: pūsdavo.Bulgarian and Polish use other means, for instance, PL: ile razy, BG: conjunction ) y» ¤% Ñ ) AE % or imperfect tense.
C. Case of expressing "sequentiality-causality" of two states (or states and events), referring to the one and the same agent, for instance:

Features of the adjective
Adjectives in Polish and Lithuanian can be declined for gender, number and case (in Bulgarian only for gender and number), but do not express a temporal or aspect relation on their own, unlike the participle.These arguments show that participles deserve a separate treatment from adjectives.The main grammatical meaning of the adjective is the attributive meaning.Unlike the participle, which is closely related to a verbal action (state or event in the past, present and future), the adjective denotes a constant property or quality of the object such as: PL: Dobrze (adverb) się mieszka na wsi.LT: Gera (adjective, neuter) gyventi kaime.
(EN:Living in the village is good.) Our observations show that participles have to be considered apart from the adjectives, since adjectives do not carry the verbal characteristics: voice, tense, aspect and valence.Mixing adjectives and participles is a sign of insufficient knowledge of the grammatical structure of Slavic languages.Unification of adjectives and participles might be allowed for languages without aspect and/or whose descriptive system of aspect and tense of the verbal form is simpler compared to that of Slavic or Baltic languages.That is the main reason why participles have to be classified as separate POS and not re-qualified as adjectives.The close relationship between participles and adjectives is only on a formal level.On a semantic level there are differences, see the list in 4.1.i 4.2.

Towards development of annotated trilingual electronic resources
Morphosyntactic descriptions for Bulgarian have been developed in several projects, the first of which are for the purposes of corpora processing at the morpholexical level in MTE project of EC.The MTE consortium developed morphosyntactic specifications and word-form lexical lists (so called lexicons) covering at least the words appearing in the MTE corpus.For each of the six MTE languages, a lexical list containing at least 15,000 lemmata was developed for use with the morphological analyzer.Each lexicon entry includes information about the inflected-form, lemma, POS, and morphosyntactic specifications.A mapping from the morphosyntactic information contained in the lexicon to a set of corpus tags (used by the POS disambiguator) was also provided, according to the MULTEXT tagging model.The structure of the lexicon entry is the following: where word-form represents an inflected form of the lemma, characterised by a combination of feature values encoded by MSD-code (MSD: MorphoSyntactic Description); the fourth (optional) column, comments, is currently ignored and may contain either comments or information processable by other tools.
Here is an excerpt from the Bulgarian lexicon: The MSDs are provided as strings, using a linear encoding; an efficient and compact way for the representation of the flat attribute-value matrices.In this notation, the position in a string of characters corresponds to an attribute, and specific characters in each position indicate the value for the corresponding attribute.That is, the positions in a string of characters are numbered 0, 1, 2, etc., and are used in the following way: the character at position 0 encodes part-of-speech; each character at position 1, 2, . . ., n, encodes the value of one attribute (person, gender, number, etc.), using the one-character code; if an attribute does not apply, the corresponding position in the string contains the special marker "-" (hyphen).By convention, trailing hyphens are not included in the MSDs.Such specifications provide a simple and compact encoding, and are similar to feature-structure encoding used in unification-based grammar formalisms.When the word form is the very lemma, then the equal sign is written in the lemma field of the entry ("=").
For Bulgarian the morphosyntactic descriptions were designed on the basis of the traditional POS classification according to the traditional Bulgarian grammar [1].Each word form is assigned a label encoding the major category (POS), type where applicable (e.g., proper versus common noun) and inflectional features.Punctuation is also included, as are abbreviations, numbers written in digits, and unidentified objects (residuals).The morphosyntactic descriptions of Bulgarian participles are discussed in detail in [5].
The morphosyntactic descriptions for Polish: the description of Polish by Saloni [16] serves as a basis for the morphosyntactic descriptions for Polish and has been adapted to a large degree to the MTE MSD format in [15].
The system of morphosyntactic tags developed for the Polish at the Institute of Computer Science, Polish Academy of Sciences (IPI PAN), is based on a sound methodological foundation comprising linguistic work by authors such as J. S. Bień, Z. Saloni, M. Świdziński.It is thanks to this foundation that the IPI PAN's tagset goes beyond the fossilised traditional framework dating back to Aristotle.On the other hand, the MTE tagset, which serves as a point of reference here, is based on the traditional subdivision into parts of speech (this is why, among others, pronouns have been singled out as a part of speech).
Consequently, the aim of our work is neither to revise the good and highly refined IPI PAN tagset nor to replace it with a new tagset for Polish.The issue in question is what kind of compromise should be sought when developing a joint tagset to be used for simultaneous description of the three languages in the BG-PL-LT parallel corpus.For some reasons the MTE tagset (developed previously for many languages) has been selected as the leading one for this corpus.Therefore, the aim of our work is to provide a theoretical study of various categories of Polish (and Lithuanian), to set priorities (e.g.morphological, semantic, syntactic) in identifying various meanings and to provide a classification of morphosyntactic phenomena which does not contradict the MTE standard and does not deviate too strongly from the IPI PAN tagset.
It cannot be excluded that due to the obvious difficulties in achieving consistency of the intertagset the BG-PL-LT corpus will use the IPI PAN tagset for Polish and its modification for Lithuanian.This solution would certainly necessitate a list of more or less close equivalents for the two tagsets: a tagset for Bulgarian on the one hand, and the IPI PAN tagset on the other (for Polish and an extended version for Lithuanian).
It is important to emphasise that only a coherent tagset for a parallel multilingual corpus: 1. allows complete linguistic confrontation, 2. enables identification of linguistic facts, 3. enables a search based on pre-defined unambiguous morphosyntactic characteristics.
The morphosyntactic descriptions for Lithuanian: as a basis for morphosyntactic descriptions of Lithuanian serve the Academic grammar of the Lithuanian language [12] and the Functional grammar of Lithuanian [17].A tool for morphosyntactic annotation for Lithuanian -MorfoLema -has been created by Vytautas Zinkevičius in Centre of Computational Linguistics of Vytautas Magnus University (Lithuania) [19].The program MorfoLema can perform a morphosyntactic analysis and generate forms of Lithuanian words based on user's morphosyntactic characteristic.The next step of the development of a system for morphological annotation (Morfologinis anotatorius [20]) has been realised by Vidas Daudaravičius and Erika Rimkutė.Vidas Daudaravičius has created disambiguation tools for the Morfologinis anotatorius.More information about the Morfologinis anotatorius and used set of tags we can find on [20] in Lithuanian (the names of tags are in Lithuanian, because the authors of the Morfologinis anotatorius didn't use English terms).It is possible to perform online a morphosyntactic analysis through the web-page [21].The results are visualized on the screen, and it is possible to receive the result as a file.
The authors of the Lithuanian Morfologinis anotatorius (see [20]) use the traditional to Lithuanian description of POS.They add two new POS: acronym (like LR for Lietuvos Respublika 'Republic of Lithuanian') and abbreviation (like gen.for generalinis 'main, leading (chief)').In practice these are not POS, but a means to denote some phenomenon specific to the written language.Subcategories such as gender, number, case, present, past, passive, active, etc., are described as separate categories and are not related to POS.This division is in correspondence with many of the subcategories in the Lithuanian academic grammar.
The names of tags are in Lithuanian, because the authors of the Morfologinis anotatorius did not use English terms.
A comparison between experimental annotations of the following sentence "I felt no fear." of the parallel corpus was performed: The tagsets for Polish (based on [13], [14], [18]) and Lithuanian ([20] The annotation of the Bulgarian text is done with MTE MSDs.For manual annotation of the Polish and Lithuanian text the above-mentioned descriptors are used, because these languages lack developed MTE language specifications.Establishing a 1-1-correspondence between the tags used and the MTE tagset does not present an insurmountable difficulty.The result could be seen in Appendix.

Applications of the trilingual corpus
A parallel corpus of two Slavic languages and one Baltic language is of great interest from the viewpoint of describing the similarities and differences of the formal means of these three languages.Bulgarian belongs to the South subgroup, Polish -to the West subgroup of the Slavic languages.Lithuanian belongs to the Eastern Baltic group.All three languages preserve the special features for each corresponding group.Each one of the three languages however has specific traits which make it unique within the respective language group.
We studied some characteritics in the previous parts.Here we will consider some significant differences between the languages which can be illustrated by examples of texts from the trilingual corpus.
A significant feature is the analytic character of Bulgarian, and the synthetic character of Lithuanian (with some analytic character, like word order in absolute constructions) and Polish.Bulgarian exhibits several linguistic innovations in comparison to the other Slavic languages (a rich system of verbal forms, a definite article), and has a grammatical structure closer to English, Modern Greek, or the Neo-Latin languages than Polish.
The definite article in Bulgarian is postpositive, whereas in Lithuanian a similar function is served by qualitative adjectives and adjectival participial forms, both with pronominal declension.Bulgarian preserves some vestiges of case forms in the pronoun system.Polish and Lithuanian exhibit all features of synthetic languages (a very rich case paradigm for nouns).Although Lithuanian has lost the neuter gender of nouns, its case system is richer than the Polish one.Bulgarian and Lithuanian have a high number of verbal forms, but Polish has reduced most of the forms for past tense.Both Polish and Bulgarian have a strongly developed category of verbal aspect.In Lithuanian the verb can have more than one aspect depending on the usage of a base stem for present, past and future tense.
Furthermore, a trilingual corpus can find applications into the design and development of LDB of future bilingual dictionaries, for example, of a LDB supporting a BG-LT dictionary, based on a LDB that supports a BG-PL online dictionary.The advantage of processing a trilingual parallel corpus is to obtain context specific information about syntactic and semantic structures and usage of words in given language or languages.
The list of POS used for Lithuanian in Morfologinis anotatorius follows: , [21]), used in the corresponding examples in the Appendix, follow: