EXTRACTION AND PRESENTATION OF BILINGUAL CORRESPONDENCES FROM SLOVAK-BULGARIAN PARALLEL CORPUS

In this paper the results of the automatic extraction and presentation of bilingual correspondences from Slovak-Bulgarian Parallel corpus are described. The equivalent phrases are extracted from sentence and word level automatically aligned corpus, filtered, indexed and presented in a dictionary-like interface. The bilingual dictionary database contains 80 thousand phrase pairs consisting of approximately 350 thousand words (per each language). Counting unique word forms, the size is 31 thousand in the Slovak part of the dictionary, 26 thousand in the Bulgarian part.


Introduction
In this article the authors describe the results of an experimental study on the Slovak-Bulgarian/Bulgarian-Slovak parallel corpus, prepared under the collaborative work in the frame of the Joint research project "Electronic Corpora -Contrastive Study with Focus on Design of Bulgarian-Slovak Digital Language Resources" between the Institute of Mathematics and Informatics at the Bulgarian Academy of Sciences and Ľudovít Štúr Institute of Linguistics of the Slovak Academy of Sciences.The first joint study under this project consists of the analysis of differences between the Bulgarian and Slovak languages in the MULTEXT-East morphology tagset for corpora annotation (Garabík, Majchráková, & Dimitrova, 2009).The second study is a corpus-based experiment, focusing on the analysis of automatic extraction and visualization of translation equivalents of Slovak-Bulgarian/Bulgarian-Slovak parallel texts with the ultimate goal to obtain useful information about Slovak translation equivalents of (definite) articles and demonstrative pronouns in Bulgarian (Dimitrova & Garabík, 2014).

Slovak-Bulgarian/Bulgarian-Slovak Parallel Corpus
The parallel sentence-aligned Slovak-Bulgarian/Bulgarian-Slovak corpus is currently under development as a bilingual resource for different kind of language analysis, for research and development of machine and human translation systems, automatic term extraction, etc.A recent version of the corpus is available via a NoSketch Engine web interface at http://korpus.sk/skbg.html.
The corpus consists of two parts, fiction texts of about 650 thousand words and over 82 million words (in Bulgarian) and 85 million words (in Slovak) of texts of the EU & EC journals and documents (Dimitrova & Garabík, 2011, 2012).

Corpus Structure
The corpus currently contains translations of fiction in both languages, either from Slovak into Bulgarian or from Bulgarian into Slovak.The main part of parallel corpus contains texts in other languages translated into both Bulgarian and Slovak.The corpus consists of two subcorpora: direct and translated.
The direct Bulgarian-Slovak parallel sentence-aligned subcorpus consists of original texts in Bulgarian, such as novels and short stories by Bulgarian writers and their translation in Slovak, and original texts in Slovak, such as literary works by Slovak writers and their translation in Bulgarian.The set of aligned texts includes two Bulgarian novels: Dimitȗr Dimov's Осъдени души (Doomed Souls), Pavel Vezhinov's Бариерата (The Barrier) and their Slovak translations, the novel of Slovak writer Klára Jarunková Brat mlčanlivého vlka (The silent wolf's brother) and its Bulgarian translation.
The translated Bulgarian-Slovak parallel subcorpus consists of Bulgarian and Slovak translations of works into a third language, namely the Slovak and Bulgarian translations of Jaroslav Hašek's Osudy dobrého vojáka Švejka za svȇtové války (The Good Soldier Švejk) and a set of texts of the EU&EC journals and documents.
Recently, the texts of Bulgarian novel -Pavel Vezhinov's Нощем с белите коне (In the night riding the white horses), and Ȋordan Ȋovkov's short stories Песента на колелетата (The Song of Wheels), Вечери в Антимовския хан (Inn at Antimovo), Ако можеха да говорят (If they could talk), Женско сърце (Women heart) and their Slovak translations were also included in the direct Bulgarian-Slovak subcorpus.The volume of the literature parallel texts is about 650 thousand words per language.

Morphological Annotation
At the first step of our study we prepare morphologically annotated sentence-aligned parallel texts.The Slovak texts are morphologically annotated automatically by the tagger Morče which has been trained and tuned on tagset, developed by the Slovak National Corpus (Garabík & Šimková, 2012).

Alignment
The bilingual sentence-aligned corpora are valuable resources for many NLP applications: for machine translation research, for searching/extracting of language data, and can be also used as a translation database and language learning materials for training of translators -human and programming tools.The web-presented bilingual aligned corpora are available and oriented both to human and machine users.Such corpora and derived from them special type of lists, as frequency lists and concordances, are useful for language teaching.Concordances have also many applications in contrastive studies: they are used for comparison of different uses of the same word (in a different context), and to locate and analyse phrases and idioms in a given text; to find the translation of the essential elements of text, such as terms (in multilingual texts).
To align the text on the sentence level, we use the hunalign software (Varga et al., 2005).The uses a corresponding bilingual Slovak-Bulgarian dictionary to ensure a higher accuracy of the alignment; we used a small bootstrapped dictionary that has been generated automatically and then manually proofread, removing incorrect word pairs.Alignment on the word level was performed using the GIZA++ software (Och & Ney, 2000), using (for simplicity) only sentence pairs where the alignment was 1:1.Generally, word alignment is M:N (any number of Bulgarian words can map to any number of Slovak ones), although only 1:1 and realistically at most 1:2 (and 2:1) appear in our corpus texts.

Phrases Extraction
We use the MOSES (Koehn et al., 2007), a statistical machine translation toolkit to process the corpus.The toolkit uses GIZA++ to obtain an initial word alignment which is subsequently improved by a "grow-diag-final" method.
Throughout this article, we use the term 'phrase' following the MOSES terminology, i.e. a phrase is a short sequence of one or several words that has been selected from the text corpus (and aligned with a corresponding text chunkphrase -from the second language part of the corpus), and has no connection with a 'phrase' as a term in general linguistics.Although MOSES could be used to build a machine translation system based on our corpus, this was not our goal and we used only the training process which produces aligned and scored bilingual phrase tables.
MOSES training produces four different phrase translation scores: • inverse phrase translation probability ϕ(f |e) • inverse lexical weighting lex(f |e) • direct phrase translation probability ϕ(e|f ) • direct lexical weighting lex(e|f ) Ideally, we would like to compute a single score out of these four numbers, reflecting the level of "suitability" of the phrase pair.Since we designed our interface to be language-direction agnostic (i.e.conceptually neither Bulgarian → Slovak nor Slovak → Bulgarian correspondence should be favoured), and since we want to take into account not just phrase correspondence, but also correspondence of individual words, our score must be symmetrical with regard to ϕ(f |e) and ϕ(e|f ), as well as to lex(f |e) and lex(e|f ), and should reflect the likelihood-like nature of these scores.The simplest function that fulfils these criteria is a simple product, In order to be able to quantify the correctness of extracted phrases, we split the phrases into sets according to logarithm of the score g, in intervals two orders of magnitude wide, i.e. g ∈ (10 −30 , 10 −28 ] ∪ . . .∪ (10 −6 , 10 −4 ] ∪ (10 −4 , 0.01] ∪ (0.01, 1] In each interval, we randomly selected 10 sentences (population sample) and manually annotated their correspondence, selecting between three options: good, bad and not sure.In the interval [10 −14 , 10 −6 ] we increased the number of sentences to 30, to get better estimates.Since the sampling of sentences from each interval is without replacement, the probability distribution is hypergeometric; however the number of sentences in each interval (population) is on the order of millions and therefore we can approximate the distribution by a binomial one (this is relevant for confidence interval estimation).For each of the intervals, we calculate the ratio: i.e. we remove the "not sure" sentence pairs from the sample and calculate the ratio of good ones.In order to get a function describing relation of the parameter r to the score g, we start with several basic assumption.First, phrases with g = 1 should be perfect equivalents, r(g = 1) = 1.Phrase pairs with very low score should be completely bogus: We are therefore looking for a sigmoid function whose values start at zero at zero and saturates when approaches 1.Since we are operating on intervals defined by orders of magnitude, we use the sigmoid function on domain ln(x).Generic logistic function is defined by: where x 0 is the centre of the function (horizontal shift) and a reflects the slope ('steepness').Using the logistic function of a variable x = ln(x) and simplifying we get We then fit our data points with the function (2) to obtain the parameters a and x 0 , which gives us: a = 0.142 ± 0.021 (3) and subsequently x 0 = e x 0 = 6.3 • 10 −16 , i.e. the ratio of incorrectly aligned phrase pairs will reach 0.5 around g = 6.3•10 −16 .The relation between the score g and the ratio r of our population samples is depicted on Fig. 2, together with the function (2).We can use the function (2) to obtain the value of parameter g where the ratio of correctly aligned phrases drops below certain value -we decided to keep 95% accuracy, so solving the equation r 0 = 0.95 = 1 1 + e ax 0 x a for x gives us a threshold g 0 = 6.4 • 10 −7 for the desired 95% accuracy.1After applying the above mentioned threshold, we examined the phrases we obtained, sorted by the score g.At the beginning we have phrase pairs with g = 1.This implies that all the factors have to Figure 2: Relation between phrase alignment score g (horizontal) and the ratio of good pairs r (vertical).Vertical error bars display Jeffreys intervals at 95% confidence level (Brown, Cai and DasGupta, 2001).
perfectly -they occur always in the same form, and all the words of the phrases are always translated in the same way.This happens most of the time if there is a (often unique) foreign language (i.e.neither Bulgarian nor Slovak) citation, such as a name of a company or product (most striking are those in a foreign script, e.g.Greek occurs relatively often).Since "normal" sentences do not appear here at all, we included a condition g 1 in our filter (such "foreign script" phrases would be excluded by following filters anyway).Additional heuristic filtering consists of excluding phrases that: • do not start with a letter • contain punctuation (apart from a comma) • are not exactly 4 words long • contain more than 3 words starting with a capital letter • end with a preposition • contain characters out of the appropriate alphabet (Slovak or Bulgarian)

Search Interface
After filtering, we got 80 thousand phrase pairs, which we indexed by words and lemmas for our dictionary query system -keys (headwords) for each phrase consist of a union of lemmas and word forms from both Bulgarian and Slovak phrase, and their equivalents without diacritics (for Slovak) and transliteration into Latin script (for Bulgarian), to facilitate queries for users without ways of entering Cyrillic or Slovak diacritics.The amount of words in the database is 350 thousand (per each language).Counting unique word forms, the size is 31 thousand in the Slovak part of the dictionary, 26 thousand in the Bulgarian part.
For the dictionary access, we are using the dict (RFC 2229) server2 as a backend, with a CGI frontend formatting the results in an intuitive and graphically representative way (see Fig. 3).The interface is accessible at http://slovniky.korpus.sk/?d=pskbg.

Conclusion
This paper presents results of an experimental study, namely automatic extraction and presentation of bilingual correspondences from Slovak-Bulgarian/Bulgarian-Slovak parallel and aligned corpus.The parallel Slovak-Bulgarian corpus, currently under development, is a valuable bilingual resource for language analysis, automatic term extraction, the research and development of machine and human translation systems, supervised and unsupervised NLP tools training, and machine translation.