Extraction and Presentation of Bilingual Correspondences from Slovak-Bulgarian Parallel Corpus

Radovan Garabík, Ludmila Dimitrova


Extraction and Presentation of Bilingual Correspondences from Slovak-Bulgarian Parallel Corpus

In this paper the results of the automatic extraction and presentation of bilingual correspondences from Slovak-Bulgarian Parallel corpus are described. The equivalent phrases are extracted from sentence and word level automatically aligned corpus, filtered, indexed and presented in a dictionary-like interface. The bilingual dictionary database contains 80 thousand phrase pairs consisting of approximately 350 thousand words (per each language). Counting unique word forms, the size is 31 thousand in the Slovak part of the dictionary, 26 thousand in the Bulgarian part.


translation equivalents; GIZA++; parallel corpora; aligned text; Slovak; Bulgarian

Full Text:

PDF (in English)


Brown, L. D., Cai, T. T., & DasGupta, A. (2001). Interval estimation for a binomial proportion. Statistical Science, 101–117.

Dimitrova, L., & Garabík, R. (2011). Bulgarian-Slovak Parallel Corpus. In Natural Language Processing, Multilinguality: Proceedings of the 6th International Conference SLOVKO 2011, Modra, Slovakia (pp. 44–50).

Dimitrova, L., & Garabík, R. (2012). Bilingual corpus — digital repository for preservation of language heritage. In Proceedings of the International Conference Digital Presentation and Preservation of Cultural and Scientific Heritage DiPP 2012, Veliko Tŭrnovo, Bulgaria (pp. 132–141).

Dimitrova, L., & Garabík, R. (2014). Translation equivalence of demonstrative pronouns in Bulgarian-Slovak parallel texts. Cognitive Studies | Études cognitives, 14, 65–74.

Garabík, R. & Šimková, M. (2012). Slovak morphosyntactic tagset. Journal of Language Modelling, 0(1), 41–63.

Garabík, R., Dimitrova, L., & Koseska-Toszewa, V. (2011). Web-presentation of bilingual corpora (Slovak-Bulgarian and Bulgarian-Polish). Cognitive Studies | Études cognitives, 11, 227–239.

Garabík, R., Majchráková, D., & Dimitrova, L. (2009). Comparing Bulgarian and Slovak Multext-East morphology tagset. In Organization and development of digital lexical resources (pp. 38–46). Kyiv. Dovira Publishing House.

Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B.,

Shen, W., Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A. and Herbst, E.

(2007). Moses: Open Source Toolkit for Statistical Machine Translation, Annual

Meeting of the Association for Computational Linguistics (ACL), demonstration

session, Prague, Czech Republic, June 2007.

Och, F. J. & Ney, H. (2000). Improved statistical alignment models. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics (pp. 440–447). Stroudsburg, PA: Association for Computational Linguistics.

Schmid, H. (1997). Probabilistic part-of-speech tagging using decision trees. In D. Jones & H. Somers (Eds.), New methods in language processing (pp. 154–164). London: UCL Press. (Studies in Computational Linguistics).

Simov, K., Osenova, P., & Slavcheva, M. (2004). BTB-TR03: BulTreeBank Morphosyntactic Tagset. BulTreeBank Project Technical Report № 03. Technical report, Linguistic Modelling Laboratory, Bulgarian Academy of Sciences.

Varga, D., Németh, L., Halácsy, P., Kornai, A., Trón, V. & Nagy, V. (2005). Parallel corpora for medium density languages. In Proceedings of the RANLP 2005 (pp. 590–596).

Copyright (c) 2015 Radovan Garabík, Ludmila Dimitrova

License URL: