DOI: https://doi.org/10.11649/cs.2015.024

Digital Corpora and their Applications in Semantic Studies and Lexicography

Ludmila Dimitrova, Violetta Koseska–Toszewa

Abstract


Digital Corpora and their Applications in Semantic Studies and Lexicography

The paper describes the first Bulgarian-Polish digital resources — parallel and comparable corpora, and their applications in the semantic studies and lexicography for creation of Bulgarian-Polish digital dictionary, a significant part of these bilingual resources. Some examples show how valuable the links between the bilingual aligned corpus and the digital dictionary are. The first Bulgarian-Polish digital resources are the main result of collaborative work between the Institute of Mathematics and Informatics of the Bulgarian Academy of Sciences (IMI-BAS) and the Institute of Slavic Studies of the Polish Academy of Sciences (ISS-PAS), established for the first time in 2006.


Keywords


digital corpora; aligned corpus; bilingual corpus; digital dictionary; digital lexicographic resource; semantic study

Full Text:

PDF (in English)

References


Burnard, L. (1995). The Text Encoding Initiative: An overview. In G. Leech, G. Myers, & J. Thomas (Eds.), Spoken English on computer: Transcript, mark-up and application (pp. 69–81). New York: Longman.

CONCEDE. (n.d.). Retrieved from http://www.itri.brighton.ac.uk/projects/concede/

Dimitrova, L. & Koseska-Toszewa, V. (2014). Semantics properties of selected universal language categories in digital bilingual resources. Sofia: Demetra Ltd. Publisher.

Dimitrova, L., Koseska-Toszewa, V., Dutsova, R., & Panova, R. (2009). Bulgarian-Polish online dictionary — Design and development. In Proceedings of the MONDILEX Fourth Open International Workshop, Warsaw, Poland, 29 June – 1 July 2009 (pp. 76–88). Warsaw: SOW.

Dimitrova, L. (2009). From electronic corpora to online dictionaries (on the example of Bulgarian Language Resources). In J. Levická & R. Garabík (Eds.), Proceedings of the Fifth International Conference NLP, Corpus Linguistics, Corpus Based Grammar Research, Smolenice, Slovakia, 25–27 November 2009 (pp. 78–92). Brno: Tribun.

Dimitrova, L. (2010). Multilingual digital resources with Bulgarian language. Cognitive Studies | Études cognitives, 10, 241–252.

Dimitrova, L. & Dutsova, R. (2012). Implementation of the Bulgarian-Polish online dictionary. Cognitive Studies | Études cognitives, 12, 219–229.

Dimitrova, L. & Dutsova, R. (2013a). Web-application for the presentation of bilingual corpora (Focusing on Bulgarian as one of the two paired languages). Cognitive Studies | Études cognitives, 13, 183–193. http://doi.org/10.11649/cs.2013.012

Dimitrova, L. & Dutsova, R. (2013b). A software package for processing Bulgarian digital resources: Parallel corpora and a bilingual dictionary. In Proceedings of the Seventh International Conference NLP, Corpus Linguistics, E-Learning SLOVKO’2013, 13–15 November 2011, Bratislava, Slovakia (pp. 40–50). Lüdenscheid: RAM-Verlag.

Dimitrova, L. & Koseska, V. (2009a). Bulgarian-Polish Corpus. Cognitive Studies | Études cognitives, 9, 133–141.

Dimitrova, L. & Koseska, V. (2009b). Classifiers and digital dictionaries. Cognitive Studies | Études cognitives, 9, 117–131.

Dimitrova, L. & Koseska, V. (2012). Bulgarian-Polish parallel digital corpus and quantification of time. Cognitive Studies | Études cognitives, 12, 199–207.

Dimitrova, L. & Koseska-Toszewa, V. (2008). Some problems in multilingual digital dictionaries. Cognitive Studies | Études Cognitives, 8, 237–254.

Dimitrova, L., Dutsova, R., & Panova, R. (2011a). Information technologies for the preservation of language heritage. In Proc. of the International Conference Digital Presentation and Preservation of Cultural and Scientific Heritage DiPP 2011, 11–14 September 2011, Veliko Tarnovo, Bulgaria (pp. 140–150).

Dimitrova, L., Dutsova, R., & Panova, R. (2011b). Survey on current state of Bulgarian-Polish online dictionary. In Proceedings of the International Workshop “Language Technology for Digital Humanities and Cultural Heritage” within International Conference RANLP’2011, 16 September 2011, Hissar, Bulgaria (pp. 43–50). Shoumen: INCOMA, Association for Computational Linguistics. http://aclweb.org/anthology-new/W/W11/W11-41.pdf

Dimitrova, L., Erjavec, T., Ide, N., Kaalep, H. J., Petkevic, V., & Tufis, D. (1998). Multext-East: Parallel and comparable corpora and lexicons for six Central and Eastern European languages. In Proceedings of COLING-ACL '98. Montréal, Québec, Canada (Vol. 1, pp. 315–319). Stroudsburg, PA: Association for Computational Linguistics. http://doi.org/10.3115/980845.980897

Dimitrova, L., Koseska, V., Garabík, R., Erjavec, T., Iomdin, L., & Shyrokov, V. (2010). MONDILEX — Towards the research infrastructure for digital resources in Slavic lexicography. Cognitive Studies | Études cognitives, 10, 147–162.

Dimitrova, L., Panova, R., & Dutsova, R. (2009). Lexical database of the experimental Bulgarian-Polish online dictionary. In R. Garabík (Ed.), Metalanguage and encoding scheme design for digital lexicography: Proceedings of the MONDILEX Third Open Workshop, Bratislava, Slovak Republic, 15–16 April 2009 (pp. 36–47). Bratislava: L'. Štúr Institute of Linguistic, Slovak Academy of Sciences.

Dutsova, R. (2013). Web-application for presentation of Bulgarian language heritage: Bilingual digital corpora and dictionaries. In Proc. of the International Conference Digital Presentation and Preservation of Cultural and Scientific Heritage DiPP’2013, 18–21 September 2013, Veliko Tarnovo, Bulgaria (pp. 99–108).

Dutsova, R. (2014). Web-based software system for processing bilingual digital resources. Cognitive Studies | Études cognitives, 14, 33–43. http://doi.org/10.11649/cs.2014.004

Erjavec, T., Evans, R., Ide, N., & Kilgarriff, A. (2003). From machine readable dictionaries to lexical databases: The concede experience. In Proceedings of the 7th International Conference on Computational Lexicography, COMPLEX'03, Budapest, Hungary, 2003.

Ide, N. M., (1998). Corpus Encoding Standard: SGML guidelines for encoding linguistic corpora. In Proc. of the First International Conference on Language Resources and Evaluation, LREC'98, Granada ELRA (pp. 463–470). http://www.cs.vassar.edu/CES/

Ide, N. M. & Sperberg-McQueen, C. M. (1995). The TEI: History, goals, and future. Computers and the Humanities, 29(1), 5–15. http://doi.org/10.1007/BF01830313

Koseska, V. & Mazurkiewicz, A. (2010). Time flow and tenses. Warsaw: SOW.

Koseska-Toszewa, V. (2009). Form, its meaning, and dictionary entries. In Metalanguage and encoding scheme design for digital lexicography: Proceedings of the MONDILEX Third Open Workshop, Bratislava, Slovak Republic, 15–16 April 2009 (pp. 105–111). Bratislava: L'. Štúr Institute of Linguistic, Slovak Academy of Sciences.

MULTEXT-East Home Page [MTE]. (n.d.). Retrieved 1 October 2015, from http://nl.ijs.si/ME

Sperberg-McQueen, C. M. & Burnard, L. (Eds.). (2002). TEI P4: Guidelines for electronic text encoding and interchange. Text Encoding Initiative Consortium. XML Version: Oxford, Providence, Charlottesville, Bergen. http://www.tei-c.org/P4X/




Copyright (c) 2015 Ludmila Dimitrova, Violetta Koseska–Toszewa

License URL: http://creativecommons.org/licenses/by/3.0/pl/