DOI: https://doi.org/10.11649/cs.1468

Testing word embeddings for Polish

Agnieszka Mykowiecka, Małgorzata Marciniak, Piotr Rychlik

Abstract


Testing word embeddings for Polish

Distributional Semantics postulates the representation of word meaning in the form of numeric vectors which represent words which occur in context in large text data. This paper addresses the problem of constructing such models for the Polish language. The paper compares the effectiveness of models based on lemmas and forms created with Continuous Bag of Words (CBOW) and skip-gram approaches based on different Polish corpora. For the purposes of this comparison, the results of two typical tasks solved with the help of distributional semantics, i.e. synonymy and analogy recognition, are compared. The results show that it is not possible to identify one universal approach to vector creation applicable to various tasks. The most important feature is the quality and size of the data, but different strategy choices can also lead to significantly different results.

 

Testowanie wektorowych reprezentacji dystrybucyjnych słów języka polskiego

Semantyka dystrybucyjna opiera się na założeniu, że znaczenie słów wyrażone jest za pomocą wektorów reprezentujących, w sposób bezpośredni bądź pośredni, konteksty, w jakich słowo to jest używane w dużym zbiorze tekstów. Niniejszy artykuł dotyczy ewaluacji wielu takich modeli skonstruowanych dla języka polskiego. W pracy porównano skuteczność modeli opartych na lematach i formach słów, utworzonych przy wykorzystaniu sieci neuronowych na danych z dwóch różnych korpusów języka polskiego. Ewaluacji dokonano na podstawie wyników dwóch typowych zadań rozwiązywanych za pomocą metod semantyki dystrybucyjnej, tzn. rozpoznania występowania synonimii i analogii między konkretnymi parami słów. Uzyskane wyniki dowodzą, że nie można wskazać jednego uniwersalnego podejścia do tworzenia modeli dystrybucyjnych, gdyż ich skuteczność jest różna w zależności od zastosowania. Najważniejszą cechą wpływającą na jakość modelu jest jakość oraz rozmiar danych, ale wybory różnych strategii uczenia sieci mogą również prowadzić do istotnie odmiennych wyników.


Keywords


distributional semantics; word embeddings; model evaluation; synonymy; analogy

Full Text:

PDF (in English)

References


Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don't count, predict!: A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of ACL 2014 (52nd Annual Meeting of the Association for Computational Linguistics) (pp. 238–247). East Stroudsburg, PA: Association for Computational Linguistics.

Baroni, M., & Lenci, A. (2010). A general framework for corpus-based semantics. Computational Linguistics, 36(4), 673–721. https://doi.org/10.1162/coli_a_00016

Baroni, M., & Lenci, A. (2011). How we BLESSed Distributional Semantic Evaluation. In Proceedings of the GEMS 2011 Workshop on Geometrical Models of Natural Language Semantics (pp. 1–10). Edinburgh: Association for Computational Linguistics.

Basile, P., Caputo, A., & Semeraro, G. (2014). An enhanced Lesk word sense disambiguation algorithm through a distributional semantic model. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics. Dublin, Irleand: Association for Computational Linguistics.

Bellegarda, J. R. (2000). Large vocabulary speech recognition with multispan statistical language models. IEEE Transactions on Speech and Audio Processing, 8(1), 76–84.

Broda, B., & Piasecki, M. (2008). SuperMatrix: A general tool for lexical semantic knowledge acquisition. In Proceedings of the International Multiconference on Computer Science and Information Technology — 3rd International Symposium Advances in Artificial Intelligence and Applications (AAIA'08) (pp. 345–352). https://doi.org/10.1109/IMCSIT.2008.4747263

Broda, B., & Piasecki, M. (2013). Parallel, massive processing in SuperMatrix — a General tool for distrubutional semantic analysis. International Journal of Data Mining, Modelling and Management, 5(1), 1–19. https://doi.org/10.1504/IJDMMM.2013.051924

Broniarek, W. (2010). Gdy Ci słowa zabraknie. Brwinów: Haroldson.

Budanitsky, A., & Hirst, G. (2006). Evaluating WordNet-based Measures of Lexical Semantic Relatedness. Computational Linguistics, 32(1), 13–47. https://doi.org/10.1162/coli.2006.32.1.13

Cheung, J. C., & Penn, G. (2012). Evaluating distributional models of semantics for syntactically invariant inference. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (pp. 33–45). Avignon: Association for Computational Linguistic.

Church, K. W., & Hanks, P. (1989). Word association norms, mutual information, and lexicography. In ACL’89 Proceedings of the 27th annual meeting on Association for Computational Linguistics (pp. 76–83). Stroudsburg, PA: Association for Computational Linguistics. https://doi.org/10.3115/981623.981633

Clark, S. (2015). Vector Space Models of Lexical Meaning. In S. Lappin & C. Fox, Handbook of contemporary semantics (2nd ed.). Willey-Blackwell. https://doi.org/10.1002/9781118882139.ch16

Coccaro, N., & Jurafsky, D. (1998). Towards better integration of semantic predictors in statistical language modeling. In Proceedings of ICSLP-98 (Vol. 6, pp. 2403–2406).

Dinu, G., & Baroni, M. (2014). How to make words with vectors: Phrase generation in distributional semantics. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: Vol. 1. Long papers (pp. 624–633). Association for Computational Linguistics. https://doi.org/10.3115/v1/P14-1059

Duyu, T., Wei, F., Yang, N., Ming, Z., Ting, L., & Bing, Q. (2014). Learning sentiment-specific word embedding for twitter sentiment classification. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: Vol. 1. Long papers (pp. 1555–1565). Association for Computational Linguistics.

Faruqui, M., Tsvetkov, Y., & Rastogi, P. (2016). Problems with evaluation of word embeddings using word similarity tasks. In Proceedings of the 1st Workshop on Evaluating Vector Space Representations for NLP (pp. 30–35). Associacion of Computational Linguistics. https://doi.org/10.18653/v1/W16-2506

Griffiths, T. L., Steyvers, M., & Tenenbaum, J. B. (2007). Topics in semantic representation. Psychological Review, 114(2), 211–244. https://doi.org/10.1037/0033-295X.114.2.211

Harris, Z. S. (1954). Distributional structure. Word, 10(23), 146–162. https://doi.org/10.1080/00437956.1954.11659520

Jastrzebski, S., Leśniak, D., & Czarnecki, W. M. (2017). How to evaluate word embeddings?: On importance of data efficiency and simple supervised tasks. Retrieved 23 July 2017, from https://arxiv.org/pdf/1702.02170

Kędzia, P., Czachor, G., Piasecki, M., & Kocoń, J. (2016). Vector representations of Polish words (Word2Vec method). CLARIN-PL digital repository. http://hdl.handle.net/11321/327

Kim, H. K., Kim, H., & Cho, S. (2017). Bag-of-concepts: Comprehending document representation through clustering words in distributed representation. Neurocomputing, 266, 336–352. https://doi.org/10.1016/j.neucom.2017.05.046

Kovatchev, V., Salamo, M., & Marti, M. (2016). Comparing Distributional Semantics Models for identifying groups of semantically related words. Procesamiento del Lenguaje Natural, 2016(57), 109–116.

Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2), 211–240. https://doi.org/10.1037/0033-295X.104.2.211

Landauer, T. K., Foltz, P. W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25(2–3), 259–284. https://doi.org/10.1080/01638539809545028

Lund, K., & Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments, & Computers, 28(2), 203–208. https://doi.org/10.3758/BF03204766

McDonald, S. (2000). Environmental determinants of lexical processing effort (Unpublished doctoral dissertation). University of Edinburgh.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 2013, 3111–3119. https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf

Mikolov, T., Yih, W., & Zweig, G. (2013). Linguistic regularities in continuous space word representations. In Proceedings of NAACL (pp. 746–751). Atlanta, GA.

Palmer, F. R. (Ed.). (1968). Selected papers of J. R. Firth 1952–1959. London: Longman. (Reprinted from A synopsis of linguistic theory 1930–1955: Studies in linguistic analysis, pp. 1–32, by J. R. Firth, 1957, Oxford: Philological Society).

Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (pp. 1532–1543). Association for Computational Linguistics. https://doi.org/10.3115/v1/D14-1162

Przepiórkowski, A., Bańko, M., Górski, R. L., & Lewandowska-Tomaszczyk, B. (Eds.). (2012). Narodowy Korpus Języka Polskiego. Warszawa: Wydawnictwo Naukowe PWN.

Řehůřek, R., & Sojka, P. (2010). Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks (pp. 45–50). Valetta, Malta: ELRA.

Rogalski, M., & Szczepaniak, P. S. (2016). Word embeddings for the Polish language. In L. Rutkowski, M. Korytkowski, R. Scherer, R. Tadeusiewicz, L. Zadeh, & J. Zurada (Eds.), Artificial Intelligence and Soft Computing, ICAISC 2016: Part I. LNAI 9692 (pp. 126–135). https://doi.org/10.1007/978-3-319-39378-0_12

Sager, J. C. (1990). A practical course in terminology processing. Amsterdam: John Benjamins. https://doi.org/10.1075/z.44

Scheible, S., Schulte im Walde, S., & Springorum, S. (2013). Uncovering distributional differences between synonyms and antonyms in a word space model. In International Joint Conference on Natural Language Processing (pp. 489–497). Ngoya, Japan.

Schutze, H. (1998). Automatic word sense discrimination. Computational Linguistics, 24(1), 97–124.

Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27(3), 379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x

Shaw, E. (2015). An SAT® Validity Primer. College Board Research. http://research.collegeboard.org/sites/default/files/publications/2015/2/research-report-sat-validity-primer.pdf

Shutova, E., Sun, L., Gutierrez, D., Lichtenstein, P., & Narayanan, S. (2017). Multilingual metaphor processing: Experiments with semi-supervised and unsupervised learning. Computational Linguistics, 43(1), 71–123. https://doi.org/10.1162/COLI_a_00275

Spark Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1), 11–21. https://doi.org/10.1108/eb026526

Stokowiec, W. (2015). word2vec dla Polskiego Internetu. Retrieved 19 August 2017, from http://doczz.pl/doc/562319/word2vec-dla-polskiego-internetu

Tatjewski, M., Bańko, M., Kucińska, A., & Rączaszek-Leonardi, J. (2017). Computational distributional semantics and free associations: A comparison of two word-similarity models in a study of synonyms and lexical variants. In P. P. Waliński, Language, corpora and cognition. Frankfurt am Main: Peter Lang.

Tsvetkov, Y., Faruqui, M., Ling, W., Lample, G., & Dyer, C. (2015). Evaluation of Word Vector Representations by Subspace Alignment. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 2049–2054). https://doi.org/10.18653/v1/D15-1243

Turney, P. D. (2001). Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. In Proceedings of the Twelfth European Conference on Machine Learning (pp. 491-502). Berlin: Springer-Verlag.

Turney, P., & Pantel, P. (2010). From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 2010(37), 141–188.

Waszczuk, J. (2012). Harnessing the CRF complexity with domain-specific constraints: The case of morphosyntactic tagging of a highly inflected language. In Proceedings of COLLING 2012 (pp. 2789–2804). Mumbai, India.

Weeds, J., Clark, D., Reffin, J., Weir, D., & Bill, K. (2014). Learning to distinguish hypernyms and co-hyponyms. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers (pp. 2249–2259). Dublin: Dublin City University and Association for Computational Linguistics.

Wittgenstein, L. (1953). Philosophical investigations. Oxford: Basil Blackwell.

Woliński, M. (2014). Morfeusz reloaded. In N. Calzorali, K. Chourkri, T. Declerk, H. Loftsson, B. M. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Ninth International Conference on Language Resources and Evaluation. Reykjavík: ELRA.




Copyright (c) 2017 Agnieszka Mykowiecka, Małgorzata Marciniak, Piotr Rychlik

License URL: http://creativecommons.org/licenses/by/3.0/pl/