TESTING WORD EMBEDDINGS FOR POLISH

Distributional Semantics postulates the representation of word meaning in the form of numeric vectors which represent words which occur in context in large text data. This paper addresses the problem of constructing such models for the Polish language. The paper compares the effectiveness of models based on lemmas and forms created with Continuous Bag of Words (CBOW) and skipgram approaches based on different Polish corpora. For the purposes of this comparison, the results of two typical tasks solved with the help of distributional semantics, i.e. synonymy and analogy recognition, are compared. The results show that it is not possible to identify one universal approach to vector creation applicable to various tasks. The most important feature is the quality and size of the data, but different strategy choices can also lead to significantly different results.


Introduction
Distributional Semantics (DS) is currently widely-used in many tasks in the domain of Natural Language Processing (NLP).Its main assumption is that the meaning of a word can be inferred (to some extent) from its usage.Therefore, in DS models words are represented as vectors whose positions directly or indirectly represent information about the frequency of the particular word occurring in its context.The underlying concept of this approach is not new.The suggestion that "the meaning of words lies in their use" was formulated by Wittgenstein in 1953Wittgenstein in and, in 1957, Firth stated "You shall know a word by the company it keeps!".At around the same time Harris, in the paper "Distributional structure" (Harris, 1954), formulated an idea which can be directly implemented in computer programs: "The distribution of an element will be understood as the sum of all its environments.An environment of an element A is an existing array of its co-occurrents".However, in spite of this theoretical support, the idea of distributional semantics was to remain rather marginal for quite some time.This situation changed in the late 1990s, with enhancements in both language corpora availability and technical capabilities, and when distributional methods had proven themselves effective in both modelling cognitive phenomena and in practical applications.They have been used, for example, for word sense disambiguation problems (Schutze, 1998); Testing word embeddings for Polish to model human similarity judgements (McDonald, 2000); to enhance n-gram language models with long range semantic information (Bellegarda, 2000;Coccaro & Jurafsky, 1998); to identify synonyms (Landauer & Dumais, 1997); and to model semantic priming (Lund & Burgess, 1996).Positive and easily achievable results, and the creation of publicly available tools for building distributional models, has increased the popularity of this approach even further.
Lying at the core of the distributional approach is the vector representation of words.A vector model is built on the basis of an appropriate corpus -a set of texts, either plain or annotated with some morphosyntactic features.As the distribution data is more reliable when a source text is large, it is common practice to use an existing large corpus of general texts or even to combine several corpora.As usual with NLP technology, texts should cover the appropriate domain and genre.Building more complex models requires the addition of different types of annotation, which can be done with the help of linguistic tools or manually, e.g. using the crowdsourcing approach.For every word from a given corpus, one can count the contexts in which it occurs.Collected contexts create a huge matrix, which is then transformed (e.g. using linear algebra) into a matrix approximating meaning.Each row in this matrix is a vector representation of one entity (usually a word).The similarity of vectors can be measured with standard mathematical functions, for example the cosine of the angle between them.Similar vectors are considered to represent related words.The relatedness of words is general and cannot be precisely defined.In this paper, as in Budanitsky and Hirst (2006), it consists of well-established relations such as: synonymy (amazingwonderful ) and antonymy (good -bad ); hyperonymy and hyponymy (bird -crow ; cutlery -spoon); co-hyponymy (coffee -tea, dog -cat); meronymy (flat -room); and other functional associations (coffee -cup, state -legislation).
Transforming corpus data into vector representations can be done in several ways.The most direct, count-based strategy consists of collecting all context data from all word occurrences and then transforming the resulting matrix using some kind of weighting function.Weighting is aimed at strengthening surprising events and weakening highly expected events, because it is more informative if something rare occurs than if something quite common takes place.In DS models, this means that having a rare context in common, e.g.'roar', should make words more similar than having more typical common contexts, e.g.'run'.The most commonly used method of formalizing the idea of rare and frequent words for term-document matrices is the tf-idf (term frequency × inverse document frequency) function (Spark Jones, 1972).In information theory, a surprising event has a higher information content than an expected event (Shannon, 1948).A frequently used alternative to tf-idf is PMI (Pointwise Mutual Information; Church & Hanks, 1989;Turney, 2001).The final, optional, step in building a DS model is dimensionality reduction, which aims to establish the most informative dimensions, usually from hundreds of thousands of different contexts.Dimensionality reduction can be performed by feature selection but it is typically done by SVM (Singular Value Decomposition), being the core of the Latent Semantic Analysis/Latent Semantic Indexing (LSA/LSI) method (Landauer, Foltz, & Laham, 1998).It constructs a low-rank approximation to the word-context matrix.
The second way to transform context counts into vectors representing word meanings (word embeddings) is called Global Vectors (Pennington, Socher, & Manning, 2014).The main concept behind GloVe is the observation that ratios of word-word co-occurrence probabilities can encode some sense of meaning.The training objective of GloVe is to transform original frequency-based word vectors so that their dot product equals the logarithm of the words' probability of cooccurrence.
The third method of building distributional models, and the one which has probably gained the most spectacular popularity, is to train a neural network (Mikolov, Sutskever, Chen, Corrado, & Dean, 2013) to predict a word given a context (CBOW approach), or a context given a word (skip-gram approach), on the basis of a corpus in which every word occurrence represents one learning example.In this approach, word sense is represented as a vector of the neural network layer.This method was implemented by the author as the word2vec algorithm, which uses a very efficient learning strategy, allowing much faster neural networks model building than those Testing word embeddings for Polish previously used.
Although distributional semantics has become very popular, there are only a few published papers concerning the vector representation of Polish words.There is a tool -Supermatrix -which builds a distributional model, taking into account a predefined set of features and SVD dimensionality reduction, and computes word similarity (Broda & Piasecki, 2008, 2013).Kędzia, Czachor, Piasecki, and Kocoń (2016) published a skip-gram model of Polish created by word2vec with 100-dimensional feature vectors.The presentation word2vec dla Polskiego Internetu "word2vec for Polish Internet" (Stokowiec, 2015) is available on the Internet.The problem of synonyms and lexical variants is described in the paper (Tatjewski, Bańko, Kucińska, & Rączaszek-Leonardi, 2017).Rogalski and Szczepaniak (2016) published a paper concerning the creation of word embeddings and embeddings themselves.They re-implemented the Mikolov, Sutskever, et al. (2013) algorithm and created vectors from Polish Wikipedia.
This paper verifies the distributional semantic models (DSM) for Polish created by word2vec from the genism package (Řehůřek & Sojka, 2010), https://radimrehurek.com/gensim/, and compares them with previously published resources.The functionalities available in the word2vec tool are tested to discover which parameter values are the best for processing Polish -a highly inflectional language.Models based on lemmas and forms for corpora, consisting of Polish Wikipedia (WikiPL) and the National Corpus of Polish (NKJP; Przepiórkowski, Bańko, Górski, & Lewandowska-Tomaszczyk, 2012), are created.The results obtained by CBOW and the skip-gram architecture using 100-and 300-dimensional vectors are compared.Moreover, the paper examines how the removal of infrequent forms from the data influences the results.
The evaluation of DSM has been the subject of many studies.Two ways of performing this evaluation are possible: intrinsic evaluation (testing a system in itself), e.g.Tsvetkov, Faruqui, Ling, Lample, and Dyer (2015), and extrinsic evaluation, measuring its performance in a task or application, e.g.Cheung and Penn (2012), which reports on testing syntactically invariant inference.The problem with performing an extrinsic evaluation is that task-oriented benchmarks adopted in distributional semantics tasks, such as the TOEFL synonim detection task, have not been specifically designed to evaluate DSMs.Thus, the results obtained reveal more about the particular solution of the task than about a specific element of the processing flow, i.e. in this case a DS model.To gain a real insight into the abilities of DSM, Baroni and Lenci (2011) postulate that existing benchmarks must be complemented with a more intrinsically oriented approach.Although aware of the many problems also identified for intrinsic DSM evaluation, described for example in Faruqui, Tsvetkov, and Rastogi (2016) and Jastrzebski, Leśniak, and Czarnecki (2017), it was decided to perfom such an evaluation using already available data, in order to gain some knowlegde about the differences in the quality of various word models for Polish.As there are still no sets designed to test the specific aspects of lexical knowledge for Polish, it was decided to use two existing lexicons of synomyns.In order to make the comparison more robust, a set of analogy pairs covering many types of relations apart from synonymy were defined.
Testing word embeddings for Polish

Corpora description
The experiments with DS models employ the NKJP and WikiPL corpora, as well as the combined set of these two corpora.A small, openly-accessible subset of NKJP is downloadable from http://clip.ipipan.waw.pl/NationalCorpusOfPolish,but for this paper the full data of the NKJP project consortium was used, by permission of the project leader.NKJP and Wikipedia, dump of late 2016 (https://dumps.wikimedia.org/plwiki),were annotated using Concraft-pl (Waszczuk, 2012), a morphosyntactic tagger for Polish based on constrained conditional random fields.Several input sets for building DS models were prepared.They can be divided into two main groups: one containing orthographic forms and one containing lemmas generated by a tagger.All sentences were scanned to remove tokens that are punctuation marks, or which contain characters other than a letter or digit.All words were converted to lower case, unless capital letters had been found in their lemmas.
Unfortunately, using data annotated by Concraft-pl has a potential drawback that may influence the results of experiments.Some words, mainly verb forms, are divided into several tokens.For example, the word chciałbym 'I would like' will be split into three tokens: chciał (past tense), by (qublik), and m (agglutinate).Similarly, the word biało-czerwony 'white and red' will be split into biało, punctuation mark '-', and czerwony.To test the influence of some potentially not very informative tokens like '-', 'by' or 'm', restricted data sets were prepared, which only included tokens classified as nouns, adjectives, adverbs, verb forms, and abbreviations, which constitute 19 parts of speech (POS) out of the 34 foreseen in NKJP.All other words are treated as if they do not exist in the data.The sizes of the corpora used are shown in Table 1 below 3 Models There are many assumptions that may influence the performance of a particular vector model in a particular task.These assumptions may be divided into three different categories and concerns: • model elements, i.e. if a model is built for word forms or lemmas or for more complicated structures (such as a word being a noun, or a particular word being the object of a particular verb); • context definition, i.e. whether all or only selected words will be taken into account as context values, and which features to include, e.g.only word forms, their POS, grammatical relations, etc.; • the method used for transforming raw data into a final model; • the values of parameters specific for a chosen method.
All the models used in this paper were built using genism word2vec.In the description below, a naming convention for the models is given in brackets.
Both the CBOW (c) and skip-gram (s) approaches were tested.The models were built on NKJP data (N), Wikipedia (W), and the two corpora joined together (NW) consisting of either: Testing word embeddings for Polish • word forms (fa); • lemmas (la); • word forms restricted to 19 out of 34 POS (fr); • lemmas restricted to 19 out of 34 POS (lr); • lemmas combined with a part of speech name (-pos).
As learning strategies, the experiment used either hierarchical softmax (h) or negative sampling (n) in the standard configuration of 5 positive examples and 1 negative.The number of features, i.e. different types of contexts represented for one word, was either 100 (1) or 300 (3).The context size is equal to 5, the minimal number of occurrences is 5 and there are 10 learning steps.To test the influence of rare words (and some no-words, spelling errors, etc.) selected models were built, limited to words occurring no fewer than 50 times for NKJP data or no fewer than 30 times for Wikipedia data.These models are: NWfa-3-s-h50 (NKJP plus WikiPL, all word forms, CBOW, 300 features, hierarchical softmax, word form occurrences no fewer than 50), NWfa-3-c-n50 (the same as before, but with negative sampling), Wfa-3-c-h30 and Wfa-3-c-n30.For selected models, it was also tested whether increasing the number of training steps to 100 influences the results (-it).
As well as these models, publicly available models (named pl-emb-c and pl-emb-s) from the paper (Rogalski & Szczepaniak, 2016) were also used.These are CBOW and skip-gram models with negative sampling trained on pre-processed data from Wikipedia.All Wikipedia text was changed to lower case, numbers were divided into separate digits and converted to words, and some non-text elements were deleted.The skip-gram model published by Kędzia et al. (2016) was not used due to technical problems with processing it.Moreover, there are no details about the corpus or the exact technique that was used to obtain the data.

Tasks description
The main problem when comparing many alternative models is to establish a relatively large scale through a repeated experiment which uses open and high-quality data.Satisfying all these requirements is very difficult, and in many cases even unfeasible, as preparing test data is highly labour-intensive.It is well-documented that the similarity of word embeddings reflects many different relations between the given words, e.g.synonymy, antonymy, hypernymy or hyponymy (Scheible, Schulte im Walde, & Springorum, 2013;Weeds, Clark, Reffin, Weir, & Bill, 2014).For example, in one of the models used in this paper the most similar words to przyjazd 'arrival, using some ground vehicle' are: przylot 'arival by plane', wyjazd 'departure' and przybycie 'arrival, no means specified'; to cichy 'quiet, noiseless' the most similar words are: wesoły 'cheerful', delikatny 'delicate', cichutki 'quiet, barely audible', miły 'nice'.To allow for massive and maximally objective tests two specific problems were chosen to be solved using all of the models: synonym identification and analogy testing

Synonymy
From the many possible relatedness relations only one, synonymy, was selected, as it facilitated the preparation of the test data and the interpretation of the results.It was assumed that if word embeddings correctly represent word senses, then they should also correctly indicate synonyms as words whose embeddings are very close.The larger the number of synonyms at the top of the ranked list of similar words, the better the model represents word senses.It should be stressed that the goal was not to elaborate the most efficient way of finding synonyms, but rather to ascertain which model represents word synonymy in the best way.For this reason, we did not add any additional methods for filtering non-synonyms from the ranked similar word lists, but instead evaluated the original lists obtained using different model settings.

Testing word embeddings for Polish
For the evaluation of the different embedding models, two publicly available collections of synonyms were selected.The first is a free online resource, created and edited by volunteers.The lexicon contains more than 600,000 synonyms of almost 150,000 words and it is available at www.synonim.net.The second was created and is maintained by Wojciech Broniarek and its original version was published as a synonym lexicon entitled Gdy Ci słowa zabraknie "When You Are Lost For Words".It is still edited and extended by the author, and is also available on-line at www.synonimy.pl.This set is smaller, as the adapted rules for assigning synonyms to lexical entries are more restrictive.For example, for the word słowo 'word', the first set contains 74 synonyms, while the second one only 15.Most of the synonyms are single words, but some multiword phrases are also included.Both lexicons differentiate word meanings, but as we do not build representation for senses but for words, we merge synonyms for all word senses together.Tests were limited to words from the three main syntactic categories: nouns, verbs and adjectives.Each category is represented by 50 frequent words selected from the list of NKJP forms.Moreover, an attempt was made to select forms/words which are not ambiguous in terms of parts of speech.When selecting nouns, those which can also be gerunds were avoided.In Table 2 the least and the most frequent words, together with the number of their occurrences in the combined NKJP and WikiPL, corpus are given.It was decided to perform the tests separately for words of different categories (the selection concerns only the sets of tested words, lists of the most similar words were not filtered from words of different categories) to check if the difference in syntactic structures in which these words occur might influence the results.We also wanted to test how inflection influences the quality of embeddings.In Polish, all nouns have gender and have to agree in gender with the modifying adjectives.Gender agreement is also visible between a noun which is the subject of a verb in the third person in the past tense and the verb itself, e.g." Szkoła f em była f em zamknięta f em ."'The school was closed.',"Dworzec masc był masc zamknięty masc ."'The station was closed.'.Third person constructions occur very frequently in texts -in Wikipedia, there are 60 times more 3rd person than 1st person constructions and in NKJP the proportion is roughly 3.5 to 1.This may lead to different lists of the most similar suggestions for synonyms of different genders.For example, the list of related words for osoba f em 'person' contains mainly feminine nouns like kobieta 'woman', dziewczyna 'girl', osóbka 'wench', duszyczka 'soul', prostytutka 'prostitute' while the list of related words for człowiek masc 'man' contains masculine nouns like mężczyzna 'man', facet 'guy', chłopak 'boy', osobnik 'individual', chłop 'peasant', and on 'he'.To see whether gender inflection influences the similarity results, models were built on word lemmas and directly on word forms.To evaluate this second set of models, we generated all the forms of synonyms taken from both lexicons and all the forms of the selected lexicon entries.While models based on lemmas could possibly overcome Testing word embeddings for Polish inflection features disagreement, we wanted to ascertain whether or not they are influenced by the relatively poor quality of Polish lemmatizers (in the case when there is more than one word of the same syntactic category, Polish taggers do not assign lemmas with great precision, e.g. they quite frequently assign to the word form mają 'they have' lemma 'to adorn with verdure' (maić) not 'to have'.).Table 3 shows the names and the number of elements of the two synonym sets defined for the three syntactic categories.1

Analogy
The second task concerns the identification of analogy (Baroni, Dinu, & Kruszewski, 2014;Mikolov, Yih, & Zweig, 2013).This type of relation is open and is defined by a pair of words that are in this relation, e.g.jesień-deszcz 'autumn-rain' or Polska-Warszawa 'Poland-Warsaw'.The algorithm has to identify the word that is in the same relation with a new word given as an input.Thus, in the first case, for the word zima 'winter', we would expect śnieg 'snow' (for data in Polish at least), and in the second case, for Francja 'France', we would expect Paryż 'Paris'.In this task, the selection of both the initial pairs and the test words is crucial.The relations between two words can be hard to recognize, as in filiżanka-kot 'cup-cat', if we have in mind 'something that can be broken by a cat'.This task has its source in the college admission test in the United States -SAT (Shaw, 2015), which includes this type of questions.For each pair, we tested the 10 first nearest vectors and checked if they were consistent with the word given in the pair.The tested relations, divided into groups of similar ones (together with the number of examples), are given in Table 4.Moreover, we prepared 20 additional analogies representing grammatical relations, e.g.kot-kotom 'cat-cat pl,dat ' pisał-pisała 'wrote-wrote fem ' and mały-mniejszy 'small-smaller'.These were only tested on form-based models.The test contained: noun-noun in plural (3); noun in the nominative case-noun in a different case (3); noun-noun in various cases and numbers (7); adjective-adjective in the higher degree (2); adjective-adjective in different number and gender (1), verbs in the present-past tenses (3) and verb in the singular-plural (1).Figures 1 and 2 show the performance of all the models trained on the three corpora for formand lemma-based models respectively.(Models trained on restricted POS data are not shown here in the interests of greater readability.)The data shows how many words from the synonym sets from Table 3 are found within the first 50 most similar words of 50 elements checked (so, the maximum value could be 2500 if every word had 50 synonyms and all of them were placed at the top 50 position of the similarity lists).For N2 and A2 sets, this could possibly be 100% of the appropriate set (as, in the second set, the number of synonyms are usually much lower than 50).The differences between results for various parts of speech are not clear, but different models are the most efficient for a particular word category.Due to the different sizes of test sets, the figures illustrate the changes in a models efficiency for every test set separately but it cannot be used directly to compare the performance for different test sets.
Table 5: Precision of selected models for all test sets counted either for 2500 elements or for the test set size (for N2 and A2) NWla-3-c-n50 0.23 0.11 0.23 0.11 0.21 NWfa-3-s-n50 0.25 0.32 0.30 0.33 0,19 NWla-3-c-n 0.22 0.20 0.22 0.19 0.21 NWfa-3-s-n 0.16 0.18 0.18 0.14 0.18 NWla-1-c-n 0.21 0.23 0.19 0.18 0.17 NWfa-1-c-n 0.16 0.13 0.18 0.14 0.18 NWla-3-s-n 0.07 0.08 0.09 0.11 0.14 pl-emb-s 0.23 0.27 0.17 0.18 0.12 Wla-3-c-h 0.25 0.20 0.16 0.12 0.15 Wfa-3-c-n30 0.12 0.31 0.17 0.14 0.13 Table 5 contains the overall precision for the selected models.We have not reported recall, as the results are highly influenced by the fact that the N1, A1 and V1 test sets are much larger than the given threshold (50 top most similar words).The best achieved precision was 0,33 for Figure 1: The number of retrieved synonyms in 2500 list consisting of the 50 first most similar words for all 50 test words; form-based models the NWfa-3-s-n50 model on the A2 test set.The results are low, partially due to the fact that the lists contain words of syntactic categories.Moreover, in cases when one word meaning is predominant, the top of the list of similar words reflects this one sense only, while the test sets contain synonyms for all, even rare, senses.
The results obtained for different models and each test set were checked using the t-student test for averages.For the lemma-based models, eliminating low frequency words did not improve the results.Model NWla-3-c-n50 was equally as good as NWla-3-c-n for N1, A1 and V1 but worse for N2 and A2.The skip-gram approach produces significantly worse results than CBOW -the NWla-3-s-n model is significantly worse than NWla-3-c-n.The smaller number of features (NWla-3-c-n) worsens the results for A1 and V1 only.The Wla-3-c-h model is equally good for N1 as the best models which are based on much more data.
For form-based models, the NWfa-3-s-n50 model was statistically significantly better than other form-based models for the N2, A1 and A2 sets.For N1, the results of the pl-emb-s model are statistically equally as good as for NWfa-3-s-n50.For V1, both NWfa-3-s-n and NWfa-1-c-n models give similar results.For verbs, the pl-emb-s and Wfa-3-c-n30 models are worse than others (a statistically significant difference).The Wfa-3-c-n30 model is equally as good for N2 as the best models, but for the other sets it produces significantly worse results.
The results obtained from the word form-based models are more uniform across various model parameters than those for the lemma-based models.In the latter case, CBOW has a clear advantage over the skip-gram approach.The results for CBOW models are twice as good.
Figure 3 shows the results for nouns, adjectives and verbs for all models separately.The left column of the diagram shows word form models, the right column -lemma-based ones.For lemma-based models, the advantage of the CBOW approach is visible for all test sets.Using skipgram with negative sampling is the worst strategy here, and there is only a slight improvement for models based on 300 features for nouns.For adjectives and verbs, there is slightly more improvement.For form-based models, the choice of learning strategy is not very important.For all but verbs, the skip-gram model based on a large corpus, with low frequency words eliminated, is  the best choice and sometimes its results are far superior to the others.Pl-emb models are good for smaller sets of noun and adjective synonyms (N2 and A2), while they are only slightly better than word2vec models for larger sets and perform at the same level for verbs.Eliminating words with a count lower than 50 did not change the results significantly for lemma-based CBOW with negative sampling models, but was very important for most form-based models of this type (green lines on Figure 3).Using only selected POS categories to train a model (models with 'r') worsens the results for lemmas but improves them (very slightly) for forms.

Analogy
The same models were used to search for analogies.Figure 4 illustrates the overall performance of all the models in this task.It shows the number of correctly recognised analogies (the whole set described in Table 4) and compares the models based on lemmas (blue) and forms (red).It shows the higher efficiency of lemma-based models which are systematically better than the respective form-based models.The best results were obtained from the models based on NKJP and Wiki, but the models based solely on NKJP are similarly effective for some parameters.The Nlr-3-c-n model gives the best results on the tested set of analogies, but differences in the results of several other good models are within the margin of error.Generally, models based solely on Wikipedia are less effective than others, but both pl-emb models are even better than some models based on the large amount of NKJP data.
Figure 5 illustrates a more detailed comparison of models based on word forms from Wikipedia, highlighting the differences between pl-emb vectors and those created by us with the word2vec tool.Pl-emb vectors are better than most of our models with the same (100) feature vector length, and even those with 300 features.It shows that either there were some unreported interesting changes in the learning algorithm, or that data pre-processing can significantly influence the results.This would mean that it is essential to clean data before training a model.However, restricting data to certain part of speech categories (see Sec. 2) did not prove itself valuable as it did not improve the results.The effects are usually comparable to those obtained from all data.Interestingly, for  form-based models, limiting word types gave slightly better results, while for lemma-based models they produced slightly worse results.The difference is most significant for verbs, which is probably because some verb forms were treated by the tagger as consisting of more than one token.Our best model was trained on the subset of Wikipedia tokens which occur more than 30 times using the skip-gram approach with negative sampling and 300 features and obtained results similar to the pl-emb models.This shows that the filtering of low occurrences can play a similar role to careful pre-processing.The number of learning iterations does not have a uniform impact on the results.In the case of CBOW with hierarchical softmax, increasing the amount of iteration to 100 for a 100-feature model resulted in a model which is better than an analogical model with 300 features and 10 iterations.This improvement did not occur in the skip-gram approach with negative sampling, so there is no clear answer as to whether increasing the number of iterations is justified.
Figure 6 shows a comparison of the lemma-based models grouped according to the learning technique used.Within each group, the models differ in the training data set.There are three corpora: Wikipedia; NKJP; and Wikipedia+NKJP, all in two variants: as a full set (a) or restricted (r) to selected POS.Models with a greater number of features (300) perform better than those with 100 features and the best learning combination for lemmas is CBOW with negative sampling.This model for Wikipedia is equally as good as a model for the much larger NKJP corpus.However, in this particular case data restriction worsens the results substantially.Another interesting observation is that adding Wikipedia data to the NKJP corpus does not significantly change the results.This would suggest that Wikipedia data does not contribute much to NKJP in the case of this task, i.e. all the information we look for is encoded in NKJP already.However, it is not clear why the results for the larger set are sometimes worse.
Testing word embeddings for Polish The conclusion formulated in the paper Mikolov, Sutskever, et al. (2013) that negative sampling outperforms hierarchical softmax for this task was confirmed by our results.The difference between CBOW and skip-gram is much less clear.For lemmas, models using CBOW typically give better results, while for forms, better results are obtained with the skip-gram approach.
Figure 7 shows the results for 20 pairs representing grammatical relations.The set is small as we decided that recognition of grammatical relations is not essential to the DS models.The overall results are good, with the best results (19 good answers for 20 questions) obtained for 300 feature vectors trained on NKJP only, or NKJP together with the Wikipedia data.The conclusions are similar to those formulated above for the general relations.
Finally, Figure 8 shows how well analogies are recognised for various groups of relations (given in Table 4).We have shown results obtained from the best model, i.e.: Nlr-3-c-n.It is difficult to draw reliable conclusions from this data, as the groups are rather small, but it is clear that some types of relations are easier to recognize than others.Good results are obtained for geographic relations (apart from river-town pairs) and gender (family, animal and profession) relations.It is interesting that there are no results for analogies representing the substance from which an item is made, or a cultural event and its type.This problem needs further investigation before we can formulate the conditions under which analogies are correctly recognized.The aim of this paper was to test models of Polish words created with the word2vec tool with various parameters for two specific tasks: synonymy and analogy identification.As Polish is inflectional, we tested models of both lemmas and forms.The results show that word embeddings can be used to identify similarity and certain kinds of analogies for Polish words, and that the efficiency of the method is highly dependent on the chosen corpus and the model parameter values.The distributional models based on lemmas are better for analogy, while for synonymy, word forms produce better results.Moreover, it is not possible to identify one reliable, universal approach to vector creation.The CBOW approach gives better results for analogy, while skip-grams are better for synonymy.An increasing number of features, or even corpora size, do not always yield better

Figure 2 :
Figure 2: The number of retrieved synonyms in 2500 list consisting of the 50 first most similar words for all 50 test words; lemma-based models Figure 3: Synonymy results for lemma (left) and form (right) based models.The order of models reflects the increasing size of a corpus.First data are given for 100 features, then for 300 features.Four combinations of cbow and skip-gram approaches with hierarchical softmax and negative sampling were tested.The last models use either data annotation or elimination

Figure 4 :
Figure 4: Overall performance for all models, both built on lemmas and word forms, for the analogy recognition task.Number of correct answers for 200 examples

Figure 5 :
Figure 5: Overall performance of models based on Wikipedia word forms.Number of correct answers for 200 pairs

Figure 7 :
Figure 7: Performance of all word-form models on the set of grammatically related inflected forms

Table 2 :
The least and the most frequent words together with the number of their occurrences in the combined NKJP and WikiPL corpus (first number) and WikiPL only (second number)

Table 3 :
Cardinality of test sets for synonyms Testing word embeddings for Polish