EXPANDING WORDNET WITH GLOSS AND POLYSEMY LINKS FOR EVOCATION STRENGTH RECOGNITION

Evocation — a phenomenon of sense associations going beyond standard (lexico)-semantic relations — is difficult to recognise for natural language processing systems. Machine learning models give predictions which are only moderately correlated with the evocation strength. It is believed that ordinary graph measures are not as good at this task as methods based on vector representations. The paper proposes a new method of enriching the WordNet structure with weighted polysemy and gloss links, and proves that Dijkstra’s algorithm performs equally as well as other more sophisticated measures when set together with such expanded structures.


Introduction
Evocation is a psycho-linguistic phenomenon of associations arising between specific word senses that go beyond standard (lexico)-semantic relations. For example, ankle-n-1, in the sense of 'a gliding joint between the distal ends of the tibia and fibula and the proximal end of the talus', evokes swell-v-3, in the sense of 'expand abnormally' (Boyd-Graber et al., 2006;Nikolova et al., 2009). As such, they resemble simple free word associations that immediately come to the mind of a native speaker when presented with a stimulus word, e.g. girl and boy, or harm and bad. Both evocations and word associations are asymmetric and weighted relations, yet word associations do not specify which particular word senses are involved (Ma, 2013).
From a theoretical perspective, the phenomenon of evocation should also be distinguished from word/sense similarity and semantic relatedness. Relatedness is usually recognised as more general than similarity (Agirre et al., 2009;Ballatore et al., 2014;Faruqui et al., 2016). Following Cramer (2008), we treat the three concepts, i.e. similarity -relatedness -evocation, as forming a chain of subsumed senses, with similarity as the narrowest and evocation as the broadest one. Semantic similarity signifies close semantic resemblance (such as synonymy, near-synonymy or hyponymy). Relatedness covers both close semantic relations as well as more distant relationships 2 Related work The task of evocation strength recognition is a challenge for NLP. Sense associations spread in various directions and are asymmetrical (Cramer, 2008). There are no language resources designed 1 We assume the following set of definitions: (Def. 1) Orthographic word or token is a string of letters (and other symbols) of the English alphabet delimited in writing usually by spaces (Lyons, 1977, pp. 49-50;Saeed, 2003, pp. 55-56). (Def. 2) Word-form is either an orthographic word (in the case of one-word lexical items), or a sequence of orthographic words (in the case of multi-word lexical items, treated as 'words with spaces') as seen from the perspective of inflection, see Saeed (2003, p. 56), Lyons (1977, p. 50), Sag et al. (2002). (Def. 3) Lemma is the canonical word-form that was chosen to represent other inflectional forms in a dictionary as an entry term (Lyons, 1977, p. 50;Saeed, 2003, p. 56;Svensen, 2009, p. 93). (Def. 4) Word is the class of all word-forms equivalent -according to English inflectional patterns -to the same lemma and representing one sense or several related senses. (Ex.) For instance, we treat the word [go] as the class of all semantically related word-forms, including go, goes, going, went, gone, equivalent to each other, since they all might be equated with the lemma go according to inflectional rules (like adding the affix -(e)s to a stem and irregular verb alternations codified in grammars of English). (Remark 1) We may operationalise the notions of lemma, word and equivalence relation by taking the output of existing English lemmatizers, ascribing to a given word-form its lemma. (Remark 2) For WordNet words the definition of the term lemma could be narrowed to basic word-forms of nouns, adjectives, verbs, and adverbs. Hence, we regarded nominal, adjectival, verbal or adverbial word-forms as representing the same word if only they shared the same lemma in WordNet and were semantically related. (Remark 3) Because of the isomorphism between lemma and its word/equivalence class, we talked about lemmas in such a manner as if we would describe words themselves. (Def. 5) Sense or meaning is the triple <lemma, POS, sense number>, where parts of speech and sense numbers (called variants) are taken from Princeton WordNet.
2 https://wordnet.cs.princeton.edu/downloads.html 3 1,000 synsets -sets of synonymous lexical units -classified as denoting the so called core concepts were involved in the experiment.
to aid evocation recognition. WordNets mainly focus on paradigmatic relations, while various thesauri and ontologies capture special vocabulary and taxonomic dependencies. Valence lexicons cover predicate-argument relations. Therefore, the Princeton evocation data set was intended to complement the taxonomy of WordNet.
Ordinary semantic similarity measures have proven to be completely inefficient in capturing evocation strength. When Boyd-Graber et al. (2006) implemented them, they achieved only 0.131 of Spearman's ρ. 4 Hayashi (2016) confirmed this finding -his individual WordNet-based measures achieved Pearson's correlation r = 0.15 at most. His results were much better for complex vector-space-based measures, with max(r) = 0.30 (cf. Figure 8 at the end of this paper). His final model, performing at r = 0.4391, was a neural network combining a dozen individual measures, with no feature playing the central role. According to Hayashi, further advancement in the field of evocation recognition should proceed in two complementary directions: (i) applying more sophisticated machine learning frameworks and (ii) gathering and merging new and better features. He suggested making use of high quality word/sense vector representations and relational features, as well as more adequate semantic networks (in which distance measures could be applied). Most of Hayashi's best individual measures rely on calculating distances in different semantic spaces. Two out of four of his best measures reaching r values higher than 0.2 are cosine functions (for word2vec and AutoExtend vectors), and one is the AutoExtend difference of two vectors. Cattle and Ma (2017) focused on predicting word association strength in the Princeton evocation data set. Although the task was different (instead of concept evocations, they were seeking word associations), the results were strikingly similar to those obtained by Hayashi. Again, cosine vector similarities (such as w2v, GloVe and w2g embeddings) proved to be the best. More recently, Kacmajor and Kelleher (2019) tested several individual measures of word similarity on the evocation set. The authors divided their measures into four broad groups: (i) knowledge-based distance measures, (ii) measures utilizing vector space models constructed out of existing lexical resources, (iii) distributional vector space measures based on large corpora, and (iv) hybrid approaches mixing knowledge-based and distributional approaches. The main claim of their paper is that measures based on WordNet and other lexical resources are inadequate in the evocation task, because most WordNet/lexical resource relation instances are taxonomic in nature. On the other hand, distributional and hybrid models perform well with intuitive evocation associations. It seems that the WordNet structure itself is unfit for the task of evocation recognition.

Experiments
This paper will show that it is possible to construct a WordNet-based distance measure which performs better than other knowledge-based features, and no worse than vector space-based measures. We made use of the implementation of Dijkstra's algorithm in the igraph library in R (Csardi & Nepusz, 2006).
The experiment design was as follows: • Firstly, four different versions of the WordNet graph were constructed and the most successful one (achieving the best Pearson's r correlations) was selected. The graphs consisted of three different types of semantic relations: (i) pure WordNet links, (ii) gloss links, and (iii) relations between different senses of the same polysemous lemma (Sec. 3.1). Next, Dijkstra's distance measuring algorithm was applied to the obtained structures in order to obtain the best predictions of evocation strength.
• Secondly, local optima were identified in parameter spaces (the axes represented the costs of edges in the algorithm, Sec. 3.2).

Expanding WordNet with gloss and polysemy links for evocation strength recognition
• Having found the minimum points, in the third step several similarity measures in the form of functions of Dijkstra's distance: Sim = f (Dist) were evaluated. Both graph structures and similarity functions were compared. Based on several quantitative-qualitative criteria, one measure was chosen (Sec. 3.3).
• Finally, the efficiency of the measure in the evocation recognition task was tested on the validation data set. Different graph topologies were compared together with evocation measures from the literature (Sec. 3.4).
Since the proposed association strength function was strikingly simple (cf. Sec. 3.3), the main emphasis was placed on the optimization of the WordNet structure. The idea was to give it a shape that would facilitate evocation recognition. The optimisation procedure and the final evaluation were run on three independent subsets of the evocation data set: 5 • S 1 -2,000 evocation pairs used for checking the efficiency of the WordNet graph and its extensions, and for tuning the weights of edges, • S 2 -10,000 sense pairs used to determine the choice of the best similarity measure, • S 3 -the final testing data set of 108,000 evocation pairs for choosing the best graph topology.
The experiment proceeded in the following manner: (i) four differently structured WordNet graphs were constructed, (ii) various combinations of relation weights and fitted response surfaces were tested on two testing samples (10% of the total number of instances, samples S 1 and S 2 ).
During the preparatory phase, the efficiency of Dijkstra's algorithm was tested on differently structured WordNet graphs. Evocation strength recognition was performed on the smaller set of 2,000 evocation pairs (S 1 ). Networks were unweighted. Technically, it was obtained by applying the weight (cost) of 1 to all edge types (cf. Table 1). Table 1: Edge type combinations tested on unweighted graphs and on the testing set S 1 . Symbols: wn -WordNet relations, g -gloss relations, polyWN -the set of all pairs of polysemous lemma senses taken from WordNet, polySC -the set of all pairs of polysemous lemma senses co-occurring in SemCor altogether with polysemy patterns and top level noun and verb synsets, N -number of relation instances, d -directed edges, u -undirected links. Please note that the calculated correlations do not contain the cases of disconnected graph edges (NA and Inf values were excluded).
graph r directionality vector of configuration Dist of links costs The four graph structures will be inspected in the forthcoming sections.
5 All subsets were randomly chosen from the original data set.
Expanding WordNet with gloss and polysemy links for evocation strength recognition

Relation types
The graphs were constructed out of the following types of edges: • directed WordNet edges (380,000 links in total, symbol wn); • directed gloss relation instances (820,000 links, marked with g); • bidirectional polysemy links between different WordNet senses (400,000 links in total; (n−1)n 2 links for each n-sense lemma, symbol polyWN); • a heterogenous set of edges made out of SemCor 6 polysemy links and upper synsets of nominal and verbal WordNet hierarchies (marked jointly as polySC), including: directed polysemy links collected from the SemCor corpus; the links were established every time a sense pair appeared in a very similar context; polysemy patterns extracted from the previous set via a generalization from a given polysemy pair to a pair of corresponding semantic domains ('lexicographer files' of the considered synsets); top level noun and verb WordNet synsets linked to their semantic domains via undirected edges to facilitate linking synsets with the top level of polysemy patterns.
It is important to distinguish between two different types of WordNet relations: paradigmatic relations (hyponymy, meronymy, antonymy etc.) and gloss relations, emerging from the WordNet gloss annotation process (cf. Suderman & Ide, 2006). 7 WordNet was first tested without glosses (symbol: wn), and was then later tested with the addition of glosses (symbol: wn+g), see Table 2 below. Next, the WordNet graphs were extended further by adding polysemy links. Polysemy is a phenomenon in which the same word-form constitutes a sign for different related meanings (Cruse, 2006, pp. 133-134). 8 Polysemous lemmas are to be understood as those WordNet lemmas which possess two or more semantically related senses. 9 Polysemy should be carefully discerned from homonymy, that is from accidental relationships between word meanings (Allen, 2014, p. 150n). 10 In contrast to real polysemy cases, homonymous sense pairs are usually not related at all (Lyons, 1995, p. 59). Establishing links between semantically related WordNet senses is important, because the lexical net does not possess explicit information on the sense relationship.
In our experiments we tested the result of adding combinations of all lemma senses. 11 The polyWN set counted 200,000 undirected links. We also took the SemCor and inspected the corpus characteristics of polysemy patterns. It was assumed that those lemma senses and their semantic domains which co-occurred in the very same text were semantically related. 12 The relation was kept directed and forced to go from the preceding sense/semantic domain to the succeeding one. This net reflects the hidden relationships between different semantic categories. We call them polysemy patterns and use them to pin up the upper levels of nominal and verbal WordNet problematic, since in usage it is very difficult to discern all different senses or shades of meaning, and the choice of proper dictionary is very important (Agirre & Edmonds, 2007). Some dictionaries have very coarse-grained sense distinctions, while others possess very fined lists of meanings. WordNet, because of its vast computational applications, is often used as a source of word senses, however, at the same time, it is criticised for its overly detailed sense distinctions (Edmonds, 2004). Clustering WordNet senses seems a reasonable solution to the problem (Agirre & Lopez de Lacalle, 2003). Yet another solution is to seek for the so called polysemy patterns, that is regular polysemy types (Vicente & Falkum, 2017).
9 Please note that we consider meanings belonging to distinct parts of speech as cases of polysemy of the same lemma, not the result of word formation. Hence, we treat the unmarked change of word category as semantic derivation. In theoretical linguistics the status of the conversion (zero-derivation) sense pairs is not entirely clear (Schmid, 2007). Although many researchers treat zero-derivation as a purely word-formational process, some portray it as mere syntactic change, cf. the description of theoretical positions in Schönefeld (2005, pp. 135-138). For instance, Baker (2003) ascribes a syntactic category (POS) not lexemes but syntactic phrases (pp. 266n). Lexicographers often put related meanings characterised by different parts of speech into the same lexical entry, differentiating it from homonymy cases (Saeed, 2003, p. 80). They are treated as polysemous senses representing the same lemma (Svensen, 2009, pp. 95, 97).
10 These accidental relationships can be due either to the converging evolution of native vocabulary, or to loans from other languages (Jackson, 2002, pp. 2-3). 11 Hence also with homonymy. 12 In such a manner we overcame the challenging homonymy problem.
hierarchies, i.e. those synsets that do not possess any hypernym. 13 We made the latter links 14 undirected in order to enable free movement up and down -from the polysemy patterns net and back again. The net consisted of 3,400 directed links. In comparison to WordNet paradigmatic relations, gloss links and the huge polysemy set derived from WordNet, the size of the polySC set is rather modest.

Setting weights
Each of the four relation types described above was given a weight (the cost of Dijkstra's algorithm). WordNet links were equipped with a weight of "1", all other link types were tested for the optimal values in the range [0,12]. Again, the tests were conducted on the same testing sample of 2,000 evocation pairs (S 1 ). For the original WordNet graph (wn), the correlation values of Pearson's r and Spearman's ρ measures were calculated. Then, we merged the base graph with glosses (wn+g) and checked correlations in a systematic one-factor design (Fig. 1). The response curve r(g) is a polynomial of the fifth degree (adjusted R 2 = 0.8956, p-value of the F -test = 1.219 × 10 −9 ). It suggests that finding an adequate optimization solution is a demanding task. The long right tail of the response curve monotonously climbing up to correlations close to r ∼ −0.10 -proves the importance of using the gloss relation set. We estimated the global minimum location at (g = 1.4). The two remaining polysemy graphs were extensions of the wn+g graph, hence we expected response surfaces of a higher degree. For the 2-factor (X1 − X2) problem we conducted a quadruple third order design (CCD Augmented by I-optimal design; cf. Yang, 2008, Tab. 20), because ordinary second order designs were inefficient. The response surfaces of the fourth and fifth order were fitted to the experimental points with acceptable adjusted R-squared values. Figures 2 and 3 present the response surface for the wn+g+polyWN graph. Generally speaking, the plot reproduces the shape of the 1-factor r(g) response curve with a long tail stretching out to high g weights and a valley of minima extending meridionally for the wide range of polyWN values. Comparing the wn+g minimum and the ravine of the wn+g+polyWN graph, one may notice that the optimal area is shifted to higher g values. From the two deepest minima we chose the one deposited in the point g = 2.5, polyW N = 3.2.
The shape of the SemCor polysemy response surface ( Fig. 4 and 5) is a twin to the WordNet polysemy graph, but the localisation of the optimum area is closer to the original wn+g value.
Expanding WordNet with gloss and polysemy links for evocation strength recognition  It is probably caused by different cardinalities of the two polysemy sets. The polyWN is a very large matrix consisting of 200,000 bi-directional edges. Its mass of edges linking different parts of graph, especially different parts of speech, greatly affects the length of paths within the graph. The polySC set, though 200 times smaller, is much more accurate; it does not contain many homonymy pairs or too distant relationships. The fitted models suggest the better performance of the polySC network.
We evaluated how sufficient all these WordNet topologies are on another sample set S 2 .

Similarities
We treated weights as a cost of every step and aimed to establish the optimal set of these parameters. We identified the local optima of the discussed network models using the S 1 set. We will now turn our attention to the evaluation of the efficacy of all these structures. The second testing set, consisting of 10,000 evocation pairs, was used here (the S 2 set). Table 3 collates the data,  while Figure 6 presents differences in the correlations between Dijkstra's distance and evocation strength. All three expanded WordNet graphs proved to be better at predicting evocation strength based on the bare Dijkstra's Dist measure. Following Ge & Qiu (2008, p. 382), for this task we employed three measures of semantic similarity, where Dist is calculated with Dijkstra's algorithm (Cormen et al., 2001, Sec. 24.3): If no path could be established, we ascribed to a sense pair the distance of maximum shortest path length in a graph plus one. To this set of similarity functions we also added a natural, although theoretically not so plausible, measure -the inverse of Dijkstra's distance: Table 3: Local optima for different expansions of the WordNet graph and different weights (the cost of Dijkstra's algorithm) tested on the set of 10,000 evocation pairs randomly sampled from the whole set (the set S 2 ). The correlation is given for pairs of Dijkstra's distances and evocation strength values. Dist ≤ 1 After the transformation of the Dist measure into the measures of similarity Sim0 -Sim4, we obtained higher correlation ratios. Table 4 presents the values of Pearson's correlation for all Sim functions and the S 2 data set. For all measures and graphs, we also prepared bootstrap samples to assess their distributional properties (the number of iterations B = 1000). We checked standard deviations and calculated mean deviations as well as ranges. 15 Sim2 gives the highest correlations, while Sim1 the lowest. Sim2 and Sim3 measures are characterised by relatively large variances, while Sim0 and especially Sim1 functions preserve smaller variances. The graph of Cullen and

Expanding WordNet with gloss and polysemy links for evocation strength recognition
Frey proves that all measures gave values close to the normal distribution, with Sim2 and Sim0 being the closest to the point (0,3). Table 4: Local optima for different expansions of the WordNet graph and different weights (the cost of Dijkstra's algorithm) tested on the set of 10,000 evocation pairs randomly sampled from the whole set (the set S 2 ). The correlation is given for pairs of similarity and evocation scores. The symbol "sd" stands for standard deviation. For further experiments we have chosen the Sim0 measure, taking into account the qualitativequantitative criteria mentioned above. This measure gave high correlation values and was characterised by small variance. It also ensured a relatively large range; its distribution was very close to the normal distribution. Obviously, this decision is not fully objective and is a matter of a researcher's choice.

Similarity measure efficiency
The optimizing procedure conducted on the S 1 set (2,000 evocation pairs) led to the choice of the most promising parameter values for gloss edges (g) and polysemy links (polyWN and polySC). Correlations for Dijkstra's distances and evocation strength improved after the optimalisation of weights on the gloss-enlarged graph and still remained high after adding new polysemy links. The set (10 pairs) was used to support the choice of the best similarity measure. The final evaluation was performed on a large set of the remaining 90% of evocation pairs (S 3 ) with the use of a compromise measure Sim0. Despite a sharp disproportion in size between the testing sets and the validation set (1 : 10 ratio), final experiments conducted on 107,000 evocation examples confirmed the following findings: • The optimizing procedure gave Sim0 the highest r values within the group of Hayashi's knowledge-based individual measures (Fig. 8, the group "K").
We performed a paired bootstrap test on six pairs of graphs at the 95% confidence level, using the Bonferroni correction (the significance level of each paired test was adjusted to the 5 6 % level). Due to the large size of the S3 data set, most tests led to significant differences. It transpired that the wn+g+polySC network model is the most suitable for evocation recognition (the mean correlation of the Sim0 measure is 0.251). Sole WordNet relations (the graph wn) were much worse than any of the extended structures (for details see Table 5). Table 5: One-sided paired bootstrap percentile test for the difference between r values for Sim0 at the 95% confidence level. Symbols: "=" marks insignificant differences, while " >" and "<" designate differences which are significant, wherein the former symbol is read as 'a row is greater than a column', and the latter has the opposite meaning. The direction of the test was established on the basis of the previous experiment (on S 1 ).

Conclusions
The main contribution of this paper is the presentation of a novel method of optimizing lexical net structure for the needs of evocation recognition. We started with a bare WordNet graph and expanded it with glosses and polysemy links. In order to achieve better agreement with gold standard evocation strength values, the graph relations were differentiated through different weights (the cost of each edge) and optimized on a very small subset of the original data set. The optimization process led to new network structures which gave distances better correlated with the evocation strength. The WordNet structure most optimal for the task was extended with gloss relations and a small set of polysemy patterns and instances derived from the SemCor corpus. 16 This fact may be evidence for the importance of polysemy links in the lexico-semantic system. We also proposed an evocation measure in the form of inverse Dijkstra's distance which performed well not only when compared with other graph measures, but also in comparison with sophisticated vector space models, of which some were highly-dimensioned (100 − 300D vectors). Contrary to the well-established opinion of the superiority of vector representations, it is probable that the common denominator of all the successful individual measures is adequate distance measuring methodology. Our simple knowledge-based measure has recently been reused -together with some other credible individual measures -in a neural-network framework, with the overall NN efficiency r = 0.4415, and has turned out to be the best feature of all those used (Janz & Maziarz, in press).