Synonymy and search synonymy in an IR system (on the basis of linguistic terminology and the iSybislaw system)

This is an Open Access article distributed under the terms of the Creative Commons Attribution 3.0 PL License (creativecommons.org/licenses/by/3.0/pl/), which permits redistribution, commercial and non­ ­commercial, provided that the article is properly cited. © The Author(s) 2014. Publisher: Institute of Slavic Studies, PAS & The Slavic Foundation [Wydawca: Instytut Slawistyki PAN & Fundacja Slawistyczna] DOI: 10.11649/sfps.2014.017

distinction between synchronic and diachronic phenomena, currently con sidered standard in linguistic studies, is difficult to apply in the case of vast data banks in which older works coexist with new ones.Practical application of consistent and current terminology in the description of all of the indexed information seems almost impossible because of the diversity of research methods and methodological trends.Such standardization of terminological system, along with the elimination of contradictions and ambiguities would be a great help in the process of creating an IR system.It should be noted, however, that it would be a major simplification of the image of the scientific field that emerges from the database.A significant problem lies thus in the ambiguity of linguistic signs as such.The relationship between a linguistic exponent and the concept (i.e. the semantic component of a linguistic unit) is rarely unambiguous.One concept may be expressed by multiple strings of phonemes/graphemes (synonymy) and one string of phonemes/graphemes may express different concepts (ambiguity).These phenomena, nonrelevant from the perspective of everyday communication (because of context etc.), turn out to be crucial in the process of optimization of information retrieval both in closed and open collections.
There are two distinctive levels considered in this paper.The first one is primarily metalinguistic resulting from the character of linguistics itself and it being the subject presented in iSybislaw, the second is metainformative and is a result of the character of iSybislaw (it being an IR system).
Before I can proceed any further in the deliberation of the impact that synonymy and similar phenomena have on IR, I must note that the elimina tion of ambiguity is a necessary preliminary condition for such an analysis.Due to the binary character of the study we should first establish the notions of synonymy in natural language (including metalanguage) and synonymy in IR languages, such as the keyword language implemented in iSybislaw.One has to note that whilst synonymy in natural language is not a problem per se (it may however be subject to study), synonymy in IR systems is not only an interesting phenomena, but mainly a problem of practical nature (limiting the effectiveness of a search in terms of its completeness).On the basis of Encyklopedia językoznawstwa ogólnego we can give the following definition of synonymy: expressing the same content using two or more different linguistic forms (cf.Polański, 1999).Although owing to language economy, also typical of specialized languages , diachronically synonymous terms may differentiate their meaning.In the case of IR tools it is necessary to combine synonymous expressions or remove those of them that the cre ator of the system would consider (for various reasons) redundant or non preferable.The second of these solutions, however, requires the user of an IR system to be accurately acquainted with the conceptual apparatus used in indexing, and thus it makes information retrieval problematic.Of course, the creators of iSybislaw are aware of the complexity of the phenomena and changes characteristic for the terminological subsystem and to some extent take them into account in the database.In any synonymous string a single word is highlighted as a key descriptor, based on its usage, frequency, linguis tic correctness and clarity, see the entry: termin preferowany (Eng.preferred term) in Słownik encyklopedyczny informacji, języków i systemów informacyjno-wyszukiwawczych (Bojar, 2002).
A linguistic sign is considered to consist of its form (phonemic or graphemic), connotation and denotation.The denotation of a sign is widely believed to be dependent on its connotation.This matter is more complicated in IR languages (even those paranatural) because of the metainformative function of IR in general, resulting in keywords having both direct and indirect connotation and denotation (cf.Bojar, 2002).Therefore the relation of search synonymy requires two or more expressions in an IR language to have identical direct and indirect denotation and connotation (cf.Bojar, 2002).The indirect con notation and denotation of keywords can obviously be derived from the para natural character of the keyword language.The direct denotation of a keyword (being a set of documents on the subject) must be created during indexation by ascribing the given keyword to bibliographic records (or other [meta]data depending on the system in question).We may therefore conclude that keywords may be indirectly synonymous (i.e. have identical indirect connotation and denotation) as a result of the paranatural character of the used IR language.Their direct synonymy can only be achieved through the optimization of the used IR language and only then can we speak of search synonymy.In iSybislaw this can be accomplished simply by linking synonymous keywords.This is a great advantage over some popular software packages used for creating open source repositories such as DSpace, in which synonymous keywords cannot technically be linked to one another.Having to add every synonymous key word separately in every record in DSpace makes search synonymy virtually impossible and may lead to information overload.
The core of linguistic terms functions as metalinguistic.Used in information retrieval system, their equivalents function providing data on metalinguistic and metascientific information contained in the described works.Within the framework of scientific information the need for such a choice of keywords that they be as informative as possible and thus have their scope defined in the most unambiguous manner possible is often highlighted.There is no doubt that strict definitions are an extremely important component of good scientific workshop.Obviously, even within a single language the same denotation can be assigned to different names, defined and understood in slightly different ways.This phenomenon in general language is described as the socalled pro filing.In the case of terminology, however, the problem is often not limited to random semantic features (different associations of a given expression) and considers qualities essential to the definition (i.e. its differentia specifica).Used in IR terms refer indirectly to themselves (the concepts they name) and directly to documentary reality (the set of documents on the subject).The users information needs seem a good standpoint for further deliberation.Since iSybislaw is mainly used by linguists we can assume that they seek primarily metalinguistic content (information on the phenomena of linguistic reality).Therefore denoting the same set of linguistic elements seems more relevant than the means by which they are defined.The division between purely meta linguistic and metascientific terms was mentioned above, in reality there is a large group of mixed terms: Such terms present additional difficulty in the process of indexing.Adding a methodologically more neutral keyword is one of the possible solutions.For the mixed terms presented in table 1. adding Pol.określoność/nieokreśloność (Eng.definiteness/indefiniteness) seems like a plausible solution.
For example, there is no doubt that in all the Polish works in the field of Slavic studies the following terms for imperceptive mood: tryb nieświadka, narrativus/narratyw, imperceptivus and tryb imperceptywny all refer to the same set of verb forms in Bulgarian and/or Macedonian, but they do it in a diffe rent way.The diversity of meanings of linguistic terms with this denotation in Polish, Russian and Bulgarian is presented in the table below.The confusion is such that it results even in abandoning domestic terminology.For instance M. Ledzion Jelen chooses to use the Macedonian term прекажаност (Eng.re -narrativeness) (cf.LedzionJelen, 2009, p. 130).All terms in the above table can be defined in such a way that their scope is strict and the only loss of information occurs because of some connota tional differences.Such terms are combined into sets of synonyms in one language and sets of equivalents on multilingual level enabling crosslingual IR in iSybislaw.
In the database we consistently distinguish between two levels of linguistic reality -the formal and the content plane.This results in the separate treatment of semantic units such as Polish imperceptywność (Eng.imperceptivity) and the means of expressing a given notion/semantic category etc. (both grammati cal and lexical) such as Polish tryb imperceptywny (Eng.imperceptive mood).This division is sometimes troublesome because such an approach is not yet prevalent in all linguistic frameworks.It is worth noting that the picture emerging in this regard from particular languages is largely due to the usage and tradition.Both in Polish, Bulgarian, and Russian the term for predicate acts both as the name of a semantic and syntactic (i.e.formal) component.To maintain consistency we found it necessary to add a subscript to the second/ secondary (formal) meaning of the term.The table below presents synonymous strings for the term in Polish, Russian, and Bulgarian.

predykat składniowy синтактичен предикат синтаксический предикат
There is no doubt that the interchangeable use of all of the specified terms in one scientific work (or even more broadly -one terminological idiolect) would lead to inconsistencies.It turns out that authors' preferences in this area vary and have different motivations.For instance Z. Topolińska uses the term Pol.wyrażenie predykatywne (Eng.lit.predicative expression) very consistently (cf.Topolińska, 1999).As we can see in the table above the presented terms can even be grouped in such a way that they correspond not only by meaning but also by form.Such is not always the case as can be seen in table 4. presenting the Polish equivalents of the Russian term предикатив (Eng.non-inflectional verb) (with probably stabilized meaning in Russian) (cf.Ахманова, 1966; Немченко, 2008) and its synonyms.The use of Polish terms such as przysłówek predykatywny (Eng.predicative adverb) is very rare and may be viewed as a result of Rus sian influence.And thus arises the question (relevant in translation) which of the noncorresponding terms should be viewed as the most strict equivalents.For example, the distinction between verbs and adverbs seems well documented in linguistics and yet Polish and Russian differ slightly in the manner they treat noninflectional verbs (cf. the use of Rus.наречие [Eng.adverb] in twoword terms in Russian as opposed to the use of czasownik [Eng.verb] in Polish).There fore, one can concur that in Russian terminology the phenomenon is viewed as a certain kind of an adverb.In Polish terminology, however, the view that it is a special kind of verb seems prevalent.Of course these are only preliminary observations and it seems that a deepened research should take into account the text frequency of the considered terms.It should be noted that the classification of parts of speech is rarely strict enough to create separate sets of units without any ambiguity.For example, in Polish terminology it is possible to use the name predykatyw (Eng.lit.predicative) in a broad sense, synonymous with widely understood czasownik (Eng.verb) (and therefore predykatyw 2 [i.e.predykatyw in the above sense] would determine a set of linguistic units, such that predykatyw in its primary meaning would be a part of) (cf.KubiszynMędrala, 2000).One should also note that both in Polish and Russian the respective terms are also used as a case name (cf.Topolińska, 1999;Жеребило, 2010).
Complex semantic relations occurring between terms and varying ter minological conventions do not alter the fact that the lexical subsystem is characterized by the pursuit of systematic organization.Terms that become ambiguous sometimes "wear out" and gradually become obsolete, see e.g. the abandonment of the Polish term agens (Eng.agent) in the works of M. Koryt kowska (cf.Korytkowska, 1992).Potential units often remain only potential in the absence of clear nominative need.The observation of this state of affairs leads to the trivial conclusion that linguists are expected to be competent in the field of linguistic terminology.Languages may differ greatly and conclu sions based on monolingual material are often not representative for multi lingual purposes.Even closely related languages are characterized by lexical asymmetry.The traditional approach of source and target language may not result in a complete picture of the target language.An important novelty in the works on iSybislaw is the rejection of such an approach (i.e.projecting one language onto another).This results in the parallel research of confronted languages.The following table shows the relation of synonymy for three differ ent languages.In these sequences one should also distinguish certain pairs of terms being combinatorial variants.The table also includes potential units (crossed out expressions).In IR the distinction between synonymy and variantivity seems irrelevant.In both cases, different language forms express the same content and search engine optimization requires combining them in one equivalence class.There are various views on variantivity on the level of morphemes and word forma tion, which forces us to ask the question about the nature of the relationship between complex terms in which one of the elements is interchangeable with a functionally identical element (see above).The systematic character of such phenomena allows to predict the socalled potential units.Synonymy (being a lexical phenomenon) is more irregular.A separate problem is the possibil ity that variants of the same term in different languages differ in their nature (e.g.phonetic vs. inflectional), cf.Russian aлломорф/ aлломорфa (Eng.allomorph) and Polish allomorf/alomorf.True variantivity is a rarity in the terminological subsystem, however.
A separate problem is also a kind of ambiguity of terms resulting from their different definitions and the application of various research methods.Such terms as określoność (Eng.definiteness) in S. Karolak's works (cf.Karolak, 2001) have a different meaning and scope than in the works of V. Koseska (cf.KoseskaToszewa, Korytkowska, & Roszko, 2007).In the case of S. Karo lak it can be considered synonymous with Pol.intensjonalna zupełność (Eng.intensional completeness), in other terms with uniqueness and generality.
V. Koseska does not use intensional completeness as a term not due to idiolectal preferences discussed above.The absence of the term is motivated by a differ ent research method implemented in her works in which Pol. określoność (Eng.definiteness) is understood more narrowly and does not cover ogólność generality (generality is considered indefinite in works based on the quan tificational model sic!).Distinguishing two meanings for each of the two following terms Pol.określoność (Eng.definiteness) and Pol.nieokreśloność (Eng.indefiniteness) in the case of an IR system such as iSybislaw seems a bit far stretched, however.
It seems that true synonymy in terminology is problematic because defi nitions vary in different works (even of the same author) and establishing it requires a depend research.In IR, when creating synonymy/equivalence classes (multilingual and/or including variants), the depth of analysis should be restricted to a more moderate level.It is preferable for the user to receive a complete set of information even at the cost of obtaining some redundant (from his point of view) data.The optimization of IR requires some compro mises, but (unfortunately) there are no shortcuts and every case should be analyzed separately.

Table 1 .
Types of linguistic terms