Multi-level Annotation of the Specialized Corpus of Dialogs of Disabled Polish Speakers

While Polish language is relatively well represented in general purpose corpora such as National Polish Language Corpus still there are groups of speakers that are underrepresented in reference corpora. One of such sub-groups is the disabled people community. On the other hand there is a growing need for understanding how disability influences social and cognitive abilities, language in particular. In this paper, we present a specialized Corpus of Dialogs of Disabled Speakers. The process of compiling, transcription and annotation of pragmatic, semantic and morphosyntactic features will be described, as well as Corpus applications will be discussed.


Introduction
The Corpus of Dialogs of Disabled Speakers (in Polish: Korpus Mowy Osób Niepełnosprawnych) has been designed and compiled in the course of the "Evaluation of the Situation, Needs and Competence of Polish Disabled People on a Sample of 10000 Individuals with Impairments" (in Polish: "Ogólnopolskie badanie sytuacji, potrzeb i możliwości osób niepełnosprawnych na próbie 10000 ON ") supported by the National Fund for Rehabilitation of Disabled People (PFRON) in cooperation with the University of Social Sciences and Humanities in Warsaw and the European Social Fund.The project has been supervised by Anna Brzezińska, Adam Mickiewicz University in Poznan.The Corpus of Dialogs of Disabled Speakers (CDDS) has been compiled and annotated within "The Corpus Analysis of Disabled Speakers Utterances" module conducted by a team led by Joanna Trzebińska.
The literature on the mutual impact of language and disability is scarce and devoted mostly to mentally disabled (Happé, 1993;Langdon et al., 2002;Woźniak, 2000).However, the language of people with physical impairments has been studied from different perspectives and with various methodologies.Some experimental research has demonstrated no differences in specific metaphor use by blind and healthy individuals of various age (Antović et. al., 2013;Minervino et al., 2009).On the other hand, neural activity patterns of adult patients suffering from traumatic brain injury have been shown to differ significantly from the control group during metaphor processing (Yang et al., 2010).Social impact of the notion of disability have also been studied, both by surveying the person-first language preferences of the concerned groups (Bickford" 2004) and discussion of particular metaphors associated with disability (Vidali 2010).What is more, there has been some effort to design disability specific tools for evaluation of language development in case of children with motor and visual disabilities (Hennesey, 2011) and for providing a tailored Cognitive-Behavioral Therapy for patients suffering from medically unexplained symptoms (Sumathipala, 2013).
So far, there were no broad-scale corpus studies of the language of people with physical disabilities.As the social awareness of the problem of disability has been raising in the last two decades and the disability community itself has become interested in the research of their language and its socio-cognitive impact there is a growing need for studies of this kind.In this study, people with psychical or intellectual disabilities constituted a minor fraction of the sample, so it was possible to concentrate on the language of the physically disabled, providing a better insight in their language and cognition.
Structure of the CDDS Data characteristics and format 20 group interviews featuring 113 subjects have been transcribed and annotated.The corpus consists of 402,146 units, including 225,299 words of raw text, with nearly 100 tags providing metadata concerning various features of both verbal and non-verbal communication.Detailed data characteristics is given in Table 1.The corpus data consists of natural speech samples transcribed and stored in several formats including xml.Extensible markups and stylesheets allow easy tagging, filtering, transformation and retrieval of the raw text.An additional set of scripts provides tools for automatic format conversion and correct concordancer input preparation.

Annotation method
Corpora differ in both data types and tagging methods (Bougraev & Pustejovsky, 1996;Garside, Leech, McEnery, 1997;Lewandowska-Tomaszczyk, 2005;McEnery & Hardie, 2011).Corpus of Dialogs of Disabled Speakers is also annotated by its own system (Szwabe, 2009b).CDDS has been tagged with semantic, pragmatic and extra-linguistic metadata.Structural metadata, like partition of the corpus into dialogs, and dialogs into utterances, pauses and events, or group type (healthy, disabled, mixed), recording date and place, interview ID and number of interlocutors have also been tagged.
Every utterance is marked by an individual attributes of the speakers (WHO) and their type (SPEAKER_BIO).The first corresponds to the specific interview the speaker participated in, her or his role in the dialog (moderator -PR, healthy -BK, disabled -BN) and a personal ID.The latter consists of data concerning speaker's sex, place of residence (city, town, village), education, age group (18-30, 31-45, 46-73), disability type and onset.Disability acquisition time has been tagged using following intervals: This allows tracking utterances spoken by a specified type of speakers as well as analysis of effects demographic factors and case history may have on language acquisition and use.Subjects remained anonymous.
The semantic tagset consists of 40 tags coding semantic fields.They may be embedded if the semantic fields intersect, providing accurate data describing the topic of each utterance and semantic field co-occurrence tracking.

Figure 4 Semantic annotation sample
The pragmatic tagset consists of 9 tags allowing marking of pragmatic features like implicatures, metaphors, analogies, humor, irony, and indirect speech acts.Two perspectives have been applied to metaphor coding: conceptual metaphor theory (Lakoff & Johnson, 1980) and in parallel post-Gricean inferential model of communication (Grice, 1989).Four types of implicatures have been coded corresponding to Grice's Maxims of Quality, Quantity, Manner and Relevance.

Figure 5 Pragmatic annotation sample
The extra-linguistic tagset consists of 27 tags coding paralinguistic features (laughter, sigh etc.), utterance overlaps, pauses in the dialog, moments of silence after response-demanding acts of speech, syllabication and spelling, external events influencing the dialog's course, communicative gestures and facial expressions, singing, foreign words and unclear speech fragments, as well as structural1 and biographical metadata.
In addition original forms such as typical errors, untypical errors, neologisms and region-specific forms have been marked and supplemented with their standard counterparts.
Furthermore, CDDS has been tagged morphosyntactically using Morfeusz automatic analyzer and TaKiPi tagger (Woliński, 2004).However, in numerous cases the resulting annotation included mutually exclusive tags, impairing further study requiring concordancer use.A short script choosing the most probable tag has been used to prepare the text for statistical analysis.

Other tools
Trancription has been performed using text editors supporting xml tagging and a previously prepared CDDS xsl template, allowing raw corpus text preview.Additionally, an xml schema, assisting transcription errors finding, and a file template providing consistent transcript format have been created.As word frequency lists and concordances are the elementary tools of corpus analysis, a script transforming the corpus text into a form accepted by most text analysis software (e.g.concordancers, frequencers) has been written.

Speakers characteristics
Utterances of 116 speakers, including 34 healthy speakers, 79 speakers with disabilities and 3 moderators are present in the corpus.Subjects were aged 18-73; having various levels of education (mostly secondary and higher education among the speakers with disabilities), coming from various parts of Poland, mainly middlesized cities.The sex ratio has been close to 1, with a little overall male domination and female domination within the healthy group.
Speakers with disabilities have been divided into sub-groups according to their disability type: motor, visual, speech-auditory-vocal, psychic, mental.The groups are not disjoint as some of the subjects suffering from multiple conditions, fall into more than one group.The motor disability group has been the largest, the second being the visual disability group.What is more, motion and visual perception are the most important experiential bases used in figurative language by healthy language users.Thus, there is a good reason and opportunity to study figurative language use in those groups.Table 2 shows the speakers characteristics.
The corpus in the video version has been compiled using group interviews recorded in 2009.Subject recruitment rules are described by Iwański in a research report (Iwański, 2009, Iwański & Owczarek, 2010) containing more specific data concerning the subjects, and will not be discussed here.However, it should be mentioned that individuals whose disability "could prevent them from manifesting elementary communicative competence -like listening, talking, maintaining eye contact with an interlocutor" have been excluded from the study (Iwański, 2009).

Future development
It should be noted that pragmatic annotation is still rare in the corpus linguistics field.The corpus includes pragmatic annotation enabling study of language use from the cognitive pragmatics perspective (Szwabe, 2009a).
As the CDDS is semantically annotated by human taggers it may be used as a test-bed or a training set for machine learning algorithms in the process of semantic tagging automation for natural language corpora.The Corpus has already been used in an application of this kind: it has been shown that Latent Semantic Analysis based on Randomized Singular Value Decomposition may reduce the human effort necessary to semantically annotate a speech corpus in per-sentence tagging scenario.As a result of the study an automatic semantic tagger named Semancor has been designed, implemented and tested on samples of transcribed Polish speech derived from the CDDS itself (Prus-Zajączkowski et al. in prep.).Semancor has been successfully used for the Polish Child Speech Corpus annotation.
The primary use of the Corpus was a series of analyses of disabled people speech, conducted by Joanna Trzebińska in the course of the "Evaluation of the Situation, Needs and Competence of Polish Disabled People on a Sample of 10000 Individuals with Impairments" project.As the general results of the study show the differences between the disabled speakers and controls are found rather in communicative style than in linguistic competence, the corpus may be viewed as a supplementary reference corpus of Polish (Szwabe, 2009a).

Figure 1 Figure 2 Figure 3
Figure 1 Text sample in the xml tree form

Table 1
Data characteristics

Table 2
Speakers characteristics