Towards an event annotated corpus of Polish

Authors

  • Michał Marcińczuk Politechnika Wrocławska [Wrocław University of Technology], Wrocław
  • Marcin Oleksy Politechnika Wrocławska [Wrocław University of Technology], Wrocław
  • Tomasz Bernaś Politechnika Wrocławska [Wrocław University of Technology], Wrocław
  • Jan Kocoń Politechnika Wrocławska [Wrocław University of Technology], Wrocław
  • Michał Wolski Politechnika Wrocławska [Wrocław University of Technology], Wrocław

DOI:

https://doi.org/10.11649/cs.2015.018

Keywords:

information extraction, event recognition, corpus annotation

Abstract

The paper presents a typology of events built on the basis of TimeML specification adapted to Polish language. Some changes were introduced to the definition of the event categories and a motivation for event categorization was formulated. The event annotation task is presented on two levels – ontology level (language independent) and text mentions (language dependant). The various types of event mentions in Polish text are discussed. A procedure for annotation of event mentions in Polish texts is presented and evaluated. In the evaluation a randomly selected set of documents from the Corpus of Wrocław University of Technology (called KPWr) was annotated by two linguists and the annotator agreement was calculated. The evaluation was done in two iterations. After the first evaluation we revised and improved the annotation procedure. The second evaluation showed a significant improvement of the agreement between annotators. The current work was focused on annotation and categorisation of event mentions in text. The future work will be focused on description of event with a set of attributes, arguments and relations.

References

Agerri, R., Agirre, E., Aldabe, I., Altuna, B., Beloki, Z., Laparra, E., de Lacalle, M. L., Rigau, G., Soroa, A., and Urizar, R. (2014). Newsreader project. In 30th Conference of the Spanish Society for Natural Language Processing (SEPLN).

Apresjan, J. D. (2000). Semantyka leksykalna: Synonimiczne środki języka. (Z. Kozłowska, Z. & A. Markowski, Trans.). Warszawa.

Bach, E. (1986). The algebra of events. Linguistics and Philosophy, 9, 5–16. DOI: https://doi.org/10.1007/BF00627432

Bittar, A. (2010). Building a TimeBank for French: A Reference Corpus Annotated According to the ISO-TimeML Standard (Unpublished Phd thesis). Université Paris Diderot.

Broda, B., Marcińczuk, M., Maziarz, M., Radziszewski, A., & Wardyński, A. (2012). KPWr: Towards a free corpus of Polish. In N. Calzolari, K. Choukri, T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani, & S. Piperidis (Eds.), Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12). Istanbul: European Language Resources Association (ELRA).

Caselli, T., Bartalesi Lenzi, V., Sprugnoli, R., Pianta, E., & Prodanof, I. (2011). Annotating events, temporal expressions and relations in Italian: The It-TimeML Experience for the Ita-TimeBank. In Proceedings of the 5th Linguistic Annotation Workshop, LAW V ’11 (pp. 143–151). Stroudsburg, PA, USA: Association for Computational Linguistics.

Comrie, B. (1989). Aspect: An introduction to the study of verbal aspect and related problems. Cambridge: Cambridge University Press.

Hripcsak, G. & Rothschild, A. S. (2005). Technical brief: Agreement, the f-measure, and reliability in information retrieval. Journal of the American Medical Informatics Association, 12(3), 296–298. http://dx.doi.org/10.1197/jamia.M1733 DOI: https://doi.org/10.1197/jamia.M1733

Jędrzejko, E. (1993). Nominalizacje w systemie i w tekstach współczesnej polszczyzny. Katowice: Uniwersytet Śląski. (Prace naukowe Uniwersytetu Śląskiego w Katowicach, 1335)

Jędrzejko, E. (2011). The problematics of describing periphrastic predication: Between word and image. Studies in Polish Linguistics, 6, 27–44.

Jespersen, O. (1965). A modern English grammar – on historical principles (Pt. 6: Morphology). London: Read Books.

Jodłowski, S. (1976). Podstawy polskiej składni. Warszawa: PWN.

Kenny, A. (1963). Actions, Emotions and Will. London: Routledge & Kegan Paul.

Kotsyba, N. (2014). How light are aspectual meanings? A study of the relation between light verbs and lexical aspects in Ukrainian. In K. Robering (Ed.), Events, arguments, and aspects: Topics in the semantics of verbs (pp. 261–299). Amsterdam: John Benjamins Publishing Company. (Studies in Language Companion Series, 152). Retrieved from https://benjamins.com/catalog/slcs.152.07kot DOI: https://doi.org/10.1075/slcs.152.07kot

Langacker, R. W. (2010). Control and the mind/body duality: Knowing vs. effecting. In E. Tabakowska, M. Choiński, & Ł. Wiraszka (Eds.), Cognitive linguistics in action: From theory to application and back (pp. 165–207). Berlin: Mouton de Gruyter. (Applications of Cognitive Linguistics, 14) DOI: https://doi.org/10.1515/9783110226096.3.165

Laskowski, R. (1998). Kategorie morfologiczne języka polskiego – charakterystyka funkcjonalna. In R. Grzegorczykowa, R. Laskowski, & H. Wróbel (Eds.), Gramatyka współczesnego języka polskiego: Morfologia. Warszawa: PWN.

Lyons, J. (1977). Semantics (Vol. 1). Cambridge: Cambridge University Press.

Marcińczuk, M., Kocoń, J., & Broda, B. (2012). Inforex – a web-based tool for text corpus management and semantic annotation. In N. Calzolari, K. Choukri, T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani, S. Piperidis (Eds.), Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12). Istanbul: European Language Resources Association (ELRA).

Maybury, M. T. (1995). Generating summaries from event data. Information Processing & Management, 31(5), 735–751. http://dx.doi.org/10.1016/0306-4573(95)00025-C DOI: https://doi.org/10.1016/0306-4573(95)00025-C

Mourelatos, A. P. D. (1978). Events, processes, and states. Linguistics and Philosophy, 2(3), 415–434. http://dx.doi.org/10.1007/BF00149015 DOI: https://doi.org/10.1007/BF00149015

Pease, A. (2011). Ontology: A practical guide. Angwin, CA: Articulate Software Press.

Piasecki, M., Szpakowicz, S., & Broda, B. (2009). A wordnet from the ground up. Wrocław: Oficyna Wydawnicza Politechniki Wrocławskiej.

Radziszewski, A. (2013). A tiered CRF tagger for Polish. In H. Rybiński, M. Kryszkiewicz, M. Niezgódka, R. Bembenik, & Ł. Skonieczny (Eds.), Intelligent tools for building a scientific information platform: Advanced architectures and solutions. Berlin: Springer Verlag. Retrieved from http://link.springer.com/10.1007/978-3-642-35647-6_16 DOI: https://doi.org/10.1007/978-3-642-35647-6_16

Ryle, G. (1949). The Concept of Mind. London: Barnes and Nobles.

Saurí, R., Batiukova, O., & Pustejovsky, J. (n.d.). Annotating Events in Spanish TimeML Annotation Guidelines.

Saurí, R. & Pustejovsky, J. (n.d.). Annotating Events in Catalan. TimeML Annotation Guidelines.

Saurí, R., Littman, J., Knippen, B., Gaizauskas, R., Setzer, A., & Pustejovsky, J. (2006). TimeML Annotation Guidelines, Version 1.2.1.

Seibt, J. (2004). Process theories: Crossdisciplinary studies in dynamic categories. Studies in Philosophy and Religion. Dordrecht: Springer Netherlands.

Topolińska, Z. (1984). Składnia grupy imiennej. In Topolińska, Z. (Ed.) Gramatyka współczesnego języka polskiego (pp. 301–384). Warszawa.

van Erp, M., Fokkens, A., & Vossen, P. (2014). Finding stories in 1,784,532 events: Scaling up computational models of narrative. In Workshop on Computational Models of Narrative (CMN’14), Quebec City, Canada, July 31 – August 2.

Vendler, Z. (1957). Verbs and times. Philosophical Review, 66(2), 143–160. http://dx.doi.org/10.2307/2182371 DOI: https://doi.org/10.2307/2182371

Vossen, P., Rigau, G., Serafini, L., Stouten, P., Irving, F., Van Hage, W. (2014). NewsReader: Recording history from daily news streams. In Proceedings of the 9th Language Resources and Evaluation Conference (LREC2014), Reykjavik, Iceland, May 26--31.

Zolotova, G. A., Onipenko, N. K., & Sidorova, M. I. (1999). Kommunikativnaia grammatika russkogo jazyka. Moskva: RAN.

Downloads

Published

2015-12-31

Issue

Section

Semantics, Corpus Linguistics and Computer Linguistics