DOI: https://doi.org/10.11649/cs.1316

A preliminary study in zero anaphora coreference resolution for Polish

Adam Jan Kaczmarek, Michał Marcińczuk

Abstract


A preliminary study in zero anaphora coreference resolution for Polish

Zero anaphora is an element of the coreference resolution task that has not yet been directly addressed in Polish and, in most studies, it has been left as the most challenging aspect for further investigation. This article presents an initial study of this problem. The preparation of a machine learning approach, alongside engineering features based on linguistic study of the KPWr corpus, is discussed. This study utilizes existing tools for Polish coreference resolution as sources of partial coreferential clusters containing pronoun, noun and named entity mentions. They are also used as baseline zero coreference resolution systems for comparison with our system. The evaluation process is focused not only on clustering correctness, without taking into account types of mentions, using standard CoNLL-2012 measures, but also on the informativeness of the resulting relations. According to the annotation approach used for coreference to the KPWr corpus, only named entities are treated as mentions that are informative enough to constitute a link to real world objects. Consequently, we provide an evaluation of informativeness based on found links between zero anaphoras and named entities. For the same reason, we restrict coreference resolution in this study to mention clusters built around named entities.

 

Wstępne studium rozwiązywania problemu koreferencji anafory zerowej w języku polskim

Koreferencja zerowa, w języku polskim, jest jednym z zagadnień rozpoznawania koreferencji. Dotychczas nie była ona bezpośrednim przedmiotem badań, gdyż ze względu na jej złożoność była pomijana i odsuwana na dalsze etapy badań. Artykuł prezentuje wstępne studium problemu, jakim jest rozpoznawanie koreferencji zerowej. Przedstawiamy podejście wykorzystujące techniki uczenia maszynowego oraz proces tworzenia cech w oparciu o analizę lingwistyczną korpusu KPWr. W przedstawionej pracy wykorzystujemy istniejące narzędzia do rozpoznawania koreferencji dla pozostałych rodzajów wzmianek (tj. nazwy własne, frazy rzeczownikowe oraz zaimki) jako źródło częściowych zbiorów wzmianek odnoszących się do tego samego obiektu, a także jako punkt odniesienia dla uzyskanych przez nas wyników. Ocena skupia się nie tylko na poprawności uzyskanych zbiorów wzmianek, bez względu na ich typ, co odzwierciedlają wyniki podane dla standardowych metryk CoNLL-2012, ale także na wartości informacji, która zostaje uzyskana w wyniku rozpoznania koreferencji. W nawiązaniu do założeń anotacji korpusu KPWr, jedynie nazwy własne traktowane są jako wzmianki, które zawierają w sobie wystarczająco szczegółową informację, aby można było powiązać je z obiektami rzeczywistymi. W konsekwencji dostarczamy także ocenę opartą na wartości informacji dla podmiotów domyślnych połączonych relacją koreferencji z nazwami własnymi. Z tą samą motywacją rozpatrujemy jedynie zbiory wzmianek koreferencyjnych zbudowane wokół nazw własnych.


Keywords


coreference; zero subject; zero anaphora coreference in Polish

Full Text:

PDF (in English)

References


Bagga, A., & Baldwin, B. (1998). Algorithms for scoring coreference chains. In The First International Conference on Language Resources and Evaluation Workshop on Linguistics Coreference (pp. 563–566).

Broda, B., Burdka, L., & Maziarz, M. (2012). IKAR: An improved kit for anaphora resolution for Polish. In Proceedings of COLING 2012: Demonstration papers (pp. 25–32). COLING 2012, Mumbai, December 2012.

Broda, B., Marcińczuk, M., Maziarz, M., Radziszewski, A., & Wardyński, A. (2012). KPWr: Towards a free corpus of Polish. In N. Calzolari, K. Choukri, T. Declerck, M. Uğur Doğan, B. Maegaard, J. Mariani, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12). Istanbul, Turkey: European Language Resources Association (ELRA).

Chen, C., & Ng, V. (2015). Chinese zero pronoun resolution: A joint unsupervised discourse-aware model rivaling state-of-the-art resolvers. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Vol. 2, Short Papers, pp. 320–326). Association for Computational Linguistics. https://doi.org/10.3115/v1/P15-2053

Dunin-Keplicz, B. (1983). Towards better understanding of anaphora. In A. Zampolli & G. Ferrari (Eds.), EACL 1983, 1st Conference of the European Chapter of the Association for Computational Linguistics, September 1-2, 1983, Pisa, Italy (pp. 139–143). Association for Computer Linguistics. https://doi.org/10.3115/980092.980116

Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter, 11(1), 10–18. https://doi.org/10.1145/1656274.1656278

Iida, R., Torisawa, K., Hashimoto, C., Oh, J.-H., & Kloetzer, J. (2015). Intra-sentential zero anaphora resolution using subject sharing recognition. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 2179–2189). Lisbon, Portugal: Association for Computational Linguistics. https://doi.org/10.18653/v1/D15-1260

Kaczmarek, A., & Marcińczuk, M. (2015a). Evaluation of coreference resolution tools for Polish from the information extraction perspective. In Proceedings of the Workshop: The 5th Workshop on Balto-Slavic Natural Language Processing, BSNLP 2015.

Kaczmarek, A., & Marcińczuk, M. (2015b). Heuristic algorithm for zero subject detection in Polish. In P. Král & V. Matoušek (Eds.), Text, Speech, and Dialogue: 18th International Conference, TSD 2015, Pilsen, Czech Republic, September 14-17, 2015: Proceedings (pp. 378–386). Cham: Springer International Publishing. (Lecture Notes in Computer Science, 9302). https://doi.org/10.1007/978-3-319-24033-6_43

Kopeć, M. (2014). Zero subject detection for Polish. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (Vol. 2, Short Papers, pp. 221–225). Gothenburg, Sweden: Association for Computational Linguistics. https://doi.org/10.3115/v1/E14-4043

Kopeć, M., & Ogrodniczuk, M. (2012). Creating a coreference resolution system for Polish. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012) (pp. 192–195). Istanbul, Turkey: European Language Resources Association (ELRA).

Luo, X. (2005). On coreference resolution performance metrics. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, HLT ’05 (pp. 25–32). Stroudsburg, PA, USA: Association for Computational Linguistics. https://doi.org/10.3115/1220575.1220579

Mihăilă, C., Ilisei, I., & Inkpen, D. (2010). Zero pronominal anaphora resolution for the Romanian language.

Nivre, J., Hall, J., & Nilsson, J. (2006). Maltparser: A data-driven parser-generator for dependency parsing. In Proceedings of LREC-2006 (pp. 2216–2219).

Ogrodniczuk, M., & Kopeć, M. (2011). Rule-based coreference resolution module for Polish. In Proceedings of the 8th Discourse Anaphora and Anaphor Resolution Colloquium (DAARC 2011) (pp. 191–200). Faro, Portugal.

Pradhan, S., Luo, X., Recasens, M., Hovy, E., Ng, V., & Strube, M. (2014). Scoring coreference partitions of predicted mentions: A reference implementation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Vol. 2, Short Papers, pp. 30–35). Baltimore, MD, USA: Association for Computational Linguistics. https://doi.org/10.3115/v1/P14-2006

Radziszewski, A. (2013). A tiered CRF tagger for Polish. In R. Bembenik, Ł. Skonieczny, H. Rybiński, M. Kryszkiewicz, & M. Niezgódka (Eds.), Intelligent tools for building a scientific information platform: Advanced architectures and solutions (pp. 215–230). Berlin: Springer. (Studies in Computational Intelligence, 467). https://doi.org/10.1007/978-3-642-35647-6_16

Rahman, A., & Ng, V. (2011). Narrowing the modeling gap: A cluster-ranking approach to coreference resolution. Journal of Artificial Intelligence Research, 40(1), 469–521. Retrieved 5 May 2017, from http://dl.acm.org/citation.cfm?id=2016945.2016958

Recasens, M., Hovy, E., & Antònia Martí, M. (2011). Identity, non-identity, and near-identity: Addressing the complexity of coreference. Lingua, 121(6), 1138–1152. https://doi.org/10.1016/j.lingua.2011.02.004

Rello, L., Baeza-Yates, R., & Mitkov, R. (2012). Elliphant: Improved automatic detection of zero subjects and impersonal constructions in Spanish. In EACL 2012: The 13th European Chapter of the Association for Computational Linguistics. Avignon, France: Association for Computational Linguistics.

Rello, L., Ferraro, G., & Gayo, I. (2012). A first approach to the automatic detection of zero subjects and impersonal constructions in Portuguese. Procesamiento de Lenguaje Natural, (49), 163–169.

Russo, L., Loáiciga, S., & Gulati, A. (2012). Improving machine translation of null subjects in Italian and Spanish. In EACL ’12: Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Association for Computational Linguistics (pp. 81–89). Stroudsburg, PA, USA: Association for Computational Linguistics. http://dl.acm.org/citation.cfm?id=2380943.2380953

Stoyanov, V., & Eisner, J. (2012). Easy-first coreference resolution. In Proceedings of COLING 2012: Technical papers (pp. 2519–2534). COLING 2012, Mumbai, December 2012. http://www.aclweb.org/anthology/C12-1154

Vilain, M., Burger, J., Aberdeen, J., Connolly, D., & Hirschman, L. (1995). A model-theoretic coreference scoring scheme. In Proceedings of the 6th Conference on Message Understanding, MUC6 ’95 (pp. 45–52). Stroudsburg, PA, USA: Association for Computational Linguistics. https://doi.org/10.3115/1072399.1072405

Wiseman, S., Rush, A. M., & Shieber, S. M. (2016). Learning global features for coreference resolution. Association for Computational Linguistics. https://doi.org/10.18653/v1/n16-1114

Wróblewska, A. (2012). Polish dependency bank. Linguistic Issues in Language Technology, 7(1). Retrieved from http://journals.linguisticsociety.org/elanguage/lilt/article/view/2684.html

Yin, Q., Zhang, W., Zhang, Y., & Liu, T. (2016). A deep neural network for Chinese zero pronoun resolution. Retrieved 5 May 2017, from http://arxiv.org/abs/1604.05800
Copyright (c) 2017 Adam Jan Kaczmarek, Michał Marcińczuk

License URL: http://creativecommons.org/licenses/by/3.0/pl/