DOI: https://doi.org/10.11649/cs.1430

An open stylometric system based on multilevel text analysis

Maciej Eder, Maciej Piasecki, Tomasz Walkowiak

Abstract


An open stylometric system based on multilevel text analysis

Stylometric techniques are usually applied to a limited number of typical tasks, such as authorship attribution, genre analysis, or gender studies. However, they could be applied to several tasks beyond this canonical set, if only stylometric tools were more accessible to users from different areas of the humanities and social sciences. This paper presents a general idea, followed by a fully functional prototype of an open stylometric system that facilitates its wide use through to two aspects: technical and research flexibility. The system relies on a server installation combined with a web-based user interface. This frees the user from the necessity of installing any additional software. At the same time, the system offers a variety of ways in which the input texts can be analysed: they include not only the usual lexical level, but also deep-level linguistic features. This enables a range of possible applications, from typical stylometric tasks to the semantic analysis of text documents. The internal architecture of the system relies on several well-known software packages: a collection of language tools (for text pre-processing), Stylo (for stylometric analysis) and Cluto (for text clustering). The paper presents: (1) The idea behind the system from the user’s perspective. (2) The architecture of the system, with a focus on data processing. (3) Features for text description. (4) The use of analytical systems such as Stylo and Cluto. The presentation is illustrated with example applications.

 

Otwarty system stylometryczny wykorzystujący wielopoziomową analizę języka

 Zastosowania metod stylometrycznych na ogół ograniczają się do kilku typowych problemów badawczych, takich jak atrybucja autorska, styl gatunków literackich czy studia nad zróżnicowaniem stylistycznym kobiet i mężczyzn. Z pewnością dałoby się je z powodzeniem zastosować również do wielu innych problemów klasyfikacji tekstów, gdyby tylko owe metody oraz odpowiednie narzędzia były bardziej dostępne dla uczonych reprezentujących różne dyscypliny nauk humanistycznych i społecznych. Artykuł niniejszy omawia założenia teoretyczne oraz w pełni funkcjonalny prototyp otwartego systemu stylometrycznego, którego szerokie zastosowanie umożliwią dwie jego cechy: elastyczność techniczna oraz dostosowywalność do różnych pytań badawczych. System opiera się na instalacji serwerowej sprzęgniętej z sieciowym interfejsem użytkownika. Uwalnia to użytkownika od konieczności instalowania jakichkolwiek dodatkowych programów. Jednocześnie system oferuje wiele sposobów analizowania tekstów nie tylko na poziomie leksykalnym, lecz także poprzez cechy językowe niskiego poziomu. Daje to możliwość stosowania systemu na wiele różnych sposobów, od typowych testów stylometrycznych do analizy semantycznej dokumentów. Wewnętrzna architektura systemu składa się z wielu elementów znanych ze swej funkcjonalności, w tym z pakietu Stylo przeznaczonego do analiz stylometrycznych oraz pakietu Cluto służącego do zaawansowanej analizy skupień. Artykuł omawia: (1) Koncepcję całego systemu, postrzeganą z punktu widzenia użytkownika, (2) Architekturę systemu oraz jego elementy odpowiedzialne za przetwarzanie tekstu, (3) Cechy językowe służące do opisu dokumentów, (4) Zastosowanie modułów analizy danych, takich jak Stylo czy Cluto. W artykule zostały też przedstawione przykładowe zastosowania systemu.


Keywords


stylometry; Polish; CLARIN-PL; research infrastructure; language technology

Full Text:

PDF (in English)

References


Allweyer, T. (2010). BPMN 2.0: Introduction to the standard for business process modelling. Norderstedt: Books on Demand.

Argamon, S. (2008). Interpreting Burrows’s Delta: Geometric and probabilistic foundations. Literary and Linguistic Computing, 23(2), 131–147. https://doi.org/10.1093/llc/fqn003

Baayen, H., Van Halteren, H., Neijt, A., & Tweedie, F. (2002). An experiment in authorship attribution. In Proceedings of JADT 2002 (pp. 29–37). St. Malo: University de Rennes.

Bentivogli, L., Forner, P., Magnini, B., & Pianta, E. (2004). Revising wordnet domains hierarchy: Semantics, coverage, and balancing. In COLING 2004 Workshop on Multilingual Linguistic Resources Geneva, Switzerland, August 28 (pp. 101–108). https://doi.org/10.3115/1706238.1706254

Broda, B., & Piasecki, M. (2008). SuperMatrix: A general tool for lexical semantic knowledge acquisition. In G. Demenko, K. Jassem, & S. Szpakowicz (Eds.), Speech and language technology (Vol. 11, pp. 239–254). Polish Phonetics Assocation. (The first version was published in the Proceedings of the International Multiconference on Computer Science and Information Technology — 3rd International Symposium Advances in Artificial Intelligence and Applications (AAIA'08)).

Broda, B., & Piasecki, M. (2013). Parallel, massive processing in SuperMatrix: A general tool for distributional semantic analysis of corpora. International Journal of Data Mining, Modelling and Management, 5(1), 1–19. https://doi.org/10.1504/IJDMMM.2013.051924

Broda, B., Kędzia, P., Marcińczuk, M., Radziszewski, A., Ramocki, R., & Wardyński, A. (2013). Fextor: A feature extraction framework for natural language processing: A case study in word sense disambiguation, relation recognition and anaphora resolution. In A. Przepiórkowski, M. Piasecki, K. Jassem, & P. Fuglewicz (Eds.), Computational linguistics: Applications (pp. 41–62). Berlin: Springer. (Studies in Computational Intelligence, 458).

Broeder, D., Gaiffe, B., Gavrilidou, M., Hinrichs, E., Lemnitzer, L., van Uytvanck, D., Witt, A., & Wittenburg, P. (2009). Registry requirements metadata infrastructure for language resources and technology. Technical Report CLARIN-2008-5, Consortium CLARIN. http://hdl.handle.net/1839/00-DOCS.CLARIN.EU-33

Broeder, D., Windhouwer, M., van Uytvanck, D., Trippel, T., & Goosen, T. (2012). CMDI: A component metadata infrastructure. In V. Arranz, D. Broeder, B. Gaiffe, M. Gavrilidou, M. Monachini, & T. Trippel (Eds.), Proceedings of the Workshop on Describing LRs with Metadata: Towards Flexibility and Interoperability in the Documentation of LR. LREC'2012 (pp. 1–4).

Burrows, J. (2002). ‘Delta’: A measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing, 17(3), 267–287. https://doi.org/10.1093/llc/17.3.267

Eder, M. (2011). Style-markers in authorship attribution: A cross-language study of the authorial fingerprint. Studies in Polish Linguistics, 6, 99–114.

Eder, M., Kestemont, M., & Rybicki, J. (2013). Stylometry with R: A suite of tools. In Digital Humanities 2013: Conference abstracts (pp. 487–89). Lincoln, NE: University of Nebraska-Lincoln.

Hirst, G., & Feiguina, O. (2007). Bigrams of syntactic labels for authorship discrimination of short texts. Literary and Linguistic Computing, 22(4), 405–417. https://doi.org/10.1093/llc/fqm023

Hoover, D. L. (2003). Multivariate analysis and the study of style variation. Literary and Linguistic Computing, 18(4), 341–360. https://doi.org/10.1093/llc/18.4.341

Hoover, D. L. (2004a). Delta Prime? Literary and Linguistic Computing, 19(4), 477–495. https://doi.org/10.1093/llc/19.4.477

Hoover, D. L. (2004b). Testing Burrows’s Delta. Literary and Linguistic Computing, 19(4), 453–475. https://doi.org/10.1093/llc/19.4.453

Houvardas, J., & Stamatatos, E. (2006). N-gram feature selection for authorship identification. In J. Euzenat & J. Domingue (Eds.), Artificial intelligence: Methodology, systems, and applications (pp. 77–86). Berlin: Springer. (Lecture Notes in Computer Science, 4183).

Jockers, M. L. (2013). Macroanalysis: Digital methods and literary history. Urbana: University of Illinois Press. (Topics in the Digital Humanities).

Jockers, M. L., Witten, D. M., & Criddle, C. S. (2008). Reassessing authorship of the Book of Mormon using delta and nearest shrunken centroid classification. Literary and Linguistic Computing, 23(4), 465–491. https://doi.org/10.1093/llc/fqn040

Josuttis, N. M. (2007). SOA in practice: The art of distributed system design. Beijing: O'Reilly Media.

Juola, P. (2007). Becoming Jack London. Journal of Quantitative Linguistics, 14(2–3), 145–147. https://doi.org/10.1080/09296170701378957

Karypis, G. (2003). Cluto — a clustering toolkit release 2.1.1. Technical Report 02-017, University of Minnesota, Department of Computer Science, Minneapolis, MN 55455, USA, November 28.

Kędzia, P., Piasecki, M., & Orlińska, M. (2015). Word sense disambiguation based on large scale Polish CLARIN heterogeneous lexical resources. Cognitive Studies | Études cognitives, 2015(15), 269–292. https://doi.org/10.11649/cs.2015.019

Kędzia, P., Piasecki, M., Kocoń, J., & Indyka-Piasecka, A. (2014). Distributionally extended network-based word sense disambiguation in semantic clustering of Polish texts. IERI Procedia, 10, 38–44. https://doi.org/10.1016/j.ieri.2014.09.073

Kjell, B. (1994). Discrimination of authorship using visualization. Information Processing & Management, 30(1), 141–150. https://doi.org/10.1016/0306-4573(94)90029-9

Koppel, M., Schler, J., & Argamon, S. (2009). Computational methods in authorship attribution. Journal of the American Society for Information Science and Technology, 60(1), 9–26. https://doi.org/10.1002/asi.20961

Marcińczuk, M., Kocoń, J., & Janicki, M. (2013). Liner2 — a customizable framework for proper names recognition for Polish. In Intelligent tools for building a scientific information platform (pp. 231–253). Berlin: Springer. (Studies in Computational Intelligence, 467).

Maryl, M. (2012). Kim jest pisarz (w internecie?). Teksty Drugie, 2012(6), 77–100.

Maziarz, M., Piasecki, M., Rudnicka, E., & Szpakowicz, S. (2013). Beyond the transfer-and-merge Wordnet construction: plWordNet and a comparison with WordNet. In Proc. of the International Conference Recent Advances in Natural Language Processing RANLP 2013 (pp. 443–452). Hissar, Bulgaria. INCOMA Ltd.

Moisl, H. (2014). Cluster analysis for corpus linguistics. Berlin: Mouton de Gruyter.

Mosteller, F., & Wallace, D. (1964). Inference and disputed authorship: The Federalist. Stanford: CSLI Publications.

Niles, I., & Pease, A. (2001). Towards a standard upper ontology. In Proceedings of the International Conference on Formal Ontology in Information Systems — Volume 2001, FOIS '01 (pp. 2–9). New York, NY: ACM.

Pease, A. (2011). Ontology: A practical guide. Angwin, CA: Articulate Software Press.

Piasecki, M. (2014). User-driven language technology infrastructure — the case of Clarin-PL. In Proceedings of the Ninth Language Technologies Conference. Ljubljana, Slovenia.

Piasecki, M., & Radziszewski, A. (2008). Morphological prediction for Polish by a statistical A Tergo index. Systems Science, 34(4), 7–17.

Piasecki, M., Ramocki, R., & Maziarz, M. (2012a). Automated generation of derivative relations in the Wordnet expansion perspective. In C. Fellbaum & P. Vossen (Eds.), Proceedings of 6th International Global Wordnet Conference (pp. 273–280). Matsue, Japan: The Global WordNet Association.

Piasecki, M., Ramocki, R., & Maziarz, M. (2012b). Recognition of Polish derivational relations based on supervised learning scheme. In N. Calzolari, K. Choukri, T. Declerck, M. Uğur Doğan, B. Maegaard, J. Mariani, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12) (pp. 916–922). Istanbul, Turkey: European Language Resources Association (ELRA).

Piasecki, M., Ramocki, R., & Minda, P. (2012). Corpus-based semantic filtering in discovering derivational relations. In A. Ramsay & G. Agre (Eds.), Artificial intelligence: Methodology, systems, and applications (pp. 14–22). Berlin: Springer. (Lecture Notes in Computer Science, 7557).

Przepiórkowski, A., Bańko, M., Górski, R. L., & Lewandowska-Tomaszczyk, B. (Eds.). (2012). Narodowy Korpus Języka Polskiego. Warszawa: Wydawnictwo Naukowe PWN.

R Core Team. (2015). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing.

Radziszewski, A. (2013). A tiered CRF tagger for Polish. In Intelligent tools for building a scientific information platform (pp. 215–230). Berlin: Springer. (Studies in Computational Intelligence, 467).

Radziszewski, A., & Śniatowski, T. (2011). Maca — a configurable tool to integrate Polish morphological data. In Proceedings of FreeRBMT11. Software available at http://nlp.pwr.wroc.pl/redmine/projects/libpltagger/wiki

Radziszewski, A., Wardyński, A., & Śniatowski, T. (2011). WCCL: A morpho-syntactic feature toolkit. In I. Habernal & V. Matoušek (Eds.), Text, speech and dialogue (pp. 434–441). Berlin: Springer. (Lecture Notes in Computer Science, 6836).

Richardson, L., & Ruby, S. (2007). RESTful web services (1st ed.). Farnham: O'Reilly.

Rudnicka, E., Maziarz, M., Piasecki, M., & Szpakowicz, S. (2012). A strategy of mapping Polish WordNet onto Princeton WordNet. In Proceedings of the 24th International Conference on Computational Linguistics COLING (pp. 1039–1048).

Salton, G., & McGill, M. J. (1986). Introduction to modern information retrieval. New York, NY: McGraw-Hill, Inc.

Schaalje, G. B., Blades, N. J., & Funai, T. (2013). An open-set size-adjusted Bayesian classifier for authorship attribution. Journal of the American Society for Information Science and Technology, 64(9), 1815–1825. https://doi.org/10.1002/asi.22877

Stamatatos, E. (2009). A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology, 60(3), 538–556. https://doi.org/10.1002/asi.21001

Stamou, C. (2008). Stylochronometry: Stylistic development, sequence of composition, and relative dating. Literary and Linguistic Computing, 23(2), 181–199. https://doi.org/10.1093/llc/fqm029

Thies, G., & Vossen, G. (2008). Web-oriented architectures: On the impact of web2.0 on service-oriented architectures. In Proceedings of the 2008 IEEE Asia-Pacific Services Computing Conference (APSCC) (pp. 1075–1082). Yilan, Taiwan.

Woliński, M. (2006). Morfeusz — a practical tool for the morphological analysis of Polish. In M. A. Kłopotek, S. T. Wierzchoń, & K. Trojanowski (Eds.), Intelligent information processing and web mining: Proceedings of the International IIS: IIPWM '06 Conference held in Ustroń, Poland, June 19-22, 2006 (pp. 511–520). Berlin: Springer. (Advances in Soft Computing).

Zhao, Y., Karypis, G., & Fayyad, U. (2005). Hierarchical clustering algorithms for document datasets. Data Mining and Knowledge Discovery, 10(2), 141–168. https://doi.org/10.1007/s10618-005-0361-3




Copyright (c) 2017 Maciej Eder, Maciej Piasecki, Tomasz Walkowiak

License URL: http://creativecommons.org/licenses/by/3.0/pl/