DOI: https://doi.org/10.11649/cs.2014.008

The IMPACT project Polish Ground-Truth texts as a Djvu corpus

Janusz S. Bień

Abstract


The IMPACT project Polish Ground-Truth texts as a Djvu corpus

The purpose of the paper is twofold. First, to describe the already implemented idea of DjVu corpora, i.e. corpora which consist of both scanned images and a transcription of the texts with the words associated with their occurrences in the scans. Secondly, to present a case study of a corpus consisting of almost 5 000 pages of Polish historical texts dating from 1570 to 1756 (it is practically the very first corpus of historical Polish). The tools described have universal character and are freely available under the GNU GPL license, hence they can be used also for other purposes.


Keywords


Polish language; corpora; DjVu; OCR; PAGE; Page Analysis and Ground-Truth Elements; GNU GPL

Full Text:

PDF (in English)

References


Bień, J. S. (2009). Facilitating access to digitalized dictionaries in DjVu format. Cognitives Studies | Études cognitives, 9, 161–170. Retrieved from http://bc.klf.uw.edu.pl/160/

Bień, J. S. (2011). Efficient search in hidden text of large DjVu documents. In R. Bernardi, S. Chambers, B. Gottfried, F. Segond & I. Zaihrayeu (Eds.), Advanced Language Technologies for Digital Libraries, volume 6699 of Lecture Notes in Computer Science (pp. 1-14). Berlin/Heidelberg: Springer. Retrieved from http://dx.doi.org/10.1007/978-3-642-23160-51,http://bc.klf.uw.edu.pl/177/

Breuel, T. (2007). The hOCR microformat for OCR workflow and results. In Proceedings of the Ninth International Conference on Document Analysis and Recognition (pp. 1063–1067). IEEE Computer Society. Retrieved from http://madm.dfki.de/publication&pubid=4373

Kenter, T., Erjavec, T., Žorga Dulmin, M., & Fišer, D. (2012). Lexicon construction and corpus annotation of historical language with the CoBaLT editor. In Proceedings of the 6th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (pp. 1–6).

Avignon: Association for Computational Linguistics. Retrieved from http://www.aclweb.org/anthology/W/W12/W12-1001.pdf

Le Cun, Y., Bottou, L., Haffner, P., & Howard, P. G. (1998). DjVu: a compression method for distributing scanned documents in color over the internet. In Sixth Color Imaging Conference: Color Science, Systems and Applications (pp. 220–223). Scottsdale, Arizona: IST. Retrieved from http://leon.bottou.org/papers/lecun-98c

Pletschacher, S. & Antonacopoulos, A. (2010). The PAGE (Page Analysis and Ground-Truth Elements) format framework. In International Conference on Pattern Recognition (pp. 257–260). Los Alamitos, CA: USA. IEEE Computer Society. Retrieved from http://www.impact-project.eu/fileadmin/Editorial/Documents/ICPR2010_The_PAGE_Format_Framework_USAL.pdf

Przepiórkowski, A., Krynicki, Z., Dębowski, Ł., Woliński, M., Janus, D., & Bański, P. (2004). A search tool for corpora with positional tagsets and ambiguities. In Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC2004 (pp. 1235–1238). Retrieved from http://nlp.ipipan.waw.pl/ adamp/Papers/2004-lrec/fcqp.pdf




Copyright (c) 2014 Janusz S. Bień

License URL: http://creativecommons.org/licenses/by/3.0/pl/