The IMPACT project Polish Ground-Truth texts as a Djvu corpus

Janusz S. Bień


The IMPACT project Polish Ground-Truth texts as a Djvu corpus

The purpose of the paper is twofold. First, to describe the already implemented idea of DjVu corpora, i.e. corpora which consist of both scanned images and a transcription of the texts with the words associated with their occurrences in the scans. Secondly, to present a case study of a corpus consisting of almost 5 000 pages of Polish historical texts dating from 1570 to 1756 (it is practically the very first corpus of historical Polish). The tools described have universal character and are freely available under the GNU GPL license, hence they can be used also for other purposes.


Polish language; corpora; DjVu; OCR; PAGE; Page Analysis and Ground-Truth Elements; GNU GPL

Full Text:

PDF (in English)


Bień, J. S. (2009). Facilitating access to digitalized dictionaries in DjVu format. Cognitives Studies | Études cognitives, 9, 161–170. Retrieved from

Bień, J. S. (2011). Efficient search in hidden text of large DjVu documents. In R. Bernardi, S. Chambers, B. Gottfried, F. Segond & I. Zaihrayeu (Eds.), Advanced Language Technologies for Digital Libraries, volume 6699 of Lecture Notes in Computer Science (pp. 1-14). Berlin/Heidelberg: Springer. Retrieved from,

Breuel, T. (2007). The hOCR microformat for OCR workflow and results. In Proceedings of the Ninth International Conference on Document Analysis and Recognition (pp. 1063–1067). IEEE Computer Society. Retrieved from

Kenter, T., Erjavec, T., Žorga Dulmin, M., & Fišer, D. (2012). Lexicon construction and corpus annotation of historical language with the CoBaLT editor. In Proceedings of the 6th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (pp. 1–6).

Avignon: Association for Computational Linguistics. Retrieved from

Le Cun, Y., Bottou, L., Haffner, P., & Howard, P. G. (1998). DjVu: a compression method for distributing scanned documents in color over the internet. In Sixth Color Imaging Conference: Color Science, Systems and Applications (pp. 220–223). Scottsdale, Arizona: IST. Retrieved from

Pletschacher, S. & Antonacopoulos, A. (2010). The PAGE (Page Analysis and Ground-Truth Elements) format framework. In International Conference on Pattern Recognition (pp. 257–260). Los Alamitos, CA: USA. IEEE Computer Society. Retrieved from

Przepiórkowski, A., Krynicki, Z., Dębowski, Ł., Woliński, M., Janus, D., & Bański, P. (2004). A search tool for corpora with positional tagsets and ambiguities. In Proceedings of the Fourth International Conference on Language Resources and Evaluation, LREC2004 (pp. 1235–1238). Retrieved from adamp/Papers/2004-lrec/fcqp.pdf

Copyright (c) 2014 Janusz S. Bień

License URL: