Extracting information from PDF documents for use in automatic indexing of e-books

Authors

Keywords:

Software evaluation, DFMiner.six., PDFAct., PDF-extract, PDFExtract, Grobib, Automatic indexing

Abstract

The number of electronic books that enter libraries in PDF format is greater every day. Complicating and making it almost unfeasible for some processes, traditionally carried out manually by librarians such as the assignment of subjects, to be done. In this context, it is necessary to design and develop applications that assist librarians. Taking this into consideration, we present in this work the evaluation of tools for extracting information from books in PDF format that could be used later as raw material for an automatic indexing system. To do this, we carried out a first evaluation of five software (PDFMiner.six, PDFAct, PDF-extract, PDFExtract, and Grobib), later, as PDFAct achieved the best performance, we did a second evaluation to find out their ability to identify and extract information from the books such as titles, indexes, sections, titles of tables and graphs and bibliographic reference which are relevant information for any indexing system. It is concluded that none of the evaluated tools adequately extracts the different parts of PDF books, although PDFAct has achieved a better performance than the rest.

Downloads

Download data is not yet available.

References

Alamoudi, A. et al. A rule-based information extraction approach for extracting metadata from PDF books. ICICExpress Letters, Part B: Applications, v. 12, n. 2, p. 121-132, 2021. Doi: https://doi.org/ 10.24507/icicelb.12.02.121

Anggakusuma, J.; Mawardi, V.C.; Lauro, M.D. Resume extraction with conditional random field method. IOP Conference Series: Materials Science and Engineering, v. 1007, n. 1, 012154. 2020. Doi: https://doi.org/10.1088/1757-899X/1007/1/012154

Bui, D. D. A.; Del Fiol, G.; Jonnalagadda, S. PDF text classification to leverage information extraction from publication reports. Journal of Biomedical Informatics, v. 61, p. 141-148, 2016.

Chaniago, R.; Khodra, M. Information extraction on novel text using machine learning and rule-based system. In: International Conference on Innovative and Creative Information Technology, 2017. [S.l.]. Proceedings […]. [S.l.]: IEEE Explore, 2017. p. 1-6.

Chaudary, A. et al. Extraction of useful information from Crude Job Descriptions. In: IEEE International Multi-Topic Conference, INMIC, 23rd., 2020, Bahawalpur. Proceedings […]. [S.l.]: IEEE Explore, 2020. p. 1-4. Doi: https://doi.org/10.1109/INMIC50486.2020.9318132

Dong, A. et al. Citation Metadata Extraction via Deep Neural Network-based Segment Sequence Labeling. In: Conference on Information and Knowledge Management, 2017. Singapore. Proceedings […]. [S./.]: ACM, 2017. p. 1967-1970. Doi: https://doi.org/10.1145/3132847.3133074

Gil-Leiva, I. Manual de indización: teoría y práctica. Gijón: Trea,2008.

Gil-Leiva, I. et al. The abandonment of the assignment of subject headings and classification codes in University Libraries due to the massive emergence of electronic books. Knowledge Organization, v. 47, n. 8, p. 646-667. 2020. Doi: https://doi.org/10.5771/0943-7444-2020-8-646

Haviana, S.; Subroto, I. Obtaining reference’s topic congruity in Indonesian publications using machine learning approach. 2019. In: International Conference on Electrical Engineering, Computer Science and Informatics (EECSI), 6., 2019 [S.l.]. Proceedings […]. [S.l.:s.n.]: 2019. p. 428-431. Doi: https:// doi.org/10.23919/EECSI48112.2019.8976985

Jayaram, K.; Sangeeta, K. A review: Information extraction techniques from research papers. 2017. In: IEEE International Conference on Innovative Mechanisms for Industry Applications, 2017, Bengaluru, India. Proceedings […]. New York: IEEE, 2017. p. 56-59. Doi: https://doi.org/10.1109/ICIMIA.2017.7975532

Khusro, S.; Latif, A.; Ullah, I. On methods and tools of table detection, extraction and annotation in PDF documents. Journal of Information Science, v. 41, n. 1, p. 41-57, 2015. Doi: https://doi.org/10.1177/0165551514551903

Najah-Imane, B.; R’emi, J.; Sira, F. Table-of-contents generation on contemporary documents. In: International Conference on Document Analysis and Recognition (ICDAR), 15th., 2019, Sydney, Australia, september 20-25, 2019. Proceedings […]. New York: IEEE, 2019. p. 100-107. Doi: https://doi.org/10.1109/ICDAR.2019.00025

Nasar, Z.; Jaffry, S. W.; Malik, M. K. Information extraction from scientific articles: a survey. Scientometrics, v. 117, n. 3, p. 1931-1990, 2018. Doi: https://doi.org/10.1007/s11192-018-2921-5

Nitu, M. et al. Reconstructing scanned documents for full-text indexing to empower digital library services. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), v. 11984 LNCS, p. 183-190, 2020.

Ojokoh, B. A.; Adewale, O. S.; Falaki, S.O. Automated document metadata extraction. Journal of Information Science, v. 35, n. 5, p. 563-570, 2009. Doi: https://doi.org/10.1177/0165551509105195

Perez-Arriaga, M.O.; Estrada, T.; Abad-Mota, S. Tao: system for table detection and extraction from PDF documents. In: Markov, Z.; Russell, I. (ed.). Proceedings of the Twenty-Ninth International Florida Artificial Intelligence Research Society Conference, FLAIRS 2016, Key Largo, Florida, May 16-18, 2016. Palo Alto: AAAI Press, 2016. p. 591-596.

Pudasaini, S. et al. Application of NLP for information extraction from unstructured documents. Lecture Notes in Networks and Systems, v. 209, p. 695-704, 2021. Doi: https://doi.org/10.1007/978-981-16-2126-0_54

Ratcliff, J. W.; Metzener, D. E. Pattern matching: the gestalt approach. Dr. Dobb’s Journal, v. 13, n. 7, p. 46, 1988.

Sandanayake, T. C. et al. Automated CV analyzing and ranking tool to select candidates for job positions. In: Proceedings of the 6th International Conference on Information Technology: IoT and Smart City. 2018, Hong Kong. Proceedings […]. New York, NY: Association for Computing Machinery, 2018. p. 13-18. Doi: https://doi.org/10.1145/3301551.3301579

Shahid, M. H.; Islam, M. A. TOC generation in PDF Document for smart automated compliance engine. In: International Symposium on Recent Advances in Electrical Engineering & Computer Sciences (RAEE & CS), 2020, p. 1-5, Islamabad, Pakistan. Proceedings […]. New York: IEEE, 2020. Doi: https://

doi.org/10.1109/raeecs50817.2020.9265792

Tkaczyk, D. et al. Machine learning vs. rules and outof- the-box vs. retrained: an evaluation of open-source bibliographic reference and citation parsers. In: ACM/IEEE on Joint Conference on Digital Libraries, 18., June 3-7, 2018, Fort Worth, Texas, USA. Proceedings […]. New York, NY:

Association for Computing Machinery, 2018. https://doi.org/10.1145/3197026.3197048

Zaman, G.; Mahdin, H.; Hussain, K. Information extraction from semi and unstructured data sources: a systematic literature review. ICIC Express Letters, v. 14, n. 6, p. 593-603, 2020. Doi: https://doi.org/10.24507/icicel.14.06.593

Published

2022-09-23

How to Cite

Gil-Leiva, I. ., Fujita, M. S. L., Redigolo, F. M., & Saran, J. F. (2022). Extracting information from PDF documents for use in automatic indexing of e-books. Transinformação, 34, 1–11. Retrieved from https://periodicos.puc-campinas.edu.br/transinfo/article/view/6870

Issue

Section

Original