Extracción de información de documentos PDF para su uso en la indización automática de e-books
Palabras clave:
Evaluación de software, Grobib, Indización automática, PDFMiner.six, PDFAct., DF-extract., PDFExtract.Resumen
El número de libros electrónicos que ingresan en las bibliotecas en formato PDF cada día es mayor, complicando y haciendo
casi inviables algunos procesos realizados tradicionalmente de forma manual por los bibliotecarios, como es la asignación de
materias. En este contexto, se hace necesario el diseño y desarrollo de aplicaciones que asistan a los bibliotecarios. Teniendo
esto en consideración, presentamos en este trabajo la evaluación de herramientas de extracción de información de libros en
PDF que podrían usarse posteriormente como materia prima para un sistema de indización automática. Para ello, realizamos
una primera evaluación de cinco softwares (PDFMiner.six, PDFAct, PDF-extract, PDFExtract y Grobib) y, posteriormente, como
PDFAct consiguió el mejor rendimiento, hicimos una segunda evaluación para averiguar su capacidad para identificar y
extraer informaciones de los libros, tales como títulos, índices, secciones, títulos de tablas y gráficos y referencias bibliográficas,
informaciones relevantes para cualquier sistema de indización. Se concluye que ninguna de las herramientas evaluadas extrae
adecuadamente las diferentes partes de libros en PDF, si bien, PDFAct ha logrado un rendimiento superior al del resto.
Descargas
Citas
Alamoudi, A. et al. A rule-based information extraction approach for extracting metadata from PDF books. ICICExpress Letters, Part B: Applications, v. 12, n. 2, p. 121-132, 2021. Doi: https://doi.org/ 10.24507/icicelb.12.02.121
Anggakusuma, J.; Mawardi, V.C.; Lauro, M.D. Resume extraction with conditional random field method. IOP Conference Series: Materials Science and Engineering, v. 1007, n. 1, 012154. 2020. Doi: https://doi.org/10.1088/1757-899X/1007/1/012154
Bui, D. D. A.; Del Fiol, G.; Jonnalagadda, S. PDF text classification to leverage information extraction from publication reports. Journal of Biomedical Informatics, v. 61, p. 141-148, 2016.
Chaniago, R.; Khodra, M. Information extraction on novel text using machine learning and rule-based system. In: International Conference on Innovative and Creative Information Technology, 2017. [S.l.]. Proceedings […]. [S.l.]: IEEE Explore, 2017. p. 1-6.
Chaudary, A. et al. Extraction of useful information from Crude Job Descriptions. In: IEEE International Multi-Topic Conference, INMIC, 23rd., 2020, Bahawalpur. Proceedings […]. [S.l.]: IEEE Explore, 2020. p. 1-4. Doi: https://doi.org/10.1109/INMIC50486.2020.9318132
Dong, A. et al. Citation Metadata Extraction via Deep Neural Network-based Segment Sequence Labeling. In: Conference on Information and Knowledge Management, 2017. Singapore. Proceedings […]. [S./.]: ACM, 2017. p. 1967-1970. Doi: https://doi.org/10.1145/3132847.3133074
Gil-Leiva, I. Manual de indización: teoría y práctica. Gijón: Trea,2008.
Gil-Leiva, I. et al. The abandonment of the assignment of subject headings and classification codes in University Libraries due to the massive emergence of electronic books. Knowledge Organization, v. 47, n. 8, p. 646-667. 2020. Doi: https://doi.org/10.5771/0943-7444-2020-8-646
Haviana, S.; Subroto, I. Obtaining reference’s topic congruity in Indonesian publications using machine learning approach. 2019. In: International Conference on Electrical Engineering, Computer Science and Informatics (EECSI), 6., 2019 [S.l.]. Proceedings […]. [S.l.:s.n.]: 2019. p. 428-431. Doi: https:// doi.org/10.23919/EECSI48112.2019.8976985
Jayaram, K.; Sangeeta, K. A review: Information extraction techniques from research papers. 2017. In: IEEE International Conference on Innovative Mechanisms for Industry Applications, 2017, Bengaluru, India. Proceedings […]. New York: IEEE, 2017. p. 56-59. Doi: https://doi.org/10.1109/ICIMIA.2017.7975532
Khusro, S.; Latif, A.; Ullah, I. On methods and tools of table detection, extraction and annotation in PDF documents. Journal of Information Science, v. 41, n. 1, p. 41-57, 2015. Doi: https://doi.org/10.1177/0165551514551903
Najah-Imane, B.; R’emi, J.; Sira, F. Table-of-contents generation on contemporary documents. In: International Conference on Document Analysis and Recognition (ICDAR), 15th., 2019, Sydney, Australia, september 20-25, 2019. Proceedings […]. New York: IEEE, 2019. p. 100-107. Doi: https://doi.org/10.1109/ICDAR.2019.00025
Nasar, Z.; Jaffry, S. W.; Malik, M. K. Information extraction from scientific articles: a survey. Scientometrics, v. 117, n. 3, p. 1931-1990, 2018. Doi: https://doi.org/10.1007/s11192-018-2921-5
Nitu, M. et al. Reconstructing scanned documents for full-text indexing to empower digital library services. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), v. 11984 LNCS, p. 183-190, 2020.
Ojokoh, B. A.; Adewale, O. S.; Falaki, S.O. Automated document metadata extraction. Journal of Information Science, v. 35, n. 5, p. 563-570, 2009. Doi: https://doi.org/10.1177/0165551509105195
Perez-Arriaga, M.O.; Estrada, T.; Abad-Mota, S. Tao: system for table detection and extraction from PDF documents. In: Markov, Z.; Russell, I. (ed.). Proceedings of the Twenty-Ninth International Florida Artificial Intelligence Research Society Conference, FLAIRS 2016, Key Largo, Florida, May 16-18, 2016. Palo Alto: AAAI Press, 2016. p. 591-596.
Pudasaini, S. et al. Application of NLP for information extraction from unstructured documents. Lecture Notes in Networks and Systems, v. 209, p. 695-704, 2021. Doi: https://doi.org/10.1007/978-981-16-2126-0_54
Ratcliff, J. W.; Metzener, D. E. Pattern matching: the gestalt approach. Dr. Dobb’s Journal, v. 13, n. 7, p. 46, 1988.
Sandanayake, T. C. et al. Automated CV analyzing and ranking tool to select candidates for job positions. In: Proceedings of the 6th International Conference on Information Technology: IoT and Smart City. 2018, Hong Kong. Proceedings […]. New York, NY: Association for Computing Machinery, 2018. p. 13-18. Doi: https://doi.org/10.1145/3301551.3301579
Shahid, M. H.; Islam, M. A. TOC generation in PDF Document for smart automated compliance engine. In: International Symposium on Recent Advances in Electrical Engineering & Computer Sciences (RAEE & CS), 2020, p. 1-5, Islamabad, Pakistan. Proceedings […]. New York: IEEE, 2020. Doi: https://
doi.org/10.1109/raeecs50817.2020.9265792
Tkaczyk, D. et al. Machine learning vs. rules and outof- the-box vs. retrained: an evaluation of open-source bibliographic reference and citation parsers. In: ACM/IEEE on Joint Conference on Digital Libraries, 18., June 3-7, 2018, Fort Worth, Texas, USA. Proceedings […]. New York, NY:
Association for Computing Machinery, 2018. https://doi.org/10.1145/3197026.3197048
Zaman, G.; Mahdin, H.; Hussain, K. Information extraction from semi and unstructured data sources: a systematic literature review. ICIC Express Letters, v. 14, n. 6, p. 593-603, 2020. Doi: https://doi.org/10.24507/icicel.14.06.593