A Comprehensive Review of Methods, Frameworks, and Domains for Metadata Extraction from Scientific Texts
DOI:
https://doi.org/10.32628/CSEIT261218Keywords:
Metadata Extraction, Scientific Text Mining, Information Extraction, Scholarly Documents, Machine LearningAbstract
Metadata extraction from scientific texts plays a crucial role in enabling efficient organization, retrieval, and analysis of scholarly knowledge. With the exponential growth of scientific publications across disciplines, manual metadata annotation has become infeasible, motivating the development of automated and semi-automated extraction techniques. This review paper presents a comprehensive analysis of recent advances in metadata and structured information extraction from scientific documents. It explores traditional rule-based methods, machine learning approaches, deep learning architectures, and emerging large language model-based frameworks. The paper also examines domain-specific applications, including systematic reviews, digital libraries, scientific repositories, and open journal systems. By synthesizing findings from recent literature, this study highlights key research trends, strengths, and limitations of existing methods. Furthermore, it identifies major challenges such as document heterogeneity, semantic ambiguity, evaluation complexity, and human–machine collaboration. The review aims to provide researchers with a structured understanding of current methodologies and open research directions, thereby supporting the development of robust, scalable, and high-precision metadata extraction systems for scientific knowledge management.
Downloads
References
W. A. Ingram et al., “Building datasets to support information extraction and structure parsing from electronic theses and dissertations,” International Journal on Digital Libraries, vol. 25, no. 2, pp. 175–196, 2024, doi: 10.1007/s00799-024-00395-4. DOI: https://doi.org/10.1007/s00799-024-00395-4
J. Dagdelen et al., “Structured information extraction from scientific text with large language models,” Nature Communications, vol. 15, no. 1, 2024, doi: 10.1038/s41467-024-45563-x. DOI: https://doi.org/10.1038/s41467-024-45563-x
O. Iwashokun and A. Ade-Ibijola, “Parsing of Research Documents into XML Using Formal Grammars,” Applied Computational Intelligence and Soft Computing, vol. 2024, 2024, doi: 10.1155/2024/6671359. DOI: https://doi.org/10.1155/2024/6671359
B. Butcher, M. Zilka, J. Hron, D. Cook, and A. Weller, “Optimising Human-Machine Collaboration for Efficient High-Precision Information Extraction from Text Documents,” ACM Journal on Responsible Computing, vol. 1, no. 2, pp. 1–27, 2024, doi: 10.1145/3652591. DOI: https://doi.org/10.1145/3652591
Y. Liu, S. Li, K. Huang, and Q. Wang, “AutoIE: An Automated Framework for Information Extraction from Scientific Literature,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 14885 LNAI, pp. 424–436, 2024, doi: 10.1007/978-981-97-5495-3_32. DOI: https://doi.org/10.1007/978-981-97-5495-3_32
T. Saier, M. Ohta, T. Asakura, and M. Färber, “HyperPIE: Hyperparameter Information Extraction from Scientific Publications,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 14609 LNCS, pp. 254–269, 2024, doi: 10.1007/978-3-031-56060-6_17. DOI: https://doi.org/10.1007/978-3-031-56060-6_17
A. Marshalova, E. Bruches, and T. Batura, “Automatic Aspect Extraction from Scientific Texts,” Communications in Computer and Information Science, vol. 1905 CCIS, pp. 67–80, 2024, doi: 10.1007/978-3-031-67008-4_6. DOI: https://doi.org/10.1007/978-3-031-67008-4_6
M. I. Rizqi, M. Z. Abdullah, and M. A. Hendrawan, “The Development of Meta data Extractor Plugin for Open Journal System,” in Proceedings of the International Conference on Applied Science and Technology on Engineering Science, vol. 2023, Atlantis Press International BV, 2024, pp. 985–991. doi: 10.2991/978-94-6463-364-1_90. DOI: https://doi.org/10.2991/978-94-6463-364-1_90
R. Turgunbaev, “Machine Learning and its Use in the Automatic Extraction of Metadata from Academic Articles,” International Journal of Engineering & Computer Science, vol. 7, no. March, pp. 1–7, 2024, doi: 10.21744/ijecs.v7n1.1782 Machine.
M. A. Suryani, S. Hahne, C. Beth, K. Wallmann, and M. Renz, “DAF: Data Acquisition Framework to Support Information Extraction from Scientific Publications,” International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, IC3K - Proceedings, vol. 1, no. Ic3k, pp. 468–476, 2023, doi: 10.5220/0012260300003598. DOI: https://doi.org/10.5220/0012260300003598
A. Panayi, K. Ward, A. Benhadji-Schaff, A. S. Ibanez-Lopez, A. Xia, and R. Barzilay, “Evaluation of a prototype machine learning tool to semi-automate data extraction for systematic literature reviews,” Systematic Reviews, vol. 12, no. 1, pp. 1–11, 2023, doi: 10.1186/s13643-023-02351-w. DOI: https://doi.org/10.1186/s13643-023-02351-w
L. Foppiano et al., “Semi-automatic staging area for high-quality structured data extraction from scientific literature,” Science and Technology of Advanced Materials: Methods, vol. 3, no. 1, 2023, doi: 10.1080/27660400.2023.2286219. DOI: https://doi.org/10.1080/27660400.2023.2286219
N. Bölücü, M. Rybinski, and S. Wan, “Investigating the Impact of Syntax-Enriched Transformers on Quantity Extraction in Scientific Texts,” in Proceedings of the Second Workshop on Information Extraction from Scientific Publications, 2023, pp. 1–13. doi: 10.18653/v1/2023.wiesp-1.1.
N. Bölücü, M. Rybinski, and S. Wan, “Investigating the Impact of Syntax-Enriched Transformers on Quantity Extraction in Scientific Texts,” in Proceedings of the Second Workshop on Information Extraction from Scientific Publications, 2023, pp. 1–13. doi: 10.18653/v1/2023.wiesp-1.1. DOI: https://doi.org/10.18653/v1/2023.wiesp-1.1
T. Ikoma and S. Matsubara, “On the Use of Language Models for Function Identification of Citations in Scholarly Papers,” in Proceedings of the Second Workshop on Information Extraction from Scientific Publications, 2023, pp. 130–135. doi: 10.18653/v1/2023.wiesp-1.15. DOI: https://doi.org/10.18653/v1/2023.wiesp-1.15
wenjin yang, X. Zhang, and J. He, “ChartLine: Curve Extraction from Scientific Line Charts with Spatial-Sequence Feature Pyramid Network.” Aug. 16, 2023. doi: 10.21203/rs.3.rs-2892637/v1. DOI: https://doi.org/10.21203/rs.3.rs-2892637/v1
M. W. Ahmed and M. T. Afzal, “FLAG-PDFe: Features Oriented Metadata Extraction Framework for Scientific Publications,” IEEE Access, vol. 8, pp. 99458–99469, 2020, doi: 10.1109/ACCESS.2020.2997907. DOI: https://doi.org/10.1109/ACCESS.2020.2997907
Downloads
Published
Issue
Section
License
Copyright (c) 2026 International Journal of Scientific Research in Computer Science, Engineering and Information Technology

This work is licensed under a Creative Commons Attribution 4.0 International License.