Semantic-Aware Data Lake Architecture with Automated Schema Evolution and Intelligent Data Discovery
DOI:
https://doi.org/10.32628/CSEIT24113394Keywords:
Data lake, semantic architecture, schema evolution, knowledge graphs, intelligent data discovery, metadata management, ontology learningAbstract
This paper presents a semantic-aware data lake architecture that incorporates automated schema evolution and intelligent data discovery capabilities to address the challenges of managing diverse, rapidly changing datasets in modern big data environments. The proposed Semantic Data Lake System (SDLS) combines knowledge graphs, machine learning, and advanced metadata management to create self-organizing data repositories that automatically understand, categorize, and optimize data storage and access patterns. Our methodology employs natural language processing and ontology learning techniques to automatically extract semantic meaning from ingested data, creating rich metadata descriptions that enable intelligent data discovery and lineage tracking. The system includes a novel automated schema evolution mechanism that can detect schema changes in streaming data and automatically update data catalogs and downstream processing pipelines without service interruption. We introduce an intelligent data discovery engine that uses semantic similarity and usage pattern analysis to help users find relevant datasets and understand their relationships. The architecture incorporates advanced data quality monitoring and automatic anomaly detection to maintain data integrity across the entire lake. Our implementation demonstrates superior performance in data discovery tasks with 78% improvement in relevant dataset identification and 65% reduction in data preparation time. Experimental validation using multi-domain enterprise datasets shows effective handling of schema evolution events with minimal performance impact and enhanced data governance capabilities.
Downloads
References
J. Dixon, "Pentaho, Hadoop, and Data Lakes," James Dixon's Blog, Oct. 2010. [Online]. Available: https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/
N. Marz and J. Warren, Big Data: Principles and Best Practices of Scalable Real-Time Data Systems, Manning Publications, 2015.
M. Armbrust et al., "Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores," Proc. VLDB Endowment, vol. 13, no. 12, pp. 3411-3424, 2020. DOI: https://doi.org/10.14778/3415478.3415560
A. Gorelik, The Enterprise Big Data Lake: Delivering the Promise of Big Data and Data Science, O'Reilly Media, 2019.
C. Curino et al., "Schism: A Workload-Driven Approach to Database Replication and Partitioning," Proc. VLDB Endowment, vol. 3, no. 1-2, pp. 48-57, 2010. DOI: https://doi.org/10.14778/1920841.1920853
J. Syed and C. Curino, "Schema Evolution in Wikipedia: Toward a Dynamically Extensible Web," in Proc. Workshop on Self-Managing Database Systems, 2011.
S. Auer et al., "DBpedia: A Nucleus for a Web of Open Data," in Proc. Int. Semantic Web Conf., pp. 722-735, 2007. DOI: https://doi.org/10.1007/978-3-540-76298-0_52
P. Christophides et al., "An Overview of Ontology Learning and Engineering Tools," in Ontology Learning and Population: Bridging the Gap between Text and Knowledge, IOS Press, 2008.
M. Kejriwal and D. P. Miranker, "An Unsupervised Algorithm for Learning Blocking Schemes," in Proc. IEEE Int. Conf. Data Mining, pp. 340-349, 2013. DOI: https://doi.org/10.1109/ICDM.2013.60
A. Halevy et al., "Principles of Dataspace Systems," in Proc. ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems, pp. 1-9, 2006. DOI: https://doi.org/10.1145/1142351.1142352
Z. Khayyat et al., "BigDansing: A System for Big Data Cleansing," in Proc. ACM SIGMOD Int. Conf. Management of Data, pp. 1215-1230, 2015. DOI: https://doi.org/10.1145/2723372.2747646
R. Castro Fernandez et al., "Aurum: A Data Discovery System," in Proc. IEEE Int. Conf. Data Engineering, pp. 1001-1012, 2018. DOI: https://doi.org/10.1109/ICDE.2018.00094
F. Nargesian et al., "Table Union Search on Open Data," Proc. VLDB Endowment, vol. 11, no. 7, pp. 813-825, 2018. DOI: https://doi.org/10.14778/3192965.3192973
Y. Zhang et al., "DataCivilizer: A Unified Platform for Data Discovery, Cleaning, and Integration," in Proc. CIDR, 2017.
E. Rahm and P. A. Bernstein, "A Survey of Approaches to Automatic Schema Matching," VLDB Journal, vol. 10, no. 4, pp. 334-350, 2001. DOI: https://doi.org/10.1007/s007780100057
Sandeep Kamadi. (2022). AI-Powered Rate Engines: Modernizing Financial Forecasting Using Microservices and Predictive Analytics. InternationalJournal of Computer Engineering and Technology (IJCET), 13(2), 220-233. https://iaeme.com/MasterAdmin/Journal_uploads/IJCET/VOLUME_13_ISSUE_2/IJCET_13_02_024.pdf DOI: https://doi.org/10.34218/IJCET_13_02_024
Chandra Sekhar Oleti. (2022). Serverless Intelligence: Securing J2ee-Based Federated Learning Pipelines on AWS. International Journal of Computer Engineering and Technology (IJCET), 13(3), 163-180. https://iaeme.com/MasterAdmin/Journal_uploads/IJCET/VOLUME_13_ISSUE_3/IJCET_13_03_017.pdf DOI: https://doi.org/10.34218/IJCET_13_03_017
Praveen Kumar Reddy Gujjala. (2022). Enhancing Healthcare Interoperability Through Artificial Intelligence and Machine Learning: A Predictive Analytics Framework for Unified Patient Care. International Journal of Computer Engineering and Technology (IJCET), 13(3), 181-192. https://iaeme.com/Home/issue/IJCET?Volume=13&Issue=3 DOI: https://doi.org/10.34218/IJCET_13_03_018
Gujjala, Praveen Kumar Reddy. (2022). Data science pipelines in lakehouse architectures: A scalable approach to big data analytics. World Journal of Advanced Research and Reviews. 16. 1412-1425. 10.30574/wjarr.2022.16.3.1305. DOI: https://doi.org/10.30574/wjarr.2022.16.3.1305
Oleti, Chandra Sekhar. (2022). The future of payments: Building high-throughput transaction systems with AI and Java Microservices. World Journal of Advanced Research and Reviews. 16. 1401-1411. 10.30574/wjarr.2022.16.3.1281. DOI: https://doi.org/10.30574/wjarr.2022.16.3.1281
Arcot, Siva Venkatesh. (2022). Secure Cloud-Native GNN Architecture for Multi-Channel Contact Center Flow Orchestration. International Journal of Scientific Research in Computer Science Engineering and Information Technology. 8. 565-581. 10.32628/CSEIT2541328. DOI: https://doi.org/10.32628/CSEIT2541328
Sandeep Kamadi. (2022). Proactive Cybersecurity for Enterprise Apis: Leveraging AI-Driven Intrusion Detection Systems in Distributed Java Environments. International Journal of Research in Computer Applications and Information Technology (IJRCAIT), 5(1), 34-52. https://iaeme.com/MasterAdmin/Journal_uploads/IJRCAIT/VOLUME_5_ISSUE_1/IJRCAIT_05_01_004.pdf DOI: https://doi.org/10.34218/IJRCAIT_05_01_004
Arcot, Siva Venkatesh. (2022). Federated Learning Framework for Privacy- Preserving Voice Biometrics in Multi-Tenant Contact Centers. International Journal For Multidisciplinary Research. 4.
Gollapudi, Pavan Kumar. (2022). Intelligent Data Analytics Platform for Insurance Domain Test Data Management and Privacy Preservation. International Journal of Scientific Research in Computer Science, Engineering and Information Technology. 8. 553-564. 10.32628/CSEIT2541327. DOI: https://doi.org/10.32628/CSEIT2541327
J. Madhavan et al., "Web-Scale Data Integration: You Can Only Afford to Pay As You Go," in Proc. CIDR, pp. 342-350, 2007.
G. Koutrika et al., "Explaining Structured Queries in Natural Language," in Proc. IEEE Int. Conf. Data Engineering, pp. 333-344, 2010. DOI: https://doi.org/10.1109/ICDE.2010.5447824
K. Chatfield et al., "Return of the Devil in the Details: Delving Deep into Convolutional Nets," in Proc. British Machine Vision Conf., 2014. DOI: https://doi.org/10.5244/C.28.6
M. A. Hearst, "Automatic Acquisition of Hyponyms from Large Text Corpora," in Proc. Int. Conf. Computational Linguistics, pp. 539-545, 1992. DOI: https://doi.org/10.3115/992133.992154
C. D. Manning et al., Introduction to Information Retrieval, Cambridge University Press, 2008. DOI: https://doi.org/10.1017/CBO9780511809071
Y. Ioannidis, "The History of Histograms," in Proc. VLDB Endowment, vol. 8, no. 12, pp. 2114-2125, 2003.
T. Rekatsinas et al., "HoloClean: Holistic Data Repairs with Probabilistic Inference," Proc. VLDB Endowment, vol. 10, no. 11, pp. 1190-1201, 2017. DOI: https://doi.org/10.14778/3137628.3137631
M. Stonebraker and I. F. Ilyas, "Data Integration: The Current Status and the Way Forward," IEEE Data Engineering Bulletin, vol. 41, no. 2, pp. 3-9, 2018.
A. Doan et al., "Crowdsourcing Systems on the World-Wide Web," Communications of the ACM, vol. 54, no. 4, pp. 86-96, 2011. DOI: https://doi.org/10.1145/1924421.1924442
J. M. Hellerstein et al., "Ground: A Data Context Service," in Proc. CIDR, 2017.
P. Buneman et al., "Why and Where: A Characterization of Data Provenance," in Proc. Int. Conf. Database Theory, pp. 316-330, 2001. DOI: https://doi.org/10.1007/3-540-44503-X_20
Y. Cui and J. Widom, "Lineage Tracing for General Data Warehouse Transformations," VLDB Journal, vol. 12, no. 1, pp. 41-58, 2003. DOI: https://doi.org/10.1007/s00778-002-0083-8
M. Interlandi et al., "Titian: Data Provenance Support in Spark," Proc. VLDB Endowment, vol. 9, no. 3, pp. 216-227, 2015. DOI: https://doi.org/10.14778/2850583.2850595
Z. Abedjan et al., "Data Profiling," Synthesis Lectures on Data Management, Morgan & Claypool Publishers, 2019. DOI: https://doi.org/10.1007/978-3-031-01865-7
S. Kandel et al., "Profiler: Integrated Statistical Analysis and Visualization for Data Quality Assessment," in Proc. Int. Working Conf. Advanced Visual Interfaces, pp. 547-554, 2012. DOI: https://doi.org/10.1145/2254556.2254659
T. Berners-Lee et al., "The Semantic Web," Scientific American, vol. 284, no. 5, pp. 34-43, 2001. DOI: https://doi.org/10.1038/scientificamerican0501-34
M. Kejriwal, Domain-Specific Knowledge Graph Construction, Springer, 2019. DOI: https://doi.org/10.1007/978-3-030-12375-8
X. L. Dong et al., "Knowledge-Based Trust: Estimating the Trustworthiness of Web Sources," Proc. VLDB Endowment, vol. 8, no. 9, pp. 938-949, 2015. DOI: https://doi.org/10.14778/2777598.2777603
J. Lehmann et al., "DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia," Semantic Web Journal, vol. 6, no. 2, pp. 167-195, 2015. DOI: https://doi.org/10.3233/SW-140134
T. Mitchell et al., "Never-Ending Learning," Communications of the ACM, vol. 61, no. 5, pp. 103-115, 2018. DOI: https://doi.org/10.1145/3191513
Downloads
Published
Issue
Section
License
Copyright (c) 2024 International Journal of Scientific Research in Computer Science, Engineering and Information Technology

This work is licensed under a Creative Commons Attribution 4.0 International License.