Overview on Data Ingestion and Schema Matching
DOI:
https://doi.org/10.56294/dm2024219Keywords:
Data Management, Schema Matching, Data Ingestion, Heterogeneous Schema, Dynamic EnvironmentAbstract
This overview traced the evolution of data management, transitioning from traditional ETL processes to addressing contemporary challenges in Big Data, with a particular emphasis on data ingestion and schema matching. It explored the classification of data ingestion into batch, real-time, and hybrid processing, underscoring the challenges associated with data quality and heterogeneity. Central to the discussion was the role of schema mapping in data alignment, proving indispensable for linking diverse data sources. Recent advancements, notably the adoption of machine learning techniques, were significantly reshaping the landscape. The paper also addressed current challenges, including the integration of new technologies and the necessity for effective schema matching solutions, highlighting the continuously evolving nature of schema matching in the context of Big Data
References
1. Souibgui M, Atigui F, Zammali S, Cherfi S, Yahia SB. Data quality in ETL process: A preliminary study. Procedia Computer Science [Internet]. 2019;159. Available from: https://doi.org/10.1016/j.procs.2019.09.223 DOI: https://doi.org/10.1016/j.procs.2019.09.223
2. Informatica [Internet]. [cited 2023 Oct 18]. What Is Data Ingestion? Available from: https://www.informatica.com/resources/articles/what-is-data-ingestion.html
3. Alserafi A. Dataset Proximity Mining for Supporting Schema Matching and Data Lake Governance [PhD Thesis]. Universitat Politècnica de Catalunya, BarcelonaTech; 2021.
4. Meehan J, Tatbul N, Aslantas C, Zdonik S. Data Ingestion for the Connected World. In: CIDR’17. 2017.
5. Hoseini S, Ali A, Shaker H, Quix C. SEDAR: A Semantic Data Reservoir for Heterogeneous Datasets. In: 32nd ACM International Conference on Information and Knowledge Management [Internet]. ACM; 2023. p. 5056–60. Available from: https://doi.org/10.1145/3583780.3614753 DOI: https://doi.org/10.1145/3583780.3614753
6. Yihun AM, Stanislava S. Learning analytics for higher education: proposal of big data ingestion architecture. SHS, Web of Conferences [Internet]. 2021; Available from: https://doi.org/10.1051/shsconf/20219202002 DOI: https://doi.org/10.1051/shsconf/20219202002
7. Giebler C, Stach C, Schwarz H, Mitschang B. BRAID - A Hybrid Processing Architecture for Big Data. In: 7th International Conference on Data Science, Technology and Applications. SCITEPRESS - Science and Technology Publications; 2018. DOI: https://doi.org/10.5220/0006861802940301
8. Miloslavskaya N, Tolstoy A. Big Data, Fast Data and Data Lake Concepts. Procedia Computer Science. 2016;88. DOI: https://doi.org/10.1016/j.procs.2016.07.439
9. Pal G, Li G, Atkinson K. Big Data Ingestion and Lifelong Learning Architecture. In: 2018 IEEE International Conference on Big Data, Big Data 2018. 2018. DOI: https://doi.org/10.1109/BigData.2018.8621859
10. Marz N, Warren J. Big data: principles and best practices of scalable real-time data systems. Shelter Island, NY: Manning; 2015.
11. Kreps J. Questioning the Lambda Architecture [Internet]. 2014. Available from: https://www.oreilly.com/radar/questioning-the-lambda-architecture/
12. Podhoranyi M. A comprehensive social media data processing and analytics architecture by using big data platforms: a case study of twitter flood-risk messages. Earth Sci Inform. 2021;14:913–29. DOI: https://doi.org/10.1007/s12145-021-00601-w
13. Pal G, Atkinson K, Li G. Managing Heterogeneous Data on a Big Data Platform: A Multi-criteria Decision Making Model for Data-Intensive Science. In: 2020 IEEE International Conference on Big Data and Smart Computing (BigComp). 2020. DOI: https://doi.org/10.1109/BigComp48618.2020.00-69
14. Sawadogo P, Darmont J. On data lake architectures and metadata management. Journal of Intelligent Information Systems. 2021;56(1). DOI: https://doi.org/10.1007/s10844-020-00608-7
15. Sharjeel A. What is Data Ingestion: Process, Tools, and Challenges Discussed [Internet]. 2020. Available from: https://dataintegrationinfo.com/what-is-data-ingestion/
16. Armoogum S, Li X. Big Data Analytics and Deep Learning in Bioinformatics With Hadoop. In: Deep Learning and Parallel Computing Environment for Bioengineering Systems. Elsevier; 2019. DOI: https://doi.org/10.1016/B978-0-12-816718-2.00009-9
17. Ahmed H, Mun J, Park Y, Choi J. A schema generator for collected data from wearable devices for reliable data ingestion. In: ACM International Conference Proceeding Series. 2019. DOI: https://doi.org/10.1145/3326467.3326493
18. Abdallah ZS, Du L, Webb GI. Data Preparation. In: Sammut C, Webb GI, editors. Encyclopedia of Machine Learning and Data Mining. Boston, MA: Springer US; 2017. DOI: https://doi.org/10.1007/978-1-4899-7687-1_62
19. Naeem T. Data Ingestion - Definition, Challenges, and Best Practices [Internet]. 2020. Available from: https://www.astera.com/type/blog/data-ingestion/
20. Aumueller D, Do H, Massmann S, Rahm E. Schema and ontology matching with COMA++. In: Proceedings of the 2005 ACM SIGMOD international conference on Management of data. 2005. DOI: https://doi.org/10.1145/1066157.1066283
21. Bernstein PA, Madhavan J, Rahm E. Generic schema matching, ten years later. Proceedings of the VLDB Endowment. 2011;4(11):695–701. DOI: https://doi.org/10.14778/3402707.3402710
22. Rahm E, Bernstein PA. A survey of approaches to automatic schema matching. The VLDB Journal. 2001;10(4):334–50. DOI: https://doi.org/10.1007/s007780100057
23. Auza-Santiváñez JC, Díaz JAC, Cruz OAV, Robles-Nina SM, Escalante CS, Huanca BA. Bibliometric Analysis of the Worldwide Scholarly Output on Artificial Intelligence in Scopus. Gamification and Augmented Reality 2023;1:11–11. https://doi.org/10.56294/gr202311. DOI: https://doi.org/10.56294/gr202311
24. Castillo JIR. Aumented reality im surgery: improving precision and reducing ridk. Gamification and Augmented Reality 2023;1:15–15. https://doi.org/10.56294/gr202315. DOI: https://doi.org/10.56294/gr202315
25. Castillo-Gonzalez W, Lepez CO, Bonardi MC. Augmented reality and environmental education: strategy for greater awareness. Gamification and Augmented Reality 2023;1:10–10. https://doi.org/10.56294/gr202310. DOI: https://doi.org/10.56294/gr202310
26. Aveiro-Róbalo TR, Pérez-Del-Vallín V. Gamification for well-being: applications for health and fitness. Gamification and Augmented Reality 2023;1:16–16. https://doi.org/10.56294/gr202316. DOI: https://doi.org/10.56294/gr202316
27. Chaudhri V, Baru C, Chittar N, Dong X, Genesereth M, Hendler J, et al. Knowledge Graphs: Introduction, History and Perspectives. AI Magazine. 2022;43(1):17–29. DOI: https://doi.org/10.1609/aimag.v43i1.19119
28. Ehrlinger L, Schrott J, Melichar M, Kirchmayr N, Wöß W. Data Catalogs: A Systematic Literature Review and Guidelines to Implementation. In: DEXA 2021 Workshops, Communications in Computer and Information Science. Springer International Publishing, Cham; 2021. p. 148–58. DOI: https://doi.org/10.1007/978-3-030-87101-7_15
29. Diamantini C, Giudice PL, Musarella L, Potena D, Storti E, Ursino D. A New Metadata Model to Uniformly Handle Heterogeneous Data Lake Sources. In: New Trends in Databases and Information Systems. Springer International Publishing, Cham; 2021. p. 165–77. DOI: https://doi.org/10.1007/978-3-030-00063-9_17
30. Koutras C, Siachamis G, Ionescu A, Psarakis K, Brons J, Fragkoulis M, et al. Valentine: Evaluating Matching Techniques for Dataset Discovery. In: IEEE 37th International Conference on Data Engineering (ICDE). 2021. DOI: https://doi.org/10.1109/ICDE51399.2021.00047
31. Castro Fernandez R, Mansour E, Qahtan AA, Elmagarmid A, Ilyas I, Madden S, et al. Seeping Semantics: Linking Datasets Using Word Embeddings for Data Discovery. In: 2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE; 2018. DOI: https://doi.org/10.1109/ICDE.2018.00093
32. Hättasch B, Truong-Ngoc M, Schmidt A, Binnig C. It’s AI Match: A Two-Step Approach for Schema Matching Using Embeddings [Internet]. 2022. Available from: http://arxiv.org/abs/2203.04366
33. Shraga R, Gal A, Roitman H. ADnEV: Cross-Domain Schema Matching using Deep Similarity Matrix Adjustment and Evaluation. Proceedings of the VLDB Endowment. 2020;13(9):1401–15. DOI: https://doi.org/10.14778/3397230.3397237
34. Cappuzzo R, Papotti P, Thirumuruganathan S. Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks. In: 2020 ACM SIGMOD International Conference on Management of Data. 2020. DOI: https://doi.org/10.1145/3318464.3389742
35. Zhang J, Shin B, Choi JD, Ho JC. SMAT: An Attention-Based Deep Learning Solution to the Automation of Schema Matching. In: Advances in Databases and Information Systems. Springer International Publishing; 2021. p. 260–74. DOI: https://doi.org/10.1007/978-3-030-82472-3_19
36. Yousfi A, Yazidi M, Zellou A. hMatcher: Matching Schemas Holistically. International Journal of Intelligent Engineering and Systems. 2020;13:490–501. DOI: https://doi.org/10.22266/ijies2020.1031.43
37. Amrouch S, Mostefai S. A Schema-Free Instance Matching Algorithm Based on Virtual Document Similarity. The International Arab Journal of Information Technology. 2022;19(3A). DOI: https://doi.org/10.34028/iajit/19/3A/3
38. Zhang CJ, Chen L, Jagadish HV, Zhang M, Tong Y. Reducing Uncertainty of Schema Matching via Crowdsourcing with Accuracy Rates [Internet]. 2018. Available from: http://arxiv.org/abs/1809.04017
39. Amghar S, Cherdal S, Mouline S. A Schema Integration Approach for Big Data Analysis. Ingénierie Des Systèmes d’Information. 2023;28(2). DOI: https://doi.org/10.18280/isi.280207
40. Liao X, Bottelier J, Zhao Z. A Column Styled Composable Schema Matcher for Semantic Data-Types. Data Sci J. 2019;18(25). DOI: https://doi.org/10.5334/dsj-2019-025
41. Hui B, Geng R, Ren Q, Li B, Li Y, Sun J, et al. Dynamic Hybrid Relation Network for Cross-Domain Context-Dependent Semantic Parsing [Internet]. 2021. Available from: https://arxiv.org/abs/2101.01686
42. Zhou Q, Liu X, Wang Q. Interpretable duplicate question detection models based on attention mechanism. Information Sciences. 2021 DOI: https://doi.org/10.1016/j.ins.2020.07.048
43. Gal A, Shraga R. Human’s Role in-the-Loop [Internet]. arXiv; 2022, Available from: http://arxiv.org/abs/2204.14192
Published
Issue
Section
License
Copyright (c) 2024 Oumaima El Haddadi, Max Chevalier, Bernard Dousset, Ahmad El Allaoui, Anass El Haddadi, Olivier Teste (Author)

This work is licensed under a Creative Commons Attribution 4.0 International License.
The article is distributed under the Creative Commons Attribution 4.0 License. Unless otherwise stated, associated published material is distributed under the same licence.
