Data Lake Management System based on Topic Modeling

Authors

  • Amine El Haddadi Data Science and Competetive Intelligence Team (DSCI), ENSAH, Abdelmalek Essaâdi, University (UAE) Tetouan, Morocco Author
  • Oumaima El Haddadi Data Science and Competetive Intelligence Team (DSCI), ENSAH, Abdelmalek Essaâdi, University (UAE) Tetouan, Morocco Author
  • Mohamed Cherradi Data Science and Competetive Intelligence Team (DSCI), ENSAH, Abdelmalek Essaâdi, University (UAE) Tetouan, Morocco Author
  • Fadwa Bouhafer Data Science and Competetive Intelligence Team (DSCI), ENSAH, Abdelmalek Essaâdi, University (UAE) Tetouan, Morocco Author
  • Anass El Haddadi Data Science and Competetive Intelligence Team (DSCI), ENSAH, Abdelmalek Essaâdi, University (UAE) Tetouan, Morocco Author
  • Ahmed El Allaoui Data Science and Competetive Intelligence Team (DSCI), ENSAH, Abdelmalek Essaâdi, University (UAE) Tetouan, Morocco Author

DOI:

https://doi.org/10.56294/dm2023183

Keywords:

Data Lake, Big Data, Business Intelligence, LDA, Topic Modeling

Abstract

In an environment full of competitiveness, data is a valuable asset for any company looking to grow. It represents a real competitive economic and strategic lever. The most reputable companies are not only concerned with collecting data from heterogeneous data sources, but also with analyzing and transforming these datasets into better decision-making. In this context, the data lake continues to be a powerful solution for storing large amounts of data and providing data analytics for decision support. In this paper, we examine the intelligent data lake management system that addresses the drawbacks of traditional business intelligence, which is no longer capable of handling data-driven demands. Data lakes are highly suitable for analyzing data from a variety of sources, particularly when data cleaning is time-consuming. However, ingesting heterogeneous data sources without any schema represents a major issue, and a data lake can easily turn into a data swamp. In this study, we implement the LDA topic model for managing the storage, processing, analysis, and visualization of big data. To assess the usefulness of our proposal, we evaluated its performance based on the topic coherence metric. The results of these experiments showed our approach to be more accurate on the tested datasets

References

1. Fang, H., 2015. Managing data lakes in big data era: What’s a data lake and why has it became popular in data management ecosystem. 2015 IEEE International Conference on Cyber Technology in Automation.

2. Kim, M., Kim, D., 2022. A Suggestion on the LDA-Based Topic Modeling Technique Based on ElasticSearch for Indexing Academic Research Results. Applied Sciences 12, 3118. https://doi.org/10.3390/app12063118

3. Suriarachchi, I., Plale, B., 2016. Crossing analytics systems: A case for integrated provenance in data lakes, in: 2016 IEEE 12th International Conference on E-Science (e-Science). Presented at the 2016 IEEE 12th International Conference on e-Science (e-Science), pp. 349–354. https://doi.org/10.1109/eScience.2016.7870919

4. Silva, C.C., Galster, M., Gilson, F., 2021. Topic modeling in software engineering research. Empir Software Eng 26, 120. https://doi.org/10.1007/s10664-021-10026-0

5. Yeh, wei-chih, McIntosh, S., Sobolevsky, S., Hung, P., 2017. Big Data Analytics and Business Intelligence in Industry. Information Systems Frontiers 19. https://doi.org/10.1007/s10796-017-9804-9

6. Inmon, B., 2016. Data Lake Architecture: Designing the Data Lake and Avoiding the Garbage Dump, First edition. ed. Technics Publications, Basking Ridge, NJ.

7. Yang, X., Lo, D., Li, L., Xia, X., Bissyandé, T.F., Klein, J., 2017. Characterizing malicious Android apps by mining topic-specific data flow signatures. Information and software technology 27–39.

8. Cherradi, M., El Haddadi, A., Routaib, H., 2022. Data Lake Management Based on DLDS Approach. pp. 679–690. https://doi.org/10.1007/978-981-16-3637-0_48

9. Cherradi, M., El Haddadi, A., 2022a. Grover’s Algorithm for Data Lake Optimization Queries. International Journal of Advanced Computer Science and Applications 13, 568–576. https://doi.org/10.14569/IJACSA.2022.0130866

10. Terrizzano, I.G., Schwarz, P., Roth, M., Colino, J.E., 2015. Data Wrangling: The Challenging Yourney from the Wild to the Lake. Presented at the Conference on Innovative Data Systems Research.

11. Ashish, T., Ben, S., 2016. Architecting Data Lakes [Book] [WWW Document]. URL https://www.oreilly.com/library/view/architecting-data-lakes/9781492042518/ (accessed 2.12.23).

12. Heintz, I., Gabbard, R., Srivastava, M., Barner, D., Black, D., Friedman, M., Weischedel, R., 2013. Automatic Extraction of Linguistic Metaphors with LDA Topic Modeling.

13. Alnoukari, M., 2022. From Business Intelligence to Big Data: The Power of Analytics. pp. 823–841. https://doi.org/10.4018/978-1-6684-3662-2.ch038

14. Zhang, L., Sun, X., Zhuge, H., 2015. Topic discovery of clusters from documents with geographical location. Concurrency and Computation: Practice and Experience 27. https://doi.org/10.1002/cpe.3474

15. Romero-Carazas R. Prompt lawyer: a challenge in the face of the integration of artificial intelligence and law. Gamification and Augmented Reality 2023;1:7–7. https://doi.org/10.56294/gr20237.

16. Gonzalez-Argote J. A Bibliometric Analysis of the Studies in Modeling and Simulation: Insights from Scopus. Gamification and Augmented Reality 2023;1:5–5. https://doi.org/10.56294/gr20235.

17. Gonzalez-Argote D, Gonzalez-Argote J, Machuca-Contreras F. Blockchain in the health sector: a systematic literature review of success cases. Gamification and Augmented Reality 2023;1:6–6. https://doi.org/10.56294/gr20236.

18. Madera, C., Laurent, A., 2016. The next information architecture evolution: the data lake wave, in: Proceedings of the 8th International Conference on Management of Digital EcoSystems, MEDES. Association for Computing Machinery, New York, NY, USA, pp. 174–180. https://doi.org/10.1145/3012071.3012077

19. Cherradi, M., El Haddadi, A., 2023. DLDB-Service: An Extensible Data Lake System, in: Ben Ahmed, M., Abdelhakim, B.A., Ane, B.K., Rosiyadi, D. (Eds.), Emerging Trends in Intelligent Systems & Network Security, Lecture Notes on Data Engineering and Communications Technologies. Springer International Publishing, Cham, pp. 211–220. https://doi.org/10.1007/978-3-031-15191-0_20

20. Zhang, Y., Chen, M., Huang, D., Wu, D., Li, Y., 2017. iDoctor: Personalized and professionalized medical recommendations based on hybrid matrix factorization. Future Generation Computer Systems 66, 30–35. https://doi.org/10.1016/j.future.2015.12.001

21. Cherradi, M., El Haddadi, A., 2022b. Data Lakes: A Survey Paper. pp. 823–835. https://doi.org/10.1007/978-3-030-94191-8_66

22. Dixon, J., 2010. Pentaho, Hadoop, and Data Lakes | James Dixon’s Blog [WWW Document]. URL https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/ (accessed 2.12.23).

23. Jelodar, H., Wang, Y., Yuan, C., Feng, X., Jiang, X., Li, Y., Zhao, L., 2019. Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey. Multimed Tools Appl 78, 15169–15211. https://doi.org/10.1007/s11042-018-6894-4

24. Levy, K., Franklin, M., 2014. Driving Regulation: Using Topic Models to Examine Political Contention in the U.S. Trucking Industry. Social Science Computer Review 32, 182–194. https://doi.org/10.1177/0894439313506847

25. Yonggan Li, Xueguang Zhou, Yan Sun, Huanguo Zhang, 2016. Design and implementation of Weibo sentiment analysis based on LDA and dependency parsing. China Commun. 13, 91–105. https://doi.org/10.1109/CC.2016.7781721

26. Ruzgas, T., Bagdonavičienė, J., 2017. Business Intelligence for Big Data Analytics. International Journal of Computer Applications Technology and Research 6, 001–008. https://doi.org/10.7753/IJCATR0601.1001

27. Cherradi, M., El Haddadi, A., Routaib, H., 2021. Moroccan Data Lake Healthcare Analytics for Covid-19. https://doi.org/10.5220

Downloads

Published

2023-12-28

Issue

Section

Original

How to Cite

1.
El Haddadi A, El Haddadi O, Cherradi M, Bouhafer F, El Haddadi A, El Allaoui A. Data Lake Management System based on Topic Modeling. Data and Metadata [Internet]. 2023 Dec. 28 [cited 2024 Dec. 21];2:183. Available from: https://dm.ageditor.ar/index.php/dm/article/view/113