Embedding and Topic Modeling Techniques for Short Text Analysis on Social Media: A Systematic Literature Review

Authors

DOI:

https://doi.org/10.56294/dm20251168

Keywords:

Systematic Literature Review, Topic Modeling, Word Embedding, Short-Text Analysis, Social Media, Natural Language Processing

Abstract

Introduction: The analysis of short texts from social media is critical for gaining insights but is challenged by data sparsity and noise. Integrating embedding and topic modeling techniques has emerged as a key solution.

Methods: This study conducted a Systematic Literature Review (SLR) following PRISMA guidelines. A systematic search across IEEE, ScienceDirect, and Scopus databases was performed to identify relevant studies, which were then screened and selected based on predefined inclusion and exclusion criteria.

Results: The analysis of 22 included studies revealed a clear methodological trend toward hybrid models that integrate transformer-based embeddings, such as BERT, with topic modeling frameworks. These integrated approaches consistently demonstrated superior performance in generating coherent topics and improving downstream task accuracy compared to standalone or traditional methods. However, limitations related to model generalizability, computational cost, and domain adaptation were identified.

Conclusions: The integration of contextual embeddings with topic models is the most effective approach for short-text analysis on social media. Future research should focus on developing more adaptive and efficient models, including fine-tuning language models on domain-specific corpora and exploring the integration of Large Language Models (LLMs) to enhance automation and accuracy.

References

1. Jha B. The Role of Social Media Communication: Empirical Study of Online Purchase Intention of Financial Products. Glob Bus Rev. 2019;20(6):1445–61.

2. Yadav P, Kumar A, Shivani S, Hooda R, Sudhir S, Pooja P. Enhanced Spam Detection System for Twitter Social Networking Platform. Int J Recent Innov Trends Comput Commun. 2023;11(11s):195–201.

3. Valle-Cruz D, Fernandez-Cortez V, López-Chau A, Sandoval-Almazán R. Does Twitter Affect Stock Market Decisions? Financial Sentiment Analysis During Pandemics: A Comparative Study of the H1N1 and the COVID-19 Periods. Cognit Comput. 2022;14(1):372–87.

4. Swaminathan V, Schwartz HA, Menezes R, Hill S. The Language of Brands in Social Media: Using Topic Modeling on Social Media Conversations to Drive Brand Strategy. J Interact Mark. 2022;57(2):255–77.

5. Murakami R, Chakraborty B. Investigating the Efficient Use of Word Embedding with Neural-Topic Models for Interpretable Topics from Short Texts. Sensors. 2022;22(3).

6. De Santis E, Martino A, Ronci F, Rizzi A. From Bag-of-Words to Transformers: A Comparative Study for Text Classification in Healthcare Discussions in Social Media. IEEE Trans Emerg Top Comput Intell. 2024;9(1):1–15.

7. Romero JD, Feijoo-Garcia MA, Nanda G, Newell B, Magana AJ. Evaluating the Performance of Topic Modeling Techniques with Human Validation to Support Qualitative Analysis. Big Data Cogn Comput. 2024;8(10).

8. Si Y, Wang J, Xu H, Roberts K. Enhancing clinical concept extraction with contextual embeddings. J Am Med Informatics Assoc. 2019;26(11):1297–304.

9. Aksoy M, Yanık S, Amasyali MF. A comparative analysis of text representation, classification and clustering methods over real project proposals. Vol. 16, International Journal of Intelligent Computing and Cybernetics. 2023. 595–628 p.

10. Egger R, Yu J. A Topic Modeling Comparison Between LDA, NMF, Top2Vec, and BERTopic to Demystify Twitter Posts. Front Sociol [Internet]. 2022 May;7(May):1–16. Available from: https://www.frontiersin.org/articles/10.3389/fsoc.2022.886498/full

11. Odden TOB, Tyseng H, Mjaaland JT, Kreutzer MF, Malthe-Sørenssen A. Using text embeddings for deductive qualitative research at scale in physics education. Phys Rev Phys Educ Res [Internet]. 2024 Dec;20(2):20151. Available from: https://link.aps.org/doi/10.1103/PhysRevPhysEducRes.20.020151

12. Thakral P, Sharma D, Ghosh K. Evidence-based knowledge management: a topic modeling analysis of research on knowledge management and analytics. VINE J Inf Knowl Manag Syst [Internet]. 2024 Jan; Available from: https://www.emerald.com/insight/content/doi/10.1108/VJIKMS-03-2023-0079/full/html

13. Walsh J, Cave J, Griffiths F. Combining Topic Modeling, Sentiment Analysis, and Corpus Linguistics to Analyze Unstructured Web-Based Patient Experience Data: Case Study of Modafinil Experiences. J Med Internet Res [Internet]. 2024 Dec;26:e54321. Available from: https://www.jmir.org/2024/1/e54321

14. Jang H, Rempel E, Roth D, Carenini G, Janjua NZ. Tracking COVID-19 discourse on twitter in north america: Infodemiology study using topic modeling and aspect-based sentiment analysis. J Med Internet Res. 2021;23(2).

15. Serrano-Guerrero J, Bani-Doumi M, Chiclana F, Romero FP, Olivas JA. How satisfied are patients with nursing care and why? A comprehensive study based on social media and opinion mining. Informatics Heal Soc Care [Internet]. 2024 Jan;49(1):14–27. Available from: https://doi.org/10.1080/17538157.2023.2297307

16. Tarifa A, Hedhili A, Chaari WL. A Filtering Process to Enhance Topic Detection and Labelling. Procedia Comput Sci [Internet]. 2020;176:695–705. Available from: https://doi.org/10.1016/j.procs.2020.09.042

17. Hanifa A, Debora C, Hasani MF, Wicaksono P. Analyzing Views on Presidential Candidates for Election 2024 Based on the Instagram and X Platforms with Text Clustering. Procedia Comput Sci [Internet]. 2024;245(C):730–9. Available from: https://doi.org/10.1016/j.procs.2024.10.299

18. Murthy D, Keshari S, Arora S, Yang Q, Loukas A, Schwartz SJ, et al. Categorizing E-cigarette-related tweets using BERT topic modeling. Emerg Trends Drugs, Addict Heal [Internet]. 2024 Dec;4(October):100160. Available from: https://linkinghub.elsevier.com/retrieve/pii/S2667118224000199

19. Liu Z, Qin T, Chen KJ, Li Y. Collaboratively Modeling and Embedding of Latent Topics for Short Texts. IEEE Access. 2020;8:99141–53.

20. Zhang P, Wang S, Li D, Li X, Xu Z. Combine Topic Modeling with Semantic Embedding: Embedding Enhanced Topic Model. IEEE Trans Knowl Data Eng. 2020;32(12):2322–35.

21. Uteuov A, Kalyuzhnaya A. Combined document embedding and hierarchical topic model for social media texts analysis. Procedia Comput Sci [Internet]. 2018;136:293–303. Available from: https://www.sciencedirect.com/science/article/pii/S1877050918315953

22. Zamiralov A, Khodorchenko M, Nasonov D. Detection of housing and utility problems in districts through social media texts. Procedia Comput Sci [Internet]. 2020;178:213–23. Available from: https://www.sciencedirect.com/science/article/pii/S1877050920323978

23. Steuber F, Schneider S, Schoenfeld M. Embedding Semantic Anchors to Guide Topic Models on Short Text Corpora. Big Data Res [Internet]. 2022;27:100293. Available from: https://www.sciencedirect.com/science/article/pii/S2214579621001106

24. Gao C, Zeng J, Wen Z, Lo D, Xia X, King I, et al. Emerging App Issue Identification via Online Joint Sentiment-Topic Tracing. IEEE Trans Softw Eng. 2022;48(8):3025–43.

25. Ma Z, Li L, Hemphill L, Baecher GB, Yuan Y. Investigating disaster response for resilient communities through social media data and the Susceptible-Infected-Recovered (SIR) model: A case study of 2020 Western U.S. wildfire season. Sustain Cities Soc [Internet]. 2024;106:105362. Available from: https://www.sciencedirect.com/science/article/pii/S2210670724001902

26. Murakami R, Chakraborty B. Neural Topic Models for Short Text Using Pretrained Word Embeddings and Its Application to Real Data. 4th IEEE Int Conf Knowl Innov Invent 2021, ICKII 2021. 2021;146–50.

27. Diaz-Garcia JA, Ruiz MD, Martin-Bautista MJ. NOFACE: A new framework for irrelevant content filtering in social media according to credibility and expertise. Expert Syst Appl [Internet]. 2022;208:118063. Available from: https://www.sciencedirect.com/science/article/pii/S0957417422012684

28. Lopreite M, Misuraca M, Puliga M. Outbreak and integration of social media in public health surveillance systems: A policy review through BERT embedding technique. Socioecon Plann Sci [Internet]. 2024 Oct;95(March):101995. Available from: https://doi.org/10.1016/j.seps.2024.101995

29. Ng QX, Lee DYX, Yau CE, Lim YL, Liew TM. Public perception on “healthy ageing” in the past decade: An unsupervised machine learning of 63,809 Twitter posts. Heliyon [Internet]. 2023;9(2):e13118. Available from: https://doi.org/10.1016/j.heliyon.2023.e13118

30. Li X, Zhang A, Li C, Guo L, Wang W, Ouyang J. Relational Biterm Topic Model: Short-Text Topic Modeling using Word Embeddings. Comput J. 2019;62(3):359–72.

31. Meddeb A, Romdhane L Ben. Using Topic Modeling and Word Embedding for Topic Extraction in Twitter. Procedia Comput Sci [Internet]. 2022;207(Kes):790–9. Available from: https://doi.org/10.1016/j.procs.2022.09.134

32. Zuo Y, Li C, Lin H, Wu J. Topic Modeling of Short Texts: A Pseudo-Document View with Word Embedding Enhancement. IEEE Trans Knowl Data Eng [Internet]. 2021;35(1):1. Available from: https://www.scopus.com/inward/record.uri?eid=2-s2.0-85104259963&doi=10.1109%2FTKDE.2021.3073195&partnerID=40&md5=e1767abd0eb7eeecd3e1a41c98ee7731

33. Limwattana S, Prom-On S. Topic Modeling Enhancement using Word Embeddings. JCSSE 2021 - 18th Int Jt Conf Comput Sci Softw Eng Cybern Hum Beings. 2021;1–5.

34. Nasser M, Saeed F, Da’u A, Alblwi A, Al-Sarem M. Topic-aware neural attention network for malicious social media spam detection. Alexandria Eng J [Internet]. 2025 Jan;111(October 2024):540–54. Available from: https://doi.org/10.1016/j.aej.2024.10.073

35. Shao W, Huang L, Liu S, Ma S, Song L. Towards Better Understanding with Uniformity and Explicit Regularization of Embeddings in Embedding-based Neural Topic Models. Proc Int Jt Conf Neural Networks. 2022;2022-July:1–9.

36. Yu J, Qiu L. ULW-DMM: An Effective Topic Modeling Method for Microblog Short Text. IEEE Access. 2019;7:884–93.

37. Sun H, Chen Y, Zhang Y. Who is to blame for AV crashes? Public perceptions of blame attribution using text mining based on social media. Comput Human Behav [Internet]. 2025;168:108627. Available from: https://www.sciencedirect.com/science/article/pii/S0747563225000743

38. Wicaksono JA, Kusumaningrum R, Sediyono E. Sentiment analysis of public response to measurable fishing capture policy using LDA and LSTM methods. TELKOMNIKA Telecommun Comput El Control. 2024;22(6):1405–13. doi:10.12928/TELKOMNIKA.v22i6.25935

39. Muzakir A, Adi K, Kusumaningrum R. Short text classification based on hybrid semantic expansion and Bidirectional GRU (BiGRU) based method to improve hate speech detection. Rev Intell Artif. 2023;37(6):1471–81. doi:10.18280/ria.370611

40. Doogan Poet Laureate C, Buntine W, Linger H. A systematic review of the use of topic models for short-text social media analysis. Artif Intell Rev. 2023;56:14223–14255.

41. Qin S, Zhang M, Hu H, et al. A joint-training topic model for social media texts. Humanities Soc Sci Commun. 2025;12:281.

42. Doi T, Isonuma M, Yanaka H. Topic Modeling for Short Texts with Large Language Models. In: Proceedings of ACL Short Papers. ACL; 2024.

43. Eichin F, Schuster C, Groh G, Hedderich MA. Semantic Component Analysis: Discovering Patterns in Short Texts Beyond Topics. arXiv; 2024.

44. Rashid J. WETM: A word embedding-based topic model with structural topic representations. Expert Syst Appl. 2023.

Downloads

Published

2025-09-06

Issue

Section

Systematic reviews or meta-analyses

How to Cite

1.
Warsito B, Endro Suseno J, Arifudin A. Embedding and Topic Modeling Techniques for Short Text Analysis on Social Media: A Systematic Literature Review. Data and Metadata [Internet]. 2025 Sep. 6 [cited 2025 Sep. 13];4:1168. Available from: https://dm.ageditor.ar/index.php/dm/article/view/1168