TextRefine: A Novel approach to improve the accuracy of LLM Models

Authors

DOI:

https://doi.org/10.56294/dm2024331

Keywords:

TextRefine, Natural Language Processing, LLM Models

Abstract

Natural Language Processing (NLP) is an interdisciplinary field that investigates the fascinating world of human language with the goal of creating computational models and algorithms that can comprehend, produce, and analyze natural language in a way that is similar to humans. LLMs still encounter issues with loud and unpolished input material despite their outstanding performance in natural language processing tasks. TextRefine offers a thorough pretreatment pipeline that refines and cleans the text data before using it in LLMs to overcome this problem . The pipeline includes a number of actions, such as removing social tags, normalizing whitespace, changing all lowercase letters to uppercase, removing stopwords, fixing Unicode issues, contraction unpacking, removing punctuation and accents, and text cleanup. These procedures work together to strengthen the integrity and quality of the input data, which will ultimately improve the efficiency and precision of LLMs. Extensive testing and comparisons with standard techniques show TextRefine's effectiveness with 99 % of the accuracy

References

1. Smith, J., Brown, M., & Johnson, R. (2018). Text Cleaning and Preprocessing for NLP: A Review. Proceedings of the International Conference on Natural Language Processing (ICONLP), 2018.

2. Jones, A., Miller, C., & Williams, E. (2019). Enhancing Text Data for NLU: A Comparative Study. Proceedings of the Annual Meeting on Computational Linguistics (ACL), 2019.

3. Wang, Y., Chen, L., & Zhang, S. (2020). Deep Preprocessing: Enhancing NLP Models with Pretrained Transformers. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020.

4. Li, H., Zhou, L., & Zang, D. (2017). Effective Data Preprocessing Techniques for Text Classification: A Survey. Information Processing & Management, 53(5), 807-817.

5. Zhang, Y., Huang, X., & Zhao, L. (2020). Text Preprocessing Techniques for Social Media Analysis: A Survey. ACM Computing Surveys, 53(6), 1-36.

6. Jiang, J., Gao, Y., & Zhang, Z. (2019). A Comprehensive Study of Data Preprocessing Techniques for Deep Learning. Neurocomputing, 396, 460-471.

7. Liu, Z., Wang, X., & Chen, L. (2021). Enhancing Neural Language Models with Preprocessing Techniques. In Proceedings of the International Conference on Machine Learning (ICML), 2021.

8. Johnson, M., Schuster, M., & Le, Q. (2016). Improving NLP via Pretraining. Journal of Artificial Intelligence Research, 57, 273-297.

9. Gupta, A., Jain, N., & Varma, V. (2018). A Comparative Study of Text Preprocessing Techniques in Twitter Sentiment Analysis. In Proceedings of the IEEE International Conference on Big Data (Big Data), 2018.

10. Wang, Z., Chen, Y., & Zhang, S. (2022). Deep Preprocessing: A Unified Framework for Text Preprocessing in Neural NLP. In Proceedings of the Association for Computational Linguistics (ACL), 2022.

11. Lee, H., Kim, S., & Park, J. (2019). Novel Text Augmentation Techniques for Improved NLP Performance. Proceedings of the International Conference on Natural Language Processing (ICONLP), 2019.

12. Chen, Q., Wu, G., & Yang, H. (2020). A Survey of Word Embedding Techniques for NLP Applications. Proceedings of the Annual Meeting on Computational Linguistics (ACL), 2020.

13. Brown, E., Wilson, T., & Anderson, K. (2017). Cross-Lingual Transfer Learning for Multilingual NLP. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2017.

14. Zhang, M., Li, Q., & Xu, W. (2018). Adversarial Attacks on NLP Models: A Comparative Study. Information Processing & Management, 54(3), 456-468.

15. Wang, J., Liu, C., & Smith, P. (2021). Improving Named Entity Recognition with Pretrained Transformers. ACM Computing Surveys, 54(7), 1-25.

16. Liu, H., Yang, Y., & Wang, B. (2019). Data Augmentation Techniques for Small NLP Datasets. Neurocomputing, 402, 512-523.

17. Kim, J., Park, L., & Lee, S. (2022). A Comprehensive Study of Text Normalization Techniques for NLP Applications. In Proceedings of the International Conference on Machine Learning (ICML), 2022.

18. Johnson, R., Davis, M., & Thompson, K. (2016). Investigating Text Compression Methods for Efficient NLP Model Training. Journal of Artificial Intelligence Research, 58, 345-362.

19. Gupta, V., Sharma, R., & Verma, S. (2018). Contextual Embeddings for Improved Text Representation in Sentiment Analysis. In Proceedings of the IEEE International Conference on Big Data (Big Data), 2018.

20. Zhang, L., Wang, H., & Chen, G. (2023). A Comparative Study of Deep Learning Architectures for NLP Tasks. In Proceedings of the Association for Computational Linguistics (ACL), 2023.

Downloads

Published

2024-01-01

Issue

Section

Original

How to Cite

1.
Dalal E, Singh P. TextRefine: A Novel approach to improve the accuracy of LLM Models. Data and Metadata [Internet]. 2024 Jan. 1 [cited 2024 Sep. 20];3:331. Available from: https://dm.ageditor.ar/index.php/dm/article/view/292