Phishing Website Detection: A Dataset-Centric Approach for Enhanced Security

Sultan   Ahmad; Md   Alimul Haque; Hikmat A.   M. Abdeljaber; M. U.   Bokhari; Jabeen   Nazeer; B. K.   Mishra

doi:10.56294/dm2024.223

Authors

Sultan Ahmad Department of Computer Science, College of Computer Engineering and Sciences, Prince Sattam Bin Abdulaziz University, P.O.Box. 151, Alkharj 11942, Saudi Arabia Author https://orcid.org/0000-0002-3198-7974
Md Alimul Haque Department of Computer Science, Veer Kunwar Singh University, Ara- 802301, India Author https://orcid.org/0000-0002-0744-0784
Hikmat A. M. Abdeljaber Department of Computer Science, Faculty of Information Technology, Applied Science Private University, Amman, Jordan Author https://orcid.org/0000-0001-9557-3933
M. U. Bokhari Department of Computer Science, Aligarh Muslim University, Aligarh - 202002, India Author
Jabeen Nazeer Department of Computer Science, College of Computer Engineering and Sciences, Prince Sattam Bin Abdulaziz University, P.O.Box. 151, Alkharj 11942, Saudi Arabia Author
B. K. Mishra P.G. Department of Physics, Veer Kunwar Singh University, Ara- 802301, India Author

DOI:

https://doi.org/10.56294/dm2024.223

Keywords:

Machine learning, Phishing attacks, Cybersecurity

Abstract

Introduction; Phishing involves cybercriminals creating fake websites that appear to be real sites with the aim of obtaining personal information. With the increasing sophistication of phishing websites, machine learning today provides a useful approach to scan and counter such attacks.
Objective; In this study, we seek to apply machine learning algorithms on the dataset - Phishing_Legitimate_full.csv – which consists of phishing websites and genuine websites that have been labeled.
Method; This paper aims to identify the most effective feature selection method for predicting phishing websites.
Result; The findings highlight the potential of machine learning in enhancing cybersecurity by automating threat detection and intelligence. Phishing attacks rely on social engineering strategies to present deceptive links as trustworthy sources, deceiving individuals into sharing confidential data.
Conclusion; This study explores the utilization of curated datasets and machine learning algorithms to develop adaptive and efficient phishing detection mechanisms, providing a robust defense against such malicious activities

References

1. Wu L, Du X, Wu J. Effective defense schemes for phishing attacks on mobile computing platforms. IEEE Trans Veh Technol. 2015;65(8):6678–91. DOI: https://doi.org/10.1109/TVT.2015.2472993

2. Ahmad S, Jha S, Alam A, Alharbi M, Nazeer J. Analysis of Intrusion Detection Approaches for Network Traffic Anomalies with Comparative Analysis on Botnets (2008–2020). Secur Commun Networks. 2022;2022. DOI: https://doi.org/10.1155/2022/9199703

3. Anupam S, Kar AK. Phishing website detection using support vector machines and nature-inspired optimization algorithms. Telecommun Syst. 2021;76(1):17–32. DOI: https://doi.org/10.1007/s11235-020-00739-w

4. Xiang G, Hong JI. A hybrid phish detection approach by identity discovery and keywords retrieval. In: Proceedings of the 18th international conference on World wide web. 2009. p. 571–80. DOI: https://doi.org/10.1145/1526709.1526786

5. Ahmad S, Afzal MM. A Study and Survey of Security and Privacy issues in Cloud Computing. Int J Eng Res Technol (IJERT), ISSN. :181–2278.

6. Haque MA, Ahmad S, Haque S, Kumar K, Mishra K, Mishra BK. Analyzing University Students’ Awareness of Cybersecurity. In: 2023 International Conference on Emerging Trends in Networks and Computer Communications (ETNCC). IEEE; 2023. p. 250–7. DOI: https://doi.org/10.1109/ETNCC59188.2023.10284971

7. Whig V, Othman B, Gehlot A, Haque MA, Qamar S, Singh J. An Empirical Analysis of Artificial Intelligence (AI) as a Growth Engine for the Healthcare Sector. In: 2022 2nd International Conference on Advance Computing and Innovative Technologies in Engineering (ICACITE). IEEE; 2022. p. 2454–7. DOI: https://doi.org/10.1109/ICACITE53722.2022.9823607

8. Haakenstad A, Irvine CMS, Knight M, Bintz C, Aravkin AY, Zheng P, et al. Measuring the availability of human resources for health and its relationship to universal health coverage for 204 countries and territories from 1990 to 2019: a systematic analysis for the Global Burden of Disease Study 2019. Lancet. 2022;399(10341):2129–54. DOI: https://doi.org/10.1016/S0140-6736(22)00532-3

9. Hossain MA, Haque MA, Ahmad S, Abdeljaber HAM, Eljialy AEM, Alanazi A, et al. AI-enabled approach for enhancing obfuscated malware detection: a hybrid ensemble learning with combined feature selection techniques. Int J Syst Assur Eng Manag [Internet]. 2024; Available from: https://doi.org/10.1007/s13198-024-02294-y DOI: https://doi.org/10.1007/s13198-024-02294-y

10. Haque MA, Ahmad S, Abboud AJ, Hossain MA, Kumar K, Haque S, et al. 6G wireless Communication Networks: Challenges and Potential Solution. Int J Bus Data Commun Netw. 2024;19(1):1–27. DOI: https://doi.org/10.4018/IJBDCN.339889

11. Haque MA, Haque S, Kumar K, Singh NK. A Comprehensive Study of Cyber Security Attacks, Classification, and Countermeasures in the Internet of Things. In: Digital Transformation and Challenges to Data Security and Privacy. IGI Global; 2021. p. 63–90. DOI: https://doi.org/10.4018/978-1-7998-4201-9.ch004

12. Haque MA, Ahmad S, Sonal D, Abdeljaber HAM, Mishra BK, Eljialy AEM, et al. Achieving Organizational Effectiveness through Machine Learning Based Approaches for Malware Analysis and Detection. Data Metadata. 2023;2:139. DOI: https://doi.org/10.56294/dm2023139

13. Haque MA, Ahmad S, John A, Mishra K, Mishra BK, Kumar K, et al. Cybersecurity in Universities: An Evaluation Model. SN Comput Sci [Internet]. 2023;4(5):569. Available from: https://doi.org/10.1007/s42979-023-01984-x DOI: https://doi.org/10.1007/s42979-023-01984-x

14. Haque A, Raza S, Ahmad S, Hossain A, Abdeljaber HAM, Eljialy AEM, et al. Implication of Different Data Split Ratio on the Performance of Model in Price Prediction of Used Vehicles Using Regression Analysis. Data Metadata. 2024;3:425. DOI: https://doi.org/10.56294/dm2024425

15. Alauthman M, Aslam N, Al-Kasassbeh M, Khan S, Al-Qerem A, Choo KKR. An efficient reinforcement learning-based Botnet detection approach. J Netw Comput Appl. 2020;150:102479. DOI: https://doi.org/10.1016/j.jnca.2019.102479

16. Chiew KL, Yong KSC, Tan CL. A survey of phishing attacks: Their types, vectors and technical approaches. Expert Syst Appl. 2018;106:1–20. DOI: https://doi.org/10.1016/j.eswa.2018.03.050

17. Sahingoz OK. Networking models in flying ad-hoc networks (FANETs): Concepts and challenges. J Intell Robot Syst. 2014;74:513–27. DOI: https://doi.org/10.1007/s10846-013-9959-7

18. Das A, Baki S, El Aassal A, Verma R, Dunbar A. SoK: a comprehensive reexamination of phishing research from the security perspective. IEEE Commun Surv Tutorials. 2019;22(1):671–708. DOI: https://doi.org/10.1109/COMST.2019.2957750

19. Jhanjhi NZ, Shah IA. Navigating Cyber Threats and Cybersecurity in the Logistics Industry. IGI Global; 2024. DOI: https://doi.org/10.4018/979-8-3693-3816-2

20. Keyvanpour MR, Javideh M, Ebrahimi MR. Detecting and investigating crime by means of data mining: a general crime matching framework. Procedia Comput Sci. 2011;3:872–80. DOI: https://doi.org/10.1016/j.procs.2010.12.143

21. Chiew KL, Chang EH, Tiong WK. Utilisation of website logo for phishing detection. Comput Secur. 2015;54:16–26. DOI: https://doi.org/10.1016/j.cose.2015.07.006

22. Jain AK, Gupta BB. A survey of phishing attack techniques, defence mechanisms and open research challenges. Enterp Inf Syst. 2022;16(4):527–65. DOI: https://doi.org/10.1080/17517575.2021.1896786

23. Zhang J, Pan Y, Wang Z, Liu B. URL based gateway side phishing detection method. In: 2016 IEEE Trustcom/BigDataSE/ISPA. IEEE; 2016. p. 268–75. DOI: https://doi.org/10.1109/TrustCom.2016.0073

24. Rao RS, Vaishnavi T, Pais AR. CatchPhish: detection of phishing websites by inspecting URLs. J Ambient Intell Humaniz Comput. 2020;11:813–25. DOI: https://doi.org/10.1007/s12652-019-01311-4

25. Mohamed G, Visumathi J, Mahdal M, Anand J, Elangovan M. An effective and secure mechanism for phishing attacks using a machine learning approach. Processes. 2022;10(7):1356. DOI: https://doi.org/10.3390/pr10071356

26. Nguyen TP, Pham CC, Ha SVU, Jeon JW. Change detection by training a triplet network for motion feature extraction. IEEE Trans Circuits Syst Video Technol. 2018;29(2):433–46. DOI: https://doi.org/10.1109/TCSVT.2018.2795657

27. Karunakaran B, Misra D, Marshall K, Mathrawala D, Kethireddy S. Closing the loop—Finding lung cancer patients using NLP. In: 2017 IEEE international conference on big data (big data). IEEE; 2017. p. 2452–61. DOI: https://doi.org/10.1109/BigData.2017.8258203

28. Sahingoz OK, Buber E, Demir O, Diri B. Machine learning based phishing detection from URLs. Expert Syst Appl. 2019;117:345–57. DOI: https://doi.org/10.1016/j.eswa.2018.09.029

29. Basnet RB, Sung AH. Learning to Detect Phishing Webpages. J Internet Serv Inf Secur. 2014;4(3):21–39.

30. Hong J, Kim H, Oh S, Im Y, Jeong H, Kim H, et al. Combating phishing and script-based attacks: a novel machine learning framework for improved client-side security. J Supercomput. 2025;81(1):1–24. DOI: https://doi.org/10.1007/s11227-024-06551-6

31. Phishing Detection Dataset [Internet]. Available from: https://www.kaggle.com/datasets/sharmi3754/phishing-detection-dataset.