Logistic Regression and TF-IDF Models for Early Detection of Indonesian-Language Fake News Model Regresi Logistik dan TF-IDF untuk Deteksi Dini Berita Palsu Berbahasa Indonesia
Main Article Content
Abstract
The spread of false information, particularly on social media platforms, has become a significant challenge due to the rapid flow of digital content. Most existing fake news detection systems are primarily designed for the English language, making them less effective when applied to Indonesian contexts. This study proposes a web-based hoax detection system that combines the Term Frequency-Inverse Document Frequency (TF-IDF) method with the Logistic Regression classification algorithm. The dataset, consisting of both real and fake news articles, was obtained from Kaggle and processed through several stages including normalization, stopword removal, and stemming. The TF-IDF vectorization results were then used to train a binary classification model. The system allows for user input either in the form of raw text or a URL, and delivers real-time classification results. Evaluation of the system indicates a high level of accuracy and strong potential in improving public digital literacy. These findings demonstrate a lightweight yet effective approach to mitigating the spread of hoaxes in the Indonesian language.
Article Details
References
[2] R. Rafique, R. Gantassi, R. Amin, J. Frnda, A. Mustapha, and A. H. Alshehri, “Deep fake detection and classification using error-level analysis and deep learning,” Sci. Rep., vol. 13, no. 1, Dec. 2023, doi: 10.1038/s41598-023-34629-3.
[3] S. Lohitha, S. D. Reddy, B. R. Krishna, and N. S. Krishna, “Fake News Detection Using Machine Learning,” Lect. Notes Networks Syst., vol. 645 LNNS, pp. 463–470, 2023, doi: 10.1007/978-981-99-0769-4_41.
[4] S. Shahsavari, P. Holur, T. Wang, T. R. Tangherlini, and V. Roychowdhury, “Conspiracy in the time of corona: automatic detection of emerging COVID-19 conspiracy theories in social media and the news,” J. Comput. Soc. Sci., vol. 3, no. 2, pp. 279–317, Nov. 2020, doi: 10.1007/s42001-020-00086-5.
[5] J. A. García-Díaz, S. M. Jiménez-Zafra, M. A. García-Cumbreras, and R. Valencia-García, “Evaluating feature combination strategies for hate-speech detection in Spanish using linguistic features and transformers,” Complex Intell. Syst., vol. 9, no. 3, pp. 2893–2914, Jun. 2023, doi: 10.1007/s40747-022-00693-x.
[6] H. F. Villela, F. Corrêa, J. S. de A. N. Ribeiro, A. Rabelo, and D. B. F. Carvalho, “Fake news detection: a systematic literature review of machine learning algorithms and datasets,” J. Interact. Syst., vol. 14, no. 1, pp. 47–58, 2023, doi: 10.5753/jis.2023.3020.
[7] M. Maalouf, “Logistic regression in data analysis: An overview,” Int. J. Data Anal. Tech. Strateg., vol. 3, no. 3, pp. 281–299, 2011, doi: 10.1504/IJDATS.2011.041335.
[8] J. Adeleke Adeyiga, P. Gbounmi Toriola, T. Elizabeth Abioye, and A. Esther Oluwatosin, “Fake News Detection Using a Logistic Regression Model and Natural Language Processing Techniques,” pp. 1–18, 2023, [Online]. Available: https://doi.org/10.21203/rs.3.rs-3156168/v1
[9] R. Couronné, P. Probst, and A. L. Boulesteix, “Random forest versus logistic regression: A large-scale benchmark experiment,” BMC Bioinformatics, vol. 19, no. 1, Jul. 2018, doi: 10.1186/s12859-018-2264-5.
[10] A. Shinde, E. Q. Shahra, S. Basurra, F. Saeed, A. A. AlSewari, and W. A. Jabbar, “SMS Scam Detection Application Based on Optical Character Recognition for Image Data Using Unsupervised and Deep Semi-Supervised Learning,” Sensors, vol. 24, no. 18, p. 6084, Sep. 2024, doi: 10.3390/s24186084.
[11] I. Dimitrov, D. R. Flower, and I. Doytchinova, “AllerTOP - a server for in silico prediction of allergens,” BMC Bioinformatics, vol. 14, no. SUPPL6, Apr. 2013, doi: 10.1186/1471-2105-14-S6-S4.
[12] R. Kusumaningrum, I. Z. Nisa, R. P. Nawangsari, and A. Wibowo, “Sentiment analysis of Indonesian hotel reviews: from classical machine learning to deep learning,” Int. J. Adv. Intell. Informatics, vol. 7, no. 3, pp. 292–303, 2021, doi: 10.26555/ijain.v7i3.737.
[13] I. Y. Agarwal and D. P. Rana, “Fake news and imbalanced data perspective,” Data Preprocessing, Act. Learn. Cost Perceptive Approaches Resolv. Data Imbalance, no. June, pp. 195–210, 2021, doi: 10.4018/978-1-7998-7371-6.ch011.
[14] R. Blagus and L. Lusa, “SMOTE for high-dimensional class-imbalanced data,” BMC Bioinformatics, vol. 14, 2013, doi: 10.1186/1471-2105-14-106.
[15] B. Wang et al., “Explainable Fake News Detection with Large Language Model via Defense Among Competing Wisdom,” WWW 2024 - Proc. ACM Web Conf., pp. 2452–2463, 2024, doi: 10.1145/3589334.3645471.