Logistic Regression and TF-IDF Models for Early Detection of Indonesian-Language Fake News:

Abdul Hamid Arribathi; Sunan Reihan Jungjunan; Muhammad Adrian Dwiharyanto

doi:10.33050/sensi.v11i2.4067

PDF

Published: Aug 31, 2025

DOI: https://doi.org/10.33050/sensi.v11i2.4067

Keywords:

Hoax detection, Indonesian language, TF-IDF, Logistic Regression, digital literacy

Abdul Hamid Arribathi

University of Raharja Tangerang

Sunan Reihan Jungjunan

University of Raharja Tangerang

Muhammad Adrian Dwiharyanto

University of Raharja Tangerang

Abstract

The spread of false information, particularly on social media platforms, has become a significant challenge due to the rapid flow of digital content. Most existing fake news detection systems are primarily designed for the English language, making them less effective when applied to Indonesian contexts. This study proposes a web-based hoax detection system that combines the Term Frequency-Inverse Document Frequency (TF-IDF) method with the Logistic Regression classification algorithm. The dataset, consisting of both real and fake news articles, was obtained from Kaggle and processed through several stages including normalization, stopword removal, and stemming. The TF-IDF vectorization results were then used to train a binary classification model. The system allows for user input either in the form of raw text or a URL, and delivers real-time classification results. Evaluation of the system indicates a high level of accuracy and strong potential in improving public digital literacy. These findings demonstrate a lightweight yet effective approach to mitigating the spread of hoaxes in the Indonesian language.

How to Cite

Arribathi, A. H., Jungjunan, S. R., & Dwiharyanto, M. A. (2025). Logistic Regression and TF-IDF Models for Early Detection of Indonesian-Language Fake News: Model Regresi Logistik dan TF-IDF untuk Deteksi Dini Berita Palsu Berbahasa Indonesia. Journal Sensi: Strategic of Education in Information System, 11(2), 152-164. https://doi.org/10.33050/sensi.v11i2.4067

Issue

Vol. 11 No. 2 (2025): Journal SENSI

Section

Articles

References

[1] B. León, M. P. Martínez-Costa, R. Salaverría, and I. López-Goñi, “Health and science-related disinformation on COVID-19: A content analysis of hoaxes identified by fact-checkers in Spain,” PLoS One, vol. 17, no. 4 April, Apr. 2022, doi: 10.1371/journal.pone.0265995.
[2] R. Rafique, R. Gantassi, R. Amin, J. Frnda, A. Mustapha, and A. H. Alshehri, “Deep fake detection and classification using error-level analysis and deep learning,” Sci. Rep., vol. 13, no. 1, Dec. 2023, doi: 10.1038/s41598-023-34629-3.
[3] S. Lohitha, S. D. Reddy, B. R. Krishna, and N. S. Krishna, “Fake News Detection Using Machine Learning,” Lect. Notes Networks Syst., vol. 645 LNNS, pp. 463–470, 2023, doi: 10.1007/978-981-99-0769-4_41.
[4] S. Shahsavari, P. Holur, T. Wang, T. R. Tangherlini, and V. Roychowdhury, “Conspiracy in the time of corona: automatic detection of emerging COVID-19 conspiracy theories in social media and the news,” J. Comput. Soc. Sci., vol. 3, no. 2, pp. 279–317, Nov. 2020, doi: 10.1007/s42001-020-00086-5.
[5] J. A. García-Díaz, S. M. Jiménez-Zafra, M. A. García-Cumbreras, and R. Valencia-García, “Evaluating feature combination strategies for hate-speech detection in Spanish using linguistic features and transformers,” Complex Intell. Syst., vol. 9, no. 3, pp. 2893–2914, Jun. 2023, doi: 10.1007/s40747-022-00693-x.
[6] H. F. Villela, F. Corrêa, J. S. de A. N. Ribeiro, A. Rabelo, and D. B. F. Carvalho, “Fake news detection: a systematic literature review of machine learning algorithms and datasets,” J. Interact. Syst., vol. 14, no. 1, pp. 47–58, 2023, doi: 10.5753/jis.2023.3020.
[7] M. Maalouf, “Logistic regression in data analysis: An overview,” Int. J. Data Anal. Tech. Strateg., vol. 3, no. 3, pp. 281–299, 2011, doi: 10.1504/IJDATS.2011.041335.
[8] J. Adeleke Adeyiga, P. Gbounmi Toriola, T. Elizabeth Abioye, and A. Esther Oluwatosin, “Fake News Detection Using a Logistic Regression Model and Natural Language Processing Techniques,” pp. 1–18, 2023, [Online]. Available: https://doi.org/10.21203/rs.3.rs-3156168/v1
[9] R. Couronné, P. Probst, and A. L. Boulesteix, “Random forest versus logistic regression: A large-scale benchmark experiment,” BMC Bioinformatics, vol. 19, no. 1, Jul. 2018, doi: 10.1186/s12859-018-2264-5.
[10] A. Shinde, E. Q. Shahra, S. Basurra, F. Saeed, A. A. AlSewari, and W. A. Jabbar, “SMS Scam Detection Application Based on Optical Character Recognition for Image Data Using Unsupervised and Deep Semi-Supervised Learning,” Sensors, vol. 24, no. 18, p. 6084, Sep. 2024, doi: 10.3390/s24186084.
[11] I. Dimitrov, D. R. Flower, and I. Doytchinova, “AllerTOP - a server for in silico prediction of allergens,” BMC Bioinformatics, vol. 14, no. SUPPL6, Apr. 2013, doi: 10.1186/1471-2105-14-S6-S4.
[12] R. Kusumaningrum, I. Z. Nisa, R. P. Nawangsari, and A. Wibowo, “Sentiment analysis of Indonesian hotel reviews: from classical machine learning to deep learning,” Int. J. Adv. Intell. Informatics, vol. 7, no. 3, pp. 292–303, 2021, doi: 10.26555/ijain.v7i3.737.
[13] I. Y. Agarwal and D. P. Rana, “Fake news and imbalanced data perspective,” Data Preprocessing, Act. Learn. Cost Perceptive Approaches Resolv. Data Imbalance, no. June, pp. 195–210, 2021, doi: 10.4018/978-1-7998-7371-6.ch011.
[14] R. Blagus and L. Lusa, “SMOTE for high-dimensional class-imbalanced data,” BMC Bioinformatics, vol. 14, 2013, doi: 10.1186/1471-2105-14-106.
[15] B. Wang et al., “Explainable Fake News Detection with Large Language Model via Defense Among Competing Wisdom,” WWW 2024 - Proc. ACM Web Conf., pp. 2452–2463, 2024, doi: 10.1145/3589334.3645471.

Article Sidebar

Main Article Content

Abstract

Article Details

References