Logistic Regression and TFIDF Models for Early Detection of IndonesianLanguage Fake News

Model Regresi Logistik dan TF-IDF untuk Deteksi Dini Berita Palsu Berbahasa Indonesia

Authors

  • Abdul Hamid Arribathi University of Raharja Tangerang
  • Sunan Reihan Jungjunan University of Raharja Tangerang
  • Muhammad Adrian Dwiharyanto University of Raharja Tangerang

DOI:

https://doi.org/10.33050/sensi.v11i2.4067

Keywords:

Hoax detection, Indonesian language, TF-IDF, Logistic Regression, digital literacy

Abstract

 The spread of false information, particularly on social media platforms, has become a significant challenge due to the rapid flow of digital content. Most existing fake news detection systems are primarily designed for the English language, making them less effective when applied to Indonesian contexts. This study proposes a web-based hoax detection system that combines the Term Frequency-Inverse Document Frequency (TF-IDF) method with the Logistic Regression classification algorithm. The dataset, consisting of both real and fake news articles, was obtained from Kaggle and processed through several stages including normalization, stopword removal, and stemming. The TF-IDF vectorization results were then used to train a binary classification model. The system allows for user input either in the form of raw text or a URL, and delivers real-time classification results. Evaluation of the system indicates a high level of accuracy and strong potential in improving public digital literacy. These findings demonstrate a lightweight yet effective approach to mitigating the spread of hoaxes in the Indonesian language.

References

[1] B. León, M. P. Martínez-Costa, R. Salaverría, and I. López-Goñi, “Health and science-related disinformation on COVID-19: A content analysis of hoaxes identified by fact-checkers in Spain,” PLoS One, vol. 17, no. 4 April, Apr. 2022, doi: 10.1371/journal.pone.0265995.
[2] R. Rafique, R. Gantassi, R. Amin, J. Frnda, A. Mustapha, and A. H. Alshehri, “Deep fake detection and classification using error-level analysis and deep learning,” Sci. Rep., vol. 13, no. 1, Dec. 2023, doi: 10.1038/s41598-023-34629-3.
[3] S. Lohitha, S. D. Reddy, B. R. Krishna, and N. S. Krishna, “Fake News Detection Using Machine Learning,” Lect. Notes Networks Syst., vol. 645 LNNS, pp. 463–470, 2023, doi: 10.1007/978-981-99-0769-4_41.
[4] S. Shahsavari, P. Holur, T. Wang, T. R. Tangherlini, and V. Roychowdhury, “Conspiracy in the time of corona: automatic detection of emerging COVID-19 conspiracy theories in social media and the news,” J. Comput. Soc. Sci., vol. 3, no. 2, pp. 279–317, Nov. 2020, doi: 10.1007/s42001-020-00086-5.
[5] J. A. García-Díaz, S. M. Jiménez-Zafra, M. A. García-Cumbreras, and R. Valencia-García, “Evaluating feature combination strategies for hate-speech detection in Spanish using linguistic features and transformers,” Complex Intell. Syst., vol. 9, no. 3, pp. 2893–2914, Jun. 2023, doi: 10.1007/s40747-022-00693-x.
[6] H. F. Villela, F. Corrêa, J. S. de A. N. Ribeiro, A. Rabelo, and D. B. F. Carvalho, “Fake news detection: a systematic literature review of machine learning algorithms and datasets,” J. Interact. Syst., vol. 14, no. 1, pp. 47–58, 2023, doi: 10.5753/jis.2023.3020.
[7] M. Maalouf, “Logistic regression in data analysis: An overview,” Int. J. Data Anal. Tech. Strateg., vol. 3, no. 3, pp. 281–299, 2011, doi: 10.1504/IJDATS.2011.041335.
[8] J. Adeleke Adeyiga, P. Gbounmi Toriola, T. Elizabeth Abioye, and A. Esther Oluwatosin, “Fake News Detection Using a Logistic Regression Model and Natural Language Processing Techniques,” pp. 1–18, 2023, [Online]. Available: https://doi.org/10.21203/rs.3.rs-3156168/v1
[9] R. Couronné, P. Probst, and A. L. Boulesteix, “Random forest versus logistic regression: A large-scale benchmark experiment,” BMC Bioinformatics, vol. 19, no. 1, Jul. 2018, doi: 10.1186/s12859-018-2264-5.
[10] A. Shinde, E. Q. Shahra, S. Basurra, F. Saeed, A. A. AlSewari, and W. A. Jabbar, “SMS Scam Detection Application Based on Optical Character Recognition for Image Data Using Unsupervised and Deep Semi-Supervised Learning,” Sensors, vol. 24, no. 18, p. 6084, Sep. 2024, doi: 10.3390/s24186084.
[11] I. Dimitrov, D. R. Flower, and I. Doytchinova, “AllerTOP - a server for in silico prediction of allergens,” BMC Bioinformatics, vol. 14, no. SUPPL6, Apr. 2013, doi: 10.1186/1471-2105-14-S6-S4.
[12] R. Kusumaningrum, I. Z. Nisa, R. P. Nawangsari, and A. Wibowo, “Sentiment analysis of Indonesian hotel reviews: from classical machine learning to deep learning,” Int. J. Adv. Intell. Informatics, vol. 7, no. 3, pp. 292–303, 2021, doi: 10.26555/ijain.v7i3.737.
[13] I. Y. Agarwal and D. P. Rana, “Fake news and imbalanced data perspective,” Data Preprocessing, Act. Learn. Cost Perceptive Approaches Resolv. Data Imbalance, no. June, pp. 195–210, 2021, doi: 10.4018/978-1-7998-7371-6.ch011.
[14] R. Blagus and L. Lusa, “SMOTE for high-dimensional class-imbalanced data,” BMC Bioinformatics, vol. 14, 2013, doi: 10.1186/1471-2105-14-106.
[15] B. Wang et al., “Explainable Fake News Detection with Large Language Model via Defense Among Competing Wisdom,” WWW 2024 - Proc. ACM Web Conf., pp. 2452–2463, 2024, doi: 10.1145/3589334.3645471.

Downloads

Published

2025-08-31

Most read articles by the same author(s)