ONLINE NEWS CLASSIFICATION USING MACHINE LEARNING TECHNIQUES
DOI:
https://doi.org/10.31436/iiumej.v22i2.1662Keywords:
Text Classification, Naive Bayes, News Classification, Support Vector Machine, News ArticlesAbstract
A massive rise in web-based online content today pushes businesses to implement new approaches and resources that might support better navigation, processing, and handling of high-dimensional data. Over the Internet, 90% of the data is unstructured, and there are several approaches through which this data can translate into useful, structured data—classification is one such approach. Classification of knowledge into a good collection of groups is significant and necessary. As the number of machine-readable documents proliferates, automatic text classification is badly needed to classify these documents. Unlabeled documents are categorized into predefined classes of labeled documents using text labeling, a supervised learning technique. This paper reviewed some existing approaches for classifying online news articles and discusses a framework for the automatic classification of online news articles. For achieving high accuracy, different classifiers were tried. Our experimental method achieved 93% accuracy using a Bayesian classifier and present in terms of confusion metrics.
ABSTRAK: Peningkatan tinggi pada masa kini pada maklumat dalam talian berasaskan web menyebabkan kaedah baru dalam bisnes telah diguna pakai dan sumber sokongan seperti navigasi, proses, dan pengurusan data berdimensi-tinggi adalah perlu. 90% data di internet adalah data tidak berstruktur, dan terdapat pelbagai kaedah data ini dapat diterjemahkan kepada data berguna, lebih berstruktur — iaitu melalui kaedah klasifikasi. Klasifikasi ilmu kepada koleksi kumpulan baik adalah penting dan perlu. Seperti mana mesin-boleh baca dokumen berkembang pesat, teks klasifikasi automatik juga sangat diperlukan bagi mengklasifikasi dokumen-dokumen ini. Dokumen yang tidak dilabel dikategori sebagai pengelasan pratakrif dokumen berlabel melalui teks label, iaitu teknik pembelajaran berpenyelia. Kajian ini mengkaji semula pendekatan sedia ada bagi artikel berita dalam talian dan membincangkan rangka kerja bagi pengelasan automatik artikel berita dalam talian. Bagi menghasilkan ketepatan yang tinggi, kami menggunakan pelbagai alat klasifikasi. Kaedah eksperimen ini mempunyai ketepatan 93% menggunakan pengelas Bayesian dan data dibentangkan berdasarkan matriks kekeliruan.
Downloads
References
Jindal R, Malhotra R, Jain A. (2015) Techniques for text classification: Literature review
and current trends. Webology, 12(2): Article 139.
https://www.webology.org/2015/v12n2/a139.pdf
Turney P. (2002) Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews. Computing Research Repository, 417-424. doi:10.3115/1073083.1073153. DOI: https://doi.org/10.3115/1073083.1073153
Wilson T, Wiebe J, Hoffmann P. (2009) Recognizing Contextual Polarity: An Exploration of Features for Phrase-Level Sentiment Analysis. Computational Linguistics, 35(3): 399-433. doi:10.1162/coli.08-012-r1-06-90 DOI: https://doi.org/10.1162/coli.08-012-R1-06-90
Quan C, Ren F. (2009) Construction of a blog emotion corpus for Chinese emotional expression analysis. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, 3: 1446-1454. DOI: https://doi.org/10.3115/1699648.1699691
Wang TY, Chiang HM. (2011) Solving multi-label text categorization problem using support vector machine approach with membership function. Neurocomputing, 74(17): 3682-3689. https://doi.org/10.1016/j.neucom.2011.07.001 DOI: https://doi.org/10.1016/j.neucom.2011.07.001
Harrag F, El-Qawasmah E, Al-Salman AMS. (2010) Comparing dimension reduction techniques for Arabic text classification using BPNN algorithm. In Proceedings of the 2010 First International Conference on Integrated Intelligent Computing, Bangalore, India, 2010, pp 6-11. https://doi.org/10.1109/iciic.2010.23 DOI: https://doi.org/10.1109/ICIIC.2010.23
Sapankevych N, Sankar R. (2009) Time series prediction using support vector machines: A survey. IEEE Computational Intelligence Magazine, 4(2): 24-38. https://10.1109/MCI.2009.932254. DOI: https://doi.org/10.1109/MCI.2009.932254
Zhihang Chen, Chengwen Ni, Murphey, YL. (2006) Neural Network Approaches for Text Document Categorization. In Proceedings of the IEEE International Joint Conference on Neural Network Proceedings, pp.1054–1060. https://doi.org/10.1109/ijcnn.2006.246805 DOI: https://doi.org/10.1109/IJCNN.2006.246805
Zhang X, Bicheng Li, Xianzhu Sun. (2010) A k-nearest neighbor text classification algorithm based on fuzzy integral. In Proceedings of the Sixth International Conference on Natural Computation, pp 2228–2231. https://doi.org/10.1109/icnc.2010.5584406 DOI: https://doi.org/10.1109/ICNC.2010.5584406
Martinez-Arroyo M, & Sucar LE. (2006) Learning an Optimal Naive Bayes Classifier. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR’06), pp 748–752. https://doi.org/10.1109/icpr.2006.748 DOI: https://doi.org/10.1109/ICPR.2006.749
Pendharkar B, Ambekar P, Godbole P, Joshi S, Abhyankar S. (2007) Topic categorization of RSS news feeds. Group, 4, 1.
Rao V, Sachdev J. (2017) A machine learning approach to classify news articles based on location. In Proceedings of the International Conference on Intelligent Sustainable Systems (ICISS), pp 863-867. https://doi.org/10.1109/iss1.2017.8389300 DOI: https://doi.org/10.1109/ISS1.2017.8389300
Lewis DD. (1998) Naive (Bayes) at forty: The independence assumption in information retrieval. In European conference on machine learning, 4-15). Springer, Berlin, Heidelberg DOI: https://doi.org/10.1007/BFb0026666
Chen N, Blostein D. (2006) A survey of document image classification: problem statement, classifier architecture and performance evaluation. International Journal of Document Analysis and Recognition (IJDAR), 10(1): 1-16.
https://doi.org/10.1007/S10032-006-0020-2 DOI: https://doi.org/10.1007/s10032-006-0020-2
Gupta V, Lehal GS. (2009) A survey of text mining techniques and applications. Journal of Emerging Technologies in Web Intelligence, 1(1): 60-76. https://doi.org/10.4304/jetwi.1.1.60-76 DOI: https://doi.org/10.4304/jetwi.1.1.60-76
Manning CD, Raghavan P, Schutze H. (2008) Introduction to information retrieval? Cambridge University Press, pp 405-416. DOI: https://doi.org/10.1017/CBO9780511809071
Li W, Han J, Pei J. (2001) CMAR: Accurate and efficient classification based on multiple class-association rules. In Proceedings of the 2001 IEEE International Conference on Data Mining, San Jose, CA, USA, 2001, pp. 369-376. doi: 10.1109/ICDM.2001.989541. DOI: https://doi.org/10.1109/ICDM.2001.989541
Yin X, Han J. (2003) CPAR: Classification based on predictive association rules. In Proceedings of the 2003 SIAM International Conference on Data Mining, pp 331-335. DOI: https://doi.org/10.1137/1.9781611972733.40
Berzal F, Cubero J, Marín N, Sánchez D, Serrano J, Vila A. (2005) Association rule evaluation for classification purposes. Actas del III Taller Nacional de Mineria de Datos y Aprendizaje, pp 135-144.
Jiang C, Coenen F, Sanderson R, Zito M. (2010) Text classification using graph mining-based feature extraction. In Research and Development in Intelligent Systems XXVI, Springer, London, pp 21-34. DOI: https://doi.org/10.1007/978-1-84882-983-1_2
Huynh D, Tran D, Ma W, Sharma D. (2011) A new term ranking method based on relation extraction and graph model for text classification. In Proceedings of the Thirty-Fourth Australasian Computer Science Conference, 113: 145-152.
Han J, Kamber M. (2001) Data mining concepts and techniques, Morgan Kaufmann Publishers. San Francisco, CA, pp 335-391.
Chen J, Huang H, Tian S, Qu Y. (2009) Feature selection for text classification with Naïve Bayes. Expert Systems with Applications, 36(3):5432-5435. https://doi.org/10.1016/j.eswa.2008.06.054 DOI: https://doi.org/10.1016/j.eswa.2008.06.054
Bijalwan V, Kumar, V, Kumari P, Pascual J. (2014) KNN based machine learning approach for text and document mining. International Journal of Database Theory and Application, 7(1):61-70. https://doi.org/10.14257/ijdta.2014.7.1.06 DOI: https://doi.org/10.14257/ijdta.2014.7.1.06
Lin Y, Wang J. (2014) Research on text classification based on SVM-KNN. In Proceedings of the IEEE 5th International Conference on Software Engineering and Service Science, Beijing, China, pp 842-844. https://doi.org/10.1109/ICSESS.2014.6933697. DOI: https://doi.org/10.1109/ICSESS.2014.6933697
Rahmawati D, Khodra ML. (2015) Automatic multi-label classification for Indonesian news articles. In Proceedings of the 2nd International Conference on Advanced Informatics: Concepts, Theory and Applications (ICAICTA), Chonburi, Thailand, 2015, pp 1-6. https://doi.org/10.1109/ICAICTA.2015.7335382. DOI: https://doi.org/10.1109/ICAICTA.2015.7335382
Hassaine A, Mecheter S, Jaoua A. (2015) Text categorization using hyper rectangular keyword extraction: Application to news articles classification. In Proceedings of the International Conference on Relational and Algebraic Methods in Computer Science, pp 312-325. https://doi.org/10.1007/978-3-319-24704-5_19 DOI: https://doi.org/10.1007/978-3-319-24704-5_19
Lehe?ka J, Švec J. (2015) Improving multi-label document classification of Czech news articles. In Proceedings of the International Conference on Text, Speech, and Dialogue, pp 307-315. DOI: https://doi.org/10.1007/978-3-319-24033-6_35
Arya C, Dwivedi SK. (2016) News web page classification using URL content and structure attributes. In Proceedings of the 2nd International Conference on Next Generation Computing Technologies (NGCT), Dehradun, India, 2016, pp. 317-322, https://doi.org/10.1109/NGCT.2016.787743. DOI: https://doi.org/10.1109/NGCT.2016.7877434
Li J, Fong, S, Zhuang, Y, Khoury, R. (2015) Hierarchical classification in text mining for sentiment analysis of online news. Soft Computing, 20(9), 3411-3420. https://doi.org/10.1007/s00500-015-1812-4 DOI: https://doi.org/10.1007/s00500-015-1812-4
Weng, W., Liu, Y., Wang, S., & Lei, K. (2016) A multiclass classification model for stock news based on structured data. In Proceedings of the Sixth International Conference on Information Science and Technology (ICIST), Dalian, China, pp 72-78. https://doi.org/10.1109/ICIST.2016.7483388. DOI: https://doi.org/10.1109/ICIST.2016.7483388
Kaur S, Khiva NK (2016). Online news classification using deep learning technique.
International Research Journal of Engineering and Technology, 3(10): 558-563.
Suh Y, Yu J, Mo J, Song L. (2017) A comparison of oversampling methods on imbalanced topic classification of Korean news articles. Journal of Cognitive Science, 18(4): 391-437. DOI: https://doi.org/10.17791/jcs.2017.18.4.391
Ahmed H, Traore I, Saad S. (2017) Detecting opinion spams and fake news using text classification. Security and Privacy, 1(1), e9. https://doi.org/10.1002/spy2.9 DOI: https://doi.org/10.1002/spy2.9
Watanabe K. (2017) Newsmap. Digital Journalism, 6(3): 294-309. https://doi.org/10.1080/21670811.2017.1293487 DOI: https://doi.org/10.1080/21670811.2017.1293487
Du C, Huang L. (2018) Text classification research with attention-based recurrent neural networks. International Journal of Computers Communications & Control, 13(1): 50-64. https://doi.org/10.15837/ijccc.2018.1.3142 DOI: https://doi.org/10.15837/ijccc.2018.1.3142
Gruppi M, Horne BD, Adali S. (2018) An exploration of unreliable news classification in Brazil and the US. arXiv preprint arXiv:1806.02875.
Kusumaningrum R, Adhy S. (2018). WCLOUDVIZ: Word cloud visualization of Indonesian News Articles Classification Based on Latent Dirichlet Allocation. Telkomnika, 16(4): 1752-1759. DOI: https://doi.org/10.12928/telkomnika.v16i4.8194
Cecchini D, Na L. (2018) In Proceedings of the 2018 IEEE International Conference on Big Data and Smart Computing (BigComp), Shanghai, China, pp 681-684. https://doi.org/10.1109/BigComp.2018.00125. DOI: https://doi.org/10.1109/BigComp.2018.00125
Jang B, Kim I, Kim JW. (2019) Word2vec convolutional neural networks for classification of news articles and tweets. PLOS ONE, 14(8): e0220976. https://doi.org/10.1371/journal.pone.0220976 DOI: https://doi.org/10.1371/journal.pone.0220976
Qadi LA, Rifai HE, Obaid S, Elnagar A. (2019) Arabic text classification of news articles using classical supervised classifiers. In Proceedings of the 2nd International Conference on new Trends in Computing Sciences (ICTCS), Amman, Jordan, pp 1-6. https://doi.org/10.1109/ICTCS.2019.8923073. DOI: https://doi.org/10.1109/ICTCS.2019.8923073
Gumilang M, Purwarianti A, Nurdinasari F. (2019) Systemic risk document classification on Indonesian news articles using deep learning and active learning. In Proceedings of the International Conference on Electrical Engineering and Informatics (ICEEI), Bandung, Indonesia, pp 46-51. https://doi.org/10.1109/iceei47359.2019.8988829 DOI: https://doi.org/10.1109/ICEEI47359.2019.8988829
Noppakaow A, Uchida O. (2019) Examinations on the Performance of Classification Models for Thai News Articles. In Proceedings of the 2019 11th International Conference on Information Technology and Electrical Engineering (ICITEE), Pattaya, Thailand, 2019, pp. 1-4 https://doi.org/10.1109/iciteed.2019.8929959 DOI: https://doi.org/10.1109/ICITEED.2019.8929959
Huang CM, Jiang YJ. (2019) An empirical study on the classification of Chinese news articles by machine learning and deep learning techniques. In Proceedings of the International Conference on Machine Learning and Cybernetics (ICMLC), Kobe, Japan, pp 1-6. https://doi.org/10.1109/icmlc48188.2019.8949309 DOI: https://doi.org/10.1109/ICMLC48188.2019.8949309
Winster SG, Kumar MN. (2020) Automatic classification of emotions in news articles through ensemble decision tree classification techniques. Journal of Ambient Intelligence and Humanized Computing, 1–12. https://doi.org/10.1007/s12652-020-02373-5 DOI: https://doi.org/10.1007/s12652-020-02373-5
Rabbimov IM, Kobilov SS. (2020) Multi-class text classification of Uzbek news articles using machine learning. Journal of Physics: Conference Series, 1546(1): 012-097. DOI: https://doi.org/10.1088/1742-6596/1546/1/012097
Fesseha A, Xiong S, Emiru ED, Dahou A. (2020) Text classification of news articles using machine learning on low-resourced language: Tigrigna. In Proceedings of the 3rd International Conference on Artificial Intelligence and Big Data (ICAIBD), Chengdu, China, pp 34-38. https://doi.org/10.1109/ICAIBD49809.2020.9137443. DOI: https://doi.org/10.1109/ICAIBD49809.2020.9137443
Sharma A, Mishra PK. (2020) State-of-the-art in performance metric and future direction for data science algorithm. Journal of Scientific Research, 64(2): 221-238. DOI: https://doi.org/10.37398/JSR.2020.640232
Saura JR. (2020) Using Data Sciences in Digital Marketing: Framework, methods, and performance metrics. Journal of Innovation & Knowledge, 6(2): 92-102. https://doi.org/10.1016/j.jik.2020.08.001 DOI: https://doi.org/10.1016/j.jik.2020.08.001
Pereira L, Nunes N. (2020) An empirical exploration of performance metrics for event detection algorithms. Non-Intrusive Load Monitoring. Sustainable Cities and Society, 62: 102399. https://doi.org/10.1016/j.scs.2020.102399 DOI: https://doi.org/10.1016/j.scs.2020.102399
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2021 IIUM Press

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.