ONLINE NEWS CLASSIFICATION USING MACHINE LEARNING TECHNIQUES

Authors

  • Jeelani Ahmed Maualan Azad National Urdu University, Hyderabad
  • Muqeem Ahmed School of Technology, Maulana Azad National Urdu University, Hyderabad

DOI:

https://doi.org/10.31436/iiumej.v22i2.1662

Keywords:

Text Classification, Naive Bayes, News Classification, Support Vector Machine, News Articles

Abstract

A massive rise in web-based online content today pushes businesses to implement new approaches and resources that might support better navigation, processing, and handling of high-dimensional data. Over the Internet, 90% of the data is unstructured, and there are several approaches through which this data can translate into useful, structured data—classification is one such approach. Classification of knowledge into a good collection of groups is significant and necessary. As the number of machine-readable documents proliferates, automatic text classification is badly needed to classify these documents. Unlabeled documents are categorized into predefined classes of labeled documents using text labeling, a supervised learning technique. This paper reviewed some existing approaches for classifying online news articles and discusses a framework for the automatic classification of online news articles. For achieving high accuracy, different classifiers were tried. Our experimental method achieved 93% accuracy using a Bayesian classifier and present in terms of confusion metrics.

ABSTRAK: Peningkatan tinggi pada masa kini pada maklumat dalam talian berasaskan web menyebabkan kaedah baru dalam bisnes telah diguna pakai dan sumber sokongan seperti navigasi, proses, dan pengurusan data berdimensi-tinggi adalah perlu. 90% data di internet adalah data tidak berstruktur, dan terdapat pelbagai kaedah data ini dapat diterjemahkan kepada data berguna, lebih berstruktur — iaitu melalui kaedah klasifikasi. Klasifikasi ilmu kepada koleksi kumpulan baik adalah penting dan perlu. Seperti mana mesin-boleh baca dokumen berkembang pesat, teks klasifikasi automatik juga sangat diperlukan bagi mengklasifikasi dokumen-dokumen ini. Dokumen yang tidak dilabel dikategori sebagai pengelasan pratakrif dokumen berlabel melalui teks label, iaitu teknik pembelajaran berpenyelia. Kajian ini mengkaji semula pendekatan sedia ada bagi artikel berita dalam talian dan membincangkan rangka kerja bagi pengelasan automatik artikel berita dalam talian. Bagi menghasilkan ketepatan yang tinggi, kami menggunakan pelbagai alat klasifikasi. Kaedah eksperimen ini mempunyai ketepatan 93% menggunakan pengelas Bayesian dan data dibentangkan berdasarkan matriks kekeliruan.

Downloads

Download data is not yet available.

Metrics

Metrics Loading ...

References

Jindal R, Malhotra R, Jain A. (2015) Techniques for text classification: Literature review

and current trends. Webology, 12(2): Article 139.

https://www.webology.org/2015/v12n2/a139.pdf

Turney P. (2002) Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews. Computing Research Repository, 417-424. doi:10.3115/1073083.1073153. DOI: https://doi.org/10.3115/1073083.1073153

Wilson T, Wiebe J, Hoffmann P. (2009) Recognizing Contextual Polarity: An Exploration of Features for Phrase-Level Sentiment Analysis. Computational Linguistics, 35(3): 399-433. doi:10.1162/coli.08-012-r1-06-90 DOI: https://doi.org/10.1162/coli.08-012-R1-06-90

Quan C, Ren F. (2009) Construction of a blog emotion corpus for Chinese emotional expression analysis. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, 3: 1446-1454. DOI: https://doi.org/10.3115/1699648.1699691

Wang TY, Chiang HM. (2011) Solving multi-label text categorization problem using support vector machine approach with membership function. Neurocomputing, 74(17): 3682-3689. https://doi.org/10.1016/j.neucom.2011.07.001 DOI: https://doi.org/10.1016/j.neucom.2011.07.001

Harrag F, El-Qawasmah E, Al-Salman AMS. (2010) Comparing dimension reduction techniques for Arabic text classification using BPNN algorithm. In Proceedings of the 2010 First International Conference on Integrated Intelligent Computing, Bangalore, India, 2010, pp 6-11. https://doi.org/10.1109/iciic.2010.23 DOI: https://doi.org/10.1109/ICIIC.2010.23

Sapankevych N, Sankar R. (2009) Time series prediction using support vector machines: A survey. IEEE Computational Intelligence Magazine, 4(2): 24-38. https://10.1109/MCI.2009.932254. DOI: https://doi.org/10.1109/MCI.2009.932254

Zhihang Chen, Chengwen Ni, Murphey, YL. (2006) Neural Network Approaches for Text Document Categorization. In Proceedings of the IEEE International Joint Conference on Neural Network Proceedings, pp.1054–1060. https://doi.org/10.1109/ijcnn.2006.246805 DOI: https://doi.org/10.1109/IJCNN.2006.246805

Zhang X, Bicheng Li, Xianzhu Sun. (2010) A k-nearest neighbor text classification algorithm based on fuzzy integral. In Proceedings of the Sixth International Conference on Natural Computation, pp 2228–2231. https://doi.org/10.1109/icnc.2010.5584406 DOI: https://doi.org/10.1109/ICNC.2010.5584406

Martinez-Arroyo M, & Sucar LE. (2006) Learning an Optimal Naive Bayes Classifier. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR’06), pp 748–752. https://doi.org/10.1109/icpr.2006.748 DOI: https://doi.org/10.1109/ICPR.2006.749

Pendharkar B, Ambekar P, Godbole P, Joshi S, Abhyankar S. (2007) Topic categorization of RSS news feeds. Group, 4, 1.

Rao V, Sachdev J. (2017) A machine learning approach to classify news articles based on location. In Proceedings of the International Conference on Intelligent Sustainable Systems (ICISS), pp 863-867. https://doi.org/10.1109/iss1.2017.8389300 DOI: https://doi.org/10.1109/ISS1.2017.8389300

Lewis DD. (1998) Naive (Bayes) at forty: The independence assumption in information retrieval. In European conference on machine learning, 4-15). Springer, Berlin, Heidelberg DOI: https://doi.org/10.1007/BFb0026666

Chen N, Blostein D. (2006) A survey of document image classification: problem statement, classifier architecture and performance evaluation. International Journal of Document Analysis and Recognition (IJDAR), 10(1): 1-16.

https://doi.org/10.1007/S10032-006-0020-2 DOI: https://doi.org/10.1007/s10032-006-0020-2

Gupta V, Lehal GS. (2009) A survey of text mining techniques and applications. Journal of Emerging Technologies in Web Intelligence, 1(1): 60-76. https://doi.org/10.4304/jetwi.1.1.60-76 DOI: https://doi.org/10.4304/jetwi.1.1.60-76

Manning CD, Raghavan P, Schutze H. (2008) Introduction to information retrieval? Cambridge University Press, pp 405-416. DOI: https://doi.org/10.1017/CBO9780511809071

Li W, Han J, Pei J. (2001) CMAR: Accurate and efficient classification based on multiple class-association rules. In Proceedings of the 2001 IEEE International Conference on Data Mining, San Jose, CA, USA, 2001, pp. 369-376. doi: 10.1109/ICDM.2001.989541. DOI: https://doi.org/10.1109/ICDM.2001.989541

Yin X, Han J. (2003) CPAR: Classification based on predictive association rules. In Proceedings of the 2003 SIAM International Conference on Data Mining, pp 331-335. DOI: https://doi.org/10.1137/1.9781611972733.40

Berzal F, Cubero J, Marín N, Sánchez D, Serrano J, Vila A. (2005) Association rule evaluation for classification purposes. Actas del III Taller Nacional de Mineria de Datos y Aprendizaje, pp 135-144.

Jiang C, Coenen F, Sanderson R, Zito M. (2010) Text classification using graph mining-based feature extraction. In Research and Development in Intelligent Systems XXVI, Springer, London, pp 21-34. DOI: https://doi.org/10.1007/978-1-84882-983-1_2

Huynh D, Tran D, Ma W, Sharma D. (2011) A new term ranking method based on relation extraction and graph model for text classification. In Proceedings of the Thirty-Fourth Australasian Computer Science Conference, 113: 145-152.

Han J, Kamber M. (2001) Data mining concepts and techniques, Morgan Kaufmann Publishers. San Francisco, CA, pp 335-391.

Chen J, Huang H, Tian S, Qu Y. (2009) Feature selection for text classification with Naïve Bayes. Expert Systems with Applications, 36(3):5432-5435. https://doi.org/10.1016/j.eswa.2008.06.054 DOI: https://doi.org/10.1016/j.eswa.2008.06.054

Bijalwan V, Kumar, V, Kumari P, Pascual J. (2014) KNN based machine learning approach for text and document mining. International Journal of Database Theory and Application, 7(1):61-70. https://doi.org/10.14257/ijdta.2014.7.1.06 DOI: https://doi.org/10.14257/ijdta.2014.7.1.06

Lin Y, Wang J. (2014) Research on text classification based on SVM-KNN. In Proceedings of the IEEE 5th International Conference on Software Engineering and Service Science, Beijing, China, pp 842-844. https://doi.org/10.1109/ICSESS.2014.6933697. DOI: https://doi.org/10.1109/ICSESS.2014.6933697

Rahmawati D, Khodra ML. (2015) Automatic multi-label classification for Indonesian news articles. In Proceedings of the 2nd International Conference on Advanced Informatics: Concepts, Theory and Applications (ICAICTA), Chonburi, Thailand, 2015, pp 1-6. https://doi.org/10.1109/ICAICTA.2015.7335382. DOI: https://doi.org/10.1109/ICAICTA.2015.7335382

Hassaine A, Mecheter S, Jaoua A. (2015) Text categorization using hyper rectangular keyword extraction: Application to news articles classification. In Proceedings of the International Conference on Relational and Algebraic Methods in Computer Science, pp 312-325. https://doi.org/10.1007/978-3-319-24704-5_19 DOI: https://doi.org/10.1007/978-3-319-24704-5_19

Lehe?ka J, Švec J. (2015) Improving multi-label document classification of Czech news articles. In Proceedings of the International Conference on Text, Speech, and Dialogue, pp 307-315. DOI: https://doi.org/10.1007/978-3-319-24033-6_35

Arya C, Dwivedi SK. (2016) News web page classification using URL content and structure attributes. In Proceedings of the 2nd International Conference on Next Generation Computing Technologies (NGCT), Dehradun, India, 2016, pp. 317-322, https://doi.org/10.1109/NGCT.2016.787743. DOI: https://doi.org/10.1109/NGCT.2016.7877434

Li J, Fong, S, Zhuang, Y, Khoury, R. (2015) Hierarchical classification in text mining for sentiment analysis of online news. Soft Computing, 20(9), 3411-3420. https://doi.org/10.1007/s00500-015-1812-4 DOI: https://doi.org/10.1007/s00500-015-1812-4

Weng, W., Liu, Y., Wang, S., & Lei, K. (2016) A multiclass classification model for stock news based on structured data. In Proceedings of the Sixth International Conference on Information Science and Technology (ICIST), Dalian, China, pp 72-78. https://doi.org/10.1109/ICIST.2016.7483388. DOI: https://doi.org/10.1109/ICIST.2016.7483388

Kaur S, Khiva NK (2016). Online news classification using deep learning technique.

International Research Journal of Engineering and Technology, 3(10): 558-563.

Suh Y, Yu J, Mo J, Song L. (2017) A comparison of oversampling methods on imbalanced topic classification of Korean news articles. Journal of Cognitive Science, 18(4): 391-437. DOI: https://doi.org/10.17791/jcs.2017.18.4.391

Ahmed H, Traore I, Saad S. (2017) Detecting opinion spams and fake news using text classification. Security and Privacy, 1(1), e9. https://doi.org/10.1002/spy2.9 DOI: https://doi.org/10.1002/spy2.9

Watanabe K. (2017) Newsmap. Digital Journalism, 6(3): 294-309. https://doi.org/10.1080/21670811.2017.1293487 DOI: https://doi.org/10.1080/21670811.2017.1293487

Du C, Huang L. (2018) Text classification research with attention-based recurrent neural networks. International Journal of Computers Communications & Control, 13(1): 50-64. https://doi.org/10.15837/ijccc.2018.1.3142 DOI: https://doi.org/10.15837/ijccc.2018.1.3142

Gruppi M, Horne BD, Adali S. (2018) An exploration of unreliable news classification in Brazil and the US. arXiv preprint arXiv:1806.02875.

Kusumaningrum R, Adhy S. (2018). WCLOUDVIZ: Word cloud visualization of Indonesian News Articles Classification Based on Latent Dirichlet Allocation. Telkomnika, 16(4): 1752-1759. DOI: https://doi.org/10.12928/telkomnika.v16i4.8194

Cecchini D, Na L. (2018) In Proceedings of the 2018 IEEE International Conference on Big Data and Smart Computing (BigComp), Shanghai, China, pp 681-684. https://doi.org/10.1109/BigComp.2018.00125. DOI: https://doi.org/10.1109/BigComp.2018.00125

Jang B, Kim I, Kim JW. (2019) Word2vec convolutional neural networks for classification of news articles and tweets. PLOS ONE, 14(8): e0220976. https://doi.org/10.1371/journal.pone.0220976 DOI: https://doi.org/10.1371/journal.pone.0220976

Qadi LA, Rifai HE, Obaid S, Elnagar A. (2019) Arabic text classification of news articles using classical supervised classifiers. In Proceedings of the 2nd International Conference on new Trends in Computing Sciences (ICTCS), Amman, Jordan, pp 1-6. https://doi.org/10.1109/ICTCS.2019.8923073. DOI: https://doi.org/10.1109/ICTCS.2019.8923073

Gumilang M, Purwarianti A, Nurdinasari F. (2019) Systemic risk document classification on Indonesian news articles using deep learning and active learning. In Proceedings of the International Conference on Electrical Engineering and Informatics (ICEEI), Bandung, Indonesia, pp 46-51. https://doi.org/10.1109/iceei47359.2019.8988829 DOI: https://doi.org/10.1109/ICEEI47359.2019.8988829

Noppakaow A, Uchida O. (2019) Examinations on the Performance of Classification Models for Thai News Articles. In Proceedings of the 2019 11th International Conference on Information Technology and Electrical Engineering (ICITEE), Pattaya, Thailand, 2019, pp. 1-4 https://doi.org/10.1109/iciteed.2019.8929959 DOI: https://doi.org/10.1109/ICITEED.2019.8929959

Huang CM, Jiang YJ. (2019) An empirical study on the classification of Chinese news articles by machine learning and deep learning techniques. In Proceedings of the International Conference on Machine Learning and Cybernetics (ICMLC), Kobe, Japan, pp 1-6. https://doi.org/10.1109/icmlc48188.2019.8949309 DOI: https://doi.org/10.1109/ICMLC48188.2019.8949309

Winster SG, Kumar MN. (2020) Automatic classification of emotions in news articles through ensemble decision tree classification techniques. Journal of Ambient Intelligence and Humanized Computing, 1–12. https://doi.org/10.1007/s12652-020-02373-5 DOI: https://doi.org/10.1007/s12652-020-02373-5

Rabbimov IM, Kobilov SS. (2020) Multi-class text classification of Uzbek news articles using machine learning. Journal of Physics: Conference Series, 1546(1): 012-097. DOI: https://doi.org/10.1088/1742-6596/1546/1/012097

Fesseha A, Xiong S, Emiru ED, Dahou A. (2020) Text classification of news articles using machine learning on low-resourced language: Tigrigna. In Proceedings of the 3rd International Conference on Artificial Intelligence and Big Data (ICAIBD), Chengdu, China, pp 34-38. https://doi.org/10.1109/ICAIBD49809.2020.9137443. DOI: https://doi.org/10.1109/ICAIBD49809.2020.9137443

Sharma A, Mishra PK. (2020) State-of-the-art in performance metric and future direction for data science algorithm. Journal of Scientific Research, 64(2): 221-238. DOI: https://doi.org/10.37398/JSR.2020.640232

Saura JR. (2020) Using Data Sciences in Digital Marketing: Framework, methods, and performance metrics. Journal of Innovation & Knowledge, 6(2): 92-102. https://doi.org/10.1016/j.jik.2020.08.001 DOI: https://doi.org/10.1016/j.jik.2020.08.001

Pereira L, Nunes N. (2020) An empirical exploration of performance metrics for event detection algorithms. Non-Intrusive Load Monitoring. Sustainable Cities and Society, 62: 102399. https://doi.org/10.1016/j.scs.2020.102399 DOI: https://doi.org/10.1016/j.scs.2020.102399

Downloads

Published

2021-07-04

How to Cite

Ahmed, J., & Ahmed, M. (2021). ONLINE NEWS CLASSIFICATION USING MACHINE LEARNING TECHNIQUES. IIUM Engineering Journal, 22(2), 210–225. https://doi.org/10.31436/iiumej.v22i2.1662

Issue

Section

Electrical, Computer and Communications Engineering