DEVELOPING A PARALLEL CLASSIFIER FOR MINING IN BIG DATA SETS

Ahad  Shamseen; Morteza  Mohammadi Zanjireh; Mahdi Bahaghighat; Qin  Xin

doi:10.31436/iiumej.v22i2.1541

Authors

Ahad Shamseen
Morteza Mohammadi Zanjireh Computer Engineering Department, Imam Khomeini International University (IKIU) https://orcid.org/0000-0001-9802-4788
Mahdi Bahaghighat Amirkabir University of Technology https://orcid.org/0000-0002-1813-8417
Qin Xin Faculty of Science and Technology, University of the Faroe Islands, Tórshavn, Faroe Islands https://orcid.org/0000-0002-6178-8538

DOI:

https://doi.org/10.31436/iiumej.v22i2.1541

Keywords:

Data mining, Big data, Decision tree, Parallel classifier, SPRINT classifier

Abstract

Data mining is the extraction of information and its roles from a vast amount of data. This topic is one of the most important topics these days. Nowadays, massive amounts of data are generated and stored each day. This data has useful information in different fields that attract programmers’ and engineers’ attention. One of the primary data mining classifying algorithms is the decision tree. Decision tree techniques have several advantages but also present drawbacks. One of its main drawbacks is its need to reside its data in the main memory. SPRINT is one of the decision tree builder classifiers that has proposed a fix for this problem. In this paper, our research developed a new parallel decision tree classifier by working on SPRINT results. Our experimental results show considerable improvements in terms of the runtime and memory requirements compared to the SPRINT classifier. Our proposed classifier algorithm could be implemented in serial and parallel environments and can deal with big data.

ABSTRAK: Perlombongan data adalah pengekstrakan maklumat dan peranannya dari sejumlah besar data. Topik ini adalah salah satu topik yang paling penting pada masa ini. Pada masa ini, data yang banyak dihasilkan dan disimpan setiap hari. Data ini mempunyai maklumat berguna dalam pelbagai bidang yang menarik perhatian pengaturcara dan jurutera. Salah satu algoritma pengkelasan perlombongan data utama adalah pokok keputusan. Teknik pokok keputusan mempunyai beberapa kelebihan tetapi kekurangan. Salah satu kelemahan utamanya adalah keperluan menyimpan datanya dalam memori utama. SPRINT adalah salah satu pengelasan pembangun pokok keputusan yang telah mengemukakan untuk masalah ini. Dalam makalah ini, penyelidikan kami sedang mengembangkan pengkelasan pokok keputusan selari baru dengan mengusahakan hasil SPRINT. Hasil percubaan kami menunjukkan peningkatan yang besar dari segi jangka masa dan keperluan memori berbanding dengan pengelasan SPRINT. Algoritma pengklasifikasi yang dicadangkan kami dapat dilaksanakan dalam persekitaran bersiri dan selari dan dapat menangani data besar.

Downloads

Download data is not yet available.

Metrics

Metrics Loading ...

Author Biography

Mahdi Bahaghighat, Amirkabir University of Technology

He got his Ph.D. in EE from Amirkabir University Of Technology (AUT) in 2017 and his M.Sc from IUST in 2007. Currently, he is the assistant professor and chairman of the electrical engineering group at Raja University. His current research interests include Signal, Image and Video Processing, Computer Vision, Artificial Intelligence, Machine Learning, Deep Learning, Sensor Networks, and Wireless Multimedia Transmission.

References

Fayyad, U., G. Piatetsky-Shapiro, and P. Smyth (1996). From data mining to knowledge discovery in databases. AI magazine. 17(3):37-37. https://doi.org/10.1609/aimag.v17i3.1230

Berry, M.J. and G.S. Linoff (2004). Data mining techniques: for marketing, sales, and customer relationship management. John Wiley & Sons.

Tan, P.-N., M. Steinbach, and V. Kumar (2016). Introduction to data mining. Pearson Education India.

Aggarwal, C.C. (2014). An Introduction to Data Classification. Data Classification: Algorithms and Applications. Chapman and Hall/CRC. DOI: https://doi.org/10.1201/b17320

Zaki, M.J. and W. Meira (2014). Data mining and analysis: fundamental concepts and algorithms. Cambridge University Press. DOI: https://doi.org/10.1017/CBO9780511810114

Quinlan, J.R (1986). Induction of decision trees. Machine learning. 1(1):81-106. https://doi.org/10.1007/BF00116251. DOI: https://doi.org/10.1007/BF00116251

Rennie, J.D., et al (2003). Tackling the poor assumptions of naive Bayes text classifiers In Proceedings of the Twentieth International Conference on Machine Learning: August 21-24, 2003; Whasington DC. https://dl.acm.org/doi/10.5555/3041838.3041916.

Hagan, M., et al (2014). Neural Network Design. 2nd Edtion. Oklahoma. Martin Hagan.

Jain, A.K. and R.C (1988). Dubes, Algorithms for clustering data. Prentice-Hall, Inc.

Zhang, C. and S. Zhang (2003). Association rule mining: models and algorithms. Springer.

Aggarwal, C (2014). Data Classification: Algorithms and Applications, ser. Frontiers in physics. Chapman and Hall/CRC.

Esposito, F., et al (1997). A comparative analysis of methods for pruning decision trees. IEEE transactions on pattern analysis and machine intelligence. 19(5): 476-491. https://doi.org/10.1109/34.589207. DOI: https://doi.org/10.1109/34.589207

Mingers, J (1989). An empirical comparison of pruning methods for decision tree induction. Machine learning. 4(2):227-243. https://doi.org/10.1023/A:1022604100933. DOI: https://doi.org/10.1007/BF00116837

Rokach, L. and O.Z. Maimon (2008). Data mining with decision trees: theory and applications. World scientific. DOI: https://doi.org/10.1142/6604

Kass, G.V (1980). An exploratory technique for investigating large quantities of categorical data. Journal of the Royal Statistical Society: Series C (Applied Statistics). 29(2):119-127. https://doi.org/10.2307/2986296. DOI: https://doi.org/10.2307/2986296

Breiman, L., et al (1984). Classification and regression trees. CRC press.

Hunt, E.B , J. Marin, and P.J. Stone (1966). Experiments in induction. Academic Press.

Friedman, J.H (1991). Multivariate adaptive regression splines. The annals of statistics, 19(1):1-67. https://doi.org/10.1214/aos/1176347963 DOI: https://doi.org/10.1214/aos/1176347973

Mehta, M., R. Agrawal, and J. Rissanen (1996). SLIQ: A fast scalable classifier for data mining. In Proceedings of the International conference on extending database technology: 25-29 March 1996; Avignon, France; pp 18-32. https://doi.org/10.1007/BFb0014141. DOI: https://doi.org/10.1007/BFb0014141

Shafer, J., R. Agrawal, and M. Mehta (1996). SPRINT: A scalable parallel classifier for data mining. In Proceedings of the 22nd VLDB Conference. 3-6 September 1996; Mumbai(Bombay), India; pp 544-555.

Joshi, M.V., G. Karypis, and V. Kumar (1998). ScalParC: A new scalable and efficient parallel classification algorithm for mining large datasets. In Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing. 30 March-3 April 1998; Orlando, FL, USA. https://doi.org/10.1109/IPPS.1998.669983. DOI: https://doi.org/10.1109/IPPS.1998.669983

Bowyer, K.W., et al. A parallel decision tree builder for mining very large visualization datasets. In Proceedings of the ieee international conference on systems, man and cybernetics.'cybernetics evolving to systems, humans, organizations, and their complex interaction. 8-11 Oct 2000; Nashville, TN, USA. https://doi.org/10.1109/ICSMC.2000.886388. DOI: https://doi.org/10.1109/ICSMC.2000.886388

Ranka, S. and V. Singh (1998). CLOUDS: A decision tree classifier for large datasets. In Proceedings of the 4th Knowledge Discovery and Data Mining Conference. 27 – 31 August 1998; New York, USA. https://dl.acm.org/doi/10.5555/3000292.3000294.

Liaw, A. and M. Wiener (2002). Classification and regression by random Forest. R news. 2(3):18-22.

Rastogi, R. and K. Shim (2000). PUBLIC: A decision tree classifier that integrates building and pruning. Data Mining and Knowledge Discovery. 4(4):315-344. DOI: https://doi.org/10.1023/A:1009887311454

Bahaghighat, M., et al (2020). Estimation of wind turbine angular velocity remotely found on video mining and convolutional neural network. Applied Sciences. 10(10):35-44. https://doi.org/10.3390/app10103544. DOI: https://doi.org/10.3390/app10103544

Bahaghighat, M., et al (2020). ConvLSTMConv network: A deep learning approach for sentiment analysis in cloud computing. Journal of Cloud Computing: Advances, Systems and Applications. 9(16). https://doi.org/10.1186/s13677-020-00162-1. DOI: https://doi.org/10.1186/s13677-020-00162-1

Abedini, F., et al (2019). Wind turbine tower detection using feature descriptors and deep learningFacta Universitatis, Series: Electronics and Energetics. 33(1):133-153. https://doi.org/10.2298/FUEE2001133A. DOI: https://doi.org/10.2298/FUEE2001133A

Bahaghighat, M., et al (2019). Vision Inspection of Bottle Caps in Drink Factories Using Convolutional Neural Networks. In Proceedings of the IEEE 15th International Conference on Intelligent Computer Communication and Processing (ICCP). 5 - 7 September 2019; City Plaza, Cluj-Napoca, Romania. https://doi.org/10.1109/ICCP48234.2019.8959737. DOI: https://doi.org/10.1109/ICCP48234.2019.8959737

Bahaghighat, M., et al (2019). A machine learning based approach for counting blister cards within drug packages. IEEE Access. 7: 83785-83796.

https://doi.org/10.1109/ACCESS.2019.2924445. DOI: https://doi.org/10.1109/ACCESS.2019.2924445