DEVELOPING A PARALLEL CLASSIFIER FOR MINING IN BIG DATA SETS
DOI:
https://doi.org/10.31436/iiumej.v22i2.1541Keywords:
Data mining, Big data, Decision tree, Parallel classifier, SPRINT classifierAbstract
Data mining is the extraction of information and its roles from a vast amount of data. This topic is one of the most important topics these days. Nowadays, massive amounts of data are generated and stored each day. This data has useful information in different fields that attract programmers’ and engineers’ attention. One of the primary data mining classifying algorithms is the decision tree. Decision tree techniques have several advantages but also present drawbacks. One of its main drawbacks is its need to reside its data in the main memory. SPRINT is one of the decision tree builder classifiers that has proposed a fix for this problem. In this paper, our research developed a new parallel decision tree classifier by working on SPRINT results. Our experimental results show considerable improvements in terms of the runtime and memory requirements compared to the SPRINT classifier. Our proposed classifier algorithm could be implemented in serial and parallel environments and can deal with big data.
ABSTRAK: Perlombongan data adalah pengekstrakan maklumat dan peranannya dari sejumlah besar data. Topik ini adalah salah satu topik yang paling penting pada masa ini. Pada masa ini, data yang banyak dihasilkan dan disimpan setiap hari. Data ini mempunyai maklumat berguna dalam pelbagai bidang yang menarik perhatian pengaturcara dan jurutera. Salah satu algoritma pengkelasan perlombongan data utama adalah pokok keputusan. Teknik pokok keputusan mempunyai beberapa kelebihan tetapi kekurangan. Salah satu kelemahan utamanya adalah keperluan menyimpan datanya dalam memori utama. SPRINT adalah salah satu pengelasan pembangun pokok keputusan yang telah mengemukakan untuk masalah ini. Dalam makalah ini, penyelidikan kami sedang mengembangkan pengkelasan pokok keputusan selari baru dengan mengusahakan hasil SPRINT. Hasil percubaan kami menunjukkan peningkatan yang besar dari segi jangka masa dan keperluan memori berbanding dengan pengelasan SPRINT. Algoritma pengklasifikasi yang dicadangkan kami dapat dilaksanakan dalam persekitaran bersiri dan selari dan dapat menangani data besar.
Downloads
Metrics
References
Fayyad, U., G. Piatetsky-Shapiro, and P. Smyth (1996). From data mining to knowledge discovery in databases. AI magazine. 17(3):37-37. https://doi.org/10.1609/aimag.v17i3.1230
Berry, M.J. and G.S. Linoff (2004). Data mining techniques: for marketing, sales, and customer relationship management. John Wiley & Sons.
Tan, P.-N., M. Steinbach, and V. Kumar (2016). Introduction to data mining. Pearson Education India.
Aggarwal, C.C. (2014). An Introduction to Data Classification. Data Classification: Algorithms and Applications. Chapman and Hall/CRC. DOI: https://doi.org/10.1201/b17320
Zaki, M.J. and W. Meira (2014). Data mining and analysis: fundamental concepts and algorithms. Cambridge University Press. DOI: https://doi.org/10.1017/CBO9780511810114
Quinlan, J.R (1986). Induction of decision trees. Machine learning. 1(1):81-106. https://doi.org/10.1007/BF00116251. DOI: https://doi.org/10.1007/BF00116251
Rennie, J.D., et al (2003). Tackling the poor assumptions of naive Bayes text classifiers In Proceedings of the Twentieth International Conference on Machine Learning: August 21-24, 2003; Whasington DC. https://dl.acm.org/doi/10.5555/3041838.3041916.
Hagan, M., et al (2014). Neural Network Design. 2nd Edtion. Oklahoma. Martin Hagan.
Jain, A.K. and R.C (1988). Dubes, Algorithms for clustering data. Prentice-Hall, Inc.
Zhang, C. and S. Zhang (2003). Association rule mining: models and algorithms. Springer.
Aggarwal, C (2014). Data Classification: Algorithms and Applications, ser. Frontiers in physics. Chapman and Hall/CRC.
Esposito, F., et al (1997). A comparative analysis of methods for pruning decision trees. IEEE transactions on pattern analysis and machine intelligence. 19(5): 476-491. https://doi.org/10.1109/34.589207. DOI: https://doi.org/10.1109/34.589207
Mingers, J (1989). An empirical comparison of pruning methods for decision tree induction. Machine learning. 4(2):227-243. https://doi.org/10.1023/A:1022604100933. DOI: https://doi.org/10.1007/BF00116837
Rokach, L. and O.Z. Maimon (2008). Data mining with decision trees: theory and applications. World scientific. DOI: https://doi.org/10.1142/6604
Kass, G.V (1980). An exploratory technique for investigating large quantities of categorical data. Journal of the Royal Statistical Society: Series C (Applied Statistics). 29(2):119-127. https://doi.org/10.2307/2986296. DOI: https://doi.org/10.2307/2986296
Breiman, L., et al (1984). Classification and regression trees. CRC press.
Hunt, E.B , J. Marin, and P.J. Stone (1966). Experiments in induction. Academic Press.
Friedman, J.H (1991). Multivariate adaptive regression splines. The annals of statistics, 19(1):1-67. https://doi.org/10.1214/aos/1176347963 DOI: https://doi.org/10.1214/aos/1176347973
Mehta, M., R. Agrawal, and J. Rissanen (1996). SLIQ: A fast scalable classifier for data mining. In Proceedings of the International conference on extending database technology: 25-29 March 1996; Avignon, France; pp 18-32. https://doi.org/10.1007/BFb0014141. DOI: https://doi.org/10.1007/BFb0014141
Shafer, J., R. Agrawal, and M. Mehta (1996). SPRINT: A scalable parallel classifier for data mining. In Proceedings of the 22nd VLDB Conference. 3-6 September 1996; Mumbai(Bombay), India; pp 544-555.
Joshi, M.V., G. Karypis, and V. Kumar (1998). ScalParC: A new scalable and efficient parallel classification algorithm for mining large datasets. In Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing. 30 March-3 April 1998; Orlando, FL, USA. https://doi.org/10.1109/IPPS.1998.669983. DOI: https://doi.org/10.1109/IPPS.1998.669983
Bowyer, K.W., et al. A parallel decision tree builder for mining very large visualization datasets. In Proceedings of the ieee international conference on systems, man and cybernetics.'cybernetics evolving to systems, humans, organizations, and their complex interaction. 8-11 Oct 2000; Nashville, TN, USA. https://doi.org/10.1109/ICSMC.2000.886388. DOI: https://doi.org/10.1109/ICSMC.2000.886388
Ranka, S. and V. Singh (1998). CLOUDS: A decision tree classifier for large datasets. In Proceedings of the 4th Knowledge Discovery and Data Mining Conference. 27 – 31 August 1998; New York, USA. https://dl.acm.org/doi/10.5555/3000292.3000294.
Liaw, A. and M. Wiener (2002). Classification and regression by random Forest. R news. 2(3):18-22.
Rastogi, R. and K. Shim (2000). PUBLIC: A decision tree classifier that integrates building and pruning. Data Mining and Knowledge Discovery. 4(4):315-344. DOI: https://doi.org/10.1023/A:1009887311454
Bahaghighat, M., et al (2020). Estimation of wind turbine angular velocity remotely found on video mining and convolutional neural network. Applied Sciences. 10(10):35-44. https://doi.org/10.3390/app10103544. DOI: https://doi.org/10.3390/app10103544
Bahaghighat, M., et al (2020). ConvLSTMConv network: A deep learning approach for sentiment analysis in cloud computing. Journal of Cloud Computing: Advances, Systems and Applications. 9(16). https://doi.org/10.1186/s13677-020-00162-1. DOI: https://doi.org/10.1186/s13677-020-00162-1
Abedini, F., et al (2019). Wind turbine tower detection using feature descriptors and deep learningFacta Universitatis, Series: Electronics and Energetics. 33(1):133-153. https://doi.org/10.2298/FUEE2001133A. DOI: https://doi.org/10.2298/FUEE2001133A
Bahaghighat, M., et al (2019). Vision Inspection of Bottle Caps in Drink Factories Using Convolutional Neural Networks. In Proceedings of the IEEE 15th International Conference on Intelligent Computer Communication and Processing (ICCP). 5 - 7 September 2019; City Plaza, Cluj-Napoca, Romania. https://doi.org/10.1109/ICCP48234.2019.8959737. DOI: https://doi.org/10.1109/ICCP48234.2019.8959737
Bahaghighat, M., et al (2019). A machine learning based approach for counting blister cards within drug packages. IEEE Access. 7: 83785-83796.
https://doi.org/10.1109/ACCESS.2019.2924445. DOI: https://doi.org/10.1109/ACCESS.2019.2924445
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2020 IIUM Press
This work is licensed under a Creative Commons Attribution 4.0 International License.