A NOVEL DIMENSIONALITY REDUCTION APPROACH TO IMPROVE MICROARRAY DATA CLASSIFICATION

Mohammed Hamim; Ismail El Mouden; Mounir Ouzir; Hicham Moutachaouik; Mustapha Hain

doi:10.31436/iiumej.v22i1.1447

Authors

Mohammed Hamim I2SI2E Laboratory, ENSAM-casablanca https://orcid.org/0000-0001-7666-5760
Ismail El Mouden EVMS-Sentara Healthcare Analytics and Delivery Science Institute, Eastern Virginia Medical School, Norfolk, VA, USA https://orcid.org/0000-0001-7702-2564
Mounir Ouzir 3Group of Research in Physiology and Physiopathology, Department of Biology, Faculty of Science, University Mohammed V, Rabat, Morocco https://orcid.org/0000-0001-6835-9755
Hicham Moutachaouik I2SI2E Laboratory, ENSAM- Casablanca, University Hassan II, Casablanca, Morocco https://orcid.org/0000-0003-1566-104X
Mustapha Hain I2SI2E Laboratory, ENSAM- Casablanca, University Hassan II, Casablanca, Morocco

DOI:

https://doi.org/10.31436/iiumej.v22i1.1447

Keywords:

Gene Selection, Metaheuristic-Ant Colony Optimization, Feature Extraction, Pattern Recognition, Microarray Data Analysis

Abstract

Cancer tumor prediction and diagnosis at an early stage has become a necessity in cancer research, as it provides an increase in the treatment success chances. Recently, DNA microarray technology became a powerful tool for cancer identification, that can analyze the expression level of a different and huge number of genes simultaneously. In microarray data, the large genes number versus a few records may affect the prediction performance. In order to handle this "curse of dimensionality” constraint of microarray dataset while improving the cancer identification performance, a dimensional reduction phase is necessary. In this paper, we proposed a framework that combines dimensional reduction methods and machine learning algorithms in order to achieve the best cancer prediction performance using different microarray datasets. In the dimensional reduction phase, a combination of feature selection and feature extraction techniques was proposed. Pearson and Ant Colony Optimization was used to select the most important genes. Principal Component Analysis and Kernel Principal Component Analysis were used to linearly and non-linearly transform the selected genes to a new reduced space. In the cancer identification phase, we proposed four algorithms C5.0, Logistic Regression, Artificial Neural Network, and Support Vector Machine. Experimental results demonstrated that the framework performs effectively and competitively compared to state-of-the-art methods.

ABSTRAK: Ramalan tumor kanser dan diagnosis pada peringkat awal telah menjadi keperluan dalam kajian kanser, kerana ia membuka peluang peningkatan kejayaan dalam rawatan. Kebelakangan ini, teknologi mikrotatasusunan DNA menjadi alat berkuasa bagi mengenal pasti kanser, di mana ia mampu menganalisa level ekspresi yang pelbagai dan gen-gen yang banyak secara serentak. Dalam data mikrotatasusunan, gen-gen yang banyak ini bakal menentukan ramalan prestasi berbanding analisa melalui rekod-rekod yang sebilangan. Fasa pengurangan dimensi adalah perlu bagi mengawal kakangan “penentuan kedimensian” dataset mikrotatasusunan, sementara itu ia memantapkan lagi keberkesanan kenal pasti kanser. Kajian ini mencadangkan rangka kombinasi kaedah pengurangan dimensi dan algoritma pembelajaran mesin bagi mencapai prestasi ramalan kanser terbaik dengan menggunakan pelbagai dataset mikrotatasusunan. Dalam fasa pengurangan dimensi, kombinasi pemilihan ciri dan teknik pengekstrakan ciri telah dicadangkan, Pengoptimuman Pearson dan Koloni Semut bagi memilih gen yang paling penting, Analisis Komponen Prinsipal dan Analisis Komponen Prinsipal Kernel, bagi menukar gen terpilih yang linear dan tak linear kepada ruang baru yang dikurangkan. Dalam menentukan fasa mengenal pasti kanser, kajian ini mencadangkan empat algoritma iaitu C5.0, Regresi Logistik, Rangkaian Neural Buatan dan Mesin Vektor Sokongan. Dapatan kajian menunjukkan rangka ini adalah berkesan dan kompetitif berbanding kaedah semasa.

Downloads

Download data is not yet available.

Metrics

Metrics Loading ...

References

Gheyas, I., & Smith, L. (2010). Feature subset selection in large dimensionality domains. Pattern Recognition, 43, 5–13.

Abdi, H., & Williams, L. J. (2010). Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics, 2(4), 433–459.

Liu, Z., Chen, D., & Bensmail, H. (2005). Gene Expression Data Classification With Kernel Principal Component Analysis. Journal of Biomedicine and Biotechnology, 2005, 155–159.

Biesiada, J., & Wlodzislaw, D. (2007). Feature Selection for High-Dimensional Data—A Pearson Redundancy Based Filter. In Advances in Soft Computing (Vol. 45, pp. 242–249).

Kohavi, R., & John, G. H. (1997). Wrappers for feature subset selection. Artificial Intelligence, 97(1), 273–324.

Yu, L., & Liu, H. (2003). Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution. Proceedings, Twentieth International Conference on Machine Learning, 2, 856–863.

Kourou, K., Exarchos, T. P., Exarchos, K. P., Karamouzis, M. V., & Fotiadis, D. I. (2015). Machine learning applications in cancer prognosis and prediction. Computational and Structural Biotechnology Journal, 13, 8–17.

Sayed, S., Nassef, M., Badr, A., & Farag, I. (2019). A Nested Genetic Algorithm for feature selection in high-dimensional cancer Microarray datasets. Expert Systems with Applications, 121, 233–243.

Ghosh, M., Begum, S., Sarkar, R., Chakraborty, D., & Maulik, U. (2019). Recursive Memetic Algorithm for gene selection in microarray data. Expert Systems with Applications, 116, 172–185.

Moutachaouik, H., & El Moudden, I. (2018). Mining Prostate Cancer Behavior Using Parsimonious Factors and Shrinkage Methods.

Kar, S., Sharma, K. D., & Maitra, M. (2016). A particle swarm optimization based gene identification technique for classification of cancer subgroups. 2016 2nd International Conference on Control, Instrumentation, Energy Communication (CIEC), 130–134. https://doi.org/10.1109/CIEC.2016.7513800

Mortazavi, A., & Hossein Moattar, M. (2016). Robust Feature Selection from Microarray Data Based on Cooperative Game Theory and Qualitative Mutual Information. Advances in Bioinformatics, 2016, 1–16. https://doi.org/10.1155/2016/1058305

Chandra, B., & Gupta, M. (2011). An efficient statistical feature selection approach for classification of gene expression data. Journal of Biomedical Informatics, 44, 529–535. https://doi.org/10.1016/j.jbi.2011.01.001

Chandra, B. (2018). An efficient feature selection technique for gene expression data. 1–6. https://doi.org/10.1109/CIBCB.2018.8404977

Guo, S., Guo, D., Chen, L., & Jiang, Q. (2016). A Centroid-based Gene Selection Method for Microarray Data Classification. Journal of Theoretical Biology, 400. https://doi.org/10.1016/j.jtbi.2016.03.034

Chu, C., Hsu, A.-L., Chou, K.-H., Bandettini, P., & Lin, C. (2012). Does feature selection improve classification accuracy? Impact of sample size and feature selection on classification using anatomical magnetic resonance images. NeuroImage, 60(1), 59–70.

Ang, J., Mirzal, A., Haron, H., & Hamed, H. (2016). Supervised, Unsupervised, and Semi-Supervised Feature Selection: A Review on Gene Selection. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 13(05), 971–989. https://doi.org/10.1109/TCBB.2015.2478454

Dorigo, Marco. (1992). Optimization, Learning and Natural Algorithms [PhD Thesis]. Politecnico di Milano.

Bullnheimer, B., Hartl, R., & Strauss, C. (1999). A New Rank Based Version of the Ant System—A Computational Study. Central European Journal of Operations Research, 7, 25–38.

Dorigo, M., Maniezzo, V., & Colorni, A. (1996). Ant system: Optimization by a colony of cooperating agents. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 26(1), 29–41.

Parpinelli, R. S., Lopes, H. S., & Freitas, A. A. (2002). Data mining with an ant colony optimization algorithm. IEEE Transactions on Evolutionary Computation, 6(4), 321–332.

Di Caro, G., & Dorigo, M. (1998). AntNet: Distributed Stigmergetic Control for Communications Networks. J. Artif. Int. Res., 9(1), 317–365.

Aldryan, D. P., Adiwijaya, & Annisa, A. (2018). Cancer Detection Based on Microarray Data Classification with Ant Colony Optimization and Modified Backpropagation Conjugate Gradient Polak-Ribiére. 2018 International Conference on Computer, Control, Informatics and Its Applications (IC3INA), 13–16. https://doi.org/10.1109/IC3INA.2018.8629506

Wichaidit, S., Wardkean, P., Chaiwong, K., & Wettayaprasit, W. (2012). New hybrid adaptive Ant Colony Optimizaion and Self-Organizing Map for DNA microarray group finding. 2012 IEEE International Conference on Computer Science and Automation Engineering (CSAE), 3, 444–447.

Yu-Min Chiang, Huei-Min Chiang, & Shang-Yi Lin. (2008). The application of ant colony optimization for gene selection in microarray-based cancer classification. 2008 International Conference on Machine Learning and Cybernetics, 7, 4001–4006.

Jolliffe, I. T., & Cadima, J. (2016). Principal component analysis: A review and recent developments. Phil. Trans. R. Soc. A, 374(2065), 20150202.

Schölkopf, B., & Smola, A. J. (2001). Smola, A.: Learning with Kernels—Support Vector Machines, Regularization, Optimization and Beyond. MIT Press, Cambridge, MA. In Journal of The American Statistical Association—J AMER STATIST ASSN (Vol. 98).

Wang, Q. (2012). Kernel Principal Component Analysis and its Applications in Face Recognition and Active Shape Models.

Weinberger, K., Sha, F., & K. Saul, L. (2004). Learning a kernel matrix for nonlinear dimensionality reduction. https://doi.org/10.1145/1015330.1015345

Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.

R, Q. J. (2007). C5. http://rulequest.com

Bujlow, T., Riaz, T., & Pedersen, J. M. (2012). A method for classification of network traffic based on C5.0 Machine Learning Algorithm. 2012 International Conference on Computing, Networking and Communications (ICNC), 237–241. https://doi.org/10.1109/ICCNC.2012.6167418

Ranjbar, S., Aghamohammadi, M., & Haghjoo, F. (2016). Determining Wide Area Damping Control Signal (WADCS) based on C5.0 classifier.

Agaoglu, M. (2016). Predicting Instructor Performance Using Data Mining Techniques in Higher Education. IEEE Access, 4, 1–1. https://doi.org/10.1109/ACCESS.2016.2568756

Rathinasamy, R., & Raj, L. (2019). Comparative Analysis of C4.5 and C5.0 Algorithms on Crop Pest Data. International Journal of Innovative Research in Computer and Communication Engineering, 5, 2017.

Elsalamony, H., & Elsayad, A. (2013). Bank Direct Marketing Based on Neural Network. International Journal of Engineering and Advanced Technology, 2, 392–400.

Marjanovi?, M., Kova?evi?, M., Bajat, B., & Voženílek, V. (2011). Landslide susceptibility assessment using SVM machine learning algorithm. Engineering Geology - ENG GEOL, 123, 225–234.

McCulloch, W. S., & Pitts, W. H. (1988). A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biology, 52, 99–115.

Cachim, P. (2011). Using artificial neural networks for calculation of temperatures in timber under fire loading. Construction and Building Materials - CONSTR BUILD MATER, 25, 4175–4180. https://doi.org/10.1016/j.conbuildmat.2011.04.054

Singh, D., Febbo, P., Ross, K., G Jackson, D., Manola, J., Ladd, C., Tamayo, P., A Renshaw, A., V D’Amico, A., P Richie, J., S Lander, E., Loda, M., Kantoff, P., R Golub, T., & Sellers, W. (2002). Gene Expression Correlates of Clinical Prostate Cancer Behavior. Cancer Cell, 1, 203–209. https://doi.org/10.1016/S1535-6108(02)00030-2

Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M., Downing, J. R., Caligiuri, M., D Bloomfield, C., & S Lander, E. (1999). Molecular classification of cancer: Class discovery and class prediction by gene monitoring. Science (New York, N.Y.), 286, 531–537.

Westermann, F., Wei, J. S., Ringner, M., Saal, L., Berthold, F., Schwab, M., Peterson, C., Meltzer, P., & Khan, J. (2002). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. GBM Annual Fall Meeting Halle 2002, 2002. https://doi.org/10.1240/sav_gbm_2002_h_000061

J Gordon, G., Jensen, R., Hsiao, L.-L., R Gullans, S., Blumenstock, J., Ramaswamy, S., G Richards, W., Sugarbaker, D., & Bueno, R. (2002). Translation of Microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Research, 62, 4963–4967.

Li, W., Suh, Y. J., & Zhang, J. (2006). Does Logarithm Transformation of Microarray Data Affect Ranking Order of Differentially Expressed Genes? Conference Proceedings?: ... Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Conference, Suppl, 6593–6596. https://doi.org/10.1109/IEMBS.2006.260896

Marko, N., & Weil, R. (2012). Non-Gaussian Distributions Affect Identification of Expression Patterns, Functional Annotation, and Prospective Classification in Human Cancer Genomes. PloS One, 7, e46935. https://doi.org/10.1371/journal.pone.0046935

Shapiro, S. S., & Wilk, M. B. (1965). An Analysis of Variance Test for Normality (Complete Samples). Biometrika, 52(3/4), 591–611.

Royston, J. P. (1982). An Extension of Shapiro and Wilk’s W Test for Normality to Large Samples. Applied Statistics, 31(2), 115. https://doi.org/10.2307/2347973.

Masters, T. (1993). Practical Neural Network Recipes in C++. Academic Press Professional, Inc.

Deng, L., Yan, Y., & Wang, C. (2015). Improved POLSAR Image Classification by the Use of Multi-Feature Combination. Remote Sensing, 7, 4157–4177.

McIver, D. K., & Friedl, M. A. (2002). Using Prior Probabilities in Decision-Tree Classification of Remotely Sensed Data. Remote Sensing of Environment, 81, 253–261.