A NOVEL DIMENSIONALITY REDUCTION APPROACH TO IMPROVE MICROARRAY DATA CLASSIFICATION

Cancer tumor prediction and diagnosis at an early stage has become a necessity in cancer research, as it provides an increase in the treatment success chances. Recently, DNA microarray technology became a powerful tool for cancer identification, that can analyze the expression level of a different and huge number of genes simultaneously. In microarray data, the large genes number versus a few records may affect the prediction performance. In order to handle this "curse of dimensionality” constraint of microarray dataset while improving the cancer identification performance, a dimensional reduction phase is necessary. In this paper, we proposed a framework that combines dimensional reduction methods and machine learning algorithms in order to achieve the best cancer prediction performance using different microarray datasets. In the dimensional reduction phase, a combination of feature selection and feature extraction techniques was proposed. Pearson and Ant Colony Optimization was used to select the most important genes. Principal Component Analysis and Kernel Principal Component Analysis were used to linearly and non-linearly transform the selected genes to a new reduced space. In the cancer identification phase, we proposed four algorithms C5.0, Logistic Regression, Artificial Neural Network, and Support Vector Machine. Experimental results demonstrated that the framework performs effectively and competitively compared to state-of-the-art methods. ABSTRAK: Ramalan tumor kanser dan diagnosis pada peringkat awal telah menjadi keperluan dalam kajian kanser, kerana ia membuka peluang peningkatan kejayaan dalam rawatan. Kebelakangan ini, teknologi mikrotatasusunan DNA menjadi alat berkuasa bagi mengenal pasti kanser, di mana ia mampu menganalisa level ekspresi yang pelbagai dan gen-gen yang banyak secara serentak. Dalam data mikrotatasusunan, gen-gen yang banyak ini bakal menentukan ramalan prestasi berbanding analisa melalui rekod-rekod yang sebilangan. Fasa pengurangan dimensi adalah perlu bagi mengawal kakangan “penentuan kedimensian” dataset mikrotatasusunan, sementara itu ia memantapkan lagi keberkesanan kenal pasti kanser. Kajian ini mencadangkan rangka kombinasi kaedah pengurangan dimensi dan algoritma pembelajaran mesin bagi mencapai prestasi ramalan kanser terbaik dengan menggunakan pelbagai dataset mikrotatasusunan. Dalam fasa pengurangan dimensi, kombinasi pemilihan ciri dan teknik pengekstrakan ciri telah dicadangkan, Pengoptimuman Pearson dan Koloni Semut bagi memilih gen yang paling penting, Analisis Komponen Prinsipal dan Analisis Komponen Prinsipal Kernel, bagi menukar gen terpilih yang linear dan tak linear kepada ruang baru yang dikurangkan. Dalam


INTRODUCTION
According to a recent publication by the World Health Organization (WHO) in 2018, cancer is considered the second most lethal factor for human beings. Knowing that early diagnosis is a mandatory and a crucial step in cancer treatment, the chance to get an appropriate treatment may require further measurements to increase the accuracy of cancer diagnosis combined with other clinical tests. With the development of machine learning techniques and microarray technology, the DNA analysis microarray data brings a great opportunity in cancer diagnosis. However, the presence of a large number of irrelevant or redundant genes (features) in gene expression data may increase the search space size, which makes pattern detection more difficult and makes it complex to capture the necessary rules for classification [1]. To overcome this "curse of dimensionality", a dimensional reduction process is strongly recommended. Dimensional reduction refers to a process that removes redundant and noisy features from the data, thus maximizing prediction performance. Dimensional reduction can be divided into feature selection (FS) and feature extraction (FE). FE methods create a subset of new features by combinations of existing features. The new features are low-dimensional features with the same or better performance in terms of prediction accuracy. In the literature, some proposed FE methods for cancer classification using gene expression data include Principal Component Analysis (PCA) [2] and kernel PCA [3]. On the other hand, the FS process focuses simply on the relevant features in the dataset by removing any redundant, irrelevant, or noisy features, which leads to better learning performance. The frequently used FS methods are divided into filter and wrapper. In the filter approach, features are scored based on statistical criteria such as Pearson correlation coefficients (P) [4]. In the wrapper approach [5], FS is combined with classification algorithms. Examples of wrapper algorithms include Ant Colony Optimization algorithm (ACO), Genetic Algorithm (GA), and others. When the number of features becomes very large, the filter methods are usually chosen due to their computational efficiency and simplicity [6]. In this paper, in addition to the Pearson correlation-based filter, a hybrid approach of feature selection has also been proposed that takes advantage of filter and wrapper methods. The proposed hybrid approach combines correlation-based feature selection with the ACO algorithm.
In this study, our aim is to improve the performance of cancer tumor modeling using a framework that combines FS and FE as dimension reduction methods with machine learning algorithms.

RELATED WORKS
The importance of classifying cancer patients into high or low risk groups has led to study the application of machine learning methods. Different strategies exist focusing on modifying the data for better fitting in a specific machine learning method; among them, we have dimensionality reduction, FS, and FE [7]. Several DNA microarray experiments have marked the power of datamining methods over clinical criteria for cancer diagnosis [8,9]. These studies accentuate the improvement of prediction performance based on gene https://doi.org/10.31436/iiumej.v22i1.1447 expression data by combining dimensional reduction techniques with machine learning algorithms.
To improve prostate cancer performance modeling and mining, Hicham and al. have proposed a new framework combining feature selection using Pearson and feature extraction using PCA in conjunction with machine learning algorithms. The most important result achieved in this study is obtained by the Pearson-PCA-C5.0 model with 94.05% classification accuracy and five selected features [10]. Kar et al. proposed a combination of filter method based on t-test and wrapper method based on particle swarm optimization (PSO) to find the most relevant genes in the SRBCT microarray dataset. The study achieved 100% accuracy for 14 selected genes [11].
Atiyeh and Mohammad implemented an innovative feature selection approach Based on Cooperative Game Theory and Qualitative Mutual Information (QMT). The classification accuracy on 11 microarray datasets, namely Leukemia1, SRBCT, Lung, and prostate cancer, shows that the proposed approach improves both accuracy and stability compared to other methods [12]. Chandra proposed an efficient feature selection technique that removes the drawbacks of [13], by taking into account the redundancy between features. the research study shows that the classification accuracy form using the proposed algorithm Inter Feature Effective range overlap (IFERO), for many cancers, is much superior compared to other feature selection algorithms. The proposed technique has been applied to 8 benchmark cancer datasets [14].
Shun Guo et al. formulated the feature selection problem as an optimization one based on a newly defined linear discriminant analysis criterion. The experiment was applied to 10 publicly available microarray datasets, and the results show that the proposed gene selection is an effective method for improving the accuracy of tumor classification [15].
The present paper aims to improve the classification performance for four benchmark cancer datasets. For this purpose and in order to handle the curse of dimensionality problem of the Microarray dataset, we propose a framework that combines FS and FE methods in conjunction with machine-learning algorithms. Figure 1 summarizes the main steps of our proposed framework, which is based on feature selection using Filter and Hybrid approaches, FE using linear and non-linear PCA, and cancer identification (classification) using Logistic Regression (LR), C5.0 Decision Tree algorithm, Support-Vector Machines SVM, and Artificial Neural Network (ANN). The main structure of the proposed Framework is described in Algorithm 1.

Feature Selection Methods
The feature selection or gene selection in the context of microarray data analysis is a useful technique that can reduce dimensionality by removing any redundant, irrelevant or noisy genes, which can lead to improve the classification performance and reduce the cost of computation [16]. As shown in Fig. 2, the feature selection process can be reformulated as follows: given an original set, = ( 1, 1, , ⋯ , ), of features, find the subset which consists of features (where ≪ ), such that the most informative features are selected.
The proposed framework in the present paper implements two feature selection techniques, the filter method based on statistic tools and the hybrid method that combines the filter approach with ACO.

Filter Method using Pearson's Correlation Coefficient
Because they act independently of any classification process, filters are considered to be faster than the wrapper approach. This is because this model is frequently used when it comes to working with a large number of features [17]. To measure feature relevance using filter methods, some statistical techniques are applied for each feature, such as Pearson's correlation coefficient, Spearman's rank correlation, Pearson 's Chi-square, Cramer's v, ...
In the present paper, Pearson's correlation in Eq. (1), denoted by , was applied to recognize features (X) showing a strong linear relationship with the target (Y). https://doi.org/10.31436/iiumej.v22i1.1447 Where N is the total number of samples in the training set. ̅ , ̅ are, respectively, the overall mean of the X and the Y. and are, respectively, the i-th observation in X and Y. The ( , ) always lies between ±1, where 1 indicates a perfect relationship between X and Y, and the 0 indicates no relationship between them.
Then, the relevance value of each feature X is measured as (1 − p Value ) × 100%, where p Value based on the t-statistic with df = N − 2 degree of freedom is computed using the Eq. (2).
With = r√ /(1 − 2 ), and T(df) is a random variable that follows a Student's tdistribution with . In this study, all features (genes) in the training set with relevance greater than 95% were selected.

Ant Colony Optimization Hybrid Approach
Hybrid methods attempt to combine the straightness of two feature selection methods. The most frequently used combination is the filter with a wrapper approach [17]. The present framework proposes a new hybrid technique (Fig. 3) that combines Pearson's correlation and Ant Colony Optimization (PACO). The filter step in the framework consists of reducing the number of genes by removing non-informative genes in the original training set, and then the number of pre-selected genes is given to the ACO to select the optimal subset in the original training set.
Proposed by Marco Dorigo [18], ACO is a nature-inspired metaheuristic approach. The idea behind ACO is to represent the search space of a problem in the form of a graph, then the solution of the problem is to find the optimal path in this graph using artificial ants. As in real ant colonies, each ant deposits the pheromone trail with the same rate on the components of the graph that it chooses to cross. The chosen path to cross by an ant is usually based on the accumulated pheromone trail. Thus, the accumulated pheromone is considered to be an indicator of the quality of the chosen path, which can attract ants in the next iterations to the corresponding areas in the search space [19].
The ACO has been a powerful tool in many optimization problems [20][21][22], and for many reasons, it was recently used as a powerful tool for gene selection [23][24][25]. In feature selection using ACO, each node in the graph is viewed as a feature (gene), and edges between nodes (features) represent the choice of the next node to be selected. Thus, searching the optimal subset of features is to find the optimal path in the graph until such a stopping criterion is satisfied. The problem of feature selection using PACO can be reformulated as follows: given an original training set = ( 1, 1, , ⋯ , ), of features, find the subset that consists of features (where ≪ ), such that a maximum number of iterations is reached. According Fig. 3, before starting any iteration, the number of genes in the optimal subset to select is initialized using a Pearson correlation-based filter, and the amount of pheromone in the search space is initialized to a constant value. Then, at the start of each iteration , each ant starts in a randomly selected feature. To select (to visit) the next feature (nodes) from unselected ones, each ant must respect the probabilistic "transition rule" [19] using the Eq. (3). Where: : features set that have not been visited yet.  The constructed subset by the k-th ant is then evaluated using an SVM classifier, and the estimated Mean Square Error of the classification results will decide if the current subset is the best one. The is computed by applying a stratified 5 − Cross Validation method. The constructed subset is split into five-folds, and at each time, one of the five folds is used for the test, and the remaining folds form the training data. Then the average MSE for the five trials is calculated using the Eq. (4).
The subset giving the lowest MSE is known as the best one related to the best ant and denoted by . At the end of each iteration, the amount of pheromone in the search space is updated according to the Eq. (5) [19].
Where: : is the number of ants.
: represents the constructed subset corresponding to the ant k. : denotes the pheromone evaporation coefficient. : a constant multiplier that defines the amount of pheromones that should put each ant.
: denote the Mean Square Error corresponding to the constructed subset by the ant k.
The overall pseudocode of the proposed PACO gene selection approach is illustrated in Algorithm 2.

Feature Extraction Methods
Feature extraction (FE) is the process of transforming original data with a large number of features into a reduced representation of a set of features. As shown in Fig. 4, the FE is achieved by transforming = ( 1, 1, , ⋯ , ), of features to a new set of predictor variables called components (where ≪ ).
Among linear and nonlinear methods, PCA and Kernel PCA are the most commonly used FE techniques for dimensionality reduction. In this paper, we attempt to use FE methods combined with FS ones in order to handle the curse of dimensionality of cancer datasets.

Principal Components Analysis (PCA)
PCA is a classical dimension-reduction technique used to reduce large sets of variables (features) into new small ones without much loss of information from the large sets [26]. Mathematically, PCA attempts to transform a number of linearly correlated variables into a smaller number of new ones called components. In other words, PCA aims to find a linear subspace of lower dimensionality than the large variable space, where the new linear https://doi.org/10.31436/iiumej.v22i1.1447 subspace has the largest variance (has the most of the information in the large space). FE using PCA can be reformulated as follows: Given a p-dimensional training set: [ 1 , 2 , … , ] ×1 , Where: denotes the number of features and the number of patterns. We want to find Ψ, the matrix of new components where the number of principal components that should be retained is decided using the percentage of total variance explained. The pseudocode of the PCA method is illustrated in Algorithm 3.

kernel-PCA
While PCA is a dimension reduction technique that assumed to find linear transformation to represent the data in a lower dimension, kernel-PCA is used when we deal with complex structure data where linear subspace is not very useful [27]. In this paper, the kernel-PCA is used as an alternative to PCA when there is no linear correlation between features, which can affect classification accuracy.
Introduced as a nonlinear generalization of standard PCA [27], in the kernel PCA, the original input matrix 1 , 2 , . . . , ∈ is mapped into a new feature space Φ( 1 ), Φ( 2 ), . . . , Φ( ) ∈ and then the standard PCA is performed using this new feature space. However, computing Φ( ) explicitly before extracting the principal components is extremely costly [28]. The best practice is to directly construct a kernel matrix using X instead of computing Φ( ) explicitly [29]; thus, the mapping Φ( ) is implicitly specified by the kernel function. The most commonly used kernel function is the Radial Basis Function kernel (RBF) in Eq. (6) Where: ∥ − ∥ 2 ∶ denotes the squared Euclidean distance > 0 ∶ a parameter that sets the spread of the kernel : the degree of the kernel.
If the new feature space is not centered, a centering transformation can be applied directly to the kernel matrix using Eq. (7) [30].
Where 1/ is the × matrix with all elements equal to 1/ and is the number of patterns.
The overall pseudocode of the Kernel Principal Components Analysis method is illustrated in Algorithm 4.

C5.0 Decision Tree
C5.0 is a new decision tree algorithm developed from C4.5 by [31], which has proven its high detection accuracy in many research fields [32][33][34]. Compared to C4.5, C5.0 can handle different types of data, deal with missing values and support boosting to improve classifier accuracy [35]. In C5.0 algorithm, samples are split into sub-samples by using a recursive method based on information gain ratios. Each sub-sample received from the first split will be split again. The split process is repeated until there is no more split that makes a difference in terms of information gain ratios. At the end of the process, any split which doesn't have a significant contribution to the model is rejected [36].

Support Vector Machine
The Support Vector Machine (SVM) is a binary classifier algorithm that has been successfully applied in many pattern recognition areas. In linear classification, SVM constructs a classification hyper-plane that separates the data into two sets by maximizing the margins and minimizing the classification error. The hyper-plane is constructed in the middle of the maximum margin. Thus, samples above the hyper-plane are classified as positives. Otherwise, they are classified as negatives (Fig. 5). The classification function is given with Eq. (8) [37].
Where y denotes the class label, w and b are the parameters of the hyper-plane, and denotes the sign function.
However, in a real classification problem, datasets are often linearly non-separable. Therefore, Eq. (8) will allow some of the samples to be on the wrong side of the hyperplane. To overcome this problem of non-linearity, a nonlinear transformation of the input vectors into a new feature space is performed, and then a linear separation is performed using this new feature space [37]. To perform a nonlinear SVM, the product (x, y) is replaced by a kernel function (Eq. (9)).  In this paper a Gaussian kernel (Eq. (6)) was used to deal with the problem of non-linearity.

Artificial Neural Network
Introduced by [38], the Artificial neural network (ANN) is a form of distributed computation inspired by networks of human biological neurons. As shown in Fig. 6(a), An ANN consists of a set of interconnected artificial neurons that are organized in a minimum of three layers: the input layer, hidden layer, and output layer. All nodes (neurons) in each layer of the network are connected to the nodes of the next layer with no connection back, and all the connections are defined by weight values denoted by . In the input layer, all nodes get information from the outside and pass it to the nodes of the next layer weighted by w. If we take a look at one of the hidden or output neurons ( Fig. 6 (b)), we find that each node computes the weighting sum of all the N neurons of the previous layer and passes it through an activation function [39]. Equation (10) represents the equation for a given neuron.
A Common choice for the activation function is non-linear functions such as the logistic sigmoid function given by the Eq. (11).

Logistic Regression
As an extension of the linear regression algorithm for classification problems, Logistic Regression aims to find the best fitting model, which squeezes the output of a linear equation between 0 and 1 using the logistic function (Eq. (11)). In linear regression, the relationship between output and features is modeled using a linear equation (Eq. (12)). However, in a classification problem, it is strongly recommended to have probabilities between 0 and 1, which can force the outcome to be only between 0 and 1 (Eq. (13)).

Performance Evaluation
The datamining process has several ways to check the performance of any classification model. The quality of any classification model is built from the confusion matrix (Table 1), which summarizes the comparison between predicted and observed classes for all observations.
Another common evaluation metric used in Machine Learning is the receiver operating characteristic (ROC) curve which is created by plotting the True Positive Rate ( = /( + )) against the False Positive Rate ( = /( + )). The Area Under the ROC Curve (AUC) provides a good idea about model performance. The model that gives 100% of correct predictions has an AUC of 1, while the model that gives 100% of wrong predictions has an AUC of 0.
In the present paper, both accuracy and ROC curve were used to evaluate the performance of each generated model.

Dataset Description
In order to evaluate the ability of our framework to adapt to different situations, experiments are achieved on several public high-dimensional microarray datasets with different properties (number of genes, number of patterns, and the number of classes). Description of the datasets used in the present work is provided in Table 2.

Partitioning
In order to avoid the overestimating prediction, a stratified 5 − cross-validation technique was employed. Using this technique, samples are split into five equal folds (subset) of samples. One of the five folds is used as a testing step, and the remaining four folds are put together to form the training data. This process is repeated five times. The stratification process was used to ensure that all folds are made by preserving the same percentage of samples for each class.

Data Preprocessing
Before supplying the datasets to our analysis system, it was necessary to perform data preprocessing, as it is an important step in the data analysis process. In the present paper, gene expression datasets were preprocessed using the standard procedure, which includes log transformation and standardization.

a) Data Transformation
The main motivation for using the log transformation is due to the asymmetric distribution of the derived expression levels [44], which can affect the identification of expression Patterns, and prospective Classification in Human Cancer Genomes [45].
In the present work, before transforming our data (training or testing set) using Eq. (15), a test of normality using Shapiro-Wilk [46,47] is used to evaluate whether the distribution of the data agrees with normal distribution. The calculated P-value of the Shapiro-Wilk test for each gene shows a strong significance, which indicates a deviation from the normal distribution for most of the gene expressions.

b) Data Standardization
Gene expression levels for each gene were standardized using the Eq. (16). The result is that expression levels for each feature have a mean 0 and variance 1.
Where: is the overall mean of the feature and , its standard deviation.

a) System Configuration
Using parallel processing, our proposed framework was implemented in Python 3.7 language. All of the experiments were carried out using an Intel Xeon E5-2637 v2 3.5 GHz PC with 64 GB of RAM.

b) Parameters Settings
The parameters used in our Framework are shown in Table 3:  Table 4, which is also presented as graphs in Fig. 7, shows the overall performance of dimension reduction techniques and their respective classification algorithms used in the four public microarray datasets described in Table 2. 36 different models were generated for each microarray dataset. The quality of each model is measured by the number of selected genes , the dimension of the new subspace 'generated by using the FE process, the running time (the running time reported here includes both the dimension reduction and classification stages), and the cancer prediction performance which represents the average accuracy of the training and testing sets. The results of each dataset are as follows:

RESULTS AND DISCUSSION
For the SRBCT dataset, Table 4 and Fig. 7 show that the shrinkage models P-PCA-SVM, P-PCA-ANN, and P-PCA-LR provide an excellent accuracy of 100%. The power of these models resides in the fact that the number of genes was reduced twice. The first reduction was obtained using Pearson correlation-based feature selection (Line 9 in Algorithm 1), the dimension of dataset passed from = 2308 (the original number of genes in the dataset as reported in Table. 2) to a new subset of = 727 while selecting only the https://doi.org/10.31436/iiumej.v22i1.1447 more representative genes with relevance greater than 95%. Then, the new k-dimension subset was transformed to a linear subspace using PCA where only the first ' = 22 predictors (components) that explain approximately 80% of the total variation of genes subset (the cumulative variance equal to 80%) were retained. The same results were achieved by P-KPCA-SVM, P-KPCA-ANN, and P-KPCA-LR models with the nonlinear transformation (KPCA) for the feature extraction method. The classification rate was close to 100% for most shrinkage models using the PACO-based feature selection method with a more significant response time increase. Fortunately, there is almost no significant accuracy loss for the rest of the generated models. For the Lung dataset, The PACO-ANN model achieved the highest classification performance of 99% over the entire set of generated models while using only 36% of genes ( = 4534 most significant genes) from the original pre-processed dataset by using PACObased feature selection (Line 13 in Algorithm 1). As there is no feature extraction process in this model, the selected subset of genes PACO-based was used as an input layer for the ANN classifier, thus = ′ = 4534. In contrast, this model took about 4709 seconds, which is considered to be a significant response time compared to the other generated models.
For the Prostate dataset, the best performance was achieved by the PACO-LR model since the average classification accuracy of LR (Logistic Regression algorithm) reached 97.62% when involving only = 1801 genes from = 10509 by using the PACO feature selection method. The same average accuracy was achieved (97.62%) when applying LR and ANN on the p original genes without any dimensionality reduction process.
For the Leukemia1 dataset, with the different values of and ′, almost all generated models produced high average classification performance close to 100%.
The power of our framework resides in the fact that it proposes a large number of models that combine different dimensionality reduction techniques with the classification process. The major aim behind this combination is to use a bare minimum of dimensions while maximizing the classification performance. Regarding Table 5, the most interesting result concerned the Leukemia1 dataset since the best model, P-PCA-C5.0, achieved an excellent accuracy of 100% with only ′ = 3 selected dimensions and a response time of 2 seconds, which is much better than what was reported in [11][12], [14][15]. The classification accuracy of 100% was obtained by computing the percentage of correct predictions from the confusion matrix (c) shown in Fig. 9. The strongest point of the P-PCA-C5.0 model resides in the fact that the original number of genes = 5327 was reduced three times to arrive at ′ = 3 dimensions. The first reduction was obtained by selecting only the = 1035 most relevant genes by using Pearson correlation-based feature selection. Then, using the PCA based FE, the new subset of genes was converted into a new subspace of 26 dimensions (components), which in turn reduced into ′ = 3 dimensions by using the innate feature selection capacity of C5.0 algorithm [49,50]. The quality of the P-PCA-C5.0 model in terms of classifications performance was validated by the area under the ROC curve (AUC). As we can notice from ROC curves (g) and (h) drawn in Fig. 8, the average AUC of testing and training set shows a maximum value of = 1 which can confirm the quality of our favorite model. With a significant increase of consumed time, PACO-C5.0 model achieved exactly the same results (in terms of classification accuracy and dimension reduction degree), except that the ′ = 3 in this model represents the numbers of genes instead of dimensions (components) obtained by P-PCA-C5.0. For the Prostate dataset, the power of the C5.0 algorithm in terms of FS and classification performance was enough to achieve the best result compared to the other generated models. As we can notice from confusion matrix (d) in Fig. 9  94% with only ′ = 2 genes such as reported in Table 5, which is better than what is reported in [10]. The average AUC of 0.935 obtained from the ROC curves (e) and (f) confirms the quality of our model in terms of classification performance.
For the Lung dataset, as we can notice from Table 5, the P-C5.0 model achieved a classification accuracy of 96.74% (calculated from confusion matrix (b) in Fig. 9) by involving only 4 genes. This model benefitted from both Pearson correlation-based feature selection to reduce the number of genes from = 12600 to = 4679 , and from the innate power of C5.0 to reduce a second time the number of genes from to ′ = 4. The average AUC of 0.99 could validate the choice of our model.
For the SRBCT dataset, our favorite model, C5.0, achieved almost the same accuracy on the Prostate dataset using only ′ = 3 genes from = 2308.
According to the results reported in Table 4 and Table 5, our framework could improve both the accuracy and degree of dimensionality reduction compared with state-of-the-art methods.

CONCLUSIONS AND FUTURE WORK
Genes expression data analysis is challenging the conventional prediction techniques, since limited labeled samples versus a large number of genes may significantly affect the classification performance. To overcome this issue, a new generic approach combining dimensional reduction techniques with machine learning algorithms was proposed. The main objective behind this approach is to improve prediction performance for microarray datasets while involving a bare minimum number of predictors. The dimensional reduction process used in this paper is a combination of FS and FE techniques. The FS using Pearson correlation or PACO aims at selecting the most relevant genes, while FE using PCA or kernel-PCA aims at transforming the original genes space into a new linear or non-linear subspace. Dimensional reduction techniques were combined with four classifiers SVM, ANN, LR, and C5.0. We conducted the experiments on four public microarray gene expression data sets, SRBCT, leukemia1, lung, and Prostate cancer. Experimental results show that the number of genes was efficiently reduced to reach two genes, with a high classification accuracy that reached up to 100% (Table 5), making our framework very effectively competitive with the reference approaches. Moreover, our experiment confirms that our coupling of dimensionality reduction with classification makes our framework powerful in terms of its ability to adapt with different kinds of microarray datasets.
Our future work includes experimentation of our proposed approach on new gene expression datasets, and study of new data mining techniques that can enhance our framework in many different aspects in the aim of identifying, with high performance, previously unknown cancer-related genes, which may guide further cancer research.