FEATURE EXTRACTION AND SUPERVISED LEARNING FOR VOLATILE ORGANIC COMPOUNDS GAS RECOGNITION

: The emergence of advanced technologies, particularly in the field of artificial intelligence (AI), has sparked significant interest in exploring their potential benefits for various industries, including healthcare. In the medical sector, the utilization of sensing systems has proven valuable for diagnosing pulmonary diseases by detecting volatile organic compounds (VOCs) in exhaled breath. However, the identification of the most informative and discriminating features from VOC sensor arrays remains an unresolved challenge, essential for achieving robust VOC class recognition. This research project aims to investigate effective feature extraction techniques that can be employed as discriminative features for machine learning algorithms. A preliminary dataset was used to predict VOC classification through the application of five supervised machine learning algorithms: k-Nearest Neighbors (kNN), Random Forest (RF), Support Vector Machines (SVM), Logistic Regression (LR), and Artificial Neural Networks (ANN). Ten feature extraction methods were proposed based on changes in sensor response as inputs to classify three types of gases in the dataset. The performance of each model was evaluated and compared using k-Fold cross-validation (k=10) and metrics derived from the confusion matrix. The results demonstrate that the RF model


INTRODUCTION
Volatile organic compounds (VOC) have been used as preclinical biomarkers in breath analysis to monitor health and diagnose various pulmonary diseases such as asthma and lung cancer [1] [2] [3] [4]. An array of sensors, or electronic nose (e-nose) is known as the alternative for a non-invasive method of detecting volatile organic compounds (VOC). E-nose is a device inspired by the olfactory system of humans or mammals (sense of smell), composed of a collection of an array of gas sensors with a pattern recognition system designed to detect and differentiate a wide variety of gas compounds [5].
The advancement of nanosensor arrays with pattern recognition involving pre-processing, feature extraction and machine learning algorithms makes it a powerful tool for the detection and recognition of gas samples with concentration estimation. Feature extraction is an essential technique used to extract significant information from the sensor response signal [6] [7] to optimize the performance of pattern recognition algorithms for gas classification [6] [8].
However, the detection of VOC using nanosensor technologies still has some constraints in its detection system. The VOC sensor as a sensing unit faced a few limitations such as lack of sensitivity and selectivity [9] [10]. Besides, it is still not clear which type of features from VOC sensor arrays are the most descriptive and discriminative leading to a robust recognition of the VOC classes. Data collection from a gas sensor array can also be cumbersome and timeconsuming which poses a nuisance in employing data-hungry machine learning algorithms. Therefore, this paper proposes employing supervised machine learning algorithms to classify the preliminary data of the individual sensors for VOC recognition. The VOC detection was performed on a chemiresistive sensor from various functionalised reduced Graphene Oxide (rGO) as a sensing layer. The targeted VOC gases used are acetone, toluene and isoprene which have been suggested as pulmonary disease-related biomarkers [11] with concentration levels ranging from 1 to 6 ppm.
Concretely, we explore 10 feature sets that were extracted from the sensor's original response curve. Then, we analyse the effect of these features towards VOC classification with five benchmark machine learning algorithms including K-Nearest Neighbours (kNN), Random https://doi.org/10.31436/iiumej.v24i2.2832 Forest (RF), Logistic Regression (LR), Support Vector Machine (SVM) and Artificial Neural Network (ANN). The recognition models were then put into comparison to determine the one which provides the best evaluation and high accuracy in performing the classification of the targeted VOC gases using k-Fold Cross Validation (k=10) and Confusion Matrix.

GAS SENSING MECHANISM
Sensing mechanism of the sensor is first studied, to understand the VOC detection on the sensor. The thin film is comprised of rGO which is one of the most promising materials for detecting low VOC concentrations at room temperature [12]. Graphene is a two-dimensional building block made up of a one-atom-thick sheet of a carbon atom.
Graphene can work well at room temperature because it has enormously high mobility [13]. Researchers are interested in modifying graphene into reduced Graphene Oxide as a sensing element because of its excellent electrical, high thermal conductivity, and mechanical properties [14] [15]. The functionalisation of rGO with nanoparticles and plasma treatment can improve sensor functionality and selectivity in distinguishing different vapours [11]. Different functionalisation of sensing elements is a good technique to improve the gas sensor's sensitivity and characteristics.
In this research, the sensing layer was deposited on Ti/Pt Interdigitated Electrode (IDE). The electrode was used to supply current flow from the power source to the device, which improved the sensing material's catalytic properties towards a specific gas [16]. Furthermore, the VOC sensor employed is a resistive type, which produces a signal based on a change in resistance in response to gas exposure. In general, VOC gas detection on a sensor is caused by the adsorption and desorption processes that occur between analytes and the sensor surface [17].
Oxygen ion species were absorbed on the sensor surface in the presence of air (humidity) and lowered the electron from the conduction band [18]. The electron density is falling off and forming an electron depletion layer and barrier potential on the surface. Electron removal causes an increase in the depletion layer. The related equation for chemisorbed oxygen at temperatures less than 100°C [19] is as follows: O2 (gas)+ e -(surface)↔ O2 -(adsorption) (<100℃) When VOC gas was introduced into the chamber, the gas molecules started to react with the absorbed oxygen ions and released electrons back into the conduction band. The predominant carrier in the sensors was modified by the reaction of the VOC gas (oxidizing or reducing agent) with the molecules in the sensing layer, resulting in an increasing or decreasing in the resistance measurement as the output [19].
Reduced Graphene Oxide has been reported to exhibit p-type behaviour [20]. However, the functionalised sensor was shown to be an n-type semiconductor in this VOC test, and the VOC analytes acted as reducing gases [18]. The sensor experienced an electron carrier majority, causing a decrease in depletion width and potential barrier. As a result, the sensor resistance decreased in the presence of VOC gas [21]. https://doi.org/10.31436/iiumej.v24i2.2832

Feature Extraction
Feature extraction is a technique that is used to extract significant information from the sensor response graph [6] to ensure better performance of machine learning algorithms in pattern recognition [8]. The information is deemed relevant when the derived value extracted from the measured data is non-redundant, not correlated with other features and projects the decisive features [22]. Other than that, feature selection is also related to the dimensionality reduction process of transforming high dimensional data into a low dimensional feature [8].
Detection of VOCs using gas sensors commonly used real-time analysis and discrimination of "breath prints" to perform the gas classification process [2]. In 2012, Vergara and his team applied 8 feature extraction from the time-series sensor, which are the change of the maximal resistance change (ΔR), the normalized resistance change (||ΔR||), minimum and maximum exponential moving average(ema) with a value of = 0.001,0.01,0.1 each [10].
On the other hand, many features can be extracted from raw signals and applied in electronic nose applications. Commonly extracted features from gas original response curves such as maximum response, the response of special time, time of special response, area, integral, derivative, difference and second derivative [6]. Table 1 shows a few lists of feature extraction from electronic nose sensor data for wound detection [23].

Normalization
Preprocessing the sensor data for features from the steady-state response, eliminate the effect of a concentration difference on recognition.

Integral and derivative methods
Integrals may represent the accumulative total of the reaction degree change and derivatives may represent the rate at which the sensor reacts to the odour.

The Root Mean Square Error (RMSE) of curve fitting
Depends on the type of model and the number of parameters in the model.

Fourier transform and wavelet transform
Fourier transform decomposes the original response curve into a superposition of the DC component and different harmonic components.

Supervised Machine Learning
There were few studies which implemented the detection of different gases by Supervised learning models such as k-Nearest Neighbour (kNN), Support Vector Machine (SVM), Artificial Neural Network (ANN), Random Forest (RF) and Logistic Regression. The findings were summarised in Table 2.
There are two gaseous flows in the system: for carrier gas and VOC gas. Clean Dry Air (CDA) was used as carrier gas, while isoprene, toluene, and acetone as the targeted VOC gas. The temperature of the gas and temperature chuck in the sensor chamber were controlled using a Cellkraft Humidifier P-10 and a Nextron Temperature controller module (Nextron Microprobe Station, with platier heater and 4 probe needles). Agilent SMU 34410A was used to drive the voltage and input current. A data acquisition (DAQ) system that is used to convert the output/measured signal from the sensor system into the computer is via a user interface software that is programmed using the LabVIEW program, provided by MIMOS.  kNN is widely used in the classification of mixed gas and for gas discrimination systems. The kNN model is advantageous because it is comprehensible, insensitive to noise, low cost for retaining and good combination with other algorithms. However, this model is sensitive to sample distribution, it has a slow speed for recognition, high spatial complexity, heavy calculation burden and poor interpretability. [24]

Support Vector Machines (SVM)
SVM of classifiers can cope well with gas sensor drift and perform better than the baseline competing methods on the extensive dataset. However, the SVM model requires a long learning time and poor application for larger data. Choice of kernel function is important as it is the key for feature space in SVM.
[10], [24] Artificial Neural Network (ANN) ANN is the frequently used method in predicting and analysing complex gas (Hashoul & Haick, 2019). It has good learning ability, good parallel processing capability and detecting compatibility error. However, this model has poor interpretability for output, long time learning and is easy to overfit. Therefore, weight, activation function and the number of hidden layers are important to develop an ANN algorithm in performing the classification of targeted output.

Forest (RF)
RF model is used in a lot of feature datasets as it can prevent overfitting from a decision tree algorithm. In the Random Forest algorithm, the number of trees affected the accuracy of the model, as each tree has a classification result and the final result is based on the majority decision trees vote [26], [36] Logistic Regression LR is a classification algorithm that calculates linear output and statistical function through the regression output. Logistic regression can perform multiclass classification problems by using one-vs-rest or one-vs-one wrapper models. The algorithm can be applied to a non-linear classification problem with a proper feature selection. LR model can produce high accuracy as it is a good signal to noise ratio. [27]

EXPERIMENTAL SETUP
As illustrated in Fig 1, the gas sensing system for this study comprises a gas supply system, a sensor chamber, a temperature and humidity controller module, and data collection system [28]. the test measurement such as flow of the CDA, flow of the VOC gas, input current, input voltage, temperature inside the chamber, temperature of the sensor's heater, relative humidity, and system ramp rate. Then, the sensor was tested individually with the targeted VOC gas and the sensor's responses were recorded to study performance of the individual sensor.

VOC Sensor
The gas sensor used in this study is called a VOC sensor, which is prepared, fabricated and functionalised by the engineering team at MIMOS Bhd. Reduced Graphene Oxide (rGO) as a sensing membrane was deposited on the Platinum-titanium interdigitated electrode (Pt/Ti IDE) on a silicon and silicon dioxide (Si/SiO2) substrate. The rGO was functionalised with nanoparticles such as; gold (Au), silver (Ag) and platinum (Pt) and plasma treatment such as; hydrogen (H2) and Octafluorocyclobutane (C4F8).
The sensor was fabricated using a standard semiconductor process using Chemical Vapor Deposition (CVD) by a standard lithography process for the functionalisation with different recipes. The rGO was functionalised with nanoparticles at a different duration of sputtering and Relative Frequency (RF) power, while functionalisation with plasma treatment at a variety of plasma power and temperature. Therefore, there are 21 individual VOC gas sensors used in this study and the details are according to Table 3. Next, the pre-processed signal proceeded with a feature extraction method to extract pertinent information to be input for supervised machine learning at classifying the gas components into targeted gas output. The features were https://doi.org/10.31436/iiumej.v24i2.2832 decided to extract from the original gas response involving measured resistance in the absence and presence of the VOC gas.

Data Collection
The sensors were tested individually with each of the selected VOC gas. The sensor was placed in a chamber with 30℃ of temperature and presence of 40% relative humidity (RH). The voltage and current input were set at 1V and 1.2A respectively. CDA was maintained at 1 L/min for 5 minutes to stabilize the baseline reading. Then, the VOC gas was purged into the chamber with a gradual increase of concentrations, from 1 to 6 ppm in 12 minutes (2 minutes for each concentration). The sensor responses were analysed from the resistance changes of individual sensors that undergo the VOC test.

DATA PRE-PROCESSING AND FEATURE EXTRACTION
In this phase, the analytes of the VOC gas were reacting with the sensing element, thus leading to a change in resistance. The sensor response was determined by analysing the measured resistance as a signal output from each sensor. However, the parameter setup was not in optimal condition and the output signal contained unexpected noise from the SMU system. A typical sensor response could not be seen clearly from the graph of resistance versus time.
As a result, the signal was pre-processed by applying filter and smoothing methods to denoising the signal and reduce the influence of random variation caused by instrumental conditions and atmospheric effects [29]. The data was filtered using a moving average (MA https://doi.org/10.31436/iiumej.v24i2.2832 length = 3) and smoothed with Minitab software using a single exponential method with a constant = 0.02 value. The sensor response was determined by using the formula [30] [31]: Where, = resistance in clean dry air, without VOC gas = resistance with the exposure of VOC gas Next, the pre-processed signal proceeded with a feature extraction method to extract pertinent information as input for supervised machine learning at classifying the gas components into targeted gas output. The features were decided to extract from the original gas response involving measured resistance in the absence and presence of the VOC gas. The 10 selected features as listed in Table 4. The VOC dataset comprises ten feature extraction values and is organised into three categories. In summary, there are 918 total samples, with 252, 324, and 342 each for acetone, toluene, and isoprene gas, respectively.

SUPERVISED LEARNING FOR VOC GAS CLASSIFICATION
Five supervised learning models including k-Nearest Neighbour (kNN), Random Forest (RF), Logistic Regression (LR), Support Vector Machines (SVM) and Artificial Neural Networks (ANN) were benchmarked for the VOC gas classification. The model was implemented using the Python and Scikit-learn library. Each model's parameter settings are described in Table 5.
To avoid bias in the analysis, the dataset was first standardised in the range 0 to 1 to uniform the values with different scales by using the min-max normalisation technique [32]. https://doi.org/10.31436/iiumej.v24i2.2832 Following that, the dataset was divided into 70% for the training set and 30% for the testing set. The performances of each model are then evaluated from confusion matrix-based measures in terms of accuracy, precision and using k-fold cross-validation technique, where k =10. Kfold Cross Validation is a cross validation technique used to evaluate the performance of a machine learning model by the resampling procedure. The training of the models proceeds using the k-1 parts and validation or testing errors from the remaining part [33]. Table 5: Parameter Setting of the Approached Supervised Machine Learning

Model Parameter K-nearest Neighbour
The K-value is decided as one, (k=1) and distance between two points is calculated by applying the distance metric formula (2), (mentioned in chapter 2), with p = 2, to manipulate the generalised distance to Euclidean Distance.

Random Forest
Grid search for the setting parameter, with n-estimator:100.

Artificial Neural Network
A shallow Neural Network was implemented, with a standard three-layer feed-forward network. For the hidden layer, the size was set up to 50 and used the ReLU activation function. While Softmax activation function for output layer with learning rate 0.001 and 50 epochs.

Support Vector Machine (RBF kernel)
Radial basis function (RBF) was selected as a kernel function for this SVM model, as defined in equation (6) (in Chapter 2) The kernel function, σ and regularisation, C used GridSearchCV from scikit-learn library to perform grid search for parameter setting.

Logistic Regression
The model used 'l2' for regularisation (penalty) and solver 'lbfgs'.  Table 6 below showed the accuracies from the 10fold cross validation from each model.

RESULT
The diagonal values in the confusion matrix denoted the accuracy values of the gas classification to the targeted output [34]. Figure 3 shows that the kNN and RF models performed well in classifying each of the targeted VOCs. The kNN model correctly predicted all three gases with greater than 80% accuracy, while the Random Forest model predicted them with greater than 70% accuracy. On the other hand, Logistic Regression, Support Vector Machine (kernel = RBF) and Artificial Neural Network performed poor classification on the 3 VOCs gases. The SVM and ANN models misclassified the gas more to isoprene gas while LR model misclassified the toluene and isoprene gas.
It is noticeable in Table 6, the RF model showed the highest mean accuracy with 0.813 ±0.035, followed by the kNN model with 0.803 ±0.033. RF model are known to have advantages in the process of random sampling which can ensure randomness and avoid overfitting. Besides, this model is also robust to noise [5] and it is good at handling missing data and imbalance classes [4].
Whereas, the kNN model has limitations in understanding the relationship between the features and the class (output) thus easily producing the wrong classification for a multiclass problem [4]. Therefore, the highest accuracy achieved by kNN in this study showed that the features were related well to the output class. high accuracy by using a Polynomial kernel at degree = 3, compared with other kernels such as Linear and RBF. The poor performance from the ANN, LR and SVM (Polynomial kernel) models might be due to their weakness, in which they are very prone to overfitting training data [33] and they required testing with various kernels and model parameters [4]. The ANN model also is a learning-based algorithm and is more complex in architecture. Thus, it has more hyperparameters required to be tuned [33] and it needs enough samples for training.
Other than that, the ROC curve and AUC value are another way of visualising the output performances from the computed confusion matrix. Evaluation of the Receiver Operating Characteristics (ROC) and Area Under Curve (AUC) was done to analyse the performance of the classifiers. The highest value of the AUC showed good value prediction of the model to assign a larger probability to a random positive example than a random negative example [35]. The AUC value should be between 0.5 and 1.0. The ROC curves for each classifier are illustrated in Figure 4. As the minimum is 0.541, it can be said that the SVM classifier does not predict our dataset very well and could not differentiate the classes, while the highest value of AUC goes to the KNN classifier which is equal to 0.886 and 0.885 for the RF model. As a result, in this research, the kNN and RF are two models that can deal with the selected features in the VOC dataset as they obtained the highest accuracy for the gas classification.

CONCLUSION
The gas sensor data was collected at a preliminary stage and has been used for the machine learning part, which involved pre-processing, feature extraction and classification algorithm. Each sensor was performing well at low operating temperatures and in the presence of humidity. The sensors response on different targeted VOC gas from 1 to 6 ppm were collected. Then, feature extracted were performed on the resistance-based data.
Then, feature extracted were performed on the resistance-based data. Ten featured were proposed as inputs to five supervised learning algorithms to accurately recognise and classify the selected VOC gas based on the labelled output. The confusion matrix and 10-Fold Cross Validation were used to evaluate each model's performance. As a result, the RF and kNN models have higher accuracy with 0.813 ± 0.035and 0.803 ± 0.033, compared with LR, SVM and ANN with the accuracy of 0.447 ± 0.035, 0.403 ± 0.041 and 0.419 ± 0.035 respectively. The two highest accuracies achieved by RF and kNN models demonstrated that they distinguished the gas well from the VOC dataset.
Despite the gas sensor's shortcomings, such as low sensitivity, selectivity, and noise in the sensor signal output, the findings of this study can be utilised as a guide for selecting the optimum algorithm for dealing with a gas sensor array. The performance of the kNN and RF models is the Proof of Concept that the algorithm can perform gas classification tasks from the simplest feature selected from the steady-state phase. The feature extraction approach, on the other hand, can be discovered more from the raw signal to build a dataset with more significant features and relevant information to improve the algorithm's performance.