CLASSIFICATION OF CHEST RADIOGRAPHS USING NOVEL ANOMALOUS SALIENCY MAP AND DEEP CONVOLUTIONAL NEURAL NETWORK

The rapid advancement in pattern recognition via the deep learning method has made it possible to develop an autonomous medical image classification system. This system has proven robust and accurate in classifying most pathological features found in a medical image, such as airspace opacity, mass, and broken bone. Conventionally, this system takes routine medical images with minimum pre-processing as the model's input; in this research, we investigate if saliency maps can be an alternative model input. Recent research has shown that saliency maps' application increases deep learning model performance in image classification, object localization, and segmentation. However, conventional bottom-up saliency map algorithms regularly failed to localize salient or pathological anomalies in medical images. This failure is because most medical images are homogenous, lacking color, and contrast variant. Therefore, we also introduce the Xenafas algorithm in this paper. The algorithm creates a new kind of anomalous saliency map called the Intensity Probability Mapping and Weighted Intensity Probability Mapping. We tested the proposed saliency maps on five deep learning models based on common convolutional neural network architecture. The result of this experiment showed that using the proposed saliency map over regular radiograph chest images increases the sensitivity of most models in identifying images with air space opacities. Using the GradCAM algorithm, we showed how the proposed saliency map shifted the model attention to the relevant region in chest radiograph images. While in the qualitative study, it was found that the proposed saliency map regularly highlights anomalous features, including foreign objects and cardiomegaly. However, it is inconsistent in highlighting masses and nodules. ABSTRAK: Perkembangan pesat sistem pengecaman corak menggunakan kaedah pembelajaran mendalam membolehkan penghasilan sistem klasifikasi gambar perubatan secara automatik. Sistem ini berupaya menilai secara tepat jika terdapat tanda-tanda patologi di dalam gambar perubatan seperti kelegapan ruang udara, jisim dan tulang patah. Kebiasaannya, sistem ini akan mengambil gambar perubatan dengan pra-pemprosesan minimum sebagai input. Kajian ini adalah tentang potensi peta salien dapat dijadikan sebagai model input alternatif. Ini kerana kajian terkini telah menunjukkan penggunaan peta salien dapat meningkatkan prestasi model pembelajaran mendalam dalam pengklasifikasian gambar, pengesanan objek, dan segmentasi gambar. Walau bagaimanapun, sistem konvensional algoritma peta salien jenis bawah-ke-atas kebiasaannya gagal mengesan salien atau anomali patologi dalam gambar-gambar


INTRODUCTION
The convolutional neural network (CNN) has become the de-facto choice for image classification and object detection. It has shown that the network model can achieve humanlevel accuracy, including for medical images. Nevertheless, researchers are still finding ways to improve the classification performance with novel ideas. The majority of this research focuses on developing ever more complex and deep architecture. In this paper, we test the idea of changing the input typing rather than the model architecture. Instead of using a regular medical image, the saliency map is proposed to be the alternative input.

Introduction to Saliency Map
Itti et al. [1] introduced the concept of the saliency map in 1998. A saliency map is a numerical map that localizes an object (or objects) in an image that is deemed interesting (salient). In other words, the map emphasizes relevant features in an image while at the same time suppressing irrelevant features. Saliency maps have been employed in many tasks, including image classification, object detection, and image segmentation [2,3].
Methods for creating a salient map can be divided into the top-down and bottom-up approaches [4]. In the bottom-up approaches, the salient map is constructed based solely on the image's feature. Features such as color mapping, contrast, edges, and objection placement are used to localize the image's salient region. Famous bottom-up algorithms are Binary Normed Gradient for Objectness [5], the Fine-Grained [6], and Spectral Residual [7]. However, [8] stated that medical images produced by conventional modalities such as CXR, computer tomography (CT) scan, and ultrasound are mostly homogenous and possess very few color variants. In situations like this, most conventional bottom-up algorithms will fail to localize any salient object in the image; this is shown in Fig. 4.
Contradicting the previous method, the top-down approach produces a salience map based on the task given. The algorithm takes external cues from a human or model feedback to construct the final salience map. This method is fast becoming the mainstream solution, especially for medical images, as it can produce precise salient region boundaries even in the presence of shades or reflections [9]. However, since the techniques are based on a supervised CNN model, from which it naturally inherits CNN dependencies. First, it requires a large number of annotated samples for training purposes. Secondly, its development and deployment require access to accelerated hardware. These two requirements are an obstacle for the practical deployment of such technology in the medical https://doi.org/10.31436/iiumej.v22i2.1752 field, especially in Malaysia. Currently, Malaysia lacks any open medical image dataset, and very few hospitals are equipped with, or have access to, accelerated hardware.

Anomalous Saliency Mapping
In this paper, we introduce a new algorithm called the Xenafas algorithm that produces two novel anomalous saliency maps call Intensity Probability Map (IPM) and Weighted Intensity Probability Map (WPM). Different from the bottom-approach which only takes internal image cues, our approach takes cues from the probability mapping of a pixel's intensity relative to a cluster of similar images. However, it also does not require annotated samples or accelerated hardware to create the saliency map, which is practical in the context of Malaysia's clinical settings. Therefore, the algorithm can be considered as a middle ground between the bottom-up and top-down approach. We test the algorithm on a chest radiograph (CXR) dataset to see if the algorithm can create a salience region by highlighting pathological features such as air space opacities, masses, and foreign objects.

LITERATURE REVIEW
For readers who want more information on the saliency map, ref [8] provides an extensive review of the subject matter. This paper's literature review will focus on the application of the saliency map in medical image analysis.
The application of saliency in medical images can be separated into two categories, depending on when it is used. The majority of research only applied it post-training and solely for model interpretations; it is not actively involved in model training. For example, in [10], the saliency map produced by the class activation mappings (CAM) [11] is used to validate the feature selected by the CheXNeXt model for its classifications. Similarly, in [12], a saliency map is created via the guided back-propagation method [13], which is then used to provide interpretability for model classification on breast cancer image classifications. Research done by [14][15][16] also shows similar traits. However, it is vital to mention the finding by [17], in which the author demonstrated that most algorithms used to create this saliency map are inconsistent when repeated. Among all algorithms, the Grad-CAM [18] algorithm shows the most consistency. Thus, the trustworthiness of using a saliency map to validate clinical CNN models is questionable.
The second type of saliency map research actively uses it in the model training. For example, in [19], the saliency map in the form of an attention map reduces a model falsepositive rate. Similar to a saliency map, the attention map produced by the Attention Gate (AG) algorithm suppresses irrelevant regions in the image. In [20], localization of pulmonary lesions in CXR images is achieved by extracting a saliency map from a CNN model. Likewise, in [21], a saliency map is generated and used to detect polyps in capsule endoscopy. Various bottom-up saliency algorithms are used for segmenting skin cancer in [22,23].
To the best of the authors' knowledge, there is yet a paper that examines the effect of using a saliency map as input for chest radiograph classification. Therefore, in this paper, we tested the effect of using the proposed saliency map, IPM, and WPM, which will enhance the classification performance of the CNN model. In addition, we also test if the proposed saliency map successfully highlights all pathological features in a CXR image. For the interested reader, a review on the classification of CXR by supervised CNN models can be found from [24].

The Xenafas Algorithm
We proposed the Xenafas method, an algorithm that indicates anomaly regions' location on a CXR image, based on the likelihood of a pixel's intensity (opacities) at a given location. The method starts with creating a control dataset. Images for this dataset must be cherrypicked; avoiding images containing any form of anomalies. Examples of anomalies include but are not limited to; any pathology, foreign body, extreme variation such as dextrocardia, rotated film, and patients in non-standard body positions.
After the control dataset has been created, the images are clustered into several groups using the K-Means algorithm. This step is needed to address the variation in patient body shape, image quality between x-ray machines, and the patient's body's orientation when the x-ray image is taken. The number of clusters, K, depends on the homogeneity of the images in the dataset. A good homogenous dataset will use a K value of 1-3, while a heterogeneous dataset will use a value between 7-10. Next, the 2D pixel intensity distribution or ProbMat is created as shown in Algorithm 1. There are several ways to create a non-parametric probability function; one of the most popular is to use the Kernel Density Estimation method (KDE). KDE is easy to implement; however, computationally intensive when it is scaled to sample high-resolution images. A dataset with images with 256 by 256 resolution will need 65 536 KDE modeling to create all the necessary ProbMat. This requirement will quickly exhaust the memory resource of a computer. Additionally, there is no clear guideline on determining the appropriate bandwidth value of a KDE model. We proposed another method as an alternative to KDE. In this method, we first create a histogram of pixel intensity for all CXR images in the subcluster K, at a specific value of x and y. From the histogram, a discrete probability function can be obtained. A continuous function for all possible intensity values is approximate by combining the discrete probability function with cubic spline interpolation. The Savitzky-Golay filter was then applied to smooth the probability distribution function further. Using this continuous probability function, it is now possible to create a matrix (ProbMat) representing the probability of all intensity values at any given location. Pseudocode shown Fig. 2 is used to create the anomalous saliency map. In this part, the ProbMat is used to produce the Intensity Probability Map (IPM) and Weighted Intensity Probability Map (WPM) for all CXR images. The WPM function is given by Eq. (1), The weighted pixel intensity, at position , is equal to the product of its intensity, , and the intensity likelihood, (), at the same locations. In WPM, the original pixel intensity acts as a weight for the likelihood. Thus, only anomaly regions with high opacities will be shown in WPM; lucent anomalies will be suppressed. In visualizing IPM and WPM images, pixels with lower likelihood will have a higher intensity (appear brighter) than pixels with high likelihood. Thus, a region that is marked brightly (highlighted) is a region that the algorithm considers to have anomaly features.
One of the IPM and WPM images' fundamental weaknesses is that it suppresses anatomical landmarks. Without anatomical landmarks, it is difficult to determine the location of an anomalous region relative to an organ. To solve this issue, we added IPM/WPM heatmap as a layer on top of the corresponding images. Though, only regions that exceed the Otsu threshold [25] are incorporated into the images. An example of IPM, WPM, Infused-IPM (IIPM) and Infused-WPM (IWPM) is shown in Fig. 5. For comparison purpose, Fig. 4 shows the output of conventional bottom-up saliency mapping algorithms called Fine-Grained and Spectral Residual. The implementation of both these algorithms is taken from the OpenCV.

Classification Method
This paper aims to test whether or not replacing CXR images with IPM and WPM will improve CNN's classification performance. Thus, to ensure any performance changes are due to the input type and not the CNN architectures, only familiar deep CNN models are used. Figure 3 shows the network architecture used, with the base model being MobileNet, DenseNet121, ResNet50, VGG19 and Xception [26][27][28][29][30]. The implementation of base model network architecture is taken from TensorFlow (Ver. 2) library and with the pre-trained weight from ImageNet [31]. All models are trained with 100 epochs; however, an early stop is executed if there is no improvement in the loss value after ten epochs. All models were trained and tested on the Google Cloud platforms.
We have chosen the regular classification metric performance for model validation: precision, sensitivity, receiver operating characteristic curve (ROC-AUC), and the area under the precision versus sensitivity curve (PR-AUC).

Dataset
The dataset that is being used in this research is the Google-NIH dataset [32]. Initially, NIH provides the images while the labels are provided by Google [33]. It is important to note that only labels for the test and validation dataset are provided from the source. To create the training dataset for this study, we split the original validation dataset into a new training and validation dataset with a ratio of 0.3. All datasets are imbalanced datasets.

Qualitative Validations
To test the clinical relevance of the proposed algorithm, several normal and anomalous CXR images were selected, and the resulting IPM and WPM were examined qualitatively by a certified radiologist. Anomalous CXR that was chosen includes images of a rotated film, foreign body, cardiomegaly, and masses. For model interpretation, the Grad-CAM [18] algorithm is used to visualize which region of the CXR is relevant to the model when making the class classification. Table 1 shows several classification metrics obtained by various models and input-data types to classify the test dataset for air space opacity. While Table 2 shows similar metrics for the classification of CXR images with masses/nodules. Entries with the highest score for a particular metric is bolded.

Classification Result
The result obtained is not particularly easy to decipher. The highest score in PR-AUC, accuracy, and precision is obtained by ResNet50+Image, ResNet50+IIPM, and ResNet50+IWMP, respectively. Xception+Image and VGG19+Image do have a higher precision score, however, both results were rejected due to their sensitivity score being less than 0.5. This means the models falsely label the majority of positive samples. The model with the highest sensitivity score is VGG199+WPM, with a score of 0.930. However, the model precision is quite low, only 0.672, thus the next model, DenseNet121+IWPM, will be a better choice, having obtained 0.893 in sensitivity and 0.775 in precision. Meanwhile, DenseNet121+Image obtained the highest ROC-AUC score, 0.877. Next, we analyze if using the proposed anomalous saliency mapping as input will result in a better classifier for the airspace opacity dataset. We are particularly interested if such change in input can boost the performance of shallower CNN models (MobileNetV2 and DenseNet121) to comparable performance of deeper CNN models (ResNet50, VGG19 and Xception). What is evident from the result, using the alternative data types as input enhances the model's sensitivity. For example, VGG19+WPM, which obtained the highest sensitivity, obtained a 108.3% improvement compared to VGG19+IMG. This sensitivity improvement is more apparent in deep CNN models (ResNet50, VGG19, and Xception) than shallower CNN models (MobileNetV2 and DenseNet121).
As one might expect, any improvement in sensitivity tends to reduce model precision. Nevertheless, in most results, the degree of precision reduction is less than the degree of sensitivity gain. For example, the model Xception+IIPM obtained an increase of 30.4% in sensitivity while only reducing its precision by 7.1% compared to Xception+Image.
The answer to which CNN model and input data type perform the best, depends on the purpose of the model. For screening purposes, then DenseNet121+IWPM will be the recommended model as it obtained the second-best sensitivity score while maintaining a reliable precision score. For precise clinical classification, then ResNet50+IWMP is recommended. It is worth noting that, since this dataset is imbalanced, the PR-AUC score is more important than the ROC-AUC score. DenseNet121+IWPM also obtains a PR-AUC score of 0.904, a mere 0.003 less than the highest score, which is 0.907 by ResNet50+Image. In addition to obtaining a reliable classification score, it also has the advantage of requiring fewer computing resources than ResNet50, VGG19, and Xception. Thus, it is more practical to be deployed in Malaysian hospitals.
In Table 2, results show that all models failed to achieve acceptable classification performance for the mass/nodule testing-dataset. No model obtained a precision score of more than 0.5, meaning the majority of positive classifications were actually false. It is worth pointing out that all sensitivity scores for IMG input were lower than 0.5, thus all model missed the majority of the mass/nodule samples. Only DenseNet121+WPM, ResNet50+ WPM/IPM and VGG19+IPM manage to achieve a sensitivity score above 0.5. Figure 4 shows the example of a saliency map produced by the Fine Grained and Spectral Residual algorithm, a conventional bottom-up algorithm. As shown in the figure, the Fine Grained failed to emphasize or suppress any feature in the image. Conversely, the Spectral Residual suppressed almost all featured, making it impossible to extract any meaningful information from its saliency map. Aligned with what was mentioned in [9], the conventional bottom-up saliency map algorithm cannot produce meaningful mapping for CXR images.  Fig. 5(b) and 5(c), respectively. There is no apparent region highlighted for the WPM image, implying that the algorithm does not identify any anomaly in the original CXR image. However, the perihilar region is incorrectly highlighted in the IPM image; this may suggest that the IPM may be over-sensitive in highlighting anomalies in CXR images.

Qualitative Assessment
Next, we examine how the algorithm processed CXR images taken for an incorrectly positioned patient, or for a rotated x-ray film. For example, in Fig. 5(d), the patient's trachea is not located in the midline, suggesting that the patient may be rotated relative to the film. This orientation gives the appearance that the right lung is more lucent than the left. In the produced IPM image, the lucent region is highlighted, whereas WPM does not highlight this feature as WPM suppresses lucent anomalies. Whether or not the cause of the right lung lucency is significant is still an anomaly from the imaging perspective. Thus, the algorithm should highlight this anomaly as in the IPM image and then proceed to validate it by a radiologist.
Another feature that indicates that the patient is in an abnormal position is the presence of teeth in the CXR image. The anomaly is highlighted in the IPM and more evidently in the WPM image. Teeth usually are not presented in a CXR, thus it is a form of anomaly that should be highlighted by a UAS algorithm. However, the algorithm incorrectly highlighted the patient breasts; this error may have been caused by the lack of images containing breasts in the control dataset.  matter around the foreign body is suppressed, making it appear lucent. With this result, it can be assumed that WPM images may help in identifying foreign objects in CXR.
Next, the algorithm capabilities in highlighting pathological changes are demonstrated. Figure 7(a) shows a CXR with cardiomegaly, and it is highlighted clearly in both IPM, Fig.  7(b) and WPM Fig. 7(c) image. On the other hand, Fig. 7(d) shows CXR with a homogenous opacity at the right lower lung zone that does not obscure the cardiac border. This feature is not highlighted in both IPM, Fig. 7(e) and WPM, Fig. 7(f) images. Additionally, the algorithm also frequently failed to highlight opacity due to mass and nodules. The single nodule in Fig. 8(a) was not highlighted in the resulting WPM image, Fig. 8(d). The same can also be said for the multiple nodule-like opacities at bilateral mid and lower lung zones, as shown in Fig. 8(b). Only some of the lung masses are highlighted in resulting WPM images, Fig. 8(e). An example of a correctly highlighted lung mass is shown in Fig. 8(c) and 8(f).

Grad-CAM Results
To meaningfully deploy a developed model in clinical use, it must show some degree of interpretability and the classification must be validated based on some biological markers. For this reason, we use the Grad-CAM, [18], to visualize which region on the input data is emphasized. Figure 9(a) shows an example of CXR having airspace opacifications. Figure  9(b)-(f) shows the output of the Grad-CAM algorithm for DenseNet121 models that were trained with different input data types. The only models that were trained using the WPM and IIPM as input were correctly labeled the sample. Only lung masses in (c) were successfully highlighted in (f). An obvious pattern that emerges in Fig. 9 is that models that received the original CXR image (image, IIPM, and IWPM) as input tend to mark the lower-left diagram. On the other hand, the model that takes IPM and WPM tends to focus more on the lung and shoulder region. It is not exactly certain why the DenseNet121-IPM model incorrectly labeled the image even though it correctly emphasized the lung region. One reason that can be attributed is the model emphasized the right lung more than the left lung. Both models that correctly labeled the image, DenseNet121-WPM, and DenseNet121-IIPM, emphasize the left lung region. Even though DenseNet121-WPM also marked the left lung region; it emphasized more on the diagram, hence the wrong labeling. Figure 10(a) shows a sample of CXR that is positive for mass. The location of the mass is in the left upper and lower lobe. Figure 10(b)-(f) show the output result of the Grad-CAM algorithm for different ResNet50 models that were trained by different input data types. Models trained using images, IIPM and IWPM, show similar marked regions; they extend from the left clavicle bone to the lung right middle lobe. For IIPM and IWPM, the region does not cover any mass. Thus, the model is falsely labeled as negative. ResNet50+IPM correctly marked the left-upper lobe, and the mass contained in it. ResNet50+WPM only weakly marked this region. No model correctly marked the mass at the left lower lobe. From the example of results shown in Fig. 10, it can be concluded that the trained model failed to learn a mass feature. It also emphasizes the need for more effective feature extraction if we want to detect masses in CXR more accurately.

CONCLUSION
In this paper, we introduce the Xenafas algorithm, which creates the IMP and WMP anomalous saliency mapping for CXR images. A qualitative study by a certified radiologist has shown that the algorithm can highlight most foreign objects and cardiomegaly in the CXR samples tested; however, it is inconsistent in highlighting masses and nodules. It has also been shown that using IMP and WMP over regular CXR images increases the sensitivity of most CNN models that were tested. Using the Grad-CAM algorithm, it has been demonstrated that by using the IMP and WMP, the CNN model shifted its focus to a more relevant CXR image region. The results obtained from the experiment conducted show that the IMP and WMP can be an alternative to regular CXR images for future machine learning development.