1. INTRODUCTION

Over the past decades, lung cancer has increased dramatically and has remained the most prevalent and deadly malignancy worldwide (Bray et al., 2024). According to the International Agency for Research on Cancer, in 2022, lung cancer accounted for 12.4% of all new cancer cases (2.5 million) and 18.7% of cancer deaths (1.8 million). Projections estimate a 77% increase in these statistics by 2050 (Leiter et al., 2023). The diagnosis begins when pulmonary nodules are detected on medical imaging; these nodules indicate abnormal and uncontrolled growth of lung cells and are classified as either benign (slow-growing, non-metastatic) or malignant (rapidly growing and metastatic). This kind of cancer is often detected at advanced stages when treatment options are limited and survival chances are lower (WHO, 2023). In this regard, it would seem that improving early detection methods could reduce the high death rate due to lung cancer.

It is worth noting that several methods exist for detecting lung nodules in the chest. Among these, Computed Tomography (CT) is considered the most effective because it usually produces high-resolution, cross-sectional images of the lungs using X-rays. For this reason, the provided images offer better visibility of lung nodules compared to standard chest X-rays (Murad et al., 2023). Furthermore, a special type of CT, known as Low-Dose Computed Tomography (LDCT), uses significantly less radiation. Despite the lower dose, it still provides accurate detection. Studies have shown that LDCT can reduce lung cancer mortality by around 20% Therefore, it was recommended that LDCT be used as a standard screening tool for high-risk individuals (National Lung Screening Trial Research Team, 2011).

However, manually interpreting CT scans remains time-consuming and may lead to inconsistent diagnoses between experts. Consequently, these limitations may cause treatment delays and negatively impact patient outcomes. Therefore, there is a critical need for automated diagnostic systems to support clinical decision-making. Recent advances in deep learning (DL), particularly Convolutional Neural Networks (CNNs), have shown significant promise in medical imaging because they can automatically learn discriminative features directly from raw images. Studies have demonstrated that CNNs can detect and classify lung nodules with accuracy comparable to that of experienced radiologists (Ibrahim & Mahmood, 2023).

Among the many available CNN architectures, VGG16 has become widely adopted for its effectiveness in medical imaging. Its simple and well-organized design includes small 3x3 filters and stacked convolutional layers, allowing it to extract a rich hierarchy of features, from basic edges to complex patterns. Furthermore, its use of ReLU activation adds non-linearity, enhancing the model’s overall learning capacity. Ultimately, these extracted features are passed to fully connected layers for classification, making VGG16 a powerful tool for medical analysis (Ardila et al., 2019).

Problem Statement:

Although deep learning (DL) has shown growing success in lung cancer detection and classification, several limitations hinder its clinical deployment. A primary challenge starts from medical datasets, which are often small and imbalanced. Furthermore, each dataset collection contains different variations and noise acquired from different CT scanners, which reduces model reliability and generalization. Another key issue is that standard DL architectures are computationally expensive, making them unsuitable for resource-constrained settings. While lightweight models offer an efficient alternative, they cannot frequently capture the fine-grained patterns required in medical imaging.

Furthermore, the use of weak validation strategies, such as simple train-test splits and evaluation on single-source datasets, often leads to poor generalization on different source data. Additionally, many studies rely overly on accuracy while neglecting more robust metrics, such as F1-score, Matthews Correlation Coefficient (MCC), Cohen’s Kappa, and Confidence Intervals (CI), that better reflect true model performance. Moreover, interpretability remains a significant concern. Most DL models function as “black boxes,” which can limit clinical trust and acceptance. All these compound challenges underscore the need for DL frameworks that are efficient, interpretable, and rigorously validated for real-world clinical use.

Study Objectives:

This study aims to develop a robust, efficient, and interpretable VGG16 model for multi-class lung cancer classification that generalizes across diverse CT data for reliable clinical use.

Specific Objectives:

1. To mitigate challenges related to data quality and imbalance by implementing an advanced preprocessing pipeline and class-balancing techniques. This will enhance the model's reliability and generalization across diverse and noisy datasets.

2. To reduce the computational cost of the VGG16 architecture through strategic parameter pruning and fine-tuning. Consequently, the resulting model will be efficient enough for deployment in resource-constrained clinical environments.

3. To improve model transparency and build clinical trust by integrating explainability (XAI) methods. These tools will visualize the model’s decision-making process, moving beyond a 'black box' approach.

4. To conduct a rigorous and clinically relevant evaluation of the model. Therefore, performance will be assessed using K-Fold cross-validation with comprehensive metrics and validated on multi-source data to ensure its real-world effectiveness.

Study Contributions:

This study makes several key contributions to automated lung cancer detection by addressing common limitations in data preparation, model efficiency, and validation. Our main contributions are:

1. A Quantitatively Validated Preprocessing Pipeline: We developed a pipeline that combines hybrid filtering and adaptive contrast enhancement. Its effectiveness was confirmed through comparative analysis and quantitative metrics (PSNR, SSIM) to ensure it improves image quality without introducing artifacts.

2. A Hybrid Strategy for Data Diversity and Balancing: We implemented a strategy combining hybrid augmentation with SMOTE oversampling. This creates a more diverse and balanced dataset, which directly improves the model's ability to generalize.

3. A Lightweight and Efficient Deep Learning Architecture: We developed a highly efficient model by pruning the pretrained VGG16 architecture and applying a dual-phase fine-tuning strategy. This reduced the model's parameters and memory size by more than 80% without sacrificing accuracy, making it practical for clinical deployment.

4. A Robust Multi-Layered Validation: We employed a two-tiered strategy combining K-Fold Cross-Validation and external validation on a large multi-source dataset. This comprehensive approach, using metrics like F1-score, MCC, and 95% CI, demonstrates the model's robustness and real-world generalizability.

5. Enhanced Clinical Interpretability: We integrated Grad-CAM to provide visual explanations of the model’s predictions. These heatmaps enhance transparency, build clinical trust, and support expert review.

2. LITERATURE REVIEW

To design a clinically robust and generalizable lung cancer classification system, we examined recent deep learning research using CT imaging. This section categorizes key studies by methodology and outlines common limitations—particularly in preprocessing, class balancing, validation, and explainability—that our proposed framework aims to overcome.

Summary of Recent Studies in Lung Cancer Classification:

The Automated lung cancer diagnosis has progressed from traditional machine learning to advanced deep learning pipelines. Current literature highlights diverse architectural approaches, from pretrained CNNs to custom and hybrid models.

Preprocessing and Data Handling:

Effective preprocessing is a critical first step for any diagnostic model. The literature shows a wide range of approaches. Many studies have employed foundational techniques such as image resizing, normalization, and basic filtering (Anand et al., 2022; Tandon et al., 2022). More advanced studies have incorporated specific denoising filters like Gaussian or median filters (Gupta et al., 2023; Ravindra et al., 2024) and contrast enhancement using methods like CLAHE to improve the visibility of nodules (Gupta et al., 2023; Kamath & Singh, 2024). To address the everyday challenges of small dataset sizes and class imbalance, data augmentation is a near-universal practice (Anand et al., 2022; Jassim et al., 2024; Tandon et al., 2022). Some studies have further utilized oversampling techniques like the Synthetic Minority Oversampling Technique (SMOTE) to balance class distributions (Kumaran et al., 2024).

Architectural Models Commonly Applied in Lung Cancer Diagnosis:

The choice of model architecture is central to performance. We can categorize the approaches in the literature into three main groups:

1. Transfer Learning with Standard Architectures: The most prevalent approach involves leveraging pre-trained CNNs. Models such as VGG16 (Anand et al., 2022; Benamara et al., 2024; Ghosh et al., 2023; Klangbunrueang et al., 2025), MobileNetV2 (Alheeti et al., 2024; Ghosh et al., 2023), ResNet (Al-Shouka & Alheeti, 2023; Sangeetha et al., 2023), and InceptionV3 (Anand et al., 2022) are commonly fine-tuned for lung cancer classification, often achieving high accuracy scores. These studies have indicated the power of transfer learning in medical imaging.

2. Complex and Ensemble Models: To further boost accuracy, some researchers have developed more complex systems. This includes creating novel hybrid architectures, such as VCNet, which combines VGG16 with a Capsule Network (Tandon et al., 2022), or IRRCNN, which integrates Inception and recurrent layers (Anusha & Reddy, 2023). Another popular strategy is ensembling, where predictions from multiple models (e.g., ResNet, EfficientNet) are combined to produce a more robust final decision (Jassim et al., 2024; Kumaran et al., 2024). These approaches often report state-of-the-art accuracies, frequently exceeding 99% on their respective test sets.

3. Handcrafted CNNs: A smaller subset of studies builds custom, handcrafted CNNs from scratch (Anand et al., 2022; Gupta et al., 2023). While sometimes effective, these methods are slightly outperformed by transfer learning and can be less scalable.

Validation Strategies and Generalization:

A model's true clinical value is determined by its ability to generalize to new, unseen data. However, the validation strategies reported in the literature are often limited. The vast majority of studies evaluate their models using a simple train-test split on a single, homogenous dataset, such as LIDC-IDRI or IQ-OTH/NCCD (Kumaran et al., 2024; Tandon et al., 2022; Ghosh et al., 2023; Gupta et al., 2023; Naseer et al., 2023).

Clinical Interpretability:

For a diagnostic AI tool to be trusted by clinicians, it must be interpretable. Explainable AI (XAI) techniques like Grad-CAM, which generate heatmaps to show where the model is focusing, are crucial for building this trust. However, XAI is still underutilized in the field, with only a few recent studies incorporating it (Kumaran et al., 2024; Klangbunrueang et al., 2025).

Limitations in Existing Literature and Our Proposed Solutions:

Although deep learning has advanced lung cancer classification, a review of the literature reveals several persistent limitations that hinder the development of clinically robust models. The present study directly addresses these challenges related to data preparation, model complexity, and validation.

First, many studies employ inadequate preprocessing methods without quantitatively validating their impact. For instance, some rely on basic filters alone (Ravindra et al., 2024; Sangeetha et al., 2023), while others apply standard contrast enhancement techniques (Alheeti et al., 2024; Gugulothu & Balaji, 2023). Similarly, data augmentation is frequently simplistic (Anand et al., 2022; Tandon et al., 2022), and oversampling methods like SMOTE are often used in isolation, overlooking the need for greater data diversity (Kumaran et al., 2024). In contrast, our framework implements an advanced pipeline combining a hybrid filter with dynamic contrast enhancement to improve image quality. Furthermore, we pair advanced augmentation—designed to simulate real-world variations from different CT scanners and imaging protocols—with SMOTE to ensure the model trains on a diverse and balanced dataset.

Additionally, other research often relies on overly complex architectures to achieve high accuracy. These ensemble or deep models typically have high computational costs and memory requirements (Anusha & Reddy, 2023; Jassim et al., 2024; Tandon et al., 2022). Consequently, their use is impractical in many resource-constrained clinical settings.

To address this, our work streamlines the VGG16 architecture through structured pruning. This method reduces the model's parameters by approximately 80%, creating a lightweight yet powerful model suitable for real-world deployment.

Furthermore, a critical weakness in the literature is the reliance on limited validation strategies. Most studies evaluate their models on a single data source with a basic train-test split (Alheeti et al., 2024; Al-Shouka & Alheeti, 2023; Balaji & Gugulothu, 2023; Jassim et al., 2024; Kumaran et al., 2024; Klangbunrueang et al., 2025; Naseer et al., 2023;); consequently, these models often fail to prove they can generalize to different clinical environments. Performance is also often measured with narrow metrics like accuracy (Tandon et al., 2022; Sangeetha et al., 2023), which can be misleading.

Our study directly confronts these issues with a multi-layered validation strategy. First, we apply K-Fold Cross-Validation on the single-source dataset. Second, and most importantly, we retrained and validated our entire framework on a large, multi-source dataset—a step notably absent in much of the literature—to confirm its real-world generalization. Both validation approaches are evaluated with comprehensive metrics, including F1-score, MCC, 95% CI, Cohen’s Kappa, and error-based metrics.

Moreover, explainability is underutilized, with only rare studies implementing it (Kumaran et al., 2024; Klangbunrueang et al., 2025). We address this gap by integrating Grad-CAM, which provides transparent visualizations to help build clinical trust.

3: METHODS AND MATERIALS

This section outlines the experimental methodology developed for our automated lung cancer detection framework. Our end-to-end pipeline integrates a quantitatively validated preprocessing stage, a hybrid data balancing strategy, a pruned and fine-tuned VGG16 model, a multi-layered validation protocol, and Grad-CAM for interpretability. This approach was systematically designed to overcome common limitations in the literature, including poor data quality, model complexity, and limited generalization. The overall workflow is presented in Figure 1, and each step is detailed in the sections below.

K-Fold Cross-Validation Protocol:

To train the model robustly, we applied a 4-fold cross-validation (CV) protocol exclusively to the Training Set (the 60% portion). In each of the four training iterations, three folds were used for model training, while the remaining fold served as an internal validation set to provide immediate feedback on generalization within the training data.

Model Selection and Final Evaluation:

Crucially, during each training iteration, model performance was monitored on the Fixed Validation Set (the 20% hold-out). This set provided a stable, consistent benchmark across all four CV runs. Decisions for early stopping and saving the best model checkpoint were based exclusively on the performance on this Fixed Validation Set, ensuring that the model selection process was stable and not influenced by the variability of the individual validation folds.

Finally, the single best-performing model identified through this entire process was evaluated once on the Hold-Out Test Set to report its final, unbiased performance.

Generalization Assessment on Multi-Source Dataset:

To assess the model's real-world generalization and robustness beyond a single data source, a final external validation was performed. This validation utilized a large, composite multi-source dataset comprising over 29,000 images compiled from five public sources, including our primary IQ-OTH/NCCD dataset.

For this assessment, the entire proposed framework—from the preprocessing pipeline to the model training and fine-tuning strategy—was reapplied to this new, diverse dataset. The model, after training, was then evaluated using a standard train-test split to simulate a realistic deployment scenario. This critical step confirms that the model's high performance is not limited to the characteristics of a single dataset and that it can generalize effectively to unseen data from different clinical environments. The specific composition of this dataset and the detailed outcomes of this evaluation are presented in Section 4.4.

Proposed Model Architecture and Efficiency Improvement:

The core of our framework is the pretrained VGG16 model, which consists of 13 convolutional layers grouped into 5 blocks (indexed from block1_conv1 through block5_conv3), followed by three large fully connected (FC) layers. While this structure provides excellent feature extraction, its original design is too large and computationally intensive for practical clinical use. Therefore, we implemented a structured, layer-level pruning strategy. Specifically, we removed the original, oversized FC layers in their entirety, including all associated weights and neurons. This was necessary because these layers are a primary source of computational cost and are highly prone to overfitting, a problem that is worsened when working with modest-sized medical datasets. Figure 4 illustrates architectural differences between the original and modified model.

Figure 4: Structural Comparison Between the Original VGG16

and the Proposed Lightweight VGG16 Model

Therefore, the resulting lightweight architecture significantly reduces the model’s size by approximately 80%, making it over five times smaller than the original VGG16. This promotes efficient deployment of our model while simultaneously improving predictive accuracy by enhancing generalization. Figure 5 presents a comparison of parameters and model size between the original and modified versions.

A graph with different colored bars

AI-generated content may be incorrect.

Figure 5:Comparison of Parameter Count and Model Size

between the Original VGG16 and the Modified Version.

Training Strategy:

Transfer Learning and Fine-Tuning:

To effectively utilize transfer learning, we adopted a multi-stage training strategy based on a modified pretrained VGG16 architecture. The model was designed for multi-class classification using the Categorical Cross-Entropy loss function. The convolutional layers served as feature extractors, capturing essential patterns from lung CT images.

In the initial training phase, the model was trained for 30 epochs using the Adam optimizer, while the base VGG16 layers were frozen and only the newly added custom classification head was updated. This step preserved the general-purpose features learned from ImageNet and enabled the model to start learning lung-specific patterns without modifying the core feature extractors. Following this, we implemented a two-phase fine-tuning strategy to gradually adapt the model to the unique characteristics of lung CT images while retaining the general representations learned during pre-training:

· First Fine-Tuning Phase:

The model was trained for 20 additional epochs using the Stochastic Gradient Descent (SGD) optimizer with the last 10 convolutional layers unfrozen. These included layers from block3_conv1 to block5_conv3, which are responsible for extracting mid- and high-level semantic features. Unfreezing these layers allowed the model to fine-tune the deeper representations related to lung tissue and lesion structures.

· Second Fine-Tuning Phase:

To further refine feature learning, we unfroze five additional earlier layers—including block2_conv1, block2_conv2, and earlier layers in block3. The model was retrained for another 10 epochs using the same optimizer with a lower initial learning rate. This allowed more accurate weight updates in both mid- and high-level layers, improving convergence and reducing overfitting.

Eventually, this selective fine-tuning strategy, where layers were gradually unfrozen from deep to shallow, enabled effective adaptation of pretrained features to the medical imaging domain. It also ensured stable training dynamics and strong generalization performance on the multi-source lung CT dataset.

Adaptive Learning Rate Scheduling:

To ensure stable and efficient model convergence, we implemented a custom adaptive learning rate scheduling strategy. This approach is designed to mitigate the risks of divergence during early training and to refine the model's parameters as it approaches an optimal solution. The schedule consists of two distinct phases: an initial warm-up phase followed by a stepwise decay phase. This two-phase process is detailed in Algorithm 1 and formulated in Equation 14.

Table 9: Detailed Performance Metrics of the Proposed VGG16 Framework Across 4-Fold Cross-Validation

Fold	Accuracy	F1 Score	Loss	MCC	Balanced Accuracy	Cohen’s Kappa	QWK	AUC	MSE	RMSE	MAE
Fold 1	0.9910	0.9911	0.1605	0.9769	0.9583	0.9765	0.9765	0.9998	0.0543	0.2330	0.0271
Fold 2	0.9819	0.9821	0.1786	0.9465	0.9137	0.9450	0.9450	0.9909	0.1131	0.3363	0.0588
Fold 3	0.9910	0.9911	0.1705	0.9769	0.9693	0.9766	0.9766	0.9900	0.0407	0.2018	0.0226
Fold 4	0.9910	0.9911	0.1582	0.9845	0.9722	0.9844	0.9844	0.9890	0.0362	0.1903	0.0181
Mean ± SD	0.9887 ± 0.0045	0.988± 0.0045	0.1670± 0.0082	0.971± 0.0159	0.9534 ± 0.0249	0.9706± 0.0166	0.970 ±0.016	0.9927± 0.0046	0.0611± 0.0334	0.2404± 0.0619	0.0317± 0.0177

4. DISCUSSION

This section interprets the experimental results presented in Section 4, connecting them to the methodologies detailed in Section 3. We discuss the significance of our findings and compare our framework's performance to the existing methods, highlighting its key advantages.

Summary and Interpretation of Findings:

The present study has successfully developed a lightweight, accurate, and robust framework for lung cancer classification. The outstanding performance reported in Section 4—including a test accuracy of 0.9910 and a multi-source balanced accuracy of 0.9677—is the direct outcome of our careful methodological choices.

The foundation of this success lies in the rigorous preprocessing pipeline (Section 3.2). The comparative evaluation confirmed that our hybrid filter effectively reduces noise while preserving critical features, validated by excellent PSNR (39.88 dB) and SSIM (0.8845) scores. This ensured the model was trained on high-quality data, which is crucial for achieving reliable performance.

Furthermore, the lightweight model architecture (Section 3.5) proved highly effective. By strategically pruning the VGG16 model and employing a dual-phase fine-tuning strategy, we achieved state-of-the-art accuracy while dramatically reducing computational cost. This result demonstrates that architectural efficiency and high performance are not mutually exclusive.

Finally, the most critical finding is the model's proven generalization capability, substantiated by our two-layered validation framework (Section 3.4). The model's stability was confirmed internally with K-Fold cross-validation, but more importantly, its real-world applicability was proven by successfully retraining and evaluating it on a large, heterogeneous multi-source dataset (as detailed in Section 4.4). This comprehensive validation, supported by a full suite of metrics (Section 3.7), addresses a critical gap in the literature and confirms that the model is ready for clinical deployment.

Comparison with Existing Methods:

As evidenced by the comparative summary in Table 15, our proposed framework addresses several critical gaps in the existing literature. While numerous studies have achieved high accuracy, our work distinguishes itself through a combination of methodological rigor, proven generalization, and computational efficiency.

First and most critically, our study confronts the "generalization gap" head-on. The table clearly shows that the vast majority of prior works, including those with near-perfect accuracy scores like Tandon et al. (2022), Ghosh et al. (2023), and Jassim et al. (2024), limit their validation to a single, often small, dataset. In contrast, our framework not only achieves a competitive accuracy of 99.1% on its primary single-source dataset but also demonstrates robust performance on a large, challenging multi-source dataset with a balanced accuracy of 96.9%. This dual-level validation provides a much higher degree of confidence in the model's real-world clinical applicability; a feature largely is absent in the compared literature. Second, our approach balances high performance with computational efficiency. Many state-of-the-art results in the table are achieved using computationally expensive ensembles (Jassim et al., 2024; Kumaran et al., 2024) or large custom models (Gupta et al., 2023). Our work takes a more practical approach by using structured pruning to create a lightweight yet powerful model. This makes our solution more feasible for deployment in clinical settings with limited resources, a crucial consideration that is often overlooked in the pursuit of marginal accuracy gains.

Finally, our framework emphasizes methodological transparency. We combine advanced preprocessing with quantitative validation to ensure data quality. Also, we integrate Grad-CAM for interpretability, aligning our work with the best practices seen in the most recent studies by Kumaran et al. (2024) and Klangbunrueang et al. (2025). In summary, while prior works excel in specific areas, our study presents a more holistic and clinically viable solution by delivering a model that is simultaneously accurate, efficient, interpretable, and—most importantly—proven to generalize across diverse data sources.

Table 15: Comparison of Methodologies and Performance in Recent Lung Cancer Classification Literature

This table provides a structured comparison of recent studies. The final row highlights our proposed model to facilitate a direct comparison. Note: Acc: Accuracy, Sens: Sensitivity, Spec: Specificity, Prec.: Precision, Rec: Recall, MCC: Matthews Correlation Coefficient, AUC: Area Under the Curve, XAI: Explainable AI, --: Not specified or Not Applicable.

Study	CT Lung Dataset & Size	Preprocessing	Augmentation/ Data Balancing	Classification Type	XAI	Model & Results
Anand et al. (2022)	IQ-OTH/NCCD (977 images)	Basic (DICOM to JPG, Resize, Normalize)	Basic (flip, rotate, zoom, Brightness)	Binary	--	• VGG16: Acc. 0.96, Sens. 0.94, Spec. 0.96
Tandon et al. (2022)	LIDC-IDRI (7,500 images)	Basic (Resize, Normalize)	Flip, Rotate	Binary	--	• VCNet (Hybrid): Acc. 0.99, F1. 0.991, AUC. 0.991
Naseer et al., (2023)	LUNA16 (888 CT scans)	Resize, Normalize, Patch Extraction, Segmentation	Patch-based	Binary	--	• Modified AlexNet-SVM: Acc. 0.979, Sens. 0.988, F1. 0.977
Sangeetha et al. (2023)	Kaggle Hist. & CT (750 images)	Denoising, Normalize, CLAHE	Basic Augmentation	Multi-class	--	• ResNet50 Histopathology Acc: 0.988 CT scan Acc: 0.847
Gupta et al. (2023)	SPIE-AAPM (18,000 images divided into 4 small sub-datasets for training: 2805, 2600, 2431, 3561)	-Resize, Normalize -Denoising (Gaussian and median filters) -CLAHE)	Shearing, Zooming, Horizontal Flipping, Fill mode.	Multi-class	--	• Custom CNN: Acc: ranging from 0.952 to 0.998 across different small sub datasets, Avg F1. 0.97, Avg Rec. 0.977
Ghosh et al. (2023)	IQ-OTH/NCCD (1,097 images)	-Resize, Smoothing, Threshold, Histogram, Equalization	Flip, Rotate	Multi-class	--	• CNN: Acc. 0.867, Prec. 0.939, Rec.0.704, AUC 0.946, F1: 0.74 • VGG16: Acc. 0.982, Prec. 0.956, Rec. 1.0, AUC:1.0, F1: 0.979
Al-Shouka and Alheeti (2023)	Kaggle Chest CT (1,200 images)	Resize, Normalize	Rotation, Shift, Shear, Zoom, flip, Fill mode.	Binary	--	• ResNet: Acc 0.90, loss 0.16. • MobileNetV2: Acc. 0.93, loss, 0.16. • Xception: Acc. 0.92, 0.19 loss. • VGG16: Acc. 0.91, 0.18 loss.
Gugulothu and Balaji (2023)	LIDC-IDRI (subsets)	Denoising, Contrast Enhancement	--	Binary	--	• HDE-NN (Hybrid): Acc. 0.963, Sens. 0.952
Benamara et al. (2024)	IQ-OTH/NCCD (1,097 images)	Resizing, Brightness, Sharpness	--	Multi-class	--	• DenseNet169 (Modified): Acc.1.0
Kumaran et al. (2024)	IQ-OTH/NCCD (1,097 images)	Resize, RGB convert, Normalize	SMOTE, Class Weighting	Multi-class	Grad-CAM	• Ensemble (VGG16 + ResNet50 + InceptionV3): Acc. 0.981, Bal. Acc. 0.969, MCC. 0.968, Kappa: 0.969, MSE. 0.061, RMSE. 0.249, MAE. 0.033
Alheeti et al. (2024).	Kaggle Chest CT-(1,000 images)	Brightness, Sharpness Conversion, Resizing.	Flip, Rotate	Binary	--	• MobileNetV2: Acc. 0.98, F1. 0.98
Jassim et al. (2024).	Kaggle Chest CT (1,000 images)	Resize, RGB convert	flipping, rotation, scaling	Multi-class	--	•Ensemble (ResNet50/101+EfficientNetB3) • Validation Acc. 0.994
Klangbunrueang et al. (2025).	IQ-OTH/NCCD (1,097)	Image resizing, normalization	Rotation, Scaling, Flipping	Multi-class	Grad-CAM	• VGG16: Acc. 0.981 • MobileNetV2: Acc. 0.971 • ResNet50: Acc. 0.933
This Study	• IQ-OTH/NCCD (1,097 Single-source) • Multi-sources (29,546 images)	Hybrid Filter + CLAHE (Quantitatively Validated)	SMOTE+ Hybrid Augmentation	Multi-class	Grad-CAM	Modified VGG16: - (Single Source Dataset) Acc. 0.991, Bal. Acc. 0.9722, MCC. 0.984, Kappa &QWK: 0.984, MSE. 0.036, F1. 0.9911, AUC. 0.989 - (Multi-Sources Dataset) Acc. 0.964, Bal. Acc. 0.969, MCC. 0.943, Kappa: 0.942, MSE. 0.0193, F1. 0.966, QWK. 0.949 AUC. 0.998

Study’s Limitations and Future Directions:

Despite the promising results, this study has several limitations that open avenues for future research. First, our study focused on enhancing the VGG16 architecture. While this demonstrated the power of our methodology, future work should extend this framework to other architectures, such as EfficientNet or vision transformers, to explore potential performance trade-offs.

Second, while our model accurately classifies entire images, it does not perform lesion segmentation. Integrating an automated segmentation module is a critical next step that would enhance the system's clinical utility by providing precise lesion localization and boundaries. Finally, while Grad-CAM improved model interpretability, clinical expert validation is needed. Future work can include deploying the model as an app/API for real-time clinical use.

CONCLUSION

This study successfully developed and validated a lightweight, accurate, and robust deep learning framework for automated lung cancer classification from CT images. By systematically addressing common challenges in data quality, model efficiency, and validation, we have created a solution poised for real-world clinical application. Our approach integrated a quantitatively validated preprocessing pipeline, a streamlined VGG16 architecture improved through pruning and fine-tuning, and a rigorous multi-layered validation strategy. The resulting model demonstrated outstanding and highly generalizable performance on both single-source and challenging multi-source datasets. Furthermore, the inclusion of Grad-CAM for model interpretability enhances its clinical utility by providing transparent, visual evidence for its predictions. Ultimately, this work presents a holistic and methodologically sound framework that sets a high standard for developing clinically viable computer-aided diagnosis (CAD) systems for the early and reliable detection of lung cancer.

Acknowledgments:

Not applicable.

Ethical Statement:

This study is a secondary analysis of publicly available, fully anonymized datasets. All data were originally collected under protocols approved by the respective institutional review boards (IRBs) of the data providers. As the research involved no direct interaction with human participants and exclusively utilized de-identified, pre-existing data, additional ethical approval was not required for this secondary analysis.

Author Contributions:

All authors have read and approved the final manuscript and agree to be accountable for all aspects of the work. M. S. R. Conceptualization, methodology, data acquisition, analysis, software development, validation, visualization, and original draft preparation. M. A. S., Conceptualization, supervision, manuscript review, and critical revision for intellectual content.

Declaration of Competing Interests:

The author(s) declare that they have no competing interests.

Funding:

The author(s) received no specific funding for this work.

REFERENCES

Al Najjar, Y. (2024). Comparative analysis of image quality assessment metrics: MSE, PSNR, SSIM, and FSIM. International Journal of Science and Research, 13(3), 110–114. https://doi.org/10.21275/SR24302013533

Alheeti, K. M. A., Al-Shouka, T. T., Majeed, S. H., & Ahmed, A. A. (2024). Lung cancer detection using machine learning and deep learning models. In 2024 21st International Multi-Conference on Systems, Signals & Devices (SSD) (pp. 63–69). IEEE. https://doi.org/10.1109/SSD61670.2024.1054950.

Al-Shouka, T. T., & Alheeti, K. M. A. (2023). A transfer learning for intelligent prediction of lung cancer detection. In 2023 Al-Sadiq International Conference on Communication and Information Technology (AICCIT) (pp. 54–59). IEEE. https://doi.org/10.1109/AICCIT57614.2023.10217967.

Al-Yasriy, H. F., Al-Husieny, M. S., Mohsen, F. Y., Khalil, E. A., & Hassan, Z. S. (2020). Diagnosis of lung cancer based on CT scans using CNN. IOP Conference Series: Materials Science and Engineering, 928, 032033. https://www.kaggle.com/datasets/hamdallak/the-iqothnccd-lung-cancer-dataset/data.

Anand, R., Rao, N., & Sumukh, D. (2022). Lung cancer detection and prediction using deep learning. International Journal of Engineering Applied Sciences and Technology.

Anusha, M., & Reddy, D. S. (2023). Lung carcinoma diagnosis and classification using deep learning. 2023 4th International Conference for Emerging Technology (INCET), 1–4. IEEE. https://doi.org/10.1109/INCET57972.2023.10170615.

Ardila, D., Kiraly, A. P., Bharadwaj, S., Choi, B., Reicher, L., Peng, L., Tse, D., Etemadi, M., Ye, W., Corrado, G. S., Naidich, D. P., & Shetty, S. (2019). End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nature Medicine, 25(6), 954–961. https://doi.org/10.1038/s41591-019-0447-x.

Benamara, Z., Zehani, S., & Zitouni, A. (2024). The effect of fully connected layers in different pre-trained CNN architectures on the enhancement of lung cancer classification. In 2024 8th International Conference on Image and Signal Processing and their Applications (ISPA) (pp. 1–6). IEEE. https://doi.org/10.1109/ISPA59904.2024.10536830.

Bray, F., Laversanne, M., Sung, H., Ferlay, J., Siegel, R. L., Soerjomataram, I., & Jemal, A. (2024). Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA: A Cancer Journal for Clinicians, 74(3), 229–263. https://doi.org/10.3322/caac.21834.

Chang, Y., Jung, C., Ke, P., Song, H., & Hwang, J. (2018). Automatic contrast-limited adaptive histogram equalization with dual gamma correction. IEEE Access, 6, 11782–11792. https://doi.org/10.1109/ACCESS.2018.2797872

Chattopadhay, A., Sarkar, A., Howlader, P., & Balasubramanian, V. N. (2018, March). Grad-CAM++: Generalized gradient-based visual explanations for deep convolutional networks. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV) (pp. 839–847). IEEE. https://doi.org/10.1109/WACV.2018.00097.

Gupta, D., Dawn, S., & others. (2023). Detection and staging of lung cancer from CT scan images by deep learning. In 2023 International Conference on Disruptive Technologies (ICDT) (pp. 274–278). IEEE. https://doi.org/10.1109/ICDT57929.2023.10151194.

Ghosh, R., Ahamed, A., Sadhukhan, B., & Das, N. (2023). Lung nodule classification using MobileNet transfer learning. In 2023 9th International Conference on Smart Computing and Communications (ICSCC) (pp. 290–295). IEEE. https://doi.org/10.1109/ICSCC59169.2023.10335043.

Gugulothu, V. K., & Balaji, S. (2024). RETRACTED ARTICLE: An early prediction and classification of lung nodule diagnosis on CT images based on hybrid deep learning techniques. Multimedia Tools and Applications, 83, 1041–1061. https://doi.org/10.1007/s11042-023-15802-2

Ibrahim, W. R., & Mahmood, M. R. (2023). Classified Covid-19 By Densenet121-Based Deep Transfer Learning From Ct-Scan Images. Science Journal of University of Zakho, 11(4), 571 –https://doi.org/10.25271/sjuoz.2023.11.4.1166.

Jassim, O. A., Abed, M. J., & Saied, Z. H. (2024). Deep learning techniques in the cancer-related medical domain: A transfer deep learning ensemble model for lung cancer prediction. Baghdad Science Journal, 21(3). https://doi.org/10.21123/bsj.2023.8340.

Klangbunrueang, R., Pookduang, P., Chansanam, W., & Lunrasri, T. (2025, February). AI-powered lung cancer detection: Assessing VGG16 and CNN architectures for CT scan image classification. Informatics, 12(1), 18. MDPI. https://doi.org/10.3390/informatics12010018.

Kumaran, Y. S., Jeya, J. J., T. R, M., Khan, S. B., Alzahrani, S., & Alojail, M. (2024). Explainable lung cancer classification with ensemble transfer learning of VGG16, ResNet50 and InceptionV3 using Grad-CAM. BMC Medical Imaging, 24(1), 176. https://doi.org/10.1186/s12880-024-01345-x.

Leiter, A., Veluswamy, R. R., & Wisnivesky, J. P. (2023). The global burden of lung cancer: Current status and future trends. Nature Reviews Clinical Oncology, 20, 624–639. https://doi.org/10.1038/s41571-023-00798-3.

Murad, S. H., Awlla, A. H., & Moahmmed, B. T. (2023). Prediction lung cancer based critical factors using machine learning. Science Journal of University of Zakho, 11(3), 447–452. https://doi.org/10.25271/sjuoz.2023.11.3.1105

National Lung Screening Trial Research Team (NLST). (2011). Reduced lung-cancer mortality with low-dose computed tomographic screening. New England Journal of Medicine, 365(5), 395–409. https://doi.org/10.1056/NEJMoa1102873.

Naseer, I., Akram, S., Masood, T., Rashid, M., & Jaffar, A. (2023). Lung cancer classification using modified U-Net based lobe segmentation and nodule detection. IEEE Access, 11, 60279–60291. https://doi.org/10.1109/ACCESS.2023.3285821.

Park, S., Park, H., Lee, S. M., Kim, H., & Goo, J. M. (2022). Application of computer-aided diagnosis for Lung-RADS categorization in CT screening for lung cancer: Effect on inter-reader agreement. European Radiology, 32(2), 1054–1064. https://doi.org/10.1007/s00330-021-08202-3

Ravindra, C., Nalband, A. H., Kumar, G., Basheer, S., & Ravindra, M. (2024). From pixels to prognosis: A deep dive into lung cancer subtype classification using transfer learning. In 2024 IEEE International Conference on Contemporary Computing and Communications (InC4) (Vol. 1, pp. 1–6). IEEE. https://doi.org/10.1109/InC460750.2024.10649168.

Rodrigues, R., Lévêque, L., Gutiérrez, J., Jebbari, H., Outtas, M., Zhang, L., Chetouani, A., Al-Juboori, S., Martini, M. G., & Pinheiro, A. M. G. (2024). Objective quality assessment of medical images and videos: Review and challenges. Multimedia Tools and Applications, 1–34. https://doi.org/10.1007/s11042-024-20292-x.

Sangeetha, M., Devi, R. M., Gunasekaran, H., Venkatesan, R., Ramalakshmi, K., & Murugesan, P. (2023). Deep residual learning for lung cancer nodules detection and classification. In 2023 7th International Conference on Computing Methodologies and Communication (ICCMC) (pp. 907–912). IEEE. https://doi.org/10.1109/ICCMC56507.2023.10083783.

Singh, A., & Kamath, S. (2024). Preprocessing of CT scans for lung cancer detection. In 2024 4th International Conference on Intelligent Technologies (CONIT) (pp. 1–4). IEEE. https://doi.org/10.1109/CONIT61985.2024.10626594.

Tandon, R., Agrawal, S., Chang, A., & Band, S. S. (2022). VCNet: Hybrid deep learning model for detection and classification of lung carcinoma using chest radiographs. Frontiers in Public Health, 10, 894920. https://doi.org/10.3389/fpubh.2022.894920.

World Health Organization (WHO). (2023). Lung cancer (Report No. WHO/2023/LC_FS). https://www.who.int/news-room/fact-sheets/detail/lung-cancer.

* Corresponding author

This is an open access under a CC BY-NC-SA 4.0 license (https://creativecommons.org/licenses/by-nc-sa/4.0/)

Class Type	Description	Number of Samples	Percentage (%)
Benign	Non-cancerous lung nodules	120	10.94%
Malignant	Cancerous (lung cancer) nodules	561	51.14%
Normal	Healthy lung images with no nodules	416	37.91%
Total	—	1,097	100%

#	Step	Technique	Parameters	Purpose
1	Noise Reduction	Hybrid Filtering (Median + Guided)	- Median Filter kernel 3×3 - Guided Filter (radius=5, smoothness controller ε=0.1)	Removes salt-and-pepper noise while preserving critical edge and texture details, outperforming single-filter methods.
2	Contrast Enhancement	Dynamic CLAHE	- Tile Grid: 8×8 - Base Clip Limit: 1.5 - Scale Factor: 0.3 - Entropy Thresholds: E_Low =5, E_High=6	Enhances local contrast for better feature visibility while preventing noise over-amplification in high-contrast regions.
3	Standardization	Resizing & Normalization	- Method: Lanczos Interpolation - Target Size: 224×224, Range: [0, 1]	Ensures uniform input dimensions and pixel value range for stable training.

Denoising Technique	PSNR (dB)	SSIM	Observations
Bilateral Filtering	30.75	0.7754	Over-smoothed; loss of texture
Gaussian Filtering	35.79	0.8729	Smooth output; edge blurring observed
Median Filtering (3×3)	39.82	0.8845	Strong denoising; moderate structure preservation
Median Filtering (5×5)	36.28	0.8191	Excessive smoothing; reduced fine detail retention
Guided Filtering	36.54	0.8992	Highest SSIM; preserves edges well, less effective on noise.
Non-Local Means (NL-Means)	36.86	0.8595	Good PSNR, slight blurring; time-consuming
NL-Means + Guided Filtering	35.73	0.8298	Redundant smoothing; degraded structure clarity
NL-Means + Bilateral Filtering	29.53	0.6934	Suboptimal; lowest PSNR and SSIM
Proposed: Median (3×3) + Guided	39.88	0.8845	Best PSNR, high SSIM; balanced edge/detail preservation

Contrast Enhancement Technique	Average PSNR (dB)	Average SSIM
Dynamic CLAHE (Adaptive Clip Limit, 8×8 Tile Grid)	31.08	0.8584
Standard CLAHE (Fixed Clip Limit: 1.5, 8×8 Tile Grid)	29.12	0.7288

Parameter	Symbol	Value	Description
Initial Learning Rate		0.001	The target learning rate is after the warm-up phase.
Minimum Learning Rate		1e-6	The lower value for the learning rate is to prevent training from stalling.
Warm-up Epochs		5	The number of epochs for the linear warm-up phase.
Decay Rate		0.75	The multiplicative factor for each decay step.
Epoch Drop		5	The number of epochs between each learning rate decay.

Set	Benign	Malignant	Normal	Total
Training Set (for CV)	72	336	249	657
Fixed Validation Set	24	112	83	219
Hold-Out Test Set	24	113	84	221

Metric	Mathematical Formulation of Metrics	Equation No.
Accuracy		(3)
Precision		(4)
Recall		(5)
F1-score		(6)
Cohen’s Kappa (κ)		(7)
Quadratic Weighted Kappa (QWK)		(8)
Matthews Correlation Coefficient (MCC)		(9)
Balanced Accuracy		(10)
Mean Squared Error (MSE)		(11)
Root Mean Squared Error (RMSE)		(12)
Mean Absolute Error (MAE)		(13)

Metric	Fold 1 [95% CI]	Fold 2 [95% CI]	Fold 3 [95% CI]	Fold 4 [95% CI]
Accuracy	[0.9683 – 1.0]	[0.9412 – 0.9910]	[0.9683 – 1.0]	[0.9683 – 1.0]
F1 Score (Macro)	[0.9336 – 1.0]	[0.8798 – 0.9774]	[0.9472 – 1.0]	[0.9336 – 1.0]
Balanced Accuracy	[0.9067 – 1.0]	[0.8472 – 0.9667]	[0.9270 – 1.0]	[0.9067 – 1.0]
AUC (OVR)	[0.9976 – 1.0]	[0.9964 – 1.0]	[0.9995 – 1.0]	[0.9978 – 1.0]
MCC	[0.9482 – 1.0]	[0.9041 – 0.9836]	[0.9470 – 1.0]	[0.9482 – 1.0]
Cohen's Kappa	[0.9465 – 1.0]	[0.8991 – 0.9835]	[0.9457 – 1.0]	[0.9465 – 1.0]
Log Loss	[0.0251 – 0.0842]	[0.0334 – 0.1456]	[0.0203 – 0.0632]	[0.0168 – 0.0706]
MSE	[0.0 – 0.1267]	[0.0362 – 0.2128]	[0.0 – 0.0995]	[0.0 – 0.1267]

Class	Precision	Recall	F1-Score	Support
Benign (0)	1.00	0.92	0.96	24
Malignant (1)	1.00	1.00	1.00	113
Normal (3)	0.98	1.00	0.99	84
Weighted Accuracy	0.99	0.99	0.99	221

Source Dataset	Original Labels	Mapped Labels	Format	Image Count	Labeling Notes
IQ-OTH/NCCD	Normal, Benign, Malignant	Normal, Benign, Malignant	JPEG	1,097	Labels already follow the unified three-class scheme
Chest CT-Scan (Kaggle)	Normal; Adenocarcinoma, Large cell carcinoma, Squamous cell carcinoma	Normal, Malignant	PNG / JPEG	1,000	Malignant histological subtypes were grouped under "Malignant"
SPIE-AAPM-NCI L	Nodules (Benign), Nodules (Malignant)	Benign, Malignant	DICOM	15,931	Nodules manually labeled by radiologists and pathologically validated
Lung-RADS Dataset (Mendeley)	LR2, LR3 (Benign Appearance); LR4A (Suspicious), LR4B (Very Suspicious)	Benign, Malignant	PICKLE	972	Labels interpreted per Lung-RADS categorization scheme
LIDC-IDRI Subset	No nodules; Nodules with low rating malignancy; Nodules with high rating or confirmed malignancy	Normal, Benign, Malignant	DICOM	10,546	Malignancy labels derived from radiologist consensus and pathology reports
Final Unified Dataset	--	Normal, Benign, Malignant	JPEG	29,546	All original labels harmonized into a unified 3-class format

Metric	Performance Score
Balanced Accuracy	0.9693
Loss	0.1503
AUC	0.9980
MCC	0.9427
Cohen's Kappa	0.9421
MSE	0.0193
RMSE	0.1390
MAE	0.0440
QWK	0.9497


(a)	(b)

(c)	(d)


(a)	(b)


(a)	(b)

1. Initial Data Partitioning