Cross-Media Fake Content Detection via Independent Deep Learning Classifiers

Iqbal Najihah binti Samsul Kamal; Anna Safiya  binti Samsudin; Raini binti Hassan

doi:10.31436/ijpcc.v12i1.651

Authors

Iqbal Najihah binti Samsul Kamal Department of Computer Science, International Islamic University Malaysia, Kuala Lumpur, Malaysia
Anna Safiya binti Samsudin Department of Computer Science, International Islamic University Malaysia, Kuala Lumpur, Malaysia
Raini binti Hassan Department of Computer Science, International Islamic University Malaysia, Kuala Lumpur, Malaysia

DOI:

https://doi.org/10.31436/ijpcc.v12i1.651

Keywords:

Deep Learning, Data Science, Multimedia Forensics, Swin Transformer, Wav2Vec 2.0, Machine Learning

Abstract

The rapid advancement of generative models has enabled the creation of highly realistic fake multimedia content, including altered images, deepfake videos, and synthetic audio. These forgeries undermine information integrity and pose significant societal risks, especially by encouraging misinformation, digital fraud and impersonation. As these threats directly affect public trust and institutional transparency, they challenge the goals outlined in SDG 16: Peace, Justice, and Strong Institutions, which focuses on reducing corruption, preserving information integrity, and ensuring accountable, trustworthy systems. To address these issues, this paper proposes a deep learning–based system that classifies multimedia content across three modalities, which are image, video, and audio. Unlike conventional multimodal fusion approaches that necessitate paired data inputs, this paper introduces a novel routing-based unification architecture. The suggested framework makes use of a content-adaptive routing mechanism that treats each modality independently. Using a dual-backbone Swin Transformer and EfficientNet for images, Video Swin Transformer for video, and Wav2Vec 2.0 for audio, the system automatically determines the type of input file and sends it to the relevant specialized deep learning classifier. This design allows for a versatile, single-entry-point forensic tool that maintains high accuracy by leveraging domain-specific experts without the computational overhead of processing multiple streams concurrently. Experimental results demonstrate strong performance across individual modalities, with the audio model achieving 96.95% accuracy and the image model showing robust precision despite challenges posed by high quality generative forgeries.

References

L. Verdoliva, “Media Forensics and DeepFakes: an overview,” arXiv preprint, arXiv:2001.06564, Jan. 2020. [Online]. Available: https://arxiv.org/pdf/2001.06564

A. Novozámský, B. Mahdian, and S. Saic, “IMD2020: A Large-Scale Annotated Dataset Tailored for Detecting Manipulated Images,” in 2020 IEEE Winter Applications of Computer Vision Workshops (WACVW), Snowmass Village, CO, USA, Mar. 2020, pp. 71-80, doi: 10.1109/WACVW50321.2020.9096940

Y. Li, X. Yang, P. Sun, H. Qi, and S. Lyu, “Celeb-DF: A large-scale challenging dataset for DeepFake forensics,” arXiv preprint arXiv:1909.12962, Sept. 2019. [Online]. Available: https://arxiv.org/abs/1909.12962. arXiv+1

H. Delgado, N. Evans, T. Kinnunen, K. A. Lee, X. Liu, A. Nautsch, J. Patino, M. Sahidullah, M. Todisco, X. Wang, and J. Yamagishi, “ASVspoof 2021: Automatic speaker verification spoofing and countermeasures challenge evaluation plan,” arXiv preprint arXiv:2109.00535, Sept. 2021. [Online]. Available: https://arxiv.org/abs/2109.00535.

B. Dolhansky et al., “The DeepFake Detection Challenge (DFDC) dataset,” arXiv preprint arXiv:2006.07397, Jun. 2020. [Online]. Available: https://arxiv.org/abs/2006.07397

Z. Liu et al., “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows,” arXiv preprint arXiv:2103.14030, 2021. [Online]. Available: http://arxiv.org/abs/2103.14030

Y. Li, X. Yang, P. Sun, H. Qi, and S. Lyu, “Celeb-DF: A Large-scale Challenging Dataset for DeepFake Forensics,” in Proc. IEEE Computer Vision and Pattern Recognition (CVPR), 2020.

Z. Liu et al., “Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows,” in Proc. IEEE International Conference on Computer Vision (ICCV), 2021.

B. Singh and D. K. Sharma, “Predicting image credibility in fake news over social media using multi-modal approach,” Neural Computing and Applications, vol. 34, no. 24, pp. 21503–21517, 2021. https://doi.org/10.1007/s00521-021-06086-4

Y. Almsrahad and N. M. Charkari, “Image Fake News Detection using Efficient NetB0 Model,” Journal of Information Systems and Telecommunication (JIST), vol. 12, no. 45, pp. 41–48, 2024. https://doi.org/10.61186/jist.40976.12.45.41

L. Y. Gong, X. J. Li, and P. H. J. Chong, “Swin-Fake: A Consistency Learning Transformer-Based Deepfake Video Detector,” Electronics, vol. 13, no. 15, p. 3045, 2024. https://doi.org/10.3390/electronics13153045

S. R. Mishra, H. Mohapatra, S. A. Edalatpanah, and M. K. Gourisaria, “Advanced deepfake detection leveraging swin transformer technology,” Engineering Review, vol. 44, no. 4, pp. 45–56, 2024. https://doi.org/10.30765/er.2583

J. M. Martín-Doñas and A. Álvarez, “The Vicomtech Audio Deepfake Detection System based on Wav2Vec2 for the 2022 ADD Challenge,” arXiv preprint arXiv:2203.01573, 2022. [Online]. Available: https://arxiv.org/abs/2203.01573

S. Dilbar, M. A. Qureshi, S. K. Noon, and A. Mannan, “AudioFakeNet: A Model for Reliable Speaker Verification in Deepfake Audio,” Algorithms, vol. 18, no. 11, p. 716, 2025. https://doi.org/10.3390/a18110716

F. Khalid, M. H. Akbar, and S. Gul, “SWYNT: Swin Y-Net Transformers for Deepfake Detection,” in 2023 International Conference on Robotics and Artificial Intelligence (ICRAI), 2023, pp. 1–6. https://doi.org/10.1109/icrai57502.2023.10089585

Wodajo, D., Atnafu, S., & Akhtar, Z. (n.d.). “Deepfake Video Detection Using Generative Convolutional Vision Transformer,” arXiv preprint arXiv:2307.07036, 2023. [Online]. Available: https://arxiv.org/pdf/2307.07036

Y. Zhou, Q. Ying, Z. Qian, S. Li, and X. Zhang, “Multimodal Fake News Detection via CLIP-Guided Learning,” arXiv preprint arXiv:2205.14304, 2022. [Online]. Available: https://arxiv.org/abs/2205.14304

Ma, Z., Luo, M., Guo, H., Zeng, Z., Hao, Y., & Zhao, X. (2024). “Event-Radar: Event-driven Multi-View Learning for Multimodal Fake News Detection,” Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 5809–5821. https://doi.org/10.18653/v1/2024.acl-long.316

Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video Swin Transformer,” arXiv preprint arXiv:2106.13230, 2021. [Online]. Available: https://arxiv.org/abs/2106.13230

Y. Sun, X. Li, J. Wang, L. He, and X. Liu, “Audio Anti-Spoofing Based on Audio Feature Fusion,” Algorithms, vol. 16, no. 7, p. 317, 2023. https://doi.org/10.3390/a16070317

M. Li and X.-P. Zhang, “Interpretable Temporal Class Activation Representation for Audio Spoofing Detection,” in Interspeech 2024, 2024. [Online]. Available: https://arxiv.org/abs/2406.08825

X. Liu, W. Ge, X. Wang, and J. Yamagishi, “LENS-DF: Deepfake Detection and Temporal Localization for Long-Form Noisy Speech,” in IJCB 2025, 2025. [Online]. Available: https://arxiv.org/abs/2507.16220

Sivaraman, D. K., Saif, M., MR, M. F., & Moosa, M. (2025). Enhanced Fake Image Localization in Social Media using Swin Transformer and EfficientNet Feature Fusion. International Journal for Research in Applied Science and Engineering Technology, 13(4), 4052–4058. https://doi.org/10.22214/ijraset.2025.69194

S. A. Khan and D.-T. Dang-Nguyen, “Deepfake Detection: Analysing Model Generalisation Across Architectures, Datasets and Pre-Training Paradigms,” IEEE Access, vol. 12, pp. 1880–1908, 2024. https://doi.org/10.1109/access.2023.3348450

A. Novozámský, B. Mahdian, and S. Saic, “IMD2020: A Large-Scale Annotated Dataset Tailored for Detecting Manipulated Images,” in 2020 IEEE Winter Applications of Computer Vision Workshops (WACVW), 2020, pp. 71–80. https://doi.org/10.1109/wacvw50321.2020.9096940

D. Goel, “CASIA 2.0 Image Tampering Detection Dataset,” Kaggle, 2021. [Online]. Available: https://www.kaggle.com/datasets/divg07/casia-20-image-tampering-detection-dataset

EndlessSora, “DeeperForensics-1.0,” GitHub repository, 2025. [Online]. Available: https://github.com/EndlessSora/DeeperForensics-1.0/tree/master/dataset

D. Wan, M. Cai, S. Peng, W. Qin, and L. Li, “Deepfake Detection Algorithm Based on Dual-Branch Data Augmentation and Modified Attention Mechanism,” Applied Sciences, vol. 13, no. 14, p. 8313, 2023. https://doi.org/10.3390/app13148313