Comparative Analysis of Vision Transformers and CNN Models for Driver Fatigue Classification
DOI:
https://doi.org/10.31436/iiumej.v26i2.3488Keywords:
Deep learning, Convolutional Neural Network, Vision Transformer, Driving behavior, Embedded Systems,, Raspberry PiAbstract
This study provides a comprehensive evaluation of Convolutional Neural Network (CNN) and Vision Transformer (ViT) models for driver fatigue classification, a critical issue in road safety. Using a custom driving behavior dataset, state-of-the-art CNN and ViT architectures, including VGG16, EfficientNet, MobileNet, Inception, DenseNet, ResNet, ViT, and Swin Transformer, were analyzed in this study to determine the best model for practical driver fatigue monitoring systems. Performance metrics such as accuracy, F1-score, training time, inference time, and frames per second (fps) were assessed across different hardware platforms, including a high-performance workstation, Raspberry Pi 5, and a desktop with a Graphic Processing Unit (GPU). Results demonstrate that CNN models, particularly VGG16, achieve the best balance between accuracy and efficiency, with an F1-score of 0.97 and 77.00 fps on a desktop. On the other hand, Swin V2S outperforms all models in terms of accuracy, achieving an F1-score of 0.99 and 61.18 fps on a GPU, although it exhibits limited efficiency on embedded systems. This study significantly contributes by providing practical recommendations for selecting models based on performance needs and hardware constraints, highlighting the suitability of ViTs for high-computation environments. The findings support the development of more efficient driver fatigue monitoring systems, offering practical implications for enhancing road safety and reducing traffic accidents.
ABSTRAK: Kajian ini merupakan penilaian komprehensif terhadap model Konvolusi Rangkaian Neural (CNN) dan Transformer Penglihatan (ViT) bagi pengelasan keletihan pemandu, iaitu satu isu kritikal dalam keselamatan jalan raya. Menggunakan set data tingkah laku pemanduan tersuai, seni bina terkini CNN dan ViT, termasuk VGG16, EfficientNet, MobileNet, Inception, DenseNet, ResNet, ViT dan Transformer Swin dianalisa dalam kajian ini bagi menentukan model terbaik bagi sistem pemantauan keletihan pemandu yang praktikal. Metrik prestasi seperti ketepatan, skor F1, masa latihan, masa inferens, dan bingkai sesaat (fps) telah dinilai merentasi pelbagai platfom perkakasan, termasuk stesen kerja berprestasi tinggi, Raspberry Pi 5, dan komputer meja dengan Unit Pemprosesan Grafik (GPU). Dapatan kajian menunjukkan bahawa model CNN, khususnya VGG16, mencapai keseimbangan terbaik antara ketepatan dan kecekapan, dengan skor F1 sebanyak 0.97 dan 77.00 fps pada komputer meja. Sebaliknya, Swin V2S mengatasi semua model dari segi ketepatan, mencapai skor F1 sebanyak 0.99 dan 61.18 fps pada GPU, walaupun menunjukkan kecekapan yang terhad pada sistem terbenam. Kajian ini memberikan sumbangan yang signifikan dengan menyediakan cadangan praktikal bagi pemilihan model berdasarkan keperluan prestasi dan kekangan perkakasan, serta menonjolkan kesesuaian ViT bagi persekitaran berkomputasi tinggi. Penemuan ini menyokong pembangunan sistem pemantauan keletihan pemandu yang lebih cekap, dengan implikasi praktikal bagi meningkatkan keselamatan jalan raya dan mengurangkan kemalangan.
Downloads
Metrics
References
Ministry of Transport Malaysia, “Malaysia Road Fatalities Index,” Ministry of Transport Malaysia Official Portal. Accessed: Jun. 13, 2024. [Online]. Available: https://www.mot.gov.my/en/land/safety/malaysia-road-fatalities-index
A. Fernández, “Facial attributes recognition using computer vision to detect drowsiness and distraction in drivers,” Electronic Letters on Computer Vision and Image Analysis, vol. 16, no. 2, pp. 25–28, 2017, doi: 10.5565/rev/elcvia.1134.
B. Shandhana Rashmi and S. Marisamynathan, “Factors affecting truck driver behavior on a road safety context: A critical systematic review of the evidence,” Oct. 01, 2023, KeAi Communications Co. doi: 10.1016/j.jtte.2023.04.006.
Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015, doi: 10.1038/nature14539.
L. Cun et al., “Handwritten Digit Recognition with a Back-Propagation Network.”
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale Video Classification with Convolutional Neural Networks.” [Online]. Available: http://cs.stanford.edu/people/karpathy/deepvideo
A. Vaswani et al., “Attention Is All You Need,” Jun. 2017, [Online]. Available: http://arxiv.org/abs/1706.03762
A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” Oct. 2020, [Online]. Available: http://arxiv.org/abs/2010.11929
Y. S. Poon et al., “Driver Distracted Behavior Detection Technology with YOLO-Based Deep Learning Networks,” in ISPCE-ASIA 2021 - IEEE International Symposium on Product Compliance Engineering-Asia, Proceeding, Institute of Electrical and Electronics Engineers Inc., 2021. doi: 10.1109/ISPCE-ASIA53453.2021.9652435.
M. F. Ishak, F. H. K. Zaman, N. K. Mun, S. A. C. Abdullah, and A. K. Makhtar, “Improving night driving behavior recognition with ResNet50,” Indonesian Journal of Electrical Engineering and Computer Science, vol. 33, no. 3, pp. 1974–1988, 2024, doi: 10.11591/ijeecs.v33.i3.pp1974-1988.
J. Liang, H. Zhu, E. Zhang, and J. Zhang, “Stargazer: A Transformer-based Driver Action Detection System for Intelligent Transportation,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, IEEE Computer Society, 2022, pp. 3159–3166. doi: 10.1109/CVPRW56347.2022.00356.
T. Kondo, S. Narumi, Z. He, D. Shin, and Y. Kang, “A Performance Comparison of Japanese Sign Language Recognition with ViT and CNN Using Angular Features,” Applied Sciences (Switzerland), vol. 14, no. 8, Apr. 2024, doi: 10.3390/app14083228.
H. E. Kim, M. E. Maros, T. Miethke, M. Kittel, F. Siegel, and T. Ganslandt, “Lightweight Visual Transformers Outperform Convolutional Neural Networks for Gram-Stained Image Classification: An Empirical Study,” Biomedicines, vol. 11, no. 5, May 2023, doi: 10.3390/biomedicines11051333.
H. V. Koay, J. H. Chuah, and C. O. Chow, “Convolutional Neural Network or Vision Transformer? Benchmarking Various Machine Learning Models for Distracted Driver Detection,” in IEEE Region 10 Annual International Conference, Proceedings/TENCON, Institute of Electrical and Electronics Engineers Inc., 2021, pp. 417–422. doi: 10.1109/TENCON54134.2021.9707341.
S. Z. Ul Abidin, H. M. Lashari, and R. F. Ahmad, “ViT vs CNN: A Comparative Study of Wheat Disease Classification for Custom Data,” in Proceedings - 2023 International Conference on Frontiers of Information Technology, FIT 2023, Institute of Electrical and Electronics Engineers Inc., 2023, pp. 274–279. doi: 10.1109/FIT60620.2023.00057.
M. M. Sufian, E. G. Moung, J. A. Dargham, F. Yahya, and S. Omatu, “Pre-trained Deep Learning Models for COVID19 Classification: CNNs vs. Vision Transformer,” in 4th IEEE International Conference on Artificial Intelligence in Engineering and Technology, IICAIET 2022, Institute of Electrical and Electronics Engineers Inc., 2022. doi: 10.1109/IICAIET55139.2022.9936852.
K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” 2015. [Online]. Available: https://ui.adsabs.harvard.edu/abs/2015arXiv151203385H
M. Tan and Q. V. Le, “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks,” May 2019, [Online]. Available: http://arxiv.org/abs/1905.11946
C. Szegedy et al., “Going deeper with convolutions,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1–9. doi: 10.1109/CVPR.2015.7298594.
K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” 2014. [Online]. Available: https://ui.adsabs.harvard.edu/abs/2014arXiv1409.1556S
G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely Connected Convolutional Networks,” Aug. 2016, [Online]. Available: http://arxiv.org/abs/1608.06993
A. G. Howard et al., “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,” 2017. [Online]. Available: https://ui.adsabs.harvard.edu/abs/2017arXiv170404861H
Z. Liu et al., “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows,” arXiv e-prints, p. arXiv:2103.14030, Mar. 2021, doi: 10.48550/arXiv.2103.14030.
D. Li, Y. Wang, and W. Xu, “A Deep Multichannel Network Model for Driving Behavior Risk Classification,” IEEE Transactions on Intelligent Transportation Systems, vol. 24, no. 1, pp. 1204–1219, Jan. 2023, doi: 10.1109/TITS.2022.3201378.
M. F. Bin Ishak, F. H. K. Zaman, N. K. Mun, and A. K. Makhtar, “Day Driving and Night Driving Behavior Detection Using Deep Learning Models,” in International Conference on ICT Convergence, IEEE Computer Society, 2023, pp. 463–468. doi: 10.1109/ICoICT58202.2023.10262745.
A. Noor, B. Benjdira, A. Ammar, and A. Koubaa, “DriftNet: Aggressive Driving Behaviour Detection using 3D Convolutional Neural Networks,” in Proceedings - 2020 1st International Conference of Smart Systems and Emerging Technologies, SMART-TECH 2020, Institute of Electrical and Electronics Engineers Inc., Nov. 2020, pp. 214–219. doi: 10.1109/SMART-TECH49988.2020.00056.
Z. Qu, L. Cui, and X. Yang, “HAR-Net: An Hourglass Attention ResNet Network for Dangerous Driving Behavior Detection,” Electronics (Switzerland), vol. 13, no. 6, Mar. 2024, doi: 10.3390/electronics13061019.
P. Li et al., “Driver Distraction Detection Using Octave-Like Convolutional Neural Network,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 7, pp. 8823–8833, Jul. 2022, doi: 10.1109/TITS.2021.3086411.
H. Vin Koay, J. Huang Chuah, and C. O. Chow, “Shifted-Window Hierarchical Vision Transformer for Distracted Driver Detection,” in TENSYMP 2021 - 2021 IEEE Region 10 Symposium, Institute of Electrical and Electronics Engineers Inc., Aug. 2021. doi: 10.1109/TENSYMP52854.2021.9550995.
H. Wang et al., “FPT: Fine-Grained Detection of Driver Distraction Based on the Feature Pyramid Vision Transformer,” IEEE Transactions on Intelligent Transportation Systems, vol. 24, no. 2, pp. 1594–1608, Feb. 2023, doi: 10.1109/TITS.2022.3219676.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 IIUM Press

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Funding data
-
Ministry of Higher Education, Malaysia
Grant numbers FRGS/1/2023/TK07/UITM/02/23