|
ShodhKosh: Journal of Visual and Performing ArtsISSN (Online): 2582-7472
Photo Authenticity Detection Using Machine Learning for Deepfake and AI-Generated Content Verification Rajesh Raikwar 1 1 Assistant
Professor, Department of Electrical Engineering, Vishwakarma Institute of
Technology, Pune, Maharashtra-411037, India 2 Dean,
PGSR, Deogiri Institute of Engineering and Management Studies, Chhatrapati
Sambhajinagar, Maharashtra, India 3 Assistant Professor, School of Still Photography, AAFT University of
Media and Arts, Raipur, Chhattisgarh-492001, India 4 Assistant Professor, Department of Robotics and Automation, AISSMS
College of Engineering, Kennedy Road, Pune-01, Maharashtra, India 5 Assistant Professor, SJMC, Noida International University, Noida,
Uttar Pradesh, India 6 Department of Computer Applications, CT University, Ludhiana, Punjab,
India
1. INTRODUCTION A lot of people are worried about how real and trustworthy visual material in digital spaces is because deepfake technologies and AI-generated images are getting better and better so quickly. The danger that fake media presents to public trust, media ethics, and digital security keeps growing as it gets harder to tell the difference between it and real content. Traditional ways of finding things, which usually depend on hand-made features or simple investigative studies, aren't working well enough against the more complicated AI-driven manipulation methods that are being used. Because of this pressing problem, we need to make proof systems that are more advanced, flexible, and smart so they can tell the difference between real pictures and material that was made in a lab or edited. In answer to these problems, this study suggests a complete machine learning system that can be used to find fake photos. The system uses both convolutional neural networks (CNNs) to extract localised features and transformer-based designs to pick up on global errors in the context. The suggested model is good at finding differences that could be signs of deepfake or changes made by AI. It does this by looking at small flaws, noise patterns, and environmental artefacts that are often not visible to the human eye. Additionally, advanced feature engineering techniques like frequency domain analysis and noise residual modelling are used to make the model more sensitive and resistant to different types of fake content creation. This study shows how important machine learning, especially mixed deep learning models, is in fighting the spread of fake media and recovering trust in digital images. It shows how important it is to use both local feature extraction and global semantic knowledge for correct validity verification. The study also talks about the problems that are happening now, like how they could be attacked from the other side and how there is a race going on between technologies that create and find threats. In order to handle these issues, new research will focus on combining explainable AI (XAI) methods to make models more clear and build user trust, along with putting in place frameworks for continuous learning that let the system automatically adjust to new deepfake generation methods that come out. 2. RELATED WORK In the past ten years, a lot of study has been done on how to spot deepfakes and AI-generated synthetic media. This has led to the creation of many methods that can be used to spot pictures that have been changed. Early methods mostly used standard digital picture forensics methods, like looking at JPEG compression artefacts, sensor noise patterns, and lighting or shadow problems that didn't match up Naitali et al. (2023). Even though these methods worked for simple changes, they often didn't work when faced with complex AI-generated material that closely matched the qualities of real photos. To solve this problem, academics started using machine learning models that could learn unique features from the data itself. This was a big change from building features by hand to learning features from data Kerenalli et al. (2023). One important area of research in this area is the use of Convolutional Neural Networks (CNNs) to find fake images. Models like MesoNet and XceptionNet have shown they can spot face deepfakes by learning to spot small changes in texture levels and errors that happen when images are synthesised Bhandarkar et al. (2024), Kang et al. (2022). CNNs, on the other hand, tend to focus on localised patterns and miss larger semantic-level oddities. This can leave systems open to attack when fake content keeps up strong local realism but doesn't make sense in the bigger picture. Transformer-based designs, especially Vision Transformers (ViT), have been used to solve this problem for jobs like finding fakes. Transformers are great at catching pictures' long-range relationships and global context, which makes them a hopeful way to find deeper errors in synthetic media Firc et al. (2023), Nah et al. (2023). Even with these improvements, there are still some problems with the way machine learning is done now. One big worry is how well deepfake detecting models can work with different datasets and editing methods that haven't been seen before Malik et al. (2022). Many current systems work well on certain datasets but not so well when tested on samples that are not from the same distribution, showing that they tend to overfit. Also, hostile attacks have been shown to trick deepfake analysers by making changes that can't be seen. This makes people worry about how stable and reliable current solutions are Masood et al. (2023). One problem is that most models don't let you explain how they make decisions; these models work like "black boxes," which makes it hard to see how decisions are made, which is important in sensitive areas like media identification and legal research Heidari et al. (2024). Another thing that shows we need more complete models is mixed models that blend both local texture analysis and global semantic thinking. Both CNNs and transformers have their good points, but not many studies have combined them in a way that makes the most of both Bird and Lotfi (2024). Also, not much study has been done on multi-domain and continuous learning methods that would let detectors adapt to quickly changing generation technologies without having to be completely retrained Al-Adwan et al. (2024). Additionally, the datasets used in earlier research were mostly limited to face deepfakes and didn't include a wide range of AI-generated material, like fake papers, scenery, or photos from the past that have been changed Mukta et al. (2023). These holes show how important it is to have stronger, more flexible, and easier to understand monitoring systems that can be used successfully in a lot of different real-life situations. This paper tries to fill in these gaps by suggesting a CNN-Transformer architecture that is improved with frequency-domain feature extraction and noise residual modelling. This architecture is made to work well with a wide range of deepfakes and AI-generated content. Table 1
3. PROPOSED METHODOLOGY The suggested system design of the proposed system using machine learning to check the accuracy of photos in real and fake analysis, the architecture illustrate in Figure 1. At first, the system is given an input collection that has both real and artificial (deepfake or AI-generated) pictures. First, a Convolutional Neural Network (CNN) is used to process these pictures and pull out localised spatial information. The feature maps are then sent to a Transformer block, which finds global relationships and errors that depend on the context. Figure 1 |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Table 2 Comparative Performance with State-Of-The-Art Models |
||||
|
Model |
Accuracy (%) |
Precision (%) |
Recall (%) |
F1-Score (%) |
|
Traditional CNN Model |
89.3 |
88.1 |
87.5 |
87.8 |
|
Transformer-Only Model |
91.7 |
90.6 |
89.8 |
90.2 |
|
Hybrid CNN-Transformer
(Proposed) |
94.8 |
93.6 |
94.2 |
93.9 |
However, Transformers usually need very large datasets to be trained well, and they might miss small local artefacts that are important for finding fakes. With an F1-score of 93.9%, the suggested Hybrid CNN-Transformer model does the best, with an accuracy of 94.8%, a precision of 93.6%, a recall of 94.2%, and an F1-score of 93.9%, as shown in Figure 3. The mixed design does a better job of capturing both small local flaws and larger meaning oddities because it uses the best features of both CNNs and Transformers.
Figure 3

Figure 3 Hybrid CNN-Transformer model metric comparison
This big speed boost shows how helpful it is to use both localised feature extraction and global contextual reasoning together. The better results show that the mixed model is stronger, works better with different datasets, and is better for real-world situations where deepfake and AI-generated material are changing quickly and becoming more complicated, comparison mode tradition and proposed model shown in Figure 4.
Figure 4

Figure 4 Performance Metrics (Accuracy, Precision, Recall,
and F1-Score) Vary Across Different Models
Table 3 shows how well the suggested Hybrid CNN-Transformer model did in cross-dataset testing, which checks how well it can work with different types of simulated media. It gets an amazing 94.2% accuracy, 93.1% precision, 93.7% recall, and 93.4% F1-score on the FaceForensics++ testing dataset, which is generally seen as the best way to find people who have manipulated their faces. These results show that the model is good at finding changed face information using common manipulation techniques. The model gets even better when tested on the Deepfake Detection Challenge (DFDC) dataset, where it gets 95.1% accuracy, 94.0% precision, 94.8% recall, and 94.4% F1-score.
Table 3
|
Table 3 Cross-Dataset Testing Performance Using Hybrid Cnn-Transformer |
||||
|
Dataset |
Accuracy (%) |
Precision (%) |
Recall (%) |
F1-Score (%) |
|
FaceForensics++ (Testing) |
94.2 |
93.1 |
93.7 |
93.4 |
|
Deepfake Detection Challenge |
95.1 |
94 |
94.8 |
94.4 |
|
StyleGAN-Generated Dataset |
93.6 |
92.3 |
92.9 |
92.6 |
This better performance shows that the model can easily handle more complicated and varied changes that happen in the real world, like those that happen on social media and video-sharing sites. The model still does a great job on the StyleGAN dataset, which has 93.6% accuracy, 92.3% precision, 92.9% recall, and 92.6% F1-score, performance illustrate in Figure 5.
Figure 5

Figure 5 Performance across Different Datasets
This is because the dataset is made up of very accurate AI-generated faces. Performance on StyleGAN content is still strong, even though it's a little lower than on FaceForensics++ and DFDC. This suggests that the hybrid design can find even small flaws in completely fake pictures where standard forensic clues aren't very strong. Overall, the cross-dataset results show that the hybrid model can generalise, showing that it can handle differences in editing methods, dataset properties, and the quality of fake images. The model's steady high performance across multiple datasets proves that it is ready for use in the real world, where new types of simulated media are often created and conditions are uncertain.
Table 4
|
Table 4 Detection Performance on New Manipulation Techniques |
||
|
Manipulation Type |
Real Image Detection
Accuracy (%) |
Fake Image Detection
Accuracy (%) |
|
New GAN Variant (Unseen) |
92.5 |
91.7 |
|
Novel Face Swap |
93.1 |
92.6 |
|
Synthetic Object Insertion |
91.4 |
90.9 |
Table 4 shows how well the suggested Hybrid CNN-Transformer model can identify new, previously unknown manipulation methods. This shows that it is robust and flexible. It is 92.5% accurate at finding real images and 91.7% accurate at finding fake images for the New GAN Variant, which was not seen during training. This means that the model can reliably tell the difference between real and artificially generated pictures, even when it comes across new GAN designs. Novel Face Swap manipulation keeps the model's good performance, with a 93.1% success rate for finding real images and a 92.6% success rate for finding fake images, as represent it in Figure 6. For example, this show that the model can find small mistakes and problems with the environment that happen a lot when face switching.
Figure 6

Figure 6 Detection Accuracy by Manipulation Type
The model can tell the difference between real and fake images with a 91.4% success rate for Synthetic Object Insertion manipulations and a 90.9% success rate for fake image manipulations.
5. CONCLUSION
To find real photos, this study showed a strong CNN-Transformer structure that works especially well for checking deepfakes and AI-generated material. By combining CNN-based local feature extraction with Transformer-based global context modelling, the suggested system is able to detect both small errors and larger problems that are common in images that have been changed. The mixed model did better in tests than standard CNN and Transformer-only designs, with an F1-score of 93.9%, an accuracy of 94.8%, a precision of 93.6%, a recall of 94.2%, and a recall of 93.2%. The system's strong ability to generalise was further proven by cross-dataset validation, which showed consistent results on FaceForensics++, DFDC, and StyleGAN-generated pictures. The model was tested against new types of manipulation that had never been seen before, such as new GAN variants, novel face swaps, and synthetic object insertions. It kept detecting real and fake images with an accuracy of above 90%, showing that it can adapt to new threats. Using methods from feature engineering, like frequency domain analysis and noise residual modelling, together improved the model's ability to find minor investigative clues that stand-alone systems often miss. Overall, the results show that the mixed model could be used in the real world for things like monitoring social media, investigating media content, and checking court proof. In the future, researchers will focus on adding explainable AI modules to make models easier to understand and constantly learning processes to make sure they work well against new technologies that make fake media. This study makes big steps towards scalable, long-lasting solutions for making sure the accuracy of digital visual material is protected
CONFLICT OF INTERESTS
None.
ACKNOWLEDGMENTS
None.
REFERENCES
Al-Adwan, A., Alazzam, H., Al-Anbaki, N., and Alduweib, E. (2024). Detection of Deepfake Media Using a Hybrid CNN–RNN Model and Particle Swarm Optimization (PSO) Algorithm. Computers, 13(4), 99. https://doi.org/10.3390/computers13040099
Babaei, R., Cheng, S., Duan, R., and Zhao, S. (2025). Generative Artificial Intelligence and the Evolving Challenge of Deepfake Detection: A Systematic Analysis. Journal of Sensor and Actuator Networks, 14(1), 17. https://doi.org/10.3390/jsan14010017
Barik, B. R., Nayak, A., Biswal, A., and Padhy, N. (2025). Practical Evaluation and Performance Analysis for Deepfake Detection Using Advanced AI Models. Engineering Proceedings, 87, 36. https://doi.org/10.3390/engproc2025087036
Bhandarkar, A., Khobragade, P., Pawar, R., Lokulwar, P., and Saraf, P. (2024). Deep Learning Framework for Robust Deep Fake Image Detection: A Review. In Proceedings of the International Conference on Artificial Intelligence and Quantum Computation-Based Sensor Application (ICAIQSA) (1–7). IEEE. https://doi.org/10.1109/ICAIQSA64000.2024.10882361
Bird, J. J., and Lotfi, A. (2024). Cifake: Image Classification and Explainable Identification of AI-Generated Synthetic Images. IEEE Access, 12, 15642–15650. https://doi.org/10.1109/ACCESS.2024.3356122
Firc, A., Malinka, K., and Hanáček, P. (2023). Deepfakes as a Threat to a Speaker and Facial Recognition: An Overview of Tools and Attack Vectors. Heliyon, 9, e15090. https://doi.org/10.1016/j.heliyon.2023.e15090
Heidari, A., Jafari Navimipour, N., Dag, H., and Unal, M. (2024). Deepfake Detection Using Deep Learning Methods: A Systematic and Comprehensive Review. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 14(1), e1520. https://doi.org/10.1002/widm.1520
Jheelan, J., and Pudaruth, S. (2025). Using Deep Learning to Identify Deepfakes Created Using Generative Adversarial Networks. Computers, 14(2), 60. https://doi.org/10.3390/computers14020060
Kang, J., Ji, S. K., Lee, S., Jang, D., and Hou, J. U. (2022). Detection Enhancement for Various Deepfake Types Based on Residual Noise and Manipulation Traces. IEEE Access, 10, 69031–69040. https://doi.org/10.1109/ACCESS.2022.3185121
Kerenalli, S., Yendapalli, V., and Chinnaiah, M. (2023). Classification of Deepfake Images Using a Novel Explanatory Hybrid Model. CommIT Journal, 17(2), 151–168. https://doi.org/10.21512/commit.v17i2.8761
Malik, A., Kuribayashi, M., Abdullahi, S. M., and Khan, A. N. (2022). Deepfake Detection for Human Face Images and Videos: A Survey. IEEE Access, 10, 18757–18775. https://doi.org/10.1109/ACCESS.2022.3151186
Masood, M., Nawaz, M., Malik, K. M., Javed, A., Irtaza, A., and Malik, H. (2023). Deepfakes Generation and Detection: State-of-the-Art, Open Challenges, Countermeasures, and Way Forward. Applied Intelligence, 53, 3974–4026. https://doi.org/10.1007/s10489-022-03766-z
Mukta, M. S. H., Ahmad, J., Raiaan, M. A. K., Islam, S., Azam, S., Ali, M. E., and Jonkman, M. (2023). An Investigation of the Effectiveness of Deepfake Models and Tools. Journal of Sensor and Actuator Networks, 12(4), 61. https://doi.org/10.3390/jsan12040061
Nah, F.-H., Zheng, R., Cai, J., Siau, K., and Chen, L. (2023). Generative AI and ChatGPT: Applications, Challenges, and AI–Human Collaboration. Journal of Information Technology Case and Application Research, 25(3), 277–304. https://doi.org/10.1080/15228053.2023.2233814
Naitali, A., Ridouani, M., Salahdine, F., and Kaabouch, N. (2023). Deepfake Attacks: Generation, Detection, Datasets, Challenges, and Research Directions. Computers, 12(10), 216. https://doi.org/10.3390/computers12100216
|
|
This work is licensed under a: Creative Commons Attribution 4.0 International License
© ShodhKosh 2025. All Rights Reserved.