EXPLORING MULTIMODAL GENERATIVE SYSTEMS: EFFICIENT TRAINING AND EVALUATION FOR VISUAL AND CREATIVE APPLICATIONS

Samir Nasruddin Ajani; Midhunchakkaravarthy; Mudassir Khan

doi:10.29121/shodhkosh.v7.i2s.2026.7265

Authors

Dr. Samir Nasruddin Ajani School of Computer Science and Engineering, Ramdeobaba University (RBU), Nagpur, India
Dr. Midhunchakkaravarthy Lincoln University College, LUC, Malaysia
Dr. Mudassir Khan King Khalid University, Saudi Arabia

DOI:

https://doi.org/10.29121/shodhkosh.v7.i2s.2026.7265

Keywords:

Multi-Modal Generative Models, Training Efficiency Optimization, Low-Rank Adaptation, Knowledge Distillation, Sparse Cross-Modal Attention, Diffusion–Gan Hybrid Models, Benchmark-Based Evaluation

Abstract [English]

Multi-modal generative models have now been developed with sufficient speed to produce high-quality text-to-image and audio-visual synthesis, although their use is limited due to high costs of computation, memory requirements and a lengthy training process. The current methods typically use large scale architectures that provide good performance at the cost of efficiency and therefore restrict the scalability and practicability in real world applications. To overcome this difficulty, this paper suggests a multi-modal generative framework which is efficient and aims at training maximizing without losing the output fidelity or cross-modal coherence. The main aim of the research work is to come up with and test training methods that would allow decreasing computational load by a significant margin without compromising the quality of the generated content across the modalities. The suggested method will combine Low-Rank Adaptation to Multi-Mode Generators (LoRA-MMG) to allow the high capacity teacher models to be fine-tuned parameter-efficiently, Knowledge-Distilled Multi-Mode generators (KD-MMG) to distribute representational knowledge of large capacity teacher models to small students, and a Sparse Cross-Mode Attention Network (SCMAN) to reduce the complexity of attention during modality fusion. To retain quality, a Hybrid DiffusionGAN Multi-Mode Synthesis model (HDG-MMS) is used as one of the high-performance reference and distillation sources. The benchmarks and real-life data evaluated in the framework include COCO Captions to generate text to image, VGGSound to generate audio-visual data, and Conceptual Captions to do image-text alignment on a large scale. The experimental findings indicate that there are significant decreases in training time, memory consumption and computational cost and also that there is a steady or enhanced quality of perceptual and semantic alignment. The results have proved that well-structured efficiency-based algorithms can perform in a competitive way. The article provides a scalable, benchmark-tested embedding of high-performance multi-modal generative models of both resource-constrained systems and actual practice currently.

References

Aggarwal, A., Mittal, M., and Battineni, G. (2021). Generative Adversarial Network: An Overview of Theory and Applications. International Journal of Information Management Data Insights, 1, 100004. https://doi.org/10.1016/j.jjimei.2020.100004 DOI: https://doi.org/10.1016/j.jjimei.2020.100004

Aldausari, N., Sowmya, A., Marcus, N., and Mohammadi, G. (2022). Video Generative Adversarial Networks: A Review. ACM Computing Surveys, 55, 30. https://doi.org/10.1145/3487891 DOI: https://doi.org/10.1145/3487891

Bandi, A., Adapa, P. V. S. R., and Kuchi, Y. E. V. P. K. (2023). The Power of Generative AI: A Review of Requirements, Models, Input-Output Formats, Evaluation Metrics, and Challenges. Future Internet, 15, 260. https://doi.org/10.3390/fi15080260 DOI: https://doi.org/10.3390/fi15080260

Danel, T., Łęski, J., Podlewska, S., and Podolak, I. T. (2023). Docking-Based Generative Approaches in the Search for New Drug Candidates. Drug Discovery Today, 28, 103439. https://doi.org/10.1016/j.drudis.2022.103439 DOI: https://doi.org/10.1016/j.drudis.2022.103439

De Rosa, G. H., and Papa, J. P. (2021). A Survey on Text Generation Using Generative Adversarial Networks. Pattern Recognition, 119, 108098. https://doi.org/10.1016/j.patcog.2021.108098 DOI: https://doi.org/10.1016/j.patcog.2021.108098

Dholakia, A., Ellison, D., Hodak, M., Dutta, D., and Binnig, C. (2023). Benchmarking Generative AI Performance Requires a Holistic Approach. In Performance Evaluation and Benchmarking: 15th TPC Technology Conference (TPCTC 2023) (34–43). Springer. https://doi.org/10.1007/978-3-031-68031-1_3 DOI: https://doi.org/10.1007/978-3-031-68031-1_3

Dwivedi, Y. K., Kshetri, N., Hughes, L., Slade, E. L., Jeyaraj, A., Kar, A. K., Baabdullah, A. M., Koohang, A., Raghavan, V., and Ahuja, M., et al. (2023). “So what if ChatGPT wrote it?” Multidisciplinary Perspectives on Opportunities, Challenges and Implications of Generative Conversational AI for Research, Practice and Policy. International Journal of Information Management, 71, 102642. https://doi.org/10.1016/j.ijinfomgt.2023.102642 DOI: https://doi.org/10.1016/j.ijinfomgt.2023.102642

Eckerli, F., and Osterrieder, J. (2021). Generative Adversarial Networks in Finance: An Overview (arXiv:2106.06364). arXiv. https://doi.org/10.2139/ssrn.3864965 DOI: https://doi.org/10.2139/ssrn.3864965

Gozalo-Brizuela, R., and Garrido-Merchán, E. C. (2023a). A Survey of Generative AI Applications (arXiv:2306.02781).

Gozalo-Brizuela, R., and Garrido-Merchán, E. C. (2023b). ChatGPT is not All You Need: A State of the Art Review of Large Generative AI Models (arXiv:2301.04655).

Jabbar, A., Li, X., and Omar, B. (2021). A Survey on Generative Adversarial Networks: Variants, Applications, and Training. ACM Computing Surveys, 54, 157. https://doi.org/10.1145/3463475 DOI: https://doi.org/10.1145/3463475

Ji, L., Xiao, S., Feng, J., Gao, W., and Zhang, H. (2025). Multimodal Large Model Pretraining, Adaptation and Efficiency Optimization. Neurocomputing, 619, 129138. https://doi.org/10.1016/j.neucom.2024.129138 DOI: https://doi.org/10.1016/j.neucom.2024.129138

Jiang, R., Wang, C., Zhang, J., Chai, M., He, M., Chen, D., and Liao, J. (2023). AvatarCraft: Transforming Text Into Neural Human Avatars with Parameterized Shape and Pose Control (arXiv:2303.17606). https://doi.org/10.1109/ICCV51070.2023.01322 DOI: https://doi.org/10.1109/ICCV51070.2023.01322

Jin, Y., Li, J., Gu, T., et al. (2025). Efficient Multimodal Large Language Models: A Survey. Visual Intelligence, 3, 27. https://doi.org/10.1007/s44267-025-00099-6 DOI: https://doi.org/10.1007/s44267-025-00099-6

Li, C., Zhang, C., Waghwase, A., Lee, L. H., Rameau, F., Yang, Y., Bae, S. H., and Hong, C. S. (2023). Generative AI Meets 3D: A Survey on Text-to-3D in AIGC Era (arXiv:2305.06131).

Liang, Y., Qin, G., Sun, M., Qin, J., Yan, J., and Zhang, Z. (2022). Multi-Modal Interactive Attention and Dual Progressive Decoding Network for RGB-D/T Salient Object Detection. Neurocomputing, 490, 132–145. https://doi.org/10.1016/j.neucom.2022.03.029 DOI: https://doi.org/10.1016/j.neucom.2022.03.029

Liu, Y., Yang, Z., Yu, Z., Liu, Z., Liu, D., Lin, H., Li, M., Ma, S., Avdeev, M., and Shi, S. (2023). Generative Artificial Intelligence and its Applications in Materials Science: Current Situation and Future Perspectives. Journal of Materials Science and Technology, 9, 798–816. https://doi.org/10.1016/j.jmat.2023.05.001 DOI: https://doi.org/10.1016/j.jmat.2023.05.001

Schick, T., Dwivedi-Yu, J., Jiang, Z., Petroni, F., Lewis, P., Izacard, G., You, Q., Nalmpantis, C., Grave, E., and Riedel, S. (2022). PEER: A Collaborative Language Model (arXiv:2208.11663).

Thoppilan, R., De Freitas, D., Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H. T., Jin, A., Bos, T., Baker, L., and Du, Y., et al. (2022). LaMDA: Language Models for Dialog Applications (arXiv:2201.08239).

Tong, K., and Wu, Y. (2023). Rethinking PASCAL-VOC and MS-COCO Dataset for Small Object Detection. Journal of Visual Communication and Image Representation, 93, 103830. https://doi.org/10.1016/j.jvcir.2023.103830 DOI: https://doi.org/10.1016/j.jvcir.2023.103830

Tong, X., Liu, X., Tan, X., Li, X., Jiang, J., Xiong, Z., Xu, T., Jiang, H., Qiao, N., and Zheng, M. (2021). Generative Models for de Novo Drug Design. Journal of Medicinal Chemistry, 64, 14011–14027. https://doi.org/10.1021/acs.jmedchem.1c00927 DOI: https://doi.org/10.1021/acs.jmedchem.1c00927

Zeng, X., Wang, F., Luo, Y., Kang, S.-G., Tang, J., Lightstone, F. C., Fang, E. F., Cornell, W., Nussinov, R., and Feixiong, C. (2022). Deep Generative Molecular Design Reshapes Drug Discovery. Cell Reports Medicine, 3, 100794. https://doi.org/10.1016/j.xcrm.2022.100794 DOI: https://doi.org/10.1016/j.xcrm.2022.100794

Zhang, C., Zhang, C., Li, C., Qiao, Y., Zheng, S., Dam, S. K., Zhang, M., Kim, J. U., Kim, S. T., and Choi, J., et al. (2023a). One Small Step for Generative AI, One Giant Leap for AGI: A Complete Survey on ChatGPT in AIGC era (arXiv:2304.06488).

Zhang, C., Zhang, C., Zhang, M., and Kweon, I. S. (2023b). Text-To-Image Diffusion Models in Generative AI: A Survey (arXiv:2303.07909).

Zhang, C., Zhang, C., Zheng, S., Zhang, M., Qamar, M., Bae, S. H., and Kweon, I. S. (2023c). A Survey on Audio Diffusion Models: Text-To-Speech Synthesis and Enhancement in Generative AI (arXiv:2303.13336).

Zhang, M., Qamar, M., Kang, T., Jung, Y., Zhang, C., Bae, S. H., and Zhang, C. (2023). A Survey on Graph Diffusion Models: Generative AI in Science for Molecule, Protein and Material (arXiv:2304.01565).