EXPLORING MULTIMODAL GENERATIVE SYSTEMS: EFFICIENT TRAINING AND EVALUATION FOR VISUAL AND CREATIVE APPLICATIONS
DOI:
https://doi.org/10.29121/shodhkosh.v7.i2s.2026.7265Keywords:
Multi-Modal Generative Models, Training Efficiency Optimization, Low-Rank Adaptation, Knowledge Distillation, Sparse Cross-Modal Attention, Diffusion–Gan Hybrid Models, Benchmark-Based EvaluationAbstract [English]
Multi-modal generative models have now been developed with sufficient speed to produce high-quality text-to-image and audio-visual synthesis, although their use is limited due to high costs of computation, memory requirements and a lengthy training process. The current methods typically use large scale architectures that provide good performance at the cost of efficiency and therefore restrict the scalability and practicability in real world applications. To overcome this difficulty, this paper suggests a multi-modal generative framework which is efficient and aims at training maximizing without losing the output fidelity or cross-modal coherence. The main aim of the research work is to come up with and test training methods that would allow decreasing computational load by a significant margin without compromising the quality of the generated content across the modalities. The suggested method will combine Low-Rank Adaptation to Multi-Mode Generators (LoRA-MMG) to allow the high capacity teacher models to be fine-tuned parameter-efficiently, Knowledge-Distilled Multi-Mode generators (KD-MMG) to distribute representational knowledge of large capacity teacher models to small students, and a Sparse Cross-Mode Attention Network (SCMAN) to reduce the complexity of attention during modality fusion. To retain quality, a Hybrid DiffusionGAN Multi-Mode Synthesis model (HDG-MMS) is used as one of the high-performance reference and distillation sources. The benchmarks and real-life data evaluated in the framework include COCO Captions to generate text to image, VGGSound to generate audio-visual data, and Conceptual Captions to do image-text alignment on a large scale. The experimental findings indicate that there are significant decreases in training time, memory consumption and computational cost and also that there is a steady or enhanced quality of perceptual and semantic alignment. The results have proved that well-structured efficiency-based algorithms can perform in a competitive way. The article provides a scalable, benchmark-tested embedding of high-performance multi-modal generative models of both resource-constrained systems and actual practice currently.
References
Aggarwal, A., Mittal, M., and Battineni, G. (2021). Generative Adversarial Network: An Overview of Theory and Applications. International Journal of Information Management Data Insights, 1, 100004. https://doi.org/10.1016/j.jjimei.2020.100004 DOI: https://doi.org/10.1016/j.jjimei.2020.100004
Aldausari, N., Sowmya, A., Marcus, N., and Mohammadi, G. (2022). Video Generative Adversarial Networks: A Review. ACM Computing Surveys, 55, 30. https://doi.org/10.1145/3487891 DOI: https://doi.org/10.1145/3487891
Bandi, A., Adapa, P. V. S. R., and Kuchi, Y. E. V. P. K. (2023). The Power of Generative AI: A Review of Requirements, Models, Input-Output Formats, Evaluation Metrics, and Challenges. Future Internet, 15, 260. https://doi.org/10.3390/fi15080260 DOI: https://doi.org/10.3390/fi15080260
Danel, T., Łęski, J., Podlewska, S., and Podolak, I. T. (2023). Docking-Based Generative Approaches in the Search for New Drug Candidates. Drug Discovery Today, 28, 103439. https://doi.org/10.1016/j.drudis.2022.103439 DOI: https://doi.org/10.1016/j.drudis.2022.103439
De Rosa, G. H., and Papa, J. P. (2021). A Survey on Text Generation Using Generative Adversarial Networks. Pattern Recognition, 119, 108098. https://doi.org/10.1016/j.patcog.2021.108098 DOI: https://doi.org/10.1016/j.patcog.2021.108098
Dholakia, A., Ellison, D., Hodak, M., Dutta, D., and Binnig, C. (2023). Benchmarking Generative AI Performance Requires a Holistic Approach. In Performance Evaluation and Benchmarking: 15th TPC Technology Conference (TPCTC 2023) (34–43). Springer. https://doi.org/10.1007/978-3-031-68031-1_3 DOI: https://doi.org/10.1007/978-3-031-68031-1_3
Dwivedi, Y. K., Kshetri, N., Hughes, L., Slade, E. L., Jeyaraj, A., Kar, A. K., Baabdullah, A. M., Koohang, A., Raghavan, V., and Ahuja, M., et al. (2023). “So what if ChatGPT wrote it?” Multidisciplinary Perspectives on Opportunities, Challenges and Implications of Generative Conversational AI for Research, Practice and Policy. International Journal of Information Management, 71, 102642. https://doi.org/10.1016/j.ijinfomgt.2023.102642 DOI: https://doi.org/10.1016/j.ijinfomgt.2023.102642
Eckerli, F., and Osterrieder, J. (2021). Generative Adversarial Networks in Finance: An Overview (arXiv:2106.06364). arXiv. https://doi.org/10.2139/ssrn.3864965 DOI: https://doi.org/10.2139/ssrn.3864965
Gozalo-Brizuela, R., and Garrido-Merchán, E. C. (2023a). A Survey of Generative AI Applications (arXiv:2306.02781).
Gozalo-Brizuela, R., and Garrido-Merchán, E. C. (2023b). ChatGPT is not All You Need: A State of the Art Review of Large Generative AI Models (arXiv:2301.04655).
Jabbar, A., Li, X., and Omar, B. (2021). A Survey on Generative Adversarial Networks: Variants, Applications, and Training. ACM Computing Surveys, 54, 157. https://doi.org/10.1145/3463475 DOI: https://doi.org/10.1145/3463475
Ji, L., Xiao, S., Feng, J., Gao, W., and Zhang, H. (2025). Multimodal Large Model Pretraining, Adaptation and Efficiency Optimization. Neurocomputing, 619, 129138. https://doi.org/10.1016/j.neucom.2024.129138 DOI: https://doi.org/10.1016/j.neucom.2024.129138
Jiang, R., Wang, C., Zhang, J., Chai, M., He, M., Chen, D., and Liao, J. (2023). AvatarCraft: Transforming Text Into Neural Human Avatars with Parameterized Shape and Pose Control (arXiv:2303.17606). https://doi.org/10.1109/ICCV51070.2023.01322 DOI: https://doi.org/10.1109/ICCV51070.2023.01322
Jin, Y., Li, J., Gu, T., et al. (2025). Efficient Multimodal Large Language Models: A Survey. Visual Intelligence, 3, 27. https://doi.org/10.1007/s44267-025-00099-6 DOI: https://doi.org/10.1007/s44267-025-00099-6
Li, C., Zhang, C., Waghwase, A., Lee, L. H., Rameau, F., Yang, Y., Bae, S. H., and Hong, C. S. (2023). Generative AI Meets 3D: A Survey on Text-to-3D in AIGC Era (arXiv:2305.06131).
Liang, Y., Qin, G., Sun, M., Qin, J., Yan, J., and Zhang, Z. (2022). Multi-Modal Interactive Attention and Dual Progressive Decoding Network for RGB-D/T Salient Object Detection. Neurocomputing, 490, 132–145. https://doi.org/10.1016/j.neucom.2022.03.029 DOI: https://doi.org/10.1016/j.neucom.2022.03.029
Liu, Y., Yang, Z., Yu, Z., Liu, Z., Liu, D., Lin, H., Li, M., Ma, S., Avdeev, M., and Shi, S. (2023). Generative Artificial Intelligence and its Applications in Materials Science: Current Situation and Future Perspectives. Journal of Materials Science and Technology, 9, 798–816. https://doi.org/10.1016/j.jmat.2023.05.001 DOI: https://doi.org/10.1016/j.jmat.2023.05.001
Schick, T., Dwivedi-Yu, J., Jiang, Z., Petroni, F., Lewis, P., Izacard, G., You, Q., Nalmpantis, C., Grave, E., and Riedel, S. (2022). PEER: A Collaborative Language Model (arXiv:2208.11663).
Thoppilan, R., De Freitas, D., Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H. T., Jin, A., Bos, T., Baker, L., and Du, Y., et al. (2022). LaMDA: Language Models for Dialog Applications (arXiv:2201.08239).
Tong, K., and Wu, Y. (2023). Rethinking PASCAL-VOC and MS-COCO Dataset for Small Object Detection. Journal of Visual Communication and Image Representation, 93, 103830. https://doi.org/10.1016/j.jvcir.2023.103830 DOI: https://doi.org/10.1016/j.jvcir.2023.103830
Tong, X., Liu, X., Tan, X., Li, X., Jiang, J., Xiong, Z., Xu, T., Jiang, H., Qiao, N., and Zheng, M. (2021). Generative Models for de Novo Drug Design. Journal of Medicinal Chemistry, 64, 14011–14027. https://doi.org/10.1021/acs.jmedchem.1c00927 DOI: https://doi.org/10.1021/acs.jmedchem.1c00927
Zeng, X., Wang, F., Luo, Y., Kang, S.-G., Tang, J., Lightstone, F. C., Fang, E. F., Cornell, W., Nussinov, R., and Feixiong, C. (2022). Deep Generative Molecular Design Reshapes Drug Discovery. Cell Reports Medicine, 3, 100794. https://doi.org/10.1016/j.xcrm.2022.100794 DOI: https://doi.org/10.1016/j.xcrm.2022.100794
Zhang, C., Zhang, C., Li, C., Qiao, Y., Zheng, S., Dam, S. K., Zhang, M., Kim, J. U., Kim, S. T., and Choi, J., et al. (2023a). One Small Step for Generative AI, One Giant Leap for AGI: A Complete Survey on ChatGPT in AIGC era (arXiv:2304.06488).
Zhang, C., Zhang, C., Zhang, M., and Kweon, I. S. (2023b). Text-To-Image Diffusion Models in Generative AI: A Survey (arXiv:2303.07909).
Zhang, C., Zhang, C., Zheng, S., Zhang, M., Qamar, M., Bae, S. H., and Kweon, I. S. (2023c). A Survey on Audio Diffusion Models: Text-To-Speech Synthesis and Enhancement in Generative AI (arXiv:2303.13336).
Zhang, M., Qamar, M., Kang, T., Jung, Y., Zhang, C., Bae, S. H., and Zhang, C. (2023). A Survey on Graph Diffusion Models: Generative AI in Science for Molecule, Protein and Material (arXiv:2304.01565).
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Dr. Samir Nasruddin Ajani, Dr. Midhunchakkaravarthy, Dr. Mudassir Khan

This work is licensed under a Creative Commons Attribution 4.0 International License.
With the licence CC-BY, authors retain the copyright, allowing anyone to download, reuse, re-print, modify, distribute, and/or copy their contribution. The work must be properly attributed to its author.
It is not necessary to ask for further permission from the author or journal board.
This journal provides immediate open access to its content on the principle that making research freely available to the public supports a greater global exchange of knowledge.























