MULTIMODAL EMOTION RECOGNITION USING AUDIO-TEXT FUSION AND TRANSFORMER-BASED CONTEXTUAL REPRESENTATION LEARNING
DOI:
https://doi.org/10.29121/shodhkosh.v7.i1s.2026.7045Keywords:
Multimodal Emotion Recognition, Audio–Text Fusion, Transformer Networks, Cross-Modal Attention, Contextual Representation Learning, Affective ComputingAbstract [English]
The ability of the affective computing system to recognize emotions is a major aspect of the system that allows intelligent machines to detect the emotional conditions of humans during natural conversations. The complementary affective and semantic information in human communication is usually not reflected in traditional emotion recognition methods based on one modality, e.g. speech or text. In a bid to overcome this shortcoming, the work presents a Transformer-based cross-modal emotion recognition system that combines audio-text fusion with learning contextual representation. The proposed model utilises context-specific Transformer encoders to acquire contextual acoustic and linguistic representations and then a cross-modal attention mechanism which dynamically aligns and combines modality-specific information. This fusion of attention allows modelling the inter-modal dependencies well and increasing the discrimination of emotions. On well-known multimodal emotion datasets, extensive experiments have shown that the proposed method is invariably better than multimodal and unimodal baselines in accuracy and F1-score. The findings affirm that Transformer-based fusion with contextual representation learning is significantly beneficial in enhancing robustness and generalization when it comes to emotion recognition studies. The identified framework offers a scalable and efficient framework to the application in the real world including conversational agents, human-computer interaction, and affective analysis systems.
References
Bai, Z. L., Hou, F. Z., Sun, K. X., Wu, Q. Z., Zhu, M., Mao, Z. M., Song, Y., and Gao, Q. (2023). SECT: A Method of Shifted EEG Channel Transformer for Emotion Recognition. IEEE Journal of Biomedical and Health Informatics, 27(10), 4758–4767. https://doi.org/10.1109/JBHI.2023.3301993 DOI: https://doi.org/10.1109/JBHI.2023.3301993
Bai, Z. L., Liu, J. J., Hou, F. Z., Chen, Y. R., Cheng, M. Y., Mao, Z. M., Song, Y., and Gao, Q. (2023). Emotion Recognition with Residual Network Driven by Spatial-Frequency Characteristics of EEG Recorded from Hearing-Impaired Adults in Response to Video Clips. Computers in Biology and Medicine, 152, 106344. https://doi.org/10.1016/j.compbiomed.2022.106344 DOI: https://doi.org/10.1016/j.compbiomed.2022.106344
Cai, M. P., Chen, J. X., Hua, C. C., Wen, G. L., and Fu, R. R. (2024). EEG Emotion Recognition Using EEG-SWTNS Neural Network Through EEG Spectral Image. Information Sciences, 680, 121198. https://doi.org/10.1016/j.ins.2024.121198 DOI: https://doi.org/10.1016/j.ins.2024.121198
Devarajan, K., Ponnan, S., and Perumal, S. (2025). Enhancing Emotion Recognition Through Multi-Modal Data Fusion and Graph Neural Networks. Intelligent-Based Medicine, 12, 100291. https://doi.org/10.1016/j.ibmed.2025.100291 DOI: https://doi.org/10.1016/j.ibmed.2025.100291
Duan, R.-N., Zhu, J.-Y., and Lu, B.-L. (2013). Differential Entropy Feature for EEG-Based Emotion Classification. In Proceedings of the 6th International IEEE/EMBS Conference on Neural Engineering (NER) ( 81–84). IEEE. https://doi.org/10.1109/NER.2013.6695876 DOI: https://doi.org/10.1109/NER.2013.6695876
Hou, G. Q., Yu, Q. W., Chen, C. G., and Chen, F. (2024). A Novel and Powerful Dual-Stream Multi-Level Graph Convolution Network for Emotion Recognition. Sensors, 24(23), 7377. https://doi.org/10.3390/s24227377 DOI: https://doi.org/10.3390/s24227377
Hu, F., He, K., Wang, C., Zheng, Q., Zhou, B., Li, G., and Sun, Y. (2025). STRFLNet: Spatio-Temporal Representation Fusion Learning Network for EEG-Based Emotion Recognition. IEEE Transactions on Affective Computing, 1–16. https://doi.org/10.1109/TAFFC.2025.3611173 DOI: https://doi.org/10.1109/TAFFC.2025.3611173
Li, D. H., Liu, J. Y., Yang, Y., Hou, F. Z., Song, H. T., Song, Y., Gao, Q., and Mao, Z. M. (2023). Emotion Recognition of Subjects with Hearing Impairment Based on Fusion of Facial Expression and EEG Topographic Map. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 31, 437–445. https://doi.org/10.1109/TNSRE.2022.3225948 DOI: https://doi.org/10.1109/TNSRE.2022.3225948
Li, X., Song, D., Zhang, P., Zhang, Y., Hou, Y., and Hu, B. (2018). Exploring EEG Features in Cross-Subject Emotion Recognition. Frontiers in Neuroscience, 12, 162. https://doi.org/10.3389/fnins.2018.00162 DOI: https://doi.org/10.3389/fnins.2018.00162
Liang, Z., Zhou, R. S., Zhang, L., Li, L. L., Huang, G., Zhang, Z. G., and Ishii, S. (2021). EEGFuseNet: Hybrid Unsupervised Deep Feature Characterization and Fusion for High-Dimensional EEG with an Application to Emotion Recognition. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 29, 1913–1925. https://doi.org/10.1109/TNSRE.2021.3111689 DOI: https://doi.org/10.1109/TNSRE.2021.3111689
Pillalamarri, R., and Shanmugam, U. (2025). A Review on EEG-Based Multimodal Learning for Emotion Recognition. Artificial Intelligence Review, 58, 131. https://doi.org/10.1007/s10462-025-11126-9 DOI: https://doi.org/10.1007/s10462-025-11126-9
Wu, X., Zheng, W. L., Li, Z. Y., and Lu, B. L. (2022). Investigating EEG-Based Functional Connectivity Patterns for Multimodal Emotion Recognition. Journal of Neural Engineering, 19(1), 016012. https://doi.org/10.1088/1741-2552/ac49a7 DOI: https://doi.org/10.1088/1741-2552/ac49a7
Wu, Y., Mi, Q. W., and Gao, T. H. (2025). A Comprehensive Review of Multimodal Emotion Recognition: Techniques, Challenges, and Future Directions. Biomimetics, 10(7), 418. https://doi.org/10.3390/biomimetics10070418 DOI: https://doi.org/10.3390/biomimetics10070418
Yao, L. X., Lu, Y., Qian, Y. K., He, C. J., and Wang, M. J. (2024). High-Accuracy Classification of Multiple Distinct Human Emotions Using EEG Differential Entropy Features and ResNet18. Applied Sciences, 14(12), 6175. https://doi.org/10.3390/app14146175 DOI: https://doi.org/10.3390/app14146175
Yu, P., He, X. P., Li, H. Y., Dou, H. W., Tan, Y. Y., Wu, H., and Chen, B. D. (2025). FMLAN: A Novel Framework for Cross-Subject and Cross-Session EEG Emotion Recognition. Biomedical Signal Processing and Control, 100, 106912. https://doi.org/10.1016/j.bspc.2024.106912 DOI: https://doi.org/10.1016/j.bspc.2024.106912
Zheng, W. M., Zhou, X. Y., Zou, C. R., and Zhao, L. (2006). Facial Expression Recognition Using Kernel Canonical Correlation Analysis (KCCA). IEEE Transactions on Neural Networks, 17(1), 233–238. https://doi.org/10.1109/TNN.2005.860849 DOI: https://doi.org/10.1109/TNN.2005.860849
Zhu, M., Bai, Z. L., Wu, Q. Z., Wang, J. C., Xu, W. H., Song, Y., and Gao, Q. (2024). CFBC: A Network for EEG Emotion Recognition by Selecting the Information of Crucial Frequency Bands. IEEE Sensors Journal, 24(19), 30451–30461. https://doi.org/10.1109/JSEN.2024.3440340 DOI: https://doi.org/10.1109/JSEN.2024.3440340
Zhu, X. L., Liu, C., Zhao, L., and Wang, S. M. (2024). EEG Emotion Recognition Network Based on Attention and Spatiotemporal Convolution. Sensors, 24(11), 3464. https://doi.org/10.3390/s24113464 DOI: https://doi.org/10.3390/s24113464
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Dr. Priyanka Singh Niranjan, Dr. Rasna Sehrawat, Dr. Ashwini Katkar, Mona Sharma, Pushpa Nagini Sripada, Sulabha Narendra Patil

This work is licensed under a Creative Commons Attribution 4.0 International License.
With the licence CC-BY, authors retain the copyright, allowing anyone to download, reuse, re-print, modify, distribute, and/or copy their contribution. The work must be properly attributed to its author.
It is not necessary to ask for further permission from the author or journal board.
This journal provides immediate open access to its content on the principle that making research freely available to the public supports a greater global exchange of knowledge.























