MULTIMODAL EMOTION RECOGNITION USING AUDIO-TEXT FUSION AND TRANSFORMER-BASED CONTEXTUAL REPRESENTATION LEARNING

Authors

  • Dr. Priyanka Singh Niranjan Assistant Professor, Amity University, Noida, India
  • Dr. Rasna Sehrawat Assistant Professor, Amity University, Noida, India
  • Dr. Ashwini Katkar Assistant Professor, Vidyavardhini's College of Engineering and Technology, Vasai, West, Maharashtra, India
  • Mona Sharma Assistant Professor, School of Business Management, Noida International University, Greater Noida 203201, India
  • Pushpa Nagini Sripada Professor, Meenakshi College of Arts and Science, Meenakshi Academy of Higher Education and Research, Chennai, Tamil Nadu 600117, India
  • Sulabha Narendra Patil Department of Engineering, Science and Humanities, Vishwakarma Institute of Technology, Pune, Maharashtra, 411037, India

DOI:

https://doi.org/10.29121/shodhkosh.v7.i1s.2026.7045

Keywords:

Multimodal Emotion Recognition, Audio–Text Fusion, Transformer Networks, Cross-Modal Attention, Contextual Representation Learning, Affective Computing

Abstract [English]

The ability of the affective computing system to recognize emotions is a major aspect of the system that allows intelligent machines to detect the emotional conditions of humans during natural conversations. The complementary affective and semantic information in human communication is usually not reflected in traditional emotion recognition methods based on one modality, e.g. speech or text. In a bid to overcome this shortcoming, the work presents a Transformer-based cross-modal emotion recognition system that combines audio-text fusion with learning contextual representation. The proposed model utilises context-specific Transformer encoders to acquire contextual acoustic and linguistic representations and then a cross-modal attention mechanism which dynamically aligns and combines modality-specific information. This fusion of attention allows modelling the inter-modal dependencies well and increasing the discrimination of emotions. On well-known multimodal emotion datasets, extensive experiments have shown that the proposed method is invariably better than multimodal and unimodal baselines in accuracy and F1-score. The findings affirm that Transformer-based fusion with contextual representation learning is significantly beneficial in enhancing robustness and generalization when it comes to emotion recognition studies. The identified framework offers a scalable and efficient framework to the application in the real world including conversational agents, human-computer interaction, and affective analysis systems.

References

Bai, Z. L., Hou, F. Z., Sun, K. X., Wu, Q. Z., Zhu, M., Mao, Z. M., Song, Y., and Gao, Q. (2023). SECT: A Method of Shifted EEG Channel Transformer for Emotion Recognition. IEEE Journal of Biomedical and Health Informatics, 27(10), 4758–4767. https://doi.org/10.1109/JBHI.2023.3301993 DOI: https://doi.org/10.1109/JBHI.2023.3301993

Bai, Z. L., Liu, J. J., Hou, F. Z., Chen, Y. R., Cheng, M. Y., Mao, Z. M., Song, Y., and Gao, Q. (2023). Emotion Recognition with Residual Network Driven by Spatial-Frequency Characteristics of EEG Recorded from Hearing-Impaired Adults in Response to Video Clips. Computers in Biology and Medicine, 152, 106344. https://doi.org/10.1016/j.compbiomed.2022.106344 DOI: https://doi.org/10.1016/j.compbiomed.2022.106344

Cai, M. P., Chen, J. X., Hua, C. C., Wen, G. L., and Fu, R. R. (2024). EEG Emotion Recognition Using EEG-SWTNS Neural Network Through EEG Spectral Image. Information Sciences, 680, 121198. https://doi.org/10.1016/j.ins.2024.121198 DOI: https://doi.org/10.1016/j.ins.2024.121198

Devarajan, K., Ponnan, S., and Perumal, S. (2025). Enhancing Emotion Recognition Through Multi-Modal Data Fusion and Graph Neural Networks. Intelligent-Based Medicine, 12, 100291. https://doi.org/10.1016/j.ibmed.2025.100291 DOI: https://doi.org/10.1016/j.ibmed.2025.100291

Duan, R.-N., Zhu, J.-Y., and Lu, B.-L. (2013). Differential Entropy Feature for EEG-Based Emotion Classification. In Proceedings of the 6th International IEEE/EMBS Conference on Neural Engineering (NER) ( 81–84). IEEE. https://doi.org/10.1109/NER.2013.6695876 DOI: https://doi.org/10.1109/NER.2013.6695876

Hou, G. Q., Yu, Q. W., Chen, C. G., and Chen, F. (2024). A Novel and Powerful Dual-Stream Multi-Level Graph Convolution Network for Emotion Recognition. Sensors, 24(23), 7377. https://doi.org/10.3390/s24227377 DOI: https://doi.org/10.3390/s24227377

Hu, F., He, K., Wang, C., Zheng, Q., Zhou, B., Li, G., and Sun, Y. (2025). STRFLNet: Spatio-Temporal Representation Fusion Learning Network for EEG-Based Emotion Recognition. IEEE Transactions on Affective Computing, 1–16. https://doi.org/10.1109/TAFFC.2025.3611173 DOI: https://doi.org/10.1109/TAFFC.2025.3611173

Li, D. H., Liu, J. Y., Yang, Y., Hou, F. Z., Song, H. T., Song, Y., Gao, Q., and Mao, Z. M. (2023). Emotion Recognition of Subjects with Hearing Impairment Based on Fusion of Facial Expression and EEG Topographic Map. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 31, 437–445. https://doi.org/10.1109/TNSRE.2022.3225948 DOI: https://doi.org/10.1109/TNSRE.2022.3225948

Li, X., Song, D., Zhang, P., Zhang, Y., Hou, Y., and Hu, B. (2018). Exploring EEG Features in Cross-Subject Emotion Recognition. Frontiers in Neuroscience, 12, 162. https://doi.org/10.3389/fnins.2018.00162 DOI: https://doi.org/10.3389/fnins.2018.00162

Liang, Z., Zhou, R. S., Zhang, L., Li, L. L., Huang, G., Zhang, Z. G., and Ishii, S. (2021). EEGFuseNet: Hybrid Unsupervised Deep Feature Characterization and Fusion for High-Dimensional EEG with an Application to Emotion Recognition. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 29, 1913–1925. https://doi.org/10.1109/TNSRE.2021.3111689 DOI: https://doi.org/10.1109/TNSRE.2021.3111689

Pillalamarri, R., and Shanmugam, U. (2025). A Review on EEG-Based Multimodal Learning for Emotion Recognition. Artificial Intelligence Review, 58, 131. https://doi.org/10.1007/s10462-025-11126-9 DOI: https://doi.org/10.1007/s10462-025-11126-9

Wu, X., Zheng, W. L., Li, Z. Y., and Lu, B. L. (2022). Investigating EEG-Based Functional Connectivity Patterns for Multimodal Emotion Recognition. Journal of Neural Engineering, 19(1), 016012. https://doi.org/10.1088/1741-2552/ac49a7 DOI: https://doi.org/10.1088/1741-2552/ac49a7

Wu, Y., Mi, Q. W., and Gao, T. H. (2025). A Comprehensive Review of Multimodal Emotion Recognition: Techniques, Challenges, and Future Directions. Biomimetics, 10(7), 418. https://doi.org/10.3390/biomimetics10070418 DOI: https://doi.org/10.3390/biomimetics10070418

Yao, L. X., Lu, Y., Qian, Y. K., He, C. J., and Wang, M. J. (2024). High-Accuracy Classification of Multiple Distinct Human Emotions Using EEG Differential Entropy Features and ResNet18. Applied Sciences, 14(12), 6175. https://doi.org/10.3390/app14146175 DOI: https://doi.org/10.3390/app14146175

Yu, P., He, X. P., Li, H. Y., Dou, H. W., Tan, Y. Y., Wu, H., and Chen, B. D. (2025). FMLAN: A Novel Framework for Cross-Subject and Cross-Session EEG Emotion Recognition. Biomedical Signal Processing and Control, 100, 106912. https://doi.org/10.1016/j.bspc.2024.106912 DOI: https://doi.org/10.1016/j.bspc.2024.106912

Zheng, W. M., Zhou, X. Y., Zou, C. R., and Zhao, L. (2006). Facial Expression Recognition Using Kernel Canonical Correlation Analysis (KCCA). IEEE Transactions on Neural Networks, 17(1), 233–238. https://doi.org/10.1109/TNN.2005.860849 DOI: https://doi.org/10.1109/TNN.2005.860849

Zhu, M., Bai, Z. L., Wu, Q. Z., Wang, J. C., Xu, W. H., Song, Y., and Gao, Q. (2024). CFBC: A Network for EEG Emotion Recognition by Selecting the Information of Crucial Frequency Bands. IEEE Sensors Journal, 24(19), 30451–30461. https://doi.org/10.1109/JSEN.2024.3440340 DOI: https://doi.org/10.1109/JSEN.2024.3440340

Zhu, X. L., Liu, C., Zhao, L., and Wang, S. M. (2024). EEG Emotion Recognition Network Based on Attention and Spatiotemporal Convolution. Sensors, 24(11), 3464. https://doi.org/10.3390/s24113464 DOI: https://doi.org/10.3390/s24113464

Downloads

Published

2026-02-17

How to Cite

Niranjan, P. S., Sehrawat, R., Katkar, A., Sharma, M., Sripada, P. N., & Patil, S. N. (2026). MULTIMODAL EMOTION RECOGNITION USING AUDIO-TEXT FUSION AND TRANSFORMER-BASED CONTEXTUAL REPRESENTATION LEARNING. ShodhKosh: Journal of Visual and Performing Arts, 7(1s), 190–201. https://doi.org/10.29121/shodhkosh.v7.i1s.2026.7045