INTERPRETING MUSICAL MOOD THROUGH MULTIMODAL FEATURE INTEGRATION: A SCALABLE FRAMEWORK FOR INTELLIGENT MUSIC ANALYSIS

Shital Shankar  Gujar; Ali Yawar  Reha

doi:10.29121/shodhkosh.v7.i5s.2026.7539

Authors

Shital Shankar Gujar Pacific Academy of Higher Education and Research University, Udaipur, Rajasthan, India
Ali Yawar Reha Pacific Academy of Higher Education and Research University, Udaipur, Rajasthan, India

DOI:

https://doi.org/10.29121/shodhkosh.v7.i5s.2026.7539

Keywords:

Music Emotion Recognition, Multimodal Deep Learning, Mel-Spectrogram, Bert Embeddings, Attention Fusion Mechanism, Affective Computing, Context-Aware Classification

Abstract [English]

Music emotion recognition is an important field in intelligent recommendation system, affective computing and personalized media analytics. Nonetheless, unimodal methods that use only audio or textual information to identify emotions do not allow the identification of the complicated and situation-specific emotional features inherent to music. In the given paper, a multimodal deep neural framework that is scaled to identify music emotion in a context-aware manner is introduced to combine acoustic and semantic modalities and achieve improved performance. The main issue taken care of is the inadequacy of therapeutic models and their inability to generalize because of the incomplete reproduction of emotions and the absence of cross-modality of interaction. The goal is to come up with a unified scalable architecture that efficiently combines audio-based features (Mel-spectrograms trained on CNN/CRNN) and lyric-based semantic embeddings (TF-IDF, Word2Vec and BERT) through an attention-based fusion process. The proposed approach is tested on benchmark datasets like DEAM (Database for Emotional Analysis in Music) and Million Song Dataset (MSD) that has extensions of the lyrics, which guarantees the strength of the approach in different music genres and schemes of annotation. The comparison of the results against audio-only, text-only, and late-fusion models proves that the results are significantly improved. The suggested framework delivers an accuracy of 84.6 which is better by 7.1, 12.2 and 5.5 the text-only models, audio-only models, and late-fusion models respectively, as well as it has better F1-score and generalization stability. The results prove that multimodal integration does enhance the ability to recognize the context and emotional discrimination. The area of this work is also the real-time music recommendation, emotion aware playlists, and adaptive multimedia systems. To sum up, the suggested framework provides a scalable, robust, and high-performance framework of next-generation music emotion recognition systems.

References

Aguilera, A., Mellado, D., and Rojas, F. (2023). An Assessment of in-the-Wild Datasets for Multimodal Emotion Recognition. Sensors, 23(11), 5184. https://doi.org/10.3390/s23115184 DOI: https://doi.org/10.3390/s23115184

Bakariya, B., Singh, A., Singh, H., Raju, P., Rajpoot, R., and Mohbey, K. K. (2024). Facial Emotion Recognition and Music Recommendation System Using CNN-Based Deep Learning Techniques. Evolutionary Systems, 15, 641–658. https://doi.org/10.1007/s12530-023-09506-z DOI: https://doi.org/10.1007/s12530-023-09506-z

Chen, C. (2025). Research on Music Emotion Classification Algorithm Model Based on Multimodal Deep Learning. In Proceedings of the 2025 International Conference on Generative AI and Digital Media Arts (GAIDMA ’25) (252–257). Association for Computing Machinery. https://doi.org/10.1145/3770445.3770489 DOI: https://doi.org/10.1145/3770445.3770489

Grosu, M.-M., Datcu, O., Tapu, R., and Mocanu, B. (2026). A Comparative Study of Emotion Recognition Systems: From Classical Approaches to Multimodal Large Language Models. Applied Sciences, 16(3), 1289. https://doi.org/10.3390/app16031289 DOI: https://doi.org/10.3390/app16031289

Han, X., Chen, F., and Ban, J. (2023). Music Emotion Recognition Based on a Neural Network with an Inception-GRU Residual Structure. Electronics, 12(4), 978. https://doi.org/10.3390/electronics12040978 DOI: https://doi.org/10.3390/electronics12040978

Hao, X., Li, H., and Wen, Y. (2025). Real-Time Music Emotion Recognition Based on Multimodal Fusion. Alexandria Engineering Journal, 116, 586–600. https://doi.org/10.1016/j.aej.2024.12.060 DOI: https://doi.org/10.1016/j.aej.2024.12.060

He, N., and Ferguson, S. (2022). Music Emotion Recognition Based on Segment-Level Two-Stage Learning. International Journal of Multimedia Information Retrieval, 11, 383–394. https://doi.org/10.1007/s13735-022-00230-z DOI: https://doi.org/10.1007/s13735-022-00230-z

Lian, H., Lu, C., Li, S., Zhao, Y., Tang, C., and Zong, Y. (2023). A Survey of Deep Learning-Based Multimodal Emotion Recognition: Speech, Text, and Face. Entropy, 25(10), 1440. https://doi.org/10.3390/e25101440 DOI: https://doi.org/10.3390/e25101440

Lisitsa, E., Benjamin, K. S., Chun, S. K., Skalisky, J., Hammond, L. E., and Mezulis, A. H. (2020). Loneliness Among Young Adults During COVID-19 Pandemic: The Mediational Roles of Social Media Use and Social Support Seeking. Journal of Social and Clinical Psychology, 39, 708–726. https://doi.org/10.1521/jscp.2020.39.8.708 DOI: https://doi.org/10.1521/jscp.2020.39.8.708

Louro, P. L., Redinho, H., Malheiro, R., Paiva, R. P., and Panda, R. (2024). A Comparison Study of Deep Learning Methodologies for Music Emotion Recognition. Sensors, 24(7), 2201. https://doi.org/10.3390/s24072201 DOI: https://doi.org/10.3390/s24072201

Meena, G., Mohbey, K. K., Indian, A., Khan, M. Z., and Kumar, S. (2024). Identifying Emotions From Facial Expressions Using a Deep Convolutional Neural Network-Based Approach. Multimedia Tools and Applications, 83, 15711–15732. https://doi.org/10.1007/s11042-023-16174-3 DOI: https://doi.org/10.1007/s11042-023-16174-3

Mohbey, K. K., Meena, G., Kumar, S., and Lokesh, K. (2023). A CNN-LSTM-Based Hybrid Deep Learning Approach for Sentiment Analysis on Monkeypox Tweets. New Generation Computing. https://doi.org/10.1007/s00354-023-00227-0 DOI: https://doi.org/10.1007/s00354-023-00227-0

Nandini, D., Yadav, J., Singh, V., Mohan, V., and Agarwal, S. (2025). An Ensemble Deep Learning Framework for Emotion Recognition Through Wearable Devices using Multi-Modal Physiological Signals. Scientific Reports, 15, 17263. https://doi.org/10.1038/s41598-025-99858-0 DOI: https://doi.org/10.1038/s41598-025-99858-0

Pandeya, Y. R., Bhattarai, B., and Lee, J. (2021). Deep-Learning-Based Multimodal Emotion Classification for Music Videos. Sensors, 21(14), 4927. https://doi.org/10.3390/s21144927 DOI: https://doi.org/10.3390/s21144927

Patil, S., Patil, R., Goudar, S., et al. (2025). Review on Music Emotion Analysis Using Machine Learning: Technologies, Methods, Datasets, and Challenges. Discover Applied Sciences, 7, 692. https://doi.org/10.1007/s42452-025-07178-9 DOI: https://doi.org/10.1007/s42452-025-07178-9

Tu, Z., Yan, R., Weng, S., Li, J., and Zhao, W. (2025). Multimodal Emotion Recognition Based on Graph Neural Networks. Applied Sciences, 15(17), 9622. https://doi.org/10.3390/app15179622 DOI: https://doi.org/10.3390/app15179622

Visutsak, P., Loungna, J., Sopromrat, S., Jantip, C., Soponkittikunchai, P., and Liu, X. (2025). Mood-Based Music Discovery: A System for Generating Personalized Thai Music Playlists Using Emotion Analysis. Applied System Innovation, 8(2), 37. https://doi.org/10.3390/asi8020037 DOI: https://doi.org/10.3390/asi8020037

Wu, Y., Zhang, S., and Li, P. (2024). Improvement of Multimodal Emotion Recognition Based on Temporal-Aware Bi-Direction Multi-Scale Network and Multi-Head Attention Mechanisms. Applied Sciences, 14(8), 3276. https://doi.org/10.3390/app14083276 DOI: https://doi.org/10.3390/app14083276

Yang, Y., Qian, M., Di Nardo, M., and Huang, Z. (2025). Deep Learning-Based Music Emotion Recognition Framework for Intelligent Vocal Pedagogy. In Proceedings of the 2nd International Conference on Digital Society and Artificial Intelligence (191–196). Association for Computing Machinery. https://doi.org/10.1145/3748825.3748857 DOI: https://doi.org/10.1145/3748825.3748857