SOUND EMOTION MAPPING USING DEEP LEARNING

Authors

  • Dr. Sachin Vasant Chaudhari Electronics and Computer Engineering, Sanjivani College of Engineering, Kopargaon, Ahmednagar, Electronics Engineering
  • Sadhana Sargam Assistant Professor, School of Business Management, Noida international University 203201
  • Harsimrat Kandhari Chitkara Centre for Research and Development, Chitkara University, Himachal Pradesh, Solan, 174103, India
  • Madhur Taneja Centre of Research Impact and Outcome, Chitkara University, Rajpura- 140417, Punjab, India
  • Mr. Sourav Panda Assistant Professor, Department of Film , Parul Institute of Design, Parul University, Vadodara, Gujarat, India
  • Dr. L.Sujihelen Associate Professor, Department of Computer Science and Engineering, Sathyabama Institute of Science and Technology, Chennai, Tamil Nadu, India

DOI:

https://doi.org/10.29121/shodhkosh.v6.i1s.2025.6625

Keywords:

Affective Computing, Emotion Recognition, CNN–LSTM, Audio Processing, Deep Learning, Speech Analysis, Temporal Modeling, Log-Mel Features

Abstract [English]

Emotion recognition from sound is an important area of affective computing where machines can use vocal cues as an indication of human emotions for empathic and adaptive interactions. Traditional methods based on handcrafted acoustic features like MFCCs and LPC are restricted in terms of nonlinear and context-dependent emotional dynamics and mostly suffer from speaker and recording condition variations. To solve these issues, in this study the deep learning-based sound emotion mapping framework based on the combination of Convolutional Neural Networks (CNNs) for the spatial feature extraction and Long Short-Term Memory (LSTMs) for the temporal modeling has been proposed. CNN layers detect spectrogram patterns of the log-mel spectrograms and prosodic cues whereas LSTMs detect the transitions in emotions sequentially resulting in a powerful end-to-end system which does not require manual feature design. Using RAVDESS and Berlin EMO-DB testing datasets, the proposed CNN-LSTM model obtained an accuracy of 93.2% and 91.4% respectively over performing SVM and CNN-only baselines. Attention-weight visualization showed that the model attention is concentrated on the mid-frequency region, which is in line with the psychoacoustic theories of emotional prosody.

References

Al-Talabani, A. A., and Al-Jubouri, M. A. (2021). Emotion Recognition from Speech Signals Using Machine Learning Techniques: A Review. Biomedical Signal Processing and Control, 69, Article 102936.

Balaji, A., Balanjali, D., Subbaiah, G., Reddy, A. A., and Karthik, D. (2025). Federated Deep Learning for Robust Multi-Modal Biometric Authentication Based on Facial and Eye-Blink Cues. International Journal of Advanced Computer Engineering and Communication Technology, 14(1), 17–24. DOI: https://doi.org/10.65521/ijacect.v14i1.167

Chaturvedi, I., Noel, T., and Satapathy, R. (2022). Speech Emotion Recognition Using Audio Matching. Electronics, 11(23), Article 3943. https://doi.org/10.3390/electronics11233943 DOI: https://doi.org/10.3390/electronics11233943

Chintalapudi, K. S., Patan, I. A. K., Sontineni, H. V., Muvvala, V. S. K., Gangashetty, S. V., and Dubey, A. K. (2023). Speech Emotion Recognition Using Deep Learning. In Proceedings of the 2023 International Conference on Computer Communication and Informatics (ICCCI) (pp. 1–5). IEEE. https://doi.org/10.1109/ICCCI56745.2023.10128612 DOI: https://doi.org/10.1109/ICCCI56745.2023.10128612

Cummins, N., Sethu, V., Kundu, S., and McKeown, G. (2013). The PASCAL Affective Audio-Visual Database. In Proceedings of the 21st ACM International Conference on Multimedia (pp. 1025–1028).

De Silva, U., Madanian, S., Templeton, J. M., Poellabauer, C., Schneider, S., and Narayanan, A. (2023). Design Concept of a Mental Health Monitoring Application with Explainable Assessments [Conference paper]. In ACIS 2023 Proceedings (Paper 28). AIS Electronic Library.

Gupta, and Mishra, D. (2023). Sentimental Voice Recognition: An Approach to Analyse the Emotion by Voice. In Proceedings of the 2023 International Conference on Electrical, Electronics, Communication and Computers (ELEXCOM) (pp. 1–6). IEEE. https://doi.org/10.1109/ELEXCOM58812.2023.10370064 DOI: https://doi.org/10.1109/ELEXCOM58812.2023.10370064

Hook, J., Noroozi, F., Toygar, O., and Anbarjafari, G. (2019). Automatic Speech-Based Emotion Recognition Using Paralinguistic Features. Bulletin of the Polish Academy of Sciences: Technical Sciences, 67(3), 1–10. https://doi.org/10.24425/bpasts.2019.129647 DOI: https://doi.org/10.24425/bpasts.2019.129647

Kusal, S., Patil, S., Kotecha, K., Aluvalu, R., and Varadarajan, V. (2020). AI-Based Emotion Detection for Textual Big Data: Techniques and Contribution. Big Data and Cognitive Computing, 5(3), Article 43. https://doi.org/10.3390/bdcc5030043 DOI: https://doi.org/10.3390/bdcc5030043

Li, H., Zhang, X., and Wang, M.-J. (2021). Research on Speech Emotion Recognition Based on Deep Neural Network. In Proceedings of the 2021 IEEE 6th International Conference on Signal and Image Processing (ICSIP) (pp. 795–799). IEEE. https://doi.org/10.1109/ICSIP52628.2021.9689043 DOI: https://doi.org/10.1109/ICSIP52628.2021.9689043

Livingstone, S. R., and Russo, F. A. (2018). The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). Zenodo. https://doi.org/10.5281/zenodo.1188976

Mittal, A., Arora, V., and Kaur, H. (2021). Speech Emotion Recognition Using HuBERT Features and Convolutional Neural Networks. In Proceedings of the 2021 6th International Conference on Computing, Communication and Security (ICCCS) (pp. 1–5). IEEE. https://doi.org/10.1109/ICCCS51487.2021.9776325 DOI: https://doi.org/10.1109/ICCCS51487.2021.9776325

Scherer, K. R. (2003). Vocal communication of emotion: A Review of Research Paradigms. Speech Communication, 40(1–2), 227–256. https://doi.org/10.1016/S0167-6393(02)00084-5 DOI: https://doi.org/10.1016/S0167-6393(02)00084-5

Trigeorgis, G., et al. (2016). Adieu Features? End-To-End Speech Emotion Recognition Using a Deep Convolutional Recurrent Network. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5200–5204). IEEE. https://doi.org/10.1109/ICASSP.2016.7472669 DOI: https://doi.org/10.1109/ICASSP.2016.7472669

Zhang, Y., Yang, Y., Li, Y., Li, W., and Zhao, J. (2021). Speech Emotion Recognition Based on HuBERT and Attention Mechanism. In Proceedings of the 2021 6th International Conference on Automation, Control and Robotics Engineering (CACRE) (pp. 277–280). IEEE.

Zhao, J., Mao, X., and Chen, L. (2019). Speech Emotion Recognition Using Deep 1D and 2D CNN LSTM networks. Biomedical Signal Processing and Control, 47, 312–323. https://doi.org/10.1016/j.bspc.2018.08.035 DOI: https://doi.org/10.1016/j.bspc.2018.08.035

Downloads

Published

2025-12-10

How to Cite

Chaudhari, D. S. V., Sargam, S., Kandhari, H., Taneja, M., Panda, S., & L.Sujihelen. (2025). SOUND EMOTION MAPPING USING DEEP LEARNING. ShodhKosh: Journal of Visual and Performing Arts, 6(1s), 438–446. https://doi.org/10.29121/shodhkosh.v6.i1s.2025.6625