SPAMGUARD: AN INTEGRATED KALMAN FILTER AND CNN APPROACH FOR EMAIL SPAM CLASSIFICATION

Umesh; Yuvraj Pawar; Abhay Sharma; Akshat Chauhan; Suman

doi:10.29121/ijetmr.v10.i6.2023.1600

Authors

Umesh Computer Science and Engineering, Echelon Institute of Technology, Faridabad
Yuvraj Pawar Computer Science and Engineering, Echelon Institute of Technology, Faridabad
Abhay Sharma Computer Science and Engineering, Echelon Institute of Technology, Faridabad
Akshat Chauhan Computer Science and Engineering, Echelon Institute of Technology, Faridabad
Suman Computer Science and Engineering, Echelon Institute of Technology, Faridabad

DOI:

https://doi.org/10.29121/ijetmr.v10.i6.2023.1600

Keywords:

Kalman, Cnn, Email, Classification, Spam

Abstract

Email remains a primary mode of communication for both professional and personal use due to its low cost, accessibility, and widespread adoption. However, the open nature of email systems exposes users to spam — unsolicited, irrelevant, or malicious messages — posing risks such as phishing, fraud, and information overload. Existing spam detection mechanisms face challenges in keeping pace with the evolving strategies used by spammers and must balance aggressive filtering with the risk of legitimate message loss. To address these limitations, this study proposes a novel spam detection framework combining Kalman Filters and Convolutional Neural Networks (CNNs). Kalman Filters are utilized to pre-process and denoise input text data, effectively mitigating irregularities and improving feature consistency. CNNs are then employed to automatically learn hierarchical text representations, enabling robust classification of emails into spam or legitimate categories. The integration of Kalman-based preprocessing with deep learning enhances both detection accuracy and system reliability. Additionally, the system provides a quick summary view of classified emails to assist users in rapidly assessing message content. Experimental results demonstrate the potential of the proposed method to outperform traditional spam detection techniques, offering a scalable and adaptive solution to modern email security challenges.

Downloads

Download data is not yet available.

References

Aggarwal, C. C., Zhai, C. (2012). MininG Text Data. Springer. https://doi.org/10.1007/978-1-4614-3223-4

Al-Azani, S., El-Alfy, E.-S. M. (2019). A Framework for Email Spam Filtering Using Word2vec and Deep Learning. Journal of Information Security and Applications.

Almeida, T. A., Hidalgo, J. M. G., & Yamakami, A. (2011). Contribution To the Study of SMS Spam Filtering: New Collection and Results. Proceedings of ACM SAC. https://doi.org/10.1145/2034691.2034742

Androutsopoulos, I., et al. (2000). An Evaluation of Naive Bayesian Anti-Spam Filtering. Workshop on Machine Learning in the New Information Age.

Blanzieri, E., Bryl, A. (2008). A Survey of Learning-Based Techniques of Email Spam Filtering. Artificial Intelligence Review. https://doi.org/10.1007/s10462-009-9109-6

Carreras, X., Márquez, L. (2001). Boosting Trees for Anti-Spam Email Filtering. Proceedings of RANLP.

Chen, T., Guestrin, C. (2016). Xgboost: A Scalable Tree Boosting System. Proceedings of KDD. https://doi.org/10.1145/2939672.2939785

Chen, X., et al. (2006). Kalman Filter for Speech Enhancement. IEEE Transactions on Audio, Speech, and Language Processing.

Cormack, G. V. (2008). Email Spam Filtering: A Systematic Review. Foundations and Trends in Information Retrieval. https://doi.org/10.1561/9781601981479

Drucker, H., Wu, D., & Vapnik, V. (1999). Support Vector Machines for Spam Categorization. IEEE Transactions on Neural Networks. https://doi.org/10.1109/72.788645

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.

Goodman, J., Heckerman, D., & Rounthwaite, R. (2005). Stopping Spam. Scientific American. https://doi.org/10.1038/scientificamerican0405-42

Hochreiter, S., Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation. https://doi.org/10.1162/neco.1997.9.8.1735

Joachims, T. (1998). Text Categorization With Support Vector Machines: Learning With Many Relevant Features. Proceedings of ECML. https://doi.org/10.1007/BFb0026683

Johnson, R., Zhang, T. (2015). Effective Use of Word Order for Text Categorization with Convolutional Neural Networks. Proceedings of NAACL-HLT. https://doi.org/10.3115/v1/N15-1011

Kalman, R. E. (1960). A New Approach To Linear Filtering and Prediction Problems. Journal of Basic Engineering. https://doi.org/10.1115/1.3662552

Kim, Y. (2014). Convolutional Neural Networks for Sentence Classification. Proceedings of EMNLP. https://doi.org/10.3115/v1/D14-1181

Kotsiantis, S. B. (2007). Supervised Machine Learning: A Review of Classification Techniques. Informatica.

Liu, P., Qiu, X., & Huang, X. (2016). Recurrent Neural Network for Text Classification With Multi-Task Learning. Proceedings of IJCAI.

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv preprint.

Qi, P., et al. (2020). A Multimodal Approach for Spam Detection in Short Texts. Proceedings of EMNLP.

Radicati Group. (2021). Email Statistics Report, 2021-2025.

Ramos, J. (2003). Using TF-IDF To Determine Word Relevance in Document Queries. Proceedings of the First Instructional Conference on Machine Learning.

Sahami, M., Dumais, S., Heckerman, D., & Horvitz, E. (1998). A Bayesian Approach To Filtering Junk E-Mail. AAAI Workshop on Learning for Text Categorization.

Salton, G., McGill, M. J. (1983). Introduction To Modern Information Retrieval. McGraw-Hill.

Sebastiani, F. (2002). Machine Learning in Automated Text Categorization. ACM Computing Surveys. https://doi.org/10.1145/505282.505283

SpamAssassin, Apache Software Foundation.

U.S. Congress. (2003). CAN-SPAM Act of 2003.

Vapnik, V. (1995). The Nature of Statistical Learning Theory. Springer. https://doi.org/10.1007/978-1-4757-2440-0

Vaswani, A., et al. (2017). ATtention Is All You Need. Proceedings of NeurIPS.

Waseem, Z., Hovy, D. (2016). Hateful Symbols or Hateful People? Predictive features for hate speech detection on Twitter. Proceedings of NAACL. https://doi.org/10.18653/v1/N16-2013

Wei, J., Zou, K. (2019). EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. Proceedings of EMNLP. https://doi.org/10.18653/v1/D19-1670

Welch, G., Bishop, G. (1995). An iNtroduction To the Kalman Filter. University of North Carolina at Chapel Hill.

Zhang, L., Zhu, J., & Yao, T. (2004). An Evaluation of Statistical Spam Filtering Techniques. ACM Transactions on Asian Language Information Processing. https://doi.org/10.1145/1039621.1039625

Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-Level Convolutional Networks for Text Classification. Proceedings of NeurIPS.

Zhou, Z.-H. (2012). Ensemble Methods: Foundations and Algorithms. CRC Press. https://doi.org/10.1201/b12207