Overview of Machine Learning Algorithms for Solving the Spam Detection Problem
Keywords:
machine learning, spam filtering, spam detection algorithms, classification framework, mathematical formulation, algorithm accuracy, research directions, spam identification, machine learning in spam filtering.Abstract
This research focuses on addressing the challenge of spam filtering through the application of machine learning techniques. This research involved a comprehensive review of spam identification algorithms, resulting in a proposed classification framework. A detailed mathematical formulation of the algorithms is presented, accompanied by empirical results demonstrating the accuracy of their current implementations. Potential avenues for future research have been highlighted to enhance spam detection capabilities.
References
Cormack G. V. Email spam filtering: A systematic review //Foundations and Trends in Information Retrieval. – 2008. – Vol. 1. – №. 4. – P. 335-455.
Спам и фишинг во втором квартале 2016. [Электронный ресурс]. Режим доступа: https://securelist.ru/analysis/spam-quarterly/29116/spam-and-phishing-in-q2-2016/ (дата обращения: 20.02.17)
Большакова Е.И., Клышинский Э.С., Ландэ Д.В., Носков А.А., Пескова О.В., Ягунова Е.В. Автоматическая обработка текстов на естественном языке и компьютерная лингвистика: учеб. пособие — М.: МИЭМ, 2011. — 272 с. Barber D. Bayesian reasoning and machine learning. – Cambridge University Press, 2012.
Chhabra S., Yerazunis W. S., Siefkes C. Spam filtering using a markov random field model with variable weighting schemas //Data Mining, 2004. ICDM'04. Fourth IEEE International Conference on. – IEEE, 2004. – Pp. 347-350.
CRM114 Notes for the TREC 2005 Spam Track [Электронный ресурс]. Режим доступа http://crm114.sourceforge.net/docs/NIST_TREC_2005_paper.html (дата обращения: 20.02.17)
Hovold J. Naive Bayes Spam Filtering Using Word-Position-Based Attributes //CEAS. – 2005. – Pp. 41-48.
Christina V., Karpagavalli S., Suganya G. Email spam filtering using supervised machine learning techniques //International Journal on Computer Science and Engineering (IJCSE). – 2010. – Vol. 2. – Pp. 3126-3129.
Lowd D., Meek C. Good Word Attacks on Statistical Spam Filters //In Proceedings of the Second Conference on Email and Anti-Spam (CEAS). – 2005.
Su B., Xu C. Not So Naıve Online Bayesian Spam Filter //Proceedings of the Twenty-First Innovative Applications of Artificial Intelligence Conference. – 2009.
Better Bayesian Filtering. [Электронный ресурс]. Режим доступа http://www.paulgraham.com/better.html (Дата обращения: 20.02.17)
Joachims T. A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization. // Proceedings of ICML-97, 14th International Conference on Machine Learning. – 1997. – Pp. 143-151.
Yerazunis W. S. The spam-filtering accuracy plateau at 99.9% accuracy and how to get past it //Proceedings of the 2004 MIT Spam Conference. – 2004.
Littlestone N. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm //Machine learning. – 1988. – Vol. 2. – №. 4. – Pp. 285-318.
Siefkes C., Assis, F., Chhabra, S., Yerazunis, W. S. Combining winnow and orthogonal sparse bigrams for incremental spam filtering //European Conference on Principles of Data Mining and Knowledge Discovery. – 2004. – Pp. 410-421.
Wu C. H. Behavior-based spam detection using a hybrid method of rule-based techniques and neural networks //Expert Systems with Applications. – 2009. – Vol. 36. – №. 3. – Pp. 4321-4330.
Bratko A., Filipic B., Zupan B. Towards Practical PPM Spam Filtering: Experiments for the TREC 2006 Spam Track // Proceedings of the 15th Text REtrieval Conference (TREC 2006). – 2006.
Bratko A., Cormack, G. V., Filipič, B., Lynam, T. R., Zupan, B. Spam filtering using statistical data compression models //Journal of machine learning research. – 2006. – Vol. 7. – Pp. 2673-2698.
Breiman L. , Friedman, J., Stone, C. J., Olshen, R. A. Classification and regression trees. – CRC press, 1984.
Shakhnarovich, G. Darrell, T., Indyk, P. Nearest-neighbor methods in learning and vision. Theory and Practice. – MIT Press. – 2006.
Hanley J. A., McNeil B. J. A method of comparing the areas under receiver operating characteristic curves derived from the same cases //Radiology. – 1983. – Vol. 148. – №. 3. – Pp. 839-843.
Drucker H., Wu D., Vapnik V. N. Support vector machines for spam categorization //IEEE Transactions on Neural networks. – 1999. – Vol. 10. – №. 5. – Pp. 1048-1054.
Sculley D., Wachman G. M. Relaxed online SVMs for spam filtering //Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. – 2007. – Pp. 415-422. Carreras X., Marquez L. Boosting trees for anti-spam email filtering //In 4th International Conference on Recent Advances in Natural Language Processing. – 2001. – Pp. 58-64.
Androutsopoulos I., Paliouras G., Michelakis E. Learning to filter unsolicited commercial e-mail. Technical report 2004/2, National Center for Scientific Research “Demokritos”. – 2004.
Yerazunis W. S. Seven Hypothesis about Spam Filtering // Proceedings of the 15th Text REtrieval Conference (TREC 2006). – 2006.