Palak Gupta -Data Analyst

Portfolio Project 9:

Spam Classifier Model

Services:

Machine Learning | Python

Overview

The Spam Classifier Model project focused on developing a machine learning model that can accurately distinguish between spam and legitimate (ham) messages. The objective was to automate spam detection in SMS or email systems using Natural Language Processing (NLP) techniques and classification algorithms. This model enhances user experience by reducing unnecessary or malicious communication.

Research:The research phase involved analyzing common characteristics of spam messages, such as use of promotional keywords, urgency, links, and unusual formatting. A labeled SMS Spam Collection dataset was used, comprising thousands of real-world messages marked as 'spam' or 'ham'.

Information Architecture:The dataset was cleaned by removing stop words, punctuation, and performing tokenization, stemming, and lemmatization. Text vectorization methods like TF-IDF (Term Frequency–Inverse Document Frequency) and CountVectorizer were used to convert raw text into numerical features suitable for machine learning models.

Wireframing and Prototyping:A basic user interface prototype was created using Streamlit, where users could enter a message to check if it’s spam. Behind the scenes, models like Multinomial Naive Bayes, Logistic Regression, and Support Vector Machines (SVM) were trained and compared using accuracy, precision, recall, and F1-score metrics.

Challenges

Text Preprocessing Complexity:

Challenge: Handling diverse formats, abbreviations, and noisy language in user messages.
Solution: Implemented robust preprocessing using NLTK and spaCy for better text normalization, including slang replacement and context-based lemmatization.

Imbalanced Dataset:

Challenge: The number of 'ham' messages significantly outweighed 'spam', leading to biased models.
Solution:Used stratified sampling, oversampling techniques like SMOTE, and adjusted model thresholds to improve sensitivity to spam messages.

Model Generalization:

Challenge:Ensuring the model performs well on unseen, real-world messages with different spam patterns.
Solution: Cross-validation was applied, and messages from outside datasets were used for testing to assess generalization.

Interpretability:

Challenge:Explaining why a message was classified as spam or not.
Solution: Integrated LIME (Local Interpretable Model-Agnostic Explanations) to provide word-level importance in classification, enhancing transparency.

Results/Conclusion:

The final model, based on Multinomial Naive Bayes with TF-IDF vectorization, achieved ~98% accuracy and a F1-score of 0.97 on the test set. It successfully detected spam messages containing promotional content, phishing links, and urgent call-to-actions. The real-time Streamlit app provided an intuitive way to classify messages, demonstrating the model’s practicality for integration into email/SMS systems. The project highlighted how simple yet powerful NLP techniques can significantly improve digital communication hygiene and user trust.

Let's 👋 Work Together Let's 👋 Work Together

Palak Gupta👋