Detecting spam comments is the task of text classification in Machine Learning. Spam comments on social media platforms are the type of comments posted to redirect the user to another social media account, website, or any piece of content.
To detect spam comments with Machine Learning, we need labelled data of spam comments. We found a dataset on Kaggle about YouTube spam comments, which will be helpful for the task of spam comment detection. You can download the dataset from here.
In the following section, we will explore how to detect spam comments using Machine Learning with the Python programming language.
Youtube Spam Comments Detection using Python
Let’s start by bringing in the necessary Python libraries and loading our dataset.
from google.colab import drive
drive.mount('/content/drive')
file_id="1ODhaVbvrG61isTumYruvVDvsDgoa8qyl"
url=f"https://drive.google.com/uc?id={file_id}"
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import BernoulliNB
data=pd.read_csv(url)
print(data.head())
For this task, we only need the content and class columns from the dataset. Let’s extract these two columns and proceed to the next step:
data=data[["CONTENT", "CLASS"]]
print(data.sample(7))
The class column currently contains values 0 and 1, where 0 represents ‘Not spam’ and 1 represents ‘Spam.’ To make the dataset more readable, we’ll replace these numeric values with the labels spam and not spam.
data["CLASS"]=data["CLASS"].map({0:"Not Spam",
1:"Spam"})
print(data.sample(7))
Training a Classification Model
Next, we’ll train a Machine Learning model to classify comments as spam or not spam. Since this is a binary classification problem, we’ll use the Bernoulli Naive Bayes algorithm to build our model:
x=np.array(data["CONTENT"])
y=np.array(data["CLASS"])
cv=CountVectorizer()
x=cv.fit_transform(x)
xtrain, xtest, ytrain, ytest=train_test_split(x, y,
test_size=0.2,
random_state=42)
model=BernoulliNB()
model.fit(xtrain, ytrain)
print(model.score(xtest, ytest))
Now, let’s test our model by providing it with sample inputs of both spam and not spam comments
comment=input("Enter a message")
data=cv.transform([comment]).toarray()
print(model.predict(data))
comment=input("Enter a message")
data=cv.transform([comment]).toarray()
print(model.predict(data))
And that’s how we train a Machine Learning model in Python to detect spam comments.
Conclusion
Detecting spam comments on YouTube is just one example of how machine learning can be applied to identify and filter unwanted or malicious activity online. By analyzing text features, behavior patterns, and metadata, models can effectively separate genuine interactions from spam.
The same principles extend to many other domains, such as online payment fraud detection, where algorithms analyze transaction data, user behavior, and anomalies to flag potentially fraudulent activities in real time. Whether it’s filtering spam or securing financial transactions, machine learning continues to play a critical role in building safer and more trustworthy digital ecosystems.