You are on page 1of 5

A Simple Machine Learning Approach to Detect AI-Generated Text

With the growing presence of AI-generated text in our daily lives, it's becoming more important
to be able to differentiate between human-generated and AI-generated text. In this post, I will
guide you through the process of developing a software tool that can detect AI-generated text.
To do this, I will use a supervised machine learning approach built with decision tree algorithms
and provide an example of its implementation in Python using a small dataset. Additionally, I
will discuss the challenges involved in building a tool to detect AI-generated text.

First let’s understand what we need to develop a machine learning classifier to detect AI-
generated text versus human-generated text. This is a high-level overview of the process and
the specific steps may vary depending on the complexity of the task and the tools we're using.
1. Dataset: A labeled dataset of text data, where each text sample is assigned a class label
(e.g., AI-generated text or human-generated text) to train our classifier. The larger and
more diverse the dataset, the better the classifier will perform.
2. Feature extraction: Convert the text data into numerical representations (features) that
can be used as input to the classifier. The features could be generated using techniques
such as word embeddings, TF-IDF, bag-of-words representations, or other feature
extraction techniques. The classifier would then use the features as input to make
predictions about the class labels of the text samples.
3. Model selection: We'll need to select a machine learning model that is appropriate for
the task. For example, we could use a simple logistic regression classifier or a more
complex deep neural network-based classifier. The choice of algorithm would depend
on the size and complexity of the data, as well as the desired accuracy and performance
of the classifier. For this example, we will be using the decision trees, a simple and
widely used algorithm in classification tasks.
4. Training: We'll need to train the selected model (decision trees) on the labeled dataset
using the extracted features. This will allow the model to learn the relationships
between the features and the labels, and make predictions about new text samples.
5. Evaluation: Once the model is trained, we will then use it to make predictions about the
class labels of new, unseen text samples to evaluate its performance. Common
evaluation metrics for text classification include accuracy, precision, recall, and F1 score.
6. Fine-tuning: Depending on the performance of the model, we may need to fine-tune the
model by adjusting its parameters, changing the features we're using, or selecting a
different model altogether.
Example of a Decision Tree
Here's a Python code example of how you could build a decision tree classifier to detect AI-
generated text using the small dataset. To run the code for the decision tree classifier, you will
need the following:
1. A programming environment: To execute the code, you'll need a programming
environment such as Jupyter Notebook or PyCharm, or IDLE, etc.
2. Required libraries: You'll need to have the following libraries installed: Pandas for data
manipulation and analysis, Scikit-learn for machine learning algorithms and models, and
numpy for numerical computing. If you don't already have these libraries installed, you
can install them by running the following command in your terminal or command
prompt:
pip install pandas scikit-learn numpy
If you are using Anaconda, you can install the required libraries simply by replacing the word
”pip” with “conda”:
conda install pandas scikit-learn numpy

Dataset
The dataset provided in the code example is a small and simple dataset that consists of 10 text
examples, 5 of which are AI-generated text and 5 of which are human-generated text. The text
examples are stored in a list of tuples, where each tuple contains a text string and a label
indicating whether the text is generated by AI or by a human. The list is then converted into a
Pandas DataFrame for easier manipulation and analysis.
The text data is preprocessed using the TfidfVectorizer from the scikit-learn library, which
converts the text into numerical features that can be used as input for the machine learning
model. The dataset is then split into a training set and a test set, and a decision tree classifier is
trained on the training set. The trained model is used to make predictions on the test set, and
the accuracy of the predictions is evaluated.
This dataset is a simple example that was created to demonstrate the basic steps involved in
building a machine learning model to detect AI-generated text. In real-world applications, the
dataset would typically be much larger and more diverse, and the features used for prediction
would be more sophisticated. However, this simple dataset is a good starting point for
understanding the concepts involved in building a machine learning model.
Code
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Dataset
data = [("The cat is sitting on the mat.", "AI"),
("The sun is shining brightly today.", "AI"),
("The car is red in color.", "AI"),
("The sky is blue and clear.", "AI"),
("The flowers are blooming in the garden.", "AI"),
("I love spending time with my friends and family.", "Human"),
("My favorite food is pizza, what's yours?", "Human"),
("I love to play basketball and watch movies in my free time.", "Human"),
("I have a passion for photography and traveling.", "Human"),
("I enjoy reading books and learning new things.", "Human")]
# Creates a Pandas DataFrame from a list of tuples named data and name the columns
df = pd.DataFrame(data, columns=["Text", "Label"])

# Feature extraction using TF-IDF


vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df["Text"])
y = df["Label"]
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=0)

# Train a decision tree classifier


clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)

# Make predictions on the test set


y_pred = clf.predict(X_test)

# Evaluate the model's performance


accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(accuracy * 100)) # Accuracy: 100.00%

Now, let’s make a new prediction using a new example with the trained decision tree classifier:
new_text = ["The flowers are blooming in the garden.", "I love to play basketball and watch
movies in my free time."]
# Preprocess the new text data
new_text_transformed = vectorizer.transform(new_text)

# Make predictions on the new text


new_predictions = clf.predict(new_text_transformed)

# Print the predictions


print("Predictions:", new_predictions) # Predictions: ['AI' 'Human']
The output Predictions: ['AI' 'Human'] means that the decision tree classifier has made two
predictions based on the two new text examples. The first prediction is 'AI', indicating that the
classifier has predicted that the first text example is generated by AI. The second prediction is
'Human', indicating that the classifier has predicted that the second text example is generated
by a human.
These predictions are based on the training data that the decision tree classifier was previously
fit to and the features extracted from the new text examples. The accuracy of these predictions
will depend on how well the classifier was trained and how similar the new text examples are to
the examples in the training data.

Challenges
Building a software tool to detect AI-generated text is not without its challenges. Some of the
challenges include:
1. Lack of labeled data: One of the biggest challenges in detecting AI-generated text is the
lack of labeled data. It can be difficult to find a large, diverse, and representative dataset
of AI-generated and human-generated text to train the classifier.
2. Difficulty in defining AI-generated text: Another challenge is defining what constitutes
AI-generated text. There are many different forms of AI-generated text, and not all of
them are easily distinguishable from human-generated text.
3. Changes in AI technology: The field of AI is rapidly evolving, and new techniques for
generating text are being developed all the time. This can make it challenging to build a
software tool that can detect AI-generated text, as the characteristics of AI-generated
text can change over time.
Despite these challenges, building a software tool to detect AI-generated text is a valuable and
important task. With the right tools and resources, it is possible to build a tool that can
accurately identify AI-generated text, and help distinguish between human-generated and AI-
generated text. I hope this post has provided you with a good understanding of the concept and
process of building a software tool to detect AI-generated text. If you liked this post, please
share it with others who may be interested. Thank you for reading!

You might also like