You are on page 1of 3

Name: Muhammad Mohsin Ramzan

Roll no: 20i-2354

Fast Nuces

AI BASED AUTHOR IDENTIFICATION


The objective of this project is to utilize a pre-trained model capable of accurately determining
the authorship of text or paragraphs based on distinct writing styles, such as active voice, passive
voice, and other relevant linguistic characteristics.

Project Description:
The author identification system, which is already available, has undergone the following key
steps:

Data Collection: A comprehensive dataset comprising texts from various authors with diverse
writing styles has been collected. This dataset serves as the foundation for training the pre-
existing AI model.

Model Training: In this project, a Long Short-Term Memory (LSTM) model has been
utilized for training, which is already available. Leveraging the power of LSTM, the machine
learning model has already been trained to recognize unique patterns and attributes in the writing
styles of different authors. The LSTM model excels at capturing long-term dependencies in
sequential data, making it well-suited for analyzing text and language-related tasks. During the
training process, the LSTM model has learned to identify specific linguistic features, including
sentence structure, vocabulary choice, and syntactic preferences, enabling it to make accurate
predictions about authorship based on these characteristics. Feature Engineering: To address the
limitation of encountering text from an unknown author, a feature engineering technique has
been implemented. In cases where the model encounters text from an author not present in the
training dataset, the system automatically assigns an "Unknown" label to such texts. This feature
ensures that the system provides appropriate feedback when confronted with unfamiliar writing
styles.

Limitations:
The project does have certain limitations due to the nature of the dataset and feature engineering:

Generalization to Unknown Authors: The model's ability to accurately predict the


author of text samples from authors do not present in the training dataset may be limited. While
the "Unknown" label has been implemented for unknown authors, the system's predictions for
such cases may not be as precise as for known authors.

Dataset Representativeness: The accuracy of the model heavily relies on the diversity and
representativeness of the training dataset. If the dataset is not sufficiently comprehensive and
does not capture a wide range of writing styles, it may impact the model's ability to accurately
identify authors.
Ambiguity and Overlapping Writing Styles: Some authors may exhibit similar writing
styles, making it challenging for the model to distinguish between them accurately. In cases
where authors have overlapping linguistic features, the model's predictions may be less reliable.

You might also like