Batch 6

Personality Prediction From Social Media
Data Using Language Models
A. Naga Sai Ajay Kumar - 20BQ1A0501

Project Guide:
Ch. Pavan Sai Ganesh - 20BQ1A0537
Mr. P.R.Krishna Prasad G. Jaswanth - 20BQ1A0555
Associate Professor B. Dheeraj Kumar - 20BQ1A0521
April 24, 2024 1

Introduction:
1. Personality prediction refers to the process of using various data sources and analytical techniques to
make educated guesses or predictions about an individual's personality traits or characteristics.
2. The MBTI test classifies individuals into one of 16 personality types, each characterized by distinct
preferences in four dichotomous dimensions:
i. Extraversion vs. Introversion
ii. Sensing vs. Intuition
iii. Thinking vs. Feeling
iv. Judging vs. Perceiving.
3. Our project focuses on personality classification from text using a pre-trained language model called
BERT (Bidirectional Encoder Representation from Transformers).
4. We then apply the BERT model to real-time tweets collected via the Twitter api demonstrating the
effectiveness of predicting personality from social media data.
Applications: Psychology and Counselling, Recruitment and HR

Marketing and Advertising ,Personalized Services
2
Existing System:
Proposed System:
• Uses flask for the backend, model deployment and CSS for building the homepage ui. The results of
this research indicate the feasibility of predicting Myers Briggs personality types based on social media
user data with the finest accuracy.
• Personality classification from user’s social media recent tweets using a pre-trained language model called
BERT
3
Concept:
 The Myers-Briggs Type Indicator (MBTI) is a widely used personality assessment tool based on Carl Jung’s theories.
It categorizes individuals into one of 16 personality types, providing insights into their preferences in four
dimensions:
1. Extraversion (E) vs. Introversion (I):
• Extraversion (E): People who gain energy from social interactions, enjoy group activities, and tend to be
outgoing.
• Introversion (I): Individuals who recharge by spending time alone, prefer deeper conversations, and may be
more reserved.
2. Sensing (S) vs. Intuition (N):
• Sensing (S): People who focus on concrete details, practical information, and the present moment.
• Intuition (N): Individuals who look beyond the surface, seek patterns, imaginative about future possibilities.
3. Thinking (T) vs. Feeling (F):
• Thinking (T): Individuals who make decisions based on logic, objective analysis, and principles.
• Feeling (F): People who consider emotions, values, and empathy when making choices.
4. Judging (J) vs. Perceiving (P):
• Judging (J): People who prefer structure, planning, and organization. They like closure and decision-making.
• Perceiving(P): Individuals who are adaptable, spontaneous, and open-ended. They enjoy flexibility and
exploration. Department of CSE 4
Bidirectional Encoder Representations From Transformers:
 BERT, short for Bidirectional Encoder Representations from Transformers, is an open-source machine learning
framework designed for the realm of natural language processing (NLP). It was developed by researchers from Google
AI Language in 2018. Let’s delve into the details of BERT:
1. Architecture and Working:
i. BERT leverages a transformer-based neural network to understand and generate human-like language.
ii. Unlike the original Transformer architecture, which has both encoder and decoder modules, BERT employs
an encoder-only architecture. This emphasizes understanding input sequences rather than generating output
sequences.
iii. BERT uses a bidirectional approach, considering both the left and right context of words in a sentence
simultaneously. This allows for a more nuanced understanding of context.
2. Pre-training and Fine-tuning:
 BERT undergoes a two-step process:
i. Pre-training on Large Data: BERT is pre-trained on a large amount of unlabeled text data. During pre-
training, it learns contextual embeddings, which represent words considering their surrounding context in a
sentence.
ii. Fine-tuning on Labeled Data: After pre-training, BERT is fine-tuned for specific NLP tasks using labeled data.
This step tailors the model to more targeted applications by adapting its general language understanding to
task-specific nuances.
April 24, 2024 Department of CSE 5
Working of BERT :
Many models predict the next word in a sequence, which is a directional approach and may limit context learning.
BERT addresses this challenge with two innovative training strategies:
1. Masked Language Model (MLM): In BERT’s pre-training process, a portion of words in each input sequence is
masked and the model is trained to predict the original values of these masked words based on the context
provided by the surrounding words. In simple terms, Masking words.
2. Next Sentence Prediction (NSP): BERT predicts if the second sentence is connected to the first. This is done by
transforming the output of the [CLS] token into a 2×1 shaped vector using a classification layer, and then calculating
the probability of whether the second sentence follows the first using SoftMax.
Applications:
2. Question-Answering Systems
3. Text classification
4. Named entity recognition
5. Text Summarization 6
Design:
Split dataset into

training, testing and
validation

Working of BERT:

Flowchart:
Department of CSE April 24, 2024 9

Implementation:
1. Importing packages:
• Packages used for Data Visualization:
import matplotlib.pyplot as plt
from wordcloud import WordCloud
• Packages used for Data Preprocessing:
import re
import string
from transformers import TFBertModel, BertTokenizer
• Packages used for Model Training:
import tensorflow as tf
import tensorflow_addons as tfa
import tensorflow.keras as keras
import tensorflow.keras.layers as layers
from tensorflow.keras.callbacks import ModelCheckpoint
• Packages used for Model Evaluation:
from sklearn.metrics import auc, roc_curve
2. Loading the Dataset :
The mbti dataset contains over 8675 rows of data, on each row is a person’s:
• Type (This persons 4 letter MBTI code/type)
• A section of each of the last 50 things they have posted (Each entry separated by "|||" )
Dataset link: https://www.kaggle.com/datasets/datasnaek/mbti-type
3. Data Fetching: Obtaining recent 20 tweets of an individual using Twitter api.

3. Data Pre-processing:
Removing URLs, quotes, @ and other special symbols, applying lowercase on posts
Value of types:
Before preprocessing : ‘INFJ’
After preprocessing : array([0., 0., 1., 0.], dtype=float32)
The text is further preprocessed into Bert base inputs as input_ids, token_type_ids and attention mask.
Split dataset into training, testing and validation:
76% of the dataset is used for training, 12% of dataset for testing and remaining 12% of dataset for validation.
4. BERT Model Training:
• Bert base inputs are passed into Bert Model and the generated Bert outputs pass through a densely-connected
neural network layer that uses a sigmoid activation function to generate four outputs aka personalities.
 The loss function is binary cross-entropy since it's a multi-label classification problem and it deals with the
imbalance of the dataset.
 The optimizer used is Adam from TensorFlow Addons with a specified learning rate it is used to add the weights
and bias. Finally the model weights are saved in ‘bert_base_model1.h5’ file.
Department of CSE 13
5. Model Evaluation:
• Evaluating the performance of model using ROC-AUC method.
 ROC (Receiver Operating Characteristic) curve is a graphical plot that illustrates the performance of a binary
classifier across various threshold settings. It represents the true positive rate (TPR) against the false positive
rate (FPR) at different threshold levels.
 AUC (Area Under the Curve) computes the area under the ROC curve, providing a single scalar value that
summarizes the model's performance across all possible threshold settings.
 A higher ROC-AUC score indicates better discrimination ability of the model, with a value of 1 indicating perfect
performance and a value of 0.5 indicating random guessing.
Department of CSE 14
Testing :
Test Results: auc: 0.7850 accuracy: 0.7681
Result Analysis:
 Since the dataset is imbalanced the value of accuracy can be misleading so we considered ROC-AUC
value to evaluate model performance.

OUTPUT SCREEN:

Tweets of the user and the scores of each personality is displayed in the terminal screen

Conclusion:
1. Our project demonstrates the feasibility of predicting personality using BERT Language
Model . Through extensive experimentation, we achieve the advantages of our
approach lie in its potential applications, such as personalized recommendations,
targeted marketing, and psychological analysis.
2. However, it is important to acknowledge the limitations of our project. One limitation
is the reliance on self-reported data, which may introduce biases and inaccuracies.
Future Enhancement:
3. Future development may utilize the use of larger training and testing datasets.
4. Fetching data from other social media sources like YouTube, reddit etc.
5. Currently the model is trained on 128 tokens and it can be increased for more accuracy.
6. Comparing with other pre-trained models such as ALBERT, XLNet, Distil BERT etc.

References:
1. Text based personality prediction from multiple social media data sources using pre-trained language
model and model averaging.
- Hans Christian, Darwin Suhartono, Andry Chowanda and Kamal Z. Zamil.
2. Social Media Text - A Source for Personality Prediction Methods.

- P.S.Dandannavar, S.R.Mangalwede, P.M. Kulkarni [March 2018].
3. Personality Prediction System From Facebook Users.

- Tommy Tenders ,Hendro, Darwin Suhartono, Rini Wongso and Yen Lina Praseti [October 2017].
4. Personality Predictions Based on User behavior on the Facebook Social Media Platform
- MIACHAEL M. TADESSE 1 HONGFEI LIN1,BO XU1 AND LIANG YANG [June 2021].
5. Personality Prediction Based on Twitter Information

- Veronica Ong1, Anneke D.S. X Rahmanto 1, William Darwin Suharatono [October-2017].


Batch 6

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Batch 6

Uploaded by

Copyright:

Available Formats

Personality Prediction From Social Media

Data Using Language Models

A. Naga Sai Ajay Kumar - 20BQ1A0501

April 24, 2024 1

Applications: Psychology and Counselling, Recruitment and HR

Split dataset into

April 24, 2024 Department of CSE 7

April 24, 2024 Department of CSE 8

Department of CSE April 24, 2024 9

3. Data Fetching: Obtaining recent 20 tweets of an individual using Twitter api.

Department of CSE April 24, 2024 11

April 24, 2024 Department of CSE 15

Department of CSE April 24, 2024 16

April 24, 2024 Department of CSE 18

April 24, 2024 Department of CSE 19

2. Social Media Text - A Source for Personality Prediction Methods.

3. Personality Prediction System From Facebook Users.

5. Personality Prediction Based on Twitter Information

April 24, 2024 Department of CSE 20

You might also like