You are on page 1of 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/316547065

Automated Personality Classification Using Data Mining Techniques

Technical Report · April 2017


DOI: 10.13140/RG.2.2.35949.59363

CITATIONS READS

0 3,701

5 authors, including:

Prajakata Gogate
Pillai Institute of Information Technology, Engineering, Media Studies and Research
1 PUBLICATION   0 CITATIONS   

SEE PROFILE

All content following this page was uploaded by Prajakata Gogate on 28 April 2017.

The user has requested enhancement of the downloaded file.


Automated Personality Classification Using
Data Mining Techniques
Manasi Ombhase, Student, PCE, Prajakta Gogate, Student, PCE, Tejas Patil, Student, PCE, Karan
Nair, Student, PCE and Prof. Gayatri Hegde, Faculty, PCE

Abstract— This project comes across areas where it scale. The manual analysis does not make sense of
has access to large amounts of person behavioral analyzing user learning experiences which are huge in
data. This data can be helpful to classify persons volume with different Internet slang and the timing of the
using Automated personality classification (APC). In user posting on the web. The sentiment analysis of the
this project, the system proposes an advanced APC – user collected data does not cover much relevant
automated personality classification system. The experience because even for a human judge to determine
system uses learning algorithms like Naive Bayes and what user problems a data indicates is a more
SVM, Decision tree along with advanced data mining complicated task than to determine just the sentiment of
to mine user characteristics data and learn from the a data the people, while some non-relevant features are
patterns. This learning can now be used to used in the judgment. Humans are prone to biases and
classify/predict user personality based on past prejudices which may affect the accuracy of their
classifications. The system analyses vast user judgments. Also, certain features of a Facebook profile
characteristics and behaviors and based on the or other social networks text data are difficult for humans
patterns observed, it stores its own user to grasp. For example, while the number of Facebook
characteristics patterns in a database. The system friends is clearly displayed on the profile, it is more
now predicts new user personality based on difficult for a human to determine features such as the
personality data stored by classification of previous network density.
user data. This system is useful to social networks as
well as various ad selling online networks to classify
user personality and sell more relevant ads. Also the
II. LITERATURE SURVEY
system is useful for government agencies to observe
user personality and predict new user personality on 1. Novel approaches to automated personality
a large scale. classification: Ideas and their potentials:
This paper[4] proposes several new research directions
regarding the problem of Automated Personality
Classification (APC). Firstly, we investigate possible
I. INTRODUCTION improvements of the existing solutions to the problem
Personality identification of a human being by their of APC, for which we use different combinations of
nature the APC corpora, psychological trait measurements,
an old technique. Earlier these were done manually by and learning algorithms. Afterwards, we consider
spending lot of time to predict the nature of the person. extensions of the APC problem and the related tasks,
Data mining is primarily used today by companies with such as dynamical APC and detecting personality
a strong consumer focus - retail, financial, inconsistency in a text. This entire research was
communication, and marketing organizations.Methods performed in the context of social networks and the
used to analyze the data include surveys, interviews, related data mining mechanisms.
questionnaires, classroom activities, shopping website
data, social network data about the user experiences and 2. Educational Game (Detecting personality of
problems they are facing. But these traditional methods players in an educational game):
are time consuming and very limited in scale. The One of the goals of Educational Data Mining[1] is to
manual analysis does not make sense of analyzing user develop the methods for student modeling based on
learning experiences which are huge in volume with educational data, such as; chat conversation, class
different Internet slang and the timing of the user posting discussion, etc. On the other hand, individual behavior
on the web. The sentiment analysis of the user collected and personality play a major role in Intelligent Tutoring
data does not cover much relevant experience because Systems (ITS) and Educational Data Mining (EDM).
even for a human judge to determine what user problems Thus, to develop a user adaptable system, the student’s
a data indicates is a more complicated task than to behaviors that occurring during interaction has huge
determine just the sentiment of a data. impact EDM and ITS. In this chapter, we introduce a
novel data mining techniques and natural language
processing approaches for automated detection student’s
Methods used to analyze the data include surveys,
personality and behaviors in an educational game (Land
interviews, questionnaires, classroom activities, shopping
Science) where students act as interns in an urban
website data, social network data about the user
planning firm and discuss in groups their ideas. In order
experiences and problems they are facing. But these
to apply this framework, input excerpts must be
traditional methods are time consuming and very limited
classified into one of six possible personality classes. We
in
applied this personality classification method using
machine learning algorithms, such as: Naive Bayes, Facebook profile or other social networks text data are
Support Vector Machine (SVM) and Decision Tree. difficult for humans to grasp. For example, while the
number of Facebook friends is clearly displayed on the
3. A System for Personality and Happiness profile, it is more difficult for a human to determine
Detection; features such as the network density.
This[3] work proposes a platform for estimating
personality and happiness. Starting from Eysenck's IV. PROPOSED SYSTEM
theory about human's personality, authors seek to provide
Personality classification is one of the problems
a platform for collecting text messages from social media
considered by personality psychology, a branch of
(Whatsapp), and classifying them into different
psychology. The focus of this field is the study of
personality categories. Although there is not a clear link
personality and individual differences. According to that
between personality features and happiness, some
study[4], personality can be defined as a dynamic and
correlations between them could be found in the future.
organized set of characteristics of a person, which have a
In this work, we describe the platform developed, and as
unique influence on cognition, motivation and behavior
a proof of concept, we have used different sources of
of that person. In this paper the problem of automated
messages to see if common machine learning algorithms
personality classification is considered based on
can be used for classifying different personality features
information from the following content: textual content
and happiness.
that the person wrote and meta information about a
person received on request, through social networks or
4. Using Twitter Content to Predict Psychopathy:
other means. There are studies that also include speech,
An ever-growing number of users share their thoughts
analysis of facial characteristics, gestures and other
and experiences using the Twitter micro logging
aspects of behavior, but they are not the subjects of our
service. Although sometimes dismissed as containing
study. The standard approach to solving the APC
too little content to convey significant information,
problem based on the aforementioned content is
these messages can be combined to build a larger
described in the following steps: A. Gathering the corpus
picture of the user posting them. One particularly
data, B. Determination of the personality characteristics
notable personality trait which can be discovered this
of the participants, and C. Building the model.
way is psychopathy: the tendency for disregarding
In this proposed system, there are areas where there is
others and the rule of society. In this paper, we explore
access to large amounts of person behavioral data. This
techniques to apply data mining towards the goal of
data can help us classify persons using automated
identifying those who score in the top 1.4% of a well-
personality classification (APC). In this project, propose
known psychopathy metric using information available
an advanced APC – automated personality classification
from their Twitter accounts. We apply a newly-
system. The project use learning algorithms along with
proposed form of ensemble learning, Select RUSBoost
advanced data mining to mine user characteristics data
(which adds feature selection to our earlier imbalance-
and learn from the patterns. This learning can now be
aware ensemble in order to resolve high
used to classify/predict user personality based on past
dimensionality), employ four classification learners,
classifications. The system analyses vast user
and use four feature selection techniques. The results
characteristics and behaviors and based on the patterns
show that when using the optimal choices of
observed, it stores its own user characteristics patterns in
techniques, we are able to achieve an AUC value of
a database. The system now predicts new user
0.736. Furthermore, these results were only achieved
personality based on personality data stored by
when using the Select RUSBoost technique,
classification of previous user data. This system is useful
demonstrating the importance of feature selection, data
to social networks as well as various ad selling online
sampling, and ensemble learning. Overall, we show[2]
networks to classify user personality and sell more
that data mining can be a valuable tool for law
relevant ads. Also, the system is useful for government
enforcement and others interested in identifying
agencies to observe user personality and predict new user
abnormal psychiatric states from Twitter data.
personality on a large scale.

A.Advanced Naive-Bayes Classification Algorithm


III. EXISTING SYSTEM
In the existing system fact that people can judge each Advanced Naive Bayes is one of the most efficient and
other’s personality based on Facebook profiles or text effective inductive learning algorithms for machine
data, and some aspects of Facebook profiles or other learning and data mining. Its competitive performance in
social networks text data are used by people to judge classification is surprising, because the conditional
others’ personalities. However, the overlap between independence assumption on which it is based, is rarely
Facebook profile features that contain the actual true in real world applications.
personality cues and features used by people to form Most of the algorithms for sentiment analysis are based
personality judgments does not have to be perfect. It is on a classifier trained using a collection of annotated text
possible that some of the actual personality cues are data. Before training, data is preprocessed so as to extract
ignored or misinterpreted by the people, while some non- the main features.
relevant features are used in the judgment. Humans are Some classification methods have been proposed: Naive
prone to biases and prejudices which may affect the Bayes, Support Vector Machines, K Nearest Neighbors,
accuracy of their judgments. Also, certain features of a etc. However, and per (Go et al., 2009), it is not clear
which of these classification strategies is the more get a new sample (new points), you will have already
appropriate to perform sentiment analysis. NB combines made a line that keeps B and A as far away from each
efficiency (optimal time performance) with reasonable other as possible, and so it is less likely that one will
accuracy. The main theoretical drawback of NB methods spillover across the line into the other's territory.
is that it assumes conditional independence among the
linguistic features. If the main features are the tokens
extracted from texts, it is evident that they cannot be
considered as independent, since words co-occurring in a
text are somehow linked by different types of syntactic
and semantic dependencies. However, even if NB
produces an oversimplified model, its classification
decisions are surprisingly accurate.
In the system datasets, the result is stored as yes and no
for each question. Hence the working of the algorithm
depends on the corresponding probabilities of yes and
no. Training data set of weather and corresponding target
variable ‘q1(question 1)’ Now, system classify user
personality type based on questions answers.
Step 1: Convert the data set into a frequency table.

Step 2: Create Likelihood table by finding the


probabilities like question 1 probability = 0.29 and
probability of question 2 is 0.64.

Step 3: Now, use Naive Bayesian equation to calculate


the posterior probability for each class. The class with
the highest posterior probability is the outcome of
prediction.

B. Support Vector Machine Support vector machine (SVM) is a nonlinear classifier


which is often reported as producing superior
Focus only on the points that are the most difficult to tell classification results compared to other methods. The
apart, whereas other classifiers pay attention to all of the idea behind the method is to nonlinearly map the input
points.The intuition behind the support vector machine data to some high dimensional space, where the data can
approach is that if a classifier is good at the most be linearly separated, thus providing great classification
challenging comparisons (the points in B and A that are (or regression) performance. One of the bottlenecks of
closest to each other in Figure 2), then the classifier will the SVM is the large number of support vectors used
be even better at the easy comparisons (comparing points from the training set to perform classification
in B and A that are far away from each other).The best (regression) tasks.
dividing line maximizes the distance between the B
points closest to A and the A points closest to B. It's not SVM working is as follows :
necessary to look at all of the points to do this. In
fact,incorporating feedback from points that are far away
1. Get question answers and create vectors.
can bump the line a little too far, as seen.Unlike other
2. Calculate weighted value of vectors.
classifiers, the support vector machine is explicitly told
3. Get higher values vector and find value of the
to find the best separating line. How? The support vector
personality.
machine searches for the closest points (Figure 2), which
4. Predict user personality type.
it calls the "support vectors" (the name "support vector
machine" is due to the fact that points are like vectors
and that the best line "depends on" or is "supported by"
the closest points).
Once it has found the closest points, the SVM draws a
line connecting them (see the line labeled 'w' in Figure
2). It draws this connecting line by doing vector
subtraction (point A - point B). The support vector
machine then declares the best separating line to be the
line that bisects -- and is perpendicular to -- the
connecting line.
The support vector machine is better because when you
Interface of the system(questions page): application in various domain like including job success,
attractiveness, marital satisfaction and happiness.
Personality detection from text means to extract the
behavior characteristics of authors written the text. This
paper presents state-of-art review of the emerging field
i.e. personality detection from text. This paper discusses
the state-of-art methods for personality detection; In
addition, state-of-art publically available dataset is
discussed. Two types of techniques have been employed
for detection of personality from the text i.e. machine
learning based approach based on social network
activities and second is based on the linguistic properties
present in the text.
Result using Advanced naive bayes and svm algorithm
are displayed separately: Apart from the work done towards this system, future
work mainly comprises of the following objectives:
● For future work, we want to include an
extended experiment of the methods proposed
in current research to sentiment analysis,
opinion mining, as well as emotion detection in
other domains. Also, we want to extend the
method in this work to apply in Big-Five
personality detection.
● This would test the proposed approach on a
larger dataset of Facebook profile pictures.
● There can be module where user will be
provided with career guidance which matches
his personality.
· For example: if a user has the ability to speak
well and able to convince opposite person. So,
this user will be good in marketing field.

VI. REFERENCES
[1] Fazel Keshtkar, Candice Burkett, Haiying Li and
Arthur C. Graesser,Using Data Mining Techniques to
Detect the Personality of Players in an Educational Game

[2] R. Wald,T. M. Khoshgoftaar,A. Napolitano Using


Twitter Content to Predict Psychopathy
V. RESULT ANALYSIS
[3]Yago Saez , Carlos Navarro , Asuncion Mochon and
Analysis on Advanced Naïve Bayes and SVM Pedro Isasi, A system for personality and happiness
detection.
Naive Bayes comes under the class of generative models
for classification. It models the posterior probability from [4]Aleksandar Kartelj, Vladimir Filipović, Veljko
the class conditional densities. So the output is a Milutinović, Novel approaches to automated personality
probability of belonging to a class.SVM on the other hand classification: Ideas and their potentials.
is based on a discriminant function given by y = w.x+b.
Here the weights w and bias parameter b are estimated
from the training data. It tries to find a hyperplane that [5] Oberlander, J., and Nowson, S. 2006. Whose thumb
maximizes the margin and there is optimization function is it anyway? classifying author personality from weblog
in this regard.Performance wise SVMs using the radial text. In Proc. of the 44th Annual Meeting of the
basis function kernel are more likely to perform better as Association for Computational Linguistics ACL. 627–
they can handle non-linearities in the data.Naive Bayes 634.
performs when the features are independent of each other
which does not happen in real, but still its performance is [6] Golbeck, J., Robles, C., and Turner, K. 2011a.
good in that case as well. Predicting Personality with Social Media. In Proc. of the
2011 annual conference extended abstracts on Human
VI.CONCLUSION factors in computing systems. 253–262.
Social network analysis has increased tremendously in
recent times. To extract the personality of the authors on [7] Golbeck, J., Robles, C., Edmondson, M., and Turner,
the social networking websites is very useful for much K. 2011b. Predicting Personality from Twitter. In Proc.
of International Conference on Social Computing. 149–
156.

[8] Bachrach, Y., Kosinski, M., Graepel, T., Kohli, P.,


and Stillwell, D.J., 2012 Personality and Patterns of
Facebook Usage. In Proc. of Web Science 2012. 36–45.

[9] Matthews G.; Deary I.; and Whiteman, M. 2003.


Personality traits. Cambridge University

[10] Staiano J, Lepri B, Aharony N, Pianesi F, Sebe N,


Pentland A.S. Friends dont Lie - Inferring Personality
Traits from Social Network Structure. In Proceedings of
International Conference on Ubiquitous Computing.
2012.

View publication stats

You might also like