You are on page 1of 8

Available online at www.sciencedirect.

com

ScienceDirect
Available online at www.sciencedirect.com
Procedia Computer Science 00 (2019) 000–000
Procedia Computer Science 00 (2019) 000–000
www.elsevier.com/locate/procedia
ScienceDirect
Procedia Computer Science 157 (2019) 64–71

4th International Conference on Computer Science and Computational Intelligence 2019


(ICCSCI), 12–13 September 2019

Gender Demography Classification on Instagram based on User's


Comments Section
Reynaldo Na. Goenawana, William Chanricoa, Derwin Suhartonoa*, Fredy Purnomoa
a
Computer
Computer Science Department, School of Computer Science, Bina Nusantara University, Jakarta, Indonesia 11480

Abstract

Social media has become an integral part of the society. This created a rising demand on social media marketing. New breakthroughs
in the field of artificial intelligence have enabled brand owners to boost social media marketing by knowing their target’s
demography. This study presents a model to predict gender of social media users based on comments section on Instagram profile
by using AdaBoost, XGBoost, Support Vector Machine, and Naive Bayes Classifier combined with a grid search and K- Fold
validation. The model was trained with 40,000 comments and managed to get 78.64% on Naive Bayes, 73.41% on XGBoost, 74.56%
on AdaBoost and 76.07% on SVM which are generally higher than related studies on this subject. This result shows that Naive
Bayes produces higher accuracy on short text classification and has possibilities to help social media marketing.

© 2019 The Authors. Published by Elsevier B.V.


© 2019 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
This is an open
Peer-review access
under article under of
responsibility thethe
CCscientific
BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
committee of the 4th International Conference on Computer Science and
Peer-review
Computationalunder responsibility
Intelligence of the scientific committee of the 4th International Conference on Computer Science and
2019.
Computational Intelligence 2019
Keywords: Gender prediction; Short text classification; Instagram;

* Corresponding author.
E-mail address: dsuhartono@binus.edu

1877-0509 © 2019 The Authors. Published by Elsevier B.V.


This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the scientific committee of the 4th International Conference on Computer Science and Computational
Intelligence 2019

1877-0509 © 2019 The Authors. Published by Elsevier B.V.


This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the scientific committee of the 4th International Conference on Computer Science and Computational
Intelligence 2019.
10.1016/j.procs.2019.08.142
Reynaldo N. et al. / Procedia Computer Science 157 (2019) 64–71 65
2 Goenawan et al./ Procedia Computer Science 00 (2019) 000–000

1. Introduction

According to the survey conducted by PewResearchCenter (2010), the adaptation of internet has caused the rising
of technology as communication devices. In Indonesia, social media provider is one of the favorite services. Social
media usage has generated big chunks of data in which potentially private information are submitted by users. Service
providers are starting to realize about the complexity of collected chunks of data and are trying to find more efficient
ways to process them. Of course, private user information is protected by the law. According to a study1 titled “Privacy
on Reddit? Towards Large-Scale User Classification”, the advancement of technology especially in the field of
artificial intelligence has allowed service providers to process those data without violating users’ privacy.
Processing data collected from social media users are important in business growth. With the right method, service
providers can potentially gain benefit from personal users’ information for things like targeted marketing,
advertisement, or even personalized products based on specific consumers information. Paul Webster, a Brand
Developer from Instagram made an observation and concluded that Indonesia was one of many countries with the
biggest number of Instagram users. About 89% of Instagram users in Indonesia fall in the same 18-34 age group2 and
are accessing Instagram at least once every day. High number of Instagram users in Indonesia has made Instagram the
favorite target of social media marketing by many companies. There are two ways of social media marketing on
Instagram; they are Instagram Ads and influencers. Instagram Ads is mostly used by big companies that have clear
targets and enough budget. “Influencer” is used as one terminology to describe Instagram user with high number of
followers and are opened for endorsement or “paid promote” services. While influencers are usually cheaper than the
official Instagram Ads, users can not be sure when it comes to the influencers' follower’s demography, unlike the
official Instagram Ads where we can specify our targets.

1.1. Related Studies

There are several studies that attempted to classify Instagram users based on visual attributes. One of them is done
by You3. He tried to predict the gender of Instagram and Pinterest users using visual attributes of the uploaded pictures.
He used several categories of pictures as the features in classifying the gender. The pre-processed data was trained
using Logistic Regression.
Other than Instagram, Twitter was also used in gender classification5. They used Gradient Boosted Decision Trees
(GBDT) to classify political affiliation and race based on 4 variables: profile description, tweets behavior, following,
and tweets content. This approach can also be used to classify something commercial such as determining whether
users like Starbucks or not. This study concluded that there are few dominant features like word and character choices,
retweet, reply, and following behavior.
This study intends to find the optimal machine learning algorithm in classifying gender demography of Instagram
users using publicly available data. With the right machine learning algorithm, that public information can be used to
help in targeted social media marketing. Knowing the gender demography of the targets will improve the effectiveness
of targeted social media marketing campaigns.

2. Algorithms

2.1. AdaBoost

AdaBoost6 is an ensemble algorithm which consists of multiple weak learners, usually decision tree. AdaBoost
combines these weak learners into a single strong learner with better accuracy. AdaBoost corrects its error by putting
more weight on items that were incorrectly classified in the previous iteration. By doing this, it can focus more on
achieving a higher accuracy.

2.2. XGBoost

XGBoost4 is another form of boosting algorithm that is specifically created to be scalable and fast. XGBoost differs
slightly from AdaBoost as XGBoost approach is based on gradient boosting. On many occasions, XGBoost can
outperform other ensemble classifier quite significantly. While the basic concept is similar to gradient boosting, to
achieve the faster training time, XGBoost uses cache access patterns, data compression, and sharding.
66 Reynaldo N. et al. / Procedia Computer Science 157 (2019) 64–71
Goenawan et al./ Procedia Computer Science 00 (2019) 000–000 3

2.3. Naïve Bayes

Naive Bayes7 is a classifier that assigns the most likely class to a given point or data. It utilizes the Bayes Theorem,
which calculates the prior probability of a given class and its distribution to calculate the probability of one’s class.
The naive part comes from the fact that it ignores the connection between features. In this experiment, it means Naive
Bayes ignores the connection of every word within a comment.

2.4. Support Vector Machine (SVM)

SVM8 is a supervised learning for classification or regression where it tries to minimize empirical classification
error and maximize geometric margin. It uses maximum margin separator to do so. SVM maps input vector to a higher
dimensional space and then separates classes using a hyperplane. Two parallel hyperplanes are created on each side of
the hyperplane, these are the maximum margin separator. It uses support vector (points closest to the separator) to
choose where to construct the maximum margin.

3. Algorithms

Fig. 1. Flow of study

3.1. Data Crawling

The first step was crawling the data. By using public API from Instagram, we managed to get a list of usernames.
Each username will be checked if their account is public before being crawled any further. Medias and comments will
be collected from each user to be used in data labelling. A total of 64,000 comments are successfully collected from
1,369 accounts consisting 881 female and 488 male accounts.
Reynaldo N. et al. / Procedia Computer Science 157 (2019) 64–71 67
4 Goenawan et al./ Procedia Computer Science 00 (2019) 000–000

3.2. Data Labeling

Data labeling was done with the help of Microsoft Face API. In a nutshell, it is a service which returns information
on what is in an image. All pictures are filtered, only pictures which have exactly one face will be used to label the
corresponding user. Each user’s gender will be determined from a minimum of 10 pictures. Microsoft Face API itself
has an accuracy of 93.7%9.
If multiple genders are guessed for a single user, then the one with the higher probability will be used. After each
picture is labeled with Microsoft’s face API, the labeled pictures were verified manually to spot any mistakes in the
labeling process. When a picture is suspected to have been mislabeled, the picture and any corresponding mistakes
will be disregarded from the dataset.
The collected comments will then be paired with each user’s predicted gender in the form of JSON.

3.3. Preprocessing

The gathered comments will be cleaned up from spam comments. A simple keyword like ‘sell’ or ‘sale’ indicates
that the comment was sent from a bot and comments like this will be removed from our dataset. By the end of this
step, there are only 40.000 labelled comments.
The dataset was converted into a count vector or bag-of-words, which shows the number of a specific word within
a single comment. Bag of words was selected because of the nature of Instagram comments, they are very short, and
there is also little to no repetition of words within a single comment. The same reason also made any preprocessing
that includes trimming words like lemmatization and/or stemming unusable.
As the data was sparse, to reduce memory usage and computing power needed for training, the dataset was then
saved as a coordinate list (COO).

3.4. Training & Testing

The training was conducted by doing a 7-fold cross validation on all 4 algorithms that were mentioned before. They
were XGBoost, Naive Bayes, AdaBoost, and SVM. XGBoost and AdaBoost were chosen because of their promising
high accuracy shown in recent years’ competitions, while Naive Bayes and SVM were well-known for their high
accuracy on very short text classification. The number 7 was chosen to make this study’s result directly comparable to
previous works on similar topics, which used 7-Fold cross validation.
In order to achieve best accuracy for each algorithm, grid search was used to find perfect combination of kernel,
gamma, and c (only for SVM). For algorithms that are based on ensemble classifiers, the number of weak classifiers
in the ensemble was considered a parameter as well. With grid search, all possible combination of parameter will be
used. K-fold was used to keep variance of accuracy result low and to minimize any bias in the training dataset.
Training of each algorithm will create a model that will later be used on classifying an account’s gender.

3.5. Tuning

After the initial training, efforts were made to reduce training time and to increase accuracy. First, the dataset was
retrained using a smaller wordlist, by changing all uppercase letters to lowercase. This did not change accuracy level
significantly; however, it did lower training time and hardware usage. Concurrency was also done on XGBoost to
further reduce the training time, however training multiple model on AdaBoost was more problematic due to its high
hardware requirements.
68 Reynaldo N. et al. / Procedia Computer Science 157 (2019) 64–71
Goenawan et al./ Procedia Computer Science 00 (2019) 000–000 5

4. Results and Analysis

4.1. AdaBoost

Table 1 compares the result of AdaBoost algorithm executed with different parameters by doing a grid search. The
parameters were general algorithm (SAMME or SAMME.R) and the number of decision tree (N Estimator).
SAMME.R is an improved version

Table 1. AdaBoost accuracy result comparison.


Attributes M:F (1:1) M:F (1:2) M:F (2:1) M:F (2:3) M:F (3:2)
Algorithm: SAMME.R 64.79% 73.80% 72.17% 69.91% 69.71%
N Estimator: 200
Algorithm: SAMME 58.92% 70.13% 66.76% 64.78% 64.16%
N Estimator: 200
Algorithm: SAMME.R 66.80% 74.19% 72.19% 70.64% 71.09%
N Estimator: 400
Algorithm: SAMME 59.97% 70.54% 67.75% 65.09% 65.81%
N Estimator: 400
Algorithm: SAMME.R 67.37% 74.53% 72.52% 70.86% 71.11%
N Estimator: 600
Algorithm: SAMME 60.69% 70.78% 68.19% 65.87% 65.88%
N Estimator: 600
Algorithm: SAMME.R 68.16% 74.49% 72.67% 71.45% 71.09%
N Estimator: 800
Algorithm: SAMME 61.42% 70.97% 68.50% 66.24% 65.98%
N Estimator: 800
Algorithm: SAMME.R 68.97% 74.66% 72.96% 71.64% 71.38%
N Estimator: 1000
Algorithm: SAMME 61.45% 71.35% 68.92% 66.63% 66.56%
N Estimator: 1000

4.2. XGBoost

XGBoost algorithm tends to have a performance in term of training speed, this is made possible by the use of
Newton Boosting Decision Tree, but it lacks in accuracy compared to AdaBoost for the given dataset. The parameters
used in XGBoost grid search was Gamma, Learning Rate, and the number of decision tree.

Table 2. XGBoost accuracy result comparison.


Attributes M:F (1:1) M:F (1:2) M:F (2:1) M:F (2:3) M:F (3:2)
Gamma: 1 61.29% 70.86% 68.60% 66.25% 66.14%
Learning Rate: 0.1
N Estimator: 180
Gamma: 1 64.71% 73.27% 71.26% 69.24% 69.18%
Reynaldo N. et al. / Procedia Computer Science 157 (2019) 64–71 69
6 Goenawan et al./ Procedia Computer Science 00 (2019) 000–000

Learning Rate: 0.5


N Estimator: 180
Gamma: 1 62.16% 71.41% 69.53% 66.63% 66.82%
Learning Rate: 0.1
N Estimator: 250
Gamma: 1 64.99% 73.69% 71.57% 69.90% 69.58%
Learning Rate: 0.5
N Estimator: 250
Gamma: 0.1 64.01% 72.92% 71.69% 69.96% 69.51%
Learning Rate: 0.5
N Estimator: 250
Gamma: 0.1 61.26% 71.06% 68.76% 66.30% 66.04%
Learning Rate: 0.1
N Estimator: 180
Gamma: 0.1 62.27% 71.19% 69.51% 66.84% 66.89%
Learning Rate: 0.1
N Estimator: 250

4.3. Naïve Bayes

Naïve-Bayes [7] takes the occurrences of each word into account and disregards any potential dependencies of
words in text classification, hence the high accuracy that can be seen in the table 3.

Table 3. Naïve Bayes accuracy result comparison.


Gender Ratio Accuracy
M:F (1:1) 74.98%
M:F (1:2) 78.64%
M:F (2:1) 74.57%
M:F (2:3) 75.39%
M:F (3:2) 74.81%

4.4. Support Vector Machines

Support Vector Machines constructs a hyper-plane that divides the dataset into possible classes. The main purpose
is to find the most efficient way to classify dataset in higher dimension. Grid search was used to find the best
combination of kernel, gamma and learning rate.

Table 4. SVM accuracy result comparison.


Attributes M:F (1:1) M:F (1:2) M:F (2:1) M:F (2:3) M:F (3:2)
Kernel: RBF 65.00% 69.40% 68.80% 64.60% 66.14%
Gamma: 1
C: 10
Kernel: RBF 63.98% 69.06% 68.48% 64.17% 66.15%
70 Reynaldo N. et al. / Procedia Computer Science 157 (2019) 64–71
Goenawan et al./ Procedia Computer Science 00 (2019) 000–000 7

Gamma: 1
C: 1
Kernel: RBF 71.66% 76.27% 74.13% 72.91% 73.36%
Gamma: 0.1
C: 10
Kernel: RBF 70.53% 73.72% 71.56% 70.66% 69.56%
Gamma: 0.1
C: 1
Kernel: Poly 66.42% 70.21% 69.24% 68.01% 67.80%
Gamma: 1
C: 10
Kernel: Poly 67.23% 70.07% 68.59% 68.59% 68.61%
Gamma: 1
C: 1
Kernel: Poly 62.59% 69.51% 66.61% 64.67% 63.78%
Gamma: 0.1
C: 1
Kernel: Sigmoid 64.94% 69.28% 64.54% 66.58% 64.36%
C: 10
Kernel: Sigmoid 65.40% 69.65% 70.33% 67.56% 68.42%
C: 1

We can see from the table 4 that RBF kernel achieved the highest accuracy, which probably was caused by its
ability to classify lower size of data that has a lot of features.

4.5. Accuracy Comparisons

Table 5 shows that the best accuracy was achieved when the model was trained using a dataset that comprises of 1
man for every two women, which is the same ratio that was found by Statista2. This ratio of gender in dataset
outperforms every other ratio, regardless of what the current algorithm and its parameters are. Most algorithms take
account the probability of each class so the skewed dataset will not affect accuracy in a negative way.
Table 5 also shows that algorithms based on decision tree (XGBoost and AdaBoost) fails to match the accuracy of
Naive Bayes and SVM by a quite significant margin. One possible explanation is that decision tree-based algorithms
are prone to overfit, especially on small dataset with large number of features like previous research on this subject 10.
Naïve Bayes scores best on this study, mainly because of its naïve statistical approach that do not require relation
between features to work.

Table 5. Overall accuracy result comparison.


Gender Ratio Naive Bayes XGBoost AdaBoost Support Vector Machines Best
M:F (1:1) 75.88% 65.11% 68.97% 71.66% 75.88%
M:F (1:2) 78.61% 72.92% 74.66% 76.27% 78.61%
M:F (2:1) 74.57% 71.69% 72.96% 74.13% 74.57%
M:F (2:3) 75.39% 69.96% 71.64% 72.91% 75.39%
M:F (3:2) 74.81% 69.51% 71.38% 73.36% 74.81%
Reynaldo N. et al. / Procedia Computer Science 157 (2019) 64–71 71
8 Goenawan et al./ Procedia Computer Science 00 (2019) 000–000

5. Conclusion

5.1. Conclusion and Recommendation

Classifying gender demography on Instagram can be done by processing public comments with machine learning.
From all algorithms that were used in this experiment, Naive Bayes outperformed the other 3, while XGBoost and
AdaBoost scores relatively low accuracy due to inability to work with very short text with lots of features. Naive Bayes
works best for this kind of approach as it is more immune to overfitting, this is particularly useful as comments on
Instagram are in nature very short texts. This study showed that gender classification can be done without using
comments on Instagram as opposed to the more common facial recognition approach.
By knowing gender distribution on an account’s followers, meaningful business decision can be made especially
but not limited to targeted social media marketing.

References

1 Fabian B, Baumann A, Keil M. Privacy on Reddit? Towards Large-scale User Classification.


2 Global Instagram user age & gender distribution 2019 | Statistic [Internet]. Statista. 2019 [cited 8 May 2019]. Available from:
https://www.statista.com/statistics/248769/agedistribution-of-worldwide-instagram-users/
3 You, Quanzeng, Sumit Bhatia, Tong Sun, and Jiebo Luo. "The eyes of the beholder: Gender prediction using images posted in online social
networks." In 2014 IEEE International Conference on Data Mining Workshop, pp. 1026-1030. IEEE, 2014.
4 Chen T, Guestrin C. Xgboost: A scalable tree boosting system. InProceedings of the 22nd acm sigkdd international conference on knowledge
discovery and data mining 2016 Aug 13 (pp. 785-794). ACM.
5 Pennacchiotti M, Popescu AM. A machine learning approach to twitter user classification. InFifth International AAAI Conference on
Weblogs and Social Media 2011 Jul 5.
6 Freund Y, Schapire R, Abe N. A short introduction to boosting. Journal-Japanese Society for Artificial Intelligence. 1999 Sep 1;14(771-
780):1612.
7 Rish I. An empirical study of the naive Bayes classifier. InIJCAI 2001 workshop on empirical methods in artificial intelligence 2001 Aug 4
(Vol. 3, No. 22, pp. 41-46).
8 Durgesh KS, Lekha B. Data classification using support vector machine. Journal of theoretical and applied information technology. 2010
Feb;12(1):1-7.
9 Buolamwini J, Gebru T. Gender shades: Intersectional accuracy disparities in commercial gender classification. InConference on Fairness,
Accountability and Transparency 2018 Jan 21 (pp. 77-91).
10 Pranckevičius T, Marcinkevičius V. Comparison of naive bayes, random forest, decision tree, support vector machines, and logistic
regression classifiers for text reviews classification. Baltic Journal of Modern Computing. 2017 Jan 1;5(2):221-32.

You might also like