You are on page 1of 20

“FACEBOOK COMMENT VOLUME PREDICTION”

Partial submission (Project Notes – III)

Post Graduate Program in Business Analytics and Business Intelligence

Capstone Project Report

Submitted to

Submitted by:

Yogesh Sharma

Under the guidance of

(Mr. Anirban Dey)

Batch- PGPBABI. Jan’19

Date of submission: 20th Dec,19


ACKNOWLEDGEMENT

I would like to convey my sincere gratitude to the mentor Mr. Anirban Dey for his able guidance and
mentorship. I expect that his deep understanding of the use case and business intellect shall help me
in charting the right approach and deploying the appropriate models for the analytics problem at
hand.

I would also like to thank the Great Lakes management for giving an opportunity to work on a real
case scenario which will surely help me to apply the learning practically.

NOTE

This document is submitted in fulfillment basis the requirements in ‘Project Notes-III’. Furthermore,
this document is built on the analysis done in Project Notes -I and Project Notes -II which shall be
enriched further basis the mentor/ evaluator comments.

2|P a ge
Table of Contents

1. INTRODUCTION: .................................................................................................................................................. 4

2. PROJECT BACKGROUND:................................................................................................................................. 4

3. PROJECT OBJECTIVE: ....................................................................................................................................... 5

4. APPROACH AND METHODOLOGY: ................................................................................................................. 5

5. TECHNIQUES, TOOLS, DOMAIN: ...................................................................................................................... 6

6. EXPLORATORY DATA ANALYSIS:................................................................................................................... 7

7. ANALYTICAL MODEL BUILDING: ...................................................................................................................15

8. MODEL ANALYSIS:.............................................................................................................................................15

9. CONCLUSION: .....................................................................................................................................................19

3|P a ge
1. Introduction:

The leading trends towards the Social Networking has drawn high public attention from past ‘two’ decades.
For both small businesses and large corporations, social media is playing a key role in brand building and
customer communication. Facebook is one of the social networking sites relevant for firms to make
themselves real for customers. It is estimated that advertising revenues of Facebook in the United States in
2018 stands up to 14.89 billion USD against 18.95 billion USD outside. Other categories like news,
communication, commenting, marketing, banking, Entertainment etc. are also generating huge social
media content every minute.

As per Forbes survey in 2018, there are 2 billion active users on Facebook making it the largest media
platform.
Here are some more intriguing Facebook statistics:

- 1.5 billion people are active on Facebook daily


- Europe has more than 307 million people on Facebook
- There are five new Facebook profiles created every second
- More than 300 million photos get uploaded per day
- Every minute there are 510,000 comments posted and 293,000 statuses updated

2. Project Background:

In this project, we used the most active social networking service ‘Facebook’ importantly the ‘Facebook
Pages’ for analysis. Our research is oriented towards the estimation of comment volume that a post is
expected to receive in next few hours. Before continuing to the problem of comment volume prediction,
some domain specific concepts are discussed below:

- Public Group/Facebook Page: It is a public profile specifically created for businesses, brands,
celebrities etc.
- Post/Feed: These are basically the individual stories published on page by administrators of page.
- Comment: It is an important activity in social sites, that gives potential to become a discussion
forum and it is only one measure of popularity/interest towards post is to which extent readers are
inspired to leave comments on document/post.

4|P a ge
3. Project Objective:

Basis the training dataset e.g. ‘Facebook comment volume prediction’ provided, the goal is to predict how
many comments a user generated posts is expected to receive in the given set of hours. We need to model
the user comments pattern over a set of variables which are provided and get to the right number of
comments for each post with minimum error possible.

Here, user comment volume prediction is made based on page category i.e., for a particular category of
page’s post will get certain amount of comments. In order to predict the comment volume for each page
and to find which page category getting the highest comment, I shall use ‘Decision tree’ and ‘regression
techniques’ to make the prediction effective. I shall also model the user comment pattern with respect to
Page Likes and Popularity, Page Category and Time.

As the part of Project Notes – III (covering Notes -I,II analysis), we shall focus on following:

• Initial Data Report


o Visual inspection of data (rows, columns, descriptive details)
o Understanding of attributes (variable info, renaming if required)
• Exploratory Data Analysis
o Univariate analysis (distribution and spread for every continuous attribute, distribution of the
data in categories for categorical ones)
o Bivariate analysis (relationship between different variables, correlations)
o Insightful visualizations
• Data Pre-processing
o Removal of unwanted treatment, outlier treatment
o Variable transformation, additional of new variables
• Model Build
o Analytical Model Building (mention the alternate analytical approaches that they may see fit to
be applied to the problem)
• Model Analysis
o Validation and interpretation of modelling process
o Compare the models and outcomes
• Final Interpretation
o Chose the best model and interpret results
o Generate business insights

Dataset and attributes information are attached in the Appendix section.

4. Approach and Methodology:

As the no. of comments (Facebook dataset) is the continuous data hence we shall perform Regression
analysis to determine the relationship between the target value and predictors in it. We shall also look at
the distribution and spread of variables using histogram and box plot.

5|P a ge
To achieve the overall objectives, following approach shall be used:

• Use of techniques like Decision Tree, LASSO, K-Nearest Neighbor (KNN), Random Forest and
Linear Regression

• The error will be further quantified basis RSME (Root Mean Square Error) metrices

• Then, concluding that K-Nearest Neighbor Algorithm performing well and giving the effective
prediction

Our experimental model explains that data set split into training and testing before data modelling and then
change into vector form in order to push it for prediction model and the results will be generated with
respect to minimal error obtained.

The structure of each process is carried out in each phase as per below diagram:

Figure - 1

5. Techniques, Tools, Domain:

Techniques EDA, Linear Regression, Decision


tree, LASSO, KNN, Random Forest,
RSME
Tools R
Domain Social Media

6|P a ge
6. Exploratory Data Analysis:

6.1 First Level Insights (Data Filtering and Imputation)

The data set used is a ‘Facebook comment volume’ record captured over the period containing 32,759
lines and 43 variables.

Out of 43 variables with one as target value for each post and categorized the features based on
relation between Target variable.

1) Page Features: It defines about popularity/Likes of a page, check-in’s, category of a page. Page
Likes: This feature describes about the user specific interest related to page category such as Status,
wall posts, Photos, Profile pic, shares or pages.

2) Essential Features: The pattern of comment from different users on the post at various time interval
with respect to randomly selected base time/date. CC1 to CC5 cover this.

Figure - 2

3) Weekday Features: It is for the complete week that is used to pick the post that got published on
selected base time/date and weekday.

4) Other basic Features: The remaining features that help to predict the volume of comment for each
page category and that includes to document about the source of the page and date/time for about next
H hours.

5) Without using Parameter: The prediction comes with expected way when performing without
specifying any parameters in it. The regression gives the result which expected and termed as best
prediction results among the results that with specified

parameters.glm (formula = train_scale$Target.Variable ~ ., data = train_scale)

6.2 Variable Rationalization (Data bucketing)

In order to study the data better, we performed a preliminary variable reduction in the beginning itself.
At this stage, we reduced the variable on the following criteria:

▪ Redundant Variables
▪ Business relevance

7|P a ge
▪ Correlated Variables
▪ Target Variable

Variable name
Type of Variable
Page Popularity/likes Business relevance

Page Check-ins Business relevance

Page talking about Business relevance

Page Category Business relevance

Feature 5 – Feature 29 Redundant Variable

CC1 Correlated Variable


CC2 Correlated Variable
CC3 Correlated Variable

CC4 Correlated Variable

CC5 Correlated Variable


Base time Business relevance
Post length Redundant Variable

Post Share Count Business relevance

Post Promotion Status Business relevance

H Local Business relevance


Post published weekday Business relevance
Base Date Time weekday Business relevance
Comments Target Variable

6.3 Data Validation and Analysis (Outlier skip patterns, missing inputs)

Let’s perform analysis started using R:

## Set Working Directory

setwd("C:/Users/Yogesh Sharma/Desktop/Capstone")

getwd()

## Read Input data

Comments = read.csv("Facebook.csv", header = TRUE)

## View column names

names(Comments)

## View Structure and Summary of Input data

8|P a ge
str(Comments)

summary(Comments)

OBSERVATIONS:

1. Dependent Variable: Target.Variable

2. All independent variables are numeric or integer expect Post.published.weekday and

Base.DateTime.weekday

3. Missing values present in Page.likes, Page.Checkins, Page.talking.about, Page.Category,CC1-

CC5,Derived features

4. Maximum value for some key variables is high as compared to 3rd Qu - Possibility of outliers?

5. Similar outlier possibility found in Page.likes, Page.Checkins,Page.talking.about,CC1-

CC5,Post.Share.Count

## Examine Dependent Variable 'Comments'

attach(Comments)

## Build Histogram for Target.Variable to understand its distribution

hist(Target.Variable)

OBSERVATIONS: Possibly Outlier(s) affecting histogram

boxplot(Target.Variable)

OBSERVATIONS:

Most of the Comments are at the lower end - One outlier very far out

For now, let us examine only low Comments (< 1100)

library(dplyr)

Comments=Target.Variable[Target.Variable<1100]

hist(Target.Variable)

9|P a ge
OBSERVATIONS:

Number of observations reduced from 32759 to 32756 - Therefore, there were 3 outliers:

Comments resembles Normal Distribution

## Let us now examine the Integer Independent variables using the original dataset

names(Target.Variable)

hist(Page.likes)

hist(Page.talking.about)

hist(Page.Category)

hist(CC1)

hist(CC2)

hist(CC3)

hist(CC4)

hist(CC5)

hist(H.local)

boxplot(Page.Category)

## Now let us examine the Categorical Variables

table(Post.published.weekday)

> table(Post.published.weekday)

Friday Monday Saturday Sunday Thursday Tuesday Wednesday

1 4813 4693 4437 4043 4692 4920 5161

plot(Post.published.weekday)

table(Base.DateTime.weekday)

plot(Base.DateTime.weekday)

10 | P a g e
6.4 Visual representation

11 | P a g e
6.5 Data Modelling and Experimental Settings

For further analysis, data of Facebook page with user pattern in each page is taken for training and
testing. The sorted data is cleaned, and the cleaned corpus is divided into two different subsets using
temporal splits i.e. (1) Training data (80%, 32756) and (2) Testing data (20%, 8190)

A. Training Dataset -Training Dataset is from the variant selection and calculation of it then vectorizing
it termed to be pre- processing.

B. Testing Dataset - In Testing also, data is vectorized i.e., from 8190 it had formed as 100 in each
vector of total 10 as modeled.

• By looking at the structure of the training data, we can see that data is now in ‘integer’ or ‘float’
format. So, this make our mathematical calculations easier. Now, we go through our dataset to
understand how the data is distributed. To understand what day of the week the post is been
posted, we plot a graph.

From the above graph, we can understand that frequency of post increases on daily basis and it
reaches its maximum point at Wednesday and then it declines gradually.

Now we must understand how the comments are coming for these posts when compared to base time.

12 | P a g e
• Now, we must understand the characteristics of length of the post.

• With the count and mean mentioned, we can clearly understand how the data is distributed.
Similarly, we must understand the characteristics of ‘Post_share_count’

• With the count and mean mentioned, we can clearly understand how the data is distributed.
Similarly, we must understand the characteristics of CC1, CC2, CC3.

13 | P a g e
With the count and mean mentioned, we can clearly understand how the data is distributed.

There are other columns in our dataset (Feature columns) for which we don’t understand the
content. In dictionary, generic terms are mentioned as they represent mean, minimum value,
maximum value, average, median, standard deviation, etc. So, it’s better to understand the
correlation between these columns by drawing heat maps.

From the above heat map, we can understand that ‘c21’ has least values as compared to all
other columns. ‘c24’ also have very low values but slightly higher as compared to ‘c21’. ‘c11’
have highest value as compared to every other column. ‘c6’ also contains higher set of values
but not higher than ‘c11’. Data in columns ‘CC1’ , ‘CC2’ , ‘CC3’ and ‘CC4’ are evenly distributed
with no column have regular high or low data as compared to other columns.

14 | P a g e
7. Analytical model building

Based on the inferences from data processing, decision tree results with other Algorithms like Linear
Regression, Random Forest, Lasso and KNN shall be made. Here, Decision Tree results will be
compared in both cases with parameter and without parameter towards other Algorithms. We shall
observe that the prediction results in each algorithm of non-parametric form gives better results
compared to parametric form. Also, Random Forest Algorithm RMSE value will be observed high, but
its prediction results shall come closer to the expected results but, it will not give the constant results as
well.

We shall detail out the analysis covering below techniques/ models in this document:

1) Linear Regression: It’s a common regression technique that helps in the forecasting the results. In
this model the dependent value will be continuous, and the independent variable may be of discrete or
continuous depends on the values given. So, based on the line equation it has been calculated along
with mean square error (difference of Residuals and observation).

2) K-Nearest Neighbors: KNN is the other effective algorithm that takes for analysis without specifying
the parameters and calculates based on the data similarity.

3) Decision Tree: Decision Tree is the tree based structured model. It selects the node by itself, from
the input given and forms tree. From classification it differs in regression i.e., it takes average of every
parameter and forms the root node with the highest influencing node.

4) Random Forest: It clearly for the large set of data that picks the variables randomly which fits for it. If
the response is a factor, random Forest performs Classification; if the response is continuous, random
Forest performs Regression. The unsupervised data is generally called as unlabeled data. It randomly
picks up the predictors (i.e., group of decision tress) to form the model

5) Least Absolute Shrinkage and selection operator (LASSO) creates a regression model that will
penalize with the L1-norm which is the sum of the absolute coefficients. It makes the effect of shrinking
the coefficients

6) Statistical testing is for choosing best predictors - We shall test cases for hypothesis testing for
example, Chi Square, Anova, F-test, Pearson Correlation etc., For large we set of data with different
group of parameters and values of continues as well as categorical will be going for Anova Test.

8. Model Analysis

8.1 Models evaluation

15 | P a g e
> summary(dtree) ------------------------ Decision Tree

Call

n= 32759

CP nsplit rel error xerror xstd

1 0.11856017 0 1.0000000 1.0000339 0.08565948

2 0.04939177 2 0.7628797 0.7929352 0.07076855

3 0.04115737 3 0.7134879 0.7602787 0.06974433

4 0.04013157 5 0.6311731 0.7001560 0.06674591

5 0.02915479 6 0.5910416 0.6562073 0.06152217

6 0.02292093 7 0.5618868 0.6165994 0.05988566

7 0.01847200 8 0.5389658 0.5859780 0.05704875

8 0.01741446 9 0.5204938 0.5783039 0.05650934

9 0.01593535 10 0.5030794 0.5766578 0.05650324

10 0.01190385 11 0.4871440 0.5456269 0.05377161

11 0.01139054 12 0.4752402 0.5418436 0.05381691

12 0.01055486 13 0.4638496 0.5328157 0.05381680

13 0.01000000 15 0.4427399 0.5205916 0.05186179

Variable importance

Base.time CC1

52 36

Page.talking.about Page.Popularity.likes

63

Page.Category Page.CheckinsŸ

21

OBSERVATIONS:

In decision tree, the root node error is the one to compute two measures of predictive variables,
according to the values shown in in the rel error and xerror column, which is depending on the CP
(complexity of parameter in first column):

16 | P a g e
0.71348 x 0.76027 = 0.5424 (54.2%) is the re-substitution error rate in a 3rd row (i.e., error rate
computed on the training sample). 0.76027 x 0.06974 = 0.5302 (53.02%) is the cross-validated error
rate.

Then the result of linear regression is explained below,

>summary(lm) ------------------------- Linear Regression

Callglm (formula = train_sacle$Target.Variable ~ ., data = train_sacle)

Deviance Residuals:

Min 1Q Median 3Q Max

-173.20 -8.82 -2.88 4.33 1280.70

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 7.32289 0.15983 45.818 < 2e-16 ***

Page.Popularity.likes -1.07511 0.20714 -5.190 2.11e-07 ***

Page.CheckinsŸ -0.74871 0.16276 -4.600 4.24e-06 ***

Page.talking.about 3.12843 0.22157 14.120 < 2e-16 ***

Page.Category -0.58454 0.16290 -3.588 0.000333 ***

CC1 11.14411 0.17213

64.741 < 2e-16 ***

Base.time -8.42235 0.16029 -52.545 < 2e-16 ***

Post.length 0.08752 0.15994 0.547 0.584269

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Null deviance: 51588873 on 32759 degrees of freedom

Residual deviance: 42825449 on 32757 degrees of freedom

AIC: 400927

Number of Fisher Scoring iterations: 2

17 | P a g e
OBSERVATIONS:

Std. Error is standard deviation error which is deviating from the regression line(residuals). Otherwise,
the value which is deviating from the point at where the regression line is termed to be zero error state.
Estimating the coefficient under standard regression model of the corresponding quantity (the
coefficient estimated t value is the value of the t-statistic for which defined as the value obtained is
differs from Zero.

Pr. is the p-value from hypothesis testing of statistic model along with t value. It refers alpha that termed
to have 0.5 as constant. If the value is greater than 0.5 it termed to be likely and if it is less than 0.5 it
termed to be failure and the results obtaining will be of unusual, otherwise if the null hypothesis were
true.

Least Absolute Shrinkage and selection operator (LASSO), creates a regression model that will
penalize with the L1-norm which is the sum of the absolute coefficients. It makes the effect of shrinking
the coefficients. The graph resembles a sine curve but not exactly because of the noise present in this
model towards the data. So, it shows that LASSO won’t be a good choice.

The above graph of LASSO model is formed with the Facebook data that originally deals with
regression technique. Here the curve which trying to form sin curve but due to noise in the model
towards the data it started increasing as well. So, the model has been failed and it shows the similar
results of Ridge regression model. Result of RMSE values for different regression techniques which
clearly shows that KNN gives the least error compared to other algorithms.

The graph showing KNN RMSE for different K-Values. In this Algorithm we can iterate the model with n
number of K values. So that, it will help to pick any K value which giving minimum error rate. Here, at
K=5; which is shown in the graph below. Since KNN will not allow to specify parameters, takes as
whole data and performs. The variable selection is done with in the model like Lasso and Ridge does.
So that it will assign the high influencing parameters from the data and it has the K-value specified to
give the prediction which all matches the K-values.

18 | P a g e
The 3D plot which represents values with respect to main features (Base Time, Page category). So that
the plot which represent the comments based on page and time (Hours) as well. The color from red to
black represents the data. First value starts with red and finally ends at black. With respect to
parameters base time of each post based on page category, comments are predicted accordingly. So
that the dependency of both parameters plays the major role in it.

By selecting the different base date/time randomly for each post with different variants, can be chosen
to get good results. Then the clear plot of possible number of comments for each post category is
displayed below. The Comments for page category 24 received maximum of comments in next H hours
among all. It’s a period of work done describing the efficiency of the model in it. These processes
include the time taken to train the data and taken for regression process and conclude the validation
with test cases.

9. Model Results and Conclusion:

9.1 Models comparison

The Regression model for the Facebook dataset is concluded with KNN Regression Techniques, that is
a Non-parametrized model which gives the accurate prediction results compared to other algorithms.
We are also proving that RMSE results compared with the algorithms like Linear, KNN, Random Forest
(RF), Decision Tree (DT), Lasso and Ridge. For KNN, RMSE is less compared to others. So, it clearly
shows KNN works better.

19 | P a g e
9.2 Conclusion

Our investigation has revealed that much of the comment volume of a post is determined by the
features of that post’s Facebook page and is relatively unrelated to intrinsic features of the post. In
particular, the number of posts on that page in the preceding 24 hours and the number of post shares
largely predicts the amount of comments a post will receive. Among features that can be controlled by
the user, the character length of a post and the day of posting are the most predictive, but their relative
importance is small when compared to the page features. With that said, future work could be
performed to examine the effect of promoting Facebook posts to see if such actions lead to greater
comment volume. Such an approach would help determine if Facebook post promotions are effective in
increasing the exposure of a post.

Appendix:

Title Artifact/Location Remarks


Source of Data https://olympus.greatlearning.in/courses/4012/files/459750?module_item_id=265652
List of Variable
Variables description and
Data
Dictionary.docx
rationale
behind
selection
R Code for R File for
Reference Facebook

20 | P a g e

You might also like