You are on page 1of 28

SCHOOL OF COMPUTER SCIENCE AND

ENGINEERING

SOCIAL AND INFORMATION NETWORKS


J-COMPONENT REPORT

“Analysis and Prediction for IPL data and


Twitter sentiment analysis using Turicrate.”

Submitted to –

Prof. Punitha K

Submitted by –

ANSHUL RAJ – 17BCE1274 (araj101099@gmail.com)


TANISHKA KUKREJA – 17BCE1266 (tashukuku24@gmail.com)

Nov, 2020
ACKNOWLEDGEMENT

The success and final outcome of this project required a lot of guidance and assistance from
many people and we are extremely privileged to have got this all along the completion of my
project. All that we have done is only due to such supervision and assistance and we would
not forget to thank them. We respect and thank Prof. Punitha K, for providing us an
opportunity to do the project work at VIT Chennai, and giving us all support and guidance,
which made us complete the project duly. We are extremely thankful to her for providing
such a nice support and guidance, although she had busy schedule managing the academic
affairs.

Yours Sincerely

Tanishka Kukreja(17BCE1266)

Anshul Raj(17BCE1274)
ABSTRACT

Social media is one of the platform, most of the people use to express their feelings, thoughts,
suggestions and opinion via blog posts, status update, website, forums and online discussion
groups etc. Due to facilities a large volume of data is found to be generated every day.
Especially when sports like cricket, soccer and football are played, then many discussion are
made on social media like Twitter and respective forums using their restricted words. The
opinion expressed by huge population may be in a different manner/different notations and
they may comprise different polarity like positive, negative or both positivity and negativity
regarding current trend. So, simply looking at each opinion and drawing a conclusion is very
difficult and time consuming. Because the opinion/tweets collected from Twitter consists lot
of unwanted information which burdens the process. Hence, we need an intelligent system to
retrieve tweets from Twitter, analysis systematically and draw an accurate result based on its
positivity and negativity. Data Science / Analytics is all about finding valuable insights from
the given dataset. In short, Finding answers that could help business. In this tutorial, We will
see how to get started with Data Analysis in Python. The Python packages that we use in this
notebook are: numpy, pandas, matplotlib, and Turicreate.Cricket is a sport that contains a lot
of statistical data. There is data about batting records, bowling records, individual player
records, scorecard of different matches played, etc. This data can be put to proper use to
predict the results of games and so this problem has become an interesting problem in today’s
world.
INTRODUCTION
Social network analysis (SNA) is the process of investigating social structures through the
use of networks and graph theory. It characterizes networked structures in terms
of nodes (individual actors, people, or things within the network) and the ties, edges,
or links (relationships or interactions) that connect them. Examples of social
structures commonly visualized through social network analysis include social media
networks, memes spread, information circulation, friendship and acquaintance networks,
business networks, knowledge networks, difficult working relationships, social
networks, collaboration graphs, kinship, disease transmission, and sexual relationships. These
networks are often visualized through sociograms in which nodes are represented as points
and ties are represented as lines. These visualizations provide a means of qualitatively
assessing networks by varying the visual representation of their nodes and edges to reflect
attributes of interest.
Twitter analysis and playing with data using various algorithm for prediction of IPL matches
is a project which includes concepts of Big Data, Machine learning and social and
Information Networks.
The major characteristics and challenges of big data are defined as ‘3v’s’ .Which are namely
Volume, Velocity and Variety of data.
Volume: It is the amount of data generated from different sources. It is growing rapidly such
as a text file may be of few KBs of data, an audio file may contain few MBs and a video file
may contain few GBs of data. According to study in 2013 there was around 4.4 zettabytes of
data generated till then and it is estimated to increase to 44 zettabytes by 2020.
Velocity: It is rate at which the data is generated from different sources. The best example is
twitter on topic “2014 world cup, Germany’s victory against Argentina”. It has been seen that
there were 618,725 tweets posted per minute which was a record breaking one recently. So,
this indicates the velocity the speed of data generation and delivery.
Variety: Data that exists today is in many types of format and they are classified mainly into
three types of data. They structured unstructured and semi-structured data. Structured data
have specified set of structure which can be easily stored in relational database whereas
unstructured and semi-structured data is one which does not follow any predefined structure
and needs large time and energy for computation.
Veracity: It refers to noisy, confusedness and abnormality of data generated from different
sources of data. This characteristic is one of the biggest challenges when compared to
velocity and volume.
Value: The final and most important characteristic of big data is value. The data becomes
useless until we are able to access it and turn into valuable information. The large volume of
data contains valuable information which we cannot see directly in fact they are hidden.
Twitter which is one of the famous micro blogging sites allows registered users to share their
feelings, ideas, opinion and thoughts etc. in short restricted number of characters. These
messages are called as tweets. The Sentiment Analysis is a one of growing research field
which allow us to analyse the people sentiment and feeling present in the text messages and
draw the accurate conclusion.

BACKGROUND STUDY

1. COMPREHENSIVE DATA ANALYSIS AND PREDICTION ON IPL USING


MACHINE LEARNING ALGORITHMS
By Amala Kaviya V.S., Amol Suraj Mishra and Valarmathi B

A detailed analysis of the complete IPL dataset and visualization of various features
necessary for IPL evaluation is performed. Many machine learning algorithms have
been used to compare and predict the winner between any two teams. Few models
exist that try to rank players either based on simple formulae or based on few
mathematical models.

2. ANALYSIS AND PREDICTION OF SENTIMENTS FOR CRICKET TWEETS


By Bharati S. Kannolli, Prabhu R. Bevinmarad

With increasing technologies and inventions the data is also increasing tremendously
which is called as “Big Data”. This term is applied to large volume of data that
becomes difficult for traditional system to process within specified amount of time.
REQUIREMENTS ANALYSIS

 SOFTWARE REQUIREMENTS

 Jupyter Notebook - The Jupyter Notebook is an incredibly


powerful tool for interactively developing and presenting data science
projects. A notebook integrates code and its output into a single
document that combines visualisations, narrative text, mathematical
equations, and other rich media. The intuitive workflow promotes
iterative and rapid development, making notebooks an increasingly
popular choice at the heart of contemporary data science, analysis, and
increasingly science at large. Best of all, as part of the open source
Project Jupyter, they are completely free.

 NumPy, Pandas - Numpy is the core library for scientific


computing in Python. It provides a high-performance multidimensional
array object and tools for working with these arrays. Numpy is a
powerful N-dimensional array object which is Linear algebra for
Python. Numpy arrays essentially come in two
flavors: Vectors and Matrices. Vectors are strictly 1-d array
whereas Matrices are 2-d but matrices can have only one row/column.

Pandas Series object is created using pd.Series function. Each row is


provided with an index and by defaults is assigned numerical values
starting from 0. Like NumPy, Pandas also provide the basic
mathematical functionalities like addition, subtraction and conditional
operations and broadcasting.Pandas dataframe object represents a
spreadsheet with cell values, column names, and row index labels.
Dataframe can be visualized as dictionaries of Series.

 Turi create - Turi Create is an open source toolset for creating


Core ML models, for tasks such as image classification, object
detection, style transfers, recommendations, and more.

 TextBlob - TextBlob is a Python (2 and 3) library for processing


textual data. It provides a simple API for diving into common natural
language processing (NLP) tasks such as part-of-speech tagging, noun
phrase extraction, sentiment analysis, classification, translation, and
more.
METHODLOGY

Basic steps involved in the project:

1. Understand the dataset.


2. Clean the data.
3. Analyse the candidate columns to be Features.
4. Process the features as required by the model/algorithm.
5. Train the model/algorithm on training data.
6. Test the model/algorithm on testing data.
7. Tune the model/algorithm for higher accuracy.
DESIGN

1. DATASET GENERATION - The data was collected from the


http://kaggle.com . The website has data about all 8 seasons (from 2008 to 2016)
of domestic tournaments held in India i.e. the Indian Premiere League. It is a 20-
20 format of tournament. It means that each team bats or bowls for maximum 20
overs each and the result of the game is decided at the end of the total 40 overs
played.
The dataset has many parts such as: players.csv, matches.csv,
teams.csv,season.csv.

Matches.csv (577 rows x 13 columns)

Teams.csv
Season.csv

2. DATA MODELLING –

Data modeling (data modelling) is the process of creating a data model for


the data to be stored in a database. This data model is a conceptual representation
of Data objects, the associations between different data objects, and the rules.

After data modelling the final dataset we got is given below and the process for
that is given in the implementation section of the report:
3. FEATURE SELECTION – Extracting the important features and attributes and
discarding the unimportant ones comes under feature selection.
Example: matches.csv has the following attributes:
Match_Id int64
Team_1 object
Team_2 object
Match_Date object
Season_Id object
Venue_Id object
Toss_Winner object
Toss_Decide object
Win_Type object
Win_Margin float64
Outcome_type int64
Match_Winner object
Man_of_the_Match float64
dtype: object

In these attributes we will discard outcome_type and Man_of_the_match.

4. ALGORITHM SELECTION – The algorithms we will be using are:

For IPL Prediction:


 Random Forest –
Random forest is a flexible, easy to use machine learning algorithm that
produces, even without hyper-parameter tuning, a great result most of the time. It
is also one of the most used algorithms, because of its simplicity and diversity (it
can be used for both classification and regression tasks). 
 Decision Tree –
A decision tree is a decision support tool that uses a tree-like model of decisions
and their possible consequences, including chance event outcomes, resource
costs, and utility. It is one way to display an algorithm that only contains
conditional control statements.
Decision trees are commonly used in operations research, specifically in decision
analysis, to help identify a strategy most likely to reach a goal, but are also a
popular tool in machine learning.

 Boosted Trees –
The model in supervised learning usually refers to the mathematical structure of
by which the prediction yiyi is made from the input xixi. A common example is
a linear model, where the prediction is given as y^i=∑jθjxijy^i=∑jθjxij, a linear
combination of weighted input features. The prediction value can have different
interpretations, depending on the task, i.e., regression or classification.
 Logistic Model –

Logistic regression is a statistical model that in its basic form uses a logistic


function to model a binary dependent variable, although many more
complex extensions exist. In regression analysis, logistic regression[1] (or logit
regression) is estimating the parameters of a logistic model (a form of binary
regression). 

For Twitter Data:


TextBlob is a Python (2 and 3) library for processing textual data. It provides a
simple API for diving into common natural language processing (NLP) tasks such
as part-of-speech tagging, noun phrase extraction, sentiment analysis,
classification, translation, and more.

5. TRAINING THE DATA

Training data is the data you use to train an algorithm or machine learning model to


predict the outcome you design your model to predict. If you are using supervised
learning or some hybrid that includes that approach, your data will be enriched
with data labeling or annotation.

6. TESTING THE DATA


The data is tested after training with the help of several algortihms and the
accuracy is noted.

7. ANALYSING THE RESULTS

The results are then analysed with the help of plotting different graphs by
visualization of data.

IMPLEMENTATION

IMPLEMENTATION FOR IPL DATASET:

 Data modelling to create a final data with selected features:


matches["Team_1"].replace({1:'KKR',2:'RCB',3:'CSK',4:'KXIP',5:'RR',6:'DD',7:'MI',8:'DCH',9:'KTK',10:'PW',1
1:'SRH',12:'RPS',13:'GL'}, inplace=True)
matches["Team_2"].replace({1:'KKR',2:'RCB',3:'CSK',4:'KXIP',5:'RR',6:'DD',7:'MI',8:'DCH',9:'KTK',10:'PW',1
1:'SRH',12:'RPS',13:'GL'}, inplace=True)
matches["Toss_Winner"].replace({1:'KKR',2:'RCB',3:'CSK',4:'KXIP',5:'RR',6:'DD',7:'MI',8:'DCH',9:'KTK',10:'
PW',11:'SRH',12:'RPS',13:'GL'}, inplace=True)
matches["Match_Winner"].replace({1.0:'KKR',2.0:'RCB',3.0:'CSK',4.0:'KXIP',5.0:'RR',6.0:'DD',7.0:'MI',8.0:'D
CH',9.0:'KTK',10.0:'PW',11.0:'SRH',12.0:'RPS',13.0:'G
OUTCOME:
The outcome of this implementation is the final dataset which we get after combining
several datasets and dropping some features.
Matches.csv

 Implementing several algorithms :


OUTCOME:
 Twitter Analysis

class TwitterClient(object):
'''
Generic Twitter Class for sentiment analysis.
'''
def __init__(self):
'''
Class constructor or initialization method.
'''
# keys and tokens from the Twitter Dev Console
consumer_key = 'xJ0VdLiKu0YhEJmtz9pMoGufm'
consumer_secret = 'i0d2yKjnBsmsry1amQ75myDUDllr3JYWX5PsUoJwgLO4K8i1Qq'
access_token = '1088038375368445952-nyq820FPiAR4cKmxfw4AgsIO9578SM'
access_token_secret = 'ZMSozcQ1XsWm1CPLA4Euw0lvVpQVhB7TTm8FtXqgeYcU5'

# attempt authentication
try:
# create OAuthHandler object
self.auth = OAuthHandler(consumer_key, consumer_secret)
# set access token and secret
self.auth.set_access_token(access_token, access_token_secret)
# create tweepy API object to fetch tweets
self.api = tweepy.API(self.auth)
except:
print("Error: Authentication Failed")

def clean_tweet(self, tweet):


'''
Utility function to clean tweet text by removing links, special characters
using simple regex statements.
'''
return ' '.join(re.sub("(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())

def get_tweet_sentiment(self, tweet):


'''
Utility function to classify sentiment of passed tweet
using textblob's sentiment method
'''
# create TextBlob object of passed tweet text
analysis = TextBlob(self.clean_tweet(tweet))
# set sentiment
if analysis.sentiment.polarity > 0:
return 'positive'
elif analysis.sentiment.polarity == 0:
return 'neutral'
else:
return 'negative'

def get_tweets(self, query, count = 10):


'''
Main function to fetch tweets and parse them.
'''
# empty list to store parsed tweets
tweets = []

try:
# call twitter api to fetch tweets
fetched_tweets = self.api.search(q = query, count = count)
# parsing tweets one by one
for tweet in fetched_tweets:
# empty dictionary to store required params of a tweet
parsed_tweet = {}

# saving text of tweet


parsed_tweet['text'] = tweet.text
# saving sentiment of tweet
parsed_tweet['sentiment'] = self.get_tweet_sentiment(tweet.text)

# appending parsed tweet to tweets list


if tweet.retweet_count > 0:
# if tweet has retweets, ensure that it is appended only once
if parsed_tweet not in tweets:
tweets.append(parsed_tweet)
else:
tweets.append(parsed_tweet)

# return parsed tweets


return tweets

except tweepy.TweepError as e:
# print error (if any)
print("Error : " + str(e))
RISK ANALYSIS

We faced many problems while doing the project. Risk analysis is the process of assessing
the likelihood of an adverse event occurring within the corporate, government, or
environmental sector. Risk can be analysed using several approaches including those that
fall under the categories of quantitative and qualitative.
We faced problems while implementing different models and maintaining the accuracy was
difficult. Combining the twitter sentiment analysis results with IPL prediction results was
problematic as only the recent last 1000 tweets were available of 2020 IPL series but the IPL
prediction was done from 2008-2016.
RESULT AND CONCLUSION

The analysis and prediction of cricket data and twitter sentimental analysis was successful
and we implemented several algorithms with different accuracies. The accuracy we obtained
for different models was:

 Random Forest – 57.9%


 Decision Tree – 57.14%
 Boosted Trees – 61.9%
 Logistic model – 46.04%

The highest accuracy was achieved for the Boosted Trees algorithm.
FUTURE WORK

There are some future works that can be done in order to improve this project. There are
some future works that can be done in order to improve this project.
• The data set can include some of the external factors like player injury, player fatigue,
winning streak with a particular team, overall winning streak, average runs scored by a team
against a particular team in previous matches, etc. and on the basis of these data, we can try
to do the prediction and check to see if the accuracy improves.
• There is no web/mobile application or UI that my project contains. So, a web/mobile
application can be made which would take in the entire data set as input and display the
prediction result for each of the instances to a pdf or text file.
REFERENCES
 Dataset: http://kaggle.com
 Haghighat, Maral, Hamid Rastegari, and Nasim Nourafza. "A review of data mining
techniques for result prediction in sports." Advances in Computer Science: an International
Journal 2.5 (2013): 7-12.
 http://www.ijirset.com/upload/2019/april/69_Analyzing.pdf
 https://en.wikipedia.org/wiki/Data_analysis
 https://www.irjet.net/archives/V4/i10/IRJET-V4I10175.pdf
 https://en.wikipedia.org/wiki/Social_network_analysis#:~:text=Social%20network
%20analysis%20(SNA)%20is,or%20interactions)%20that%20connect%20them.
 https://towardsdatascience.com/predicting-ipl-match-winner-fc9e89f583ce

You might also like