Professional Documents
Culture Documents
ENGINEERING
Submitted to –
Prof. Punitha K
Submitted by –
Nov, 2020
ACKNOWLEDGEMENT
The success and final outcome of this project required a lot of guidance and assistance from
many people and we are extremely privileged to have got this all along the completion of my
project. All that we have done is only due to such supervision and assistance and we would
not forget to thank them. We respect and thank Prof. Punitha K, for providing us an
opportunity to do the project work at VIT Chennai, and giving us all support and guidance,
which made us complete the project duly. We are extremely thankful to her for providing
such a nice support and guidance, although she had busy schedule managing the academic
affairs.
Yours Sincerely
Tanishka Kukreja(17BCE1266)
Anshul Raj(17BCE1274)
ABSTRACT
Social media is one of the platform, most of the people use to express their feelings, thoughts,
suggestions and opinion via blog posts, status update, website, forums and online discussion
groups etc. Due to facilities a large volume of data is found to be generated every day.
Especially when sports like cricket, soccer and football are played, then many discussion are
made on social media like Twitter and respective forums using their restricted words. The
opinion expressed by huge population may be in a different manner/different notations and
they may comprise different polarity like positive, negative or both positivity and negativity
regarding current trend. So, simply looking at each opinion and drawing a conclusion is very
difficult and time consuming. Because the opinion/tweets collected from Twitter consists lot
of unwanted information which burdens the process. Hence, we need an intelligent system to
retrieve tweets from Twitter, analysis systematically and draw an accurate result based on its
positivity and negativity. Data Science / Analytics is all about finding valuable insights from
the given dataset. In short, Finding answers that could help business. In this tutorial, We will
see how to get started with Data Analysis in Python. The Python packages that we use in this
notebook are: numpy, pandas, matplotlib, and Turicreate.Cricket is a sport that contains a lot
of statistical data. There is data about batting records, bowling records, individual player
records, scorecard of different matches played, etc. This data can be put to proper use to
predict the results of games and so this problem has become an interesting problem in today’s
world.
INTRODUCTION
Social network analysis (SNA) is the process of investigating social structures through the
use of networks and graph theory. It characterizes networked structures in terms
of nodes (individual actors, people, or things within the network) and the ties, edges,
or links (relationships or interactions) that connect them. Examples of social
structures commonly visualized through social network analysis include social media
networks, memes spread, information circulation, friendship and acquaintance networks,
business networks, knowledge networks, difficult working relationships, social
networks, collaboration graphs, kinship, disease transmission, and sexual relationships. These
networks are often visualized through sociograms in which nodes are represented as points
and ties are represented as lines. These visualizations provide a means of qualitatively
assessing networks by varying the visual representation of their nodes and edges to reflect
attributes of interest.
Twitter analysis and playing with data using various algorithm for prediction of IPL matches
is a project which includes concepts of Big Data, Machine learning and social and
Information Networks.
The major characteristics and challenges of big data are defined as ‘3v’s’ .Which are namely
Volume, Velocity and Variety of data.
Volume: It is the amount of data generated from different sources. It is growing rapidly such
as a text file may be of few KBs of data, an audio file may contain few MBs and a video file
may contain few GBs of data. According to study in 2013 there was around 4.4 zettabytes of
data generated till then and it is estimated to increase to 44 zettabytes by 2020.
Velocity: It is rate at which the data is generated from different sources. The best example is
twitter on topic “2014 world cup, Germany’s victory against Argentina”. It has been seen that
there were 618,725 tweets posted per minute which was a record breaking one recently. So,
this indicates the velocity the speed of data generation and delivery.
Variety: Data that exists today is in many types of format and they are classified mainly into
three types of data. They structured unstructured and semi-structured data. Structured data
have specified set of structure which can be easily stored in relational database whereas
unstructured and semi-structured data is one which does not follow any predefined structure
and needs large time and energy for computation.
Veracity: It refers to noisy, confusedness and abnormality of data generated from different
sources of data. This characteristic is one of the biggest challenges when compared to
velocity and volume.
Value: The final and most important characteristic of big data is value. The data becomes
useless until we are able to access it and turn into valuable information. The large volume of
data contains valuable information which we cannot see directly in fact they are hidden.
Twitter which is one of the famous micro blogging sites allows registered users to share their
feelings, ideas, opinion and thoughts etc. in short restricted number of characters. These
messages are called as tweets. The Sentiment Analysis is a one of growing research field
which allow us to analyse the people sentiment and feeling present in the text messages and
draw the accurate conclusion.
BACKGROUND STUDY
A detailed analysis of the complete IPL dataset and visualization of various features
necessary for IPL evaluation is performed. Many machine learning algorithms have
been used to compare and predict the winner between any two teams. Few models
exist that try to rank players either based on simple formulae or based on few
mathematical models.
With increasing technologies and inventions the data is also increasing tremendously
which is called as “Big Data”. This term is applied to large volume of data that
becomes difficult for traditional system to process within specified amount of time.
REQUIREMENTS ANALYSIS
SOFTWARE REQUIREMENTS
Teams.csv
Season.csv
2. DATA MODELLING –
After data modelling the final dataset we got is given below and the process for
that is given in the implementation section of the report:
3. FEATURE SELECTION – Extracting the important features and attributes and
discarding the unimportant ones comes under feature selection.
Example: matches.csv has the following attributes:
Match_Id int64
Team_1 object
Team_2 object
Match_Date object
Season_Id object
Venue_Id object
Toss_Winner object
Toss_Decide object
Win_Type object
Win_Margin float64
Outcome_type int64
Match_Winner object
Man_of_the_Match float64
dtype: object
Boosted Trees –
The model in supervised learning usually refers to the mathematical structure of
by which the prediction yiyi is made from the input xixi. A common example is
a linear model, where the prediction is given as y^i=∑jθjxijy^i=∑jθjxij, a linear
combination of weighted input features. The prediction value can have different
interpretations, depending on the task, i.e., regression or classification.
Logistic Model –
The results are then analysed with the help of plotting different graphs by
visualization of data.
IMPLEMENTATION
class TwitterClient(object):
'''
Generic Twitter Class for sentiment analysis.
'''
def __init__(self):
'''
Class constructor or initialization method.
'''
# keys and tokens from the Twitter Dev Console
consumer_key = 'xJ0VdLiKu0YhEJmtz9pMoGufm'
consumer_secret = 'i0d2yKjnBsmsry1amQ75myDUDllr3JYWX5PsUoJwgLO4K8i1Qq'
access_token = '1088038375368445952-nyq820FPiAR4cKmxfw4AgsIO9578SM'
access_token_secret = 'ZMSozcQ1XsWm1CPLA4Euw0lvVpQVhB7TTm8FtXqgeYcU5'
# attempt authentication
try:
# create OAuthHandler object
self.auth = OAuthHandler(consumer_key, consumer_secret)
# set access token and secret
self.auth.set_access_token(access_token, access_token_secret)
# create tweepy API object to fetch tweets
self.api = tweepy.API(self.auth)
except:
print("Error: Authentication Failed")
try:
# call twitter api to fetch tweets
fetched_tweets = self.api.search(q = query, count = count)
# parsing tweets one by one
for tweet in fetched_tweets:
# empty dictionary to store required params of a tweet
parsed_tweet = {}
except tweepy.TweepError as e:
# print error (if any)
print("Error : " + str(e))
RISK ANALYSIS
We faced many problems while doing the project. Risk analysis is the process of assessing
the likelihood of an adverse event occurring within the corporate, government, or
environmental sector. Risk can be analysed using several approaches including those that
fall under the categories of quantitative and qualitative.
We faced problems while implementing different models and maintaining the accuracy was
difficult. Combining the twitter sentiment analysis results with IPL prediction results was
problematic as only the recent last 1000 tweets were available of 2020 IPL series but the IPL
prediction was done from 2008-2016.
RESULT AND CONCLUSION
The analysis and prediction of cricket data and twitter sentimental analysis was successful
and we implemented several algorithms with different accuracies. The accuracy we obtained
for different models was:
The highest accuracy was achieved for the Boosted Trees algorithm.
FUTURE WORK
There are some future works that can be done in order to improve this project. There are
some future works that can be done in order to improve this project.
• The data set can include some of the external factors like player injury, player fatigue,
winning streak with a particular team, overall winning streak, average runs scored by a team
against a particular team in previous matches, etc. and on the basis of these data, we can try
to do the prediction and check to see if the accuracy improves.
• There is no web/mobile application or UI that my project contains. So, a web/mobile
application can be made which would take in the entire data set as input and display the
prediction result for each of the instances to a pdf or text file.
REFERENCES
Dataset: http://kaggle.com
Haghighat, Maral, Hamid Rastegari, and Nasim Nourafza. "A review of data mining
techniques for result prediction in sports." Advances in Computer Science: an International
Journal 2.5 (2013): 7-12.
http://www.ijirset.com/upload/2019/april/69_Analyzing.pdf
https://en.wikipedia.org/wiki/Data_analysis
https://www.irjet.net/archives/V4/i10/IRJET-V4I10175.pdf
https://en.wikipedia.org/wiki/Social_network_analysis#:~:text=Social%20network
%20analysis%20(SNA)%20is,or%20interactions)%20that%20connect%20them.
https://towardsdatascience.com/predicting-ipl-match-winner-fc9e89f583ce