You are on page 1of 13

IPL Tournament Data Analysis 1

1. INTRODUCTION
1.1 Project Overview
Indian Premier League (IPL) is one of the more popular cricket world tournaments, and its financial is
increasing each season, its viewership has increased markedly and the betting market for IPL is growing
significantly every year. With cricket being a very dynamic game, bettors and bookies are incentivized to bet on
the match results because it is a game that changes ball-by-ball. This paper investigates machine learning
technology to deal with the problem of predicting cricket match results based on historical match data of the IPL.
Influential features of the dataset have been identified using filter-based methods including Correlation-based
Feature Selection, Information Gain (IG), Relief and Wrapper. More importantly, machine learning techniques
including Naïve Bayes, Random Forest, K-Nearest Neighbor (KNN) and Model Trees (classification via
regression) have been adopted to generate predictive models from distinctive feature sets derived by the filter-
based methods. Two featured subsets were formulated, one based on home team advantage and other based on
Toss decision. Selected machine learning techniques were applied on both feature sets to determine a predictive
model. Experimental tests show that tree-based models particularly Random Forest performed better in terms of
accuracy, precision and recall metrics when compared to probabilistic and statistical models. However, on the
Toss featured subset, none of the considered machine learning algorithms performed well in producing accurate
predictive models.
Cricket is a well-known sport and with its increasing popularity and viewership, change of formats and
innovations in tournament played became necessary. To cater for potential future growth, global market research
was commissioned by the International Cricket Council (ICC) which revealed that cricket has more than one
billion fans worldwide, with the potential for significant growth. Among all formats of cricket, the popularity of
Twenty20 Internationals (T20) was the highest with 92%, with 87% of the fans

1.2 Data Collection


If you are familiar with the concept of web scraping, you can scrape the data from this ESPN cricket info.
If you are not aware of web scraping. The data is available as an Excel file. Once you have the dataset with you,
you will need to load it in Python. You can use the piece of code below to load the dataset in Python:
Here there we import two data sets one containing details of match like match id, venue, date, teams played
as oppositions, toss winner and winner etc... Another Data set has ball to ball data i.e runs scored per each ball,
Stricker, non-Stricker, bowler etc.
Before We start into Perform Data set operations, we should import necessary libraries and then import data sets
for analysis. Once you have the dataset with you, you will need to load it in Python. You can use the piece of code
below to load the dataset in Python:

RNSIT Department of MCA 2021-2022


IPL Tournament Data Analysis 2

Fig 1.1 Dataset Reading

Once the dataset has been read, we should look at the head and tail of the dataset to make sure it is imported
correctly. The head of the dataset should look like this:

Figure 1.2 Data Head

RNSIT Department of MCA 2021-2022


IPL Tournament Data Analysis 3

2.LITERATURE SURVEY
2.1 Library/Module Requirements
This Project Require some of the Python Libraries and Modules i.e pandas, numpy, matplotlib, Seaborns
libraries
Pandas: pandas is a Python package providing fast, flexible, and expressive data structures designed to
make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level
building block for doing practical, real-world data analysis in Python. Additionally, it has the broader goal of
becoming the most powerful and flexible open-source data.

NumPy: It is a Python library that provides a multidimensional array object, various derived objects (such
as masked arrays and matrices), and an assortment of routines for fast operations on arrays, including
mathematical, logical, shape manipulation, sorting, selecting, I/O, discrete Fourier transforms, basic linear
algebra, basic statistical operations, random simulation and much more. At the core of the NumPy package, is the
ND array object. This encapsulates n-dimensional arrays of homogeneous data types, with many operations being
performed in compiled code for performance. There are several important differences between NumPy arrays and
the standard Python sequences:

Matplotlib: Matplotlib is one of the most popular Python packages used for data visualization. It is a
cross-platform library for making 2D plots from data in arrays. It provides an object-oriented API that helps in
embedding plots can be used in Python and IPython shells, Jupyter notebook and web application servers also.

Seaborn: Seaborn is an open-source Python library built on top of matplotlib. It is used for data
visualization and exploratory data analysis. Seaborn works easily with dataframes and the Pandas library. The
graphs created can also be customized easily.

2.2 Hardware and Software requirements


Hardware requirements:
• Processor: i3 or higher
• RAM:4GB or more
• Input Devices: Keyboard, mouse
• Hard Disk:500GB or more

RNSIT Department of MCA 2021-2022


IPL Tournament Data Analysis 4

Software requirement:
• Windows 7 or Higher

2.3 Tools/Language/Platform
• Python Language

• Jupyter Platform

RNSIT Department of MCA 2021-2022


IPL Tournament Data Analysis 5

3. DATA CLEANING MECHANISMS


Most datasets have may null values. The data has been taken from a webpage, so it is not very clean. Data
cleansing is so important for individuals because eventually, all this information can become overwhelming. It
can be difficult to find the most recent paperwork. You may have to wade through dozens of old files before you
find the most recent one. Disorganization can lead to stress, and even lost documents data cleansing ensures you
only have the most recent files and important documents, so when you need to, you can find them with ease. It
also helps ensure that you do not have significant amounts of personal information on your computer, which can
be a security risk.

In cricket we need accurate data so we don’t fill null values which result in course wrong results. Here we
don’t consider the columns which are having null values.
Now find the fiends which have null values are not.

Fig 3.1: Finding null value Count

RNSIT Department of MCA 2021-2022


IPL Tournament Data Analysis 6
Before we proceed with our Python data analysis of IPL data, we should know what columns are present
in the dataset, their count, and data type. For this, we use Pandas info() function

Fig 3.2 Dataset Info

Now find out total number of matches played, Unique number of cities where matches are played and
Total number of teams including in IPL

Fig 3.3 Different Cities & Teams

RNSIT Department of MCA 2021-2022


IPL Tournament Data Analysis 7

4. DATA ANALYSIS AND VISUALIZATION


Historical data of Indian Premiere League (IPL-T20) tournaments is captured to perform prediction
analysis. We consider IPL Cricket matches for 10 years (2008 to 2017) and store them in a dataset. The dataset
used consists of 17 variables and 637 instances and was downloaded from. MatchSK and MatchID are unique to
matches played. Matches are played on home ground (matches played in the home city of team 1) and a few
matches were played on international grounds (South Africa and United Arab Emirates (UAE)). Only two seasons
were played out of India, IPL-2009 and IPL-2014 in South Africa and UAE respectively.
According to the rules of IPL only 8 individual teams participate in each season. However, a few teams
have been created and dissolved during the period covered by the data, so the dataset has 13 individual teams.
Most seasons are played by Royal Challengers Bangalore, Kings XI Punjab, Delhi Daredevils, Mumbai Indians
and Kolkata Knight Riders. Kochi Tuskers Kerala played only one season whereas Gujarat Lions have been part
of two seasons. Some teams represented a city but have been dissolved and created again with a new name. For
instance, Pune Warriors was dissolved in 2014 and Rising Pune Super giants came into existence in 2016.
Sunrisers Hyderabad, created in 2013 was formerly known as Deccan chargers.
The aim of predicting the match result in the first model is to evaluate the impact of home ground
advantage. In this experiment, the variable ‘‘Result” is derived based on the Home Team (Team 1) winning the
match, when the match is played on home ground. For example, it is the frequency of Chennai Super Kings
winning the match when it plays in its home ground Chennai. The attribute ‘‘Result” will be the target class for
predicting the outcome by classification. The format of the tournament is that each team is designated one city as
its home ground, and two matches are played by combination of two teams, playing once at the first team’s home
ground and once at the other team’s home. In the considered dataset, two seasons were played in a foreign country
and for these matches the venue is changed to the home ground of their respective teams. A few teams, such as
Kings XI Punjab have different home grounds, but their original home ground is Mohali. For this
experiment, only those features that have an impact on the home ground are considered.
Here we will show matches played in each season using count plot. we can find out total matches played by
grouping match id from match data set We can use the following piece of code for this purpose:

The output is as Below:

RNSIT Department of MCA 2021-2022


IPL Tournament Data Analysis 8

Figure 4.1: Total matches played in each season

Now we Find total Runs scored in each season, this can be achieved by finding total num of all type of score i.e
1’s 2’s 3’s etc... and plot the line plot by applying seasons against grouped total runs.

Figure 4.2: Total runs in each season

RNSIT Department of MCA 2021-2022


IPL Tournament Data Analysis 9
We can see that Toss decision across the team, we show how the most of the teams choose bat or bowl
first along all the seasons. Here count plot shows the overall simple visualization about decisions. Code is as
below

Resultant output:

Figure 4.3: Toss decision across seasons

IPL Matches Played by Each Team

RNSIT Department of MCA 2021-2022


IPL Tournament Data Analysis 10
We can find out the matches played by each team by the same process which is grouping the batting team and
the match id column and counting the data and then dropping the first index layer which is match id.

Fig 4.4 Matches Played by Each Teams

Most IPL Runs by a Batsman


From the below visualization we can see that the Run-Machine, Virat Kohli is at the top of this list with more than
6,000 runs followed by Suresh Raina and Shikhar Dhawan.

RNSIT Department of MCA 2021-2022


IPL Tournament Data Analysis 11

Fig 4.5 Most score by batsman’s

RNSIT Department of MCA 2021-2022


IPL Tournament Data Analysis 12

5.CONCLUSION
Applying data Analysis Technique for analyzing cricket sports by considering historical game data,
players performance, natural parameters, pre-game conditions and other features is beneficial for multiple
stakeholders. In a dynamic format like T20, where the situation in a game change on every ball, it becomes
challenging to predict the match outcome. For predicting the final outcome of a T20 cricket match, we have
investigated machine learning technology for the possibility of improving the prediction rate of the results of
matches. We have formulated the problem in two scenarios, named for the most influential features, firstly the
Home Team features set and secondly Toss Winner decision features set.

RNSIT Department of MCA 2021-2022


IPL Tournament Data Analysis 13

REFERENCES
• Kumash Kapadia a , Hussein Abdel-Jaber b,⇑ , Fadi Thabtah a , Wael Hadi ,Sport analytics for cricket game
results using machine learning: An experimental study,
• Abdelhamid, N., Ayesh, A., Thabtah, F., 2012. An experimental study of three different rule ranking formulas
in associative classification mining, in: Proceedings of the 7th IEEE International Conference for Internet
Technology and Secured Transactions (ICITST-2012), pp. 795–800, UK.
• Analytics, C., 2017. Anaconda software distribution. Computer software Vers, 2-2.

RNSIT Department of MCA 2021-2022

You might also like