You are on page 1of 13

Scraping and Exploring Sports Betting Data — Is Arbitra... https://towardsdatascience.com/scraping-and-exploring-...

Get started Open in app

489K Followers · About Follow

You have 2 free member-only stories left this month. Sign up for Medium and get an extra one

Photo by Richard Boyle on Unsplash

Scraping and Exploring Sports Betting Data —


Is Arbitrage Possible? A Hands-On Analysis
with Code.

1 of 13 10/26/20, 11:11 AM
Scraping and Exploring Sports Betting Data — Is Arbitra... https://towardsdatascience.com/scraping-and-exploring-...

How to download live sports betting time series data, parse it and analyse arbitrage
opportunities with Python

Steven Eulig Sep 24, 2019 · 10 min read

Who wouldn’t like to own a risk-free profit machine? I guess everyone would like to
have one. The practice of taking advantage of price differences (in this case betting rate
differences) between different markets or bookmakers is also known as arbitrage. The
idea is to place bets on every outcome of a sample space and generate a profit in every
case. In soccer, you could bet which player wins a game, i.e. the sample space would
consist of the three outcomes ‘Player 1 wins’, ‘Tie’ and ‘Player 1 loses’.

Before we dive deeper into the analysis, we should first look at the data collection and
further topics which we will cover along the way:

1. Web scraping live betting data (BeautifulSoup)

2. Storing results in dataframes (pandas)

3. Automation of the scraping process with a function

4. Visualization of the results with (matplotlib-pyplot)

5. First arbitrage analysis and profit calculations

I have come up with many other exciting topics on this data, but I will leave them out of
scope for now. (Might follow-up on them in the next articles.)

Web scraping from multiple websites and data consolidation

Derivation of rules for a betting / trading strategy incl. simulation and evaluation of
the performance of the automated strategy

Driver analysis: What drives the evolution of betting rates (potentially including
text analysis from a live conference ticker)

Probably there are many more interesting topics; happy to take your input

2 of 13 10/26/20, 11:11 AM
Scraping and Exploring Sports Betting Data — Is Arbitra... https://towardsdatascience.com/scraping-and-exploring-...

Legal disclaimer: I am not a legal expert and do not provide any recommendations for
action here, but a pure technical explanation. Massive scraping of websites causes high
traffic and could thereby burden them.
I believe if you are accessing websites, you should always consider their terms of
service and ideally obtain prior permission for projects like scraping. Moreover, I do not
promote gambling, no matter what kind of betting it is.

1. Web scraping live betting data (BeautifulSoup)


To gather the data, we will use the library BeautifulSoup. I have downloaded and
installed it with the command line via:

pip install beautifulsoup4

Once you have installed it, we can start by importing all relevant libraries.

from bs4 import BeautifulSoup


import urllib.request
import re

We can now use them to retrieve the source code of any page and parse it.
I have selected the live bets page from the German betting website tipico.de.

url = “https://www.tipico.de/de/live-wetten/"

try:
page = urllib.request.urlopen(url)
except:
print(“An error occured.”)

soup = BeautifulSoup(page, ‘html.parser’)


print(soup)

The above code checks if it can access the page and then print out the entire html
source code of your page. This is then assigned to a variable called ‘soup’ by convention

3 of 13 10/26/20, 11:11 AM
Scraping and Exploring Sports Betting Data — Is Arbitra... https://towardsdatascience.com/scraping-and-exploring-...

and will be about 200–300 A4 pages long, depending on how many bets are currently
online.

As a next step, we want to extract only the relevant information in a well-structured


manner. Let’s look at (a screenshot from) the website and check which information we
care about.

To start with, we could try to extract the seven values marked in green: The time, the
name of the players, the current score and the rates for each possible outcome. Here
Rate 1 corresponds to Player 1 wins, Rate 2 corresponds to tie and Rate 3 corresponds
to Player 1 loses. To do so, it is convenient to inspect the element with your browser
(right-click and ‘Inspect’ with Chrome).

4 of 13 10/26/20, 11:11 AM
Scraping and Exploring Sports Betting Data — Is Arbitra... https://towardsdatascience.com/scraping-and-exploring-...

We see that the rates are stored in buttons of class “c_but_base c_but”. To extract the
rates, we will use soup.find_all to get all buttons of this class.

regex = re.compile(‘c_but_base c_but’)


content_lis = soup.find_all(‘button’, attrs={‘class’: regex})
print(content_lis)

To get rid of the html code, we use the .getText() function, then store the values of all
buttons in a list and remove line breaks and tabulators.

content = []
for li in content_lis:
content.append(li.getText().replace(“\n”,””).replace(“\t”,””))
print(content)

The other variables can be queried analogously. You can find the details in my
notebook on Github. Don’t hesitate to ask questions if something is unclear.

2. Storing results in dataframes (pandas)


Having the raw and parsed data, we now want to structure it in a practical way. We
understand that there is one time of the game, two players, one score and eleven rates

5 of 13 10/26/20, 11:11 AM
Scraping and Exploring Sports Betting Data — Is Arbitra... https://towardsdatascience.com/scraping-and-exploring-...

for each row. As for the rates, we only want to store the three rates of “who wins the
game” and three rates of “who scores the next goal”, because these are available on
most of the betting sides and will later allow us a high degree of comparability.

N_games = 10 # number of games observed, say we want the first 10


N_players = 2 # number of players per game
N_outcomes = 11 # number of possible outcomes (Win, lose, tie, Next
goal etc.)
df = []
for i in range(N_games):
df.append([datetime.now(), content_names[i*N_players],
content_names[1+i*N_players], content_minute[i],
content_score[i], content[i*N_outcomes],
content[1+i*N_outcomes], content[2+i*N_outcomes],
content[6+i*N_outcomes], content[7+i*N_outcomes],
content[8+i*N_outcomes]])

pdf = pd.DataFrame(df, columns = ['Time', 'Player_1', 'Player_2',


'MinuteOfGame', 'Score', 'Win_1', 'Win_X', 'Win_2', 'NextGoal_1' ,
'NextGoal_X' , 'NextGoal_2' ])

pdf.head()

(Note: In my notebook, there are two more variables, j and k. They are used to get rid of
bugs, when there is an additional row per game for ‘results of the first half of the game’.
For sake of simplicity, I have excluded it from this description.)

3. Automation of the scraping process with a function


To repeat the scraping process many times, it is convenient to write it into a function:

6 of 13 10/26/20, 11:11 AM
Scraping and Exploring Sports Betting Data — Is Arbitra... https://towardsdatascience.com/scraping-and-exploring-...

def get_soccer_rates_tipico():
"""
This function creates a table with the live betting information,
this includes a timestamp, the players, the score and the rates
for each party winning and scoring the next goal.

Arguments:
None

Returns:
pdf -- pandas dataframe with the results of shape (N_games, 11)
"""

... FULL CODE ON GITHUB

return pdf

The shape of the returned table depends on the number of games that are currently
live. I have implemented a function which finds the first tennis game entry to figure of
the total number of live soccer games and get all of them.

Now we can write it into a loop to repeat the scraping function at fixed time intervals.

def repeat_scraping(timedelay, number_of_scrapes, filename =


'bet_rates_scraping_tipico.csv'):
"""
This function repeadetly calls the scraping function to create a
timeseries of scraping data. The time interval between scrapes and
number of scrapes in total are taken as argument. The result is
saved in a csv-file.

Arguments:
timedelay -- delay between each scrape request in seconds (min.
15 sec recommended due to processing time) number_of_scrapes --
number of scrape requests

Returns:
Void
"""
... FULL CODE ON GITHUB

dataframe = pdf.to_csv(filename, index=False)

7 of 13 10/26/20, 11:11 AM
Scraping and Exploring Sports Betting Data — Is Arbitra... https://towardsdatascience.com/scraping-and-exploring-...

# Check processing time and add sleeping time to fit the timedelay
time_run = time.time() - start_time
time.sleep(timedelay - time_run)

Note that the loading and parsing of the page might take up to 10 seconds. We can use
the time library to track how long our function takes to subtract it from the sleeping
time to align the time delays (assuming that all queries take approximately the same
time).

We can call this function to scrape an entire game, for example with a time delay of 15
seconds and 500 queries, i.e. covering 125 minutes.

repeat_scraping(15, 500, 'scraping_500x15s.csv')

4. Visualization and analysis of the results with (matplotlib-pyplot)


As a first step, after importing the data, we need to do some data cleaning. This
includes filling the NaNs with zeros for all rates, replacing the commas by dots and
converting them into float types.

dataframe = pd.read_csv('scraping_500x15s.csv', encoding =


'unicode_escape')
dataframe = dataframe.fillna(0)
ratecols =
['Win_1','Win_X','Win_2','NextGoal_1','NextGoal_X','NextGoal_2']
dataframe[ratecols] = dataframe[ratecols].apply(lambda x:
x.str.replace(',','.')).astype(float)

Because we want to query multiple times per minute, the ‘minute of the game’ is not
precise enough, therefore, we add the query timestamp.

dataframe['Time_parsed'] = 0
# Check for dates without milliseconds and add .000 to have a
consistent formatting
dataframe.Time.iloc[np.where(dataframe.Time.apply(lambda x: True if

8 of 13 10/26/20, 11:11 AM
Scraping and Exploring Sports Betting Data — Is Arbitra... https://towardsdatascience.com/scraping-and-exploring-...

len(x) == 19 else False))] \


= dataframe.Time.iloc[np.where(dataframe.Time.apply(lambda x: True
if len(x) == 19 else False))].apply(lambda t: t + ".000")
dataframe.Time_parsed = dataframe.Time.apply(lambda x:
datetime.strptime(x, '%Y-%m-%d %H:%M:%S.%f').time())
dataframe = dataframe.drop(['Time'], axis=1)

Now we can explore the data of a first game and visualize the results. Let’s look at the
very first Player of our table:

df1 = dataframe[dataframe[‘Player_1’] == dataframe[‘Player_1’][0]]


df1

We now have data on the Bayern München vs 1. FC Köln game starting from minute 5
for every 15 seconds. The following plot visualizes the evolution of the rates for who
scores the next goal. I used a log-scale on the y-axis to account for the rates over 30.
Additionally, I included a horizontal line at a rate of three, signalling a simple form of
arbitrage: If all quotes are greater than three, you could generate a risk-less profit by
placing uniformly distributed bets (e.g. 100 € each) of on all three outcomes.

# Data for plotting


t = df1.Time_parsed.values
w1 = df1.NextGoal_1.values
w2 = df1.NextGoal_X.values
w3 = df1.NextGoal_2.values

# Plot setup
fig, ax = plt.subplots(figsize=(15, 6))
ax.plot(t, w1, marker ='', label = 'Player 1 scores next' ,color =

9 of 13 10/26/20, 11:11 AM
Scraping and Exploring Sports Betting Data — Is Arbitra... https://towardsdatascience.com/scraping-and-exploring-...

'green', linewidth = 2)
ax.plot(t, w2, marker ='', label = 'No more goals', color =
'orange', linewidth = 2)
ax.plot(t, w3, marker ='', label = 'Player 2 scores next', color =
'red', linewidth = 2)
plt.axhline(y=3., label = 'Critical line', color='grey', linewidth =
2, linestyle='--') # Line for arbitrage detection

ax.set(xlabel='Time', ylabel='Betting rates',


title=str(np.unique(df1.Player_1)[0]) + ' vs ' + \
str(np.unique(df1.Player_2)[0]) + ': Rates for "Who
scores next?"')
ax.grid()
plt.legend()

ax.set_yscale('log')

In my notebook, you will find a function that automatically creates the plots for all
games and saves them as images.

5. First arbitrage analysis and profit calculations


Coming back to our idea of finding arbitrage deals, it is probably unlikely to find them
on one single website. I believe it is more likely to achieve arbitrage betting by using
information differences between multiple bookmakers. However, we will now look at
the analysis only for Tipico data. This analysis could easily be scaled to multiple
websites from there on.

10 of 13 10/26/20, 11:11 AM
Scraping and Exploring Sports Betting Data — Is Arbitra... https://towardsdatascience.com/scraping-and-exploring-...

A brief excursion into mathematics:


Be n the number of possible outcomes of an event and q_i the rate for each outcome,
then arbitrage is possible if the sum of all 1/q_i is smaller than one.

df2['Check_a_1'] = df2.Win_1.apply(lambda x: 1/x)


df2['Check_a_2'] = df2.Win_X.apply(lambda x: 1/x)
df2['Check_a_3'] = df2.Win_2.apply(lambda x: 1/x)
df2['Check_a_sum']=0
df2['Check_b_1'] = df2.NextGoal_1.apply(lambda x: 1/x)
df2['Check_b_2'] = df2.NextGoal_X.apply(lambda x: 1/x)
df2['Check_b_3'] = df2.NextGoal_2.apply(lambda x: 1/x)
df2['Check_b_sum']=0
df2['Check_a_sum'] = df2.Check_a_1 + df2.Check_a_2 + df2.Check_a_3
df2['Check_b_sum'] = df2.Check_b_1 + df2.Check_b_2 + df2.Check_b_3
df2['Arbitrage_flag']=0
arb_idx = np.unique(np.append(np.where(df2.Check_a_sum <=
1)[0],np.where(df2.Check_b_sum <= 1)[0]))
df2.Arbitrage_flag[arb_idx] = 1

If we want to obtain an outcome-independent profit, we need to distribute the betting


amounts according to the rates. When you consider rate q multiplied by amount s
equals profit p to fulfill p_i = p_j, you attain s_j = q_i/q_j * s_i.

# Give the first bet the weight of 1 and adjust the other two bets
accordingly
df2['Win_1_betting_fraction'] = 1
df2['NextGoal_1_betting_fraction'] = 1
... FULL CODE ON GITHUB
df2['Win_profit_percentage'] = df2.Win_1 * df2.Win_1_betting_amount
* 100 - 100

In the data that I collected, I found the following example:

Rates for scoring the next goal:


Player 1: 1.70
No more goals: 4.70
Player 2: 7.00

11 of 13 10/26/20, 11:11 AM
Scraping and Exploring Sports Betting Data — Is Arbitra... https://towardsdatascience.com/scraping-and-exploring-...

This leads to the following betting amount distribution:


Bet 1: 62.32%
Bet 2: 22.54%
Bet 3: 15.14%

This results in a sure profit of 5.9%.

(It seems strange to me that I found an arbitrage chance on a single website. Maybe
there was an error in scraping the numbers, I will investigate this.)

As demonstrated, this procedure enables you to calculate the sure profit of a


combination of bets. Now it is up to you to find positive values!

Wrapping up
In this article, you could learn how to load and parse web pages with Python using
BeautifulSoup, extract only desired information and load it into structured data tables.
To extract time series, we have looked at how we can implement it as functions and
repeat our query at fixed time intervals. Afterwards, we wrote code to automatically
clean the data, followed by visualizations and additional calculations to make the data
interpretable.

We have seen that arbitrage betting is probably possible — if not on one website, then
by combining the bets with multiple bookmakers. (Please note that I am not promoting
gambling and/or web scraping and that even arbitrage bets may also be subject to risks,
e.g. delays or cancellations of bets.)

Thank you very much for taking your time and reading this article!

I highly appreciate and welcome your feedback and remarks.


You can reach out to me via LinkedIn: https://www.linkedin.com/in/steven-eulig-
8b2450110/
…and you can find my github including the Jupyter notebook here:

12 of 13 10/26/20, 11:11 AM
Scraping and Exploring Sports Betting Data — Is Arbitra... https://towardsdatascience.com/scraping-and-exploring-...

https://github.com/Phisteven/scraping-bets

Cheers,
Steven

Sign up for The Daily Pick


By Towards Data Science

Hands-on real-world examples, research, tutorials, and cutting-edge techniques delivered Monday
to Thursday. Make learning your daily ritual. Take a look

Your email

Get this newsletter

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information
about our privacy practices.

Web Scraping Data Visualization Data Science Python Towards Data Science

About Help Legal

Get the Medium app

13 of 13 10/26/20, 11:11 AM

You might also like