Mozilla PDF

Scraping and Exploring Sports Betting Data — Is Arbitra... https://towardsdatascience.com/scraping-and-exploring-...
Get started Open in app
489K Followers · About Follow
You have 2 free member-only stories left this month. Sign up for Medium and get an extra one
Photo by Richard Boyle on Unsplash
Scraping and Exploring Sports Betting Data —

Is Arbitrage Possible? A Hands-On Analysis
with Code.
1 of 13 10/26/20, 11:11 AM
How to download live sports betting time series data, parse it and analyse arbitrage
opportunities with Python
Steven Eulig Sep 24, 2019 · 10 min read
Who wouldn’t like to own a risk-free profit machine? I guess everyone would like to
have one. The practice of taking advantage of price differences (in this case betting rate
differences) between different markets or bookmakers is also known as arbitrage. The
idea is to place bets on every outcome of a sample space and generate a profit in every
case. In soccer, you could bet which player wins a game, i.e. the sample space would
consist of the three outcomes ‘Player 1 wins’, ‘Tie’ and ‘Player 1 loses’.
Before we dive deeper into the analysis, we should first look at the data collection and
further topics which we will cover along the way:
1. Web scraping live betting data (BeautifulSoup)
2. Storing results in dataframes (pandas)
3. Automation of the scraping process with a function
4. Visualization of the results with (matplotlib-pyplot)
5. First arbitrage analysis and profit calculations
I have come up with many other exciting topics on this data, but I will leave them out of
scope for now. (Might follow-up on them in the next articles.)
Web scraping from multiple websites and data consolidation
Derivation of rules for a betting / trading strategy incl. simulation and evaluation of
the performance of the automated strategy
Driver analysis: What drives the evolution of betting rates (potentially including
text analysis from a live conference ticker)
Probably there are many more interesting topics; happy to take your input
2 of 13 10/26/20, 11:11 AM
Legal disclaimer: I am not a legal expert and do not provide any recommendations for
action here, but a pure technical explanation. Massive scraping of websites causes high
traffic and could thereby burden them.
I believe if you are accessing websites, you should always consider their terms of
service and ideally obtain prior permission for projects like scraping. Moreover, I do not
promote gambling, no matter what kind of betting it is.
1. Web scraping live betting data (BeautifulSoup)

To gather the data, we will use the library BeautifulSoup. I have downloaded and
installed it with the command line via:
pip install beautifulsoup4
Once you have installed it, we can start by importing all relevant libraries.
from bs4 import BeautifulSoup

import urllib.request
import re
We can now use them to retrieve the source code of any page and parse it.
I have selected the live bets page from the German betting website tipico.de.
url = “https://www.tipico.de/de/live-wetten/"
try:
page = urllib.request.urlopen(url)
except:
print(“An error occured.”)
soup = BeautifulSoup(page, ‘html.parser’)

print(soup)
The above code checks if it can access the page and then print out the entire html
source code of your page. This is then assigned to a variable called ‘soup’ by convention
3 of 13 10/26/20, 11:11 AM
and will be about 200–300 A4 pages long, depending on how many bets are currently
online.
As a next step, we want to extract only the relevant information in a well-structured

manner. Let’s look at (a screenshot from) the website and check which information we
care about.
To start with, we could try to extract the seven values marked in green: The time, the
name of the players, the current score and the rates for each possible outcome. Here
Rate 1 corresponds to Player 1 wins, Rate 2 corresponds to tie and Rate 3 corresponds
to Player 1 loses. To do so, it is convenient to inspect the element with your browser
(right-click and ‘Inspect’ with Chrome).
4 of 13 10/26/20, 11:11 AM
We see that the rates are stored in buttons of class “c_but_base c_but”. To extract the
rates, we will use soup.find_all to get all buttons of this class.
regex = re.compile(‘c_but_base c_but’)

content_lis = soup.find_all(‘button’, attrs={‘class’: regex})
print(content_lis)
To get rid of the html code, we use the .getText() function, then store the values of all
buttons in a list and remove line breaks and tabulators.
content = []
for li in content_lis:
content.append(li.getText().replace(“\n”,””).replace(“\t”,””))
print(content)
The other variables can be queried analogously. You can find the details in my
notebook on Github. Don’t hesitate to ask questions if something is unclear.
2. Storing results in dataframes (pandas)

Having the raw and parsed data, we now want to structure it in a practical way. We
understand that there is one time of the game, two players, one score and eleven rates
5 of 13 10/26/20, 11:11 AM
for each row. As for the rates, we only want to store the three rates of “who wins the
game” and three rates of “who scores the next goal”, because these are available on
most of the betting sides and will later allow us a high degree of comparability.
N_games = 10 # number of games observed, say we want the first 10

N_players = 2 # number of players per game
N_outcomes = 11 # number of possible outcomes (Win, lose, tie, Next
goal etc.)
df = []
for i in range(N_games):
df.append([datetime.now(), content_names[i*N_players],
content_names[1+i*N_players], content_minute[i],
content_score[i], content[i*N_outcomes],
content[1+i*N_outcomes], content[2+i*N_outcomes],
content[6+i*N_outcomes], content[7+i*N_outcomes],
content[8+i*N_outcomes]])
pdf = pd.DataFrame(df, columns = ['Time', 'Player_1', 'Player_2',

'MinuteOfGame', 'Score', 'Win_1', 'Win_X', 'Win_2', 'NextGoal_1' ,
'NextGoal_X' , 'NextGoal_2' ])
pdf.head()
(Note: In my notebook, there are two more variables, j and k. They are used to get rid of
bugs, when there is an additional row per game for ‘results of the first half of the game’.
For sake of simplicity, I have excluded it from this description.)
3. Automation of the scraping process with a function

To repeat the scraping process many times, it is convenient to write it into a function:
6 of 13 10/26/20, 11:11 AM
def get_soccer_rates_tipico():
"""
This function creates a table with the live betting information,
this includes a timestamp, the players, the score and the rates
for each party winning and scoring the next goal.
Arguments:
None
Returns:
pdf -- pandas dataframe with the results of shape (N_games, 11)
"""
... FULL CODE ON GITHUB
return pdf
The shape of the returned table depends on the number of games that are currently
live. I have implemented a function which finds the first tennis game entry to figure of
the total number of live soccer games and get all of them.
Now we can write it into a loop to repeat the scraping function at fixed time intervals.
def repeat_scraping(timedelay, number_of_scrapes, filename =

'bet_rates_scraping_tipico.csv'):
"""
This function repeadetly calls the scraping function to create a
timeseries of scraping data. The time interval between scrapes and
number of scrapes in total are taken as argument. The result is
saved in a csv-file.
Arguments:
timedelay -- delay between each scrape request in seconds (min.
15 sec recommended due to processing time) number_of_scrapes --
number of scrape requests
Returns:
Void
"""
dataframe = pdf.to_csv(filename, index=False)
7 of 13 10/26/20, 11:11 AM
# Check processing time and add sleeping time to fit the timedelay
time_run = time.time() - start_time
time.sleep(timedelay - time_run)
Note that the loading and parsing of the page might take up to 10 seconds. We can use
the time library to track how long our function takes to subtract it from the sleeping
time to align the time delays (assuming that all queries take approximately the same
time).
We can call this function to scrape an entire game, for example with a time delay of 15
seconds and 500 queries, i.e. covering 125 minutes.
repeat_scraping(15, 500, 'scraping_500x15s.csv')
4. Visualization and analysis of the results with (matplotlib-pyplot)

As a first step, after importing the data, we need to do some data cleaning. This
includes filling the NaNs with zeros for all rates, replacing the commas by dots and
converting them into float types.
dataframe = pd.read_csv('scraping_500x15s.csv', encoding =

'unicode_escape')
dataframe = dataframe.fillna(0)
ratecols =
['Win_1','Win_X','Win_2','NextGoal_1','NextGoal_X','NextGoal_2']
dataframe[ratecols] = dataframe[ratecols].apply(lambda x:
x.str.replace(',','.')).astype(float)
Because we want to query multiple times per minute, the ‘minute of the game’ is not
precise enough, therefore, we add the query timestamp.
dataframe['Time_parsed'] = 0
# Check for dates without milliseconds and add .000 to have a
consistent formatting
dataframe.Time.iloc[np.where(dataframe.Time.apply(lambda x: True if
8 of 13 10/26/20, 11:11 AM
len(x) == 19 else False))] \

= dataframe.Time.iloc[np.where(dataframe.Time.apply(lambda x: True
if len(x) == 19 else False))].apply(lambda t: t + ".000")
dataframe.Time_parsed = dataframe.Time.apply(lambda x:
datetime.strptime(x, '%Y-%m-%d %H:%M:%S.%f').time())
dataframe = dataframe.drop(['Time'], axis=1)
Now we can explore the data of a first game and visualize the results. Let’s look at the
very first Player of our table:
df1 = dataframe[dataframe[‘Player_1’] == dataframe[‘Player_1’][0]]

df1
We now have data on the Bayern München vs 1. FC Köln game starting from minute 5
for every 15 seconds. The following plot visualizes the evolution of the rates for who
scores the next goal. I used a log-scale on the y-axis to account for the rates over 30.
Additionally, I included a horizontal line at a rate of three, signalling a simple form of
arbitrage: If all quotes are greater than three, you could generate a risk-less profit by
placing uniformly distributed bets (e.g. 100 € each) of on all three outcomes.
# Data for plotting

t = df1.Time_parsed.values
w1 = df1.NextGoal_1.values
w2 = df1.NextGoal_X.values
w3 = df1.NextGoal_2.values
# Plot setup
fig, ax = plt.subplots(figsize=(15, 6))
ax.plot(t, w1, marker ='', label = 'Player 1 scores next' ,color =
9 of 13 10/26/20, 11:11 AM
'green', linewidth = 2)
ax.plot(t, w2, marker ='', label = 'No more goals', color =
'orange', linewidth = 2)
ax.plot(t, w3, marker ='', label = 'Player 2 scores next', color =
'red', linewidth = 2)
plt.axhline(y=3., label = 'Critical line', color='grey', linewidth =
2, linestyle='--') # Line for arbitrage detection
ax.set(xlabel='Time', ylabel='Betting rates',

title=str(np.unique(df1.Player_1)[0]) + ' vs ' + \
str(np.unique(df1.Player_2)[0]) + ': Rates for "Who
scores next?"')
ax.grid()
plt.legend()
ax.set_yscale('log')
In my notebook, you will find a function that automatically creates the plots for all
games and saves them as images.
5. First arbitrage analysis and profit calculations

Coming back to our idea of finding arbitrage deals, it is probably unlikely to find them
on one single website. I believe it is more likely to achieve arbitrage betting by using
information differences between multiple bookmakers. However, we will now look at
the analysis only for Tipico data. This analysis could easily be scaled to multiple
websites from there on.
10 of 13 10/26/20, 11:11 AM
A brief excursion into mathematics:

Be n the number of possible outcomes of an event and q_i the rate for each outcome,
then arbitrage is possible if the sum of all 1/q_i is smaller than one.
df2['Check_a_1'] = df2.Win_1.apply(lambda x: 1/x)

df2['Check_a_2'] = df2.Win_X.apply(lambda x: 1/x)
df2['Check_a_3'] = df2.Win_2.apply(lambda x: 1/x)
df2['Check_a_sum']=0
df2['Check_b_1'] = df2.NextGoal_1.apply(lambda x: 1/x)
df2['Check_b_2'] = df2.NextGoal_X.apply(lambda x: 1/x)
df2['Check_b_3'] = df2.NextGoal_2.apply(lambda x: 1/x)
df2['Check_b_sum']=0
df2['Check_a_sum'] = df2.Check_a_1 + df2.Check_a_2 + df2.Check_a_3
df2['Check_b_sum'] = df2.Check_b_1 + df2.Check_b_2 + df2.Check_b_3
df2['Arbitrage_flag']=0
arb_idx = np.unique(np.append(np.where(df2.Check_a_sum <=
1)[0],np.where(df2.Check_b_sum <= 1)[0]))
df2.Arbitrage_flag[arb_idx] = 1
If we want to obtain an outcome-independent profit, we need to distribute the betting

amounts according to the rates. When you consider rate q multiplied by amount s
equals profit p to fulfill p_i = p_j, you attain s_j = q_i/q_j * s_i.
# Give the first bet the weight of 1 and adjust the other two bets
accordingly
df2['Win_1_betting_fraction'] = 1
df2['NextGoal_1_betting_fraction'] = 1
df2['Win_profit_percentage'] = df2.Win_1 * df2.Win_1_betting_amount
* 100 - 100
In the data that I collected, I found the following example:
Rates for scoring the next goal:

Player 1: 1.70
No more goals: 4.70
Player 2: 7.00
11 of 13 10/26/20, 11:11 AM
This leads to the following betting amount distribution:

Bet 1: 62.32%
Bet 2: 22.54%
Bet 3: 15.14%
This results in a sure profit of 5.9%.
(It seems strange to me that I found an arbitrage chance on a single website. Maybe
there was an error in scraping the numbers, I will investigate this.)
As demonstrated, this procedure enables you to calculate the sure profit of a

combination of bets. Now it is up to you to find positive values!
Wrapping up
In this article, you could learn how to load and parse web pages with Python using
BeautifulSoup, extract only desired information and load it into structured data tables.
To extract time series, we have looked at how we can implement it as functions and
repeat our query at fixed time intervals. Afterwards, we wrote code to automatically
clean the data, followed by visualizations and additional calculations to make the data
interpretable.
We have seen that arbitrage betting is probably possible — if not on one website, then
by combining the bets with multiple bookmakers. (Please note that I am not promoting
gambling and/or web scraping and that even arbitrage bets may also be subject to risks,
e.g. delays or cancellations of bets.)
Thank you very much for taking your time and reading this article!
I highly appreciate and welcome your feedback and remarks.

You can reach out to me via LinkedIn: https://www.linkedin.com/in/steven-eulig-
8b2450110/
…and you can find my github including the Jupyter notebook here:
12 of 13 10/26/20, 11:11 AM
https://github.com/Phisteven/scraping-bets
Cheers,
Steven
Sign up for The Daily Pick

By Towards Data Science
Hands-on real-world examples, research, tutorials, and cutting-edge techniques delivered Monday
to Thursday. Make learning your daily ritual. Take a look
Your email
Get this newsletter
By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information
about our privacy practices.
Web Scraping Data Visualization Data Science Python Towards Data Science
About Help Legal
Get the Medium app
13 of 13 10/26/20, 11:11 AM

Mozilla PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mozilla PDF

Uploaded by

Copyright:

Available Formats

Scraping and Exploring Sports Betting Data — Is Arbitra... https://towardsdatascience.com/scraping-and-exploring-...

Get started Open in app

489K Followers · About Follow

Photo by Richard Boyle on Unsplash

Scraping and Exploring Sports Betting Data —

Steven Eulig Sep 24, 2019 · 10 min read

1. Web scraping live betting data (BeautifulSoup)

2. Storing results in dataframes (pandas)

3. Automation of the scraping process with a function

4. Visualization of the results with (matplotlib-pyplot)

5. First arbitrage analysis and profit calculations

Web scraping from multiple websites and data consolidation

1. Web scraping live betting data (BeautifulSoup)

pip install beautifulsoup4

from bs4 import BeautifulSoup

soup = BeautifulSoup(page, ‘html.parser’)

As a next step, we want to extract only the relevant information in a well-structured

regex = re.compile(‘c_but_base c_but’)

2. Storing results in dataframes (pandas)

N_games = 10 # number of games observed, say we want the first 10

pdf = pd.DataFrame(df, columns = ['Time', 'Player_1', 'Player_2',

3. Automation of the scraping process with a function

... FULL CODE ON GITHUB

def repeat_scraping(timedelay, number_of_scrapes, filename =

dataframe = pdf.to_csv(filename, index=False)

repeat_scraping(15, 500, 'scraping_500x15s.csv')

4. Visualization and analysis of the results with (matplotlib-pyplot)

dataframe = pd.read_csv('scraping_500x15s.csv', encoding =

len(x) == 19 else False))] \

df1 = dataframe[dataframe[‘Player_1’] == dataframe[‘Player_1’][0]]

# Data for plotting

ax.set(xlabel='Time', ylabel='Betting rates',

5. First arbitrage analysis and profit calculations

A brief excursion into mathematics:

df2['Check_a_1'] = df2.Win_1.apply(lambda x: 1/x)

If we want to obtain an outcome-independent profit, we need to distribute the betting

In the data that I collected, I found the following example:

Rates for scoring the next goal:

This leads to the following betting amount distribution:

This results in a sure profit of 5.9%.

As demonstrated, this procedure enables you to calculate the sure profit of a

I highly appreciate and welcome your feedback and remarks.

Sign up for The Daily Pick

Get this newsletter

About Help Legal

Get the Medium app

You might also like