Professional Documents
Culture Documents
You have 2 free member-only stories left this month. Sign up for Medium and get an extra one
1 of 13 10/26/20, 11:11 AM
Scraping and Exploring Sports Betting Data — Is Arbitra... https://towardsdatascience.com/scraping-and-exploring-...
How to download live sports betting time series data, parse it and analyse arbitrage
opportunities with Python
Who wouldn’t like to own a risk-free profit machine? I guess everyone would like to
have one. The practice of taking advantage of price differences (in this case betting rate
differences) between different markets or bookmakers is also known as arbitrage. The
idea is to place bets on every outcome of a sample space and generate a profit in every
case. In soccer, you could bet which player wins a game, i.e. the sample space would
consist of the three outcomes ‘Player 1 wins’, ‘Tie’ and ‘Player 1 loses’.
Before we dive deeper into the analysis, we should first look at the data collection and
further topics which we will cover along the way:
I have come up with many other exciting topics on this data, but I will leave them out of
scope for now. (Might follow-up on them in the next articles.)
Derivation of rules for a betting / trading strategy incl. simulation and evaluation of
the performance of the automated strategy
Driver analysis: What drives the evolution of betting rates (potentially including
text analysis from a live conference ticker)
Probably there are many more interesting topics; happy to take your input
2 of 13 10/26/20, 11:11 AM
Scraping and Exploring Sports Betting Data — Is Arbitra... https://towardsdatascience.com/scraping-and-exploring-...
Legal disclaimer: I am not a legal expert and do not provide any recommendations for
action here, but a pure technical explanation. Massive scraping of websites causes high
traffic and could thereby burden them.
I believe if you are accessing websites, you should always consider their terms of
service and ideally obtain prior permission for projects like scraping. Moreover, I do not
promote gambling, no matter what kind of betting it is.
Once you have installed it, we can start by importing all relevant libraries.
We can now use them to retrieve the source code of any page and parse it.
I have selected the live bets page from the German betting website tipico.de.
url = “https://www.tipico.de/de/live-wetten/"
try:
page = urllib.request.urlopen(url)
except:
print(“An error occured.”)
The above code checks if it can access the page and then print out the entire html
source code of your page. This is then assigned to a variable called ‘soup’ by convention
3 of 13 10/26/20, 11:11 AM
Scraping and Exploring Sports Betting Data — Is Arbitra... https://towardsdatascience.com/scraping-and-exploring-...
and will be about 200–300 A4 pages long, depending on how many bets are currently
online.
To start with, we could try to extract the seven values marked in green: The time, the
name of the players, the current score and the rates for each possible outcome. Here
Rate 1 corresponds to Player 1 wins, Rate 2 corresponds to tie and Rate 3 corresponds
to Player 1 loses. To do so, it is convenient to inspect the element with your browser
(right-click and ‘Inspect’ with Chrome).
4 of 13 10/26/20, 11:11 AM
Scraping and Exploring Sports Betting Data — Is Arbitra... https://towardsdatascience.com/scraping-and-exploring-...
We see that the rates are stored in buttons of class “c_but_base c_but”. To extract the
rates, we will use soup.find_all to get all buttons of this class.
To get rid of the html code, we use the .getText() function, then store the values of all
buttons in a list and remove line breaks and tabulators.
content = []
for li in content_lis:
content.append(li.getText().replace(“\n”,””).replace(“\t”,””))
print(content)
The other variables can be queried analogously. You can find the details in my
notebook on Github. Don’t hesitate to ask questions if something is unclear.
5 of 13 10/26/20, 11:11 AM
Scraping and Exploring Sports Betting Data — Is Arbitra... https://towardsdatascience.com/scraping-and-exploring-...
for each row. As for the rates, we only want to store the three rates of “who wins the
game” and three rates of “who scores the next goal”, because these are available on
most of the betting sides and will later allow us a high degree of comparability.
pdf.head()
(Note: In my notebook, there are two more variables, j and k. They are used to get rid of
bugs, when there is an additional row per game for ‘results of the first half of the game’.
For sake of simplicity, I have excluded it from this description.)
6 of 13 10/26/20, 11:11 AM
Scraping and Exploring Sports Betting Data — Is Arbitra... https://towardsdatascience.com/scraping-and-exploring-...
def get_soccer_rates_tipico():
"""
This function creates a table with the live betting information,
this includes a timestamp, the players, the score and the rates
for each party winning and scoring the next goal.
Arguments:
None
Returns:
pdf -- pandas dataframe with the results of shape (N_games, 11)
"""
return pdf
The shape of the returned table depends on the number of games that are currently
live. I have implemented a function which finds the first tennis game entry to figure of
the total number of live soccer games and get all of them.
Now we can write it into a loop to repeat the scraping function at fixed time intervals.
Arguments:
timedelay -- delay between each scrape request in seconds (min.
15 sec recommended due to processing time) number_of_scrapes --
number of scrape requests
Returns:
Void
"""
... FULL CODE ON GITHUB
7 of 13 10/26/20, 11:11 AM
Scraping and Exploring Sports Betting Data — Is Arbitra... https://towardsdatascience.com/scraping-and-exploring-...
# Check processing time and add sleeping time to fit the timedelay
time_run = time.time() - start_time
time.sleep(timedelay - time_run)
Note that the loading and parsing of the page might take up to 10 seconds. We can use
the time library to track how long our function takes to subtract it from the sleeping
time to align the time delays (assuming that all queries take approximately the same
time).
We can call this function to scrape an entire game, for example with a time delay of 15
seconds and 500 queries, i.e. covering 125 minutes.
Because we want to query multiple times per minute, the ‘minute of the game’ is not
precise enough, therefore, we add the query timestamp.
dataframe['Time_parsed'] = 0
# Check for dates without milliseconds and add .000 to have a
consistent formatting
dataframe.Time.iloc[np.where(dataframe.Time.apply(lambda x: True if
8 of 13 10/26/20, 11:11 AM
Scraping and Exploring Sports Betting Data — Is Arbitra... https://towardsdatascience.com/scraping-and-exploring-...
Now we can explore the data of a first game and visualize the results. Let’s look at the
very first Player of our table:
We now have data on the Bayern München vs 1. FC Köln game starting from minute 5
for every 15 seconds. The following plot visualizes the evolution of the rates for who
scores the next goal. I used a log-scale on the y-axis to account for the rates over 30.
Additionally, I included a horizontal line at a rate of three, signalling a simple form of
arbitrage: If all quotes are greater than three, you could generate a risk-less profit by
placing uniformly distributed bets (e.g. 100 € each) of on all three outcomes.
# Plot setup
fig, ax = plt.subplots(figsize=(15, 6))
ax.plot(t, w1, marker ='', label = 'Player 1 scores next' ,color =
9 of 13 10/26/20, 11:11 AM
Scraping and Exploring Sports Betting Data — Is Arbitra... https://towardsdatascience.com/scraping-and-exploring-...
'green', linewidth = 2)
ax.plot(t, w2, marker ='', label = 'No more goals', color =
'orange', linewidth = 2)
ax.plot(t, w3, marker ='', label = 'Player 2 scores next', color =
'red', linewidth = 2)
plt.axhline(y=3., label = 'Critical line', color='grey', linewidth =
2, linestyle='--') # Line for arbitrage detection
ax.set_yscale('log')
In my notebook, you will find a function that automatically creates the plots for all
games and saves them as images.
10 of 13 10/26/20, 11:11 AM
Scraping and Exploring Sports Betting Data — Is Arbitra... https://towardsdatascience.com/scraping-and-exploring-...
# Give the first bet the weight of 1 and adjust the other two bets
accordingly
df2['Win_1_betting_fraction'] = 1
df2['NextGoal_1_betting_fraction'] = 1
... FULL CODE ON GITHUB
df2['Win_profit_percentage'] = df2.Win_1 * df2.Win_1_betting_amount
* 100 - 100
11 of 13 10/26/20, 11:11 AM
Scraping and Exploring Sports Betting Data — Is Arbitra... https://towardsdatascience.com/scraping-and-exploring-...
(It seems strange to me that I found an arbitrage chance on a single website. Maybe
there was an error in scraping the numbers, I will investigate this.)
Wrapping up
In this article, you could learn how to load and parse web pages with Python using
BeautifulSoup, extract only desired information and load it into structured data tables.
To extract time series, we have looked at how we can implement it as functions and
repeat our query at fixed time intervals. Afterwards, we wrote code to automatically
clean the data, followed by visualizations and additional calculations to make the data
interpretable.
We have seen that arbitrage betting is probably possible — if not on one website, then
by combining the bets with multiple bookmakers. (Please note that I am not promoting
gambling and/or web scraping and that even arbitrage bets may also be subject to risks,
e.g. delays or cancellations of bets.)
Thank you very much for taking your time and reading this article!
12 of 13 10/26/20, 11:11 AM
Scraping and Exploring Sports Betting Data — Is Arbitra... https://towardsdatascience.com/scraping-and-exploring-...
https://github.com/Phisteven/scraping-bets
Cheers,
Steven
Hands-on real-world examples, research, tutorials, and cutting-edge techniques delivered Monday
to Thursday. Make learning your daily ritual. Take a look
Your email
By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information
about our privacy practices.
Web Scraping Data Visualization Data Science Python Towards Data Science
13 of 13 10/26/20, 11:11 AM