Forecasting Success in the National Hockey League Using Advanced Performance Metrics
Josh Weissbock 4521930 University of Ottawa CSI-5388 April 18, 2013
Abstract In this project I collected a number of traditional and advanced statistics, over a period of 10 weeks, in the National Hockey League. I use this data to train a number of diﬀerent classiﬁers to forecast success in an NHL game. In the ﬁrst half I compare classiﬁers using traditional stats, advanced stats and both. Then I use these results and expand upon the advanced stats, using variants of the PDO stat, and train new classiﬁers. The best classiﬁer is a NeuralNetwork able to accurately predict success 84.33% of the time using 10-fold cross-validation and the PDO of each team over the last 3 games.
1 Introduction 2 Background 3 Data 3.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Classiﬁcation & Results 5 Conclusion and Future Work 2 2 3 3 5 6 7
Performance Metrics in hockey, or “Advanced Statistics” are a new form of statistics in hockey to measure the performance of hockey players and teams. Traditional stats in hockey have been used by teams a few decades but these stats do not do a great job in terms of prediction or measuring performance of teams. Stats such as Plus/Minus, the summary of goals scored for and against while a player was on the ice, do not tell how or even if a speciﬁc player was involved. Despite these issues, traditional media still cites these numbers to back arguments, regardless of their content. Other traditional stats such as hits, blocked shots, giveaways and takeaways have less of a correlation with results than previously thought. Performance metrics are starting to become more popular over the last decade due to their recent developments. Despite their use mainly by bloggers and some hockey analysts there is evidence that professional NHL teams are using performance metrics to help analyze their own gameplay . There are many diﬀerent types of performance metrics in hockey such as Fenwick, a measure of possession, Corsi, an expanded version of Fenwick, PDO, a measure of luck and many variants of these representing the diﬀerent game states. In this project I am trying to determine success in NHL games, by forecasting the winner of a game between two teams. This can give a cutting edge competitive advantage to team managers, scouts, hockey analysts and gamblers. I do this forecasting in two parts, ﬁrst looking at classiﬁcation of wins and losses by traditional stats, advanced stats and combination. In the second part I look at classiﬁcation by modifying the diﬀerent states of these advanced statistics, speciﬁcally PDO.
To the best of my ability, I have not found any other research in using machine learning methods to predict who is more likely to win a game between two NHL teams. Related research I have found include Gramacy et al. who used regularized linear regression methods to try and estimate individual player contributions in the NHL . The authors argue that traditional stats such as Plus/Minus do not tell the full story on the eﬀect of individual players. Often players who are strongest in possession are the least well known. The authors use linear regression with a Bayesian approach and a Laplace prior distribution and demonstrate that some superstars don’t have as large of an 2
aﬀect on the ice relative to their high salaries while many low paid players are the ones to drive possession. Warner  tries predicting the margin of victory in NFL games by comparing machine learning methods to the Las Vegas line. Using the data of over 2,000 games that took place over a decade the author tries to predict the winners using logistic regression and achieves an accuracy of 64%, 2% higher than the Las Vegas Line. Charron  argues that traditional stats (hits, blocks etc) do not tell a very descriptive story on the success of teams. Many of these stats are counted by people employed by the home team in their home arena and it is well demonstrated that these statistic trackers have a bias towards the home team. Despite this bias Charron shows that “Fenwick Close” (0.623 and 0.218 for Home and Away games) has the highest r2 correlation with points achieved in the standings. This is much higher than other stats including Wins, Goals For, Goal Diﬀerential, Giveaways, Takeaways, Hits and others. Murphy  looks at the r2 correlation between traditional stats and 5on-5 Goals For/Goals Against ratio (5-5F/A). Using the data from all seasons, from 2005 onwards, he shows that 5-5F/A is the most correlated to wins (0.605) and points (0.655) compared to traditional stats such as Goals Against / Game (0.472, 0.510), Power Play % and Power Kill % (0.372, 0.390) and nine others.
I collected a number of traditional and advanced stats daily for each game that took place that evening. After the games were completed additional stats on the results were also collected and compiled. Data was collected every day during the period of this course for a total of 386 games over approximately 10 weeks. A Python script was created to run daily to automatically collect the data. Each of the stats were mined from a number of diﬀerent websites. The traditional data collected before the games were: Goals For (GF), Goals Against (GA), Goal Diﬀerential (GlDiﬀ), Power Play Success Rate (PP%), Power Kill Success Rate (PK%), Shot Percentage (Sh%), Save Percentage (Sv%), Winning Streak and Conference Standing. This data came from TSN.ca and NHL.com which is updated daily. The Advanced Statistics that were collected were Fenwick Close %, PDO and 5-5 F/A. These stats are not published by the NHL instead are maintained by individual online hockey 3
analysts such as BehindTheNet.ca. The next day the post-game information is collected and includes the winner and loser, the teams’ goals scored for and against, their shots percentage, their save percentage and their PDO (at all states). Conference Standing
Fenwick Close %
Away 44.92 108 100 8 18.7 85 892 919 1027 Home 49.85 89 72 17 29.8 89.4 929 939 1010 Table 1: Example of data for a game
1.05 Win 1.12 Loss
The traditional stats were selected because they are well known and often cited amongst the media. The advanced stats were chosen due to their high correlation with wins and points as shown by Charron, Drance and Murphy. Fenwick Close is a stat representing possession, it is a rate of all shots and missed shots a team has taken, divided by the number of shots and missed shots against . “Close” refers to the state of the game where both teams are at even strength (5-on-5, 4-on-4 etc) when the score is tied in the third period, and when there is no more than a one goal separation between the two teams in the ﬁrst two periods . Teams who are higher than 50% are said to be a positive puck possession team, the more often you have the puck, the more you shoot the puck, the more you score. The advanced stats community in hockey often discusses luck. Luck can be described as the results that fall outside of normal boundaries and variance in the players performance . A player cannot maintain a shot percentage multiple standard deviations higher or lower than the average for long periods of time. A goalie will not be able to stop every shot all season, nor will a goalie allow a goal on every shot. This is referred to as luck, when the results of the players performance is better (good luck) or worse (bad luck) than the normal average and variance. Over the long term the luck of all teams will regress but in the short term you can see which teams have been “luckier” than others. PDO is a stat of luck, luck role in hockey at the NHL level due to the near even skill level between teams. Skill does determine a large percentage of the game but luck is still involved in 38% of the results and standings . 4
PDO doesn’t stand for anything and it is the summation of Sh% and Sv%. Teams who play with a PDO over 100% are exceeding their expectation and are seen as lucky while teams who play with a PDO less than 100% are not meeting their level of skill and are seen as unlucky . Over a regular season PDO for all teams will regress to close to 100%. Within 25 games PDO will be at 100% ± 2%  as can be seen in ﬁgure 1. This is useful to us as we can see in the short term who has been lucky and who has been unlucky. Over the long term this stat becomes less relevant as PDOs will regress to the norm.
Figure 1: PDO Boundaries of Chance from 
Preprocessing of the data was done through the entire time collecting the data. Each day new data was ensured to be valid and up to date, and to make sure it fell within the normal variance. Data summarization was completed to see how the data was skewed. Data cleaning followed up to make sure there were no missing values and no values were outliers that might cause issues. Data integration and transformation was completed. There was no data reduction as there was only data for 386 games and data discretization was the ﬁnal step to place the nominal and ordinal values into appropriate ARFF data for Weka. Python was then used to format all the data from a .csv ﬁle into its appropriate arﬀ ﬁle. The data was represented as the diﬀerential between the statistics of the two teams with the winning team receiving the label “Win” and the losing team receiving the label “Loss”. An example of how 5
the data from table 1 would be represented in the arf ﬁle can be seen in table 2 Away -4.93 19 28 -9 -11.1 -4.4 -37 -20 17 -1 Home 4.93 -19 -28 9 11.1 4.4 37 20 -17 1 Table 2: Diﬀerential for Weka of data in table 1 In the ﬁrst part of the experiment an arﬀ ﬁle was created with only the traditional stats, one arﬀ ﬁle with only the advanced stats and one arﬀ ﬁle with both. In the second part of the experiment diﬀerent variations of the PDO stat were used. As discussed previously PDO regresses to the norm of 100% for all teams over the long term. Using a smaller game sample we can represent if a team has been lucky or unlucky in the previous n games. In the second part of this experiment I created arﬀ ﬁles of the data, using Python; similar to the ﬁrst half I used all stats but used the PDO of each team over the last 3, 5, 10, and 25 games. 1 -0.07 Win -1 0.07 Loss
Classiﬁcation & Results
This problem becomes a binary classiﬁcation problem, either Win or Loss. I used 10-fold cross-validation and used a number of algorithms. NeuralNetworks (MultilayerPerceptron Weka variant) as it is known to work well with noisy data, NaiveBayes to look at future results based on the past, Support Vector Machines (SVM, using the Weka SMO algorithm) as it has worked well in previous experience and J48 as it produces a human readable output. All algorithms were run using Weka with their default values and all classiﬁers were compared to the baseline using ZeroR. In the ﬁrst part of the experiment I used the classiﬁcation algorithms to see what accuracy is possible with traditional stats, advanced stats and mixed stats. The PDO used in this part was the PDO of the season so far and the results can be seen in table 3 Traditional 49.48% 60.10% 59.07% 58.03% 58.68% Advanced Mixed 49.48% 49.48% 55.05% 60.10% 54.53% 57.25% 52.85% 58.29% 55.83% 58.03%
Baseline SMO NB J48 NN
Table 3: Results of the ﬁrst experiment. 6
In the second part of the experiments I used both traditional and advanced statistics but modiﬁed the PDOs to try diﬀerent game lengths to represent the luckiness of the teams. I used game lengths of 3, 5, 10, and 25 and compared them to the PDO using the season total results as seen in table 3. This is allows us to see teams who have been luckier in the short term. Results can be seen in table 4 with futher breakdown of the best classiﬁers can be seen in table 5. PDO3 49.48% 81.22% 68.13% 77.46% 84.33% PDO5 PDO10 PDO25 PDOall 49.48% 49.48% 49.48% 49.48% 65.29% 60.10% 60.10% 60.10% 60.36% 58.55% 58.29% 57.25% 66.32% 66.71% 55.06% 58.29% 78.24% 63.47% 60.49% 58.03%
Baseline SMO NB J48 NN
Table 4: Results of the second experiment.
NN & PDO3 NN & PDO5 SMO & PDO3 J48 & PDO3
Precision Recall F-Score ROC Curve 0.843 0.843 0.843 0.887 0.783 0.782 0.782 0.818 0.812 0.812 0.812 0.812 0.775 0.775 0.775 0.774
Table 5: Breakdown of results for the best classifers .
Conclusion and Future Work
In the ﬁrst experiment I looked at predicting success in NHL games by comparing the results of a number of classiﬁers using traditional, advanced and mixed stats. The greatest accuracy came from using the SVM Weka implementation SMO resulting in an accuracy of 60.10%. This result was the same for both traditional stats and mixed stats and was able to beat the baseline of 49.48% (essentially a coin ﬂip). Using advanced stats the accuracy decreased. My hypothesis for this is that the data used was such a short sample of a hockey season (in 2013 teams only played 48 games compared to 82 in a regular season due to a work lockout). This shortened season and small data sample does not show all of the best teams rising to the top of the standings and the worst teams dropping. 7
In the second part of the experiment by replacing the overall PDO of the season, to a much shorter time frame, the accuracy was able to increase by a lot more than my initial estimates. Using PDO of the last three games each team has played and a Neural Networks classiﬁer the accuracy of the classiﬁer has improved to 84.33%. This is much better than the baseline and much better than using PDO for the entire season. For all classiﬁers used the accuracy increases as the number of games the PDO is calculated on decreases. By looking at the precision, recall, f-score and area under the ROC curve of the best classiﬁers in table 5 we can conﬁrm this is still the best classiﬁer. While I feel fairly comfortable with the success of this classiﬁer there are many future items I would like to try and expand upon this work. As this was a shortened season I would like to try it again with the 2013-2014 season and use an entire seasons worth of data. There are many other advanced stats that I did not collect on this project that I would like to add and see how it aﬀects the results such as Fenwick Tied, Fenwick +1, Fenwick +2, Fenwick -1, Fenwick -2, Score-Adjusted Fenwick, Corsi % (similar to Fenwick but includes blocked shots), the stats of the team based on the goaltender who is playing that game, injuries, weather at the arena, change in weather for the teams, change in altitude, recent trades, change in teams after the trade deadline, scoring chances and the odds that casinos are giving each team to win. These were not collected in this experiment because of the lack of knowledge of their possible importance and at the beginning of the project I was unsure where to ﬁnd them. In addition, as the PDO used in the second half of the project was the PDO generated at all states of the game I would like to try and use PDOn, calculated only from shots and goals only when the team is at even strength. There are a few applications I would like to try this classiﬁer on. The ﬁrst is to see how successful it would be for gambling and see if it is likely to generate proﬁts over the long term. Another application I would like to try is using the classiﬁer to try and predict playoﬀ winners. In the best of 7 series you are more likely to see the stats regress to the normal. In a one-oﬀ game luck takes up 38% of the game so I would hypothesize that this classiﬁer would be more accurate in the playoﬀs. The biggest lessons learned that came out of this project was the work that was required to collect the data. Daily the data had to be collected, due to the lack of data available in the past, for many of these stats it is only possible to see their current values. Daily the data had to be ensured it was collected, the appropriate values and to make sure nothing was missing. The python script to automate much of this helped a lot, but it still required daily checks, over 10 weeks, to validate the data was correct. 8
While this project has been successful there is still much work to be done to ensure that it was not a one-oﬀ experiment that is caused by the luck of the data.
 Cam Charron. Breaking news: Puck-possession is important (and nobody told the cbc). http://blogs.thescore.com/nhl/2013/02/25/ breaking-news-puck-possession-is-important-and-nobody-told-the-cbc/, 2013. [Online; accessed 12-April-2013].  Cam Charron. Pdo explained. http://blogs.thescore.com/nhl/2013/ 01/21/pdo-explained/, 2013. [Online; accessed 12-April-2013].  Patrick D. Studying luck other factors in pdo. http://nhlnumbers.com/ 2013/1/10/studying-luck-other-factors-in-pdo, 2013. [Online; accessed 12-April-2013].  Thomas Drance. Drance numbers: Which canucks’ defender suppresses shots most eﬀectively? http://vansunsportsblogs.com/2012/03/09/ drance-numbers-which-canucks-defender-suppresses-shots-most-effectively/, 2012. [Online; accessed 12-April-2013].  Robert B Gramacy, Matthew A Taddy, and Shane T Jensen. Estimating player contribution in hockey with regularized logistic regression. arXiv preprint arXiv:1209.5026, 2012.  Hawerchuck. Luck in the nhl standings. http://www.arcticicehockey. com/2010/11/22/1826590/luck-in-the-nhl-standings, 2010. [Online; accessed 12-April-2013].  Blake Murphy. Exploring marginal save percentage and if the canucks should trade a goalie. http://www.nucksmisconduct.com/2013/2/13/3987546/ exploring-marginal-save-percentage-and-if-the-canucks-should-trade-a, 2013. [Online; accessed 12-April-2013].  Jim Warner. Predicting margin of victory in nﬂ games: Machine learning vs. the las vegas line. 2010.