You are on page 1of 20

NCAA TOURNAMENT PREDICATION

PROJECT

GROUP 10: WEIHAO ZHANG/ SI YANG/ RAN DING


DATE
APRIL 16 2014 COURSE
INFSCI 2160 DATA MINING
Problem
Predict Basketball Tournament Results Before
Tournament Start

Target: National Collegiate Athletic Association


Men's Basketball Division I Championship

Total Team Number: 68

Number of Match: 67, 1-2% of whole season


matches (Large Train and Small Test Problem)
Our Project
Data: Collected Thousand
Records of Data for 2014,
2015 Season

Method: Linear Regression,


Logistical Regress (Plus K-
nearest neighbor)

Number of Predications:
2278’s possible matchup

Evaluation: Based on actual


67 matches’ result
Baseline: Random Guess

We assume that in the worst


case, even a monkey is able
to randomly pick the result
of competition (though
monkey is smarter than that)

The accuracy of random


guess would be around 0.5
Performance Comparison
Model
Initial Formula: Consider the Difference Between Team 1 and
Team 2: Here we used expert pre-tournament rating
! = !(!! − !! ) !

IF Y>0, then predict Team 1 wins Team 2; if Y<0, then reverse

Accuracy:
For 2014 Data: 23 Wrong / 44 Correct 0.656714
For 2015 Data: 19 Wrong / 48 Correct 0.716471
(Better Than Random Guess)
Team Performance
Analysis

2014 2015
Not a regular normal distribution
But few team strongest (5)
Few team weakest (5)
Most team are 0.8-0.9 in rating ( rating is 0~1 )
Game Start: Our Real
Challenge
Linear Strength Comparison tend to be wrong if
two team is not absolutely difference in their rating

Most incorrect predications happened in first


round 68 -> 32 (10/19, 52% for 2015) and last 8
teams competition (5/19, 26% for 2015)

Accurate Rating

Specific to Similar Rating Teams Matchup


Data For Model 1
Kaggle and KenPom Blog

Collected 351 teams data with their regular


season match data (6000 records)

And each team’s performance indexes (7 types


data) to construct team efficiency matrix
(based on sport specialist journal)

2014 data for train(regular season)/test


(tourney), 2015 for predication
Logistical Regression Team
Efficiency Matrix: Model
!"#$% Pr !! = 1 = !! + !! ∗ !"#!

Initially, we used 15 features to train and test the


model including rating, offensive, offensive (adj),
defensive, defensive(adj), tempo and tempo(adj)
and natural marker (either home or away) for two
teams (15 features)
Logistical Regression Team
Efficiency Matrix: Result
After test several times, we finally confirm 4
features for both team are most critical for
predication result. (Base on observing their
coefficients in logistical regression)

Final model variables: Offensive (adj) for two team,


and Defensive (adj) for two team plus neutral
marker
Logistical Regression Team
Efficiency Matrix: Evaluation
It’s worse than we expected

It may due to some


problems occurred during
team matching

And tournament is a very


small potion of whole
season matches, so model
well trained with regular
season may not properly
reflect tournament facts
K Nearest Neighbor to Adjust Two
Similar Performance Team Result
We want to use specific matchup adjustment to
improve the prediction when two teams are very
close in performance measures.

So we collected 22 features of each teams (total


351 teams) in order to represent whole team unique
in hyper-dimension space, and find each team’s 5
closest neighbor

Use matchup to adjust our performance comparison


result
Challenge

Each season, a team would experience 30-40


regular matches. But there’s 5000 matches in
total.

As a result, 5 closet neighbors could never meet a


specific team before

We compromised our condition to search closet


neighbor in one team’s met competitors.
Nearest Neighbor Matchup

If team A wining all B’s


neighbor in average score
Then believe A will has
advantage when facing B

The points differential


model calculate
between two teams pair
Performance Comparison Plus
Nearest Neighbor Matchup: Model
!!,! |!!,! = !!!"# + !" !! − !! + !!,! + !!,! !

α: Home advantage of a team (no use in


tournament)
e

D: Difference between two teams in performance

σ: Matchup result between team i and team j

ϵ: Error

The result is point spread of match, can be


convert as probability by logistical regression
Performance Comparison Plus
Nearest Neighbor Matchup:
Evaluation

It has a relatively good test


score as 0.7431 on 2014
data

However, when doing


prediction on 2015
tournament, the accuracy is
lower as 0.7233
Discussion I
Nearest Neighbor is not perfect since many teams are overlapped
with each other and that impacted our matchup improvement

And each year’s team performance and neighbors are greatly


difference

2015 Team Cluster 2014 Team Cluster


Discussion II

Since K-nearest neighbor provides a more


accurate predication, it shows that KNN is more
complete for predication than simply using
performance comparison

Solely using team feature data may not be better


than expert’s conclusion
Future Work &
Improvement
Train and Test Our Model in
Boarder Dataset (with more
historic data)

Using similar model on other


basketball competition

Develop a more efficiency


method to process and filter
data

And hope Panther would be


better luck next year
Reference
Lopez, Michael J., and Gregory Matthews. "Building an
NCAA mens basketball predictive model and quantifying its
success." arXiv preprint arXiv:1412.0248 (2014).

Hoegh, Andrew, et al. "Nearest-neighbor matchup effects:


accounting for team matchups for predicting March
Madness." Journal of Quantitative Analysis in Sports 11.1
(2015): 29-37.

Pomeroy, K. 2012. “Ratings Glossary.” (http://kenpom.com/


blog/ index.php/weblog/entry/ratings_glossary), accessed
June 18, 2014.

You might also like