Project Presentation 10

NCAA TOURNAMENT PREDICATION
PROJECT
GROUP 10: WEIHAO ZHANG/ SI YANG/ RAN DING

DATE
APRIL 16 2014 COURSE
INFSCI 2160 DATA MINING
Problem
Predict Basketball Tournament Results Before
Tournament Start
Target: National Collegiate Athletic Association

Men's Basketball Division I Championship
Total Team Number: 68
Number of Match: 67, 1-2% of whole season

matches (Large Train and Small Test Problem)
Our Project
Data: Collected Thousand
Records of Data for 2014,
2015 Season
Method: Linear Regression,

Logistical Regress (Plus K-
nearest neighbor)
Number of Predications:
2278’s possible matchup
Evaluation: Based on actual

67 matches’ result
Baseline: Random Guess
We assume that in the worst

case, even a monkey is able
to randomly pick the result
of competition (though
monkey is smarter than that)
The accuracy of random

guess would be around 0.5
Performance Comparison
Model
Initial Formula: Consider the Difference Between Team 1 and
Team 2: Here we used expert pre-tournament rating
! = !(!! − !! ) !
IF Y>0, then predict Team 1 wins Team 2; if Y<0, then reverse
Accuracy:
For 2014 Data: 23 Wrong / 44 Correct 0.656714
For 2015 Data: 19 Wrong / 48 Correct 0.716471
(Better Than Random Guess)
Team Performance
Analysis
2014 2015
Not a regular normal distribution
But few team strongest (5)
Few team weakest (5)
Most team are 0.8-0.9 in rating ( rating is 0~1 )
Game Start: Our Real
Challenge
Linear Strength Comparison tend to be wrong if
two team is not absolutely difference in their rating
Most incorrect predications happened in first

round 68 -> 32 (10/19, 52% for 2015) and last 8
teams competition (5/19, 26% for 2015)
Accurate Rating
Specific to Similar Rating Teams Matchup

Data For Model 1
Kaggle and KenPom Blog
Collected 351 teams data with their regular

season match data (6000 records)
And each team’s performance indexes (7 types

data) to construct team efficiency matrix
(based on sport specialist journal)
2014 data for train(regular season)/test

(tourney), 2015 for predication
Logistical Regression Team
Efficiency Matrix: Model
!"#$% Pr !! = 1 = !! + !! ∗ !"#!
Initially, we used 15 features to train and test the

model including rating, offensive, offensive (adj),
defensive, defensive(adj), tempo and tempo(adj)
and natural marker (either home or away) for two
teams (15 features)
Efficiency Matrix: Result
After test several times, we finally confirm 4
features for both team are most critical for
predication result. (Base on observing their
coefficients in logistical regression)
Final model variables: Offensive (adj) for two team,

and Defensive (adj) for two team plus neutral
marker
Efficiency Matrix: Evaluation
It’s worse than we expected
It may due to some

problems occurred during
team matching
And tournament is a very

small potion of whole
season matches, so model
well trained with regular
season may not properly
reflect tournament facts
K Nearest Neighbor to Adjust Two
Similar Performance Team Result
We want to use specific matchup adjustment to
improve the prediction when two teams are very
close in performance measures.
So we collected 22 features of each teams (total

351 teams) in order to represent whole team unique
in hyper-dimension space, and find each team’s 5
closest neighbor
Use matchup to adjust our performance comparison

result
Challenge
Each season, a team would experience 30-40

regular matches. But there’s 5000 matches in
total.
As a result, 5 closet neighbors could never meet a

specific team before
We compromised our condition to search closet

neighbor in one team’s met competitors.
Nearest Neighbor Matchup
If team A wining all B’s

neighbor in average score
Then believe A will has
advantage when facing B
The points differential

model calculate
between two teams pair
Performance Comparison Plus
Nearest Neighbor Matchup: Model
!!,! |!!,! = !!!"# + !" !! − !! + !!,! + !!,! !
α: Home advantage of a team (no use in

tournament)
e
D: Difference between two teams in performance
σ: Matchup result between team i and team j
ϵ: Error
The result is point spread of match, can be

convert as probability by logistical regression
Performance Comparison Plus
Nearest Neighbor Matchup:
Evaluation
It has a relatively good test

score as 0.7431 on 2014
data
However, when doing

prediction on 2015
tournament, the accuracy is
lower as 0.7233
Discussion I
Nearest Neighbor is not perfect since many teams are overlapped
with each other and that impacted our matchup improvement
And each year’s team performance and neighbors are greatly

difference
2015 Team Cluster 2014 Team Cluster

Discussion II
Since K-nearest neighbor provides a more

accurate predication, it shows that KNN is more
complete for predication than simply using
performance comparison
Solely using team feature data may not be better

than expert’s conclusion
Future Work &
Improvement
Train and Test Our Model in
Boarder Dataset (with more
historic data)
Using similar model on other

basketball competition
Develop a more efficiency

method to process and filter
data
And hope Panther would be

better luck next year
Reference
Lopez, Michael J., and Gregory Matthews. "Building an
NCAA mens basketball predictive model and quantifying its
success." arXiv preprint arXiv:1412.0248 (2014).
Hoegh, Andrew, et al. "Nearest-neighbor matchup effects:

accounting for team matchups for predicting March
Madness." Journal of Quantitative Analysis in Sports 11.1
(2015): 29-37.
Pomeroy, K. 2012. “Ratings Glossary.” (http://kenpom.com/

blog/ index.php/weblog/entry/ratings_glossary), accessed
June 18, 2014.

Project Presentation 10

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Project Presentation 10

Uploaded by

Copyright:

Available Formats

NCAA TOURNAMENT PREDICATION

GROUP 10: WEIHAO ZHANG/ SI YANG/ RAN DING

Target: National Collegiate Athletic Association

Total Team Number: 68

Number of Match: 67, 1-2% of whole season

Method: Linear Regression,

Evaluation: Based on actual

We assume that in the worst

The accuracy of random

IF Y>0, then predict Team 1 wins Team 2; if Y<0, then reverse

Most incorrect predications happened in first

Specific to Similar Rating Teams Matchup

Collected 351 teams data with their regular

And each team’s performance indexes (7 types

2014 data for train(regular season)/test

Initially, we used 15 features to train and test the

Final model variables: Oﬀensive (adj) for two team,

It may due to some

And tournament is a very

So we collected 22 features of each teams (total

Use matchup to adjust our performance comparison

Each season, a team would experience 30-40

As a result, 5 closet neighbors could never meet a

We compromised our condition to search closet

If team A wining all B’s

The points differential

α: Home advantage of a team (no use in

D: Diﬀerence between two teams in performance

σ: Matchup result between team i and team j

The result is point spread of match, can be

It has a relatively good test

However, when doing

And each year’s team performance and neighbors are greatly

2015 Team Cluster 2014 Team Cluster

Since K-nearest neighbor provides a more

Solely using team feature data may not be better

Using similar model on other

Develop a more eﬃciency

And hope Panther would be

Hoegh, Andrew, et al. "Nearest-neighbor matchup eﬀects:

Pomeroy, K. 2012. “Ratings Glossary.” (http://kenpom.com/

You might also like