You are on page 1of 34

ACM KDD Cup A Survey: 1997-2011

Qiang Yang
(partly based on Xinyue Lius slides @SFU, and Nathan Lius slides @hkust)

Hong Kong University of Science and Technology

About KDD Cup (1997 2011)

Competition is a strong mover for Science and Engineering:

ACM Programming Contest

World College level Programming skills


World Robotics Competition


ACM KDD: Premiere Conference in knowledge discovery and data mining ACM KDDCUP:

Worldwide competition in conjunction with ACM KDD conferences. showcase the best methods for discovering higher-level knowledge from data. Helping to close the gap between research and industry Stimulating further KDD research and development

It aims at:


Participation in KDD Cup grew steadily Average person-hours per submission: 204
Max person-hours per submission: 910

Year Submissions

97 98 16 21

99 24

2000 2005 2011 30 32 1000+


Algorithms (up to 2000)

KDD Cup 97

A classification task to predict financial services industry (direct mail response) Winners

Charles Elkan, a Prof from UC-San Diego with his Boosted Naive Bayesian (BNB) Silicon Graphics, Inc with their software MineSet Urban Science Applications, Inc. with their software gain, Direct Marketing Selection System

MineSet (Silicon Graphics Inc.)

A KDD tool that combines data access, transformation, classification, and visualization.

KDD Cup 98: CRM Benchmark

URL: d98/kdd-cup-98.html A classification task to analyze fund raising mail responses to a non-profit organization


Urban Science Applications, Inc. with their software GainSmarts. SAS Institute, Inc. with their software SAS Enterprise Miner Quadstone Limited with their software Decisionhouse

KDDCUP 1998 Results

$70,000 $65,000 $60,000 $55,000 $50,000 $45,000 $40,000 $35,000 $30,000 $25,000 $20,000 $15,000 $10,000 $5,000 $Maximum Possible Profit Line ($72,776 in profits with 4,873 mailed)

100% 90% 80%

Mail to Everyone Solution ($10,560 in profits with 96,367 mailed)

70% 60% 50% 40% 30% 20% 10% 0%

SAS/Enterprise Miner Quadstone/Decisionhouse

ACM KDD Cup 1999

URL: kdresults.html Problem To detect network intrusion and protect a computer network from unauthorized users, including perhaps insiders Data: from DoD Winners SAS Institute Inc. with their software Enterprise Miner. Amdocs with their Information Analysis Environment


KDDCUP 2000: Data Set and Goal:

Data collected from, a legwear and legcare Web retailer Pre-processed Training set: 2 months Test sets: one month Data collected includes:

The goal to design models to support website personalization and to improve the profitability of the site by increasing customer response.
Questions - When given

Click streams Order information

a set of page views,

characterize heavy spenders characterize killer pages characterize which product brand a visitor will view in the remainder of the session?


KDDCUP 2000: The Winners

Question 1 & 5 Winner: Amdocs Question 2 & 3 Winner: Salford Systems Question 4 Winner: esteam


KDD Cup 2001

3 Bioinformatics Tasks

Dataset 1: Prediction of Molecular Bioactivity for Drug Design

half a gigabyte when uncompressed

Dataset 2: Prediction of Gene/Protein Function (task 2) and Localization (task 3)

Dataset 2 is smaller and easier to understand 7 megabytes uncompressed

A total of 136 groups participated to produce a total of 200 submitted predictions over the 3 tasks: 114 for Thrombin, 41 for Function, and 45 for Localization.


2001 Winners

Task 1, Thrombin:

Task 2, Function: Mark-A. Krogel (University of Magdeburg).

Jie Cheng (Canadian Imperial Bank of Commerce). Bayesian network learner and classifier

Task 2:

the genes of one particular type of organism A gene/protein can have more than one function, but only one localization.

Task 3, Localization: Hisashi Hayashi, Jun Sese, and Shinichi Morishita (University of Tokyo).

Inductive Logic programming

K nearest neighbor


molecular biology : Two tasks


Task 1: Document extraction from biological articles Task 2: Classification of proteins based on gene deletion experiments

Task 1: ClearForest and Celera, USA

Yizhar Regev and Michal Finkelstein

Task 2: Telstra , Australia

Research Laboratories
Adam Kowalczyk and Bhavani Raskutti



Information Retrieval/Citation Mining of Scientific research papers

based on a very large archive of research papers First Task: predict how many citations each paper will receive during the three months leading up to the KDD 2003 conference Second Task: a citation graph of a large subset of the archive from only the LaTex sources Third Task: each paper's popularity will be estimated based on partial download logs Last Task: devise their own questions

2003 KDDCUP: Results

Task 1:

Task 2:

Claudia Perlich, Foster Provost, Sofus Kacskassy New York University

1st place: David Vogel AI Insight Inc. Janez Brank and Jure Leskovec Jozef Stefan Institute, Slovenija Amy McGovern, Lisa Friedland, Michael Hay, Brian Gallagher, Andrew Fast, Jennifer Neville, and David Jensen University of Massachusetts Amherst, USA

Task 3:

Task 4:

2004 Tasks and Results

Particle physics; plus protein homology prediction

David S. Vogel, Eric Gottschalk, and Morgan C. Wang Bernhard Pfahringer, Yan Fu (), RuiXiang Sun, Qiang Yang (), Simin He, Chunli Wang, Haipeng Wang, Shiguang Shan, Junfa Liu, Wen Gao.

Past KDDCUP Overview: 2005-2010

Year 2005 2006 Host Microsoft Siemens Task Web query categorization Pulmonary emboli detection Technique Feature Engineering, Ensemble Multi-instance, Non-IID sample, Cost sensitive, Class Imbalance, Noisy data Collaborative Filtering, Time series, Ensemble Winner HKUST AT&T, Budapest University of Technology & Economics IBM Research, Hungarian Academy of Sciences IBM Research, National Taiwan University



Consumer recommendation



Breast cancer detection from medical images

Ensemble, Class imbalance, Score calibration



Customer relationship prediction in telecom

Student performance prediction in ELearning

Feature selection, Ensemble

Feature engineering, Ensemble, Collaborative filtering

IBM Research, University of Melbourne

National Taiwan University CJ Lin, S. Lin, etc.)


PSLC Data Shop

KDDCUP11 Dataset

11 years of data Rated items are Tracks Albums Artists Genres Items arranges in a taxonomy Two tasks
Track 1 Track 2 63M 296K

#ratings #items

263M 625K




Items in a Taxonomy

Track 1 Details

Track 1 Highlights

Largest publicly available dataset Large number of items (50 times more than Netflix) Extreme rating sparsity (20 times more sparse than Netflix) Taxonomy can help in combating sparsely rated items. Fine time stamps with both date and time allow sophisticated temporal modeling.

Track 2 Details

Track 2 Highlights

Performance metric focus on ranking/ classification, which differs from traditional collaborative filtering. No validation data provided, need to selfconstruct binary labeled data from rating data. Unlike track 1, track 2 removed time stamps to focus more than long term preference rather than short term behaviors.

Submission Stats

Track 1 1st place 2nd place 3rd place National Taiwan University Commendo (Netflix Prize Winnder) Hong Kong University of Science and Technology, Shanghai Jiaotong University Track 2 National Taiwan University Chinese Academy of Science, Hulu Labs Commendo (Netflix Prize Winnder)

Chinese Teams at KDDCUP (NTU, CAS, HKUST)

Key Techniques

Track 1:

Blending of multiple techniques Matrix factorization models Nearest neighbor models Restricted Bolzmann machines Temporal modelings

Track 2:

Importance sampling of negative instances Taxonomical modelings Use of pairwise ranking objective functions


To place on top of KDDCUP requires

Team work Expertise in domain knowledge as well as mathematical tools Often done by world famous institutes and companies Dataset increasingly more realistic Participants increasingly more professional Tasks are increasingly more difficult

Recent trends:



KDD Cup is an excellent source to learn the state-of-art KDD techniques KDDCUP dataset often becomes the standard benchmark for future research, development and teaching Top winners are highly regarded and respected

Elkan C. (1997). Boosting and Naive Bayesian Learning. Technical Report No. CS97-557, September 1997, UCSD. Decisionhouse (1998). KDD Cup 98: Quadstone Take Bronze Miner Award. Retrieved March 15, 2001 from ndex.html Urbane Science (1998). Urbane Science wins the KDD-98 Cup. Retrieved March 15, 2001 from Georges, J. & Milley, A. (1999). KDD99 Competition: Knowledge Discovery Contest. Retrieved March 15, 2001 from Rosset, S. & Inger A. (1999). KDD-Cup 99 : Knowledge Discovery In a Charitable Organizations Donor Database. Retrieved March 15, 2001 from



Sebastiani P., Ramoni M. & Crea A. (1999). Profiling your Customers using Bayesian Networks. Retrieved March 15, 2001 from p99.pdf Inger A., Vatnik N., Rosset S. & Neumann E. (2000). KDD-Cup 2000: Question 1 Winners Report. Retrieved March 18, 2000 from Neumann E., Vatnik N., Rosset S., Duenias M., Sasson I. & Inger A. (2000). KDD-Cup 2000: Question 5 Winners Report. Retrieved March 18, 2000 from Salford System white papers: Summary talk presented at KDD (2000)

References (cont)