You are on page 1of 13

Assignment

Ha Le Hoai Trung
Topic01: Character recognition (digits) data
• Optical character recognition, and the simpler digit recognition task,
has been the focus of much ML research. We have two datasets on
this topic. The first tackles the more general OCR task, on a small
vocabulary of words: (Note that the first letter of each word was
removed, since these were capital letters that would make the task
harder for you.)

Download dataset.
Topic02: NBA statistics data
• This download contains 2004-2005 NBA and ABA stats for:
-Player regular season stats
-Player regular season career totals
-Player playoff stats
-Player playoff career totals
-Player all-star game stats
-Team regular season stats
-Complete draft history
-coaches_season.txt - nba coaching records by season
-coaches_career.txt - nba career coaching records
Currently all of the regular season
Project idea:
* outlier detection on the players; find out who are the outstanding players.
* predict the game outcome.
Topic03: Precipitation data
• This dataset has includes 45 years of daily precipitation data from the
Northwest of the US:

Download Dataset

Project ideas:
Weather prediction: Learn a probabilistic model to predict rain levels.
Sensor selection: Where should you place sensor to best predict rain.
Topic04: WebKB
• This dataset contains webpages from 4 universities, labeled with
whether they are professor, student, project, or other pages.
Download Dataset.

Project ideas:
* Learning classifiers to predict the type of webpage from the text.
* Can you improve accuracy by exploiting correlations between pages
that point to each other using graphical models?

Papers:
* http://www-2.cs.cmu.edu/~webkb/.
Topic05: Email Annotation
The datasets provided below are sets of emails. The goal is to identify which
parts of the email refer to a person name. This task is an example of the
general problem area of Information Extraction.
Download Dataset
Project Ideas:
* Model the task as a Sequential Labeling problem, where each email is a
sequence of tokens, and each token can have either a label of "person-
name" or "not-a-person-name".
Papers: http://www.cs.cmu.edu/~einat/email.pdf
Topic06: Netflix Prize Dataset
• The Netflix Prize data set gives 100 million records of the form "user X
rated movie Y a 4.0 on 2/12/05". The data is available here: Netflix Prize.
Project idea:
• Can you predict the rating a user will give on a movie from the movies that
user has rated in the past, as well as the ratings similar users have given
similar movies?
• Can you discover clusters of similar movies or users?
• Can you predict which users rated which movies in 2006? In other words,
your task is to predict the probability that each pair was rated in 2006.
Note that the actual rating is irrelevant, and we just want whether the
movie was rated by that user sometime in 2006. The date in 2006 when the
rating was given is also irrelevant. The test data can be found at this
website.
Topic07: Enron E-mail Dataset
• The Enron E-mail data set contains about 500,000 e-mails from about
150 users.

The data set is available here

Project ideas:
* Can you classify the text of an e-mail message to decide who sent
it?
Topic 08: Loan – Bab & Good
In this problem, we’ll apply logistic regression to predict whether a person will default on a home equity loan. There are two data files: hmeq-
train.csv and hmeq-test.csv with 4000 and 1357 data, respectively. These are comma-delimited CSV files with the following columns:
• BAD: Value is 1 if the loan was bad, 0 if it was paid back. This is the quantity you will predict with your logistic regression model.
• LOAN: Amount of the loan.
• MORTDUE: Amount of existing mortgage.
• VALUE: Value of current property.
• REASON: {DebtCon, HomeImp, Unknown}
• JOB: {Mgr, Office, Other, ProfExe, Self, Sales, Unknown}
• YOJ: Years at present job. -1 if unknown.
• DEROG: Number of major derogatory reports. -1 if unknown.
• DELINQ: Number of delinquent credit lines. -1 if unknown.
• CLAGE: Age of oldest credit line in months. -1 if unknown.
• NINQ: Number of recent credit inquiries. -1 if unknown.
• CLNO: Number of credit lines. -1 if unknown.
You will build a logistic regression model to predict the BAD column, using the training data. You will need to load the CSV file (look into the csv
python module) and turn the values into useful data that you can use as features. In particular, that will probably mean using a one-hot coding
for the categorical variables. (Optionally, you may also want to replace the unknown -1 values with some other quantity, like the mean of that
column; this is called data imputation.) In your first pass, don’t do anything fancy with basis functions, just use the raw features and a bias
term.
Topic 09: K - Means
Use library K-Means clustering from a third-party machine learning implementation
like scikit-learn; math libraries like numpy are fine. Go out and grab an image data
set like:
• CIFAR-10 or CIFAR-100: http://www.cs.toronto.edu/~kriz/cifar.html
• MNIST Handwritten Digits: http://yann.lecun.com/exdb/mnist/
• Small NORB (toys): http://www.cs.nyu.edu/~ylclab/data/norb-v1.0-small/
• Street View Housing Numbers: http://ufldl.stanford.edu/housenumbers/
• STL-10: http://cs.stanford.edu/~acoates/stl10/
• Labeled Faces in the Wild: http://vis-www.cs.umass.edu/lfw/
Figure out how to load it into your environment and turn it into a set of vectors.
Run K-Means on it for a few different K and show some results from the fit. What
do the mean images look like? What are some representative images from each of
the clusters? Are the results wildly different for different restarts and/or different
K? Plot the K-Means objective function (distortion measure) as a function of
iteration and verify that it never increases
Topic 10: Distance 100 cities
• Download the cities100.csv data set from the course website. This file
contains the latitude and longitude of the 100 most populous cities in
the world. Convert these latitudes and longitudes into pairwise
distances using geodesic or great-circle distance (look at geopy for
this conversion). With this distance matrix in hand, use
scipy.cluster.hierarchy to explore these data with hierarchical
clustering. Produce at least three different dendrograms(with the city
names labeled) that use different configurations of linkage, etc.
Explain any differences
• you see arising from different choices of linkage.
Topic 11: PCA - Runtime
• CIFAR-10 or CIFAR-100: http://www.cs.toronto.edu/~kriz/cifar.html
• MNIST Handwritten Digits: http://yann.lecun.com/exdb/mnist/
• Small NORB (toys): http://www.cs.nyu.edu/~ylclab/data/norb-v1.0-
small/
• Street View Housing Numbers:
http://ufldl.stanford.edu/housenumbers/
• STL-10: http://cs.stanford.edu/~acoates/stl10/
• Labeled Faces in the Wild: http://vis-www.cs.umass.edu/lfw/
Take the top 16 eigenvectors and put them back into image space,
probably by rescaling them to be in [0, 1], reshaping, and then using
imshow. Produce a figure with these images as subplots.
Requirement
• Chạy được kết quả
• Hiểu giải thuật
 Vẽ sơ đồ giải thuật
 Chạy từng bước minh họa
• Phân tích kết quả chạy – trực quan dữ liệu
• Phân tích điểm mạnh & điểm yếu
 Ngữ cảnh dữ liệu
 Loại dữ liệu
• Khắc phục điểm yếu: nêu giải pháp (search - idea)

You might also like