You are on page 1of 7

Datasets for Data Science and Machine Learning

August 22, 2017

These days, we have the opposite problem we had 5-10 years ago…

Back then, it was actually difficult to find datasets for data science and machine learning

Since then, we’ve been flooded with lists and lists of datasets. Today, the problem is not
finding datasets, but rather sifting through them to keep the relevant ones.

Well, we’ve done that for you right here.

Below, you’ll find a curated list of free datasets for data science and machine learning,
organized by their use case. You’ll find both hand-picked datasets and our favorite

Table of Contents

Datasets for Exploratory Analysis

Exploratory analysis is your first step in most data science exercises. The best datasets
for practicing exploratory analysis should be fun, interesting, and non-trivial (i.e. require
you to dig a little to uncover all the insights).

All links open in a new tab.

Our picks:

Game of Thrones – Game of Thrones is a popular TV series based on George R.R.
Martin’s A Song of Fire and Ice book series. With this dataset, you can explore its
political landscape, characters, and battles.
World University Rankings – Ranking universities can be difficult and controversial.
There are hundreds of ranking systems, and they rarely reach a consensus. This
dataset contains three global university rankings.
IMDB 5000 Movie Dataset – This dataset explores the question of whether we can
anticipate a movie’s popularity before it’s even released.


Kaggle Datasets – Open datasets contributed by the Kaggle community. Here, you’ll
find a grab bag of topics. Plus, you can learn from the short tutorials and scripts that
accompany the datasets.
r/datasets – Open datasets contributed by the Reddit community. This is another
source of interesting and quirky datasets, but the datasets tend to less refined.

Datasets for General Machine Learning

In this context, we refer to “general” machine learning as Regression, Classification, and
Clustering with relational (i.e. table-format) data. These are the most common ML tasks.

Our picks:

Wine Quality (Regression) – Properties of red and white vinho verde wine samples
from the north of Portugal. The goal is to model wine quality based on
physicochemical tests. (We also have a tutorial.)
Credit Card Default (Classification) – Predicting credit card default is a valuable and
common use for machine learning. This rich dataset includes demographics,
payment history, credit, and default data.
US Census Data (Clustering) – Clustering based on demographics is a tried and
true way to perform market research and segmentation.


UCI Machine Learning Repository – The UCI ML repository is an old and popular
aggregator for machine learning datasets. Tip: Most of their datasets have linked
academic papers that you can use for benchmarks.

Datasets for Deep Learning

While not appropriate for general-purpose machine learning, deep learning has been
dominating certain niches, especially those that use image, text, or audio data. From our
experience, the best way to get started with deep learning is to practice on image data
because of the wealth of tutorials available.

Our picks:

MNIST – MNIST contains images for handwritten digit classification. It’s considered
a great entry dataset for deep learning because it’s complex enough to warrant
neural networks, while still being manageable on a single CPU. (We also have a
CIFAR – The next step up in difficulty is the CIFAR-10 dataset, which contains
60,000 images broken into 10 different classes. For a bigger challenge, you can try
the CIFAR-100 dataset, which has 100 different classes.
ImageNet – ImageNet hosts a computer vision competition every year, and many
consider it to be the benchmark for modern performance. The current image dataset
has 1000 different classes.
YouTube 8M – Ready to tackle videos, but can’t spare terabytes of storage? This
dataset contains millions of YouTube video ID’s and billionsof audio and visual
features that were pre-extracted using the latest deep learning models.

Aggregators: – Up-to-date list of datasets for benchmarking deep learning

algorithms. – Up-to-date list of high-quality datasets for deep learning


Datasets for Natural Language Processing

Natural Language Processing (N.L.P.) is about text data. And for messy data like text, it's
especially important for the datasets to have real-world applications so that you can
perform easy sanity checks.

Our picks:

Enron Dataset - Email data from the senior management of Enron, organized into
folders. This dataset was originally made public and posted to the web by the
Federal Energy Regulatory Commission during its investigation.
Amazon Reviews - Contains ~35 million reviews from Amazon spanning 18 years.
Data include product and user information, ratings, and the plaintext review.
Newsgroup Classification - Collection of approximately 20,000 newsgroup
documents, partitioned (nearly) evenly across 20 different newsgroups. Great for
practicing text classification and topic modeling.


nlp-datasets (Github) - Alphabetical list of free/public domain datasets with text data
for use in NLP.
Quora Answer - List of annotated corpora for NLP.

Datasets for Cloud Machine Learning

Technically, any dataset can be used for cloud-based machine learning if you just upload
it to the cloud. However, if you're just starting out and evaluating a platform, you may wish
to skip all the data piping.

Fortunately, the major cloud computing services all provide public datasets that you can
easily import. Their datasets are all comparable.

Our picks:

AWS Public Datasets

Google Cloud Public Datasets
Microsoft Azure Public Datasets

Datasets for Time Series Analysis

Time series analysis requires observations marked with a timestamp. In other words,
each subject and/or feature is tracked across time.

Our picks:

EOD Stock Prices - End of day stock prices, dividends, and splits for 3,000 US
companies, curated by the Quandl community.
Zillow Real Estate Research - Home prices and rents by size, type, and tier, sliced
by zip code, neighborhood, city, metro area, county and state.
Global Education Statistics - Over 4,000 internationally comparable indicators for
education access, progression, completion, literacy, teachers, population, and


Quandl - Quandl contains free and premium time series datasets for financial
The World Bank - Contains global macroeconomic time series and searchable by
country or indicator.

Zillow Real Estate Data

Datasets for Recommender Systems

Recommender systems have taken the entertainment and e-commerce industries by
storm. Amazon, Netflix, and Spotify are great examples.

Our picks:

MovieLens - Rating data sets from the MovieLens web site. Perfect for getting
started thanks to the various dataset sizes available.
Jester - Ideal for building a simple collaborative filter. Contains 4.1 Million
continuous ratings (-10.00 to +10.00) of 100 jokes from 73,421 users.
Million Song Dataset - Large, rich dataset for music recommendations. You can
start with a pure collaborative filter and then expand it with other methods such as
content-based models or web scraping.


entaroadun (Github) - Collection of datasets for recommender systems. Tip: Check

the comments section for recent datasets.

Datasets for Specific Industries

In this compendium, we've organized datasets by their use case. This is helpful if you
need to practice a certain skill, such as deep learning or time series analysis.

However, you may also wish to search by a specific industry, such as datasets for
neuroscience, weather, or manufacturing. Here are a couple options:


Awesome Public Datasets - High quality datasets separated by industry. - Curated government data separated by industry.

Datasets for Streaming

Streaming datasets are used for building real-time applications, such as data
visualization, trend tracking, or updatable (i.e. "online") machine learning models.

Our picks:

Twitter API - The twitter API is a classic source for streaming data. You can track
tweets, hashtags, and more.
StockTwits API - StockTwits is like a twitter for traders and investors. You can
expand this dataset in many interesting ways by joining it to time series datasets
using the timestamp and ticker symbol.
Weather Underground - A reliable weather API with global coverage. Features a
free tier and paid options for scaling up.


Satori - Satori is a platform that lets you connect to streaming live data at ultra-low
latency (for free). They frequently add new datasets.

Datasets for Web Scraping

Web scraping is a common part of data science research, but you must be careful of
violating websites' terms of services. Fortunately, there's a whole site that's designed to
be freely scraped.

Our picks: - Web scraping sandbox with two subdomains. You can practice
scraping a fictional bookstore or a site that lists quotes from famous people.

Fictional Bookstore

Datasets for Current Events

Finding datasets for current events can be tricky. Fortunately, some publications have
started releasing the datasets they use in their articles.


FiveThirtyEight - FiveThirtyEight is a news and sports site with data-driven articles.

They make their datasets openly available on Github.
BuzzFeedNews - BuzzFeed became (in)famous for their listicles and superficial
pieces, but they've since expanded into investigative journalism. Their datasets are
available on Github.

Next Steps
If you'd like more guidance for using these datasets, check our other free resources:

How to Learn Python for Data Science - Python is our preferred language for data
Python Seaborn Tutorial - Our favorite library for exploratory analysis.
Python Scikit-Learn Tutorial - Our favorite library for general purpose machine
Python Keras Tutorial - Our favorite library for deep learning.
Modern Machine Learning Algorithms - Overview of machine learning algorithms,
including each one's strengths and weaknesses.


You might also like