You are on page 1of 9

Contents

Azure Open Datasets documentation


Overview
What are Azure Open Datasets?
Tutorial
Regression with automated machine learning
Enrich an image classification model
How-to guides
Datasets in Azure Machine Learning
Samples
Reference
Python SDK
Jupyter notebooks
Resources
Azure Open Datasets catalog
Microsoft Research Open Data
Azure roadmap
Pricing
Regional availability
Microsoft Q & A
Stack Overflow
What are Azure Open Datasets and how can you use
them?
5/8/2020 • 2 minutes to read • Edit Online

Azure Open Datasets are curated public datasets that you can use to add scenario-specific features to machine
learning solutions for more accurate models. Open Datasets are in the cloud on Microsoft Azure and are integrated
into Azure Machine Learning and readily available to Azure Databricks and Machine Learning Studio (classic). You
can also access the datasets through APIs and use them in other products, such as Power BI and Azure Data Factory.
Datasets include public-domain data for weather, census, holidays, public safety, and location that help you train
machine learning models and enrich predictive solutions. You can also share your public datasets on Azure Open
Datasets.

Curated, prepared datasets


Curated open public datasets in Azure Open Datasets are optimized for consumption in machine learning
workflows.
To see all the datasets available, go to the Azure Open Datasets Catalog.
Data scientists often spend the majority of their time cleaning and preparing data for advanced analytics. Open
Datasets are copied to the Azure cloud and preprocessed to save you time. At regular intervals data is pulled from
the sources, such as by an FTP connection to the National Oceanic and Atmospheric Administration (NOAA). Next,
data is parsed into a structured format, and then enriched as appropriate with features such as ZIP Code or location
of the nearest weather station.
Datasets are cohosted with cloud compute in Azure making access and manipulation easier.
Following are examples of datasets available.
Weather data
DATA SET N OT EB O O K S DESC RIP T IO N

NOAA Integrated Surface Data (ISD) Azure Notebooks Worldwide hourly weather data from
Azure Databricks NOAA with the best spatial coverage in
North America, Europe, Australia, and
parts of Asia. Updated daily.

NOAA Global Forecast System (GFS) Azure Notebooks 15-day U.S. hourly weather forecast
Azure Databricks data from NOAA. Updated daily.

Calendar data
DATA SET N OT EB O O K S DESC RIP T IO N

Public Holidays Azure Notebooks Worldwide public holiday data, covering


Azure Databricks 41 countries or regions from 1970 to
2099. Includes country and whether
most people have paid time off.

Access to datasets
With an Azure account, you can access open datasets using code or through the Azure service interface. The data is
colocated with Azure cloud compute resources for use in your machine learning solution.
Open Datasets are available through the Azure Machine Learning UI and SDK. Open Datasets also provides Azure
Notebooks and Azure Databricks notebooks you can use to connect data to Azure Machine Learning and Azure
Databricks. Datasets can also be accessed through a Python SDK.
However, you don't need an Azure account to access Open Datasets; you can access them from any Python
environment with or without Spark.

Request or contribute datasets


If you can't find the data you want, email us to request a dataset or contribute a dataset.

Next steps
Sample notebook
Tutorial: Regression modeling with NY taxi data
Python SDK for Open Datasets
Create Azure Machine Learning datasets from Azure
Open Datasets
9/22/2020 • 4 minutes to read • Edit Online

In this article, you learn how to bring curated enrichment data into your local or remote machine learning
experiments with Azure Machine Learning datasets and Azure Open Datasets.
By creating an Azure Machine Learning dataset, you create a reference to the data source location, along with a
copy of its metadata. Because datasets are lazily evaluated, and the data remains in its existing location, you
Incur no extra storage cost.
Don't risk unintentionally changing your original data sources.
Improve ML workflow performance speeds.
To understand where datasets fit in Azure Machine Learning's overall data access workflow, see the Securely access
data article.
Azure Open Datasets are curated public datasets that you can use to add scenario-specific features to enrich your
predictive solutions and improve their accuracy. See the Open Datasets catalog for public-domain data that can
help you train machine learning models, like:
weather
census
holidays
public safety
location
Open Datasets are in the cloud on Microsoft Azure and are included in both the Azure Machine Learning Python
SDK and the Azure Machine Learning studio.

Prerequisites
For this article, you need:
An Azure subscription. If you don't have one, create a free account before you begin. Try the free or paid
version of Azure Machine Learning.
An Azure Machine Learning workspace.
The Azure Machine Learning SDK for Python installed, which includes the azureml-datasets package.
Create an Azure Machine Learning compute instance, which is a fully configured and managed
development environment that includes integrated notebooks and the SDK already installed.
OR
Work on your own Python environment and install the SDK yourself with these instructions.
NOTE
Some dataset classes have dependencies on the azureml-dataprep package, which is only compatible with 64-bit Python. For
Linux users, these classes are supported only on the following distributions: Red Hat Enterprise Linux (7, 8), Ubuntu (14.04,
16.04, 18.04), Fedora (27, 28), Debian (8, 9), and CentOS (7).

Create datasets with the SDK


To create Azure Machine Learning datasets via Azure Open Datasets classes in the Python SDK, make sure you've
installed the package with pip install azureml-opendatasets . Each discrete data set is represented by its own class
in the SDK, and certain classes are available as either an Azure Machine Learning TabularDataset , FileDataset , or
both. See the reference documentation for a full list of opendatasets classes.
You can retrieve certain opendatasets classes as either a TabularDataset or FileDataset , which allows you to
manipulate and/or download the files directly. Other classes can get a dataset only by using the
get_tabular_dataset() or get_file_dataset() functions from the Dataset class in the Python SDK.

The following code shows that the MNIST opendatasets class can return either a TabularDataset or FileDataset .

from azureml.core import Dataset


from azureml.opendatasets import MNIST

# MNIST class can return either TabularDataset or FileDataset


tabular_dataset = MNIST.get_tabular_dataset()
file_dataset = MNIST.get_file_dataset()

In this example, the Diabetes opendatasets class is only available as a TabularDataset , hence the use of
get_tabular_dataset() .

from azureml.opendatasets import Diabetes


from azureml.core import Dataset

# Diabetes class can return ONLY TabularDataset and must be called from the static function
diabetes_tabular = Diabetes.get_tabular_dataset()

Register datasets
Register an Azure Machine Learning dataset with your workspace, so you can share them with others and reuse
them across experiments in your workspace. When you register an Azure Machine Learning dataset created from
Open Datasets, no data is immediately downloaded, but the data will be accessed later when requested (during
training, for example) from a central storage location.
To register your datasets with a workspace, use the register() method.

titanic_ds = titanic_ds.register(workspace=workspace,
name='titanic_ds',
description='titanic training data')

Create datasets with the studio


You can also create Azure Machine Learning datasets from Azure Open Datasets with the Azure Machine Learning
studio, a consolidated web interface that includes machine learning tools to perform data science scenarios for data
science practitioners of all skill levels.

NOTE
Datasets created through Azure Machine Learning studio are automatically registered to the workspace.

1. In your workspace, select the Datasets tab under Assets . On the Create dataset drop-down menu, select
From Open Datasets .

2. Select a dataset by selecting its tile. (You have the option to filter by using the search bar.) Select Next .

3. Choose a name under which to register the dataset, and optionally filter the data by using the available
filters. In this case, for the public holidays dataset, you filter the time period to one year and the country
code to only the US. See the Azure Open Datasets Catalog for data detail such as, field descriptions and date
ranges. Select Create .

The dataset is now available in your workspace under Datasets . You can use it in the same way as other
datasets you've created.

Access datasets for your experiments


Use your datasets in your machine learning experiments for training ML models. Learn more about how to train
with datasets.

Example notebooks
For examples and demonstrations of Open Datasets functionality, see these sample notebooks.

Next steps
Train your first ML model.
Train with datasets.
Create an Azure machine learning dataset.
Example Jupyter notebooks show how to enrich data
with Open Datasets
9/22/2020 • 2 minutes to read • Edit Online

The example Jupyter notebooks for Azure Open Datasets show you how to load open datasets and use them to
enrich demo data. Techniques include use of Apache Spark and Pandas to process data.

IMPORTANT
When working in a non-Spark environment, Open Datasets allows downloading only one month of data at a time with
certain classes in order to avoid MemoryError with large datasets.

Load NOAA Integrated Surface Database (ISD) data


N OT EB O O K DESC RIP T IO N

Load one recent month of weather data into a Pandas Learn how to load historical weather data into your favorite
dataframe Pandas dataframe.

Load one recent month of weather data into a Spark Learn how to load historical weather data into your favorite
dataframe Spark dataframe.

Join demo data with NOAA ISD data


N OT EB O O K DESC RIP T IO N

Join demo data with weather data - Pandas Join a 1-month demo dataset of sensor locations with
weather readings in a Pandas dataframe.

Join demo data with weather data – Spark Join a demo dataset of sensor locations with weather readings
in a Spark dataframe.

Join NYC taxi data with NOAA ISD data


N OT EB O O K DESC RIP T IO N

Taxi trip data enriched with weather data - Pandas Load NYC green taxi data (over 1 month) and enrich it with
weather data in a Pandas dataframe. This example overrides
the method get_pandas_limit and balances data load
performance with the amount of data.

Taxi trip data enriched with weather data – Spark Load NYC green taxi data and enrich it with weather data, in
Spark dataframe.

Next steps
Tutorial: Regression modeling with automated machine learning and an open dataset
Python SDK for Open Datasets
Azure Open Datasets catalog
Create Azure Machine Learning dataset from Open Dataset

You might also like