You are on page 1of 15

Data set

A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds
to one or more database tables, where every column of a table represents a particular variable,
and each row corresponds to a given record of the data set in question. The data set lists values
for each of the variables, such as for example height and weight of an object, for each member
of the data set. Data sets can also consist of a collection of documents or files.[2]

Various plots of the multivariate data


set Iris flower data set introduced by
Ronald Fisher (1936).[1]

In the open data discipline, data set is the unit to measure the information released in a public
open data repository. The European data.europa.eu portal aggregates more than a million data
sets.[3]
Properties
Several characteristics define a data set's structure and properties. These include the number
and types of the attributes or variables, and various statistical measures applicable to them,
such as standard deviation and kurtosis.[4]

The values may be numbers, such as real numbers or integers, for example representing a
person's height in centimeters, but may also be nominal data (i.e., not consisting of numerical
values), for example representing a person's ethnicity. More generally, values may be of any of
the kinds described as a level of measurement. For each variable, the values are normally all of
the same kind. Missing values may exist, which must be indicated somehow.

In statistics, data sets usually come from actual observations obtained by sampling a statistical
population, and each row corresponds to the observations on one element of that population.
Data sets may further be generated by algorithms for the purpose of testing certain kinds of
software. Some modern statistical analysis software such as SPSS still present their data in the
classical data set fashion. If data is missing or suspicious an imputation method may be used to
complete a data set.[5]

Classics
Several classic data sets have been used extensively in the statistical literature:

Iris flower data set – Multivariate data


set introduced by Ronald Fisher
(1936).[1] Provided online by University
of California-Irvine Machine Learning
Repository (https://archive.ics.uci.edu/
ml/datasets/Iris) .[6]
MNIST database – Images of
handwritten digits commonly used to
test classification, clustering, and image
processing algorithms
Categorical data analysis – Data sets
used in the book, An Introduction to
Categorical Data Analysis, provided
online (https://stats.oarc.ucla.edu/othe
r/examples/icda/) by UCLA Advanced
Research Computing.[7]
Robust statistics – Data sets used in
Robust Regression and Outlier Detection
(Rousseeuw and Leroy, 1968). Provided
online (https://web.archive.org/web/200
50207032959/http://www.uni-koeln.de/t
hemen/statistik/data/rousseeuw/) at
the University of Cologne.[8]
Time series – Data used in Chatfield's
book, The Analysis of Time Series, are
provided on-line (https://web.archive.or
g/web/20110102201323/http://lib.stat.
cmu.edu/modules.php?op=modload&na
me=PostWrap&file=index&page=dataset
s/) by StatLib.[9]
Extreme values – Data used in the book,
An Introduction to the Statistical
Modeling of Extreme Values are a
snapshot of the data as it was provided
on-line by Stuart Coles (https://web.arch
ive.org/web/20060910161517/http://ho
mes.stat.unipd.it/coles/public_html/ism
ev/ismev.dat) , the book's author.
Bayesian Data Analysis – Data used in
the book are provided on-line (http://ww
w.stat.columbia.edu/~gelman/book/dat
a/) (archive link (https://web.archive.or
g/web/20230122121643/http://www.st
at.columbia.edu/~gelman/book/dat
a/) ) by Andrew Gelman, one of the
book's authors.
The Bupa liver data (https://web.archive.
org/web/20171023174701/http://ftp.ic
s.uci.edu:80/pub/machine-learning-data
bases/liver-disorders/) – Used in
several papers in the machine learning
(data mining) literature.
Anscombe's quartet – Small data set
illustrating the importance of graphing
the data to avoid statistical fallacies

See also

Data blending
Data (computing)
Data sample
Data store
Interoperability
Data collection system
List of data sets for machine-learning
research

References

1. Fisher, R.A. (1963). "The Use of Multiple


Measurements in Taxonomic Problems" (h
ttps://web.archive.org/web/20110928044
802/http://digital.library.adelaide.edu.au/c
oll/special//fisher/138.pdf) (PDF). Annals
of Eugenics. 7 (2): 179–188.
doi:10.1111/j.1469-1809.1936.tb02137.x
(https://doi.org/10.1111%2Fj.1469-1809.1
936.tb02137.x) . hdl:2440/15227 (https://
hdl.handle.net/2440%2F15227) . Archived
from the original (http://digital.library.adel
aide.edu.au/coll/special//fisher/138.pdf)
(PDF) on 2011-09-28. Retrieved
2007-05-22.
2. Snijders, C.; Matzat, U.; Reips, U.-D. (2012).
" 'Big Data': Big gaps of knowledge in the
field of Internet" (https://web.archive.org/
web/20191123051001/http://www.ijis.ne
t/ijis7_1/ijis7_1_editorial.html) .
International Journal of Internet Science.
7: 1–5. Archived from the original (http://w
ww.ijis.net/ijis7_1/ijis7_1_editorial.html)
on 2019-11-23. Retrieved 2017-02-10.
3. "European open data portal" (http://www.e
uropeandataportal.eu/data/en/dataset) .
European open data portal. European
Commission. Retrieved 2016-09-23.
4. Jan M. Żytkow, Jan Rauch (2000).
Principles of data mining and knowledge
discovery (https://books.google.com/boo
ks?id=uTzeRZFmaBgC&pg=PA100) .
Springer. ISBN 978-3-540-66490-1.
5. United Nations Statistical Commission;
United Nations Economic Commission for
Europe (2007). Statistical Data Editing:
Impact on Data Quality: Volume 3 of
Statistical Data Editing, Conference of
European Statisticians Statistical
standards and studies (https://books.goo
gle.com/books?id=X0wtLo2XY9gC) .
United Nations Publications. p. 20.
ISBN 978-9211169522. Retrieved 19 July
2015.
6. "UCI Machine Learning Repository: Iris
Data Set" (https://archive.ics.uci.edu/ml/d
atasets/Iris) . Archived (https://web.archiv
e.org/web/20230426065109/https://archi
ve.ics.uci.edu/ml/datasets/Iris) from the
original on 2023-04-26. Retrieved
2023-05-02.
7. "Textbook Examples An Introduction to
Categorical Data Analysis by Alan Agresti"
(https://stats.oarc.ucla.edu/other/exampl
es/icda/) . Archived (https://web.archive.o
rg/web/20230131013107/https://stats.oa
rc.ucla.edu/other/examples/icda/) from
the original on 2023-01-31. Retrieved
2023-05-02.
8. "The ROUSSEEUW datasets" (https://web.
archive.org/web/20050207032959/http://
www.uni-koeln.de/themen/statistik/data/r
ousseeuw/) . Archived from the original (h
ttp://www.uni-koeln.de/themen/statistik/d
ata/rousseeuw/) on 2005-02-07.
9. "StatLib :: Data, Software and News from
the Statistics Community" (https://web.arc
hive.org/web/20110102201323/http://lib.
stat.cmu.edu/modules.php?op=modload&
name=PostWrap&file=index&page=datase
ts/) . Archived from the original (http://lib.
stat.cmu.edu/modules.php?op=modload&
name=PostWrap&file=index&page=datase
ts/) on 2011-01-02.
External links

Data.gov (https://www. Look up


data set in
data.gov/) – the U.S.
Wiktionary,
Government's open the free
dictionary.
data
GCMD (https://earthdata.nasa.gov/gcm
d) – the Global Change Master
Directory containing over 34,000
descriptions of Earth science and
environmental science data sets and
services
Humanitarian Data Exchange(HDX) (http
s://data.humdata.org/) – The
Humanitarian Data Exchange (HDX) is
an open humanitarian data sharing
platform managed by the United Nations
Office for the Coordination of
Humanitarian Affairs.
NYC Open Data (https://opendata.cityof
newyork.us/) – free public data
published by New York City agencies
and other partners.
Relational data set repository (https://rel
ational.fit.cvut.cz/) Archived (https://we
b.archive.org/web/20180307150058/htt
ps://relational.fit.cvut.cz/) 2018-03-07
at the Wayback Machine
Research Pipeline (https://web.archive.o
rg/web/20190214051201/http://www.re
searchpipeline.com/mediawiki/index.ph
p?title=Main_Page) – a wiki/website
with links to data sets on many different
topics
StatLib–JASA Data Archive (http://lib.st
at.cmu.edu/jasadata/)
UCI (https://archive.ics.uci.edu/) – a
machine learning repository
UK Government Public Data (https://dat
a.gov.uk/)
World Bank Open Data (https://data.worl
dbank.org/) – Free and open access to
global development data by World Bank
Retrieved from
"https://en.wikipedia.org/w/index.php?
title=Data_set&oldid=1214512483"

This page was last edited on 19 March 2024, at


12:05 (UTC). •
Content is available under CC BY-SA 4.0 unless
otherwise noted.

You might also like