You are on page 1of 9

The history of statistics can be said to start around 1749 although, over time, there have been

changes to the interpretation of the word statistics. In early times, the meaning was restricted to
information about states. This was later extended to include all collections of information of all types,
and later still it was extended to include the analysis and interpretation of such data. In modern
terms, "statistics" means both sets of collected information, as in national accounts and temperature
records, and analytical work which requires statistical inference.

Statistical activities are often associated with models expressed using probabilities, and
require probability theory for them to be put on a firm theoretical basis: see History of probability.

A number of statistical concepts have an important impact on a wide range of sciences. These
include the design of experiments and approaches to statistical inference such as Bayesian
inference, each of which can be considered to have their own sequence in the development of the
ideas underlying modern statistics
History of statistics
1. 1. History of statisticsThe history of statistics can be said to start around
1749 although, over time, there have beenchanges to the interpretation of
the word statistics. In early times, the meaning was restricted
toinformation about states. This was later extended to include all
collections of information of all types,and later still it was extended to
include the analysis and interpretation of such data. In modern
terms,"statistics" means both sets of collected information, as in national
accounts and temperature records,and analytical work which requires
statistical inference.Statistical activities are often associated with models
expressed using probabilities, andrequire probability theory for them to be
put on a firmtheoretical basis: see History of probability.A number of
statistical concepts have had an important impact on a wide range of
sciences. Theseinclude the design of experiments and approaches to
statistical inference such as Bayesian inference,each of which can be
considered to have their own sequence in the development of the
ideasunderlying modern statistics.By the 18th century, the term
"statistics" designated the systematiccollection of demographic and
economic data by states. In the early 19th century, the meaning
of"statistics" broadened to include the discipline concerned with the
collection, summary, and analysis ofdata. Today statistics is widely
employed in government, business, and all the sciences.Electronic
computers have expedited statistical computation, and have allowed
statisticians to develop"computer-intensive" methods.The term
"mathematical statistics" designates the mathematical theories of
probability and statisticalinference, which are used in statistical practice.
The relation between statistics and probability theorydeveloped rather
late, however. In the 19th century, statistics increasingly used probability
theory,whose initial results were found in the 17th and 18th centuries,
particularly in the analysis of games ofchance (gambling). By 1800,
astronomy used probability models and statistical theories, particularlythe
method of least squares, which was invented by Legendre and Gauss.
Early probability theory andstatistics was systematized and extended by
Laplace; following Laplace, probability and statistics havebeen in
continual development. In the 19th century, statistical reasoning and
probability models wereused by social scientists to advance the new
sciences of experimental psychologyand sociology, andby physical
scientists in thermodynamics and statistical mechanics. The development
of statisticalreasoning was closely associated with the development of
inductive logic and the scientific method.Statistics can be regarded as not
a field of mathematics but an autonomous mathematical science,like
computer science and operations research. Unlike mathematics, statistics
had its origins in publicadministration. It is used in demography and
economics. With its emphasis on learning from data andmaking best
predictions, statistics has a considerable overlap with decisionscience
and microeconomics. With its concerns with data, statistics has overlap
with informationscience and computer science.EtymologyLook up
statistics in wiktionary, the free dictionary.The term statistics is
ultimately derived from the New Latin statisticum collegium ("council of
state")and the Italian word statista ("statesman" or "politician"). The
German Statistik, first introducedby Gottfried Achenwall (1749), originally
designated the analysis of data about the state, signifying the"science of
state" (then called political arithmetic in English). It acquired the meaning
of the collection
2. 2. and classification of data generally in the early 19th century. It was
introduced into English in 1791by Sir John Sinclair when he published the
first of 21 volumes titled Statistical Account of Scotland.[1]Thus, the
original principal purpose of Statistik was data to be used by governmental
and (oftencentralized) administrative bodies. The collection of data about
states and localities continues, largelythrough national and international
statistical services. In particular, censuses provide frequentlyupdated
information about the population.The first book to have statistics in its
title was "Contributions to Vital Statistics" by Francis GP Neison,actuary
to the Medical Invalid and General Life Office (1st ed., 1845; 2nd ed., 1846;
3rd ed., 1857).1654 -- Pascal -- mathematics of probability, in
correspondence with Fermat1662 -- William Petty and John Graunt -- first
demographic studies1713 -- Jakob Bernoulli -- Ars Conjectandi1733 --
DeMoivre -- Approximatio; law of error (similar to standard deviation)1763 --
Rev. Bayes -- An essay towards solving a problem in the Doctrine of
Chances, foundation for "Bayesianstatistics"1805 -- A-M Legendre -- least
square method1809 -- C. F. Gauss -- Theoria Motus Corporum
Coelestium1812 -- P. S. Laplace -- Théorie analytique des probabilités1834 --
Statistical Society of London established1853 -- Adolphe Quetelet --
organized first international statistics conference; applied statistics to
biology;described the bell-shaped curve1877 -- F. Galton -- regression to the
mean1888 -- F. Galton -- correlation1889 -- F. Galton -- Natural
Inheritance1900 -- Karl Pearson -- chi square; applied correlation to natural
selection1904 -- Spearman -- rank (non-parametric) correlation
coefficient1908 -- "Student" (W. S. Gossett) -- The probable error of the
mean; the t-test1919 -- R. A. Fisher -- ANOVA; evolutionary biology1930s --
Jerzy Neyman and Egon Pearson (son of Karl Pearson) -- type II errors,
power of a test, confidenceintervals

Glossary of Terms
 Statistics - a set of concepts, rules, and procedures that help us
to:

o organize numerical information in the form of tables,


graphs, and charts;

o understand statistical techniques underlying decisions that


affect our lives and well-being; and

o make informed decisions.

 Data - facts, observations, and information that come from


investigations.

o Measurement data sometimes called quantitative data -- the


result of using some instrument to measure something (e.g.,
test score, weight);

o Categorical data also referred to as frequency or qualitative


data. Things are grouped according to some common
property(ies) and the number of members of the group are
recorded (e.g., males/females, vehicle type).

 Variable - property of an object or event that can take on


different values. For example, college major is a variable that
takes on values like mathematics, computer science, English,
psychology, etc.
o Discrete Variable - a variable with a limited number of
values (e.g., gender (male/female), college class
(freshman/sophomore/junior/senior).

o Continuous Variable - a variable that can take on many


different values, in theory, any value between the lowest
and highest points on the measurement scale.

o Independent Variable - a variable that is manipulated,


measured, or selected by the researcher as an antecedent
condition to an observed behavior. In a hypothesized
cause-and-effect relationship, the independent variable is
the cause and the dependent variable is the outcome or
effect.

o Dependent Variable - a variable that is not under the


experimenter's control -- the data. It is the variable that is
observed and measured in response to the independent
variable.

o Qualitative Variable - a variable based on categorical data.

o Quantitative Variable - a variable based on quantitative


data.

 Graphs - visual display of data used to present frequency


distributions so that the shape of the distribution can easily be
seen.

o Bar graph - a form of graph that uses bars separated by an


arbitrary amount of space to represent how often elements
within a category occur. The higher the bar, the higher the
frequency of occurrence. The underlying measurement
scale is discrete (nominal or ordinal-scale data), not
continuous.

o Histogram - a form of a bar graph used with interval or ratio-


scaled data. Unlike the bar graph, bars in a histogram
touch with the width of the bars defined by the upper and
lower limits of the interval. The measurement scale is
continuous, so the lower limit of any one interval is also the
upper limit of the previous interval.

o Boxplot - a graphical representation of dispersions and


extreme scores. Represented in this graphic are minimum,
maximum, and quartile scores in the form of a box with
"whiskers." The box includes the range of scores falling
into the middle 50% of the distribution (Inter Quartile Range
= 75th percentile - 25th percentile)and the whiskers are lines
extended to the minimum and maximum scores in the
distribution or to mathematically defined (+/-1.5*IQR) upper
and lower fences.

o Scatterplot - a form of graph that presents information from


a bivariate distribution. In a scatterplot, each subject in an
experimental study is represented by a single point in two-
dimensional space. The underlying scale of measurement
for both variables is continuous (measurement data). This
is one of the most useful techniques for gaining insight into
the relationship between tw variables.

 Measures of Center - Plotting data in a frequency distribution


shows the general shape of the distribution and gives a general
sense of how the numbers are bunched. Several statistics can
be used to represent the "center" of the distribution. These
statistics are commonly referred to as measures of central
tendency.

o Mode - The mode of a distribution is simply defined as the


most frequent or common score in the distribution. The
mode is the point or value of X that corresponds to the
highest point on the distribution. If the highest frequency is
shared by more than one value, the distribution is said to
be multimodal. It is not uncommon to see distributions that
are bimodal reflecting peaks in scoring at two different
points in the distribution.

o Median - The median is the score that divides the


distribution into halves; half of the scores are above the
median and half are below it when the data are arranged in
numerical order. The median is also referred to as the
score at the 50th percentile in the distribution. The median
location of N numbers can be found by the formula (N + 1) /
2. When N is an odd number, the formula yields a integer
that represents the value in a numerically ordered
distribution corresponding to the median location. (For
example, in the distribution of numbers (3 1 5 4 9 9 8) the
median location is (7 + 1) / 2 = 4. When applied to the
ordered distribution (1 3 4 5 8 9 9), the value 5 is the
median, three scores are above 5 and three are below 5. If
there were only 6 values (1 3 4 5 8 9), the median location is
(6 + 1) / 2 = 3.5. In this case the median is half-way
between the 3rdand 4th scores (4 and 5) or 4.5.

o Mean - The mean is the most common measure of central


tendency and the one that can be mathematically
manipulated. It is defined as the average of a distribution is
equal to the X / N. Simply, the mean is computed by
summing all the scores in the distribution (X) and dividing
that sum by the total number of scores (N). The mean is the
balance point in a distribution such that if you subtract
each value in the distribution from the mean and sum all of
these deviation scores, the result will be zero.

 Measures of Spread - Although the average value in a distribution


is informative about how scores are centered in the distribution,
the mean, median, and mode lack context for interpreting those
statistics. Measures of variability provide information about the
degree to which individual scores are clustered about or deviate
from the average value in a distribution.

o Range - The simplest measure of variability to


compute and understand is the range. The range is the
difference between the highest and lowest score in a
distribution. Although it is easy to compute, it is not often
used as the sole measure of variability due to its instability.
Because it is based solely on the most extreme scores in the
distribution and does not fully reflect the pattern of variation
within a distribution, the range is a very limited measure of
variability.
o Interquartile Range (IQR) - Provides a measure of the
spread of the middle 50% of the scores. The IQR is defined as
the 75th percentile - the 25th percentile. The interquartile range
plays an important role in the graphical method known as
the boxplot. The advantage of using the IQR is that it is easy
to compute and extreme scores in the distribution have much
less impact but its strength is also a weakness in that it
suffers as a measure of variability because it discards too
much data. Researchers want to study variability while
eliminating scores that are likely to be accidents. The boxplot
allows for this for this distinction and is an important tool for
exploring data.

Data and Statistics: Terminology and Examples


Terminology Basics

Below you will find simple definitions of the basic terminology associated with data
and statistics. The examples will link into DataSheets from Data-Planet Statistical
Ready Reference. From the DataSheets, you can link into Data-Planet Statistical
Datasets to explore the millions of time series available in the repository.

Data: Fundamentally, data=information. We typically use the term to refer to


numeric files that are created and organized for analysis. There are two types
of data: aggregate and microdata.

 Aggregate data are statistical summaries of data, meaning that the data have been analyzed in some way.
The Data-Planet repository is an excellent resource for obtaining aggregated data.

 Microdata: Individual response data obtained in surveys and censuses - these are data points directly
observed or collected from a specific unit of observation. Also known as raw data. ICPSR is an excellent
resource for obtaining microdata files.

Data point or datum: Singular of data. Refers to a single point of data.


Example: the amount of aviation gasoline consumed by the transportation
sector in the US in 2012

Quantitative data/variables: Information that can be handled numerically.


Example: spending by US consumers on personal care products and services
Qualitative data/variables: Information that refers to the quality of something.
Ethnographic research, participant observation, open-ended interviews, etc.,
may collect qualitative data. However, often there is some element of the
results obtained via qualitative research that can be handled numerically, eg,
how many observations, number of interviews conducted, etc.
Example: periods when the US was in, vs was not in, a recession, 1850-
2014 The quality of being in a recession is assigned a value of .01 and not in
a recession .0, which makes it possible to display as a chart.
Indicator: Typically used as a synonym for statistics that describe variables that describe something about the
socioeconomic environment of a society, eg, per capita income, unemployment rate, median years of education.

Statistic: A number that describes some characteristic, or status, of a


variable, eg, a count or a percentage. Example: total nonfarm job starts in
August 2014

Statistics: Numerical summaries of data that has been analyzed in some way.
Example: ranking of airlines by percentage of flights arriving on-time into
Huntsville International Airport in Alabama in 2013

Time series data: Any data arranged in chronological order. Example: Gross
Domestic Product of Greece, 2000-2013

Variable: Any finding that can change or vary. Examples include anything that
can be measured, such as the number of logging operations in Alabama.

 Numerical variable: Usually referring to a variable whose possible values are numbers. Example: Bank
Prime Loan Rate

 Categorical variable: A variable whose that distinguishes among subjects by putting them in categories (eg,
gender). Also called discrete or nominal variables. Example: Female Infant Mortality Rate of Belarus (the
mortality rate is numerical - the age and gender characteristic is categorical)

Terminology Used with Collections of Data

Data aggregation: A collection of datapoints and datasets. Example: a search


on the broad category "energy resources and industries" retrieves results from
multiple sources

Dataset: A collection of related data items, eg, the responses of survey


participants. This term is used very loosely – the entire Census 2010
Summary File 1 can be considered a dataset as can any individual time series
included in the Census 2010, eg, Table P20. Households by Presence of
People Under 18 Years by Household Type by Age of People Under 18 Years
Database: A collection of data organized for research and retrieval.
Example: OECD Factbook. Example: American Community Survey.

Time series: A set of measures of a single variable recorded over a period of


time. Example: Hourly Mean Earnings of Civilian Workers – Mining
Management, Professional, and Related Workers

"Big Data" Terminology

Big data: A popular term used to describe the exponential growth and
availability of structured and unstructured data that derived from the
increasing sophistication of operational and transactional systems, mobile
media, and the Internet. Big data and its analysis have become key
components of obtaining business intelligence in particular.
Data analytics: Generally used to refer to the analytical techniques and tools required to analyze massive amounts
of data. Closely related to data mining, which refers to the extraction of information from business systems.

You might also like