You are on page 1of 6

Statistics 

(from German: Statistik, orig. "description of a state, a country")[1][2] is the discipline that


concerns the collection, organization, analysis, interpretation, and presentation of data.[3][4][5] In
applying statistics to a scientific, industrial, or social problem, it is conventional to begin with
a statistical population or a statistical model to be studied. Populations can be diverse groups of
people or objects such as "all people living in a country" or "every atom composing a crystal".
Statistics deals with every aspect of data, including the planning of data collection in terms of the
design of surveys and experiments.[6]
When census data cannot be collected, statisticians collect data by developing specific experiment
designs and survey samples. Representative sampling assures that inferences and conclusions can
reasonably extend from the sample to the population as a whole. An experimental study involves
taking measurements of the system under study, manipulating the system, and then taking
additional measurements using the same procedure to determine if the manipulation has modified
the values of the measurements. In contrast, an observational study does not involve experimental
manipulation.
Two main statistical methods are used in data analysis: descriptive statistics, which summarize data
from a sample using indexes such as the mean or standard deviation, and inferential statistics,
which draw conclusions from data that are subject to random variation (e.g., observational errors,
sampling variation).[7] Descriptive statistics are most often concerned with two sets of properties of
a distribution (sample or population): central tendency (or location) seeks to characterize the
distribution's central or typical value, while dispersion (or variability) characterizes the extent to which
members of the distribution depart from its center and each other. Inferences on mathematical
statistics are made under the framework of probability theory, which deals with the analysis of
random phenomena.
A standard statistical procedure involves the collection of data leading to a test of the
relationship between two statistical data sets, or a data set and synthetic data drawn from an
idealized model. A hypothesis is proposed for the statistical relationship between the two data sets,
and this is compared as an alternative to an idealized null hypothesis of no relationship between two
data sets. Rejecting or disproving the null hypothesis is done using statistical tests that quantify the
sense in which the null can be proven false, given the data that are used in the test. Working from a
null hypothesis, two basic forms of error are recognized: Type I errors (null hypothesis is falsely
rejected giving a "false positive") and Type II errors (null hypothesis fails to be rejected and an actual
relationship between populations is missed giving a "false negative").[8] Multiple problems have come
to be associated with this framework, ranging from obtaining a sufficient sample size to specifying an
adequate null hypothesis.[7]
Measurement processes that generate statistical data are also subject to error. Many of these errors
are classified as random (noise) or systematic (bias), but other types of errors (e.g., blunder, such as
when an analyst reports incorrect units) can also occur. The presence of missing
data or censoring may result in biased estimates and specific techniques have been developed to
address these problems.

Contents

 1Introduction

o 1.1Mathematical statistics

 2History
 3Statistical data

o 3.1Data collection

o 3.2Types of data

 4Methods

o 4.1Descriptive statistics

o 4.2Inferential statistics

o 4.3Exploratory data analysis

 5Misuse

o 5.1Misinterpretation: correlation

 6Applications

o 6.1Applied statistics, theoretical statistics and mathematical statistics

o 6.2Machine learning and data mining

o 6.3Statistics in academia

o 6.4Statistical computing

o 6.5Business statistics

o 6.6Statistics applied to mathematics or the arts

 7Specialized disciplines

 8See also

 9References

 10Further reading

 11External links

Introduction[edit]
Main article: Outline of statistics

Statistics is a mathematical body of science that pertains to the collection, analysis, interpretation or
explanation, and presentation of data,[9] or as a branch of mathematics.[10] Some consider statistics to
be a distinct mathematical science rather than a branch of mathematics. While many scientific
investigations make use of data, statistics is concerned with the use of data in the context of
uncertainty and decision making in the face of uncertainty.[11][12]
In applying statistics to a problem, it is common practice to start with a population or process to be
studied. Populations can be diverse topics such as "all people living in a country" or "every atom
composing a crystal". Ideally, statisticians compile data about the entire population (an operation
called census). This may be organized by governmental statistical institutes. Descriptive
statistics can be used to summarize the population data. Numerical descriptors
include mean and standard deviation for continuous data (like income), while frequency and
percentage are more useful in terms of describing categorical data (like education).
When a census is not feasible, a chosen subset of the population called a sample is studied. Once a
sample that is representative of the population is determined, data is collected for the sample
members in an observational or experimental setting. Again, descriptive statistics can be used to
summarize the sample data. However, drawing the sample contains an element of randomness;
hence, the numerical descriptors from the sample are also prone to uncertainty. To draw meaningful
conclusions about the entire population, inferential statistics is needed. It uses patterns in the
sample data to draw inferences about the population represented while accounting for randomness.
These inferences may take the form of answering yes/no questions about the data (hypothesis
testing), estimating numerical characteristics of the data (estimation), describing associations within
the data (correlation), and modeling relationships within the data (for example, using regression
analysis). Inference can extend to forecasting, prediction, and estimation of unobserved values
either in or associated with the population being studied. It can
include extrapolation and interpolation of time series or spatial data, and data mining.

Mathematical statistics[edit]
Main article: Mathematical statistics

Mathematical statistics is the application of mathematics to statistics. Mathematical techniques used


for this include mathematical analysis, linear algebra, stochastic analysis, differential equations,
and measure-theoretic probability theory.[13][14]

History[edit]

Gerolamo Cardano, a pioneer on the mathematics of probability.


Main articles: History of statistics and Founders of statistics
The early writings on statistical inference date back to Arab mathematicians and cryptographers,
during the Islamic Golden Age between the 8th and 13th centuries. Al-Khalil (717–786) wrote
the Book of Cryptographic Messages, which contains the first use of permutations and combinations,
to list all possible Arabic words with and without vowels.[15] In his book, Manuscript on Deciphering
Cryptographic Messages, Al-Kindi gave a detailed description of how to use frequency analysis to
decipher encrypted messages. Al-Kindi also made the earliest known use of statistical inference,
while he and later Arab cryptographers developed the early statistical methods
for decoding encrypted messages. Ibn Adlan (1187–1268) later made an important contribution, on
the use of sample size in frequency analysis.[15]
The earliest European writing on statistics dates back to 1663, with the publication of Natural and
Political Observations upon the Bills of Mortality by John Graunt.[16] Early applications of statistical
thinking revolved around the needs of states to base policy on demographic and economic data,
hence its stat- etymology. The scope of the discipline of statistics broadened in the early 19th
century to include the collection and analysis of data in general. Today, statistics is widely employed
in government, business, and natural and social sciences.
The mathematical foundations of modern statistics were laid in the 17th century with the
development of the probability theory by Gerolamo Cardano, Blaise Pascal and Pierre de Fermat.
Mathematical probability theory arose from the study of games of chance, although the concept of
probability was already examined in medieval law and by philosophers such as Juan Caramuel.
[17]
 The method of least squares was first described by Adrien-Marie Legendre in 1805.

Karl Pearson, a founder of mathematical statistics.

The modern field of statistics emerged in the late 19th and early 20th century in three stages.[18] The
first wave, at the turn of the century, was led by the work of Francis Galton and Karl Pearson, who
transformed statistics into a rigorous mathematical discipline used for analysis, not just in science,
but in industry and politics as well. Galton's contributions included introducing the concepts
of standard deviation, correlation, regression analysis and the application of these methods to the
study of the variety of human characteristics—height, weight, eyelash length among others.
[19]
 Pearson developed the Pearson product-moment correlation coefficient, defined as a product-
moment,[20] the method of moments for the fitting of distributions to samples and the Pearson
distribution, among many other things.[21] Galton and Pearson founded Biometrika as the first journal
of mathematical statistics and biostatistics (then called biometry), and the latter founded the world's
first university statistics department at University College London.[22]
Ronald Fisher coined the term null hypothesis during the Lady tasting tea experiment, which "is
never proved or established, but is possibly disproved, in the course of experimentation".[23][24]
The second wave of the 1910s and 20s was initiated by William Sealy Gosset, and reached its
culmination in the insights of Ronald Fisher, who wrote the textbooks that were to define the
academic discipline in universities around the world. Fisher's most important publications were his
1918 seminal paper The Correlation between Relatives on the Supposition of Mendelian
Inheritance (which was the first to use the statistical term, variance), his classic 1925 work Statistical
Methods for Research Workers and his 1935 The Design of Experiments,[25][26][27] where he developed
rigorous design of experiments models. He originated the concepts of sufficiency, ancillary
statistics, Fisher's linear discriminator and Fisher information.[28] In his 1930 book The Genetical
Theory of Natural Selection, he applied statistics to various biological concepts such as Fisher's
principle[29] (which A. W. F. Edwards called "probably the most celebrated argument in evolutionary
biology") and Fisherian runaway,[30][31][32][33][34][35] a concept in sexual selection about a positive feedback
runaway effect found in evolution.
The final wave, which mainly saw the refinement and expansion of earlier developments, emerged
from the collaborative work between Egon Pearson and Jerzy Neyman in the 1930s. They
introduced the concepts of "Type II" error, power of a test and confidence intervals. Jerzy Neyman in
1934 showed that stratified random sampling was in general a better method of estimation than
purposive (quota) sampling.[36]
Today, statistical methods are applied in all fields that involve decision making, for making accurate
inferences from a collated body of data and for making decisions in the face of uncertainty based on
statistical methodology. The use of modern computers has expedited large-scale statistical
computations and has also made possible new methods that are impractical to perform manually.
Statistics continues to be an area of active research for example on the problem of how to
analyze big data.[37]

Statistical data[edit]
Main article: Statistical data

Data collection[edit]
Sampling[edit]
When full census data cannot be collected, statisticians collect sample data by developing
specific experiment designs and survey samples. Statistics itself also provides tools for prediction
and forecasting through statistical models.
To use a sample as a guide to an entire population, it is important that it truly represents the overall
population. Representative sampling assures that inferences and conclusions can safely extend
from the sample to the population as a whole. A major problem lies in determining the extent that the
sample chosen is actually representative. Statistics offers methods to estimate and correct for any
bias within the sample and data collection procedures. There are also methods of experimental
design for experiments that can lessen these issues at the outset of a study, strengthening its
capability to discern truths about the population.
Sampling theory is part of the mathematical discipline of probability theory. Probability is used
in mathematical statistics to study the sampling distributions of sample statistics and, more
generally, the properties of statistical procedures. The use of any statistical method is valid when the
system or population under consideration satisfies the assumptions of the method. The difference in
point of view between classic probability theory and sampling theory is, roughly, that probability
theory starts from the given parameters of a total population to deduce probabilities that pertain to
samples. Statistical inference, however, moves in the opposite direction—inductively inferring from
samples to the parameters of a larger or total population.
Experimental and observational studies[edit]
A common goal for a statistical research project is to investigate causality, and in particular to draw a
conclusion on the effect of changes in the values of predictors or independent variables on
dependent variables. There are two major types of causal statistical studies: experimental
studies and observational studies. In both types of studies, the effect of differences of an
independent variable (or variables) on the behavior of the dependent variable are observed. The
difference between the two types lies in how the study is actually conducted. Each can be very
effective. An experimental study involves taking measurements of the system under study,
manipulating the system, and then taking additional measurements using the same procedure to
determine if the manipulation has modified the values of the measurements. In contrast, an
observational study does not involve experimental manipulation. Instead, data are gathered and
correlations between predictors and response are investigated. While the tools of data analysis work
best on data from randomized studies, they are also applied to other kinds of data—like natural
experiments and observational studies[38]—for which a statistician would use a modified, more
structured estimation method (e.g., Difference in differences estimation and instrumental variables,
among many others) that produce consistent estimators.

You might also like