You are on page 1of 38

Univariate Data Distributions

Univariate Data Distributions


• For almost all geologic & engineering data, the data are
distributed among a range of values
– We want to determine the variability of the data
– Different data sets follow different statistical distributions,
which can be described by a slightly different set of parameters
(the location, spread, shape etc)
• The data can be discrete or non-discrete (continuous).
• Discrete: number of wells per square mile, number of heads
when flipping coins
– You can’t interpolate between categories
• Continuous: geochemical data, measurements of
length, weight, etc.

– You can interpolate between categories.


– In reality, all measurements are discrete to some
extent (we measure to the nearest millimeter etc.)
– A convenient representation of continuous data is
a histogram
Univariate Data Distributions

• Distributions can be:

– Unimodal as with the two examples above


(only one peak)

– Bimodal (two peaks)


• Symmetric about the mode or median: as in the
examples to the right
Univariate Data Distributions

• Distributions can be:

– Unimodal as with the two examples above


(only one peak)

– Bimodal (two peaks)


– Symmetric about the mode or median: as
in the examples to the right
– Skewed:
Univariate Data Distributions

•Distributions can be:

– Unimodal as with the two examples above


(only one peak)

– Bimodal (two peaks)


– Symmetric about the mode or median: as
in the examples to the right
– Skewed:
– Or random (???)
• Most measurements follow certain specific distributions that can be described and
explained by theoretical considerations

•The major ones being normal, lognormal, Poisson, binomial, exponential,


and negative binomial distributions

•This is useful as it allows us to characterize a population from a sample, predict


the likelihood of a measurement as being from a population and compare our
distribution to standard models.

• Indispensable for accurate inferential statistics

•We can use parametric statistics on data that follows a specific distribution,
otherwise we need to use nonparametric stats.
Normal Distribution:
– Also called the Bell curve, Gaussian Distribution,
Standard curve
– Most important distribution in statistics:
• Random errors are additive.
– We will refer to this curve a lot
– It is characterized by its mean (μ) and standard
deviation (σ). It is a continuous function given by the
Equation:
• Examples of normal distributions:

– Means of data sets from other distributions


– Water levels in a well through time
– Miles of streams per unit area of a drainage basin
– Percent of moisture in a sediments
– Topographic relief
– Densities of specimens from an intrusion
– Major elements in some rocks
– Lengths of fossils or other measurements
– Percentage of major minerals in rocks
– Pebble sphericity for fixed particle/pebble size
Lognormal distribution: next most common

– A variant of the normal distribution in that the


logarithms of the values form a normal distribution
– Lognormal distributions are continuous distributions
– Characterized by a lot of small values and a few
very large values
– A highly skewed distribution. Random errors are
multiplicative.
Example:
– The first graph is a plot of ppm As in a sample. The second is a plot of log(As) values.
– The equation for Lognormal data is:

LOGNORMAL Logarithmic transform of log-normal data


• A variable might be modeled as log-normal if it can be thought of as the multiplicative
product of many independent factors which are positive and close to 1.

– For example the long-term return rate on a stock investment can be considered to be
the product of the daily return rates.

– In wireless communication, the attenuation caused by shadowing or slow fading from


random objects is often assumed to be log-normally distributed.

• Log-normal distributions are also particularly common when mean values are low,
variances large, and values cannot be negative
– E.g. species abundances
– Distribution of mineral resources in the earth’s crust
•Examples of lognormal distributions:

– Contents of trace elements in rocks


– Magnitudes of earthquakes
– Production of certain types of mines
– Sediment size particle distributions
– Heights of floodwaters in a river
– Gold assay values from many mines
– Permeability of some sediments
– Areas of river placer deposits
– Magnitudes of volcanic eruptions
• Warning about lognormal: It is tempting to characterize any highly skewed
distribution as lognormal when it might be simply an overlapping mixture of
several Normal distributions
Poisson distribution
Poisson distribution is a statistical distribution that shows how many times an event is likely to occur
within a specified period of time. It is used for independent events which occur at a constant rate within a
given interval of time.

– A discrete distribution where the probability of an event occurring is rare and random • e.g. radioactive
decays per hour, wells per square mile etc.

– Defined by the equation:

• where P is the probability of r events occurring if the average number of events per unit of time or area is X .
• The Poisson distribution expresses the probability of a number of events occurring in a fixed
period of time if these events occur with a known average rate and independently of the time since
the last event.

• The Poisson distribution can also be used for the number of events in other specified intervals
such as distance, area or volume.

• The Poisson distribution can be applied to systems with a large number of possible events, each of
which is rare.
– A classic example is the nuclear decay of atoms.

• Properties:
– The expected value of a Poisson-distributed random variable is equal to X and so is its variance.

• The Poisson distribution is sometimes called the law of small numbers because it is the probability
distribution of the number of occurrences of an event that happens rarely but has very many
opportunities to happen
• Examples of Poisson distribution:

– Number of particles emitted by a radioactive source in a given time


– Number of major earthquakes in a given time interval
– Numbers of meteorite falls (NOT finds) over a given area
– Sizes of invertebrates in a “death” population
– Numbers of grains of a mineral per unit area in an isotropic rock.
– Numbers of grains of accessory mineral in sediment samples

• Note how all of these are discrete and usually rare events.
Statistics and Statistical Parameters
• A sample is a subset of a population.
• We collect a sample and calculate statistics from it to infer parameters of the population.
• Generally we want to know:
– central tendency, dispersion (or spread), symmetry and shape
• Moment: is a way that the four values are related
Measures of Central Tendency
Other types of means:

• Weighted mean: each data point is given a different level of


importance or weight. Can weight by variance, standard deviation,
precision of measurement, volume, time etc.

• Example 1: The water level in a well is measured over a time period.


Geometric Mean:
Harmonic Mean
Harmonic Mean
– Typically, the harmonic mean is appropriate for situations when the average of rates is
desired (e.g. glacier motion rates)

– As shown on the page before, for all data sets containing at least one pair of non-equal
values, the harmonic mean is always the least of the three means, while the arithmetic
mean is always the greatest of the three and the geometric mean is always in between.

– (If all values in a nonempty dataset are equal, the three means are always equal to one
another; e.g. the harmonic, geometric, and arithmetic means of {2, 2, 2} are all 2.)

– Since the harmonic mean of a list of numbers tends strongly toward the least elements
of the list, it tends (compared to the arithmetic mean) to mitigate the impact of large
outliers and aggravate the impact of small ones.
skewness
skewness is a measure of the asymmetry of a distribution:

– negative skew: The left tail is longer; the mass of the distribution is
concentrated on the right of the figure. The distribution is said to be left-
skewed.
• The mean is lower than median which in turn is lower than the mode (i.e.;
mean < median < mode);
• The skewness coefficient is lower than zero
– positive skew: The right tail is longer; the mass of the distribution is
concentrated on the left of the figure. The distribution is said to be right-
skewed.
• Mean is greater than median which is greater than the mode (i.e.; mean >
median > mode)
• The skewness coefficient is greater than zero. In a skewed (unbalanced,
lopsided) distribution, the mean is farther out in the long tail than is the
median. If there is no skewness or the distribution is symmetric like the
bell-shaped normal curve then the mean = median = mode.
Distribution Types
Probability
• To understand why data are distributed the way they are and why data fall in specific types, a
review of probability is necessary.
• Basics on Probability:
– For independent sampling: That is, the sample that I choose is not related to the previous sampling
(true for coin flipping, dice rolling, not true for choosing marbles out of a bag)
– If there are N equally likely ways for an event to occur and there are M desired events, then the
probability (p) of the desired events occurring is M/N.
– For a coin, there are 2 (N = 2) possible outcomes. If one of the outcomes is desired (e.g. heads)
then M = 1, and the probability of getting that outcome is M/N = 1⁄2.
– For a die, there are 6 possible outcomes. If the desired outcome is a 5 or 6 (M = 2), then the
probability of getting those numbers is M/N = 2/6 = 1/3.
– The probability of something happening (p) is always
Basics of Probability:
– Again for a coin: pheads + ptails = 1, or for a die, p1 + p2 + p3 + p4 + p5 + p6 = 1
– If the probability of an event occurring is p, then the probability of that event not
occurring is 1 – p = q
– Let’s look at an unbiased sample of a manufacturing process:
• In this process, for every 5 parts made, one is defective.
• The probability of an acceptable part (A) is 4/5, the probability of a defective part (D) is 1/5.
• If I select a part at random, the probability that I get an A is 4/5, and for D is 1/5
Basics of Probability:
– How do probabilities change with a second draw?
Basics of Probability:
Basics of Probability:
Probability – Binomial Equation
Probability – Binomial Equation
Probability – Binomial Equation
Probability – Binomial Equation

• What about 10 draws?

• Obviously if the probability was different, the curve would be different.


– If p = 1⁄2, q = 1⁄2 (like flipping a coin)

 Remember: Area under the curve = 1. As N gets very large, the


distribution becomes continuous.
In practice if N > 30, then continuous
More theory on the binomial distribution:

– The mean of the binomial distribution: μ = N*p


• For p = 4/5, N = 10, μ = ?
• For p = 1/2, N = 10, μ = ?
– The variance of the distribution σ2 = N*p*q

– and the standard deviation, σ =


• For p = 4/5: σ = ?
• For p = 1/2: σ = ?
– Skewness is a measure of symmetry = (1-2p)/ σ
• For p = 4/5: skew = ?
• For p = 1/2: skew = ?
– Kurtosis is a measure of the shape of a distribution and = 3+(1/ σ2)-6/N
• For p = 4/5: kurtosis = ?
• For p = 1/2: kurtosis = ?

You might also like