Professional Documents
Culture Documents
Histograms
A histogram provides a quick visual insight into how a data set is distributed. The range of possible values
is divided into intervals, or bins. Then a bar chart is created, where the height of each bar corresponds to
how frequently values in that bin appear in the data.
The histogram function creates a histogram with the bins chosen automatically based on the data.
Box Plots
A box plot is another way to visualize the distribution of a data set. The central box represents the middle
50% of observations, with the red line at the median. The "whisker" lines show the extent of
≈
99% of the data. Remaining outliers are shown individually with red crosses.
Scatter Plots
A scatter plot explores how two variables are related to each other. You can use the scatter function
or plot function to create a scatter plot.
To view how one variable changes in response to two other variables, you can use scatter3 or plot3.
2.2 Measures of Centrality and Spread: (1/4) Measures of Centrality
Measures of Centrality
From the histograms of height data shown to the right, it appears that women’s heights are centered on
approximately 160 cm, whereas men’s heights are centered on approximately 175 cm.
Quantifying these statements requires calculating some measure of central tendency. Although people
commonly talk about a "typical" or “average” height, there are several standard measures of “average”
(or centrality).
Mean
The mean (also referred to as the arithmetic mean, or often simply the average) is a common measure of
centrality. The mean is useful for symmetric distributions, but notoriously sensitive to outliers. If your data
set is not distributed symmetrically or has extreme outliers, you will need to consider how these factors
will affect the calculation of the mean.
meanW = mean(heightWomen)
meanW =
160.7405
meanM = mean(heightMen)
meanM =
174.1837
Median
The median gives the midpoint of the sorted data, so half the data is greater than the median and half is
smaller. The median is much more resistant than the mean to changes in a few data values, and is an
especially useful center for nonsymmetric (skewed) distributions, like the distribution of weight data.
medWeight = median(weight)
medWeight =
77.5000
meanWeight = mean(weight)
meanWeight =
80.2421
Additional Measures of Centrality
mode
Most frequent values in a data set.
trimmean
Mean excluding outliers.
2.2 Measures of Centrality and Spread: (3/4) Measures of Spread
Measures of Spread
You can calculate the mean and median of the height data to find the "average" height is about 167 cm.
But what practical significance is that? If everyone was roughly 167 cm tall, that would be very different
than if people were equally likely to be any height between 135 and 200 cm.
The difference between these extreme scenarios is the degree of spread of the distributions – that is, how
much the data deviates from the center. As with measures of centrality, there are several standard
measures of spread.
Standard Deviation
Like the mean, the standard deviation is typically used to measure the spread of symmetric distributions
that follow a "bell curve" (normal distribution). Because the standard deviation is the square root of
the variance – the sum of the squares of the distances of data values from the mean – the standard
deviation tends to amplify the effect of outliers.
stdHeight = std(height)
stdHeight =
10.0521
Interquartile Range
The interquartile range is based on the median (the 50th percentile point). It gives the distance between
the 25th and 75th percentile in the data – that is, the width of the region that contains the middle 50% of
the data values. Like the median, the interquartile range is resistant to outliers and especially useful for
nonsymmetric distributions.
The function iqr calculates the interquartile range of a data set. The central box in a box plot spans the
interquartile range.
iqrWeight = iqr(weight)
iqrWeight =
25.6000
Additional Measures of Spread
range
Difference between maximum and minimum values.
var
Variance of a data set.
histogram(weight)
histogram(weight,"Normalization","pdf")
2.3 Distributions: (2/6) Normal and Uniform Distributions
The standard uniform distribution is defined as 1 on the interval 0 to 1, and 0 everywhere else.
Normal distribution
A normal distribution is the classic "bell curve" distribution. The most probable values are near the mean,
and values further from the mean are less probable.The normal distribution is defined by two numbers:
the mean μ and the standard deviation σ. The standard normal distribution has a mean of 0 and
standard deviation of 1.
The randn function generates normally distributed random numbers with mean 0 and standard deviation
1. randn(n,1) generates a column vector of n normally distributed random numbers.
rand and randn are useful functions for generating random numbers. You can find out more about them
in the documentation.
rand
Uniformly distributed random numbers
randn
Normally distributed random numbers
histogram
Bar plot of frequencies of data values.
boxplot
Box-and-whisker plot based on median and quartiles.
scatter
Plot relationship between two variables.
Mean and standard deviation are useful for symmetric, normally distributed data.
mean
Arithmetic mean or average of a data set.
std
Standard deviation of a data set.
Median and interquartile range are much more resistant to changes in a few data values, and especially
useful for nonsymmetric distributions.
median
50th percentile of sorted data.
iqr
Difference between 25th and 75th percentile.
Data Distributions
Visualize the probability density function (pdf) for discrete data using a normalized histogram.
histogram(x,"Normalization","pdf")
Generate continuous normal and uniform pdfs.
normpdf
Compute a normal pdf.
unifpdf
Compute a uniform pdf.
The "best fit" line is the line through the data that minimizes the distance between the actual, observed
values of y and the values of y predicted by the equation y=ax+b.
This demonstrates that the linear equation 1.5229 * x -2.1911 predicts 87% of the variance in the
variable y.