You are on page 1of 27

2.

1 Visualizing Data Sets: (1/2) Data Visualization Functions


Data Visualizations
Before trying to perform calculations or draw conclusions from data, it is helpful to get a qualitative feel for
the data. Visualization is often a useful method when beginning to explore data.

Histograms

A histogram provides a quick visual insight into how a data set is distributed. The range of possible values
is divided into intervals, or bins. Then a bar chart is created, where the height of each bar corresponds to
how frequently values in that bin appear in the data.

The histogram function creates a histogram with the bins chosen automatically based on the data.

Box Plots
A box plot is another way to visualize the distribution of a data set. The central box represents the middle
50% of observations, with the red line at the median. The "whisker" lines show the extent of 

99% of the data. Remaining outliers are shown individually with red crosses.

The boxplot function creates a box plot.

Scatter Plots
A scatter plot explores how two variables are related to each other. You can use the scatter function
or plot function to create a scatter plot.

To view how one variable changes in response to two other variables, you can use scatter3 or plot3.
2.2 Measures of Centrality and Spread: (1/4) Measures of Centrality
Measures of Centrality
From the histograms of height data shown to the right, it appears that women’s heights are centered on
approximately 160 cm, whereas men’s heights are centered on approximately 175 cm.

Quantifying these statements requires calculating some measure of central tendency. Although people
commonly talk about a "typical" or “average” height, there are several standard measures of “average”
(or centrality).
Mean
The mean (also referred to as the arithmetic mean, or often simply the average) is a common measure of
centrality. The mean is useful for symmetric distributions, but notoriously sensitive to outliers. If your data
set is not distributed symmetrically or has extreme outliers, you will need to consider how these factors
will affect the calculation of the mean.

The function mean calculates the mean of a data set.

meanW = mean(heightWomen)
meanW =
160.7405

meanM = mean(heightMen)
meanM =
174.1837
Median
The median gives the midpoint of the sorted data, so half the data is greater than the median and half is
smaller. The median is much more resistant than the mean to changes in a few data values, and is an
especially useful center for nonsymmetric (skewed) distributions, like the distribution of weight data.

The function median calculates the median of a data set.

medWeight = median(weight)
medWeight =
77.5000

meanWeight = mean(weight)
meanWeight =
80.2421
Additional Measures of Centrality
mode
Most frequent values in a data set.
trimmean
Mean excluding outliers.
2.2 Measures of Centrality and Spread: (3/4) Measures of Spread
Measures of Spread

You can calculate the mean and median of the height data to find the "average" height is about 167 cm.
But what practical significance is that? If everyone was roughly 167 cm tall, that would be very different
than if people were equally likely to be any height between 135 and 200 cm.

The difference between these extreme scenarios is the degree of spread of the distributions – that is, how
much the data deviates from the center. As with measures of centrality, there are several standard
measures of spread.
Standard Deviation
Like the mean, the standard deviation is typically used to measure the spread of symmetric distributions
that follow a "bell curve" (normal distribution). Because the standard deviation is the square root of
the variance – the sum of the squares of the distances of data values from the mean – the standard
deviation tends to amplify the effect of outliers.

The function std calculates the standard deviation of a data set.

stdHeight = std(height)
stdHeight =
10.0521

Interquartile Range
The interquartile range is based on the median (the 50th percentile point). It gives the distance between
the 25th and 75th percentile in the data – that is, the width of the region that contains the middle 50% of
the data values. Like the median, the interquartile range is resistant to outliers and especially useful for
nonsymmetric distributions.

The function iqr calculates the interquartile range of a data set. The central box in a box plot spans the
interquartile range.

iqrWeight = iqr(weight)
iqrWeight =
25.6000
Additional Measures of Spread
range
Difference between maximum and minimum values.
var
Variance of a data set.

2.3 Distributions: (1/6) Histograms and Data Distributions


Data Distributions
A histogram can give a qualitative feel for the shape of a data set. The exact shape of the distribution is
given by the distribution’s probability density function (pdf). The pdf is a mathematical function  f(x) such
that the area under the curve between two values of  x  is the probability that a randomly selected  x will
fall between those two values.
The pdf for discrete data is a normalized histogram, where the area of a given bar is the probability that a
randomly selected data point falls in that interval.

histogram(weight)

histogram(weight,"Normalization","pdf")
2.3 Distributions: (2/6) Normal and Uniform Distributions

The Uniform and Normal Distributions


Two common distributions are the uniform distribution and the normal distribution.
Uniform distribution
Uniformly distributed data has an equal probability of appearing anywhere in an interval. The pdf is
shaped like a rectangle.

The standard uniform distribution is defined as 1 on the interval 0 to 1, and 0 everywhere else.

The standard uniform distribution.

Normal distribution
A normal distribution is the classic "bell curve" distribution. The most probable values are near the mean,
and values further from the mean are less probable.The normal distribution is defined by two numbers:
the mean μ  and the standard deviation  σ. The standard normal distribution has a mean of 0 and
standard deviation of 1.

The standard normal distribution.


The normpdf function can calculate the normal probability density function with three inputs:
normpdf(x,mean,std)

x: values at which to calculate


mean: mean of the distribution
std: standard deviation of the distribution

2.3 Distributions: (4/6) Generating Random Numbers

The rand function generates uniformly distributed random numbers. rand(n,1) generates a column


vector of n uniformly distributed random numbers.

The randn function generates normally distributed random numbers with mean 0 and standard deviation
1. randn(n,1) generates a column vector of n normally distributed random numbers.

rand and randn are useful functions for generating random numbers. You can find out more about them
in the documentation.

rand
Uniformly distributed random numbers
randn
Normally distributed random numbers

Summary: Exploring Data


Visualizing Data

histogram
Bar plot of frequencies of data values.
boxplot
Box-and-whisker plot based on median and quartiles.
scatter
Plot relationship between two variables.

Measures of Centrality and Spread

Mean and standard deviation are useful for symmetric, normally distributed data.
mean
Arithmetic mean or average of a data set.
std
Standard deviation of a data set.
Median and interquartile range are much more resistant to changes in a few data values, and especially
useful for nonsymmetric distributions.
median
50th percentile of sorted data.
iqr
Difference between 25th and 75th percentile.

Data Distributions

Visualize the probability density function (pdf) for discrete data using a normalized histogram.

histogram(x,"Normalization","pdf")
Generate continuous normal and uniform pdfs.
normpdf
Compute a normal pdf.
unifpdf
Compute a uniform pdf.

Generate arrays of normally and uniformly distributed random numbers.


randn
Generate normally distributed random numbers.
rand
Generate uniformly distributed random numbers.
3.1 Linear Regression: (1/5) Linear Regression
Linear Regression
Suppose you suspect there is a relationship between two variables, 
x and y. The simplest relationship (and the one you can usually assume as a starting point) is that of a
straight line, or  y=ax+b. The process of determining  a and  b for a set of x and  y
 data is called linear regression.

The "best fit" line is the line through the data that minimizes the distance between the actual, observed
values of y and the values of y predicted by the equation y=ax+b.

3.1 Linear Regression: (2/5) The fit Function


Example: Computing R2 from Polynomial Fits
You can derive R2 from the coefficients of a polynomial regression to determine how much variance in y a
linear model explains, as the following example describes:
1. Create two variables, x and y, from the first two columns of the count variable in the data
file count.dat:
2. load count.dat
3. x = count(:,1);
y = count(:,2);

4. Use polyfit to compute a linear regression that predicts y from x:


5. p = polyfit(x,y,1)
6.
7. p =
8. 1.5229 -2.1911
p(1) is the slope and p(2) is the intercept of the linear predictor. You can also obtain regression
coefficients using the Basic Fitting UI.
9. Call polyval to use p to predict y, calling the result yfit:
yfit = polyval(p,x);
Using polyval saves you from typing the fit equation yourself, which in this case looks like:
yfit = p(1) * x + p(2);
10. Compute the residual values as a vector of signed numbers:
yresid = y - yfit;
11. Square the residuals and total them to obtain the residual sum of squares:
SSresid = sum(yresid.^2);
12. Compute the total sum of squares of y by multiplying the variance of y by the number of
observations minus 1:
13. SStotal = (length(y)-1) * var(y);
14. Compute R2 using the formula given in the introduction of this topic:
15. rsq = 1 - SSresid/SStotal
16.
17. rsq =
0.8707

This demonstrates that the linear equation 1.5229 * x -2.1911 predicts 87% of the variance in the
variable y.

3.2 Evaluating Goodness of Fit: (1/4) Evaluating the Fit


3.3 Nonlinear Regression: (1/3) Nonlinear Regression
3.4 Review - Fitting a Curve to Data: (1/2) Summary
4.1 Linear Interpolation: (1/4) Linear Interpolation

You might also like