Professional Documents
Culture Documents
Descriptive
statistics
Types of data
Types of data
Categorical Numerical
Examples:
Discrete Continuous
1. Car brands: Audi, BMW and Mercedes.
Numerical data represents numbers. It is divided into two
2. Answers to yes/no questions: yes and no groups: discrete and continuous. Discrete data can be
usually counted in a finite matter, while continuous is
infinite and impossible to count.
Examples:
Discrete: # children you want to have, SAT score
Continuous: weight, height
Levels of measurement
Levels of measurement
Qualitative Quantitative
There are two qualitative levels: nominal and ordinal. The There are two quantitative levels: interval and ratio. They
nominal level represents categories that cannot be put in both represent “numbers”, however, ratios have a true
any order, while ordinal represents categories that can be zero, while intervals don’t.
ordered.
Examples:
Examples: Interval: degrees Celsius and Fahrenheit
Nominal: four seasons (winter, spring, summer, autumn) Ratio: degrees Kelvin, length
Ordinal: rating your meal (disgusting, unappetizing,
neutral, tasty, and delicious)
Graphs and tables that represent categorical variables
Frequency
Bar charts Pie charts Pareto diagrams
distribution tables
Sales Sales
150 Merced 150 100%
es Audi
Frequency
34% 80%
37%
Frequency
100
100
60%
50 40%
124 98 113 50
20%
0 BMW 124 113 98
Audi BMW Mercedes
29% 0 0%
Audi Mercedes BMW
Frequency distribution tables Bar charts are very common. Pie charts are used when we The Pareto diagram is a special
show the category and its Each bar represents a category. want to see the share of an type of bar chart where the
corresponding absolute On the y-axis we have the item as a part of the total. categories are shown in
frequency. absolute frequency. Market share is almost always descending order of frequency,
represented with a pie chart. and a separate curve shows
the cumulative frequency.
Graphs and tables that represent categorical variables. Excel formulas
Frequency
Bar charts Pie charts Pareto diagrams
distribution tables
Sales Sales
150 Merced 150 100%
es Audi
Frequency
34% 80%
37%
Frequency
100
100
60%
50 40%
124 98 113 50
20%
0 BMW 124 113 98
Audi BMW Mercedes
29% 0 0%
Audi Mercedes BMW
In Excel, we can either hard Bar charts are also called Pie charts are created in the Next slide.
code the frequencies or count clustered column charts in following way:
them with a count function. Excel. Choose your data, Insert Choose your data, Insert ->
This will come up later on. -> Charts -> Clustered column Charts -> Pie chart
Total formula: =SUM() or Bar chart.
Pareto diagrams in Excel
80%
4. Select the plot area of the chart in Excel and Right
100
70% click.
5. Choose Select series.
Frequency
60%
124 113 98
10%
10. Select the plot area of the chart and Right click.
0 0% 11. Choose Change Chart Type.
Audi Mercedes BMW
12. Select Combo.
13. Choose the type of representation from the dropdown
list. Your initial categories should be ‘Clustered
Column’. Change the second series, that you called
‘Line’, to ‘Line’.
14. Done.
Numerical variables. Frequency distribution table and histogram
Note that all formulas could be found in the lesson Excel files and the solutions of the exercises provided with each lesson.
Numerical variables. Frequency distribution table and histogram
0.25
1. Choose your data 0.20
2. Insert -> Charts -> Histogram 0.15
3. To change the number of bins (intervals):
0.10
1. Select the x-axis
2. Click Chart Tools -> Format -> Axis options
3. You can select the bin width (interval width), number of bins,
etc.
Graphs and tables for relationships between variables. Cross tables
Cross tables (or contingency tables) are used to represent categorical variables. One set
of categories is labeling the rows and another is labeling the columns. We then fill in
the table with the applicable data. It is a good idea to calculate the totals. Sometimes,
these tables are constructed with the relative frequencies as shown in the table below.
800
When we want to represent two numerical variables on the same graph, we usually use
700
a scatter plot. Scatter plots are useful especially later on, when we talk about regression
600 analysis, as they help us detect patterns (linearity, homoscedasticity).
500 Scatter plots usually represent lots and lots of data. Typically, we are not interested in
400
single observations, but rather in the structure of the dataset.
300
200
Creating a scatter plot in Excel:
100
1. Choose the two datasets you want to plot.
0 2. Insert -> Charts -> Scatter
0 100 200 300 400 500 600 700 800
26
25.5
25
A scatter plot that looks in the following way (down) represents data that doesn’t have
a pattern. Completely vertical ‘forms’ show no association.
24.5
24
23.5
Conversely, the plot above shows a linear pattern, meaning that the observations move
23 together.
22.5
22
21.5
0 20 40 60 80 100 120 140 160 180
Mean, median, mode
=AVERAGE()
Skewness
Left (negative) skewness means that the outliers are to the left.
Median Mean
Mode
1 𝑛
σ (𝑥 − 𝑥)ҧ 3
Calculating skewness in Excel: Formula to calculate skewness: 𝑛 𝑖=1 𝑖
3
1
=SKEW() σ𝑛 (𝑥 − 𝑥)ҧ 2
𝑛 − 1 𝑖=1 𝑖
Variance and standard deviation
σ𝑛 ҧ 2
𝑖=1(𝑥𝑖 −𝑥)
Sample variance formula: 𝑠2 =
𝑛−1
Point 3 Point 6
σ𝑁
𝑖=1(𝑥𝑖 −𝜇)
2
Population variance formula: σ2 =
𝑁
σ𝑛 ҧ 2
𝑖=1(𝑥𝑖 −𝑥)
Calculating variance in Excel: Sample standard deviation formula: 𝑠=
𝑛−1
Sample variance: =VAR.S()
Population variance: =VAR.P() σ𝑁 2
𝑖=1(𝑥𝑖 −𝜇)
Sample standard deviation: = STDEV.S() Population standard deviation formula: σ=
Population standard deviation: =STDEV.P() 𝑁
Covariance and correlation
Covariance Correlation
Covariance is a measure of the joint variability of two variables. Correlation is a measure of the joint variability of two variables.
Unlike covariance, correlation could be thought of as a
➢ A positive covariance means that the two variables move standardized measure. It takes on values between -1 and 1, thus
together. it is easy for us to interpret the result.
➢ A covariance of 0 means that the two variables are
independent. ➢ A correlation of 1, known as perfect positive correlation,
➢ A negative covariance means that the two variables move in means that one variable is perfectly explained by the other.
opposite directions. ➢ A correlation of 0 means that the variables are independent.
➢ A correlation of -1, known as perfect negative correlation,
Covariance can take on values from -∞ to +∞ . This is a means that one variable is explaining the other one
problem as it is very hard to put such numbers into perspective. perfectly, but they move in opposite directions.
σ𝑛
𝑖=1 𝑥𝑖 −𝑥ҧ ∗(𝑦𝑖 −𝑦)
ത 𝑠𝑥𝑦
Sample covariance formula: 𝑠𝑥𝑦 = Sample correlation formula: r=
𝑛−1 𝑠𝑥 𝑠𝑦
σ𝑁
𝑖=1 𝑥𝑖 −𝜇𝑥 ∗(𝑦𝑖 −𝜇𝑦 ) 𝜎𝑥𝑦
Population covariance formula: 𝜎𝑥𝑦 = 𝑁
Population correlation formula: ρ=
𝜎𝑥 𝜎𝑦
In Excel, the covariance is calculated by: In Excel, correlation is calculated by: