Notes On Statistics

Statistics is about the whole process of using the scientific method to answer questions and make decisions.
That
process involves designing studies, collecting good data, describing the data with numbers and graphs, analyzing the data,
and then making conclusions.
In statistics, we have a saying: “Garbage in equals garbage out.” If you select your subjects in a way that is biased — that
is, favoring certain individuals or groups of individuals — then your results will also be biased.
Descriptive statistics are numbers that summarize some characteristic about a set of data. They provide you with easy-to-
understand information that helps answer questions. They also help researchers get a rough idea about what’s happening in
their experiments so later they can do more formal and targeted analyses. Descriptive statistics make a point clearly and
concisely.
The data are the individual pieces of factual information recorded, and it is used for the purpose of the analysis
process. The two processes of data analysis are interpretation and presentation. Statistics are the result of data analysis.
There are two main types of data: categorical (or qualitative) data and numerical (or quantitative data).
Categorical data record qualities or characteristics about the individual, such as eye color, gender, political party, or
opinion on some issue (using categories such as agree, disagree, or no opinion). Categorical data place individuals into
groups. For example, male/female, own your home/don’t own, or Democrat/ Republican/Independent/Other.
Numerical data record measurements or counts regarding each individual, which may include weight, age, height, or time
to take an exam; counts may include number of pets, or the number of red lights you hit on your way to work.
The important difference between the two is that with categorical data, any numbers involved do not have real numerical
meaning (for example, using 1 for male and 2 for female), while all numerical data represents actual numbers for which
math operations make sense.
A third type of data, ordinal data, falls in between, where data appear in categories, but the categories have a
meaningful order, such as ratings from 1 to 5, or class ranks of freshman through senior. Ordinal data can be
analyzed like categorical data, and the basic numerical data techniques also apply when categories are represented by
numbers that have meaning.
Qualitative or Categorical Data
Qualitative data, also known as the categorical data, describes the data that fits into the categories. Qualitative data are
not numerical. The categorical information involves categorical variables that describe the features such as a person’s
gender, home town etc. Categorical measures are defined in terms of natural language specifications, but not in terms of
numbers.
Sometimes categorical data can hold numerical values (quantitative value), but those values do not have a mathematical
sense. Examples of the categorical data are birthdate, favourite sport, school postcode. Here, the birthdate and school
postcode hold the quantitative value, but it does not give numerical meaning.
Nominal Data
Nominal data is one of the types of qualitative information which helps to label the variables without providing the
numerical value. Nominal data is also called the nominal scale. It cannot be ordered and measured. But sometimes,
the data can be qualitative and quantitative. Examples of nominal data are letters, symbols, words, gender etc.
The nominal data are examined using the grouping method. In this method, the data are grouped into categories, and then
the frequency or the percentage of the data can be calculated. These data are visually represented using the pie charts.
Ordinal Data
Ordinal data/variable is a type of data that follows a natural order. The significant feature of the nominal data is that the
difference between the data values is not determined. This variable is mostly found in surveys, finance, economics,
questionnaires, and so on.
The ordinal data is commonly represented using a bar chart. These data are investigated and interpreted through many
visualisation tools. The information may be expressed using tables in which each row in the table shows the distinct
category.
Quantitative or Numerical Data
Quantitative data is also known as numerical data which represents the numerical value (i.e., how much, how often,
how many). Numerical data gives information about the quantities of a specific thing. Some examples of numerical data
are height, length, size, weight, and so on. The quantitative data can be classified into two different types based on the data
sets. The two different classifications of numerical data are discrete data and continuous data.
Discrete Data
Discrete data can take only discrete values. Discrete information contains only a finite number of possible values.
Those values cannot be subdivided meaningfully. Here, things can be counted in whole numbers.
Example: Number of students in the class
Continuous Data
Continuous data is data that can be calculated. It has an infinite number of probable values that can be selected within
a given specific range.
Example: Temperature range
There are two major types of random variables: discrete and continuous. Discrete random variables basically count things
(number of heads on 10 coin flips, number of female Democrats in a sample, and so on). The most well known discrete
random variable is the binomial. A continuous random variable measures things and takes on values within an interval, or
they have so many possible values that they might as well be deemed continuous (for example, time to complete a task,
exam scores, and so on).
We say that X has a normal distribution if its values fall into a smooth (continuous) curve with a bell-shaped, symmetric
pattern, meaning it looks the same on each side when cut down the middle. The total area under the curve is 1.
Figure below illustrates three different normal distributions with different means and standard deviations.
Note that the saddle points (highlighted by arrows in figure above on either side of the mean) on each graph are
where the graph changes from concave down to concave up. The distance from the mean out to either saddle point
is equal to the standard deviation for the normal distribution. For any normal distribution, almost all its values lie
within three standard deviations of the mean.
THE LEVELS OF MEASUREMENT: DEFINITION AND BRIEF EXPLANATION

The level of measurement is about how each variable is measured – qualitative or quantitative -- and how precise each
variable is. There are four levels of measurement – nominal, ordinal, and interval/ratio – with nominal being the
least precise and informative and interval/ratio variable being most precise and informative. Given a choice, choose
an interval/ratio variable, as it gives you more freedom and choice when it comes to choosing an appropriate statistical
technique.
Why do we need to learn this?
Not all statistical techniques and methods can be used to all variables. The levels of measurement help us determine
what statistical technique is appropriate to use. For example, if the level of measurement of your variable is
nominal (the least precise and informative variable), you can use mode to summarize your variable, but not median
or mean. If your variable is an interval/ratio variable, you can use all three mean, median, and mode to summarize
your variable.
In nominal measurement the numerical values just “name” the attribute uniquely. No ordering of the cases is
implied. For example, jersey numbers in basketball are measures at the nominal level. A player with number 30 is
not more of anything than a player with number 15, and is certainly not twice whatever number 15 is.
In ordinal measurement the attributes can be rank-ordered. Here, distances between attributes do not have any
meaning. For example, on a survey you might code Educational Attainment as 0=less than high school; 1=some
high school.; 2=high school degree; 3=some college; 4=college degree; 5=post college. In this measure, higher
numbers mean more education. But is distance from 0 to 1 same as 3 to 4? Of course not. The interval between
values is not interpretable in an ordinal measure.
Individuals competing in a contest may be fortunate to achieve first, second, or third place. First, second, and third
place represent ordinal data. If Roscoe takes first and Wilbur takes second, we do not know if the competition was
close; we only know that Roscoe outperformed Wilbur. Likert-type scales (such as "On a scale of 1 to 10 with one
being no pain and ten being high pain, how much pain are you in today?") also represent ordinal data.
Fundamentally, these scales do not represent a measurable quantity. An individual may respond 8 to this question
and be in less pain than someone else who responded 5.
A scale which represents quantity and has equal units but for which zero represents simply an additional point of
measurement is an interval scale. The Fahrenheit scale is a clear example of the interval scale of measurement.
Thus, 60 degree Fahrenheit or -10 degrees Fahrenheit are interval data. Measurement of Sea Level is another
example of an interval scale. With each of these scales there is direct, measurable quantity with equality of units. In
addition, zero does not represent the absolute lowest value. Rather, it is point on the scale with numbers both above
and below it (for example, -10 degrees Fahrenheit). In interval measurement the distance between
attributes does have meaning. For example, when we measure temperature (in Fahrenheit), the distance from 30-40
is same as distance from 70-80. The interval between values is interpretable. Because of this, it makes sense to
compute an average of an interval variable, where it doesn’t make sense to do so for ordinal scales. But note that in
interval measurement ratios don’t make any sense - 80 degrees is not twice as hot as 40 degrees (although the
attribute value is twice as large).
Finally, in ratio measurement there is always an absolute zero that is meaningful. This means that you can
construct a meaningful fraction (or ratio) with a ratio variable. Weight is a ratio variable. In applied social
research most “count” variables are ratio, for example, the number of clients in past six months. Why? Because
you can have zero clients and because it is meaningful to say that “…we had twice as many clients in the past six
months as we did in the previous six months.”
It’s important to recognize that there is a hierarchy implied in the level of measurement idea. At lower levels of
measurement, assumptions tend to be less restrictive and data analyses tend to be less sensitive. At each level up the
hierarchy, the current level includes all of the qualities of the one below it and adds something new. In general, it is
desirable to have a higher level of measurement (e.g., interval or ratio) rather than a lower one (nominal or
ordinal).
Measures of Variability
Variability is what the field of statistics is all about. Results vary from individual to individual, from group to group, from
city to city, from moment to moment. Variation always exists in a data set, regardless of which characteristic you’re
measuring, because not every individual will have the same exact value for every characteristic you measure. Without a
measure of variability you can’t compare two data sets effectively. What if in both sets two sets of data have about the
same average and the same median? Does that mean that the data are all the same? Not at all. For example, the data sets
199, 200, 201, and 0, 200, 400 both have the same average, which is 200, and the same median, which is also 200.
Yet they have very different amounts of variability. The first data set has a very small amount of variability
compared to the second.
By far the most commonly used measure of variability is the standard deviation. The standard deviation of a data set,
denoted by s, represents the typical distance from any point in the data set to the center. It’s roughly the average distance
from the center, and in this case, the center is the average. Most often, you don’t hear a standard deviation given just by
itself; if it’s reported (and it’s not reported nearly enough) it’s usually in the fine print, in parentheses, like “(s = 2.68).”
The formula for the standard deviation of a ‘sample’ data set is

To calculate s, do the following steps:
1. Find the average of the data set, . To find the average, add up all the numbers and divide by the number of numbers in
the data set, n. 2. For each number, subtract the average from it.
3. Square each of the differences.
4. Add up all the results from Step 3.
5. Divide the sum of squares (Step 4) by the number of numbers in the data set, minus one (n – 1). If you do Steps 1
through 5 only, you have found another measure of variability, called the variance.
6. Take the square root of the variance. This is the standard deviation. Suppose you have four numbers: 1, 3, 5, and 7. The
mean is 16 ÷ 4 = 4. Subtracting the mean from each number, you get (1 – 4) = –3, (3 – 4) = –1, (5 – 4) = +1, and (7 – 4) =
+3. Squaring the results you get 9, 1, 1, and 9, which sum to 20. Divide 20 by 4 – 1 = 3 to get 6.67. The standard deviation
is the square root of 6.67, which is 2.58. Here are some properties that can help you when interpreting a standard
deviation:
✓ The standard deviation can never be a negative number.
✓ The smallest possible value for the standard deviation is 0 (when every number in the data set is exactly the same).
✓ Standard deviation is affected by outliers, as it’s based on distance from the mean, which is affected by outliers.
✓ The standard deviation has the same units as the original data, while variance is in square units.
Percentiles :The most common way to report relative standing of a number within a data set is by using percentiles. A
percentile is the percentage of individuals in the data set who are below where your particular number is located. If your
exam score is at the 90th percentile, for example, that means 90% of the people taking the exam with you scored lower
than you did (it also means that 10 percent scored higher than you did.)
Finding a percentile
To calculate the kth percentile (where k is any number between one and one hundred), do the following steps:
1. Order all the numbers in the data set from smallest to largest.
2. Multiply k percent times the total number of numbers, n.
3a. If your result from Step 2 is a whole number, go to Step 4. If the result from Step 2 is not a whole number, round it up
to the nearest whole number and go to Step 3b.
3b. Count the numbers in your data set from left to right (from the smallest to the largest number) until you reach the value
from Step 3a. This corresponding number in your data set is the kth percentile.
4. Count the numbers in your data set from left to right until you reach that whole number. The kth percentile is the
average of that corresponding number in your data set and the next number in your data set.
For example, suppose you have 25 test scores, in order from lowest to highest: 43, 54, 56, 61, 62, 66, 68, 69, 69, 70, 71,
72, 77, 78, 79, 85, 87, 88, 89, 93, 95, 96, 98, 99, 99. To find the 90th percentile for these (ordered) scores start by
multiplying 90% times the total number of scores, which gives 90% × 25 = 0.90 × 25 = 22.5 (Step 2). This is not a whole
number; Step 3a says round up to the nearest whole number — 23 — then go to step 3b. Counting from left to right (from
the smallest to the largest number in the data set), you go until you find the 23rd number in the data set. That number is 98,
and it’s the 90th percentile for this data set. If you want to find the 20th percentile, take 0.20 × 25 = 5; this is a whole
number so proceed to Step 4, which tells us the 20th percentile is the average of the 5th and 6th numbers in the ordered
data set (62 and 66). The 20th percentile then comes to (62+66)/2 = 64.
The median is the 50th percentile, the point in the data where 50% of the data fall below that point and 50% fall above it.
The median for the test scores example is the 13th number, 77.
A percentile is not a percent; a percentile is a number that is a certain percentage of the way through the data set,
when the data set is ordered. Suppose your score on the GRE was reported to be the 80th percentile. This doesn’t
mean you scored 80% of the questions correctly. It means that 80% of the students’ scores were lower than yours,
and 20% of the students’ scores were higher than yours.
The five-number summary is a set of five descriptive statistics that divide the data set into four equal sections. The five
numbers in a five number summary are:
1. The minimum (smallest) number in the data set.
2. The 25th percentile, aka the first quartile, or Q1.
3. The median (or 50th percentile).
4. The 75th percentile, aka the third quartile, or Q3.
5. The maximum (largest) number in the data set.
For example, we can find the five-number summary of the 25 (ordered) exam scores 43, 54, 56, 61, 62, 66, 68, 69, 69, 70,
71, 72, 77, 78, 79, 85, 87, 88, 89, 93, 95, 96, 98, 99, 99. The minimum is 43, the maximum is 99, and the median is the
number directly in the middle, 77.
To find Q1 and Q3, you use the steps shown in the section, “Finding a percentile,” where n = 25. Step 1 is done since the
data are ordered. For Step 2, since Q1 is the 25th percentile, multiply 0.25 * 25 = 6.25. This is not a whole number, so
Step 3a says round it up to 7 and proceed to Step 3b. Count from left to right in the data set until you reach the 7th number,
68; this is Q1. For Q3 (the 75th percentile) multiply 0.75 * 25 = 18.75; round up to 19, and the 19th number on the list is
89, or Q3. Putting it all together, the five-number summary for the test scores data is 43, 68, 77, 89, and 99.
The purpose of the five-number summary is to give descriptive statistics for center, variability, and relative
standing all in one shot. The measure of center in the five-number summary is the median, and the first quartile, median,
and third quartiles are measures of relative standing. To obtain a measure of variability based on the five-number
summary, you can find what’s called the Interquartile Range (or IQR). The IQR equals Q3 – Q1 and reflects the distance
taken up by the innermost 50% of the data. If the IQR is small, you know there is much data close to the median. If the
IQR is large, you know the data are more spread out from the median. The IQR for the test scores data set is 89 – 68 = 21,
which is quite large seeing as how test scores only go from 0 to 100.
Boxplots: Boxplots are a standardized way of displaying the distribution of data based on a five-number summary
(“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum” Use box and whisker plots when
you have multiple data sets from independent sources related to each other in some way. Examples include, Test
scores between schools or classrooms; Data from duplicate machines manufacturing the same products etc.
Suppose you wanted to compare three lathes’ performance responsible for the rough turning of a motor shaft. The
design specification is 18.85 +/- 0.1 mm. Diameter measurements from a sample of shafts taken from each roughing
lathe are displayed in a box and whisker plot in the figure
• Lathe 1 appears to be making good parts and is centered in the tolerance.

• Lathe 2 appears to have excess variation and is making shafts below the minimum diameter.
• Lathe 3 performs with relatively less variation than Lathe 2; however, it is centered on the lower side of the
specification and is making shafts below specification.
A boxplot is a one-dimensional graph of numerical data based on the five-number summary, which includes the
minimum value, the 25th percentile (known as Q1), the median, the 75th percentile (Q3), and the maximum value.
In essence, these five descriptive statistics divide the data set into four equal parts. Box and Whisker Plot
Worksheets
Box and whisker plots are used to display and analyze data conveniently. They include many important parameters
required for further analysis, like mean, 25 percentile mark, and the outliers in the data. This helps in a lot of fields
like machine learning, deep learning, etc. which include the representation of huge amounts of data. It can also
represent multiple sets of data in the same graph.
To make a boxplot, follow these steps:

1. Find the five number summary of your data set.
2. Create a horizontal number line whose scale includes the numbers in the five-number summary.
3. Label the number line using appropriate units of equal distance from each other.
4. Mark the location of each number in the five-number summary just above the number line.
5. Draw a box around the marks for the 25th percentile and the 75th percentile.
6. Draw a line in the box where the median is located.
7. Draw lines from the outside edges of the box out to the minimum and maximum values in the data set.
Draw the box-and-whisker plot for the data; 21, 29, 25, 20, 36, 28, 32, 35, 28, 30, 29, 25, 21, 35, 26, 35, 20, 19
We will first find the median and then the lower quartile and the upper quartile using the quartile formula to draw
the box and whisker.
Step 1: Arrange the data in ascending order.
19, 20, 20, 21, 21, 25, 25, 26, 28 28, 29, 29, 30, 32, 35, 35, 35, 36.
Step 2: Find the median of the data
M = [(n/2)th term + (n + 1/2)th ]/2
M = [9th term + 10th ]/2
M = (28 + 28)/2 = 28
M = 28
Step 3: Find the minimum and maximum values of the data.
The minimum is the lowest number in the data set which is 19.
The maximum is the highest number in the data set which is 36.
Step 4: Find the first quartile which lies at 25% of the data and the third quartile which lies at 75% of the data.
The first quartile (Q1) is the median of the lower half of data which lies at 25% of the data.
19, 20, 20, 21, 21, 25, 25, 26, 28.
Q1 = 21
The third quartile (Q3) is the median of the upper half of data which lies at 75% of the data.
28, 29, 29, 30, 32, 35, 35, 35, 36.
Q3 = 32
Step 5: Draw a box and whisker using the data below.
Minimum: 19
First quartile: 21
Median: 28
Third quartile: 32
Maximum: 36
The Binomial Distribution: A random variable is a characteristic, measurement, or count that changes randomly
according to some set of probabilities; its notation is X, Y, Z, and so on. A list of all possible values of a random variable,
along with their probabilities is called a probability distribution. One of the most well-known probability distributions is
the binomial. Binomial means “two names” and is associated with situations involving two outcomes: success or failure
(hitting a red light or not; developing a side effect or not).
Characteristics of a Binomial: A random variable has a binomial distribution if all of following conditions are met:
1. There are a fixed number of trials (n).
2. Each trial has two possible outcomes: success or failure.
3. The probability of success (call it p) is the same for each trial.
4. The trials are independent, meaning the outcome of one trial doesn’t influence that of any other.
Let X equal the total number of successes in n trials; if all of the above conditions are met, X has a binomial distribution
with probability of success equal to p.
Checking the Binomial conditions step by step:
You flip a fair coin 10 times and count the number of heads. Does this represent a binomial random variable? You can
check by reviewing your responses to the questions and statements in the list that follows:
1. Are there a fixed number of trials? You’re flipping the coin 10 times, which is a fixed number. Condition 1 is met, and n
= 10.
2. Does each trial have only two possible outcomes — success or failure? The outcome of each flip is either heads or tails,
and you’re interested in counting the number of heads, so flipping a head represents success and flipping a tail is a failure.
Condition 2 is met.
3. Is the probability of success the same for each trial? Because the coin is fair the probability of success (getting a head) is
p = 1 ⁄2 for each trial. You also know that 1 – 1 ⁄2 = 1 ⁄2 is the probability of failure (getting a tail) on each trial. Condition
3 is met.
4. Are the trials independent? We assume the coin is being flipped the same way each time, which means the outcome of
one flip doesn’t affect the outcome of subsequent flips. Condition 4 is met.
Because the coin-flipping example meets the four conditions, the random variable X, which counts the number of
successes (heads) that occur in 10 trials, has a binomial distribution with n = 10 and p = 1 ⁄2.
Types of data & the scales of measurement

Data is a valuable asset – so much so that it’s the world’s most valuable resource. That makes understanding the
different types of data – and the role of a data scientist – more important than ever.
What is data? In short, it’s a collection of measurements or observations, divided into two different types:
qualitative and quantitative.
Data at the highest level: qualitative and quantitative
Qualitative data refers to information about qualities, or information that cannot be measured. It’s usually
descriptive and textual. Examples include someone’s eye colour or the type of car they drive. In surveys, it’s often
used to categorise ‘yes’ or ‘no’ answers.
Quantitative data is numerical. It’s used to define information that can be counted. Some examples of quantitative
data include distance, speed, height, length and weight. It’s easy to remember the difference between qualitative and
quantitative data, as one refers to qualities, and the other refers to quantities.
A bookshelf, for example, may have 100 books on its shelves and be 100 centimetres tall. These are quantitative
data points. The colour of the bookshelf – red – is a qualitative data point.
What is quantitative (numerical) data?
Quantitative, or numerical, data can be broken down into two types: discrete and continuous.
Discrete data
Discrete data is a whole number that can’t be divided or broken into individual parts, fractions or decimals.
Examples of discrete data include the number of pets someone has – one can have two dogs but not two-and-a-half
dogs. The number of wins someone’s favourite team gets is also a form of discrete data because a team can’t have a
half win – it’s either a win, a loss, or a draw.
Continuous data
Continuous data describes values that can be broken down into different parts, units, fractions and decimals.
Continuous data points, such as height and weight, can be measured. Time can also be broken down – by half a
second or half an hour. Temperature is another example of continuous data.
Discrete versus continuous
There’s an easy way to remember the difference between the two types of quantitative data: data is considered
discrete if it can be counted and is continuous if it can be measured. Someone can count students, tickets purchased
and books, while one measures height, distance and temperature.
What is qualitative (categorical) data?
Qualitative data describes the qualities of data points and is non-numerical. It’s used to define the information and
can also be further broken down into sub-categories through the four scales of measurement.
Properties and scales of measurement

Scales of measurement is how variables are defined and categorised.
Psychologist Stanley Stevens developed the four common scales of measurement:
nominal, ordinal, interval and ratio.
Each scale of measurement has properties that determine how to properly analyse the data. The properties evaluated
are identity, magnitude, equal intervals and a minimum value of zero.
Properties of Measurement
 Identity: Identity refers to each value having a unique meaning.
 Magnitude: Magnitude means that the values have an ordered relationship to one another, so there is a
specific order to the variables.
 Equal intervals: Equal intervals mean that data points along the scale are equal, so the difference between
data points one and two will be the same as the difference between data points five and six.
 A minimum value of zero: A minimum value of zero means the scale has a true zero point. Degrees, for
example, can fall below zero and still have meaning. But if you weigh nothing, you don’t exist.
The four scales of measurement

By understanding the scale of the measurement of their data, data scientists can determine the kind of statistical test
to perform.
1. Nominal scale of measurement

The nominal scale of measurement defines the identity property of data. This scale has certain characteristics, but
doesn’t have any form of numerical meaning. The data can be placed into categories but can’t be multiplied,
divided, added or subtracted from one another. It’s also not possible to measure the difference between data
points.
Examples of nominal data include eye colour and country of birth. Nominal data can be broken down again into
three categories:
 Nominal with order: Some nominal data can be sub-categorised in order, such as “cold, warm,
hot and very hot.”
 Nominal without order: Nominal data can also be sub-categorised as nominal without order,
such as male and female.
 Dichotomous: Dichotomous data is defined by having only two categories or levels, such as
“yes’ and ‘no’.
2. Ordinal scale of measurement

The ordinal scale defines data that is placed in a specific order. While each value is ranked, there’s no information
that specifies what differentiates the categories from each other. These values can’t be added to or subtracted from.
An example of this kind of data would include satisfaction data points in a survey, where ‘one = happy, two =
neutral, and three = unhappy.’ Where someone finished in a race also describes ordinal data. While first place,
second place or third place shows what order the runners finished in, it doesn’t specify how far the first-place
finisher was in front of the second-place finisher.
This scale not only assigns values to the variables but also measures the rank or order of the variables, such as:
 Grades
 Satisfaction
 Happiness
How satisfied are you with our services?
 1- Very Unsatisfied
 2- Unsatisfied
 3- Neural
 4- Satisfied
 5- Very Satisfied
3. Interval scale of measurement

The interval scale contains properties of nominal and ordered data, but the difference between data points can be
quantified. This type of data shows both the order of the variables and the exact differences between the
variables. They can be added to or subtracted from each other, but not multiplied or divided. For example, 40
degrees is not 20 degrees multiplied by two.
This scale is also characterised by the fact that the number zero is an existing variable. In the ordinal scale,
zero means that the data does not exist. In the interval scale, zero has meaning – for example, if you measure
degrees, zero has a temperature.
Data points on the interval scale have the same difference between them. The difference on the scale between 10 and
20 degrees is the same between 20 and 30 degrees. This scale is used to quantify the difference between variables,
whereas the other two scales are used to describe qualitative values only. Other examples of interval scales include
the year a car was made or the months of the year.
The difference between interval and ratio scales comes from their ability to dip below zero. Interval scales hold no
true zero and can represent values below zero. For example, you can measure temperature below 0 degrees
Celsius, such as -10 degrees. In an interval scale, the data collected can be added, subtracted, and multiplied.
The scale allows computing the degree of difference but not the ratio between them. An interval variable is a
one where the difference between two values is meaningful. The difference between a temperature of 100
degrees and 90 degrees is the same difference as between 90 degrees and 80 degrees.
The following questions fall under the Interval Scale category:

 What is your family income?
 What is the temperature in your city?
4. Ratio scale of measurement

Ratio scales of measurement include properties from all four scales of measurement. The data is nominal and
defined by an identity, can be classified in order, contains intervals and can be broken down into exact value.
Weight, height and distance are all examples of ratio variables. Data in the ratio scale can be added, subtracted,
divided and multiplied.
Ratio scales also differ from interval scales in that the scale has a ‘true zero’. The number zero means that the data
has no value point. An example of this is height or weight, as someone cannot be zero centimetres tall or weigh zero
kilos – or be negative centimetres or negative kilos. Examples of the use of this scale are calculating shares or sales.
Of all types of data on the scales of measurement, data scientists can do the most with ratio data points.
Ratio variables, on the other hand, never fall below zero. A ratio scale permits not only addition, subtraction,
and multiplication but also division. That is, you can calculate the ratio of the values. A ratio variable, has all
the properties of an interval variable, but also has a clear definition of 0.0.
Ratio Scale Examples

The following questions fall under the Ratio Scale category:
What is your daughter’s current height?
 Less than 5 feet.
 5 feet 1 inch – 5 feet 5 inches
 5 feet 6 inches- 6 feet
 More than 6 feet
What is your weight in kilograms?
 Less than 50 kilograms
 51- 70 kilograms
 71- 90 kilograms
 91-110 kilograms
 More than 110 kilograms
To summarise, nominal scales are used to label or describe values. Ordinal scales are used to provide
information about the specific order of the data points, mostly seen in the use of satisfaction surveys. The
interval scale is used to understand the order and differences between them. The ratio scales gives more
information about identity, order and difference, plus a breakdown of the numerical detail within each data
point.
Using quantitative and qualitative data in statistics: Once data scientists have a conclusive data set from
their sample, they can start to use the information to draw descriptions and conclusions. To do this, they can use
both descriptive and inferential statistics.
Descriptive statistics
Descriptive statistics help demonstrate, represent, analyse and summarise the findings contained in a sample. They
present data in an easy-to-understand and presentable form, such as a table or graph. Without description, the data
would be in its raw form with no explanation.
Frequency counts
One way data scientists can describe statistics is using frequency counts, or frequency statistics, which describe the
number of times a variable exists in a data set. For example, the number of people with blue eyes or the number of
people with a driver’s license in the sample can be counted by frequency. Other examples include qualifications of
education, such as high school diploma, a university degree or doctorate, and categories of marital status, such as
single, married or divorced.
Frequency data is a form of discrete data, as parts of the values can’t be broken down. To calculate continuous data
points, such as age, data scientists can use central tendency statistics instead. To do this, they find the mean or
average of the data point. Using the age example, this can tell them the average age of participants in the sample.
While data scientists can draw summaries from the use of descriptive statistics and present them in an
understandable form, they can’t necessarily draw conclusions. That’s where inferential statistics come in.
Inferential statistics
Inferential statistics are used to develop a hypothesis from the data set. It would be impossible to get data from an
entire population, so data scientists can use inferential statistics to extrapolate their results. Using these statistics,
they can make generalisations and predictions about a wider sample group, even if they haven’t surveyed them all.
An example of using inferential statistics is in an election. Even before the entire country has voted, data scientists
can use these kinds of statistics to make assumptions regarding who might win based on a smaller sample size.
Stem and Leaf Plot: A stem and leaf plot is a unique table where values of data are split into a stem and leaf. The
first digit or digits will be written in stem and the last digit will be written in leaf. Let us learn more about this
interesting
What is Stem and Leaf Plot?
A stem and leaf plot also called a stem and leaf diagram is a way of organizing data into a form that makes it easy to
observe the frequency of different types of values. It is a graph that shows numerical data arranged in order. Each
data value is broken into a stem and a leaf.
A stem and leaf plot is represented in form of a special table where each first digit or digit of data value is split into a
stem and the last digit of data in a leaf. This " | " symbol is used to show stem values and leaf values and it is called
as stem and leaf plot key. For example, 46 is represented as 4 on the stem and 6 on the leaf and shown using stem
and leaf plot key like this 4 | 6.
In the image given below, the stem values are listed one below the other in ascending order and the leaf values are
listed left to right from the stem values in ascending order.
As the stem and leaf plot definition states,
For example,
6 | 7 ⇒ 6 on the stem and 7 on the leaf read as 67.
6 | 8 ⇒ 6 on the stem and 8 on the leaf read as 68.
Finding the mean, median, and mode are a part of stem and leaf plot statistics. Let's understand with help of an
example. Consider the following stem and leaf plot worksheet which shows 5
data values.
The data values are already in ascending order. They are 20, 32, 32, 35 and 41
Mean of the data = Sum of data values ÷ Total number of values = (20 + 32 + 32
+ 35 + 41) ÷ 5 = 160 ÷ 5 = 32.
Mean = 32
Mode is the data value that frequently appears.
Mode = 32
Median = The middle value of the data.
Median = 32
Thus, calculating the stem and leaf plot statistics is very easy.
How to Read Stem and Leaf Plot?

The stem and leaf plot key helps us understand the data values. The stem is on the left and the leaf is on the right. If
you combine the values of the stem and the leaves, you will get the data values. Let us interpret the following data
after studying the stem and leaf plot.
(a) What are the leaf numbers for stem 2?

Observe the stem and leaf plot table. Look at the stem column on the left and
locate stem 2. Then find the corresponding leaves on the right. The leaf
numbers are 5 and 9.
(b) What are the data values for stem 2?
Observe the stem and leaf plot table. Identify the data values. The stem 2 means 20 and the leaf means 5 and 9.
Combining the values of the stem and the leaves, we get the data values 25 and 29.
(c) What are the data values for stem 0?
The numbers on the leaves against the value 0 on the stem are 2, 6, and 8. Thus, the data values are 2, 6, and 8.
(d) List the data values greater than 15.
From stem value 1, we get the data values 15 and 17. From stem value 2, we get the data values 25 and 29. Hence,
the data values greater than 15 are 17, 25, and 29.
How to Construct Stem and Leaf Plot?
Follow the steps for constructing a stem and leaf plot.
 Step 1: Look at the data and find the number of digits. Classify them as 2 or 3 digit numbers.
 Step 2: Fix the stem and leaf plot key. For example, 3 I 5 = 35, and 15 I 2 is 152.
 Step 3: Identify the first digits as stems and the last digit as leaves.
 Step 4: Determine the range of the data, i.e. the lowest and the highest values among the data.
 Step 5: Draw a vertical line. Place the stem on the left column and the leaf on the right column.
 Step 6: List the stems in the stems column. Arrange it in ascending order starting from the lowest to the top.
 Step 7: Plot the leaves in the column against the stem from lowest to the highest horizontally.
Important Notes
Given below are some important notes related to stem and leaf plots. Have a look!
 If we plot data using stem and leaf and place it alongside the new data, we might be able to see a
correlation between both the data and the frequency distribution of data.
 The stem and leaf plot key for three-digit numbers is represented as two digits in the stem and one digit in
the leaf. For example, 45 | 6 = 456
 The mean, median, and mode of the given data can easily be calculated using stem and leaf plots.
How to Split a Stem and Leaf Plot?
The split stem and leaf plot separates each stem into many stems based on its frequency. We place the smaller
leaves on the first part of the split stem and the larger leaves on the subsequent stems.
For example, let's consider a set of data that includes the scores of 7 students in their math test, 78, 82, 82, 90,
94, 81, 72. Data in ascending order 72, 78, 81, 82, 82, 90, 94

Notes On Statistics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Notes On Statistics

Uploaded by

Copyright:

Available Formats

Statistics is about the whole process of using the scientific method to answer questions and make decisions.

THE LEVELS OF MEASUREMENT: DEFINITION AND BRIEF EXPLANATION

The formula for the standard deviation of a ‘sample’ data set is

• Lathe 1 appears to be making good parts and is centered in the tolerance.

To make a boxplot, follow these steps:

Types of data & the scales of measurement

Properties and scales of measurement

The four scales of measurement

1. Nominal scale of measurement

2. Ordinal scale of measurement

3. Interval scale of measurement

The following questions fall under the Interval Scale category:

4. Ratio scale of measurement

Ratio Scale Examples

How to Read Stem and Leaf Plot?

(a) What are the leaf numbers for stem 2?

You might also like