You are on page 1of 27

CS3352 FOUNDATIONS OF DATA SCIENCE

II YEAR / III SEMESTER B.Tech.- INFORMATION TECHNOLOGY

UNIT II
DESCRIBING DATA

COMPILED BY,

Prof.M.KARTHIKEYAN, M.E., HoD / IT

VERIFIED BY

HOD PRINCIPAL CEO/CORRESPONDENT

DEPARTMENT OF INFORMATION TECHNOLOGY

SENGUNTHAR COLLEGE OF ENGINEERING, TIRUCHENGODE – 637 205.


1
UNIT II
DESCRIBING DATA

➢ Types of Data
➢ Types of Variables
➢ Describing Data with Tables and Graphs
➢ Describing Data with Averages
➢ Describing Variability
➢ Normal Distributions and Standard (z) Score

2
LIST OF IMPORTANT QUESTIONS
UNIT - I

INTRODUCTION

PART – A
1. Write the three types of data.
2. What is Random sampling ?
3. What do you mean by Descriptive Statistics ?
4. Write the difference between Parameter and Statistic.
5. Write the Statistics and its types.
6. What is Descriptive Statistics ?
7. What is Inferential Statistics ?
8. What do you mean by Mean?
9. What do you mean by Median ?
10.What do you mean by Mode?
11.What do you mean by Variance?
12.What do you mean by Standard Deviation?
13.What do you mean by Range?
14.What is Probability distributions ?
15. What do you mean by Graphical representations?
PART – B
1. Illustrate the steps in constructing a frequency distribution and Construct
a frequency distribution for grouped data. Specify the real limits for the
lowest class interval in this frequency distribution.

3
2. The frequency distribution in the table shows the annual incomes in
dollars for a group of college graduates.

a. Construct a histogram.
b. Construct a frequency polygon.
c. Is this distribution balanced or lopsided?

3. Explain the computation process of sum of squares formulas for


population
4. Explain the following;
a. Misleading graphs.
b. The difference between mean and median.
c. Determine the mode for the following retirement ages: 60, 63, 45,
63, 65, 70, 55, 63, 60, 65, 63.
d. The owner of a new car conducts six gas mileage tests and obtains
the following results, expressed in miles per gallon: 26.3, 28.7,
27.4, 26.6, 27.4, and 26.9, - Find the mode for these data.
5. Explain how to calculate interquartile range.
6. Explain in detail about Typical Shapes or Explain the
Types of Frequency Distribution.

7. Explain how to construct a frequency polygon from a histogram.

4
LIST OF IMPORTANT QUESTIONS
UNIT - I

INTRODUCTION

PART – A

1.Write the three types of data.


Generally, qualitative data consist of words (Yes or No), letters (Y or N), or
numerical codes (0 or 1) that represent a class or category. Ranked data consist of
numbers.(1st, 2nd, . . . 40th place) that represent relative standing within a group.
Quantitative data consist of numbers (weights of 238, 170, . . . 185 lbs) that represent
an amount or a count.

2. What is Random sampling ?


Random sampling is a procedure designed to ensure that each potential
observation in the population has an equal chance of being selected in a survey. Classic
examples of random samples are a state lottery where each number from 1 to 99 in the
population has an equal chance of being selected as one of the five winning numbers
or a nation-wide opinion survey in which each telephone number has an equal chance
of being selected as a result of a series of random selections, beginning with a three-
digit area code and ending with a specific seven-digit telephone number.

3. What do you mean by Descriptive Statistics ?


Statistics exists because of the prevalence of variability in the real world. In its
sim-plest form, known as descriptive statistics, statistics provides us with tools—
tables,graphs, averages, ranges, correlations—for organizing and summarizing the
inevi-table variability in collections of actual observations or scores.
Examples
A graph showing the annual change in global temperature during the last 30
years A report that describes the average difference in grade point average (GPA)
between college students who regularly drink alcoholic beverages and those who don’t

5
4. Write the difference between Parameter and Statistic.
In our day in day out, we keep speaking about the Population and sample. So,
it is very important to know the terminology to represent the population and the
sample.A parameter is a number that describes the data from the population. And, a
statistic is a number that describes the data from a sample.

5. Write the Statistics and its types.


Statistics:
Statistics is the most critical unit of Data Science basics, and it is the method or
science of collecting and analyzing numerical data in large quantities to get useful
insights.The Wikipedia definition of Statistics states that “it is a discipline that concerns
the collection, organization, analysis, interpretation, and presentation of data.”
It means, as part of statistical analysis, we collect, organize, and draw meaningful
insights from the data either through visualizations or mathematical explanations.
Statistics is broadly categorized into two types:

1. Descriptive Statistics
2. Inferential Statistics

6. What is Descriptive Statistics ?


As the name suggests in Descriptive statistics, we describe the data using the
Mean, Standard deviation, Charts, or Probability distributions.Basically, as part of
descriptive Statistics, we measure the following:

1. Frequency: no. of times a data point occurs


2. Central tendency: the centrality of the data – mean, median, and mode
3. Dispersion: the spread of the data – range, variance, and standard deviation
4. The measure of position: percentiles and quantile ranks
7. What is Inferential Statistics ?
In Inferential statistics, we estimate the population parameters. Or we run
Hypothesis testing to assess the assumptions made about the population parameters.

6
In simple terms, we interpret the meaning of the descriptive statistics by inferring them
to the population.
For example, we are conducting a survey on the number of two-wheelers in a
city. Assume the city has a total population of 5L people. So, we take a sample of 1000
people as it is impossible to run an analysis on entire population data.
From the survey conducted, it is found that 800 people out of 1000 (800 out of
1000 is 80%) are two-wheelers. So, we can infer these results to the population and
conclude that 4L people out of the 5L population are two-wheelers.

8. What do you mean by Mean?


It is the sum of all the data points divided by the total number of values in the
data set. Mean cannot always be relied upon because it is influenced by outliers.

9. What do you mean by Median ?


It is the middlemost value of a sorted/ordered dataset. If the size of the dataset
is even, then the median is calculated by taking the average of the two middle values.
10.What do you mean by Mode?
It is the most repeated value in the dataset. Data with a single mode is called
unimodal, data with two modes is called bimodal, and data with more than two modes
is called multimodal.

11.What do you mean by Variance?


It is the average squared distance of all the data points from their mean. The problem
with Variance is, the units will also get squared.
12.What do you mean by Standard Deviation?
It is the square root of Variance. Helps in retrieving the original units.
13.What do you mean by Range?
It is the difference between the maximum and the minimum values of a dataset.
14. What is Probability distributions ?
In statistical terms, a distribution function is a mathematical expression that
describes the probability of different possible outcomes for an experiment.

7
15. What do you mean by Graphical representations?

Graphical representation refers to the use of charts or graphs to visualize,

analyze and interpret numerical data.For a single variable (Univariate analysis), we have

a bar plot, line plot, frequency plot, dot plot, boxplot, and the Normal Q-Q plot.We will

be discussing the Boxplot and the Normal Q-Q plot.

16. Students in a theater arts appreciation class rated the classic film “The Wizard of
Oz” on a 10-point scale, ranging from 1 (poor) to 10 (excellent), as follows:

Since the number of possible values is relatively small—only 10—it’s appropriate to


construct a frequency distribution for ungrouped data. Do this.

8
17. The IQ scores for a group of 35 high school dropouts are as follows:

Construct a frequency distribution for grouped data.

9
18. Construct a bar graph for the data shown in the following table

19. What are the measures of variability?


Measures of Variability: Range, Interquartile Range, Variance, and Standard
Deviation.

10
20. What is in an interquartile range?

The interquartile range (IQR) contains the second and third quartiles, or the
middle half of our data set. Whereas the range gives us the spread of the whole
data set, the interquartile range gives us the range of the middle half of a data set.

21. What is relative frequency distribution?


A relative frequency distribution shows the proportion of the total number of
observations associated with each value or class of values and is related to a
probability distribution, which is extensively used in statistics
22. How to form cumulative frequency distribution?
The cumulative frequency can be calculated by adding the frequency of the
first class interval to the frequency of the second class interval. After that, the
sum is added to the frequency of the third class interval, etc

23. What is a percentile score indicating?


A percentile is a comparison score between a particular score and the scores of
the rest of a group. It shows the percentage of scores that a particular score
surpassed. For example, if we score 75 points on a test, and are ranked in the
85 th percentile, it means that the score 75 is higher than 85% of the scores.

24. What is a histogram?


A histogram is a chart that plots the distribution of a numeric variable's
values as a series of bars. Each bar typically covers a range of numeric values
called a bin or class; a bar's height indicates the frequency of data points with a
value within the corresponding bin.

11
25. What is the use of frequency polygon?
A frequency polygon is a type of line graph where a line segment curves to join the
midpoints of all the class intervals. The shape of the curved line helps in providing
accurate data. Both a line graph and frequency polygon graph are widely used when
data is required to be compared.

26. What is the advantage of stem and leaf display?


The main advantage of a stem and leaf plot is that the data are grouped and all
the original data are shown, too. In Example 3 on battery life in the Frequency
distribution tables section, the table shows that two observations occurred in the
interval from 360 to 369 minutes
Part –B

1. Illustrate the steps in constructing a frequency distribution and


Construct a frequency distribution for grouped data. Specify the real
limits for the lowest class interval in this frequency distribution.

Constructing Frequency Distributions


1. Find the range, that is, the difference between the largest and smallest
observations. The range of weights in Table 1.1 is 245 – 133 = 112.
2. Find the class interval required to span the range by dividing the range
by the desired number of classes (ordinarily 10). In the present example,

12
3. Round off to the nearest convenient interval (such as 1, 2, 3, . . 10,
particularly 5 or 10 or multiples of 5 or 10). In the present example, the
nearest convenient interval is 10.
4. Determine where the lowest class should begin.
(Ordinarily, this number should be a multiple of the class interval.) In the
present example, the smallest score is 133, and therefore the lowest class
should begin at 130, since 130 is a multiple of 10 (the class interval).
5. Determine where the lowest class should end
By adding the class interval to the lower boundary and then subtracting
one unit of measurement. In the present example, add 10 to 130 and then
subtract 1, the unit of measurement, to obtain 139—the number at which
the lowest class should end.
6. Working upward, list as many equivalent classes as are required to
include the largest observation.
In the present example, list 130–139, 140–149, . . . , 240–249, so that the last
class includes 245, the largest score.
7. Indicate with a tally the class in which each observation falls.
For example, the first score in Table 1.1, 160, produces a tally next to 160–
169; the next score, 193, produces a tally next to 190–199; and so on.
8. Replace the tally count for each class with a number—the frequency (f
)—and show the total of all frequencies.
(Tally marks are not usually shown in the final frequency distribution.)
9. Supply headings for both columns and a title for the table.
The IQ scores for a group of 35 high school dropouts are as follows:

13
Construct a frequency distribution for grouped data.
(b) Specify the real limits for the lowest class interval in this frequency
distribution.

14
(b) 64.5–69.5
The real limits are located at the midpoint of the gap between adjacent tabled
boundaries; that is, one-half of one unit of measurement below the lower tabled
boundary and one-half of one unit of measurement above the upper tabled
boundary. (65-0.5 = 64.5) & (69+0.5 = 69.5)

2. The frequency distribution in the table shows the annual incomes in dollars
for a group of college graduates.

a. Construct a histogram.
b. Construct a frequency polygon.
c. Is this distribution balanced or lopsided?

15
NOTE: Ordinarily, only either (a) a histogram, or (b) a frequency polygon
would be shown.
When closing the left flank of (b), imagine extending a line to the midpoint of
the first unoccupied class (–10,000 to –1) on the left, but stop the line at the
vertical axis, as shown.
(c) Lopsided.

3. Explain the computation process of sum of squares formulas for population

Understanding the Sum of Squares


The sum of squares is a statistical measure of deviation from the mean. It is also
known as variation. It is calculated by adding together the squared differences of each
data point. To determine the sum of squares, square the distance between each data
point and the line of best fit, then add them together. The line of best fit will minimize
this value.
A low sum of squares indicates little variation between data sets while a higher
one indicates more variation. Variation refers to the difference of each data set from
the mean. We can visualize this in a chart. If the line doesn't pass through all the data
points, then there is some unexplained variability. We go into a little more detail about
this in the next section below.

Calculate the Sum of Squares


We can see why the measurement is called the sum of squared deviations, or the
sum of squares for short. We can use the following steps to calculate the sum of
squares:

1. Gather all the data points.


2. Determine the mean/average
3. Subtract the mean/average from each individual data point.

16
4. Square each total from Step 3.
5. Add up the figures from Step 4.

In statistics, it is the average of a set of numbers, which is calculated by adding the


values in the data set together and dividing by the number of values. But knowing the
mean may not be enough to determine the sum of squares. As such, it helps to know
the variation in a set of measurements. How far individual values are from the mean
may provide insight into how fit the observations or values are to the regression model
that is created.

4. Explain the following;


d. Misleading graphs.
e. The difference between mean and median.
f. Determine the mode for the following retirement ages: 60, 63, 45,
63, 65, 70, 55, 63, 60, 65, 63.
g. The owner of a new car conducts six gas mileage tests and obtains
the following results, expressed in miles per gallon: 26.3, 28.7,
27.4, 26.6, 27.4, and 26.9, - Find the mode for these data.
Misleading graphs
Misleading graphs are sometimes deliberately misleading and sometimes it’s just
a case of people not understanding the data behind the graph they create. The “classic”
types of misleading graphs include cases where:
• The Vertical scale is too big or too small, or skips numbers, or doesn’t start
at zero.
• The graph isn’t labeled properly.
• Data is left out.
But some real life misleading graphs go above and beyond the classic types. Some are
intended to mislead, others are intended to shock. And in some cases, well-meaning
individuals just got it all plain wrong. These are some of my favorite recent-history
misleading graphs from real life.

17
The mean (average) of a data set is found by adding all numbers in the data set and
then dividing by the number of values in the set. The median is the middle value when
a data set is ordered from least to greatest. The mode is the number that occurs most
often in a data set.

The mode is the number that appears most often in the given data set.

Determine the mode for the following retirement ages: 60, 63, 45, 63, 65, 70, 55, 63,
60, 65, 63.
Mode=63

The owner of a new car conducts six gas mileage tests and obtains the following
results, expressed in miles per gallon: 26.3, 28.7, 27.4, 26.6, 27.4, and 26.9
Find the mode for these data.
Mode=27.4

5. Explain how to calculate interquartile range.


In descriptive statistics, the interquartile range tells us the spread of the
middle half of our distribution.

Quartiles segment any distribution that’s ordered from low to high into four
equal parts. The interquartile range (IQR) contains the second and third
quartiles, or the middle half of our data set.

Whereas the range gives us the spread of the whole data set, the interquartile
range gives us the range of the middle half of a data set.

18
Calculate the interquartile range

The interquartile range is found by subtracting the Q1 value from the Q3 value:

Formula Explanation

• IQR = interquartile range


• Q3 = 3rd quartile or 75th percentile
• Q1 = 1st quartile or 25th percentile

Q1 is the value below which 25 percent of the distribution lies, while Q3 is the value
below which 75 percent of the distribution lies.

We can think of Q1 as the median of the first half and Q3 as the median of the
second half of the distribution. Methods for finding the interquartile range

Although there’s only one formula, there are various different methods for identifying
the quartiles. We’ll get a different value for the interquartile range depending on the
method we use.

Here, we’ll discuss two of the most commonly used methods. These methods differ
based on how they use the median.

Exclusive method vs inclusive method


The exclusive method excludes the median when identifying Q1 and Q3, while
the inclusive method includes the median in identifying the quartiles.

The procedure for finding the median is different depending on whether our data set is
odd- or even-numbered.

• When we have an odd number of data points, the median is the value in the
middle of our data set. We can choose between the inclusive and exclusive
method.
• With an even number of data points, there are two values in the middle, so the
median is their mean. It’s more common to use the exclusive method in this
case.

While there is little consensus on the best method for finding the interquartile range,
the exclusive interquartile range is always larger than the inclusive interquartile range.

The exclusive interquartile range may be more appropriate for large samples, while for
small samples, the inclusive interquartile range may be more representative because
it’s a narrower range.

19
Steps for the exclusive method

To see how the exclusive method works by hand, we’ll use two examples: one with an
even number of data points, and one with an odd number.

Even-numbered data set


We’ll walk through four steps using a sample data set with 10 values.

Step 1: Order our values from low to high.

Step 2: Locate the median, and then separate the values below it from the
values above it.

With an even-numbered data set, the median is the mean of the two values in the
middle, so we simply divide our data set into two halves.

Step 3: Find Q1 and Q3.

Q1 is the median of the first half and Q3 is the median of the second half. Since each
of these halves have an odd number of values, there is only one value in the middle of
each half.

20
Step 4: Calculate the interquartile range.

Odd-numbered data set


This time we’ll use a data set with 11 values.

Step 1: Order our values from low to high.

Step 2: Locate the median, and then separate the values below it from the
values above it.

In an odd-numbered data set, the median is the number in the middle of the list. The
median itself is excluded from both halves: one half contains all values below the
median, and the other contains all the values above it.

Step 3: Find Q1 and Q3.


Q1 is the median of the first half and Q3 is the median of the second half. Since each
of these halves have an odd-numbered size, there is only one value in the middle of
each half.

21
Step 4: Calculate the interquartile range.

Steps for the inclusive method

Almost all of the steps for the inclusive and exclusive method are identical. The
difference is in how the data set is separated into two halves.

The inclusive method is sometimes preferred for odd-numbered data sets because it
doesn’t ignore the median, a real value in this type of data set.

Step 1: Order our values from low to high.

Step 2: Find the median.

The median is the number in the middle of the data set.

Step 2: Separate the list into two halves, and include the median in both halves.
The median is included as the highest value in the first half and the lowest value in the
second half.

22
Step 3: Find Q1 and Q3.
Q1 is the median of the first half and Q3 is the median of the second half. Since the
two halves each contain an even number of values, Q1 and Q3 are calculated as the
means of the middle values.

Step 4: Calculate the interquartile range.

We can see from these examples that using the inclusive method gives us a smaller
IQR. With the same data set, the exclusive IQR is 24, and the inclusive IQR is 20.

6. Explain in detail about Typical Shapes or Explain the Types


of Frequency Distribution.

Whether expressed as a histogram, a frequency polygon, or a stem and leaf


display, an important characteristic of a frequency distribution is its shape. Typical
shapes for smoothed frequency polygons (which ignore the

23
Normal

Any distribution that approximates the normal shape in panel A of Figure can be
analysed with the aid of the welldocumented normal curve.

The familiar bell-shaped silhouette of the normal curve can be superimposed on


many frequency distributions, including those for uninterrupted gestation periods of
human foetuses, scores on standardized tests, and even the popping times of
individual kernels in a batch of popcorn.

Bimodal

Any distribution that approximates the bimodal shape, reflect the coexistence
of two different types of observations in the same distribution.

24
For instance, the distribution of the ages of residents in a
neighbourhood consisting largely of either new parents or their
infants has a bimodal shape.

Positively Skewed
A lopsided distribution caused by a few extreme observations in the positive
direction (to the right of the majority of observations),

Incomes spanning a wide range of values above $200,000.

Negatively Skewed

A lopsided distribution caused by a few extreme observations in the negative


direction (to the left of the majority of observations), as in Figure is a negatively skewed
distribution.

25
The distribution of ages at retirement among U.S. job holders has a pronounced
negative skew, with most retirement ages at 60 years or older and relatively few
retirement ages spanning the wide range of ages younger than 60.

7. Explain how to construct a frequency polygon from a histogram.


A frequency polygon is almost identical to a histogram, which is used to compare
sets of data or to display a cumulative frequency distribution. It uses a line graph to
represent quantitative data.

Statistics deals with the collection of data and information for a particular
purpose. The tabulation of each run for each ball in cricket gives the statistics of the
game. Tables, graphs, pie-charts, bar graphs, histograms, polygons etc. are used to
represent statistical data pictorially.

Frequency polygons are a visually substantial method of representing


quantitative data and its frequencies. Let us discuss how to represent a frequency
polygon.

Steps to Draw Frequency Polygon

To draw frequency polygons, first we need to draw histogram and then follow the below
steps:

• Step 1- Choose the class interval and mark the values on the horizontal axes

• Step 2- Mark the mid value of each interval on the horizontal axes.

• Step 3- Mark the frequency of the class on the vertical axes.

• Step 4- Corresponding to the frequency of each class interval, mark a point at


the height in the middle of the class interval

• Step 5- Connect these points using the line segment.

• Step 6- The obtained representation is a frequency polygon.

Let us consider an example to understand this in a better way.

26
Example
In a batch of 400 students, the height of students is given in the following table.
Represent it through a frequency polygon.

Solution: Following steps are to be followed to construct a histogram from the given
data:
• The heights are represented on the horizontal axes on a suitable scale as
shown.
• The number of students is represented on the vertical axes on a suitable scale
as shown.
• Now rectangular bars of widths equal to the class- size and the length of the

bars corresponding to a frequency of the class interval is drawn.


ABCDEF represents the given data graphically in form of frequency polygon as:

Frequency polygons can also be drawn independently without drawing histograms.


For this, the midpoints of the class intervals known as class marks are used to plot the
points.

27

You might also like