You are on page 1of 3

Discussion Assignment – Unit 3

An important practice is to check the validity of any data set that you analyze. One
goal is to detect typos in the data, and another would be to detect faulty
measurements. Recall that outliers are observations with values outside the
“normal” range of values of the rest of the observations.

Specify a large population that you might want to study and describe the type
numeric measurement that you will collect (examples: a count of things, the height
of people, a score on a survey, the weight of something).  What would you do if you
found a couple outliers in a sample of size 100? What would you do if you found
two values that were twice as big as the next highest value?

You may use examples from your area of interest, such as monthly sales levels of a
product, file transfer times to different computer on a network, characteristics of
people (height, time to run the 100 meter dash, statistics grades, etc.), trading
volume on a stock exchange, or other such things.

There is no requirement to use sources from the Internet, but if you use an idea or a
quotation from any source, it should be cited (such as putting the author and year at
the end of the sentence and then adding a reference at the end to describe the
source).

Let us assume I am investigating "the average daily income of households" in a


neighborhood deemed to be under the national poverty line.

Measurement tools

I make sure the measurement tools are uniform and applied to all households at a uniform
period of time. As the income of households might vary throughout the year, whether it is a
drought season, rainy season or harvesting time,

Typos

Typographical errors might happen during data collection, and such errors could be related
to data types (numbers, characters). Such errors might be fixed through inspection, and if it
is difficult to detect and the number of defective observations is not so great, I might
remove them from the dataset.

Outliers:

I calculate outliers by applying the boxplot and identifying data points which could fall into
the outlier ranges. The other important indicator is that the expected range of daily income
for such a neighborhood is assumed to be below the national poverty line. If I observe an
income to be relatively higher than that level, I will take it as an outlier. "Data points with
values that are much too large or much too small in comparison to the vast majority of the
observations will be identified as outliers" (Yakir, 2011, p. 33, para. 2). But if such an
occurrence happens to more than one observation, I inspect both observations and might
even re-collect the data of these particular households. After verifying the accuracy of these
exceptional data and if both values are significantly higher or lower than the "normal" range
of the other observations, I will discard these data as outliers.

Reference:

Yakir, B. (2011). Introduction to statistical thinking (with R, without calculus). The Hebrew


University of Jerusalem, Department of Statistics.

Yakir, B. (2011). Introduction


to statistical thinking (with R,
without calculus). Retrieved
April
23, 2021, from
https://my.uopeople.edu ›
MATH1280.IntroStat.pdf
Yakir, B. (2011). Introduction
to statistical thinking (with R,
without calculus). Retrieved
April
23, 2021, from
https://my.uopeople.edu ›
MATH1280.IntroStat.pdf

You might also like