You are on page 1of 12

Chapter-II Data Analysis

Contents: 2.1 Introduction


2.1.1 Types of Data

2.2 Some Definitions 2.3 Frequency Distribution:


2.3.1 Graphical presentation of Frequency distribution

2.4 Measure of Central tendency


2.4.1
Arithmetic Mean 2.4.2 Median 2.4.3 Mode

2.5 Measure of Dispersion


2.5.1 2.5.2 2.5.3 2.5.4 Range Mean Deviation Variance and standard deviation The Coefficient of Variation

Chapter-II Data Analysis


2.1 Introduction
Statistics is a branch of applied mathematics concerned with the collection and interpretation of quantitative data and the use of probability theory to estimate population parametersStatistical methods can be used to summarize or describe a collection of data; this is called descriptive statistics. Data: A collection of values to be used for statistical analysis. A dictionary defines data as facts or figures from which conclusions may be drawn. Data may consist of numbers, words, or images, particularly as measurements or observations of a set of variables. Data are often viewed as a lowest level of abstraction from which information and knowledge are derived. Thus, technically, it is a collective or plural noun. Datum is the singular form of the noun data. Data can be classified as either numeric or nonnumeric. Specific terms are used as follows:

2.1.1 Types of Data I.


I Qualitative data are nonnumeric.

1. {Poor, Fair, Good, Better, Best}, colors (ignoring any physical causes), and types of material {straw, sticks, bricks} are examples of qualitative data. 2. Qualitative data are often termed categorical data. Some books use the terms individual and variable to reference the objects and characteristics described by a set of data. They also stress the importance of exact definitions of these variables, including what units they are recorded in. The reason the data were collected is also important. II Quantitative data are numeric. Quantitative data are further classified as either discrete or continuous. Discrete data are numeric data that have a finite number of possible values. A classic example of discrete data is a finite subset of the counting numbers, {1,2,3,4,5} perhaps corresponding to {Strongly Disagree Strongly Agree}.

When data represent counts, they are discrete. An example might be how many students were absent on a given day. ocounts are usually considered exact and integer. Continuous data have infinite possibilities: 1.4, 1.41, 1.414, 1.4142, 1.141421... The real numbers are continuous with no gaps or interruptions. Physically measureable quantities of length, volume, time, mass, etc. are generally considered continuous. At the physical level (microscopically), especially

for mass, this may not be true, but for normal life situations is a valid assumption. Data analysis is a process of gathering, modeling, and transforming data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, in different business, science, and social science domains.

2.2 Some Definitions


Raw Data: Data collected in original form. Frequency: The number of times a certain value or class of values occurs. Frequency Distribution: The organization of raw data in table form with classes and frequencies. Categorical Frequency Distribution: A frequency distribution in which the data is only nominal or ordinal. Ungrouped Frequency Distribution: A frequency distribution of numerical data. The raw data is not grouped. Grouped Frequency Distribution: A frequency distribution where several numbers are grouped into one class. Class Limits: Separate one class in a grouped frequency distribution from another. The limits could actually appear in the data and have gaps between the upper limit of one class and the lower limit of the next. Class Boundaries: Separate one class in a grouped frequency distribution from another. The boundaries have one more decimal place than the raw data and therefore do not appear in the data. There is no gap between the upper boundary of one class and the lower boundary of the next class. The lower class boundary is found by subtracting 0.5 units from the lower class limit and the upper class boundary is found by adding 0.5 units to the upper class limit. Class Width: The difference between the upper and lower boundaries of any class. The class width is also the difference between the lower limits of two consecutive classes or the upper limits of two consecutive classes. It is not the difference between the upper and lower limits of the same class. Class Mark (Midpoint): The number in the middle of the class. It is found by adding the upper and lower limits and dividing by two. It can also be found by adding the upper and lower boundaries and dividing by two. Cumulative Frequency: The number of values less than the upper class boundary for the current class. This is a running total of the frequencies. Relative Frequency: The frequency divided by the total frequency. This gives the percent of values falling in that class. Cumulative Relative Frequency (Relative Cumulative Frequency): The running total of the relative frequencies or the cumulative frequency divided by the total frequency, gives the percent of the values which are less than the upper class boundary.

2.3 Frequency Distribution


The distribution of empirical data is called a frequency distribution and consists of a count of the number of occurrences of each value. If the data are continuous, then a grouped frequency distribution is used. Typically, a distribution is portrayed using a frequency polygon or a histogram. Mathematical distributions are often used to define distributions. The normal distribution is, perhaps, the best known example. Many empirical distributions are

approximated well by mathematical distributions such as the normal distribution. Grouped Frequency Distribution A grouped frequency distribution is a frequency distribution in which frequencies are displayed for ranges of data rather than for individual values. For example, the distribution of heights might be calculated by defining one-inch ranges. The frequency of individuals with various heights rounded off to the nearest inch would be then be tabulated.

2.3.1 Graphical presentation of Frequency distribution:


Histogram A histogram is a graphical display of tabulated frequencies. A histogram is the graphical version of a table that shows what proportion of cases fall into each of several or many specified categories.

Example of a histogram of 100 values

Advantages Visually strong Can compare to normal curve Usually vertical axis is a frequency count of items falling into each category

Disadvantages Cannot read exact values because data is grouped into categories More difficult to compare two data sets Use only with continuous data

Frequency Polygons Frequency polygons are a graphical device for understanding the shapes of distributions. They serve the same purpose as histograms, but are especially helpful in comparing sets of data. Frequency polygons are also a good choice for displaying cumulative frequency distributions.

To create a frequency polygon, start just as for histograms, by choosing a class interval. Then draw an X-axis representing the values of the scores in your data. Mark the middle of each class interval with a tick mark, and label it with the middle value represented by the class. Draw the Y-axis to indicate the frequency of each class. Place a point in the middle of each class interval at the height corresponding to its frequency. Finally, connect the points. You should include one class interval below the lowest value in your data and one above the highest value. The graph will then touch the X-axis on both sides.

Advantages Visually appealing Can compare to normal curve Can compare two data sets

Disadvantages Anchors at both ends may imply zero as data points Use only with continuous data

Frequency Curve A smooth curve which corresponds to the limiting case of a histogram computed for a frequency distribution of a continuous distribution as the number of data points becomes very large.

Advantages Visually appealing

Disadvantages Anchors at both ends may imply zero as data points Use only with continuous data

2.4 Measure of Central tendency


Central Tendency is the center or middle of a distribution. There are many measures of central tendency. The most common are the mean, median and mode. The center of a distribution could be defined three ways: 1. the point on which a distribution would balance, 2. the value whose average absolute deviation from all the other values is minimized, and 3. the value whose squared difference from all the other values is minimized. From the simulation in this chapter, you discovered (we hope) that the mean is the point on which a distribution would balance, the median is the value that minimizes the sum of absolute deviations, and the mean is the value that minimizes the sum of the squared values.

2.4.1 Arithmetic Mean


The arithmetic mean is the most common measure of central tendency. For a data set, the mean is the sum of the observations divided by the number of observations. Basically, the mean describes the central location of the data. For a given set of data, where the observations are x1, x2,.,xi ; the Arithmetic Mean is defined as :

The weighted arithmetic mean is used, if one wants to combine average values from samples of the same population with different sample sizes:

Example 1: Observations Weights 12 2 15 5 20 7 22 6 30 1

Find the mean. Observations 12 15 20 22 30 Total Advantages can be specified using and equation, and therefore can be manipulated algebraically is the most sufficient of the three estimators is the most efficient of the three estimators is unbiased Weights 2 5 7 6 1 21 xiwi 24 75 140 132 30 404 Mean =401/21 =19.10

Disadvantages is very sensitive to extreme scores (i.e., low resistance) value is unlikely to be one of the actual data points requires an interval scale anything else about the distribution that wed want to convey to someone if we were describing it to them?

2.4.2 Median
The median of a finite list of numbers can be found by arranging all the observations from lowest value to highest value and picking the middle one. If there is an even number of observations, the median is not unique, so one often takes the mean of the two middle values. For Odd number of observations: Median = (n+1)/2 th observations. For Even number of observations: Median = Average of (n/2) th and (n/2 + 1) th observations. Here are the sample test scores you have seen so often: 100, 100, 99, 98, 92, 91, 91, 90, 88, 87, 87, 85, 85, 85, 80, 79, 76, 72, 67, 66, 45

The "middle" score of this group could easily be seen as 87. Why? Exactly half of the scores lie above 87 and half lie below it. Thus, 87 is in the middle of this set of scores. This score is known as the median. In this example, there are 21 scores. The eleventh score in the ordered set is the median score (87), because ten scores are on either side of it. If there were an even number of scores, say 20, the median would fall halfway between the tenth and eleventh scores in the ordered set. We would find it by adding the two scores (the tenth and eleventh scores) together and dividing by two. Advantages is unbiased is unaffected by extreme scores (i.e., high resistance) doesnt require the use of an interval scale, as long as you can order the scores along some continuum then you can find the median Disadvantage can not be specified using an equation so cant be manipulated algebraically is the least sufficient of the three estimators

is less efficient than the mean

2.4.3 Mode
The mode is the most frequently occurring value. It is the most common value in a distribution: The mode of 3, 4, 4, 5, 5, 5, 8 is 5. Note that the mode may be very different from the mean and the median. With continuous data such as response time measured to many decimals, the frequency of each value is one since no two scores will be exactly the same. Therefore the mode of continuous data is normally computed from a grouped frequency distribution. The grouped frequency distribution table shows a grouped frequency distribution for the target response time data. Since the interval with the highest frequency is 600-700, the mode is the middle of that interval (650).

Range 500-600 600-700 700-800 800-900 900-1000

Frequency 3 6 5 5 0

Range Frequency 500-600 3 1000-1100 1 Table 3: Grouped frequency distribution Advantages represents a number that actually occurred in the data represents the largest number of scores, and so the probability of getting that score is greater then the probability of getting any of the other scores if an observation is just chosen at random is unaffected by extreme scores (i.e., high resistance) is unbiased doesnt require an interval scale

Disadvantages the mode depends on how we group the data can not be specified using an equation so cant be manipulated algebraically is less sufficient than the mean is less efficient than the mean

2.5 Measure of Dispersion


Measures of Dispersion provide us with a summary of how much the points in our data set vary, e.g. how spread out they are or how volatile they are. In measuring dispersion, it is necessary to know the amount of variation and the degree of variation. The former is designated as absolute measures if dispersion and expressed in the denomination of original variants while the latter is designated as related measures of dispersion. Absolute measures can be divided into positional measures based on some items of the series such as (I) Range, (ii) Quartile deviation or semi interquartile range and those which are based on all items in series such as (I) Mean deviation, (ii) Standard deviation. The relative measures in each of the above cases are called the coefficients of the respective measures. For purposes of comparison between two or more series with varying size or number of items, varying central values or units of calculation, only relatives measures can be used. The following are the important methods of studying variation: 1. Range 2. Mean deviation 3. Standard deviation and Variance (which is closely related to standard deviation) 4. The Coefficient of Variation

2.5.1 Range
Range is the simplest of the summary measures of variation .It is also the crudest and most prone to error .It is computed as the difference between the largest and the smallest value in a data set: Range = H- L

Absolute range Relative range; Coefficient of range = = Sum of the two extremes For example, for the data set {2, 2, 3, 4, 14} Range = 14-2=12 Coefficient of range = 14 2 12 = = 0.75 14 + 2 16

H-L H+L

Example : You are given the following data: 3 6 9 11

Compute the sample range Solution : H = 11, L = 3 range = H - L = 11 - 3 = 8

2.5.2 Mean Deviation


Mean Deviation can be calculated from any value of Central Tendency, viz. Mean, Median, Mode. Accordingly, Mean Deviation can be of the following types: Mean Deviation about Mean Mean Deviation about Median Mean Deviation about Mode

Mean Deviation about Mean =

Properties of Mean Deviation about Mean: The average absolute deviation from the mean is less than or equal to the Standard Deviation. The mean deviation of any data set from its mean is always zero. The mean absolute deviation is the average absolute deviation from the mean and is a common measure of

Forecast Error or Time Series Analysis.

For example, for the data set {2, 2, 3, 4, 14}: Measure of central tendency Absolute deviation | 2 - 5| + | 2 - 5| +| 3 - 5| + | 4 - 5| + Mean = 5 5 | 14 - 5| = 3.6

2.5.3 Variance and standard deviation


Variance and standard deviation are the most common of all of the measures of variation Variance is a measure of statistical dispersion, indicating how its possible values are spread around the mean. Thus, variance indicates the variability of the values. A smaller value implies a smaller variation from the mean

The positive square root of Variance is called the Standard Deviation.

Let us consider an example: Values 4 6 5 5 Total =20 , mean=5 Variance = .2 =1/2 Xi - Mean(x) -1 1 0 0 [Xi - XMean]2 1 1 0 0 2

S.D =

2.5.4 The Coefficient of Variation

The Coefficient of Variance is a measure of variation expressed as a percentage the sample mean: CV = S Xmean . 100

You might also like