Professional Documents
Culture Documents
1. {Poor, Fair, Good, Better, Best}, colors (ignoring any physical causes), and types of material {straw, sticks, bricks} are examples of qualitative data. 2. Qualitative data are often termed categorical data. Some books use the terms individual and variable to reference the objects and characteristics described by a set of data. They also stress the importance of exact definitions of these variables, including what units they are recorded in. The reason the data were collected is also important. II Quantitative data are numeric. Quantitative data are further classified as either discrete or continuous. Discrete data are numeric data that have a finite number of possible values. A classic example of discrete data is a finite subset of the counting numbers, {1,2,3,4,5} perhaps corresponding to {Strongly Disagree Strongly Agree}.
When data represent counts, they are discrete. An example might be how many students were absent on a given day. ocounts are usually considered exact and integer. Continuous data have infinite possibilities: 1.4, 1.41, 1.414, 1.4142, 1.141421... The real numbers are continuous with no gaps or interruptions. Physically measureable quantities of length, volume, time, mass, etc. are generally considered continuous. At the physical level (microscopically), especially
for mass, this may not be true, but for normal life situations is a valid assumption. Data analysis is a process of gathering, modeling, and transforming data with the goal of highlighting useful information, suggesting conclusions, and supporting decision making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, in different business, science, and social science domains.
approximated well by mathematical distributions such as the normal distribution. Grouped Frequency Distribution A grouped frequency distribution is a frequency distribution in which frequencies are displayed for ranges of data rather than for individual values. For example, the distribution of heights might be calculated by defining one-inch ranges. The frequency of individuals with various heights rounded off to the nearest inch would be then be tabulated.
Advantages Visually strong Can compare to normal curve Usually vertical axis is a frequency count of items falling into each category
Disadvantages Cannot read exact values because data is grouped into categories More difficult to compare two data sets Use only with continuous data
Frequency Polygons Frequency polygons are a graphical device for understanding the shapes of distributions. They serve the same purpose as histograms, but are especially helpful in comparing sets of data. Frequency polygons are also a good choice for displaying cumulative frequency distributions.
To create a frequency polygon, start just as for histograms, by choosing a class interval. Then draw an X-axis representing the values of the scores in your data. Mark the middle of each class interval with a tick mark, and label it with the middle value represented by the class. Draw the Y-axis to indicate the frequency of each class. Place a point in the middle of each class interval at the height corresponding to its frequency. Finally, connect the points. You should include one class interval below the lowest value in your data and one above the highest value. The graph will then touch the X-axis on both sides.
Advantages Visually appealing Can compare to normal curve Can compare two data sets
Disadvantages Anchors at both ends may imply zero as data points Use only with continuous data
Frequency Curve A smooth curve which corresponds to the limiting case of a histogram computed for a frequency distribution of a continuous distribution as the number of data points becomes very large.
Disadvantages Anchors at both ends may imply zero as data points Use only with continuous data
The weighted arithmetic mean is used, if one wants to combine average values from samples of the same population with different sample sizes:
Find the mean. Observations 12 15 20 22 30 Total Advantages can be specified using and equation, and therefore can be manipulated algebraically is the most sufficient of the three estimators is the most efficient of the three estimators is unbiased Weights 2 5 7 6 1 21 xiwi 24 75 140 132 30 404 Mean =401/21 =19.10
Disadvantages is very sensitive to extreme scores (i.e., low resistance) value is unlikely to be one of the actual data points requires an interval scale anything else about the distribution that wed want to convey to someone if we were describing it to them?
2.4.2 Median
The median of a finite list of numbers can be found by arranging all the observations from lowest value to highest value and picking the middle one. If there is an even number of observations, the median is not unique, so one often takes the mean of the two middle values. For Odd number of observations: Median = (n+1)/2 th observations. For Even number of observations: Median = Average of (n/2) th and (n/2 + 1) th observations. Here are the sample test scores you have seen so often: 100, 100, 99, 98, 92, 91, 91, 90, 88, 87, 87, 85, 85, 85, 80, 79, 76, 72, 67, 66, 45
The "middle" score of this group could easily be seen as 87. Why? Exactly half of the scores lie above 87 and half lie below it. Thus, 87 is in the middle of this set of scores. This score is known as the median. In this example, there are 21 scores. The eleventh score in the ordered set is the median score (87), because ten scores are on either side of it. If there were an even number of scores, say 20, the median would fall halfway between the tenth and eleventh scores in the ordered set. We would find it by adding the two scores (the tenth and eleventh scores) together and dividing by two. Advantages is unbiased is unaffected by extreme scores (i.e., high resistance) doesnt require the use of an interval scale, as long as you can order the scores along some continuum then you can find the median Disadvantage can not be specified using an equation so cant be manipulated algebraically is the least sufficient of the three estimators
2.4.3 Mode
The mode is the most frequently occurring value. It is the most common value in a distribution: The mode of 3, 4, 4, 5, 5, 5, 8 is 5. Note that the mode may be very different from the mean and the median. With continuous data such as response time measured to many decimals, the frequency of each value is one since no two scores will be exactly the same. Therefore the mode of continuous data is normally computed from a grouped frequency distribution. The grouped frequency distribution table shows a grouped frequency distribution for the target response time data. Since the interval with the highest frequency is 600-700, the mode is the middle of that interval (650).
Frequency 3 6 5 5 0
Range Frequency 500-600 3 1000-1100 1 Table 3: Grouped frequency distribution Advantages represents a number that actually occurred in the data represents the largest number of scores, and so the probability of getting that score is greater then the probability of getting any of the other scores if an observation is just chosen at random is unaffected by extreme scores (i.e., high resistance) is unbiased doesnt require an interval scale
Disadvantages the mode depends on how we group the data can not be specified using an equation so cant be manipulated algebraically is less sufficient than the mean is less efficient than the mean
2.5.1 Range
Range is the simplest of the summary measures of variation .It is also the crudest and most prone to error .It is computed as the difference between the largest and the smallest value in a data set: Range = H- L
Absolute range Relative range; Coefficient of range = = Sum of the two extremes For example, for the data set {2, 2, 3, 4, 14} Range = 14-2=12 Coefficient of range = 14 2 12 = = 0.75 14 + 2 16
H-L H+L
Properties of Mean Deviation about Mean: The average absolute deviation from the mean is less than or equal to the Standard Deviation. The mean deviation of any data set from its mean is always zero. The mean absolute deviation is the average absolute deviation from the mean and is a common measure of
For example, for the data set {2, 2, 3, 4, 14}: Measure of central tendency Absolute deviation | 2 - 5| + | 2 - 5| +| 3 - 5| + | 4 - 5| + Mean = 5 5 | 14 - 5| = 3.6
Let us consider an example: Values 4 6 5 5 Total =20 , mean=5 Variance = .2 =1/2 Xi - Mean(x) -1 1 0 0 [Xi - XMean]2 1 1 0 0 2
S.D =
The Coefficient of Variance is a measure of variation expressed as a percentage the sample mean: CV = S Xmean . 100