You are on page 1of 5

Types of Variables Quantitativemeasure a value. Answer How much/many? o Ratiomeasured on a scale, so ratios of its values are meaningful Ex: salary, height, weight, time, & distance (twice as far, 25% faster) o Intervalthere is no inherently defined zero value. Ex: temperature, because a negative temperature is a valid measurement, and a ratio of temp wouldnt really make sense. Qualitative/categoricalAnswers fall into a group; set number of predefined choices. Ex: gender, type of something, or yes/no answers o Ordinalthere is a meaningful ordering of the categories. Ex: rate a professor on a scale of 0-4 o Nominativethere is NO meaningful ordering Ex: gender, car color.

Types of Data Cross-sectionalcollected at same point in time o Ex: a list of peoples names and their utility bill amounts this month. Time seriescollected over a period of time o Ex: population of Dallas over the course of 10 years. o Graph is called a time series plot, or a run plot.

Studies Parts of a study o Response variablethe variable of interest in a study o Factorsother variables which may or may not influence the outcome of the study. Types of studies o Experimentalthe researcher controls one or more factors o Observationalthe researcher DOES NOT MESS WITH the factors

Collecting data Censusexamine the entire population of interest. This is guaranteed accurate but is usually unreasonably large. Surveya select group of people are asked a series of questions; we infer about the population from their responses o Survey methods: In order of most to least desirable Simple random sample (SRS)all subjects have an equal probability of being selected Systematic sampletarget population is arranged in a certain order, then randomly selected. A list of student names is sorted in

alphabetical order, assigned a number in that order, then numbers are randomly drawn. Stratified samplesample pop is broken into strata, then random selection occurs within the stratas. Separate by gender, then randomly select with a coin toss within each gender. Three conditions must be met for this to work: o Variability within the stratas must be minimized o Variability between stratas must be maximized o The variable(s) that determined how you stratified must correlate with outcome. Youd stratify by gender if you wanted to compare mens v. womens reactions to a new drug. Cluster samplesamples are taken from one or more clusters, which are usually defined by geographic location or time. This is where you start to lose the random factor, because of the limit you are imposing upon your sample pop). Convenience samplethis is awful. An example would be grabbing the first 20 kids out of the front doors of a school and asking them questions. The results wont be applicable to any greater population Samplesubset of the elements of a population Statistical inferenceyou are trying to draw a universally valid conclusion about a population from a smaller sample size, which is why we really want randomness.

Graphic representation Frequencythe number of times an answer occurs. Frequency distributiontable summarizing the frequency in non-overlapping classes Relative frequencysummarizes the proportion (fraction, percent) of items in each class o How? frequency of class / n, when n is the total number of observations o Relative frequency distributiontable that lists the RF for each class Percent frequencymultiply RF by 100. o Percent frequency distributionlists the percent frequency for each class Cumulative frequencythe CF for a value x is the total less than or equal to x. o A bar chart showing a CF will look like steps, and the highest/last step to the right should be the tallest. Cumulative relative frequencyCF / n, when n is the total # of observations. o Will show you the percentage of people with, for example, earnings less than or equal to 20,000 a year on the first bar; 40,000/year on the second, etc. Ogivegraph of a cumulative distribution. Pareto chartsummarizes the data concerning, in this ex, types of defects. List them in decreasing order by frequency, with other category at end. o Pareto principleonly a few defect types account for most of a products problems. Histogramdata is grouped into classes by frequency distribution

o 1. Find the number of classes: smallest whole number of K | 2^K is greater than the number of measurements in the data set. Ex: When K=30, the number of classes should be 5 because 2^4 is 16 (not large enough) and 2^5 is 32 (and 32 is larger than 30). o 2. Find the class length: (Largest measurement smallest measurement)/ number of classes (which you found in step 1). This means youre taking the range of your data and dividing by the number of classes, which will leave you with classes of EQUAL SIZE. o 3. Form non-overlapping classes. Define the boundaries so that the lower boundary is a greater than or equal to, and the upper is ONLY a less than. o 4. Tally and count the number of measurements in each class. Frequency polygonpage 48. Ive never used this. Dot plotdefine your classes as you would a histogram, then for each class put a dot near the x-axis and one dot on top of that one for each subsequent value that falls within the class. VERY USEFUL, histograms are usually made from these. Stem and leaf displaythe base of the number is the stem. The leaf is the next part. o Ex: {23, 23, 25, 26, 29} would look like 2|3,3,5,6,9. Be sure to define your units (leaf unit here would be 1).

Interpret the graph Symmetricalthe graph appears in the shape of a bell curve o The mean and median are very close to each other. Skewed o Right: the majority of data (median) still falls in the middle but the average is being pulled to the right by a few numbers that are unusually high (outliers) o Left: the majority of data (median) still falls in the middle but the average is being pulled to the left by a few numbers that are unusually low (outliers). Bimodalthere are two hump shapes (or bells) on the graph Population parameterdescriptive measure of the population Parameter of interestthe part of the population you want to study Point estimatea one-number estimate of the pop parameters value, can be found by using a sample statistic. Sample statisticnumber that describes your sample, calculated using its values Mean or , is the average of the population. Add them all up and divide by n, the total number of values in the sample. Ignore that weird formula in your book, its the math way of
saying the same thing.

Medianthe center-most number in your data set when the data is in order. o Ex: 15 30 45 60 75. The median here is 45. However, if you add 90 to the end, the median is now the average of 45 and 60, which gives you 52.5 The mode of the set is the number that occurs most frequently. If Id added another 75 in the example above, the mode would be 75 since its the only number that occurs more than once.

Scatter plots

Not as tricky as you might think. If you are given a set of points that are vaguely linear in shape, you will enter them into a table in your calculator. You will then to a linear regression, which will give you the equation for a straight line that best fits your data. They are, however, very sensitive to outliers (because of the correlation coefficientlook up r and r^2).

Variance Population o Standard deviation (sigma) represents the variance of the mean. It is the positive square root of the population variance. o Varianceaverage of the squared deviations of the individual population measurements. Sample o Standard deviationjust s, its the positive square root of the variance. (We called this sigma hat, as youll see it on the internet). o Variances^2. THIS IS NOT COMPLICATED, IT JUST TAKES A LONG TIME. Youre basically going to take every value of your sample, subtract x-bar from it and square that value. Add those all up and divide by n-1. Computational formula:

Standard Deviation o For standard deviation, the empirical rule tells you what percentage of data falls within x number of deviations. 68% of the data falls within 1 of the center. 95% of the data falls within 2 of the center. 99.7% of the data falls within 3 of the center.

Outliers & Resistance The median and IQR are resistant to extreme values. The mean, correlation coefficient and standard deviations are NOT resistant to outliers. Finding an outlier: this is done by using the Inter-quartile range, or IQR. o 1. Find the Quartiles. To find Q1, count how many data values you have and divide by 4. To find Q3, multiply that value by 3.

If you had {9, 10, 12, 13, 15, 16, 17, 17, 21, 22}, Q1 would be the 2.5th numberso halfway between 10 and 12, which is 11. Q3 would be the 7.5th, or 17. o 2. Now Q3-Q1. This is your IQR. =6 in the above example. o 3. Determine the outliers. Multiply the IQR by 1.5. Then subtract this value from Q1 and add its value to Q3. If you have a point less than the result of the subtraction or greater than the result of the addition, it is an outlier. Remind me to talk about sampling bias and errorsI forgot about those but Im too tired right now to finish. In the meantime, reference this page a lot: http://en.wikipedia.org/wiki/Sampling_(statistics)