You are on page 1of 9

# Chapter 1: Exploring Data I. Data Analysis: Making Sense of Data A. Individuals – the objects described by a set of data i.

They may be people, animals, or things. 1. Ex.  For a high school’s student data, the students would be the individuals. B. Variable – any characteristic of an individual i. You can take different values for different individuals. 1. Ex.  Age, gender, GPA, grade level, homeroom are all variables in a high school’s student database. ii. Categorical Variable – places an individual into one of several groups or categories iii. Quantitative Variable – takes numerical values for which it makes sense to find an average 1. Not every variable that takes number values is quantitative. C. Most data follows this format: each row is an individual, each column is a variable D. Categorical variables sometimes have similar counts in each category & sometimes don’t E. Distribution – the distribution of a variable feels us what values the variable takes and often it takes these values F. Exploring data i. Begin by examining each variable by itself ii. Move on to study relationships among the variables iii. Start with a graph or graphs iv. Add numerical summaries G. Inference – drawing conclusions that go beyond the data at hand H. Probability – the study of chance behavior II. Analyzing Categorical Data A. The distribution of a categorical variable lists the categories and gives either the count or the percent of the individuals who fall in each category. i. Frequency Table – displays the counts ii. Relative Frequency Table – displays the percents (always add up to 100%) B. Round-off Error – when adding up percents, they may not add up to a hundred because of rounding techniques (is not a mistake, just the effect of rounding off results) C. Pie Chart – displays ratios between different categories D. Bar Graph- compares different quantities or percents of different categories i. Bar graphs are more flexible than pie charts; it can compare any set of quantities that are measured in the same units ii. When making a bar graph, make the bars equally wide to avoid confusion. Do not confuse bar graph’s height with its area. The height is the only thing that matters. iii. Bar graphs comparing percents do not necessarily add up to a hundred. iv. A segmented bar graph shows conditional distribution for each category.

G. divide a specific cell by the total of the variable to find the conditional distribution. and about equal to E. Depending on the variable. Skewed Left – if the left side of the distribution is much longer than the right 4. less than.E. Unimodal – distribution has a single peak 5. Plan: How will you go about answering the question? What statistical techniques does this problem call for? iii. iv. There is a separate conditional distribution for each value of the other variable. Symmetric – if both sides of the distribution are approximately mirror images of each other 2. Skewed Right – if the right side of the distribution is much longer than the left 3. Displaying Quantitative Date with Graphs A. center. and spread. Caution: Even a strong association between two categorical variables can be influenced by variables lurking in the background H. C. Spread – use the range C. Association – if specific values of one variable tend to occur in common with the specific values of the other i. State: What’s the question that you’re trying to answer? ii. Many distributions have irregular shapes that are neither symmetric nor skewed D. center. To compare distributions. Always look for the overall pattern in any graph or striking departures from the pattern. ii. Do: Create graphs and carry out necessary calculations. spread. i. “split the stems” . The marginal distribution of the categorical variables in a two-way table of counts is the distribution of values of that variable among all individuals described by the table (does not tell you anything about the relationship between two variables). 1. F. compare shape. The purpose of the graph is to help us understand the data. Dot Plot – a graph that shows data values as a dot above its location on a number line i. Conclude: Give a practical conclusion in the setting of the real-world problem. To organize a statistical problem: i. Bimodal – distribution has two clear peaks 6. Center – Take the mean B. Simpson’s Paradox – an association between two variables that holds for each individual value or a third variable can be changed or even reversed when the date for all values of the third variable are combined III. and outliers. i. You can describe the overall pattern of any distribution by its shape. B. Describing Quantitative Data with Numbers A. Multimodal – distribution with three or more peaks IV. Stem Plot – a stem and leaf plot i. Shape – indicate the mode and skew-ness (skew-ness is the direction of the long tail) 1. To construct a stem plot. ii. A conditional distribution of a variable describes the values of that variable among individuals who have a specific value of another variable. An outlier is an individual value that falls outside the overall pattern. look at the distribution of each variable separately. In order to understand information in a two-way table. and use words like greater than.

median.ii. A back-to-back stem plot compares two sets of data with a stem plot that is back to back. To make a box plot. draw a vertical line in the box to mark the median. = Summation vi. vii. Do not use counts or percents as data iv. ̅ = (sum of observations)/n 2. and the largest observations (written in order from largest to smallest) vi. Spread is also measured by standard deviation. 2. take Q3Q1 iv. Box Plots i. Third Quartile – lies three quarters of the way up the list iii. F. Histograms i. Divide the range into classes of equal width 2. To make a histogram: 1. G. ̅ . The mean. is the most common measure of the center 1. Then. To calculate the outlier – if an observation is more than 1.5 (IQR) above the third quartile or below the first quartile v. 5 number summary – consists of the smallest observation. Label and scale your axes and draw the histogram ii. Variance is the average of the squared distances. Draw extension lines from the box to the smallest and largest observations that are not outliers. Interquartile Range (IQR) – measure the range of the middle 50% of data. The smaller values always go closest to the stem. 3. Find the count or percent of individuals in each class 3. Standard Deviation – the average distance of the observations from their mean 1. Use the median when a distribution is skewed and the mean when it is not. Calculated by finding an average of the squared distances and then taking the square root. third quartile. . first quartile. draw a central box from the first to third quartile. Use percents instead of counts on the vertical axis when comparing distribution with different number of observations v. First Quartile – lies one quarter of the way up the list ii. Do not confuse histograms with bar graphs iii.

1. Describe it by its curve H. A standardized value is a z-score. Standardizing – converting observations from original values to standard deviation units i. Has an area of exactly one underneath it . center. and standard deviation by the absolute value of that number. Look for overall pattern and striking departures from the pattern (shape. It will also multiply or divide the measures of spread such as range. spread. i. Effect of multiplying or dividing by a constant 1. quartiles. Relative Cumulative Frequency – divide the count in each class by the sample size. If x is an observation from a distribution that has known mean and standard deviation. and percentiles 2. G. Cumulative Frequency – divide the entries in the cumulative frequency column by the sample size. Z = (x-mean)/(standard deviation) a. or standard deviation ii. quartiles. median. Effect of adding or subtracting to a constant 1. F. etc. Calculate a numerical summary to briefly describe center and spread. median. dot plot. Ex. and percentiles by that number. then multiply by 100 to convert to a percent C. Density Curves i. iii. 2. and multiply by 100 to convert to a percent D. Data Transformations i. It will not change the shape of the distribution. histogram. and outliers). Describing Location in a Distribution A. 3. we would standardize their height values and then compare them. ii. stem plot. Percentile – the pth percentile of a distribution is the value with p percent of the observations less than it B. E. Adding the same number to each observation adds that number to the mean. – If we want to compare a brother & sister’s height according to age. iv. IQR. Cumulative Relative Frequency Graph – Plot a point corresponding to the cumulative relative frequency in each class at the smallest value of the next class. IQR. Using standardized values is useful for comparing things such as height at different ages. ii. Negative z-scores mean that observations are smaller than the mean iii. Multiplying or dividing each observation by the same number will multiply or divide the mean. Exploring quantitative data i. Always plot data through a graph. Always on or above the horizontal axis ii. It does not change the shape of the distribution or measure of spread such as the range. Used to describe the position of an individual within a distribution or to locate a specified percentile of the distribution. Positive z-scores mean that observations are larger than the mean c.Chapter 2: Modeling Distributions of Data I. The z-score tells us how many standard deviations away from the mean an observation falls and in what direction b. the standardized value of x is 1.

It is also known as the 68-95-99. Find z-scores using percentile . Chebyshev’s Inequality – in any distribution.7 rule. repeated measurements of the same quantity. vii.iii. The mean of the density curve is the balancing point ix. Normal Distributions A. The mean is located at the center and is the same as the median. These curves are symmetric. Standard Normal Table – table of area under the standard normal curve shows the area to the left of the z-score 1. iii. Real data is never exactly normal E. Often a good description of the overall pattern of a distribution. . and bell shaped.the usual notation for the mean of a density curve xi. ii. Standard Normal Distribution – the normal distribution with a mean of 0 and standard deviation of 1 i. half of the area is to the left and half of the area is to the right of the point. viii. Area under the curve and above any interval of values on the horizontal axis is the proportion of all observations that fall in that interval v. v. the proportion of observations falling within k standard deviations of the mean is at least [1-(1/k^2)] D. Approximately 99. Approximately 68% of the observations fall within of the mean iii. Approximately 95% of the observations fall within 2 of the mean iv. single peaked. vi. viii. The median of the density curve is the equal areas point. and characteristics of biological populations. . They are good approximations of the results of many kinds of chance outcomes ix. x. Normal Curves – one of the most important density curves that describe a normal distribution i. Normal Probability Plot – provides a good assessment of whether a data set follows a normal distribution i. Standard deviation is the natural measure of spread and it is the distance from the center to the change-of-curvature points on either side. Describes the overall pattern of a distribution iv. Changing the mean without the standard deviation moves the normal curve along the horizontal axis without changing its spread. vii. Empirical Rule i.)/ ii. iv. Pay attention to which way the z-value will be F.7% of the observations fall within 3 of the mean C. All have the same overall shape ii. Any specific normal curve is completely described by its mean and standard deviation. They come in many shapes. and outliers are not described by the curve. Normal distributions include scores on tests like SAT and IQ tests. Arrange data from smallest to largest ii. vi. z=(x. Many sets of data still do not follow a normal distribution B. The mean is always more toward the skewed direction than the median unless the curve is symmetric. In order to calculate the z-score for this.the usual notation for the standard deviation of a density curve II.

Plot x against z iv. Look for shapes that show clear departures from normality v.iii. then it is close to normal . If it is close to a straight line.

1. Must be quantitative data ii. Positive Association – when above average values of one variable tend to accompany above values of the other and vice versa ii. then move over horizontally to get the response value. then the relationship has a positive association. 5. It is always a number between -1 and 1. . If r < 0. Correlation – the correlation. Direction – indicate which way the pattern moves a. measures the strength and direction of a linear relationship between two quantitative variables 1. Associations and Relationships i. To interpret scatter plots: 1. move up until you find a dot. Use words such as negative or positive association 2. Scatter Plot – plots explanatory variable against response variable i. E. 4. then the relationship has a negative association. 3. and strength of the relationship. If r > 0. r. iv. Association does not imply causation. Explanatory Variable – helps explain or influence changes in a response variable C. The most useful graph for displaying two quantitative relationships iii. If r = 1 or -1. You can describe the overall pattern of a scatterplot by its direction. Response Variable – measures an outcome of a study B. To make a scatter plot: 1. Scatterplots and Correlation A. Our eyes are not good judges of how strong a linear relationship is vi. Form – indicate the shape of the scatterplot (linear or curved) and the clusters of data 3. Decide which variable should go on each axis 2. Indicate any outliers viii. Label and scale the axes 3. v. The goal is to show that changes in one or more explanatory variables cause changes in a response variable D. Not all relationships have a clear direction that we can describe as a positive or negative association iv. form. 2. then the relationship is exactly linear.Chapter 3: Describing Relationships I. If you wanted to find the response for an explanatory variable. simply go to the explanatory variable value. Each individual in the data appears as a point in the graph v. If r is close to 0. Always plot explanatory variables on the x axis and the response variable on the y axis vi. Plot individual data values vii. then the relationship is weak. Strength – how closely the points of the scatterplot follow a clear form (use words like moderately strong) 4. We cannot conclude that lurking variable did not influence the results. Negative Association – when above average values of one tend to accompany below average values of the other and vice versa iii.

The formula for r: a. Equation of the least squares regression line for explanatory variable x. Residual – the difference between an observed and expected value of the response variable ix. ̂ . Least-Squares Regression i. it is just a number 12. Formula for regression line 1. response variable y. 1. . Residual Plot – a scatterplot of the residuals against the explanatory variable . Least Squares Regression Line – the line that makes the sum of the squared residuals as small as possible 1.the slope (amount y is predicted to change when x increases by one unit) 4. . Correlation makes no distinction between explanatory and response variables 10. A good regression line makes the vertical distance from the line to the points as small as possible. it does not describe curved relationships no matter how strong the curved relationship is 14. Regression Line – a line that describes how a response variable changes as an explanatory variable changes. Regression requires that you have an explanatory and response variable iii. Requires both variables to be quantitative 13. A small slope does not mean that there is a weak relationship. Extrapolation – the use of a regression line for prediction far outside the interval of values of the explanatory variable x used to obtain the line 1. The y intercept is ̅ ̅ x. ii. A scatterplot cannot be replaced by a numerical summary. A value of r close to 1 or -1 does not guarantee a linear relationship. Correlation is not a complete summary of two variable data a.the y intercept (the predicted value of y when x=0) v. Do not make predictions using values of x that are much larger or much smaller than those that actually appear in the data. iv. ̂ 2. The slope is ( ) b. similar to density curves. it could be curved 9. It must include the mean and standard deviation of both x and y [separately] along with the correlation. Correlation only measures the strength of linear relationships. It is used to predict y given a value of x. ( ) ( ̅) ̅ 8.the predicted value of the response variable y for a given value of the explanatory value of x 3. b. vii. Strength of the relationship increases as r moves away from 0 and toward 1 or -1 7. It is often inaccurate. r does not change if we change the units of x and/or y 11.6. r itself has no unit of measurement. and n individuals: ̂ a. Correlation is not resistant – outliers and influential points can strongly affect it 15. viii. The size of a slope of a regression line does not show how important a relationship is vi. F. A regression lines is a model for the data.

( ) b. xii. c. Outlier – observation that lies outside the overall pattern of the other observations xiv. 4. then stddev of the residuals is: √ b. 2. An association between an explanatory variable and a response variable. The coefficient of determination is 1. . The formula is a. ( ̅) 2. 3. If using a LSRL. If all points fall directly on LSRL. is not by itself good evidence that changes in the explanatory variable actually cause changes in the response variable. then SSE=0 and . Gives the approximate size of a typical prediction error (residual) xi. then it is an influential point xv. A plot with no pattern means that the data is linear Residuals should be relatively small in size Examining residuals tells us how well the regression line fits the data Standard deviation of residuals: a. Influential Point – if removing it would markedly change the result of the calculation. even if it is very strong.1. Correlation and regression lines describe only linear relationships (Anascombe’s data) xiii.