This action might not be possible to undo. Are you sure you want to continue?
14 Samples and Surveys
TWO SURPRISING PROPERTIES OF SAMPLING ....................................................................... 14-3 Randomization ................................................................................................................ 14-3 Sample Size ..................................................................................................................... 14-5 Simple Random Sample................................................................................................... 14-6 Defining the Sampling Frame ......................................................................................... 14-7 VARIATION ........................................................................................................................... 14-8 Estimating Parameters.................................................................................................... 14-8 Sampling Variation ......................................................................................................... 14-9 ALTERNATIVES METHODS .................................................................................................. 14-11 Stratified and Cluster Samples...................................................................................... 14-11 Census ........................................................................................................................... 14-13 Voluntary Response ...................................................................................................... 14-14 Convenience Samples.................................................................................................... 14-14 CHECKLIST FOR SURVEYS .................................................................................................. 14-15 SUMMARY .......................................................................................................................... 14-19
You may have seen the description “ranked tops in initial quality” in advertisements for new cars. Have you ever wondered who decides which model wins, and how? J. D. Power and Associates annually sends a questionnaire to owners of new cars. Did they find defects? Do they enjoy the way the car performs? Would they buy the same model again? Each year, the Initial Quality Survey contacts thousands of purchasers and lessees (more than 97,000 in 2007). Their responses are used to rank the models and determine which gets the prize. The Initial Quality Survey sounds big until you consider how many new cars and light trucks get sold or leased in the US every year: more than 16 million. J. D. Power contacts less than ½ of one percent: millions of buyers never got a questionnaire. Plus, J. D. Power isn’t getting thousands of responses for each model. These customers bought all sorts of vehicles, from Honda Accords to hybrids to Ford trucks. When spread over hundreds of models, a collection of 100,000 buyers begins to look small. What can be learned from a relatively tiny proportion of customers? Decisions often require knowing a characteristic of a great number of people or things. • A retailer wants to know the market share for a brand before deciding to stock the item in its stores. • The foreman of a warehouse will not accept a large shipment of electronic components unless he’s sure that the shipment contains only a few that are defective. • Managers in the human resources department use hourly wages paid around the country to set a competitive salary for attracting new employees. Ideally, each situation requires a huge amount of data. For instance, managers in the HR department would like to know the starting salaries offered by employers around the country. That’s not going to happen. Instead, these managers are going to have to rely on what can be learned from a subset of the ideal data set. Such a subset is known as a sample Is that a good idea? In fact, sampling is usually the best approach.
Samples that distort the population. A sample that reflects the mix in the entire population. the salesperson puts the can in a machine that shakes it for several minutes. When done correctly. Randomly choose which members of the population to include in the sample. Population. It’s relevant to know if the population has. 2. are biased. N stands for the size of the population (not to be confused with the N which denotes a normal random variable). say. Our interest lies in problems when the population is too large to reach every member. Suppose you’ve decided to change the color of the living room in your apartment. The subset queried in a survey. Sample. n identifies the size of the sample. A simpler method – random selection – typically works better. n. Larger populations don’t require larger samples. Following convention. 14-3 Terminology Survey. you’ve never been selected for a national opinion poll. Systematic error in choosing the sample. There’s also a second surprise. The deliberate introduction of randomness when selecting a sample from a population is one of the great insights of statistics. Once you choose a color from the pallet of choices. That would usually be a mistake. A sample that presents a snapshot of the population is said to be representative. The best strategy for choosing a sample intentionally introduces random variation. The following analogy suggests the reasoning. the easy way is to pick the sample from the population at random. In that case. 1. If it’s known. After a computer adds pigments to a universal base. The subset of the population is called a sample. So.4/20/2008 14 Surveys Two Surprising Properties of Sampling It’s hard to go a day without hearing about the latest opinion poll when an election is coming. such as one that systematically omits an important part of the population. That’s the first surprise for many people about sampling. Newspapers regularly report the results of the latest surveys. The entire collection of interest. When it comes to sampling from a population. Most likely. you might be tempted to handpick the sample with care and precision. you’d probably contact everyone. Bias. Asking questions of a subset to learn about the larger group. To avoid bias. A survey asks questions of a subset of people who belong to a much larger group called the population. How can pollsters claim that a sample is representative of the population if they haven’t talked to you? A key message of this chapter is that it’s hard to get a representative sample. a survey reveals much about the opinions of the entire population – without having to contact the whole population. Representative. you head down to the Home Depot for paint. N n = sample size N = population size Randomization . they’ll mix a gallon for you on the spot. 10 items.
you’d get the misleading impression that the paint is a rich.54 1.4/20/2008 14 Surveys To confirm the color. Comparison of two random samples from a database of 3. Sample 1 Sample 2 Population Table 14-1. Shaking works for paint.91 3. newspapers asked readers to “vote“ by returning sample ballots on a variety of topics.87 Income (1–7) 5. Custom paints typically use a white base. Age (yr) 45.88 White (%) 56. Because the paint has been thoroughly mixed by the shaker. so the smear would probably give the impression that the paint is pure white. It is all made possible because we deliberately choose samples randomly. A magazine called the Literary Digest was 14-4 .33 5. The Literary Digest [boxed] Sometimes it’s instructive to see a dismal failure.4 56. Here’s an example. each with 8.5 million. all of the paint in the can. We didn’t consider variables such as age or income when we drew these two samples. We drew two samples.5 million customers. Even so.1 randomization Selecting a subset of the population at random. the salesperson dips a gloved finger into the can and smears a dab of paint on the lid. Randomization ensures that on average a sample mimics the population. you trust that the color of the smear represents the color of the entire can. In the early 20th century.51 Num of Children 3. but how do we “shake” a population of people to get a representative sample? The answer is to select them at random. at random from a database of 3.2 56. accomplishes the mixing as if shaking paint.4 61.27 Homeowner? (% Yes) 61. Such inferences are among the most powerful things we do with statistics. selecting a subset of the population at random. That smear is just a few drops. Randomization.000 people. The entire database of customers is the population. randomization produces samples whose averages resemble those in the population. but it’s enough to tell whether you got the color you wanted.29 5. This test relies on a small sample. the few drops in that smear. Shaking the can before sampling mixes the pigments with the base and allows a finger-sized sample to be representative of the whole can.51 1.2 61.3 Female (%) 1. You can imagine what would happen if the salesperson were to sample the paint before mixing. it enables us to infer characteristics of the population from a sample.44 44. A randomly selected sample is representative of the whole population. If the salesperson happened to touch the pigment that had been added. This table shows how means and percentages match up for two samples and the population. Internet surveys are modern versions of same idea. to represent the entire population.12 44. dark color. This one is infamous.88 3. Because randomizing avoids bias.
It’s intuitive to think that we need to sample a large percentage of the population. its own list of subscribers. only the number of cases in the sample matters. Sample Size How large a sample do we need? Common sense tells you that small samples don’t reveal much about the population.4/20/2008 14 Surveys known for its ballots. More than 100. the Literary Digest mailed 10 million ballots. Someone else did predict the outcome. From 1916 through 1932. but the polls you see in the news typically query at most 1. In fact. The main campaign issue in 1936 was the economy. Landon wasn’t the only loser: the Digest itself went bankrupt. In 1936 at the height of the Great Depression. its mock voting had correctly forecast the winner in presidential elections. The results of the Digest’s survey did not reflect the opinions of the overall population. the size of the population doesn’t affect statistical inference. his sample was representative of the actual voters.000 voters cast ballots in American presidential elections. who tended to be poor. Though smaller.200 people.4 million. For this election. Using a sample of 50.000. What is surprising is that “small” has nothing to do with the size of the population. The survey summarized at the left shows results from asking 922 American adults in 14-5 . Unless the population is small. Roosevelt’s core supporters. Landon carried only two states. That’s not true. Respondents returned 2. A list of phone numbers included far more rich than poor people. The 1936 presidential campaign pitted Alf Landon versus Franklin Roosevelt. Roosevelt won 62% to 37%. What went wrong? The Digest made a critical blunder: Where do you think the Digest got a list of 10 million names and addresses – in 1936? Where would you get such a list? You might think of phone numbers—and that’s one of the lists the Digest used. and memberships in organizations such as country clubs. The Gallup Organization went on to become one of the leading polling companies. Alf Landon would be the next president by a landslide: 57% to 43%. For large populations. The results were clear.000. telephones were luxuries. a young pollster named George Gallup predicted that Roosevelt would get 56% of the vote to Landon’s 44%. were under-represented in the Digest’s sample. The other lists available to the Digest were even less representative—drivers’ registrations.
If you’re painting a whole house rather than a single room. A procedure that makes every sample of size n equally likely produces a simple random sample. abbreviated SRS. we first need the sampling frame.4/20/2008 14 Surveys January 2007 for their view of the economy. for instance. no matter how large. This requirement assigns an equal chance to every combination of n members of the population. The fraction of the population that you’ve sampled doesn’t matter so long as it’s well mixed. Janary 2. but every sample is of a single sex—hardly representative. 1 simple random sample (SRS) A sample of n items chosen by a method that has equal chance of picking any sample of size n. not as long as it’s well mixed. this survey reveals the attitudes of the whole population to within about ±3%. a spreadsheet makes it easy to obtain a simple random sample. you might get one of those large. (We’ll explain that in the next chapter. The sampling frame is a list of items (voters. Virtually all of the theory of statistics concerns simple random samples. We could sample the customers this way: Flip a coin.) How can it be that we learn so much about the population from a sample? Let’s open that can of paint again. Consider. if it lands heads. In order to obtain a simple random sample. “Investors greet new year with ambivalence”. computer chips. • Add a second column of random numbers (such as using Excel’s RAND function). Every customer has an equal chance of being selected. Simple Random Sample How do you get a random sample? Methods that give everyone in the population an equal chance to be in the sample don’t necessarily produce a representative sample. a clothier that has equal numbers of male and female customers. Even so. • The items in the first n rows of the resulting spreadsheet identify a simple random sample. Once we have the sampling frame. An SRS is the standard against which we compare other sampling methods. select 100 women at random.1 A sample of this size is an almost infinitesimal part of the population. If it lands tails. Do you need the salesperson to make a bigger smear to decide if it’s the right color? No. The same smear is enough to make a decision about the entire batch. candy bags. shoppers…) from which to draw the sample. fivegallon buckets of paint. 2007. • Place a list of items in the sampling frame in the first column of a spreadsheet. A sampling procedure must assign an equal chance to every possible sample of size n. select 100 men. The New York Times. 14-6 . sampling frame A list of items from which to select a random sample. • Sort the rows of the spreadsheet using the random numbers in the second column.
Horticulturists grew 300 of these hybrid oranges by grafting buds onto existing trees. The ease of sampling using random numbers makes the use of systematic samples unnecessary. you must start the systematic selection at a randomly selected position in the list. it’s going to be difficult to market this hybrid. not those who will. orders. Venture capitalists who invested in the company are not going to like this answer. Similar logic applies to the quality control process in the chip fab in Chapter 13. sales. Scientists measured their weight and nutritional content. Data that monitor production. Election polls again provide an intuitive (and important) example.4/20/2008 14 Surveys systematic sampling Selecting items from the sampling frame following a regular pattern. There’s no fixed population of chips. Such claims force us to think of these oranges as a sample. If these 300 oranges are the population. The typical sampling frame. If you use a systematic sample. systematic sampling gives a representative sample. For this procedure to produce an SRS. Consider a biotech company that has developed a new type of fruit. The population for a poll consists of people who will vote in the coming election. Of course. Those who actually vote seldom form a random subset of registered voters. Are these 300 oranges the population or a sample? After all. Without some claims about more oranges than these. You can also obtain a simple random sample by picking items from the sampling frame systematically. these are the only 300 oranges of this variety ever grown. lists registered voters from public records. unless we limit ourselves to those made during a specific time period. a disease resistant orange that possesses a higher concentration of vitamin C. Hypothetical populations also complicate identifying the sampling frame. When there is no reason to believe that the order of the list could be associated with the responses. however. The hard part of getting an SRS is to obtain a sampling frame that lists every member of the population of interest. scientists cannot infer anything about oranges grown from this hybrid in the future. The list that you have often differs from the list that you want. 14-7 Defining the Sampling Frame . you might interview every 10th person on an alphabetical list of customers. even though there’s no list or sampling frame in the usual sense. If these 300 oranges offer a snapshot of the population of all possible oranges that might be grown. then these 300 are a sample. Other populations are less tangible. you must justify the assumption that the systematic sampling is not associated with any of the measured variables. The population of voters is real. we can imagine a list with the name of everyone who will vote. the sample is representative only if later growers raise their crops as carefully as the horticulturists who grew these. and other business activities are often most naturally thought of as sampling a process. For example. The sampling frame identifies people who can vote.
and the traditional notation distinguishes one from the other. as in the notation for random variables. Characteristics of the population. The percentage of consumers who say that they prefer Coke to Pepsi. Traditionally. for instance.” What does that mean? Does this statement mean that Wal-Mart isn’t a threat to traditional grocers? We’re sure that the Food Marketing Institute didn’t take a census. varies depending on the composition of the sample. To make this distinction clear. do you think a cowhand would be inclined to take a simple random sample?2 (b) When marketers collect opinions from shoppers who are willing to stop and fill out a form in a market. Properties of the sample allow us to estimate the corresponding characteristics of the population. such as its mean. Readers often interpret these as population characteristics. but they are not. are population parameters. we might use a sample mean to estimate the population mean. For instance. Estimates differ from parameters.4/20/2008 14 Surveys Are You There? In each case. If our results vary from one sample to the next. 14-8 . Random 2 Not a chance. Responding to their concerns. Most surveys report a mean and perhaps a standard deviation. 3 No. answer yes or no and think about how you would explain your answer to someone. the Food Marketing Institute reported that 72% of shoppers say that a “supermarket is their primary food store. Most likely. (a) If asked to take blood from a sample of 20 cattle from a large herd at a ranch. but many shoppers are in a hurry and will not want to take the time to complete a form. What does the 72% mean? What does a sample tell you about a population? Estimating Parameters parameter (population) A characteristic. different names and symbols distinguish characteristics of the population from those of a sample. These must be results from a sample. even though that detail is not mentioned in the headline. usually unknown. Greek letters denote population parameters. The sample might be representative of customers who are willing to stop and fill out a form. the cowhand would collect blood from the first 20 cows that he could get to rather than look for some that might be more aggressive or harder to find. estimation The use of a sample characteristic (a statistic) to guess or approximate a population characteristic (a parameter). do you think that they get a simple random sample of shoppers?3 Variation The results obtained in a survey depend on which members of the population happen to be included. what can we say about the population from a sample? Supermarket chains worry about competition from Wal-Mart. of the population. The Food Marketing Institute can’t possibly know the proportion of all shoppers who visit groceries. Reality is too complex.
Notation for statistics and parameters. The mean of a second SRS is not likely match the mean of the first. the letter that stands for a statistic corresponds to the parameter in an obvious way. The standard deviation of the data is s. The Food Marketing Institute proportion and p ˆ = 0. again from politics. € € Name Mean Standard deviation Correlation Slope of line Proportion Statistic y s r b ˆ p Parameter µ (mu. 14-9 . and a population is the collection of items that we might see in a sample. Longstanding convention puts a bar over anything when we average it. Sampling Variation Each sample has its own characteristics. and the correlation in the population is ρ (rho). Each point in the figure is the proportion in a sample who approved of President Bush in a specific survey. and the population standard deviation is σ (sigma. varies from sample to sample.4/20/2008 14 Surveys variables and populations are similar with regard to sampling. the pattern is irregular. A statistic. Often. Greek for s). The mean of a population is µ (because µ is the Greek letter for m). As an illustration. For the slope of the line associated with the correlation. The issue now becomes whether p ˆ and p are close claims that p to each other. The letter r denotes the correlation in data. A random variable stands for an idealized distribution of possible outcomes. sampling variation Variation of a statistic from one sample to the next caused by selecting random subsets of the population. Alas.” not “moo”) σ (sigma) ρ (rho. b is the statistic whereas β (beta) identifies the parameter of the population.72. This table summarizes the correspondence between several sample statistics and population parameters. pronounced “mew. so we write x or y for sample means. Proportions are also irregular. p denotes the population ˆ the sample proportion. In this book. pronounced “row”) β (beta. pronounced “baytah”) p Table 14-2. Differences among random samples produce sampling variation. such as the sample proportion. this plot returns to the approval polls for President Bush.
Consider the questions of a clothing retailer located in a busy mall. Fortunately. In this case. Different polls give different results. Sampling variation is the price we pay for working with a sample rather than the population. We’ll see how in the following chapters. but this plot shows the sampling variation. statistics allows us to quantify the effects of sampling variation. but the results also depend upon the survey sample. an objective is to learn the percentage that left without a purchase for each of the reasons listed above. Do they leave because they did not find what they were looking for. 4 Not only does public opinion change over time. or because the prices were too high? Every survey should have a clear objective. These tell the business about the customers who buy things. Method describe the data and select an approach tip A survey is necessary. the store wants to get a sample of about 50 weekend customers. are visible in this plot.4/20/2008 100 90 14 Surveys Poll Rating 80 70 60 50 40 01/01/2002 01/01/2001 01/01/2003 01/01/2004 01/01/2005 Date Figure 14-1. The trend and jumps that are evident in Error! Reference source not found.1 Exit Surveys Motivation 01/01/2006 state the question tip Business use a variety of methods to keep up with regular customers. such as registration cards included with electronics purchases or loyalty programs at supermarkets. It is not be possible to speak with everyone who leaves the store or have all of them answer even a few questions. Based on the amount of business. but not about those that don’t. because the sizes or colors were wrong. 4 These approval ratings come from the web site of Steve Ruggles at the University of Minnesota. but she knows nothing about those who leave without making a purchase. Example 14. Polls taken at the same time don’t give the same ratings. 14-10 . There’s visible sampling variation. The owner has data about the customers who frequently make purchases. A precise way to state the objective is to identify the population and the parameter of interest.
. in-person survey to learn the interest of homeowners in remodeling would be prohibitively expensive unless the surveyor can visit several homes in each locale. Randomizing routinely protects from imbalances. Then you can compare the actual survey to this ideal. a survey needs a record of nonresponses. If a shopper refuses to participate. Message summarize the results Based on the survey. Given the selection of the clusters. one randomly samples within each cluster. If staff interview every 10th departing shopper. such as census tracts that have comparable population. then the sample size will be n = 60 if the survey runs both Saturday and Sunday. Perhaps she’s not changing inventory fast enough or has stocked the wrong sizes. Stratifying helps in other domains as well. but there may also be cases in which we deliberately over-sample parts of the population. a national. the surveyor will ask the following customer and make a note. There’s a reason for the complexity: to save time and money. Even if you don’t have such a list. Cluster sampling is a type of stratified sampling that is natural in situations that cover a wide geographic area. Someone will have to try to interview shoppers who did not make a purchase as they leave. Alternatives Methods Stratified and Cluster Samples Simple random samples are easy to use. For instance. with about 25 shoppers walking through per hour (300 per day). however. the sample may not be representative. called strata. Suppose we operate a 14-11 stratified sampling Random sampling within subsets of similar items. Mechanics do the analysis The store is open from 9 am to 9 pm. one first selects a simple random sample from a list of geographic units (clusters). but the store can try to sample this population just the same. These geographic clusters form the strata. That list does not exist. Robert Stine 3/7/08 4:47 PM Formatted: Font:Comic Sans MS. A stratified random sample divides the sampling frame into homogeneous groups.4/20/2008 14 Surveys The hard part of designing a survey is to describe the target population and how to sample it. identify the ideal sampling frame. In cluster sampling. To be reliable. should determine which members of the sampling frame belong in the sample. The ideal sampling frame in this example would list every shopper over the weekend who did not make a purchase. If there are too many of these. In later chapters we’ll assume that our data is an SRS. Most commercial polls and surveys. the owner will be able to find out why shoppers are leaving without buying. rather than personal preferences. Simple random sampling is used to pick items for the sample within each stratum. 9 pt cluster sampling A type of stratified sampling in which the strata a determined geographically. Though more complex. are more complex. Surveys that sample large populations are typically more complicated than simple random samples. Most don’t make a purchase. all sampling designs share the idea that random chance. before the sample is selected.
2 Estimating the Rise of Prices Motivation state the question Businesses. (Statistics packages adjust for stratifying by introducing sampling weights into the calculations. BLS uses stratified 14-12 . and take random samples within each. BLS divides the items sold into 211 categories and estimates the change in price for each category in every location (211 × 38 = 8. What goes into the consumer price index (CPI). and governments to pay more for entitlement programs such as Social Security. We’re not going to learn much about business travelers from these 5 alone. We’ll focus on the urban CPI that measures inflation in 38 urban areas. Now we’ll learn something about both groups. The choice of items and stores change over time to reflect changes in the economy. We’d like a survey that told us about both groups. we can stratify the population of customers into two strata. We’ll leave those details to another course!) Example 14. No one knows the price paid in every consumer transaction. We suspect that the two groups of customers have different views on our service. To learn more about business travelers. The survey is done monthly. If we select 100 guests at random we might get a sample with 95 tourists and only 5 business travelers.018 price indices). consumers. We’ll have to adjust for that if we want to describe the population of all customers. but it does not have a list of every item sold. but were not a major category until recently. BLS sends its own force of data collectors into the field where they price a sample of items in selected stores in every location. Most of our customers (90%) are tourists. and the government are all concerned about inflation. To get a handle on the vast scope of the economy. consumers to cut back on what they purchase. Inflation pressures businesses to increase salaries and prices. 75 tourists and 25 business travelers. For example we could sample. The target population consists of the costs of every consumer transaction in these urban areas during a specific month. Also.4/20/2008 14 Surveys large hotel. the definition of 211 types of transactions adds another type of clustering to the survey. the leading indicator of inflation in the US? Method describe the data and select an approach BLS uses a survey to estimate inflation. this is a clustered sample. say. Let’s consider how the Bureau of Labor Statistics (BLS) estimates inflation. The catch with a stratified sample is that we’ve deliberately overrepresented business travelers in our sample. Mechanics do the analysis BLS has the list of the urban areas and a list of people who live in each. The other 10% travel for business. the inexorable rise in costs for goods and services. tourists and business travelers. Because the sample only includes transactions from some stores and not others. Personal computers are now included. To compute each index. and we’d like to survey our customers’ opinions about the quality of service.
The sample must be large enough for the survey to include customers that visit each portion of the store.S. Because it tries so hard to count everyone. the Census records ends up with too many college students. Ultimately. Hostess wouldn’t have any left to sell. If you were a taste tester for the Hostess Company. The cost goes up with every additional respondent. the test is destructive. these are rare in practice. Cost is the overriding concern. much less contact. For manufacturing. If the sample does not include shoppers who visit the hardware department. Some individuals don’t want to be found! The U. the survey will not have much to say about reactions to this department – other than indicate that customers perhaps could not find this area! census A comprehensive survey of the entire population.) Message summarize the results The urban consumer price index is an estimate of inflation based on a complex survey procedure in metropolitan areas. First. these are compared to prices last month category by category. and reluctant to fill out forms. you probably wouldn’t want to taste every Twinkie on the production line. Management would like to know how customers react to the new design in the various departments. Aside from the fact that you couldn’t eat every one. Once it has collected current prices. mistrust authority figures. If you live outside the covered areas or perhaps spend your money differently from the items covered in the survey. Consider a survey of customers at a newly renovated retail store. but you have to weigh those gains against rising costs. (To learn more about the CPI. Many are included by their families and then counted a second time at school. The manufacturer of GPS chips in Chapter 13 cannot test every chip. the choice of the sample size n depends on what you want to learn. Wouldn’t it be better to include everyone and “sample” the entire population? A comprehensive survey of the entire population is called a census. When the population 14-13 . Census undercounts the homeless. In the time that it takes to finish a census. it can be difficult to list. a census is impractical. but other complications arise as well. events may shift opinions. your impression of inflation may be considerably different from the CPI.4/20/2008 14 Surveys samples grouped geographically to estimate housing costs. they’re harder to find. The second major complication of a census is change. visit the BLS on-line and check out Chapter 17 in the Handbook of Methods. Though a census gives a definitive answer. Larger surveys reveal more. the entire population. Census Every survey must balance accuracy versus cost.
Convenience Samples Another sampling method that usually fails is convenience sampling. these individuals may not be representative. these samples are not representative. Dealers might choose customers that are sure to give them a high rating. a group of individuals is invited to respond. When a company wants to learn reactions to its products or services. and those who do respond are counted. whom does it survey? The easiest people to sample are its current customers. it will never learn how the rest of the market feels about its products. flawed samples Voluntary response. No matter how it selects a sample from these customers. You see voluntary response samples all the time: call-in polls for the local news. Though easy to contact. The interviewer picks respondents who look easy to convince to participate. The first 200 on the 14-14 Are You There? . “Should women’s and men’s shoe sizes correspond?” Experience suggests that people with negative opinions tend to respond more often than those with equally strong positive opinions. After all. In a voluntary response sample. A farmer asked to pick out a sample of cows to check the health of the herd isn’t likely to choose animals that seem unruly or run away. and Internet polls. Voluntary Response Samples can be as flawed as a poorly done census. 800 numbers. Unless the company reaches beyond this convenient list. it makes more sense to collect a sequence of smaller samples that can detect changes in the population. Surveys conducted at shopping malls suffer from this problem. In spite of the problems. Voluntary response samples are usually biased toward those with strong opinions. the sample remains a convenience sample. Which survey would you respond to—one that asked. One of the least reliable samples is a voluntary response sample.4/20/2008 14 Surveys is changing (and most do). Convenience. the company has a list of them with addresses and phone numbers —at least those who sent in registration cards. People volunteer to participate in a survey or join a sample. convenience sampling is widespread. The resulting voluntary response bias invalidates the survey. Interviewers tend to select individuals who look easy to interview. and large dealers will be sending in more names than small dealers. “Should the minimum age to drive a car be raised to 30?” or one that asked. What problems do you foresee if Ford uses the following methods to sample customers who purchased an Expedition (a large SUV) during the last model year?5 5 Each method has a problem. Convenience sampling surveys individuals who are readily available. How often do constituents write their representative in Congress when they’re happy? Even though every individual has the chance to respond.
Those who return supplemental information voluntarily may also have more strongly held opinions. Most businesses don’t conduct their own surveys. but seldom reach 100%. so you might have to ask for more details. Most summaries of surveys omit these issues. If the response rate is low. They rely on data that someone else collects. Rather than sending out a large number of surveys for which the response rate will be low. The problem with nonresponse is that those who don’t respond may differ from those who do. 14-15 . • What is the rate of nonresponse? The design of an SRS includes a list of individuals from the sampling frame that the survey intends to contact. How was the question worded? • list may be eager buyers and different from those who buy later in the model year. c) Randomly choose 200 customers from those who voluntarily mailed in customer registration forms. you often won’t find answers to these questions until you ask. whether you analyze the data yourself or read someone else’s analysis. If you start with a biased sample. it won’t matter how well you do the subsequent analysis.4/20/2008 14 Surveys a) Have each dealer send in a list with the names of 5% of their customers who bought Expedition. what is the sampling design? Though fundamental. b) Start at the top of the list of all purchasers of Ford vehicles and stop after finding 200 who bought an Expedition. you’ve got to wonder if the people who participated resemble those who declined. it is usually better to design a smaller randomized survey for which you have the resources to ensure a high response rate. Here are some other questions that we have not yet covered. say 30%. Response rates of surveys vary. make sure that you can answer a few questions. To get the most out of surveys. your data will be flawed. An SRS with a low response rate looks more like a voluntary response sample than a randomly chosen subset of the population. It’s usually impossible to tell what the nonrespondents would have said had they participated. We’ve talked about the first two. Checklist for Surveys Unless surveys are done correctly. • • What was the sampling frame? Does it match the population? Is the sample a simple random sample? If not. Your conclusions are questionable. You can be sure that some are less willing to participate than others.
Bush” rather than “President Bush” and omits the link to terrorism. especially those conducted by special interest groups. present one side of an issue before the question itself. 2006. the following poll contacted 1. undecided voters gravitate to the first things that get their attention. “In the Voting Booth. Bias Starts at the Top. Studies have found that candidates whose names are at the top of the ballot get about 2% more votes on average than they would had their name been positioned elsewhere on the ballot.6 Figure 14-2. Summary of opinion poll. With this wording. Let’s consider one of the most important surveys around: an election.” Responding to this wording. that could swing the outcome. the wording refers to “George W.229 adults and asked about telephone wiretapping. For example. Asking a question with a leading statement is a good way to bias the response. It’s a lot like putting items on the grocery shelf at eye level: like shoppers. 14-16 . Placement of items in a survey affects responses as well. Take a close look at the 3rd and 4th questions. 2006. Many surveys. a small majority approves the use of wiretaps without warrants. In the 4th question (the questions were not presented one right after another).4/20/2008 14 Surveys The wording of questions can have a dramatic effect on the nature of the responses. The wording of the 3rd question refers to “President Bush” and mentions “the threat of terrorism. Two percent is small. but in a tight election.” New York Times. fewer than half approve. 11. Nov.7 6 7 The New York Times. Jan 27.
The name of this bias comes from medical studies. If there’s chemistry – good or bad – between the interviewer and the respondent. an analyst developed a list of current hedge funds and sampled 100 for detailed analysis. It’s easiest to explain in an example. Survivor bias is also a problem with analyses of the stock market. or behavior of the interviewer can all influence responses by subtle (or not so subtle) indications that certain answers are more desirable than others. many investors have shifted toward hedge funds. You might suspect that some report only when they’ve done well and hide their results otherwise. The sex. respondents tend to answer questions in a way that they believe will “please” the interviewer. His results over-represent successful funds. Even though the analyst uses an accurate sampling frame (he has a list of all of them) and takes a random sample. those that survive longer. 14-17 . and you can bet that they didn’t drop out because they were doing too well. the answer may reflect more about their interaction than the actual response that you’re trying to measure. attire. Those that have collapsed don’t report anything.4/20/2008 14 Surveys • Did the interviewer affect the results? The interviewer has less effect in automated polls. Those patients did not survive long enough to enter the study. but many detailed surveys are still done in person. In general. Recently. To learn about the fees charged by these funds. his results suffer from survivor bias. Physicians noticed that studies of certain therapies looked too good because the patients were never those that had severe cases. Because they are largely unregulated. race. • Does survivor bias affect the survey? Survivor bias occurs when certain “long-lived” items are more likely to be selected for a sample than others. A random sample of companies that are currently listed on the major exchanges is more likely to contain those that have been successful and remain listed. hedge funds choose whether to report their performance. either consciously or unconsciously.
and nonresponse and survivor effects can introduce further biases. Spend the time to design the survey. 14-3 sampling frame. Keep focused. Not much can be done to patch up a botched survey. Match the sampling frame to the target population. 14-3 survivor. 14-7 voluntary response sample. When designing a 14-18 • • . 14-6 sampling variation. 14-3 representative. such as cluster samples or stratified samples. Avoid complicated sampling plans and devote your energy to asking the right types of questions. To avoid bias. biased analysis. 14-8 population. reducing the response rate and biasing the results. 14-14 estimate.4/20/2008 14 Surveys Summary A sample is a representative subset of larger population. randomization is used to select items for the sample from the sampling frame. careful survey is more informative than a larger. a list of items in the target population. 14-3 systematic sample. Other types of samples. 1414 Best Practices • • Randomize. 14-17 voluntary response. A census attempts to enumerate every item in the population rather than a subset. the results become unreliable. use special designs to reduce costs without introducing bias. 14-11 stratified random sample. 14-11 survey. 14-11 convenience sample. 14-4 sample. your sample will not reflect the population that you’d like to study. A smaller. If it’s a poor match. 14-13 cluster sampling. Voluntary responses and convenience samples produce possibly unrepresentative samples. Samples provide statistics that allow us to estimate population parameters. 14-8 parameter. Key Terms bias. 14-14 census. Sampling variation occurs because of the differences between randomly selected subsets. 14-6 strata. 14-9 simple random sample. The sampling frame defines the population for a survey. Surveys that are too long are more likely to be refused. A simple random sample (SRS) is chosen in such a way that all possible samples of size n are equally likely. Plan carefully. Once bias creeps into a survey. A survey begins with the sampling frame. 14-3 randomization. 14-3 biased.
Pitfalls • Non-response. Stunned. not influence the sample. For example. but remember to question the methods when the survey agrees with your preconceptions as well. remember the purpose. Look for misunderstandings. or other possible biases. A survey done to measure the success of pharmaceutical sales found that promotion had no effect on the habits of doctors. For each question you include. 14-19 . test the survey in the exact form that you intend to use it with a small sample drawn from the population you intend to sample (not just some friends of yours). Redesign your survey as needed. managers at the marketing group dug into the survey and found that software had distorted the linkage between promotion and sales. Others offer a slight reward for participating. • Pretest your survey. “What would I do if I knew the answer to this question?” If you don’t have a use for the answer. You want to learn from the sample.1 might offer departing shoppers a beverage or a free sample of perfume in order to get them to answer a question or two. confusion. Make sure that the questions on a survey do not influence the response. the retailer in Example 14. Non-response converts a simple random sample into a voluntary response or convenience sample. The interaction between the interviewer and respondent can also influence the answers. ask yourself. misinterpretation. It’s easy to question the results of a survey when you discover something that you didn’t expect. Doctors were not being accurately tracked. If at all possible. That’s okay. Some telephone polls call a respondent a second time in case the first call arrived at an inconvenient moment. Do you think that the managers would have taken this effort if they had found the results they expected? • • Software Hints Excel The procedure described in the text is simple to use in Excel. Then sort the rows of the spreadsheet in the order of this column. Avoiding the details when you get the results you expect. then don’t ask the question. Leading the witness.4/20/2008 14 Surveys survey. Those who decline to participate or make themselves hard to reach often differ in many ways from those that are cooperative and easy to find. Use the Excel function RAND to generate uniformly distributed random values between 0 and 1.
To use the method described in the intro to this section. C2 and C3. indicate the columns you’d like to randomly sample and how many rows you’d like in the sample. and insert one of the choce in the Random collection of functions. If that seems confusing. C12 and C13. use the calculator to insert a column of random numbers. putting the results in columns C11. JMP 14-20 . You can also indicate a percentage. pick Formula….4/20/2008 14 Surveys Minitab To use the method described in the intro to this section. It’s easier to use the built-in sampling feature. Let’s say you put the random numbers in column C4 and would like to sample values from C1. Follow the menu sequence Tables > Subset In the resulting dialog. indicate the columns you’d like to randomly sample and how many rows (cases) you’d like in the sample. The sorted data appears in a new data table. Next use the command SORT C1 C2 C3 C11 C12 C13 BY C4 to sort the data columns. Right-click on the header of an empty column. then use the command Calc > Random Data > Sample from columns… In the resulting dialog. Then follow the menu commands Tables > Sort to use this column to sort the data table. follow the sequence of menu commands Calc > Random Data > Uniform… opens a dialog that allows you to fill a column of the data table with random numbers. JMP builds a new data table with a random sample as you’ve requested.