You are on page 1of 100
CHAPTER 12 Sampling: Final and Initial Sample Size Determination sv ‘The study found that people who are likely to participate in a telephone survey (responders) differ from those who are likely to refuse (nonresponders) in the following ‘ways: (1) confidence in survey research, (2) confidence in the research organization, (3) demographic characteristics, and (4) beliefs and attitudes about telephone surveys. ‘A recent study conducted by CMOR indicated that consumers prefer Internet surveys versus the telephone method of surveys. Statistically speaking, out of 1.753 U.S. con- sumers, 78.9 percent of respondents chose the Internet as their first choice of survey ‘method, whereas only 3.2 percent chose the telephone method of surveys.) mt Given the differences between responders and nonresponders that this study demonstrated, researchers should attempt to lower refusal rates. This can be done by prior notification, motivating the respondents, incentives, good questionnaire design and ‘administration, follow-up, and other facilitators. Prior notification. in prior notification, potential reypoudents are sent a etter notifying them of the imminent mail, telephone, personal, or Intemet survey. Prior notification increases response rates for samples of the general public because it reduces surprise and uncertainty and creates a more cooperative atmosphere. ¥ ‘Motivating the respondents, Potential respondents can be motivated to participate in the survey by increasing their interest and involvement. Two of the ways this ccan be done are the foot-in-the-door and door-in-the-face strategies. Both strate- gies attempt to obtain participation through the use of sequential requests. As explained briefly in Chapter 6, in the foot-in-the-door strategy, the interviewer Starts with a relatively small request, such as “Will you please take five minutes to answer five questions”, to which a large majority of people will comply. ‘The small request is followed by a larger request, the critical request, that solicits participation in the survey or experiment. The rationale is that compliance with an initial request should increase the chances of compliance with the subsequent request. The door-in-the-face is the reverse strategy. The initial request is relatively large and a majority of people refuse to comply. The large request is followed by a smaller request, the critical request, soliciting participation in the survey, Tho underlying reasoning is tha the concession offered by the subsequent Critial request should increase the chances of compliance. Foot-in-the-door is more effective than door-in-the-face.!® Incentives. Response rates can be increased by offering monetary as well as non- monetary incentives to potential respondents. Monetary incentives can be prepaid or promised. The prepaid incentive is included with the survey or questionnaire. The promised incentive is sent to only those respondents who complete the survey. The most commonly used nonmonetary incentives are premiums and rewards, such as pens, pencil, books, and offers of survey results. ® repaid incentives have beet show to increase response rates to a greater extent than promised incentives. The amount of incentive can vary from 10 cents to $50 or ‘more. The amount of incentive has a positive relationship with response rat, but the cost of lange monetary incentives may outweigh the value of additional information obtained ‘Questionnaire design and administration. A well-designed questionnaire can decrease ‘the overall refusal rate as well as refusals to specific questions (see Chapter 10) Likewise, the skill used to administer the questionnaire in telephone and personal ierviews can increase the response rate. rained interviewers are skilled in refusal conversion or persuasion. They do not accept a “no” response without an additional plea. The additional plea might emphasize the brevity of the questionnaire or the importance of the respondent's opinion. Skilled interviewers can decrease refusals by about 7 percent on average Interviewing procedures are discussed in more detail in Chapter 13. Follow-up. Follow-up, ot contacting the nonrespondents periodically after the initial contac, is particularly effective in decreasing refusals in mail surveys. The researcher might send a postcard or letter to remind nonrespondents to complete sn PART Il Rewarch Design Formulation ‘and return the questionnaire. Two or three mailings are needed, in addition to the ‘original one, With proper follow-up, the response rate in mail surveys can be increased to 80 percent or more, Follow-ups can also be done by telephone, e-mail ‘or personal contacts Other facilitators. Personalization, or sending leters addressed to specific individu- als, i effective in increasing response rates | The next example illustrates the ‘procedure employed by Arbitron to increase its response rat. Arbitron’s Response to Low Response Rates Arbitron (wncarbitroncom) is « major marketing research supplier Forth fist quarter ‘of 2005, the company reported revenie of $79.2 million an increase of 3.4 percent over revenue of $76.6 million during the first quarter of 2004. Recently, Arbitron was trying to Improve response rates in det to get more meaningful results ftom Is survey. Aiton thete purchases, "most women who wear a fragrance * six bottles of scent of more.” A recent study found that ‘ames are increasingly important. To reinforce ths idea, +> peteent of fragrance users admitted that designer and Coding Questions ‘The respondent code and the record nurmber should appear on cach record in the dat However the record code can be dispensed if there is oy one record foreach respon ‘dent. Tho following addtional codes shouldbe included foreach respondent: project | Illustrative Computer File: Department Store Project Recoros Record #1 Record #1 Record #21 Record #31 Record #2701 82882 2) Paar _ ‘Coun Nuneess a 4 = Te csceeenn BB 3S 7 ' a o 6544234853 5 1 31 o1 5564435453 4 1 31 o 4655243324 4 1 31 a 463244645, 6 1 31 58 (6652354435 5 ais Figure 14.2 Codebook Showing Information or the First Record: Department Store Projact SRSA ‘Node in which the number ofc foreach respondent arth san, and ‘he same data pea in th same columns forall spond PART MIL Data Colleton, Preparation, Analyt, and Reporting ‘Variable Name Coding Instructions ss ascessary 4 2 Recon mamber ame forall esponen) 56 3 Proje code 31 same fall sponses) 78 4 etre code As cod onthe questionnae 5S itecove Ascodedon he qutionne 1506 Timecade [Asean he quesionne 227i code ‘Ascoded on he quesioonaze not Blank ewe these ols lak 3s 8 Whostops 1 Oner=3 Int the number dee Mining aes = 2% 9 Family with tore 1 ia Foegueio I pus thous Scinpat the number circle n 10 amity wih sure 2 m Maing aes = 9 ~ 11 amity with sore 3 s 18 amis wah tr 10 36 19 Frequency: Store | Fr guecon Il asa tho J:lnput he number ced ery feet Missing ales =9 8 28 Freqteny: Stor 10 aoa Blok Lem these columns ak 8 29° Rating ore Lon guaty Nal orquestons V through Xt Input the number ected s 38 Ratingofstre 1Oon uly Nato s 39° —Ratingofstore onary” IVb o 48 Rating ofstre (Don analy IVBIO “ 9 Raingofaore Lonprces Wel 7 58 Raungotstre Ionics —1Ve10 780 Blank Lene hse cola ak code, interviewer code, date and time codes, and validation code. Fixed-field ct which mean that the number of records for each respondent is the same and same data appear in the same column(s) for all respondents, are highly desira If possible, standard codes should be used for missing data. For example, a code mate responses, ‘Coding of structured questions is relatively simple, because the response options predetermined. The researcher assigns a code for each response to each question specifies the appropriate record and columns in which the response codes are to ay Forexample, Do you have a currently valid passport? 1.Yes 2.No (1/54) For this question, a “Yes” response is coded 1 and a “No” response 2. The numbers parentheses indicate that the code assigned will appear on the first record for 1 respondent in column 54, Because only one response is allowed and there are only possible responses (1 or 2), single column is sufficient. In general, a single column sufficient to code a structured question witha single response if there are less than ni possible responses, ‘book conning coding instructions "the messy ifermation about ‘riahle in the daa se CHAPTER 14 Data Preparation 419 In questions that permit a large number of responses, each possible response option should be assigned a separate column. Such questions include those about brand ovner- ship or usage, magazine readership, and television viewing. For example, Which accounts do you mow have at this bank? ("X” as many as apply) Regular savings account O ae) Regular chec O63) Mortgage OD as Now account D (165) Cub account (Christmas, et) D166) Line of eredit Oo a6n ‘Term savings account (ime deposits, ete.) (168) Savings bank life insurance D169) Home improvement loan O 070) ‘Ato loan a am Other services oO any In this example, suppose a respondent checked regular savings, regular checking, and term savings accounts, On record #1, | will be entered in the column numbers 162, 163, and 168. All the other columns (164, 165, 166, 167, 169, 170, 171, and 172) will receive a 0. Since there is only one record per respondent, the recor has been omitted. ‘The coding of unstructured or open-ended questions is more complex. Respondents ver- batim responses are recorded on the questionnaire. Codes are then developed and assigned to these responses. Sometimes, based on previous projects or theoretical considerations, the ‘esearcner can develop the codes before beginning fieldwork, Usually this must wit uni the ‘completed questionnaires are received. Then the researcher lists 50 to 100 responses to an ‘unstructured question to identify the categories suitable for coding. Once codes are developed, the coders shouldbe trained assign the correct codes to the verbatim responses. The follow ing guidelines are suggested for coding unstructured questions and questionnaires in general Category codes should be mutually exclusive and collectively exhaustive. Categories are mutually exclusive if each response fits into one and only one category code. Categories should not overlap. Categories are collectively exhaustive if every response fits into one of the assigned category codes. This can be achieved by adding an additional catagory code of “other” or “none of the shove However, only a few (10 percent or less) of the responses should fll into this category. The vast majority ofthe responses should be classified into meaningful categories. ‘Category codes should be assigned for ertical issues even if no one has mentioned them. may be important 10 know that no one has mentioned a particular response. For example, the management of a major consumer goods company was concerned about the packaging for a new brand of toilet soap. Hence, packaging was included as a separate category in coding responses tothe question, “What do you like least about this toilet soap?” Data should be coded to retain as much detail as possible. For example, if data on the ‘exact number of trips made on commercial airlines by business travelers have been obtained, they should be coded as such, rather than grouped into two category codes of “infrequent fliers” and “frequent fliers.” Obtaining information on the exact number of trip allows the researcher to later define categories of business travelers in several different way. Ifthe cate~ gories were predefined, the subsequent analysis of data would be limited by those catezories. Codebook ‘A codebook contains coding instructions and the necessary information about variables in the data set. A codebook guides the coders in their work and helps the researcher to properly ideutify and locate the variables, Even if the questionnaire has been precoded,i¢ is helpful to prepare a formal codebook. A codebook generally contains the following information: (1) column number, (2) record number, (3) variable number, (4) variable name, (5) question number, and (6) instuctions for coding. Figure 14.2 isan excerpt from a coding ‘book developed for the department store project. Figure 14.3 isan example of questionnaire ‘coding, showing the coding of demographic data typically obtained in consumer surveys. "The questionnaire inthe following example was precoded, 20 Figure 14.3, Example of Questionnaire Coding Showing Coding ‘of Demographic Data PART I Data Collection, Preparation, Analysis aud Reporting Finally, in this pat ofthe questionnaire we would lke to ask you some background information for classification purposes PARTD 1, This questionnaire was answered by (229) a Primarily the male head of household 2 Primarily the female head of household 2 Jointly by the male and female beads of household 2 Marital Status 230) i Maied 2 [Never matied 3 Divorced/separated'widowed ‘3. What isthe foal numberof family members living at home? 231-232) 4. Number of children ving at home: 1. Unier six years ex) 2. Over six years ay Number of children not ving at home 235) 6 Number of years of formal education which you (and your spouse, if applicable) have competed (pleas circle) College High Schoo! Undergraduate Graduate L You Sorless91D 111213181516 1718192021 22 ormore 236-237) 2 Spowe Sorless9 101112 13141516 1718192021 22 ormore 238-239) 7. Your age — (240-241) 2. Age of spouse (applicable) (242-243) 8, | employed, please indicate your household's occupations by checking the appropriate category. 24) (245) i i 1. Professional and technical 2 Managers and administrators 3. Sales workers 44. Clerical and Kindred workers 'S. Craftamenioperativellabores 6. Homemakers 7. Others please specify) 8 Not applicable 9. Is your place of residence presently owned by household? 246) 1. Owned, 2. Rented 10, How many years have you been residing inthe greater Atlanta areu? years. 287-248) 11, What the approximate combined annual income of your household before taxes? Please check (249.250) 4. Less than $10,000 8, $40,000 to 44,999 2 $10,000 w 14,999 9, $45,000 10 49.999 3. $15,000 0 19.999 10, $50,000 10 54,999 4. $20,000 10 24.999 AL, $55,000 10 59,999 §. $25,000 10 29,999 12, $60,000 10 69,999 {6 $30,000 to 34.999, 13, $70,000 t0 89,999 7. $35,000 10 39.999 14, $90,000 and over IIIT LI [IIT "Note: Columns 1 dough 28 contin the respondent ID poet inforation, nd information Detain to pris AB, and C ofthe guctonnae. Theres oly ane record er espondent Data Transcription CHAPTER 14 Data Preparation an Visit iparriors.com and conduct an Internet search using a search engine and your library's ‘online database to obtain information on why people atend professional football games. As the marketing director for the New England Patriots, what information would you like to have to formulate marketing strategies to increase the attendance atthe Patriots home games? AN surwry wa administered to attondoe aa Patriots home gate to determine why they mete ‘tending, What principle wil you follow in checking the questionnaire, editing, and coding? TRANSCRIBING ‘Transeribing data involves transferring the coded data from the questionnaires or coding sheets onto disks or magnetic tapes or directly into computers by keypunching. Ifthe data have been collected via CAT! or CAPI, this step is unnecessary because the data are centered directly into the computer as they are collected. Besides keypunching, the data can be transferred by using mark sense forms, optical scanning, or computerized sensory analysis (sce Figure 14.4), Mark sense forms require responses to be recorded with a spe- ial pencil in a predesignated area coded for that response. The data can then be read by a ‘machine, Optical scanning involves direct machine reading of the codes and simultaneous transcription. A familiar example of optical scanning is the transcription of UPC (universal product code) data at supermarket checkout counters, Technological advances have resulted in computerized sensory analysis systems, which automate the data-collection process. The questions appear on a computerized gridpad. and responses are recorded. directly into the computer using a sensing device. If keypunching is used, errors can occur, and its necessary to Verify the dataset or at least «portion ot, for keypunching eros. A verifier machine and a second operator are utilized for data verification. The second operator repunches the data from the coded questionnaires, The transcribed data from the two operators are compared record by record. Any diserepaney between the two sets of transcribed data is investigated to identify and correct for keypunching errors. Verification of the entire data set will double the time and cost of data transcription. ven the time and cost constraints, as wel asthe fact that experienced keypunch operators are uite accurate, itis sufficient to verify only 25 to 50 percent ofthe data. Keypunching sees ‘ Computerized Cink cent Mark 5 Optical ot CAPE Terminal am ae _ Analysis | Verification: Correct, Keypunching Magnetic Disks "Tapes pe dotacleaning ‘Thocough ad extensive checks for consistency and weatment of missing responses, “consisteneychecks ‘Apart of the dtceaning proces that ‘dentes data hat ae ot of ane, logically inconsistent, or have exteme ales, Data with ales not defined by the coding scheme are inadmissible. PART IIT Data Collection, Preparation, Analysis, and Reporting ‘When CATT of CAPI are employed, data are verified as they are collected. In thee: of inadmissible responses, the computer will prompt the interviewer or respondent cease of admissible responses, the interviewer or the respondent can see the recor response on the screen and verify it before proceeding, ‘The selection of a data-transcription method is guided by the type of interviewit method! used and the availability of equipment. IF CATI or CAPI are used, the data centered directly into the computer. Keypunching via CRT terminal is most frequently w for ordinary telephone, in-home, mall-intercept, and mail interviews. However, the use computerized sensory analysis systems in personal interviews is increasing with t increasing use of gridpads and hand-held computers. Optical scanning can be used is structured and repetitive surveys, and mark sense forms are used in special cases.5 Scanning the Seas {As of 2006, Princess Cruises (wwiprincess.com), part of Carnival Corporation, ana ally carried more than a million passengers. Princess wished to know what passens thought of the cruise experience, but wanted to determine this information in a cost effective way. A scannable questionnaire was developed that allowed the eruse line 4uickly transeribe the data from thousands of surveys, thus expediting data preparat and analysis. This questionnaire is distributed to measure customer satisfaction all voyages. In addition o saving time as compared to keypunching, scanning has also inerea the accuracy of the survey tesults. The senior mathe researcher for Princess Crul Jaime Goldfarb, commented, “When we compared the data files from the two methods ‘we found that although the scanned system occasionally missed marks because they! had not been filled in properly, the scanned data file was still more accurate than the keypunched file” ‘A monthly report by cruise destination and ship is produced. This report identifies any specific problems that have been noticed, and steps ae taken to make sure these problems are addressed, Recently these surveys have led to changes inthe ment and the various buffets located around the ship.° mi DATA CLEANING Data cleaning includes consistency checks and treatment of missing responses, Although preliminary consistency checks have been made during editing, the checks at this stage are ‘more thorough and extensive, because they are made by computer. Consistency Checks Consistency checks identify data that ae out of range, logically inconsistent, or have extreme values. Out-of-range data values are inadmissible and mast be corrected. Foe example, respondents have been asked to express their degree of auneement witha series of lifestyle statements on & 1to-5 scale. Assuming that 9 has been designated for missing values, data values of 0, 6, 7, and 8 ar out of range. Computer packages like SPSS, SAS. EXCEL, and MINITAB can be programmed to identify out of-range valves for each variable and printout the respondent code, variable code, variable name, record number column number, and out-of-range value.’ This makes it easy to check each variable. systematically for out-of range values. The corect responses can he determined by going back tothe eited and coded questionnaire Responses ean be logically inconsistent in various ways. For example a respondent ‘may indicate that she charges long-distance calls oa calling card, although she doesnot have one. Ora respondent reports both unfailiarity with, and frequent usage of, the same product. The necessary information respondent code, variable code, variable name, record ‘lacs fa variable that are unkown, ‘esau teve respondents did not vide unambiguous answers to uestion. "method for handling missing response in which eases or respon- ‘es with any missing responses we disarded from the analysis "Noted of handling missing values inhi al eases, or esponden's, ‘hth any missing values ste not ‘sutomaticllydscardd: rather foreach calculation only the cases ‘erespondents with complete responses a considered CHAPTER 14 Data Preparation 25 number, column nomber, and inconsistent values) can be printed to locate these responses, ‘and take corrective action. Finally, extreme values should be closely examined. Not all extreme values result from errors, but they may point to problems with the data. For example, an extremely Jow evaluation of a brand may be the result ofthe respondent indiscriminately circling 1s (on a 1-t0-7 rating scale) om all attributes of this brand, Treatment of Missing Responses Missing responses represent vals ofa variable that are unknown, either because respondents provided ambiguous answers or ther answers were not properly recorded. ‘Treatment of asingrexponses poses problems, particulary ifthe proportion of missing respomes is more than 10 percent. The following option are availabe forthe treatment of missing responses.* 1. Substitute a Neutral Value. A neutral value, typically the mean response to the variable, is substituted for the missing responses. Thus, the mean of the variable remains unchanged and other statistics, such as correlations, are not affected much. Although this approach has some merit, the logic of substituting a mean value (say 4) for respondents who, if they had answered, might have used either high ratings (6.0r 7) or low ratings (1 or 2) is questionable. 2, Substitute an Imputed Response. The respondents” pattern of responses to other {questions are used to impute or calculate a suitable response to the missing questions. ‘The researcher attempts to infer fom the available data the responses the individuals \would have given if they had answered the questions. This can be done statistically by determining the retationship of the variable in question to other variables, based on the available data. For example, product usage could be related to household size for respondents who have provided data on both variables. The missing product usage response for a respondent could then be calculated, given that respondent's household size. However, this approach requires considerable effort and can introduce serious bias. Sophisticated statistical procedures have been developed to calculate imputed values for missing responses. Imputation Increases Integrity {A project was undertaken to assess the willingness of households to implement the recommendation ofan energy audit (dependent variable, given the financial implications ‘The independent variables consisted of five financial Factors that were manipulated at Knovn levels, and ther values were always known by virtue of the design adopted. However several values ofthe dependent variable were missing. These missing values Were replaced with imputed values. The imputed values wore statistically calculated, given the Comesponding values of the independent variables. The teatment of missing responses in this manner greatly increased the simplicity an validity of sabnequent analysis 3. Casewise Deletion. In casewise deletion, cases, ot respondents, with any missing responses are discarded from the analysis. Because many respondents may have some ‘missing responses, this approach could result in a small sample. Throwing away large amounts of data is undesirable, because itis costly and time consuming to collect data, Furthermore, respondents with missing responses could differ from respondents with complete responses in systematic ways. Ifs0, casewise deletion could seriously bias the results, 4. Puirnise Deletion. ln pairwise deletion, inscad of discarding all eases with any missing values, the researcher uses only the cases or respondents with complete responses for each calculation, As a result, different calculations in an analysis may be based on different sample sizes, This procedure may be appropriate when (1) the sample size is large, (2) there are few missing responses, and (3) the variables are not highly related. Yet this procedure ean produce results that are unappealing or even infeasible. “4 ‘AN statistical adjustment to the data in which each ease oe respondent in the database is asigned 2 weight to eet its importance relative to ster eases o espondent PART IL Date Calletion, Preparation, Analysis, and Reporting ‘The different procedures for the treatment of missing responses may yield dif results, particularly when the responses are not missing at random and the variables related. Hence, missing responses should be kept to a minimum. The researcher s carefully consider the implications ofthe various procedures before selecting a parti ‘method for the treatment of nonresponse. STATISTICALLY ADJUSTING THE DATA Procedures for statistically adjusting the data consist of weighting, variable respecific and scale transformations. These adjustments are not always necessary but can enhance quality oF data analy Weighting In weighting, each case or respondent inthe database is assigned a weigh o reflect importance relative to other eases or respondents. The value 1.0 represents the une case. The eet of weighting so increase or decrease the number of cases in the st that possess certain characteristics. (See Chapter 12, which discussed the use of wea to adjust for nonresponse) ‘Weihtng i most widely wed to make te sample deta more representative of a population on specific charters, For example, itmay be sed to give greater tocases or respondents with higher quality data Yet another use of weighing i to aus sample ott greater impor is atached to respondents with certain characteristics. Study Is conducied to determine what modiicaions should be made to an existing, the researcher might want to attach greater weigh tothe opinion of heavy users of product. This could be accomplished by assigning weights of 3.0 to heavy users, 2.0 ‘medium users, and Oto light users and nonusers Weighting shouldbe appli! with ca because it destroys the self-weighting nature of the sample design.!! Determining the Weight of Fast-Food Customers ‘A mail survey was conducted inthe Los Angols-Long Beach ara to determine ens patronage of fastfood restaurants. The resting sample composition differed in edvat level frm the area population disebuton as compiled from recent census data, There ‘he sample was weighted to make it representative in terms of educational level. The wei applied were determined by dividing the population percentage by the coresponding sm percentage. The distribution of education forthe sample and population, as well as ‘weights applied, ae given inthe following able Use of Weighting for Representativeness Sample Population Years of Education Percentage Percentage Weight Elementary Schoo! 010 7 years 249 423 170 years 1.26 219 14 High Schoo! 110 3 years 639 865 133 4 years 2539 2.24 Ls College 1103 years 233 2942 La2 4 years Isa 1201 0.80 510 6 years 1494 736 049 7 yeats or more 128 6.30 ost Tals 100.00 100.00 suriable respeifcaion ‘The transformation of data create ow variables othe modification if ensting Variables so that they are ‘more consistent with the objectives etme stay tummy vartabes ‘ respecificaton procedure using “enables that fake on only 10 vals, ‘sual 00 scale transformation ‘A manipulation of seal values to ensure comprbility with other Scales or othervise make the data stable for analysis. cuarn 14. Data Prparation 105 Categories underrepresented in the sample received higher weights, whereas overrepre- sented categories received lower weights. Thus, the data fora respondent with 1 to 3 years of college education should be overweighted by multiplying by (29.42/22.33 =) 1.32, whereas the data for a respondent with 7 or more years of college education should be ‘underweighted by multiplying by (6.90/12.18 =) 0.57. If used, the weighting procedure should be documented and made a part of the project report Variable Respecification arabe repecfcation involves the transformation of dat create new variales or modify twining variables. The purpose of respecitication isto crete variables tat re consistent with the objectives ofthe uly, Forexampe, suppose the orignal vrable was product sage, with TO response categories These might be collapsed int four categories: hav, mei, ah, tind manus Or i esarcher may ecate now vabls that ae compose of ever ober “evables For example, the research may create an Index of Infomation Search (HS), which isthe sum of information customer seek from dealers, promotional materi, he Itemet, tnd ater independent sources. Likewise, one ma take the ratio of varisls If the amount of purchses a department tres (X,) andthe amount of purchases charged (X,) have Been mensre the proportion of purchases charged canbe «new variable created y taking the tai ofthe two (EX). Other respecifction of variables include square root and Tog tans formations, which ar often applied to improe the i ofthe mxl being estimates ‘An imprantrespeifealin procedure involves the wseof dummy variables fr respecity- ing categorical variables, Durr rrlables er lao called binary dichotomous, intnamenta cevuatave vaben. They are aries that may take on ony eno vals sich as or 1. The ener rues that espe a categorical aril with Keategre,K~ dummy variables fre needed. The reason foe having KI rather than K, dummy variables i that only K~ 1 Categories are independent Given the sample daa information about the Kt category can be Skriv fom information abut he ote K~ | eateores, Consider sex, yarable having 0 Cstegricn Only one dummy aril is needed Tnfermation onthe umber or percentage of tale inthe sample canbe realy derived rom the number or erentageof feria. RAL MS ARCADES “Frozen” Consumers Treated as Dummies Ina study of consumer preferences for frozen foods, the respondents were classified as heavy, ‘medium, light, and nonusers and originally assigned codes of 4, 3,2, and 1, respectively. This ‘coding was not meaningful for several statistical analyses. In order to conduct these analyses, product usage was represented by three dummy variables, X,, and X,,as shown, Product Usage Original Variable ‘Dummy Variable Code Category Code z, cf x Nomusers ' t 0 0 Light users a ° 1 ° Medium users 3 ° 0 1 Heavy users 4 ° 0 ° Note that X, = | for nonusers and 0 forall others. Likewise, X, = | for light users and O forall others, and X, = | for medium users and 0 forall others. In analyzing the data, X,, Xp, and X, are used to represent all user/nonuser groups. BL Scale Transformation Seate transformation incolves a manipulation of sale values to ensure comparability with other scales or otherwise make the dita suitable for analysis. Frequently, different seals, are employed for measuring different variables. For example, image variables may be measured ona 7-pont semantic differential scale, atitude variables on continous rating 26 | sandardiation “The proces of conectng data 10 reduce them tothe same sale by Sublroctng the sample mean and Aividing bythe standard deviation PART IIT Data Collection, Preparation, Analysis, and Reporting Visit wirwferus.com and conduct an Internet search using a search engine and your libra ‘line database to obtain information on the criteria buyers use in selecting a luxury car brand. _ Demographic and pyychographic data were obtained in a survey designed f0 explain ‘choice ofa luxury ear brand. What kind of consistency checks, treatment of missing nd variable respcificaton shouldbe conducted? ‘As the marketing manager for Lexus, what information would you like 1 have to formal marketing strategies to increase your marketshare? scale, and lifestyle variables on a S-point Likert scale. Therefore it would not be mest ful to make comparisons across the measurement scales for any respondent. To compar attitudinal scores with lifestyle or image scores, it would be necessary to transform the var ious scales. Even ifthe same scale is employed for all the variables, different respondent may use the scale differently. For example, some respondents consistently use the upp end of a rating scale, whereas others consistently use the lower end. These differences be corrected by appropriately transforming the data, Health Care Services—Transforming Consumers Ina study examining preference segmentation of health care services, respondents w asked to rate the importance of 18 factors affecting preferences for hospitals on a 3-po scale (very, somewhat, or not important). Before analyzing the data, each individual's ratings were transformed, For cach individual, preference responses weie averaged actos all 18 items. Then this mean was subtracted from each item rating and a constant w ‘added to the difference. Thus, the transformed data, X,, were obtained by: x i= Kec Subiraction of the mean value corrected for uneven use of the importance scale. The constant C was added to make al the transformed values positive, because negative impor tance ratings are not meaningful conceptually. This transformation was desirable because Some respondents, especially those with low incomes, had rated almost al the preference items as very important. Other, high-income respondents in particu, hal seed he very important rating to only a few preference items, Thus, subtraction ofthe mean va provided a more accurate idea of the relative importance of the factors.!? In this example, the scale transformation is corrected only for the mean respons A more common transformation procedure is standardization. To standardize a scale X, we first subtract the mean, X, from each score and then divide by the standard deviation, ‘Thus, the standardized scale will have a mean of zero and a standard deviation of 1, This i essentially the same as the calculation of z scores (See Chapter 12). Standardization allo the researcher to compare variables that have been measured using different types o scales." Mathematically, standardized scores, may be oblained a = (X)- Rs SELECTING A DATA ANALYSIS STRATEGY ‘The process of selecting a data analysis strategy is described in Figure 14.5, The selection fof a data analysis strategy should be based on the earlier steps of the marketing research process, known characteristics of the data, properties of statistical techniques, and the: background and philosophy of the researcher. ‘Data analysis is not an end in itself. Its purpose isto produce information that will help !address the problem at hand. The selection of a data analysis strategy must begin with a ‘consideration of the earlier step in the process: problem definition (Step 1), development of an Figure 14.5 Selecting a Data Analysis Strategy Stntisical techniques appropriate for alyzing data when there sa single Seasarement ofeach element it Se sample or if hte ae ever] ‘yedsuements on each element each “rable is analy in isolation Statistica techniques suitable for “halyzng data when tee are £80 or ‘ore messarements on each element “nd he viable re analyze simal- "ancousl. Mulivariate techniques ‘concemed with he simultapecus -ltonships among tw o¢ more henomens CHAPTER 14 Data Preparation “07 Earlier Step (LH, and 11D ‘ofthe Marketing Research Properties of Statistical eer ee Background and Philosphy a bis ata Analy esi MN ecesakd approach (Step I), and research design (Step III). The preliminary plan of data analysis prepared as part ofthe research design should be used as a springboard. Changes may be nec- «essary in light of addtional information generated in subsequent stages of the research process. ‘The next step is to consider the known characteristics ofthe data. The measurement scales used exert a strong influence on the choice of statistical techniques (see Chapter 8) In addition, the research design may favor certain techniques. For example, analysis of variance (see Chapter 16) is suited for analyzing experimental data from causal design. ‘The insights into the data obtained during data preparation can be valuable for selecting a strategy for analysis. Itis also important to take into account the properties of the statistical techniques, particularly their purpose and underlying assumptions. Some statistical techniques are appropriate for examining differences in variables, others for assessing the magnitudes of the relationships between variables, and others for making predictions. The techniques also involve different assumptions, and some techniques can withstand violations of the underlying assumptions better than others. A classification of statistical techniques is presenied in the vent seco Finally, the researcher's background and philosophy affect the choice ofa data analysis strategy. The experienced, statistically trained researcher will employ a range of techniques, including advanced statistical methods. Researchers differ in theit willingness to make assumptions about the variables and their underlying populations. Researchers who are ‘conservative about making assumptions will limit their choice of techniques to distribution- free methods. In general, several techniques may be appropriate for analyzing the data from, a given project. A CLASSIFICATION OF STATISTICAL. TECHNIQUES ‘Statistical techniques ean be classified as univariate or multivariate. Univariate techniques are appropriate when there is a single measurement of each element inthe sample, or there are several measurements of each element but each variable is analyzed in isolation. Multivariate techniques, on the other hand, ae suitable for analyzing data when there are ‘wo or more mexsurements of each element and the variables are analyzed simultaneously ‘Multivariate techniques are concerned with the simultaneous selativnships annoy (wo ot ‘more phenomena, Multivariate techniques differ from univariate techniques in that they shift the focus away from the levels (averages) and distributions (variances) of the phenomena, concentrating instead upon the degree of relationships (correlations or covariances) among these phenomena. The univariate and multivariate techniques are 08 ‘metre data z Data that ar interval oration ature, nonmetrie data Data derived from » nominal or ordinal scale Independent ‘The samples are independent they a drawn randomly from diferent populations. ‘The samples ar'paired when te data forthe two samples relate tothe same _roup of respondent, dependence techniques “Multivariate techniques appropriate hen one or more othe vrables ‘an be identified as dependent ‘ariables andthe remaining as Independent variables, Figure 14.6 ‘A Classification of Univariate Techniques PART III Data Collection, Preparation, Analwis, and Reporting described in detail in subsequent chapters; here we show how the various techniques: to each other in an overall scheme of classification, Univariate techniques can be classified based on whether the data are metric nonmetric. Metric data are measured on an interval ot ratio scale. Nonmetrie data ‘measured on a nominal or ordinal scale (see Chapter 8), These techniques can be FUNCTION>ALL>IF. certain statistical techniques in specific settings. It is possible to surf the Net for new statistical techniques that are not yet available in commonly used statistical packages. News groups and special-interest groups are useful sources for a variety of statistical information. SPSS WINDOWS o_o Using the Base module of SPSS, out-of-range values can be selected using the SELECT IF command. These cases, with the identifying information (subject ID, record number, variable name, and variable value) can then be printed using the LIST or PRINT commands, The PRINT command will save active cases to an extemal file. IF a formatted list is required, the SUMMARIZE command can be used. ‘SPSS Data Entry can facilitate data preparation. You can verify that respondents have answered completely hy setting rules. These rules can be used on existing datasets 10 validate and check the data, whether of not the questionnaire used to collect the data was constructed in Data Entry. Data Entry allows you to control and check the entry of data through three type of rules: validation, checking, and skip and fill rules. ‘Although the missing values can be treated within the context of the Base module. SPSS Missing Values Analysis can assist in diagnosing missing values and replacing missing values with estimates. TextSmart by SPSS can help in the coding and analysis of ‘open-ended responses. ‘We illustrate the use ofthe base module in creating new variables and recoding exist ing ones using the data of Table 14.2. This table gives the data from a pretest sample of 20 respondents on preferences for a restaurant. Each respondent was asked to rate his of her preference to eat in a familiar restaurant (I = Weak Preference, 7 = Strong. Preference), and to rate the restaurant in terms of quality of food, quantity of portions value, and service (1'= Poor, 7 = Excellent). Annual household income was also obtained and coded as: 1 = Less than $20,000, 2 = $20,000-34,999, 3 = $35,000-49,999,, :50,000-74,999, 5 = $75,000-99,999, 6 = $100,000 or more. 44 SPSS Data File PART ILL Date Collection, Preparation, Analy, and Reporting Guam Quam Aue —_‘Servce We want to create a variable called overall evaluation of the restaurant (Overall) that is the sum of the ratings on quality, quantity, valve, and service. Thus, Overall Quality + Quantity + Value + Service “The screen captures using SPSS Windows for these steps can be downloaded from the Web site for this book. These steps are as follows. 1. Select TRANSFORM. 2, Click on COMPUTE. 3. ‘Type “overall” in the TARGET VARIABLE box. ‘4. Click on “quality” and move it to the NUMERIC EXPRESSIONS box. 5, Click on the “+” sign. 66. Click on “quantity” and move it to the NUMERIC EXPRESSIONS box. 7. Click on the “+” sign. 8. Click on “value” and move it to the NUMERIC EXPRESSIONS box, 9. Click on the “+” sign. 10. Click on “service” and move it to the NUMERIC EXPRESSIONS box. 1. Click on TYPE & LABEL under the TARGET VARIABLE box and type “Overall Evaluation.” Click on CONTINUE. 12, Click OK. We also want to illustrate the recoding of variables to create new variables. Income category 1 occurs only once and income category 6 occurs only twice. So we want ‘combine income categories 1 and 2, and categories S and 6, and create a new income variable “rincome” labeled “Recoded Income.” Note that rincome has only four eategories that are coded as 1 10 4. ‘This can be done in SPSS Windows (download screen captures from the Web site for this book) as follows, \. Select TRANSFORM. Click on RECODE and select INTO DIFFERENT VARIABLES. Click on income and move it o NUMERIC VARIABLE OUTPUT VARIABLE box. ‘Type “rincome” in OUTPUT VARIABLE NAME box. ehe CHAPTER 14 Data Prparation 485 5. Type “Recode Income” in OUTPUT VARIABLE LABEL box. 6. Click OLD AND NEW VALUES box. 7. Under OLD VALUES on the left click RANGE. Type 1 and 2 in the range boxes. Under NEW VALUES on the right click VALUE and type 1 in the value box. Click ADD. 8. Under OLD VALUES on the left click VALUE. Type 3 in the value hox. Under NEW VALUES on the rightclick VALUE and type 2 in the value box. Click ADD. Under OLD VALUES on the left click VALUE. Type 4 in the value box. Under NEW VALUES on the right click VALUE and type 3 in the value box. Click ADD. 10. Under OLD VALUES on the left click RANGE. Type 5 and 6 in the range boxes. Under NEW YALUES on the right click VALUE and type 4in the value box. Click ADD. 1. Click CONTINUE. 12. Click CHANGE. 13. Click OK. Data Analysis Strategy [As prt of the analysis conducted in the department store project, store choice was modeled in tens of sore image characteristics orth factors iafluencing the choice criteria. The Sample was split into halves. The respondents in each half were clustered onthe basis ofthe importance attached to the store image characteristics, Statistical tests for clusters were Comicted and four segments were identified. Store preference was modeled in tems of fhe evaluauons ofthe slores on tte imae variables, The ood was estimated separately for each segment. Differences between segment preference funetions were statistically tested: Finally, model validation and cross-validation were conducted for each segment. The data analysis strategy adopted is depicted inthe following diagram 9. Sisal 1 subsample 2 wets ONE RG Some inom oem cma, _ Statistical Tests for Diflereoces.in Segments 4 + + ‘o Model Madel_ Moet Mel Validation Validation Validation _Validation a ese ass 4836 SPSS Data File SPSS Data File PART IL Data Collection, Preparation, Analysis, and Reporting Project Acti Download the SPSS data file Sears Data 14 from the Web site for this book. This file contains information on who in the household does most of the shopping in department stores, familiarity ratings with each of the 10 department stores, and demographic dat The measurement of these Variables is described in Chapter 1. The remaining variables hhave not been included so that the numberof variables will be less than 50 and you can use the student SPSS software. es |. Determine how many cases of familiarity with Kohl's have missing values. . How are missing values coded? . Replace the missing values of familiarity with Koht's with the mean value. |. Compute an overall familiarity score by summing the familiarity with each of the 10 department store. 5, The demographic variables are described in Figure 14.3. Recode the demographic variables as follows: Marital status: 1 = 1; 203 = 2 Total number of family members: 1 =1;2 = 2:3 = 3:4 = 4 and 5 or more = Children under six years: 0 =1; 1 or mote = 2 Children over six years: 0 =1; 1 =2; 2 or more = 3 Children not living at home: 0 =1; | =2;2 = 3; 3 or more = 4 Formal education (you and spouse): 12 or less =1; 13 t0 15 = 2; 16 to I 19 or more = 4 [Age (you and spouse): less than 30 = 1; 30 10 39 = 2; 40 1 49 = 3; 50 w 59 = 4, 60 to 69 = 5; 70 or older = 6 Occupation (Male Head): 1 or 2 = 1;3,4,0r 5 = 2;6,7,0r8 = 3 ‘Occupation (Female Head): 1 or 2 = 1;3,4, or5 = 2:6,7, 08 = 3 ‘Years of residency: 5 of less = 1; 610 10 = 2: 11 1020 = 3;21 t030 = 4; 31 1040, 41 or more = 6 Income: 1, 2, 3, or 4 = 1; 5, 6, or 7 : 8 or 9 3; 10 or 11 = 4; 12, 13, oF EXPERIENTIAL RESEARCH a Download the Dell case and questionnaire from the Web ste for this book. This information is also given atthe end of the book. Download the Dell SPSS data file. 1, Recode the respondents based on total hours per week spent online into two groups: five hours or less (light users), and six hours or more (heavy users). Calculate a frequency distribution 2. Recode the respondents based on total hours per week spent onfine into three groups: five hours or less (light users), six to 10 hours (medium users), and 11 hours or more (heavy users). Calculate a frequency distribution, 3. Form a new variable that denotes the total number of things that people have ever done online based on q2_I 10 g2_7. Rum a frequency distribution of the new variable and interpret the results. Note the missing values for q2_1 to q2_7 are coded as 0, 4. Recode q4 (overall satisfaction) into two groups: Very satisfied (rating of 1), and somewhat satisfied or dissatisfied (ratings of 2, 3, and 4). Caleulate a frequency 040 If the null hypothesis H, is rejected, then the alternative hypothesis #7, will be accepted and the new Internet shopping service will be introduced. On the other hand, if His not rejected, then the new service should not be introduced unless additional evidence is obtained. ‘This tes ofthe null hypothesis isa one-tailed test, because the alternative hypothesis is ‘expressed directionally: The proportion of Internet users who use the Internet for shopping is greater than 0.40, On the vue Land, suppose dhe researcher wanted to determine whether the proportion of Intemet users who shop via the Internet is different from 40 percent. Then a fwoctaled test would be required, and the hypotheses would be expressed as: Hy x= 040 Hin #040 In commercial marketing research, the one-tailed testis used more often than a two- tailed test, Typically, there is some preferred direction for the conclusion for which evi- dence is sought. For example, the higher the profits, sales, and product quality, the better, ‘The one-tailed test ic more powerful than the two tailed test The power of a statistical test is discussed further in step 3. Step 2: Select an Appropriate Test To est the mill ypotess itis necessary to select an appropriate tatstical technique. The researcher should take into consideration how the test ats is computed and the sam: pling disriution thatthe sample ats the mean) follows The test sais mes. Sues how close the sample has comet he ml hypothesis. The ext sats often follows 2 ell-known distribution, sch asthe normal to hi square sinbution, Cuidtines for selecting an appropriate fest or stats technique ae discussed late in ths chap In our example, the sits, whch follows the standard normal distribution, would be prope. This stastic would be computed as follows where Step 3: Choose Level of Significance, « athe Whenever we draw inferences about a population, there isa risk that an ince sion will be reached. Two types of errors can occur. a2 ‘ype Leror Alb known aap nur, it ocurs ‘when the sample Yess ea othe ection of mal ype infact ve level of significance The pcb of making a ype Inpe Herr Ako kt as beta ero. oct Sen the spl rests ea to the ‘onrjeton of ul hy potbessthat infact fae power ofa test ‘The probaly of ejecting tena hype when in ft fase a Figure 15.4 Type | Error (a) and Type i Error (8) PART I Data Colston, Prparation, Anal and Reporting Type I Error. Type 1 error occurs when the sample results lead to the rej the null hypothesis when itis in fact true. In our example, a Type I error would os new service plan was greater than 0.40, when in fact it was less than or equal ‘The probability of Type I eror (a) is also called the level af significance. The Ty ertor is controlled by establishing the tolerable level of risk of rejecting a hhypothesis. The selection of « particular risk level should depend on the cost of m 1 Type | error led, hacel om the sample data, tha the proportion of customers prefer Type I Error. Type 11 error occurs when, based on the sample results, hypothesis isnot rejected when i isin fact false. In our example, the Type Meroe ‘occur if we concluded, based on sample data, that the proportion of customers pr tke new service plan was less than or equal 100.40 when, in fet, it Was greater ts ‘The probability of Type Il error is denoted by B. Unlike a, which is specified by researcher, the magnitude of B depends on the actual value of the population par (proportion). The probability of Type I errr (a) and the probability of Type Il error shown in Figure 15.4, The complement (1 ~ B) of the probability of a Type It called the power ofa statistical test. Power of a Test. The power of a testis the probability (1 ~ B) of rejecting the hypothesis when itis false and should be rejected. Although is unknown, iis toa, An extremely low value of « (e., = 0.001) will result i intolerably high B e So itis necessary to halance the two types of ertors. As a compromise, «is often 0.05: sometimes it is 101; other values of a ae rare, The level ofc, along withthe ss Size, ill determine the level off fora particular research design, The tsk ofboth a can be controlled by inereasing the sample size, For a given level of a, inereasin sumple size will derease B thereby increasing the posse ofthe test Step 4: Collect Data and Calculate Test Statistic Sample size is determined after taking into account the desired a and B errors and {qualitative considerations, such as budget constraints, Then the required data are co and the value of the test statistic computed. In our example, 30 users were surveyes 17 indicated tht they used the Internet for shopping. Thus the value of the sample ps tion isp = 17/30 = 0.567 a= 005 ss Critical Value / ore 9980 ‘Tal Area =045 304 CHAPTER 15 Frequency Distribution, Cra-Tabulation, and Hypothesis Testing 48 ‘The value of 6, can be determined as follows: [Ram oon.) Vo 30 089 ‘The test statistic z can be calculated as follows: 0.567 - 0.40 0,089 = 188 Step 5: Determine the Probability (Critical Value) Using standard normal tables (Table 2 ofthe Statistical Appendix, the probability of obtaining az value of 1.88 canbe calculated (see Figure 15.5). The shaded area between ~eo and 1.88 i 0.9699. Therefore, the are to the right of z = 1.88 1s 1.0000 ~ 0.9699 = (0.0301. Alternatively the eitical value ofc, which will ive an area to te right side ofthe critical value of 0.05, is between 1.64 and 1.65 and equals 1.645. Note that in determining the eitical value ofthe test statisti, the atc w te igh ofthe eel Yale Is ether a or 2. Ise fora one-tal test and a2 for a twortltest. Steps 6 and 7: Compare the Probability (Critical Value) and Make the Decision ‘The probability associated wit he calculated or observed vlc ofthe test statistics 0.0301 ‘This isthe probability of geting ap value of 0.567 when & = 040. Tiss less than the level of significance of 0.05, Hence, the null hypothesis is rejected, ltemtively, the calculated ‘alte ofthe test statistic = = 1.88 lies inthe rejection region, yond the value of 1.648, Again, the same conclusion to reject he nll hypothesis is reached. Note tht the two ways of testing the null hypothesis are equivalent but mathematically opposite inthe dieation of comparison. If the probability associated with the calculated or observed vale ofthe ts st sti (TSoq, is less than the level of significance (a, the nll hypothesis is rejected. However ithe calculated value ofthe test Statistici greater than the ctl value ofthe text statistic (TS the nul hypothesis is rejected. The reason for this sin shit is that the larger the vale of 3, the smaller the probability of obtaining a more extreme value ofthe test statistic under the ll hypothesis, This sgn shit can be easily seen if probability of TS.q, < significance level (a), then reject py but if TS, > TScy then reject Hy, 2 One-Tailed Test atfer eto reach o 25138 a4 Figure 15.6 ‘A Broad Classification ‘of Hypothesis Tosts “erosetabutation ‘A saisticalechnique that desebes "wo or more variable simultaneously nd results in ables that reflect the Joint distebaton of S90 or more “aril that havea limited umber ‘of eategorie or distin values, PART IL Data Collection, Preparation, Analysis, and Reporting Texts of Association Differences s Medians! jsribuions Means ist Proportions Step 8: Marketing Research Conclusion The concusion reached by hypothesis testing mus be expresedin terms ofthe mari research problem. In our example, we conclude that there is evidence that the proportion Internet uses who shop via the Internet significantly retr than O40 Hence the tenant the department sore would be to inrodoce te new Inemet shopping sen ‘As ca be sen from Figure 15.6, hypotheses testing canbe related to ether ane on ofessocstions or an examination of differences In tess of associations, nullypotesisis hat tee imo association between the vail (iis NOT re tb...) In ess of differences, the mill hypothesis is that there is no diference (y= NOT different from...) Tet of dfeences could eat dstibtion, means, tons, medians, or rankings. Fis, we discuss hypotheses related fo associations i context of cose ablations CROSS-TABULATIONS Although answers to questions related to a single variable are interesting, they often Additional questions about how to Tink that variable to other variables. To introduce frequency distribution, we posed several representative marketing research questions. each ofthese, a researcher might pose additional questions to relate these variables to variables. For example: 1 How many brand-loyal users are males? Wt Is product use (measured in terms of heavy users, medium users, light users, nonusers) related to interest in outdoor activities (high, medium, and low)? 1m Is familiarity with a new product related to age and education levels? 1 Is product ownership related to income (high, medium, and low? ‘The answers to such questions can be determined by examining eross-tabulati Whereas a frequency distribution describes one variable ata time, a eross-tabu describes two of more variables simultaneously. A cross-tabulation isthe merging of the ‘quency distribution of two or more variables in a single table. Ic helps us to understand ‘one variable such as brand loyalty relates to another variable such as sex. Cross-tabul results in tables that reflect the joint distribution of two or more variables with a ls umber of categories or distinct values. The categories of one variable are cross-cassi ‘with the categories of one or more other variables. Thus, the frequency distribution of \ariable is subdivided according to the values or categories of the other variables, Suppose we are interested in determining whether Internet usage is related to sex. the purpose of cross-tabulation, respondents are classified as light or heavy users. Th reporting five hours or less usage are classified as light users, and the remaining are he users, The cross-tabulation is shown in Table 15.3. A cross-tabulation includes a cell every combination ofthe categories of the two variables. The number in each cell shot SPSS SPSS Output File contingency able ‘crosetabulation able. It contains “for every combination of sores of the two variables. SPSS SPSS Output File CHAPTER 15 Frequency Distribution, Crow-Tebudation, and Hypothesis Testing §—— 4855 Internet Usage se leaner Usace mae Fawr Row Tout Light (1) s 10 | Heavy (2) 10 5 5s | ‘Column total Is 15 | how many respondents gave that combination of responses. In Table 15.3, 10 respondents ‘were females who reported light Internet usage. The marginal totals in this table indicate that of the 30 respondents with valid responses on both the variables, 15 reported light usage and 15 were heavy users. In terms of sex, 15 respondents were females and 15 were ‘males. Note that this information could have been obtained from a separate frequency dis- tribution for each variable, In general, the margins of a cross-tabulation show the same information as the frequency tables for each of the variables, Cross-tabulation tables are also called contingency tables. The data are considered to be qualitative or categorical data, because each variable is assumed to have only a nominal scale.® ‘Cross-tabulation is widely used in commercial marketing research, because (1) cross tabulation analysis and results ean be easily interpreted and understood by managers who are not statistically oriented; (2) the clarity of interpretation provides a stronger link between research results and managerial action; (3) a series of eross-tabulations may pro- Vide greater insights into a complex phenomenon than a single multivariate analysis; (4) eross-tabulation may alleviate the problem of sparse cells, which could be serious in discrete multivariate analysis: and (5) cross-tabulation analysis is simple to conduct and appealing to less sophisticated researchers.” Two Variables th wo variables is also known as bivariate crosstabulation. Consider cation of Internet usage with sex given in Table 15.3. Is usage Cross-tabulation again the eross-classi related to sex? Itappeats tobe from Table 15,3, We see that disproportionately more of the respondents who are males ae heavy Internet users as compared to females. Computation ‘of percentages can provide more insights Because two variables have been cross-classified, percentages could be computed either columnvise, based on column totals (Table 15.4), oF rowwise, based on row totals (Table 15.5), Which ofthese tables is more useful? The answer depends on which variable ean Usa Mae Faw Light 333% 667% | Hewy 65.7% 333% | Colum cota 100.0% 100.0% | | Internet Usage by Sex sox ter ~ Hew Torti Male 333% 66.7% 1 spt to rch Female 66.3% 333% 456 Figure 15.7, The Introduction of a Third Variable in Cross-Tabulation PART II Data Collection, Preparation, Analysis, and Reporting will be considered as the independent variable and which as the dependent variable. T) general rule is to compute the percentages in the direction of the independent varia) ‘across the dependent variable. In our analysis, sex may be considered as the independ variable and Internet usage as the dependent variable, and the correct way of calculatin percentages is as shown in Table 15.4. Note that whereas 66.7 percent of the males a hheavy users, only 33.3 percent of females fall into this eategory. This seems to indicate th ‘males are more likely to be heavy users ofthe Internet as compared to females, Note that computing percentages in the direction of the dependent variable across independent variable, as shown in Table 15.5, is not meaningful in this ease, Table 15 implies that heavy Intemet usage causes people to be males. This latter finding is implat sible, Ii possible, however, thatthe association between Internet usage and sex is med ‘ated by a third variable, such as age or income. This kind of post to examine the effect of a third variable Three Variables Often the introduction of a third variable clarifies the initial association (or lack of 4 ‘observed between two variables. As shown in Figure 15.7, the introduetion ofa third vad able can result in four possibilities, | 4, Itcan indicate no change in the initial association ® | ‘These cases are explained with examples based on a sample of 1,000 respondent} Although these examples are contrived to illustrate specific cases, such cases are nq uncommon in commercial marketing research. Refine an Initial Relationship. An examination of the relationship between parchase of fashion clothing and marital status resulted in the data reported in Table 15. ‘The respondents were classified into cither high or low categories based on their purchase fashion clothing. Marital status was also measured in terms of two categories: current ‘married or unmarried, As canbe seen from Table 15.6, 52 percent of unmarried respondent fell in the high-purchase category, as opposed to 31 percent of the married respondents ae Some Assocation No Association between the Two icaaela Introduce a : Third Variable Refined Assovition _No Association NoChange in Some Associaton meee, iy CHAPTER 15 Frequency Distribution, Crnss-Tabulation, and Hypothesis Testing 487 ‘Purchase of Fashion Clothing by Marital Status Purcnase OF Cormanr Mani SUS Fastion Clonine Marsico Unwaaies Wien 31% 52% Low om 48% Column 100% 100% Number of respondents 700 300 Purchase of Fashion Clothing by Marital Status and Sex Sex MALE MARITAL STATUS Mate Manas Sas Purcouse oF Fashon Clominc Maaco Unad MaseeD Una High 35% 40% 25% ow Low 65% owe 138 40% ‘Column totals. 100% 100% 100% 100% [Number of eases 400 120 300 180 Before concluding that unmarried respondents purchase more fashion clothing than those ‘who are married, a third variable, the buyer's sex, was introduced into the analysis. ‘The buyer's sex was selected as the third variable based on past research. The rela- tionship between purchase of fashion clothing and marital status was reexamined in light of the third variable, as shown in Table 15.7. In the case of females, 60 percent of the ‘unmarried fll in the high-purchase category, as compared to 25 percent of those who are married. On the other hand, the percentages are much closer for males, with 40 percent of the unmarried and 35 percent of the married falling inthe high-purchase category, Hence, the introduction of sex (third variable) has refined the relationship between marital status and purchase of fashion clothing (original variables). Unmarried respondents are more Tikely to fallin the high-purchase category than married ones, and this effect is much more pronounced for females than for males. Initial Relationship Was Spurious. A researcher working for an advertising agency promoting a line of automobiles costing more than $30,000 was attempting to explain the ownership of expensive automobiles (see Table 15.8). The table shows that 532 percent of those with college degrees own an expensive automobile, as compared to 21 pereent of those without college degrees. The researcher was tempted to conclude that ‘education influenced ownership of expensive automobiles. Realizing that income may also be a factor, the researcher decided to reexamine the relationship between education and ownership of expensive automobiles in light of income level. This resulted in Table 15.9. Note that the percentages of those with and without college degrees who own expensive sive Automobiles by Education Level ‘Ou Boeasne - Eovcarion —. ‘Auromosne Couce Decree ‘No Coulee Decree Yes 32% 21% No oe 9% differ Column total 100% 100% Number of eases 250 750 1% "opt to reach 458 PART IIT Data Colleton, Preparation, Analyis, and Reporting Low incoue Eovearon Hicw incon EDvearOn (Onn Bxrensve Course —NoCousce © Cource No Cousce ‘Amonomue ‘Deore Deore Drone eons Yes 20% 20% 40% 40% No 80% 80% 60% 60% Column toals 100% 100% 100% 100% Number of respondents 100 30 automobiles are the same for each of the income groups. When the data for the hi income and low-income groupe are examined separately, the association between ca ‘ion and ownership of expensive automobiles disappears, indicating thatthe initial re tionship observed between these two variables was spurious. Reveal Suppressed Association. A researcher suspected desir to travel abroad be influenced by age. However, a cross-tabulation of the two vatiables produced the re in Table 15.10, indicating no association. When sex was introduced as the third varia Table 15.11 was obtained, Among men, 60 percent of those under 45 indicated a desire travel abroad, as compared to 40 percent of those 45 or older. The pattern was reversed women, where 2S percent of those under 45 indiested a desine to travel abroad, as opps {© 65 percent of those 45 or older. Because the association between desire to travel ab ‘and age runs in the opposite direction for males and females, the relationshi ‘wo variables is masked when the data are aggregated across sex, as in Table 15,10, But wh the effect of sex is controlled, as in Table 15.11, the suppressed association between dest totravel broad and age is revealed for the separate categories of males and females, No Change in Initial Relationship. 1n some cases, the introduction of the variable does not change the intial relationship observed, regardless of whether the ori nal variables were associated. This suggests that the third variable dass aot influence esiee to Travel Abroad by Ag hee Desite ro Tavel ABROAD lsssTuaw aS SSCS 8 Mone Yes 0% 50% No 50% 50% Colum total 100% 100% Number of respondents 500 500 Desite To Trav AsROAD Mae Ace ea Yes wow 40% 38% 65% No 40% 60% 65% 35%) Column tat 100% 100% 100% 100% NNumiber of cases 300 300 200 200 CHAPTER 15 Frequency Distribution, Crass-Tabulation, and Hypothesis Testing 459 Eating Frequently in Fast-Food Restaurants by Family Size Ess Frecuewr in FastFOoD ___ Foamy Se Restaueans Sanus nse Yes 65% 65% No 35% 35% Column otal 100% 100% [Number of eases 500 500 Eating Frequently in Fast-Food Resaurants by Fanily Size and Income Tienes kx Fou Faia —— Tes FasrPooe Resume jou Ys 5% 6s 65% 5% No 35% 35% 35% 3% Cotunn ta 00% 1008001008 Ninbetoftmpmints 250 250 250 250 relationship between the first two. Consider the cross-tabulation of family size and the ten- dency to eat out frequently in fastfood restaurants, as shown in Table 15.12. The respon- dents were classified into small and large family size categories based on a median split of the distribution, with 500 respondents in each category. No association is observed. The respondents were further classified into high- or low-income groups based on a median split. When income was introduced as a third variable in the analysis, Table 15.13 was ‘obtained, Again, no association was observed. General Comments on Cross-Tabulation Mote than three variables can be cross-abulated, but the interpretation is quite complex. [Also, because the number of cells increases multiplicatively, maintaining an adequate ‘number of respondents or cases in each cell can be problematic. As a general rule, there should be atleast five expected observations in cach eel forthe statistics computed tobe reliable, Thus, cross-tabulation isan inefficient way of examining relationships when there ae several variables, Note that crosstabulation examines association between variables, rot causation, To examine causation, the causal research design framework should be adopted (see Chapter 7). STATISTICS ASSOCIATED WITH CROSS-TABULATION ‘We will discuss the statistics commonly used for assessing the statistical significance and stucngth of association of cross-tabulated variables. The statistical significance of the observed association is commonly measured by the chi-square statistic. The strength of association, or degree of association, is important from a practical or s<%- stantive perspective, Generally, the strength of association is of interest only if the ilfer ciation is statistically significant. The strength of the association can be me aoe : 0 0 reach the phi correlation coefficient, the contingency coefficient, Cramer’ V, an coefficient. 460 ‘he sais ed to tes the static tion ina cos-bulaton Issn ws in detrining whether systematic association exis betwee he (40 “arabes, SPSS Output File 4 the lamba PART III Deca Collection Preparation, Anais, and Reporting Chi-Square ‘The chi-square statistic (x) is used to test the statistical significance of the observed. ton in a rose ablation sis us in dteining wheter a systematssoiation between the tvo waa. The ml hypothe, shat heres association bee ‘aval, The testis condicted by computing te cl frequenes at woeld be expel tssociton were present between the variable, gten the existing ro and came ‘Theve expected cel fequendies, enced, are then compared the sctalabseed gueneien found in he ros-aulton wcll te ch square stats. The pe isrepanls between he expected and acl quencies thlager he ate of he Assume hata crostiabulaion has rows and c clus and a undom sample of where 1, = total number in the row 2m, = total number inthe coluran ‘n= total sample size For the data in Table 15.3, the expected frequencies for the cells, going from right and from top to bottom, are: Isms Isoc1s 18 180 30 3 ISxIS yey HIS _ agg 30 30 ‘Then the value of 72 is calculated as follows For the data in Table 15.3 the value 7? of is calculated as: $= 75) , 0-757 75 73 833 + 0.833 + 0.833 + 0.833, 3333 ‘To determine whether a systematic association exists, the probability of obtaining vale of chi-square as large or larger than the one calculated from the cross tabulation estimated. An important characteristic of the chi-square statistic i the number of de of freedom (associated witht In general, the number of degrees of freedom is el the numberof observations less the numberof constraints needed to calculate a stats term, Inthe case ofa chi-square statistic associated witha crosstabulaton, the number degrees of freedom is equa the product of numberof rows (7) es one andthe num ‘of columns (cess one. That is, df = (r= 1) X (¢~ 1)? The nll hypothesis (H,) of sociation between the two variables willbe rejected only when the caleulated value te test statistic i greater than the critical value ofthe chi-square distribution with & appropriate degrees of freedom, as shown in Figure 15.8 ‘The chisquare distribution isa skewed distribution whose shape depends solely the numberof degrees of feed." As the number of degrees of freon incase, chisquare distibution becomes more symmetrical, Table 3 inthe Statistical Append contains upper areas of the chi-square distribution for different degrees of freedom. a this table, the value atthe top of each column indicates the are in the upper potion (the Tight side, as shown in Figure 15.8) ofthe chi-square distribution. T illustrate, for I degree of freedom, the value for an upper til area of 0.05 is 3.841. This indicates that for I degree Figure 15.8 i-Square Test of Association ER ‘measure ofthe suength of association inte special ease Sta table with two rows and two columns (42% 2 abe). SPSS Output File CHAPTER 15 Frequency Disribution, Crows-Tabulaton, and Hypothesis Testing 461 Catcal Value of freedom, the probability of exceeding a chi-square value of 3.841 is 0.05. In other ‘words, at the 0.05 level of significance with 1 degree of freedom, the critical value of the chi-square statistic is 3.841. FFor the cross-tabulation given in Table 15.3, there are (2 ~ 1) x (2 ~ 1) = I degree of freedom. The calculated chi-square statistic had a value of 3.333. Because this is less than the critical value of 3.84], the null hypothesis of no association cannot be rejected, indi- cating thatthe association isnot statistically significant at the 0.05 level. Note that this lack (of significance is mainly due tothe small sample size (30) If, instead, the sample size were 300 and each entry of Table 15.3 were multiplied by 10, it can be seen thatthe value of the Chi-square statistic would be multiplied by 10 and would be 33.33, which is significant at the 0.05 level "The chi-square statistic can also be used in goodness-of-fit tests to determine whether certain models fit the observed data, These tests are conducted by calculating the significance of sample deviations from assumed theoretical (expected) distributions, and ean be performed ‘on eross-tabulations as well as on frequencies (one-Wway tabulations), The calculation of the chi-square statistic and the determination of its significance isthe same as illustrated above. “The chi-square statistic should be estimated only on counts of data. When the data are in percentage form, they should first be converted to absolute counts or numbers. In addition, an underlying assumption of the chi-square tes is that the observations are drawn independently. 'As a general rule, chi-square analysis should not be conducted when the expected or theoreti eal frequencies in any ofthe cells is less than five, Ifthe number of observations in any cell is less than 10, orf the table has two rows and two columns (a2 X 2 table), a correction factor should be applied.!" With the correction factor, the value is 2.133, which isnot significant at the 005 evel. Inthe case ofa 2 * 2 table, the chi-square is related tothe phi coefficient Phi Coefficient ‘The phi coefficient (is wed asa measure ofthe strength of association in the special case ofa table with two rows and two columns (a 2 2 tale). The phi coefficients pro- portional tothe square oot of the ch-aquare static Fora sample of size. this statistic is calculated as: je Vn Tttakes the value of 0 when ther is no association, which would be indicated by a chi sauare ‘ale of Oas wel When the variables are perfectly associated, phi assumes the vale of T and all the observations fl js on the main o minor agonal. (in some computer programs hi assumes a value of ~1 rather than when theres perfect negative association.) In our cas, trecause te association was not significant atte 003 vel, we would no ory compute the pi valve. However, for the purpose of illustration, we show how the values of phi and other measures ofthe strength of association would be computed, The vale of phi is = 35 0.333 ‘A measure ofthe srenath of association in a able of any size Cramer's V ‘A iesre of the stength of associ tion used in ables larger than 2 2. asymmetric lambda ‘Armeasie ofthe percentage improvement in predicting the value tf the dependent variable, given the ‘vale of the independent variable in ‘Contingency table analysis. Lambda also varies between O and PART HII Date Collection, Preparation, Analyis, and Reporting ‘Thus, the association isnot very strong. In the more general case involving a table of. size, the strength of association can be assessed by using the contingency coefficient Contingency Coefficient Whereas the phi efficient is specific a2 x 2 table. the eontingeney coefficient (C) be used to assess the strength of association ina table of any size. This index is also rel to chi-square, as follows: f c=] a Vern ‘The contingency coefficient varies between 0 and 1. The 0 value occurs in the case of association (ie, the Variables are statistically independent), but the maximum value of 1 never achieved, Rather, the maximum value of the contingency coefficient depends on size of the table (number of rows and number of columns). For this reason, it should used only to compare tables of the same size. The value of the contingency coefficient Table 15.3 is oe 13333 +30 = 0316 ‘This value of C indicates that the association isnot very strong. Another statistic that be calculated for any table is Cramer's ¥ Cramer's V Cramer's Vis « modified version ofthe phi correlation coefficient, @, and is used in tab larger than 2 X 2. When phi is calculated fora table larger than 2 X 2, it has no upper limi Cramer's Vis obtained by adjusting phi for either the number of rows or the number columns in the table, based on which of the two is smaller. The adjustment is such that ‘will range from 0 to I. A large value of V merely indicates a high degree of association does not indicate how the variables are associated. For a table with r rows and ¢ colum the lationship between Cramer's Vand the phi correlation coofficient is expressed as = ein Yimin = D.e= 0 “The vile of Cramer's V for Table 15.3 is ‘Thus, the association is not very strong. As ean be seen, in this case V = 6. This is always the case fora 2 X 2 table, Another statistic commonly estimated isthe lambda coefficient. Lambda Coefficient [Lambda assumes that the variables are measured on a nominal scale. Asymmetrie lambda ‘measures the percentage improvement in predicting the value of the dependent variable, sziven the value of the independent variable, Lambda also varies between 0 and 1. A value ‘of 0 means no improvement in prediction. A value of 1 indicates that the prediction can be ‘made without error. This happens when each independent variable category is associated with a single category of the dependent variable. x “ommerrictambda ‘The symmeuie lambda does nt make an assumption aboct which variable {sdependent t measures the overall improvement whea prediction is done Jn both dietions. au 8 ‘Test sttstc that measures he association between two ordinal evel ‘ariables. I makes an ajustment forties and is most appropriate when the table of variables i square fave Test attic Ut measures the association between two onina-evel ‘arable. makes an austment for tics and is most appropriate when the sable of variables 01 square but arectange amma ‘Test saistic that measures the sociation betwen two ondinal-vel ‘arable. It doesnt make an djstent forties, CHAPTER 15 Frequency Distribution, Craw-Tabulaton, and Hypothesis Testing 463 Asymmetric lambda is computed for each ofthe variables (teating it asthe dependent Variable). In general, the two asymmetric lambdas are likely to be different because the ‘marginal distributions are not usually the same. A symmetric lambda is also computed, which is a kind of average of the two asymmetric values. The symmetric lambda does not ‘make an assumption about which variable is dependent. It measures the overall improv ‘ment when prediction is done in both directions.** The value of asymmetric lambda in Table 15.3, with usage as the dependent variable, is 0.333. This indicates that knowledge of sex increases our predictive ability by the proportion of 0.333, that is, a 33.3 percent ‘improvement, The symmetric lambda is also 0.333. Other Statistics Note tha in the calculation of the chi-square statisti, the variables are treated as being mea- sured on only a nominal scale. Other statistics such as tau b, tau ¢, and gamma are available to measure association between two ordinal-level variables. All these statistics use informa tion about the ordering of categories of variables by considering every possible pair of cases inthe table. Each pair is examined to determine if ts relative ordering on the fist variable is the same as its relative ordering on the second variable (concordant), if the ordering is reversed (discordant), or if the pair is tied. The manner in which the ties are treated is the basic difference between these statistics. Both tau b and tau c adjust forties, Tau b is the most appropriate with square tables, in which the number of rows and the numberof eolumns are equal. Its value varies between +1 and ~1. Thus the direction (positive or negative) as well asthe strength (how close the value isto 1) of the relationship can be determined, For a ree- ‘angular table in which the number of rows is different from the number of columns, tau ¢ should be used. Gamma does not make an adjustment for ether ties or table size. Gamma also varies between +1 and —1 and generally has a higher numerical value than tau bof tac For the data in Table 15,3, as sex is @ nominal vatiable, itis not appropriate to calculate ‘ordinal statistics. ll these statistics can be estimated by using the appropriate computer pro- grams for eross-tabulation. Other statistics for measuring the strength of association, namely product moment correlation and nonmetric correlation, are discussed in Chapter 17. CROSS-TABULATION IN PRACTICE useful (0 proceed along the ‘When conducting cross-tabulation analysis in practice, it following steps. 1, Testthe null hypothesis that there is no association between the variables using the chi- square statistic. If you fail to reject the null hypothesis, then there is no relationship, 2. IF Hy is rejected, then determine the strength of the association using an appropriate slatisic (phi coefficient, contingency coefficient, Cramer's V, lambda coefficient, or other statistics). 3. If Hyis rejected, interpret the patter ofthe relationship by computing the percentages in the direction of the independent variable, across the dependent variable, 4 Ifthe variables are treated as ordinal rater than nominal, use tau , tau ¢, oF gamma as the test statistic, If Hy is rejected, then determine the strength of the association using the magnitude, and the direction of the relationship using the sign of the test statistic, 5. Translate the results of hypothesis testing, strength of association, and pattern of asso- ciation into managerial implications and recommendations where meaningful Visit wirtoveul.com wl conduet an Internet search Using a search engine and your ubrary's ‘online database to obtain information onthe heavy users, light users, and nonusers of cosmetics ‘ow would you analyze the data to determine whether the hea, light, and nonusers difer in terms of demographic characteristics? As the marketing director for L’Orea, what marketing Strategies would you adopt to reach the heavy users it users and nonusers of cosmetics? & Hypotessteing procedures tat fre measired ona leas a interval seal, ‘Hypothesis-sesting procedures that ‘assume that the varables ae mes ured ons nominal or ordinal see Figure 15.9 Hypothesis Tests Related +0 Differences PART IIL Dace Collection, Preparation, Analwi, and Reporting HYPOTHESIS TESTING RELATED TO DIFFERENCES ‘The previous section considered hypothesis testing related to associations, We n focus on hypothesis testing related to differences. A classification of hypothesis-testis procedures for examining differences is presented in Figure 15.9, Note that Figure 1 is consistent with the classification of univariate techniques presented in Figure 14 ‘The major difference is that Figure 14.6 also accommodates more than two samples thus deals with techniques such as one-way ANOVA and K-W ANOVA (Chapter whereas Figure 15.9 is limited to n0 more than two samples. Also, one-sample t niques such as frequencies, which do not involve statistical testing, are not covered i Figure 15.9. Hypothesis-testing procedures can be broadly classified as parametric nonparametric, based on the measurement scale of the variables involved, Paramet 4ests assume that the variables of interest are measured on at least an interval sc Nonparametric tests assume that the variables are measured on a nominal or ordi scale, These tests can be further classified based on whether one, two, or more samp! are involved, As explained in Chapter 14, the number of samples is determined based. hhow the data are treated for the purpose of analysis, not based on how the data w collected. The samples are independent if they are drawn randomly from different lations. For the purpose of analysis, data pertaining to different groups of respondent for example, males and females, are generally treated as independent samples. On ‘other hand, the samples are paired when the data for the two samples relate tothe s group of respondents, The most popular parametnc test isthe f test, conducted for examining hypotheses ‘means. The test could be conducted on the mean of one sample or two samples of obser tions. Inthe case of two samples, the samples could be independent or paired, The 2 test can sed for one sample or two independent samples as well. Nonparametric tess based on vations drawn from one sample include the Kolmogorov-Smimov tes, the chi-square test, ‘uns test, and the binomial test. Incase of two independent samples, the Mann-Whitney Ue the median test, and the Kolmogorov-Smirmov two-sample test are used for examin hypotheses about location. These tests are nonparametric counterparts of the two-groUp f tes ‘The chi-square test can be used for examining differences in proportions. For paired samples nonparametric tests include the Wilcoxon matched-pits signed-rank test and the sign Nonparametic Tests ‘Nonmetrie Dat) Ove Two iis Sample Samples + ChiSqure Ks + Rans Independent Paired + Binomial ote Same — al sTwogroup + Pai see Test Hes | + eTest * ChiSquare + Sign + Mane-Whitney + Wilcoxon + Median * MeNemat KS + Chi-square ee Se variate hypothesis est wing the ‘ation, which i wed when the ard deviation i unknown and imple sie is smal, Sse Ut sss it de sbjehasa symmetric bell-shaped sution, the mean is know assured to be known, and the epulaton variance i estimated from Se sample, ramet bell-shaped distribution 's useful for small sample 30) testing CHAPTER 15 Frequency Distribution, Crows-Tabulation, and Hypotbeis Testing 465 test. These tests are the counterpart othe paired test. Altematively, the chi-square test can be used for binary variables, Parametric as well as nonparametric tests are also available for evalu- ating hypotheses relating to more than two samples. These tests are considered in later chapters, PARAMETRIC TESTS Params es prove lfereoces for aking dates abot We cans of Puce ope lations. A tea i commonly ued fortis purpose This tests based on the Sade 5 tse. Theft assumes tha the varie normally ditbuted and he ean i own (orassumed'to be known), andthe population variance i estimated from the sample, “Assume that the random variable Xi normaly distbuted with mean wand unknown pop: ulation variance 62, which is estimated by the sample variance s2, Recall that the standard deviation of the sample mean, is estimated as sy = #/v. Then # = (X— Way isrdistabed with 1 deies of eedom "he (distribution i srt the onal distribution in appearance. Boh distrbu- tons bell shaped and symmete. However, as compared othe normal distin, the 1 dstibution has more ae in heal and less in the center. This is becuse popltion variance 6? is unknown and is estimated by the sample variance s, Given the uncertainty in the value of s, the observed values of r are more variable than those of z. Thus, we must 0 lager number of sanard deviaion from O a encompass a cenainperentage of Values fom ther dstbution than ste case wth the normal rbaton et asthe num- ter of degrees of feedom Increases the stration approaches te normal dsibuton Th fact for lage samples of 120 or moet dstibuton andthe normal dsibuton ae vital indsinguihabl Table 4m the Saiscal Appendix shows selected percentiles athe dsuibuson. Although nomality is assomed, the tests quite robust to departures from normality "The procedure for hypotbess testing, forthe speial case when th ttc is wed, tees ells 1, Formulate the null (14) and the alternative (Hf) hypotheses 2. Select the appropriate formula forthe statistic 1. Selecta significance lve, a, for testing Hy, Typically, the 0.05 eve is selectd.!® 4. Take one oro samples and compute the mean and standard deviation for each sample. 5. Calculate the statistic assuming His true. 6. Calculate the degrees of freedom and estimate the probability of getting a more extreme value of the statistic from Table 4. (Altematively,caleulate the critical value of the ¢ statistic.) 7. Ifthe probability computed in step 6 is smaller than the significance level selected in step 3, reject Hy, Ifthe probability i larger, do not reject Hy. (Alternatively. ifthe value ofthe calculated statistic in step 5 is larger than the erica value determined slated value is smaller than the etiten! value, donot in step 6, reject Hy Ifthe eal reject H.) Failure to reject Hy does not necessarily imply that Hi is true. It only means thatthe true state isnot significantly diferent from that assumed by H.!* 8, Express the conclusion reached by the f test in terms of the marketing research problem, One Sample i mas Mah, the rena ons iorstd i abLing sexiest a single variable against a known or given standard. Examples of such statements include: The market share for a new prodoct will exceed 15 percent at leat 65 percem of customers wil ie anew package design 80 percent of dealers wl prefer the new pricing Policy. These statements canbe translate 10 nll hypotheses that can be tested sing a One-sample tes, such a the ftesto the z test Inthe case of atest fora single mean, the rerourcber is erosed in testing whether tho population mean coufons 6 x given hypothesis (H,). For the data in Table 15.1, suppose we wanted to test the hypothesis that SPSS SPSS Output File ‘A nivarate hypothesis test sing ‘he standard normal dxebution. “Two samples hat ae not experimen- tally elated. Te measurement of one ‘imple sno effet on the valves of the second sample PART 111 Data Colleton, Preparation Anabwis, and Reporting ‘the mean familiarity rating exceeds 40, the neutral value on a 7-point scale. A signi level of c= 0.05 is selected. The hypotheses may be formulated as Hy: $40 Aya > 40 -& 1,579/Y29 = 1,579/5.385 = 0.293 (4.724 ~ 4.0)/0.293 = 0.724/0.29: “The degrees of freedom forthe tatstc to test the hypothesis about one mean are n — Inthiscase,n ~ 1 = 29~ 1 or 28. From Table 4 in the Statistical Appendix, the probability _etting a moe extreme value than 2.471 is less than 0.05. (Altemativey, te cial value 28 degrees of freedom and a significance level of 0.05 is 1.7011, which isles than the lated value.) Hence, the null hypothesis is rejected. The familiarity level does exceed 4.0, ‘Note that ifthe population standard deviation was assumed to be known as 1.5, ra than estimated from the sample, a test would be appropriate. In this cas, the value of statistic would be F -wik where S/N = 1.515.385 = 0.279 and = (4.724 ~ 4.0V0.279 = 0.72410.279 = 2.595 From Table 2 in the Statistical Appendix, the probability of getting a more exter ‘value of z than 2,595 is less than 0.05. (Alternatively, the critical z value fo a one-tailed test and a significance level of 0.05 is 1.645, whichis less than the calculated value.) Therefore, the null hypothesis is rejected, reaching the same conclusion arived at earlier by the test. ‘The procedure for testing a null hypothesis with respect to proportion was illustrated earlier inthis chapter when we introduced hypothesis testing Two Independent Samples Several hypotheses in marketing relate to parameters from two different populations: example, he users and nonusers fa rand fern terms oftheir perceptions ofthe brand the highincome consumers spend inre on entrainment than low-income consumers, of the proportion of beand-loyal wes in segment Is more than the proporion in segment I Samples dra randomly from differen populations are termed independent samples Asinthecase for one sample the hypotheses could relat to means o proportions. ‘Means. 1m the case of means for two independent samples, the hypotheses take the following form. Ht = Wy Ay ty # Hy ‘The two populations are sampled and the means and variances computed based on ‘samples of sizes n, and n If both populations are found to have the same variance, a ‘pooled variance estimate is computed from the two sample Variances a follows: Fest ‘aisles ofthe equality ofthe “anees of wo populations ‘he Fst fs computed as the ‘ho of two sample variances Fe ‘frequency distribution tht iepents upon two ses of dogs 1 Troedom—the depres of freedom the numerator and the degres of ‘foedom inthe dominator CHAPTER 15 Frequency Distribution, Craw- Tabulation, and Hypothesis Testing 467 “The standard deviation ofthe test statistic can be estimated as: ic ‘The appropriate value of ean be calculated as: (% = Xp) = = a) ‘The degrees of freeclom in tls case wre (nm, + 1 ~ 2) If the two populations have unequal variances, an exact ¢ cannot be computed for the difference in sample means. Instead, an approximation to £ is computed. The number of degrees of freedom in this case is usually not an integer, but a reasonably accurate probability can be obtained by rounding to the nearest integer.'* "An F test of sample variance may be performed if itis not known whether the two populations have equal variance, In tis case the hypotheses are: where grees of fiveom for sample 1 degrees of freedom for sample 2 sample variance for sample 1 ample variance for sample 2 ‘As can be seen, the critical value of the F distribution depends upon two sets of degrees of freedom-—those in the numerator and those in the denominator. The critical values of F for various degrees of freedom for the numerator and denominator are given in Table 5 of the Statistical Appendix. Ifthe probability of F is greater than the significance level a, His not rejected, and t based on the pooled variance estimate can be used. On the other hand, ifthe probability of F is less than or equal to a, Hy is rejected and 1 based on @ separate variance estimate is used. Using the data of Table 15.1, suppose we wanted to determine whether Internet usage was different for males as compared to females. A two-independent-samples f test was conducted. The results are presented in Table 15.14. Note that the F test of sample ‘variances has a probability that is less than 0.05, Accordingly, Mf is rejected, and the f test based on the “equal variances not assumed” should be used. The ¢ value is ~4.492 and, with 18.014 degrees of freedom, this gives a probability of 0,000, which is less than the significance level of 0.05, Therefore, the null hypothesis of equal means is rejected. Because the mean usage for males (Sex = 1) is 9.333 and that tor females (sex ~ 2) is 3,867, males use the Internet to a significantly greater extent than females. We also show the f test assuming equal variances because most computer programs automatically ‘conduct the f test both ways, Instead of the small sample of 30, if this were @ large and representative sample, there are profound implications for Intemet service providers such as AOL, EarthLink, and the various telephone (e.g., Verizon) and cable (e.g., Come~'*t SPSS Output File PART HI Data Colleton, Prparaion, Analysis, and Reporting Mess 3867 Damen Eta rvaue —Fattbow—Paoasaury Value FRetpom = ProgaBure 4490 28 0000 = 4492.18.14 0.000 ‘companies. In order to target the heavy Intemet users, these companies should focus ‘males, Thus, more advertising dollars should be spent on magazines that cater to audiences than those tha target females. Stores Seek to Suit Elderly toa “t” ‘A study based on a national sample of 789 respondents who were age 65 or o tempted to determine the effect that lack of mobility has on patronage behavior. Am research question related tothe differences in the physical requirements of dependent self-reliant elderly persons. Tht i, did the two groups require different things to get store or after they arrived a the store? A more detailed analysis of the physical requir ‘ments conducted by two-independent-sample ¢ tests (shown in the accompanying tab indicated that dependent elderly persons are more likely to look for stores that offer ho Female oor 1s Total 30 Conrecten ror | u w z ‘Ties 21meD 31.000. 151.000 ~3.406 001 Note U= Mana-Whitey text statistic W=Wilsoxon W Stutistic | | | | ‘Two other independent-samples nonparametric tests are the median test and Kolmogorov-Smirnoy test. The two-sample median test determines whether the two ‘groups are drawn from populations with the same median, I is not as powerful as the Mann-Whitney U test because it merely uses the location of each observation relative to the median, and not the rank of each observation. The Kolmogorov-Smirnoy two-sample test examines whether the two distributions are the same. It takes into account any differences between the two distributions, including the median, dispersion, and skewness, as illustrated by the following example. Directors Change Direction How do marketing esearch directors and users in Fortune 500 manufacturing firms per- ove the ote of marketing researc in ntiaing changes in snaking stategy formal tion? Ie was found thatthe marketing research directors were more strongly in favor of initiating changes in strategy and less in favor of holding back than were user of mar- Keting research. The percentage of responses t0 one ofthe items, “nate change inthe rmarketing strategy of the firm whenever possible” ae given inthe following abe. Using the Kolmogoroy-Smimnov (K-S) test, these diferences af ole definition were statistically significant a the 0.05 level as shown in the table The users of marketing research had become even more reluctant to intiate marketing strategy changes during the uncertain economy of 2005. In today's business climate, however the reluctance ofthese marketing research users must be overcome to help gin a teiter understanding of the buyers power. Thus, marketing research firms should devote considerable effort to convincing the users (generally marketing managers) of the value of rharketing research ‘The Role of Marketing Research in Strategy Formulation Responses (%) Absolutely Preferably Mayor Preferably Absolutely Sample Must Showd — May Not Should Not Must Not D n 7 6 8 9 5 uv 6 2 15 2 35 16 KS signicance "D = eircom, U= were ML 4 ‘Aronparametric est that analyzes the diflerences between the paired fbeervatons taing ino count the magnitude ofthe ditferences. PART IIL Data Collection, Prparation, Analy and Reporting Wilcoxon Matched. Pairs Signed-Rank Test —inrever wn Teenoroey Tecio.esy— wee) Cases TM mae ~ Ranks 2 Rr + Ranks 1 750 Ties 6 Total 30 In this example, the marketing research directors and users comprised two indepen dent samples. However, the samples are not always independent, In the case of pairec samples, a different set of tests should be used. Paired Samples {An important nonparametic test for examining differences in the location of 1m populations bused on paired obssrvatios is the Wilcaron matshed-paty signed ranks test This test analyzes the diflerences between the paired observations, aking into account the magnitude ofthe differences. I computes the differences between airs of variables and ranks the absolute diferences, The next stop isto sum th Positive and negative ranks. The test statistic, «js computed from Ihe positive an negative rank sums. Under the nll hypothesis of no diference, sa standard norma fatiate with mean and variance I for large samples. This fest corresponds to th paired f test considered earlier.”? "The example considered forthe paired test, whether the respondents differed it terms of attude toward te Intemet and atte toward technology is considered again Suppose we assume tat bot these variables ae measured on ordinal rather than iter tal scales. Accordingly, we use the Wilcoxon test. The esuls ae shown in Table 15.18 ‘Again, a significant dfereace is found inthe varabler, and the results ae in acon dance withthe conclusion reached by the pied tet, There ar 23 negative dilference (attitude toward technology is less favorable than attitude toward Internet), The mest rank ofthese negative diferences is 12.72. On the other hand there is only one pen tive difference (attitude toward technology is more favorable than attitude tovare Internet). The mean cank ofthis difereace 1s 7.50. There are sixties, or observation with the same value fr both variables. These numbers indicate that the atte toware the Internet is more favorable than toward technology, Furthermore, the probability associted withthe z sate i less than 0.05, indicating tat the difference is indeee Significant General Mills’ Harmony: Helping Women Achieve Nutritional Harmony ‘The Situation ‘Stephen W. Sanger, CEO of General Mills is constantly being faced with the challenge ‘of how to keep up with the changing tastes and preferences of consumers. General Mills recently did thorough focus group research on the most important consumer in grocery stores today: a woman. It is a known fact that 3 out of 4 of grocery shoppers in the United States are women, and many of these females are focusing more on their health and the nutrition value of foods. Although there are many cereals on the market with the same amount of valuable vitamins and minerals, such as Total or Kellogg's Smart Star,

You might also like