You are on page 1of 46


BUSM, A CASE STUDY OF TALUKA Mir Pur Bathoro, Thatta.


ROLL NO: 2K6/SOC/42 BS (Hons) Part-IV-2009


As the allotted topic is entitled as “BUSINEES USING STATISTICAL METHODS, A CASE STUDY OF TALUKA MIR PUR BATHORO, THATTA.” It is, therefore, the thesis report consists of the basic concepts of the statistics and business and later on there is a case study report regarding the ‘male & female of Mir Pur Bathoro footware preference’, which in response helps the proprietor of footwear house: i.e. Mr. Atta-ullah Khattri of Aashu footwear house, taluka mir pur bathoro. This report highlights the statistical method of collecting information regarding the concerned area of interest in order to enhance the business policies and no doubt the customer is also facilitated. Initially a form is distribute among 50 each males and females of the city and then the collected data is summarized in tabular form and later on statistical probability test (chi-square), a probability function widely used in testing a statistical hypothesis, for example, the likelihood that a given statistical distribution of results might be reached in an experiment, is applied and the result is concluded, which in response will help the footwear house owner to manage his business accordingly and hence the purpose of the thesis (Business using statistical methods along with a case study is accomplished). The thesis also consists of the conclusion & future scope of the case study in the end.


CHAPTER#1 Probability Estimation Hypothesis testing Bayesian methods Experimental design Time series and forecasting Nonparametric methods Statistical quality control Sample survey methods Decision analysis CHAPTER#2 Types of Business Manufacturing firms Merchandisers Service enterprise CHAPTER#3 INTRODUCTION 16 16 16 18 STATISTICS
Page No.

06 07 09 09 09 12 13 13 14 14

CASE STUDY (AASHU FOOTWEAR HOUSE) USING CHI-SQUARE- OVERVIEW Bivariate Tabular Analysis 20 Generalizing from samples to populations 26 Chi-square requirement 27 Collapsing values 30 Computing chi-square 31 Interpreting chi-square values 40 Measures of Association 42 CHAPTER#4 Conclusion Future Scope



Definition Statistics is the science of collecting, analyzing, presenting, and interpreting data. The branch of mathematics that deals with the relationships among groups of measurements and with the relevance of similarities and differences in those relationships is known as statistics. Descriptive statistics Descriptive statistics are tabular, graphical, and numerical summaries of data. The purpose of descriptive statistics is to facilitate the presentation and interpretation of data. Most of the statistical presentations appearing in newspapers and magazines are descriptive in nature. Descriptive statistics > Tabular methods The most commonly used tabular summary of data for a single variable is a frequency distribution. |A frequency distribution show the number of data values in each of several non-overlapping classes. Another tabular summary, called a relative frequency distribution, shows the fraction, or percentage, of data values in each class. Descriptive statistics > Graphical methods A number of graphical methods are available for describing data. A bar graph is a graphical device for depicting qualitative data that have been summarized in a frequency distribution. Labels for the categories of the qualitative variable are shown on the horizontal axis of the graph.


Descriptive statistics > Numerical measures A variety of numerical measures are used to summarize data. The proportion, or percentage, of data values in each category is the primary numerical measure for qualitative data. The mean, median, mode, percentiles, range, variance, and standard deviation are the most commonly used numerical measures for quantitative data. Descriptive statistic > Numerical measures > Outliers Sometimes data for a variable will include one or more values that appear unusually large or small and out of place when compared with the other data values. These values are known as outliers and often have been erroneously included in the data set. Descriptive statistics > Numerical measures > Exploratory data analysis Exploratory data analysis a variety of tools for quickly summarizing and gaining insight about a set of data. Two such methods are the five-number summary and the box plot. A five-number summary simply consists of the smallest data value, the firs quartile, the median, the third quartile, and the largest data value. Probability Probability is a subject that deals with uncertainty. In everyday terminology, probability can be thought of as a numerical measure of the likelihood that a particular event will occur. Probability values are assigned on a scale from 0 to 1, whith values near 0 indicating that an event is unlikely to occur and those near I indicating that an event is likely to take place.

Probability > Events and their probabilities Oftentimes probabilities need to be computed for related events. For instance, advertisements are developed for the purpose of increasing sales of a product. If seeing the advertisement increases the probability of a person buying the product, the events “seeing the advertisement” and “buying the product” are said to be dependent.

Probability distributions






A random variable is a numerical description of the outcome of a statistical experiment. A random variable that my a assume only a finite number or an infinite sequence of values is said to be discrete: one that mya assume any value in some interval on the real number line is said to be continuous. Probability > Special probability distribution > The position distribution The most widely used continuous probability distribution in statistics is the normal probability distribution. All normal distribution graphs are, like a bell – shaped curve.

Estimation It is often of interest to learn about the characteristics of a large group of elements such as individuals, households, buildings, products, parts,


customers, and so on. All the elements of interest in a particular study form the population. Because of time, cost, and other considerations, data often cannot be collected from every element of the population. Estimation > Sampling and sampling distributions Although sample survey methods will be discussed in more detail below in the section sample survey methods, it should be noted here that the methods of statistical inference, and estimation in particular, are based on the notion that a probability sample has been taken. Estimation > Estimation of a population mean The most fundamental point and interval estimation process involves the estimation of population mean. Suppose it is of interest to estimate the population mean, , for a quantitative variable. Data collected from a . simple random sample can be used to compute the sample mean, x, where the value of x provides a point estimate of

Estimation > Estimation of other parameters For qualitative variables, the population proportion is a parameter of interest. A point estimate of the population proportion is given by the sample proportion. With knowledge of the sampling distribution of the sample proportion, an interval estimate of a population proportion is obtained in much the sample fashion as for a population mean.


Estimation > Estimation procedures for two populations The estimation procedures can be extended to two populations for comparative studies. For example, suppo0se a study is being conducted to determine differences between the salaries paid to a population of men and a population of women. Hypothesis testing Hypothesis testing is a form of statistical inference that uses data from a sample to draw conclusions about a population parameter or a population probability distribution. This assumption is called the null hypothesis and is denoted by H0.

Bayesian methods The methods of statistical inference previously described are often referred to as classical methods. Bayesian methods (so called after the English mathematician Thomas Bays) provide alternatives that allow one to combine prior information about a population parameter with information contained in a sample to guide the statistical inference process. Experimental design Data for statistical studies are obtained by conducting either experiments or surveys. Experimental design is the branch of statistics that deals with the design and analysis of experiments. The methods of experimental design are widely used in the fields of agriculture, medicine, biology, marketing research, and industrial production.








significance testing A computational procedure frequently used to analyze the data from an experimental study employs a statistical procedure known as the analysis of variance. For a single factor experiment, this procedure uses a hypothesis test concerning equality of treatment means to determine if the factor has a statistically significant effect on the response variable. Experimental analysis Regression analysis involves identifying the relationship between a dependent variable and one or more independent variables. A model of the relationship is hypothesized, and estimates of the parameter values are used to develop an estimated regression equation. Various tests are then employed to determine if the model is satisfactory. Experimental design > Regression and correlation design > Regression and correlation

analysis > Regression model In simple linear regression, the model used to describe the relationship between a single dependent variable y and a single independent variable x is y= 0+ 1× . 0 and 1 are referred to as the model parameters, and

is a probabilistic error term that accounts for the variability in y that cannot be explained by the linear relationship with x.








analysis > Least squares method Either a simple or multiple regression models is initially posed as a hypothesis concerning the relationship among the dependent and independent variables. The least squares method is the most widely used procedure for developing estimates of the model parameters. Experimental design > Regression and correlation

analysis > Analysis of variance and goodness of fit A commonly used measure of the goodness of fit provided by the estimated regression equation is the coefficient of determination. Computation of this coefficient is based on the analysis of variance procedure that partitions the total variation in the dependent variable.







analysis > Significance testing In a regression study, hypothesis tests are usually conducted to assess the statistical significance of the overall relationship represented by the regression model and to test for the statistical significance of the individual parameters. Experimental design > Regression and correlation

analysis > Residual analysis The analysis of residuals plays an important role in validating the regression

model. If the error term in the regression model satisfies the four assume options noted earlier, then the model is considered valid. Experimental design > Regression and correlation

analysis > Model building In regression analysis, model building is the process of developing a probabilities model that best describes the relationship between the dependent and independent variables. The major issues are finding the proper from (Linear or curvilinear) of the relationship and selecting which independent variables to include. Experimental design > Regression and correlation

analysis > Correlation Correlation and regression analysis are related in the sense that both deal with relationships among variables. The correlation coefficient is a measure of linear association between two variables. Values of the correlation coefficient are always between -1 and +1. Time series and forecasting A time series is a set of data collected at successive points in time or over successive e periods of time. A sequence of monthly data on new housing starts and sequence of weekly data on product sales are examples of time series. Usually the data in a time series are collected a equally spaced periods of time, such as hour, day, week, month, or year.


Nonparametric methods The statistical methods discussed above generally focus on the parameters of populations or probability distributions and are referred to as parametric methods. Nonparametric methods are statistical methods that require fewer assumptions about a population or probability distribution and are applicable in a wider range of situations. Statistical quality control Statistical quality control refers to the use of statistical methods in the monitoring and maintaining of the quality of products and services. One method, referred to as acceptance sampling, can be used when a decision must be made to accept or reject a group of parts or items based on the quality found in a sample. Statistical quality control > Acceptance sampling Assume that a consumer receives a shipment of parts called a lot from a producer. A sample of parts will be taken and the number of defective items counted. If the number of defective items is low, the entire lot will be accepted. If the number of defective items is high, the entire lot will be rejected. Statistical quality control > Statistical process control Statistical process control uses sampling and statistical methods to monitor the quality of an ongoing process such as a production operation. A graphical display referred to as a control chart provides a basis for deciding whether the variation in the output of a process is due to common causes

(randomly occurring variations). Sample survey methods As noted above in the section Estimation, statistical inference is the process of using data frm a sample to make estimates or test hypotheses about a population. The field of sample survey methods is concerned with effective ways of obtaining sample data. The three most common types of sample surveys are • Mail surveys • Telephone surveys • Personal interview Decision analysis Decision analysis, also called statistical decision theory, involves procedures for choosing optimal decisions in the face of uncertainty. In the simplest situation, a decision maker must choose the best decision from a finite set of alternatives when there are two or more possible future events, called states of nature, that might occur.


Business, an organized approach to providing customers with the goods and services they want. The word business also refers to an organization that provides these goods and services. Most businesses seek to make a profit that is; they aim to achieve revenues that exceed the costs of operating the business. Prominent examples of for-profit businesses include Mitsubishi Group, General Motors Corporation, and Royal Dutch/Shell Group. However, some businesses only seek to earn enough to cover their operating costs. Commonly called nonprofits, these organizations are primarily nongovernmental service providers. Examples of nonprofit businesses include such organization as social service agencies, foundations, advocacy groups, and many hospitals. Business plays a vital role in the life and culture of countries with industrial and postindustrial (service and information based) free market economics such as the United States. In free market systems, prices and wages are primarily determined by competition, not by governments. In the United States, for example, many people buy and sell goods and services as their primary occupations. In 2001 American companies sold in excess of $10 trillion worth of goods and services. Businesses provide just about anything consumers want or need, including basic necessitates such as food and housing, luxuries such as whirlpool baths and wide screen televisions, and even personal services such as caring for children and finding companionship.


TYPES OF BUSINESS There are many types of businesses in a free market economy. The three most common are • Manufacturing firms • Merchandisers • Services enterprises Manufacturing firms Manufacturing firms produce a wide range of products. Large manufacturers include producers of airplanes, cars, computers, and furniture. Many manufacturing firms construct only parts rather than complete, finished products. These suppliers are usually smaller manufacturing firms, which supply parts and components to larger firms. The larger firms then assemble final products for market to consumers. For example, suppliers provide many of the components in personal computers, automobiles, and home appliances to large firms that create the finished or end products. These larger end product manufacturers are often also responsible for marketing and distributing the products. The advantage those large businesses have in being able to efficiently and inexpensively control any parts of a production process is known as economies of scale. But small manufacturing firms may work best for producing certain types of finished products. Smaller endproduct firms are common in the food industry and among artisan trades such as custom cabinetry. Merchandiser Merchandisers are businesses that help move goods through a channel of

distribution that is, the route goods take in reaching the consumer. Merchandisers may be involved in wholesaling or retailing, or sometimes both. A wholesaler is a merchandiser who purchases goods and then sells them to buyers, typically retailers, for the purpose of resale. A retailer is a merchandiser who sells goods to consumers. A wholesaler often purchases products in large quantities and then sells smaller quantities of each product to retailers who are unable to either buy or stock large amounts of the product. Wholesalers operate somewhat like large, end product manufacturing firms, benefiting from economies of scale. For example, a wholesaler might purchase 5,000 pairs of work gloves and then sell 100 pairs to 50 different retailers. Some large American discount chains, such as Kmart Corporation and Wal-Mart Stores, Inc., Serve as their own wholesalers, these companies go directly to factories and other manufacturing outlets, buy in large amounts, and then warehouse and ship the goods to their stores. The division between retailing and wholesaling is now being blurred by new technologies that allow retailing to become an economy of scale. Telephone and computer communications allow retailers to serve far greater numbers of customers in a given span of time than is possible in face to face interactions between a consumer and a retail salesperson. Computer networks such as the Internet, because they do not require any physical communication between salespeople and customers, allow a nearly unlimited capacity for sales interactions known as 24/7- that is, the Internet site can be open for transaction 24 hours a day, seven days a week and for as many transactions as the network can handle. For example, a typical transaction to purchase a pair of shoes at a shoe store may take a half-hour from browsing, to fitting,

to the transaction with a cashier. But a customer can purchase a pair of shoes through a computer interface with a retailer in a matter of seconds. Computer technology also provides retailers with another economy of scale through the ability to sell goods without opening any physical stores, often referred to as electronic commerce or e-commerce. Retailers that provide goods entirely through Internet transactions do not incur the expense of building so called brick and mortar stores or the expense of maintaining them. Service enterprises Service enterprises include many kinds of businesses. Examples include dry cleaners, shoe repair stores, barbershops, restaurants, ski resorts, hospitals, and hotels, In many cases service enterprises are moderately small because they do not have mechanized services and limit service to only as many individuals as they can accommodate at one time. For example, a waiter may be able to provide good service to four tables at once, but with five or more tables, customer service will suffer. In recent years the number of service enterprises in wealthier free market economies has grown rapidly, and spending on services now accounts for a significant percentage of all spending. By the late 1990s, private services accounted for more than 21 percent of U.S. spending. Wealthier nations have developed postindustrial economies, where entertainment and recreation businesses have become more important than most raw material extraction such as the mining of mineral ores and some manufacturing industries in terms of creating jobs and stimulating economic growth. Many of these industries have moved to developing nations, especially with the rise of large multinational corporations, As postindustrial economics have

accumulated wealth, they have come to support systems of leisure, in which people are willing to pay others to do things for them. In the United States, vast numbers of people work rigid schedules for long hours in indoor offices, stores, and factories. Many employers pay high enough wages so that employees can afford to balance their work schedules with purchased recreation. People in the United States, for example, support thriving travel, theme park, resort, and recreational sport businesses.


Chi-square is a non parametric test of statistical significance for bivariate tabular analysis (also known as cross breaks). Any appropriately performed test of statistical significance lets you know the degree of confidence you can have in accepting or rejecting any hypothesis. Typically, the hypothesis tested with chi square is whether or not two different samples (of people, texts, etc) are different enough in some characteristic or aspect of their behavior that we can generalize from our samples that the populations from which our samples are drawn are also different in the behavior or characteristic. A non parametric test, like chi squire, is a rough estimate of confidence; it accepts weaker, less accurate data as input than parametric tests (like t-tests and analysis of variance) and therefore has les status in the pantheon of statistical test. Nonetheless, its limitations are also its strengths; because chi squire is more forgiving in the data it will accept, it can be used in a wide variety of research contexts. Chi square is used most frequently to test the statistical significance of results reported in bivariate tables, and interpreting bivariate tables is integral to interpreting the results of a chi squire test, so we’ll take a look at bivariate tabular (cross-break) analysis. Bivariate Tabular Analysis Bivariate tabular (cross-break) analysis is used when you are trying to summarize the intersections of independent and dependent variables and understand the relationship (if any) between those variables. For instance, if we wanted to know if there is any relationship between the biological sex of people of Mir Pur Bathoro and their footwear preferences, we might select

50 males and 50 females as randomly as possible, and ask them, “On average, do you prefer to wear sandals, sneakers, leather shoes, boots, or something else?” using the model form,

Name: Sex: Age: Zodiac: Occupation: Footwear Choice: Sandals Sneakers Leather shoes Boots Others In this case study, our independent variable is biological sex. (In experimental research the independent variable is actively manipulated by the researcher: for example, whether or not a rate gets a food pellet when it pulls on a striped bar. In most sociological research, the independent variable is not actively manipulated in this way, but controlled by sampling for, e.g., males vs. females.) Put another way, the independent variable is the quality or characteristic that you hypothesize helps to predict or explain some other quality or characteristic (the dependent variable). We control the independent variable (and as much else as possible and natural) and elicit


and measure the dependent variable to test our hypothesis that there is some relationship between them. Bivariate rabular analysis is good for asking questions of the foll0owing kinds. 1. 2. Is there a relationship between any two variables IN THE DATA? How strong is the relationship IN THE DATA?

What is the direction and shape of the relationship IN THE DATA? Is the relationship due to some intervening variable(s) IN THE DATA?? To see any patterns or systematic relationship between biological sex of people of Mir Pur Bathoro, Thatta and reported footwear preferences, we could summarize our results in a table like this: Table 1. Male & Female of Mir Pur Bathoro’s Footwear preferences

Sandals Male Female


Leather Shoes



Depending upon how our 50 male and 50 female subjects responded, we could make a definitive claim about the (reported) footwear preferences of those 100 people. In constructing bivariate tables. Typically values on the independent bariable are arrayed on vertical axis, while values on the dependent variable are arrayed on the horizontal axis. This allows us to read across from


hypothetically causal values on the independent variable to their effects, or values on the dependent variable. How we arrange the values on each axis should be guided in conically by our research question/hypothesis. For example, if values on an independent variable were arranged from lowest to highest value on the variable and values on the dependent variable were arranged left to right from lowest to highest, a positive relationship would show up as a rising left to right line. (But remember, association does not equal causation: an observed relationship between two variables is not necessarily causal). Each intersection/cell of a value on the independent variable and a value on the independent variable reports the result of how many times that combination of values was chosen/observed in the sample being analyzed. (So we can see that cross tabs are structurally most suitable for analyzing relationship between nominal and ordinal variables. Interval and ratio variables will have to first be grouped before they can “fit” into a bivariate table.) Each cell reports, essentially, how many subjects/observations produced that combination of independent and dependent variable values? So, for example, the top left cell of the table above answers the question: “How many male in Mir Pur Bathoro prefer sandals?”

Table 2: Male & Female of Mir Pur Bathoro’s Footwear preferences

Sandals Male Female 6 13

Sneakers 17 5

Leather Shoes 13 7

Boots 9 16

Others 5 9

Reporting and interpreting cross tabs are most easily done by converting raw frequencies (in each cell) into percentages of each cell writhing the values/ categories of the independent variable. For example, in the Footwear preferences table above, total each row, then divide each cell by its row total, and multiply that fraction by 100. Table 3: Male & Felmale of mir Pur Bathoro’s Footwear preferences (Percentages) Sandals Male Female 12 26 Sneakers 34 10 Leather Shoes 26 14 Boots 18 32 Others 10 18 N

50 50


Male Preference
30 25 20 Ratio 15 10 5 0
Sandals Sneakers Leather Shoe Boots Others

Female Preference
30 25 20 Ratio 15 10 5 0
Sandals Sneakers Leathe r Shoe Boots Others

Percentages basically standardize cell frequencies an if there were 100 subjects/observations in each category of the independent variable. This is useful for comparing across values on the independent variable, but that usefulness comes at the price of a generalization from the actual number of subjects/observations in that column in your data to a hypothetical 100 subjects / observations. If the raw row total was 93, then generalization (on no statistical basis, i.e., with no knowledge of sample-population

representative ness) is drastic. So we should provide that total N at the end of each row/independent variable category (for reliability and to enable the reader to assess our interpretation of the table’s meaning). With this limitation in mind, we can compare the patterns of distribution of subjects/observations along the dependent variable between the values of the independent variable: e.g., compare male and female of Mir Pur Bathoro, footwear preference. (For some data, plotting the results on a line graph can also help you interpret the results: i.e., whether there is a positive (/), negative(\), or curvilinear (∨ ∧ relationship between the variables). , ) Table 3 shows that within our sample, roughly twice as many females preferred sandals and boots as males: and within our sample, about three times as many men preferred sneakers as women and twice as many men preferred leather shoes. We might also infer from the “Other” category that females within our sample had a broader range of footwear preferences than did males. Generalizing from samples to Populations Coverting raw observed values or frequencies into percentages allows us to see more easily patterns in the data, but that is all we can see: ‘what is in the data’. Knowing with great certainty the footwear preferences of a particular group of 100 males & females of taluka Mir Pur Bathoro is of limited use to us; we usually want to measure a sample in order to know something about the larger populations from which our samples were drawn. On the basis of raw observed frequencies (or percentages) of sample’s behavior or characteristics, we can make claims about the sample itself, but we cannot generalize to make claims about the population from which we drew our


sample, unless we submit our results to a test of statistical significance. A test of statistical significance tells us how confidently we can generalize to a larger (unmeasured) population from (measured) sample of that population. How dies chi-square do this? Basically, the chi-square test of statistical significance is a series of mathematical formulae, which compare the actual observed frequencies of some phenomenon (in our sample) with the frequencies we would expect if there were no relationship at all between the two variables in the larger (sampled) population. That is, chi-square tests our actual results against the null hypothesis and assesses whether the actual results are different enough to overcome a certain probability that they are due to sampling error. In a sense, chi-square is a lot like percentages; it extrapolates a population characteristic (a parameter) from the sampling characteristic (a statistic) similarly to the way percentage standardizes a frequency to a total column N of 100. But chi-square works within the frequencies provided by the sample and does not inflate (or minimize) the column and row totals. Chi Square Requirements As mentioned before, chi-square is a nonparametric test. It does not require the sample data to be more or less normally distributed (as parametric tests like t-tests do), although it relies on the assumption that the variable is normally distribute in the population from which the sample is drawn. But chi-square, while forgiving, does have some requirements: 1. The sample must be randomly drawn from the populations.


2. 3. 4. 5. 1.

Data must be reportd in raw frequencies (not percentages); Measured variables must be independent: Values/categories on independent and dependent variables must be mutually exclusive and exhaustive; Observed frequencies cannot be too small. As with any test of statistical significance, your data must be from a random sample of the population to which we wish to generalize our claims.


We should only use chi-square when our data are in the form of raw frequency counts of things in tow or more mutually exclusive and exhaustive categories. As discussed above, converting raw frequencies into percentages standardizes cell frequencies as if there were 100 subjects/observation in each category of the independent variable for comparability. Part of the chi-square mathematical procedure accomplishes this standardizing, so computing the chi-square of percentages would amount to standardizing an already standardized measurement.


Any observation must fall into only one category or value on each variable. In our footwear example, our data are counts of male versus female expressing a preference for five different categories of footwear. Each observation/subject is counted only once, as male or female (an exhaustive typology of biological sex) and preferring sandals, sneakers, leather shoes boots, or other kinds of footwear. For some variables, no ‘outer’ category may be needed, but often ‘outer’ ensures that the variable has been exhaustively categorized. (For some kinds of analysis, we may need to include

an “un-codable” category.) In any case, we must include the results for the whole sample. 4. Furthermore, we should use chi-square only when observations are independent: i.e.e no category or response is dependent upon or influenced by another. (In linguistics, often this rule is fudged a bit. For example, if we have one dependent variable/column for linguistic feature X and another column for number of words spoken or written (where the rows correspond to individual speakers/ texts or groups of skeakers/texts which are being compared), there is clearly some relation between the frequency of feature X in a text and the number of wrds in a text, but it is a distant, not immediate dependency.) 5. Chi-square is an approximate test of the probability of getting the frequencies we’ve actually observed if the null hypothesis were true. It’s basd on the expectation that within any category, sample frequencies are normally distributed about the expected population value. Since (logically) frequencies cannot be negative, the distributon cannot be normal when expected population values are close to zero, since the sample frequencies cannot be much below the expected frequency while they can be much above it (an asymmetric/non-normal distribution). So, when expected frequencies cannot be much below the expected frequency while they can be much above it (an asymmetric/non normal the assumption of normal distributon, but the smaller the expected frequencies, the less valid are the results of the chi-square test. We’ll discuss expected frequencies are derived from observed frequencies. Therefore, if we have cells in our bivariate table,

which show very low raw observed frequencies (5 or below), our expected frequencies may also be too low for chi-square to be appropriately used. In addition, because some of the mathematical formulas used in chi-square use division, no cell in your table can have an observed raw frequency of 0. The following minimum frequency thresholds should be obeyed:

For a 1× 2 or 2× 2 or 2× 2 table, expected frequencies in each cell should be at least 5; For a 2× 3 table, expected frequencies should be at least 2: For a 2× 4 or 3× or larger table, if all expected frequencies but one are at least 5 and if the one small cell is at least 1, chi-square is still a good approximation. In general, the greater the degrees of freedom (i.e. the more values/ categories on the independent and dependent variables), the more lenient the minimum expected frequencies threshold.

• •

Collapsing values A brief word about collapsing values/categories on a variable is necessary. Firs, although categories on a variable especially a dependent variable may be collapsed, they cannot be excluded from a chi-square analysis. That is, you cannot arbitrarily exclude some subset of our data from our analysis, Second, a decision to collapse categories should be carefully motivated, with consideration for preserving the integrity of the data as it was originally

collected. (For edample, how could we collapse the footwear preference categories in our example and still preserve the integrity of the original question/data? We can’t, since there’s no way to know if combining, e.g., boots and leather shoes varsus sandals and sneakers is true to our subjects’ typology of footwear.) As a rule, we should perform a chi-square on the data in its un-collapsed form; if the chi-square value achieved is significant, then we may collapse categories to test subsequent refinements of your original hypothesis. Computing chi Square Let’s walk through the process by which a chi-square value is computed, using Tbale2 above. The first step is to determine our threshold of tolerance for error. That is, what odds are we willing to accept that we are wrong in generalizing from the results inour sample to the population it represents? Are we willing to stake a claim on a 50% chance that we’re wrong? Or else. The answer depends largely on our research question and the consequences of being wrong. If people’s lives depend on our interpretation of our results, we might want to take only 1 chance in 100,000 (or 1,000,000) that we’re wrong. But if the stakes are smaller, for example, whether or not two texts use the same frequencies of some linguistic feature (assuming this is not a forensic issue in a capital murder case!) we might accept a greater probability, 1 in 100 or even 1 in 20, that our data do not represent the population we’re generalizing about. The important thing is to explicitly motivate your threshold before you perform any test of statistical significance. To minimize any temptation for post hoc compromise of


scientific standards. For our purposes, we’ll set a probability of error threshold of 1 in 20, or p< .05, for our Footwear Study.) Table 3: Male & Female of Mir Pur Bathoro’s Footwear preferences, observed frequencies. Sandals Male Female Total 6 13 19 Sneaker s 17 5 22 Leather shoes 13 7 20 Boots 9 16 25 Other 5 9 14 Total 50 50 100

Male & Female Choice

25 20 15 10 5 0 1 2 3 4 5 6

Remember that chi-square operates by comparing the acual, or observed, frequencies in each cell in the table to the frequencies we would expect if there were no relationship at all between the two variables in the populations from which the sample is drawn. In other words, chi-square compares what actually happened to what hypothetically would have happened if’all other things were equal’ results are sufficiently different from the predicted null hypothesis results; we can reject the null hypothesis and claim that a

statistically significant relationship exists between our variables. Chiq-squre derives a representation of the null hypothesis, the all other things being equal’ scenario, in the following way. The expected frequency in each cell is the product of that cell’s row total multiplied by that cell’s column total, divided by the sum total of all observations. So, to erive the expected frequency of the “Males who prefer Sandals” cell, we multiply the top row total (50) by the first column total (19) and divide that product by the sum total (100): ((50 × 19) / 100) = 9.5. The logic of this is that we are deriving the expected frequency of each cell from the union of the total frequencies of the relevant values on each variable (in this case, Male and Sandals), as a proportion of all observed frequencies (across al values of each variable). This calculation is performed to derive the expected frequency of each cell, as shown in Table5 below (the computation for each cell is listed below Table5).


Sandals Sneakers Leather Male Observed Male Expected Female Observed Female expected Total 6 9.5 13 9.5 19 17 11 5 11 22 Shoes 13 10 7 10 20

Boots Others Total 9 12.5 16 12.5 25 5 7 9 7 14 100 50 50

Expected Value = (cell’s column total)* (cell’s row total)/(som total of all observations) Male/Sandals: Male/Sneakers: Male/Leather Shoes Male/Boots Male/Other: Female / Sandals: Female/Sneakers: Female/Leather Shoes Female/Boots: Female/Other: ((19× 50)/100) = 9.5 ((22× 50)/100) = 11 ((20× 50)100) = 10 ((25× 50)/100) = 12.5 (14× 50)/100) = 7 (19× 50)/100) = 9.5 ((22× 50)/100) = 11 ((20× 50)/100 = 10 ((25× 50)/100 = 12.5 ((14× 50)/100 = 7


Male (Expected)
12 10 8 6 4 2 0 1 2 3 4 5 6

Choice Female (Expected)
12 10 8 6 4 2 0 1 2 3 4 5 6


Sandals Sneakers Leather shoe Boots Others

1 2 3 4 5



As we have originally obtained a balanced male/female sample, our male and female expected scores are the same. This usually will not be the case. We now have a comparison of the observed results versus the results we would expect if the null hupothesis were true. We can informally analyze this table, comparing observed and expected frequencies in each cell (Males prefer sandals less than expected), across values on the independent variable (Males prefer sneakers more than expected, Females less thanexpected(, or across values on the dependent variable (Females preferesandals and boots more than expected, but sneakers and soes less tan expecte). But so far, the extra computation doesn’t really add much more information than interpretationof the results in percentage form. We need some way to neasure how different our observed results are from the null hypothesis. Or, to pur it another way, we need some way to determine whether we can reject the null hypothesis, and if we can, with what degree of confidence that we’re not making a mistake in generalizing from our sample results to the larger population. Logically, we need to measure the size of the difference between the pair of observed and expected frequencies in each cell. More specifically, we calculate the difference between the observed and expected frequency in each cell, square that difference, and then divide that product by the difference itself. The formula can be expressed as ((O-E)^2/E) Where O is for observes E is for expected ^ is for power Squaring the difference ensures positive number, so that we end up with an

absolute value of differences. If we didn’t work with absolute values, the positive and negative differences across the entire table would always add up to 0. (We really understand the logic of chi-square if you can figure out why this is true.) Dividing the squrared difference by the expected frequency essentially removes the expected frequency from the equation, so that the remaining measures of observed/expected difference are comparable across al cells. So, for example, the difference between observed and expected frequencies for the Male/Sandals preference is calculated as foollowsE Observed (6) minus Expected (9.5) = (-3.5) Difference (-.5.5) squared = 12.55 Differnce squared (12.25) divided by Expected (9.5) = 1.289 The sum of all products of this calculation on each cell is the total chi square valu for the table. The computation of chi-square for each cell is listed below table6.


Table 6. Male and Female of Mir Pur Bathoro, Footwear Preferences: Observed and Expected Frequencies & Chi-Square. Sandals Sneaker Leather Male Observed Male Expected Female Observed Female expected Total 6 9.5 13 9.5 19 s 17 11 5 11 22 Shoes 13 10 7 10 20 Boots Other Tota 9 12.5 16 12.5 25 s 5 7 9 7 14 100 l 50 50



((19× 50) /9.5)


Chi = 1.289

9.5 Male/Sneakers: ((22× 50)2/11) = 11 3.273 Male/Leather Shoes ((20× 50)2/10) = 10 0.900 Male/Boots ((25× 50)2/12.5) = 0.980 Male/Other: Female / Sandals: Female/Sneakers: Female/Leather Shoes Female/Boots: 12.5 0.571 (14× 50)2/7) = 7 2 (19× 50) /9.5) = 1.289 9.5 ((22× 50)2/11) = 11 3.273 ((20× 50)2/10) = 10 0.900 ((25× 50)2/12.5) = 0.980

12.5 Female/Other: 0.571 ((14× 50)2/7) = 7 Total Chi-Square Sum of chi-Values 14.026 Value = The total chi-square value for Table 1 is 14.026.

Interpreting the Chi Square Value We now need some criterion or yardstick against which tomeasure the table’s chi square value, to know whether or not it is significant. What we

need to know is the probability of getting a chi-square value of a minimum given size even if our variables are not related at all in the larger population from which our sample was drawn. That is, we need to know how much larger than 0 (the absolute chi-square value of the null hypothesis) our table’s chi-square value must be before we can confidently reject the null hypothesis. The probability we seek depends in part on the degree of freedom of the table from which our chi-square value is derived. Degrees of freedom Mechanically, a table’s degrees offreedom (df) can be expressed by the following formula: Df = (r-1) (c-1) That is, a table’s degrees of freedom equals the number of rows in the table minus one multiplied by the number of columns in the minus cone. (For 1× 2 tables: df = k – 1, where k = number of values / categories on the variable.) A degree of freedom is an issue because of the way in which expected values in each cell are computed from the row and column total of each cell. All but one of the expected values in a given row or column are free to vary (within the total observed and therefore expected (frequency of that row or column: once the free to vary expected cells are specified, the last one is fixed by virtue of the fact that the expected frequencies must add up to the observed row and column totals (from which they are derived). Sandals Male Female Sneakers Leather Shoes Boots Others


Df=(#row-1)*(#column-1) = (2-1)*(5-1)=1*4=4 The sampling distribution of chi-square (also know as critical values of chi square) is typically listed in a appendix of the statistics book. We read down the column repenting our previously chosen probability of error threshold (e.g., p<.05) and across the row representing the degrees of freedom in our table. If our chi-square value is larger than the critical value in that cell, our data presents a statistically significant relationship between the variables in our table. Table 1’s chi-square value of 14.026, with 4 degrees of freedom, handily clears the related critical value of 9.49, sowe can reject the null hypothesis and affirm the claim that male and female of Mir Pur Batoro, differ in their (self-reported) footwear preferences. Statistical significance does not help us to interpret the nature or explanation of that relationship: that must be done by tother means (including bivariate tabular analysis and qualitative analysis of the data). But a statistically significant chi-square value denotes the degree of confidence: we may hold that relationship between variables described in ourresults is systematic in the larger population and not attributable to random error. Statistical significance also does not ensure substantive significance. A large enough sample may demonstrate a statistically significant relationship between two variables, but that relationship may be a trivially weak one. Statistical significane means only that the pattern of distributon and relationship between variables, which is found in the data from a sample, can be confidently generalized to the larger population from which the sample was randomly drawn. By itself, it does no ensure that the relationship is theoretically or practically important or even very large.

Measures of Associaiton While the issue of theoretical or practical importance of a statistical of a statistically significant result cannot be quantified, the relative magnitude of a satistically significant relationship can be measured. Chi-square allows us tomake decisions about whether there is relationship between two or more variables; if the null hypothesis is rejected, we conclude that there is a statistically significant relationship between the variables. But we frequently want a measure of the strength of that relationship, an index of degree of correlation, a measure of the degree of association between the variables represented in our table (and data). Luckily, several related measures of association can be derived from a table’s chi-square value. For tables larger than 2 × 2 (like ours Table 1), a measure called ‘Cramer’s phi’ is derived by the following formula (where N = the total number of observations, and k = the smaller of the number of rows or columns): Cramer’s chi = the square root of (chi-square divided by (N times (k minus 1)) So, for our Tbale 1 (2× 5), we would compute Cramer’s phi as follows: N(k-1) = 100 (2-1) = 100 Chi square/100 = 14.026/100 = 0.14 Square root of (2) = 0.37 The product is interpred as a Pearson r (that is, as a correlation coefficient) For 2× 2 tables, a measure called ‘phi’ is derived by dividing the table’s chi square value by N (the total number of observations) and then taking the square root of the product. Phi is also interpreted as a Pearson r.


A complete account of how to interpret correlation coefficients is unnecessary for present purposes. It will suffice to say that r2 is a measure called shared variance. Shared variance is the portionof the total behavior (or distributon) of the variables measured in the sample data which is accounted for by the relationship we’ve already detected with our chi square. For Table 1, r2 = 0.137, so approximately 14% of the total footwear preference storfy is explained/predicted by biological sex. Computing a measure of association like phi or Cramer’s phi is rarely done in quantitative linguistic analyses, but it is an omportant benchmark of jus ‘how much’ of the phenomenon under investigation has been explained. For example, Table 1’s Cramer’s phi of 0.37 (r2 = 0.137) means that there are one or more variables still undetected which, cumulatively, account for and predict 86% of footwear preferences. This measure, of course, doesn’t begin to address the nature of the relation(s) between these variables, which is a crucial part of any adequate explanation or theory.

Conclusion Business can be well managed and enhanced using the sociological statistical methods. The case study of male and female footwear preference helps up to 14% the Aashu footwear house owner Mr. Atta-ullah Khattri of taluka Mir _Pur Bathoro.


By keeping rhe observed male and female footwear preference items he can utilize only up to 14%, and the rest of 86% of the business management is hidden in some other variables which can be found and hence can be further extended as clearly described in future scope. Future Scope As the conclusion clearly tells us that 14% of the total footwear preference story is explained/predicted by biological sex, and hence the business using this statistical approach can only be managed up to 14%, and the rest of the 86% of keeping the footwear in the Aashu footwear house is still unknown, i.e. the thesis can be extended further by exploring the undetected variables (un-sued variables in the model can also help) which account for therest of 86% of footwear preference.



1) 2)
3) 4) 5)

Sociology of knowledge by M. Ryder Encarta encyclopedia http://www. http://www. http://www. MSN Encarta-Dictionary – chi-square.html