You are on page 1of 20

Fall 2011- August drive

MBA SEMESTER 1 MB0040 STATISTICS FOR MANAGEMENT- 4 Credits (Book ID: B1129) Assignment Set- 1 (60 Marks) Note: Each question carries 10 Marks. Answer all the questions

1. (a) Statistics is the backbone of decision-making. Comment. Due to advanced communication network, rapid changes in consumer behavior, varied expectations of variety of consumers and new market openings, modern managers have a difficult task of making quick and appropriate decisions. Therefore, there is a need for them to depend more upon quantitative techniques like mathematical models, statistics, operations research and econometrics. Decision making is a key part of our day-to-day life. Even when we wish to purchase a television, we like to know the price, quality, durability, and maintainability of various brands and models before buying one. As you can see, in this scenario we are collecting data and making an optimum decision. In other words, we are using Statistics. Again, suppose a company wishes to introduce a new product, it has to collect data on market potential, consumer likings, availability of raw materials, feasibility of producing the product. Hence, data collection is the back-bone of any decision making process. Many organizations find themselves data-rich but poor in drawing information from it. Therefore, it is important to develop the ability to extract meaningful information from raw data to make better decisions. Statistics play an important role in this aspect. Statistics is broadly divided into two main categories. Below Figure illustrates the two categories. The two categories of Statistics are descriptive statistics and inferential statistics. Descriptive Statistics: Descriptive statistics is used to present the general description of data which is summarized quantitatively. This is mostly useful in clinical research, when communicating the results of experiments. Inferential Statistics: Inferential statistics is used to make valid inferences from the data which are helpful in effective decision making for managers or professionals. Statistical methods such as estimation, prediction and hypothesis testing belong to inferential statistics. The researchers make deductions or conclusions from the collected data samples regarding the characteristics of

large population from which the samples are taken. So, we can say Statistics is the backbone of decision-making. Statistics is used for various purposes. It is used to simplify mass data and to make comparisons easier. It is also used to bring out trends and tendencies in the data as well as the hidden relations between variables. All this helps to make decision making much easier. Let us look at each function of Statistics in detail. 1. Statistics simplifies mass data The use of statistical concepts helps in simplification of complex data. Using statistical concepts, the managers can make decisions more easily. The statistical methods help in reducing the complexity of the data and consequently in the understanding of any huge mass of data. 2. Statistics makes comparison easier Without using statistical methods and concepts, collection of data and comparison cannot be done easily. Statistics helps us to compare data collected from different sources. Grand totals, measures of central tendency, measures of dispersion, graphs and diagrams, coefficient of correlation all provide ample scopes for comparison. 3. Statistics brings out trends and tendencies in the data After data is collected, it is easy to analyse the trend and tendencies in the data by using the various concepts of Statistics. 4. Statistics brings out the hidden relations between variables Statistical analysis helps in drawing inferences on data. Statistical analysis brings out the hidden relations between variables. 5. Decision making power becomes easier With the proper application of Statistics and statistical software packages on the collected data, managers can take effective decisions, which can increase the profits in a business. Seeing all these functionality we can say Statistics is as good as the user. (b) Give plural meaning of the word Statistics? Meanings of Statistics The word statistics has three different meanings (sense) which are discussed below: (1) Plural Sense (2) Singular Sense (3) Plural of the word Statistic (1) Plural Sense: In plural sense, the word statistics refer to numerical facts and figures collected in a systematic manner with a definite purpose in any field of study. In this sense, statistics are also aggregates of facts which are expressed in numerical form. For example, Statistics on industrial production, statistics or population growth of a country in different years etc.

(2) Singular Sense: In singular sense, it refers to the science comprising methods which are used in collection, analysis, interpretation and presentation of numerical data. These methods are used to draw conclusion about the population parameter. For Example: If we want to have a study about the distribution of weights of students in a certain college. First of all, we will collect the information on the weights which may be obtained from the records of the college or we may collect from the students directly. The large number of weight figures will confuse the mind. In this situation we may arrange the weights in groups such as: 50 Kg to 60 Kg 60 Kg to 70 Kg and so on and find the number of students fall in each group. This step is called a presentation of data. We may still go further and compute the averages and some other measures which may give us complete description of the original data. (3) Plural of Word Statistic: The word statistics is used as the plural of the word Statistic which refers to a numerical quantity like mean, median, variance etc, calculated from sample value. For Example: If we select 15 student from a class of 80 students, measure their heights and find the average height. This average would be a statistic.

Statistics in Plural Sense Statistics in plural sense means the mass of quantitative information called 'data',, For example, we talk of information on population or demographic features of India atrailable from the Population Census conducted every ten years by the Government of India. Similarly, we can have statistics (quantitative data or simply data) on enrollment of students in a particular university, say, over the last ten years. Further, data are collected by almost all ministries of the Government of India relating to their activities. Also referred to as Statistical Data, Horace Secrist describes statistics in plural sense as follows: "By statistics we mean aggregates of facts affected to a marked extent .by multiplicity of causes numerically expressed, enumerated Or estimated according to reasonable standard of accuracy, collected in a systematic manner for a pre-determined purpose and placed in relation to each other." This definition of statistics in plural sense highhghts the follo&g features: a) Statistics are numerical facts: In order that information obtained from an

investigation can be called as statistics or data, it must be capable of being represented by numbers. The collected data may be obtained either by the measurement of characteristics (like data on heights, weights, etc.) or by counting when the characteristics (like honesty, smoking habit, beauty, etc.) is not measurable. b) Statistics are aggregates off acts: Single and unrelated figures even though expressed as quantities are not statistics. For example, in a university examination Mr. Sharma secures 65% marks does not make statistics or data. However, if we find that out of 3 lakh university students whose average marks were 55%, Mr. Shanna secured 65% marks, then these figures are statistics. So no single figure in any' sphere of statistical inquiry, say production, employment, wage and income constitutes statistics. c) Statistics are affected to a marked extent by multiplicity of causes: In physical sciences it is possible to isolate the effect of various forces on a particular event. But in 'Statistics' facts and figures, that is, the collected information, are greatly influenced by a number of factors and forces working together. For example, the output of wheat in a year is affected by various factors like the availability of irrigation, quality of soils, method of cultivation, type of seed, amount of fertilizer used, etc. In addition to this there may be certain factors which are even difficult to identify. d) Statistics are numerically expressed: Statistics are statements of facts expressed numerically or in numbers. Qualitative statements like "the students of a school ABC are more intelligent than those of school XYZ" cannot be regarded statistics. Contrary to this the statement that ''the average marks in school ABC are 90% compared with 60% in school XYZ, and that the former had 80% first division compared with only 50% in the latter", is a statistical statement. e) Statistics are enumerated or estimated with a reasonable degree of accuracy: While enumerating or estimating statistics, a reasonable degree of accuracy must be achieved. The degree of accuracy needed, in an investigation, depends upon the nature and objective of investigation on one hand and upon the time and resources on the other. Thus it is necessary to have a reasonable degree of accuracy of data, keeping in mind the nature and objective of investigation and availability of time and resources. The degree of accuracy once decided must be uniformly maintained throughout the investigation. f) Statistics are collected in a systematic manner: Before the collection of statistics, it is necessary to define the objective of investigation. The objective of investigation must be specific and well defined. The data are then collected in systematic manner by proper planning which involves finding of answers to questions such as: Whether to use sample or census investigation, how to collect, arrange, present and analyse data, etc.

g) Statistics should be placed in relation to one another: Only comparable data make some sense. Unrelated and incomparable data are no data. They are just figures. For example, heights and weights of students of a class do not have any relation with the income and qualification of their parents. For comparability, the data should be honiogeneous; that is, it should belong to the same subject or class or phenomenon. For example, pocket money of the students of a class is certainly related to the income of their parents. Prices of onions and potatoes in Delhi can certainly be related to their prices in other cities of India

b. Enumerate the factors which should be kept in mind for proper planning.
As we know that r= xy N x y given that, r=17.5 (xyN=r) x=49=7, y=9=3. r= 17.5/7*3 which will be equal to 0.83 which means that there is a highly negative correlation.

4. a. Explain the characteristics of business forecasting. Characteristics of Business Forecasting i. Based on past and present conditions: The business forecasting is based on past and present economic condition of the business. To forecast the future,

various data, information and facts concerning to economic condition of business for past and present are analyzed. ii. Based on mathematical and statistical methods: The process of forecasting includes the use of statistical and mathematical methods. By using these methods the actual trend which may take place in future can forecasted. iii. Period: The forecasting can be made for long term, short term, medium term or any specific term. iv. Estimation of future: The business forecasting is to forecast the future regarding probable economic conditions. v.Scope: The forecasting can be physical as well as financial. Steps in Forecasting The forecasting of business fluctuations consists of the following steps: i. Understanding why changes in the past have occurred: One of the basic principles of statistical forecasting is that the forecaster should use the data on past performance. The current rate and changes in the rate constitute the basis of forecasting. Once they are known various mathematical techniques can develop projections from them. If an attempt is made to forecast business fluctuations without understanding why past changes have taken place, the forecast will be purely mechanical based solely upon the application of mathematical formulae and subject to series error. ii. Determining which phases of business activity must be measured: After it is knowing why business fluctuations have occurred, it is necessary to measure certain phase of business activity in order to predict what changes will probably follow the present level of activity. iii. Selecting and compiling data to be used as measuring devices: These is an independent relationship between the selection of statistical data and determination of why business fluctuations occur. Statistical data cannot be collected and analyzed in an intelligent manner unless there is a sufficient understanding of business fluctuations. It is important that reasons for business fluctuations be stated in such a manner that is possible to secure data that are related to the reasons. iv. Analyzing the data: Lastly, the data are analyzed in the light of understanding of the reason why change occurs. For example, if it is reasoned that a certain combination of forces will result in a Statistics For Management Unit 13 Sikkim Manipal University 209 given change, the statistical part of the problem is to measure these forces, from the data available, to draw conclusions on the future course of action. The methods of drawing conclusions may be called forecasting techniques.

Methods of Business Forecasting Almost all the businessmen make forecasting about the business conditions related to their business. In recent years scientific methods of forecasting have been developed. The base of scientific forecasting is statistics. To handle the increasing variety of managerial forecasting problems, several forecasting techniques have been developed in recent years. Forecasting techniques vary from simple expert guesses to complex analysis of mass data. Each techniques has its special use, and care must be taken to select the correct technique for a particular situation. Before applying a method of forecasting the following questions should be answered: i. What is the purpose of the forecast how is it to be used? ii.What are the dynamics and components of the system for which the forecast will be made? iii. How important is the past in estimating the future? b. Differentiate between prediction, projection and forecasting. Business forecasting has always been one component of running an enterprise. However, forecasting traditionally was based less on concrete and comprehensive data than on face-to-face meetings and common sense. In recent years, business forecasting has developed into a much more scientific endeavor, with a host of theories, methods, and techniques designed for forecasting certain types of data. The development of information technologies and the Internet propelled this development into overdrive, as companies not only adopted such technologies into their business practices, but into forecasting schemes as well. In the 2000s, projecting the optimal levels of goods to buy or products to produce involved sophisticated software and electronic networks that incorporate mounds of data and advanced mathematical algorithms tailored to a company's particular market conditions and line of business. Business forecasting involves a wide range of tools, including simple electronic spreadsheets; enterprise resource planning (ERP) and electronic data interchange (EDI) networks, advanced supply chain management systems, and other Web-enabled technologies. The practice attempts to pinpoint key factors in business production and extrapolate from given data sets to produce accurate projections for future costs, revenues, and opportunities. This normally is done with an eye toward adjusting current and near-future business practices to take maximum advantage of expectations. In the Internet age, the field of business forecasting was propelled by three interrelated phenomena. First, the Internet provided a new series of tools to aid the science of business forecasting. Second, business forecasting had to take the Internet itself into account in trying to construct viable models and make predictions. Finally, the Internet fostered

vastly accelerated transformations in all areas of business that made the job of business forecasters that much more exacting. By the 2000s, as the Internet and its myriad functions highlighted the central importance of information in economic activity, more and more companies came to recognize the value, and often the necessity, of business forecasting techniques and systems. Business forecasting is indeed big business, with companies investing tremendous resources in systems, time, and employees aimed at bringing useful projections into the planning process. According to a survey by the Hudson, Ohio-based Answer Think Consulting Group, which specializes in studies of business planning, the average U.S. company spends more than 25,000 person-days on business forecasting and related activities for every billion dollars of revenue. Companies have a vast array of business forecasting systems and software from which to choose, but choosing the correct one for their particular needs requires a good deal of investigation. According to the Journal of Business Forecasting Methods & Systems, any forecasting system

5. What are the components of time series? Bring out the significance of moving average in analysing a time series and point out its limitations. In statistics In signal processing and In many other fields, a time series is a sequence of data points, measured typically at successive times, spaced at (often uniform) time intervals. Time series analysis comprises methods that attempt to understand such time series, often either to understand the underlying context of the data points (where did they come from? what generated them?), or to make forecasts (predictions). Time series forecasting is the use of a model to forecast future events based on known past events: to forecast future data points before they are measured. A standard example in econometrics is *GDP GROWTH FORECAST *SHAREPRICE FORECAST *PRODUCT SALESTREND The term time series analysis is used to distinguish a problem, firstly from more ordinary data analysis problems (where there is no natural ordering of the context of individual observations), and secondly from spatial data analysis where there is a context that observations (often) relate to geographical locations. There are additional possibilities in the form of space-time models (often called spatialtemporal analysis). A time series model will generally reflect the fact that

observations close together in time will be more closely related than observations further apart. In addition, time series models will often make use of the natural one-way ordering of time so that values in a series for a given time will be expressed as deriving in some way from past values, rather than from future values. Methods for time series analyses are often divided into two classes: frequency-domain methods and time-domain methods. Frequency-domain centre around spectral analysis and recently wavelet analysis, and can be regarded as model-free analyses well-suited to exploratory investigations. Time-domain methods have a model-free subset consisting of the examination of auto-correlation and cross-correlation analysis, but it is here that partly and fullyspecified time series models make their appearance Time Series Analyses There are several types of data analysis available for time series which are appropriate for different purposes. 1.General Exploration -Graphical examination of data series -Autocorrelation analysis to examine serial dependence -Spectral analysis to examine cyclic behavior which need not be related to seasonality 2.Description -Separation into components representing trend, seasonality, slow and fast variation, Cyclical irregular -Simple properties of marginal distributions 3.Prediction and Forecasting -Fully-formed statistical models for stochastic simulation purposes, so as to generate alternative versions of the time series, representing what might happen over non-specific time-periods in the future (prediction). -Simple or fully-formed statistical models to describe the likely outcome of the time series in the immediate future, given knowledge of the most recent outcomes (forecasting). Time Series Models When modeling variations in the level of a process, three broad classes of practical importance are the 1. Autoregressive (AR) models, 2.integrated (I) models, and

3.moving average (MA) models. These three classes depend linearly on previous data points. A time series is a sequence of observations which are ordered in time (or space). If observations are made on some phenomenon throughout time, it is most sensible to display the data in the order in which they arose, particularly since successive observations will probably be dependent. Time series are best displayed in a scatter plot. The series value X is plotted on the vertical axis and time t on the horizontal axis. Time is called the independent variable (in this case however, something over which you have little control). There are two kinds of time series data: Continuous, where we have an observation at every instant of time, e.g. lie detectors, electrocardiograms. We denote this using observation X at time t, X(t).

Discrete, where we have an observation at (usually regularly) spaced intervals. We denote this as Xt. Examples Economics - weekly share prices, monthly profits Meteorology - daily rainfall, wind speed, temperature Sociology - crime figures (number of arrests, etc), employment figures Trend Component One of these main features is the trend component. Descriptive techniques may be extended to forecast (predict) future values. Trend is a long term movement in a time series. It is the underlying direction (an upward or downward tendency) and rate of change in a time series, when allowance has been made for the other components. A simple way of detecting trend in seasonal data is to take averages over a certain period. If these averages change with time we can say that there is evidence of a trend in the series. There are also more formal tests to enable detection of trend in time series. It can be helpful to model trend using straight lines, polynomials etc. Cyclical Component One of these main features is the cyclical component. Descriptive techniques may be extended to forecast (predict) future values. In weekly or monthly data, the cyclical component describes any regular fluctuations. It is a non-seasonal component which varies in a recognizable cycle. EXAMPLE. The recession is a cyclical component in a country's economy

which hits the economy once in [ 7-10] years. Seasonal Component One of these main features is the seasonal component. Descriptive techniques may be extended to forecast (predict) future values. In weekly or monthly data, the seasonal component, often referred to as seasonality, is the component of variation in a time series which is dependent on the time of year. It describes any regular fluctuations with a period of less than one year. For example, the costs of various types of fruits and vegetables, unemployment figures and average daily rainfall, all show marked seasonal variation. We are interested in comparing the seasonal effects within the years, from year to year; removing seasonal effects so that the time series is easier to cope with; and, also interested in adjusting a series for seasonal effects using various models. EXAMPLE 1.The monsoon season in India is a seasonal component. 2. The sales trend of soft drinks/ ice-cream goes up in the summer months. Irregular Component we want to increase our understanding of a time series by picking out its main features. One of these main features is the irregular component (or 'noise'). Descriptive techniques may be extended to forecast (predict) future values. The irregular component is that left over when the other components of the series (trend, seasonal and cyclical) have been accounted for. EXAMPLE The tourists influx / movements during the years. 1. Exponential Smoothing Exponential smoothing is a smoothing technique used to reduce irregularities (random fluctuations) in time series data, thus providing a clearer view of the true underlying behavior of the series. It also provides an effective means of predicting future values of the time series (forecasting). 2. Running Medians smoothing running medians smoothing is a smoothing technique analogous to that used for moving averages. The purpose of the technique is the same, to make a trend clearer by reducing the effects of other fluctuations. 3. Autocorrelation Autocorrelation is the correlation (relationship) between members of a time series of observations, such as weekly share prices or interest rates, and the same

values at a fixed time interval later. More technically, autocorrelation occurs when residual error terms from observations of the same variable at different times are correlated (related). Extrapolation Extrapolation is when the value of a variable is estimated at times which have not yet been observed. This estimate may be reasonably reliable for short times into the future, but for longer times, the estimate is liable to become less accurate. EXAMPLE 1.BASED ON THE LAST 9 MONTHS SALES TREND, IT IS POSSIBLE TO MAKE A RELIABLE FORECAST FOR THE TOTAL 12 MONTHS. 2.BASED ON THE LAST 3 YEARS SALES TREND, IT IS POSSIBLE TO MAKE AN ESTIMATE FOR THE FOLLOWING 12 MONTHS, BUT IT WILL BE LESS ACCURATE. Moving Averages and Time-series Forecasting One of the well known approaches to forecasting is the use of the Moving Averages. But what is the Moving Average and what effects does it have on time series? What is a Moving Average A Moving Average (MA) is a mathematical sum carried over the time series. In general the MA is a weighted MA in the sense that each term of the sum bears a weight that is used in the sum itself that thus becomes a weighed sum. In mathematical terms an m period MA of the time-series y with weighting coefficient ws for lag s is the following expression: zt yt-sws where the sum must be taken from s=0 to m-1 This is the most general equation for a MA whichever the subtype is; it is thus valid for Exponential Moving Averages, Triangular MA, Parabolic MA, etc... the only things that varies is how the weights are calculated. If the weighting coefficients are uniform that is: ws = 1/m for every 0s<m we obtain the classical and overused Simple Moving Average. This is a very simple yet effective algorithm that has been used for ages in forecasting and even today with so much computing power these are (or variations of these) are the algorithms that are used everywhere to produce forecasts and predictions in every human field. But what are the statistical effects of applying the Moving Average algorithm to the time-series? Statistical Effects of the Moving Average and the Convolution Theorem To explain the statistical effects of the MA we need to introduce a little bit of math that may not be familiar to many of you. Lets consider the Discrete Fourier Transform (DFT) of the original time-series y: Y() it exp(it) and suppose the original time-series were replaced with an m period MA over

past values as defined before: zt yt-sws The DFT of the obtained time-series would then be (using the Convolution Theorem): Z() = zt exp(it) = W()Y() where W() ws exp(is) If we use the uniform weighting introduced in the previous paragraph (i.e. we use the Simple Moving Average) then the previous equation becomes (after some math): W() ws exp(is) = 1/m exp(-is) = 1/m exp() sin(m/2) / sin(/2) The value of this expression W(2/m) will then be zero! Thus, taking an m period Simple Moving Average of the time-series has completely destroyed the evidence for an m period periodicity. So if we take a 12 month Simple Moving Average of our time-series then this will completely destroy the evidence of a yearly periodicity in the smoothed time-series. This is not evident until we write some basic math like we did here and if you go further down analyzing the results of the Convolution Theorem you will notice also other values where the periodicity is completely distorted not only in amplitude like in this case but also in phase. This effect is usually ignored and you will find a lot of "statisticians" that have never studied enough mathematics to understand this effect and that are still using this algorithm (or some of the family like Exponential Moving Averages) for time-series predictions. The only thing that can be accomplished using Simple Moving Averages is to make the graph of our time-series look better at our poor human eye and the time-series less usable by a computer program. The Moving Average (Time Series) function returns the moving average of a field over a given period of time based on linear regression. Parameters Data: The data to use in the regression. This is typically a field in a data series or a calculated value. Period: The number of bars of data to include in the regression, including the current value. For example, a period of 3 includes the current value and the two previous values. Function Value The time series moving average is calculated by fitting a linear regression line over the values for the given period, and then determining the current value for that line. A linear regression line is a straight line which is as close to all of the given values as possible.

The time series moving average at the beginning of a data series is not defined until there are enough values to fill the given period. Note that a time series moving average differs greatly from other types of moving averages in that the current value follows the recent trend of the data, not an actual average of the data. Because of this, the value of this function can be greater or less than all of the values being used if the trend of the data is generally increasing or decreasing. Usage Moving averages are useful for smoothing noisy raw data, such as daily prices. Price data can vary greatly from day-to-day, obscuring whether the price is going up or down over time. By looking at the moving average of the price, a more general picture of the underlying trends can be seen. Since moving averages can be used to see trends, they can also be used to see whether data is bucking the trend. Entry/exit systems often compare data to a moving average to determine whether it is supporting a trend or starting a new one. See the sample entry/exit systems for an example of using a Moving Average in an entry/exit system. This function is the same as the Linear Regression Indicator. It is also the same as the Time Series Forecast with an offset of zero.

6. List down various measures of central tendency and explain the difference between them? A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. As such, measures of central tendency are sometimes called measures of central location. They are also classed as summary statistics. The mean (often called the average) is most likely the measure of central tendency that you are most familiar with, but there are others, such as, the median and the mode. The mean, median and mode are all valid measures of central tendency but, under different conditions, some measures of central tendency become more appropriate to use than others. In the following sections we will look at the mean, mode and median and learn how to calculate them and under what conditions they are most appropriate to be used. Mean (Arithmetic) The mean (or average) is the most popular and well known measure of central tendency. It can be used with both discrete and continuous data, although its use is most often with continuous data (see our Types of Variable guide for data types). The mean is equal to the sum of all the values in the data set divided by

the number of values in the data set. So, if we have n values in a data set and they have values x1, x2, ..., xn, then the sample mean, usually denoted by (pronounced x bar), is: This formula is usually written in a slightly different manner using the Greek capitol letter, , pronounced "sigma", which means "sum of...": You may have noticed that the above formula refers to the sample mean. So, why call have we called it a sample mean? This is because, in statistics, samples and populations have very different meanings and these differences are very important, even if, in the case of the mean, they are calculated in the same way. To acknowledge that we are calculating the population mean and not the sample mean, we use the Greek lower case letter "mu", denoted as : The mean is essentially a model of your data set. It is the value that is most common. You will notice, however, that the mean is not often one of the actual values that you have observed in your data set. However, one of its important properties is that it minimizes error in the prediction of any one value in your data set. That is, it is the value that produces the lowest amount of error from all other values in the data set. An important property of the mean is that it includes every value in your data set as part of the calculation. In addition, the mean is the only measure of central tendency where the sum of the deviations of each value from the mean is always zero. When not to use the mean The mean has one main disadvantage: it is particularly susceptible to the influence of outliers. These are values that are unusual compared to the rest of the data set by being especially small or large in numerical value. For example, consider the wages of staff at a factory below: Staff 1 2 3 4 5 6 7 8 9 10 Salary 15k 18k 16k 14k 15k 15k 12k 17k 90k 95k The mean salary for these ten staff is $30.7k. However, inspecting the raw data suggests that this mean value might not be the best way to accurately reflect the typical salary of a worker, as most workers have salaries in the $12k to 18k range. The mean is being skewed by the two large salaries. Therefore, in this situation we would like to have a better measure of central tendency. As we will find out later, taking the median would be a better measure of central tendency in this situation. Another time when we usually prefer the median over the mean (or mode) is when our data is skewed (i.e. the frequency distribution for our data is skewed). If we consider the normal distribution - as this is the most frequently assessed in statistics - when the data is perfectly normal then the mean, median and mode are identical. Moreover, they all represent the most typical value in the data set. However, as the data becomes skewed the mean loses its ability to provide the best central location for the data as the skewed data is dragging it away from the typical value. However, the median best retains this position and is not as

strongly influenced by the skewed values. This is explained in more detail in the skewed distribution section later in this guide. Median The median is the middle score for a set of data that has been arranged in order of magnitude. The median is less affected by outliers and skewed data. In order to calculate the median, suppose we have the data below: 65 55 89 56 35 14 56 55 87 45 92 We first need to rearrange that data into order of magnitude (smallest first): 14 35 45 55 55 56 56 65 87 89 92 Our median mark is the middle mark - in this case 56 (highlighted in bold). It is the middle mark because there are 5 scores before it and 5 scores after it. This works fine when you have an odd number of scores but what happens when you have an even number of scores? What if you had only 10 scores? Well, you simply have to take the middle two scores and average the result. So, if we look at the example below: 65 55 89 56 35 14 56 55 87 45 We again rearrange that data into order of magnitude (smallest first): 14 35 45 55 55 56 56 65 87 89 92 Only now we have to take the 5th and 6th score in our data set and average them to get a median of 55.5. Mode The mode is the most frequent score in our data set. On a histogram it represents the highest bar in a bar chart or histogram. You can, therefore, sometimes consider the mode as being the most popular option. An example of a mode is presented below: Normally, the mode is used for categorical data where we wish to know which is the most common category as illustrated below: We can see above that the most common form of transport, in this particular data set, is the bus. However, one of the problems with the mode is that it is not unique, so it leaves us with problems when we have two or more values that share the highest frequency, such as below: We are now stuck as to which mode best describes the central tendency of the data. This is particularly problematic when we have continuous data, as we are more likely not to have any one value that is more frequent than the other. For example, consider measuring 30 peoples' weight (to the nearest 0.1 kg). How likely is it that we will find two or more people with exactly the same weight, e.g. 67.4 kg? The answer, is probably very unlikely - many people might be close but with such a small sample (30 people) and a large range of possible weights you are unlikely to find two people with exactly the same weight, that is, to the nearest 0.1 kg. This is why the mode is very rarely used with continuous data. Another problem with the mode is that it will not provide us with a very good measure of central tendency when the most common mark is far away from the rest of the data in the data set, as depicted in the diagram below:

In the above diagram the mode has a value of 2. We can clearly see, however, that the mode is not representative of the data, which is mostly concentrated around the 20 to 30 value range. To use the mode to describe the central tendency of this data set would be misleading. Skewed Distributions and the Mean and Median We often test whether our data is normally distributed as this is a common assumption underlying many statistical tests. An example of a normally distributed set of data is presented below: When you have a normally distributed sample you can legitimately use both the mean or the median as your measure of central tendency. In fact, in any symmetrical distribution the mean, median and mode are equal. However, in this situation, the mean is widely preferred as the best measure of central tendency as it is the measure that includes all the values in the data set for its calculation, and any change in any of the scores will affect the value of the mean. This is not the case with the median or mode. However, when our data is skewed, for example, as with the right-skewed data set below: we find that the mean is being dragged in the direct of the skew. In these situations, the median is generally considered to be the best representative of the central location of the data. The more skewed the distribution the greater the difference between the median and mean, and the greater emphasis should be placed on using the median as opposed to the mean. A classic example of the above right-skewed distribution is income (salary), where higher-earners provide a false representation of the typical income if expressed as a mean and not a median. If dealing with a normal distribution, and tests of normality show that the data is non-normal, then it is customary to use the median instead of the mean. This is more a rule of thumb than a strict guideline however. Sometimes, researchers wish to report the mean of a skewed distribution if the median and mean are not appreciably different (a subjective assessment) and if it allows easier comparisons to previous research to be made. Summary of when to use the mean, median and mode Please use the following summary table to know what the best measure of central tendency is with respect to the different types of variable. Type of Variable Best measure of central tendency Nominal Mode Ordinal Median Interval/Ratio (not skewed) Mean Interval/Ratio (skewed) Median

b. What is a confidence interval, and why it is useful? What is a confidence level?

The confidence interval is a tool of probability that is used to express the certainty or uncertainty of an estimated number. The lack of absolute certainty stems from the statistical method of using random samples or limited numbers of subjects from much larger groups when making statistical determinations and inferences. The goal of this method is for the average or mean of the sample to equal or closely approximate the mean of the total number of subjects from which the sample was obtained (the true mean). The confidence interval is the range of numbers needed to specify with varying degrees of probability (or confidence) that the sample mean closely approximates the true mean. For example, political pollsters find it impossible to query every adult in the United States about whether or not they approve of the performance of the president. Such a poll would require asking more than 200 million people whether they approved or disapproved of the president's performance. Instead, pollsters sample only a small number of people, typically 5,000 people, and draw statistical inferences for the entire population based on the results of that sample. As long as the sample population is chosen at random and the number is significant (more than 30 people), pollsters may be reasonably assured that the opinions expressed by the sample population will be normally distributed and therefore usefully indicative of the opinions of the entire population. Assume that a telephone poll is conducted in which 5,000 randomly selected people are asked to express approval, disapproval, neutrality, or no opinion about the performance of the president. The sample reveals that 2,000, or 40 percent, approve of the president's performance, while 2,250, or 45 percent, disapprove. Meanwhile, 450, or 9 percent, are neutral, and the remaining 300, or 6 percent, have no opinion about how the president is doing. The figures are 100 percent accurate only for the sample population because every one of the 5,000 has been asked. But when attempting to draw an inference for the entire population based on this sample data, the pollsters cannot be absolutely sure the proportions will remain accurate. Instead, pollsters try to express the likelihood that these numbers are accurate for the entire population. This likelihood results from the confidence interval, which enables the pollster or statistician to estimate the population meanthe true meanbased on a sample mean. As a result, presidential approval ratings are commonly expressed with degrees of accuracy that are reflections of the confidence interval. The results always include an indication of the error (plus or minus a percentage) that may exist in the poll. It is logistically very difficult to measure values for entire populations. Rather than attempting to find the correct value for an entire population, the statistician may attempt to find only the "most correct" value for the population using only a sample of the population, and use the confidence interval to determine whether

or not the sample value is absolutely correct. To use a somewhat different example from the presidential opinion poll, assume that an automobile manufacturer has developed a new car and must provide an estimate of the mileage that drivers can expect from this model. A sample of 100 cars is taken from the assembly line and given test runs on a closed track. The worst performing car among the sample gets 39 miles per gallon, while the best gets 49 miles per gallon. The average for the sample of the entire population (the total mileage of all the cars divided by 100) is 44 miles per gallon. The results can be expressed as: Mileage = 39 < < 49, which indicates that the sample mean lies between 39 and 49 miles per gallon. In fact, the wider the range of numbers greater or less than the sample mean, the greater the chance that it includes the true mean For example, there is a greater likelihood that the true mean falls between 35 and 53, than there is that it falls between 41 and 47. Moreover, there is 100 percent certainty that the true mean falls between 0 and infinity. Nevertheless, these considerations do not provide a useful estimate of the likelihood that the true mean for the entire population equals the sample mean , or 44. To be useful, the interval estimate must include a specification of limits or boundary values for the interval as well as a probability that the interval of values contains the true mean. The interval of values is the "confidence interval," and its boundaries of values are called "confidence limits" of the interval. The confidence interval is a range of numbers above and below the sample mean with a specific likelihood that it contains the true mean. As a measure of probability, it is usually expressed as a percentage and referred to as the "confidence level." The confidence interval, confidence limits, and mean may be diagrammed as in Figure 1. The width of the confidence interval is determined by the degree of confidence. A 95 percent confidence interval will be narrower than a 99 percent confidence interval, indicating that there is a greater probability that the true mean lies within a wider confidence interval. The difference between the sample mean and the true mean is attributable to an unknown variable degree of error . This relationship may be expressed as: The error is essential to defining a confidence interval for the true mean . But while the error for the sample mean is unknown, statisticians can make assumptions about the size of the errors if they know the mean, standard deviation, and shape of the distribution of those errors. For example, just by using the Figure 1 Confidence Interval standard deviationa figure derived from the variations between the sample

mean and the range of numbers used to calculate this average, statisticians can determine the confidence interval. Statisticians often set the confidence interval they want in advancethat is, they select the probability they want that the true mean will be included in the interval. Then they determine how wide the interval must be to have the desired probability (e.g., a 95 percent chance) that true mean will be included. Hence, statisticians can choose between a narrow confidence interval with a confidence level of 85 percent or a wide confidence interval with a confidence level of 99 percent. Once this information is determined, statisticians can compute how large a sample they would need in order to achieve the desired confidence interval and level of confidence.

You might also like