You are on page 1of 70

A lot of data is collected everyday, representing myriad different types with disparate uses.

Eg: Rupee cost of bags produces Weights of shipments Shop Numbers etc.

Disparate use: Although 80 kgs is twice as much as 40 kgs, but shop no. 80 is not twice as big as shop no. 40. So, the appropriateness of data analysis depends upon level of data measurement.

The process of assigning numbers to objects in such a way that specific properties of the objects are faithfully represented by specific properties of the numbers. Measurement is used to capture some construct - For example, if research is needed on the construct of depression, it is likely that some systematic measurement tool will be needed to assess depression.

In business research, measurement of variables is a indispensable requirement Problem Defining what is to be measured, and how it is to be accurately and reliably measured Some things (or concepts) which are inherently abstract in their nature (e.g. job satisfaction, employee morale, brand loyalty of consumers) are more difficult to measure than concepts which can be assigned numerical values (e.g. sales volume for employees X, Y and Z)

The values of nominal data are classification or categories. These classes have no quantitative properties. Therefore, no comparison can be made in terms of one being category being higher than the other E.g. responses to questions about marital status, coded as: Single = 1, Married = 2, Divorced = 3, Widowed = 4 Because the numbers are arbitrary, so arithmetic operations dont make any sense (e.g. does Widowed 2 = Married?!) Nominal data are also called qualitative or categorical.

This level is higher than Nominal. In addition to categorization, we can also rank or order objects. E.g. College course rating system: poor = 1, fair = 2, good = 3, very good = 4, excellent = 5 While its still not meaningful to do arithmetic on this data (e.g. does 2*fair = very good?!), we can say things like: excellent > poor or fair < very good That is, order is maintained no matter what numeric values are assigned to each category. The difference between first and second is not necessarily equivalent to the difference between second and third, or between third and fourth.

Does not assume that the intervals between numbers are equal. Example: finishing place in a race (first place, second place)

1st place

2nd place

3rd place

4th place

1 hour

2 hours

3 hours

4 hours

5 hours

6 hours

7 hours

8 hours

Data is always numerical. Also referred to as quantitative or numerical. Distances between consecutive numbers have meaning. Arithmetic operations can be performed on Interval Data, thus its meaningful to talk about 2*Height, or Price + Rs.100, and so on.

Also these distances are equal, i.e. interval data have equal intervals. Fahrenheit Temperature Zero point is a matter of convenience and not a natural or fixed zero point. It is just another point on scale and doesnt mean the absence of phenomenon. Eg: Zero degrees F is not the lowest temperature possible.

Highest level of data measurement. Same properties as Interval data. IN ADDITION:

Ratio Data have an ABSOLUTE ZERO: not just a fixed point of reference. Implies absence of phenomenon. Ratio of two numbers is meaningful.

Eg: heights, weights, prices, volume etc. 80 kgs is twice as much 40 kgs, so the ratio 80:40 is meaningful.

Multiple Regression
Discriptive Analysis Chi-Square ANOVA MANOVA Canonical Correlation

Dependent is RATIO/INTERVAL Independent is RATIO/INTERVAL

Dependent is NOMINAL/ORDINAL Independent is RATIO/INTERVAL


Dependent is RATIO/INTERVAL Independent is NOMINAL/RATIO

Both dependent and independent non-metric X>1, Y>1

Both X and Y are metric X>1, Y>1

Type of composite measure composed of several items that have a logical or empirical structure among them. Scale takes advantage of differences in intensity among the indicators of a variable. For example, when a question has the response choices of "always," "sometimes," "rarely," and "never," this is a scale because the answer choices are rank-ordered and have differences in intensity. Another example would be "strongly agree," "agree," "neither agree nor disagree," "disagree," "strongly disagree."

A scale is basically a continuous spectrum or series of categories and has been defined as any series of items that are arranged progressively according to value or magnitude, into which an item can be placed according to its quantification


Rating scales have several response categories and are used to elicit responses with regard to the object, event, or person studied. Ranking scales, make comparisons between or among objects, events, or persons and elicit the preferred choices and ranking among them.

Dichotomous Scale Category Scale Semantic Differential Scales Numerical Scale Itemized Rating Scale Likert/Summated Rating Scales Fixed or Constant Sum Rating Scale Stapel Scale Graphic Rating Scale Consensus Scale: Thruston Scales Magnitude Scaling

1. Dichotomous Scale
1. Used to elicit a Yes or No answer 2. Nominal Scale used

2. Category Scale
1. Uses multiple items to elicit response 2. Nominal Scale used
Where in India do you reside? o Delhi o Mumbai o Kolkata o Chennai

4. Semantic Differential Scale

Uses a set of scale anchored by their extreme responses using words of opposite meaning. We use this scale when several attributes are identified at the extremes of the scale. For instance, the scale would employ such terms as: Hot ___ ___ ___ ___ ___ Cold Strong ___ ___ ___ ___ ___ Weak Dark ___ ___ ___ ___ ___ Light Short ___ ___ ___ ___ ___ Tall Evil ___ ___ ___ ___ ___ Good Four to seven categories are ideal

3. Likert Scale
This is an Interval scale and the differences in responses between any two points on the scale remain the same. Is designed to examine how strongly subjects agree or disagree with statements on a 5-point scale as following:



5. Summative Ratings
A number of items collectively measure one construct (Job Satisfaction) A number of items collectively measure a dimension of a construct and a collection of dimensions will measure the construct (Selfesteem)

6. Summative Likert Scale

Must contain multiple items Each individual item must measure something that has an underlying, quantitative measurement continuum There can be no right/wrong answers as opposed to multiplechoice questions Items must be statements to which the respondent assigns a rating Cannot be used to measure knowledge or ability, but familiarity


7. Magnitude Scaling
Attempts to measure constructs along a numerical, ratio level scale Respondent is given an item with a pre-assigned numerical value attached to it to establish a norm The respondent is asked to rate other items with numerical values as a proportion of the norm Very powerful if reliability is established


8. Consensus Scale: Thurston Scale

Items are formed and Panel of experts assigns values from 1 to 11 to each item Mean or median scores are calculated for each item Example:
Please check the item that best describes your level of willingness to try new tasks I seldom feel willing to take on new tasks (1.7) I will occasionally try new tasks (3.6) I look forward to new tasks (6.9) I am excited to try new tasks (9.8)


9. Numerical Scale
Is similar to the semantic differential scale, with the difference that numbers on a 5- points or 7-points scale are provided, as illustrated in the following example: How pleased are you with your new job? Extremely Extremlely pleased 5 4 3 2 1 displeased


10. Fixed or Constant Sum Scale

The respondents are asked to distribute a given number of points across various items. Example : In choosing a toilet soap, indicate the importance you attach to each of the following five aspects by allotting points for each to total 100 in all. Fragrance ----Color ----Shape ----Size ----_________ Total points 100 This is more in the nature of an ordinal scale.

11. Itemized Rating Scale

A 5-point or 7-point scale is provided for each item and the respondent states the appropriate number on the side of each item. This uses an Interval Scale. Two Types: Balanced Rating Scale with Neutral Point Unbalanced Rating Scale

Balanced Rating Scale

Respond to each item using the scale below, and indicate your response number on the line by each item.
4 5 neither unlikely likely very likely nor Likely -------------------------------------------------------------------------------I will be changing my job in the near future. -------1 2 3 Very unlikely unlikely

Unbalanced Rating Scale

Circle the number that is closest to how you feel for the item below:
Not at all Somewhat Moderately Very much interested interested interested interested 1 2 3 4 -------------------------------------------------------------------------------How would you rate your interest 1 2 3 4 In changing current organizational Policies?

12. Stapel Scale

This scale simultaneously measures both the direction and intensity of the attitude toward the items under study. The characteristic of interest to the study is placed at the center and a numerical scale ranging, say from +3 to 3, on either side of the item as illustrated in the following example: State how you would rate your supervisors abilities with respect to each of the characteristics mentioned below, by circling the appropriate number. +3 +3 +3 +2 +2 +2 +1 +1 +1 Adopting modern Product Interpersonal Technology Innovation Skills -1 -1 -1 -2 -2 -2 -3 -3 -3

13.Graphic Rating Scale

A graphical representation helps the respondents to indicate on this scale their answers to a particular question by placing a mark at the appropriate point on the line, as in the following example: On a scale of 1 to 10, how would you rate your supervisor?


Are used to tap preferences between two or among more objects or items (ordinal in nature). However, such ranking may not give definitive clues to some of the answers sought. Example: There are 4 product lines, the manager seeks information that would help decide which product line should get the most attention. Assume: 35% of respondents choose the 1st product. 25% of respondents choose the 2nd product. 20% of respondents choose the 3rd product. 20% of respondents choose the 4th product. 100%

The manager cannot conclude that the first product is the most preferred. Why? Because 65% of respondents did not choose that product. We have to use alternative methods like Forced Choice, Paired Comparisons, and the Comparative Scale. .


Paired Comparison:
Paired comparison scales ask a respondent to pick one of two objects from a set based upon some stated criteria.

Forced Choice Enables respondents to rank objects relative to one another.

Comparative Scale It provides a benchmark or a point of reference to assess attitude towards the subject under study.

We need to assess the goodness of the measures developed. That is, we need to be reasonably sure that the instruments we use in our research do indeed measure the variables they are supposed to, and that they measure them accurately.


Deals with the consistency of the instrument. A reliable test is one that yields consistent scores when a person takes the test two alternate forms of the test or when an individual takes the same test on two or more different occasions. Reliability of measure indicates extent to which it is without bias and hence ensures consistent measurement : across time (stability) and across the various items in the instrument (internal consistency). ordinal measures always yield the same order, interval measurements always yield the same order and same distance between the measured items 66

Stability: ability of a measure to remain the same over time, despite uncontrollable testing conditions or the state of the respondents themselves.

TestRetest Reliability: The reliability coefficient obtained with a repetition of the same measure on a second occasion. Parallel-Form Reliability: Responses on two comparable sets of measures tapping the same construct are highly correlated.

41 84

When a questionnaire containing some items that are supposed to measure a concept is administered to a set of respondents now, and again to the same respondents, say several weeks to 6 months later, then the correlation between the scores obtained is called the test-retest coefficient. The higher the coefficient is, the better the testretest reliability, and consequently, the stability of the measure across time.


When responses on two comparable sets of measures tapping the same construct are highly correlated, we have parallel-form reliability. Both forms have similar items and the same response format, the only changes being the wording and the order or sequence of the questions. What we try to establish in the parallel-form is the error variability resulting from wording and ordering of the questions. If two such comparable forms are highly correlated (say 8 and above), we may be fairly certain that the measures are reasonably reliable, with minimal error variance caused by wording, ordering, or other factors.

Consistency in the type of result a test yields Time & space Participants Not perfectly similar result but very close-to being similar indicative of the homogeneity of the items in the measure that tap the construct.


Split-Half Reliability: randomly divide items into 2 subsets and examine the consistency in total scores across the 2 subsets (any drawbacks?) involves scoring two halves of a test separately for each subject and calculating the correlation coefficient between the two scores. Split-half reliability reflects the correlations between two halves of an instrument.

INTERNAL CONSISTENCY RELIABILITY Relevant for measures that consist of more than 1 item (e.g., total scores on scales, or when several behavioral observations are used to obtain a single score) Internal consistency refers to inter-item reliability, and assesses the degree of consistency among the items in a scale, or the different observations used to derive a score Want to be sure that all the items (or observations) are measuring the same construct Cronbach Alpha is the most popular measure.

Cronbach's alpha is an index of reliability associated with the variation accounted for by the true score of the "underlying construct." Allows a researcher to measure the internal consistency of scale items, based on the average inter-item correlation Indicates the extent to which the items in your questionnaire are related to each other Indicates whether a scale is one-dimensional or multidimensional

Cronbach's alpha ranges between 0 to 1 The more items, generally the higher the internal reliability will be The higher the score, the more reliable the generated scale is. General guide: A score of .70 or greater is generally considered to be acceptable
.6 = marginal reliabiility .7 = Good .8 = Very Good .9 = Excellent >.95 = too high; items are too inter-related and therefore some are redundant

Item-total Statistics

Scale Mean if Item Deleted

MATHS1 MATHS2 MATHS3 MATHS4 MATHS5 MATHS6 MATHS7 MATHS8 MATHS9 MATHS10 25.2749 25.0333 25.0192 24.9786 25.4664 25.0813 25.0909 25.8699 25.0340 25.4642

Scale Variance if Item Deleted

25.5752 26.5322 30.5174 25.8671 25.6455 24.9830 26.4215 25.7345 26.1201 25.7578

Corrected ItemTotal Correlation

.6614 .6235 .0996 .7255 .6707 .7114 .6208 .6513 .6762 .6495

Alpha if Item Deleted .8629 .8661 .9021 .8589 .8622 .8587 .8662 .8637 .8623 .8638

Reliability Coefficients N of Cases = 1353.0 N of Items = 10

Alpha =


Item-total Statistics Scale Mean if Item Deleted MATHS1 MATHS2 MATHS4 MATHS5 MATHS6 MATHS7 MATHS8 MATHS9 MATHS10 22.2694 22.0280 21.9727 22.4605 22.0753 22.0849 22.8642 22.0280 22.4590 Scale Variance if Item Deleted 24.0699 25.2710 24.4372 24.2235 23.5423 25.0777 24.3449 24.5812 24.3859 Corrected ItemTotal Correlation .6821 .6078 .7365 .6801 .7255 .6166 .6562 .7015 .6524

Alpha if Item Deleted

.8907 .8961 .8871 .8909 .8873 .8955 .8927 .8895 .8930

Reliability Coefficients N of Cases = 1355.0 N of Items = 9

Alpha =


Quality of items; concise statements, homogenous words (some sort of uniformity) Adequate sampling of content domain; comprehensiveness of items Longer assessment less distorted by chance factors Developing a scoring plan (esp. for subjective items rubrics) Ensure VALIDITY

Validity has been defined as referring to the appropriateness, correctness, meaningfulness, and usefulness of the specific inferences researchers make based on the data they collect. Validation is the process of collecting and analyzing evidence to support such inferences. It is the most important idea to consider when preparing or selecting an instrument. Validity tests show how well an instrument that is developed measures the particular concept it is intended to measure. Validity is concerned with whether we measure the right concept. e.g. is absenteeism from work a valid measure of job satisfaction or are there other influences like a flu epidemic which is keeping employees from work

Unlike reliability, validity is not absolute Validity is the degree to which variability (individual differences) in participants scores on a particular measure, reflect individual differences in the characteristic or construct we want to measure Depends on the PURPOSE :Specific to a particular purpose! E.g. a ruler may be a valid measuring device for length, but isnt very valid for measuring volume Measuring what it is supposed to Must be inferred from evidence; cannot be directly measured


Content validity ensures that the measure includes an adequate and representative set of items that tap the concept. The more the scale items represent the domain of the concept being measured, the greater the content validity. In other words, content validity is a function of how well the dimensions and elements of a concept have been delineated. Major concern for achievement tests (where content is emphasized)
How closely content of questions in the test relates to content of the curriculum? Can you test students on things they have not been taught?


Face validity refers to the extent to which a measure APPEARS to measure what it is supposed to measure Not statisticalinvolves the judgment of the researcher (and the participants) A measure has face validityif people think it does Just because a measure has face validity does not ensure that it is a valid measure (and measures lacking face validity can be valid) Does it appear to measure what it is supposed to measure? Example: Lets say you are interested in measuring, Propensity towards violence and aggression. By simply looking at the following items, state which ones qualify to measure the variable of interest:
Have you been arrested? Have you been involved in physical fighting? Do you get angry easily? Do you sleep with your socks on? Is it hard to control your anger? Do you enjoy playing sports?

Degree to which the predictor is adequate in capturing the relevant aspects of criterion Uses Correlation analysis Criterion-Related Validity is established when the measure differentiates individuals on a criterion it is expected to predict. This can be done by establishing what is called concurrent validity or predictive validity. Two Types: Concurrent validity Predictive validity


Concurrent Criterion Validity how well performance on a test estimates current performance on some valued measure (criterion)? (e.g. test of dictionary skills can estimate students current skills in the actual use of dictionary observation) measure and criterion are assessed at the same time Predictive Criterion Validity how well performance on a test predicts future performance on some valued measure (criterion)? (e.g. reading readiness test might be used to predict students achievement in reading) elapsed time between the administration of the measure to be validated and the criterion is a relatively long period (e.g., months or years)

Retrospective look at the validity of the measurement

High school seniors who score high on the the CBSE Class 12th are better prepared for college than low scorers (concurrent validity) Probably of greater interest to college admissions administrators, CBSE Class 12th scores predict academic performance three years later (predictive validity)

Construct Validity testifies to how well the results obtained from the use of the measure fit the theories around which the test is designed. Measures what accounts for the variance Attempts to identify the underlying constructs This is assessed through convergent and discriminant validity.
Convergent validity is established when the scores obtained with two different instruments measuring the same concept are highly correlated. Discriminant validity is established when, based on theory, two variables are predicted to be uncorrelated, and the scores obtained by measuring them are indeed empirically found to be so.

To have construct validity, a measure should both: Correlate with other measures that it should be related to (convergent validity) And, not correlate with measures that it should not correlate with (discriminant validity) Construct Validity: Techniques used
Correlation of proposed test with other existing tests Factor analysis Multi-trait-multimethod analysis Convergent validity - Calls for high correlation between the different measures of the same construct Discriminant validity - Calls for low correlation between sub-scales within a construct

Does the test measure the human CHARACTERISTIC(s) it is supposed to? Examples of constructs or human characteristics:
Mathematical reasoning Verbal reasoning Musical ability Spatial ability Mechanical aptitude Motivation

Each construct is broken down into its component parts E.g. motivation can be broken down to:
Interest Attention span Hours spent Assignments undertaken and submitted, etc. All of these sub-constructs put together measure motivation

Unclear directions Difficult reading vocabulary and sentence structure Ambiguity in statements Inappropriate level of difficulty Poorly constructed test items Test items inappropriate for the outcomes being measured Tests that are too short Improper arrangement of items (complex to easy?) Identifiable patterns of answers Administration and scoring Nature of criterion

Reliability vs Validity

Performance-based assessment forms are high in both validity and reliability (true/false) A test item is said to be unreliable when most students answered the item wrongly (true/false) When a test contains items that do not represent the content covered during instruction, it is known as an unreliable test (true/false) Test items that do not successfully measure the intended learning outcomes (objectives) are invalid items (true/false) Assessment that does not represent student learning well enough are definitely invalid and unreliable (true/false) A valid test can sometimes be unreliable (true/false) If a test is valid, it is reliable! (by-product)