You are on page 1of 42

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/325846760

MEASUREMENT CONCEPTS: VARIABLE, RELIABILITY, VALIDITY, AND NORM

Chapter · July 2016

CITATION READS
1 20,047

1 author:

Syed Muhammad Sajjad Kabir


Curtin University
107 PUBLICATIONS   363 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Article View project

Others Writings View project

All content following this page was uploaded by Syed Muhammad Sajjad Kabir on 25 June 2018.

The user has requested enhancement of the downloaded file.


Page 72

CHAPTER – 5

Measurement Concept: Variable,


Reliability, Validity and Norm
Topics Covered
5.1 Measurement
5.1.1 Levels of Measurement
5.1.2 Scaling of Measurement: Thurstone Scaling; Likert Scaling; and
Guttman Scaling
5.2 Variable
5.2.1 Types of Variable
5.2.2 Mediator and Moderator Variables
5.2.3 Relationship among Variables
5.2.4 Controlling Extraneous Variables
5.3 Reliability
5.3.1 Types of Reliability
5.3.2 Factors Effecting the Reliability of A Research
5.4 Validity
5.4.1 Types of Validity
5.4.2 Factors Effecting the Reliability of A Research
5.5 Norm
5.5.1 Types of Norm
Chapter – 5 Measurement Concept: Variable, Reliability, Validity and Norm Page 73

5.1 MEASUREMENT
Measurement is a process of assigning numbers to some characteristics or variables or events
according to scientific rules. It is the process observing and recording the observations that are
collected as part of a research effort. Measurement means the description of data in terms of
numbers – accuracy; objectivity and communication. The combined form of these three is the actual
measurement.
Accuracy: The accuracy is as great as the care and the instruments of the observer will permit.
Objectivity: Objectivity means interpersonal agreement. Where many persons reach agreement as
to observations and conclusions, the descriptions of nature are more likely to be free from biases of
particular individuals.
Communication: Communication is the sharing of research between one person with another one.

5.1.1 LEVELS OF MEASUREMENT


Level of measurement refers to the relationship among the values that are assigned to the
attributes for a variable. It is important because -
First, knowing the level of measurement helps you decide how to interpret the data from that
variable. When you know that a measure is nominal, then you know that the numerical values are just
short codes for the longer names.
Second, knowing the level of measurement helps you decide what statistical analysis is appropriate
on the values that were assigned. If a measure is nominal, then you know that you would never
average the data values or do a t-test on the data.
S. S. Stevens (1946) clearly delineated the four distinguish levels of measurement. The levels are -
nominal, ordinal, interval, or ratio. Stevens’s levels of measurement are important for at least two
reasons. First, they emphasize the generality of the concept of measurement. Although people do
not normally think of categorizing or ranking individuals as measurement, in fact they are as long as
they are done so that they represent some characteristic of the individuals. Second, the levels of
measurement can serve as a rough guide to the statistical procedures that can be used with the
data and the conclusions that can be drawn from them. With nominal-level measurement, for
example, the only available measure of central tendency is the mode. Also, ratio-level measurement
is the only level that allows meaningful statements about ratios of scores. One cannot say that
someone with an IQ of 140 is twice as intelligent as someone with an IQ of 70 because IQ is
measured at the interval level, but one can say that someone with six siblings has twice as many as
someone with three because number of siblings is measured at the ratio level.
Nominal: The nominal scale (also called dummy coding) simply places people, events, perceptions, etc.
into categories based on some common trait. Some data are naturally suited to the nominal scale
such as males vs. females, white vs. black vs. blue, and American vs. Asian. The nominal scale forms
the basis for such analyses as Analysis of Variance (ANOVA) because those analyses require that
some category is compared to at least one other category. The nominal scale is the lowest form of
measurement because it doesn’t capture information about the focal object other than whether the
object belongs or doesn’t belong to a category; either you are a smoker or not a smoker, you
attended university or you didn’t, a subject has some experience with computers, an average amount
of experience with computers, or extensive experience with computers. No data is captured that
can place the measured object on any kind of scale say, for example, on a continuum from one to ten.
Coding of nominal scale data can be accomplished using numbers, letters, labels, or any symbol that
represents a category into which an object can either belong or not belong. In research activities a

Basic Guidelines for Research SMS Kabir


Chapter – 5 Measurement Concept: Variable, Reliability, Validity and Norm Page 74

Yes/No scale is nominal. It has no order and there is no distance between Yes and No. The statistics
which can be used with nominal scales are in the non-parametric group. The most likely ones would be
- mode; crosstabulation - with chi-square. There are also highly sophisticated modelling techniques
available for nominal data.
Ordinal: An ordinal level of measurement uses symbols to classify observations into categories that
are not only mutually exclusive and exhaustive; in addition, the categories have some explicit
relationship among them. For example, observations may be classified into categories such as taller
and shorter, greater and lesser, faster and slower, harder and easier, and so forth. However, each
observation must still fall into one of the categories (the categories are exhaustive) but no more
than one (the categories are mutually exclusive). Most of the commonly used questions which ask
about job satisfaction use the ordinal level of measurement. For example, asking whether one is
very satisfied, satisfied, neutral, dissatisfied, or very dissatisfied with one’s job is using an ordinal
scale of measurement. The simplest ordinal scale is a ranking. When a market researcher asks you
to rank 5 types of tea from most flavourful to least flavourful, s/he is asking you to create an
ordinal scale of preference. There is no objective distance between any two points on your
subjective scale. For you the top tea may be far superior to the second prefered tea but, to another
respondant with the same top and second tea, the distance may be subjectively small. Ordinal data
would use non-parametric statistics. These would include - median and mode; rank order correlation;
non-parametric analysis of variance. Modelling techniques can also be used with ordinal data.
Interval: An interval level of measurement classifies observations into categories that are not only
mutually exclusive and exhaustive, and have some explicit relationship among them, but the
relationship between the categories is known and exact. This is the first quantitative application of
numbers. In the interval level, a common and constant unit of measurement has been established
between the categories. For example, the commonly used measures of temperature are interval level
scales. We know that a temperature of 75 degrees is one degree warmer than a temperature of 74
degrees, just as a temperature of 42 degrees is one degree warmer than a temperature of 41
degrees. Numbers may be assigned to the observations because the relationship between the
categories is assumed to be the same as the relationship between numbers in the number system.
For example, 74+1= 75 and 41+1= 42. The intervals between categories are equal, but they originate
from some arbitrary origin, that is, there is no meaningful zero point on an interval scale. The
standard survey rating scale is an interval scale. When you are asked to rate your satisfaction with a
piece of software on a 7 point scale, from Dissatisfied to Satisfied, you are using an interval scale.
Interval scale data would use parametric statistical techniques - Mean and standard deviation;
Correlation; Regression; Analysis of variance; Factor analysis; Plus a whole range of advanced
multivariate and modelling techniques. Remember that you can use non-parametric techniques with
interval and ratio data. But non-paramteric techniques are less powerful than the parametric ones.
Ratio: The ratio level of measurement is the same as the interval level, with the addition of a
meaningful zero point. There is a meaningful and non-arbitrary zero point from which the equal
intervals between categories originate. For example, weight, area, speed, and velocity are measured
on a ratio level scale. In public policy and administration, budgets and the number of program
participants are measured on ratio scales. In many cases, interval and ratio scales are treated alike
in terms of the statistical tests that are applied. A ratio scale is the top level of measurement and
is not often available in social research. The factor which clearly defines a ratio scale is that it has
a true zero point. The simplest example of a ratio scale is the measurement of length (disregarding

Basic Guidelines for Research SMS Kabir


Chapter – 5 Measurement Concept: Variable, Reliability, Validity and Norm Page 75

any philosophical points about defining how we can identify zero length). Ratio scale data would use
the same as for Interval data.
Table 5.1
A Comparison of the Four Levels of Measurement
Level Arithmetic Feature Example
Nominal  Counting 1. Categorise Religion
Ordinal  Counting 1. Categorise Economic status
 Ranking 2. Ranks
Interval  Counting 1. Categorise IQ score
 Ranking 2. Ranks
 Addition 3. Has equal units
 Subraction
Ratio  Counting 1. Categorise Family size
 Ranking 2. Ranks
 Addition 3. Has equal units
 Subraction 4. Has absolute zero
 Multiplication
 Division
According to Virginia L. Senders the simple way to understand the levels of measurement or to
select a measurement scale is as follows –
 If one object is different from another, then we use a nominal scale.
 If one object is bigger or better or more of anything than another, then we use an ordinal scale.
 If one object is so many units (degrees, inches, etc.) more than another, then we use an interval
scale.
 If one object is certain tmes as big or bright or tall or heavy as another, then we use a ratio
scale.
The following criteria should be considered in the selection of the measurement scale for variables
in a study. Researcher should consider the scale that will be most suitable for each variable under
study. Important points in the selection of measurement scale for a variable are –
 Scale selected should be appropriate for the variables one wishes to categorise.
 It should be of practical use.
 It should be clearly defined.
 The number of categories created (when necessary) should cover all possible values.
 The number of categories created (when necessary) should not overlap, i.e., it should be mutually
exclusive.
 The scale should be sufficiently powerful.
Variables measured at a higher level can always be converted to a lower level, but not vice versa.
For example, observations of actual age (ratio scale) can be converted to categories of older and
younger (ordinal scale), but age measured as simply older or younger cannot be converted to
measures of actual age.
The four levels of measurement discussed above have an important impact on how you collect data
and how you analyze them later. Collect at the wrong level, and you will end of having to adjust your
research, your design, and your analyzes. Make sure you consider carefully the level at which you
collect your data, especially in light of what statistical procedures you intend to use once you have
the data in hand.

Basic Guidelines for Research SMS Kabir


Chapter – 5 Measurement Concept: Variable, Reliability, Validity and Norm Page 76

5.1.2 SCALING OF MEASUREMENT


Scaling is the branch of measurement that involves the construction of an instrument that
associates qualitative constructs with quantitative metric units. S.S. Stevens stated the simplest
and most straightforward definition of scaling… “Scaling is the assignment of objects to numbers
according to a rule”. In most scaling, the objects are text statements, usually statements of
attitude or belief. People often confuse the idea of a scale and a response scale. A response scale is
the way you collect responses from people on an instrument. You might use a dichotomous response
scale like Agree/Disagree, True/False, or Yes/No. Or, you might use an interval response scale like a
1-to-5 or 1-to-7 rating. But, if all you are doing is attaching a response scale to an object or
statement, you can’t call that scaling. As you will see, scaling involves procedures that you do
independent of the respondent so that you can come up with a numerical value for the object. In
true scaling research, you use a scaling procedure to develop your instrument (scale) and you also
use a response scale to collect the responses from participants. But just assigning a 1-to-5 response
scale for an item is not scaling. The differences are illustrated in the table below –
Scale Response Scale
Results from a process Is used to collect the response for an item
Each item on scale has a scale value Item not associated with a scale value
Refers to a set of items Used for a single item
Why do we do scaling? Why not just create text statements or questions and use response formats
to collect the answers? First, sometimes we do scaling to test a hypothesis. We might want to know
whether the construct or concept is a single dimensional or multidimensional one. Sometimes, we do
scaling as part of exploratory research. We want to know what dimensions underlie a set of ratings.
For instance, if you create a set of questions, you can use scaling to determine how well they ‘hang
together’ and whether they measure one concept or multiple concepts. But probably the most
common reason for doing scaling is for scoring purposes. When a participant gives their responses to
a set of items, we often would like to assign a single number that represents that’s person’s overall
attitude or belief.
Scales are generally divided into two broad categories: unidimensional and multidimensional. The
unidimensional scaling methods were developed in the first half of the twentieth century and are
generally named after their inventor. In the late 1950s and early 1960s, measurement theorists
developed more advanced techniques for creating multidimensional scales. Three types of
unidimensional scaling methods are –
 Thurstone or Equal-Appearing Interval Scaling
 Likert or Summative Scaling
 Guttman or Cumulative Scaling.

Basic Guidelines for Research SMS Kabir


Chapter – 5 Measurement Concept: Variable, Reliability, Validity and Norm Page 77

THURSTONE SCALING
Psychologist Robert Thurstone was one of the first and most productive scaling theorists. He
actually invented three different methods for developing a unidimensional scale: the method of
equal-appearing intervals; the method of successive intervals; and the method of paired
comparisons. The three methods differed in how the scale values for items were constructed, but in
all three cases, the resulting scale was rated the same way by respondents. The other Thurstone
scaling methods are similar to the Method of Equal-Appearing Intervals. All of them begin by
focusing on a concept that is assumed to be unidimensional and involve generating a large set of
potential scale items. All of them result in a scale consisting of relatively few items which the
respondent rates on ‘Agree/Disagree’ basis. The major differences are in how the data from the
judges is collected. For instance, the method of paired comparisons requires each judge to make a
judgment about each pair of statements. With lots of statements, this can become very time
consuming indeed. With 57 statements in the original set, there are 1,596 unique pairs of
statements that would have to be compared. Clearly, the paired comparison method would be too
time consuming when there are lots of statements initially. In an attempt to approximate an interval
level of measurement, Thurstone developed the method of equal-appearing intervals. This technique
for developing an attitude scale compensates for the limitation of the Likert scale in that the
strength of the individual items is taken into account in computing the attitude score. It also can
accommodate neutral statements.
Constructing the Scale
Step - 1. Collect statements on the topic from people holding a wide range of attitudes, from
extremely favorable to extremely unfavorable. For this example, attitude toward the use of Yaba.
Example statements are - It has its place. Its use by an individual could be the beginning of a sad
situation. It is perfectly healthy; it should be legalized.
Step - 2. Duplicates and irrelevant statements are omitted. The rest are typed on 3/5 cards and
given to a group of people who will serve as judges.
Step - 3. Originally, judges were asked to sort the statements into eleven (11) stacks representing
the entire range of attitudes from extremely unfavorable (1) to extremely favorable (11). The
middle stack is for statements which are neither favorable nor unfavorable (6). Only the end points
(extremely favorable and extremely unfavorable) and the midpoint are labeled. The assumption is
the intervening stacks will represent equal steps along the underlying attitude dimension. With a
large number of judges, for example, using a class or some other group to do the preliminary ratings,
it is easier to create a paper-and-pencil version.
Rate each of the following statements indicating the degree to which the statement is unfavorable
or favorable to yaba use. Do not respond in terms of your own agreement or disagreement with the
statements; rather, respond in terms of the judged degree of favorableness or unfavorableness.
Place an X in the interval that best reflects your judgment.
For example: yaba is OK for most people, Unfavorable X Favorable
but a few people, may have problems with it.
Neutral

Basic Guidelines for Research SMS Kabir


Chapter – 5 Measurement Concept: Variable, Reliability, Validity and Norm Page 78

1. If yaba is taken safely, its effect can be Unfavorable Favorable

quite enjoyable.

Neutral
2. I think it is horrible and corrupting. Unfavorable Favorable


Neutral
3. It is usually the drug people start on before Unfavorable Favorable

addiction.

Neutral
Remind the judges to rate favorability with regard to the target (yaba), not to give their opinion as
whether they agree or disagree with the statement.
Step-4. Each statement will have a numerical rating (1 to 11) from each judge, based on the stack in
which it was placed. The number or weight assigned to the statement is the average of the ratings it
received from the judges.
Average rating from 20 judges
Statement
(11 = extremely favorable)
If yaba is taken safely, its effect can be quite enjoyable. 6.9
I think it is horrible and corrupting. 1.6
It is usually the drug people start on before addiction. 4.9
If the judges cannot rate the item on its favorability or show a high degree of variability in their
judgments, the item is eliminated. For example, the statement ‘Yaba use should be taxed heavily’ was
rejected because it was ambiguous. Some judges thought it was pro-yaba as it implied legalization;
others though it was anti-yaba because it advocated a heavy tax.
Administering the Scale
Here is the final form. The respondents check only the statements with which they agree. The
average ratings by the judges are shown in parentheses. These would not be included on the actual
form given to respondents. Note that the more positive statements have a higher weight.
This is a scale to measure your attitude toward yaba. It does not deal with any other drug, so please
consider that the items pertain to yaba exclusively. We want to know how students feel about this
topic. In order to get honest answers, the questionnaires are to be filled out anonymously. Do not
sign your name.
Please check all those statements with which you agree.
___1. I don’t approve of something that puts you out of a normal state of mind. (3.0)
___2. It has its place. (7.1)
___3. It corrupts the individual (2.2)
___4. Yaba does some people a lot of good. (7.9).
___5. Having never tried yaba, I can’t say what effects it would have. (6.0)
___6. If yaba is taken safely, its effect can be quite enjoyable. (6.9)
___7. I think it is horrible and corrupting. (1.6)
___8. It is usually the drug people start on before addiction. (4.9)
___9. It is perfectly healthy and should be legalized. (10.0)
___10. Its use by an individual could be the beginning of a sad situation. (4.1)

Basic Guidelines for Research SMS Kabir


Chapter – 5 Measurement Concept: Variable, Reliability, Validity and Norm Page 79

Scoring
The weights (favorability rating) for the checked statements are summed and divided by the
number of statements checked. A respondent who selected #3, #7, and #8 would have an attitude
score of 2.2 + 1.6 + 4.9 = 8.7/3 = 2.9. Dividing by the number of statements checked (3) puts the
score on the 1-11 scale. A score of 2.9 indicates an attitude that is definitely unfavorable to yaba.
This kind of scale is used to measure people’s attitude towards a fairly clear and unidimensional
concept, using a number of statements that vary in how they express a positive or negative opinion
about the main concept. Briefly explain the steps of developing a Thurstone scale are-
1. Determine the focus: what concept are you going to measure (see what people’s attitudes are
toward it)?
2. Ask a group of people (or a person): to write down different statements about this concept,
reflecting different opinions or attitudes about the subject. Make sure you have a large number
of statements, making sure that people can either degree or disagree with them (no - open -
questions for instance).
3. Rating the scale items: the next step is to have your group rate each statement on a 1-to-11
scale in terms of how much each statement indicates a favorable attitude towards the concept.
The members of the group must not express their own opinion; they must only indicate how
favorable they feel each statement is. You can use a scale with 1 = extremely favorable attitude
towards the subject (focus) and 11 = extremely unfavorable attitude towards the subject.
4. Compute the median and interquartile range: for each statement. Create a table with these
values and sort by the median.
5. Select the items for the actual scale: you should select statements that are at equal intervals
across the range of medians. Within each value, you should try to select the statement that has
the smallest Inter-quartile Range. This is the statement with the least amount of variability
across judges. You don't want the statistical analysis to be the only deciding factor here. Look
over the candidate statements at each level and select the statement that makes the most
sense. If you find that the best statistical choice is a confusing statement, select the next best
choice.
In question selection -
 Generate a large set of possible statements.
 Get a set of judges to rate the statements in terms of how much they agree with them, from 1
(agree least) to 11 (agree most).
 For each statement, plot a histogram of the numbers against which the different judges scored
it.
 For each statement, identify the median score, the number below 25% (Q1) and below 75% (Q3).
The difference between these is the inter-quartile range.
 Sort the list by median value (This is the ‘common’ score in terms of agreement).
 Select a set of statements that are equal positions across the range of medians. Choose the one
with the lowest inter-quartile range for each position.
One of the biggest problems with Thurstone scaling is to find sufficient judges who have a good
enough understanding of the concept being assessed. With a set of questions with which you can
agree or not, it is useful to have some questions with which the respondent will easily agree, some
with which they will easily disagree and some which they have to think about, and where some people
are more likely to make one choice rather than another. This should then give a realistic and varying

Basic Guidelines for Research SMS Kabir


Chapter – 5 Measurement Concept: Variable, Reliability, Validity and Norm Page 80

distribution across all questions, rather than bias being caused by questions that are likely to give all
of one type of answer.
LIKERT SCALING
A Likert scale is a psychometric scale commonly used in questionnaires, and is the most widely used
scale in survey research. When responding to a Likert questionnaire item, respondents specify their
level of agreement to a statement. The scale is named after its inventor, psychologist Rensis Likert.
The Likert scale can also be used to measure attitudes of people. When responding to a Likert
questionnaire item, respondents specify their level of agreement or disagreement on a symmetric
agree-disagree scale for a series of statements. Thus, the range captures the intensity of their
feelings for a given item. As with the Thurstone scale, the development of a Likert scale takes some
effort. An important distinction must be made between a Likert scale and a Likert item. The Likert
scale is the sum of responses on several Likert items. Because Likert items are often accompanied
by a visual analog scale (e.g., a horizontal line, on which a subject indicates his or her response by
circling or checking tick-marks), the items are sometimes called scales themselves. This is the
source of much confusion; it is better, therefore, to reserve the term Likert scale to apply to the
summated scale, and Likert item to refer to an individual item.
A Likert item is simply a statement which the respondent is asked to evaluate according to any kind
of subjective or objective criteria; generally the level of agreement or disagreement is measured.
Often five ordered response levels are used, although many psychometricians advocate using seven
or nine levels. A recent empirical study found that a 5- or 7- point scale may produce slightly higher
mean scores relative to the highest possible attainable score, compared to those produced from a
10-point scale. The format of a typical five-level Likert item is - Strongly disagree; Disagree;
Neither agree nor disagree; Agree; and Strongly agree. After the questionnaire is completed, each
item may be analyzed separately or in some cases item responses may be summed to create a score
for a group of items. Hence, Likert scales are often called summative scales.
Example of Likert Scaling
5-point Traditional Likert Scale
Strongly Agree Neither agree Disagree Strongly
agree nor disagree disagree
I like going to Chinese restaurants

5-point Likert-type Scale not All Labeled


Good Neutral Bad
When I think about Chinese restaurants I feel

6-point Likert-type Scale


Never Not Infrequently Infrequentl Sometimes Frequently Always
y
I feel happy when entering a
Chinese Restaurant

Basic Guidelines for Research SMS Kabir


Chapter – 5 Measurement Concept: Variable, Reliability, Validity and Norm Page 81

Steps in Developing A Likert Scale


The basic steps in developing a Likert or ‘Summative’ scale are –
 Defining the Focus. As in all scaling methods, the first step is to define what it is you are trying
to measure. Because this is a unidimensional scaling method, it is assumed that the concept you
want to measure is one-dimensional in nature. You might operationalize the definition as an
instruction to the people who are going to create or generate the initial set of candidate items
for your scale.
 Generating the Items. Next, you have to create the set of potential scale items. These should
be items that can be rated on a 1-to-5 or 1-to-7 Disagree-Agree response scale. Sometimes you
can create the items by yourself based on your intimate understanding of the subject matter.
But, more often than not, it’s helpful to engage a number of people in the item creation step. For
instance, you might use some form of brainstorming to create the items. It’s desirable to have
as large a set of potential items as possible at this stage, about 80-100 would be best.
 Rating the Items. The next step is to have a group of judges rate the items. Usually you would
use a 1-to-5 rating scale where: 1 = strongly unfavorable to the concept; 2 = somewhat
unfavorable to the concept; 3 = undecided; 4 = somewhat favorable to the concept; 5 = strongly
favorable to the concept. Notice that, as in other scaling methods, the judges are not telling you
what they believe - they are judging how favorable each item is with respect to the construct of
interest.
 Selecting the Items. The next step is to compute the inter-correlations between all pairs of
items, based on the ratings of the judges. In making judgments about which items to retain for
the final scale there are several analyses you can do. Items may be selected by a mathematical
process, as follows –
 Generate a lot of questions - more than you need.
 Get a group of judges to score the questionnaire.
 Sum the scores for all items.
 Calculate the inter-correlations between all pairs of items.
 Reject questions that have a low correlation with the sum of the scores.
 For each item, calculate the t-value for the top quarter and bottom quarter of the judges
and reject questions with lower t-values (higher t-values show questions with higher
discrimination).
 Administering the Scale. You’re now ready to use your Likert scale. Each respondent is asked to
rate each item on some response scale. For instance, they could rate each item on a 1-to-5
response scale where: 1 = strongly disagree; 2 = disagree; 3 = undecided; 4 = agree; 5 = strongly
agree.
There are variety possible response scales (1-to-7, 1-to-9, 0-to-4). All of these odd-numbered scales
have a middle value is often labeled Neutral or Undecided. It is also possible to use a forced-choice
response scale with an even number of responses and no middle neutral or undecided choice. In this
situation, the respondent is forced to decide whether they lean more towards the agree or disagree
end of the scale for each item. The final score for the respondent on the scale is the sum of their
ratings for all of the items. On some scales, you will have items that are reversed in meaning from
the overall direction of the scale. These are called reversal items. You will need to reverse the
response value for each of these items before summing for the total. That is, if the respondent
gave a 1, you make it a 5; if they gave a 2 you make it a 4; 3 = 3; 4 = 2; and, 5 = 1.

Basic Guidelines for Research SMS Kabir


Chapter – 5 Measurement Concept: Variable, Reliability, Validity and Norm Page 82

Likert scaling is a bipolar scaling method, measuring either positive or negative response to a
statement. Sometimes a four-point scale is used; this is a forced choice method since the middle
option of ‘Neither agree nor disagree’ is not available. Likert scales may be subject to distortion
from several causes. Respondents may avoid using extreme response categories (central tendency
bias); agree with statements as presented (acquiescence bias); or try to portray themselves or their
organization in a more favorable light (social desirability bias). Designing a scale with balanced
keying (an equal number of positive and negative statements) can obviate the problem of
acquiescence bias, since acquiescence on positively keyed items will balance acquiescence on
negatively keyed items, but central tendency and social desirability are somewhat more problematic.
Whether individual Likert items can be considered as interval-level data, or whether they should be
considered merely ordered-categorical data is the subject of disagreement. Many regard such items
only as ordinal data, because, especially when using only five levels, one cannot assume that
respondents perceive all pairs of adjacent levels as equidistant. On the other hand, often the
wording of response levels clearly implies symmetry of response levels about a middle category; at
the very least, such an item would fall between ordinal- and interval-level measurements; to treat it
as merely ordinal would lose information. Further, if the item is accompanied by a visual analog scale,
where equal spacing of response levels is clearly indicated, the argument for treating it as interval-
level data is even stronger.
When treated as ordinal data, Likert responses can be collated into bar charts, central tendency
summarized by the median or the mode (but some would say not the mean), dispersion summarized
by the range across quartiles (but some would say not the standard deviation), or analyzed using non-
parametric tests, e.g. chi-square test, Mann–Whitney test, Wilcoxon signed-rank test, or Kruskal–
Wallis test. Parametric analysis of ordinary averages of Likert scale data is also justifiable by the
Central Limit Theorem; although some would disagree that ordinary averages should be used for
Likert scale data. Responses to several Likert questions may be summed, providing that all questions
use the same Likert scale and that the scale is a defendable approximation to an interval scale, in
which case they may be treated as interval data measuring a latent variable. If the summed
responses fulfill these assumptions, parametric statistical tests such as the analysis of variance can
be applied. These can be applied only when more than 5 Likert questions are summed. Data from
Likert scales are sometimes reduced to the nominal level by combining all agree and disagree
responses into two categories of ‘accept’ and ‘reject’. The chi-square, Cochran Q, or McNemar tests
are common statistical procedures used after this transformation.
Example: Employment Self Esteem Scale
Here’s an example of a ten-item Likert Scale that attempts to estimate the level of self esteem a
person has on the job. Notice that this instrument has no center or neutral point - the respondent
has to declare whether s/he is in agreement or disagreement with the item.

Basic Guidelines for Research SMS Kabir


Chapter – 5 Measurement Concept: Variable, Reliability, Validity and Norm Page 83

Instructions: Please rate how strongly you agree or disagree with each of the following statements
by placing a check mark in the appropriate box.
Strongly Disagree Somewhat Somewhat Agree Strongly
Item Disagree Disagree Agree Agree
1. I feel good about my work on the job.
2. On the whole, I get along well with others at
work.
3. I am proud of my ability to cope with
difficulties at work.
4. When I feel uncomfortable at work, I know
how to handle it.
5. I can tell that other people at work are glad
to have me there.
6. I know I’ll be able to cope with work for as
long as I want.
7. I am proud of my relationship with my
supervisor at work.
8. I am confident that I can handle my job
without constant assistance.
9. I feel like I make a useful contribution at
work.
10. I can tell that my coworkers respect me.

GUTTMAN SCALING
The Guttman scale was first described by Louis Guttman in 1944. This scaling is also known as
cumulative scaling or scalogram analysis. The cumulative scale or Guttman scale measures to what
degree a person has a positive or negative attitude to something. It makes use of a series of
statements that are growing or descending in how positive or negative a person is towards the
subject. If for instance on a scale with seven statements the respondent agrees with the fifth
statement, it implies that s/he also agrees with the first four statements, but not with statement
number six and seven.
Developing A Guttman Scale
 Define the Focus. As in all of the scaling methods; we begin by defining the focus for the scale.
Let’s imagine that you wish to develop a cumulative scale that measures US citizen attitudes
towards immigration. You would want to be sure to specify in your definition whether you are
talking about any type of immigration (legal and illegal) from anywhere (Europe, Asia, Latin and
South America, Africa).
 Develop the Items. Next, as in all scaling methods, you would develop a large set of items that
reflect the concept. You might do this yourself or you might engage a knowledgeable group to
help. For item selection -
 Generate a list of possible statements.
 Get a set of judges to score the statements with a Yes or No, depending on whether they
agree or disagree with them.
 Draw up a table with the respondent in rows and statements in columns, showing whether
they answered Yes or No.
 Sort the columns so the statement with the most Yes’s is on the left.
 Sort the rows so the respondent with the most Yes’s is at the top.
 Select a set of questions that have the least set of ‘holes’ (No’s between Yes’s).
Let’s say for the example you came up with the following statements -

Basic Guidelines for Research SMS Kabir


Chapter – 5 Measurement Concept: Variable, Reliability, Validity and Norm Page 84

 I would permit a child of mine to marry an immigrant.


 I believe that this country should allow more immigrants in.
 I would be comfortable if a new immigrant moved next door to me.
 I would be comfortable with new immigrants moving into my community.
 It would be fine with me if new immigrants moved onto my block.
 I would be comfortable if my child dated a new immigrant.
Of course, we would want to come up with many more statements (about 80-100 would be
desirable).
 Rate the Items. Next, we would want to have a group of judges rate the statements or items in
terms of how favorable they are to the concept of immigration. They would give a ‘Yes’ if the
item was favorable toward immigration and a ‘No’ if it is not. Notice that we are not asking the
judges whether they personally agree with the statement. Instead, we’re asking them to make a
judgment about how the statement is related to the construct of interest.
 Develop the Cumulative Scale. The key to Guttman scaling is in the analysis. We construct a
matrix or table that shows the responses of all the respondents on all of the items. We then
sort this matrix so that respondents who agree with more statements are listed at the top and
those agreeing with fewer are at the bottom. For respondents with the same number of
agreements, we sort the statements from left to right from those that most agreed to those
that fewest agreed to. We might get a table something like the figure. Notice that the scale is
very nearly cumulative when you read from left to right across the columns (items). Specifically
if someone agreed with Item 7, they always agreed with Item 2. And, if someone agreed with
Item 5, they always agreed with Items 7 and 2. The matrix shows that the cumulativeness of
the scale is not perfect, however. While in general, a person agreeing with Item 3 tended to also
agree with 5, 7 and 2, there are several exceptions to that rule.
While we can examine the matrix if there are only a few items in it, if there are lots of items,
we need to use a data analysis called scalogram analysis to determine the subsets of items from
our pool that best approximate the cumulative property. Then, we review these items and select
our final scale elements. There are several statistical techniques for examining the table to find
a cumulative scale. Because there is seldom a perfectly cumulative scale we usually have to test
how good it is. These statistics also estimate a scale score value for each item. This scale score
is used in the final calculation of a respondent’s score.
 Administering the Scale. Once you’ve selected the final scale items, it’s relatively simple to
administer the scale. You simply present the items and ask the respondent to check items with
which they agree. For our hypothetical immigration scale, the items might be listed in cumulative
order as -
 I believe that this country should allow more immigrants in.
 I would be comfortable with new immigrants moving into my community.
 It would be fine with me if new immigrants moved onto my block.
 I would be comfortable if a new immigrant moved next door to me.
 I would be comfortable if my child dated a new immigrant.
 I would permit a child of mine to marry an immigrant.
Of course, when we give the items to the respondent, we would probably want to mix up the
order. Our final scale might look like –

Instructions: Place a check next to each statement you agree with.

Basic Guidelines for Research SMS Kabir


Chapter – 5 Measurement Concept: Variable, Reliability, Validity and Norm Page 85

_____ I would permit a child of mine to marry an immigrant.


_____ I believe that this country should allow more immigrants in.
_____ I would be comfortable if a new immigrant moved next door to me.
_____ I would be comfortable with new immigrants moving into my community.
_____ It would be fine with me if new immigrants moved onto my block.
_____ I would be comfortable if my child dated a new immigrant.
Each scale item has a scale value associated with it (obtained from the scalogram analysis). To
compute a respondent’s scale score we simply sum the scale values of every item they agree with. In
our example, their final value should be an indication of their attitude towards immigration.
Items in a Guttman scale gradually increase in specificity. The intent of the scale is that the person
will agree with all statements up to a point and then will stop agreeing. The scale may be used to
determine how extreme a view is, with successive statements showing increasingly extremist
positions. If needed, the escalation can be concealed by using intermediate questions.

5.2 VARIABLE
A variable is a measurable characteristic that varies. It may change from group to group, person to
person, or even within one person over time. It is the central idea in research. Simply defined,
variable is a concept that varies. There are two types of concepts - those that refer to a fixed
phenomenon and those that vary in quantity, intensity, or amount. A variable is defined as anything
that varies or changes in value. For example gender is a variable; it can take two values - male or
female. Marital status is a variable; it can take on values of never married, single, married, divorced,
or widowed. Family income is a variable; it can take on values from zero to billions. A person’s
attitude toward women empowerment is variable; it can range from highly favorable to highly
unfavorable. In this way the variation can be in quantity, intensity, amount, or type; the examples
can be production units, absenteeism, gender, religion, motivation, grade, and age. A variable may be
situation specific; for example gender is a variable but if in a particular situation like a class of
research methods if there are only female students, then in this situation gender will not be
considered as a variable.
A variable is any characteristics, number, or quantity that can be measured or counted. A variable
may also be called a data item. It is called a variable because the value may vary between data units
in a population, and may change in value over time. For example; ‘income’ is a variable that can vary
between data units in a population (i.e. the people or businesses being studied may not have the same
incomes) and can also vary over time for each data unit (i.e. income can go up or down). There are
different ways variables can be described according to the ways they can be studied, measured, and
presented.

5.2.1 TYPES OF VARIABLE


There are various types of variable considered in research methodology.
Quantitative Variables: A quantitative variable is measured numerically. With measurements of
quantitative variables you can do things like add and subtract, and multiply and divide, and get a
meaningful result. Numeric/quantitative variables have values that describe a measurable quantity as
a number, like ‘how many’ or ‘how much’. Therefore numeric variables are quantitative variables.
Numeric variables/ quantitative variables may be further described as either continuous or
discrete.

Basic Guidelines for Research SMS Kabir


Chapter – 5 Measurement Concept: Variable, Reliability, Validity and Norm Page 86

Continuous Variable: A continuous variable is a numeric variable. Observations can take any value
between a certain set of real numbers. The value given to an observation for a continuous variable
can include values as small as the instrument of measurement allows. If the values of a variable can
be divided into fractions then we call it a continuous variable. Such a variable can take infinite
number of values. Income, height, time, temperature, age, or a test score are examples of
continuous variables. These variables may take on values within a given range or, in some cases, an
infinite set.
Discrete Variable: Discrete variables are numeric variables that come from a limited set of
numbers. They may result from, answering questions such as ‘how many’, ‘how often’, etc. Any
variable that has a limited number of distinct values and which cannot be divided into fractions, is a
discontinuous variable. Such a variable is also called as categorical variable or classificatory
variable, or discrete variable. Some variables have only two values, reflecting the presence or
absence of a property: employed-unemployed or male-female have two values. These variables are
referred to as dichotomous. There are others that can take added categories such as the
demographic variables of race, religion. All such variables that produce data that fit into categories
are said to be discrete/ categorical/ classificatory, since only certain values are possible.
Qualitative/Categorical Variables: These allow for classification based on some characteristic. With
measurements of qualitative/categorical variables you cannot do things like add and subtract, and
multiply and divide, and get a meaningful result. Categorical variable usually an independent or
predictor variable that contains values indicating membership in one of several possible categories.
Nominal and ordinal variables are categorical. For example, gender (male or female), marital status
(married, single, divorced, widowed) etc. are qualitative/categorical variables. The categories are
often assigned numerical values used as labels, e.g., 0 = male; 1 = female.
Independent Variable (IV): Variable(s) that the experimenter manipulates (i.e. changes) – assumed
to have a direct effect on the dependent variable. It is also called explanatory or predictor or
manipulated variable. Researchers who focus on causal relations usually begin with an effect, and
then search for its causes. The cause variable, or the one that identifies forces or conditions that
act on something else, is the independent variable.
Dependent Variable (DV): Variable(s) that the experimenter measures. The variable that is the
effect or is the result or outcome of another variable is the dependent variable (also referred to as
outcome variable/ response variable/ predicted/ effect variable). The independent variable is
‘independent of’ prior causes that act on it, whereas the dependent variable ‘depends on’ the cause.
It is not always easy to determine whether a variable is independent or dependent. Two questions
help to identify the independent variable. First, does it come before other variable in time? Second,
if the variables occur at the same time, does the researcher suggest that one variable has an impact
on another variable? Independent variables affect or have an impact on other variables. When
independent variable is present, the dependent variable is also present, and with each unit of
increase in the independent variable, there is an increase or decrease in the dependent variable also.
In other words, the variance in dependent variable is accounted for by the independent variable.
Dependent variable is also referred to as criterion variable.

Basic Guidelines for Research SMS Kabir


Chapter – 5 Measurement Concept: Variable, Reliability, Validity and Norm Page 87

Control Variable (CV): The variables that are not measured in a particular study must be held
constant, neutralized/balanced, or eliminated, so they will not have a biasing effect on the other
variables. Variables that have been controlled in this way are called control variables.
Intervening Variables: Intervening variables refer to abstract processes that are not directly
observable but that link the independent and dependent variables. It comes between the
independent and dependent variables and shows the link or mechanism between them. Advances in
knowledge depend not only on documenting cause and effect relationship but also on specifying the
mechanisms that account for the causal relation. In a sense, the intervening variable acts as a
dependent variable with respect to independent variable and acts as an independent variable toward
the dependent variable. It is also called by some authors ‘mediating variable’ or ‘intermediary
variable’. For example, a theory of suicide states that married people are less likely to commit
suicide than single people. This theory can be restated as a three-variable relationship: marital
status (independent variable) causes the degree of social integration (intervening variable), which
affects suicide (dependent variable). Specifying the chain of causality makes the linkages in theory
clearer and helps a researcher test complex relationships.
Extraneous Variables (EV): The variables which are not the independent variable, but could affect
the results (DV) of the experiment. EVs should be controlled where possible. Extraneous variables
may damage a study’s validity, making it impossible to know whether the effects were caused by the
independent and moderator variables or some extraneous factor. If they cannot be controlled,
extraneous variables must at least be taken into consideration when interpreting results. There are
four types of extraneous variables-
1. Situational Variables: These are aspects of the environment that might affect the participant’s
behavior, e.g. noise, temperature, lighting conditions, etc. Situational variables should be
controlled so they are the same for all participants. Standardized procedures are used to
ensure that conditions are the same for all participants. This includes the use of standardized
instructions. Situational variables also include order effects that can be controlled using
counterbalancing, such as giving half the participants condition ‘A’ first, while the other half get
condition ‘B’ first. This prevents improvement due to practice, or poorer performance due to
boredom.
2. Participant / Person Variables: This refers to the ways in which each participant varies from the
other, and how this could affect the results e.g. mood, intelligence, anxiety, nerves,
concentration etc. For example, if a participant that has performed a memory test was tired,
dyslexic or had poor eyesight, this could affect their performance and the results of the
experiment. The experimental design chosen can have an effect on participant variables.
Participant variables can be controlled using random allocation to the conditions of the
independent variable.
3. Experimenter/Investigator Effects: The experimenter unconsciously conveys to participants
how they should behave - this is called experimenter bias. The experiment might do this by
giving unintentional clues to the participants about what the experiment is about and how they
expect them to behave. This affects the participants’ behavior. The experimenter is often
totally unaware of the influence which s/he is exerting and the cues may be very subtle indeed
but they have an influence nevertheless. Also, the personal attributes (e.g. age, gender, accent,
manner etc.) of the experiment can affect the behavior of the participants.
4. Demand Characteristics: These are all the clues in an experiment which convey to the
participant the purpose of the research. Participants will be affected by - (i) their surroundings;

Basic Guidelines for Research SMS Kabir


Chapter – 5 Measurement Concept: Variable, Reliability, Validity and Norm Page 88

(ii) the researcher’s characteristics; (iii) the researcher’s behavior (e.g. non-verbal
communication), and (iv) their interpretation of what is going on in the situation. Experimenters
should attempt to minimize these factors by keeping the environment as natural as possible,
carefully following standardized procedures.
Extraneous variables must be carefully and systematically controlled so they don’t vary across
any of the experimental conditions or between participants. When designing an experiment,
researchers should consider three main areas where extraneous variables may arise –
 Participant variables - participants’ age, intelligence, personality and so on should be
controlled.
 Situational variables - the experimental setting and surrounding environment must be
controlled. This may even include the temperature or noise effects.
 Experimenter variables - the personality, appearance and conduct of the researcher. Any
change in these across conditions might affect the results.
If extraneous variables are not controlled, then those can be confounding variables.
Confounding Variables: Variable(s) that have affected the results (DV), apart from the IV. A
confounding variable could be an extraneous variable that has not been controlled. So, a confounding
variable is a variable that could strongly influence the study, while extraneous variables are weaker
and typically influence the experiment in a lesser way. If one elementary reading teacher used a
phonics textbook in her class and another instructor used a whole language textbook in his class, and
students in the two classes were given achievement tests to see how well they read, the independent
variables (teacher effectiveness and textbooks) would be confounded. There is no way to determine
if differences in reading between the two classes were caused by either or both of the independent
variables.
Binary Variable: Observations (i.e., dependent variables) that occur in one of two possible states,
often labeled zero and one. E.g., ‘improved/not improved’ and ‘completed task/failed to complete
task’.
Dummy Variables: Created by recoding categorical variables that have more than two categories into
a series of binary variables. E.g., Marital status, if originally labeled 1=married, 2=single, and
3=divorced, widowed, or separated, could be redefined in terms of two variables as follows: var_1:
1=single, 0=otherwise. Var_2: 1=divorced, widowed, or separated, 0=otherwise. For a married person,
both var_1 and var_2 would be zero. In general, a categorical variable with k categories would be
recorded in terms of k- 1 dummy variables. Dummy variables are used in regression analysis to avoid
the unreasonable assumption that the original numerical codes for the categories, i.e., the values 1,
2,..., k, correspond to an interval scale.
Endogenous Variable: A variable that is an inherent part of the system being studied and that is
determined from within the system. It is caused by other variables in a causal system.
Exogenous Variable: A variable entering from and determined from outside of the system being
studied. A causal system says nothing about its exogenous variables.
Latent Variable: An underlying variable that cannot be observed. It is hypothesized to exist in order
to explain other variables, such as specific behaviors, that can be observed.
Manifest Variable: An observed variable assumed to indicate the presence of a latent variable. It is
also known as an indicator variable. We cannot observe intelligence directly, for it is a latent

Basic Guidelines for Research SMS Kabir


Chapter – 5 Measurement Concept: Variable, Reliability, Validity and Norm Page 89

variable. We can look at indicators such as vocabulary size, success in one’s occupation, IQ test
score, ability to play complicated games (e.g., bridge) well, writing ability, and so on.

5.2.2 MEDIATOR AND MODERATOR VARIABLES

Mediator Variables: Mediator variables specify how or why a particular effect or relationship
occurs. Mediators describe the psychological process that occurs to create the relationship, and as
such are always dynamic properties of individuals (e.g., emotions, beliefs, behaviors). Baron and
Kenny (1986) suggest that mediators explain how external events take on internal psychological
significance. For example, Cooper et al. (1990) hypothesized that particular work features such as
work pressures and lack of control would increase work distress which, in turn, would increase
drinking. In this example, work distress is a mediator that explains how work features may come to
be associated with drinking. Statistically, after some basic conditions are met, mediation is
indicated when the relationship between the predictor (e.g., work pressure) and criterion (e.g.,
drinking) is non-significant after controlling for the effect of the mediator.
Moderator Variables: The variable that influence or moderate the relationship between the
independent and dependent variables and thus produces an interaction effect. That is, the presence
of a third variable (the moderating variable) modifies the original relationship between the
independent and the dependent variable. Unlike extraneous variables, moderator variables are
measured and taken into consideration. For example, a strong relationship has been observed
between the quality of library facilities (X) and the performance of the students (Y). Although this
relationship is supposed to be true generally, it is nevertheless contingent on the interest and
inclination of the students. It means that only those students who have the interest and inclination
to use the library will show improved performance in their studies. In this relationship interest and
inclination is moderating variable i.e. which moderates the strength of the association between X
and Y variables.
A moderator variable changes the strength of an effect or relationship between two variables.
Moderators indicate when or under what conditions a particular effect can be expected. A
moderator may increase the strength of a relationship, decrease the strength of a relationship, or
change the direction of a relationship. In the classic case, a relationship between two variables is
significant (i.e., non-zero) under one level of the moderator and zero under the other level of the
moderator. For example, work stress increases drinking problems for people with a highly avoidant
(e.g., denial) coping style, but work stress is not related to drinking problems for people who score
low on avoidant coping (Cooper, Russell, & Frone, 1990). As another example, negative social contacts
(e.g., disagreement with friend) are associated with increased drinking at home for college students
who say that they drink to cope (e.g., to forget about problems), but negative social contacts are
unrelated to drinking at home for students who do not drink to cope ( Mohr et al., 2005).
Statistically, a moderator is revealed through a significant interaction.
The strength and form of a relation between two variables may depend on the value of a moderating
variable. The examination of moderator effects has a long and important history in a variety of
research areas (Aguinis, 2004; Aiken & West, 1991). Moderator effects are also called interactions
because the third variable interacts with the relation between two other variables. For intervention
research, moderator variables may reflect subgroups of persons for which the intervention is more
or less effective than for other groups. The moderator variable can be a continuous or categorical
variable, although interpretation of a categorical moderator is usually easier than a continuous
moderator. A moderating variable may be a factor in a randomized manipulation, representing

Basic Guidelines for Research SMS Kabir


Chapter – 5 Measurement Concept: Variable, Reliability, Validity and Norm Page 90

random assignment to levels of the factor. For example, participants may be randomly assigned to a
moderator of treatment dosage in addition to type of treatment received in order to test the
moderator effect of duration of treatment across the two treatments. Moderator variables can be
stable aspects of individuals such as sex, race, age, ethnicity, genetic predispositions, and so on.
Moderator variables may also be variables that may not change during the period of a research
study, such as socioeconomic status, risk-taking tendency, prior health care utilization, impulsivity,
and intelligence. Moderator variables may also be environmental contexts such as type of school and
geographic location. Moderator variables may also be baseline measure of an outcome or mediating
measure such that intervention effects depend on the starting point for each participant. The
values of the moderating variable may be latent such as classes of individuals formed by analysis of
repeated measures from participants. The important aspect is that the relation between two
variables X and Y depends on the value of the moderator variable, although the type of moderator
variable, randomized or not, stable characteristic, or changing characteristic often affects
interpretation of a moderation analysis. Moderator variables may be specified before a study as a
test of theory or they may be investigated after the study in an exploratory search for different
relations across subgroups.
There are several overlapping reasons for including moderating variables in a research study.
 Acknowledgment of the complexity of behavior: The investigation of moderating variables
acknowledges the complexity of behavior, experiences, and relationships. Individuals are not the
same. It would be unusual if there were no differences across individuals. This focus on
individual versus group effects is more commonly known as the tendency for researchers to be
either lumpers or splitters (Simpson, 1945). Lumpers seek to group individuals and focus on how
persons are the same. Splitters, in contrast, look for differences among groups. Of course both
approaches have problems with splitters yielding smaller and smaller groups until there is one
person in each group. Lumpers will fail to observe real subgroups, including subgroups that may
have iatrogenic effects or miss predictive relations because of opposite effects in subgroups.
 Manipulation check: If there is an additional experimental factor picked so that an observed
relation is differentially observed across subgroups, then the intervention effect is a test of
the intervention theory. For example, if dose of an intervention is manipulated as well as
intervention or control, then the additional dosages could be considered a moderator and if the
intervention effect is successful, the size of the effect should differ across levels of dosage.
 Generalizability of results: Moderation analysis provides a way to test whether an intervention
has similar effects across groups. It would be important, for example, to demonstrate that
intervention effects are obtained for males and females if the program would be disseminated
to a whole group containing males and females. Similarly, the consistency of an intervention
effect across subgroups demonstrates important information about the generalizability of an
intervention.
 Specificity of effects: In contrast to generalizability, it is important to identify groups for
which an intervention has its greatest effects or no effects. This information could then be
used to target groups for intervention thereby tailoring of an intervention.
 Identification of iatrogenic effects in subgroups: Moderation analysis can be used to identify
subgroups for which effects are counterproductive. It is possible that there will be subgroups
for which the intervention causes more negative outcomes.
 Investigation of lack of an intervention effect: If there are two groups that are affected by an
intervention in opposite ways, the overall effect will be non-significant even if there is a
statistically significant intervention effect in both groups, albeit opposite. Without investigation

Basic Guidelines for Research SMS Kabir


Chapter – 5 Measurement Concept: Variable, Reliability, Validity and Norm Page 91

of moderating variables, these types of effects would not be observable. In addition, before
abandoning an intervention or area of research it is useful to investigate subgroups for any
intervention effect. Of course, this type of exploratory search must consider the possibility of
multiplicity where by testing more effects will lead to finding effects by chance alone.
 Moderators as a test of theory: There are situations where intervention effects may be
theoretically expected in one group and not another. For example, there may be different social
tobacco intervention effects for boys versus girls because reasons for smoking may differ
across sex. In this way, mediation and moderation may be combined if it is expected that a
theoretical mediating process would be present in one group but not in another group.
 Measurement improvement: Lack of a moderating variable effect may be due to poor
measurement of the moderator variable. Many moderator variables have reasonably good
reliability such as age, sex, and ethnicity but others may have measurement limitations such as
risk-taking propensity or impulsivity.
 Practical implications: If moderator effects are found, then decisions about intervention
delivery may depend on this information. If intervention effects are positive at all levels of the
moderator, then it is reasonable to deliver the whole program. If intervention effects are
observed for one group and not another, it may be useful to deliver the program to the group
where it had success and develop a new intervention for other groups. Of course, there are
cases where the delivery of an intervention as a function of a moderating variable cannot be
realistically or ethically used in the delivery of an intervention. For example, it may be realistic
to deliver different programs to different ages and sexes but less realistic to deliver programs
based on risk taking, impulsivity, or prior drug use, for example, because of labeling of
individuals or practical issues in identifying groups. By grouping persons for intervention, there
may also be iatrogenic effects, for example, grouping adolescent drug users together may have
harmful effects by enhancing a social norm to take drugs in this group.
Moderators such as age, sex, and race are often routinely included in surveys. Demographic
characteristics are also often measured including family income, marital status, number of
siblings, and so on. Other measures of potential moderators have the same measurement and
time demand issues as for mediating variables; that is, additional measures may increase
respondent burden.
Moderation and Mediation in the Same Analysis: Both moderating and mediating variables can be
investigated in the same research but the interpretation of mediation in the presence of moderation
can be complex statistically and conceptually (Edwards & Lambert, 2007; Fairchild & MacKinnon,
2009; Preacher, Rucker, & Hayes, 2007). There are two major types of effects that combine
moderation and mediation (Baron & Kenny, 1986; Fairchild & MacKinnon, 2009) - (a) moderation of a
mediation effect, where the mediated effect is different at different values of a moderator and (b)
mediation of a moderation effect, where the effect of an interaction on a dependent variable is
mediated. An example of moderation of a mediation effect is a case where a mediation process
differs for males and females. For example, a program may affect social norms equally for both
males and females but social norms only significantly reduce subsequent tobacco use for females not
for males. These types of analyses can be used to test homogeneity of action theory across groups
and homogeneity of conceptual theory across groups (MacKinnon, 2008). An example of moderation
of a mediated effect is a case where social norms mediate the effect of an intervention on drug use
but the size of the mediated effect differs as a function of risk-taking propensity. An example of
mediation of a moderator effect would occur if the effect of an intervention depends on baseline

Basic Guidelines for Research SMS Kabir


Chapter – 5 Measurement Concept: Variable, Reliability, Validity and Norm Page 92

risk-taking propensity such that the interaction is due to a mediating variable of social norms, which
then affects drug use. These types of effects are important because they help specify types of
subgroups for whom mediational processes differ and help quantify more complicated hypotheses
about mediation and moderation relations. Despite the potential for moderation of a mediation
effect and mediation of a moderation effect, few research studies include both mediation and
moderation, at least in part because of the difficulty of specifying and interpreting these models.
General models that include mediation and moderation have been described that include the single
mediator model as a special case and the single moderator model as special cases (Fairchild &
MacKinnon, 2009; MacKinnon, 2008).
Third-Variable Effects: Mediating and moderating variables are examples of third variables. Most
research focuses on the relation between two variables - an independent variable X and an outcome
variable Y. Example statistics for two-variable effects are the correlation coefficient, odds ratio,
and regression coefficient. With two variables, there are a limited number of possible causal
relations between them: X causes Y, Y causes X, both X and Y is reciprocally related. With three
variables, the number of possible relations among the variables increases substantially: X may cause
the third variable Z and Z may cause Y; Y may cause both X and Z, and the relation between X and Y
may differ for each value of Z, along with others. Mediation and moderation are names given to two
types of third-variable effects. If the third variable Z is intermediate in a causal sequence such
that X causes Z and Z causes Y, then Z is a mediating variable; it is in a causal sequence X → Z → Y.
If the relation between X and Y is different at different values of Z, then Z is a moderating
variable. A primary distinction between mediating and moderating variables is that the mediating
variable specifies a causal sequence in that a mediating variable transmits the causal effect of X to
Y but the moderating variable does not specify a causal relation, only that the relation between X
and Y differs across levels of Z.
Another important third variable is the confounding variable that causes both X and Y such that
failure to adjust for the confounding variable will confound or lead to incorrect conclusions about
the relation of X to Y. A confounding variable differs from a mediating variable in that the
confounding variable is not in a causal sequence but the confounding variable is related to both X
and Y. A confounder differs from a moderating variable because the relation of X to Y may not
differ across values of the confounding variable.

5.2.3 RELATIONSHIP AMONG VARIABLES


Once the variables relevant to the topic of research have been identified, then the researcher is
interested in the relationship among them. A statement containing the variable is called a
proposition. It may contain one or more than one variable. The proposition having one variable in it
may be called as univariate proposition, those with two variables as bivariate proposition, and then of
course multivariate containing three or more variables. Prior to the formulation of a proposition the
researcher has to develop strong logical arguments which could help in establishing the relationship.
For example, age at marriage and education are the two variables that could lead to a proposition:
the higher the education, the higher the age at marriage. What could be the logic to reach this
conclusion? All relationships have to be explained with strong logical arguments. If the relationship
refers to an observable reality, then the proposition can be put to test, and any testable proposition
is hypothesis.
It is very important to understand relationship between variables to draw the right conclusion from
a statistical analysis. The relationship between variables determines how the right conclusions are

Basic Guidelines for Research SMS Kabir


Chapter – 5 Measurement Concept: Variable, Reliability, Validity and Norm Page 93

reached. Without an understanding of this, you can fall into many pitfalls that accompany statistical
analysis and infer wrong results from your data. Relationships between variables need to be studied
and analyzed before drawing conclusions based on it. In natural science and engineering, this is
usually more straightforward as you can keep all parameters except one constant and study how this
one parameter affects the result under study. However, in social sciences, things get much more
complicated because parameters may or may not be directly related. There could be a number of
indirect consequences and deducing cause and effect can be challenging. Only when the change in one
variable actually causes the change in another parameter is there a causal relationship. Otherwise, it
is simply a correlation. Correlation doesn’t imply causation. To actually measure relationships among
variables, you have to know what level of measurement the variable is. The level of measurement
determines what kinds of mathematical operations can meaningfully be performed on the values of a
variable.
Variables Test for Relationship Example
Both variables are nominal level Chi-square test See which divisions have the
most female employees
Independent variable is nominal, t-test (if IV has 2 Test hypothesis that male
categories only); employees are more satisfied
Dependent variable is interval or
than female employees
ratio ANOVA
Both variables are interval level Correlation; RegressionLook at relationship between job
satisfaction and salary level
Many social science variables, such as attitude scales, are really ordinal level measurements. But
there are not many measures of ordinal relationship, and all are beyond the scope of this class. So
what do you do? There are two choices - one, treat them as nominal and use chi-square tests, or two,
treat them as interval and use correlation and regression. People normally do the latter (treat them
as interval).

5.2.4 CONTROLLING EXTRANEOUS VARIABLES


Although researchers try to exercise adequate experimental control, sometimes a crucial,
uncontrolled extraneous variables are discovered only after the data are compiled. Shortcomings in
control are found even in published experiments. The following common techniques illustrate major
principles that can be applied to a wide variety of specific control problems.
1. Elimination
2. Constancy of Conditions
3. Balancing
4. Counterbalancing
5. Randomization.
Elimination: Elimination is a technique to control extraneous variables by removing them from an
experiment. Most desirable way to control extraneous variables is simply to eliminate them from the
experimental situation. For example, eliminate noises and lights in laboratories where necessary.
Some extraneous variables that one would have a hard time eliminating are gender, age, intelligence,
etc.
Constancy of Conditions: Constancy of conditions is a control procedure used to avoid confounding;
keeping all aspects of the treatment conditions identical except for the independent variable that is
being manipulated. Extraneous variables that can not be eliminated might be held constant

Basic Guidelines for Research SMS Kabir


Chapter – 5 Measurement Concept: Variable, Reliability, Validity and Norm Page 94

throughout the experiment. The same value of such a variable is thus present for all participants.
For example, the time of day is an important variable in that people perform better on the
dependent variable in the morning than in the afternoon. To hold time of day constant, all
participants might be studied at about the same hour on successive days.
One of the standard applications of the technique of holding conditions constant is to conduct
experiment sessions in the same room. Thus, whatever might be the influence of the particular
characteristics of the room, it would be the same for all participants. Many aspects of the
experimental procedure are held constant, such as instructions to participants. The experimenters
thus read precisely the same set of instructions to all participants (except as modified for
different experimental conditions). Procedurally, all participants should go through the same steps
in the same order. The apparatus for administering the experimental treatment and for recording
the results should be the same for all participants.
Balancing: Balancing is a technique used to control the impact of extraneous variables by
distributing their effects equally across treatment conditions. Balancing may be used in two
situations – (1) where one is unable to identify the extraneous variables, and (2) where they can be
identified and one takes special steps to control them. For example, there are three unknown
extraneous variables operating on the experimental group in addition to the independent variable.
Their effects can be balanced out by allowing them to operate also on the control group. Therefore,
the independent variable is the only one that can differentially influence the two groups.
Extraneous Variable 1
Experimental Group Extraneous Variable 2 Dependent Variable
Extraneous Variable 3
Positive Amount of Independent Variable

Extraneous Variable 1
Control Group Extraneous Variable 2 Dependent Variable
Extraneous Variable 3
Zero Amount of Independent Variable No Effect
Figure 5.1. Control Group as a Technique of Balancing.
Counterbalancing: Experiments conducted with a counterbalanced measures design are one of the
best ways to avoid the pitfalls of standard repeated measures designs, where the subjects are
exposed to all of the treatments. In a normal experiment, the order in which treatments are given
can actually affect the behavior of the subjects or elicit a false response, due to fatigue or outside
factors changing the behavior of many of the subjects. To counteract this, researchers often use a
counterbalanced design, which reduces the chances of the order of treatment or other factors
adversely influencing the results.
Counterbalancing is a way to remove confounding factors from an experiment by having slightly
different tasks for different groups of participants. Counterbalancing can be defined as using all of
the possible orders of conditions to control order effects (Cozby, 2009). An order effect is when
the order of ‘presenting the treatments affects the dependant variable’ (Cozby, 2009). In a
repeated measures design it is very important that the experimenter counterbalances all of the
possible orders of conditions because the extent to which order is influencing the results can be
determined. Participants are exposed to all treatment conditions in a repeated-measures design,
which leads to concerns about carryover effects and order effects. Counterbalancing refers to
exposing participants to different orders of the treatment conditions to ensure that such carryover

Basic Guidelines for Research SMS Kabir


Chapter – 5 Measurement Concept: Variable, Reliability, Validity and Norm Page 95

effects and order effects fall equally on the experimental conditions. The simplest type of
counterbalanced measures design is used when there are two possible conditions, A and B. Divide the
subjects into two groups and one group is treated with condition A, followed by condition B, and the
other is tested with condition B followed by condition A.

Group 1 Treatment A Treatment B Post-Test

Group 2 Treatment B Treatment A Post-Test


Counterbalancing can be either complete or incomplete. With complete counterbalancing, every
possible order of treatment is used equally often. Thus, with k treatments, there would be k!
different orders. As seen below, for three treatment conditions (A, B, and C), each treatment would
occur equally often in each position. In three conditions, the process is - divide the subjects into 6
groups, treated as orders ABC, ACB, BAC, BCA, CAB and CBA.
Group 1 A B C Post-Test

Group 2 A C B Post-Test

Group 3 B A C Post-Test

Group 4 B C A Post-Test

Group 5 C A B Post-Test

Group 6 C B A Post-Test

The problem with complete counterbalancing is that for complex experiments, with multiple
conditions, the permutations quickly multiply and the research project becomes extremely unwieldy.
For example, four possible conditions requires 24 orders of treatment (4x3x2x1), and the number
of participants must be a multiple of 24, due to the fact that you need an equal number in each
group.With 5 conditions you need multiples of 120 (5x4x3x2x1), with 7 you need 5040. Therefore,
for all but the largest research projects with huge budgets, this is impractical and a compromise is
needed.
Randomization: Randomization is a sampling method used in scientific experiments. It is commonly
used in randomized controlled trials in experimental research. In medical research, randomization
and control of trials is used to test the efficacy or effectiveness of healthcare services or health
technologies like medicines, medical devices or surgery. Randomized controlled trials are one of the
most efficient ways of reducing the influence of external variables. Randomization refers to the
practice of using chance methods (random number tables, flipping a coin, etc.) to assign subjects to
treatments. In this way, the potential effects of lurking variables are distributed at chance levels
(hopefully roughly evenly) across treatment conditions. It reduces bias as much as possible. The
fundamental goal of randomization is to certain that each treatment is equally likely to be assigned
to any given experimental unit. It is the process of making something random; in various contexts
this involves, for example-
 generating a random permutation of a sequence (such as when shuffling cards);
 selecting a random sample of a population (important in statistical sampling);
 allocating experimental units via random assignment to a treatment or control condition;

Basic Guidelines for Research SMS Kabir


Chapter – 5 Measurement Concept: Variable, Reliability, Validity and Norm Page 96

 generating random numbers (random number generation); or


 transforming a data stream (such as when using a scrambler in telecommunications).
In science, randomized experiments are the experiments that allow the greatest reliability and
validity of statistical estimates of treatment effects. Randomization-based inference is especially
important in experimental design and in survey sampling.
The number of extraneous factors and potential confounding variables for such a study is enormous.
Age, gender, weight, even what the participants eat at home, and activity level are just some of the
factors that could make a difference. In addition, if the researchers, generally a health-conscious
bunch, are involved in the selection of participants, they might subconsciously pick those who are
most likely to adapt to the healthier regime and show better results. Such a pre-determined bias
destroys the chance of obtaining useful results. By using pure randomized controlled trials and
allowing chance to select participants into one of the two groups, it can be assumed that any
confounding variables are cancelled out, as long as you have a large enough sample group. Whilst
randomized controlled trials are regarded as the most accurate experimental design in the social
sciences, education, medicine and psychology, they can be extremely resource heavy, requiring very
large sample groups, so are rarely used. Instead, researchers sacrifice generalization for
convenience, leaving large scale randomized controlled trials for researchers with bigger budgets
and research departments.

5.3 RELIABILITY
Reliability refers to the consistency or repeatability of an operationalized measure. A reliable
measure will yield the same results over and over again when applied to the same thing. It is the
degree to which a test consistently measures whatever it measures. If you have a survey question
that can be interpreted several different ways, it is going to be unreliable. One person may
interpret it one way and another may interpret it another way. You do not know which interpretation
people are taking. Even answers to questions that are clear may be unreliable, depending on how they
are interpreted.
Reliability refers to the consistency of scores obtained by the same persons when they are re-
examined with the same tests on different occasions, or with different sets of equivalent items, or
under other variable examining conditions. Research requires dependable measurement.
Measurements are reliable to the extent that they are repeatable and that any random influence
which tends to make measurements different from occasion to occasion or circumstance to
circumstance is a source of measurement error. Errors of measurement that affect reliability are
random errors and errors of measurement that affect validity are systematic or constant errors.
Reliability of any research is the degree to which it gives an accurate score across a range of
measurement. It can thus be viewed as being ‘repeatability’ or ‘consistency’. In summary –
 Inter-rater: Different people, same test.
 Test-retest: Same people, different times.
 Parallel-forms: Different people, same time, different test.
 Internal consistency: Different questions, same construct.

Test-retest, equivalent forms and split-half reliability are all determined through correlation.
There are a number of ways of determining the reliability of an instrument. The procedure can be
classified into two groups –

Basic Guidelines for Research SMS Kabir


Chapter – 5 Measurement Concept: Variable, Reliability, Validity and Norm Page 97

 External Consistency Procedures: It compare findings from two independent processes of data
collection with each other as a means of verifying the reliability of the measure. For example,
test-retest reliability, parallel forms of the same test, etc.
 Internal Consistency Procedures: The idea behind this procedure is that items measuring the
same phenomenon should produce similar results. For example, split-half technique.

5.3.1 TYPES OF RELIABILITY


All types of reliability are concerned with the degree of consistency or agreement between two
independently derived sets of scores. There are various types of reliability -
Test-Retest Reliability: The most obvious method for finding the reliability of test scores is by
repeating the identical test on a second occasion. Test-retest reliability is a measure of reliability
obtained by administering the same test twice over a period of time to a group of individuals. The
scores from ‘Time 1’ and ‘Time 2’ can then be correlated in order to evaluate the test for stability
over time. The reliability coefficient in this case is simply the correlation between the scores
obtained by the same persons on the two administrations of the test. The error variance
corresponds to the random fluctuations of performance from one test session to the other. These
variations may result in part from-
 Uncontrolled testing conditions, such as extreme changes in weather, sudden noises and other
distractions, or a broken pencil point.
 Test taker error, such as unpleasant nature, emotional strain, worry, fatigue and the like.
Example: A test designed to assess student learning in psychology could be given to a group of
students twice, with the second administration perhaps coming a week after the first. The obtained
correlation coefficient would indicate the stability of the scores.

Measure = Measure

Time 1 Time 2
Figure 5.2. Test-Rest Reliability.
We know that if we measure the same thing twice that the correlation between the two
observations will depend in part by how much time elapses between the two measurement occasions.
The shorter the time gap, the higher the correlation; the longer the time gap, the lower the
correlation. This is because the two observations are related over time - the closer in time we get
the more similar the factors that contribute to error. In actual practice, a simple distinction can
usually be made and an effort is made to keep the interval short. In testing young children, the
period should be even shorter than for old persons, since at early ages progressive developmental
changes are discernible over a period of a month or even less. For any type of person, the interval
between retests should rarely exceed six months.
Advantage: Simple and straightforward.
Disadvantage: More or less the disadvantages of this reliability are -
 Practice effect: Practice will probably produce varying amounts of improvement in the retest
scores of different individuals.
 Interval effect: If the interval between retest is fairly short, the test takers may recall many
of their former responses.

Basic Guidelines for Research SMS Kabir


Chapter – 5 Measurement Concept: Variable, Reliability, Validity and Norm Page 98

 Reasoning effect: Once the test taker has grasped the principle involved in the problem or has
worked out a solution, s/he can reproduce the correct response in the future without going
through the intervening steps.
 A number of sensory discrimination and motor tests would fall into this category.
Split-Half Reliability: Split-half reliability is a subtype of internal consistency reliability. In split-
half reliability we randomly divide all items that purport to measure the same construct into two
sets. We administer the entire instrument to a sample of people and calculate the total score for
each randomly divided half. The most commonly used method to split the test into two is using the
odd-even strategy. The split-half reliability estimate, as shown in the figure, is simply the
correlation between these two total scores. In the example it is .88.
Item 1 Item 1 Item 3 Item 4
Item 2
Item 3
Measure 0.88
Item 4
Item 5
Item 6 Item 2 Item 5 Item 6
Figure 5.3. Split-Half Reliability.
From a single administration of one form of a test; it is possible to arrive at a measure of reliability
by various split-half procedures. In such a way, two scores are obtained for each person by dividing
the test into equivalent halves. Split-half reliability provides a measure of consistency with regard
to content sampling. The steps that are followed for this reliability -
Step 1: Divide the test into equivalent halves.
Step 2: Compute a ‘Pearson r’ between scores on the two halves of the test.
Step 3: Adjust the half-test reliability using the ‘Spearman-Brown’ formula.
The Spearman-Brown formula is widely used in determining reliability by the spilt-half method, many
test manuals reporting reliability in this form. When applied to split-half reliability, the formula
always involves doubling the length of the test. Under these conditions, it can be simplified as
follows -
2 x rhalf-test
Reliability =
1 + rhalf-test
For example, if the half-test correlation (for a 40-item test) between the 20 odd-numbered and 20
even-numbered items on a test turned out to be .50, the full-test (40-item) reliability would be .67
as follows –
The reliability of the full test (rtt) can be estimated by means of the Spearman-Brown prophecy
formula as given below –
2 x rhh 2 x 0.50 1.00
rtt = = = = 0.67
1 + rhh 1 + 0.50 1.50
To find split-half reliability, the first problem is how to split the test in order to obtain the most
nearly equivalent halves. Any test can be divided in many different ways. A procedure that is
adequate for most purposes is to find the scores on the odd and even items of the test. If the
items were originally arranged in an approximate order of difficulty, such a division yields very

Basic Guidelines for Research SMS Kabir


Chapter – 5 Measurement Concept: Variable, Reliability, Validity and Norm Page 99

nearly equivalent half-scores. Once the two half-scores have been obtained for each person, they
may be correlated by the usual method. This correlation actually gives the reliability of only a half-
test.
Inter-Rater Reliability: Inter-rater reliability is a measure of reliability used to assess the degree
to which different judges or raters agree in their assessment decisions. Inter-rater reliability is
also known as inter-observer reliability or inter-coder reliability. Inter-rater reliability is useful
because human observers will not necessarily interpret answers the same way; raters may disagree
as to how well certain responses or material demonstrate knowledge of the construct or skill being
assessed. Inter-rater reliability might be employed when different judges are evaluating the degree
to which art portfolios meet certain standards. Inter-rater reliability is especially useful when
judgments can be considered relatively subjective. Thus, the use of this type of reliability would
probably be more likely when evaluating artwork as opposed to math problems.
Object or Phenomenon

?
Observer 1 = Observer 2

Figure 5.4. Inter-Rater Reliability or Inter-Observer Reliability.


When multiple people are giving assessments of some kind or are the subjects of some test, then
similar people should lead to the same resulting scores. It can be used to calibrate people, for
example those being used as observers in an experiment. Inter-rater reliability thus evaluates
reliability across different people.
Two major ways in which inter-rater reliability is used are (a) testing how similarly people categorize
items, and (b) how similarly people score items. If your measurement consists of categories - the
raters are checking off which category each observation falls in - you can calculate the percent of
agreement between the raters. For instance, let’s say you had 100 observations that were being
rated by two raters. For each observation, the rater could check one of three categories. Imagine
that on 86 of the 100 observations the raters checked the same category. In this case, the percent
of agreement would be 86%. OK, it’s a crude measure, but it does give an idea of how much
agreement exists, and it works no matter how many categories are used for each observation. The
other major way to estimate inter-rater reliability is appropriate when the measure is a continuous
one. There, all you need to do is calculate the correlation between the ratings of the two observers.
For instance, they might be rating the overall level of activity in a classroom on a 1-to-7 scale. You
could have them give their rating at regular time intervals (e.g., every 30 seconds). The correlation
between these ratings would give you an estimate of the reliability or consistency between the
raters. This is the best way of assessing reliability when you are using observation, as observer bias
very easily creeps in. It does, however, assume you have multiple observers, which is not always the
case.
Parallel-Forms Reliability: Parallel forms reliability is a measure of reliability obtained by
administering different versions of an assessment tool (both versions must contain items that probe
the same construct, skill, knowledge base, etc.) to the same group of individuals. The scores from
the two versions can then be correlated in order to evaluate the consistency of results across
alternate versions. In parallel forms reliability you first have to create two parallel forms. One way

Basic Guidelines for Research SMS Kabir


Chapter – 5 Measurement Concept: Variable, Reliability, Validity and Norm Page 100

to accomplish this is to create a large set of questions that address the same construct and then
randomly divide the questions into two sets. You administer both instruments to the same sample of
people. The correlation between the two parallel forms is the estimate of reliability. For example if
you wanted to evaluate the reliability of a critical thinking assessment, you might create a large set
of items that all pertain to critical thinking and then randomly split the questions up into two sets,
which would represent the parallel forms. One major problem with this approach is that you have to
be able to generate lots of items that reflect the same construct. Furthermore, this approach
makes the assumption that the randomly divided halves are parallel or equivalent. Even by chance
this will sometimes not be the case.

Form A

Form B

Time 1 Time 2

Figure 5.5. Parallel Forms Reliability.


The parallel forms approach is very similar to the split-half reliability. The major difference is that
parallel forms are constructed so that the two forms can be used independent of each other and
considered equivalent measures. For instance, we might be concerned about a testing threat to
internal validity. If we use Form A for the pretest and Form B for the posttest, we minimize that
problem. It would even be better if we randomly assign individuals to receive Form A or B on the
pretest and then switch them on the posttest. With split-half reliability we have an instrument that
we wish to use as a single measurement instrument and only develop randomly split halves for
purposes of estimating reliability.
Average Inter-Item Correlation: This is a subtype of internal consistency reliability. Average item
total correlation takes the average inter-item correlations and calculates a total score for each
item, then averages these.

Basic Guidelines for Research SMS Kabir


Chapter – 5 Measurement Concept: Variable, Reliability, Validity and Norm Page 101

It compares correlations between all pairs of questions that test the same construct by calculating
the mean of all paired correlations.
I1 I2 I3 I4 I5 I6
Item 1 I1 1.00

I2 .89 1.00
Item 2
I3 .91 .92 1.00
Item 3
Measure
I4 .88 .93 .95 1.00
Item 4
I5 .84 .86 .92 .85 1.00
Item 5
I6 .88 .91 .95 .87 .85 1.00
Item 6
.90
Figure 5.6. Average Inter-Item Correlation.
The average inter-item correlation uses all of the items on our instrument that are designed to
measure the same construct. We first compute the correlation between each pair of items, as
illustrated in the figure. For example, if we have six items we will have 15 different item pairings
(i.e., 15 correlations). The average inter-item correlation is simply the average or mean of all these
correlations. In the example, we find an average inter-item correlation of .90 with the individual
correlations ranging from .84 to .95.
Average Item-total Correlation: This approach also uses the inter-item correlations. In addition, we
compute a total score for the six items and use that as a seventh variable in the analysis. The figure
shows the six item-to-total correlations at the bottom of the correlation matrix. They range from
.82 to .88 in this sample analysis, with the average of these at .85.
I1 I2 I3 I4 I5 I6
Item 1 I1 1.00

I2 .89 1.00
Item 2
I3 .91 .92 1.00
Item 3
Measure
I4 .88 .93 .95 1.00
Item 4
I5 .84 .86 .92 .85 1.00
Item 5
I6 .88 .91 .95 .87 .85 1.00
Item 6
Total .84 .88 .86 .87 .83 .82 1.00

.85
Figure 5.7. Average Item-Total Correlation.

Basic Guidelines for Research SMS Kabir


Chapter – 5 Measurement Concept: Variable, Reliability, Validity and Norm Page 102

Cronbach’s Alpha (α) Reliability: Cronbach’s alpha (often symbolized by the lower case Greek
letter α) is a measure of internal consistency, that is, how closely related a set of items are as a
group. It is considered to be a measure of scale reliability. A ‘high’ value for alpha does not imply
that the measure is unidimensional. If, in addition to measuring internal consistency, you wish to
provide evidence that the scale in question is unidimensional, additional analyses can be performed.
Exploratory factor analysis is one method of checking dimensionality. Technically speaking,
Cronbach’s alpha is not a statistical test - it is a coefficient of reliability (or consistency).
Cronbach’s alpha is the most common measure of internal consistency (reliability). It is most
commonly used when you have multiple Likert questions in a survey/questionnaire that form a scale
and you wish to determine if the scale is reliable. For example a researcher has devised a nine-
question questionnaire to measure how safe people feel at work at an industrial complex. Each
question was a 5-point Likert item from ‘strongly disagree’ to ‘strongly agree’. In order to
understand whether the questions in this questionnaire all reliably measure the same latent variable
(feeling of safety) [so a Likert scale could be constructed], a Cronbach’s alpha was run on a sample
size of 15 workers. The alpha coefficient for the items is .839, suggesting that the items have
relatively high internal consistency. Note that a reliability coefficient of .70 or higher is considered
‘acceptable’ in most social science research situations.
Cronbach’s alpha can be written as a function of the number of test items and the average inter-
correlation among the items. Below, for conceptual purposes, we show the formula for the
standardized Cronbach’s alpha -

Here, N is equal to the number of items, c-bar is the average inter-item covariance among the items
and v-bar equals the average variance.
One can see from this formula that if you increase the number of items, you increase Cronbach’s
alpha. Additionally, if the average inter-item correlation is low, alpha will be low. As the average
inter-item correlation increases, Cronbach’s alpha increases as well (holding the number of items
constant). Cronbach alpha is used to estimate the proportion of variance that is systematic or
consistent in a set of test scores. It can range from 00.0 (if no variance is consistent) to 1.00 (if all
variance is consistent) with all values between 00.0 and 1.00 also being possible. For example, if the
Cronbach alpha for a set of scores turns out to be .90, you can interpret that as meaning that the
test is 90% reliable, and by extension that it is 10% unreliable (100% - 90% = 10%). However, when
interpreting Cronbach alpha, you should keep in mind at least the following five concepts -
 Cronbach alpha provides an estimate of the internal consistency of the test, thus (a) alpha does
not indicate the stability or consistency of the test over time, which would be better estimated
using the test-retest reliability strategy, and (b) alpha does not indicate the stability or
consistency of the test across test forms, which would be better estimated using the equivalent
forms reliability strategy.
 Cronbach alpha is appropriately applied to norm-referenced tests and norm-referenced
decisions (e.g., admissions and placement decisions), but not to criterion-referenced tests and
criterion-referenced decisions (e.g., diagnostic and achievement decisions).
 All other factors held constant, tests that have normally distributed scores are more likely to
have high Cronbach alpha reliability estimates than tests with positively or negatively skewed
distributions, and so alpha must be interpreted in light of the particular distribution involved.

Basic Guidelines for Research SMS Kabir


Chapter – 5 Measurement Concept: Variable, Reliability, Validity and Norm Page 103

 All other factors held constant, Cronbach alpha will be higher for longer tests than for shorter
tests (Brown 1998 & 2001), and so alpha must be interpreted in light of the particular test
length involved.
 The standard error of measurement (or SEM) is an additional reliability statistic calculated
from the reliability estimate that may prove more useful than the reliability estimate itself
when you are making actual decisions with test scores. The SEM’s usefulness arises from the
fact that it provides an estimate of how much variability in actual test score points you can
expect around a particular cut-point due to unreliable variance.
Alternative-form Reliability: One way of avoiding the difficulties encountered in test-retest
reliability is through the use of alternate forms of the test. The same persons can thus be tested
with one from on the first occasion and with another, equivalent form on the second. The correlation
between the scores obtained on the two forms represents the reliability coefficient of the test. It
will be noted that such a reliability of coefficient is a measure of both-
 Temporal stability
 Consistency of response to different item samples.
This coefficient thus combines two types of reliability. Alternate-from reliability provides a useful
measure for evaluating many tests. The concept of item sampling or content sampling underlies not
only alternate-form reliability but also other types of reliability. Like test-retest reliability,
alternate-form reliability should always be accompanied by a statement of the length of the interval
between test administrations. If the two forms are administered in immediate succession, the
resulting correlation shows reliability across forms only, not across occasions. The error variance in
this case represents fluctuations in performance from one set of items to another, but not
fluctuations over time.
In the development of alternate forms care should be exercise to ensure that they are truly
parallel. Fundamentally, parallel forms of a test should be independently constructed tests designed
to meet the same specifications-
 The tests contain the same number of items;
 The items should be expressed in the same form;
 The items should cover the same type of content;
 The range and level of difficulty of the items should also be equal;
 Instructions, time limits, illustrative examples, format, and all other aspects of the test must
likewise be checked for equivalence.
Advantage
 Alternate-forms are useful in follow-up studies or in investigations of the effects of some
intervening experimental factor on test performance.
 The use of several alternate forms also provides a means of reducing the possibility of coaching
or cheating.
 The test is widely applicable then the test-retest reliability.
\

Disadvantage
 In the first place, if the behavior functions under consideration are subject to a large practice
effect, the use of alternate forms will reduce but not eliminate such an effect. If the practice
effect is small, reduction will be negligible.
 Alternate forms are unavailable for many tests, because of the practical difficulties of
constructing truly equivalent forms.

Basic Guidelines for Research SMS Kabir


Chapter – 5 Measurement Concept: Variable, Reliability, Validity and Norm Page 104

Kuder-Richardson Reliability: This type of reliability also utilizing a single administration of a single
form, is based on the consistency of responses to all items in the test. This inter-item consistency is
influenced by two sources of error variance –
 Content sampling
 Heterogeneity of the behavior domain sampled.
The more homogenous the domain, the higher the inter-item consistency. Although, homogenous
tests are to be preferred because their scores permit fairly unambiguous interpretation, a single
homogenous test is obviously not an adequate predictor of a highly heterogeneous criterion.
Moreover, in the prediction of a heterogeneous criterion, the heterogeneity of the tests item would
not necessarily represent error variance. The most common procedure fort finding inter-item
consistency is the developed by Kuder and Richardson (1937). The most widely applicable formula,
commonly known as ‘Kuder-Richardson formula 20’ is the following-

In this formula, rtt = the reliability coefficient of the whole test; n = number of items in the tests;
SDt = standard deviation of total scores on the test; and pq = the proportions of persons who pass
(p) and the proportion who do not pass (q) each item.
Comparison Among Different Types of Reliability
Brown (1997) usually explain three strategies for estimating reliability - (a) test-retest reliability
(i.e., calculating a reliability estimate by administering a test on two occasions and calculating the
correlation between the two sets of scores), (b) equivalent (or parallel) forms reliability (i.e.,
calculating a reliability estimate by administering two forms of a test and calculating the correlation
between the two sets of scores), and (c) internal consistency reliability (i.e., calculating a reliability
estimate based on a single form of a test administered on a single occasion using one of the many
available internal consistency equations). Clearly, the internal consistency strategy is the easiest
logistically because it does not require administering the test twice or having two forms of the test.
Each of the reliability estimators has certain advantages and disadvantages. Inter-rater reliability
is one of the best ways to estimate reliability when your measure is an observation. However, it
requires multiple raters or observers. As an alternative, you could look at the correlation of ratings
of the same single observer repeated on two different occasions. For example, let’s say you
collected videotapes of child-mother interactions and had a rater code the videos for how often the
mother smiled at the child. To establish inter-rater reliability you could take a sample of videos and
have two raters code them independently. To estimate test-retest reliability you could have a single
rater code the same videos on two different occasions. You might use the inter-rater approach
especially if you were interested in using a team of raters and you wanted to establish that they
yielded consistent results. If you get a suitably high inter-rater reliability you could then justify
allowing them to work independently on coding different videos. You might use the test-retest
approach when you only have a single rater and don’t want to train any others. On the other hand, in
some studies it is reasonable to do both to help establish the reliability of the raters or observers.
The parallel forms estimator is typically only used in situations where you intend to use the two
forms as alternate measures of the same thing. Both the parallel forms and all of the internal
consistency estimators have one major constraint - you have to have multiple items designed to
measure the same construct. This is relatively easy to achieve in certain contexts like achievement
testing (it’s easy, for instance, to construct lots of similar addition problems for a math test), but
for more complex or subjective constructs this can be a real challenge. If you do have lots of items,
Cronbach’s Alpha tends to be the most frequently used estimate of internal consistency.

Basic Guidelines for Research SMS Kabir


Chapter – 5 Measurement Concept: Variable, Reliability, Validity and Norm Page 105

The test-retest estimator is especially feasible in most experimental and quasi-experimental designs
that use a no-treatment control group. In these designs you always have a control group that is
measured on two occasions (pretest and posttest). The main problem with this approach is that you
don’t have any information about reliability until you collect the posttest and, if the reliability
estimate is low, you’re pretty much sunk.
Each of the reliability estimators will give a different value for reliability. In general, the test-
retest and inter-rater reliability estimates will be lower in value than the parallel forms and internal
consistency ones because they involve measuring at different times or with different raters. Since
reliability estimates are often used in statistical analyses of quasi-experimental designs (e.g., the
analysis of the nonequivalent group design), the fact that different estimates can differ
considerably makes the analysis even more complex.

5.3.2 FACTORS EFFECTING THE RELIABILITY OF A RESEARCH


In social sciences it is quite impossible to have a research tool which is 100% accurate. Not only the
instrument cannot be so but also because to control the factors effecting the reliability. Some of
these factors are –
 Wording of Questions – a slight ambiguity in the wording of questions or statements can affect
the reliability of a research instrument as respondents may interpret the questions differently
at different times, resulting in different responses.
 Physical Setting - in the case of an instrument being used in an interview, any change in the
physical setting at the time of the repeat interview may affect the responses given by a
respondent, which may affect reliability.
 Respondent’s Mood - a change in a respondent’s mood when respondent to questions or writing
answers in a questionnaire can change and may affect the reliability of that instrument.
 Nature of Interaction - in an interview, the interaction between the interviewer and the
interviewee can affect responses significantly. During the repeat interview the responses given
may be different due to a change in interaction, which could affect reliability.
 Regression Effect of An Instrument – when a research instrument is used to measure attitudes
towards an issue, some respondents, after having expressed their opinion, may feel that they
have been either too negative or too positive towards the issue. The second time they may
express their opinion differently, thereby affecting reliability.

5.4 VALIDITY
Validity refers to whether the measure actually measures what it is supposed to measure. If a
measure is unreliable, it is also invalid. That is, if you do not know what it is measuring, it certainly
cannot be said to be measuring what it is supposed to be measuring. On the other hand, you can have
a consistently unreliable measure. For example, if we measure income level by asking someone how
many years of formal education they have completed, we will get consistent results, but education is
not income (although they are positively related). If the ‘trade dress’ of a product refers to the
total image of a product, then measuring how people perceive the product’s color and shape by
themselves falls far short of measuring the product’s ‘trade dress’. It is an invalid measure. In
general, validity is an indication of how sound your research is. More specifically, validity applies to
both the design and the methods of your research. Validity in data collection means that your
findings truly represent the phenomenon you are claiming to measure. Valid claims are solid claims.

Basic Guidelines for Research SMS Kabir


Chapter – 5 Measurement Concept: Variable, Reliability, Validity and Norm Page 106

Validity is described as the degree to which a research study measures what it intends to measure.
There are two main types of validity, internal and external. Internal validity refers to the validity of
the measurement and test itself, whereas external validity refers to the ability to generalize the
findings to the target population. Both are very important in analyzing the appropriateness,
meaningfulness and usefulness of a research study. Some factors which affect internal validity are -
Subject variability; Size of subject population; Time given for the data collection or experimental
treatment; History; Attrition; Maturation; Instrument/task sensitivity etc. The important factors
affect external validity are - Population characteristics (subjects); Interaction of subject selection
and research; Descriptive explicitness of the independent variable; The effect of the research
environment; Researcher or experimenter effects; Data collection methodology; The effect of time
etc.

5.4.1 TYPES OF VALIDITY


Following are the validity types that are typically mentioned in texts and research papers when
talking about the quality of measurement. Each type views validity from a different perspective and
evaluates different relationships between measurements.
Face Validity: Face validity refers to the degree to which a test appears to measure what it
purports to measure. The stakeholders can easily assess face validity. Although this is not a very
‘scientific’ type of validity, it may be an essential component in enlisting motivation of stakeholders.
If the stakeholders do not believe the measure is an accurate assessment of the ability, they may
become disengaged with the task. For example, if a measure of art appreciation is created all of the
items should be related to the different components and types of art. If the questions are
regarding historical time periods, with no reference to any artistic movement, stakeholders may not
be motivated to give their best effort or invest in this measure because they do not believe it is a
true assessment of art appreciation.
Predictive Validity: Predictive validity refers to whether a new measure of something has the same
predictive relationship with something else that the old measure had. In predictive validity, we
assess the operationalization’s ability to predict something it should theoretically be able to predict.
For instance, we might theorize that a measure of math ability should be able to predict how well a
person will do in an engineering-based profession. We could give our measure to experienced
engineers and see if there is a high correlation between scores on the measure and their salaries as
engineers. A high correlation would provide evidence for predictive validity - it would show that our
measure can correctly predict something that we theoretically think it should be able to predict.
There are obvious limitations to this as behavior cannot be fully predicted to great depths, but this
validity helps predict basic trends to a certain degree.
Criterion-Related Validity: Criterion validity is a test of a measure when the measure has several
different parts or indicators in it - compound measures. Each part or criterion of the measure
should have a relationship with all the parts in the measure for the variable to which the first
measure is related in a hypothesis. When you are expecting a future performance based on the
scores obtained currently by the measure, correlate the scores obtained with the performance. The
later performance is called the criterion and the current score is the prediction. It is used to
predict future or current performance - it correlates test results with another criterion of
interest. For example, if a physics program designed a measure to assess cumulative student
learning throughout the major. The new measure could be correlated with a standardized measure
of ability in this discipline, such as GRE subject test. The higher the correlation between the

Basic Guidelines for Research SMS Kabir


Chapter – 5 Measurement Concept: Variable, Reliability, Validity and Norm Page 107

established measure and new measure, the more faith stakeholders can have in the new assessment
tool.
Content Validity: In content validity, you essentially check the operationalization against the
relevant content domain for the construct. This approach assumes that you have a good detailed
description of the content domain, something that’s not always true. In content validity, the criteria
are the construct definition itself - it is a direct comparison. In criterion-related validity, we usually
make a prediction about how the operationalization will perform based on our theory of the
construct. When we want to find out if the entire content of the behavior/ construct/ area is
represented in the test we compare the test task with the content of the behavior. This is a logical
method, not an empirical one. Example, if we want to test knowledge on Bangladesh Geography it is
not fair to have most questions limited to the geography of Australia.
Convergent Validity: Convergent validity refers to whether two different measures of presumably
the same thing are consistent with each other - whether they converge to give the same
measurement. In convergent validity, we examine the degree to which the operationalization is
similar to (converges on) other operationalizations that it theoretically should be similar to. For
example, to show the convergent validity of a test of arithmetic skills, we might correlate the
scores on test with scores on other tests that purport to measure basic math ability, where high
correlations would be evidence of convergent validity. Or, if SAT scores and GRE scores are
convergent, then someone who scores high on one test should also score high on the other.
Different measures of ideology should classify the same people the same way. If they do not, then
they lack convergent validity.
Concurrent Validity: Concurrent validity is the degree to which the scores on a test are related to
the scores on another already established, test administered at the same time or to some other
valid criterion available at the same time. This compares the results from a new measurement
technique to those of a more established technique that claims to measure the same variable to see
if they are related. Often two measurements will behave in the same way, but are not necessarily
measuring the same variable; therefore this kind of validity must be examined thoroughly. In
concurrent validity, we assess the operationalization’s ability to distinguish between groups that it
should theoretically be able to distinguish between. For example, if we come up with a way of
assessing manic-depression, our measure should be able to distinguish between people who are
diagnosed manic-depression and those diagnosed paranoid schizophrenic. If we want to assess the
concurrent validity of a new measure of empowerment, we might give the measure to both migrant
farm workers and to the farm owners, theorizing that our measure should show that the farm
owners are higher in empowerment. As in any discriminating test, the results are more powerful if
you are able to show that you can discriminate between two groups that are very similar.
Construct Validity: Construct validity is used to ensure that the measure is actually measure what
it is intended to measure (i.e. the construct), and not other variables. Using a panel of ‘experts’
familiar with the construct is a way in which this type of validity can be assessed. The experts can
examine the items and decide what that specific item is intended to measure. This is whether the
measurements of a variable in a study behave in exactly the same way as the variable itself. This
involves examining past research regarding different aspects of the same variable. It is also the
degree to which a test measures an intended hypothetical construct. For example, if we want to
validate a measure of anxiety. We have a hypothesis that anxiety increases when subjects are under
the threat of an electric shock, then the threat of an electric shock should increase anxiety scores.

Basic Guidelines for Research SMS Kabir


Chapter – 5 Measurement Concept: Variable, Reliability, Validity and Norm Page 108

Formative Validity: When applied to outcomes assessment it is used to assess how well a measure is
able to provide information to help improve the program under study. For example - when designing a
rubric for history one could assess student’s knowledge across the discipline. If the measure can
provide information that students are lacking knowledge in a certain area, for instance the Civil
Rights Movement, then that assessment tool is providing meaningful information that can be used to
improve the course or program requirements.
Sampling Validity: Sampling validity (similar to content validity) ensures that the measure covers
the broad range of areas within the concept under study. Not everything can be covered, so items
need to be sampled from all of the domains. This may need to be completed using a panel of ‘experts’
to ensure that the content area is adequately sampled. Additionally, a panel can help limit ‘expert’
bias (i.e. a test reflecting what an individual personally feels are the most important or relevant
areas). For example - when designing an assessment of learning in the theatre department, it would
not be sufficient to only cover issues related to acting. Other areas of theatre such as lighting,
sound, functions of stage managers should all be included. The assessment should reflect the
content area in its entirety.
Discriminant Validity: In discriminant validity, we examine the degree to which the
operationalization is not similar to (diverges from) other operationalizations that it theoretically
should be not be similar to. For instance, to show the discriminant validity of a Head Start program,
we might gather evidence that shows that the program is not similar to other early childhood
programs that don’t label themselves as Head Start programs. Or, to show the discriminant validity
of a test of arithmetic skills, we might correlate the scores on test with scores on tests that of
verbal ability, where low correlations would be evidence of discriminant validity.

5.4.2 FACTORS EFFECTING THE VALIDITY OF A RESEARCH


Factors that can effect validity can come in many forms, and it is important that these are
controlled for as much as possible during research to reduce their impact on validity. The term
history refers to effects that are not related to the treatment that may result in a change of
performance over time. This could refer to events in the participant’s life that have led to a change
in their mood etc. Instrumental bias refers to a change in the measuring instrument over time which
may change the results. This is often evident in behavioral observations where the practice and
experience of the experimenter influences their ability to notice certain things and changes their
standards. A main threat to internal validity is testing effects. Often participants can become tired
or bored during an experiment, and previous tests may influence their performance. This is often
counterbalanced in experimental studies so that participants receive the tasks in a different order
to reduce their impact on validity. If the results of a study are not deemed to be valid then they
are meaningless to our study. If it does not measure what we want it to measure then the results
cannot be used to answer the research question, which is the main aim of the study. These results
cannot then be used to generalize any findings and become a waste of time and effort. It is
important to remember that just because a study is valid in one instance it does not mean that it is
valid for measuring something else. So, validity is very important in a research study to ensure that
our results can be used effectively, and variables that may threaten validity should be controlled as
much as possible.

5.5 NORMS

Basic Guidelines for Research SMS Kabir


Chapter – 5 Measurement Concept: Variable, Reliability, Validity and Norm Page 109

Norms refer to conditions for social relations between groups and individuals, for the structure of
society and the difference between societies, and for human behavior in general. Norms are shared
rules, customs, and guidelines that govern society and define how people should behave in the
company of others. Norms may be applicable to all members of society or only to certain subsets of
the population, such as students, teachers, clergy, police officers, or soldiers in warfare. Norms
guide smooth and peaceful interactions by prescribing predictable behavior in different situations.
For instance, in the United States, handshaking is a traditional greeting; in other countries, the
expected protocol upon meeting someone might be to kiss both cheeks, bow, place palms together,
or curtsy. Norms tend to be institutionalized and internalized. Most social control of individuals
through norms is internal and guided by the pressures and restraints of cultural indoctrination.
Individual cultures sanction their norms. Sanctions may be rewards for conformity to norms or
punishment for nonconformity. Positive sanctions include rewards, praise, smiles, and gestures.
Negative sanctions include the infliction of guilt, condemnation, citations, fines, and imprisonment.
There is a definite difference and distinction between values and norms. Values are individual or, in
some instances, commonly shared conceptions of desirable states of being. In contrast, norms are
generally accepted prescriptions for or prohibitions against behavior, belief, or feeling. While values
can be held by an individual, norms cannot and must be upheld by a group. Norms always include
sanctions but values never do. Norms tend to be based on and influenced by common values and they
tend to persist even after the reasons for certain behaviors are forgotten. For instance, the habit
of shaking hands when meeting another person has its origin in the practice of revealing that the
right hand did not conceal a weapon (Morris, 1956).
Norms are the specific cultural expectations for how to behave in a given situation. They are the
agreed-upon expectations and rules by which the members of a culture behave. Norms vary from
culture to culture, so some things that are considered norms in one culture may not be in another
culture. For example, in America it is a norm to maintain direct eye contact when talking with others
and it is often considered rude if you do not look at the person you are speaking with. In Asian, on
the other hand, averting your eyes when conversing with others is a sign of politeness and respect
while direct eye contact is considered rude. Every society has expectations about how its members
should and should not behave. A norm is a guideline or an expectation for behavior. Each society
makes up its own rules for behavior and decides when those rules have been violated and what to do
about it. Norms change constantly. Norms differ widely among societies, and they can even differ
from group to group within the same society.
 Different settings: Wherever we go, expectations are placed on our behavior. Even within the
same society, these norms change from setting to setting. For example - the way we are
expected to behave in mosque differs from the way we are expected to behave at a party, which
also differs from the way we should behave in a classroom.
 Different countries: Norms are place-specific, and what is considered appropriate in one
country may be considered highly inappropriate in another. For example - in some African
countries, it’s acceptable for people in movie theaters to yell frequently and make loud
comments about the film. In the United States, people are expected to sit quietly during a
movie, and shouting would be unacceptable.
 Different time periods: Appropriate and inappropriate behavior often changes dramatically from
one generation to the next. Norms can and do shift over time. For example - in the United
States in the 1950s, a woman almost never asked a man out on a date, nor did she pay for the
date. While some traditional norms for dating prevail, most women today feel comfortable asking
men out on dates and paying for some or even all of the expenses.

Basic Guidelines for Research SMS Kabir


Chapter – 5 Measurement Concept: Variable, Reliability, Validity and Norm Page 110

5.5.1 TYPES OF NORMS


Sociologists divide norms into four types: folkways, mores, taboos, and laws. These four types of
norms are ranked from least restrictive to most compulsory.
Folkways: Folkways refer to norms that protect common conventions. They are often referred to as
‘customs’. They are standards of behavior that are socially approved but not morally significant.
They are norms for everyday behavior that people follow for the sake of tradition or convenience.
Breaking a folkway does not usually have serious consequences. Cultural forms of dress or food
habits are examples of folkways. In America, if someone belched loudly while eating at the dinner
table with other people, s/he would be breaking a folkway. It is culturally appropriate to not belch
at the dinner table, however if this folkway is broken, there are no moral or legal consequences.
Mores: Mores are strict norms that control moral and ethical behavior. Mores are norms based on
definitions of right and wrong. Unlike folkways, mores are morally significant. People feel strongly
about them and violating them typically results in disapproval. Religious doctrines are an example of
mores. For instance, if someone were to attend church in the nude, s/he would offend most people
of that culture and would be morally shunned. Also, parents who believe in the more that only
married people should live together will disapprove of their daughter living with her boyfriend. They
may consider the daughter’s actions a violation of their moral guidelines.
Taboos: Taboos refer to the strongest types of mores. A taboo is a norm that society holds so
strongly that violating it results in extreme disgust. Often times the violator of the taboo is
considered unfit to live in that society. For instance, in some Muslim cultures, eating pork is taboo
because the pig is considered unclean. At the more extreme end, incest and cannibalism are taboos
in most countries.
Laws: A law is a norm that is written down and enforced by an official law enforcement agency.
Driving while drunk, theft, murder, and trespassing are all examples of laws. If violated the law, the
person could get cited, owe a fine, or go to jail.
Ultimately, social norms are important, in part, because they enable individuals to agree on a shared
interpretation of the social situation and prevent harmful social interactions. In all types of
research, researchers must follow the codes of conduct and should compare/ explain their
participant under the existing norms.

References
Kabir, S.M.S. (2016). Basic Guidelines for Research: An Introductory Approach for All
Disciplines. Book Zone Publication, ISBN: 978-984-33-9565-8, Chittagong-4203,
Bangladesh.
Kabir, S.M.S. (2017). Essentials of Counseling. Abosar Prokashana Sangstha, ISBN: 978-984-
8798-22-5, Banglabazar, Dhaka-1100.
Kabir, S.M.S., Mostafa, M.R., Chowdhury, A.H., & Salim, M.A.A. (2016). Bangladesher
Samajtattwa (Sociology of Bangladesh). Protik Publisher, ISBN: 978-984-8794-69-2,
Dhaka-1100.
Kabir, S.M.S. (2018). Psychological health challenges of the hill-tracts region for climate

Basic Guidelines for Research SMS Kabir


Chapter – 5 Measurement Concept: Variable, Reliability, Validity and Norm Page 111

change in Bangladesh. Asian Journal of Psychiatry, Elsevier,34, 74–77.


Kabir, S.M.S., Aziz, M.A., & Jahan, A.K.M.S. (2018). Women Empowerment and Governance
in Bangladesh. ANTYAJAA: Indian journal of Women and Social Change, SAGE
Publications India Pvt. Ltd, 3(1), 1-12.
Alam, S.S. & Kabir, S.M.S. (2015). Classroom Management in Secondary Level: Bangladesh
Context. International Journal of Scientific and Research Publications, 5(8), 1-4, ISSN
2250-3153, www.ijsrp.org.
Alam, S.S., Kabir, S.M.S., & Aktar, R. (2015). General Observation, Cognition, Emotion,
Social, Communication, Sensory Deficiency of Autistic Children. Indian Journal of
Health and Wellbeing, 6(7), 663-666, ISSN-p-2229-5356,e-2321-3698.
Kabir, S.M.S. (2013). Positive Attitude Can Change Life. Journal of Chittagong University
Teachers’ Association, 7, 55-63.
Kabir, S.M.S. & Mahtab, N. (2013). Gender, Poverty and Governance Nexus: Challenges and
Strategies in Bangladesh. Empowerment a Journal of Women for Women, Vol. 20, 1-12.
Kabir, S.M.S. & Jahan, A.K.M.S. (2013). Household Decision Making Process of Rural Women
in Bangladesh. IOSR Journal of Humanities and Social Science (IOSR-JHSS), ISSN:
2279-0845,Vol,10, Issue 6 (May. - Jun. 2013), 69-78. ISSN (Online): 2279-0837.
Jahan, A.K.M.S., Mannan, S.M., & Kabir, S.M.S. (2013). Designing a Plan for Resource
Sharing among the Selected Special Libraries in Bangladesh, International Journal of
Library Science and Research (IJLSR), ISSN 2250-2351, Vol. 3, Issue 3, Aug 2013, 1-20,
ISSN: 2321-0079.
Kabir, S.M.S. & Jahan, I. (2009). Anxiety Level between Mothers of Premature Born Babies
and Those of Normal Born Babies. The Chittagong University Journal of Biological
Science, 4(1&2), 131-140.
Kabir, S.M.S., Amanullah, A.S.M., & Karim, S.F. (2008). Self-esteem and Life Satisfaction of
Public and Private Bank Managers. The Dhaka University Journal of Psychology, 32, 9-
20.
Kabir, S.M.S., Amanullah, A.S.M., Karim, S.F., & Shafiqul, I. (2008). Mental Health and Self-
esteem: Public Vs. Private University Students in Bangladesh. Journal of Business and
Technology, 3, 96-108.
Kabir, S.M.S., Shahid, S.F.B., & Karim, S.F. (2007). Personality between Housewives and

Basic Guidelines for Research SMS Kabir


Chapter – 5 Measurement Concept: Variable, Reliability, Validity and Norm Page 112

Working Women in Bangladesh. The Dhaka University Journal of Psychology, 31, 73-
84.
Kabir, S.M.S. & Karim, S.F. (2005). Influence of Type of Bank and Sex on Self-esteem, Life
Satisfaction and Job Satisfaction. The Dhaka University Journal of Psychology, 29, 41-
52.
Kabir, S.M.S. & Rashid, U.K. (2017). Interpersonal Values, Inferiority Complex, and
Psychological Well-Being of Teenage Students. Jagannath University Journal of Life and
Earth Sciences, 3(1&2),127-135.
-------------------------

Basic Guidelines for Research SMS Kabir

View publication stats

You might also like