You are on page 1of 70

Unit-III

Syllabus

Measurement and Scaling,


Classification and Presentations of Data Through Charts, Frequency Distribution and Graphs,
Correlation and Regression.
Measurement scale
In Statistics, the variables or numbers are defined and categorized using
different scales of measurements. Each level of measurement scale has specific
properties that determine the various use of statistical analysis.
There are four types of scales such as nominal, ordinal, interval and ratio scale.
Each of the four scales (i.e., nominal, ordinal, interval, and ratio) provides a
different type of information. Measurement refers to the assignment of numbers in a
meaningful way, and understanding measurement scales is important to interpreting
the numbers assigned to people, objects, and events.
A scale is a device or an object used to measure or quantify any event or another
object.
Levels of Measurements
There are four different scales of measurement. The data can be defined as being
one of the four scales. The four types of scales are:
• Nominal Scale
• Ordinal Scale
• Interval Scale
• Ratio Scale
Measurement and Scaling
The term ‘measurement’ means assigning numbers or some other
symbols to the characteristics of certain objects. When numbers are
used, the researcher must have a rule for assigning a number to an
observation in a way that provides an accurate description.
There is certain rule to assignment of the numbers to the
characteristics must be isomorphic, i.e. there must be a one-to-one
correspondence between the numbers and the characteristics being
measure. We do not measure the object but some characteristics of
it. There are two reasons for which numbers are usually assign :

1. Numbers permit statistical analysis of the resulting data and


2. They facilitate the communication of measurement results.

Scaling is an extension of measurement. Scaling involves creating a


continuum on which measurements on objects are located.
Ex. Satisfaction toward Airline scale 1 to 5 whereas 1=strongly
dissatisfied and 5 = strongly satisfied
Scaling Techniques
Definition: Scaling technique is a method of placing
respondents in continuation of gradual change in the pre-
assigned values, symbols or numbers based on the
features of a particular object as per the defined rules. All
the scaling techniques are based on four pillars, i.e., order,
description, distance and origin.

Types Scaling Techinques


1. Primary Scaling Techniques
1. Nominal Scale
2. Ordinal Scale
3. Interval Scale
4. Ratio Scale
2. Other Scaling Techniques
1. Comparative Scales
2. Non-Comparative Scales

Types of Scaling Techniques
The researchers have identified many
scaling techniques; some of the most
common scales used by business
organizations, researchers,
economists, experts, etc.
These techniques can be classified as
primary scaling techniques and other
scaling techniques.
• Primary Scaling Techniques
• The major four scales used in
statistics for market research as
mentioned in diagram:
Nominal Scales
In nominal scales, numbers, such as driver’s license numbers and product serial numbers, are used
to name or identify people, objects, or events.
A nominal scale is the 1st level of measurement scale in which the numbers serve as “tags” or
“labels” to classify or identify the objects. A nominal scale usually deals with the non-numeric
variables or the numbers that do not have any value.
Gender is an example of a nominal measurement in which a number (e.g., 1) is used to label one
gender, such as males, and a different number (e.g., 2) is used for the other gender, females.
Numbers do not mean that one gender is better or worse than the other; they simply are used to
classify persons.
In fact, any other numbers could be used, because they do not represent an amount or a quality. It
is impossible to use word names with certain statistical techniques, but numerals can be used in
coding systems, For example, any departments may wish to examine the relationship between
gender (where male = 1, female = 2) and performance on physical-ability tests (with numerical
scores indicating ability).
Characteristics of Nominal Scale
• A nominal scale variable is classified into two or more categories. In this measurement
mechanism, the answer should fall into either of the classes.
• It is qualitative. The numbers are used here to identify the objects.
• The numbers don’t define the object characteristics. The only permissible aspect of numbers in
the nominal scale is “counting.”
Example:
An example of a nominal scale measurement is given below:
What is your gender? - M- Male F- Female
Here, the variables are used as tags, and the answer to this question should be either M or F.
In an organization Department : 1. Marketing 2. HR 3. Finance 4. Operations 5. IT
• Nominal Scale
• Nominal scales are adopted for non-quantitative (containing no
numerical implication) labelling variables which are unique and
different from one another.
• Types of Nominal Scales:
1. Dichotomous: A nominal scale that has only two labels is called
‘dichotomous’; for example, Yes/No.
2. Nominal with Order: The labels on a nominal scale arranged in
an ascending or descending order is termed as ‘nominal with
order’; for example, Excellent, Good, Average, Poor, Worst.
3. Nominal without Order: Such nominal scale which has no
sequence, is called ‘nominal without order’; for example, Black,
White.

Ordinal Scale
In ordinal scales, numbers represent rank order and indicate the order of quality or
quantity, but they do not provide an amount of quantity or degree of quality.
The ordinal scale is the 2nd level of measurement that reports the ordering and ranking
of data without establishing the degree of variation between them. Ordinal represents
the “order.” Ordinal data is known as qualitative data or categorical data. It can be
grouped, named and also ranked.
Usually, the number 1 means that the person (or object or event) is better than the
person labeled 2; person 2 is better than person 3, and so forth—for example, to rank
order persons in terms of potential for promotion, with the person assigned the 1 rating
having more potential than the person assigned a rating of 2. Such ordinal scaling does
not, however, indicate how much more potential the leader has over the person
assigned a rating of 2, and there may be very little difference between 1 and 2 here.
When ordinal measurement is used (rather than interval measurement),
Characteristics of the Ordinal Scale
• The ordinal scale shows the relative ranking of the variables
• It identifies and describes the magnitude of a variable
• Along with the information provided by the nominal scale, ordinal scales give the
rankings of those variables
• The interval properties are not known
• The surveyors can quickly analyze the degree of agreement concerning the
identified order of variables
• The ordinal scale functions on the concept of the
relative position of the objects or labels based on the
individual’s choice or preference.
• For example, At Amazon.in, every product has a
customer review section where the buyers rate the listed
product according to their buying experience, product
features, quality, usage, etc.
• The ratings so provided are as follows:
5 Star – Excellent
4 Star – Good
3 Star – Average
2 Star – Poor
1 Star – Worst
• A = 34 1
• B = 32 2
• C = 29 3
• D = 26 4
• E = 19 5
Example:
• Ranking of school students – 1st, 2nd, 3rd, etc.
• Ratings in choosing a restaurants ( 1=most important ….5=least important)
a Food Quality 5
b Prices 3
c Manu Variety 4
d Ambience 1
e Service 2
• Evaluating the frequency of occurrences
• Very often
• Often
• Not often
• Not at all
• Assessing the degree of agreement
• Totally agree
• Agree
• Neutral
• Disagree
• Totally disagree
In the ordinal scale , the assigned ranks can not be
added , multiplied, subtracted or divided. One can
compute median, percentiles and quartiles of
distribution.
The other major statistical analysis which can be
carried out is the rank order correlation coefficient.
As the ordinal scale measurement is higher than the
nominal scale measurement, all the statistical
techniques which are applicable in the case of
nominal scale measurement can also used for the
ordinal scale measurement. However, reverse is not
true.
This is because ordinal data can be converted into
nominal scale data but not the other way round.
Interval Scale
The interval scale is the 3rd level of measurement scale. It is defined as a quantitative
measurement scale in which the difference between the two variables is meaningful. In other
words, the variables are measured in an exact manner, not as in a relative way in which the
presence of zero is arbitrary.

In interval scales, numbers form a continuum and provide information about the amount of
difference, but the scale lacks a true zero. The differences between adjacent numbers are equal
or known. If zero is used, it simply serves as a reference point on the scale but does not
indicate the complete absence of the characteristic being measured.
The Fahrenheit and Celsius temperature scales are examples of interval measurement. In
those scales, 0 °F and 0 °C do not indicate an absence of temperature.

Characteristics of Interval Scale:


• The interval scale is quantitative as it can quantify the difference between the values
• It allows calculating the mean and median of the variables
• To understand the difference between the variables, you can subtract the values between
the variables
• The interval scale is the preferred scale in Statistics as it helps to assign any numerical
values to arbitrary assessment such as feelings, calendar types, etc.
Example:
• Likert Scale
• Net Promoter Score (NPS)
• Bipolar Matrix Table
An interval scale is also called a cardinal scale which is
the numerical labelling with the same difference among
the consecutive measurement units. With the help of this
scaling technique, researchers can obtain a better
comparison between the objects.

For example; A survey conducted by an automobile


company to know the number of vehicles owned by the
people living in a particular area who can be its
prospective customers in future. It adopted the interval
scaling technique for the purpose and provided the units
as 1, 2, 3, 4, 5, 6 to select from.

In the scale mentioned above, every unit has the same


difference, i.e., 1, whether it is between 2 and 3 or
between 4 and 5
• 15 to 25 exclusive
• 25 to 35
• 0 to
• Quality
• Hi q q N L V Low q
5 4 3 2 1
2 1 0 -1 - 2
Ratio Scale
Ratio scales have all of the characteristics of interval
scales as well as a true zero, which refers to complete
absence of the characteristic being measured. Physical
characteristics of persons and objects can be measured
with ratio scales, and, thus, height and weight are
examples of ratio measurement. A score of 0 means there
is complete absence of height or weight. A person who is
1.2 metres (4 feet) tall is two-thirds as tall as a 1.8-metre-
(6-foot-) tall person. Similarly, a person weighing 45.4 kg
(100 pounds) is two-thirds as heavy as a person who
weighs 68 kg (150 pounds).
The ratio scale is the 4th level of measurement scale,
which is quantitative. It is a type of variable measurement
scale. It allows researchers to compare the differences or
intervals. The ratio scale has a unique feature. It possesses
the character of the origin or zero points.
Characteristics of Ratio Scale:
• Ratio scale has a feature of absolute zero
• It doesn’t have negative numbers, because of its zero-point feature
• It affords unique opportunities for statistical analysis. The variables can
be orderly added, subtracted, multiplied, divided. Mean, median, and
mode can be calculated using the ratio scale.
• Ratio scale has unique and useful properties. One such feature is that it
allows unit conversions like kilogram – calories, gram – calories, etc.
• Mathematical form of the data on the interval scale may be written as
Y = a+bx where a not equal to zero
C0= 5 (F0-32)
9
C0= -160 + 5 F0
9 9
-

• One of the most superior measurement technique is the ratio
scale. Similar to an interval scale, a ratio scale is an abstract
number system. It allows measurement at proper intervals,
order, categorization and distance, with an added property of
originating from a fixed zero point. Here, the comparison
can be made in terms of the acquired ratio.
• For example, A health product manufacturing company
surveyed to identify the level of obesity in a particular
locality. It released the following survey questionnaire:
Select a category to which your weight belongs to:
• Less than 40 kilograms
• 40-59 Kilograms
• 60-79 Kilograms
• 80-99 Kilograms
• 100-119 Kilograms
• 120 Kilograms and more
Other Scaling Techniques
(not required in detail)
• Scaling of objects can be
used for a comparative study
between more than one
objects (products, services,
brands, events, etc.). Or can
be individually carried out to
understand the consumer’s
behaviour and response
towards a particular object.
• Following are the two
categories under which other
scaling techniques are placed
based on their comparability:
1) Comparative Scales
2) Non-Comparative Scales
Comparative Scales
For comparing two or more variables, a comparative scale is used by the respondents. Following are
the different types of comparative scaling techniques:
Paired Comparison : A paired comparison symbolizes two variables from which the respondent needs
to select one. This technique is mainly used at the time of product testing, to facilitate the consumers
with a comparative analysis of the two major products in the market.
To compare more than two objects say comparing P, Q and R, one can first compare P with Q and then
the superior one (i.e., one with a higher percentage) with R.
For example, A market survey was conducted to find out consumer’s preference for the network
service provider brands, A and B. The outcome of the survey was as follows:
Brand ‘A’ = 57%
Brand ‘B’ = 43%
Thus, it is visible that the consumers prefer brand ‘A’, over brand ‘B’.
Rank Order : In rank order scaling the respondent needs to rank or arrange the given objects according
to his or her preference.
For example, A soap manufacturing company conducted a rank order scaling to find out the orderly
preference of the consumers. It asked the respondents to rank the following brands in the sequence of
their choice:
SOAP BRANDS RANK
Brand V 4
Brand X 2
Brand Y 1
Brand Z 3
The above scaling shows that soap ‘Y’ is the most preferred brand, followed by soap ‘X’, then soap ‘Z’
and the least preferred one is the soap ‘V’.
Non-Comparative Scales
• A non-comparative scale is used to analyse the performance of an individual
product or object on different parameters. Following are some of its most common
types:
• Continuous Rating Scales
• It is a graphical rating scale where the respondents are free to place the object at a
position of their choice. It is done by selecting and marking a point along the
vertical or horizontal line which ranges between two extreme criteria.
• For example, A mattress manufacturing company used a continuous rating scale to
find out the level of customer satisfaction for its new comfy bedding. The
response can be taken in the following different ways (stated as versions here):
• In the diagram shows a non-comparative analysis of one particular product, i.e.
comfy bedding. Thus, making it very clear that the customers are quite satisfied
with the product and its features.
• Itemized Rating Scale
• Itemized scale is another essential technique under the non-comparative scales. It
emphasizes on choosing a particular category among the various given categories
by the respondents. Each class is briefly defined by the researchers to facilitate
such selection.
• The three most commonly used itemized rating scales are as follows:
• Likert Scale: In the Likert scale, the researcher provides some statements and
ask the respondents to mark their level of agreement or disagreement over these
statements by selecting any one of the options from the five given alternatives.
For example, A shoes manufacturing company adopted the Likert scale technique
for its new sports shoe range named Z sports shoes. The purpose is to know the
agreement or disagreement of the respondents.
For this, the researcher asked the respondents to circle a number representing the
most suitable answer according to them, in the following representation:
• 1 – Strongly Disagree
• 2 – Disagree
• 3 – Neither Agree Nor Disagree
• 4 – Agree
• 5 – Strongly Agree

STRONGLY NEITHER AGREE


STATEMENT DISAGREE AGREE STRONGLY AGREE
DISAGREE NOR DISAGREE
Z sports shoes are 1 2 3 4 5
very light weight

Z sports shoes are 1 2 3 4 5


extremely
comfortable
Semantic Differential Scale:
A bi-polar seven-point non-comparative rating scale is where the respondent can
mark on any of the seven points for each given attribute of the object as per
personal choice. Thus, depicting the respondent’s attitude or perception towards
the object.

For example, A well-known brand for watches, carried out semantic differential
scaling to understand the customer’s attitude towards its product. The pictorial
representation of this technique is as follows:
Semantic Differential Scale
From the diagram, we can analyze that the customer finds the product of
superior quality; however, the brand needs to focus more on the styling of its
watches.
Scale Reliability and Validity
There is some of the difficulties with measuring constructs in social science
research. For instance, how do we know whether we are measuring
“compassion” and not the “empathy”, since both constructs are somewhat
similar in meaning? Or is compassion the same thing as empathy? What makes
it more complex is that sometimes these constructs are imaginary concepts (i.e.,
they don’t exist in reality), and multi-dimensional (in which case, we have the
added problem of identifying their constituent dimensions). Hence, it is not
adequate just to measure social science constructs using any scale that we
prefer.

We also must test these scales to ensure that:

(1) these scales indeed measure the unobservable construct that we wanted to
measure (i.e., the scales are “valid”), and

(2) they measure the intended construct consistently and precisely (i.e., the
scales are “reliable”).

Reliability and validity, jointly called the “psychometric properties” of


measurement scales, are the yardsticks against which the adequacy and
accuracy of our measurement procedures are evaluated in scientific research.
A measure can be reliable but not valid, if it is measuring something very
consistently but is consistently measuring the wrong construct. Likewise, a
measure can be valid but not reliable if it is measuring the right construct,
but not doing so in a consistent manner.
Using the analogy of a shooting target, as shown in Figure 7.1, a multiple-
item measure of a construct that is both reliable and valid consists of shots
that clustered within a narrow range near the center of the target. A
measure that is valid but not reliable will consist of shots centered on the
target but not clustered within a narrow range, but rather scattered around
the target. Finally, a measure that is reliable but not valid will consist of
shots clustered within a narrow range but off from the target. Hence,
reliability and validity are both needed to assure adequate measurement of
the constructs of interest.

Figure 7.1. Comparison of reliability and validity


Reliability
Reliability is the degree to which the measure of a construct is consistent or dependable. In other words, if we
use this scale to measure the same construct multiple times, do we get pretty much the same result every time,
assuming the underlying phenomenon is not changing? An example of an unreliable measurement is people
guessing your weight. Quite likely, people will guess differently, the different measures will be inconsistent, and
therefore, the “guessing” technique of measurement is unreliable. A more reliable measurement may be to use a
weight scale, where you are likely to get the same value every time you step on the scale, unless your weight has
actually changed between measurements.

Note that reliability implies consistency but not accuracy. In the previous example of the weight scale, if the
weight scale is calibrated incorrectly (say, to shave off ten pounds from your true weight, just to make you feel
better!), it will not measure your true weight and is therefore not a valid measure. Nevertheless, the
miscalibrated weight scale will still give you the same weight every time (which is ten pounds less than your
true weight), and hence the scale is reliable.

What are the sources of unreliable observations in social science measurements? One of the primary sources is
the observer’s (or researcher’s) subjectivity. If employee morale in a firm is measured by watching whether the
employees smile at each other, whether they make jokes, and so forth, then different observers may infer
different measures of morale if they are watching the employees on a very busy day (when they have no time to
joke or chat) or a light day (when they are more jovial or chatty). Two observers may also infer different levels
of morale on the same day, depending on what they view as a joke and what is not. “Observation” is a qualitative
measurement technique. Sometimes, reliability may be improved by using quantitative measures, for instance,
by counting the number of grievances filed over one month as a measure of (the inverse of) morale. Of course,
grievances may or may not be a valid measure of morale, but it is less subject to human subjectivity, and
therefore more reliable. A second source of unreliable observation is asking imprecise or ambiguous questions.
For instance, if you ask people what their salary is, different respondents may interpret this question differently
as monthly salary, annual salary, or per hour wage, and hence, the resulting observations will likely be highly
divergent and unreliable. A third source of unreliability is asking questions about issues that respondents are not
very familiar about or care about, such as asking an American college graduate whether he/she is satisfied with
Canada’s relationship with Slovenia, or asking a Chief Executive Officer to rate the effectiveness of his
company’s technology strategy – something that he has likely delegated to a technology executive.
So how can you create reliable measures? If your measurement involves
soliciting information from others, as is the case with much of social science
research, then you can start by replacing data collection techniques that
depends more on researcher subjectivity (such as observations) with those that
are less dependent on subjectivity (such as questionnaire), by asking only
those questions that respondents may know the answer to or issues that they
care about, by avoiding ambiguous items in your measures (e.g., by clearly
stating whether you are looking for annual salary), and by simplifying the
wording in your indicators so that they not misinterpreted by some respondents
(e.g., by avoiding difficult words whose meanings they may not know). These
strategies can improve the reliability of our measures, even though they will not
necessarily make the measurements completely reliable. Measurement
instruments must still be tested for reliability.
There are many ways of estimating reliability, some of the test are mentioned here :

➢ Inter-rater reliability
➢ Test-retest reliability
➢ Split-half reliability
➢ Internal consistency reliability
➢ Cronbach’s alpha
Validity
• Validity , often called construct validity, refers to the extent to which a measure adequately represents
the underlying construct that it is supposed to measure. For instance, is a measure of compassion
really measuring compassion, and not measuring a different construct such as empathy? Validity can
be assessed using theoretical or empirical approaches, and should ideally be measured using both
approaches. Theoretical assessment of validity focuses on how well the idea of a theoretical construct
is translated into or represented in an operational measure. This type of validity is called translational
validity (or representational validity), and consists of two subtypes: face and content validity.
Translational validity is typically assessed using a panel of expert judges, who rate each item
(indicator) on how well they fit the conceptual definition of that construct, and a qualitative technique
called Q-sort.

• Empirical assessment of validity examines how well a given measure relates to one or more external
criterion, based on empirical observations. This type of validity is called criterion-related validity ,
which includes four sub-types: convergent, discriminant, concurrent, and predictive validity. While
translation validity examines whether a measure is a good reflection of its underlying construct,
criterion -related validity examines whether a given measure behaves the way it should, given the
theory of that construct. This assessment is based on quantitative analysis of observed data using
statistical techniques such as correlational analysis, factor analysis, and so forth. The distinction
between theoretical and empirical assessment of validity is illustrated in Figure 7.2. However, both
approaches are needed to adequately ensure the validity of measures in social science research.

• Note that the different types of validity discussed here refer to the validity of the measurement
procedures , which is distinct from the validity of hypotheses testing procedures , such as internal
validity (causality), external validity (generalizability), or statistical conclusion validity.
Classification of Data
Meaning of Classification of Data
• It is the process of arranging data into homogeneous (similar) groups according to their common characteristics.
• Raw data cannot be easily understood, and it is not fit for further analysis and interpretation. Arrangement of data
helps users in comparison and analysis.
• For example, the population of a town can be grouped according to sex, age, marital status, etc.

Classification of data
The method of arranging data into homogeneous classes according to the common features present in the data is
known as classification.

A planned data analysis system makes the fundamental data easy to find and recover. This can be of particular interest
for legal discovery, risk management, and compliance. Written methods and sets of guidelines for data classification
should determine what levels and measures the company will use to organise data and define the roles of employees
within the business regarding input stewardship.

Once a data -classification scheme has been designed, the security standards that stipulate proper approaching
practices for each division and the storage criteria that determines the data’s lifecycle demands should be discussed.

Objectives of Data Classification


The primary objectives of data classification are:
• To consolidate the volume of data in such a way that similarities and differences can be quickly understood. Figures
can consequently be ordered in sections with common traits.
• To aid comparison.
• To point out the important characteristics of the data at a flash.
• To give importance to the prominent data collected while separating the optional elements.
• To allow a statistical method of the materials gathered.
Types of classification
Bases of Classification : There are four important bases of classification:

Geographical classification

Chronological classification

Qualitative classification

Quantitative classification

(i) Geographical classification

When data are classified on the basis of location or areas, it is called geographical classification.

In geographical classification, data are classified on the basis of location, region, etc. For example, if we present the data
regarding production of sugarcane or wheat or rice, in view of the four main regions in India, this would be known as
geographical classification as given below. Geographical classification is usually listed in alphabetical order for easy reference.

Items may also be listed by size to emphasis the magnitude of the areas under consideration such as ranking the states based on
population. Normally, in reference tables, the first approach (i.e. listing in alphabetical order) is followed.
Example: Classification of production of food grains in different states in India.

States Production of food grains (in '000 tons)

Tamil Nadu 4500


Karnataka 4200
Andhra Pradesh 3600

Data on area under crop in India can be classified as shown below :


Area ( in hectares)
Region
Central India -
West -
North -
East -
South -

(ii) Chronological classification


Chronological classification means classification on the basis of time, like months, years etc.
Classification of data observed over a period of time is known as chronological classification.

Year Profits (in 000 Rupees)


2001 72
2002 85
2003 92
2004 96
2005 95
Example: Profits of a company from 2001 to 2005.
Profits of a company from 2001 to 2005
Data on Production of food grains in India can be classified as shown below
Tonnes
Year
1990-91 -
1991-92 -
1992-93 -
1993-94 -
1994-95 -

(iii) Qualitative classification

In Qualitative classification, data are classified on the basis of some attributes or quality such as sex, colour of hair, literacy and
religion. In this type of classification, the attribute under study cannot be measured. It can only be found out whether it is present
or absent in the units of study.

In qualitative classification, data are classified on the basis of some attributes or qualitative characteristics such as sex, colour of
hair, literacy, religion, etc. You should note that in this type of classification the attribute under study cannot be measured
quantitatively. One can only count it according to its presence or absence among the individuals of the population under study.

The number of farmers based on their land holdings can be given as follows
Number of farmers
Type of farmers
Marginal 907
Medium 1041
Large 1948
Total 3896
Qualitative classification can be of two types as follows
o Simple classification
o Manifold classification

(i) Simple Classification

This is based on only one quality.


Eg:

(ii) Manifold Classification

This is based on more than one quality.


Eg:

(iv) Quantitative classification


Quantitative classification refers to the classification of data according to some characteristics that can be measured numerically
such as height, weight, income, age, sales, etc.
Example: The students of a school may be classified according to the weight as follows

Weight (in kgs) No of Studemts


40-50 50

50-60 200

60-70 300

70-80 100

80-90 30

90-100 20

Total 700

The data on land holdings by farmers in a block. Quantitative classification is based the land holding which is the variable in this
example.
Land holding ( hectare) Number of Farmers
<1 442
1-2 908
2-5 471
>5 124
Total 1945

There are two types of quantitative classification of data. They are

i)Discrete frequency distribution

ii)Continuous frequency distribution


In this type of classification there are two elements (i) variable (ii) frequency

Variable

Variable refers to the characteristic that varies in magnitude or quantity. E.g. weight of the students. A variable may be discrete or
continuous.

Discrete variable

A discrete variable can take only certain specific values that are whole numbers (integers). E.g. Number of children in a family or
Number of class rooms in a school.

Continuous variable

A Continuous variable can take any numerical value within a specific interval.

Example: the average weight of a particular class student is between 60 and 80 kgs.

Frequency

Frequency refers to the number of times each variable gets repeated.

For example there are 50 students having weight of 60 kgs. Here 50 students is the frequency.

Frequency distribution

Frequency distribution refers to data classified on the basis of some variable that can be measured such as prices, weight, height,
wages etc.

The following are the two examples of discrete and continuous frequency distribution
The following technical terms are important when a continuous frequency distribution is formed

Class limits: Class limits are the lowest and highest values that can be included in a class. For example take the class 40-50. The
lowest value of the class is 40 and the highest value is 50. In this class there can be no value lesser than 40 or more than 50. 40 is
the lower class limit and 50 is the upper class limit.

Class interval: The difference between the upper and lower limit of a class is known as class interval of that class. Example in
the class 40-50 the class interval is 10 (i.e. 50 minus 40).

Class frequency: The number of observations corresponding to a particular class is known as the frequency of that class

Example:
Income (Rs) No. of persons
1000 - 2000 50
In the above example, 50 is the class frequency. This means that 50 persons earn an income between Rs.1, 000 and Rs.2,
000.

(iv) Class mid-point: Mid point of a class is formed out as follows.


Tabulation of Data

A table is a systematic arrangement of statistical data in columns and rows. Rows are horizontal arrangements whereas the
columns are vertical ones.
Presentation of Data
Statistics is all about data. Presenting data effectively and efficiently is an art. You may have uncovered many truths that
are complex and need long explanations while writing. This is where the importance of the presentation of data comes
in. You have to present your findings in such a way that the readers can go through them quickly and understand each
and every point that you wanted to showcase. As time progressed and new and complex research started happening,
people realized the importance of the presentation of data to make sense of the findings.

Define Data Presentation


Data presentation is defined as the process of using various graphical formats to visually represent the relationship
between two or more data sets so that an informed decision can be made based on them.

Types of Data Presentation


Broadly speaking, there are three methods of data presentation:

• Textual

• Tabular

• Diagrammatic

Textual Ways of Presenting Data

Out of the different methods of data presentation, this is the simplest one. You just write your findings in a coherent
manner and your job is done. The demerit of this method is that one has to read the whole text to get a clear picture.
Yes, the introduction, summary, and conclusion can help condense the information.
Tabular Ways of Data Presentation and Analysis
To avoid the complexities involved in the textual way of data presentation, people use tables and charts to present data.
In this method, data is presented in rows and columns - just like you see in a cricket match showing who made how
many runs. Each row and column have an attribute (name, year, sex, age, and other things like these). It is against these
attributes that data is written within a cell.

Diagrammatic Presentation: Graphical Presentation of Data in Statistics


Diagrammatic Presentation has been divided into further categories:

• Geometric Diagram

When a Diagrammatic presentation involves shapes like a bar or circle, we call that a Geometric Diagram.
Examples of Geometric Diagram

• Bar Diagram

i)Simple Bar Diagram


Simple Bar Diagram is composed of rectangular bars. All of these bars have the same width and are placed at an
equal distance from each other. The bars are placed on the X-axis. The height or length of the bars is used as the
means of measurement. So, on the Y-axis, you have the measurement relevant to the data.
Suppose, you want to present the run scored by each batsman in a game in the form of a bar chart. Mark the runs
on the Y-axis - in ascending order from the bottom. So, the lowest scorer will be represented in the form of the
smallest bar and the highest scorer in the form of the longest bar.
ii)Multiple Bar Diagram
In many states of India, electric bills have bar diagrams showing the consumption in the last 5 months. Along with
these bars, they also have bars that show the consumption that happened in the same months of the previous
year. This kind of Bar Diagram is called Multiple Bar Diagrams.

Methods of Data Presentation in Statistics

Pie Chart
A pie chart is a chart where you divide a pie (a circle) into different parts based on the data. Each of the data is first
transformed into a percentage and then that percentage figure is multiplied by 3.6 degrees. The result that you get is the
angular degree of that corresponding data to be drawn in the pie chart. So, for example, you get 30 degrees as the
result, on the pie chart you draw that angle from the center.
Pie charts provide a very descriptive & a 2D depiction of the data pertaining to comparisons or resemblance of data in two
separate fields.

Bar charts

A bar chart that shows the accumulation of data with cuboid bars with different dimensions & lengths which are directly
proportionate to the values they represent. The bars can be placed either vertically or horizontally depending on the data being
represented.
Frequency Diagram
Suppose you want to present data that shows how many students have 1 to 2 pens, how many have 3 to 5 pens, how
many have 6 to 10 pens (grouped frequency) you do that with the help of a Frequency Diagram. A Frequency Diagram
can be of many kinds:

Column chart

It is a simplified version of the pictorial Presentation which involves the management of a larger amount of data being shared
during the presentations and providing suitable clarity to the insights of the data.
Histograms

It is a perfect Presentation of the spread of numerical data. The main differentiation that separates data graphs and
histograms are the gaps in the data graphs.
Pictorial Presentation

It is the simplest form of data Presentation often used in schools or universities to provide a clearer picture to students, who
are better able to capture the concepts effectively through a pictorial Presentation of simple data.
Box plots

Box plot or Box-plot is a way of representing groups of numerical data through quartiles. Data Presentation is easier with this
style of graph dealing with the extraction of data to the minutes of difference.
Maps

Map Data graphs help you with data Presentation over an area to display the areas of concern. Map graphs are useful to make
an exact depiction of data over a vast case scenario.

All these visual presentations share a common goal of creating meaningful insights and a platform to understand and manage
the data in relation to the growth and expansion of one’s in-depth understanding of data & details to plan or execute future
decisions or actions.
Importance of Data Presentation
Data Presentation could be both can be a deal maker or deal breaker based on the delivery of the content in the context of visual
depiction.

Data Presentation tools are powerful communication tools that can simplify the data by making it easily understandable &
readable at the same time while attracting & keeping the interest of its readers and effectively showcase large amounts of complex
data in a simplified manner.

If the user can create an insightful presentation of the data in hand with the same sets of facts and figures, then the results promise
to be impressive.

There have been situations where the user has had a great amount of data and vision for expansion but the presentation drowned
his/her vision.

To impress the higher management and top brass of a firm, effective presentation of data is needed.

Data Presentation helps the clients or the audience to not spend time grasping the concept and the future alternatives of the
business and to convince them to invest in the company & turn it profitable both for the investors & the company.

Although data presentation has a lot to offer, the following are some of the major reason behind the essence of an effective
presentation:-

• Many consumers or higher authorities are interested in the interpretation of data, not the raw data itself. Therefore, after the
analysis of the data, users should represent the data with a visual aspect for better understanding and knowledge.
• The user should not overwhelm the audience with a number of slides of the presentation and inject an ample amount of
texts as pictures that will speak for themselves.
• Data presentation often happens in a nutshell with each department showcasing their achievements towards company
growth through a graph or a histogram.
• Providing a brief description would help the user to attain attention in a small amount of time while informing the audience
about the context of the presentation
• The inclusion of pictures, charts, graphs and tables in the presentation help for better understanding the potential outcomes.
• An effective presentation would allow the organization to determine the difference with the fellow organization and
acknowledge its flaws. Comparison of data would assist them in decision making.

Correlation Coefficient Formula: Definition


Correlation coefficient formulas are used to find how strong a relationship is between data. The formulas return a value between -1 and 1, where:

• 1 indicates a strong positive relationship.


• -1 indicates a strong negative relationship.
• A result of zero indicates no relationship at all.

Graphs showing a correlation of -1, 0 and +1

Meaning
• A correlation coefficient of 1 means that for every positive increase in one variable, there is a positive increase of a fixed proportion
in the other. For example, shoe sizes go up in (almost) perfect correlation with foot length.
• A correlation coefficient of -1 means that for every positive increase in one variable, there is a negative decrease of a fixed
proportion in the other. For example, the amount of gas in a tank decreases in (almost) perfect correlation with speed.
• Zero means that for every increase, there isn’t a positive or negative increase. The two just aren’t related.

The absolute value of the correlation coefficient gives us the relationship strength. The larger the number, the stronger the relationship. For
example, |-.75| = .75, which has a stronger relationship than .65.

Types of correlation coefficient formulas.


There are several types of correlation coefficient formulas.

One of the most commonly used formulas is Pearson’s correlation coefficient formula. If you’re taking a basic stats class, this is the one you’ll
probably use:

Pearson correlation coefficient

Two other formulas are commonly used: the sample correlation coefficient and the population correlation coefficient.

Sample correlation coefficient


Sx and sy are the sample standard deviations, and sxy is the sample covariance.

Population correlation coefficient

The population correlation coefficient uses σx and σy as the population standard deviations, and σxy as the population covariance.

What is Pearson Correlation?


Correlation between sets of data is a measure of how well they are related. The most common measure of correlation in stats is the
Pearson Correlation. The full name is the Pearson Product Moment Correlation (PPMC). It shows the linear
relationship between two sets of data. In simple terms, it answers the question, Can I draw a line graph to represent the
data? Two letters are used to represent the Pearson correlation: Greek letter rho (ρ) for a population and the letter “r” for a
sample.

Potential problems with Pearson correlation.


The PPMC is not able to tell the difference between dependent variables and independent variables. For example, if you are trying
to find the correlation between a high calorie diet and diabetes, you might find a high correlation of .8. However, you could also
get the same result with the variables switched around. In other words, you could say that diabetes causes a high calorie diet. That
obviously makes no sense. Therefore, as a researcher you have to be aware of the data you are plugging in. In addition, the PPMC
will not give you any information about the slope of the line; it only tells you whether there is a relationship.

Real Life Example

Pearson correlation is used in thousands of real life situations. For example, scientists in China wanted to know if there was a
relationship between how weedy rice populations are different genetically. The goal was to find out the evolutionary potential of
the rice. Pearson’s correlation between the two groups was analyzed. It showed a positive Pearson Product Moment correlation of
between 0.783 and 0.895 for weedy rice populations. This figure is quite high, which suggested a fairly strong relationship.

Example question: Find the value of the correlation coefficient from the following table:

SUBJECT AGE X GLUCOSE LEVEL Y


1 43 99
2 21 65
3 25 79
4 42 75
5 57 87
6 59 81
Step 1: Make a chart. Use the given data, and add three more columns: xy, x2, and y2.

SUBJECT AGE X GLUCOSE LEVEL Y XY X 2 Y 2


1 43 99
2 21 65
3 25 79
4 42 75
5 57 87
6 59 81
Step 2: Multiply x and y together to fill the xy column. For example, row 1 would be 43 × 99 = 4,257.

SUBJECT AGE X GLUCOSE LEVEL Y XY X2 Y2


1 43 99 4257
2 21 65 1365
3 25 79 1975
4 42 75 3150
5 57 87 4959
6 59 81 4779
Step 3: Take the square of the numbers in the x column, and put the result in the x2 column.

SUBJECT AGE X GLUCOSE LEVEL Y XY X2 Y2


1 43 99 4257 1849
2 21 65 1365 441
3 25 79 1975 625
4 42 75 3150 1764
5 57 87 4959 3249
6 59 81 4779 3481

Step 4: Take the square of the numbers in the y column, and put the result in the y2 column.

SUBJECT AGE X GLUCOSE LEVEL Y XY X2 Y2


1 43 99 4257 1849 9801
2 21 65 1365 441 4225
3 25 79 1975 625 6241
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
6 59 81 4779 3481 6561
Step 5: Add up all of the numbers in the columns and put the result at the bottom of the column. The Greek letter sigma (Σ) is a
short way of saying “sum of” or summation.

SUBJECT AGE X GLUCOSE LEVEL Y XY X2 Y2


1 43 99 4257 1849 9801
2 21 65 1365 441 4225
3 25 79 1975 625 6241
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
6 59 81 4779 3481 6561
Σ 247 486 20485 11409 40022
Step 6: Use the following correlation coefficient formula.

The answer is: 2868 / 5413.27 = 0.529809


Click here if you want easy, step-by-step instructions for solving this formula.
From our table:
• Σx = 247
• Σy = 486
• Σxy = 20,485
• Σx2 = 11,409
• Σy2 = 40,022
• n is the sample size, in our case = 6
The correlation coefficient =
• 6(20,485) – (247 × 486) / [√[[6(11,409) – (2472)] × [6(40,022) – 4862]]]
= 0.5298
The range of the correlation coefficient is from -1 to 1. Our result is 0.5298 or 52.98%, which means the variables have a
moderate positive correlation.

What Is Regression?
Regression is a statistical method used in finance, investing, and other disciplines that attempts to determine the strength and
character of the relationship between one dependent variable (usually denoted by Y) and a series of other variables (known as
independent variables).

Regression helps investment and financial managers to value assets and understand the relationships between variables, such
as commodity prices and the stocks of businesses dealing in those commodities.

Regression Explained
The two basic types of regression are simple linear regression and multiple linear regression, although there are non-linear
regression methods for more complicated data and analysis. Simple linear regression uses one independent variable to explain or
predict the outcome of the dependent variable Y, while multiple linear regression uses two or more independent variables to
predict the outcome.

Regression can help finance and investment professionals as well as professionals in other businesses. Regression can also he lp
predict sales for a company based on weather, previous sales, GDP growth, or other types of conditions. The capital asset pricing
model (CAPM) is an often-used regression model in finance for pricing assets and discovering costs of capital.
The general form of each type of regression is:

• Simple linear regression: Y = a + bX + u


• Multiple linear regression: Y = a + b1X1 + b2X2 + b3X3 + ... + btXt + u

Where:

• Y = the variable that you are trying to predict (dependent variable).


• X = the variable that you are using to predict Y (independent variable).
• a = the intercept.
• b = the slope.
• u = the regression residual.

Regression takes a group of random variables, thought to be predicting Y, and tries to find a mathematical relationship between
them. This relationship is typically in the form of a straight line (linear regression) that best approximates all the individual data
points. In multiple regression, the separate variables are differentiated by using subscripts.

SIMPLE LINEAR REGRESSION

Regression analysis is majorly used to find equations that will fit the data. Linear analysis is one type of
regression analysis. The equation for a line is y = a + bX. Y is the dependent variable in the formula which one
is trying to predict what will be the future value if X, an independent variable, change by a certain value. “a” in
the formula is the intercept which is that value which will remain fixed irrespective of changes in the
independent variable and the term ‘b’ in the formula is the slope which signifies how much variable is the
dependent variable upon independent variable.

KEY TAKE A WAYS

Regression helps investment and financial managers to value assets and understand the relationships between variables
Regression can help finance and investment professionals as well as professionals in other businesses.

A Real World Example of How Regression Analysis Is Used

Regression is often used to determine how many specific factors such as the price of a commodity, interest rates, particular
industries, or sectors influence the price movement of an asset. The aforementioned CAPM is based on regression, and it is
utilized to project the expected returns for stocks and to generate costs of capital. A stock's returns are regressed against the
returns of a broader index, such as the S&P 500, to generate a beta for the particular stock.

Beta is the stock's risk in relation to the market or index and is reflected as the slope in the CAPM model. The return for t he
stock in question would be the dependent variable Y, while the independent variable X would be the market risk premium.

Additional variables such as the market capitalization of a stock, valuation ratios, and recent returns can be added to the CAPM
model to get better estimates for returns. These additional factors are known as the Fama-French factors, named after the
professors who developed the multiple linear regression model to better explain asset returns.

Regression analysis widely used statistical methods to estimate the relationships between one or more
independent variables and dependent variables. Regression is a powerful tool as it is used to assess the
strength of the relationship between two or more variables, and then it would be used for modeling the
relationship between those variables in the future.
Y=a + bX + ∈

Where:

• Y – is the dependent variable


• X – is the independent (explanatory) variable
• a – is the intercept
• b – is the slope
• ∈ – and is the residual (error)
The formula for intercept “a” and the slope “b” can be calculated per below.

a= (Σy)(Σx2) - (Σx)(Σxy)/ n(Σx2) - (Σx)2

b= n (Σxy) - (Σx)(Σy) /n(Σx2) - (Σx)2

Examples
Example #1
Consider the following two variables x and y, you are required to do the calculation of the regression.

Solution:

Using the above formula, we can do the calculation of linear regression in excel as follows.
We have all the values in the above table with n = 5.

Now, first, calculate the intercept and slope for the regression.

Calculation of Intercept is as follows,

a = ( 628.33 * 88,017.46 ) – ( 519.89 * 106,206.14 ) / 5* 88,017.46 – (519.89)2

a = 0.52
Calculation of Slope is as follows,

b = (5 * 106,206.14) – (519.89 * 628.33) / (5 * 88,017.46) – (519,89)2

b = 1.20

Let’s now input the values in the regression formula to get regression.

Hence the regression line Y = 0.52 + 1.20 * X

When x=1 than y=

Example #2
State bank of India recently established a new policy of linking savings account interest rate to Repo rate, and
the auditor of the state bank of India wants to conduct an independent analysis on the decisions taken by the
bank regarding interest rate changes whether those have been changes whenever there have been changes in
the Repo rate. Following is the summary of the Repo rate and Bank’s savings account interest rate that
prevailed in those months are given below.

The auditor of state bank has approached you to conduct an analysis and provide a presentation on the same
in the next meeting. Use regression formula and determine whether Bank’s rate changed as and when the
Repo rate was changed?

Solution:

Using the formula discussed above, we can do the calculation of linear regression in excel. Treating
the Repo rate as an independent variable, i.e., X, and treating Bank’s rate as the dependent variable as Y.
We have all the values in the above table with n = 6.

Now, first, calculate the intercept and slope for the regression.

Calculation of Intercept is as follows,

a = ( 24.17 * 237.69 ) – ( 37.75 * 152.06 ) / 6 * 237.69 – (37.75)2


a = 4.28

Calculation of Slope is as follows,

b = (6 * 152.06) – (37.75 *24.17) / 6 * 237.69 – (37.75)2

b= -0.04

Let’s now input the values in the formula to arrive at the figure.

Hence the regression line Y = 4.28 – 0.04 * X

Analysis: It appears State bank of India is indeed following the rule of linking its saving rate to the repo rate
as there is some slope value that signals a relationship between the repo rate and the bank’s saving account
rate.
Example #3
ABC laboratory is conducting research on height and weight and wanted to know if there is any relationship
like as the height increases, the weight will also increase. They have gathered a sample of 1000 people for
each of the categories and came up with an average height in that group.

Below are the details that they have gathered.

You are required to do the calculation of regression and come up with the conclusion that any such
relationship exists.

Solution:

Using the formula discussed above, we can do the calculation of linear regression in excel. Treating Height as
an independent variable, i.e., X, and treating Weight as the dependent variable as Y.
We have all the values in the above table with n = 6

Now, first, calculate the intercept and slope for the regression.

Calculation of Intercept is as follows,


a = ( 350 * 120,834 ) – ( 850 * 49,553 ) / 6 * 120,834 – (850)2

a = 68.63

Calculation of Slope is as follows,


b = (6 * 49,553) – (850 *350) / 6 * 120,834 – (850)2

b = -0.07

Let’s now input the values in the formula to arrive at the figure.

Hence the regression line Y = 68.63 – 0.07 * X

Analysis: It appears that there is a significant very less relationship between height and weight as the slope is
very low.

Relevance and Uses of Regression Formula


When a correlation coefficient depicts that data can predict the future outcomes and along with that, a
scatter plot of the same dataset appears to form a linear or a straight line, then one can use the simple linear
regression by using the best fit to find a predictive value or predictive function. The regression analysis has
many applications in the field of finance as it is used in CAPM that is the capital asset pricing model a
method in finance. It can be used to forecast revenue and expenses of the firm.

You might also like