Professional Documents
Culture Documents
PART 8
Statistical Techniques for Processing &
Analysis of Data
M S Sridhar
Head, Library & Documentation
ISRO Satellite Centre
Bangalore 560017
1. Aggregates of facts
2. Affected by multiple causes
3. Numerically expressed
4. Collected in a systematic manner
5. Collected for a predetermined purpose
6. Enumerated or estimated according to reasonable
standard of accuracy
7. Statistics must be placed in relation to each other
(context)
11 12 13 14 16 17 18 19
5 UNITS 5 UNITS
Even class-interval
A
& its mid point
10 15 20
LOWER MID-POINT UPPER
LIMIT LIMIT
11 12 13 14
Two methods for class limits: Exclusive & inclusive type class
intervals for determination of frequency of each class (see tally
sheet example given later)
(i) Exclusive method: Upper class limit of one class equals the
lower class limit of the next class. Suitable in case of data of a
continuous variable and here the upper class limit is excluded but
the lower class limit of a class is included in the interval
(ii) Inclusive method: Both class limits are parts of the class interval.
An adjustment in class interval is done if we found ‘gap’ or
discontinuing between the upper limit of a class and the lower limit
of the next class.
Divide the difference between the upper limit of first class and lower
limit of the second class by 2 and subtract it from all lower limits
and add it to all upper class limits.
Adjusted class mark = (Adjusted upper limit + Adjusted lower limit) /2
This adjustment restores continuity of data in the frequency
distribution
Research Methodology 8 M S Sridhar, ISRO 21
Preparation of Frequency Distribution contd.
4. Find the frequency of each class (i.e., how many times that
observation occurs in the row data) by tally marking. Frequency of an
observation is the number of times a certain observations occurs.
Frequency table gives the class intervals and the frequencies
associated with them
Loss of information: Frequency distribution summarises raw data to
make it concise and comprehensible, but does not show the details
that are found in raw data.
Bivariate Frequency distribution is a frequency distribution of two
variables (e.g.:No. of books in stock and budget of 10 libraries)
Frequency Distribution with unequal classes: Some classes having
either densely populated or sparsely populated observations, the
observations deviate more from their respective class marks than in
comparison to those in other classes. In such cases, unequal classes
are appropriate. They are formed in such a way that class marks
coincide, as far as possible, to a value around which the observations
in a class tend to concentrate, then in that case unequal class interval
is more appropriate.
Frequency array: For a discrete variable, the classification of its data
is known as a frequency array (e.g. No. of books in 10 libraries)
Research Methodology 8 M S Sridhar, ISRO 22
Analysis of Data
Computation of certain indices or measures, searching for patterns
of relationships, estimating values of unknown parameters, & testing
of hypothesis for inferences
1. Descriptive analysis : Largely the study of distributions of one
variable (uni-dimension); Univariate analysis → Two variables
Multivariate analysis → More than two variables
Total 50
Research Methodology 8 M S Sridhar, ISRO 38
Home work
Work out a frequency table with less than cumulative
and more than cumulative frequencies for the raw
data of number of words per line in a book given
below :
12 10 12 09 11 10 13 13
07 11 10 10 09 10 12 11
01 10 13 10 15 13 11 12
08 13 11 10 08 12 13 11
09 11 14 12 07 12 11 10
10 9
9
8 7
No. of usres
7 6
6
5
4 3
3
2
1
0
s
es
es
s
te
te
at
at
ra
ua
du
du
to
ad
ra
oc
ra
gr
D
er
st
Po
nd
U
19 8 8 7
20000
No. of books
15000
10000
4 4 77
5000 4047
13 2 8 152 4
897 72 6 557 447 348 286 290
0
0 1 2 3 4 5 6 7 8 9 10 >10
No. of tim es borrow ed
8
N o. of c ita tions
7
3
4
2
5
1
4
0
2
0 2 4 6 8 10
No. of reports
60
53
51
50
43
40
No. of papers
30
30
19
20 15
10 Frequency polygon 6
4
2 1
0
1 2 3 4 5 6 7 8 9 10
No. of authors
51 53
50
43
No. of papers
40
30 30
20 19
15
10
6 4
0 2 1
1 2 3 4 5 6 7 8 9 10
No. of authors
Research Methodology 8 M S Sridhar, ISRO 50
Frequency Distribution of No. of Words per Line of a
Book (Home work)
120
100 95 97.5 100
80 80
60 62.5
40 42.5
20 20
7.5 12.5
0 2.5
600
400
0
1980 1985 1990 1995 2000 2002
Research Methodology 8 M S Sridhar, ISRO 52
Line graph of less than or equal cumulative frequency
of self-citations in technical reports(Table 8.12)
120
100 96 100
No. of reports
84 88
80 76
68
60 56
40
32
20 20
0
1 2 3 4 5 6 7 8 9
No. of self-citations
100
80 80
N o . o f rep o rts
68
60
40 44
32
20 24
16 12
0 4
1 2 3 4 5 6 7 8 9
No. of self-citations
Jour na l s
16%
B ook s
80%
1. MEAN
xi X1 +X2+ ….+Xn
X = =
n n
EX: 4 6 7 8 9 10 11 11 11 12 13
102
X = 11
= 9.27
X = ∑ fi xi / n where n = ∑ fi
= f1X1 + f2 X2 + ….+ fn Xn / f1 + f2 + ….+ fn
g = 46 ; ∑ƒd = - 56 ; n = 50 ; i = 10
¯X = g + [∑ f d / n] (i) = 46 + [ -56 / 50] (10) = 34.6
Note: Compare answer with mean calculated as discrete data
in Table 8.12
Research Methodology 8 M S Sridhar, ISRO 60
Assumed average (shortcut) method & step deviation
method
Table: Calculation of the mean (x ) from a frequency distribution. data represent weights or
265 male freshman students at the university of Washington
Class-Interval (Weight) ƒ d ƒd
ƒd
90 - 99 .......... 1 -5 -5 X = g + (i)
100 -- 109 …….. 1 -4 -4 N
110 -- 119 …….. 9 -3 -27 99
120 -- 129 ……... 30 -2 -60 = 145 + ----- ( 10 )
130 -- 139 …….. 42 -1 -42 265
140 -- 149 ……… 66 0 0
150 -- 159 ……… 47 1 47 = 145 + ( .3736) (10)
160 -- 169 ……… 39 2 78 = 145 + 3.74
170 -- 179 ……… 15 3 45 = 148.74
180 -- 189 ……… 11 4 44
190 -- 199 ……… 1 5 5 fi – ( Ai - A )
200 -- 209 ……… 3 6 18 X = A + ---------------------
fi
N = 265 ƒd = 237 - 138 = 99
Research Methodology 8 M S Sridhar, ISRO 61
Univariate Measures: A. Central Tendency contd..
WEIGHTED MEAN
Wi Xi
Xw =
Wi
NX + MY
Z =
N+M
MOVING AVERAGE
(Xi – A) fi (Xi – A)
X=A + X = A+
n fi
NOTE: Step deviation method takes common factor out to enable
simple working
EX: 11 7 13 4 11 9 6 11 10 12 8
4 6 7 8 9 10 11 11 11 12 13
1 2 3 4 5 6 7 8 9 10 11
L = 21 ; Cf = 10 ; I = 10 ; F = 15 ;
W = [n/2] – Cf = [50/2] – 10 = 15
M = L + W/F (i) = 21 + 15/15 (10) = 31
Note: Compare answer with median calculated as discrete data in
Table 8.12
Research Methodology 8 M S Sridhar, ISRO 66
Median for grouped or interval data
TABLE : Calculation of the median (x). data represent weights of 265 male
freshman studies at the university of Washington
Example: 4 6 7 8 9 10 11 11 11 12 13
14 – 9.271 + 16-9.271+………+113 – 9.271 24.73
δx = ----------------------------------------------------- = ----------- = 2.25
11 11
Coefficient of mean deviation: Mean deviation divided by the average. It is a
relative measure of dispersion and is comparable to similar measure of other series,
i.e., Coeff. of MD = δx / x (Ex: 2.25/9.27 = 0.24) . M.D. & its coefficient are used to
judge the variability and they are better measure than range
Research Methodology 8 M S Sridhar, ISRO 75
Univariate Measures: B. Dispersion
3. Standard Deviation: The square root of the average of squares of
deviations (based on mean), I.e., the positive square root of the
mean of squared deviation from mean
Σ (xi – x )2 Σ fi (xi – x)2
σ = ------------------ For grouped data σ = --------------
√ n √ Σ fi
Example: 4 6 7 8 9 10 11 12 13
(4-9.27)2 + (6-9.27)2 +……+ (13 –9.27)2
σ = --------------------------------------------------------- = 2.64
√ 11
Coefficient of S D is S D divided by mean.
Example: 2.64 / 9.27 = 0.28
CLASSIFICATION X σ CV
The standard table of Z scores gives the areas under the curve between
the standardised mean zero and the points to the right of the mean for all
points that are at a distance from the mean in multiples of 0.01σ. It
should be noted that only the areas are to be subtracted or added. Do
not add or subtract the Z scores and then find the area for the resulting
value.
Research Methodology 8 M S Sridhar, ISRO 87
Z-score or standardised normal deviation …contd.
The Standardised normal often used is obtained by assuming mean as
zero (µ = 0) and SD as one (σ = 1). Then,
x scale µ-3σ µ -2σ µ-σ µ µ+σ µ+2σ µ +3σ
z scale -3 -2 -1 0 +1 +2 +3
z = (xi - µ) / σ
4 X 4 Contingency Table
Attribute A
A1 A2 A3 A4 Reduced to 2x2 Table
Total Attribute
B1 (A1B1) (A2B1) (A3B1) (A4B1) (B1)
Total
Attribu A a
B2 (A1B2) (A2B2) (A3B2) (A4B2) (B2) te
Attribute
B B (AB) (aB) (B)
B3 (A1B3) (A2B3) (A3B3) (A4B3) (B3) b (A b) (a b) (b)
Homework: Given below is the mean scores on a 5 point scale about the
nature & type of information required by a group of physicists and
another group of mechanical engineers. Find the correlation of their
rankings ? (carryout t-test for 5% significance level)
Physicists Mech. Engrs.
A. State of the art 2.60 1.17
B. Theoretical background 2.98 2.71
C. Experimental results 2.67 2.34
D. Methods, processes & procedures 2.62 2.07
E. Product, material & equipment information 2.45 2.23
F. Computer programs & model building info. 2.00 0.85
G. Standard & patent spec. 0.93 2.15
H. Physical, technical & design data 3.05 2.65
I. S & T news 2.29 2.53
J. General information 1.21 0.92
√ ∑ Xi 2 - n X2 √ ∑ Yi2 – n Y2
r = --------------------------------------------- ∑ d xi = ∑ (Xi – A x)
∑ d x i2 ∑ d xi 2 ∑ d y i2 ∑ d y i 2 ∑ d yi = ∑ (Yi – A y)
√
√ n n n n ∑ d xi2 = ∑ (Xi – A x)2
∑ d yi 2 = ∑ (Yi – A y)2
X Y XY X2 Y2 105 – 5 X 3. 4 X 5. 24
1 2 2 1 4 r = ----------------------------------------
2 4 8 4 16 √ 71 – 5 (3.4)2 √ 158 – 5 (5. 2 )2
4 5 20 16 25
5 7 35 25 49 105 – 88.4 16 . 6
5 8 40 25 64 = ----------------- = --------------
Total ------- ------- ------- ------- -------- 3.63 X 4.77 17 . 315
17 26 105 71 158
= + 0 . 96
0.01
0.21 -0.05
0.31
Aggressio
Aggression in n in the
the 3rd grade
0.38
13th grade
Y Y Y
X X
X
X X X
X X
X
X X X
NO RELATIONSHIP ACTUAL VALUES BEST FIT
THE LEAST SQUARE METHOD PROVIDES TWO NORMAL EQNS. TO DETERMINE CONSTANTS a AND b
∑ Y =na+b∑X
∑ XY = a ∑ X + b ∑ X2 TSS = ∑ (Y – Y ) 2
RSS = ∑ (Y – Y ) 2
THE BASIC EQN IS Y = a + b x +E ESS = ∑ (Y – ŷ ) 2
TSS = RSS + ESS
RSS / 1
F = --------------
ESS / n –2
EXAMPLE : Given below are the estimated use of a library (Y) for a corresponding expenditure on promotion and user
orientation (x). Fit best regression line. Estimate the use (I.e., predict) for an expenditure of Rs.8000 . If the library would
like to reach a level of use of 70,000 what should be the expenditure on promotion and user-orientation. …contd.
Research Methodology 8 M S Sridhar, ISRO 112
3. Cause & Effect Relationships : i. Regression Analysis contd..
X Ŷ (Estimate) XŶ X2 Y (EXPECTED) ERROR
( In thousands (In ten thousands)
of Rs.
5 4 20 25 4 . 02 -0 . 02
6 3 18 36 4 . 32 -1 . 32
1 2 2 1 2 . 82 -0 . 82
4 6 24 16 3 . 72 +2 . 28
2 3 6 4 3 . 12 -0 . 12
Tot 18 18 70 82 18 0
∑X ∑Ŷ ∑XŶ ∑X 2
PROBLEM OF MULTICOLLINEARITY
REGRESSION COEFFICIENTS b1 AND b2 BECOME LESS RELIABLE IF
THERE IS A HIGH DECREE OF CORRELATION BETWEEN IND. VAR. X1
AND X2 .THE COLLECTIVE EFFECT OF INO. VAR X1 AND X2 IS GIVEN BY
THE COEFFICIENT OF MULTIPLE CORRELATION
b1 ∑ xi x1i - n y x1 + b2 ∑ yi x2i - n y x2
Ry. X1 x2 = ---------------------------------------------------
√ ∑ Yi – n Y
x1i = (x1i – x1)
b1 ∑ x1i yi + b2 ∑ x2i yi x2i = (x2i – x2)
OR ------------------------------ y i = ( yi – y)
√ ∑ Yi2
Research Methodology 8 M S Sridhar, ISRO 114
iii. Partial Correlation
(iii) PARTIAL CORRELATION measures, separately, the relationship betn
two variables (i.e. dep. and a particular ind. variables) by holding all other
variables constant
R2 y. x1x2 – r2 y x2
r yx1. x2 = -------------------------
1 - r2 yx1
• Index number is a special type of average used to measure the level of a given
phenomenon as compared to the level of the same phenomenon at some standard date
I.o.w reducing the figure to a common base (eg: converting the series into a series of
index numbers) to study the chances in the effect of such factors which are incapable of
being measured directly
• They are approximate indicators & give only a fair idea of changes.
• index numbers prepared for a purpose cannot be used for other purposes or same
purpose at other places. Cchances of error also remain in them.
Examples:
1. Library use index = 1/100 no. of pages of xerox copies of reading material taken during a
year + 2 times no. of documents borrowed through ILL + 5 times no. of visits to library
during 3months sample seat occupancy study + mean no. of documents borrowed during
the year (both circulation sample and collection sample)
2. Library interaction index = No. of documnts sugested + no. of documents indented + no.
of documents reserved + 2times no. of literature search service availed + no of short
range ref. Queries placed
Research Methodology 8 M S Sridhar, ISRO 118
4. Other Measures B. Time series analysis contd.
B. Time series analysis
Time series: Series of successive observations of a phenomenon over a
period of time
– When individual variable is time in a cause and effect relationship of
regression analysis type it is time series analysis
– It helps to estimate/ predict the future
Components of time series
1. Secular or long term trend (T)
2. Short term oscillations : (i) Cyclical variations(C) (usually more than a
year) (ii) Seasonal variations(S) (usually within a year)
3. Irregular or erratic variations (I) Random fluctuations & completely
unpredictable like riots, natural calamities, etc.
Solution:
Trend : Upward, i.e., Increasing daily issues
Cyclic variation: Difference between the moving average (expected) and
corresponding actual figure of issues are markedly high on Saturday and
very low on Wednesday
Home work
Daily visitors to a
public library
Day Week 1 Week 2
Sun 900 800
Mon 400 500
Tue 500 300
Wed 600 300
Thu 300 400
Fri 700 600
Sat 1100 900
If the first quarter of 4th year records 6000 user visits estimate the
average quarterly visits for that year
X1+ X2+ X3
2. FIND MEAN OF THE SAMPLE MEANS, I.E. X = ---------------
3 (k)
3. FIND SUM OF SQUARES FOR VARIANCE BETN THE SAMPLES,
I.E., SS BETWEEN = n1 (X1 – X)2 + n2 (X2 – X)2 + n3 (X3 – X)2
MS BETWEEN
8. F RATIO = --------------------
MS WITHIN
Note: Compare with table value of F. If it is equal or more
than table value difference is significant and hence
1. samples could not have come from the same
universe or
2. the independent variable has a significant effect on
dependant variable.
More the value of F ratio more definite and sure about the
conclusions
Research Methodology 8 M S Sridhar, ISRO 131
References
1. Anderson, Jonathan, et. al. Thesis and assignment writing. New Delhi:
Wiley, 1970.
2. Best, Joel. Damned lies and statistics. California: University of California
Press, 2001.
3. Best, Joel. More damned lies and statistics; how numbers confuse public
issues. Berkeley: University of California Press, 2004
4. Body, Harper W Jr. et.al. Marketing research: text and cases. Delhi: All
India Traveler Bookseller, 1985.
5. Booth, Wayne C, et. al. The craft of research. 2 ed. Chicago: The
University of Chicago Press, 2003.
6. Chandran, J S. Statistics fdor business and economics. New Delhi:
Vikas, 1998.
7. Chicago guide to preparing electronic manuscripts: For authors and
publishers. Chicago: The University of Chicago Press, 1987.
8. Cohen, Louis and Manion, Lawrence. Research methods in education.
London: Routledge, 1980.
9. Goode, William J and Hatt, Paul K. Methods on social research. London;
Mc Graw Hill, 1981.
10. Gopal, M.H. An introduction to research procedures in social sciences.
Bombay: Asia Publishing House, 1970.
11. Koosis, Donald J. Business statistics. New York: John Wiley,1972.
Research Methodology
M S Sridhar, ISRO 8 M S Sridhar,
Testing ISRO
of Hypotheses 132
References …Contd.