Statistics For Architecture: 20BAR Research Methodology

[Type text] Research Methodology Notes B.
Arch 2021
B.Arch 2021
20BAR Research Methodology
November 2021 Paul Varghese
Version 21.00.1 +91.9567263770
pvarghese.ijk@gmail.com
Statistics for Architecture

Some Terms:
Population: the entire collection of events or subjects which is being studied.
Sample: observations from a population, of which the characteristics are to be studied.
Parameter: a numerical property of a population.
Statistic: a numerical characteristic of the sample -- estimates the corresponding population parameter.
Variable: property of an object or event that can take on different values.
Random sample: a manner to ensure that each and every element of the population has an equal chance of being
selected.
Representative sample: a subset of the population that has the same characteristics as the population.
Dependent variable: those that are used to measure something, not under the experimenter’s control (and part of
the data).
Sample space: is the set of all possible results or outcomes of an experiment.
Significance: the measure of whether the results of research were due to chance -- the more statistical significance
assigned to an observation, the less likely the observation occurred by chance.
Validity (internal/external): state of subjects/groups chosen that are fundamental to both for the external or
internal integrity of the experiment.
Hypothesis: an assumption (theory, supposition) which has to be proved or disproved.
Statistics for Research

Statistical Concepts
Probability
Descriptive Statistics (Refer K&G: Chapter 08)

When our need is to describe a set of data, or to understand what the data is about, one employs descriptive
statistics. The basic measures and analysis used give an idea of the distribution of the observations in the data-set,
which together, are called descriptive statistics. This might, however, only be a superficial reading of the data
available.
Inferential Statistics:
It aims to obtain conclusions that affect future decision-making.
Measures of Central Tendency & Dispersion

A value (real or calculated) around which all the observations tend to cluster.
Normal distribution (assumptions of):

 many of the dependent variables are assumed to be normally distributed in the population, i.e., that the
sample distribution would closely resemble the population,
 if normally distributed, the techniques used allows one to make inferences about the population,
 the hypothetical set of samples would be approximately normal under a variety of circumstances,
 the tests all assume that the population sampled is normally distributed.
Faculty of Architecture & Planning, K A H E Page 1

[Type text] Research Methodology Notes B.Arch 2021
Mode: the most frequently occurring value; is not affected by extreme values, not always
amenable to algebraic treatment. (A dataset might not have a mode, or might even have
multiple modes.)
Median: is the positional average; the middle value when arranged in order, used in the context of
qualitative phenomena. Not often used in sampling statistics.
Mean: A most commonly used ‘average’; the most common measure of central tendency.
Weighted mean: the product of the weightage of an event and its quantitative outcome, then summing the products.
Geometric mean: of n numbers {x1, x2, x3, …, xn} is the nth root of their product.
Harmonic mean: calculated by dividing the number of terms by its reciprocals.
Measures of Dispersion / Variability: (Refer K&G: Chapter 08)

Range: the difference between the extreme values of the data-set.
The Normal Distribution: graphically the normal distribution is the "bell-shaped" curve; the curve is
symmetrical about the middle peak; the ‘tails’ of the distribution might approach, but never touch, the horizontal
axis; the mean is the position of the peak of the curve
Outliers: Outliers are extreme, atypical values. They may be genuine values or the result of some mistake. If it is a
mistake this must be corrected or, if this is impossible, the data ignored.
Deviation (s):
Mean Deviation: is the average of the difference of the values of items from some average of the series.
Standard deviation (σ): a measure based on how far values are from the mean; defined as the square-root of the
average of the squares of deviations in a series obtained from the arithmetic average.
Variance (s2):
Also
Coefficient of standard deviation = (σ/x)

One standard deviation from the mean in either direction on the horizontal axis (the two shaded areas
closest to the centre axis on the above graph - in red) accounts for somewhere around 68% of the
population/sample. Two standard deviations away from the mean (the four areas closest to the centre areas –
all red & green areas) account for roughly 95% of the population/sample. And three standard deviations
(all shaded areas) account for 99% of the population/sample.
Degree of freedom: is the number of values in the final calculation of a statistic that are free to vary.
Skewness: degree of asymmetry of the distribution.
Kurtosis: measures how ‘peaked’ or’ flat’ the data distribution is; a mathematical definition pertaining to the
degree of peakedness or flatness of the distribution.
Confidence interval: it quantifies the uncertainty in the measurement.
Correlation: an analytical technique used to show the relationship between pairs -- to know how strongly (or
weakly) the pairs are related to one another.
Correlation coefficient (r): A decimal number between 0.00 and ±1.00 that indicates the degree to which two
quantitative variables are related.
Regression: it fits a line to a plot in such a way as to minimize the sum of the squares of the residuals.
The Hypothesis & Testing it (Refer K&G: Chapter 10)

In statistics,
Research Hypothesis: the basic formal question / hypothesis that has to be solved.
Null Hypothesis (H0): the hypothesis that the difference between the two population means is zero, or ‘null’.
[μ1 – μ2 = 0]
The null hypothesis was formulated as the method of contradiction by R. A. Fisher, which is that one can never
prove a hypothesis always true, but one can sometimes prove one false.
In Neyman and Pearson’s view, one either rejects or accepts the null hypothesis.
Type I Error: a probability (α) involves incorrectly rejecting a true null hypothesis.
Type II Eror: a probability (β) involves not rejecting a null hypothesis that is in fact false.
Alternate Hypothesis (H1): the hypothesis that is contradictory to the null hypothesis (H0) [μ1 ≠ μ2]
H1: μ1 ≠ μ2 i.e., H1: μ1 > μ2 or H1: μ1 < μ2

Sampling (Refer K&G: Chapter/sections 04.5, 09.2)
• can save time & money: less expensive than census, comparative results at faster speed,
• good accuracy when conducted by experts,
• useful when population contains infinite members,
• estimate of sampling errors – information concerning characteristics of the population.
Sampling error = frame error + chance error + response error

Total error = measurement error + sampling error
nonsampling errors are difficult to estimate
The typical Sampling Procedure is:
Potential errors
1 Define the population of - inappropriate population for problem
interest - undefined or vaguely defined population, causing inconsistencies in
selecting the sample and imprecise conclusions
2 If possible, obtain a list of all Sampling frame may not match the population by listing members who do
members of the population not belong (over-representation) or missing some members (under-
(called a sampling frame) representation)
3 Choose a sampling method - getting a biased, unrepresentative sample
- using non-random sampling method but analysing the data as if the
sample was random
4 Determine sample size Too small a sample for the required accuracy
5 Obtain the sample and collect - non-response error - when population members not obtainable or not
the data responding are not representative of the population as a whole
- asking ambiguous, biased or other poor questions
- mistakes - mishearing an answer, miskeying data into the computer,
dishonest researchers etc. etc.
SAMPLING METHODS (REFER K&G: CHAPTER/SECTION 04.5, 09.2)
Random sampling methods (each member of the target population has an equal chance of being selected):
 simple random sampling - choose at random from the frame.
o e.g. from a complete list of all 2000 employees , numbered 1-2000, use a random number generator to
generate 50 random numbers in the 1-2000 range and question those 50 employees
 stratified sampling - split population into homogenous segments, or strata, and choose at random
proportionate numbers from each strata.
o e.g. if the company is known to have 60% female employees and 40% male a sample of 50 employees would
select 30 women at random and 20 men at random.
 cluster, or area, sampling - split population into heterogeneous groups, or clusters, choose a cluster at
random and then sample within it
o e.g. select one of the company’s sales regions East/West/North/South.
 systematic sampling - select every nth member from the frame.
o e.g. select employees numbered 40, 80, 120, etc.
Non random sampling methods:

 convenience sampling - select on basis of convenience/ cost
 judgement sampling - researcher chooses what he/she judges to be a representative sample
 quota sampling - as judgement, but researcher must fill certain quotas e.g. 25 men aged 30 to 40
 purposive - sample chosen for a specific purpose e.g. “key informants”, thought to be most interested and/
or knowledgeable.
Non-random sampling can be used by experienced researchers to obtain good results more cheaply than random
sampling. However, inferential statistical theory is based on probability theory and is only valid when the data has
been collected with every member of the population having an equal chance of being selected i.e. by random
sampling. If you do not use random sampling any inferences you make from your data do not have any
mathematical basis.
Central Limit Theorem (Refer K&G: Chapter/section 09.6)
When n is small, the shape of the distribution will depend largely on the shape of the parent population, but as n
gets large (n >30), the shape of the sampling distribution will become more and more like a normal distribution,
irrespective of the shape of the parent population. The theorem that explains this kind of relationship between the
shape of the population distribution and the sampling distribution of the mean is called the Central Limit
Theorem.
Or,
Given a population with mean μ and variance σ2, the sampling distribution of the mean will have a mean equal to
μ (μx = μ) and a variance (σx2) equal to σ2/NN, and standard deviation σx equal to σ/N√NN.
The distribution will approach the normal distribution as N, the sample size increases. Refer fig. below.
Inferential Statistics
Going beyond descriptive statistics, one can begin to tell more things about the data
• sampling from the total population, of which one needs to know more about,
• one needs to infer something about the characteristics of the population from what one knows
from the characteristics of the sample
Parametric Methods
In parametric methods/tests, one assumes that the distribution is normal.
Assumptions include that population is normal, samples are independent, standard deviation is known.
Non-parametric Methods (Refer K&G: Chapter 13.7)

One can apply a test without a model, it is called a distribution-free or a non-parametric test. Hence, in
non-parametric methods/tests, there are no assumptions of normality in the distribution. One does not
assume that a particular distribution is applicable, or that a value is attached to a parameter to the
population.
The central point of a data-set is usually the arithmetic mean, when weightage is given to the magnitude
of the observation, while the location-wise central point is given by the median, which is what is used in
sign-tests and other non-parametric tests.
Non-parametric methods
• Does not suppose any particular distribution

• Quick and easy to use, no laborious computations since observations are placed in rank order, or
sometimes just signs (+/-),
• Not so efficient -- does not use all the information available, but uses groupings or rankings, with
resulting loss in efficiency,
• Can be satisfactorily used with not-so accurate data,
• Non-parametric tests can be used for ordinal or nominal scale data, but parametric tests cannot.

Statistics For Architecture: 20BAR Research Methodology

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistics For Architecture: 20BAR Research Methodology

Uploaded by

Copyright:

Available Formats

[Type text] Research Methodology Notes B.

Statistics for Architecture

Statistics for Research

Descriptive Statistics (Refer K&G: Chapter 08)

Measures of Central Tendency & Dispersion

Normal distribution (assumptions of):

Faculty of Architecture & Planning, K A H E Page 1

Measures of Dispersion / Variability: (Refer K&G: Chapter 08)

Faculty of Architecture & Planning, K A H E Page 2

The Hypothesis & Testing it (Refer K&G: Chapter 10)

H1: μ1 ≠ μ2 i.e., H1: μ1 > μ2 or H1: μ1 < μ2

Faculty of Architecture & Planning, K A H E Page 3

Sampling (Refer K&G: Chapter/sections 04.5, 09.2)

Sampling error = frame error + chance error + response error

The typical Sampling Procedure is:

Non random sampling methods:

Central Limit Theorem (Refer K&G: Chapter/section 09.6)

Non-parametric Methods (Refer K&G: Chapter 13.7)

• Does not suppose any particular distribution

Faculty of Architecture & Planning, K A H E Page 5

You might also like