EVS042 - Statistical Approaches and Modelling in Environmental Sciences PDF

ins School of Architecture, Science and Technology,
Yashwantrao Chavan Maharashtra Open University
Statistical Program:
V136:
M.Sc. in
Approaches and Environmental
Science
Modelling in {2021 Pattern}
Environmental
Sciences
Semester 04
EVS042
Email: director.ast@ycmou.ac.in
Website: www.ycmou.ac.in
Phone: +91-253-2231473
A S T, Y C M O U , N a s h i k – 4 2 2 2 2 2 , M H , I n d i a
Yashwantrao Chavan EVS042
Maharashtra Open Statistical Approaches and
University Modelling in
Environmental Sciences
Brief Contents
Vice Chancellor‟s Message ...................................................................................... 5
Forward By The Director ....................................................................................... 6
Credit 01 ................................................................................................................. 7
Credit 01 –Unit 01-01: Basics of Statistics……………………………………… ……..8
Credit 01 - Unit 01-02: Statistical methods……………………………. ............... ...38
Credit 01 - Unit 01-03: Dispersion……………………….………… ....................... ..44
Credit 01 - Unit 01-04: Probability…………………………….………..……… ....... 61
Credit 02 ............................................................................................................... 99
Credit 02 - Unit 02-01: Correlation....................................................... ............. .100
Credit 02 - Unit 02-02: Regression.….. ................................................................ 112
Credit 02 - Unit 02-03: Testing of hypothesis……………………………… ............ 126
Credit 02 - Unit 02-04: Bioassay……………………………….................. ............. 133
Credit 03 .............................................................................................................. 144

Credit 03 - Unit 03-01: Introduction-Environmental modelling………….. ......... 145
Credit 03 - Unit 03-02: Models in environmental science emphasizing -I.…... .... 156
Credit 03 - Unit 03-03: Models in environmental science emphasizing -II .......... 161
Credit 03 - Unit 03-04: Models in environmental science emphasizing -III ........ 170
Credit 04 .............................................................................................................. 178

Credit 04 - Unit 04-01: Elementary concepts .................................................. ….179
Credit 04 - Unit 04-02: The building blocks.……………. .................................... 186
Credit 04 - Unit 04-03: Environmental modelling …...……………. .................... 197
Credit 04 - Unit 04-04: Air quality modelling……………………….……… .......... 209
Feedback Sheet for the Student ---------------------------------------------------------- 231
EVS042 Statistical Approaches and Modelling in Environmental Sciences Page 2

EVS042 - Statistical Approaches and Modelling in Environmental Sciences
Yashwantrao Chavan Maharashtra Open University
Vice-Chancellor: Prof. Dr. P. G. Patil
School of Architecture, Science and Technology
Director of the School: Dr. Sunanda More
Programme Advisory Committee
Dr Sunanda More Dr Manoj Killedar Dr.ChetanaKamlaskar
Director & Associate Professor, School of Associate Professor, School of
Associate Professor, School of Architecture, Science & Architecture, Science &
Architecture, Science & Technology, YCMOU, Nashik Technology, YCMOU, Nashik
Technology, YCMOU, Nashik
Prof. Jaydeep Nikam Dr. Pondhe G.M. Prof. Dr. Sharad Ratan
Director & Assot.Professor and Head, Khandelwal
Professor, School of Health Department of Environmental Professor & Principal
Science , YCMOU, Nashik Science H.A.L. College of Science &
PadmshreeVikhe Patil College Commerce,
of Arts, Science and Ozar Township Tal: Niphad
Commerce, Dist: Nashik 422207
Pravaranagar, A/P-Loni-
413713
Dr. Pravin Nalawade Dr.Anita V. Handore,
Assistant Professor and Head Department of Incharge, Research and Development Dept.
Environmental Science, KTHM College,
Sigma Wineries Pvt.Ltd.,Nashik
Nashik – 2
Development Team
Instructional Course Coordinator Book Writer Book Editor
Technology Editor
Mr. Manish S Shingare Mr. Kailas Ahire Dr, Yogeshwar R. Baste
Dr. Sunanda More Academic coordinator, Assistant Professor Assistant Professor.
Director(I/c) & School of Dept. of Karmaveer
Associate Professor, Architecture, Science Environmental Shantarambapu Wavare
School of & Technology, Science, K.R.T. Arts, Arts, Science and
Architecture, Science YCMOU, Nashik B.H. Commerce and Commerce College,
&Technology, A.M. Science College Uttamnagar CIDCO,
YCMOU, Nashik (KTHM) College, Nashik
Nashik
This work by YCMOU is licensed under a Creative Commons Attribution-NonCommercial-

ShareAlike 4.0 International License.
 Book Publication : 23-March-2023 Publication No:
 Publisher : Mr. B.P Patil I/C, Registrar, YCMOU, Nashik- 422 222, MS
 ISBN:

V ICE C HANCELLOR ‟ S M ESSAGE
Dear Students,
Greetings!!!
I offer cordial welcome to all of you for the Master‟s degree programme of Yashwantrao
Chavan Maharashtra Open University.
As a post graduate student, you must have autonomy to learn, have information and
knowledge regarding different dimensions in the field of Environmental Science and at the same time
intellectual development is necessary for application of knowledge wisely. The process of learning
includes appropriate thinking, understanding important points, describing these points on the basis
of experience and observation, explaining them to others by speaking or writing about them. The
science of Education today accepts the principle that it is possible to achieve excellence and
knowledge in this regard.
The syllabus of this course has been structured in this book in such a way, to give you
autonomy to study easily without stirring from home. During the counseling sessions, scheduled at your
respective study centre, all your doubts will be clarified about the course and you will get guidance
from some qualified and experienced counsellors/ professors. This guidance will not only be based
on lectures, but it will also include various techniques such as question-answers, doubt
clarification. We expect your active participation in the contact sessions at the study centre. Our
emphasis is on „self study‟. If a student learns how to study, he will become independent in learning
throughout life. This course book has been written with the objective of helping in self-study and giving
you autonomy to learn at your convenience.
During this academic year, you have to give assignments, complete laboratory activities, field
visits and the Project work wherever required. You have to opt for specialization as per programme
structure. You will get experience and joy in personally doing above activities. This will enable
you to assess your own progress and thereby achieve a larger educational objective.
We wish that you will enjoy the courses of Yashwantrao Chavan Maharashtra Open
University, emerge successful and very soon become a knowledgeable and honorable Master‟s
degree holder of this university.
I congratulate “Development Team” for the development of this excellent high quality “Self-
Learning Material (SLM)” for the students. I hope and believe that this SLM will be immensely
useful for all students of this program.
Best Wishes!
- Prof. Dr. P. G. Patil

Vice-Chancellor, YCMOU

F OREWORD B Y T HE D IRECTOR
Dear Students,
Greetings!!!
This book aims at acquainting the students with conceptualand applied fundamentals
about Environmental Sciencerequired at degree level.
The book has been specially designed for science students. It has a comprehensive
coverage of environmental concepts and its application in practical life. The book
contains numerous examples to build understanding and skills.
The book is written with self- instructional format. Each chapter is prepared with
articulated structure to make the contents not only easy to understand but also
interesting too.
Each chapter begins with learning objectives which are stated using Action Verbs as
per the Bloom‟s Taxonomy. Unit is started with introduction to arouse or stimulate
curiosity of learner about the content/ topic. Thereafter the unit contains explanation of
concepts supported by tables, figures, exhibits and solved illustrations wherever
necessary for better effectiveness and understanding.
This book is written in simple and lucid language, using spoken style and short
sentences. Topics of each unit of the book presents from simple to complex in logical
sequence. This book is appropriate for low achiever students with lower intellectual
capacity and coversthe syllabus of the course .
Exercises given in the chapter include MCQs, conceptual questions and practical
questions so as to create a ladder in the minds of students to grasp each and every
aspect of a particular concept.
I thank the students who have been a constant motivation for us. I am grateful to the
writers, editors and the School faculty associated in this SLM development of the
Programme.
Best Wishes to all of you!!!
- Dr. Sunanda More

Director,
School of Architecture, Science and Technology,
YCMOU

C REDIT 01

UNIT 01-01 BASICS OF STATISTICS: SAMPLING DATA, TYPES OF DATA,
METHOD OF COLLECTION AND RECORDING
LEARNING OBJECTIVES
After successful completion of this unit, you will be able to learn-
 Basics of statistics
 Basics of Statistical Models, Dispersion, Probability
Statistics is the discipline that concerns the collection, organization, analysis,

interpretation, and presentation of data. In applying statistics to a scientific,
industrial, or social problem, it is conventional to begin with a statistical
population or a statistical model to be studied. Populations can be diverse groups
of people or objects such as "all people living in a country" or "every atom
composing a crystal". Statistics deals with every aspect of data, including the
planning of data collection in terms of the design of surveys and experiments.
When census data cannot be collected, statisticians collect data by developing

specific experiment designs and survey samples. Representative sampling assures
that inferences and conclusions can reasonably extend from the sample to the
population as a whole. An experimental study involves taking measurements of
the system under study, manipulating the system, and then tak ing additional
measurements using the same procedure to determine if the manipulation has
modified the values of the measurements. In contrast, an observational study
does not involve experimental manipulation.
Two main statistical methods are used in data analysis: descriptive statistics,
which summarize data from a sample using indexes such as the mean or standard
deviation, and inferential statistics, which draw conclusions from data that are
subject to random variation (e.g., observational errors, sampl ing variation).
Descriptive statistics are most often concerned with two sets of properties of a
distribution (sample or population): central tendency (or location) seeks to
characterize the distribution's central or typical value, while dispersion (or
variability) characterizes the extent to which members of the distribution depart
from its center and each other. Inferences on mathematical statistics are made
under the framework of probability theory, which deals with the analysis of
random phenomena.
A standard statistical procedure involves the collection of data leading to test of
the relationship between two statistical data sets, or a data set and synthetic data
drawn from an idealized model. A hypothesis is proposed for the statistical
relationship between the two data sets, and this is compared as an alternative to
an idealized null hypothesis of no relationship between two data sets. Rejecting
or disproving the null hypothesis is done using statistical tests that quantify the
sense in which the null can be proven false, given the data that are used in the
test. Working from a null hypothesis, two basic forms of error are recognized:
Type I errors (null hypothesis is falsely rejected giving a "false positive") and
Type II errors (null hypothesis fails to be rejected and an actual relationship
between populations is missed giving a "false negative"). Multiple problems
have come to be associated with this framework, ranging from obtaining a
sufficient sample size to specifying an adequate null hypothesis.
Measurement processes that generate statistical data are also subject to error.
Many of these errors are classified as random (noise) or systematic (bias), but
other types of errors (e.g., blunder, such as when an analyst reports incorrect
units) can also occur. The presence of missing data or censoring may result in
biased estimates and specific techniques have been developed to address these
problems.Sampling (statistics)
In statistics, quality assurance, and survey methodology, sampling is the

selection of a subset (a statistical sample) of individuals from within a statistical
population to estimate characteristics of the whole population. Statisticians
attempt to collect samples that are representative of the population in question.
Sampling has lower costs and faster data collection than measuring the entire
population and can provide insights in cases where it is infeasible to measure an
entire population.
Each observation measures one or more properties (such as weight, location,

colour) of independent objects or individuals. In survey sampling, weights can be
applied to the data to adjust for the sample design, particularly in stratified
sampling. Results from probability theory and statistical theory are employed to
guide the practice. In business and medical research, sampling is widely used for
gathering information about a population. Acceptance sampling is used to
determine if a production lot of material meets the governing specifications.

Sampling methods
Within any of the types of frames identified above, a variety of sampling

methods can be employed, individually or in combination. Factors commonly
influencing the choice between these designs include:
• Nature and quality of the frame
• Availability of auxiliary information about units on the frame
• Accuracy requirements, and the need to measure accuracy
• Whether detailed analysis of the sample is expected
• Cost/operational concerns
Simple Random Sampling
In a simple random sample (SRS) of a given size, all subsets of a sampling frame
have an equal probability of being selected. Each element of the frame thus has
an equal probability of selection: the frame is not subdivided or partitioned.
Furthermore, any given pair of elements has the same chance of selection as any
other such pair (and similarly for triples, and so on). This minimizes bias and
simplifies analysis of results. In particular, the variance between individual
results within the sample is a good indicator of variance in the overall
population, which makes it relatively easy to estimate the accuracy of results.
Simple random sampling can be vulnerable to sampling error because the

randomness of the selection may result in a sample that doesn't reflect the
makeup of the population. For instance, a simple random sample of ten peop le
from a given country will on average produce five men and five women, but any
given trial is likely to over represent one sex and underrepresent the other.
Systematic and stratified techniques attempt to overcome this problem by "using
information about the population" to choose a more "representative" sample.
Also, simple random sampling can be cumbersome and tedious when sampling
from a large target population. In some cases, investigators are interested in
research questions specific to subgroups of the population. For example,
researchers might be interested in examining whether cognitive ability as a
predictor of job performance is equally applicable across racial groups. Simple
random sampling cannot accommodate the needs of researchers in this situ ation,

because it does not provide subsamples of the population, and other sampling
strategies, such as stratified sampling, can be used instead.
Systematic sampling
Systematic sampling (also known as interval sampling) relies on arranging the

study population according to some ordering scheme and then selecting elements
at regular intervals through that ordered list. Systematic sampling involves a
random start and then proceeds with the selection of every k th element from then
onwards. In this case, k=(population size/sample size). It is important that the
starting point is not automatically the first in the list, but is instead randomly
chosen from within the first to the k th element in the list. A simple example
would be to select every 10 th name from the telephone directory (an 'every 10 th '
sample, also referred to as 'sampling with a skip of 10').
As long as the starting point is randomized, systematic sampling is a type of

probability sampling. It is easy to implement and the stratification induced can
make it efficient, if the variable by which the list is ordered is correlated with the
variable of interest. 'Every 10th' sampling is especially useful for efficient
sampling from databases.
For example, suppose we wish to sample people from a long street that starts in a
poor area (house No. 1) and ends in an expensive district (house No. 1000). A
simple random selection of addresses from this street could easily end up with
too many from the high end and too few from the low end (or vice versa),
leading to an unrepresentative sample. Selecting (e.g.) every 10th street number
along the street ensures that the sample is spread evenly along the length of the
street, representing all of these districts. (Note that if we always start at house #1
and end at #991, the sample is slightly biased towards the low end; by randomly
selecting the start between #1 and #10, this bias is eliminated.
However, systematic sampling is especially vulnerable to periodicities in the list.

If periodicity is present and the period is a multiple or factor of the interval used,
the sample is especially likely to be unrepresentative of the overall population,
making the scheme less accurate than simple random sampling.
For example, consider a street where the odd-numbered houses are all on the
north (expensive) side of the road, and the even-numbered houses are all on the

south (cheap) side. Under the sampling scheme given above, it is impossible to
get a representative sample; either the houses sampled will all be from the odd -
numbered, expensive side, or they will all be from the even-numbered, cheap
side, unless the researcher has previous knowledge of this bias and avoids it by a
using a skip which ensures jumping between the two sides (any odd -numbered
skip).
Another drawback of systematic sampling is that even in scenarios where it is

more accurate than SRS, its theoretical properties make it difficult to quantify
that accuracy. (In the two examples of systematic sampling that are given above,
much of the potential sampling error is due to variation between neighbouring
houses – but because this method never selects two neighbouring houses, the
sample will not give us any information on that variation.)
As described above, systematic sampling is an EPS method, because all elements

have the same probability of selection (in the example given, one in ten). It is not
'simple random sampling' because different subsets of the same size have
different selection probabilities – e.g. the set {4,14,24,...,994} has a one-in-ten
probability of selection, but the set {4,13,24,34,...} has zero probability of
selection.
Systematic sampling can also be adapted to a non -EPS approach; for an example,
see discussion of PPS samples below.
Stratified sampling
When the population embraces a number of distinct categories, the frame can be
organized by these categories into separate "strata." Each stratum is then sampled
as an independent sub-population, out of which individual elements can be
randomly selected. The ratio of the size of this random selection (or sample) to
the size of the population is called a sampling fraction. There are several
potential benefits to stratified sampling.
First, dividing the population into distinct, independent strata can enable
researchers to draw inferences about specific subgroups that may be lost in a
more generalized random sample.
Second, utilizing a stratified sampling method can lead to more efficient

statistical estimates (provided that strata are selected based upon relevance to the
criterion in question, instead of availability of the samples). Even if a stratified
sampling approach does not lead to increased statistical efficiency, such a tactic
will not result in less efficiency than would simple random sampling, provided
that each stratum is proportional to the group's size in the population.
Third, it is sometimes the case that data are more readily available for individual,
pre-existing strata within a population than for the overall population; in such
cases, using a stratified sampling approach may be more co nvenient than
aggregating data across groups (though this may potentially be at odds with the
previously noted importance of utilizing criterion-relevant strata).
Finally, since each stratum is treated as an independent population, different

sampling approaches can be applied to different strata, potentially enabling
researchers to use the approach best suited (or most cost -effective) for each
identified subgroup within the population.
There are, however, some potential drawbacks to using stratified samplin g. First,
identifying strata and implementing such an approach can increase the cost and
complexity of sample selection, as well as leading to increased complexity of
population estimates. Second, when examining multiple criteria, stratifying
variables may be related to some, but not to others, further complicating the
design, and potentially reducing the utility of the strata. Finally, in some cases
(such as designs with a large number of strata, or those with a specified
minimum sample size per group), stratified sampling can potentially require a
larger sample than would other methods (although in most cases, the required
sample size would be no larger than would be required for simple random
sampling).
A stratified sampling approach is most effective when three conditions are

met
1. Variability within strata are minimized
2. Variability between strata are maximized
3. The variables upon which the population is stratified are strongly

correlated with the desired dependent variable.

Advantages over other sampling methods
1. Focuses on important subpopulations and ignores irrelevant ones.
2. Allows use of different sampling techniques for different

subpopulations.
3. Improves the accuracy/efficiency of estimation.
4. Permits greater balancing of statistical power of tests of differences

between strata by sampling equal numbers from strata varying widely in
size.
Disadvantages
1. Requires selection of relevant stratification variables which can be

difficult.
2. Is not useful when there are no homogeneous subgr oups.
3. Can be expensive to implement.
Poststratification
Stratification is sometimes introduced after the sampling phase in a process

called "poststratification". This approach is typically implemented due to a lack
of prior knowledge of an appropriate stratifying variable or when the
experimenter lacks the necessary information to create a stratifying variable
during the sampling phase. Although the method is susceptible to the pitfalls of
post hoc approaches, it can provide several benefits in the right situation.
Implementation usually follows a simple random sample. In addition to allowing
for stratification on an ancillary variable, poststratification can be used to
implement weighting, which can improve the precision of a sample's estimates.
Oversampling
Choice-based sampling is one of the stratified sampling strategies. In choice -

based sampling, the data are stratified on the target and a sample is taken from
each stratum so that the rare target class will be more represented in the sample.
The model is then built on this biased sample. The effects of the input variables
on the target are often estimated with more precision with the choice -based

sample even when a smaller overall sample size is taken, compared to a random
sample. The results usually must be adjusted to correct for the oversampling.
Probability-proportional-to-size sampling
In some cases the sample designer has access to an "auxiliary variable" or "size
measure", believed to be correlated to the variable of interest, for each element
in the population. These data can be used to improve accuracy in sample design.
One option is to use the auxiliary variable as a basis for stratification.
Another option is probability proportional to size ('PPS') sampling, in which the

selection probability for each element is set to be proportional to its size
measure, up to a maximum of 1. In a simple PPS design, these selection
probabilities can then be used as the basis for Poisson sampling. However, this
has the drawback of variable sample size, and different portions of the
population may still be over- or under-represented due to chance variation in
selections.
Systematic sampling theory can be used to create a probability proportionate to

size sample. This is done by treating each count within the s ize variable as a
single sampling unit. Samples are then identified by selecting at even intervals
among these counts within the size variable. This method is sometimes called
PPS-sequential or monetary unit sampling in the case of audits or forensic
sampling.
The PPS approach can improve accuracy for a given sample size by
concentrating sample on large elements that have the greatest impact on
population estimates. PPS sampling is commonly used for surveys of businesses,
where element size varies greatly and auxiliary information is often available –
for instance, a survey attempting to measure the number of guest -nights spent in
hotels might use each hotel's number of rooms as an auxiliary variable. In some
cases, an older measurement of the variable of interest can be used as an
auxiliary variable when attempting to produce more current estimates.
Cluster sampling
Sometimes it is more cost-effective to select respondents in groups ('clusters').

Sampling is often clustered by geography, or by time periods . (Nearly all
samples are in some sense 'clustered' in time – although this is rarely taken into
account in the analysis.) For instance, if surveying households within a city, we
might choose to select 100 city blocks and then interview every household wit hin
the selected blocks.
Clustering can reduce travel and administrative costs. In the example above, an
interviewer can make a single trip to visit several households in one block, rather
than having to drive to a different block for each household.
It also means that one does not need a sampling frame listing all elements in the
target population. Instead, clusters can be chosen from a cluster-level frame, with
an element-level frame created only for the selected clusters. In the example
above, the sample only requires a block-level city map for initial selections, and
then a household-level map of the 100 selected blocks, rather than a household -
level map of the whole city.
Cluster sampling (also known as clustered sampling) generally increases the

variability of sample estimates above that of simple random sampling, depending
on how the clusters differ between one another as compared to the within -cluster
variation. For this reason, cluster sampling requires a larger sample than SRS to
achieve the same level of accuracy – but cost savings from clustering might still
make this a cheaper option.
Cluster sampling is commonly implemented as multistage sampling. This is a

complex form of cluster sampling in which two or more levels of units are
embedded one in the other. The first stage consists of constructing the clusters
that will be used to sample from. In the second stage, a sample of primary units
is randomly selected from each cluster (rather than using all units contained in
all selected clusters). In following stages, in each of those selected clusters,
additional samples of units are selected, and so on. All ultimate units
(individuals, for instance) selected at the last step of this procedure are then
surveyed. This technique, thus, is essentially the process of taking random
subsamples of preceding random samples.
Multistage sampling can substantially reduce sampling costs, where the complete
population list would need to be constructed (before other sampling methods
could be applied). By eliminating the work involved in describing clusters that

are not selected, multistage sampling can reduce the large costs associated with
traditional cluster sampling. However, each sample may not be a full
representative of the whole population.
Quota sampling
In quota sampling, the population is first segmented into mutually exclusive sub -
groups, just as in stratified sampling. Then judgement is used to select the
subjects or units from each segment based on a specified proportion. For
example, an interviewer may be told to sample 200 females and 300 males
between the age of 45 and 60.
It is this second step which makes the technique one of non -probability
sampling. In quota sampling the selection of the sample is non -random. For
example, interviewers might be tempted to interview those who look most
helpful. The problem is that these samples may be biased because not everyone
gets a chance of selection. This random element is its greatest weakness and
quota versus probability has been a matter of controversy for sever al years.
Minimax sampling
In imbalanced datasets, where the sampling ratio does not follow the population
statistics, one can resample the dataset in a conservative manner called minimax
sampling. The minimax sampling has its origin in Anderson minimax ratio whose
value is proved to be 0.5: in a binary classification, the class -sample sizes should
be chosen equally. This ratio can be proved to be minimax ratio only under the
assumption of LDA classifier with Gaussian distributions. The notion of minimax
sampling is recently developed for a general class of classification rules, called
class-wise smart classifiers. In this case, the sampling ratio of classes is selected
so that the worst case classifier error over all the possible population statistics
for class prior probabilities, would be the best.
Accidental sampling
as grab, convenience or opportunity sampling) is a type of nonprobability

sampling which involves the sample being drawn from that part of the population
which is close to hand. That is, a population is selected beca use it is readily
available and convenient. It may be through meeting the person or including a
person in the sample when one meets them or chosen by finding them through
technological means such as the internet or through phone. The researcher using
such a sample cannot scientifically make generalizations about the total
population from this sample because it would not be representative enough. For
example, if the interviewer were to conduct such a survey at a shopping center
early in the morning on a given day, the people that they could interview would
be limited to those given there at that given time, which would not represent the
views of other members of society in such an area, if the survey were to be
conducted at different times of day and several times per week. This type of
sampling is most useful for pilot testing. Several important considerations for
researchers using convenience samples include:
1. Are there controls within the research design or experiment which can
serve to lessen the impact of a non-random convenience sample, thereby
ensuring the results will be more representative of the population?
2. Is there good reason to believe that a particular convenience sample

would or should respond or behave differently than a random sample from
the same population?
3. Is the question being asked by the research one that can adequately be
answered using a convenience sample?
In social science research, snowball sampling is a similar technique, where

existing study subjects are used to recruit more subjects into the sample. Some
variants of snowball sampling, such as respondent driven sampling, allow
calculation of selection probabilities and are probability sampling methods under
certain conditions.
Voluntary Sampling
The voluntary sampling method is a type of non-probability sampling. Volunteers

choose to complete a survey.
Volunteers may be invited through advertisements in social media. The target

population for advertisements can be selected by characteristics like location,
age, sex, income, occupation, education or interests using tools provided by the
social medium. The advertisement may include a message about the research and
link to a survey. After following the link and completing the survey the volunteer
submits the data to be included in the sample population. This method can reach
a global population but is limited by the campaign budget. Volunteers outside the
invited population may also be included in the sample.
It is difficult to make generalizations from this sample because it may not

represent the total population. Often, volunteers have a strong interest in the
main topic of the survey.
Line-intercept sampling
Line-intercept sampling is a method of sampling elements in a region whereby an

element is sampled if a chosen line segment, called a "transect", intersects the
element.
Panel sampling
Panel sampling is the method of first selecting a group of participants through a

random sampling method and then asking that group for (potentially the same)
information several times over a period of time. Therefore, each participant is
interviewed at two or more time points; each period of data collection is called a
"wave". The method was developed by sociologist Paul Lazarsfeld in 1938 as a
means of studying political campaigns. This longitudinal sampling-method
allows estimates of changes in the population, for example with regard to chronic
illness to job stress to weekly food expenditures. Panel sampling can also be
used to inform researchers about within-person health changes due to age or to
help explain changes in continuous dependent variables such as spousal
interaction. There have been several proposed methods of analyzing panel data,
including MANOVA, growth curves, and structural equation modeling with
lagged effects.
Snowball sampling
Snowball sampling involves finding a small group of initial respondents and

using them to recruit more respondents. It is particularly useful in cases where
the population is hidden or difficult to enumerate.
Theoretical sampling
Theoretical sampling occurs when samples are selected on the basis of the results
of the data collected so far with a goal of developing a deeper understanding of

the area or develop theories. Extreme or very specific cases might be selected in
order to maximize the likelihood a phenomenon will actually be observable.
Data Types in Stastics:
Before we define what data collection is, it‘s essential to ask the question, ―What
is data?‖ The abridged answer is, data is various kinds of information formatted
in a particular way. There are only 2 classes of data in statistics: quantitative data
and qualitative data. This highest level of classification comes from the fact that
data can either be measured or can be an observed feature of interest.
Qualitative data are also referred to as categorical data. They are an observed
phenomenon and cannot be measured with numbers. Examples: a race, age group,
gender, origin, and so on. Even if they contain a numerical value, they hold no
meaning (1 for male and 0 for female).
Quantitative data, on the other hand, tells us about the quantities of things or the
things we can measure. And, so they are expressed in terms of numbers. It is also
known as numerical data and includes statistical data analysis. Examples: height,
water, distance, and so on.
We can further subdivide quantitative data and qualitative data into 4 subtypes as
follows: nominal data, ordinal data, interval data, and ratio data.
Qualitative (Categorical) data types
Qualitative data can be subdivided into nominal and ordinal data types. While
both these types of data can be classified, ordinal data can be ordered as well.
Nominal Data
Nominal data is a type of data that represents discrete units which is why it
cannot be ordered and measured. They are used to label variables without
providing any quantitative value. Also, they have no meaningful zero.
Some examples of nominal data include
Gender ( Male, Female)
Hair color ( Black, Brown, Gray, etc)
Nationality (Indian, American, Chinese, etc)

Data scientists use hot encoding, to transform nominal data into a numeric
feature.
The only logical operation that you can apply to them is equality or inequality
which you can also use to group them. The descriptive statistics you can do with
nominal data include frequencies, proportions, percentages, and central points.
And, to visualize nominal data, you can use a pie chart or a bar chart.
Ordinal Data
Ordinal values represent discrete as well as ordered units. Unlike nominal, here
the ordering matters. However, there is no consistency in the relative distance
between the adjacent categories. And, similar to nominal data, ordinal data also
don't have a meaningful zero.
Examples of ordinal data
Opinion (agree, mostly agree, neutral, mostly disagree, disagree) , Socioeconomic

status (low income, middle income, high income), Data scientists use label
encoding to transform ordinal data into a numeric feature.
The descriptive statistics that you can do with ordinal data include frequencies,
proportions, percentages, central points, percentiles, median, mode, and the
interquartile range. Here the visualization methods that cabe used are the same as
nominal data.
Quantitative (Numerical) Data Types
Two types of quantitative data are discrete data and continuous data. Discrete
data have distinct and separate values. Therefore, they are data with fixed points
and can‘t take any measures in between. So all counted data are discrete data.
Some examples of discrete data include shoe sizes, number of students in class,
number of languages an individual speaks, etc. Continuous data, on the other
hand, represent an endless range of possible values within a specified range. It
can be divided into finer parts to be measured but not counted. Continuous data
examples include temperature range, height, weight, etc.
Continuous data can be visualized by histogram or box plot while bar graphs or
stem plots can be used for discrete data.
Here are two types of continuous data

Interval Data
It represents ordered data that is measured along a numerical scale with equal
distances between the adjacent units. These equal distances are also referred to
as intervals. So a variable contains interval data if it has ordered numeric va lues
with the exact differences known between them.
Interval data can be continuous or discrete.
Examples of Interval data
IQ test‟s intelligence scale
Time if measured using a 12-hour clock
You can compare the data with interval data and add/subtract the values but
cannot multiply or divide as it doesn't have a meaningful zero. The descriptive
statistics you can apply for interval data include central point, range, and spread.
Ratio Data
Like Interval data, ratio data are also ordered with the same differe nce between
the individual units. However, they also have a meaningful zero so they cannot
take negative values.
Examples of ratio data
Temperature on a Kelvin scale (0 degrees represent total absence of thermal

energy)
Height ( zero is the starting point)
Now with real zero points, we can also multiply and divide the numbers.
Besides, you can sort the values as well. The descriptive statistics you can do
with ratio data are the same as interval data and include central point, range, and
spread.
Overall, ratio data and interval data are the same with equal spacing between
adjoining values but the former also has a meaningful zero. Besides addition and
subtraction, you can also multiply and divide the data, which is impossible with
interval data as it does not have an absolute zero. However, interval data can take
negative values with no absolute zero while ratio data cannot.
(Ref. https://www.turing.com/kb/statistical-data-types)
What is Data Collection: A Definition
Data collection is the process of gathering, measuring, and analyzing accurate

data from a variety of relevant sources to find answers to research problems,
answer questions, evaluate outcomes, and forecast trends and probabilities.
Our society is highly dependent on data, which underscores the importance of

collecting it. Accurate data collection is necessary to make informed business
decisions, ensure quality assurance, and keep research integrity.
During data collection, the researchers must identify the data types, the sources
of data, and what methods are being used. We will soon see that there are many
different data collection methods. There is heavy reliance on data collection in
research, commercial, and government fields.
Before an analyst begins collecting data, they must answer three questions first:
• What‘s the goal or purpose of this research?
• What kinds of data are they planning on gathering?
• What methods and procedures will be used to collect, store, and process
the information?
Additionally, we can break up data into qualitative and quantitative types.

Qualitative data covers descriptions such as color, size, quality, and appearance.
Quantitative data, unsurprisingly, deals with numbers, such as stat istics, poll
numbers, percentages, etc.
Why Do We Need Data Collection?
Before a judge makes a ruling in a court case or a general creates a plan of

attack, they must have as many relevant facts as possible. The best courses of
action come from informed decisions, and information and data are synonymous.
The concept of data collection isn‘t a new one, as we‘ll see later, but the world
has changed. There is far more data available today, and it exists in forms that
were unheard of a century ago. The data collection process has had to change and
grow with the times, keeping pace with technology.

Whether you‘re in the world of academia, trying to conduct research, or part of
the commercial sector, thinking of how to promote a new product, you need data
collection to help you make better choices.
Now that you know what is data collection and why we need it, let's take a look
at the different methods of data collection. While the phrase ―data collection‖
may sound all high-tech and digital, it doesn‘t necessarily entail things like
computers, big data, and the internet. Data collection could mean a telephone
survey, a mail-in comment card, or even some guy with a clipboard asking
passersby some questions. But let‘s see if we can sort the different data
collection methods into a semblance of organized categories.
What Are the Different Methods of Data Collection?
The following are seven primary methods of collecting data in business

analytics.
• Surveys
• Transactional Tracking
• Interviews and Focus Groups
• Observation
• Online Tracking
• Forms
• Social Media Monitoring
Data collection breaks down into two methods. As a side note, many terms, such
as techniques, methods, and types, are interchangeable and depending on who
uses them. One source may call data collection techniques ―methods,‖ for
instance. But whatever labels we use, the general concepts and breakdowns apply
across the board whether we‘re talking about marketing analysis or a scientific
research project.
The two methods are:
• Primary
As the name implies, this is original, first-hand data collected by the data
researchers. This process is the initial information gathering step, performed
before anyone carries out any further or related research. Primary data results are
highly accurate provided the researcher collects the information. However,
there‘s a downside, as first-hand research is potentially time-consuming and
expensive.
• Secondary
Secondary data is second-hand data collected by other parties and already having
undergone statistical analysis. This data is either information that the researcher
has tasked other people to collect or information the researcher has looked up.
Simply put, it‘s second-hand information. Although it‘s easier and cheaper to
obtain than primary information, secondary information raises concerns
regarding accuracy and authenticity. Quantitative data makes up a majority of
secondary data.
Specific Data Collection Techniques
Let‘s get into specifics. Using the primary/secondary methods mentioned above,
here is a breakdown of specific techniques.
Primary Data Collection
• Interviews
The researcher asks questions of a large sampling of people, either by direct

interviews or means of mass communication such as by phone or mail. This
method is by far the most common means of data gathering.
• Projective Data Gathering
Projective data gathering is an indirect interview, used when potential

respondents know why they're being asked questions and hesitate to answer. For
instance, someone may be reluctant to answer questions about their phone
service if a cell phone carrier representative poses the questions. With projective
data gathering, the interviewees get an incomplete question, and they must fill in
the rest, using their opinions, feelings, and attitudes.
• Delphi Technique
The Oracle at Delphi, according to Greek mythology, was the high priestess of
Apollo‘s temple, who gave advice, prophecies, and counsel. In the realm of data
collection, researchers use the Delphi technique by gathering information from a
panel of experts. Each expert answers questions in their field of specialty, and
the replies are consolidated into a single opinion.
• Focus Groups
Focus groups, like interviews, are a commonly used technique. The group
consists of anywhere from a half-dozen to a dozen people, led by a moderator,
brought together to discuss the issue.
• Questionnaires
Questionnaires are a simple, straightforward data collection method.

Respondents get a series of questions, either open or close -ended, related to the
matter at hand.
Secondary Data Collection
Unlike primary data collection, there are no specific collection methods. Instead,
since the information has already been collected, the researcher consults various
data sources, such as:
• Financial Statements
• Sales Reports
• Retailer/Distributor/Deal Feedback
• Customer Personal Information (e.g., name, address, age, contact info)
• Business Journals
• Government Records (e.g., census, tax records, Social Security info)
• Trade/Business Magazines
• The internet
Data Collection Tools
Now that we‘ve explained the various techniques, let‘s narrow our focus even
further by looking at some specific tools. For example, we mentioned interviews
as a technique, but we can further break that down into different interview types
(or ―tools‖).

• Word Association
The researcher gives the respondent a set of words and asks them what comes to
mind when they hear each word.
• Sentence Completion
Researchers use sentence completion to understand what kind of ideas the

respondent has. This tool involves giving an incomplete sentence and seeing how
the interviewee finishes it.
• Role-Playing
Respondents are presented with an imaginary situation and asked how they
would act or react if it was real.
• In-Person Surveys
The researcher asks questions in person.
• Online/Web Surveys
These surveys are easy to accomplish, but some users may be unwilling to
answer truthfully, if at all.
• Mobile Surveys
These surveys take advantage of the increasing proliferation of mobile

technology. Mobile collection surveys rel y on mobile devices like tablets or
smartphones to conduct surveys via SMS or mobile apps.
• Phone Surveys
No researcher can call thousands of people at once, so they need a third party to
handle the chore. However, many people have call screening and won‘t answer.
• Observation
Sometimes, the simplest method is the best. Researchers who make direct
observations collect data quickly and easily, with little intrusion or third -party
bias. Naturally, it‘s only effective in small-scale situations.
The Importance of Ensuring Accurate and Appropriate Data Collection
Accurate data collecting is crucial to preserving the integrity of research,

regardless of the subject of study or preferred method for defining data

(quantitative, qualitative). Errors are less likely to occur when the right data
gathering tools are used (whether they are brand -new ones, updated versions of
them, or already available).
Among the effects of data collection done incorrectly, include the following -
• Erroneous conclusions that squander resources
• Decisions that compromise public policy
• Incapacity to correctly respond to research inquiries
• Bringing harm to participants who are humans or animals
• Deceiving other researchers into pursuing futile research avenues
• The study's inability to be replicated and validated
When these study findings are used to support recommendations for public
policy, there is the potential to result in disproportionate harm, even if the degree
of influence from flawed data collecting may vary by discipline a nd the type of
investigation.
Let us now look at the various issues that we might face while maintaining the
integrity of data collection.
Issues Related to Maintaining the Integrity of Data Collection
In order to assist the errors detection process in the data gathering process,
whether they were done purposefully (deliberate falsifications) or not,
maintaining data integrity is the main justification (systematic or random errors).
Quality assurance and quality control are two strategies that help protect data
integrity and guarantee the scientific validity of study results.
Each strategy is used at various stages of the research timeline:
• Quality control - tasks that are performed both after and during data
collecting
• Quality assurance - events that happen before data gathering starts

Let us explore each of them in more detail now.
Quality Assurance
As data collecting comes before quality assurance, its primary goal is

"prevention" (i.e., forestalling problems with data collection). The best way to
protect the accuracy of data collection is through prevention. The uniformity of
protocol created in the thorough and exhaustive procedures manual for data
collecting serves as the best example of this proactive step.
The likelihood of failing to spot issues and mistakes early in the research attempt
increases when guides are written poorly. There are sever al ways to show these
shortcomings:
• Failure to determine the precise subjects and methods for retraining or

training staff employees in data collecting
• List of goods to be collected, in part
• There isn't a system in place to track modifications to pro cesses that may
occur as the investigation continues.
• Instead of detailed, step-by-step instructions on how to deliver tests,

there is a vague description of the data gathering tools that will be
employed.
• Uncertainty regarding the date, procedure, and identity of the person or

people in charge of examining the data
• Incomprehensible guidelines for using, adjusting, and calibrating the

data collection equipment.
Now, let us look at how to ensure Quality Control.
Despite the fact that quality control actions (detection/monitoring and

intervention) take place both after and during data collection, the specifics
should be meticulously detailed in the procedures manual. Establishing
monitoring systems requires a specific communication structure, which is a
prerequisite. Following the discovery of data collection problems, there should
be no ambiguity regarding the information flow between the primary
investigators and staff personnel. A poorly designed communication system
promotes slack oversight and reduces opportunities for error detection.
Direct staff observation conference calls, during site visits, or frequent or routine
assessments of data reports to spot discrepancies, excessive numbers, or invalid
codes can all be used as forms of detection or monitoring. Site visits might not
be appropriate for all disciplines. Still, without routine auditing of records,
whether qualitative or quantitative, it will be challenging for investigators to
confirm that data gathering is taking place in accordance with the manual's
defined methods.
Additionally, quality control determines the appropriate solutions, or "actions,"

to fix flawed data gathering procedures and reduce recurrences.
Problems with data collection, for instance, that call for immediate action
include:
• Fraud or misbehavior
• Systematic mistakes, procedure violations
• Individual data items with errors
• Issues with certain staff members or a site's performance
Researchers are trained to include one or more secondary measures that can be
used to verify the quality of information being obtained from the human subject
in the social and behavioral sciences where primary data collection entails using
human subjects.
For instance, a researcher conducting a survey would be interested in learning

more about the prevalence of risky behaviors among young adults as well as the
social factors that influence these risky behaviors' propensity for and frequency.
Let us now explore the common challenges with regard to data collection.
What are Common Challenges in Data Collection?
There are some prevalent challenges faced while collecting data, let us explore a
few of them to understand them better and avoid them.
Data Quality Issues
The main threat to the broad and successful application of machine learning is
poor data quality. Data quality must be your top priority if you want to make

technologies like machine learning work for you. Let's talk about some of the
most prevalent data quality problems in this blog article and how to fix them.
Inconsistent Data
When working with various data sources, it's conceivable that the same
information will have discrepancies between sources. The differences could be in
formats, units, or occasionally spellings. The introduction of inconsistent data
might also occur during firm mergers or relocations. Inconsistencies in data have
a tendency to accumulate and reduce the value of data if they are not continually
resolved. Organizations that have heavily focused on data consistency do so
because they only want reliable data to support their analytics.
Data Downtime
Data is the driving force behind the decisions and operations of data -driven
businesses. However, there may be brief periods when their data is unreliable or
not prepared. Customer complaints and subpar analytical outco mes are only two
ways that this data unavailability can have a significant impact on businesses. A
data engineer spends about 80% of their time updating, maintaining, and
guaranteeing the integrity of the data pipeline. In order to ask the next business
question, there is a high marginal cost due to the lengthy operational lead time
from data capture to insight.
Schema modifications and migration problems are just two examples of the
causes of data downtime. Data pipelines can be difficult due to their size and
complexity. Data downtime must be continuously monitored, and it must be
reduced through automation.
Ambiguous Data
Even with thorough oversight, some errors can still occur in massive databases
or data lakes. For data streaming at a fast speed, the issue becomes more
overwhelming. Spelling mistakes can go unnoticed, formatting difficulties can
occur, and column heads might be deceptive. This unclear data might cause a
number of problems for reporting and analytics.

Duplicate Data
Streaming data, local databases, and cloud data lakes are just a few of the
sources of data that modern enterprises must contend with. They might also have
application and system silos. These sources are likely to duplicate and overlap
each other quite a bit. For instance, duplicate contact information has a
substantial impact on customer experience. If certain prospects are ignored while
others are engaged repeatedly, marketing campaigns suffer. The likelihood of
biased analytical outcomes increases when duplicate data are present. It can also
result in ML models with biased training data.
Too Much Data
While we emphasize data-driven analytics and its advantages, a data quality

problem with excessive data exists. There is a risk of getting lost in an
abundance of data when searching for information pertinent to your analytical
efforts. Data scientists, data analysts, and business users devote 80% of their
work to finding and organizing the appropriate data. With an increase in data
volume, other problems with data quality become more serious, particularly
when dealing with streaming data and big files or databases.
Inaccurate Data
For highly regulated businesses like healthcare, data accuracy is crucial. Given
the current experience, it is more important than ever to increase the data quality
for COVID-19 and later pandemics. Inaccurate information does not provide you
with a true picture of the situation and cannot be used to plan the best course of
action. Personalized customer experiences and marketing strategies
underperform if your customer data is inaccurate.
Data inaccuracies can be attributed to a number of things, including data

degradation, human mistake, and data drift. Worldwide data decay occurs at a
rate of about 3% per month, which is quite concerning. Data integr ity can be
compromised while being transferred between different systems, and data quality
might deteriorate with time.
Hidden Data
The majority of businesses only utilize a portion of their data, with the remainder
sometimes being lost in data silos or discarded in data graveyards. For instance,
the customer service team might not receive client data from sales, missing an
opportunity to build more precise and comprehensive customer profiles. Missing
out on possibilities to develop novel products, enhance services, and streamline
procedures is caused by hidden data.
Finding Relevant Data
Finding relevant data is not so easy. There are several factors that we need to
consider while trying to find relevant data, which include -
• Relevant Domain
• Relevant demographics
• Relevant Time period and so many more factors that we need to consider
while trying to find relevant data.
Data that is not relevant to our study in any of the factors render it obsolete and
we cannot effectively proceed with its analysis. This could lead to incomplete
research or analysis, re-collecting data again and again, or shutting down the
study.
Deciding the Data to Collect
Determining what data to collect is one of the most important factors while
collecting data and should be one of the first factors while collecting data. We
must choose the subjects the data will cover, the sources we will be used to
gather it, and the quantity of information we will require. Our responses to these
queries will depend on our aims, or what we expect to achieve utilizing your
data. As an illustration, we may choose to gather information on the categories of
articles that website visitors between the ages of 20 and 50 most frequently
access. We can also decide to compile data on the typical age of all the clients
who made a purchase from your business over the previous month.
Not addressing this could lead to double work and collection of irrelevant data or
ruining your study as a whole.
Dealing With Big Data
Big data refers to exceedingly massive data sets with more intricate and
diversified structures. These traits typically result in increased challenges while
storing, analyzing, and using additional methods of extracting results. Big data
refers especially to data sets that are quite enormous or intr icate that
conventional data processing tools are insufficient. The overwhelming amount of
data, both unstructured and structured, that a business faces on a daily basis.
The amount of data produced by healthcare applications, the internet, social

networking sites social, sensor networks, and many other businesses are rapidly
growing as a result of recent technological advancements. Big data refers to the
vast volume of data created from numerous sources in a variety of formats at
extremely fast rates. Dealing with this kind of data is one of the many challenges
of Data Collection and is a crucial step toward collecting effective data.
Low Response and Other Research Issues
Poor design and low response rates were shown to be two issues with data
collecting, particularly in health surveys that used questionnaires. This might
lead to an insufficient or inadequate supply of data for the study. Creating an
incentivized data collection program might be beneficial in this case to get more
responses.
Now, let us look at the key steps in the data collection process.
What are the Key Steps in the Data Collection Process?
In the Data Collection Process, there are 5 key steps. They are explained briefly
below -
1. Decide What Data You Want to Gather
The first thing that we need to do is decide what information we want to gather.
We must choose the subjects the data will cover, the sources we will use to
gather it, and the quantity of information that we would require. For instance, we
may choose to gather information on the categories of products that an average
e-commerce website visitor between the ages of 30 and 45 most frequently
searches for.
2. Establish a Deadline for Data Collection
The process of creating a strategy for data collection can now begin. We should
set a deadline for our data collection at the outset of our planning phase. Some
forms of data we might want to continuously collect. We might want to build up
a technique for tracking transactional data and website visitor statistics over the

long term, for instance. However, we will track the data throughout a certain
time frame if we are tracking it for a particular campaign. In these situations, we
will have a schedule for when we will begin and finish gathering data.
3. Select a Data Collection Approach
We will select the data collection technique that will serve as the foundation of
our data gathering plan at this stage. We must take into account the type of
information that we wish to gather, the time period during which we will receive
it, and the other factors we decide on to choose the best gathering strategy.
4. Gather Information
Once our plan is complete, we can put our data collection plan into action and
begin gathering data. In our DMP, we can store and arrange our data. We need to
be careful to follow our plan and keep an eye on how it's doing. Especially if we
are collecting data regularly, setting up a timetable for when we will be checking
in on how our data gathering is going may be helpful. As circumstances alter and
we learn new details, we might need to amend our plan.
5. Examine the Information and Apply Your Findings
It's time to examine our data and arrange our findings after we have gathered all
of our information. The analysis stage is essential because it transforms
unprocessed data into insightful knowledge that can be applied to better our
marketing plans, goods, and business judgments. The analytics tools included in
our DMP can be used to assist with this phase. We can put the discoveries to use
to enhance our business once we have discovered the patterns and insights in our
data.
Let us now look at some data collection considerations and best practices that
one might follow.
Data Collection Considerations and Best Practices
We must carefully plan before spending time and money traveling to the field to
gather data. While saving time and resources, effective data collection strategies
can help us collect richer, more accurate, and richer data.
Below, we will be discussing some of the best practices that we can follow for
the best results -

1. Take Into Account the Price of Each Extra Data Point
Once we have decided on the data we want to gather, we need to make sure to
take the expense of doing so into account. Our surveyors and respondents will
incur additional costs for each additional data point or survey question.
2. Plan How to Gather Each Data Piece
There is a dearth of freely accessible data. Sometimes the data is there, but we
may not have access to it. For instance, unless we have a compelling cause, we
cannot openly view another person's medical information. It could be challenging
to measure several types of information.
Consider how time-consuming and difficult it will be to gather each piece of

information while deciding what data to acquire.
3. Think About Your Choices for Data Collecting Using Mobile Devices
Mobile-based data collecting can be divided into three categories -
• IVRS (interactive voice response technology) - Will call the respondents

and ask them questions that have already been recorded.
• SMS data collection - Will send a text message to the respondent, who
can then respond to questions by text on their phone.
• Field surveyors - Can directly enter data into an interactive

questionnaire while speaking to each respondent, thanks to smartphone
apps.
We need to make sure to select the appropriate tool for our survey and
responders because each one has its own disadvantages and advantages.
4. Carefully Consider the Data You Need to Gather
It's all too easy to get information about anything and everything, but it's crucial
to only gather the information that we require.
It is helpful to consider these 3 questions:
• What details will be helpful?
• What details are available?
• What specific details do you require?

5. Remember to Consider Identifiers
Identifiers, or details describing the context and source of a survey response, are
just as crucial as the information about the subject or program that we are
actually researching.
In general, adding more identifiers will enable us to pinpoint our program's

successes and failures with greater accuracy, but moderation is the key.
6. Data Collecting Through Mobile Devices is the Way to Go
Although collecting data on paper is still common, modern technology relies

heavily on mobile devices. They enable us to gather many various types of data
at relatively lower prices and are accurate as well as quick. There aren't many
reasons not to pick mobile-based data collecting with the boom of low-cost
Android devices that are available nowadays.
(Source: https://www.simplilearn.com/what-is-data-collection-article).
SELF-TEST
1) Which of the following values is used as a summary measure for a sample,
such as a sample mean?
a) Population parameter
b) Sample parameter
c) Sample statistic
d) Population mean
2) Which of the following is a branch of statistics?
a) Descriptive statistics
b) Inferential statistics
c) Industry statistics
d) Both A and B
SHORT ANSWER QUESTIONS

1) Write note on statistical data
2) Write a note on sampling in statistics

UNIT 01-02: STATISTICAL METHODS: MEASURES OF CENTRAL
TENDENCIES - MEAN , MEDIAN, MODE
LEARNING OBJECTIVES
After successful completion of this unit, you will be able to
 Learn about Mean
 Understand Median
 Understand Mode
As per the description and according to Byjus, Introduction to Mean, Median and
Mode: Often in statistics, we tend to represent a set of data by a representative
value which would approximately define the entire collection. This
representative value is called the measure of central tendency, and the name
suggests that it is a value around which the data is centred. These central
tendencies are mean, median and mode.
Fig. 1.1 Mean, Median and Mode.
(Credit: https://byjus.com)
We are all interested in cricket but have you ever wondered during the match
why the run rate of the particular over is projected and what does the run rate
mean? Or, when you get your examination result card, you mention the aggregate
percentage. Again, what is the meaning of aggregate? All these quantities in real

life make it easy to represent a collection of data in terms of a single value. It is
called Statistics.
Statistics deals with the collection of data and information for a particular
purpose. The tabulation of each run for each ball in cricket gives the statistics of
the game. The representation of any such data collection can be done in multiple
ways, like through tables, graphs, pie-charts, bar graphs, pictorial representation
etc.
Now consider a 50 over ODI match going between India and Australia. India
scored 370 runs by the end of the first innings. How do you decide whether India
put a good score or not? It‘s pretty simple, right; you find the overall run rate,
which is good for such a score. Thus, here comes the concept of mean, median
and mode in the picture. Let us learn in detail each of the central tendencies.
Measures of central tendency
The measures of central tendencies are given by various parameters but the most
commonly used ones are mean, median and mode. These parameters are
discussed below.
What is Mean?
Mean is the most commonly used measure of central tendency. It actually

represents the average of the given collection of data. It is applicable for both
continuous and discrete data.
It is equal to the sum of all the values in the collection of data divided by the
total number of values.
Suppose we have n values in a set of data namely as x 1 , x 2 , x 3 , …, x n , then the

mean of data is given by:
It can also be denoted as:

Table: 1.1 Calculation the mean using three different methods of formula
What is Median?
Generally median represents the mid-value of the given set of data when
arranged in a particular order.
Median: Given that the data collection is arranged in ascending or descending

order, the following method is applied:
If number of values or observations in the given data is odd, then the median is
given by [(n+1)/2]th observation.
If in the given data set, the number of values or observations is even, then the
median is given by the average of (n/2) th and [(n/2) +1] th observation.
The median for grouped data can be calculated using the formula,
What is Mode?
The most frequent number occurring in the data set is known as the mode.
Consider the following data set which represents the marks obtained by different
students in a subject.

The maximum frequency observation is 73 (as three students scored 73 marks),
so the mode of the given data collection is 73.
We can calculate the mode for grouped data using the below fo rmula:
Example of Mean, Median and Mode
Let us see the difference between the mean median and mode through an
example.
Example: The given table shows the scores obtained by different players in a
match. What is mean, median and mode of the given data?
Table: 1.2 Example of Mean, Median and Mode
Solution:
i) The mean is given by:
The mean of the given data is 43.
ii) To find out the median let us first arrange the given data in ascending order

As the number of items in the data is odd. Hence, the median is [(n+1)/2] th
observation.
⇒ Median = [(7+1)/2]th observation = 52
iii) Mode is the most frequent data, which is 52.
Relation of Mean Median Mode
The relation between mean, median and mode that means the three measures of
central tendency for moderately skewed distribution is given the formula:
Mode = 3 Median – 2 Mean
This relation is also called an empirical relationship. This is used to find one of
the measures when the other two measures are known to us for certain data. This
relationship is rewritten in different forms by interchanging the LHS and RHS.
Range
In statistics, the range is the difference between the highest and lowest data value
in the set. The formula is:
Range – Highest value – Lowest value
(Credit: https://byjus.com/maths/mean-median-mode/).
SELF-TEST
1) Mean, Median and Mode are
a) Measures of deviation
b) Ways of sampling
c) Measures of control tendency
d) None of the above
2) Which of the following variables cannot be expressed in quantitative terms
a) Socio-economic Status
b) Marital Status
c) Numerical Aptitude
d) Professional Attitude

1) Write note on Mean and Median
2) Write note on Mode.

UNIT 01-03: DISPERSION: MEASURES OF DISPERSION RANGE, QUARTILE,
DEVIATION, MEAN DEVIATION AND STANDARD DEVIATION , ABSOLUTE AND
RELATIVE MEASURES OF DISPERSION, SKEWNESS
LEARNING OBJECTIVES
After successful completion of this unit, you will be able to learn
 Basics of Dispersion
 Types of Dispersion
 Absolute and Relative Measures of Dispersion, Skewness
Statistical dispersion
In statistics, dispersion (also called variability, scatter, or spread) is the extent to

which a distribution is stretched or squeezed. Common examples of measures of
statistical dispersion are the variance, standard deviation, and interquartile range.
For instance, when the variance of data in a set is large, the data is widely
scattered. On the other hand, when the variance is small, the data in the set is
clustered.
Dispersion is contrasted with location or central tendency, and together they are
the most used properties of distributions.
Measures
A measure of statistical dispersion is a nonnegative real number that is zero if all

the data are the same and increases as the data become more diverse.
Most measures of dispersion have the same units as the quantity being measured.
In other words, if the measurements are in metres or seconds, so is the measure
of dispersion. Examples of dispersion measures include:
 Standard deviation
 Interquartile range (IQR)
 Range
 Mean absolute difference (also known as Gini mean absolute
difference)
 Median absolute deviation (MAD)

 Average absolute deviation (or simply called average deviation)
 Distance standard deviation
These are frequently used (together with scale factors) as estimators of scale
parameters, in which capacity they are called estimates of scale. Robust
measures of scale are those unaffected by a small number of outliers, and include
the IQR and MAD.
All the above measures of statistical dispersion have the useful property that they
are location-invariant and linear in scale. This means that if a random
variable X has a dispersion of S X then a linear
transformation Y = aX + b for real a and b should have dispersion S Y = |a|S X ,
where |a| is the absolute value of a, that is, ignores a preceding negative sign –.
Other measures of dispersion are dimensionless. In other words, they have no

units even if the variable itself has units. These include:
 Coefficient of variation
 Quartile coefficient of dispersion
 Relative mean difference, equal to twice the Gini coefficient
 Entropy: While the entropy of a discrete variable is location-invariant
and scale-independent, and therefore not a measure of dispersion in
the above sense, the entropy of a continuous variable is locatio n
invariant and additive in scale: If Hz is the entropy of continuous
variable z and z=ax+b, then Hz=Hx+log(a).
There are other measures of dispersion:
 Variance (the square of the standard deviation) – location-invariant but

not linear in scale.
 Variance-to-mean ratio – mostly used for count data when the
term coefficient of dispersion is used and when this ratio
is dimensionless, as count data are themselves dimensionless, not
otherwise.
Some measures of dispersion have specialized purposes. The Allan variance can
be used for applications where the noise disrupts convergence. The Hadamard
variance can be used to counteract linear frequency drift sensitivity.

For categorical variables, it is less common to measure dispersion by a single
number; see qualitative variation. One measure that does so is the
discrete entropy.
Sources
In the physical sciences, such variability may result from random measurement
errors: instrument measurements are often not perfectly precise, i.e.,
reproducible, and there is additional inter-rater variability in interpreting and
reporting the measured results. One may assume that the quantity being
measured is stable, and that the variation between measurements is due
to observational error. A system of a large number of particles is characterized by
the mean values of a relatively few numbers of macroscopic quantities such as
temperature, energy, and density. The standard deviation is an important measure
in fluctuation theory, which explains many physical phenomena, including why
the sky is blue.
In the biological sciences, the quantity being measured is seldom unchanging and
stable, and the variation observed might additionally be intrinsic to the
phenomenon: It may be due to inter-individual variability, that is, distinct
members of a population differing from each other. Also, it may be due to intra-
individual variability, that is, one and the same subject differing in tests taken at
different times or in other differing conditions. Such types of variability are also
seen in the arena of manufactured products; even there, the meticulous scientist
finds variation.
In economics, finance, and other disciplines, regression analysis attempts to

explain the dispersion of a dependent variable, generally measured by its
variance, using one or more independent variables each of which itself has
positive dispersion. The fraction of variance explained is called the coefficient of
determination.
A partial ordering of dispersion
A mean-preserving spread (MPS) is a change from one probability distribution A

to another probability distribution B, where B is formed by spreading out one or
more portions of A's probability density function while leaving the mean (the
expected value) unchanged. The concept of a mean -preserving spread provides

a partial ordering of probability distributions according to their dispersions: of
two probability distributions, one may be ranked as having more dispersion than
the other, or alternatively neither may be ranked as having more dispersion.
Quartile coefficient of dispersion
In statistics, the quartile coefficient of dispersion is a descriptive statistic which

measures dispersion and which is used to make comparisons within and between
data sets. Since it is based on quantile information, it is less sensitive to outliers
than measures such as the Coefficient of variation. As such, it is one of
several Robust measures of scale.
The statistic is easily computed using the first (Q 1 ) and third (Q 3 ) quartiles for
each data set. The quartile coefficient of dispersion is:
Example
Consider the following two data sets:
A = {2, 4, 6, 8, 10, 12, 14}
n = 7, range = 12, mean = 8, median = 8, Q 1 = 4, Q 3 = 12, quartile

coefficient of dispersion = 0.5
B = {1.8, 2, 2.1, 2.4, 2.6, 2.9, 3}
n = 7, range = 1.2, mean = 2.4, median = 2.4, Q 1 = 2, Q 3 = 2.9, quartile

coefficient of dispersion = 0.18
The quartile coefficient of dispersion of data set A is 2.7 times

as great (0.5 / 0.18) as that of data set B.
Deviation (statistics)
In mathematics and statistics, deviation is a measure of difference between the

observed value of a variable and some other value, often that variable's mean.
The sign of the deviation reports the direction of that difference (the deviation is
positive when the observed value exceeds the reference value). The magnitude of
the value indicates the size of the difference.
Types

A deviation that is a difference between an observed value and the true value of a
quantity of interest (where true value denotes the Expected Value, such as the
population mean) is an error.
A deviation that is the difference between the observed value and an estimate of the
true value (e.g. the sample mean; the Expected Value of a sample can be used as an
estimate of the Expected Value of the population) is a residual. These concepts are
applicable for data at the interval and ratio levels of measurement.
Unsigned or absolute deviation
In statistics, the absolute deviation of an element of a data set is the absolute

difference between that element and a given point.
Measures
Mean signed deviation
For an unbiased estimator, the average of the signed deviations across the entire set of
all observations from the unobserved population parameter value averages zero over
an arbitrarily large number of samples. However, by construction the average of
signed deviations of values from the sample mean value is always zero, though the
average signed deviation from another measure of central tendency, such as the
sample median, need not be zero.
Dispersion
Statistics of the distribution of deviations are used as measures of statistical

dispersion.
 Standard deviation is the frequently used measure of dispersion: it

uses squared deviations, and has desirable properties, but is not robust.
 Average absolute deviation, is the sum of absolute values of the
deviations divided by the number of observations.
 Median absolute deviation is a robust statistic, which uses the median,
not the mean, of absolute deviations.
 Maximum absolute deviation is a highly non-robust measure, which
uses the maximum absolute deviation.
Normalization

Deviations have units of the measurement scale (for instance, meters if measurin g
lengths). One can nondimensionalize in two ways.
One way is by dividing by a measure of scale (statistical dispersion), most often either
the population standard deviation, in standardizing, or the sample standard deviation,
in studentizing (e.g., Studentized residual).
One can scale instead by location, not dispersion: the formula for a percent
deviation is the observed value minus accepted value divided by the accepted value
multiplied by 100%.
Standard deviation
In statistics, the standard deviation is a measure of the amount of variation

or dispersion of a set of values. A low standard deviation indicates that the values tend
to be close to the mean (also called the expected value) of the set, while a high
standard deviation indicates that the values are spread out over a wider range.
Standard deviation may be abbreviated SD, and is most commonly represented in

mathematical texts and equations by the lower case Greek letter σ (sigma), for the
population standard deviation, or the Latin letter s, for the sample standard deviation.
The standard deviation of a random variable, sample, statistical population, data set,
or probability distribution is the square root of its variance. It is algebraically simpler,
though in practice less robust, than the average absolute deviation.[2][3] A useful
property of the standard deviation is that, unlike the variance, it is expressed in the
same unit as the data.
The standard deviation of a population or sample and the standard error of a statistic
(e.g., of the sample mean) are quite different, but related. The sample mean's standard
error is the standard deviation of the set of means that would be found by drawing an
infinite number of repeated samples from the population and computing a mean for
each sample. The mean's standard error turns out to equal the population standard
deviation divided by the square root of the sample size, and is estimated by using the
sample standard deviation divided by the square root of the sample size. For example,
a poll's standard error (what is reported as the margin of error of the poll), is the
expected standard deviation of the estimated mean if the same poll were to be
conducted multiple times. Thus, the standard error estimates the standard deviation of

an estimate, which itself measures how much the estimate depends on the particular
sample that was taken from the population.
In science, it is common to report both the standard deviation of the data (as a
summary statistic) and the standard error of the estimate (as a measure of potential
error in the findings). By convention, only effects more than two standard errors away
from a null expectation are considered "statistically significant", a safeguard against
spurious conclusion that is really due to random sampling error.
When only a sample of data from a population is available, the term standard
deviation of the sample or sample standard deviation can refer to either the above-
mentioned quantity as applied to those data, or to a modified quantity that is an
unbiased estimate of the population standard deviation (the standard deviation of the
entire population).
Basic examples
Population standard deviation of grades of eight students
Suppose that the entire population of interest is eight students in a particular class.
For a finite set of numbers, the population standard deviation is found by taking
the square root of the average of the squared deviations of the values subtracted from
their average value.
Standard deviation of average height for adult men
If the population of interest is approximately normally distributed, the standard

deviation provides information on the proportion of observations above or below
certain values. For example, the average height for adult men in the United States is
about 70 inches, with a standard deviation of around 3 inches. This means that most
men (about 68%, assuming a normal distribution) have a height within 3 inches of the
mean (67–73 inches) – one standard deviation – and almost all men (about 95%) have
a height within 6 inches of the mean (64–76 inches) – two standard deviations. If the
standard deviation were zero, then all men would be exactly 70 inches tall. If the
standard deviation were 20 inches, then men would have much more variable heights,
with a typical range of about 50–90 inches. Three standard deviations account for
99.7% of the sample population being studied, assuming the distribution is normal or
bell-shaped (see the 68–95–99.7 rule, or the empirical rule, for more information).
Estimation
One can find the standard deviation of an entire population in cases (such
as standardized testing) where every member of a population is sampled. In cases
where that cannot be done, the standard deviation σ is estimated by examining a
random sample taken from the population and computing a statistic of the sample,
which is used as an estimate of the population standard deviation. Such a statistic is
called an estimator, and the estimator (or the value of the estimator, namely the
estimate) is called a sample standard deviation, and is denoted by s (possibly with
modifiers).
Unlike in the case of estimating the population mean, for which the sample mean is a
simple estimator with many desirable properties (unbiased, efficient, maximum
likelihood), there is no single estimator for the standard deviation with all these
properties, and unbiased estimation of standard deviation is a very technically
involved problem. Most often, the standard deviation is estimated using the corrected
sample standard deviation (using N − 1), defined below, and this is often referred to
as the "sample standard deviation", without qualifiers. However, other estimators are
better in other respects: the uncorrected estimator (using N) yields lower mean
squared error, while using N − 1.5 (for the normal distribution) almost completely
eliminates bias.
Uncorrected sample standard deviation
The formula for the population standard deviation (of a finite population) can be
applied to the sample, using the size of the sample as the size of the population
(though the actual population size from which the sample is drawn may be much
larger). This estimator, denoted by sN, is known as the uncorrected sample standard
deviation, or sometimes the standard deviation of the sample (considered as the entire
population), and is defined as follows:
where are the observed values of the sample items, and is the mean value of these
observations, while the denominator N stands for the size of the sample: this is the
square root of the sample variance, which is the average of the squared
deviations about the sample mean.
This is a consistent estimator (it converges in probability to the population value as

the number of samples goes to infinity), and is the maximum-likelihood
estimate when the population is normally distributed. However, this is a biased

estimator, as the estimates are generally too low. The bias decreases as sample size
grows, dropping off as 1/N, and thus is most significant for small or moderate sample
sizes; for the bias is below 1%. Thus for very large sample sizes, the uncorrected
sample standard deviation is generally acceptable. This estimator also has a uniformly
smaller mean squared error than the corrected sample standard deviation.
Corrected sample standard deviation
If the biased sample variance (the second central moment of the sample, which is a
downward-biased estimate of the population variance) is used to compute an estimate
of the population's standard deviation, the result is
Here taking the square root introduces further downward bias, by Jensen's inequality,
due to the square root's being a concave function. The bias in the variance is easily
corrected, but the bias from the square root is more difficult to correct, and depends
on the distribution in question.
An unbiased estimator for the variance is given by applying Bessel's correction,

using N − 1 instead of N to yield the unbiased sample variance, denoted s2:
This estimator is unbiased if the variance exists and the sample values are drawn
independently with replacement. N − 1 corresponds to the number of degrees of
freedom in the vector of deviations from the mean, taking square roots reintroduces
bias (because the square root is a nonlinear function, which does not commute with
the expectation), yielding the corrected sample standard deviation, denoted by s:
As explained above, while s2 is an unbiased estimator for the population variance, s is

still a biased estimator for the population standard deviation, though markedly less
biased than the uncorrected sample standard deviation. This estimator is commonly
used and generally known simply as the "sample standard deviation". The bias may
still be large for small samples (N less than 10). As sample size increases, the number
of bias decreases. We obtain more information and the difference
between and becomes smaller.
Unbiased sample standard deviation
For unbiased estimation of standard deviation, there is no formula that works across
all distributions, unlike for mean and variance. Instead, s is used as a basis, and is
scaled by a correction factor to produce an unbiased estimate. For the normal

distribution, an unbiased estimator is given by s/c4, where the correction factor (which
depends on N) is given in terms of the Gamma function, and equals:
This arises because the sampling distribution of the sample standard deviation follows
a (scaled) chi distribution, and the correction factor is the mean of the chi distribution.
An approximation can be given by replacing N − 1 with N − 1.5, yielding:
The error in this approximation decays quadratically (as 1/N2), and it is suited for all
but the smallest samples or highest precision: for N = 3 the bias is equal to 1.3%, and
for N = 9 the bias is already less than 0.1%.
Confidence interval of a sampled standard deviation
The standard deviation we obtain by sampling a distribution is itself not absolutely

accurate, both for mathematical reasons (explained here by the confidence interval)
and for practical reasons of measurement (measurement error). The mathematical
effect can be described by the confidence interval or CI.
To show how a larger sample will make the confidence interval narrower, consider the
following examples: A small population of N = 2 has only 1 degree of freedom for
estimating the standard deviation. The result is that a 95% CI of the SD runs from
0.45 × SD to 31.9 × SD.
A larger population of N = 10 has 9 degrees of freedom for estimating the standard

deviation. The same computations as above give us in this case a 95% CI running
from 0.69 × SD to 1.83 × SD. So even with a sample population of 10, the actual SD
can still be almost a factor 2 higher than the sampled SD. For a sample population
N=100, this is down to 0.88 × SD to 1.16 × SD. To be more certain that the sampled
SD is close to the actual SD we need to sample a large number of points.
These same formulae can be used to obtain confidence intervals on the variance of
residuals from a least squares fit under standard normal theory, where k is now the
number of degrees of freedom for error.
Bounds on standard deviation
For a set of N > 4 data spanning a range of values R, an upper bound on the standard
deviation s is given by s = 0.6R. An estimate of the standard deviation for N > 100

data taken to be approximately normal follows from the heuristic that 95% of the area
under the normal curve lies roughly two standard deviations to either side of the
mean, so that, with 95% probability the total range of values R represents four
standard deviations so that s ≈ R/4. This so-called range rule is useful in sample
size estimation, as the range of possible values is easier to estimate than the standard
deviation. Other divisors K(N) of the range such that s ≈ R/K(N) are available for
other values of N and for non-normal distributions.
Identities and mathematical properties
The standard deviation is invariant under changes in location, and scales directly with
the scale of the random variable. Thus, for a constant c and random variables X and Y:
The standard deviation of the sum of two random variables can be related to their
individual standard deviations and the covariance between them.
Interpretation and application
A large standard deviation indicates that the data points can spread far from the mean
and a small standard deviation indicates that they are clustered closely around the
mean.
For example, each of the three populations {0, 0, 14, 14}, {0, 6, 8, 14} and {6, 6, 8,
8} has a mean of 7. Their standard deviations are 7, 5, and 1, respectively. The third
population has a much smaller standard deviation than the other two because its
values are all close to 7. These standard deviations have the same units as the data
points themselves. If, for instance, the data set {0, 6, 8, 14} represents the ages of a
population of four siblings in years, the standard deviation is 5 years. As another
example, the population {1000, 1006, 1008, 1014} may represent the distances
traveled by four athletes, measured in meters. It has a mean of 1007 meters, and a
standard deviation of 5 meters.
Standard deviation may serve as a measure of uncertainty. In physical science, for

example, the reported standard deviation of a group of repeated measurements gives
the precision of those measurements. When deciding whether measurements agree
with a theoretical prediction, the standard deviation of those measurements is of
crucial importance: if the mean of the measurements is too far away from the
prediction (with the distance measured in standard deviations), then the theory being
tested probably needs to be revised. This makes sense since they fall outside the range
of values that could reasonably be expected to occur, if the prediction were correct
and the standard deviation appropriately quantified. See prediction interval.
While the standard deviation does measure how far typical values tend to be from the
mean, other measures are available. An example is the mean absolute deviation,
which might be considered a more direct measure of average distance, compared to
the root mean square distance inherent in the standard deviation.
Application examples
The practical value of understanding the standard deviation of a set of values is in

appreciating how much variation there is from the average (mean).
Experiment, industrial and hypothesis testing
Standard deviation is often used to compare real-world data against a model to test
the model. For example, in industrial applications the weight of products coming off a
production line may need to comply with a legally required value. By weighing some
fraction of the products an average weight can be found, which will always be slightly
different from the long-term average. By using standard deviations, a minimum and
maximum value can be calculated that the averaged weight will be within some very
high percentage of the time (99.9% or more). If it falls outside the range then the
production process may need to be corrected. Statistical tests such as these are
particularly important when the testing is relatively expensive. For example, if the
product needs to be opened and drained and weighed, or if the product was otherwise
used up by the test.
In experimental science, a theoretical model of reality is used. Particle

physics conventionally uses a standard of "5 sigma" for the declaration of a discovery.
A five-sigma level translates to one chance in 3.5 million that a random fluctuation
would yield the result. This level of certainty was required in order to assert that a
particle consistent with the Higgs boson had been discovered in two independent
experiments at CERN, also leading to the declaration of the first observation of
gravitational waves, and confirmation of global warming.
Weather
As a simple example, consider the average daily maximum temperatures for two
cities, one inland and one on the coast. It is helpful to understand that the range of
daily maximum temperatures for cities near the coast is smaller than for cities inland.
Thus, while these two cities may each have the same average maximum temperature,
the standard deviation of the daily maximum temperature for the coastal city will be
less than that of the inland city as, on any particular day, the actual maximum
temperature is more likely to be farther from the average maximum temperature for
the inland city than for the coastal one.
Finance
In finance, standard deviation is often used as a measure of the risk associated with
price-fluctuations of a given asset (stocks, bonds, property, etc.), or the risk of a
portfolio of assets (actively managed mutual funds, index mutual funds, or ETFs).
Risk is an important factor in determining how to efficiently manage a portfolio of
investments because it determines the variation in returns on the asset and/or portfolio
and gives investors a mathematical basis for investment decisions (known as mean-
variance optimization). The fundamental concept of risk is that as it increases, the
expected return on an investment should increase as well, an increase known as the
risk premium. In other words, investors should expect a higher return on an
investment when that investment carries a higher level of risk or uncertainty. When
evaluating investments, investors should estimate both the expected return and the
uncertainty of future returns. Standard deviation provides a quantified estimate of the
uncertainty of future returns.
For example, assume an investor had to choose between two stocks. Stock A over the
past 20 years had an average return of 10 percent, with a standard deviation of
20 percentage points (pp) and Stock B, over the same period, had average returns of
12 percent but a higher standard deviation of 30 pp. On the basis of risk and return, an
investor may decide that Stock A is the safer choice, because Stock B's additional two
percentage points of return is not worth the additional 10 pp standard deviation
(greater risk or uncertainty of the expected return). Stock B is likely to fall short of
the initial investment (but also to exceed the initial investment) more often than Stock
A under the same circumstances, and is estimated to return only two percent more on
average. In this example, Stock A is expected to earn about 10 percent, plus or minus
20 pp (a range of 30 percent to −10 percent), about two-thirds of the future year
returns. When considering more extreme possible returns or outcomes in future, an
investor should expect results of as much as 10 percent plus or minus 60 pp, or a

range from 70 percent to −50 percent, which includes outcomes for three standard
deviations from the average return (about 99.7 percent of probable returns).
Calculating the average (or arithmetic mean) of the return of a security over a given
period will generate the expected return of the asset. For each period, subtracting the
expected return from the actual return results in the difference from the mean.
Squaring the difference in each period and taking the average gives the overall
variance of the return of the asset. The larger the variance, the greater risk the security
carries. Finding the square root of this variance will give the standard deviation of the
investment tool in question.
Financial time series are known to be non-stationary series, whereas the statistical
calculations above, such as standard deviation, apply only to stationary series. To
apply the above statistical tools to non-stationary series, the series first must be
transformed to a stationary series, enabling use of statistical tools that now have a
valid basis from which to work.
Geometric interpretation
To gain some geometric insights and clarification, we will start with a population of
three values, x1, x2, x3. This defines a point P = (x1, x2, x3) in R3. Consider the line L =
{(r, r, r) : r ∈ R}. This is the "main diagonal" going through the origin. If our three
given values were all equal, then the standard deviation would be zero and P would
lie on L. So it is not unreasonable to assume that the standard deviation is related to
the distance of P to L.
Chebyshev's inequality
An observation is rarely more than a few standard deviations away from the mean.
Chebyshev's inequality ensures that, for all distributions for which the standard
deviation is defined, the amount of data within a number of standard deviations of the
mean is at least as much as given in the following table.
Rules for normally distributed data
The central limit theorem states that the distribution of an average of many
independent, identically distributed random variables tends toward the famous bell -
shaped normal distribution with a probability density function of

where μ is the expected value of the random variables, σ equals their distribution's
standard deviation divided by n1/2, and n is the number of random variables. The
standard deviation therefore is simply a scaling variable that adjusts how broad the
curve will be, though it also appears in the normalizing constant.
If a data distribution is approximately normal then about 68 percent of the data values
are within one standard deviation of the mean (mathematically, μ ± σ, where μ is the
arithmetic mean), about 95 percent are within two standard deviations (μ ± 2σ), and
about 99.7 percent lie within three standard deviations (μ ± 3σ). This is known as
the 68–95–99.7 rule, or the empirical rule.
For various values of z, the percentage of values expected to lie in and outside the
symmetric interval, CI = (−zσ, zσ), are as follows:
Relationship between standard deviation and mean
The mean and the standard deviation of a set of data are descriptive statistics usually
reported together. In a certain sense, the standard deviation is a "natural" measure
of statistical dispersion if the center of the data is measured about the mean. This is
because the standard deviation from the mean is smaller than from any other point.
The precise statement is the following: suppose x1, ..., xn are real numbers and define
the function:
Using calculus or by completing the square, it is possible to show that σ(r) has a
unique minimum at the mean:
Variability can also be measured by the coefficient of variation, which is the

ratio of the standard deviation to the mean. It is a dimensionless number.
Standard deviation of the mean
Often, we want some information about the precision of the mean we obtained. We
can obtain this by determining the standard deviation of the sampled mean. Assuming
statistical independence of the values in the sample, the standard deviation of the
mean is related to the standard deviation of the distribution.
Rapid calculation methods
The following two formulas can represent a running (repeatedly updated) standard
deviation. A set of two power sums s1 and s2 are computed over a set of N values of x,
denoted as x1, ..., xN:

Given the results of these running summations, the values N, s1, s2 can be used at any
time to compute the current value of the running standard deviation:
Where N, as mentioned above, is the size of the set of values (or can also be regarded
as s0).
Similarly for sample standard deviation,
In a computer implementation, as the two sj sums become large, we need to

consider round-off error, arithmetic overflow, and arithmetic underflow. The method
below calculates the running sums method with reduced rounding errors. This is a
"one pass" algorithm for calculating variance of n samples without the need to store
prior data during the calculation. Applying this method to a time series will result in
successive values of standard deviation corresponding to n data points as n grows
larger with each new sample, rather than a constant-width sliding window calculation.
Weighted calculation
When the values xi are weighted with unequal weights wi, the power sums s0, s1, s2 are
each computed as:
And the standard deviation equations remain unchanged. s0 is now the sum of the
weights and not the number of samples N.
The incremental method with reduced rounding errors can also be applied, with some
additional complexity.
A running sum of weights must be computed for each k from 1 to n: and places where
1/n is used above must be replaced by wi/Wn:
SELF-TEST 01
1) Find the variance of the observation values taken in the lab.
4.2 4.3 4 4.1
a) 0.27
b) 0.28
c) 0.3
d) 0.31

2) If the standard deviation of a data is 0.012. Find the variance.
a) 0.144
b) 0.00144
c) 0.000144
d) 0.0000144

1) Explain the importance of standard deviasion.
2) Write a note on standard deviasion.

UNIT 01-04: PROBABILITY: SAMPLE SPACE, EVENTS, TYPES OF EVENTS,
ALGEBRA OF EVENTS , PROBABILITY OF AN EVENT, ADDITION AND
MULTIPLICATION OF LAW, CONDITIONAL PROBABILITY, RANDOM
VARIABLE, PROBABILITY DISTRIBUTION OF R.V. MEAN AND
VARIANCE OF R.V
LEARNING OBJECTIVES
 Probability
 Conditional probability
 Probability Distribution
Probability
Probability is the branch of mathematics concerning numerical descriptions of

how likely an event is to occur, or how likely it is that a proposition is true. The
probability of an event is a number between 0 and 1, where, roughly sp eaking, 0
indicates impossibility of the event and 1 indicates certainty. The higher the
probability of an event, the more likely it is that the event will occur. A simple
example is the tossing of a fair (unbiased) coin. Since the coin is fair, the two
outcomes ("heads" and "tails") are both equally probable; the probability of
"heads" equals the probability of "tails"; and since no other outcomes are
possible, the probability of either "heads" or "tails" is 1/2 (which could also be
written as 0.5 or 50%).
These concepts have been given an axiomatic mathematical formalization

in probability theory, which is used widely in areas of study such
as statistics, mathematics, science, finance, gambling, artificial
intelligence, machine learning, computer science, game theory,
and philosophy to, for example, draw inferences about the expected frequency of
events. Probability theory is also used to describe the underlying mechanics and
regularities of complex systems.
Sample space
In probability theory, the sample space (also called sample description space
)
or possibility space of an experiment or random trial is the set of all
possible outcomes or results of that experiment. A sample space is usually denoted
using set notation, and the possible ordered outcomes, or sample points, are listed

as elements in the set. It is common to refer to a sample space by the labels S, Ω,
or U (for "universal set"). The elements of a sample space may be numbers, words,
letters, or symbols. They can also be finite, countably infinite, or uncountably infinite.
Multiple sample spaces
For many experiments, there may be more than one plausible sample space available,
depending on what result is of interest to the experimenter. For example, when
drawing a card from a standard deck of fifty-two playing cards, one possibility for the
sample space could be the various ranks (Ace through King), while another could be
the suits (clubs, diamonds, hearts, or spades). A more complete description of
outcomes, however, could specify both the denomination and the suit, and a sample
space describing each individual card can be constructed as the Cartesian product of
the two sample spaces noted above (this space would contain fifty-two equally likely
outcomes). Still other sample spaces are possible, such as right-side up or upside
down, if some cards have been flipped when shuffling.
Equally likely outcomes
Equally likely outcomes are, like the name suggests, events with an equal chance of
happening. Many events have equally likely outcomes, like tossing a coin (50%
probability of heads; 50% probability of tails) or a die (1/6 probability of getting any
number on the die).
In real life though, it‘s highly unusual to get equally likely outcomes for events. For
example, the probability of finding a golden ticket in a chocolate bar might be 5%,
but this doesn‘t contradict the idea of equally likely outcomes. Let‘s say there are 100
chocolate bars and five of them have golden ticket, which gives us our 5%
probability. Each of those golden tickets represents one chance to win, and there are
five chances to win, each of which are equally likely outcomes. Other examples:
Flip a fair coin 10 times to see how many heads or tails you get. Each event (getting a
heads or getting a tails) is equally likely).
Roll a die 3 times and note the sequence of numbers. Each sequence of numbers
(123,234,456,…) is equally likely.
How to Find the Probability of Equally Likely Outcomes

Formally, equally likely outcomes are defined as follows:
For any sample space with N equally likely outcomes, we assign the probability
1/N to each outcome.
To find the probability of equally likely outcomes:

1. Define the sample space for an event of chance. The sample space
is all distinct outcomes. For example, if 100 lottery tickets are sold
numbered 1 through 100, the sample space is a list of all winning
tickets (1, 2, 3, …, 100).
2. Count the number of ways event A can occur. For this example,
let‘s say that event A is ―picking the number 33‖. There is only one
way to choose the number 33 from the list of numbers 1 through
100.
3. Divide your answer from (2) by your answer from (1), giving:
1/100 or 1%.
A slightly more complicated example. Let‘s say you were interested in calculating the
probability of choosing any ticket with the number three.
1. The sample space is still a list of all winning tickets (1, 2, 3, …,
100).
2. Event A, ―picking a ticket with the number 3‖, has ten possibilities:
3, 13, 23, 33, 43, 53, 63, 73, 83, 93.
3. Divide your answer from (2) by your answer from (1), giving:
10/100 = 10%.
(Ref. https://www.statisticshowto.com/equally-likely-outcomes/)
Simple Random Sample
In statistics, inferences are made about characteristics of a population by studying

a sample of that population's individuals. In order to arrive at a sample that presents
an unbiased estimate of the true characteristics of the population, statisticians often
seek to study a simple random sample—that is, a sample in which every individual in
the population is equally likely to be included. The result of this is that every
possible combination of individuals who could be chosen for the sample has an equal

chance to be the sample that is selected (that is, the space of simple random samples
of a given size from a given population is composed of equally likely outcomes).
Infinitely large sample spaces
In an elementary approach to probability, any subset of the sample space is usually

called an event. However, this gives rise to problems when the sample space is
continuous, so that a more precise definition of an event is necessary. Under this
definition only measurable subsets of the sample space, constituting a σ-algebra over
the sample space itself, are considered events.
An example of an infinitely large sample space is measuring the lifetime of a light

bulb. The corresponding sample space would be (0, ∞).
Events in Probability
Events in probability can be defined as a set of outcomes of a random experiment.

The sample space indicates all possible outcomes of an experiment. Thus, events in
probability can also be described as subsets of the sample space.
There are many different types of events in probability. Each type of event has its own
individual properties. This classification of events in probability helps to simplify
mathematical calculations. In this article, we will learn more about events in
probability, their types and see certain associated examples.
What are Events in Probability?
Events in probability are outcomes of random experiments. Any subset of the sample
space will form events in probability. The likelihood of occurrence of events in
probability can be calculated by dividing the number of favorable outcomes by the
total number of outcomes of that experiment.
Definition of Events in Probability
Events in probability can be defined as certain likely outcomes of an experiment that

form a subset of a finite sample space. The probability of occurrence of any event will
always lie between 0 and 1. There could be many events associated with one sample
space.
Events in Probability Example

Suppose a fair die is rolled. The total number of possible outcomes will form the
sample space and are given by {1, 2, 3, 4, 5, 6}. Let an event, E, be defined as getting

an even number on the die. Then E = {2, 4, 6}. Thus, it can be seen that E is a subset
of the sample space and is an outcome of the rolling of a die.
Types of Events in Probability
There are several different types of events in probability. There can only be one
sample space for a random experiment however, there can be many different types of
events. Some of the important events in probability are listed below.
Independent and Dependent Events

Independent events in probability are those events whose outcome does not depend on
some previous outcome. No matter how many times an experiment has been
conducted the probability of occurrence of independent events will be the same. For
example, tossing a coin is an independent event in probability.
Dependent events in probability are events whose outcome depends on a previous
outcome. This implies that the probability of occurrence of a dependent event will be
affected by some previous outcome. For example, drawing two balls one after another
from a bag without replacement.
Impossible and Sure Events

An event that can never happen is known as an impossible event. As impossible
events in probability will never take place thus, the chance that they will occur is
always 0. For example, the sun revolving around the earth is an impossible event.
A sure event is one that will always happen. The probability of occurrence of a sure
event will always be 1. For example, the earth revolving around the sun is a sure
event.
Simple and Compound Events
If an event consists of a single point or a single result from the sample space, it is
termed a simple event. The event of getting less than 2 on rolling a fair die, denoted
as E = {1}, is an example of a simple event.
If an event consists of more than a single result from the sample space, it is called a
compound event. An example of a compound event in probability is rolling a fair die
and getting an odd number. E = {1, 3, 5}.
Complementary Events
When there are two events such that one event can occur if and only if the other does
not take place, then such events are known as complementary events in probability.
The sum of the probability of complementary events will always be equal to 1. For
example, on tossing a coin let E be defined as getting a head. Then the complement of
E is E' which will be the event of getting a tail. Thus, E and E' together make up
complementary events. Such events are mutually exclusive and exhaustive.
Mutually Exclusive Events

Events that cannot occur at the same time are known as mutually exclusive events.
Thus, mutually exclusive events in probability do not have any common outcomes.
For example, S = {10, 9, 8, 7, 6, 5, 4}, A = {4, 6, 7} and B = {10, 9, 8}. As there is
nothing common between sets A and B thus, they are mutually exclusive events.
Exhaustive Events
Exhaustive events in probability are those events when taken together from the
sample space of a random experiment. In other words, a set of events out of which at
least one is sure to occur when the experiment is performed are exhaustive events. For
example, the outcome of an exam is either passing or failing.
Equally Likely Events
Equally likely events in probability are those events in which the outcomes are
equally possible. For example, on tossing a coin, getting a head or getting a tail, are
equally likely events.
Source- https://www.cuemath.com/data/events-in-probability/
Algebra
Algebra (from Arabic ‫الجبر‬‎‫( ‏‬al-jabr) 'reunion of broken parts, bonesetting') is one of
the broad areas of mathematics. Roughly speaking, algebra is the study
of mathematical symbols and the rules for manipulating these symbols in formulas; it
is a unifying thread of almost all of mathematics.
Elementary algebra deals with the manipulation of variables as if they were numbers
(see the image), and is therefore essential in all applications of mathematics. Abstract
algebra is the name given in education to the study of algebraic structures such
as groups, rings, and fields. Linear algebra, which deals with linear
equations and linear mappings, is used for modern presentations of geometry, and has
many practical applications (in weather forecasting, for example). There are many
areas of mathematics that belong to algebra, some having "algebra" in their name,
such as commutative algebra and some not, such as Galois theory.
The word algebra is not only used for naming an area of mathematics and some
subareas; it is also used for naming some sorts of algebraic structures, such as
an algebra over a field, commonly called an algebra. Sometimes, the same phrase is
used for a subarea and its main algebraic structures; for example, Boolean algebra and
a Boolean algebra. A mathematician specialized in algebra is called an algebraist.
The word algebra comes from the Arabic: ‫الجبر‬, romanized: al-jabr, lit. 'reunion of
broken parts, bonesetting' from the title of the early 9th century book cIlm al-jabr wa
l-muqābala "The Science of Restoring and Balancing" by the Persian mathematician
and astronomer al-Khwarizmi. In his work, the term al-jabr referred to the operation
of moving a term from one side of an equation to the other, ‫ المقابلة‬al-
muqābala "balancing" referred to adding equal terms to both sides. Shortened to
just algeber or algebra in Latin, the word eventually entered the English language
during the 15th century, from either Spanish, Italian, or Medieval Latin. It originally
referred to the surgical procedure of setting broken or dislocated bones. The
mathematical meaning was first recorded (in English) in the 16th century.
Different meanings of "algebra"
The word "algebra" has several related meanings in mathematics, as a single word or
with qualifiers.
 As a single word without an article, "algebra" names a bro ad part of

mathematics.
 As a single word with an article or in the plural, "an algebra" or "algebras"
denotes a specific mathematical structure, whose precise definition depends
on the context. Usually, the structure has an addition, multiplication, and
scalar multiplication (see Algebra over a field). When some authors use the
term "algebra", they make a subset of the following additional
assumptions: associative, commutative, unital, and/or finite-dimensional.
In universal algebra, the word "algebra" refers to a generalization of the
above concept, which allows for n-ary operations.
 With a qualifier, there is the same distinction:
 Without an article, it means a part of algebra, such as linear
algebra, elementary algebra (the symbol-manipulation rules taught in
elementary courses of mathematics as part of primary and secondary

education), or abstract algebra (the study of the algebraic structures for
themselves).
 With an article, it means an instance of some algebraic structure, like a Lie
algebra, an associative algebra, or a vertex operator algebra.
 Sometimes both meanings exist for the same qualifier, as in the
sentence: Commutative algebra is the study of commutative rings, which
are commutative algebras over the integers.
Algebra as a branch of mathematics
Historically, and in current teaching, the study of algebra starts with the solving of
equations, such as the quadratic equation above. Then more general questions, such as
"does an equation have a solution?", "how many solutions does an equation have?",
"what can be said about the nature of the solutions?" are considered. These questions
led extending algebra to non-numerical objects, such
as permutations, vectors, matrices, and polynomials. The structural properties of these
non-numerical objects were then formalized into algebraic structures such
as groups, rings, and fields.
Before the 16th century, mathematics was divided into only two
subfields, arithmetic and geometry. Even though some methods, which had been
developed much earlier, may be considered nowadays as algebra, the emergence of
algebra and, soon thereafter, of infinitesimal calculus as subfields of mathematics
only dates from the 16th or 17th century. From the second half of the 19th century on,
many new fields of mathematics appeared, most of which made use of both arithmetic
and geometry, and almost all of which used algebra.
Today, algebra has grown considerably and includes many branches of mathematics,
as can be seen in the Mathematics Subject Classification where none of the first level
areas (two digit entries) are called algebra. Today algebra includes section 08-General
algebraic systems, 12-Field theory and polynomials, 13-Commutative algebra, 15-
Linear and multilinear algebra; matrix theory, 16-Associative rings and algebras, 17-
Nonassociative rings and algebras, 18-Category theory; homological algebra, 19-K-
theory and 20-Group theory. Algebra is also used extensively in 11-Number
theory and 14-Algebraic geometry.
History

The roots of algebra can be traced to the ancient Babylonians, who developed an
advanced arithmetical system with which they were able to do calculations in
an algorithmic fashion. The Babylonians developed formulas to calculate solutions
for problems typically solved today by using linear equations, quadratic equations,
and indeterminate linear equations. By contrast, most Egyptians of this era, as well
as Greek and Chinese mathematics in the 1st millennium BC, usually solved such
equations by geometric methods, such as those described in the Rhind Mathematical
Papyrus, Euclid's Elements, and The Nine Chapters on the Mathematical Art. The
geometric work of the Greeks, typified in the Elements, provided the framework for
generalizing formulae beyond the solution of particular problems into more general
systems of stating and solving equations, although this would not be realized
until mathematics developed in medieval Islam. By the time of Plato, Greek
mathematics had undergone a drastic change. The Greeks created a geometric
algebra where terms were represented by sides of geometric objects, usually lines,
that had letters associated with them. [7] Diophantus (3rd century AD) was
an Alexandrian Greek mathematician and the author of a series of books
called Arithmetica. These texts deal with solving algebraic equations, and have led,
in number theory, to the modern notion of Diophantine equation. Earlier traditions
discussed above had a direct influence on the Persian mathematician Muḥammad
ibn Mūsā al-Khwārizmī (c. 780–850). He later wrote The Compendious Book on
Calculation by Completion and Balancing, which established algebra as a
mathematical discipline that is independent of geometry and arithmetic.
The Hellenistic mathematicians Hero of Alexandria and Diophantus as well

as Indian mathematicians such as Brahmagupta, continued the traditions of Egypt
and Babylon, though Diophantus' Arithmetica and
Brahmagupta's Brāhmasphuṭasiddhānta are on a higher level. For example, the first
complete arithmetic solution written in words instead of symbols, including zero and
negative solutions, to quadratic equations was described by Brahmagupta in his
book Brahmasphutasiddhanta, published in 628 AD. Later, Persian
and Arab mathematicians developed algebraic methods to a much higher degree of
sophistication. Although Diophantus and the Babylonians used mostly special ad
hoc methods to solve equations, Al-Khwarizmi's contribution was fundamental. He
solved linear and quadratic equations without algebraic symbolism, negative
numbers or zero, thus he had to distinguish several types of equations. In the context
where algebra is identified with the theory of equations, the Greek mathematician
Diophantus has traditionally been known as the "father of algebra" and in the context
where it is identified with rules for manipulating and solving equations, Persi an
mathematician al-Khwarizmi is regarded as "the father of algebra". It is open to
debate whether Diophantus or al-Khwarizmi is more entitled to be known, in the
general sense, as "the father of algebra". Those who support Diophantus point to the
fact that the algebra found in Al-Jabr is slightly more elementary than the algebra
found in Arithmetica and that Arithmetica is syncopated while Al-Jabr is fully
rhetorical. Those who support Al-Khwarizmi point to the fact that he introduced the
methods of "reduction" and "balancing" (the transposition of subtracted terms to the
other side of an equation, that is, the cancellation of like terms on opposite sides of
the equation) which the term al-jabr originally referred to, and that he gave an
exhaustive explanation of solving quadratic equations, supported by geometric proofs
while treating algebra as an independent discipline in its own right. His algebra was
also no longer concerned "with a series of problems to be resolved, but
an exposition which starts with primitive terms in which the combinations must give
all possible prototypes for equations, which henceforward explicitly constitute the
true object of study". He also studied an equation for its own sake and "in a generic
manner, insofar as it does not simply emerge in the course of solving a problem, but is
specifically called on to define an infinite class of problems".
Another Persian mathematician Omar Khayyam is credited with identifying the

foundations of algebraic geometry and found the general geometric solution of
the cubic equation. His book Treatise on Demonstrations of Problems of
Algebra (1070), which laid down the principles of algebra, is part of the body of
Persian mathematics that was eventually transmitted to Europe. Yet another Persian
mathematician, Sharaf al-Dīn al-Tūsī, found algebraic and numerical solutions to
various cases of cubic equations. He also developed the concept of a function. The
Indian mathematicians Mahavira and Bhaskara II, the Persian mathematician Al-
Karaji, and the Chinese mathematician Zhu Shijie, solved various cases of
cubic, quartic, quintic and higher-order polynomial equations using numerical
methods. In the 13th century, the solution of a cubic equation by Fibonacci is
representative of the beginning of a revival in European algebra. Abū al-Ḥasan ibn
ʿAlī al-Qalaṣādī (1412–1486) took "the first steps toward the introduction of
algebraic symbolism". He also computed Σn2, Σn3 and used the method of successive
approximation to determine square roots. François Viète's work on new algebra at
the close of the 16th century was an important step towards modern algebra. In
1637, René Descartes published La Géométrie, inventing analytic geometry and
introducing modern algebraic notation. Another key event in the further development
of algebra was the general algebraic solution of the cubic and quartic equations,
developed in the mid-16th century. The idea of a determinant was developed
by Japanese mathematician Seki Kōwa in the 17th century, followed independently
by Gottfried Leibniz ten years later, for the purpose of solving systems of
simultaneous linear equations using matrices. Gabriel Cramer also did some work
on matrices and determinants in the 18th century. Permutations were studied
by Joseph-Louis Lagrange in his 1770 paper "Réflexions sur la résolution
algébrique des équations" devoted to solutions of algebraic equations, in which he
introduced Lagrange resolvents. Paolo Ruffini was the first person to develop the
theory of permutation groups, and like his predecessors, also in the context of
solving algebraic equations.
Abstract algebra was developed in the 19th century, deriving from the interest in
solving equations, initially focusing on what is now called Galois theory, and
on constructibility issues. George Peacock was the founder of axiomatic thinking in
arithmetic and algebra. Augustus De Morgan discovered relation algebra in
his Syllabus of a Proposed System of Logic. Josiah Willard Gibbs developed an
algebra of vectors in three-dimensional space, and Arthur Cayley developed an
algebra of matrices (this is a noncommutative algebra).
Probability
Probability is the branch of mathematics concerning numerical descriptions of how

likely an event is to occur, or how likely it is that a proposition is true. The
probability of an event is a number between 0 and 1, where, roughly speaking, 0
indicates impossibility of the event and 1 indicates certainty. The higher th e
probability of an event, the more likely it is that the event will occur. A simple
example is the tossing of a fair (unbiased) coin. Since the coin is fair, the two
outcomes ("heads" and "tails") are both equally probable; the probability of "heads"
equals the probability of "tails"; and since no other outcomes are possible, the
probability of either "heads" or "tails" is 1/2 (which could also be written as 0.5 or
50%). These concepts have been given an axiomatic mathematical formalization

in probability theory, which is used widely in areas of study such
as statistics, mathematics, science, finance, gambling, artificial intelligence, machine
learning, computer science, game theory, and philosophy to, for example, draw
inferences about the expected frequency of events. Probability theory is also used to
describe the underlying mechanics and regularities of complex systems.
Addition
Addition (usually signified by the plus symbol+) is one of the four

basic operations of arithmetic, the other three
being subtraction, multiplication and division. The addition of two whole
numbers results in the total amount or sum of those values combined. The example
in the adjacent image shows a combination of three apples and two apples, making a
total of five apples. This observation is equivalent to the mathematical expression "3
+ 2 = 5" (that is, "3 plus 2 is equal to 5"). Besides counting items, addition can also
be defined and executed without referring to concrete objects, using abstractions
called numbers instead, such as integers, real numbers and complex numbers.
Addition belongs to arithmetic, a branch of mathematics. In algebra, another area of
mathematics, addition can also be performed on abstract objects such
as vectors, matrices, subspaces and subgroups.
Addition has several important properties. It is commutative, meaning that order does
not matter, and it is associative, meaning that when one adds more than two numbers,
the order in which addition is performed does not matter (see Summation). Repeated
addition of 1 is the same as counting. Addition of 0 does not change a number.
Addition also obeys predictable rules concerning related operations such as
subtraction and multiplication. Performing addition is one of the simplest numerical
tasks. Addition of very small numbers is accessible to toddlers; the most basic task, 1
+ 1, can be performed by infants as young as five months, and even some members of
other animal species. In primary education, students are taught to add numbers in
the decimal system, starting with single digits and progressively tackling more
difficult problems. Mechanical aids range from the ancient abacus to the
modern computer, where research on the most efficient implementations of addition
continues to this day.
PROPERTIES
Commutativity
Addition is commutative, meaning that one can change the order of the terms in a
sum, but still get the same result. Symbolically, if a and b are any two numbers, then
a + b = b + a.
The fact that addition is commutative is known as the "commutative law of addition"
or "commutative property of addition". Some other binary operations are
commutative, such as multiplication, but many others are not, such as subtraction and
division.
Associativity
Addition is associative, which means that when three or more numbers are added
together, the order of operations does not change the result.
As an example, should the expression a + b + c be defined to mean (a + b) + c or a +

(b + c)? Given that addition is associative, the choice of definition is irrelevant. For
any three numbers a, b, and c, it is true that (a + b) + c = a + (b + c). For example, (1
+ 2) + 3 = 3 + 3 = 6 = 1 + 5 = 1 + (2 + 3).
When addition is used together with other operations, the order of

operations becomes important. In the standard order of operations, addition is a
lower priority than exponentiation, nth roots, multiplication and division, but is
given equal priority to subtraction.
Identity element
Adding zero to any number, does not change the number; this means that zero is
the identity element for addition, and is also known as the additive identity. In
symbols, for every a, one has
a + 0 = 0 + a = a.
This law was first identified in Brahmagupta's Brahmasphutasiddhanta in 628 AD,

although he wrote it as three separate laws, depending on whether a is negative,
positive, or zero itself, and he used words rather than algebraic symbols. Later Indian
mathematicians refined the concept; around the year 830, Mahavira wrote, "zero
becomes the same as what is added to it", corresponding to the unary statement 0
+ a = a. In the 12th century, Bhaskara wrote, "In the addition of cipher, or subtraction
of it, the quantity, positive or negative, remains the same", corresponding to the unary
statement a + 0 = a.

Successor
Within the context of integers, addition of one also plays a special role: for any
integer a, the integer (a + 1) is the least integer greater than a, also known as
the successor of a. For instance, 3 is the successor of 2 and 7 is the successor of 6.
Because of this succession, the value of a + b can also be seen as the bth successor
of a, making addition iterated succession. For example, 6 + 2 is 8, because 8 is the
successor of 7, which is the successor of 6, making 8 the 2nd successor of 6.
Units
To numerically add physical quantities with units, they must be expressed with
common units. For example, adding 50 milliliters to 150 milliliters gives
200 milliliters. However, if a measure of 5 feet is extended by 2 inches, the sum is
62 inches, since 60 inches is synonymous with 5 feet. On the other hand, it is usually
meaningless to try to add 3 meters and 4 square meters, since those units are
incomparable; this sort of consideration is fundamental in dimensional analysis.
PERFORMING ADDITION
Innate ability
Studies on mathematical development starting around the 1980s have exploited the
phenomenon of habituation: infants look longer at situations that are unexpected. A
seminal experiment by Karen Wynn in 1992 involving Mickey Mouse dolls
manipulated behind a screen demonstrated that five-month-old infants expect 1 + 1 to
be 2, and they are comparatively surprised when a physical situation seems to imply
that 1 + 1 is either 1 or 3. This finding has since been affirmed by a variety of
laboratories using different methodologies. Another 1992 experiment with
older toddlers, between 18 and 35 months, exploited their development of motor
control by allowing them to retrieve ping-pong balls from a box; the youngest
responded well for small numbers, while older subjects were able to compute sums up
to 5.
Even some nonhuman animals show a limited ability to add, particularly primates. In
a 1995 experiment imitating Wynn's 1992 result (but using eggplants instead of
dolls), rhesus macaque and cottontop tamarin monkeys performed similarly to
human infants. More dramatically, after being taught the meanings of the Arabic
numerals 0 through 4, one chimpanzee was able to compute the sum of two

numerals without further training. More recently, Asian elephants have demonstrated
an ability to perform basic arithmetic.
Childhood learning
Typically, children first master counting. When given a problem that requires that
two items and three items be combined, young children model the situation with
physical objects, often fingers or a drawing, and then count the total. As th ey gain
experience, they learn or discover the strategy of "counting-on": asked to find two
plus three, children count three past two, saying "three, four, five" (usually ticking off
fingers), and arriving at five. This strategy seems almost universal; children can easily
pick it up from peers or teachers. Most discover it independently. With additional
experience, children learn to add more quickly by exploiting the commutativity of
addition by counting up from the larger number, in this case, starting with three and
counting "four, five." Eventually children begin to recall certain addition facts
("number bonds"), either through experience or rote memorization. Once some facts
are committed to memory, children begin to derive unknown facts from known ones.
For example, a child asked to add six and seven may know that 6 + 6 = 12 and then
reason that 6 + 7 is one more, or 13. Such derived facts can be found very quickly and
most elementary school students eventually rely on a mixture of memorized and
derived facts to add fluently.
Different nations introduce whole numbers and arithmetic at different ages, with
many countries teaching addition in pre-school. However, throughout the world,
addition is taught by the end of the first year of elementary school.
Decimal system
The prerequisite to addition in the decimal system is the fluent recall or derivation of
the 100 single-digit "addition facts". One could memorize all the facts by rote, but
pattern-based strategies are more enlightening and, for most people, more efficient:
Commutative property: Mentioned above, using the pattern a + b = b + a reduces the

number of "addition facts" from 100 to 55.
One or two more: Adding 1 or 2 is a basic task, and it can be accomplished thr ough
counting on or, ultimately, intuition.
Zero: Since zero is the additive identity, adding zero is trivial. Nonetheless, in the
teaching of arithmetic, some students are introduced to addition as a process that
always increases the addends; word problems may help rationalize the "exception" of
zero.
Doubles: Adding a number to itself is related to counting by two and

to multiplication. Doubles facts form a backbone for many related facts, and students
find them relatively easy to grasp.
Near-doubles: Sums such as 6 + 7 = 13 can be quickly derived from the doubles fact 6
+ 6 = 12 by adding one more, or from 7 + 7 = 14 but subtracting one.
Five and ten: Sums of the form 5 + x and 10 + x are usually memorized early and can
be used for deriving other facts. For example, 6 + 7 = 13 can be derived from 5 + 7 =
12 by adding one more.
Making ten: An advanced strategy uses 10 as an intermediate for sums involving 8 or

9; for example, 8 + 6 = 8 + 2 + 4 = 10 + 4 = 14.
As students grow older, they commit more facts to memory, and learn to derive other
facts rapidly and fluently. Many students never commit all the facts to memory, but
can still find any basic fact quickly.
Carry
The standard algorithm for adding multidigit numbers is to align the addends
vertically and add the columns, starting from the ones column on the right. If a
column exceeds nine, the extra digit is "carried" into the next column.
7 + 9 = 16, and the digit 1 is the carry. An alternate strategy starts adding from the
most significant digit on the left; this route makes carrying a little clumsier, but it is
faster at getting a rough estimate of the sum. There are many alternative methods.
Since the end of the XXth century, some US programs, including TERC, decided to
remove the traditional transfer method from their curriculum. This decision was
criticized that is why some states and counties didn‘t support this experiment.
Decimal fractions
Decimal fractions can be added by a simple modification of the above process. One
aligns two decimal fractions above each other, with the decimal point in the same
location. If necessary, one can add trailing zeros to a shorter decimal to make it the
same length as the longer decimal. Finally, one performs the same addition process as

above, except the decimal point is placed in the answer, exactly where it was placed
in the summands.
Non-decimal
Binary addition
Addition in other bases is very similar to decimal addition. As an example, one can
consider addition in binary. Adding two single-digit binary numbers is relatively
simple, using a form of carrying:
0+0→0
0+1→1
1+0→1
1 + 1 → 0, carry 1 (since 1 + 1 = 2 = 0 + (1 × 2 1 ))
Adding two "1" digits produces a digit "0", while 1 must be added to the next column.
This is similar to what happens in decimal when certain single-digit numbers are
added together; if the result equals or exceeds the value of the radix (10), the digit to
the left is incremented:
5 + 5 → 0, carry 1 (since 5 + 5 = 10 = 0 + (1 × 10 1 ))
7 + 9 → 6, carry 1 (since 7 + 9 = 16 = 6 + (1 × 10 1 ))
This is known as carrying.[41] When the result of an addition exceeds the value of a
digit, the procedure is to "carry" the excess amount divided by the radix (that is,
10/10) to the left, adding it to the next positional value. This is correct since the next
position has a weight that is higher by a factor equal to the radix.
Computers
Analog computers work directly with physical quantities, so their addition

mechanisms depend on the form of the addends. A mechanical adder might represent
two addends as the positions of sliding blocks, in which case they can be added with
an averaging lever. If the addends are the rotation speeds of two shafts, they can be
added with a differential. A hydraulic adder can add the pressures in two chambers
by exploiting Newton's second law to balance forces on an assembly of pistons. The
most common situation for a general-purpose analog computer is to add

two voltages (referenced to ground); this can be accomplished roughly with
a resistor network, but a better design exploits an operational amplifier.
Addition is also fundamental to the operation of digital computers, where the

efficiency of addition, in particular the carry mechanism, is an important limitation to
overall performance.
The abacus, also called a counting frame, is a calculating tool that was in use
centuries before the adoption of the written modern numeral system and is still widely
used by merchants, traders and clerks in Asia, Africa, and elsewhere; it dates back to
at least 2700–2300 BC, when it was used in Sumer.
Blaise Pascal invented the mechanical calculator in 1642; it was the first
operational adding machine. It made use of a gravity-assisted carry mechanism. It
was the only operational mechanical calculator in the 17th century and the earliest
automatic, digital computer. Pascal's calculator was limited by its carry mechanism,
which forced its wheels to only turn one way so it could add. To subtract, the operator
had to use the Pascal's calculator's complement, which required as many steps as an
addition. Giovanni Poleni followed Pascal, building the second functional
mechanical calculator in 1709, a calculating clock made of wood that, once setup,
could multiply two numbers automatically.
Adders execute integer addition in electronic digital computers, usually using binary
arithmetic. The simplest architecture is the ripple carry adder, which follows the
standard multi-digit algorithm. One slight improvement is the carry skip design,
again following human intuition; one does not perform all the carries in
computing 999 + 1, but one bypasses the group of 9s and skips to the answer.
In practice, computational addition may be achieved via XOR and bitwise logical
operations in conjunction with bitshift operations as shown in the pseudocode below.
Both XOR and gates are straightforward to realize in digital logic allowing the
realization of full adder circuits which in turn may be combined into more complex
logical operations. In modern digital computers, integer addition is typically the
fastest arithmetic instruction, yet it has the largest impact on performance, since it
underlies all floating-point operations as well as such basic tasks
as address generation during memory access and
fetching instructions during branching. To increase speed, modern designs calculate

digits in parallel; these schemes go by such names as carry select, carry lookahead,
and the Ling pseudocarry. Many implementations are, in fact, hybrids of these last
three designs. Unlike addition on paper, addition on a computer often changes the
addends. On the ancient abacus and adding board, both addends are destroyed,
leaving only the sum. The influence of the abacus on mathematical thinking was
strong enough that early Latin texts often claimed that in the process of adding "a
number to a number", both numbers vanish. In modern times, the ADD instruction of
a microprocessor often replaces the augend with the sum but preserves the
addend. In a high-level programming language, evaluating a + b does not change
either a or b; if the goal is to replace a with the sum this must be explicitly requested,
typically with the statement a = a + b. Some languages such as C or C++ allow this to
be abbreviated as a += b.
On a computer, if the result of an addition is too large to store, an arithmetic

overflow occurs, resulting in an incorrect answer. Unanticipated arithmetic overflow
is a fairly common cause of program errors. Such overflow bugs may be hard to
discover and diagnose because they may manifest themselves only for very large
input data sets, which are less likely to be used in validation tests. The Year 2000
problem was a series of bugs where overflow errors occurred due to use of a 2-digit
format for years.
Addition of numbers
To prove the usual properties of addition, one must first define addition for the
context in question. Addition is first defined on the natural numbers. In set theory,
addition is then extended to progressively larger sets that include the natural numbers:
the integers, the rational numbers, and the real numbers. (In mathematics
education, positive fractions are added before negative numbers are even considered;
this is also the historical route.)
Natural numbers
There are two popular ways to define the sum of two natural numbers a and b. If one
defines natural numbers to be the cardinalities of finite sets, (the cardinality of a set
is the number of elements in the set), then it is appropriate to define their sum as
follows:

 Let N(S) be the cardinality of a set S. Take two disjoint sets A and B,
with N(A) = a and N(B) = b. Then a + b is defined as .
Here, A ∪ B is the union of A and B. An alternate version of this definition

allows A and B to possibly overlap and then takes their disjoint union, a mechanism
that allows common elements to be separated out and therefore counted twice.
The other popular definition is recursive:
 Let n + be the successor of n, that is the number following n in the natural

numbers, so 0 + =1, 1 + =2. Define a + 0 = a. Define the general sum recursively
by a + (b + ) = (a + b) + . Hence 1 + 1 = 1 + 0 + = (1 + 0) + = 1 + = 2. [57]
Again, there are minor variations upon this definition in the literature. Taken literally,
the above definition is an application of the recursion theorem on the partially
ordered set N2. On the other hand, some sources prefer to use a restricted recursion
theorem that applies only to the set of natural numbers. One then considers a to be
temporarily "fixed", applies recursion on b to define a function "a +", and pastes these
unary operations for all a together to form the full binary operation.
This recursive formulation of addition was developed by Dedekind as early as 1854,

and he would expand upon it in the following decades. He proved the associative and
commutative properties, among others, through mathematical induction.
Integers
The simplest conception of an integer is that it consists of an absolute value (which is

a natural number) and a sign (generally either positive or negative). The integer zero
is a special third case, being neither positive nor negative. The corresponding
definition of addition must proceed by cases:
For an integer n, let |n| be its absolute value. Let a and b be integers. If
either a or b is zero, treat it as an identity. If a and b are both positive,
define a + b = |a| + |b|. If a and b are both negative, define a + b = −(|a| + |b|).
If a and b have different signs, define a + b to be the difference between |a| and
|b|, with the sign of the term whose absolute value is larger. As an example, −6 +
4 = −2; because −6 and 4 have different signs, their absolute values are

subtracted, and since the absolute value of the negative term is larger, the answer
is negative.
Although this definition can be useful for concrete problems, the number of cases to
consider complicates proofs unnecessarily. So the following method is commonly
used for defining integers. It is based on the remark that every integer is the
difference of two natural integers and that two such differences, a – b and c – d is
equal if and only if a + d = b + c. So, one can define formally the integers as
the equivalence classes of ordered pairs of natural numbers under the equivalence
relation
(a, b) ~ (c, d) if and only if a + d = b + c.
The equivalence class of (a, b) contains either (a – b, 0) if a ≥ b, or (0, b –

a) otherwise. If n is a natural number, one can denote +n the equivalence class of (n,
0), and by –n the equivalence class of (0, n). This allows identifying the natural
number n with the equivalence class +n.
A straightforward computation shows that the equivalence class of the result depends
only on the equivalences classes of the summands, and thus that this defines an
addition of equivalence classes, that is integers. Another straightforward computation
shows that this addition is the same as the above case definition.
This way of defining integers as equivalence classes of pairs of natural numbers, can
be used to embed into a group any commutative semigroup with cancellation
property. Here, the semigroup is formed by the natural numbers and the group is the
additive group of integers. The rational numbers are constructed similarly, by taking
as semigroup the nonzero integers with multiplication.
This construction has been also generalized under the name of Grothendieck
group to the case of any commutative semigroup. Without the cancellation property
the semigroup homomorphism from the semigroup into the group may be non-
injective. Originally, the Grothendieck group was, more specifically, the result of this
construction applied to the equivalences classes under isomorphisms of the objects of
an abelian category, with the direct sum as semigroup operation.
Rational numbers (fractions)

Addition of rational numbers can be computed using the least common
denominator, but a conceptually simpler definition involves only integer addition
and multiplication:
Complex numbers
Addition of two complex numbers can be done geometrically by constructing a

parallelogram.
Complex numbers are added by adding the real and imaginary parts of the
summands.
Using the visualization of complex numbers in the complex plane, the addition
has the following geometric interpretation: the sum of two complex
numbers A and B, interpreted as points of the complex plane, is the
point X obtained by building a parallelogram three of whose vertices
are O, A and B. Equivalently, X is the point such that the triangles with
vertices O, A, B, and X, B, A, are congruent.
Generalizations
There are many binary operations that can be viewed as generalizations of the
addition operation on the real numbers. The field of abstract algebra is centrally
concerned with such generalized operations, and they also appear in set
theory and category theory.
Abstract algebra
In linear algebra, a vector space is an algebraic structure that allows for adding any
two vectors and for scaling vectors. A familiar vector space is the set of all ordered
pairs of real numbers; the ordered pair (a,b) is interpreted as a vector from the origin
in the Euclidean plane to the point (a,b) in the plane. The sum of two vectors is
obtained by adding their individual coordinates:
For example:
Modular arithmetic
In modular arithmetic, the set of available numbers is restricted to a finite subset of

the integers, and addition "wraps around" when reaching a certain value, called the
modulus. For example, the set of integers modulo 12 has twelve elements; it inherits
an addition operation from the integers that is central to musical set theory. The set
of integers modulo 2 has just two elements; the addition operation it inherits is known
in Boolean logic as the "exclusive or" function. A similar "wrap around" operation
arises in geometry, where the sum of two angle measures is often taken to be their
sum as real numbers modulo 2π. This amounts to an addition operation on the circle,
which in turn generalizes to addition operations on many-dimensional tori.
General theory
The general theory of abstract algebra allows an "addition" operation to be

any associative and commutative operation on a set. Basic algebraic structures with
such an addition operation include commutative monoids and abelian groups.
SET THEORY AND CATEGORY THEORY
A far-reaching generalization of addition of natural numbers is the addition of ordinal

numbers and cardinal numbers in set theory. These give two different
generalizations of addition of natural numbers to the transfinite. Unlike most
addition operations, addition of ordinal numbers is not commutative. Addition of
cardinal numbers, however, is a commutative operation closely related to the disjoint
union operation.
In category theory, disjoint union is seen as a particular case of

the coproduct operation, and general coproducts are perhaps the most abstract of all
the generalizations of addition. Some coproducts, such as direct sum and wedge
sum, are named to evoke their connection with addition.
Related operations
Addition, along with subtraction, multiplication and division, is considered one of the
basic operations and is used in elementary arithmetic.
Arithmetic
Subtraction can be thought of as a kind of addition—that is, the addition of

an additive inverse. Subtraction is itself a sort of inverse to addition, in that
adding x and subtracting x are inverse functions.
Given a set with an addition operation, one cannot always define a corresponding
subtraction operation on that set; the set of natural numbers is a simple example. On
the other hand, a subtraction operation uniquely determines an addition operation, an

additive inverse operation, and an additive identity; for this reason, an additive group
can be described as a set that is closed under subtraction.
Multiplication can be thought of as repeated addition. If a single term x appears in a

sum n times, then the sum is the product of n and x. If n is not a natural number, the
product may still make sense; for example, multiplication by −1 yields the additive
inverse of a number.
A circular slide rules
In the real and complex numbers, addition and multiplication can be interchanged by
the exponential function:
This identity allows multiplication to be carried out by consulting

a table of logarithms and computing addition by hand; it also enables multiplication
on a slide rule. The formula is still a good first-order approximation in the broad
context of Lie groups, where it relates multiplication of infinitesimal group elements
with addition of vectors in the associated Lie algebra.
There are even more generalizations of multiplication than addition. In general,

multiplication operations always distribute over addition; this requirement is
formalized in the definition of a ring. In some contexts, such as the integers,
distributivity over addition and the existence of a multiplicative identity is enough to
uniquely determine the multiplication operation. The distributive property also
provides information about addition; by expanding the product (1 + 1) (a + b) in both
ways, one concludes that addition is forced to be commutative. For this reason, ring
addition is commutative in general.
Division is an arithmetic operation remotely related to addition. Since a/b = a(b−1),

division is right distributive over addition: (a + b) / c = a/c + b/c. However, division is
not left distributive over addition; 1 / (2 + 2) is not the same as 1/2 + 1/2.
Ordering
Log-log plot of x + 1 and max (x, 1) from x = 0.001 to 1000
The maximum operation "max (a, b)" is a binary operation similar to addition. In fact,
if two nonnegative numbers a and b are of different orders of magnitude, then their
sum is approximately equal to their maximum. This approximation is extremely
useful in the applications of mathematics, for example in truncating Taylor series.

However, it presents a perpetual difficulty in numerical analysis, essentially since
"max" is not invertible. If b is much greater than a, then a straightforward calculation
of (a + b) − b can accumulate an unacceptable round-off error, perhaps even
returning zero. See also Loss of significance.
The approximation becomes exact in a kind of infinite limit; if either a or b is an

infinite cardinal number, their cardinal sum is exactly equal to the greater of the
two. Accordingly, there is no subtraction operation for infinite cardinals.
Maximization is commutative and associative, like addition. Furthermore, since

addition preserves the ordering of real numbers, addition distributes over "max" in the
same way that multiplication distributes over addition:
For these reasons, in tropical geometry one replaces multiplication with

addition and addition with maximization. In this context, addition is called
"tropical multiplication", maximization is called "tropical addition", and the
tropical "additive identity" is negative infinity. Some authors prefer to replace
addition with minimization; then the additive identity is positive infinity.
Tying these observations together, tropical addition is approximately related to regular

addition through the logarithm:
which becomes more accurate as the base of the logarithm increases. The
approximation can be made exact by extracting a constant h, named by analogy
with Planck's constant from quantum mechanics, and taking the "classical
limit" as h tends to zero:
In this sense, the maximum operation is a dequantized version of addition.
Other ways to add
Incrementation, also known as the successor operation, is the addition of 1 to a

number.
Summation describes the addition of arbitrarily many numbers, usually more than
just two. It includes the idea of the sum of a single number, which is itself, and
the empty sum, which is zero. An infinite summation is a delicate procedure known
as a series.
Counting a finite set is equivalent to summing 1 over the set.

Integration is a kind of "summation" over a continuum, or more precisely and
generally, over a differentiable manifold. Integration over a zero-dimensional
manifold reduces to summation.
Linear combinations combine multiplication and summation; they are sums in which
each term has a multiplier, usually a real or complex number. Linear combinations
are especially useful in contexts where straightforward addition would violate some
normalization rule, such as mixing of strategies in game
theory or superposition of states in quantum mechanics.
Convolution is used to add two independent random variables defined

by distribution functions. Its usual definition combines integration, subtraction, and
multiplication. In general, convolution is useful as a kind of domain-side addition; by
contrast, vector addition is a kind of range-side addition.
Multiplication
Multiplication (often denoted by the cross symbol × , by the mid-line dot
operator ⋅ , by juxtaposition, or, on computers, by an asterisk * ) is one of the
four elementary mathematical operations of arithmetic, with the other ones

being addition, subtraction, and division. The result of a multiplication operation
is called a product.
The multiplication of whole numbers may be thought of as repeated addition;

that is, the multiplication of two numbers is equivalent to adding as many copies
of one of them, the multiplicand, as the quantity of the other one, the multiplier.
Both numbers can be referred to as factors.
Systematic generalizations of this basic definition define the multiplication of
integers (including negative numbers), rational numbers (fractions), and real
numbers.
Multiplication can also be visualized as counting objects arranged in

a rectangle (for whole numbers) or as finding the area of a rectangle whose sides
have some given lengths. The area of a rectangle does not depend on which side
is measured first—a consequence of the commutative property.

The product of two measurements is a new type of measurement. For example,
multiplying the lengths of the two sides of a rectangle gives its area . Such a
product is the subject of dimensional analysis.
The inverse operation of multiplication is division. For example, since 4

multiplied by 3 equals 12, 12 divided by 3 equals 4. Indeed, multiplication by 3,
followed by division by 3, yields the original number. The division of a number
other than 0 by itself equals 1.
Multiplication is also defined for other types of numbers, such as complex

numbers, and for more abstract constructs, like matrices. For some of these more
abstract constructs, the order in which the operands are multiplied together
matters. A listing of the many different kinds of products used in mathematics is
given in Product (mathematics).
Conditional probability
In probability theory, conditional probability is a measure of the probability of
an event occurring, given that another event (by assumption, presumption,
assertion or evidence) has already occurred. This particular method relies on
event B occurring with some sort of relationship with another event A. In this
event, the event B can be analyzed by a conditionally probability with respect to
A. If the event of interest is A and the event B is known or assumed to have
occurred, "the conditional probability of A given B", or "the probability
of A under the condition B", is usually written as P(A|B) or occasionally P B (A).
This can also be understood as the fraction of probability B that intersects with
A: .
For example, the probability that any given person has a cough on any given day may
be only 5%. But if we know or assume that the person is sick, then they are much
more likely to be coughing. For example, the conditional probability that someone
unwell (sick) is coughing might be 75%, in which case we would have
that P(Cough) = 5% and P(Cough|Sick) = 75%. Although there is a relationship
between A and B in this example, such a relationship or dependence
between A and B is not necessary, nor do they have to occur simultaneously.

P(A|B) may or may not be equal to P(A) (the unconditional probability of A).
If P(A|B) = P(A), then events A and B are said to be independent: in such a case,
knowledge about either event does not alter the likelihood of each
other. P(A|B) (the conditional probability of A given B) typically differs
from P(B|A). For example, if a person has dengue fever, the person might have a
90% chance of being tested as positive for the disease. In this case, what is being
measured is that if event B (having dengue) has occurred, the probability
of A (tested as positive) given that B occurred is 90%, simply writing P(A|B) =
90%. Alternatively, if a person is tested as positive for dengue feve r, they may
have only a 15% chance of actually having this rare disease due to high false
positive rates. In this case, the probability of the event B (having dengue) given
that the event A (testing positive) has occurred is 15% or P(B|A) = 15%. It should
be apparent now that falsely equating the two probabilities can lead to various
errors of reasoning, which is commonly seen through base rate fallacies.
SELF-TEST 02
1) An event in the probability that will never be happened is called as
a) Unsure event
b) Sure event
c) Possible event
d) Impossible event
2) What will be the probability of getting odd numbers if a dice is thrown?
a) 1/2
b) 2
c) 4/2
d) 5/2

1) Write a note on algebra.
2) Why to study probability in statistic?

SUMMARY
Statistics is the discipline that focuses on the collection, organization, analysis,
interpretation, and presentation of data. It begins with a statistical population or
a statistical model to be studied. When census data cannot be collected,
statisticians collect data by developing specific experiment designs and survey
samples. Two main statistical methods are used in data analysis: descriptive
statistics, which summarize data from a sample using indexes such as the mean
or standard deviation, and inferential statistics, which draw conclusions from
data that are subject to random variation. Inferences on mathematical st atistics
are made under the framework of probability theory, which deals with the
analysis of random phenomena.
Sampling is the selection of a subset (a statistical sample) of individuals from

within a statistical population to estimate characteristics of the whole population.
It has lower costs and faster data collection than measuring the entire population
and can provide insights in cases where it is infeasible to measure an entire
population. Type I errors and Type II errors are recognized as two basic forms of
error. Acceptance sampling is used to determine if a production lot of material
meets the governing specifications. Sampling methods can be employed
individually or in combination.
Statistics is the collection of data and information for a particu lar purpose, and
the measure of central tendency is a value around which the data is centred. It is
used to represent a set of data by a representative value which would
approximately define the entire collection. The measures of central tendency are
given by various parameters, but the most commonly used ones are mean, median
and mode. Mean is the most commonly used measure of central tendency, and is
equal to the sum of all the values in the collection of data divided by the total
number of values. Median is the mid-value of the given set of data when
arranged in a particular order, and mode is the most frequent number occurring in
the data set.
The mean, median and mode of the given data are given by the formula: Mean =
43, Median = 52, Mode = 3 Median – 2 Mean. This relation is used to find one of
the measures when the other two measures are known. Range is the difference
between the highest and lowest data value in the set.

Statistical dispersion is the extent to which a distribution is stretched or
squeezed. It is a nonnegative real number that is zero if all the data are the same
and increases as the data become more diverse. Common measures of statistical
dispersion include variance, standard deviation, and interquartile range. All of
these measures are location-invariant and linear in scale, meaning that if a
random variable X has a dispersion of SX then a linear transformation Y = aX +
b for real a and b should have dispersion SY = |a|SX, where |a| is the absolute
value of a, that is, ignores a preceding negative sign. Other measures of
dispersion include coefficient of variation, quartile coefficient of dispersion,
relative mean difference, and entropy.
Variance-to-mean ratio, Allan variance, Hadamard variance, discrete entropy,

mean-preserving spread, and partial ordering of dispersion are measures of
dispersion used in the physical sciences, biological sciences, economics, finance,
and other disciplines. They can be used to explain the dispersion of a dependent
variable, generally measured by its variance, using one or more independent
variables each of which itself has positive dispersion. The fraction of variance
explained is called the coefficient of determination. The quartile coefficient of
dispersion is a descriptive statistic that measures dispersion and is used to make
comparisons between data sets. It is less sensitive to outliers than measures such
as the Coefficient of variation and is easily computed using the first (Q1) and
third (Q3) quartiles for each data set.
Deviation is a measure of difference between the observed value and some other
value, often that variable's mean. The average of the signed deviations across the
entire set of all observations from the unobserved population parameter value
averages zero over an arbitrarily large number of samples. Statistics of the
distribution of deviations are used as measures of statistical dispersion, and the
standard deviation is a measure of the amount of variation or dispersion of a set
of values. It is abbreviated SD, and is most commonly re presented in
mathematical texts and equations by the lower case Greek letter σ (sigma), for
the population standard deviation, or the Latin letter s, for the sample standard
deviation. The standard deviation of a random variable, sample, statistical
population, data set, or probability distribution is the square root of its variance.

The standard deviation of a population or sample and the standard error of a
statistic are related. The sample mean's standard error is the standard deviation
of the set of means that would be found by drawing an infinite number of
repeated samples from the population and computing a mean for each sample.
The standard deviation of an estimate is how much the estimate depends on the
particular sample that was taken from the population. In science, only effects
more than two standard errors away from a null expectation are considered
"statistically significant". The population standard deviation is found by taking
the square root of the average of the squared deviations of the valu es subtracted
from their average value.
If the population of interest is approximately normally distributed, the standard

deviation provides information on the proportion of observations above or below
certain values. The standard deviation of an entire po pulation is estimated by
examining a random sample taken from the population and computing a statistic
of the sample, which is called a sample standard deviation. Most often, the
standard deviation is estimated using the corrected sample standard deviation
(using N − 1), defined below, but other estimators are better in other respects.
The formula for the population standard deviation (of a finite population) can be
applied to the sample, using the sample as the size of the population (though the
actual population size from which the sample is drawn may be much larger). The
uncorrected sample standard deviation, denoted by sN, is the standard deviation
of the sample (considered as the entire population).
It is a consistent estimator that converges in probability to the population value

as the number of samples goes to infinity, and is the maximum -likelihood
estimate when the population is normally distributed. The bias decreases as
sample size grows, dropping off as 1/N, and is most significant for small or
moderate sample sizes. An unbiased estimator for the variance is given by
applying Bessel's correction, using N − 1 instead of N to yield the unbiased
sample variance, denoted s2. The "sample standard deviation" is a commonly
used estimator for unbiased estimation of standard deviation. It is given by s/c4,
where the correction factor is the mean of the chi distribution.
An approximation can be given by replacing N − 1 with N − 1. The confidence

interval of a sampled standard deviation is not absolutely accu rate, both for

mathematical reasons (explained here by the confidence interval) and for
practical reasons of measurement (measurement error). A larger sample will
make the confidence interval narrower, but the actual SD can still be almost a
factor 2 higher than the sampled SD. The standard deviation is an upper bound
on the variance of residuals from a least squares fit under standard normal
theory, where k is now the number of degrees of freedom for error. It is invariant
under changes in location, scales directly with the scale of the random variable,
and is related to the individual standard deviations and covariance between them.
It is useful in sample size estimation, as the range of possible values is easier to

estimate than the standard deviation. Standard deviation is a measure of
uncertainty used in physical science, industrial and hypothesis testing, and is
used to compare real-world data against a model to test the model. It is also used
to measure how far typical values tend to be from the mean, such as the mean
absolute deviation, which is a more direct measure of average distance. The
practical value of understanding the standard deviation of a set of values is in
appreciating how much variation there is from the average. Standard deviation is
a measure of the risk associated with price-fluctuations of a given asset (stocks,
bonds, property, etc.), or the risk of a portfolio of assets (actively managed
mutual funds, index mutual funds, or ETFs).
It is an important factor in determining how to efficiently manage a portfolio of

investments because it determines the variation in returns on the asset and/or
portfolio and gives investors a mathematical basis for investment decisions.
When evaluating investments, investors should estimate both the expec ted return
and the uncertainty of future returns. Stock A over the past 20 years had an
average return of 10 percent, with a standard deviation of 20 percentage points
(pp) and Stock B had average returns of 12 percent but a higher standard
deviation of 30 pp. The most important details in this text are that Stock A is the
safer choice, as Stock B's additional two percentage points of return is not worth
the additional 10 pp standard deviation (greater risk or uncertainty of the
expected return). Stock B is likely to fall short of the initial investment more
often than Stock A under the same circumstances, and is estimated to return only
two percent more on average.

Calculating the average (or arithmetic mean) of the return of a security over a
given period will generate the expected return of the asset. To gain geometric
insights and clarification, we will start with a population of three values, x1, x2,
x3, and calculate the standard deviation of the investment tool in question.
Chebyshev's inequality ensures that, for all distributions for which the standard
deviation is defined, the amount of data within a number of standard deviations
of the mean is at least as much as given in the following table. The central limit
theorem states that the distribution of an average of many independent,
identically distributed random variables tends toward the bell -shaped normal
distribution with a probability density function of μ. Variability can also be
measured by the coefficient of variation, which is the ratio of the standard
deviation to the mean.
The standard deviation of a sampled mean is related to the standard deviation of

the distribution. Two formulas can represent a running (repeatedly updated)
standard deviation: a set of two power sums s1 and s2 are computed over a set of
N values of x, denoted as x1, ..., xN. The running sums method with reduced
rounding errors is a "one pass" algorithm for calculating variance of n samples
without the need to store prior data. Weighted calculation is the sum of the
weights and not the number of samples N. The incremental method is an
additional complexity with a running sum of weights for each k from 1 to n.
Probability is a branch of mathematics that uses numerical descriptions of how

likely an event is to occur, or how likely it is that a proposition is true. It is used
in areas such as statistics, mathematics, science, finance, gambling, artificial
intelligence, machine learning, computer science, game theory, and philosophy to
draw inferences about the expected frequency of events. Sample space is the set
of all possible outcomes or results of an experiment or random trial, denoted
using set notation. It can be finite, countably infinite, or uncountably infinite.
Equally likely outcomes are events with an equal chance of h appening, such as
tossing a coin (50% probability of heads) or rolling a die (1/6 probability of
getting any number on the die).
In real life, the probability of finding a golden ticket in a chocolate bar might be
5%, but this doesn't contradict the idea of equally likely outcomes. To find the
Probability of Equally Likely Outcomes, we assign the probability 1/N to each

outcome. To find the probability of equally likely outcomes, statisticians must
define the sample space for an event of chance, count the n umber of ways event
A can occur, and divide the answer from (2) by (1). A simple random sample is a
sample in which every individual in the population is equally likely to be
included, while infinitely large sample spaces have a more precise definition of
an event. Events in probability are outcomes of random experiments that form a
subset of a finite sample space.
Each type of event has its own individual properties and can be calculated by
dividing the number of favorable outcomes by the total number of o utcomes of
that experiment. Examples of events in probability include getting an even
number on the die, tossing a coin, drawing two balls one after another from a bag
without replacement, and impossible and sure events. The probability of
occurrence of certain events in probability is determined by the chance that they
will occur, such as impossible events, sure events, simple and compound events,
complementary events, mutually exclusive events, and exhaustive events.
Mutually exclusive events are events that cannot occur at the same time, while
mutually exclusive events do not have any common outcomes. Complementary
events are those events such that one event can occur if and only if the other
does not take place, while exhaustive events are events when ta ken together from
the sample space of a random experiment.
Equally likely events in probability are those events in which the outcomes are
equally possible, such as on tossing a coin, getting a head or getting a tail.
Algebra (from Arabic ‫الجبر‬‎‎ (al-jabr) 'reunion of broken parts, bonesetting') is one
of the broad areas of mathematics. It is the study of mathematical symbols and
the rules for manipulating these symbols in formulas. Elementary algebra deals
with the manipulation of variables as if they were numbers, Abstract algebra is
the name given in education to the study of algebraic structures such as groups,
rings, and fields, and Linear algebra is used for modern presentations of
geometry. There are many areas of mathematics that belong to algebra, s ome
having "algebra" in their name, such as commutative algebra and some not, such
as Galois theory.
Algebra is not only used for naming an area of mathematics and some subareas,
but also for naming some algebraic structures, such as an algebra over a fiel d. A

mathematician specialized in algebra is called an algebraist. Algebra is a branch
of mathematics that starts with the solving of equations and extends to non -
numerical objects such as permutations, vectors, matrices, and polynomials.
Before the 16th century, mathematics was divided into two subfields, arithmetic
and geometry. From the second half of the 19th century on, many new fields of
mathematics appeared, most of which made use of both arithmetic and geometry,
and almost all of which used algebra.
Today, algebra has grown considerably and includes many branches of

mathematics. Algebra is a mathematical discipline that is used extensively in 11 -
Number theory and 14-Algebraic geometry. Its roots can be traced to the ancient
Babylonians, who developed an advanced arithmetical system with which they
were able to do calculations in an algorithmic fashion. The Greeks created a
geometric algebra where terms were represented by sides of geometric objects,
usually lines, that had letters associated with them. Diophantus (3rd century AD)
was an Alexandrian Greek mathematician and the author of a series of books
called Arithmetica, which led to the modern notion of Diophantine equation.
Muḥammad ibn Mūsā al-Khwārizmī (c. 780-850) later wrote The Compendious
Book on Calculation by Completion and Balancing, which established algebra as
a mathematical discipline independent of geometry and arithmetic. Indian
mathematicians such as Brahmagupta continued the traditions of Egypt and
Babylon, with the first complete arithmetic solution written in words instead of
symbols.
The Greek mathematician Diophantus has traditionally been known as the "father
of algebra" and the Persian mathematician al-Khwarizmi is regarded as "the
father of algebra". However, there is debate as to who is more entitled to be
known as the father of algebra due to the fact that Al -Jabr is slightly more
elementary than the algebra found in Arithmetica and that Al -Khawarizmi
introduced the methods of "reduction" and "balancing" and gave an exhaustiv e
explanation of solving quadratic equations. Omar Khayyam is credited with
identifying the foundations of algebraic geometry and found the general
geometric solution of the cubic equation. Sharaf al -Dīn al-Tūsī found algebraic
and numerical solutions to various cases of cubic equations. The Indian
mathematicians Mahavira and Bhaskara II, the Persian mathematician Al -Karaji,

and the Chinese mathematician Zhu Shijie also developed the concept of a
function.
In the 13th century, the solution of a cubic equation by Fibonacci marked the
beginning of a revival in European algebra. Abū al -Ḥasan ibn ʿAlī al-Qalaṣādī
took the first steps toward the introduction of algebraic symbolism, and François
Viète's work on new algebra at the close of the 16th century was an i mportant
step towards modern algebra. René Descartes published La Géométrie in 1637,
and the general algebraic solution of the cubic and quartic equations was
developed in the mid-16th century. The idea of a determinant was developed by
Seki Kōwa in the 17th century, followed by Gottfried Leibniz ten years later.
Permutations were studied by Joseph-Louis Lagrange in his 1770 paper
"Réflexions sur la résolution algébrique des équations" and Paolo Ruffini was the
first person to develop the theory of permutation groups.
In the 19th century, abstract algebra was developed from the interest in solving
equations, and George Peacock was the founder of axiomatic thinking in
arithmetic and algebra. Probability is the branch of mathematics concerning
numerical descriptions of how likely an event is to occur, or how likely it is that
a proposition is true.
Addition is one of the four basic operations of arithmetic, the other three being
subtraction, multiplication and division. It is commutative, associative, and
associative, meaning that when one adds more than two numbers, the order in
which addition is performed does not matter. It obeys predictable rules
concerning related operations such as subtraction and multiplication, and is
accessible to toddlers as young as five months. In primary education, students
are taught to add numbers in the decimal form. Addition is commutative,
meaning that one can change the order of the terms in a sum, but still get the
same result.
It is also associative, which means that when three or more numbers are added
together, the order of operations does not change the result. In the standard order
of operations, addition is a lower priority than exponentiation, nth roots,
multiplication and division, but is given equal priority to subtra ction. Zero is the
identity element for addition, and was first identified in Brahmagupta's
Brahmasphutasiddhanta in 628 AD. Mahavira wrote in 830 that zero becomes the
same as what is added to it. Addition is the process of numerically adding
physical quantities with units, such as adding 50 milliliters to 150 milliliters
gives 200 milliliters, but it is usually meaningless to add 3 meters and 4 square
meters, as those units are incomparable.
Studies on mathematical development have exploited the phenomeno n of

habituation, with five-month-old infants expecting 1 + 1 to be 2 and being
surprised when a physical situation implies 1 + 1 is either 1 or 3. Nonhuman
animals show a limited ability to add, particularly primates, with rhesus macaque
and cottontop tamarin monkeys performing similarly to human infants and Asian
elephants being able to compute the sum of two numerals without further
training. Children learn to add quickly by exploiting the commutativity of
addition by counting up from the larger number, starting with three and counting
"four, five". Eventually, they begin to recall certain addition facts ("number
bonds") and derive unknown facts from known ones. Different nations introduce
whole numbers and arithmetic at different ages, but throughout the world,
addition is taught by the end of the first year of elementary school.
The prerequisite to addition in the decimal system is the fluent recall or

derivation of the 100 single-digit "addition facts". Pattern-based strategies are
more enlightening and efficient than rote memorization. One or two more is a
basic task, and adding 1 or 2 is related to counting by two and doubles. Zero is
trivial, but word problems may help rationalize the "exception" of zero. Near-
doubles, five and ten, and binary addition can be used to derive facts quickly.
The standard algorithm for adding multidigit numbers is to align the addends
vertically and add the columns, starting from the ones column on the right. An
alternate strategy starts adding from the most significant di git on the left. Non-
decimal fractions can be added by aligning two decimal fractions above each
other, adding trailing zeros to a shorter decimal, and placing the decimal point in
the answer exactly where it was placed in the summands. Adding two "1" digi ts
produces a digit "0", while 1 must be added to the next column. This is known as
carrying, and when the result of an addition exceeds the value of a digit, the
procedure is to "carry" the excess amount divided by the radix (that is, 10/10) to
the next positional value.

Analog computers work directly with physical quantities, so their addition
mechanisms depend on the form of the addends. The most common situation for
a general-purpose analog computer is to add two voltages (referenced to ground),
but a better design exploits an operational amplifier.
SUMMARY
 https://en.wikipedia.org/wiki/Statistics
 https://en.wikipedia.org/wiki/Sampling_(statistics)#Sampling_methods
 https://en.wikipedia.org/wiki/Data_type#Classes_of_data_types
 https://www.simplilearn.com/what-is-data-collection-article
 https://en.wikipedia.org/wiki/Central_tendency#:~:text=In%20statistics%2
C%20a%20central%20tendency,dates%20from%20the%20late%201920s .
 https://en.wikipedia.org/wiki/Mean
 https://en.wikipedia.org/wiki/Mode_(statistics)
 https://en.wikipedia.org/wiki/Median#:~:text=The%20median%20of%20a
%20symmetric%20distribution%20which%20possesses%20a%20mean,wh
ich%20is%20also%20the%20mean.
 https://en.wikipedia.org/wiki/Statistical_dispersion#:~:text=In%20statist ic
s%2C%20dispersion%20(also%20called,standard%20deviation%2C%20an
d%20interquartile%20range.
 https://en.wikipedia.org/wiki/Quartile_coefficient_of_dispersion#:~:text=I
n%20statistics%2C%20the%20quartile%20coefficient,as%20the%20Coeff
icient%20of%20variation.
 https://en.wikipedia.org/wiki/Deviation_(statistics)
 https://www.statisticshowto.com/equally-likely-outcomes/

C REDIT 02

UNIT 02-01: CORRELATION:- TYPES OF CORRELATION, METHOD OF
STUDYING CORRELATION , SCATHES DIAGRAM , KARL PERSON‟S
COEFFICIENTOF CORRELATION, SPEARMAN‟S RANK CORRELATION,
MULTIPLE.
LEARNING OBJECTIVES
 Fundamentals of Correlation
 Types of correlation
Correlation
Correlation or dependence is any statistical relationship, whether causal or not,

between two random variables or bivariate data. Although in the broadest sense,
"correlation" may indicate any type of association, in statistics it normally refers to
the degree to which a pair of variables are linearly related. Familiar examples of
dependent phenomena include the correlation between the height of parents and their
offspring, and the correlation between the price of a good and the quantity the
consumers are willing to purchase, as it is depicted in the so-called demand curve.
Correlations are useful because they can indicate a predictive relationship that can be
exploited in practice. For example, an electrical utility may produce less power on a
mild day based on the correlation between electricity demand and weather. In this
example, there is a causal relationship, because extreme weather causes people to use
more electricity for heating or cooling. However, in general, the presence of a
correlation is not sufficient to infer the presence of a causal relationship
(i.e., correlation does not imply causation).
Formally, random variables are dependent if they do not satisfy a mathematical
property of probabilistic independence. In informal parlance, correlation is
synonymous with dependence. However, when used in a technical sense, correlation
refers to any of several specific types of mathematical operations between the tested
variables and their respective expected values. Essentially, correlation is the measure
of how two or more variables are related to one another.
Types

There are several different measures for the degree of correlation in data,
depending on the kind of data: principally whether the data is a measurement,
ordinal, or categorical.
Pearson
The Pearson product-moment correlation coefficient, also known as r, R,

or Pearson's r, is a measure of the strength and direction of the linear relationship
between two variables that is defined as the covariance of the variables divided
by the product of their standard deviations. This is the best-known and most
commonly used type of correlation coefficient. When the term "correlation
coefficient" is used without further qualification, it usually refers to the Pearson
product-moment correlation coefficient.
Intra-class
Intraclass correlation (ICC) is a descriptive statistic that can be used, when

quantitative measurements are made on units that are organized into groups; it
describes how strongly units in the same group resemble each other.
Rank
Rank correlation is a measure of the relationship between the rankings of two

variables, or two rankings of the same variable:
 Spearman's rank correlation coefficient is a measure of how well the
relationship between two variables can be described by a monotonic
function.
 The Kendall tau rank correlation coefficient is a measure of the
portion of ranks that match between two data sets.
 Goodman and Kruskal's gamma is a measure of the strength of
association of the cross tabulated data when both variables are
measured at the ordinal level.
Tetrachoric and polychoric
The polychoric correlation coefficient measures association between two

ordered-categorical variables. It's technically defined as the estimate of the
Pearson correlation coefficient one would obtain if:
1. The two variables were measured on a continuous scale, instead of
as ordered-category variables.
2. The two continuous variables followed a bivariate normal
distribution.
When both variables are dichotomous instead of ordered-categorical,

the polychoric correlation coefficient is called the tetrachoric correlation
coefficient.
Methods of studying correlation:
In everyday language, the term correlation refers to some kind of association.

However, in statistical terms, correlation denotes the relationship b etween two
quantitative variables. Correlation, in other words, is the strength of the linear
relationship between two or more continuous variables.
If a change in one variable causes a corresponding change in the other variable,

the two variables are correlated. Scatter diagrams, Karl Pearson‘s coefficient of
correlation, and Spearman‘s rank correlation are three important tools for
studying correlation. There are three types of correlation: based on the direction
of change, based on the number of variables and based on the constancy of the
ratio of change.
We can classify the correlations based on the direction of change of variables,

the number of variables studied and the constancy of the ratio change between
the variables.

Fig. 2.1 Types of Correlation
(Credit: www.embibe.com)
Type I: Based on the direction of change of variables
Correlation is classified into two types based on the direction of change of the
variables: positive correlation and negative correlation.
 Positive Correlation:
When the variables change in the same direction, the correlation is said to be
positive. The sign of the positive correlation is +1.+1.
Example: When income rises, so does consumption and when income falls,
consumption does too.
 Negative Correlation:
When the variables move in opposite directions, the correlation is negative.
The sign of negative correlation is –1.–1.
 Example: Height above sea level and temperature are an example of a
negative association. It gets colder as you climb the mountain (ascend in
elevation) (decrease in temperature).
Fig. 2.2 Types of Correlation Based on the direction of change of variables
Type II: Based upon the number of variables studied
There are three types of correlation, based on the number of variables.
 Simple Correlation:

The study of only two variables is referred to as simple correlation. The usage
of fertilisers and paddy yield is an example of a simple connection, as paddy
yield is dependent on fertiliser use.
 Multiple Correlation:
 Multiple correlations is defined as the study of three or more variables at the
same time. Crimes in a city, for example, maybe influenced by illiteracy,
growing population, and unemployment, among other factors.
 Partial Correlation:
 If there are three or more variables, but only two are considered while
keeping the other variables constant, the correlation is said to be partial. For
example, while controlling for weight and exercise, you would wish to
investigate if there is a link between the amount of food consumed and blood
pressure.
Type III: Based upon the constancy of the ratio of change between the variables
Correlation is classified into two types based on the consistency of the ratio of
change between the variables: linear correlation and non-linear correlation.
 Linear Correlation:
The correlation is said to be linear when the change in one variable bears a
constant ratio to the change in the other.
 Example: Y=a+bx
 Non-Linear Correlation:
If the change in one variable does not have a constant ratio to the change in
the other variables, the correlation is non-linear.
Example: Y=a+bx 2
(Credit: https://www.embibe.com/exams/methods-of-studying-correlation/)
What is Karl Pearson‟s Coefficient of Correlation?

Coefficient of Correlation
According to byjus.com, a coefficient of correlation is generally applied in statistics
to calculate a relationship between two variables. The correlation shows a specific

value of the degree of a linear relationship between the X and Y variables, say X and
Y. There are various types of correlation coefficients. However, Pearson‘s correlation
(also known as Pearson‘s R) is the correlation coefficient that is frequently used in
linear regression.
Pearson‟s Coefficient Correlation

Karl Pearson‘s coefficient of correlation is an extensively used mathematical method
in which the numerical representation is applied to measure the level of relation
between linearly related variables. The coefficient of correlation is expressed by “r‖.
Karl Pearson Correlation Coefficient Formula
Alternative Formula (covariance formula)
Pearson correlation example

1. When a correlation coefficient is (1), that means for every increase in one variable,
there is a positive increase in the other fixed proportion. For example, shoe sizes
change according to the length of the feet and are perfect (almost) correlations.
2. When a correlation coefficient is (-1), that means for every positive increase in one
variable, there is a negative decrease in the other fixed proportion. For example, the
decrease in the quantity of gas in a gas tank shows a perfect (almost) inverse
correlation with speed.
3. When a correlation coefficient is (0) for every increase, that means there is no
positive or negative increase, and the two variables are not related.

SPEARMAN'S RANK CORRELATION COEFFICIENT
In statistics, Spearman's rank correlation coefficient or Spearman's ρ, named
after Charles Spearman and often denoted by the Greek letter (rho) or as , is
a nonparametric measure of rank correlation (statistical dependence between
the rankings of two variables). It assesses how well the relationship between two
variables can be described using a monotonic function.
The Spearman correlation between two variables is equal to the Pearson

correlation between the rank values of those two variables; while Pearson's
correlation assesses linear relationships, Spearman's correlation assesses monotonic
relationships (whether linear or not). If there are no repeated data values, a perfect
Spearman correlation of +1 or −1 occurs when each of the variables is a perfect
monotone function of the other.
Intuitively, the Spearman correlation between two variables will be high when
observations have a similar (or identical for a correlation of 1) rank (i.e. relative
position label of the observations within the variable: 1st, 2nd, 3rd, etc.) between the
two variables, and low when observations have a dissimilar (or fully opposed for a
correlation of −1) rank between the two variables.
Spearman's coefficient is appropriate for both continuous and discrete ordinal

variables. Both Spearman's and Kendall's can be formulated as special cases of a
more general correlation coefficient.
DEFINITION AND CALCULATION
The Spearman correlation coefficient is defined as the Pearson correlation

coefficient between the rank variables.
For a sample of size n, the n raw scores are converted to ranks , and is
computed as where denotes the usual Pearson correlation coefficient, but applied
to the rank variables, is the covariance of the rank variables,
and are the standard deviations of the rank variables.
Only if all n ranks are distinct integers, it can be computed using the popular
formula where is the difference between the two ranks of each observation,

n is the number of observations show
Identical values are usually each assigned fractional ranks equal to the average
of their positions in the ascending order of the values, which is equivalent to
averaging over all possible permutations.
If ties are present in the data set, the simplified formula above yields incorrect
results: Only if in both variables all ranks are distinct, then (calculated
according to biased variance). The first equation — normalizing by the standard
deviation — may be used even when ranks are normalized to [0, 1] ("relative
ranks") because it is insensitive both to translation and linear scaling.
The simplified method should also not be used in cases where the data set is
truncated; that is, when the Spearman's correlation coefficient is desired for the
top X records (whether by pre-change rank or post-change rank, or both), the
user should use the Pearson correlation coefficient formula given above. [5]
RELATED QUANTITIES
There are several other numerical measures that quantify the extent of statistical
dependence between pairs of observations. The most common of these is
the Pearson product-moment correlation coefficient, which is a similar
correlation method to Spearman's rank, that measures the ―linear‖ relationships
between the raw numbers rather than between their ranks.
An alternative name for the Spearman rank correlation is the ―grade

correlation‖; in this, the ―rank‖ of an observation is replaced by the ―grade‖. In
continuous distributions, the grade of an observation is, by convention, always
one half less than the rank, and hence the grade and rank correlations are the
same in this case. More generally, the ―grade‖ of an observation is p roportional
to an estimate of the fraction of a population less than a given value, with the
half-observation adjustment at observed values. Thus, this corresponds to one
possible treatment of tied ranks. While unusual, the term ―grade correlation‖ is
still in use.

INTERPRETATION
POSITIVE AND NEGATIVE SPEARMAN RANK CORRELATIONS
The sign of the Spearman correlation indicates the direction of association

between X (the independent variable) and Y (the dependent variable). If Y tends
to increase when X increases, the Spearman correlation coefficient is positive.
If Y tends to decrease when X increases, the Spearman correlation coefficient is
negative. A Spearman correlation of zero indicates that there is no tendency
for Y to either increase or decrease when X increases. The Spearman correlation
increases in magnitude as X and Y become closer to being perfectly monotone
functions of each other. When X and Y are perfectly monotonically related, the
Spearman correlation coefficient becomes 1. A perfectly m onotone increasing
relationship implies that for any two pairs of data values X i , Y i and X j , Y j ,
that X i − X j and Y i − Y j always have the same sign. A perfectly monotone
decreasing relationship implies that these differences always have opposite signs.
The Spearman correlation coefficient is often described as being

"nonparametric". This can have two meanings. First, a perfect Spearman
correlation results when X and Y are related by any monotonic function. Contrast
this with the Pearson correlation, which only gives a perfect value
when X and Y are related by a linear function. The other sense in which the
Spearman correlation is nonparametric is that its exact sampling distribution can
be obtained without requiring knowledge (i.e., knowing the parameters) of
the joint probability distribution of X and Y.
Example:
In this example, the raw data in the table below is used to calculate the
correlation between the IQ of a person with the number of hours spent in front
of TV per week.
Firstly, evaluate. To do so use the following steps, reflected in the table below.
1. Sort the data by the first column. Create a new column and assign it the
ranked values 1, 2, 3, ..., n.
2. Next, sort the data by the second column. Create a fourth column and
similarly assign it the ranked values 1, 2, 3, ..., n.
3. Create a fifth column to hold the differences between the two rank columns.
4. Create one final column to hold the value of column squared.
With found, add them to find . The value of n is 10. These values can now be
substituted back into the equation to give which evaluates to ρ = −29/165 =
−0.175757575... with a p-value = 0.627188 (using the t-distribution).
That the value is close to zero shows that the correlation between IQ and hours
spent watching TV is very low, although the negative value suggests that the
longer the time spent watching television the lower the IQ. In the case of ties in
the original values, this formula should not be used; instead, the Pearson
correlation coefficient should be calculated on the ranks .
Determining Significance
One approach to test whether an observed value of ρ is significantly different

from zero (r will always maintain −1 ≤ r ≤ 1) is to calculate the probability that
it would be greater than or equal to the observed r, given the null hypothesis, by
using a permutation test. An advantage of this approach is that it automatically
takes into account the number of tied data values in the sample and the way they
are treated in computing the rank correlation.
Another approach parallels the use of the Fisher transformation in the case of the
Pearson product-moment correlation coefficient. That is, confidence
intervals and hypothesis tests relating to the population value ρ can be carried
out using the Fisher transformation:
If F(r) is the Fisher transformation of r, the sample Spearman rank correlation

coefficient, and n is the sample size, then is a z-score for r, which approximately
follows a standard normal distribution under the null hypothesis of statistical
independence (ρ = 0).
One can also test for significance using which is distributed approximately
as Student's t-distribution with n − 2 degrees of freedom under the null
hypothesis. A justification for this result relies on a permutation argument.
A generalization of the Spearman coefficient is useful in the situation where

there are three or more conditions, a number of subjects are all observed in each
of them, and it is predicted that the observations will have a particular order. For
example, a number of subjects might each be given three trials at the same task,
and it is predicted that performance will improve from trial to trial.
Correspondence analysis based on Spearman's ρ
Classic correspondence analysis is a statistical method that gives a score to every

value of two nominal variables. In this way the Pearson correlation
coefficient between them is maximized.
There exists an equivalent of this method, called grade correspondence analysis,

which maximizes Spearman's ρ or Kendall's τ.
Approximating Spearman's ρ from a stream
There are two existing approaches to approximating the Spearman's rank

correlation coefficient from streaming data. The first approach involves
coarsening the joint distribution of . For continuous values: cutpoints are
selected for and respectively, discretizing these random variables. Default
cutpoints are added at and . A count matrix of size , denoted , is then constructed
where stores the number of observations that fall into the two -dimensional cell
indexed by . For streaming data, when a new observation arrives, the
appropriate element is incremented. The Spearman's rank correlation can then be
computed, based on the count matrix , using linear algebra operations (Algorithm
2). Note that for discrete random variables, no discretization procedure is
necessary. This method is applicable to stationary streaming data as well as large
data sets. For non-stationary streaming data, where the Spearman's rank
correlation coefficient may change over time, the same procedure can be applied,
but to a moving window of observations. When using a moving window, memory
requirements grow linearly with chosen window size.
The second approach to approximating the Spearman's rank correlation

coefficient from streaming data involves the use of Hermite series-based
estimators. These estimators, based on Hermite polynomials, allow sequential
estimation of the probability density function and cumulative distribution
function in univariate and bivariate cases. Bivariate Hermite series density
estimators and univariate Hermite series based cumulative distribution function
estimators are plugged into a large sample version of the Spearman's rank
correlation coefficient estimator, to give a sequential Spearman's correlation
estimator. This estimator is phrased in terms of linear algebra operations for
computational efficiency (equation (8) and algorithm 1 and 2). These algorithms

are only applicable to continuous random variable data, but have certain
advantages over the count matrix approach in this setting. The first advantage is
improved accuracy when applied to large numbers of observations. The second
advantage is that the Spearman's rank correlation coefficient can be computed on
non-stationary streams without relying on a moving window. Instead, the
Hermite series-based estimator uses an exponential weighting scheme to track
time-varying Spearman's rank correlation from streaming data, which has
constant memory requirements with respect to "effective" moving window size.
SELF-TEST 01
1) Which of the following are types of correlation?
a) Positive and Negative
b) Simple, Partial and Multiple
c) Linear and Nonlinear
d) All of the above
2) Which of the following statements is true for correlation analysis?
a) It is a bivariate analysis
b) It is a multivariate analysis
c) It is a univariate analysis
d) Both a and c
SHORT ANSWER QUESTIONS 01

1) What are the types of correlation?
2) Write note on Spearman‘s Rank correlation

UNIT 02-02: REGRESSION: USES OF REGRESSION AND PROPERTIES
LEARNING OBJECTIVES
 Regression
 Properties of Regression
Regression: uses of regression and properties:
Regression is a statistical method used in finance, investing, and other disciplines that
attempts to determine the strength and character of the relationship between one
dependent variable (usually denoted by Y) and a series of other variables (known as
independent variables).
Regression helps investment and financial managers to value assets and understand the
relationships between variables, such as commodity prices and the stocks of businesses
dealing in those commodities.
Regression
Regression Explained
The two basic types of regression are simple linear regression and multiple linear
regression, although there are non-linear regression methods for more complicated data
and analysis. Simple linear regression uses one independent variable to explain or
predict the outcome of the dependent variable Y, while multiple linear regression uses
two or more independent variables to predict the outcome.
Regression can help finance and investment professionals as well as professionals in

other businesses. Regression can also help predict sales for a company based on weather,
previous sales, GDP growth, or other types of conditions. The capital asset pricing
model (CAPM) is an often-used regression model in finance for pricing assets and
discovering costs of capital.
The general form of each type of regression is:
 Simple linear regression: Y = a + bX + u

 Multiple linear regression: Y = a + b1X1 + b2X2 + b3X3 + ... + btXt + u
Where:

 Y = the variable that you are trying to predict (dependent variable).
 X = the variable that you are using to predict Y (independent variable).
 a = the intercept.
 b = the slope.
 u = the regression residual.
(Source:https://www.investopedia.com/terms/r/regression.asp#:~:text=Regression
%20is%20a%20statistical%20method,(known%20as%20independent%20variable
s)
Regression analysis is a statistical method that helps us to analyze and understand the
relationship between two or more variables of interest. The process that is adapted to
perform regression analysis helps to understand which factors are important, which
factors can be ignored, and how they are influencing each other.
For the regression analysis is be a successful method, we understand the following
terms:
 Dependent Variable: This is the variable that we are trying to understand

or forecast.
 Independent Variable: These are factors that influence the analysis or
target variable and provide us with information regarding the relationship
of the variables with the target variable.
Regression Meaning
Let‘s understand the concept of regression with this example.
You are conducting a case study on a set of college students to understand if students
with high CGPA also get a high GRE score.
Your first task would be to collect the details of all the students.
We go ahead and collect the GRE scores and CGPAs of the students of this college.
All the GRE scores are listed in one column and the CGPAs are listed in another
column.
Now, if we are supposed to understand the relationship between these two variables,
we can draw a scatter plot.
Here, we see that there‘s a linear relationship between CGPA and GRE score which
means that as the CGPA increases, the GRE score also increases. This would also
mean that a student who has a high CGPA, would also have a higher probability of
getting a high GRE score.
But what if I ask, ―The CGPA of the student is 8.32, what will be the GRE score of
the student?―
This is where Regression comes in. If we are supposed to find the relationship
between two variables, we can apply regression analysis.
Regression Definition –
Why is it called regression?
In regression, we normally have one dependent variable and one or more independent
variables. Here we try to ―regress‖ the value of dependent variable ―Y‖ with the help
of the independent variables. In other words, we are trying to understand, how does
the value of ‗Y‘ change w.r.t change in ‗X‘.
What is Regression Analysis?
General Uses of Regression Analysis
Regression analysis is used for prediction and forecasting. This has a substantial
overlap to the field of machine learning. This statistical method is used across
different industries such as,
 Financial Industry- Understand the trend in the stock prices, forecast the
prices, evaluate risks in the insurance domain
 Marketing- Understand the effectiveness of market campaigns, forecast
pricing and sales of the product.
 Manufacturing- Evaluate the relationship of variables that determine to
define a better engine to provide better performance
 Medicine- Forecast the different combination of medicines to prepare
generic medicines for diseases.
Terminologies used in Regression Analysis
Outliers
Suppose there is an observation in the dataset that has a very high or very low value
as compared to the other observations in the data, i.e. it does not belong to the
population, such an observation is called an outlier. In simple words, it is an extreme
value. An outlier is a problem because many times it hampers the results we get.
Multicollinearity
When the independent variables are highly correlated to each other, then the variables
are said to be multicollinear. Many types of regression techniques assume
multicollinearity should not be present in the dataset. It is because it causes problems

in ranking variables based on its importance, or it makes the job difficult in selecting
the most important independent variable.
Heteroscedasticity
When the variation between the target variable and the independent variable is not
constant, it is called heteroscedasticity. Example-As one‘s income increases, the
variability of food consumption will increase. A poorer person will spend a rather
constant amount by always eating inexpensive food; a wealthier person may
occasionally buy inexpensive food and at other times, eat expensive meals. Those
with higher incomes display a greater variability of food consumption.
Underfit and Overfit
When we use unnecessary explanatory variables, it might lead to overfitting.
Overfitting means that our algorithm works well on the training set but is unable to
perform better on the test sets. It is also known as a problem of high variance.
When our algorithm works so poorly that it is unable to fit even a training set well,
then it is said to underfit the data. It is also known as a problem of high bias.
TYPES OF REGRESSION
For different types of Regression analysis, there are assumptions that need to be
considered along with understanding the nature of variables and its distribution.
Linear Regression
The simplest of all regression types is Linear Regression where it tries to establish
relationships between Independent and Dependent variables. The Dependent variable
considered here is always a continuous variable.
What is Linear Regression?
Linear Regression is a predictive model used for finding the linear relationship
between a dependent variable and one or more independent variables.

Fig. 2.3 Linear Regression
(Credi: www.mygreatlearning.com)
Here, ‗Y‘ is our dependent variable, which is a continuous numerical and we are
trying to understand how does ‗Y‘ change with ‗X‘.
So, if we are supposed to answer, the above question of ―What will be the GRE score
of the student, if his CCGPA is 8.32?‖ our go-to option should be linear regression.
Examples of Independent & Dependent Variables:

• x is Rainfall and y is Crop Yield
• x is Advertising Expense and y is Sales
• x is sales of goods and y is GDP
If the relationship with the dependent variable is in the form of single variables, then
it is known as Simple Linear Regression
Simple Linear Regression
X —–> Y
If the relationship between Independent and dependent variables are multiple in
number, then it is called Multiple Linear Regression
Multiple Linear Regression
Fig. 2.4 Multiple Linear Regression

(Credit: www.mygreatlearning.com)
Simple Linear Regression Model

As the model is used to predict the dependent variable, the relationship between the
variables can be written in the below format.
Y i = β0 + β1 Xi +εi
Where,
Y i – Dependent variable
β0 — Intercept
β1 – Slope Coefficient
Xi – Independent Variable
εi – Random Error Term
The main factor that is considered as part of Regression analysis is understanding the
variance between the variables. For understanding the variance, we need to
understand the measures of variation.
 SST = total sum of squares (Total Variation)

o Measures the variation of the Y i values around their mean Y
 SSR = regression sum of squares (Explained Variation)
o Variation attributable to the relationship between X and Y
 SSE = error sum of squares (Unexplained Variation)
o Variation in Y attributable to factors other than X
With all these factors taken into consideration, before we start assessing if the model
is doing good, we need to consider the assumptions of Linear Regression.
Assumptions:
Since Linear Regression assesses whether one or more predictor variables explain the
dependent variable and hence it has 5 assumptions:
1. Linear Relationship
2. Normality

3. No or Little Multicollinearity
4. No Autocorrelation in errors
5. Homoscedasticity
With these assumptions considered while building the model, we can build the model
and do our predictions for the dependent variable. For any type of machine learning
model, we need to understand if the variables considered for the model are correct
and have been analysed by a metric. In the case of Regression analysis, the statistical
measure that evaluates the model is called the coefficient of determination which is
represented as r 2 .
The coefficient of determination is the portion of the total variation in the dependent
variable that is explained by variation in the independent variable. A higher value
of r 2 better is the model with the independent variables being considered for the
model.
r 2 = SSR
SST
Note: The value of r 2 is the range of 0≤ r 2 ≤1
Polynomial Regression
This type of regression technique is used to model nonlinear equations by taking
polynomial functions of independent variables.
In the figure given below, you can see the red curve fits the data better than the green
curve. Hence in the situations where the relationship between the dependent and
independent variable seems to be non-linear, we can deploy Polynomial Regression
Models.
Thus a polynomial of degree k in one variable is written as:

Here we can create new features like
and can fit linear regression in a similar manner.

In case of multiple variables say X1 and X2, we can create a third new feature (say
X3) which is the product of X1 and X2 i.e.
The main drawback of this type of regression model is if we create unnecessary extra
features or fitting polynomials of higher degree this may lead to overfitting of the
model.
Logistic Regression
Logistic Regression is also known as Logit, Maximum-Entropy classifier is a
supervised learning method for classification. It establishes a relation between
dependent class variables and independent variables using regression.
The dependent variable is categorical i.e. it can take only integral values representing
different classes. The probabilities describing the possible outcomes of a query point
are modelled using a logistic function. This model belongs to a family of
discriminative classifiers. They rely on attributes which discriminate the classes well.
This model is used when we have 2 classes of dependent variables. When there are
more than 2 classes, then we have another regression method which helps us to
predict the target variable better.
There are two broad categories of Logistic Regression algorithms
1. Binary Logistic Regression when the dependent variable is strictly binary

2. Multinomial Logistic Regression when the dependent variable has multiple
categories.
There are two types of Multinomial Logistic Regression
1. Ordered Multinomial Logistic Regression (dependent variable has ordered

values)
2. Nominal Multinomial Logistic Regression (dependent variable has
unordered categories)
Process Methodology:
Logistic regression takes into consideration the different classes of dependent
variables and assigns probabilities to the event happening for each row of
information. These probabilities are found by assigning different weights to each
independent variable by understanding the relationship between the variables. If the
correlation between the variables is high, then positive weights are assigned and in
the case of an inverse relationship, negative weight is assigned.
As the model is mainly used to classify the classes of target variables as either 0 or 1,
thus the Sigmoid function is obtained by implementing the log-normal function on
these probabilities that are calculated on these independent variables.
The Sigmoid function:
P(y= 1) = Sigmoid(Z) = 1/(1 + e -z)
P(y= 0) = 1 –P(y =1) = 1 –(1/(1 + e -z)) = e –z/ (1 + e -z)
y = 1 if P(y=1|X) > .5, else y = 0
where the default probability cut off is taken as 0.5.
This method is also called the Odds Log ratio.

Assumptions:
1. The dependent variable is categorical. Dichotomous for binary logistic regression
and multi-label for multi-class classification
2. Attributes and log odds i.e. log(p / 1-p) should be linearly related to the
independent variables
3. Attributes are independent of each other (low or no multicollinearity)
4. In binary logistic regression class of interest is coded with 1 and other class 0
5. In multi-class classification using Multinomial Logistic Regression or OVR
scheme, class of interest is coded 1 and rest 0(this is done by the algorithm)
Note: the assumptions of Linear Regression such as homoscedasticity, normal
distribution of error terms, a linear relationship between the dependent and
independent variables are not required here.
Some examples where this model can be used for predictions.

1. Predicting the weather: you can only have a few definite weather types. Stormy,
sunny, cloudy, rainy and a few more.
2. Medical diagnosis: given the symptoms predicted the disease patient is suffering
from.
3. Credit Default: If a loan has to be given a particular candidate depend on his
identity check, account summary, any properties he holds, any previous loan, etc
4. HR Analytics: IT firms recruit a large number of people, but one of the problems
they encounter is after accepting the job offer many candidates do not join. So, this
results in cost overruns because they have to repeat the entire process again. Now
when you get an application, can you actually predict whether that applicant is likely
to join the organization (Binary Outcome – Join / Not Join).
5. Elections: Suppose that we are interested in the factors that influence whether a
political candidate wins an election. The outcome (response) variable is binary (0/1);
win or lose. The predictor variables of interest are the amount of money spent on the
campaign and the amount of time spent campaigning negatively.
Linear Discriminant Analysis (LDA)
Discriminant Analysis is used for classifying observations to a class or category based
on predictor (independent) variables of the data.
Discriminant Analysis creates a model to predict future observations where the
classes are known.
LDA comes to our rescue in situations when logistic regression is unstable when
1. Classed are well separated

2. Data is small
3. When we have more than 2 classes
Working Process of LDA Model
The LDA model uses Bayes‘ Theorem to estimate probabilities. They make
predictions upon the probability that a new input dataset belongs to each class. The
class which has the highest probability is considered as the output class and then the
LDA makes a prediction.
The prediction is made simply by the use of Bayes‘ theorem which estimates the
probability of the output class given the input. They also make use of the probability
of each class and also the data belonging to that class:
P(Y=x|X=x) = [(Plk* fk(x))] / [sum(Pll* fl(x))]

Where
k=output class
Plk= Nk/n or base probability of each class observed in the training data. It is also
called prior probability in Bayes‘ theorem.
fk(x) = estimated probability of x belonging to class k.
Regularized Linear Models
This method is used to solve the problem of overfitting of the model which arises due
to the model performing poorly on test data. This model helps us to solve the problem
by adding an error term to the objective function to reduce the bias in the model.
Regularization is generally useful in the following situations:
 A large number of variables

 Low ratio of number observations to the number of variables
 High Multicollinearity
L1 Loss function or L1 Regularization
In L1 regularization we try to minimize the objective function by adding a penalty
term to the sum of the absolute values of coefficients. This is also known as the least
absolute deviations method. Lasso Regression (Least Absolute Shrinkage Selector
Operator) makes use of L1 regularization. It takes the minimum absolute values of
the coefficients.
The cost function for lasso regression
Min(||Y – X(theta)||^2 + λ||theta||)
λ is the hypermeter, whose value is equal to the alpha in the Lasso function
It is generally used when we have more number of features because it automatically
does feature selection.
L2 Loss function or L2 Regularization
In L2 regularization we try to minimize the objective function by adding a penalty
term to the sum of the squares of coefficients. Ridge Regression or shrinkage
regression makes use of L2 regularization. This model assumes the square of the
absolute values if coefficient.
The cost function for ridge regression
Min(||Y – X(theta)||^2 + λ||theta||^2)
Lambda is the penalty term. λ given here is actually denoted by an alpha parameter
in the ridge function. So by changing the values of alpha, we are basically controlling

the penalty term. Higher the values of alpha, bigger is the penalty and therefore the
magnitude of coefficients is reduced.
It shrinks the parameters, therefore it is mostly used to prevent multicollinearity
It reduces the model complexity by coefficient shrinkage
Value of alpha, which is a hyperparameter of Ridge, which means that they are not
automatically learned by the model instead they have to be set manually.
A combination of both Lasso and Ridge regression methods brings rise to a method
called Elastic Net Regression where the cost function is :
Min(||Y-Xtheta||^2 + Lambda1||theta|| + lambda2||theta||^2)
What mistakes do people make when working with regression analysis?
When working with regression analysis, it is important to understand the problem
statement properly. If the problem statement talks about forecasting, we should
probably use linear regression. If the problem statement talks about binary
classification, we should use logistic regression. Similarly, depending on the problem
statement we need to evaluate all our regression models.
Uses of Regression Analysis
Regression analysis is a branch of statistical theory that is widely used in almost
all the scientific disciplines. In economics it is the basic technique for measuring
or estimating the relationship among economic variables that constitute the
essence of economic theory and economic life. The uses of regression are not
confined to economics and business fields only. Its applications are extended to
almost all the natural, physical, and social sciences. The regression analysis
attempts to accomplish the following:
1. Regression analysis provides estimates of values of the dependent variable
from values of the independent variable. The device used to accomplish this
estimation procedure is the regression line. The regression line describes the
average relationship existing between X and Y variables, i.e., it displays mean
values of X for given values of Y. The equation of this line, known as the
regression equation, provides estimates of the dependent variable when values of
the independent variable are inserted into the equation.
2. A second goal of regression analysis is to obtain a measure of the error

involved in using the regression line as a basis for estimation. For this purpose
the standard error of estimate is calculated. This is a corresponding value

estimated from the regression line. If the line fits the data closely, that is , if
there is little scatter of the observations around the regression line, good
estimates can be made of the Y variable. On the other hand, there is a great deal
of scatter of the observations around the fitted regression line, the line will not
produce accurate estimates of the dependent variable.
3. With the help of regression coefficients, we can calculate the correlation
coefficient. The square of correlation of coefficient (r), called coe fficient of
determination, measures the degree of association or correlation that exists
between the two variables. It assumes the proportion of variance in the
dependent variable that has been accounted for by the regression equation. In
general, the greater the value of r2, the better is the fit and the more useful the
regression equation as a predictive device.
Properties of Regression Coefficient

Statistics refers to the study of the analysis, interpretation, collection, presentation,
and organization of data. Statistics find applications in different fields such as
psychology, geology, sociology, weather forecasting, probability, and much more.
Regression coefficients are an important topic in statistics. They are a statistical
measure that is used to measure the average functional relationship between variables.
In regression analysis, one variable is dependent and the other is independent. It also
measures the degree of dependence of one variable on the other variables. In this
article, we come across the important properties of regression coefficient.
Regression coefficients determine the slope of the line which is the change in the
independent variable for the unit change in the independent variable. So they are also
known as the slope coefficient. They are classified into three. They are simple partial
and multiple, positive and negative, and linear and non-linear.
In the linear regression line, the equation is given by Y = b 0 + b1X. Here b0 is a

constant and b1 is the regression coefficient.
The formula for the regression coefficient is given below.
b1 = ∑[(xi-x)(yi-y)]/∑[(xi-x)2]
The observed data sets are given by x i and yi. x and y are the mean value.

IMPORTANT PROPERTIES OF REGRESSION COEFFICIENT
1. The regression coefficient is denoted by b.
2. We express it in the form of an original unit of data.
3. The regression coefficient of y on x is denoted by b yx. The regression coefficient of

x on y is denoted by b xy.
4. If one regression coefficient is greater than 1, then the other will be less tha n 1.
5. They are not independent of the change of scale. There will be change in the
regression coefficient if x and y are multiplied by any constant.
6. AM of both regression coefficients is greater than or equal to the coefficient of

correlation.
7. GM between the two regression coefficients is equal to the correlation coefficient.
8. If bxy is positive, then b yx is also positive and vice versa.
SELF-TEST
1) Which of the following statements is true about the regression line?
a) A regression line is also known as the line of the average relationship
b) A regression line is also known as the estimating equation
c) A regression line is also known as the prediction equation
d) All of the above
2) The original hypothesis is known as ______
a) Alternate hypothesis
b) Null hypothesis
c) Both a and b are incorrect
d) Both a and b are correct

1) What is Regression? Explain it in brief.
2) Write note on Properties on Regression.

UNIT 02-03: TESTING OF HYPOTHESIS : MEANING, TYPES OF
HYPOTHESIS , LEVEL OF SIGNIFICANCE, LARGE SAMPLE LIST OF MEAN,
PROPORTIONS , EQUALITY OF MEANS
LEARNING OBJECTIVES
 Hypothesis
 Testing of Hypothesis
What Is Hypothesis Testing?
Hypothesis testing is an act in statistics whereby an analyst tests an assumption

regarding a population parameter. The methodology employed by the analyst depends on
the nature of the data used and the reason for the analysis.
Hypothesis testing is used to assess the plausibility of a hypothesis by using sample data.
Such data may come from a larger population, or from a data-generating process. The
word "population" will be used for both of these cases in the following descriptions.
Key Takeaways
 Hypothesis testing is used to assess the plausibility of a hypothesis by using
sample data.
 The test provides evidence concerning the plausibility of the hypothesis, given
the data.
 Statistical analysts test a hypothesis by measuring and examining a random
sample of the population being analyzed.
How Hypothesis Testing Works
In hypothesis testing, an analyst tests a statistical sample, with the goal of providing
evidence on the plausibility of the null hypothesis.
Statistical analysts test a hypothesis by measuring and examining a random sample of

the population being analyzed. All analysts use a random population sample to test two
different hypotheses: the null hypothesis and the alternative hypothesis.
The null hypothesis is usually a hypothesis of equality between population parameters;

e.g., a null hypothesis may state that the population mean return is equal to zero. The

alternative hypothesis is effectively the opposite of a null hypothesis (e.g., the
population mean return is not equal to zero). Thus, they are mutually exclusive, and only
one can be true. However, one of the two hypotheses will always be true.
4 Steps of Hypothesis Testing
All hypotheses are tested using a four-step process:

1. The first step is for the analyst to state the two hypotheses so that only one can be
right.
2. The next step is to formulate an analysis plan, which outlines how the data will
be evaluated.
3. The third step is to carry out the plan and physically analyze the sample data.
4. The fourth and final step is to analyze the results and either reject the null
hypothesis, or state that the null hypothesis is plausible, given the data.
Types of Statistical Hypothesis

A statistical hypothesis is of two types – null hypothesis and alternative hypothesis.
Though both hypotheses provide a statement about the population value of the test
statistic, they endorse contrasting perspectives. 1,2 Simply put, while testing a research
hypothesis, if we say neither a difference, nor a relationship exists between the two
groups studied, it will result in a null hypothesis. If we state that a difference exists,
then it results in an alternative hypothesis.
The null hypothesis: Denoted as H0 it is that hypothesis that researchers generally

hope to refute. It is a statement reflective that no difference, no effect or association
exists between sample or population mean or proportion. In other words, the
difference equals 0 or null. The purpose behind expressing the null hypothesis is that
it is impossible to attest to the hypothesis. To investigate a null hypothesis,
researchers select samples and compute outcomes and make a decision as to
whether the sample data gives compelling evidence either to disprove or reject the
hypothesis or not. 1,2
The alternative hypothesis: Ha denotes this. If we have sufficiently strong evidence to
reject the null hypothesis, it is inferred that some specific alternative hypothesis is
most likely to be true.
Hypothesis Definition

As per the description of byjus.com, in Statistics, the determination of the variation
between the group of data due to true variation is done by hypothesis testing. The
sample data are taken from the population parameter based on the assumptions. The
hypothesis can be classified into various types. In this article, let us discuss the
hypothesis definition, various types of hypotheses and the significance of hypothesis
testing, which are explained in detail.
Hypothesis Definition in Statistics
In Statistics, a hypothesis is defined as a formal statement, which gives the
explanation about the relationship between the two or more variables of the specified
population. It helps the researcher to translate the given problem to a clear
explanation for the outcome of the study. It clearly explains and predicts the expected
outcome. It indicates the types of experimental design and directs the study of the
research process.
Types of Hypothesis
The hypothesis can be broadly classified into different types. They are:
Simple Hypothesis
A simple hypothesis is a hypothesis that there exists a relationship between two
variables. One is called a dependent variable, and the other is called an independent
variable.
Complex Hypothesis
A complex hypothesis is used when there is a relationship between the existing
variables. In this hypothesis, the dependent and independent variables are more than
two.
Null Hypothesis
In the null hypothesis, there is no significant difference between the populations
specified in the experiments, due to any experimental or sampling error. The null
hypothesis is denoted by H 0.
Alternative Hypothesis
In an alternative hypothesis, the simple observations are easily influenced by some
random cause. It is denoted by the H a or H1.
Empirical Hypothesis
An empirical hypothesis is formed by the experiments and based on the evidence.
Statistical Hypothesis

In a statistical hypothesis, the statement should be logical or illogical, and the
hypothesis is verified statistically.
Apart from these types of hypothesis, some other hypotheses are directional and non -
directional hypothesis, associated hypothesis, casual hypothesis.
Characteristics of Hypothesis
The important characteristics of the hypothesis are:
 The hypothesis should be short and precise
 It should be specific
 A hypothesis must be related to the existing body of knowledge
 It should be capable of verification
Level of Significance
Statistics is a branch of Mathematics. It deals with gathering, presenting, analyzing,
organizing and interpreting the data, which is usually numerical. It is applied to many
industrial, scientific, social and economic areas. While a researcher performs
research, a hypothesis has to be set, which is known as the null hypothesis. This
hypothesis is required to be tested via pre-defined statistical examinations. This
process is termed as statistical hypothesis testing. The level of significance or
Statistical significance is an important terminology that is quite commonly used in
Statistics. In this article, we are going to discuss the level of significance in detail.
What is Statistical Significance?

In Statistics, ―significance‖ means ―not by chance‖ or ―probably true‖. We can say
that if a statistician declares that some result is ―highly significant‖, then he indicates
by stating that it might be very probably true. It does not mean that the result is highly
significant, but it suggests that it is highly probable.
Level of Significance Definition

The level of significance is defined as the fixed probability of wrong elimination of
null hypothesis when in fact, it is true. The level of significance is stated to be the
probability of type I error and is preset by the researcher with the outcomes of error.
The level of significance is the measurement of the statistical significance. It defines
whether the null hypothesis is assumed to be accepted or rejected. It is expected to
identify if the result is statistically significant for the null hypothesis to be false or
rejected.
Level of Significance Symbol
The level of significance is denoted by the Greek symbol α (alpha). Therefore, the
level of significance is defined as follows:
Significance Level = p (type I error) = α
The values or the observations are less likely when they are farther than the mean.
The results are written as ―significant at x%‖.
Example: The value significant at 5% refers to p-value is less than 0.05 or p < 0.05.
Similarly, significant at the 1% means that the p-value is less than 0.01.
The level of significance is taken at 0.05 or 5%. When the p-value is low, it means
that the recognised values are significantly different from the population value that
was hypothesised in the beginning. The p-value is said to be more significant if it is as
low as possible. Also, the result would be highly significant if the p-value is very less.
But, most generally, p-values smaller than 0.05 are known as significant, since getting
a p-value less than 0.05 is quite a less practice.
How to Find the Level of Significance?

To measure the level of statistical significance of the result, the investigator first
needs to calculate the p-value. It defines the probability of identifying an effect which
provides that the null hypothesis is true. When the p-value is less than the level of
significance (α), the null hypothesis is rejected. If the p-value so observed is not less
than the significance level α, then theoretically null hypothesis is accepted. But
practically, we often increase the size of the sample size and check if we reach the
significance level. The general interpretation of the p-value based upon the level of
significance of 10%:
 If p > 0.1, then there will be no assumption for the null hypothesis
 If p > 0.05 and p ≤ 0.1, it means that there will be a low assumption for
the null hypothesis.
 If p > 0.01 and p ≤ 0.05, then there must be a strong assumption about the
null hypothesis.
 If p ≤ 0.01, then a very strong assumption about the null hypothesis is

indicated.
Level of Significance Example

If we obtain a p-value equal to 0.03, then it indicates that there are just 3% chances of
getting a difference larger than that in our research, given that the null hypothesis
exists. Now, we need to determine if this result is statistically significant en ough.
We know that if the chances are 5% or less than that, then the null hypothesis is true,
and we will tend to reject the null hypothesis and accept the alternative hypothesis.
Here, in this case, the chances are 0.03, i.e. 3% (less than 5%), which eventually
means that we will eliminate our null hypothesis and will accept an alternative
hypothesis.
Equality of means/medians hypothesis test (independent samples)

An equality hypothesis test formally tests if two or more population means/medians
are different.
The hypotheses to test depends on the number of samples:

 For two samples, the null hypothesis states that the difference between the
mean/medians of the populations is equal to a hypothesized value (0
indicating no difference), against the alternative hypothesis that it is not
equal to (or less than, or greater than) the hypothesized value.
 For more than two samples, the null hypothesis states that the
means/medians of the populations are equal, against the alternative
hypothesis that at least one population mean/median is different.
When the test p-value is small, you can reject the null hypothesis and conclude that
the populations differ in means/medians.
Tests for more than two samples are omnibus tests and do not tell you which groups
differ from each other. You should use multiple comparisons to make these inferences.
(Source: https://analyse-it.com/docs/user-guide/compare-groups/equality-mean-
median-hypothesis-test#:~:text=Compare%20groups-
,Equality%20of%20means%2Fmedians%20hypothesis%20test%20(independent%20s
amples),population%20means%2Fmedians%20are%20different.)
SELF-TEST 01
1) A statement made about a population for testing purpose is called?

a) Statistic
b) Hypothesis
c) Level of Significance
d) Test-Statistic
2) If the assumed hypothesis is tested for rejection considering it to be true is

called?
a) Null Hypothesis
b) Statistical Hypothesis
c) Simple Hypothesis
d) Composite Hypothesis

1) What is null Hypothesis and alternative Hypothesis?
2) What is Type I error and Type II error?

UNIT 02-04: BIOASSAY- PRINCIPLE, HISTORY, CLASSIFICATION,
EXAMPLES
LEARNING OBJECTIVES
 Bioassay
 Principal, History, Classifiation & Example of Bioassay
A bioassay is an analytical method to determine the concentration or potency of a

substance by its effect on living animals or plants (in vivo), or on living cells or
tissues (in vitro). A bioassay can be either quantal or quantitative, direct or indirect. If
the measured response is binary, the assay is quantal, if not, it is quantitative.
Fig. 2.5 Bioassay setup

(A biological test system is exposed to various experimental conditions to which
it reacts)
A bioassay may be used to detect biological hazards or to give an assessment of the

quality of a mixture. A bioassay is often used to monitor water quality as well
as wastewater discharges and its impact on the surroundings. It is also used to assess
the environmental impact and safety of new technologies and facilities.

Principle
A bioassay is a biochemical test to estimate the potency of a sample compound.
Usually, this potency can only be measured relative to a standard compound. A typical
bioassay involves a stimulus (ex. drugs) applied to a subject (ex. animals, tissues,
plants). The corresponding response (ex. death) of the subject is thereby triggered and
measured.
History
The first use of a bioassay dates back to as early as the late 19th century, when the
foundation of bioassays was laid down by German physician Paul Ehrlich. He
introduced the concept of standardization by the reactions of living matter. His
bioassay on diphtheria antitoxin was the first bioassay to receive recognition. His use
of bioassay was able to discover that administration of gradually increasing dose of
diphtheria in animals stimulated production of antiserum.
One well known example of a bioassay is the "canary in the coal mine"
experiment. To provide advance warning of dangerous levels of methane in the air,
miners would take methane-sensitive canaries into coal mines. If the canary died due
to a build-up of methane, the miners would leave the area as quickly as possible.
Many early examples of bioassays used animals to test the carcinogenicity of
chemicals. In 1915, Yamaigiwa Katsusaburo and Koichi Ichikawa tested the
carcinogenicity of coal tar using the inner surface of rabbit's ears.
From the 1940s to the 1960s, animal bioassays were primarily used to test the toxicity
and safety of drugs, food additives, and pesticides.
Beginning in the late 1960s and 1970s, reliance on bioassays increased as public
concern for occupational and environmental hazards increased.
Classifications
Direct assay
In a direct assay, the stimulus applied to the subject is specific and directly
measurable, and the response to that stimulus is recorded. The variable of interest is
the specific stimulus required to produce a response of interest (ex. death of the
subject).
Indirect assay

In an indirect assay, the stimulus is fixed in advance and the response is measured in
the subjects. The variable of interest in the experiment is the response to a fixed
stimulus of interest.
 Quantitative response: The measurement of the response to the
stimulus is on a continuous scale (ex. blood sugar content).
 Quantal response: The response is binary; it is a determination of
whether or not an event occurs (ex. death of the subject).
Examples
ELISA (Enzyme-linked immunosorbent assay)
Fig. 2.6 ELISA (Enzyme-linked immunosorbent assay)
ELISA plate with various cortisol levels

ELISA is a quantitative analytical method that measures absorbance of color change
from antigen-antibody reaction (ex. Direct, indirect, sandwich, competitive). ELISA is
used to measure a variety of substances in the human body, from cortisol levels for
stress to glucose level for diabetes.
Home pregnancy test
Home pregnancy tests use ELISA to detect the increase of human chorionic
gonadotropin (hCG) during pregnancy.
HIV test
HIV tests also use indirect ELISA to detect HIV antibodies caused by infection.
Environmental bioassays:
Environmental bioassays are generally a broad-range survey of toxicity A toxicity
identification evaluation is conducted to determine what the relevant toxicants are.
Although bioassays are beneficial in determining the biological activity within an
organism, they can often be time-consuming and laborious. Organism-specific factors

may result in data that are not applicable to others in that species. For these reasons,
other biological techniques are often employed, including radioimmunoassays.
Water pollution control requirements in the United States require some industrial
dischargers and municipal sewage treatment plants to conduct bioassays. These
procedures, called whole effluent toxicity tests, include acute toxicity tests as well as
chronic test methods. The methods involve exposing living aquatic organisms to
samples of wastewater for a specific length of time. Another example is the bioassay
ECOTOX, which uses the microalgae Euglena gracilis to test the toxicity of water
samples.
SELF-TEST
1) Bioassay uses of
a) Air
b) Water
c) Biological agent
d) All of above
2) Bioassay is an
a) Microbiological Technique
b) Analytical Technique
c) Both of above
d) Anone of above

1) What is the main purpose of bioassay?
2) What are the limitations of bioassay?
SUMMARY
Correlation or dependence is any statistical relationship between two random
variables or bivariate data. It is the measure of how two or more variables are

related to one another, and is useful because it can indicate a predictive
relationship that can be exploited in practice. However, correlation does not
imply causation, and random variables are dependent if they do not satisfy a
mathematical property of probabilistic independence.
Correlation is a measure of the strength and direction of the linear re lationship

between two variables that is defined as the covariance of the variables divided
by the product of their standard deviations. The Pearson product -moment
correlation coefficient is the best-known and most commonly used type of
correlation coefficient, while intra-class correlation (ICC) is a descriptive
statistic that can be used to describe how strongly units in the same group
resemble each other. Rank correlation is the relationship between the rankings of
two variables, or two rankings of the same variable, and Spearman's rank
correlation coefficient and Kendall tau rank correlation coefficient are measures
of the portion of ranks that match between two data sets. Tetrachoric and
polychoric correlation coefficients measure association between tw o ordered-
categorical variables, and are calculated as the estimate of the Pearson
correlation coefficient one would obtain if: 1. the two variables were measured
on a continuous scale, instead of as ordered-category variables, and 2.
the two continuous variables followed a bivariate normal distribution. In

everyday language, correlation refers to some kind of association, but in
statistical terms, correlation denotes the relationship between two quantitative
variables.
Correlation is the relationship between two variables, such as income and

consumption. There are three types of correlation: based on the direction of
change, based on the number of variables and based on the constancy of the ratio
of change. Scatter diagrams, Karl Pearson's coefficient of correlation, and
Spearman's rank correlation are three important tools for studying correlation.
Positive correlation is when the variables change in the same direction, while
negative correlation is when they move in opposite directions. Simple correlatio n
is the study of only two variables, while multiple correlations are the study of
three or more variables at the same time.

Partial correlation is when only two variables are considered while keeping the
other variables constant. Correlation is classified into two types based on the
consistency of the ratio of change between the variables: linear correlation and
non-linear correlation. Karl Pearson's Coefficient of Correlation (also known as
Pearson's R) is the correlation coefficient that is frequently us ed in linear
regression. It is an extensively used mathematical method in which the numerical
representation is applied to measure the level of relation between linearly related
variables. An example of a correlation coefficient is (1), which means for eve ry
increase in one variable, there is a positive increase in the other fixed proportion.
In statistics, Spearman's rank correlation coefficient or Spearman's ρ is a

nonparametric measure of rank correlation (statistical dependence between the
rankings of two variables). It assesses how well the relationship between two
variables can be described using a monotonic function. It is equal to the Pearson
correlation between the rank values of those two variables, and is appropriate for
both continuous and discrete ordinal variables. A perfect Spearman correlation of
+1 or −1 occurs when each of the variables is a perfect monotone function of the
other. The Spearman correlation coefficient is the Pearson correlation coefficient
between the rank variables for a sample of size n.
It is computed using the popular formula where is the difference between the two
ranks of each observation, n is the number of observations show, and identical
values are assigned fractional ranks equal to the average of their positions in the
ascending order of the values. If ties are present in the data set, the simplified
formula above yields incorrect results. The simplified method should also not be
used in cases where the data set is truncated.
The Spearman rank correlation is a numerical measure that measures the extent
of statistical dependence between pairs of observations. It is similar to the
Pearson product-moment correlation coefficient, which measures the "linear"
relationships between the raw numbers rather than between their ranks. The sign
of the Spearman correlation indicates the direction of association between X (the
independent variable) and Y (the dependent variable). A Spearman correlation of
zero indicates that there is no tendency for Y to either increase or decrease when
X increases. The Spearman correlation coefficient increases in magnitude as X
and Y become closer to being perfectly monotone functions of each other, when

X and Y are perfectly monotonically related. It is often described as being
"nonparametric".
Regression is a statistical method used in finance, investing, and other

disciplines that attempts to determine the strength and character of the
relationship between one dependent variable (usually denoted by Y) and a series
of other variables (known as independent variables). The two basic types of
regression are simple linear regression and multiple linear regression. Simple
linear regression uses one independent variable to explain or predict the outcome
of the dependent variable Y, while multiple linear regression uses two or more
independent variables to predict the outcome. Regression can help finance and
investment professionals as well as professionals in other businesses predict
sales for a company based on weather, previous sales, GDP growth, or oth er
types of conditions. The capital asset pricing model (CAPM) is an often -used
regression model in finance for pricing assets and discovering costs of capital.
Regression analysis is a statistical method used to understand the relationship

between two variables, such as the CGPA and GRE score. It is used for
prediction and forecasting across different industries, such as financial industry,
marketing, manufacturing, and medicine. Regression is used to "regress" the
value of dependent variable "Y" with the help of independent variables, and to
understand how the value of 'Y' change w.r.t change in 'X'. Outliers are observed
in the dataset that have a very high or very low value as compared to the other
observations in the data. An outlier is a problem because it hampers the results.
Linear Regression is a predictive model used to find the linear relationship

between a dependent variable and one or more independent variables. It is the
simplest of all regression types and is used to answer questions such as "W hat
will be the GRE score of the student, if his CCGPA is 8.32?" Examples of
Independent & Dependent Variables include Rainfall, Crop Yield, Advertising
Expense, Sales, and GDP.
The Simple Linear Regression Model is a machine learning model used to predic t
the dependent variable. It has five assumptions: Linear Relationship, Normality,
No or Little Multicollinearity, No Autocorrelation in errors, and
Homoscedasticity. To assess if the model is doing good, the statistical measure
that evaluates the model is called the coefficient of determination, which is the
portion of the total variation in the dependent variable that is explained by
variation in the independent variable. Polynomial Regression is a type of
regression technique used to model nonlinear equations.
Logistic Regression is a supervised learning method for classification that

establishes a relation between dependent class variables and independent
variables using regression. It is used when the dependent variable is categorical
and can fit linear regression in a similar manner. There are two broad categories
of Logistic Regression algorithms: Binary and Multinomial. Binary is used for
binary dependent variables, while Multinomial is used for multiple categories.
Process Methodology: Logistic regression takes into consideration the different
classes of dependent variables and assigns probabilities to the event happening
for each row of information. The Sigmoid function is obtained by implementing
the log-normal function on these probabilities that are calculated on these
independent variables.
Hypothesis testing is an act in statistics whereby an analyst tests an assumption

regarding a population parameter. It is used to assess the plausibility of a
hypothesis by using sample data. All hypotheses are tested using a four-step
process: the first step is to state the two hypotheses, the next step is to formulate
an analysis plan, the third step is to carry out the plan and physically analyze the
sample data, and the fourth step is to analyze the results and either reject the null
hypothesis or state that the null hypothesis is plausible. Types of Statistical
Hypothesis include null hypothesis and alternative hypothesis, which provide a
statement about the population value of the test statistic. The null hypothesis is a
statement that no difference, no effect or association exists between sample or
population mean or proportion.
To investigate a null hypothesis, researchers select samples and compute

outcomes and make a decision as to whether the sample data gives compelling
evidence either to disprove or reject the hypothesis or not. The hypothesis can be
classified into various types, such as simple and complex, and the significance of
hypothesis testing is explained in detail. Statistics is a branch of M athematics
that deals with gathering, presenting, analyzing, organizing and interpreting data.
It is applied to many industrial, scientific, social and economic areas. The level

of significance or Statistical significance is an important terminology that i s used
in Statistics.
It is defined as the fixed probability of wrong elimination of null hypothesis

when in fact, it is true. It is denoted by the Greek symbol α (alpha). The level of
significance is taken at 0.05 or 5%. When the p-value is low, it means that the
recognised values are significantly different from the population value that was
hypothesised in the beginning. To measure the level of statistical significance of
the result, the investigator first needs to calculate the P -value, which defines the
probability of identifying an effect which provides that the null hypothesis is
true.
If p > 0.05 and p ≤ 0.1, there will be no assumption for the null hypothesis. The
equality hypothesis test (independent samples) tests if two or more population
means/medians are different. The null hypothesis states that the difference
between the mean/medians of the populations is equal to a hypothesized value,
while the alternative hypothesis states that it is not equal to (or less than, or
greater than) the hypothesized value. If the test p-value is small, the null
hypothesis is rejected and the alternative hypothesis is accepted. Tests for more
than two samples are omnibus tests and do not tell you which groups differ from
each other.
A bioassay is an analytical method used to determine the concentration or

potency of a substance by its effect on living animals or plants. It can be quantal
or quantitative, direct or indirect, and can be used to detect biological hazards or
assess the quality of a mixture. The first use of a bioassay dates back to the late
19th century, when German physician Paul Ehrlich introduced the concept of
standardization by the reactions of living matter. Bioassays are used to test the
toxicity and safety of drugs, food additives, and pesticides. Examples include the
"canary in the coal mine" experiment, Yamaigiwa Katsusaburo and Koichi
Ichikawa's carcinogenicity of coal tar, and ELISA (Enzyme -linked
immunosorbent assay).
In a direct assay, the stimulus applied to the subject is specific and dir ectly
measurable, and the response to that stimulus is recorded. An indirect assay is

fixed in advance and the response is measured in the subjects. Quantitative
response is on a continuous scale, while binary response is a determination of
whether or not an event occurs. ELISA is a quantitative analytical method that
measures absorbance of color change from antigen-antibody reaction (ex. direct,
indirect, sandwich, competitive). It is used to measure a variety of substances in
the human body, from cortisol levels for stress to glucose level for diabetes.
Home pregnancy tests use ELISA to detect the increase of human chorionic
gonadotropin (hCG) during pregnancy. HIV tests also use indirect ELISA.
Environmental bioassays are generally a broad-range survey of toxicity, but can
be time-consuming and laborious. Water pollution control requirements require
some industrial dischargers and municipal sewage treatment plants to conduct
bioassays. ECOTOX is a bioassay used to test the toxicity of water samples .
SUMMARY
 https://en.wikipedia.org/wiki/Correlation
 https://en.wikipedia.org/wiki/Correlation_coefficient
 https://www.embibe.com/exams/methods-of-studying-correlation/
 https://byjus.com/commerce/karl-pearson-coefficient-of-correlation/
 https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient
 https://www.investopedia.com/terms/r/regression.asp#:~:text=Regression
%20is%20a%20statistical%20method,(known%20as%20independent%20v
ariables).
 https://www.mygreatlearning.com/blog/what-is-regression/
 https://www.tutorhelpdesk.com/homeworkhelp/Statistics-/Uses-Of-
Regression-Analysis-Assignment-Help.html
 https://byjus.com/jee/properties-of-regression-coefficient/
 https://www.investopedia.com/terms/h/hypothesistesting.asp#:~:text=Key
%20Takeaways,of%20the%20population%20being%20analyzed.
 https://pubrica.com/academy/concepts-definitions/types-of-statistical-
hypothesis/#:~:text=A%20statistical%20hypothesis%20is%20of,null%20h
ypothesis%20and%20alternative%20hypothesis.
 https://byjus.com/maths/hypothesis-definition/

 https://byjus.com/maths/level-of-
significance/#:~:text=Level%20of%20Significance%20Definition&text=T
he%20level%20of%20significance%20is%20the%20measurement%20of%
20the%20statistical,to%20be%20false%20or%20rejected.
 https://analyse-it.com/docs/user-guide/compare-groups/equality-mean-
median-hypothesis-test#:~:text=Compare%20groups-
,Equality%20of%20means%2Fmedians%20hypothesis%20test%20(indepe
ndent%20samples),population%20means%2Fmedians%20are%20different .

C REDIT 03

CREDIT 03
UNIT 03-01: INTRODUCTION-ENVIRONMENTAL MODELLING: SCOPE AND
PROBLEM DEFINITION , GOALS AND OBJECTIVES , DEFINITION;
MODELLING APPROACHES – DETERMINISTIC , STOCHASTIC AND THE
PHYSICAL APPROACH ; APPLICATIONS OF ENVIRONMENTAL MODELS ; THE
MODEL BUILDING PROCESS
LEARNING OBJECTIVES
After successful completion of this unit, you will be able
 Environmental Modelling
 Application of Environmental Models
Environmental Modelling
Model and Modelling
As per the description and study Compiled by Bikash Sherchan A model is an

abstraction of a real system, it is a simplification in which only those
components which are seen to be significant to the problem at hand are
represented in the model. In this, a model takes influence from aspects of the real
system and aspects of the modeller‘s perception of the system and its importance
to the problem at hand. Modelling supports in the conceptualization and
exploration of the behaviour of objects or p rocesses and their interaction
as a means of better understanding these and generating hypotheses
concerning them. Modelling also supports the development of (numerical)
experiments in which hypotheses can be tested and outcomes predicted. In
science, understanding is the goal and models serve as tools towards that
end (Baker, 1998) (Mulligan and Wainwright, 2004: 8).
According to the Environmental Protection Agency (EPA), a model is defined as:

―A simplification of reality that is constructed to gain insights into selected
attributes of a physical, biological, economic, or social system. A formal
representation of the behaviour of system processes, often in mathematical or
statistical terms. The basis can also be physical or conceptual‖ (EPA, 2009).
Cross and Moscardini (1985: 22) describe modelling as ‗an art with a rational
basis which requires the use of common sense at least as much as
mathematical expertise‘. Modelling is described as an art because it involves
experience and intuition as well as the development of a set of (mathematical)
skills. Cross and Moscardini argue that intuition and the resulting insight are
the factors which distinguish good modellers from mediocre ones. Intuition (or
imagination) cannot be taught and comes from the experience of designing,
building and using models.
Tackling some of the modelling problems presented on the website which

complements this book will help in this (Mulligan and Wainwright, 2004: 8).
Modelling is the process of producing a model. A model is similar to but simpler

than the system it represents. One purpose of a model is to enable the analyst to
predict the effect of changes to the system. On the one hand, a model should be a
close approximation to the real system and incorporate most of its salient
features. On the other hand, it should not be so complex that it is impossible to
understand and experiment with it. A good model is a judicious trade -off between
realism and simplicity (Maria, 1997). A simulation of a system is the operation
of a model of the system. The model can be reconfigured and experimented with;
usually, this is impossible, too expensive or impractical to do in the system it
represents. The operation of the model can be studied, and hence, properties
concerning the behaviour of the actual system or its subsystem can be inferred.
In its broadest sense, simulation is a tool to evaluate the performance of a
system, existing or proposed, under different configurations of interest and
over long periods of real time (Maria, 1997)
Environmental Modelling
Environmental modelling is a science that uses mathematics and computers to

simulate physical and chemical phenomena in the environment (e.g.,
environmental pollution). This science was initially based on pen -and-paper
calculations using simple equations. In the last 50 years, with the development of
digital computers, environmental models have become more and more complex,
requiring often numerical solutions for systems of partial differential
equations (Holzbecher, 2007). Environmental models simulate the functioning of
environmental processes. The motivation behind developing an environmental
model is often to explain complex behaviour in environmental systems, or
improve understanding of a system. Environmental models may also be
extrapolated through time in order to predict future environmental conditions, or
to compare predicted behaviour to observed processes or phenomena. However, a
model should not be used for both prediction and explanation tasks
simultaneously (Skidmore, 2002).
hy to Model the Environment? Fundamentally, the reason for modelling is a lack

of full access, either in time or space, to the phenomena of interest. In areas
where public policy and public safety are at stake, the burden is on the
modeller to demonstrate the degree of correspondence between the model and
the material world it seeks to represent and to delineate the limits of that
correspondence.‖ –Oreskes et al. 1994 (EPA, 2009). The context for much
environmental modelling at present is the concern relating to human -induced
climate change. Similarly, work is frequently carried out to evaluate the impacts
of land degradation due to human impact. Such application-driven investigations
provide an important means by which scientists can interact with and influence
policy at local, regional, national and international levels. Models can be a
means of ensuring environmental protection, as long as we ar e careful about how
the results are used (Oreskes et al., 1994; Rayner and Malone, 1998; Sarewitz
and Pielke, 1999; Bair, 2001).
On the other hand, we may use models to develop our understanding of the
processes that form the environment around us. As noted by Richards (1990),
processes are not observable features, but their effects and outcomes are. In
geomorphology, this is essentially the debate that attempts to link process to
form (Richards et al., 1997). Models can thus be used to evaluate whether the
effects and outcomes are reproducible from the current knowledge of the
processes. This approach is not straightforward, as it is often difficult to evaluate
whether process or parameter estimates are incorrect, but it does at leas t
provide a basis for investigation. Of course, understanding -driven and
applications driven approaches are not mutually exclusive. It is not possible (at
least consistently) to be successful in the latter without being successful in the
former. We follow up these themes in much more detail in Chapter 1 (Mulligan
and Wainwright, 2004). Modelling is thus the canvas of scientists on which they
can develop and test ideas, put a number of ideas together and view the outcome,
integrate and communicate those ideas to others. Models can play one or more of

many roles, but they are usually developed with one or two roles specifically in
mind. The type of model built will, in some way, restrict the uses to which the
model may be put.
The following seven headings outline the purposes to which models are usually
put (Mulligan and Wainwright, 2004: 10):
1. As an aid to research: Models are like virtual laboratory. Models allow us to

infer information about unmeasurable or expensively measured properties
through modelling them from more readily measured variables that are related in
some way to the variable of interest.
2. As a tool for understanding: Models are a tool for understanding the scientific
concepts being developed. Model building involves understanding the system
under investigation while application involves learning the system in depth.
3. As a tool for simulation and prediction: Real value of models become

apparent when they are used extensively for system simulation and/or
prediction. Simulation allows to integrate the effects of simple processes over
complex spaces or complex processes over simple spaces and to cumulate the
effects of those processes and their variation over time. This integration and
cumulation can lead prediction of system behaviour outside the time or space
domain for which data are available. Integration and cumulation are valuable in
converting a knowledge or hypothesis of a process into an understanding of this
process over time and space which is very difficult to pursue objectively without
modelling. Models are therefore extensively employed in extrapolation beyond
measured times and spaces whether it is prediction (forecasting), or
postdiction (hindcasting) or near-term casting (nowcasting) as is common in
meteorology and hydrology.
4. As a virtual laboratory: Models can also be rather inexpensive, low -hazard and
space-saving laboratories in which a good understanding of processes can
support model experiments. This approach can be particularly important where
the building of hardware laboratories (or hardware models) would be too
expensive, too hazardous or not possible. However, the outcome of any
model experiment is only as good as the understanding summarized within
the model and therefore care shall be taken while using models as laboratories.

The more physically based the model (i.e. the more based on proven physical
principles), the better in this regard and indeed the most common applications
of models as laboratories are in intensively studied fields in which the
physics are fairly well understood such as computational flui d dynamics (CFD)
or in areas where a laboratory could never be built to do the job (climate -
system modelling, global vegetation modelling).
5. As an integrator within and between disciplines: Models also have the

ability to integrate the work of many research groups into a single
working product which can summarize the understanding gained.
Understanding environmental processes at the level of detail required to
contribute to the management of changing environments requires intensive
specialization by individual scientists at the same time as the need to approach
environmental research in an increasingly multidisciplinary way. Because of
these two requirements, and because of financial issues, scientific re search is
increasingly a collaborative process whereby large grants fund integrated
analysis of a particular environmental issue by tens or hundreds of
researchers from different scientific fields, departments, institutions,
countries and continents working together and having to produce useful
and consensus outcomes, sometimes after only several years. Research groups
can define a very clear picture of the system under study and its subcomponents,
each contributing group can be held responsible for the production of algorithms
and data for a number of subcomponents and a team of scientific integrators is
charged with the task of making sure all the subcomponents or modules
work together at the end of the day. The knowledge and data gained by
each group are tightly integrated where the worth of the sum becomes much
more than the worth of its individual parts.
6. As a research product: Models are valid research product, particularly when

they can be used by others and thus either provide the basis for further
research or act as a tool in practical environmental problem solving or
consultancy. Equally, models can carry forward entrenched ideas and can set the
agenda for future research, even if the models themselves have not been
demonstrated to be sound. Models can also be very expensive ‗inventions‘,
marketed to specialist markets in government, consultancy and academia,

sometimes paying all the research costs required to produce t hem, often paying
part of the costs.
7. As a means of communicating science and the results of science: Models can

be much more effective communicators of science because, unlike the
written word, they can be interactive and their representation o f results is very
often graphical or moving graphical. If a picture saves a thousand words, then a
movie may save a million and in this way very complex science can be hidden
and its outcomes communicated easily.
Characteristics of Models
As models are simplified structures of reality that present supposedly significant

features or relationships in a generalised form, they do not include all the
associated observations or measurements of the systems they model. Thus, the
most fundamental feature of models is that their construction has involved a
highly selective attitude to information. With this selective attitude, not only the
noise but also the less important signals of the system have often been
eliminated, enabling the fundamental, relevant, or interesting aspects of the real
system to appear in some generalised form (Haggett and Chorley 1967).
Therefore, models can be thought of as selective pictures of the real -world
system, and ―only by being unfaithful in some aspects can a model r epresent its
original‖ (Black 1962: 220) (Liu, 2009: 3). The selective feature of models also
implies that models resemble the real-world system in some aspect; they are
structured approximations of reality. A good model represents the real world in a
simplified yet valid and adequate way. The model must be simple enough in
order for one to easily understand and make decisions using it. It must be
adequate to contain all the important elements of the real -world system, and it
must be valid because all the elements modelled must be correctly interrelated
according to their connections or structures (Liu, 2009: 3).
Modelling Principles
Mathematical modelling is built on principles and systematic methods for

applying those principles to a system of interest. Although the mathematics is
objective, the mathematics must be used within a framework that ensures what is
obtained is a useful simulation model. Environmental modelling goes beyond

the mathematics in seeking to characterize and understand the p hysical,
chemical, and biological behaviour of a system. This modelling rarely admits
exact solutions. Once we formulate and solve a problem, we need to make sure
that the solution has some applicability. The temptation to make use of an ill -
conceived or overly simplified model because of resource constraints, such as
insufficient data, funding, computer power, or time, in order to get an answer
can be great. Thus, it is important to make use of a conceptual model of a system
and also to work with this model systematically. Both the boxes and the arrows
that connect boxes in a conceptual model must be handled properly (Gray and
Gray, ...: 64). Although a conceptual model provides a useful depiction of the
modelling process, it is also useful to present modelling as a series of steps.
Many examples can be found where modelling is presented as a series of
statements or tasks that must be performed. This kind of list is useful in much the
same way that a checklist is employed to ensure that all appropriate st eps have
been taken before taking off in an airplane. Although we may become
accustomed to modelling and take some steps for granted, it is useful to have a
list of items to refer back to ensure that nothing has been inadvertently
overlooked. This is important for satisfactory models and also for models of
limited success as referral to these steps can help identify the main areas of
understanding or misunderstanding of the system of interest (Gray and Gray, ...:
64). Modellers are recommended to develop their own protocol list based on
experience and their individual biases that might interfere with a successful
modelling project. The following are the useful list as a good starting point
for customization. A successful modelling activity however, will likely require
movement back and forth through the items (Gray and Gray, ...: 64). 1. Identify
the question to be answered. This item serves as a balance point for ensuring that
an investigator does not get carried away with a model that provides elegant
graphics, is overly simplistic, or is irrelevant. As a model develops and,
typically, becomes more complex as more data or analysis is brought to bear on
the exercise, it is important to make use only of elements and activities that
advance the cause of addressing the questions of importance.
2. Determine what data is available to support the model. When a modelling

project has been identified, it is tempting to turn to the computer codes and
mathematics needed to solve the problem and to outline an experimental
program that will assist in model development. However, it is a good idea to
pause for a moment and find out what information is already known about the
system and what experiments have been performed. For example, government
agencies charged with characterizing an aquifer may have already performed
pump tests and identified formations and their characteristics. Obtaining
information from online sources and libraries can be a relatively chea p way to
obtain data that can support a mathematical modelling exercise. After
determining what is available, some preliminary experiments or data
collection activities may be performed to generate information that will
support the modelling.
3. Identify the processes to be modelled. Depending on the question to be

answered, it may be necessary only to perform a mass balance. If the velocity
field is known from data, this can be used to eliminate the need to model
momentum transport. If density gradients cause vertical flow that is
important, such as in fall turnover in a lake, this requires information about heat
exchange and the mechanisms of heat exchange between the lake and the
surroundings. Additionally, it is important to determine if time dependence
must be considered and the number of spatial dimensions over which variability
must accounted for. All of these considerations are presented with the caveat
that it is best to start with a simple model and bui ld in extra effects
subsequently. This allows the modeller to gain an appreciation for and insight
into the behaviour of the system and the effects of the different terms in the
model. Starting with a fully complex model can create difficulties in bo th
getting a simulation that makes sense and in determining what processes are
actually important.
4. Make a conceptual model of the mathematical model. A conceptual model at

this point can be useful in sorting out which outputs from some equation s are
inputs to other equations, which equations must be solved simultaneously, and
perhaps a strategy for decoupling the need for simultaneous solution.
Conceptual models are very useful for highlighting which aspects of physical
processes interact and how those interactions lead to interdependence of
solutions to the governing equations.

5. Select the equations to be used and solved. This step is tightly tied in to step 3
in that the processes to be simulated dictate the equations that are needed. In
addition to conservation equations, closure relations are needed to account for
the dissipative processes that involve spreading of mass, momentum, and energy
within the system. The closure relations depend on the type of materials
being modelled, velocity of the flow, the steepness of the gradients, and the
exchanges between system elements. At this point, simplified versions of
specific equations can be used; but the forms should be employed such that
additional features may be added without significant reformulation of the terms
in the first version.
6. Determine whether appropriate models are available. When using an available

model, such as might have been produced by a government agency or consulting
agency, it is tempting to merely get the code to run without understanding all
the features of the model. Thus, at this step, it is important to become aware of
available models that might be appropriate for the problem of interest.
Typically, documentation of the principles in the model and the numerical
procedures employed are available. These model features should be compared
with the processes and equations deemed to be important. Additionally, in some
instances, specific models have been designated as standards that must be used
in an environmental assessment. In such cases, one must still be aware of the
limitations and shortcomings of the model for study of the system of interest
and in answering important questions. Determination of availability of a model
also depends on whether available models are supported for the computer
environment and power available to the modeller. Issues such as user interfaces
for input and output are also important considerations in determining the
suitability of a model or the need to develop additional software or obtain
access to advanced computing environments.
7. Exercise the model. After a model has been selected or written, it should be
exercised to ensure that it is working properly. Problems with known solutions
can be posed and simulated to confirm that a model is functioning properly. For
example, systems in which no change should be occurring can be simulated
to see that the model can manage at least this simplest of all cases. By exercising
elements and features of a model, the modeller can gain experience

concerning the limits of the simulation, the importance of spatial and temporal
refinement, and the most difficult regions for obtaining accurate simulations.
Also, the sensitivity of the model simulation to the selection of parameter values
can be evaluated. 8. Apply the model. Employ the model to simulate the system
of interest. Examine the behaviour of the model, the important processes, and the
ability to answer the question that has motivated the modelling study.
Compile simulation results and make comparisons between similar scenarios. If
historical data is available, use the model to simulate transition from an older
state to a newer one. 9. Be critical. Consider whether the model an swers the
question of interest well and in what cases. Decide which of the preceding
elements of model development should be revisited to improve the efficacy of
the model. Resist the temptation to ―declare victory‖ in the modelling process
and look for weaknesses. Go over the results with knowledgeable colleagues
looking for ways to improve the simulations or deciding if additional scenarios
need to be considered. 10. Present results. Provide a realistic assessment of the
results of the simulation. Assess the importance of errors in the model. Provide a
clear indication of what can be done to improve the model accuracy or the need
to continue to update the model results as the system develops in nature.
Identify other processes in the system, not considered in the model that may be
important for an insightful analysis of the system. 1.7 Modelling Approach
Model is an everyday word. Look it up in the dictionary and you are likely to
find upwards of eight definitions as a noun and four or more as a v erb. Meanings
range from making things out of clay to using mathematics to someone who
struts down the catwalk. Leaving aside this last category, most of the
definitions have one thing in common: simplification or reduction in
complexity. At its easiest, a model is a simplification of reality in terms that we
can easily understand.
(Source & Credit: Compiled by Bikash Sherchan, 2019 at

https://www.studocu.com/)

Fig. 3.1 Pictorial description of environmental modelling (Ogola 2007)
SELF-TEST
1) Environmental models are useful for all of the following purposes except
a) Making predictions about future ecological changes
b) Testing predictions about future ecological changes
c) Evaluating proposed solutions to environmental problems
d) Accounting for all the variables that exist in a real environment
2) Environmental Modelling is ___________
a) The creation and use of mathematical models of the environment.
b) Use of Social Science
c) Both of above
d) None of above

1) What is the purpose of environmental Modelling?
2) What are some of the things you can do in the modeling environment?

UNIT 03-02: MODELS IN ENVIRONMENTAL SCIENCE EMPHASIZING - I
LEARNING OBJECTIVES
 Models in Environmental Science
 Linear Models
The scientific method is frequently used as a guided approach to learning. Linear

statistical methods are widely used as part of this learning process. In the
biological, physical, and social sciences, as well as in business and engineering,
linear models are useful in both the planning stages of research and analysis of
the resulting data. In Sections 1.1–1.3, we give a brief introduction to simple and
multiple linear regression models, and analysis-of-variance (ANOVA) models.
Simple Linear Regression Model
In simple linear regression, we attempt to model the relationship between two

variables, for example, income and number of years of education, height and
weight of people, length and width of envelopes, temperature and output of an
industrial process, altitude and boiling point of water, or dose of a drug and
response. For a linear relationship, we can use a model of the form
(1.1)
where y is the dependent or response variable and x is the independent or

predictor variable. The random variable 1 is the error term in the model. In this
context, error does not mean mistake but is a statistical term representing random
fluctuations, measurement errors, or the effect of factors outside of our control.
We typically add other assumptions about the distribution of the error terms,
independence of the observed values of y, and so on. Using observed values of x
and y, we estimate b0 and b1 and make inferences such as confidence intervals
and tests of hypotheses for b0 and b1. We may also use the estimated model to
forecast or predict the value of y for a particular value of x, in which case a
measure of predictive accuracy may also be of interest.
Multiple Linear Regression Model

The response y is often influenced by more than one predictor variable. For
example, the yield of a crop may depend on the amount of nitrogen, potash, and
phosphate fertilizers used. These variables are controlled by the experimenter,
but the yield may also depend on uncontrollable variables such as those
associated with weather. A linear model relating the response y to several
predictors has the form
(1.2)
The parameters are called regression coefficients. As in (1.1)

provides for random variation in y not explained by the x variables. This random
variation may be due partly to other variables that affect y but are not known or
not observed.
The model in (1.2) is linear in the b parameters; it is not necessarily linear in the
x variables. Thus, models such as
are included in the designation linear model. A model provides a theoretical

framework for better understanding of a phenomenon of interest. Thus a model is
a mathematical construct that we believe may represent the mechanism that
generated the observations at hand. The postulated model may be an idealized
oversimplification of the complex real-world situation, but in many such cases,
empirical models provide useful approximations of the relationships among
variables. These relationships may be either associative or causative.
Regression models such as (1.2) are used for various purposes, including the
following:
1. Prediction. Estimates of the individual parameters are of less

importance for prediction than the overall influence of the x variables on y.
However, good estimates are needed to achieve good prediction performance.
2. Data Description or Explanation. The scientist or engineer uses the estimated

model to summarize or describe the observed data.
3. Parameter Estimation. The values of the estimated parameters may have

theoretical implications for a postulated model.
4. Variable Selection or Screening. The emphasis is on determining the
importance of each predictor variable in modeling the variation in y. The
predictors that are associated with an important amount of variation in y are
retained; those that contribute little are deleted.
5. Control of Output. A cause-and-effect relationship between y and the x

variables is assumed. The estimated model might then be used to control the
output of a process by varying the inputs. By systematic experimentation, it may
be possible to achieve the optimal output.
There is a fundamental difference between purposes 1 and 5. For prediction, we

need only assume that the same correlations that prevailed when the data were
collected also continue in place when the predictions are to be made. Showing
that there is a significant relationship between y and the x variables in (1.2) does
not necessarily prove that the relationship is causal. To establish causality in
order to control output, the researcher must choose the values of the x variables
in the model and use randomization to avoid the effects of other possible
variables unaccounted for. In other words, to ascertain the effect of the x
variables on y when the x variables are changed, it is necessary to change them.
Analysis-of-Variance Models
In analysis-of-variance (ANOVA) models, we are interested in comparing several

populations or several conditions in a study. Analysis-of-variance models can be
expressed as linear models with restrictions on the x values. Typically, the x‘s are
0s or 1s. For example, suppose that a researcher wishes to compare the mean
yield for four types of catalyst in an industrial process. If n observations are to
be obtained for each catalyst, one model for the 4n observations can be expressed
as

In the examples leading to models (1.3)–(1.5), the researcher chooses the type of
catalyst or level of temperature and thus applies different treatments to the
objects or experimental units under study. In other settings, we compare the
means of variables measured on natural groupings of units, for example, males
and females or various geographic areas.
(Credit & Source: Alvin C. Rencher and G. Bruce Schaalje, 2008, Linear Models in
Statistics Department of Statistics, Brigham Young University, Provo, Utah).
SELF-TEST
1) If Linear regression model perfectly first i.e., train error is zero, then
____________________
a) Test error is also always zero
b) Test error is non zero
c) Couldn‟t comment on Test error
d) Test error is equal to Train error

2) How many coefficients do you need to estimate in a simple linear regression
model (One independent variable)?
a) 1
b) 2
c) 3
d) 4

3) What is an example of a linear regression question?
4) What is the research question for linear regression analysis?

UNIT 03-03: MODELS IN ENVIRONMENTAL SCIENCE EMPHASIZING - II
LEARNING OBJECTIVES
 Chemical transport models
 Principal of chemical transport models
Chemical Transport Model:
A chemical transport model (CTM) is a type of computer numerical

model which typically simulates atmospheric chemistry and may give air
pollution forecasting.
Chemical transport models and general circulation models
While related general circulation models (GCMs) focus on simulating overall

atmospheric dynamics (e.g. fluid and heat flows), a CTM instead focuses on the
stocks and flows of one or more chemical species. Similarly, a CTM must solve
only the continuity equation for its species of interest, a GCM must solve all
the primitive equations for the atmosphere; but a CTM will be expected to
accurately represent the entire cycle for the species of interest,
including fluxes (e.g. advection), chemical production/loss, and deposition. That
being said, the tendency, especially as the cost of computing declines over time,
is for GCMs to incorporate CTMs for species of special interest to climate
dynamics, especially shorter-lived species such as nitrogen oxides and volatile
organic compounds; this allows feedbacks from the CTM to the GCM's radiation
calculations, and also allows the meteorological fields forcing the CTM to be
updated at higher time resolution than may be practical in studies with offline
CTMs.
Types of chemical transport models
CTMs may be classified according to their methodology and their species of

interest, as well as more generic characteristics (e.g. dimensionality, degree of
resolution).

Methodologies
Jacob (1999) classifies CTMs as Eulerian/"box" or Lagrangian/"puff" models,

[1]
depending on whether the CTM in question focuses on
 (Eulerian) "boxes" through which fluxes, and in which chemical

production/loss and deposition occur over time
 (Lagrangian) the production and motion of parcels of air ("puffs") over
time
An Eulerian CTM solves its continuity equations using a global/fixed frame of

reference, while a Lagrangian CTM uses a local/moving frame of reference.
 discussion of gridding in CLaMS

 Lagrangian and Eulerian coordinates
 discussion of the continuity equation in Jacob's Introduction to
Atmospheric Chemistry online
Examples of Eulerian CTMs
 CCATT-BRAMS
 WRF-Chem
 CMAQ, CMAQ Website
 CAMx
 GEOS-Chem
 LOTOS-EUROS
 MATCH
MOZART: (Model for OZone And Related chemical Tracers) is developed

jointly by the (US) National Center for Atmospheric Research (NCAR),
the Geophysical Fluid Dynamics Laboratory (GFDL), and the Max Planck
Institute for Meteorology (MPI-Met) to simulate changes
in ozone concentrations in the Earth's atmosphere. MOZART was designed to
simulate tropospheric chemical and transport processes,[2] but has been extended
(MOZART3) into the stratosphere and mesosphere. It can be driven by
standard meteorological fields from, for example,the National Centers for
Environmental Prediction (NCEP), the European Centre for Medium-Range
Weather Forecasts (ECMWF) and the Global Modeling and Assimilation Office

(GMAO), or by fields generated from general circulation models. MOZART4
improves MOZART2's chemical mechanisms, photolysis scheme, dry
deposition mechanism, biogenic emissions and handling of tropospheric aerosols.
 TOMCAT/SLIMCAT
 CHIMERE
 POLYPHEMUS
TCAM (Transport Chemical Aerosol Model; TCAM): a mathematical modelling

method (computer simulation) designed to model certain aspects of the Earth's
atmosphere. TCAM is one of several chemical transport models, all of which are
concerned with the movement of chemicals in the atmosphere, and are thus used
in the study of air pollution.
TCAM is a multiphase three-dimensional eulerian grid model (as opposed to

lagrangian or other modeling methods). It is designed for modelling dispersion of
pollutants (in particular photochemical and aerosol) at mesoscales (medium
scale, generally concerned with systems a few hundred kilometers in size).
TCAM was developed at the University of Brescia in Italy.
Examples of Lagrangian CTMs

 CLaMS
 FLEXPART
Examples of Semi-Lagrangian CTMs

 MOCAGE
 GEM-MACH
Examples of ozone CTMs

 CLaMS
 MOZART
(Source: https://en.wikipedia.org/wiki/Chemical_transport_model)
Model types.
Mathematical models provide the necessary framework for integration of our

understanding of individual atmospheric processes and study of their
interactions. Note, that atmosphere is a complex reactive system in which
numerous physical and chemical processes occur simultaneously.

Atmospheric chemical transport models are defined according to their spatial
scale:
Fig. 3.2 Chemical transport models spatial scale
(Credit: http://irina.eas.gatech.edu/ATOC3500_Fall1998/Lecture29.pdf)

Fig. 3.3 Components of a chemical transport model (Seinfeld and Pandis, 1998).
(Credit: http://irina.eas.gatech.edu/ATOC3500_Fall1998/Lecture29.pdf)
Domain of the atmospheric model is the area that is simulated. The computation
domain consists of an array of computational cells, each having uniform
chemical composition. The size of cells determines the spatial resolution of the
model. • Atmospheric chemical transport models are also characterized by their
dimensionality: zero-dimensional (box) model; one-dimensional (column) model;
two-dimensional model; and three-dimensional model.

Fig. 3.4 zero-dimensional (box) model; one-dimensional (column) model; two-
dimensional model; and three-dimensional model.
(Credit: http://irina.eas.gatech.edu/ATOC3500_Fall1998/Lecture29.pdf )
Model time scale depends on a specific application varying from hours (e.g., air
quality model) to hundreds of years (e.g., climate models).
Two principal approaches to simulate changes in the chemical composition of a

given air parcel:
1) Lagrangian approach: air parcel moves with the local wind so that there is no
mass exchange that is allowed to enter the air parcel and its surroundings (except
of species emissions). The air parcel moves continuously, so the model simulates
species concentrations at different locations at differ ent time,
2) Eulerian approach: model simulates the species concentrations in an array of

fixed computational cells.

Box models (or ‗zero dimensional‘, or 0-D models) are the simplest models,
where the atmospheric domain is represented by only one box.
In a box model concentrations are the same everywhere and therefore are
functions of time only, ni(t).
Eulerian box model:
Chemical species enter a box in two ways:
1) source emissions;
2) transport: advection (the transport of a species by the mean h orizontal motion

of air parcel) and entrainment (the vertical movement of air parcels as a
consequence of turbulent mixing).
Chemical species are removed from a box in three ways:
1) transport: advection out of the box and detrainment due to upwards motio n;
2) chemical transformations;
3) removal processes: dry deposition on surface.
• In the Lagrangian box model: advection terms are eliminated, but source terms
vary as the parcel moves over different source regions.
Some features of the box models:
• The dimensions and placement of the box is dictated by the particular problem
of interest. For instance, to study the influence of urban emissions on the
chemical composition of air, the box may be design to cover the urban area.
• Box models can be time dependent. In that case, any variations in time of the
processes considered need to be accurately supplied to the model. For instance, if
a box model is used to compute air quality for 24 -hr day in an urban area, the
model must include daily variations in traffic and other sources; diurnal wind
speed patterns; variations in the height of the mixed layer; variation of solar
radiation during the day; etc.
Limitations of a box model:

i) assumes rapid vertical and horizontal mixing;
ii) assumes uniformity of surface sources;
The chemical loss or production rate Ci is dependent on the number of chemical

reactions included into the model, and it is given by
The chemical loss or production rate Ci is dependent on the number of chemical

reactions included into the model, and it is given by
Pi is the production term of species i; and Ri is the loss term .
(Credit: Graedel T. and P.Crutzen. ―Atmospheric change: an earth system

perspective‖. Chapter 15.‖Bulding environmental chemical models‖, 1992. at
http://irina.eas.gatech.edu/ATOC3500_Fall1998/Lecture29.pdf ).
SELF-TEST
1) Limitations of a box model
a) Assumes rapid vertical and horizontal mixing;

b) Assumes slow vertical and horizontal mixing;
c) Both of above
d) None of above
2) CMT stands for
a) Chemical transport method
b) Chemical transport models
c) Check transport models
d) None of above

1) Why chemical transport models is important to study?
2) Write a note on chemical transport models?

UNIT 03-04: MODELS IN ENVIRONMENTAL SCIENCE EMPHASIZING - III
LEARNING OBJECTIVES
 Basics of Inverse modeling
 Detail about inverse modeling
Inverse modeling
Inverse modeling is a formal approach for using observations of a physical

system to better quantify the variables driving that system. This is generally done
by statistically optimizing the estimates of the variables given all the
observational and other information at hand. We call the variables that we wish
to optimize the state variables and assemble them into a state vector x. We
similarly assemble the observations into an observation vector y. Our
understanding of the relationship between x and y is described by a model F of
the physical system called the forward model:
(11.1)
Here, p is a parameter vector including all model variables that we do not seek to
optimize as part of the inversion, and εΟ is an observational error vector
including contributions from errors in the measurements, in the forward model,
and in the model parameters. The forward model predicts the effect (y) as a
function of the cause (x), usually through equations describing the physics of the
system. By inversion of the model we can quantify the cause (x) from
observations of the effect (y). In the presence of error (εO 6¼ 0), the solution is
a best estimate of x with some statistical error. This solution for x is called the
optimal estimate, the posterior estimate, or the retrieval. The choice of state
vector (that is, which model variables to include in x versus in p) is totally up to
us. It depends on which variables we wish to optimize, what information is
contained in the observations, and what computational costs are associated with
the inversion. Because of the uncertainty in deriving x from y, we have to
consider other constraints on the value of x that may help to reduce the error on
the optimal estimate. These constraints are called the prior information. A

standard constraint is the prior estimate xA, representing our best estimate of x
before the observations are made. It has some error εA. The optimal estimate
must then weigh the relative information from the observations y and the prior
estimate xA, and this is done by considering the error statistics of εO and εA.
Inverse modeling allows a formal analysis of the relative importa nce of the
observations versus the prior information in determining the optimal estimate. As
such, it informs us whether an observing system is effective for constraining x.
Inverse modeling has three main applications in atmospheric chemistry,

summarized in fig. 3.5.
Fig. 3.5 Application of Inverse modeling
(Credit: www.cambridge.org)
measured radiances at different wavelengths represent the observation vector y,

and the concentrations on a vertical grid represent the state vector x. The forward
model F is a radiative transfer model (Chapter 5) that calculates y as a function
of x and of additional parameters p that may include surface emissivity,
temperatures, spectroscopic data, etc. The prior estimate xA is provided by
previous observations of the same or similar scenes, by knowledge of
climatological mean concentrations, or by a chemical transport model. 2. Top -
down constraints on surface fluxes. Here we use measured atmospheric
concentrations (observation vector y) to constrain surface fluxes (state v ector x).
The forward model F is a chemical transport model (CTM) that solves the
chemical continuity equations to calculate y as a function of x. The parameter
vector p includes meteorological variables, chemical variables such as rate
coefficients, and any characteristics of the surface flux such as diurnal variability
that are simulated in the CTM but not optimized as part of the state vector. The

information on x from the observations is called a top -down constraint on the
surface fluxes. The prior estimate xA is an inventory based on our knowledge of
the processes determining the surface fluxes (such as fuel combustion statistics,
land cover data bases, etc.) and is called a bottom-up constraint. See Section 9.2
for discussion of bottom-up and top-down constraints on surface fluxes. 3.
Chemical data assimilation. Here we construct a gridded 3 -D field of
concentrations x, usually time-dependent, on the basis of measurements y of
these concentrations or related quantities at various locations and times. Su ch a
construction may be useful to initialize chemical forecasts, to assess the
consistency of measurements from different platforms, or to map the
concentrations of non-measured species on the basis of measurements of related
species. We refer to this class of inverse modeling as data assimilation. The
corresponding state vectors are usually very large. In the time -dependent
problem, the prior estimate is an atmospheric forecast model that evolves x(t)
from a previously optimized state at time to to a forecast state at the next
assimilation time step to + h. The forecast model is usually a weather prediction
model including simulation of the chemical variables to be assimilated. The
forward model F can be a simple mapping operator of observations at time to + h
to the model grid, a chemical model relating the observed variables to the state
variables, or the forecasting model itself.
Proper consideration of errors is crucial in inverse modeling. Let us examine

what happens if we ignore errors. We linearize the forward model y = F(x, p)
around the prior estimate xA taken as best guess:
(11.2)
Where is the Jacobian matrix of the forward model with

elements kij= ∂yi =∂xj evaluated at x = xA. The notation O((x – xA) 2 ) groups
higher-order terms (quadratic and above) taken to be negligibly small. Let n and
m represent the dimensions of x and y, respectively. Assume that the
observations are independent such that m = n observations constrain x uniquely.
The Jacobian matrix is then an n n matrix of full rank and hence invertible. We
obtain for x:

(11.3)
If F is nonlinear, the solution (11.3) must be iterated with recalculation of the

Jacobian at successive guesses for x until satisfactory convergence is achieved.
Now what happens if we make additional observations, such that m > n? In the
absence of error these observations must necessarily be redundant. But we know
from experience that strong constraints on an atmospheric system typically
require a very large number of measurements, m > n. This is due to errors in the
measurements and in the forward model, described by the observational error
vector ε 0 in (11.1). Thus (11.3) is not applicable in practice; succ essful inversion
requires adequate characterization of the observational error εΟ and
consideration of prior information.
(Source & Credit: https://www.cambridge.org/)
SELF-TEST
3) Inverse modeling is
a) Formal approach for using observations of a physical system
b) Retrurn approach
c) Noth of above
d) None of above
4) Inverse modeling has application in
a) Remote sensing
b) Top-down constraints
c) Data assimilation
d) All of above

1) What are the applications of Inverse modeling?
2) Write a note on Inverse modeling.

SUMMARY
A model is an abstraction of a real system that takes influence from aspects of
the real system and the modeller's perception of the system and its importance to
the problem at hand. It supports the conceptualization and exploration of the
behaviour of objects or processes and their interaction as a means of better
understanding these and generating hypotheses concerning them. It is defined as
a simplification of reality that is constructed to gain insights into selected
attributes of a physical, biological, economic, or social system. Cross and
Moscardini (1985) describe modelling as an art with a rational basis which
requires the use of common sense at least as much as mathematical expertise.
Intuition and the resulting insight are the factors which distin guish good
modellers from mediocre ones. Modelling is the process of producing a model
that is similar to but simpler than the system it represents. It is a judicious trade -
off between realism and simplicity. A simulation of a system is the operation of a
model of the system, which can be reconfigured and experimented with.
Environmental modelling is a science that uses mathematics and computers to
simulate physical and chemical phenomena in the environment. The motivation
behind developing an environmental model is often to explain complex
behaviour in environmental systems, or improve understanding of a system. In
areas where public policy and public safety are at stake, the burden is on the
modeller to demonstrate the effects of changes to the system. Lin ear statistical
methods are widely used as part of the scientific method and are useful in both
the planning stages of research and analysis of the resulting data. Simple and
multiple linear regression models, and analysis-of-variance (ANOVA) models are
used to model the relationship between two variables, such as income and
number of years of education, height and weight of people, length and width of
envelopes, temperature and output of an industrial process, altitude and boiling
point of water, or dose of a drug and response. Error is a statistical term
representing random fluctuations, measurement errors, or the effect of factors
outside of our control. The response y is often influenced by more than one
predictor variable, such as the amount of nitrogen, potash, and phosphate
fertilizers used. Regression models such as (1.2) are used for various purposes,
such as prediction, data description, parameter estimation, variable selection, and
control of output. For prediction, the model is linear in the b par ameters, but not

necessarily linear in the x variables. For parameter estimation, the values of the
estimated parameters may have theoretical implications for a postulated model.
For variable selection, the emphasis is on determining the importance of each
predictor variable in modeling the variation in y. For control of output, a cause -
and-effect relationship between y and the x variables is assumed, and systematic
experimentation may be possible to achieve the optimal output. Analysis -of-
variance (ANOVA) models are linear models with restrictions on the x values. To
establish causality, the researcher must choose the values of the x variables in the
model and use randomization to avoid the effects of other possible variables.
Examples include comparing the mean yield for four types of catalyst in an
industrial process, choosing the type of catalyst or temperature, and comparing
the means of variables measured on natural groupings of units. A chemical
transport model (CTM) is a type of computer numerical model which simulates
atmospheric chemistry and may give air pollution forecasting. It is classified
according to its methodology and species of interest, as well as more generic
characteristics (e.g. dimensionality, degree of resolution). Jacob (1999) classi fies
CTMs as Eulerian/"box" or Lagrangian/"puff" models, depending on whether the
CTM focuses on fluxes, production/loss, and deposition. An Eulerian CTM
solves its continuity equations using a global/fixed frame of reference, while a
Lagrangian CTM uses a local/moving frame of reference. MOZART (Model for
OZone And Related Chemical Tracers) is a mathematical modelling method
developed by NCAR, GFDL, and MPI-Met to simulate changes in ozone
concentrations in the Earth's atmosphere. TCAM (Transport Chemical Aerosol
Model) is a multiphase three-dimensional eulerian grid model used to model
dispersion of pollutants at mesoscales. MOZART4 improves MOZART2's
chemical mechanisms, photolysis scheme, dry deposition mechanism, biogenic
emissions and handling of tropospheric aerosols. A chemical transport model is a
model that simulates changes in the chemical composition of an air parcel. It
consists of an array of computational cells, each with uniform chemical
composition, and is characterized by its dimensionality: zero-dimensional (box)
models, one-dimensional (column) models, two-dimensional models, and three-
dimensional models. The model time scale varies from hours to hundreds of
years, and there are two principal approaches to simulate changes: Lagrangian
and Eulerian. Lagrangian models are the simplest, where concentrations are the

same everywhere and therefore are functions of time only, ni(t). Chemical
species enter a box in two ways: source emissions and transport. They are
removed in three ways: transport out of the box and detrainment due to upwards
motion, chemical transformations, and dry deposition on surface. Box models
can be time dependent and must include daily variations in traffic and other
sources, diurnal wind speed patterns, variations in the height of the mixed layer,
and variation of solar radiation during the day. Limitations of a box model
include rapid vertical and horizontal mixing and uniformity of surface sources.
Inverse modeling is a formal approach for using observations of a physical
system to better quantify the variables driving that system. It involves
statistically optimizing the estimates of the variables given all the observational
and other information. The forward model predicts the effect (y) as a function of
the cause (x). By inversion of the model, the solution is a best estimate of x with
some statistical error. The choice of state vector (that is, which model variables
to include in x versus in p) depends on which variables we wish to optimize,
what information is contained in the observations, and what computational costs
are associated with the inversion. The prior estimate xA is used to reduce the
error on the optimal estimate. Inverse modeling is a formal analysis of the
relative importance of observations versus the prior information in determining
the optimal estimate. It has three main applications in atmospheric chemistry:
top-down constraints on surface fluxes, chemical data assimilation, and time -
dependent surface fluxes. Top-down constraints use measured atmospheric
concentrations (observation vector y) to constrain surface fluxes (state vector x).
Chemical data assimilation constructs a gridded 3-D field of concentrations x,
usually time-dependent, on the basis of measurements y of these concentrations
or related quantities at various locations and times.
In inverse modeling, the prior estimate is an atmospheric forecast model that

evolves x(t) from a previously optimized state at time to a forecast state at the
next assimilation time step. The forward model F can be a simple mapping
operator of observations at time to + h to the model grid, a chemical model
relating the observed variables to the state variables, or the forecasting model
itself. We linearize the forward model y = F(x, p) around the prior estimate xA
taken as best guess. O((x – xA)2) groups higher-order terms (quadratic and

above) taken to be negligibly small. Let n and m represent the dimensions of x
and y, and assume that m = n observations constrain x uniquely.
The Jacobian matrix is an n n matrix of full rank and hence invertible. If F is

nonlinear, the solution must be iterated with recalculation of the Jacobian at
successive guesses until satisfactory convergence is achieved. However,
successful inversion requires adequate characterization of th e observational error
εO and consideration of prior information.
REFERENCES
 https://www.cambridge.org/
 www.cambridge.org
 Graedel T. and P.Crutzen. ―Atmospheric change: an earth system
perspective‖. Chapter 15.‖Bulding environmental chemical
models‖, 1992. at
http://irina.eas.gatech.edu/ATOC3500_Fall1998/Lecture29.pdf
 http://irina.eas.gatech.edu/ATOC3500_Fall1998/Lecture29.pdf
 https://en.wikipedia.org/wiki/Chemical_transport_model
 Alvin C. Rencher and G. Bruce Schaalje, 2008, Linear Models in
Statistics Department of Statistics, Brigham Young University, Provo,
Utah
 Compiled by Bikash Sherchan, 2019 at https://www.studocu.com/

C REDIT 04

UNIT 04-01 ELEMENTARY CONCEPTS, LAWS, THEORIES AND PROCESSES
LEARNING OBJECTIVES
After successful completion of this unit, you will be able to learn-
 Basics of Elementary Concept
 Theories and Progress of Elementary
Elementary Concept
Knowledge of basic concepts is a prerequisite for an insight into the sample

survey designs. Assuming some exposure to elementary probability theory on the
part of the reader, we present in this chapter, a rapid review of some of these
concepts. To begin with, preliminary statistical concepts including those of
expectation, variance, and covariance, for random variables and linear functions
of random variables will be defined. The concepts of measure of error, interval
estimation, and sample size determination, which are related to sampling
distribution.
Statistics: Elementary Probability Theory
A probability gives the likelihood that a defined event will occur. It is quantified as a
positive number between 0 (the event is impossible) and 1 (the event is certain). Thus,
the higher the probability of a given event, the more likely it is to occur. If A is a
defined event, then the probability of A occurring is expressed as P(A). Probability
can be expressed in a number of ways. A frequentist approach is to observe a number
of particular events out of a total number of events. Thus, we might say the
probability of a boy is 0.52, because out of a large number of singleton births we
observe 52% are boys. A model-based approach is where a model, or mechanism
determines the event; thus, the probability of a ‗1‘ from an unbiased die is 1/6 since
there are 6 possibilities, each equally likely and all adding to one. An opinion-based
approach is where we use our past experience to predict a future event, so we might
give the probability of our favourite football team winning the next match, or whether
it will rain tomorrow.
Given two events A and B, we often want to determine the probability of either event,
or both events, occurring.

Addition Rule
The addition rule is used to determine the probability of at least one of two (or more)
events occurring. In general, the probability of either event A or B is given by:
P (A or B) = P(A) + P (B) – P (A and B)
If A and B are mutually exclusive, this means they cannot occur together, i.e. P(A and
B)=0. Therefore, for mutually exclusive events the probability of either A or B occurring
is given by:
P (A or B) = P(A) + P(B)
Example: If event A is that a person is blood group O and event B is that they are blood
group B, then these events are mutually exclusive since a person may only be either one
or the other. Hence, the probability that a given person is either group O or B is
P(A)+P(B).
Multiplication Rule
The multiplication rule gives the probability that two (or more) events happen together.
In general, the probability of both events A and B occurring is given by:
P(A and B) = P(A) x P(B|A) = P(B) x P(A|B)
The notation P(B|A) is the probability that event B occurs given that event A has
occurred where the symbol ‗|‘ is read is ‗given‘. This is an example of a conditional
probability, the condition being that event A has happened. For example, the probability
of drawing the ace of hearts from a well shuffled pack is 1/51. The probability of the ace
of hearts given that the card is red is 1/26.
Example: If event A is a person getting neuropathy and event B that they are diabetic,
then P(A|B) is the probability of getting neuropathy given that they are diabetic.
If A and B are independent events, then the probability of event B is unaffected by the
probability of event A (and vice versa). In other words, P(B|A) = P(B). Therefore, for
independent events, the probability of both events A and B occurring is given by:
P(A and B) = P(A) x P(B)
Example: If event A is that a person is blood group O and event B that they are diabetic,
then the probability of someone having blood group O and being diabetic is P(A)xP(B),
assuming that getting diabetes is unrelated to a person‘s blood group.
Note that if A and B are mutually exclusive, then P(A|B)=0
Bayes‟ Theorem
From the multiplication rule above, we see that:

P(A) x P(B|A) = P(B) x P(A|B)
This leads to what is known as Bayes' theorem:
P(B|A)=P(A|B)P(B)/P(A)
Thus, the probability of B given A is the probability of A given B, times the probability
of B divided by the probability of A.
This formula is not appropriate if P(A)=0, that is if A is an event which cannot happen.
Sensitivity and Specificity
Many diagnostic test results are given in the form of a continuous variable (that is one
that can take any value within a given range), such as diastolic blood pressure or
haemoglobin level. However, for ease of discussion we will first assume that these have
been divided into positive or negative results. For example, a positive diagnostic result of
'hypertension' is a diastolic blood pressure greater than 90 mmHg; whereas for 'anaemia',
a haemoglobin level less than 12 g/dl is required.
For every diagnostic procedure (which may involve a laboratory test of a sample taken)
there is a set of fundamental questions that should be asked. Firstly, if the disease is
present, what is the probability that the test result will be positive? This leads to the
notion of the sensitivity of the test. Secondly, if the disease is absent, what is the
probability that the test result will be negative? This question refers to the specificity of
the test. These questions can be answered only if it is known what the 'true' diagnosis is.
In the case of organic disease this can be determined by biopsy or, for example, an
expensive and risky procedure such as angiography for heart disease. In other situations,
it may be by 'expert' opinion. Such tests provide the so-called 'gold standard'.
Example
Consider the results of an assay of N-terminal pro-brain natriuretic peptide (NT-proBNP)
for the diagnosis of heart failure in a general population survey in those over 45 years of
age, and in patients with an existing diagnosis of heart failure, obtained by Hobbs, Davis,
Roalfe, et al (BMJ 2002) and summarised in table 1. Heart failure was identified when
NT-proBNP > 36 pmol/l.
Fig. 3.5: Results of NT-proBNP assay in the general population over 45 and those with a
previous diagnosis of heart failure (after Hobbs, David, Roalfe et al, BMJ 2002)

Fig. 4.1 Results of NT-proBNP assay in the general population over 45 and those with a
previous diagnosis of heart failure (after Hobbs, David, Roalfe et al, BMJ 2002)
We denote a positive test result by T+, and a positive diagnosis of heart failure (the
disease) by D+. The prevalence of heart failure in these subjects is 103/410=0.251, or
approximately 25%. Thus, the probability of a subject chosen at random from the
combined group having the disease is estimated to be 0.251. We can write this as
P(D+)=0.251.
The sensitivity of a test is the proportion of those with the disease who also have a
positive test result. Thus the sensitivity is given by e/(e+f)=35/103=0.340 or 34%. Now
sensitivity is the probability of a positive test result (event T+) given that the disease is
present (event D+) and can be written as P(T+|D+), where the '|' is read as 'given'.
The specificity of the test is the proportion of those without disease who give a negative
test result. Thus the specificity is h/(g+h)=300/307=0.977 or 98%. Now specificity is the
probability of a negative test result (event T-) given that the disease is absent (event D-)
and can be written as P(T-|D-).
Since sensitivity is conditional on the disease being present, and specificity on the disease
being absent, in theory, they are unaffected by disease prevalence. For example, if we
doubled the number of subjects with true heart failure from 103 to 206 in Table 1, so that
the prevalence was now 103/(410+103)=20%, then we could expect twice as many
subjects to give a positive test result. Thus 2x35=70 would have a positive result. In this
case the sensitivity would be 70/206=0.34, which is unchanged from the previous value.
A similar result is obtained for specificity.
Sensitivity and specificity are useful statistics because they will yield consistent results
for the diagnostic test in a variety of patient groups with different disease prevalences.
This is an important point; sensitivity and specificity are characteristics of the test, not the
population to which the test is applied. In practice, however, if the disease is very rare,
the accuracy with which one can estimate the sensitivity may be limited. This is because
the numbers of subjects with the disease may be small, and in this case the proportion
correctly diagnosed will have considerable uncertainty attached to it.
Two other terms in common use are: the false negative rate (or probability of a false
negative) which is given by f/(e+f)=1-sensitivity, and the false positive rate (or
probability of a false positive) or g/(g+h)=1-specificity.
These concepts are summarised in Fig. 3.7.
Fig. 4.2: Summary of definitions of sensitivity and specificity

(Source & Credit: www.healthknowledge.org.uk)
It is important for consistency always to put true diagnosis on the top, and test
result down the side. Since sensitivity=1–P(false negative) and specificity=1–
P(false positive), a possibly useful mnemonic to recall this is that 'sensitivity'
and 'negative' have 'n's in them and 'specificity' and 'positive' have 'p's in them.
Predictive Value of a Test

Suppose a doctor is confronted by a patient with chest pain suggestive of angina,
and that the results of the study described in Fig. 3.8 are available.

Fig. 4.3: Results of exercise tolerance test in patients with suspected coronary
artery disease
The prevalence of coronary artery disease in these patients is 1023/1465=0.70. The

doctor therefore believes that the patient has coronary artery disease with probability
0.70. In terms of betting, one would be willing to lay odds of about 7:3 that the patient
does have coronary artery disease. The patient now takes the exercise test and the result is
positive. How does this modify the odds? It is first necessary to calculate the probability
of the patient having the disease, given a positive test result. From Table 3, there are 930
men with a positive test, of whom 815 have coronary artery disease. Thus, the estimate of
0.70 for the patient is adjusted upwards to the probability of disease, with a positive test
result, of 815/930=0.88.
This gives the predictive value of a positive test (positive predictive value):
P(D+|T+) = 0.88.
The predictive value of a negative test (negative predictive value) is:
P(D-|T-) = 327/535 = 0.61.
These values are affected by the prevalence of the disease. For example, if those with the
disease doubled in Table 3, then the predictive value of a positive test would then become
1630/(1630+115)=0.93 and the predictive value of a negative test 327/(327+416)=0.44.
The Role of Bayes' Theorem

Suppose event A occurs when the exercise test is positive and event B occurs when
angiography is positive. The probability of having both a positive exercise test and
coronary artery disease is thus P(T+ and D+). From Table 3, the probability of picking
out one man with both a positive exercise test and coronary heart disease from the group
of 1465 men is 815/1465=0.56.
However, from the multiplication rule:
P(T+ and D+) = P(T+|D+)P(D+)
P(T+|D+)=0.80 is the sensitivity of the test and P(D+)=0.70 is the prevalence of coronary
disease and so P(T+ and D+)=0.80x0.70=0.56, as before.
Bayes' theorem enables the predictive value of a positive test to be related to the
sensitivity of the test, and the predictive value of a negative test to be related to the
specificity of the test. Bayes' theorem enables prior assessments about the chances of a

diagnosis to be combined with the eventual test results to obtain a so-called ―posterior‖
assessment about the diagnosis. It reflects the procedure of making a clinical judgement.
In terms of Bayes' theorem, the diagnostic process is summarised by:
P(D+|T+)=P(T+|D+)P(D+)/P(T+)
The probability P(D+) is the a priori probability and P(D+|T+) is the a
posteriori probability.
Bayes' theorem is usefully summarised when we express it in terms of the odds of an
event, rather than the probability. Formally, if the probability of an event is p, then the
odds are defined as p/(1-p). The probability that an individual has coronary heart disease,
before testing, from the Table is 0.70, and so the odds are 0.70/(1-0.70)=2.33 (which can
also be written as 2.33:1).
(Source & Credit: https://www.healthknowledge.org.uk/public-health-textbook/research-
methods/1b-statistical-methods/elementary-probability-theory)
SELF-TEST
1) Elementary concepts
e) Used in Mathematics
f) Useful in Sctaticstics
g) Both A & B
h) None of Above
3) Probability is
a) May consider in analysis of data
b) Not appy in statistic
c) Both A & B
d) None of above

3) What is Elementary concepts?
4) Write a note on Bayes' Theorem

UNIT 04-02: THE BUILDING BLOCKS : EXTENSIVE AND INTENSIVE
PROPERTIES,PROPERTIES RELEVANT TO OF ENVIRONMENTAL SYSTEMS,
THE MATERIALBALANCE APPROACH ; THE TRANSPORT PROCESSES–
ADVECTION, DIFFUSION,DISPERSION, GRAVITATIONAL SETTLING,
TRANSPORT IN POROUS MEDIA; THE TRANSFORMATION PROCESSES –THE
NON-REACTIVE PROCESSES , THE REACTIVEPROCESSES; SIMULATION OF
TRANSPORT AND TRANSFORMATION PROCESSES –INTRODUCTION , THE
COMPLETELY STIRRED TANK REACTOR, PLUG FLOWREACTOR , MIXED
FLOW REACTOR MODELS ; THE GENERAL MATERIAL BALANCE MODELS
LEARNING OBJECTIVES
 Extensive and intensive properties, properties relevant to of environmental

systems
 Various Processes in environmental systems
Environmental System:
Achim Sydow (National Research Center for Information Technology

(GMD.FIRST), Berlin, Germany) explained about the environmental system,
according to him, the conception of systems is very general. In system science it
is used to analyze complexity, to bring a greater amount of transparency into the
interaction of parts. It maps the flow of information or energy or material etc.
through the complex system. It is based on the decomposition of the complex
system into subsystems. The chosen subsystems should be simple to handle.
Mostly, they are object-orientated. The topology of the real object determines the
structure of the mapped system. Systems are characterized by inputs and outputs.
They can be controlled via the inputs and observed via the outputs. The difficulty
of analyzing and especially forecasting the environment consists in the fact that
man as an actor is himself part of the complex environmental system/complex
ecosystem – the biosphere.

Fig. 4.4 Human environment
(Credit: www.eolss.net)
The concept of systems builds a bridge between the world of real objects and
mathematics. Typical terms of systems methodology are linear and nonlinear
systems, continuous and discrete systems, lumped parameter and distributed
parameters, automata, events, hierarchical systems etc. Terms of modeling are
systems identification, and parameter estimation, input and output analysis,
sensitivity analysis, uncertainty, fuzzy sets, control, decision making, etc.
Systems analysis requires to design a conceptual model consisting of submodels.
The conceptual model is a plan for mathematical modeling. Mathematical
modeling is based on measurement before or during the control process.
Modeling is not a purpose for itself. Its focus is to solve problems an d according
to the problems, selected models are developed. Some models are developed only
for one problem. Generally, the systems approach provides models for control,
decision and planning processes.
Complexity of Environmental Systems The complexity of environmental systems
is known to all who need to make decisions in the management of plants, in

environmental politics or in the study of global change, etc. The complexity is
inherent in the nonlinearity of mathematical models, the dynamic and stochastic
nature of natural resource problems, the multipurpose, multiobjective attributes
of decision problems. The complexity is also caused by the natural coupling and
interaction of parts of the biosphere. The complexity depends also on problems
of measuring, transmitting, processing and analyzing data and the
decisionmaking process under environmental, technical, institutional, economic
and political aspects.
Fig. 4.5 Modeling and decision making

Systems methodology provides theoretical and computational tools for modeling,
analyzing, understanding and controlling, optimizing or planning complex,
environmental systems. The systems approach brings transparency into the
interactions of the system‘s parts. Simulation tools support numerical insights
into the system behavior. The idea of sustainable development is the overall goal
in treating the biosphere and its parts, which takes into consideration their high
complexity.
Systems Analysis Systems analysis consists of various steps. Basically, these are
described in the following outline:
1. Analyzing the decision problem (goals, decision or control variables).
2. Formulating a model which is adequate in quality and accuracy for the
complex problem (structure, parameter, interconnections).
3. Testing the model (usually by computer simulation) (validity, sensitivity).
4. Solving the decision problems by scenario analysis, optimization (control,
decision strategies, planning).
Computer simulation and optimization tools provide essential aids as regards
most of the steps. Typical examples of environmental problems should explain
the use of the systems approach in different model areas (regional, global).
Environmental systems are very often used in economical and social context.
Global natural resource management problems, for example, are only to be
considered together with population growth and world economy. It is similar for
other problems. That means that ecological systems should be compatible with
other necessary systems. Hierarchical systems methodology was developed with
systems theory and basic knowledge of cybernetics (information, control and
loop function, signal processing) and related engineering applications. Systems
engineering provides a tool box with approved methods in engineering sciences.
On the level of systems or ―symbolic systems‖ mainly control, decision and
planning problems will be analyzed. Modeling on the systems level is based on
the input-output analyzes and needs mathematics, natural life and economic
sciences as a background. System identification and parameter estimation are the
main steps of modeling.
Example: water quality. As an example of the systems approach a control
problem of water resources is considered.
Fig. 4.6 The water quality of a river


• What variable should be measured and controlled?
• How to control and by which variables?
The pollution is directly discharged into the river or over a wastewater plant. The
control problem for water quality consists in the fulfillment of certain conditions
(biological oxygen demand, dissolved oxygen) and optimization of the overall
costs, including the costs for the wastewater treatment. This is a distri buted
parameter system based on a partial differential equation. The control problem
would be considered a multicriteria optimization problem and solved by
hierarchical optimization.
Measurements: Data Capture, Validation, Interpretation Environmental obje ct
classification leads to the taxonomy distinguishing between atmosphere (all
objects above the surface of the Earth), hydrosphere (water-related objects),
lithosphere (relating to soil and rocks), biosphere (all living matters) and
technosphere (human-made objects), cf.
A basic method to solve this problem is called maximum likelihood. For this
method, one needs to know a finite number of classes for the allocation of new
observations. The probability distribution for each class describes the probability
that the observation belongs to the respective class. Mostly, the probability
distribution is unknown. Practically, very often a normal distribution is used.
Suitable data sets (trainings sets) are used to identify the parameters of
probability distribution. Environmental system models based on measurements
need aggregation, validation and interpretation of the initial collection of
environmental data. The measurement is associated according to the applied
methods and techniques with uncertainties. Therefore, validation procedures are
developed and applied (like temporal v., geographic v., space -time v.,
interparameter v., see Günther). Measurements in most cases are validated in the
context of interpretation. Also, knowledge-based systems (expert systems) play a
role for initial evaluation of environmental raw data. For data processing,
statistical classification, data management and artificial intelligence provide
standard methods. Within the process of data processing, especially in the case of
data fusion (by combining), methods of uncertainty management are applied.
When circumstances of measurements are known like weather, date, time of day,

etc., methods based on Bayesian probability theory are used. The Bayesian model
requires events which are independent of each other. This is mostly unrealistic.
Therefore, recently developed methods like fuzzy sets have to be applied. Neural
nets are used to handle uncertainty. Such validated data are the basis for
information systems and monitoring of the state of environment. Furthermore,
they are needed for modeling (systems identification and parameter estimation).
After systems analysis the conceptual model (model design) requires input and
output data of the systems or subsystems. The theory determines which and how
much data are needed (―It is the theory which decides what can be observed‖, A.
Einstein). Physics, chemistry, biology, ecology, especially engineering sciences,
etc., provide measurement equipment and techniques. In general, measurement
and validation are the bottleneck and a great challenge for the further
development and realization of the environment.
Continuous Stirred Tank Reactor
A Continuous Stirred Tank Reactor (CSTR) is a reaction vessel in which
reagents, reactants and often solvents flow into the reactor while the product(s)
of the reaction concurrently exit(s) the vessel. In this manner, the tank reactor is
considered to be a valuable tool for continuous chemical processing.
CSTR reactors have effective mixing and perform under steady-state with
uniform properties. Ideally, the output composition is identical to composition of
the material inside the reactor, which is a function of residence time and reaction
rate.
In situations where a reaction is too slow or when two immiscible or viscous
liquids are present requiring a high agitation rate, several Continuous Stirred
Tank Reactors (CSTRs) may be connected together forming a cascade.
A Continuous Stirred Tank Reactor (CSTR) assumes ideal mixing and thus, is the
complete opposite of a Plug Flow Reactor (PFR).
Plug Flow Reactor Model

A plug flow reactor (PFR) model is used to model chemical reactions taking
place within a tube. It is an idealized model that can be used during the design
process of reactors. The model in this blog is assumed to be adiabatic and
operating at constant pressure. A gas phase decomposition reaction is assumed to

be the only reaction taking place, and reacts according to the equation A → 2B +
C. A diagram of a typical PFR is presented below.
Fig. 4.7 Typical PFR

(Credit: https://commons.wikimedia.org/wiki/File:Pipe-PFR.svg)

Essentially, the model gives the concentration of chemical species, temperature,
conversion, rate constant, and rate of reaction at each section of volume. The
equations used to accomplish this are presented below.
Mixed flow reactor (MFR):

The continuous stirred-tank reactor (CSTR), also known as vat- or backmix
reactor, mixed flow reactor (MFR), or a continuous-flow stirred-tank
reactor (CFSTR), is a common model for a chemical reactor in chemical
engineering and environmental engineering. A CSTR often refers to a model used
to estimate the key unit operation variables when using a continuous agitated -
tank reactor to reach a specified output. The mathematical model works for all
fluids: liquids, gases, and slurries.
The behavior of a CSTR is often approximated or modeled by that of an ideal
CSTR, which assumes perfect mixing. In a perfectly mixed reactor, reagent is
instantaneously and uniformly mixed throughout the reactor upon entry.
Consequently, the output composition is identical to composition of the material
inside the reactor, which is a function of residence time and reaction rate. The
CSTR is the ideal limit of complete mixing in reactor design, which is the
complete opposite of a plug flow reactor (PFR). In practice, no reactors behave
ideally but instead fall somewhere in between the mixing limits of an ideal CSTR
and PFR.
(Cresit & Souce: https://medium.com/)
Material Balance Model In Environmental: (With reference to Environmental
Economics).
The relationship between the environment and the economy can be depicted by
means of ―Material Balance Model‖. In the figure, the production activities of
the economic system are represented by the circle called ‗Production Sector‗.
This circle represents all productive activities in the economic system, such as
agriculture, factories, warehouses, mines, transportation and other public utilities
that are engaged in the extraction of materials from the environment; their
processing, refinement and rearrangement into marketable goods an d services
and their distribution throughout the economy to the point of ultimate use.
Another circle in Material Balance Model is labeled as ‗Household Sector ‗which
depicts the individual consumers in the economy. What is produced in the
production sector of this model economy goes to individuals acting as
consumers. This is a simplified two-sector model of the economic system to
show the relationship between the environment and the economy.
Fig. 4.8: Relationship bwtween Environment and Economy Material Balance

We have dispensed with the conventional portrayal of the economic system
showing circular flow of money and opposite flow of goods and services between
production sector and consumption sector. In the conventional analysis, the
household sector provides ‗factor inputs‘ for money-income; and production
sector provides goods and services in return for money payments, making the
circular flow complete in two opposite directions.
This conventional ‗circular flow ‗of money and goods fails to explain the
material flows and basic laws of physics governing them and assumes that goods
and services are produced out of ‗something‘. Where from that ‗something‘
comes and where to that ‗something‘ goes are not explained in conventional
portrayal of circular flow of goods and services in a two -sector model.
The Material Balance Model depicted in Figure illustrates the ‗Environment ‗as a
large shell surrounding the economic system, just like the womb covers the
embryo. Actually, the relationship between the Environment and the Economic
System is that of a mother to an Unborn Baby in the uterus; providing sustenance
to the growing embryo and carrying away the wastes. These ‗inputs‘ and ‗wastes
‗, are portrayed in the model. Raw material inputs that flow from environment
are processed and produced in the ―Production Sector‖ supplying the consumable
final products to the household sector. In this process, the production sector
creates some waste products which are sent back to the environment.
The household sector does not consume all the products supplied. They have
‗wastes ‗called ‗residues ‗which are unwanted by-products of the consumption
activities of the households. These unwanted wastes or residues are returned
back to the environment. Thus, there is constant flow of residues from both
production and consumption sectors back to the environment.
These material flows must obey the basic law of physics governing the
‗conservation of matter ‗. Assuming this to be a very simplified economy where
there are no exports and imports; where there is no net accumulation of stocks of
plants, equipment, inventories, consumer durable etc., the mass of residuals
returned to the natural environment must be equal to the mass of ba sic fuels,
food, minerals and other raw materials entering the processing and production
system including the gases taken from the atmosphere.
This is the principle behind material balance. This holds good for each sector of
the economic system separately and also for the economic system as a whole.

Thus, in the absence of inventory accumulations, the flow of consumer goods
from the production sector to the household sector must be equal to the mass
flow back to the environment.
(Credit and Source: www.knowledgiate.com)
SELF-TEST
3) MFR stabds for
a) Mixed flow reactor
b) Mixed flow reation
c) Medium flow reactor
d) None of above
4) CSTR stands for
a) Continuous Stirred Tank Reaction
b) Continuous Stirred Tank Reactor
c) Current Stirred Tank Reactor
d) None of of above

3) Write a note on environmental system
4) Write a note on importance of modelling in environmental science.

UNIT 04-03: ENVIRONMENTAL MODELLING - APPLICATIONS WATER
QUALITY MODELLING : SURFACE WATER QUALITY MODELLING – LAKES
AND IMPOUNDMENTS , RIVERS AND STREAMS , ESTUARIES; GROUND WATER
POLLUTION MODELLING
LEARNING OBJECTIVES
 Applications of Environmental Modelling
 Water Modelling
Environmental Modelling Application:
The most important tasks in the field of environmental modeling and simulation
today are still the determination and analysis of the behaviour of environmental
systems, that means the computation of the time dependent dynamic behaviour of
state variables. Some applications of modeling and simulation are given below.
On the one hand, undesired secondary effects of human activities (e.g.emissions
of harmful substances of production processes) can appear. In such a situation,
the secondary effects to the environment are to minimize. On the other hand,
actions of environmental protection are to optimize. These are investigations of
the questions: how does a process work? or which effects are released in the
environmental system by a special input? In the environmental domain,
simulation models are especially used in four fields: emission computation (air,
water, ground pollution), process control, groundwater - economical and flow
investigations, and ecosystem research. 116 Part Three Modeling and Simulation
But further applications become more and more important: e.g., models on the
use and balance of resources (e.g. water, ground, materials, energy, food);
models of the carrying capacitiy and of loading limits of ecological systems with
an input, which has been caused by human activities; quality models; product-
life-time-cycle models and ecobalances; socio-economic models; combined
tasks: the use of simulation and optimization methods or experiments consisting
of coupled economical and ecological models.
(Source & Credit: R. Griitzner. (1996). Environmental modeling and simulation -
applications and future requirements, University of Rostock.)
Environmental models can be used to study many things, such as:
 Climate
 Coastal changes.
 Hydro-ecological systems
 Ocean circulation
 Surface and groundwater
 Terrestrial carbon
 The behaviour of enclosed spaces
 The behaviour of spaces around buildings
(Source: designingbuildings.co.uk)
Water Quality Modelling:
Water quality modeling involves water quality-based data using mathematical

simulation techniques. Water quality modeling helps people understand the eminence
of water quality issues and models provide evidence for policy makers to make
decisions in order to properly mitigate water. [1] Water quality modeling also helps
determine correlations to constituent sources and water quality along with identifying
information gaps. Due to the increase in freshwater usage among people, water
quality modeling is especially relevant both in a local level and global level. In order
to understand and predict the changes over time in water scarcity, climate change, and
the economic factor of water resources, water quality models would need sufficient
data by including water bodies from both local and global levels.
A typical water quality model consists of a collection of formulations representing
physical mechanisms that determine position and momentum of pollutants in a water
body. Models are available for individual components of the hydrological system such
as surface runoff; there also exist basin wide models addressing hydrologic
transport and for ocean and estuarine applications. Often finite difference methods are
used to analyze these phenomena, and, almost always, large complex computer
models are required.
Building A Model

Water quality models have different information, but generally have the same
purpose, which is to provide evidentiary support of water issues. Models can be either
deterministic or statistical depending on the scale with the base model, which is
dependent on if the area is on a local, regional, or a global scale. Another aspect to
consider for a model is what needs to be understood or predicted about that research
area along with setting up any parameters to define the research. Another aspect of
building a water quality model is knowing the audience and the exact purpose for
presenting data like to enhance water quality management for water quality
law makers for the best possible outcomes.
Formulations and associated Constants
Water Quality Is Modeled By One Or More Of The Following Formulations
 Advective Transport formulation

 Dispersive Transport formulation
 Surface Heat Budget formulation
 Dissolved Oxygen Saturation formulation
 Reaeration formulation
 Carbonaceous Deoxygenation formulation
 Nitrogenous Biochemical Oxygen Demand formulation
 Sediment oxygen demand formulation (SOD)
 Photosynthesis and Respiration formulation
 pH and Alkalinity formulation
 Nutrients formulation (fertilizers)
 Algae formulation
 Zooplankton formulation
 Coliform bacteria formulation (e.g. Escherichia coli )
As per the Pollution Prevention and Abatement Handbook WORLD BANK

GROUP Effective July 1998, Mathematical models can be used to predict
changes in ambient water quality due to changes in discharges of wastewater. In
Bank work, the models are typically used to establish priorities for reduction of
existing wastewater discharges or to predict the impacts of a proposed new
discharge. Although a range of parameters may be of interest, a modeling
exercise typically focuses on a few, such as dissolved oxygen, coliform bacte ria,
or nutrients. Predicting the water quality impacts of a single discharge can often
be done quickly and sufficiently accurately with a simple model. Regional water
quality planning usually requires a model with a broader geographic scale, more
data, and a more complex model structure.
Model Classification:
Water quality models are usually classified according to model complexity, type
of receiving water, and the water quality parameters (dissolved oxygen,
nutrients, etc.) that the model can predict. The more complex the model is, the
more difficult and expensive will be its application to a given situation. Model
complexity is a function of four factors. • The number and type of water quality
indicators. In general, the more indicators that are included , the more complex
the model will be. In addition, some indicators are more complicated to predict
than others (see Table 1). Water Quality Models In order to determine the
impacts of a particular discharge on ambient water quality, it is usually necessary
to model the diffusion and dispersion of the discharge in the relevant water body.
The approach applies both to new discharges and to the upgrading of existing
sources. This chapter provides guidance on models that may be applicable in the
context of typical Bank projects. • The level of spatial detail. As the number of
pollution sources and water quality monitoring points increase, so do the data
required and the size of the model. • The level of temporal detail. It is much
easier to predict long-term static averages than shortterm dynamic changes in
water quality. Point estimates of water quality parameters are usually simpler
than stochastic predictions of the probability distributions of those parameters. •
The complexity of the water body under analysis. Small lakes that ―mix‖
completely are less complex than moderate-size rivers, which are less complex
than large rivers, which are less complex than large lakes, estuaries, and coastal
zones. The level of detail required can vary tremendously across diff erent
management applications. At one extreme, managers may be interested in the
long-term impact of a small industrial plant on dissolved oxygen in a small, well -
mixed lake. This type of problem can be addressed with a simple spreadsheet and
solved by a single analyst in a month or less. At the other extreme, if managers
want to know the rate of change in heavy metal concentrations in the Black Sea
that can be expected from industrial modernization in the lower Danube River,

the task will probably require many person-years of effort with extremely
complex models and may cost millions of dollars.
For indicators of aerobic status, such as biochemical oxygen demand (BOD),

dissolved oxygen, and temperature, simple, well-established models can be used
to predict long-term average changes in rivers, streams, and moderate -size lakes.
The behavior of these models is well understood and has been studied more
intensively than have other parameters. Basic nutrient indicators such as
ammonia, nitrate, and phosphate concentrations can also be predicted reasonably
accurately, at least for simpler water bodies such as rivers and moderate -size
lakes. Predicting algae concentrations accurately is somewhat more difficult but
is commonly done in the United States and Europe, where eutrophication has
become a concern in the past two decades. Toxic organic compounds and heavy
metals are much more problematic. Although some of the models reviewed below
do include these materials, their behavior in the environment is still an area o f
active research.

Fig. 4.9 Criteria for Classification of Water Quality Models
(Source & Credit: https://web.worldbank.org/)
Data Requirements:
Fig. 4.10 Data Requirements for Water Quality Models

(Source & Credit: https://web.worldbank.org/)

Ground Water Pollution Modelling:
According to the Jacob Beara, Milovan S. Beljinb and Randall R. Ross in 1992 in
the boom of Fundamentals of Ground-Water Modeling, Ground-water flow and
contaminant transport modeling has been used at many hazardous waste sites
with varying degrees of success. Models may be used throughout all phases of
the site investigation and remediation processes. The ability to reliably predict
the rate and direction of groundwater flow and contaminant transport is critical
in planning and implementing ground-water remediations. This paper presents an
overview of the essential components of ground-water flow and contaminant
transport modeling in saturated porous media. While fractured rocks and
fractured porous rocks may behave like porous media with respect to many flow
and contaminant transport phenomena, they require a separate discussion and are
not included in this paper. Similarly, the special features of flow and contaminant
transport in the unsaturated zone are also not included. This paper was prepared
for an audience with some technical background and a basic working knowledge
of ground-water flow and contaminant transport processes. A suggested format
for ground-water modeling reports and a selected bibliography are included as
appendices A and B, respectively.
Modeling as a Management Tool:
The management of any system means making decisions aimed at achieving the
system‘s goals, without violating specified technical and nontechnical constraints
imposed on it. In a ground-water system, management decisions may be related
to rates and location of pumping and artificial recharge, changes in water quality,
location and rates of pumping in pump-and-treat operations, etc. Management‘s
objective function should be to evaluate the time and cost necessary to achieve
remediation goals. Management decisions are aimed at minimizing this cost
while maximizing the benefits to be derived from operating the system. The
value of management‘s objective function (e.g., minimize cost and maximize
effectiveness of remediation) usually depends on both the values o f the decision
variables (e.g., areal and temporal distributions of pumpage) and on the response
of the aquifer system to the implementation of these decisions. Constraints are
expressed in terms of future values of state variables of the considered ground -
water system, such as water table elevations and concentrations of specific

contaminants in the water. Typical constraints may be that the concentration of a
certain contaminant should not exceed a specified value, or that the water level
at a certain location should not drop below specified levels. Only by comparing
predicted values with specified constraints can decision makers conclude
whether or not a specific constraint has been violated. An essential part of a good
decision-making process is that the response of a system to the implementation
of contemplated decisions must be known before they are implemented. In the
management of a ground-water system in which decisions must be made with
respect to both water quality and water quantity, a tool is nee ded to provide the
decision maker with information about the future response of the system to the
effects of management decisions. Depending on the nature of the management
problem, decision variables, objective functions, and constraints, the response
may take the form of future spatial distributions of contaminant concentrations,
water levels, etc. This tool is the model. Examples of potential model
applications include:
• Design and/or evaluation of pump-and-treat systems
• Design and/or evaluation of hydraulic containment systems
• Evaluation of physical containment systems (e.g., slurry walls)
• Analysis of "no action" alternatives
• Evaluation of past migration patterns of contaminants
• Assessment of attenuation/transformation processes
• Evaluation of the impact of nonaqueous phase liquids (NAPL) on remediation

activities.
What Is a Ground-Water Model?
A model may be defined as a simplified version of a real-world system (here, a

ground-water system) that approximately simulates the relevant excitation-
response relations of the real-world system. Since real-world systems are very
complex, there is a need for simplification in making planning and management
decisions. The simplification is introduced as a set of assumptions which
expresses the nature of the system and those features of its behavior that are
relevant to the problem under investigation. These assumptions will relate,

among other factors, to the geometry of the investigated domain, the way various
heterogeneities will be smoothed out, the nature of the porous medium (e.g., its
homogeneity, isotropy), the properties of the fluid (or fluids) involved, and the
type of flow regime under investigation. Because a model is a simplified version
of a real-world system, no model is unique to a given ground-water system.
Different sets of simplifying assumptions will result in different models, each
approximating the investigated ground-water system in a different way. The first
step in the modeling process is the construction of a conceptual mo del consisting
of a set of assumptions that verbally describe the system‘s composition, the
transport processes that take place in it, the mechanisms that govern them, and
the relevant medium properties. This is envisioned or approximated by the
modeler for the purpose of constructing a model intended to provide information
for a specific problem.
Content of a Conceptual Model The assumptions that constitute a conceptual

model should relate to such items as:
• the geometry of the boundaries of the investigated aquifer domain;
• the kind of solid matrix comprising the aquifer (with reference to its
homogeneity, isotropy, etc.);
• the mode of flow in the aquifer (e.g., one-dimensional, two-dimensional

horizontal, or three-dimensional); • the flow regime (laminar or nonlaminar);
• the properties of the water (with reference to its homogeneity, compressibility,

effect of dissolved solids and/or temperature on density and viscosity, etc.);
• the presence of assumed sharp fluid-fluid boundaries, such as a phreatic

surface;
• the relevant state variables and the area, or volume, over which the averages of
such variables are taken;
• sources and sinks of water and of relevant contaminants, within the domain and
on its boundaries (with reference to their approximation as point sinks and
sources, or distributed sources);
• initial conditions within the considered domain; and

• the conditions on the boundaries of the considered domain that express the
interactions with its surrounding environment.
Selecting the appropriate conceptual model for a given problem is one of the
most important steps in the modeling process. Oversimplification may lead to a
model that lacks the required information, while undersimplification may result
in a costly model, or in the lack of data required for model calibration and
parameter estimation, or both. It is, therefore, important that all features relevant
to a considered problem be included in the conceptual model and that irrelevant
ones be excluded. The selection of an appropriate conceptual model and the
degree of simplification in any particular case depends on:
• the objectives of the management problem;
• the available resources; • the available field data;
• the legal and regulatory framework applying to the situation.
The objectives dictate which features of the investigated problem should be

represented in the model, and to what degree of accuracy. In some cases
averaged water levels taken over large areas may be satisfactory, while in others
water levels at specified points may be necessary. Natural recharge may be
introduced as monthly, seasonal or annual averages. Pumping may be assumed to
be uniformly distributed over large areas, or it may be represented as point sinks.
Obviously, a more detailed model is more costly and requires more skilled
manpower, more sophisticated codes and larger computers. It is important to
select the appropriate degree of simplification in each case. Selection of the
appropriate conceptual model for a given problem is not necessarily a conc lusive
activity at the initial stage of the investigations. Instead, modeling should be
considered as a continuous activity in which assumptions are reexamined, added,
deleted and modified as the investigations continue. It is important to emphasize
that the availability of field data required for model calibration and parameter
estimation dictates the type of conceptual model to be selected and the degree of
approximation involved. The next step in the modeling process is to express the
(verbal) conceptual model in the form of a mathematical model. The solution of
the mathematical model yields the required predictions of the real -world

system‘s behavior in response to various sources and/or sinks. Most models
express nothing but a balance of the considered extensive quantity (e.g., mass of
water or mass of solute). In the continuum approach, the balance equations are
written ―at a point within the domain,‖ and should be interpreted to mean ―per
unit area, or volume, as the case may be, in the vicinity of the point.‖ Under such
conditions, the balance takes the form of a partial differential equation. Each
term in that equation expresses a quantity added per unit area or per unit volume,
and per unit time. Often, a number of extensive quantities of interest ar e
transported simultaneously; for example, mass of a number of fluid phases with
each phase containing more than one relevant species. The mathematical model
will then contain a balance equation for each extensive quantity.
Content of a Mathematical Model
The complete statement of a mathematical model consists of the following

items:
• a definition of the geometry of the considered domain and its boundaries;
• an equation (or equations) that expresses the balance of the considered

extensive quantity (or quantities);
• flux equations that relate the flux(es) of the considered extensive quantity(ies)
to the relevant state variables of the problem;
• constitutive equations that define the behavior of the fluids and solids involved;
• an equation (or equations) that expresses initial conditions that describe the
known state of the considered system at some initial time; and
• an equation (or equations) that defines boundary conditions that describe the
interaction of the considered domain with its environment.
All the equations must be expressed in terms of the dependent variables selected
for the problem. The selection of the appropriate variables to be used in a
particular case depends on the available data. The number of equations included
in the model must be equal to the number of dependent variables. The boundary
conditions should be such that they enable a unique, stable solution. The most
general boundary condition for any extensive quantity states that the difference
in the normal component of the total flux of that quantity, on both sides of the

boundary, is equal to the strength of the source of that quantity. If a source does
not exist, the statement reduces to an equality of the normal component of the
total flux on both sides of the boundary. In such equalities, the information
related to the external side must be known (Bear and Verruijt, 1987). It is
obtained from field measurements or on the basis of past experience. The
mathematical model contains the same information as the conceptual one, but
expressed as a set of equations which are amenable to analytical and numerical
solutions. Many mathematical models have been proposed and published by
researchers and practitioners (see Appendix B). They cover most cases of flow
and contaminant transport in aquifers encountered by hydrologists and water
resource managers. Nevertheless, it is important to understand the procedure of
model development. The following section introduces three fundamental
assumptions, or items, in conceptual models that are always made when
modeling ground-water flow and contaminant transport and fate.
(Credit and Source: Jacob Beara, Milovan S. Beljinb and Randall R. Ross in
1992 in the boom of Fundamentals of Ground-Water Modeling)
SELF-TEST
3) Water Quality Modelling is a part of
a) Ecology
b) Environment
c) Environmental Modelling
d) None of the above
4) Water Qaulity Modelling requires?
a) Baseline Date
b) Statistics
c) Simulation
d) All of above

3) What are the various application of Environmental Modelling?
4) Why Water Quality Modelling is important?

UNIT 04-04: AIR QUALITY MODELLING : THE BOX MODEL, THE GAUSSIAN
PLUME MODEL POINT SOURCES, LINE SOURCES , AREA SOURCES ; SPECIAL
TOPICS; GAUSSIAN PUFF MODEL
LEARNING OBJECTIVES
 Bascis of Air Quality Model
 Various types of Air Quality Models
Air Quality Modelling:
According to the Gary Haq and Dieter Schwela, 2008 in Edited Book Foundation
Course on Air Quality Management in Asia (Modelling Chanpter), an air quality
simulation model system (AQSM) is a numerical technique or methodology for
estimating air pollutant concentrations in space and time. It is a function of the
distribution of emissions and the existing meteorological and geophysical
conditions. An alternative name is dispersion model. Air pollution concentrations
are mostly used as an indicator for human exposure, which is a main component
for risk assessment of air pollutants on human health. An AQSM addresses the
problem of how to allocate available resources to produce a cost -effective
control plan. In this context, an AQSM can respond to the following t ypes of
questions: • What are the relative contributions to concentrations of air pollutants
from mobile and stationary sources? • What emission reductions are needed for
outdoor concentrations to meet air quality standards? • Where should a planned
source of emissions be sited? • What will be the change in ozone (O3)
concentrations if the emissions of precursor air pollutants (e.g. nitrogen oxides
(NOx) or hydrocarbons (HC)) are reduced by a certain percentage? • What will
be the future state of air quality under certain emission reduction scenarios?
Figure 3.1 shows the main components of an AQSM. An alternative and
complementary model is source apportionment (SA). SA starts from observed
concentrations and their chemical composition and estimates the rela tive
contribution of various source types by comparing the composition of sources
with the observed composition at the receptors. For this reason, SA is also called

receptor modelling. This module provides an understanding of the basic
components of an air simulation model which is used to estimate air pollutant
concentrations in space and time. It discusses the application of dispersion
models and the key data requirements. It examines the use of meteorological data
in air quality modelling and the different types of dispersion models available. It
presents the different approaches to source apportionment, which are used in
determining the contribution of different air pollution sources.
Fig. 4. 11: Basic components of air quality modelling
(Credit & Source: Gary Haq and Dieter Schwela, 2008)
Emission Estimates:
Emission estimates are calculated in a submodel of the AQSM. In such a model,

emission estimates must be developed from basic information such as traffic
modal distribution, vehicle fleet age, vehicles types, industrial processes,
resource use, power plant loads, and fuel types. As described in Module 2
Emissions, empirically derived emission factors are applied to these basic data to
estimate emission loads. Pollutant concentrations calculated by AQSM can never
be more accurate than the input emissions estimates, unless the model is adapted
to monitored pollutant concentrations.
Applications of Dispersion Models:

Starting from reliable emission estimates and using monitored or modelle d
meteorological data, dispersion models can be applied to simulate air pollutant
concentrations at receptor sites at costs much lower than those for air pollutant
monitoring.
In addition, dispersion models are the only means to estimate concentrations in

the following situations. Among many capabilities, dispersion models can:
• estimate spatial distributions of air pollutant concentrations;
• quantify source contributions at receptor locations (e.g. at various points in a

residential area);
provide concentrations of a compound (―exotic‖ pollutant) for which measuring

methodologies do not exist or are too expensive;
• provide estimates on the impacts of the realisation of a planned manufacturing

facility or of process changes in an existing plant;
• estimate the impacts at receptor sites in the vicinity of a planned road or those
of envisaged changes in traffic flow; • support the selection of appropriate
monitoring sites (e.g. for ―hot spots‖) when no knowledge on potential
concentrations is available; • forecast air pollution concentrations; and
• help estimate exposures by simulating concentrations and duration of

meteorological episodes. Thus, dispersion modelling is a complementary tool to
air pollution monitoring.
Parameters of Air Pollution Meteorology:
Once pollutants are released into the atmosphere they are transported by air
motions which lower the air pollutant concentrations in space over a period of
time. Meteorological parameters averaged over one-hour time intervals are
usually used to describe this phenomenon. Meteorological parameters include
wind speed, wind direction, turbulence, mixing height, atmospheric stability,
temperature and inversion. Wind Wind is a velocity vector having direction and
speed. Usually only the horizontal components of this vector are considered
since the vertical component is relatively small. The wind direction is the
direction from which the wind comes. Through wind speed, continuous pollutant
releases are diluted at the point of release. Concentrations in the plume are

inversely proportional to the wind speed. Wind speed is expressed in the unit
metre/second (m/s). At the ground wind speed must be zero. Therefore the wind
speed is lower close to the ground than at higher elevations. Figure 3.2 shows
typical relationships between wind speed and height during day- and night-time.
The wind speed u(z) at a vertical height z above the ground is a function of z and
proportional to the power law zp where p is an exponent which depends
primarily on atmospheric stability. The exponent p varies and is approximately
0.07 for unstable conditions and 0.55 for stable conditions (Turner, 1994). Figure
3.2 also shows examples of the height dependence of temperature during day-
and nighttime. During the day, while the ground is heated up by solar radiation,
temperature decreases linearly with height. During the night, temperature first
increases with height and then starts to decrease similarly to that observed during
the day. Objects on the surface over which the wind is flowing will exert friction
on the wind near to the surface. Height and spacing of the objects influence the
magnitude of friction and the wind speed gradient as a function of height. The
effect is described by the roughness length which ranges for urban areas be tween
1 and 3 m, for suburban areas approximately 0.5 to 1 m and for level areas
between 0.001 m and 0.3 m. Turbulence Turbulence is the wind fluctuations over
time scales smaller than the averaging time used to estimate the mean wind
speed. Turbulence consists of eddies (circular movements of air) which may be
oriented horizontally, vertically or at all orientations in between. There are two
kinds of turbulence: mechanical and buoyant. Mechanical turbulence is caused
by objects on the surface and by wind shear, a slower moving air stream next to a
faster moving current. Mechanical turbulence increases with wind speed and the
roughness length. The turbulence created by wind shear is due to the increase of
wind speed with height. Buoyant turbulence is caused by heating or cooling of
air near the earth‘s surface. During a sunny day with clear skies and light wind,
the heating of the earth‘s surface creates an upward heat flux which heats the
lower layers of the air. The heated air goes upward and thus creates a n upward-
rising thermal stream – positive buoyant turbulence. At night with light winds,
the outgoing infrared radiation cools the ground and the lower layers of the air
above while temperature at the higher air layers is unaffected. The cooling near
the ground results in a net downward heat flux and leads to a temperature
inversion (see Figure 3.2) which is a temperature structure inverted from the

usual decrease of temperature with height. The temperature inversion causes the
atmosphere to become stable and inhibit vertical motion – negative buoyant
turbulence. The generation of mechanical turbulence is always positive but
smaller than the positive buoyant turbulence. The negative buoyant turbulence
during night-time tends to reduce mechanical turbulence.
Mixing height:
During an hourly period on a sunny day upwardrising thermal streams, which

characterise unstable atmospheric conditions, will move in the wind direction. A
series of upward and compensating downward motions will result in substantial
vertical dispersion of the pollutants. Since the eddy structures point to all
possible directions there is also substantial horizontal dispersion. In contrast, at
night, with clear skies and light wind, a minimum of buoyant turbulence is
extant, characterising stable thermal conditions which also damps out mechanical
turbulence. If the net heat flux at the ground is nearly zero, the condition is
characterised as neutral. The vertical thermal structure is a slight decrease of
temperature with height, approximately 10oC with each 1000 m height increase
(dry adiabatic lapse rate). Atmospherically neutral conditions can be caused by: •
cloudy conditions; • windy conditions; and • transitional conditions near sunrise
and sunset. These conditions induce an intermediate level of dispersion. The
mixing height is defined as the upper limit to dispersion of atmospheric
pollutants. Under unstable conditions, there is a vigorous vertical mixing from
the ground to approximately 1 km and then negligible vertical mixing above that
height. For stable conditions the mixing height is much lower.
Stability and plume types:
Under unstable conditions a visible continuously emitted plume will appear as

large loops. This plume structure is called ―looping‖. Looping is due to the
upward motions of the heat flux and the compensating downward motions
occurring as the plume is transported downwind. Considerable vertical and
horizontal dispersion of the effluent is taking place during every onehour period.
Under neutral conditions turbulence is mostly mechanical. As turbulent eddies
have many different orientations the resulting vertical and horizontal dispersion
is relatively symmetrical. The resulting plume looks like a cone and dispersion
under neutral conditions is described as ―coning‖. Ve rtical motion of the plume is
inhibited under stable conditions with a temperature inversion and a low mixing
height. Horizontal motion is not influenced by temperature and the horizontal
extension of the plume may take many appearances. If the horizontal spreading is
considerable, the plume is said to be ―fanning. Inversion, fumigation, stagnation
by definition, an inversion exists when warmer air overlies cooler air.
There are four processes to produce an inversion:
• cooling of a layer of air from below (surface inversion);
• heating of a layer of air from above (elevated inversion);
Fig. 4.12 Vertical dispersion under various conditions for low and high
elevations of the source [T: Temperature, θ: Adiabatic lapse] Source: Adapted
from Liptak (1974)

• flow of a layer of warm air over a layer of cold air (surface inversion); or
• flow of a layer of cold air under a layer of warm air (elevated inversion).
All of these occur, although the first process (cooling from below) is the more
common
Inversion due to cooling of air at the surface Shortly before sunset, the ground
surface begins to cool by radiation and cools the air layer nearest to it. By
sunset, there will be a strong but shallow inversion close to the ground. All night
this inversion will grow in strength and height, until by dawn of the next day, the
temperature profile will be practically the same as that shown for dawn. As the
temperature of the air adjacent to the ground is below that of the air at some
heights above, air pollutants released in this layer are trapped because the air
being colder than the layer above it will not rise. This is called ―stagnant
inversion‖. Inversion due to heating from above Heating an air layer from above
can simply occur when a cloud layer absorbs incoming solar radiation. However,
it often occurs when there is a high-pressure region (common in summer between
storms) in which there is a slow net downward flow of air and light winds. The
sinking air mass will increase in temperature at the adiabatic lapse rate (the
change in temperature of a mass of air as it changes height). It often becomes
warmer than the air below it. The result is an elevated inversion, also called
subsidence inversion or inversion aloft. These normally from 500 to 5000 m
above the ground, and they inhibit atmospheric mixing.
This type of inversions is common in sunny, low-wind situations such as Los

Angeles in summer.
Flow of a layer of warm air over a layer of cold air If a flow of warm air meets a
layer of cold air, the warm air mass could simply sweep away the colder air.
Under certain circumstances such as trapping the cold air by mountains, a mass
of warm air is caused to flow over a cold dome. Even without mountains,
atmospheric forces alone can cause a mass of warm air to flow over a cold mass.
This is called an "overrunning" effect, when warm air overrides cold air at the
surface. Inversion due to flowing of cool air under warm air A horizontal flow of
cold air into a region lowers the surface temperature. As cold air is denser and
flows laterally, it displaces warmer and less dense air upwards. This phenomenon

is called ‗advection inversion‘. It is common where cool maritime air blows into
a coastal area and can occur any time of the year. This type of inversion is
generally shortlived and shallow. When cold air flows down a slope from a
higher elevation into a valley and displaces warmer air upwards this often leads
to inversions at the bottom of the valley. The cold air traps all pollutants released
in this layer. The phenomenon is called ‗cold-air drainage inversion‘. In effect,
the valley collects all the ground cooled air from the higher elevations above it.
If condensation results and fog is formed, the sun can not get to the ground
during the day and the inversion can persist for days. Fumigation In air pollution,
fumigation is defined as the appearance at ground level of pollutants that were
previously: • in a poorly dispersed smoke plume; or • trapped in a temperature
inversion; or • trapped between two inversion layers as a result of turbulence
(WHO, 1980). An example is the turbulence arising from early morni ng heating
of the earth‘s surface by the sun. The subsequent gradual warming of the
atmosphere from the ground upwards ―burns off‖ the inversion that builds up
during night-time. As the strong convective mixing reaches a fanned plume, it
immediately mixes the large pollutant concentrations of the plume towards
ground level, ―fumigating‖ that area. This process leads to high but short -term
ground level concentrations. The effect is particularly strong if the plume from a
shoreline source is carried inland by a stable onshore breeze. As the breeze
passes inland, it encounters warmed air, increasing the convectional flow. A
strong breeze will prevent the plume from mixing upward; and the convectional
flow will drive air pollutants to the ground further inwards.
Influence of Topography on Wind Speed and Direction:
Topographical characteristics such as mountains, valleys, and urban areas all

influence the diffusion of stack plumes and releases from low level sources. For
a plume the centreline may become distorted and have directions completely
different from the main wind direction above the topographical influences. In
mountainous region and valleys wind speed and direction may change
substantially from one location to another (see Figure 3.6.). In urban areas , wind
speed and direction may be quite complicated in a street canyon (see Figure 3.7).
In addition to the topographical characteristics of the area, atmospheric and

surface thermal characteristics influence air motions particularly at low wind
velocities. Local wind velocities may be greater or lesser than would otherwise
occur in the absence of heat emissions from buildings. It can easily be inferred
that the airflow around urban structures will be completely different from the
airflow in rural areas. Correspondingly, the dispersion of air pollutants is a
complex phenomenon and concentrations within a street canyon vary for
different wind directions above the urban structure and different shapes of the
buildings (Kim et al., 2005).
Fig. 4.13 : Wind flow in a valley and around a hill
Meteorological Measurements:
I n situ meteorological measurements are normally required to provide an input

to air quality modelling. The primary directly measured variables include wind
speed (three dimensional), wind direction, temperature at two elevations, relative
humidity, precipitation, pressure, and solar radiation. Turbulence and mixing
height may also be primary variables although they are often estimated by
indirect methods of in situ or remote temperature sensing. Remote sensing of
wind speed and direction and humidity also enhance the meteorological data
input to modelling. Ambient air temperature and relative humidity are the most
usual temperature measurements. Ambient air temperature is measured using a
thermometer. Humidity is measured with a hygrometer, based on absorption of
humidity. As solar radiation easily disturbs such measurements, they are
performed in a shielded but at the same time ventilated surrounding (e.g. special
housings with louvered screens).
The Automatic Weather Station Continuous measurement of meteorology should
include sensors for the most important parameters such as:
• wind speed;
• wind direction;
• temperature and/or vertical temperature gradient;
• net radiation;
• wind fluctuations or turbulence;
• relative humidity;
• precipitation; and
• atmospheric pressure.
Some of these data overlap each other. The final selection of parameters will
depend on the instruments available and the type of specific data needed for the
user. The Automatic Weather Station (AWS) will collect high-quality, real-time
data that is normally used in a variety of weather observation activities ranging
from air quality data assessment and industrial accidental release forecasting to
longterm modelling for planning purposes. The weather station designed for air
quality studies will have to provide surface data and meteorological information
in the surface boundary layer and in the troposphere as a whole. For the purpose
of explaining air pollution transport and dispersion most of the sensors may be
located along a 10 m high mast. The basic suite of sensors will measure wind
velocity and wind direction, temperature, relative humidity, air pressure, and
precipitation. The expanded suite of sensors may offer measurement of solar
radiation, net radiation, wind fluctuations (turbulence), vertical temperature
gradients and visibility. To obtain electric power and a data retrieval system with
modems and computers, the AWS is often located with one or several of the air
quality monitoring stations.
Types of Models:
All models assume a material balance equation which applies to a specified set
of boundaries: (Accumulation rate) = (Flow-in rate) – (Flow-out rate) +
(Emission rate) – (Destruction rate) The units of all terms in this equation is
mass/time unit or g/s.
Box models
Box models (see Figure 3.10) illustrate the simplest kind of material balance. To
estimate the concentration in a city, air pollution in a rectangular box is
considered under the following major simplifying assumptions:
• The city is a rectangle with dimensions W (downwind) and L (crosswind), both

in units [m]. The box is defined by W·L·H [m3] where H [m] is called the
mixing height;
• The air pollution emission rate of the city is Q [g/s], which is independent of
space and time (continuous emission). Q is related to emission rate per unit area,
q [g/s·m2] by:
Q = q · (W · L) [g/s]
Fig. 4.14 The Box Model

The mass of pollutant emitted from the source remains in the atmosphere. No
pollutant leaves or enters through the top of the box, nor through the sides that

are parallel to the wind direction. No deposition, including gravitational settling
or turbulent impaction occurs. No material is removed through chemical
transformation (destruction rate equals zero).
• Atmospheric turbulence produces complete and spatial uniform mixing of

pollutants within the box.
• The turbulence is strong enough in the upwind di rection that the pollutant
concentrations [mass/volume] from releases of the sources within the box and
those due to the pollutant masses entering the box from upwind side are spatially
uniform in the box.
• The wind blows with an average (constant) velocity u [m/s].
These assumptions lead to a steady state situation and the accumulation rate is
zero. All the terms can then be easily quantified and calculated.
If χ(t) [g/m3] denotes the pollutant concentration in the box as a function of time
t and χin the (constant) concentration in the incoming air mass, the:
Flow-in rate = L · H · u · χin Flow-out rate = L · H · u · χ(t) Emission rate = Q

Destruction rate = 0 Accumulation rate= W · L · H · dχ(t)/dt Then the differential
equation emerges:
W · L · H · dχ(t)/dt = Q + L · H · u · χin - L · H · u · χ(t)
which has the solution:
χ(t) = Q/( L · H · u) · (1 - exp(-u · t/W))
For longer times t the concentration approaches a steady state (χ(t) = Q/( L · H ·
u)) which corresponds to zero accumulation rate.
There are several drawbacks of box models. Firstly, some of the assumptions are
unrealistic (e.g. wind speed independent of height or uniformity of air pollutant
concentrations throughout the box). Secondly, the model does not distinguish a
source configuration of a large numbers of small sources emitting pollutants at
low elevation (cars, houses, small industry, and open burning) from that of a
small number of large sources emitting larger amounts per source at higher
elevation (power plants, smelters, and cement plants). Both types of sources are
simply added to estimate a value for the emission rate per unit area (q). Of two
sources with the same emission rate, the higher elevated one leads to lower
ground level concentrations in reality. As there is no way t o deal with this
drawback, box models are unlikely to give reliable estimates, except perhaps
under very special circumstances.
Gaussian dispersion models
The Gaussian model is based on the following assumptions: • continuous

emissions [mass/time unit, usually g/s];
• conservation of mass;
• steady-state meteorological conditions for the travel time of pollutant from

source to receptor;
• concentration profiles in the crosswind direction and in the vertical direction

(both perpendicular to the path of transport) are represented by Gaussian or
normal distributions.
A Gaussian model is the solution of the basic equations for transport and
diffusion in the atmosphere assuming stationarity in time and complete
homogeneity in space. A Gaussian dispersion model is normally used for
considering a point source such as a factory smoke stack. It attempts to compute
the downwind concentration resulting from the point source. The origin of the
coordinate system is placed at the base of the stack, with the x axis align ed in the
downwind direction. The contaminated gas stream, which is normally called
―plume‖, is shown rising from the smokestack and then levelling off to travel in
the x direction and spreading in the y and z directions as it travels.

Fig. 4.15 The Gaussian plume model
The plume normally rises to a considerable height above the stack because it is
emitted at a temperature higher than that of ambient air and with a vertical
velocity component. For Gaussian plume calculation, the plume is assumed to
start from a point with coordinates (0,0,H), where H is called the effective stack
height and is the sum of physical stack height (h) and the plume rise (dh). It
should be kept in mind that the Gaussian plume approach tries to calculate only
the average values without making any statement about instantaneous values.
The results obtained by Gaussian plume calculations should be considered only
as averages over periods of at least 10 minutes, and pre ferably one-half to one
hour. The Gaussian plume model so far allows one to estimate the concentration
at a receptor point due to a single emission source for a specific meteorology. In
this form, they are frequently used to estimate maximum concentrations to be
expected from single isolated sources:
In this equation σy= a.x p is the standard deviation of the concentration

distribution in the crosswind direction, in [m] at the downwind distance x; and
σz= b.xq is the standard deviation of the concentration distribution in the vertical
direction, in [m], at the downwind distance x. a, b, p and q are constants
depending on the stability of the atmosphere (Turner, 1994). Figure 3.12 presents
a typical result of a concentration simulation with the Gaussian mode l. Gaussian
plume models are also applied to estimate multi-source urban concentrations.
The procedure is to estimate the concentration at various locations for each of
the point, area and line sources in the city for each meteorological condition and

then sum up over all sources, all wind directions, all wind speeds, and all
stability classes, weighted by the frequency of their occurrence.
A recent application of the Gaussian plume model is recommended by USEPA

under the name AERMOD Modelling System. It is a steady-state plume model
that incorporates air dispersion based on planetary boundary layer turbulence
structure and scaling concepts, including treatment of both surface and elevated
sources, and both simple and complex terrain (USEPA, 2005). More
sophisticated dispersion models are based on two different viewpoints regarding
the movement of polluted air parcels: The Lagrangian viewpoint is the viewpoint
of a person riding along with the air. From this viewpoint, the ground appears to
be passing below, much as the ground appears to be passing below a person in an
airplane. In case of pollution plume or puff, the observer begins riding an air
parcel along upwind of the stack from which the pollutant is emitted. As he
passes directly over the stack, he reaches into a region of high concentration. The
high concentration is localized in a thin thread of contaminated air that passed
directly over the stack.

Fig. 4.16 Simulated concentrations of a point source at ground level
(effective stack height H in metres)
After that the thread of contaminated air expands by turbulent mixing. All the
time, reference x is in the middle of the moving cloud. In a Lagrangian model the
pollution distribution is described by a set of discrete ―particles‖ (small air
volumes) or puffs, which are labelled by their changing location (i.e. their
trajectories are followed). Lagrangian particle models represent pollutant
releases as a stream of ―particles‖. Since the model ―particles‖ have no physical
dimensions, source types may be specified to have any shape and size, and the
emitted ―particles‖ may be distributed over an arbitrary line, ar ea or volume
(Ministry for the Environment, 2004).
The Eulerian viewpoint is the viewpoint of a person standing on the ground at

the base of the emission source. In this case, x represents some fixed distance
downwind from the emission point. Mostly, the distances are measured from the
base of the stack, not from the centre of a moving cloud. In an Eulerian
dispersion model the pollution distribution is described by changing
concentrations at discrete points on a fixed grid. In some models, such as The Air
Pollution Model (TAPM), a point source can be represented by either the
Eulerian Grid Module (EGM), or by a hybrid Lagrangian Particle Module (LPM)
for near-source dispersion, converting to EGM mode far from the source
(Ministry for the Environment, 2004; CSIRO, 2005).
The Gaussian puff model uses the Gaussian equation in three -dimensional and
Lagrangian viewpoints. An example of Gaussian puff model which is used as a
non-steady-state air quality model is CALPUFF (see Figure 3.13). The model
was developed originally by Sigma Research Corporation under funding
provided by the Californian Air Resource Board (CARB) (ASG, 2007).
The specifications for the CALPUFF modelling system include:
• capability to treat time-varying point and area sources;
• suitability for modelling domains from tens of metres to hundreds of kilometres

from a source;
• predictions for averaging times ranging from one-hour to one year;

• applicability to inert pollutants and those subject to linear removal and
chemical conversion mechanisms; • applicability for rough or complex terrain
situations.
The modelling system (Scire et al., 1990a, 1990b) developed to meet these
objectives consists of three components:
(1) a meteorological modelling package with both diagnostic and prognostic

wind field generators;
(2) a Gaussian puff dispersion model with chemical removal, wet and dry
deposition, complex terrain algorithms, building downwash, plume fumigation,
and other effects; and
(3) post-processing programs for the output fields of meteorological data,

concentrations and deposition fluxes. In April 2003, the USEPA proposed the
CALPUFF modelling system as a guideline (Appendix A) model for regulatory
applications involving long range transport (FR, 2005; USEPA, 2005).
In addition it considered a case-by-case basis for near-field applications where

non-steady-state effects (situations where factors such as spatial variability in the
meteorological fields, calm winds, fumigation, recirculation or stagn ation, and
terrain or coastal effects).

Fig. 4.17 Puff modelling
(Credit for above whole chapter: Gary Haq and Dieter Schwela, 2008 in Edited
Book Foundation Course on Air Quality Management in Asia (Modelling
Chanpter)
SELF-TEST
1) Air Quality Modelling may require
a) Financial Data
b) Popullation Data
c) Meteriological Data
d) None of above
2) Following subject may apply in air quality modelling
a) Economy
b) Statistics
c) Laws
d) None of above

5) What are the applications of Air quality modelling?
6) Write note on Gaussian plume model.
SUMMARY
Elementary probability theory is an important part of sample survey designs. It
provides preliminary statistical concepts such as expectation, v ariance, and
covariance, as well as measure of error, interval estimation, and sample size
determination. Probability is quantified as a positive number between 0 and 1,
and can be expressed in a number of ways. The addition rule is used to determine
the probability of at least one of two events occurring, and the probability of
either event A or B is given by P(A) + P (B) – P (A and B). A and B are mutually
exclusive, meaning they cannot occur together.
The probability of either A or B occurring is given b y P(A) + P(B). The

multiplication rule gives the probability that two (or more) events happen
together. P(B|A) is the probability that event B occurs given that event A has
occurred. If A and B are independent events, the probability of both events A and
B is unaffected by the probability of event A (and vice versa). Bayes' Theorem
states that P(A) x P(B|A) = P(A|B)P(B)/P(A).
This formula is not appropriate if P(A)=0, that is if A is an event which cannot

happen. Diagnostic test results are divided into positive or negative results, such
as diastolic blood pressure or haemoglobin level. There are two fundamental
questions that should be asked: the sensitivity of the test and the specificity of
the test. An example of this is the results of an assay of N -terminal pro-brain
natriuretic peptide (NT-proBNP) for the diagnosis of heart failure in a general
population survey. The NT-proBNP assay in the general population over 45 and
those with a previous diagnosis of heart failure found that the prevalence of heart
failure in these subjects was 103/410=0.251, or approximately 25%.
Sensitivity is the proportion of those with the disease who also have a positive
test result, while specificity is the probability of a negative test result given that
the disease is absent. Sensitivity and specificity are useful statistics for
diagnostic tests, but if the disease is rare, accuracy may be limited. Two terms in
common use are f/(e+f)=1-sensitivity and g/(g+h)=1-specificity, which are
summarised in Fig. 3.7. The most important details in this text are the definitions
of sensitivity and specificity, the predictive value of a test, and the prevalence of
coronary artery disease in patients with suspected coronary artery disease.
It is important to remember that sensitivity and spe cificity have 'n' and 'p' in
them, respectively, and that the probability of the patient having coronary artery
disease is adjusted upwards to the probability of disease, with a positive test
result of 815/930=0.70. These values are affected by the prevale nce of the
disease, such as if those with the disease doubled in Table 3, then the positive
test would become 1630/(1630+115)=0.93 and the negative test
327/(327+416)=0.44. Bayes' Theorem states that the probability of having both a
positive exercise test and coronary artery disease is P(T+ and D+). From Table 3,
the probability of picking out one man with both is 815/1465=0.56. Bayes'
theorem enables prior assessments about the chances of a diagnosis to be
combined with the eventual test results to obtain a so-called "posterior"
assessment about the diagnosis.
The diagnostic process is summarised by P(D+|T+)=P(T+|D+)P(D+)/P(T+) which

is the a priori probability and P(D+) is the a posteriori probability. Formally, if
the probability of an event is p, then the odds are defined as p/(1-p).
Achim Sydow (National Research Center for Information Technology

(GMD.FIRST), Berlin, Germany) explained that systems are used to analyze
complexity, to bring a greater amount of transparency into the interaction of
parts, and to decompose complex systems into subsystems. Systems methodology
includes linear and nonlinear systems, continuous and discrete systems, lumped
parameter and distributed parameters, automata, events, hierarchical systems etc.
Systems analysis requires to design a conceptual model consisting of submodels,
and mathematical modeling is based on measurement before or during the control
process. Complexity of environmental systems is inherent in the nonlinearity of
mathematical models, the dynamic and stochastic nature of natural resource
problems, the multipurpose, multiobjective attributes of decision problems, and
the natural coupling and interaction of parts of the biosphere. Systems
methodology provides theoretical and computational tools for modeling,
analyzing, understanding and controlling, optimizing or planning complex,
environmental systems.
Simulation tools support numerical insights into the system behavior. Systems
analysis consists of various steps, such as analyzing the decision problem,
forming a model, testing the model, and solving the decision problems by
scenario analysis, optimization. Typical examples of environmental problems
include water quality, population growth, and world economy. Hierarchical
systems methodology was developed with systems theory and basic knowledge
of cybernetics. Systems engineering provides a tool box with approved methods
in engineering sciences.
The control problem for water quality consists in the fulfillment of certain
conditions and optimization of the overall costs. This is a distributed parameter
system based on a partial differential equation and solved by hierarchical
optimization. Environmental object classification leads to the taxonomy
distinguishing between atmosphere (all objects above the surface of the Earth),
hydrosphere (water-related objects), lithosphere (relating to soil and rocks),
biosphere (all living matters) and technosphere (human-made objects). Maximum
likelihood is used to solve this problem. Environmental system models based on
measurements need aggregation, validation and interpretation of the initial
collection of environmental data.
Validation procedures are developed and applied (like temporal v., geographic v.,
space-time v., interparameter v., see Günther). Knowledge-based systems (expert
systems) play a role for initial evaluation of environmental raw data. For data
processing, statistical classification, data management and artificial intelligence
provide standard methods. When circumstances of measurements are known like
weather, date, time of day, etc., methods based on Bayesian probability theory
are used. Neural nets are used to handle uncertainty. Such validated data are the
basis for information systems and monitoring of the state of the environment.
The most important tasks in environmental modeling and simulation today are
the determination and analysis of the behaviour of environmental systems.
Simulation models are used in four fields: emission computation, process
control, groundwater - economical and flow investigations, and ecosystem
research. Environmental models can be used to study many things, such as
climate, coastal changes, hydro-ecological systems, ocean circulation, surface
and groundwater, terrestrial carbon, enclosed spaces, and spaces around
buildings. Water quality modeling is the use of mathematical simulation
techniques to analyze water quality-based data using mathematical simulation
techniques. It helps people understand the eminence of water quality issues and
provides evidence for policy makers to make decisions in order to properly
mitigate water.
A typical water quality model consists of a collection of formulations

representing physical mechanisms that determine position and momentum of
pollutants in a water body. Models can be either deterministic or statistical
depending on the scale with the base model. Formulations and associated
constants include Advective Transport, Dispersive Transport, Surface Heat
Budget, Dissolved Oxygen Saturation, Reaeration, Carbonaceous Deoxygenation,

Nitrogenous Biochemical Oxygen Demand, Sediment oxygen demand,
Photosynthesis and Respiration, and pH.
An air quality simulation model system (AQSM) is a numerical technique or

methodology for estimating air pollutant concentrations in space and time. It is a
function of the distribution of emissions and the existing meteorological and
geophysical conditions. An AQSM can respond to the following types of
questions: relative contributions to concentrations of air pollutants from mobile
and stationary sources, emission reductions needed for outdoor concentrations to
meet air quality standards, where should a planned source of emissions be sited,
and what will be the change in ozone (O3) concentrations if the emissions of
precursor air pollutants (e.g. nitrogen oxides (NOx) or hyd rocarbons (HC)) are
reduced by a certain percentage. An alternative and complementary model is
source apportionment (SA), which starts from observed concentrations and their
chemical composition and estimates the relative contribution of various source
types by comparing the composition of sources with the observed composition at
the receptors. This module provides an understanding of the basic components of
an air simulation model and the key data requirements.
Emission estimates are calculated in a submodel of the AQSM using basic

information such as traffic modal distribution, vehicle fleet age, vehicles types,
industrial processes, resource use, power plant loads, and fuel types. Dispersion
models are used to simulate air pollutant concentrations at rec eptor sites at costs
much lower than those for air pollutant monitoring. They can estimate spatial
distributions, quantify source contributions, provide concentrations of a
compound, provide estimates on the impacts of a planned manufacturing facility
or of process changes in an existing plant, estimate the impacts at receptor sites
in the vicinity of a planned road or those of envisaged changes in traffic flow,
support the selection of appropriate monitoring sites, forecast air pollution
concentrations, and help estimate exposures by simulating concentrations and
duration of meteorological episodes. Parameters of Air Pollution Meteorology
include wind speed, wind direction, turbulence, mixing height, atmospheric
stability, temperature and inversion.

REFERENCES
 Gary Haq and Dieter Schwela, 2008 in Edited Book Foundation Course on
Air Quality Management in Asia (Modelling Chanpter.
 https://web.worldbank.org
 designingbuildings.co.uk
 https://medium.com
 https://commons.wikimedia.org/wiki/File:Pipe-PFR.sv
 www.eolss.net
 https://www.healthknowledge.org.uk/public-health-textbook/research-
methods/1b-statistical-methods/elementary-probability-theory
 www.healthknowledge.org.uk
 Jacob Beara, Milovan S. Beljinb and Randall R. Ross in 1992 in the boom
of Fundamentals of Ground-Water Modeling.

School of Architecture Science and Technology
Yashwantrao Chavan Maharashtra Open University, Nashik – 422222
F EEDBACK S HEET FOR THE S TUDENT
Dear Student,
You have gone through this book, it is time for you to do some thinking for us.
Please answer the following questions sincerely. Your response will help us to
analyse our performance and make the future editions of this book more useful.
Your response will be completely confidential and will in no way affect your
examination results. Your suggestions will receive prompt attention from us.
Please submit your feedback online at this QR Code or at following link
https://forms.gle/rpDib9sy5b8JEisQ9
or email at: director.ast@ycmou.ac.in

or send this filled “Feedback Sheet” by post to above address.
(Please tick the appropriate box)
Pl. write your Program Code Course Code & Name
Style
01. Do you feel that this book enables you to learn the subject independently
without any help from others?
Yes No Not Sure
02. Do you feel the following sections in this book serve their purpose? Write the
appropriate code in the boxes.
Code 1 for ―Serve the purpose fully‖
Code 2 for ―Serve the purpose partially‖
Code 3 for ―Do not serve any purpose‖
Code 4 for ―Purpose is not clear‖
Warming up Check Point Answer to CheckPoints

To Begin with Summary References
Objectives Key Words
03. Do you feel the following sections or features, if included, will enhance self -
learning and reduce help from others?
Yes No Not Sure
Index
Glossary
List of ―Important Terms Introduced‖
Two Color Printing
Content
04. How will you rate your understanding of the contents of this Book?
Very Bad Bad Average Good Excellent
05. How will you rate the language used in this Book?
Very Simple Simple Average Complicated Extremely Complicated
06. Whether the Syllabus and content of book complement to each other?
Yes No Not Sure
07. Which Topics you find most easy to understand in this book?
Sr.No. Topic Name Page No.

08. Which Topics you find most difficult to understand in this book?
Sr.No. Topic Name Page No.
09. List the difficult topics you encountered in this Book. Also try to suggest
how they can be improved.
Use the following codes:
Code 1 for ―Simplify Text‖
Code 2 for ―Add Illustrative Figures‖
Code 3 for ―Provide Audio-Vision (Audio Cassettes with companion Book)‖
Code 4 for ―Special emphasis on this topic in counseling‖
Sr.No. Topic Name Page No. Required Action Code
10. List the errors which you might have encountered in this book.
Sr.No. Page Line Errors Possible Corrections

No. No.
11. Based on your experience, how would you place the components of distance
learning for their effectiveness?

Use the following code.
Code 1 for ―Most Effective‖ Code 3 for ―Average‖ Code 5 for ―Least Effective‖
Code 2 for ―Effective‖ Code 4 for ―less Effective‖
Printed Book Counseling Lab Journal
Audio Lectures Home Assignment YouTube Videos
Video Lectures Lab-Experiment Online Counseling
12. Give your overall rating to this book?
1. 2. 3. 4. 5.
13. Any additional suggestions:
Thank you for your co-operation!

EVS042 - Statistical Approaches and Modelling in Environmental Sciences PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

EVS042 - Statistical Approaches and Modelling in Environmental Sciences PDF

Uploaded by

Copyright:

Available Formats

ins School of Architecture, Science and Technology,

Yashwantrao Chavan Maharashtra Open University

Modelling in {2021 Pattern}

Credit 03 .............................................................................................................. 144

Credit 04 .............................................................................................................. 178

EVS042 Statistical Approaches and Modelling in Environmental Sciences Page 2

This work by YCMOU is licensed under a Creative Commons Attribution-NonCommercial-

EVS042 Statistical Approaches and Modelling in Environmental Sciences Page 3

- Prof. Dr. P. G. Patil

EVS042 Statistical Approaches and Modelling in Environmental Sciences Page 4

- Dr. Sunanda More

EVS042 Statistical Approaches and Modelling in Environmental Sciences Page 5

EVS042 Statistical Approaches and Modelling in Environmental Sciences Page 6

METHOD OF COLLECTION AND RECORDING

 Basics of Statistical Models, Dispersion, Probability

Statistics is the discipline that concerns the collection, organization, analysis,

When census data cannot be collected, statisticians collect data by developing

In statistics, quality assurance, and survey methodology, sampling is the

Each observation measures one or more properties (such as weight, location,

EVS042 Statistical Approaches and Modelling in Environmental Sciences Page 8

Within any of the types of frames identified above, a variety of sampling

• Nature and quality of the frame

• Availability of auxiliary information about units on the frame

• Accuracy requirements, and the need to measure accuracy

• Whether detailed analysis of the sample is expected

Simple Random Sampling

Simple random sampling can be vulnerable to sampling error because the

EVS042 Statistical Approaches and Modelling in Environmental Sciences Page 9

Systematic sampling (also known as interval sampling) relies on arranging the

As long as the starting point is randomized, systematic sampling is a type of

However, systematic sampling is especially vulnerable to periodicities in the list.

EVS042 Statistical Approaches and Modelling in Environmental Sciences Page 10

Another drawback of systematic sampling is that even in scenarios where it is

As described above, systematic sampling is an EPS method, because all elements

Second, utilizing a stratified sampling method can lead to more efficient

Finally, since each stratum is treated as an independent population, different

A stratified sampling approach is most effective when three conditions are

1. Variability within strata are minimized

2. Variability between strata are maximized

3. The variables upon which the population is stratified are strongly

EVS042 Statistical Approaches and Modelling in Environmental Sciences Page 12

1. Focuses on important subpopulations and ignores irrelevant ones.

2. Allows use of different sampling techniques for different

3. Improves the accuracy/efficiency of estimation.

4. Permits greater balancing of statistical power of tests of differences

1. Requires selection of relevant stratification variables which can be

2. Is not useful when there are no homogeneous subgr oups.

3. Can be expensive to implement.

Stratification is sometimes introduced after the sampling phase in a process

Choice-based sampling is one of the stratified sampling strategies. In choice -

EVS042 Statistical Approaches and Modelling in Environmental Sciences Page 13

Another option is probability proportional to size ('PPS') sampling, in which the

Systematic sampling theory can be used to create a probability proportionate to

Sometimes it is more cost-effective to select respondents in groups ('clusters').

Cluster sampling (also known as clustered sampling) generally increases the

Cluster sampling is commonly implemented as multistage sampling. This is a

EVS042 Statistical Approaches and Modelling in Environmental Sciences Page 15

as grab, convenience or opportunity sampling) is a type of nonprobability

2. Is there good reason to believe that a particular convenience sample

In social science research, snowball sampling is a similar technique, where

The voluntary sampling method is a type of non-probability sampling. Volunteers

Volunteers may be invited through advertisements in social media. The target

It is difficult to make generalizations from this sample because it may not