Professional Documents
Culture Documents
Statistical Program:
V136:
M.Sc. in
Approaches and Environmental
Science
Environmental
Sciences
Semester 04
EVS042
Email: director.ast@ycmou.ac.in
Website: www.ycmou.ac.in
Phone: +91-253-2231473
A S T, Y C M O U , N a s h i k – 4 2 2 2 2 2 , M H , I n d i a
Yashwantrao Chavan EVS042
Maharashtra Open Statistical Approaches and
University Modelling in
Environmental Sciences
Brief Contents
Vice Chancellor‟s Message ...................................................................................... 5
Forward By The Director ....................................................................................... 6
Credit 01 ................................................................................................................. 7
Credit 01 –Unit 01-01: Basics of Statistics……………………………………… ……..8
Credit 01 - Unit 01-02: Statistical methods……………………………. ............... ...38
Credit 01 - Unit 01-03: Dispersion……………………….………… ....................... ..44
Credit 01 - Unit 01-04: Probability…………………………….………..……… ....... 61
Credit 02 ............................................................................................................... 99
Credit 02 - Unit 02-01: Correlation....................................................... ............. .100
Credit 02 - Unit 02-02: Regression.….. ................................................................ 112
Credit 02 - Unit 02-03: Testing of hypothesis……………………………… ............ 126
Credit 02 - Unit 02-04: Bioassay……………………………….................. ............. 133
Development Team
Instructional Course Coordinator Book Writer Book Editor
Technology Editor
Mr. Manish S Shingare Mr. Kailas Ahire Dr, Yogeshwar R. Baste
Dr. Sunanda More Academic coordinator, Assistant Professor Assistant Professor.
Director(I/c) & School of Dept. of Karmaveer
Associate Professor, Architecture, Science Environmental Shantarambapu Wavare
School of & Technology, Science, K.R.T. Arts, Arts, Science and
Architecture, Science YCMOU, Nashik B.H. Commerce and Commerce College,
&Technology, A.M. Science College Uttamnagar CIDCO,
YCMOU, Nashik (KTHM) College, Nashik
Nashik
Dear Students,
Greetings!!!
I offer cordial welcome to all of you for the Master‟s degree programme of Yashwantrao
Chavan Maharashtra Open University.
As a post graduate student, you must have autonomy to learn, have information and
knowledge regarding different dimensions in the field of Environmental Science and at the same time
intellectual development is necessary for application of knowledge wisely. The process of learning
includes appropriate thinking, understanding important points, describing these points on the basis
of experience and observation, explaining them to others by speaking or writing about them. The
science of Education today accepts the principle that it is possible to achieve excellence and
knowledge in this regard.
The syllabus of this course has been structured in this book in such a way, to give you
autonomy to study easily without stirring from home. During the counseling sessions, scheduled at your
respective study centre, all your doubts will be clarified about the course and you will get guidance
from some qualified and experienced counsellors/ professors. This guidance will not only be based
on lectures, but it will also include various techniques such as question-answers, doubt
clarification. We expect your active participation in the contact sessions at the study centre. Our
emphasis is on „self study‟. If a student learns how to study, he will become independent in learning
throughout life. This course book has been written with the objective of helping in self-study and giving
you autonomy to learn at your convenience.
During this academic year, you have to give assignments, complete laboratory activities, field
visits and the Project work wherever required. You have to opt for specialization as per programme
structure. You will get experience and joy in personally doing above activities. This will enable
you to assess your own progress and thereby achieve a larger educational objective.
We wish that you will enjoy the courses of Yashwantrao Chavan Maharashtra Open
University, emerge successful and very soon become a knowledgeable and honorable Master‟s
degree holder of this university.
I congratulate “Development Team” for the development of this excellent high quality “Self-
Learning Material (SLM)” for the students. I hope and believe that this SLM will be immensely
useful for all students of this program.
Best Wishes!
Dear Students,
Greetings!!!
This book aims at acquainting the students with conceptualand applied fundamentals
about Environmental Sciencerequired at degree level.
The book has been specially designed for science students. It has a comprehensive
coverage of environmental concepts and its application in practical life. The book
contains numerous examples to build understanding and skills.
The book is written with self- instructional format. Each chapter is prepared with
articulated structure to make the contents not only easy to understand but also
interesting too.
Each chapter begins with learning objectives which are stated using Action Verbs as
per the Bloom‟s Taxonomy. Unit is started with introduction to arouse or stimulate
curiosity of learner about the content/ topic. Thereafter the unit contains explanation of
concepts supported by tables, figures, exhibits and solved illustrations wherever
necessary for better effectiveness and understanding.
This book is written in simple and lucid language, using spoken style and short
sentences. Topics of each unit of the book presents from simple to complex in logical
sequence. This book is appropriate for low achiever students with lower intellectual
capacity and coversthe syllabus of the course .
Exercises given in the chapter include MCQs, conceptual questions and practical
questions so as to create a ladder in the minds of students to grasp each and every
aspect of a particular concept.
I thank the students who have been a constant motivation for us. I am grateful to the
writers, editors and the School faculty associated in this SLM development of the
Programme.
Best Wishes to all of you!!!
LEARNING OBJECTIVES
After successful completion of this unit, you will be able to learn-
Basics of statistics
Two main statistical methods are used in data analysis: descriptive statistics,
which summarize data from a sample using indexes such as the mean or standard
deviation, and inferential statistics, which draw conclusions from data that are
subject to random variation (e.g., observational errors, sampl ing variation).
Descriptive statistics are most often concerned with two sets of properties of a
distribution (sample or population): central tendency (or location) seeks to
characterize the distribution's central or typical value, while dispersion (or
variability) characterizes the extent to which members of the distribution depart
from its center and each other. Inferences on mathematical statistics are made
under the framework of probability theory, which deals with the analysis of
random phenomena.
EVS042 Statistical Approaches and Modelling in Environmental Sciences Page 7
A standard statistical procedure involves the collection of data leading to test of
the relationship between two statistical data sets, or a data set and synthetic data
drawn from an idealized model. A hypothesis is proposed for the statistical
relationship between the two data sets, and this is compared as an alternative to
an idealized null hypothesis of no relationship between two data sets. Rejecting
or disproving the null hypothesis is done using statistical tests that quantify the
sense in which the null can be proven false, given the data that are used in the
test. Working from a null hypothesis, two basic forms of error are recognized:
Type I errors (null hypothesis is falsely rejected giving a "false positive") and
Type II errors (null hypothesis fails to be rejected and an actual relationship
between populations is missed giving a "false negative"). Multiple problems
have come to be associated with this framework, ranging from obtaining a
sufficient sample size to specifying an adequate null hypothesis.
Measurement processes that generate statistical data are also subject to error.
Many of these errors are classified as random (noise) or systematic (bias), but
other types of errors (e.g., blunder, such as when an analyst reports incorrect
units) can also occur. The presence of missing data or censoring may result in
biased estimates and specific techniques have been developed to address these
problems.Sampling (statistics)
• Cost/operational concerns
In a simple random sample (SRS) of a given size, all subsets of a sampling frame
have an equal probability of being selected. Each element of the frame thus has
an equal probability of selection: the frame is not subdivided or partitioned.
Furthermore, any given pair of elements has the same chance of selection as any
other such pair (and similarly for triples, and so on). This minimizes bias and
simplifies analysis of results. In particular, the variance between individual
results within the sample is a good indicator of variance in the overall
population, which makes it relatively easy to estimate the accuracy of results.
Also, simple random sampling can be cumbersome and tedious when sampling
from a large target population. In some cases, investigators are interested in
research questions specific to subgroups of the population. For example,
researchers might be interested in examining whether cognitive ability as a
predictor of job performance is equally applicable across racial groups. Simple
random sampling cannot accommodate the needs of researchers in this situ ation,
Systematic sampling
For example, suppose we wish to sample people from a long street that starts in a
poor area (house No. 1) and ends in an expensive district (house No. 1000). A
simple random selection of addresses from this street could easily end up with
too many from the high end and too few from the low end (or vice versa),
leading to an unrepresentative sample. Selecting (e.g.) every 10th street number
along the street ensures that the sample is spread evenly along the length of the
street, representing all of these districts. (Note that if we always start at house #1
and end at #991, the sample is slightly biased towards the low end; by randomly
selecting the start between #1 and #10, this bias is eliminated.
For example, consider a street where the odd-numbered houses are all on the
north (expensive) side of the road, and the even-numbered houses are all on the
Systematic sampling can also be adapted to a non -EPS approach; for an example,
see discussion of PPS samples below.
Stratified sampling
When the population embraces a number of distinct categories, the frame can be
organized by these categories into separate "strata." Each stratum is then sampled
as an independent sub-population, out of which individual elements can be
randomly selected. The ratio of the size of this random selection (or sample) to
the size of the population is called a sampling fraction. There are several
potential benefits to stratified sampling.
First, dividing the population into distinct, independent strata can enable
researchers to draw inferences about specific subgroups that may be lost in a
more generalized random sample.
Third, it is sometimes the case that data are more readily available for individual,
pre-existing strata within a population than for the overall population; in such
cases, using a stratified sampling approach may be more co nvenient than
aggregating data across groups (though this may potentially be at odds with the
previously noted importance of utilizing criterion-relevant strata).
There are, however, some potential drawbacks to using stratified samplin g. First,
identifying strata and implementing such an approach can increase the cost and
complexity of sample selection, as well as leading to increased complexity of
population estimates. Second, when examining multiple criteria, stratifying
variables may be related to some, but not to others, further complicating the
design, and potentially reducing the utility of the strata. Finally, in some cases
(such as designs with a large number of strata, or those with a specified
minimum sample size per group), stratified sampling can potentially require a
larger sample than would other methods (although in most cases, the required
sample size would be no larger than would be required for simple random
sampling).
Disadvantages
Poststratification
Oversampling
Probability-proportional-to-size sampling
In some cases the sample designer has access to an "auxiliary variable" or "size
measure", believed to be correlated to the variable of interest, for each element
in the population. These data can be used to improve accuracy in sample design.
One option is to use the auxiliary variable as a basis for stratification.
The PPS approach can improve accuracy for a given sample size by
concentrating sample on large elements that have the greatest impact on
population estimates. PPS sampling is commonly used for surveys of businesses,
where element size varies greatly and auxiliary information is often available –
for instance, a survey attempting to measure the number of guest -nights spent in
hotels might use each hotel's number of rooms as an auxiliary variable. In some
cases, an older measurement of the variable of interest can be used as an
auxiliary variable when attempting to produce more current estimates.
Cluster sampling
Clustering can reduce travel and administrative costs. In the example above, an
interviewer can make a single trip to visit several households in one block, rather
than having to drive to a different block for each household.
It also means that one does not need a sampling frame listing all elements in the
target population. Instead, clusters can be chosen from a cluster-level frame, with
an element-level frame created only for the selected clusters. In the example
above, the sample only requires a block-level city map for initial selections, and
then a household-level map of the 100 selected blocks, rather than a household -
level map of the whole city.
Multistage sampling can substantially reduce sampling costs, where the complete
population list would need to be constructed (before other sampling methods
could be applied). By eliminating the work involved in describing clusters that
Quota sampling
In quota sampling, the population is first segmented into mutually exclusive sub -
groups, just as in stratified sampling. Then judgement is used to select the
subjects or units from each segment based on a specified proportion. For
example, an interviewer may be told to sample 200 females and 300 males
between the age of 45 and 60.
It is this second step which makes the technique one of non -probability
sampling. In quota sampling the selection of the sample is non -random. For
example, interviewers might be tempted to interview those who look most
helpful. The problem is that these samples may be biased because not everyone
gets a chance of selection. This random element is its greatest weakness and
quota versus probability has been a matter of controversy for sever al years.
Minimax sampling
In imbalanced datasets, where the sampling ratio does not follow the population
statistics, one can resample the dataset in a conservative manner called minimax
sampling. The minimax sampling has its origin in Anderson minimax ratio whose
value is proved to be 0.5: in a binary classification, the class -sample sizes should
be chosen equally. This ratio can be proved to be minimax ratio only under the
assumption of LDA classifier with Gaussian distributions. The notion of minimax
sampling is recently developed for a general class of classification rules, called
class-wise smart classifiers. In this case, the sampling ratio of classes is selected
so that the worst case classifier error over all the possible population statistics
for class prior probabilities, would be the best.
Accidental sampling
1. Are there controls within the research design or experiment which can
serve to lessen the impact of a non-random convenience sample, thereby
ensuring the results will be more representative of the population?
3. Is the question being asked by the research one that can adequately be
answered using a convenience sample?
Voluntary Sampling
Line-intercept sampling
Panel sampling
Snowball sampling
Theoretical sampling
Theoretical sampling occurs when samples are selected on the basis of the results
of the data collected so far with a goal of developing a deeper understanding of
Before we define what data collection is, it‘s essential to ask the question, ―What
is data?‖ The abridged answer is, data is various kinds of information formatted
in a particular way. There are only 2 classes of data in statistics: quantitative data
and qualitative data. This highest level of classification comes from the fact that
data can either be measured or can be an observed feature of interest.
Qualitative data are also referred to as categorical data. They are an observed
phenomenon and cannot be measured with numbers. Examples: a race, age group,
gender, origin, and so on. Even if they contain a numerical value, they hold no
meaning (1 for male and 0 for female).
Quantitative data, on the other hand, tells us about the quantities of things or the
things we can measure. And, so they are expressed in terms of numbers. It is also
known as numerical data and includes statistical data analysis. Examples: height,
water, distance, and so on.
We can further subdivide quantitative data and qualitative data into 4 subtypes as
follows: nominal data, ordinal data, interval data, and ratio data.
Qualitative data can be subdivided into nominal and ordinal data types. While
both these types of data can be classified, ordinal data can be ordered as well.
Nominal Data
Nominal data is a type of data that represents discrete units which is why it
cannot be ordered and measured. They are used to label variables without
providing any quantitative value. Also, they have no meaningful zero.
The only logical operation that you can apply to them is equality or inequality
which you can also use to group them. The descriptive statistics you can do with
nominal data include frequencies, proportions, percentages, and central points.
And, to visualize nominal data, you can use a pie chart or a bar chart.
Ordinal Data
Ordinal values represent discrete as well as ordered units. Unlike nominal, here
the ordering matters. However, there is no consistency in the relative distance
between the adjacent categories. And, similar to nominal data, ordinal data also
don't have a meaningful zero.
The descriptive statistics that you can do with ordinal data include frequencies,
proportions, percentages, central points, percentiles, median, mode, and the
interquartile range. Here the visualization methods that cabe used are the same as
nominal data.
Two types of quantitative data are discrete data and continuous data. Discrete
data have distinct and separate values. Therefore, they are data with fixed points
and can‘t take any measures in between. So all counted data are discrete data.
Some examples of discrete data include shoe sizes, number of students in class,
number of languages an individual speaks, etc. Continuous data, on the other
hand, represent an endless range of possible values within a specified range. It
can be divided into finer parts to be measured but not counted. Continuous data
examples include temperature range, height, weight, etc.
Continuous data can be visualized by histogram or box plot while bar graphs or
stem plots can be used for discrete data.
It represents ordered data that is measured along a numerical scale with equal
distances between the adjacent units. These equal distances are also referred to
as intervals. So a variable contains interval data if it has ordered numeric va lues
with the exact differences known between them.
You can compare the data with interval data and add/subtract the values but
cannot multiply or divide as it doesn't have a meaningful zero. The descriptive
statistics you can apply for interval data include central point, range, and spread.
Ratio Data
Like Interval data, ratio data are also ordered with the same differe nce between
the individual units. However, they also have a meaningful zero so they cannot
take negative values.
Now with real zero points, we can also multiply and divide the numbers.
Besides, you can sort the values as well. The descriptive statistics you can do
with ratio data are the same as interval data and include central point, range, and
spread.
Overall, ratio data and interval data are the same with equal spacing between
adjoining values but the former also has a meaningful zero. Besides addition and
subtraction, you can also multiply and divide the data, which is impossible with
interval data as it does not have an absolute zero. However, interval data can take
negative values with no absolute zero while ratio data cannot.
(Ref. https://www.turing.com/kb/statistical-data-types)
EVS042 Statistical Approaches and Modelling in Environmental Sciences Page 21
What is Data Collection: A Definition
During data collection, the researchers must identify the data types, the sources
of data, and what methods are being used. We will soon see that there are many
different data collection methods. There is heavy reliance on data collection in
research, commercial, and government fields.
Before an analyst begins collecting data, they must answer three questions first:
• What methods and procedures will be used to collect, store, and process
the information?
The concept of data collection isn‘t a new one, as we‘ll see later, but the world
has changed. There is far more data available today, and it exists in forms that
were unheard of a century ago. The data collection process has had to change and
grow with the times, keeping pace with technology.
Now that you know what is data collection and why we need it, let's take a look
at the different methods of data collection. While the phrase ―data collection‖
may sound all high-tech and digital, it doesn‘t necessarily entail things like
computers, big data, and the internet. Data collection could mean a telephone
survey, a mail-in comment card, or even some guy with a clipboard asking
passersby some questions. But let‘s see if we can sort the different data
collection methods into a semblance of organized categories.
• Surveys
• Transactional Tracking
• Observation
• Online Tracking
• Forms
Data collection breaks down into two methods. As a side note, many terms, such
as techniques, methods, and types, are interchangeable and depending on who
uses them. One source may call data collection techniques ―methods,‖ for
instance. But whatever labels we use, the general concepts and breakdowns apply
across the board whether we‘re talking about marketing analysis or a scientific
research project.
• Primary
As the name implies, this is original, first-hand data collected by the data
researchers. This process is the initial information gathering step, performed
EVS042 Statistical Approaches and Modelling in Environmental Sciences Page 23
before anyone carries out any further or related research. Primary data results are
highly accurate provided the researcher collects the information. However,
there‘s a downside, as first-hand research is potentially time-consuming and
expensive.
• Secondary
Secondary data is second-hand data collected by other parties and already having
undergone statistical analysis. This data is either information that the researcher
has tasked other people to collect or information the researcher has looked up.
Simply put, it‘s second-hand information. Although it‘s easier and cheaper to
obtain than primary information, secondary information raises concerns
regarding accuracy and authenticity. Quantitative data makes up a majority of
secondary data.
Let‘s get into specifics. Using the primary/secondary methods mentioned above,
here is a breakdown of specific techniques.
• Interviews
• Delphi Technique
The Oracle at Delphi, according to Greek mythology, was the high priestess of
Apollo‘s temple, who gave advice, prophecies, and counsel. In the realm of data
collection, researchers use the Delphi technique by gathering information from a
EVS042 Statistical Approaches and Modelling in Environmental Sciences Page 24
panel of experts. Each expert answers questions in their field of specialty, and
the replies are consolidated into a single opinion.
• Focus Groups
Focus groups, like interviews, are a commonly used technique. The group
consists of anywhere from a half-dozen to a dozen people, led by a moderator,
brought together to discuss the issue.
• Questionnaires
Unlike primary data collection, there are no specific collection methods. Instead,
since the information has already been collected, the researcher consults various
data sources, such as:
• Financial Statements
• Sales Reports
• Retailer/Distributor/Deal Feedback
• Business Journals
• Trade/Business Magazines
• The internet
Now that we‘ve explained the various techniques, let‘s narrow our focus even
further by looking at some specific tools. For example, we mentioned interviews
as a technique, but we can further break that down into different interview types
(or ―tools‖).
The researcher gives the respondent a set of words and asks them what comes to
mind when they hear each word.
• Sentence Completion
• Role-Playing
Respondents are presented with an imaginary situation and asked how they
would act or react if it was real.
• In-Person Surveys
• Online/Web Surveys
These surveys are easy to accomplish, but some users may be unwilling to
answer truthfully, if at all.
• Mobile Surveys
• Phone Surveys
No researcher can call thousands of people at once, so they need a third party to
handle the chore. However, many people have call screening and won‘t answer.
• Observation
Sometimes, the simplest method is the best. Researchers who make direct
observations collect data quickly and easily, with little intrusion or third -party
bias. Naturally, it‘s only effective in small-scale situations.
Among the effects of data collection done incorrectly, include the following -
When these study findings are used to support recommendations for public
policy, there is the potential to result in disproportionate harm, even if the degree
of influence from flawed data collecting may vary by discipline a nd the type of
investigation.
Let us now look at the various issues that we might face while maintaining the
integrity of data collection.
In order to assist the errors detection process in the data gathering process,
whether they were done purposefully (deliberate falsifications) or not,
maintaining data integrity is the main justification (systematic or random errors).
Quality assurance and quality control are two strategies that help protect data
integrity and guarantee the scientific validity of study results.
• Quality control - tasks that are performed both after and during data
collecting
Quality Assurance
The likelihood of failing to spot issues and mistakes early in the research attempt
increases when guides are written poorly. There are sever al ways to show these
shortcomings:
• There isn't a system in place to track modifications to pro cesses that may
occur as the investigation continues.
Problems with data collection, for instance, that call for immediate action
include:
• Fraud or misbehavior
Researchers are trained to include one or more secondary measures that can be
used to verify the quality of information being obtained from the human subject
in the social and behavioral sciences where primary data collection entails using
human subjects.
Let us now explore the common challenges with regard to data collection.
There are some prevalent challenges faced while collecting data, let us explore a
few of them to understand them better and avoid them.
The main threat to the broad and successful application of machine learning is
poor data quality. Data quality must be your top priority if you want to make
Inconsistent Data
When working with various data sources, it's conceivable that the same
information will have discrepancies between sources. The differences could be in
formats, units, or occasionally spellings. The introduction of inconsistent data
might also occur during firm mergers or relocations. Inconsistencies in data have
a tendency to accumulate and reduce the value of data if they are not continually
resolved. Organizations that have heavily focused on data consistency do so
because they only want reliable data to support their analytics.
Data Downtime
Data is the driving force behind the decisions and operations of data -driven
businesses. However, there may be brief periods when their data is unreliable or
not prepared. Customer complaints and subpar analytical outco mes are only two
ways that this data unavailability can have a significant impact on businesses. A
data engineer spends about 80% of their time updating, maintaining, and
guaranteeing the integrity of the data pipeline. In order to ask the next business
question, there is a high marginal cost due to the lengthy operational lead time
from data capture to insight.
Schema modifications and migration problems are just two examples of the
causes of data downtime. Data pipelines can be difficult due to their size and
complexity. Data downtime must be continuously monitored, and it must be
reduced through automation.
Ambiguous Data
Even with thorough oversight, some errors can still occur in massive databases
or data lakes. For data streaming at a fast speed, the issue becomes more
overwhelming. Spelling mistakes can go unnoticed, formatting difficulties can
occur, and column heads might be deceptive. This unclear data might cause a
number of problems for reporting and analytics.
Streaming data, local databases, and cloud data lakes are just a few of the
sources of data that modern enterprises must contend with. They might also have
application and system silos. These sources are likely to duplicate and overlap
each other quite a bit. For instance, duplicate contact information has a
substantial impact on customer experience. If certain prospects are ignored while
others are engaged repeatedly, marketing campaigns suffer. The likelihood of
biased analytical outcomes increases when duplicate data are present. It can also
result in ML models with biased training data.
Inaccurate Data
For highly regulated businesses like healthcare, data accuracy is crucial. Given
the current experience, it is more important than ever to increase the data quality
for COVID-19 and later pandemics. Inaccurate information does not provide you
with a true picture of the situation and cannot be used to plan the best course of
action. Personalized customer experiences and marketing strategies
underperform if your customer data is inaccurate.
Hidden Data
The majority of businesses only utilize a portion of their data, with the remainder
sometimes being lost in data silos or discarded in data graveyards. For instance,
EVS042 Statistical Approaches and Modelling in Environmental Sciences Page 31
the customer service team might not receive client data from sales, missing an
opportunity to build more precise and comprehensive customer profiles. Missing
out on possibilities to develop novel products, enhance services, and streamline
procedures is caused by hidden data.
Finding relevant data is not so easy. There are several factors that we need to
consider while trying to find relevant data, which include -
• Relevant Domain
• Relevant demographics
• Relevant Time period and so many more factors that we need to consider
while trying to find relevant data.
Data that is not relevant to our study in any of the factors render it obsolete and
we cannot effectively proceed with its analysis. This could lead to incomplete
research or analysis, re-collecting data again and again, or shutting down the
study.
Determining what data to collect is one of the most important factors while
collecting data and should be one of the first factors while collecting data. We
must choose the subjects the data will cover, the sources we will be used to
gather it, and the quantity of information we will require. Our responses to these
queries will depend on our aims, or what we expect to achieve utilizing your
data. As an illustration, we may choose to gather information on the categories of
articles that website visitors between the ages of 20 and 50 most frequently
access. We can also decide to compile data on the typical age of all the clients
who made a purchase from your business over the previous month.
Not addressing this could lead to double work and collection of irrelevant data or
ruining your study as a whole.
Big data refers to exceedingly massive data sets with more intricate and
diversified structures. These traits typically result in increased challenges while
storing, analyzing, and using additional methods of extracting results. Big data
EVS042 Statistical Approaches and Modelling in Environmental Sciences Page 32
refers especially to data sets that are quite enormous or intr icate that
conventional data processing tools are insufficient. The overwhelming amount of
data, both unstructured and structured, that a business faces on a daily basis.
Poor design and low response rates were shown to be two issues with data
collecting, particularly in health surveys that used questionnaires. This might
lead to an insufficient or inadequate supply of data for the study. Creating an
incentivized data collection program might be beneficial in this case to get more
responses.
Now, let us look at the key steps in the data collection process.
In the Data Collection Process, there are 5 key steps. They are explained briefly
below -
The first thing that we need to do is decide what information we want to gather.
We must choose the subjects the data will cover, the sources we will use to
gather it, and the quantity of information that we would require. For instance, we
may choose to gather information on the categories of products that an average
e-commerce website visitor between the ages of 30 and 45 most frequently
searches for.
The process of creating a strategy for data collection can now begin. We should
set a deadline for our data collection at the outset of our planning phase. Some
forms of data we might want to continuously collect. We might want to build up
a technique for tracking transactional data and website visitor statistics over the
We will select the data collection technique that will serve as the foundation of
our data gathering plan at this stage. We must take into account the type of
information that we wish to gather, the time period during which we will receive
it, and the other factors we decide on to choose the best gathering strategy.
4. Gather Information
Once our plan is complete, we can put our data collection plan into action and
begin gathering data. In our DMP, we can store and arrange our data. We need to
be careful to follow our plan and keep an eye on how it's doing. Especially if we
are collecting data regularly, setting up a timetable for when we will be checking
in on how our data gathering is going may be helpful. As circumstances alter and
we learn new details, we might need to amend our plan.
It's time to examine our data and arrange our findings after we have gathered all
of our information. The analysis stage is essential because it transforms
unprocessed data into insightful knowledge that can be applied to better our
marketing plans, goods, and business judgments. The analytics tools included in
our DMP can be used to assist with this phase. We can put the discoveries to use
to enhance our business once we have discovered the patterns and insights in our
data.
Let us now look at some data collection considerations and best practices that
one might follow.
We must carefully plan before spending time and money traveling to the field to
gather data. While saving time and resources, effective data collection strategies
can help us collect richer, more accurate, and richer data.
Below, we will be discussing some of the best practices that we can follow for
the best results -
Once we have decided on the data we want to gather, we need to make sure to
take the expense of doing so into account. Our surveyors and respondents will
incur additional costs for each additional data point or survey question.
There is a dearth of freely accessible data. Sometimes the data is there, but we
may not have access to it. For instance, unless we have a compelling cause, we
cannot openly view another person's medical information. It could be challenging
to measure several types of information.
3. Think About Your Choices for Data Collecting Using Mobile Devices
• SMS data collection - Will send a text message to the respondent, who
can then respond to questions by text on their phone.
We need to make sure to select the appropriate tool for our survey and
responders because each one has its own disadvantages and advantages.
It's all too easy to get information about anything and everything, but it's crucial
to only gather the information that we require.
Identifiers, or details describing the context and source of a survey response, are
just as crucial as the information about the subject or program that we are
actually researching.
(Source: https://www.simplilearn.com/what-is-data-collection-article).
SELF-TEST
1) Which of the following values is used as a summary measure for a sample,
such as a sample mean?
a) Population parameter
b) Sample parameter
c) Sample statistic
d) Population mean
a) Descriptive statistics
b) Inferential statistics
c) Industry statistics
d) Both A and B
LEARNING OBJECTIVES
After successful completion of this unit, you will be able to
Learn about Mean
Understand Median
Understand Mode
As per the description and according to Byjus, Introduction to Mean, Median and
Mode: Often in statistics, we tend to represent a set of data by a representative
value which would approximately define the entire collection. This
representative value is called the measure of central tendency, and the name
suggests that it is a value around which the data is centred. These central
tendencies are mean, median and mode.
(Credit: https://byjus.com)
We are all interested in cricket but have you ever wondered during the match
why the run rate of the particular over is projected and what does the run rate
mean? Or, when you get your examination result card, you mention the aggregate
percentage. Again, what is the meaning of aggregate? All these quantities in real
Statistics deals with the collection of data and information for a particular
purpose. The tabulation of each run for each ball in cricket gives the statistics of
the game. The representation of any such data collection can be done in multiple
ways, like through tables, graphs, pie-charts, bar graphs, pictorial representation
etc.
Now consider a 50 over ODI match going between India and Australia. India
scored 370 runs by the end of the first innings. How do you decide whether India
put a good score or not? It‘s pretty simple, right; you find the overall run rate,
which is good for such a score. Thus, here comes the concept of mean, median
and mode in the picture. Let us learn in detail each of the central tendencies.
The measures of central tendencies are given by various parameters but the most
commonly used ones are mean, median and mode. These parameters are
discussed below.
What is Mean?
It is equal to the sum of all the values in the collection of data divided by the
total number of values.
What is Median?
Generally median represents the mid-value of the given set of data when
arranged in a particular order.
If number of values or observations in the given data is odd, then the median is
given by [(n+1)/2]th observation.
If in the given data set, the number of values or observations is even, then the
median is given by the average of (n/2) th and [(n/2) +1] th observation.
The median for grouped data can be calculated using the formula,
What is Mode?
The most frequent number occurring in the data set is known as the mode.
Consider the following data set which represents the marks obtained by different
students in a subject.
We can calculate the mode for grouped data using the below fo rmula:
Let us see the difference between the mean median and mode through an
example.
Example: The given table shows the scores obtained by different players in a
match. What is mean, median and mode of the given data?
Solution:
ii) To find out the median let us first arrange the given data in ascending order
The relation between mean, median and mode that means the three measures of
central tendency for moderately skewed distribution is given the formula:
This relation is also called an empirical relationship. This is used to find one of
the measures when the other two measures are known to us for certain data. This
relationship is rewritten in different forms by interchanging the LHS and RHS.
Range
In statistics, the range is the difference between the highest and lowest data value
in the set. The formula is:
(Credit: https://byjus.com/maths/mean-median-mode/).
SELF-TEST
1) Mean, Median and Mode are
a) Measures of deviation
b) Ways of sampling
a) Socio-economic Status
b) Marital Status
c) Numerical Aptitude
EVS042 Statistical Approaches and Modelling in Environmental Sciences Page 41
d) Professional Attitude
LEARNING OBJECTIVES
After successful completion of this unit, you will be able to learn
Basics of Dispersion
Types of Dispersion
Statistical dispersion
Dispersion is contrasted with location or central tendency, and together they are
the most used properties of distributions.
Measures
Most measures of dispersion have the same units as the quantity being measured.
In other words, if the measurements are in metres or seconds, so is the measure
of dispersion. Examples of dispersion measures include:
Standard deviation
Interquartile range (IQR)
Range
Mean absolute difference (also known as Gini mean absolute
difference)
Median absolute deviation (MAD)
These are frequently used (together with scale factors) as estimators of scale
parameters, in which capacity they are called estimates of scale. Robust
measures of scale are those unaffected by a small number of outliers, and include
the IQR and MAD.
All the above measures of statistical dispersion have the useful property that they
are location-invariant and linear in scale. This means that if a random
variable X has a dispersion of S X then a linear
transformation Y = aX + b for real a and b should have dispersion S Y = |a|S X ,
where |a| is the absolute value of a, that is, ignores a preceding negative sign –.
Coefficient of variation
Quartile coefficient of dispersion
Relative mean difference, equal to twice the Gini coefficient
Entropy: While the entropy of a discrete variable is location-invariant
and scale-independent, and therefore not a measure of dispersion in
the above sense, the entropy of a continuous variable is locatio n
invariant and additive in scale: If Hz is the entropy of continuous
variable z and z=ax+b, then Hz=Hx+log(a).
Some measures of dispersion have specialized purposes. The Allan variance can
be used for applications where the noise disrupts convergence. The Hadamard
variance can be used to counteract linear frequency drift sensitivity.
Sources
In the physical sciences, such variability may result from random measurement
errors: instrument measurements are often not perfectly precise, i.e.,
reproducible, and there is additional inter-rater variability in interpreting and
reporting the measured results. One may assume that the quantity being
measured is stable, and that the variation between measurements is due
to observational error. A system of a large number of particles is characterized by
the mean values of a relatively few numbers of macroscopic quantities such as
temperature, energy, and density. The standard deviation is an important measure
in fluctuation theory, which explains many physical phenomena, including why
the sky is blue.
In the biological sciences, the quantity being measured is seldom unchanging and
stable, and the variation observed might additionally be intrinsic to the
phenomenon: It may be due to inter-individual variability, that is, distinct
members of a population differing from each other. Also, it may be due to intra-
individual variability, that is, one and the same subject differing in tests taken at
different times or in other differing conditions. Such types of variability are also
seen in the arena of manufactured products; even there, the meticulous scientist
finds variation.
The statistic is easily computed using the first (Q 1 ) and third (Q 3 ) quartiles for
each data set. The quartile coefficient of dispersion is:
Example
Deviation (statistics)
Types
A deviation that is the difference between the observed value and an estimate of the
true value (e.g. the sample mean; the Expected Value of a sample can be used as an
estimate of the Expected Value of the population) is a residual. These concepts are
applicable for data at the interval and ratio levels of measurement.
Measures
For an unbiased estimator, the average of the signed deviations across the entire set of
all observations from the unobserved population parameter value averages zero over
an arbitrarily large number of samples. However, by construction the average of
signed deviations of values from the sample mean value is always zero, though the
average signed deviation from another measure of central tendency, such as the
sample median, need not be zero.
Dispersion
Normalization
One way is by dividing by a measure of scale (statistical dispersion), most often either
the population standard deviation, in standardizing, or the sample standard deviation,
in studentizing (e.g., Studentized residual).
One can scale instead by location, not dispersion: the formula for a percent
deviation is the observed value minus accepted value divided by the accepted value
multiplied by 100%.
Standard deviation
The standard deviation of a random variable, sample, statistical population, data set,
or probability distribution is the square root of its variance. It is algebraically simpler,
though in practice less robust, than the average absolute deviation.[2][3] A useful
property of the standard deviation is that, unlike the variance, it is expressed in the
same unit as the data.
The standard deviation of a population or sample and the standard error of a statistic
(e.g., of the sample mean) are quite different, but related. The sample mean's standard
error is the standard deviation of the set of means that would be found by drawing an
infinite number of repeated samples from the population and computing a mean for
each sample. The mean's standard error turns out to equal the population standard
deviation divided by the square root of the sample size, and is estimated by using the
sample standard deviation divided by the square root of the sample size. For example,
a poll's standard error (what is reported as the margin of error of the poll), is the
expected standard deviation of the estimated mean if the same poll were to be
conducted multiple times. Thus, the standard error estimates the standard deviation of
In science, it is common to report both the standard deviation of the data (as a
summary statistic) and the standard error of the estimate (as a measure of potential
error in the findings). By convention, only effects more than two standard errors away
from a null expectation are considered "statistically significant", a safeguard against
spurious conclusion that is really due to random sampling error.
When only a sample of data from a population is available, the term standard
deviation of the sample or sample standard deviation can refer to either the above-
mentioned quantity as applied to those data, or to a modified quantity that is an
unbiased estimate of the population standard deviation (the standard deviation of the
entire population).
Basic examples
Suppose that the entire population of interest is eight students in a particular class.
For a finite set of numbers, the population standard deviation is found by taking
the square root of the average of the squared deviations of the values subtracted from
their average value.
Estimation
EVS042 Statistical Approaches and Modelling in Environmental Sciences Page 49
One can find the standard deviation of an entire population in cases (such
as standardized testing) where every member of a population is sampled. In cases
where that cannot be done, the standard deviation σ is estimated by examining a
random sample taken from the population and computing a statistic of the sample,
which is used as an estimate of the population standard deviation. Such a statistic is
called an estimator, and the estimator (or the value of the estimator, namely the
estimate) is called a sample standard deviation, and is denoted by s (possibly with
modifiers).
Unlike in the case of estimating the population mean, for which the sample mean is a
simple estimator with many desirable properties (unbiased, efficient, maximum
likelihood), there is no single estimator for the standard deviation with all these
properties, and unbiased estimation of standard deviation is a very technically
involved problem. Most often, the standard deviation is estimated using the corrected
sample standard deviation (using N − 1), defined below, and this is often referred to
as the "sample standard deviation", without qualifiers. However, other estimators are
better in other respects: the uncorrected estimator (using N) yields lower mean
squared error, while using N − 1.5 (for the normal distribution) almost completely
eliminates bias.
The formula for the population standard deviation (of a finite population) can be
applied to the sample, using the size of the sample as the size of the population
(though the actual population size from which the sample is drawn may be much
larger). This estimator, denoted by sN, is known as the uncorrected sample standard
deviation, or sometimes the standard deviation of the sample (considered as the entire
population), and is defined as follows:
where are the observed values of the sample items, and is the mean value of these
observations, while the denominator N stands for the size of the sample: this is the
square root of the sample variance, which is the average of the squared
deviations about the sample mean.
If the biased sample variance (the second central moment of the sample, which is a
downward-biased estimate of the population variance) is used to compute an estimate
of the population's standard deviation, the result is
Here taking the square root introduces further downward bias, by Jensen's inequality,
due to the square root's being a concave function. The bias in the variance is easily
corrected, but the bias from the square root is more difficult to correct, and depends
on the distribution in question.
This estimator is unbiased if the variance exists and the sample values are drawn
independently with replacement. N − 1 corresponds to the number of degrees of
freedom in the vector of deviations from the mean, taking square roots reintroduces
bias (because the square root is a nonlinear function, which does not commute with
the expectation), yielding the corrected sample standard deviation, denoted by s:
For unbiased estimation of standard deviation, there is no formula that works across
all distributions, unlike for mean and variance. Instead, s is used as a basis, and is
scaled by a correction factor to produce an unbiased estimate. For the normal
This arises because the sampling distribution of the sample standard deviation follows
a (scaled) chi distribution, and the correction factor is the mean of the chi distribution.
The error in this approximation decays quadratically (as 1/N2), and it is suited for all
but the smallest samples or highest precision: for N = 3 the bias is equal to 1.3%, and
for N = 9 the bias is already less than 0.1%.
To show how a larger sample will make the confidence interval narrower, consider the
following examples: A small population of N = 2 has only 1 degree of freedom for
estimating the standard deviation. The result is that a 95% CI of the SD runs from
0.45 × SD to 31.9 × SD.
These same formulae can be used to obtain confidence intervals on the variance of
residuals from a least squares fit under standard normal theory, where k is now the
number of degrees of freedom for error.
For a set of N > 4 data spanning a range of values R, an upper bound on the standard
deviation s is given by s = 0.6R. An estimate of the standard deviation for N > 100
The standard deviation is invariant under changes in location, and scales directly with
the scale of the random variable. Thus, for a constant c and random variables X and Y:
The standard deviation of the sum of two random variables can be related to their
individual standard deviations and the covariance between them.
A large standard deviation indicates that the data points can spread far from the mean
and a small standard deviation indicates that they are clustered closely around the
mean.
For example, each of the three populations {0, 0, 14, 14}, {0, 6, 8, 14} and {6, 6, 8,
8} has a mean of 7. Their standard deviations are 7, 5, and 1, respectively. The third
population has a much smaller standard deviation than the other two because its
values are all close to 7. These standard deviations have the same units as the data
points themselves. If, for instance, the data set {0, 6, 8, 14} represents the ages of a
population of four siblings in years, the standard deviation is 5 years. As another
example, the population {1000, 1006, 1008, 1014} may represent the distances
traveled by four athletes, measured in meters. It has a mean of 1007 meters, and a
standard deviation of 5 meters.
While the standard deviation does measure how far typical values tend to be from the
mean, other measures are available. An example is the mean absolute deviation,
which might be considered a more direct measure of average distance, compared to
the root mean square distance inherent in the standard deviation.
Application examples
Standard deviation is often used to compare real-world data against a model to test
the model. For example, in industrial applications the weight of products coming off a
production line may need to comply with a legally required value. By weighing some
fraction of the products an average weight can be found, which will always be slightly
different from the long-term average. By using standard deviations, a minimum and
maximum value can be calculated that the averaged weight will be within some very
high percentage of the time (99.9% or more). If it falls outside the range then the
production process may need to be corrected. Statistical tests such as these are
particularly important when the testing is relatively expensive. For example, if the
product needs to be opened and drained and weighed, or if the product was otherwise
used up by the test.
Weather
As a simple example, consider the average daily maximum temperatures for two
cities, one inland and one on the coast. It is helpful to understand that the range of
daily maximum temperatures for cities near the coast is smaller than for cities inland.
EVS042 Statistical Approaches and Modelling in Environmental Sciences Page 54
Thus, while these two cities may each have the same average maximum temperature,
the standard deviation of the daily maximum temperature for the coastal city will be
less than that of the inland city as, on any particular day, the actual maximum
temperature is more likely to be farther from the average maximum temperature for
the inland city than for the coastal one.
Finance
In finance, standard deviation is often used as a measure of the risk associated with
price-fluctuations of a given asset (stocks, bonds, property, etc.), or the risk of a
portfolio of assets (actively managed mutual funds, index mutual funds, or ETFs).
Risk is an important factor in determining how to efficiently manage a portfolio of
investments because it determines the variation in returns on the asset and/or portfolio
and gives investors a mathematical basis for investment decisions (known as mean-
variance optimization). The fundamental concept of risk is that as it increases, the
expected return on an investment should increase as well, an increase known as the
risk premium. In other words, investors should expect a higher return on an
investment when that investment carries a higher level of risk or uncertainty. When
evaluating investments, investors should estimate both the expected return and the
uncertainty of future returns. Standard deviation provides a quantified estimate of the
uncertainty of future returns.
For example, assume an investor had to choose between two stocks. Stock A over the
past 20 years had an average return of 10 percent, with a standard deviation of
20 percentage points (pp) and Stock B, over the same period, had average returns of
12 percent but a higher standard deviation of 30 pp. On the basis of risk and return, an
investor may decide that Stock A is the safer choice, because Stock B's additional two
percentage points of return is not worth the additional 10 pp standard deviation
(greater risk or uncertainty of the expected return). Stock B is likely to fall short of
the initial investment (but also to exceed the initial investment) more often than Stock
A under the same circumstances, and is estimated to return only two percent more on
average. In this example, Stock A is expected to earn about 10 percent, plus or minus
20 pp (a range of 30 percent to −10 percent), about two-thirds of the future year
returns. When considering more extreme possible returns or outcomes in future, an
investor should expect results of as much as 10 percent plus or minus 60 pp, or a
Calculating the average (or arithmetic mean) of the return of a security over a given
period will generate the expected return of the asset. For each period, subtracting the
expected return from the actual return results in the difference from the mean.
Squaring the difference in each period and taking the average gives the overall
variance of the return of the asset. The larger the variance, the greater risk the security
carries. Finding the square root of this variance will give the standard deviation of the
investment tool in question.
Financial time series are known to be non-stationary series, whereas the statistical
calculations above, such as standard deviation, apply only to stationary series. To
apply the above statistical tools to non-stationary series, the series first must be
transformed to a stationary series, enabling use of statistical tools that now have a
valid basis from which to work.
Geometric interpretation
To gain some geometric insights and clarification, we will start with a population of
three values, x1, x2, x3. This defines a point P = (x1, x2, x3) in R3. Consider the line L =
{(r, r, r) : r ∈ R}. This is the "main diagonal" going through the origin. If our three
given values were all equal, then the standard deviation would be zero and P would
lie on L. So it is not unreasonable to assume that the standard deviation is related to
the distance of P to L.
Chebyshev's inequality
An observation is rarely more than a few standard deviations away from the mean.
Chebyshev's inequality ensures that, for all distributions for which the standard
deviation is defined, the amount of data within a number of standard deviations of the
mean is at least as much as given in the following table.
The central limit theorem states that the distribution of an average of many
independent, identically distributed random variables tends toward the famous bell -
shaped normal distribution with a probability density function of
If a data distribution is approximately normal then about 68 percent of the data values
are within one standard deviation of the mean (mathematically, μ ± σ, where μ is the
arithmetic mean), about 95 percent are within two standard deviations (μ ± 2σ), and
about 99.7 percent lie within three standard deviations (μ ± 3σ). This is known as
the 68–95–99.7 rule, or the empirical rule.
For various values of z, the percentage of values expected to lie in and outside the
symmetric interval, CI = (−zσ, zσ), are as follows:
The mean and the standard deviation of a set of data are descriptive statistics usually
reported together. In a certain sense, the standard deviation is a "natural" measure
of statistical dispersion if the center of the data is measured about the mean. This is
because the standard deviation from the mean is smaller than from any other point.
The precise statement is the following: suppose x1, ..., xn are real numbers and define
the function:
Using calculus or by completing the square, it is possible to show that σ(r) has a
unique minimum at the mean:
Often, we want some information about the precision of the mean we obtained. We
can obtain this by determining the standard deviation of the sampled mean. Assuming
statistical independence of the values in the sample, the standard deviation of the
mean is related to the standard deviation of the distribution.
The following two formulas can represent a running (repeatedly updated) standard
deviation. A set of two power sums s1 and s2 are computed over a set of N values of x,
denoted as x1, ..., xN:
Where N, as mentioned above, is the size of the set of values (or can also be regarded
as s0).
Weighted calculation
When the values xi are weighted with unequal weights wi, the power sums s0, s1, s2 are
each computed as:
And the standard deviation equations remain unchanged. s0 is now the sum of the
weights and not the number of samples N.
The incremental method with reduced rounding errors can also be applied, with some
additional complexity.
A running sum of weights must be computed for each k from 1 to n: and places where
1/n is used above must be replaced by wi/Wn:
SELF-TEST 01
1) Find the variance of the observation values taken in the lab.
a) 0.27
b) 0.28
c) 0.3
d) 0.31
a) 0.144
b) 0.00144
c) 0.000144
d) 0.0000144
LEARNING OBJECTIVES
After successful completion of this unit, you will be able to learn
Probability
Conditional probability
Probability Distribution
Probability
Sample space
In probability theory, the sample space (also called sample description space
)
or possibility space of an experiment or random trial is the set of all
possible outcomes or results of that experiment. A sample space is usually denoted
using set notation, and the possible ordered outcomes, or sample points, are listed
For many experiments, there may be more than one plausible sample space available,
depending on what result is of interest to the experimenter. For example, when
drawing a card from a standard deck of fifty-two playing cards, one possibility for the
sample space could be the various ranks (Ace through King), while another could be
the suits (clubs, diamonds, hearts, or spades). A more complete description of
outcomes, however, could specify both the denomination and the suit, and a sample
space describing each individual card can be constructed as the Cartesian product of
the two sample spaces noted above (this space would contain fifty-two equally likely
outcomes). Still other sample spaces are possible, such as right-side up or upside
down, if some cards have been flipped when shuffling.
Equally likely outcomes are, like the name suggests, events with an equal chance of
happening. Many events have equally likely outcomes, like tossing a coin (50%
probability of heads; 50% probability of tails) or a die (1/6 probability of getting any
number on the die).
In real life though, it‘s highly unusual to get equally likely outcomes for events. For
example, the probability of finding a golden ticket in a chocolate bar might be 5%,
but this doesn‘t contradict the idea of equally likely outcomes. Let‘s say there are 100
chocolate bars and five of them have golden ticket, which gives us our 5%
probability. Each of those golden tickets represents one chance to win, and there are
five chances to win, each of which are equally likely outcomes. Other examples:
Flip a fair coin 10 times to see how many heads or tails you get. Each event (getting a
heads or getting a tails) is equally likely).
Roll a die 3 times and note the sequence of numbers. Each sequence of numbers
(123,234,456,…) is equally likely.
For any sample space with N equally likely outcomes, we assign the probability
1/N to each outcome.
(Ref. https://www.statisticshowto.com/equally-likely-outcomes/)
Events in Probability
There are many different types of events in probability. Each type of event has its own
individual properties. This classification of events in probability helps to simplify
mathematical calculations. In this article, we will learn more about events in
probability, their types and see certain associated examples.
Events in probability are outcomes of random experiments. Any subset of the sample
space will form events in probability. The likelihood of occurrence of events in
probability can be calculated by dividing the number of favorable outcomes by the
total number of outcomes of that experiment.
There are several different types of events in probability. There can only be one
sample space for a random experiment however, there can be many different types of
events. Some of the important events in probability are listed below.
A sure event is one that will always happen. The probability of occurrence of a sure
event will always be 1. For example, the earth revolving around the sun is a sure
event.
If an event consists of a single point or a single result from the sample space, it is
termed a simple event. The event of getting less than 2 on rolling a fair die, denoted
as E = {1}, is an example of a simple event.
If an event consists of more than a single result from the sample space, it is called a
compound event. An example of a compound event in probability is rolling a fair die
and getting an odd number. E = {1, 3, 5}.
Complementary Events
When there are two events such that one event can occur if and only if the other does
not take place, then such events are known as complementary events in probability.
EVS042 Statistical Approaches and Modelling in Environmental Sciences Page 64
The sum of the probability of complementary events will always be equal to 1. For
example, on tossing a coin let E be defined as getting a head. Then the complement of
E is E' which will be the event of getting a tail. Thus, E and E' together make up
complementary events. Such events are mutually exclusive and exhaustive.
Exhaustive Events
Exhaustive events in probability are those events when taken together from the
sample space of a random experiment. In other words, a set of events out of which at
least one is sure to occur when the experiment is performed are exhaustive events. For
example, the outcome of an exam is either passing or failing.
Equally likely events in probability are those events in which the outcomes are
equally possible. For example, on tossing a coin, getting a head or getting a tail, are
equally likely events.
Source- https://www.cuemath.com/data/events-in-probability/
Algebra
Algebra (from Arabic الجبر( al-jabr) 'reunion of broken parts, bonesetting') is one of
the broad areas of mathematics. Roughly speaking, algebra is the study
of mathematical symbols and the rules for manipulating these symbols in formulas; it
is a unifying thread of almost all of mathematics.
Elementary algebra deals with the manipulation of variables as if they were numbers
(see the image), and is therefore essential in all applications of mathematics. Abstract
algebra is the name given in education to the study of algebraic structures such
as groups, rings, and fields. Linear algebra, which deals with linear
equations and linear mappings, is used for modern presentations of geometry, and has
many practical applications (in weather forecasting, for example). There are many
areas of mathematics that belong to algebra, some having "algebra" in their name,
such as commutative algebra and some not, such as Galois theory.
EVS042 Statistical Approaches and Modelling in Environmental Sciences Page 65
The word algebra is not only used for naming an area of mathematics and some
subareas; it is also used for naming some sorts of algebraic structures, such as
an algebra over a field, commonly called an algebra. Sometimes, the same phrase is
used for a subarea and its main algebraic structures; for example, Boolean algebra and
a Boolean algebra. A mathematician specialized in algebra is called an algebraist.
The word algebra comes from the Arabic: الجبر, romanized: al-jabr, lit. 'reunion of
broken parts, bonesetting' from the title of the early 9th century book cIlm al-jabr wa
l-muqābala "The Science of Restoring and Balancing" by the Persian mathematician
and astronomer al-Khwarizmi. In his work, the term al-jabr referred to the operation
of moving a term from one side of an equation to the other, المقابلةal-
muqābala "balancing" referred to adding equal terms to both sides. Shortened to
just algeber or algebra in Latin, the word eventually entered the English language
during the 15th century, from either Spanish, Italian, or Medieval Latin. It originally
referred to the surgical procedure of setting broken or dislocated bones. The
mathematical meaning was first recorded (in English) in the 16th century.
The word "algebra" has several related meanings in mathematics, as a single word or
with qualifiers.
Historically, and in current teaching, the study of algebra starts with the solving of
equations, such as the quadratic equation above. Then more general questions, such as
"does an equation have a solution?", "how many solutions does an equation have?",
"what can be said about the nature of the solutions?" are considered. These questions
led extending algebra to non-numerical objects, such
as permutations, vectors, matrices, and polynomials. The structural properties of these
non-numerical objects were then formalized into algebraic structures such
as groups, rings, and fields.
Before the 16th century, mathematics was divided into only two
subfields, arithmetic and geometry. Even though some methods, which had been
developed much earlier, may be considered nowadays as algebra, the emergence of
algebra and, soon thereafter, of infinitesimal calculus as subfields of mathematics
only dates from the 16th or 17th century. From the second half of the 19th century on,
many new fields of mathematics appeared, most of which made use of both arithmetic
and geometry, and almost all of which used algebra.
Today, algebra has grown considerably and includes many branches of mathematics,
as can be seen in the Mathematics Subject Classification where none of the first level
areas (two digit entries) are called algebra. Today algebra includes section 08-General
algebraic systems, 12-Field theory and polynomials, 13-Commutative algebra, 15-
Linear and multilinear algebra; matrix theory, 16-Associative rings and algebras, 17-
Nonassociative rings and algebras, 18-Category theory; homological algebra, 19-K-
theory and 20-Group theory. Algebra is also used extensively in 11-Number
theory and 14-Algebraic geometry.
History
Abstract algebra was developed in the 19th century, deriving from the interest in
solving equations, initially focusing on what is now called Galois theory, and
on constructibility issues. George Peacock was the founder of axiomatic thinking in
arithmetic and algebra. Augustus De Morgan discovered relation algebra in
his Syllabus of a Proposed System of Logic. Josiah Willard Gibbs developed an
algebra of vectors in three-dimensional space, and Arthur Cayley developed an
algebra of matrices (this is a noncommutative algebra).
Probability
Addition
Addition has several important properties. It is commutative, meaning that order does
not matter, and it is associative, meaning that when one adds more than two numbers,
the order in which addition is performed does not matter (see Summation). Repeated
addition of 1 is the same as counting. Addition of 0 does not change a number.
Addition also obeys predictable rules concerning related operations such as
subtraction and multiplication. Performing addition is one of the simplest numerical
tasks. Addition of very small numbers is accessible to toddlers; the most basic task, 1
+ 1, can be performed by infants as young as five months, and even some members of
other animal species. In primary education, students are taught to add numbers in
the decimal system, starting with single digits and progressively tackling more
difficult problems. Mechanical aids range from the ancient abacus to the
modern computer, where research on the most efficient implementations of addition
continues to this day.
PROPERTIES
Commutativity
EVS042 Statistical Approaches and Modelling in Environmental Sciences Page 71
Addition is commutative, meaning that one can change the order of the terms in a
sum, but still get the same result. Symbolically, if a and b are any two numbers, then
a + b = b + a.
The fact that addition is commutative is known as the "commutative law of addition"
or "commutative property of addition". Some other binary operations are
commutative, such as multiplication, but many others are not, such as subtraction and
division.
Associativity
Addition is associative, which means that when three or more numbers are added
together, the order of operations does not change the result.
Identity element
Adding zero to any number, does not change the number; this means that zero is
the identity element for addition, and is also known as the additive identity. In
symbols, for every a, one has
a + 0 = 0 + a = a.
Within the context of integers, addition of one also plays a special role: for any
integer a, the integer (a + 1) is the least integer greater than a, also known as
the successor of a. For instance, 3 is the successor of 2 and 7 is the successor of 6.
Because of this succession, the value of a + b can also be seen as the bth successor
of a, making addition iterated succession. For example, 6 + 2 is 8, because 8 is the
successor of 7, which is the successor of 6, making 8 the 2nd successor of 6.
Units
To numerically add physical quantities with units, they must be expressed with
common units. For example, adding 50 milliliters to 150 milliliters gives
200 milliliters. However, if a measure of 5 feet is extended by 2 inches, the sum is
62 inches, since 60 inches is synonymous with 5 feet. On the other hand, it is usually
meaningless to try to add 3 meters and 4 square meters, since those units are
incomparable; this sort of consideration is fundamental in dimensional analysis.
PERFORMING ADDITION
Innate ability
Studies on mathematical development starting around the 1980s have exploited the
phenomenon of habituation: infants look longer at situations that are unexpected. A
seminal experiment by Karen Wynn in 1992 involving Mickey Mouse dolls
manipulated behind a screen demonstrated that five-month-old infants expect 1 + 1 to
be 2, and they are comparatively surprised when a physical situation seems to imply
that 1 + 1 is either 1 or 3. This finding has since been affirmed by a variety of
laboratories using different methodologies. Another 1992 experiment with
older toddlers, between 18 and 35 months, exploited their development of motor
control by allowing them to retrieve ping-pong balls from a box; the youngest
responded well for small numbers, while older subjects were able to compute sums up
to 5.
Even some nonhuman animals show a limited ability to add, particularly primates. In
a 1995 experiment imitating Wynn's 1992 result (but using eggplants instead of
dolls), rhesus macaque and cottontop tamarin monkeys performed similarly to
human infants. More dramatically, after being taught the meanings of the Arabic
numerals 0 through 4, one chimpanzee was able to compute the sum of two
Childhood learning
Typically, children first master counting. When given a problem that requires that
two items and three items be combined, young children model the situation with
physical objects, often fingers or a drawing, and then count the total. As th ey gain
experience, they learn or discover the strategy of "counting-on": asked to find two
plus three, children count three past two, saying "three, four, five" (usually ticking off
fingers), and arriving at five. This strategy seems almost universal; children can easily
pick it up from peers or teachers. Most discover it independently. With additional
experience, children learn to add more quickly by exploiting the commutativity of
addition by counting up from the larger number, in this case, starting with three and
counting "four, five." Eventually children begin to recall certain addition facts
("number bonds"), either through experience or rote memorization. Once some facts
are committed to memory, children begin to derive unknown facts from known ones.
For example, a child asked to add six and seven may know that 6 + 6 = 12 and then
reason that 6 + 7 is one more, or 13. Such derived facts can be found very quickly and
most elementary school students eventually rely on a mixture of memorized and
derived facts to add fluently.
Different nations introduce whole numbers and arithmetic at different ages, with
many countries teaching addition in pre-school. However, throughout the world,
addition is taught by the end of the first year of elementary school.
Decimal system
The prerequisite to addition in the decimal system is the fluent recall or derivation of
the 100 single-digit "addition facts". One could memorize all the facts by rote, but
pattern-based strategies are more enlightening and, for most people, more efficient:
One or two more: Adding 1 or 2 is a basic task, and it can be accomplished thr ough
counting on or, ultimately, intuition.
Zero: Since zero is the additive identity, adding zero is trivial. Nonetheless, in the
teaching of arithmetic, some students are introduced to addition as a process that
EVS042 Statistical Approaches and Modelling in Environmental Sciences Page 74
always increases the addends; word problems may help rationalize the "exception" of
zero.
Near-doubles: Sums such as 6 + 7 = 13 can be quickly derived from the doubles fact 6
+ 6 = 12 by adding one more, or from 7 + 7 = 14 but subtracting one.
Five and ten: Sums of the form 5 + x and 10 + x are usually memorized early and can
be used for deriving other facts. For example, 6 + 7 = 13 can be derived from 5 + 7 =
12 by adding one more.
As students grow older, they commit more facts to memory, and learn to derive other
facts rapidly and fluently. Many students never commit all the facts to memory, but
can still find any basic fact quickly.
Carry
The standard algorithm for adding multidigit numbers is to align the addends
vertically and add the columns, starting from the ones column on the right. If a
column exceeds nine, the extra digit is "carried" into the next column.
7 + 9 = 16, and the digit 1 is the carry. An alternate strategy starts adding from the
most significant digit on the left; this route makes carrying a little clumsier, but it is
faster at getting a rough estimate of the sum. There are many alternative methods.
Since the end of the XXth century, some US programs, including TERC, decided to
remove the traditional transfer method from their curriculum. This decision was
criticized that is why some states and counties didn‘t support this experiment.
Decimal fractions
Decimal fractions can be added by a simple modification of the above process. One
aligns two decimal fractions above each other, with the decimal point in the same
location. If necessary, one can add trailing zeros to a shorter decimal to make it the
same length as the longer decimal. Finally, one performs the same addition process as
Non-decimal
Binary addition
Addition in other bases is very similar to decimal addition. As an example, one can
consider addition in binary. Adding two single-digit binary numbers is relatively
simple, using a form of carrying:
0+0→0
0+1→1
1+0→1
1 + 1 → 0, carry 1 (since 1 + 1 = 2 = 0 + (1 × 2 1 ))
Adding two "1" digits produces a digit "0", while 1 must be added to the next column.
This is similar to what happens in decimal when certain single-digit numbers are
added together; if the result equals or exceeds the value of the radix (10), the digit to
the left is incremented:
5 + 5 → 0, carry 1 (since 5 + 5 = 10 = 0 + (1 × 10 1 ))
7 + 9 → 6, carry 1 (since 7 + 9 = 16 = 6 + (1 × 10 1 ))
This is known as carrying.[41] When the result of an addition exceeds the value of a
digit, the procedure is to "carry" the excess amount divided by the radix (that is,
10/10) to the left, adding it to the next positional value. This is correct since the next
position has a weight that is higher by a factor equal to the radix.
Computers
The abacus, also called a counting frame, is a calculating tool that was in use
centuries before the adoption of the written modern numeral system and is still widely
used by merchants, traders and clerks in Asia, Africa, and elsewhere; it dates back to
at least 2700–2300 BC, when it was used in Sumer.
Blaise Pascal invented the mechanical calculator in 1642; it was the first
operational adding machine. It made use of a gravity-assisted carry mechanism. It
was the only operational mechanical calculator in the 17th century and the earliest
automatic, digital computer. Pascal's calculator was limited by its carry mechanism,
which forced its wheels to only turn one way so it could add. To subtract, the operator
had to use the Pascal's calculator's complement, which required as many steps as an
addition. Giovanni Poleni followed Pascal, building the second functional
mechanical calculator in 1709, a calculating clock made of wood that, once setup,
could multiply two numbers automatically.
Adders execute integer addition in electronic digital computers, usually using binary
arithmetic. The simplest architecture is the ripple carry adder, which follows the
standard multi-digit algorithm. One slight improvement is the carry skip design,
again following human intuition; one does not perform all the carries in
computing 999 + 1, but one bypasses the group of 9s and skips to the answer.
In practice, computational addition may be achieved via XOR and bitwise logical
operations in conjunction with bitshift operations as shown in the pseudocode below.
Both XOR and gates are straightforward to realize in digital logic allowing the
realization of full adder circuits which in turn may be combined into more complex
logical operations. In modern digital computers, integer addition is typically the
fastest arithmetic instruction, yet it has the largest impact on performance, since it
underlies all floating-point operations as well as such basic tasks
as address generation during memory access and
fetching instructions during branching. To increase speed, modern designs calculate
Addition of numbers
To prove the usual properties of addition, one must first define addition for the
context in question. Addition is first defined on the natural numbers. In set theory,
addition is then extended to progressively larger sets that include the natural numbers:
the integers, the rational numbers, and the real numbers. (In mathematics
education, positive fractions are added before negative numbers are even considered;
this is also the historical route.)
Natural numbers
There are two popular ways to define the sum of two natural numbers a and b. If one
defines natural numbers to be the cardinalities of finite sets, (the cardinality of a set
is the number of elements in the set), then it is appropriate to define their sum as
follows:
Again, there are minor variations upon this definition in the literature. Taken literally,
the above definition is an application of the recursion theorem on the partially
ordered set N2. On the other hand, some sources prefer to use a restricted recursion
theorem that applies only to the set of natural numbers. One then considers a to be
temporarily "fixed", applies recursion on b to define a function "a +", and pastes these
unary operations for all a together to form the full binary operation.
Integers
For an integer n, let |n| be its absolute value. Let a and b be integers. If
either a or b is zero, treat it as an identity. If a and b are both positive,
define a + b = |a| + |b|. If a and b are both negative, define a + b = −(|a| + |b|).
If a and b have different signs, define a + b to be the difference between |a| and
|b|, with the sign of the term whose absolute value is larger. As an example, −6 +
4 = −2; because −6 and 4 have different signs, their absolute values are
Although this definition can be useful for concrete problems, the number of cases to
consider complicates proofs unnecessarily. So the following method is commonly
used for defining integers. It is based on the remark that every integer is the
difference of two natural integers and that two such differences, a – b and c – d is
equal if and only if a + d = b + c. So, one can define formally the integers as
the equivalence classes of ordered pairs of natural numbers under the equivalence
relation
A straightforward computation shows that the equivalence class of the result depends
only on the equivalences classes of the summands, and thus that this defines an
addition of equivalence classes, that is integers. Another straightforward computation
shows that this addition is the same as the above case definition.
This way of defining integers as equivalence classes of pairs of natural numbers, can
be used to embed into a group any commutative semigroup with cancellation
property. Here, the semigroup is formed by the natural numbers and the group is the
additive group of integers. The rational numbers are constructed similarly, by taking
as semigroup the nonzero integers with multiplication.
This construction has been also generalized under the name of Grothendieck
group to the case of any commutative semigroup. Without the cancellation property
the semigroup homomorphism from the semigroup into the group may be non-
injective. Originally, the Grothendieck group was, more specifically, the result of this
construction applied to the equivalences classes under isomorphisms of the objects of
an abelian category, with the direct sum as semigroup operation.
Complex numbers
Complex numbers are added by adding the real and imaginary parts of the
summands.
Using the visualization of complex numbers in the complex plane, the addition
has the following geometric interpretation: the sum of two complex
numbers A and B, interpreted as points of the complex plane, is the
point X obtained by building a parallelogram three of whose vertices
are O, A and B. Equivalently, X is the point such that the triangles with
vertices O, A, B, and X, B, A, are congruent.
Generalizations
There are many binary operations that can be viewed as generalizations of the
addition operation on the real numbers. The field of abstract algebra is centrally
concerned with such generalized operations, and they also appear in set
theory and category theory.
Abstract algebra
In linear algebra, a vector space is an algebraic structure that allows for adding any
two vectors and for scaling vectors. A familiar vector space is the set of all ordered
pairs of real numbers; the ordered pair (a,b) is interpreted as a vector from the origin
in the Euclidean plane to the point (a,b) in the plane. The sum of two vectors is
obtained by adding their individual coordinates:
For example:
Modular arithmetic
General theory
Related operations
Addition, along with subtraction, multiplication and division, is considered one of the
basic operations and is used in elementary arithmetic.
Arithmetic
Given a set with an addition operation, one cannot always define a corresponding
subtraction operation on that set; the set of natural numbers is a simple example. On
the other hand, a subtraction operation uniquely determines an addition operation, an
In the real and complex numbers, addition and multiplication can be interchanged by
the exponential function:
Ordering
The maximum operation "max (a, b)" is a binary operation similar to addition. In fact,
if two nonnegative numbers a and b are of different orders of magnitude, then their
sum is approximately equal to their maximum. This approximation is extremely
useful in the applications of mathematics, for example in truncating Taylor series.
which becomes more accurate as the base of the logarithm increases. The
approximation can be made exact by extracting a constant h, named by analogy
with Planck's constant from quantum mechanics, and taking the "classical
limit" as h tends to zero:
Summation describes the addition of arbitrarily many numbers, usually more than
just two. It includes the idea of the sum of a single number, which is itself, and
the empty sum, which is zero. An infinite summation is a delicate procedure known
as a series.
Linear combinations combine multiplication and summation; they are sums in which
each term has a multiplier, usually a real or complex number. Linear combinations
are especially useful in contexts where straightforward addition would violate some
normalization rule, such as mixing of strategies in game
theory or superposition of states in quantum mechanics.
Multiplication
Conditional probability
In probability theory, conditional probability is a measure of the probability of
an event occurring, given that another event (by assumption, presumption,
assertion or evidence) has already occurred. This particular method relies on
event B occurring with some sort of relationship with another event A. In this
event, the event B can be analyzed by a conditionally probability with respect to
A. If the event of interest is A and the event B is known or assumed to have
occurred, "the conditional probability of A given B", or "the probability
of A under the condition B", is usually written as P(A|B) or occasionally P B (A).
This can also be understood as the fraction of probability B that intersects with
A: .
For example, the probability that any given person has a cough on any given day may
be only 5%. But if we know or assume that the person is sick, then they are much
more likely to be coughing. For example, the conditional probability that someone
unwell (sick) is coughing might be 75%, in which case we would have
that P(Cough) = 5% and P(Cough|Sick) = 75%. Although there is a relationship
between A and B in this example, such a relationship or dependence
between A and B is not necessary, nor do they have to occur simultaneously.
SELF-TEST 02
1) An event in the probability that will never be happened is called as
a) Unsure event
b) Sure event
c) Possible event
d) Impossible event
a) 1/2
b) 2
c) 4/2
d) 5/2
Statistics is the collection of data and information for a particu lar purpose, and
the measure of central tendency is a value around which the data is centred. It is
used to represent a set of data by a representative value which would
approximately define the entire collection. The measures of central tendency are
given by various parameters, but the most commonly used ones are mean, median
and mode. Mean is the most commonly used measure of central tendency, and is
equal to the sum of all the values in the collection of data divided by the total
number of values. Median is the mid-value of the given set of data when
arranged in a particular order, and mode is the most frequent number occurring in
the data set.
The mean, median and mode of the given data are given by the formula: Mean =
43, Median = 52, Mode = 3 Median – 2 Mean. This relation is used to find one of
the measures when the other two measures are known. Range is the difference
between the highest and lowest data value in the set.
Deviation is a measure of difference between the observed value and some other
value, often that variable's mean. The average of the signed deviations across the
entire set of all observations from the unobserved population parameter value
averages zero over an arbitrarily large number of samples. Statistics of the
distribution of deviations are used as measures of statistical dispersion, and the
standard deviation is a measure of the amount of variation or dispersion of a set
of values. It is abbreviated SD, and is most commonly re presented in
mathematical texts and equations by the lower case Greek letter σ (sigma), for
the population standard deviation, or the Latin letter s, for the sample standard
deviation. The standard deviation of a random variable, sample, statistical
population, data set, or probability distribution is the square root of its variance.
In real life, the probability of finding a golden ticket in a chocolate bar might be
5%, but this doesn't contradict the idea of equally likely outcomes. To find the
Probability of Equally Likely Outcomes, we assign the probability 1/N to each
Each type of event has its own individual properties and can be calculated by
dividing the number of favorable outcomes by the total number of o utcomes of
that experiment. Examples of events in probability include getting an even
number on the die, tossing a coin, drawing two balls one after another from a bag
without replacement, and impossible and sure events. The probability of
occurrence of certain events in probability is determined by the chance that they
will occur, such as impossible events, sure events, simple and compound events,
complementary events, mutually exclusive events, and exhaustive events.
Mutually exclusive events are events that cannot occur at the same time, while
mutually exclusive events do not have any common outcomes. Complementary
events are those events such that one event can occur if and only if the other
does not take place, while exhaustive events are events when ta ken together from
the sample space of a random experiment.
Equally likely events in probability are those events in which the outcomes are
equally possible, such as on tossing a coin, getting a head or getting a tail.
Algebra (from Arabic الجبر (al-jabr) 'reunion of broken parts, bonesetting') is one
of the broad areas of mathematics. It is the study of mathematical symbols and
the rules for manipulating these symbols in formulas. Elementary algebra deals
with the manipulation of variables as if they were numbers, Abstract algebra is
the name given in education to the study of algebraic structures such as groups,
rings, and fields, and Linear algebra is used for modern presentations of
geometry. There are many areas of mathematics that belong to algebra, s ome
having "algebra" in their name, such as commutative algebra and some not, such
as Galois theory.
Algebra is not only used for naming an area of mathematics and some subareas,
but also for naming some algebraic structures, such as an algebra over a fiel d. A
Muḥammad ibn Mūsā al-Khwārizmī (c. 780-850) later wrote The Compendious
Book on Calculation by Completion and Balancing, which established algebra as
a mathematical discipline independent of geometry and arithmetic. Indian
mathematicians such as Brahmagupta continued the traditions of Egypt and
Babylon, with the first complete arithmetic solution written in words instead of
symbols.
The Greek mathematician Diophantus has traditionally been known as the "father
of algebra" and the Persian mathematician al-Khwarizmi is regarded as "the
father of algebra". However, there is debate as to who is more entitled to be
known as the father of algebra due to the fact that Al -Jabr is slightly more
elementary than the algebra found in Arithmetica and that Al -Khawarizmi
introduced the methods of "reduction" and "balancing" and gave an exhaustiv e
explanation of solving quadratic equations. Omar Khayyam is credited with
identifying the foundations of algebraic geometry and found the general
geometric solution of the cubic equation. Sharaf al -Dīn al-Tūsī found algebraic
and numerical solutions to various cases of cubic equations. The Indian
mathematicians Mahavira and Bhaskara II, the Persian mathematician Al -Karaji,
In the 13th century, the solution of a cubic equation by Fibonacci marked the
beginning of a revival in European algebra. Abū al -Ḥasan ibn ʿAlī al-Qalaṣādī
took the first steps toward the introduction of algebraic symbolism, and François
Viète's work on new algebra at the close of the 16th century was an i mportant
step towards modern algebra. René Descartes published La Géométrie in 1637,
and the general algebraic solution of the cubic and quartic equations was
developed in the mid-16th century. The idea of a determinant was developed by
Seki Kōwa in the 17th century, followed by Gottfried Leibniz ten years later.
Permutations were studied by Joseph-Louis Lagrange in his 1770 paper
"Réflexions sur la résolution algébrique des équations" and Paolo Ruffini was the
first person to develop the theory of permutation groups.
In the 19th century, abstract algebra was developed from the interest in solving
equations, and George Peacock was the founder of axiomatic thinking in
arithmetic and algebra. Probability is the branch of mathematics concerning
numerical descriptions of how likely an event is to occur, or how likely it is that
a proposition is true.
Addition is one of the four basic operations of arithmetic, the other three being
subtraction, multiplication and division. It is commutative, associative, and
associative, meaning that when one adds more than two numbers, the order in
which addition is performed does not matter. It obeys predictable rules
concerning related operations such as subtraction and multiplication, and is
accessible to toddlers as young as five months. In primary education, students
are taught to add numbers in the decimal form. Addition is commutative,
meaning that one can change the order of the terms in a sum, but still get the
same result.
It is also associative, which means that when three or more numbers are added
together, the order of operations does not change the result. In the standard order
of operations, addition is a lower priority than exponentiation, nth roots,
multiplication and division, but is given equal priority to subtra ction. Zero is the
identity element for addition, and was first identified in Brahmagupta's
Brahmasphutasiddhanta in 628 AD. Mahavira wrote in 830 that zero becomes the
EVS042 Statistical Approaches and Modelling in Environmental Sciences Page 95
same as what is added to it. Addition is the process of numerically adding
physical quantities with units, such as adding 50 milliliters to 150 milliliters
gives 200 milliliters, but it is usually meaningless to add 3 meters and 4 square
meters, as those units are incomparable.
The standard algorithm for adding multidigit numbers is to align the addends
vertically and add the columns, starting from the ones column on the right. An
alternate strategy starts adding from the most significant di git on the left. Non-
decimal fractions can be added by aligning two decimal fractions above each
other, adding trailing zeros to a shorter decimal, and placing the decimal point in
the answer exactly where it was placed in the summands. Adding two "1" digi ts
produces a digit "0", while 1 must be added to the next column. This is known as
carrying, and when the result of an addition exceeds the value of a digit, the
procedure is to "carry" the excess amount divided by the radix (that is, 10/10) to
the next positional value.
SUMMARY
https://en.wikipedia.org/wiki/Statistics
https://en.wikipedia.org/wiki/Sampling_(statistics)#Sampling_methods
https://en.wikipedia.org/wiki/Data_type#Classes_of_data_types
https://www.simplilearn.com/what-is-data-collection-article
https://en.wikipedia.org/wiki/Central_tendency#:~:text=In%20statistics%2
C%20a%20central%20tendency,dates%20from%20the%20late%201920s .
https://en.wikipedia.org/wiki/Mean
https://en.wikipedia.org/wiki/Mode_(statistics)
https://en.wikipedia.org/wiki/Median#:~:text=The%20median%20of%20a
%20symmetric%20distribution%20which%20possesses%20a%20mean,wh
ich%20is%20also%20the%20mean.
https://en.wikipedia.org/wiki/Statistical_dispersion#:~:text=In%20statist ic
s%2C%20dispersion%20(also%20called,standard%20deviation%2C%20an
d%20interquartile%20range.
https://en.wikipedia.org/wiki/Quartile_coefficient_of_dispersion#:~:text=I
n%20statistics%2C%20the%20quartile%20coefficient,as%20the%20Coeff
icient%20of%20variation.
https://en.wikipedia.org/wiki/Deviation_(statistics)
https://www.statisticshowto.com/equally-likely-outcomes/
LEARNING OBJECTIVES
After successful completion of this unit, you will be able to learn
Fundamentals of Correlation
Types of correlation
Correlation
Types
Pearson
Intra-class
Rank
(Credit: www.embibe.com)
Correlation is classified into two types based on the direction of change of the
variables: positive correlation and negative correlation.
Positive Correlation:
When the variables change in the same direction, the correlation is said to be
positive. The sign of the positive correlation is +1.+1.
Example: When income rises, so does consumption and when income falls,
consumption does too.
Negative Correlation:
When the variables move in opposite directions, the correlation is negative.
The sign of negative correlation is –1.–1.
Example: Height above sea level and temperature are an example of a
negative association. It gets colder as you climb the mountain (ascend in
elevation) (decrease in temperature).
Simple Correlation:
Type III: Based upon the constancy of the ratio of change between the variables
Correlation is classified into two types based on the consistency of the ratio of
change between the variables: linear correlation and non-linear correlation.
Linear Correlation:
The correlation is said to be linear when the change in one variable bears a
constant ratio to the change in the other.
Example: Y=a+bx
Non-Linear Correlation:
If the change in one variable does not have a constant ratio to the change in
the other variables, the correlation is non-linear.
Example: Y=a+bx 2
(Credit: https://www.embibe.com/exams/methods-of-studying-correlation/)
2. When a correlation coefficient is (-1), that means for every positive increase in one
variable, there is a negative decrease in the other fixed proportion. For example, the
decrease in the quantity of gas in a gas tank shows a perfect (almost) inverse
correlation with speed.
3. When a correlation coefficient is (0) for every increase, that means there is no
positive or negative increase, and the two variables are not related.
(Credit: https://byjus.com)
Intuitively, the Spearman correlation between two variables will be high when
observations have a similar (or identical for a correlation of 1) rank (i.e. relative
position label of the observations within the variable: 1st, 2nd, 3rd, etc.) between the
two variables, and low when observations have a dissimilar (or fully opposed for a
correlation of −1) rank between the two variables.
For a sample of size n, the n raw scores are converted to ranks , and is
computed as where denotes the usual Pearson correlation coefficient, but applied
to the rank variables, is the covariance of the rank variables,
Only if all n ranks are distinct integers, it can be computed using the popular
formula where is the difference between the two ranks of each observation,
If ties are present in the data set, the simplified formula above yields incorrect
results: Only if in both variables all ranks are distinct, then (calculated
according to biased variance). The first equation — normalizing by the standard
deviation — may be used even when ranks are normalized to [0, 1] ("relative
ranks") because it is insensitive both to translation and linear scaling.
The simplified method should also not be used in cases where the data set is
truncated; that is, when the Spearman's correlation coefficient is desired for the
top X records (whether by pre-change rank or post-change rank, or both), the
user should use the Pearson correlation coefficient formula given above. [5]
RELATED QUANTITIES
There are several other numerical measures that quantify the extent of statistical
dependence between pairs of observations. The most common of these is
the Pearson product-moment correlation coefficient, which is a similar
correlation method to Spearman's rank, that measures the ―linear‖ relationships
between the raw numbers rather than between their ranks.
Example:
In this example, the raw data in the table below is used to calculate the
correlation between the IQ of a person with the number of hours spent in front
of TV per week.
Firstly, evaluate. To do so use the following steps, reflected in the table below.
1. Sort the data by the first column. Create a new column and assign it the
ranked values 1, 2, 3, ..., n.
2. Next, sort the data by the second column. Create a fourth column and
similarly assign it the ranked values 1, 2, 3, ..., n.
3. Create a fifth column to hold the differences between the two rank columns.
EVS042 Statistical Approaches and Modelling in Environmental Sciences Page 107
4. Create one final column to hold the value of column squared.
With found, add them to find . The value of n is 10. These values can now be
substituted back into the equation to give which evaluates to ρ = −29/165 =
−0.175757575... with a p-value = 0.627188 (using the t-distribution).
That the value is close to zero shows that the correlation between IQ and hours
spent watching TV is very low, although the negative value suggests that the
longer the time spent watching television the lower the IQ. In the case of ties in
the original values, this formula should not be used; instead, the Pearson
correlation coefficient should be calculated on the ranks .
Determining Significance
Another approach parallels the use of the Fisher transformation in the case of the
Pearson product-moment correlation coefficient. That is, confidence
intervals and hypothesis tests relating to the population value ρ can be carried
out using the Fisher transformation:
One can also test for significance using which is distributed approximately
as Student's t-distribution with n − 2 degrees of freedom under the null
hypothesis. A justification for this result relies on a permutation argument.
SELF-TEST 01
1) Which of the following are types of correlation?
a) It is a bivariate analysis
b) It is a multivariate analysis
c) It is a univariate analysis
d) Both a and c
LEARNING OBJECTIVES
After successful completion of this unit, you will be able to learn
Regression
Properties of Regression
Regression is a statistical method used in finance, investing, and other disciplines that
attempts to determine the strength and character of the relationship between one
dependent variable (usually denoted by Y) and a series of other variables (known as
independent variables).
Regression helps investment and financial managers to value assets and understand the
relationships between variables, such as commodity prices and the stocks of businesses
dealing in those commodities.
Regression
Regression Explained
The two basic types of regression are simple linear regression and multiple linear
regression, although there are non-linear regression methods for more complicated data
and analysis. Simple linear regression uses one independent variable to explain or
predict the outcome of the dependent variable Y, while multiple linear regression uses
two or more independent variables to predict the outcome.
Where:
(Source:https://www.investopedia.com/terms/r/regression.asp#:~:text=Regression
%20is%20a%20statistical%20method,(known%20as%20independent%20variable
s)
Regression analysis is a statistical method that helps us to analyze and understand the
relationship between two or more variables of interest. The process that is adapted to
perform regression analysis helps to understand which factors are important, which
factors can be ignored, and how they are influencing each other.
For the regression analysis is be a successful method, we understand the following
terms:
Financial Industry- Understand the trend in the stock prices, forecast the
prices, evaluate risks in the insurance domain
Marketing- Understand the effectiveness of market campaigns, forecast
pricing and sales of the product.
Manufacturing- Evaluate the relationship of variables that determine to
define a better engine to provide better performance
Medicine- Forecast the different combination of medicines to prepare
generic medicines for diseases.
Terminologies used in Regression Analysis
Outliers
Suppose there is an observation in the dataset that has a very high or very low value
as compared to the other observations in the data, i.e. it does not belong to the
population, such an observation is called an outlier. In simple words, it is an extreme
value. An outlier is a problem because many times it hampers the results we get.
Multicollinearity
When the independent variables are highly correlated to each other, then the variables
are said to be multicollinear. Many types of regression techniques assume
multicollinearity should not be present in the dataset. It is because it causes problems
TYPES OF REGRESSION
For different types of Regression analysis, there are assumptions that need to be
considered along with understanding the nature of variables and its distribution.
Linear Regression
The simplest of all regression types is Linear Regression where it tries to establish
relationships between Independent and Dependent variables. The Dependent variable
considered here is always a continuous variable.
What is Linear Regression?
Linear Regression is a predictive model used for finding the linear relationship
between a dependent variable and one or more independent variables.
Here, ‗Y‘ is our dependent variable, which is a continuous numerical and we are
trying to understand how does ‗Y‘ change with ‗X‘.
So, if we are supposed to answer, the above question of ―What will be the GRE score
of the student, if his CCGPA is 8.32?‖ our go-to option should be linear regression.
The main drawback of this type of regression model is if we create unnecessary extra
features or fitting polynomials of higher degree this may lead to overfitting of the
model.
Logistic Regression
Logistic Regression is also known as Logit, Maximum-Entropy classifier is a
supervised learning method for classification. It establishes a relation between
dependent class variables and independent variables using regression.
The dependent variable is categorical i.e. it can take only integral values representing
different classes. The probabilities describing the possible outcomes of a query point
are modelled using a logistic function. This model belongs to a family of
discriminative classifiers. They rely on attributes which discriminate the classes well.
This model is used when we have 2 classes of dependent variables. When there are
more than 2 classes, then we have another regression method which helps us to
predict the target variable better.
There are two broad categories of Logistic Regression algorithms
Regression coefficients determine the slope of the line which is the change in the
independent variable for the unit change in the independent variable. So they are also
known as the slope coefficient. They are classified into three. They are simple partial
and multiple, positive and negative, and linear and non-linear.
b1 = ∑[(xi-x)(yi-y)]/∑[(xi-x)2]
The observed data sets are given by x i and yi. x and y are the mean value.
4. If one regression coefficient is greater than 1, then the other will be less tha n 1.
5. They are not independent of the change of scale. There will be change in the
regression coefficient if x and y are multiplied by any constant.
SELF-TEST
1) Which of the following statements is true about the regression line?
a) Alternate hypothesis
b) Null hypothesis
LEARNING OBJECTIVES
After successful completion of this unit, you will be able to
Hypothesis
Testing of Hypothesis
Hypothesis testing is used to assess the plausibility of a hypothesis by using sample data.
Such data may come from a larger population, or from a data-generating process. The
word "population" will be used for both of these cases in the following descriptions.
Key Takeaways
Hypothesis testing is used to assess the plausibility of a hypothesis by using
sample data.
The test provides evidence concerning the plausibility of the hypothesis, given
the data.
Statistical analysts test a hypothesis by measuring and examining a random
sample of the population being analyzed.
In hypothesis testing, an analyst tests a statistical sample, with the goal of providing
evidence on the plausibility of the null hypothesis.
Hypothesis Definition
(Credit: https://byjus.com)
Level of Significance
Statistics is a branch of Mathematics. It deals with gathering, presenting, analyzing,
organizing and interpreting the data, which is usually numerical. It is applied to many
industrial, scientific, social and economic areas. While a researcher performs
research, a hypothesis has to be set, which is known as the null hypothesis. This
hypothesis is required to be tested via pre-defined statistical examinations. This
process is termed as statistical hypothesis testing. The level of significance or
Statistical significance is an important terminology that is quite commonly used in
Statistics. In this article, we are going to discuss the level of significance in detail.
The values or the observations are less likely when they are farther than the mean.
The results are written as ―significant at x%‖.
Example: The value significant at 5% refers to p-value is less than 0.05 or p < 0.05.
Similarly, significant at the 1% means that the p-value is less than 0.01.
The level of significance is taken at 0.05 or 5%. When the p-value is low, it means
that the recognised values are significantly different from the population value that
was hypothesised in the beginning. The p-value is said to be more significant if it is as
low as possible. Also, the result would be highly significant if the p-value is very less.
But, most generally, p-values smaller than 0.05 are known as significant, since getting
a p-value less than 0.05 is quite a less practice.
If p > 0.1, then there will be no assumption for the null hypothesis
If p > 0.05 and p ≤ 0.1, it means that there will be a low assumption for
the null hypothesis.
If p > 0.01 and p ≤ 0.05, then there must be a strong assumption about the
null hypothesis.
We know that if the chances are 5% or less than that, then the null hypothesis is true,
and we will tend to reject the null hypothesis and accept the alternative hypothesis.
Here, in this case, the chances are 0.03, i.e. 3% (less than 5%), which eventually
means that we will eliminate our null hypothesis and will accept an alternative
hypothesis.
(Credit: https://byjus.com)
When the test p-value is small, you can reject the null hypothesis and conclude that
the populations differ in means/medians.
Tests for more than two samples are omnibus tests and do not tell you which groups
differ from each other. You should use multiple comparisons to make these inferences.
(Source: https://analyse-it.com/docs/user-guide/compare-groups/equality-mean-
median-hypothesis-test#:~:text=Compare%20groups-
,Equality%20of%20means%2Fmedians%20hypothesis%20test%20(independent%20s
amples),population%20means%2Fmedians%20are%20different.)
SELF-TEST 01
1) A statement made about a population for testing purpose is called?
b) Hypothesis
c) Level of Significance
d) Test-Statistic
a) Null Hypothesis
b) Statistical Hypothesis
c) Simple Hypothesis
d) Composite Hypothesis
EXAMPLES
LEARNING OBJECTIVES
After successful completion of this unit, you will be able to
Bioassay
Principal, History, Classifiation & Example of Bioassay
SELF-TEST
1) Bioassay uses of
a) Air
b) Water
c) Biological agent
d) All of above
2) Bioassay is an
a) Microbiological Technique
b) Analytical Technique
c) Both of above
d) Anone of above
SUMMARY
Correlation or dependence is any statistical relationship between two random
variables or bivariate data. It is the measure of how two or more variables are
It is computed using the popular formula where is the difference between the two
ranks of each observation, n is the number of observations show, and identical
values are assigned fractional ranks equal to the average of their positions in the
ascending order of the values. If ties are present in the data set, the simplified
formula above yields incorrect results. The simplified method should also not be
used in cases where the data set is truncated.
The Spearman rank correlation is a numerical measure that measures the extent
of statistical dependence between pairs of observations. It is similar to the
Pearson product-moment correlation coefficient, which measures the "linear"
relationships between the raw numbers rather than between their ranks. The sign
of the Spearman correlation indicates the direction of association between X (the
independent variable) and Y (the dependent variable). A Spearman correlation of
zero indicates that there is no tendency for Y to either increase or decrease when
X increases. The Spearman correlation coefficient increases in magnitude as X
and Y become closer to being perfectly monotone functions of each other, when
The Simple Linear Regression Model is a machine learning model used to predic t
the dependent variable. It has five assumptions: Linear Relationship, Normality,
No or Little Multicollinearity, No Autocorrelation in errors, and
Homoscedasticity. To assess if the model is doing good, the statistical measure
that evaluates the model is called the coefficient of determination, which is the
EVS042 Statistical Approaches and Modelling in Environmental Sciences Page 138
portion of the total variation in the dependent variable that is explained by
variation in the independent variable. Polynomial Regression is a type of
regression technique used to model nonlinear equations.
If p > 0.05 and p ≤ 0.1, there will be no assumption for the null hypothesis. The
equality hypothesis test (independent samples) tests if two or more population
means/medians are different. The null hypothesis states that the difference
between the mean/medians of the populations is equal to a hypothesized value,
while the alternative hypothesis states that it is not equal to (or less than, or
greater than) the hypothesized value. If the test p-value is small, the null
hypothesis is rejected and the alternative hypothesis is accepted. Tests for more
than two samples are omnibus tests and do not tell you which groups differ from
each other.
In a direct assay, the stimulus applied to the subject is specific and dir ectly
measurable, and the response to that stimulus is recorded. An indirect assay is
Home pregnancy tests use ELISA to detect the increase of human chorionic
gonadotropin (hCG) during pregnancy. HIV tests also use indirect ELISA.
Environmental bioassays are generally a broad-range survey of toxicity, but can
be time-consuming and laborious. Water pollution control requirements require
some industrial dischargers and municipal sewage treatment plants to conduct
bioassays. ECOTOX is a bioassay used to test the toxicity of water samples .
SUMMARY
https://en.wikipedia.org/wiki/Correlation
https://en.wikipedia.org/wiki/Correlation_coefficient
https://www.embibe.com/exams/methods-of-studying-correlation/
https://byjus.com/commerce/karl-pearson-coefficient-of-correlation/
https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient
https://www.investopedia.com/terms/r/regression.asp#:~:text=Regression
%20is%20a%20statistical%20method,(known%20as%20independent%20v
ariables).
https://www.mygreatlearning.com/blog/what-is-regression/
https://www.tutorhelpdesk.com/homeworkhelp/Statistics-/Uses-Of-
Regression-Analysis-Assignment-Help.html
https://byjus.com/jee/properties-of-regression-coefficient/
https://www.investopedia.com/terms/h/hypothesistesting.asp#:~:text=Key
%20Takeaways,of%20the%20population%20being%20analyzed.
https://pubrica.com/academy/concepts-definitions/types-of-statistical-
hypothesis/#:~:text=A%20statistical%20hypothesis%20is%20of,null%20h
ypothesis%20and%20alternative%20hypothesis.
https://byjus.com/maths/hypothesis-definition/
LEARNING OBJECTIVES
After successful completion of this unit, you will be able
Environmental Modelling
Application of Environmental Models
Environmental Modelling
Environmental Modelling
On the other hand, we may use models to develop our understanding of the
processes that form the environment around us. As noted by Richards (1990),
processes are not observable features, but their effects and outcomes are. In
geomorphology, this is essentially the debate that attempts to link process to
form (Richards et al., 1997). Models can thus be used to evaluate whether the
effects and outcomes are reproducible from the current knowledge of the
processes. This approach is not straightforward, as it is often difficult to evaluate
whether process or parameter estimates are incorrect, but it does at leas t
provide a basis for investigation. Of course, understanding -driven and
applications driven approaches are not mutually exclusive. It is not possible (at
least consistently) to be successful in the latter without being successful in the
former. We follow up these themes in much more detail in Chapter 1 (Mulligan
and Wainwright, 2004). Modelling is thus the canvas of scientists on which they
can develop and test ideas, put a number of ideas together and view the outcome,
integrate and communicate those ideas to others. Models can play one or more of
The following seven headings outline the purposes to which models are usually
put (Mulligan and Wainwright, 2004: 10):
2. As a tool for understanding: Models are a tool for understanding the scientific
concepts being developed. Model building involves understanding the system
under investigation while application involves learning the system in depth.
4. As a virtual laboratory: Models can also be rather inexpensive, low -hazard and
space-saving laboratories in which a good understanding of processes can
support model experiments. This approach can be particularly important where
the building of hardware laboratories (or hardware models) would be too
expensive, too hazardous or not possible. However, the outcome of any
model experiment is only as good as the understanding summarized within
the model and therefore care shall be taken while using models as laboratories.
Characteristics of Models
Modelling Principles
7. Exercise the model. After a model has been selected or written, it should be
exercised to ensure that it is working properly. Problems with known solutions
can be posed and simulated to confirm that a model is functioning properly. For
example, systems in which no change should be occurring can be simulated
to see that the model can manage at least this simplest of all cases. By exercising
elements and features of a model, the modeller can gain experience
SELF-TEST
1) Environmental models are useful for all of the following purposes except
c) Both of above
d) None of above
2) What are some of the things you can do in the modeling environment?
LEARNING OBJECTIVES
After successful completion of this unit, you will be able to
Linear Models
(1.1)
(1.2)
The model in (1.2) is linear in the b parameters; it is not necessarily linear in the
x variables. Thus, models such as
Regression models such as (1.2) are used for various purposes, including the
following:
Analysis-of-Variance Models
(Credit & Source: Alvin C. Rencher and G. Bruce Schaalje, 2008, Linear Models in
Statistics Department of Statistics, Brigham Young University, Provo, Utah).
SELF-TEST
1) If Linear regression model perfectly first i.e., train error is zero, then
____________________
a) 1
b) 2
c) 3
d) 4
LEARNING OBJECTIVES
After successful completion of this unit, you will be able to
CCATT-BRAMS
WRF-Chem
CMAQ, CMAQ Website
CAMx
GEOS-Chem
LOTOS-EUROS
MATCH
(Source: https://en.wikipedia.org/wiki/Chemical_transport_model)
Model types.
(Credit: http://irina.eas.gatech.edu/ATOC3500_Fall1998/Lecture29.pdf)
(Credit: http://irina.eas.gatech.edu/ATOC3500_Fall1998/Lecture29.pdf)
Domain of the atmospheric model is the area that is simulated. The computation
domain consists of an array of computational cells, each having uniform
chemical composition. The size of cells determines the spatial resolution of the
model. • Atmospheric chemical transport models are also characterized by their
dimensionality: zero-dimensional (box) model; one-dimensional (column) model;
two-dimensional model; and three-dimensional model.
(Credit: http://irina.eas.gatech.edu/ATOC3500_Fall1998/Lecture29.pdf )
Model time scale depends on a specific application varying from hours (e.g., air
quality model) to hundreds of years (e.g., climate models).
1) Lagrangian approach: air parcel moves with the local wind so that there is no
mass exchange that is allowed to enter the air parcel and its surroundings (except
of species emissions). The air parcel moves continuously, so the model simulates
species concentrations at different locations at differ ent time,
In a box model concentrations are the same everywhere and therefore are
functions of time only, ni(t).
1) source emissions;
1) transport: advection out of the box and detrainment due to upwards motio n;
2) chemical transformations;
• In the Lagrangian box model: advection terms are eliminated, but source terms
vary as the parcel moves over different source regions.
• The dimensions and placement of the box is dictated by the particular problem
of interest. For instance, to study the influence of urban emissions on the
chemical composition of air, the box may be design to cover the urban area.
• Box models can be time dependent. In that case, any variations in time of the
processes considered need to be accurately supplied to the model. For instance, if
a box model is used to compute air quality for 24 -hr day in an urban area, the
model must include daily variations in traffic and other sources; diurnal wind
speed patterns; variations in the height of the mixed layer; variation of solar
radiation during the day; etc.
SELF-TEST
1) Limitations of a box model
c) Both of above
d) None of above
d) None of above
LEARNING OBJECTIVES
After successful completion of this unit, you will be able to learn
Inverse modeling
(11.1)
Here, p is a parameter vector including all model variables that we do not seek to
optimize as part of the inversion, and εΟ is an observational error vector
including contributions from errors in the measurements, in the forward model,
and in the model parameters. The forward model predicts the effect (y) as a
function of the cause (x), usually through equations describing the physics of the
system. By inversion of the model we can quantify the cause (x) from
observations of the effect (y). In the presence of error (εO 6¼ 0), the solution is
a best estimate of x with some statistical error. This solution for x is called the
optimal estimate, the posterior estimate, or the retrieval. The choice of state
vector (that is, which model variables to include in x versus in p) is totally up to
us. It depends on which variables we wish to optimize, what information is
contained in the observations, and what computational costs are associated with
the inversion. Because of the uncertainty in deriving x from y, we have to
consider other constraints on the value of x that may help to reduce the error on
the optimal estimate. These constraints are called the prior information. A
(Credit: www.cambridge.org)
(11.2)
SELF-TEST
3) Inverse modeling is
b) Retrurn approach
c) Noth of above
d) None of above
a) Remote sensing
b) Top-down constraints
c) Data assimilation
d) All of above
REFERENCES
https://www.cambridge.org/
www.cambridge.org
Graedel T. and P.Crutzen. ―Atmospheric change: an earth system
perspective‖. Chapter 15.‖Bulding environmental chemical
models‖, 1992. at
http://irina.eas.gatech.edu/ATOC3500_Fall1998/Lecture29.pdf
http://irina.eas.gatech.edu/ATOC3500_Fall1998/Lecture29.pdf
https://en.wikipedia.org/wiki/Chemical_transport_model
Alvin C. Rencher and G. Bruce Schaalje, 2008, Linear Models in
Statistics Department of Statistics, Brigham Young University, Provo,
Utah
Compiled by Bikash Sherchan, 2019 at https://www.studocu.com/
LEARNING OBJECTIVES
After successful completion of this unit, you will be able to learn-
Elementary Concept
A probability gives the likelihood that a defined event will occur. It is quantified as a
positive number between 0 (the event is impossible) and 1 (the event is certain). Thus,
the higher the probability of a given event, the more likely it is to occur. If A is a
defined event, then the probability of A occurring is expressed as P(A). Probability
can be expressed in a number of ways. A frequentist approach is to observe a number
of particular events out of a total number of events. Thus, we might say the
probability of a boy is 0.52, because out of a large number of singleton births we
observe 52% are boys. A model-based approach is where a model, or mechanism
determines the event; thus, the probability of a ‗1‘ from an unbiased die is 1/6 since
there are 6 possibilities, each equally likely and all adding to one. An opinion-based
approach is where we use our past experience to predict a future event, so we might
give the probability of our favourite football team winning the next match, or whether
it will rain tomorrow.
Given two events A and B, we often want to determine the probability of either event,
or both events, occurring.
Bayes‟ Theorem
From the multiplication rule above, we see that:
P(B|A)=P(A|B)P(B)/P(A)
Thus, the probability of B given A is the probability of A given B, times the probability
of B divided by the probability of A.
This formula is not appropriate if P(A)=0, that is if A is an event which cannot happen.
Sensitivity and Specificity
Many diagnostic test results are given in the form of a continuous variable (that is one
that can take any value within a given range), such as diastolic blood pressure or
haemoglobin level. However, for ease of discussion we will first assume that these have
been divided into positive or negative results. For example, a positive diagnostic result of
'hypertension' is a diastolic blood pressure greater than 90 mmHg; whereas for 'anaemia',
a haemoglobin level less than 12 g/dl is required.
For every diagnostic procedure (which may involve a laboratory test of a sample taken)
there is a set of fundamental questions that should be asked. Firstly, if the disease is
present, what is the probability that the test result will be positive? This leads to the
notion of the sensitivity of the test. Secondly, if the disease is absent, what is the
probability that the test result will be negative? This question refers to the specificity of
the test. These questions can be answered only if it is known what the 'true' diagnosis is.
In the case of organic disease this can be determined by biopsy or, for example, an
expensive and risky procedure such as angiography for heart disease. In other situations,
it may be by 'expert' opinion. Such tests provide the so-called 'gold standard'.
Example
Consider the results of an assay of N-terminal pro-brain natriuretic peptide (NT-proBNP)
for the diagnosis of heart failure in a general population survey in those over 45 years of
age, and in patients with an existing diagnosis of heart failure, obtained by Hobbs, Davis,
Roalfe, et al (BMJ 2002) and summarised in table 1. Heart failure was identified when
NT-proBNP > 36 pmol/l.
Fig. 3.5: Results of NT-proBNP assay in the general population over 45 and those with a
previous diagnosis of heart failure (after Hobbs, David, Roalfe et al, BMJ 2002)
We denote a positive test result by T+, and a positive diagnosis of heart failure (the
disease) by D+. The prevalence of heart failure in these subjects is 103/410=0.251, or
approximately 25%. Thus, the probability of a subject chosen at random from the
combined group having the disease is estimated to be 0.251. We can write this as
P(D+)=0.251.
The sensitivity of a test is the proportion of those with the disease who also have a
positive test result. Thus the sensitivity is given by e/(e+f)=35/103=0.340 or 34%. Now
sensitivity is the probability of a positive test result (event T+) given that the disease is
present (event D+) and can be written as P(T+|D+), where the '|' is read as 'given'.
The specificity of the test is the proportion of those without disease who give a negative
test result. Thus the specificity is h/(g+h)=300/307=0.977 or 98%. Now specificity is the
probability of a negative test result (event T-) given that the disease is absent (event D-)
and can be written as P(T-|D-).
Since sensitivity is conditional on the disease being present, and specificity on the disease
being absent, in theory, they are unaffected by disease prevalence. For example, if we
doubled the number of subjects with true heart failure from 103 to 206 in Table 1, so that
the prevalence was now 103/(410+103)=20%, then we could expect twice as many
subjects to give a positive test result. Thus 2x35=70 would have a positive result. In this
case the sensitivity would be 70/206=0.34, which is unchanged from the previous value.
A similar result is obtained for specificity.
Sensitivity and specificity are useful statistics because they will yield consistent results
for the diagnostic test in a variety of patient groups with different disease prevalences.
This is an important point; sensitivity and specificity are characteristics of the test, not the
EVS042 Statistical Approaches and Modelling in Environmental Sciences Page 181
population to which the test is applied. In practice, however, if the disease is very rare,
the accuracy with which one can estimate the sensitivity may be limited. This is because
the numbers of subjects with the disease may be small, and in this case the proportion
correctly diagnosed will have considerable uncertainty attached to it.
Two other terms in common use are: the false negative rate (or probability of a false
negative) which is given by f/(e+f)=1-sensitivity, and the false positive rate (or
probability of a false positive) or g/(g+h)=1-specificity.
These concepts are summarised in Fig. 3.7.
It is important for consistency always to put true diagnosis on the top, and test
result down the side. Since sensitivity=1–P(false negative) and specificity=1–
P(false positive), a possibly useful mnemonic to recall this is that 'sensitivity'
and 'negative' have 'n's in them and 'specificity' and 'positive' have 'p's in them.
SELF-TEST
1) Elementary concepts
e) Used in Mathematics
f) Useful in Sctaticstics
g) Both A & B
h) None of Above
3) Probability is
c) Both A & B
d) None of above
LEARNING OBJECTIVES
After successful completion of this unit, you will be able to learn
Environmental System:
SELF-TEST
3) MFR stabds for
d) None of above
d) None of of above
POLLUTION MODELLING
LEARNING OBJECTIVES
After successful completion of this unit, you will be able to learn
Water Modelling
The most important tasks in the field of environmental modeling and simulation
today are still the determination and analysis of the behaviour of environmental
systems, that means the computation of the time dependent dynamic behaviour of
state variables. Some applications of modeling and simulation are given below.
On the one hand, undesired secondary effects of human activities (e.g.emissions
of harmful substances of production processes) can appear. In such a situation,
the secondary effects to the environment are to minimize. On the other hand,
actions of environmental protection are to optimize. These are investigations of
the questions: how does a process work? or which effects are released in the
environmental system by a special input? In the environmental domain,
simulation models are especially used in four fields: emission computation (air,
water, ground pollution), process control, groundwater - economical and flow
investigations, and ecosystem research. 116 Part Three Modeling and Simulation
But further applications become more and more important: e.g., models on the
use and balance of resources (e.g. water, ground, materials, energy, food);
models of the carrying capacitiy and of loading limits of ecological systems with
an input, which has been caused by human activities; quality models; product-
life-time-cycle models and ecobalances; socio-economic models; combined
tasks: the use of simulation and optimization methods or experiments consisting
of coupled economical and ecological models.
(Source & Credit: R. Griitzner. (1996). Environmental modeling and simulation -
applications and future requirements, University of Rostock.)
EVS042 Statistical Approaches and Modelling in Environmental Sciences Page 196
Environmental models can be used to study many things, such as:
Climate
Coastal changes.
Hydro-ecological systems
Ocean circulation
Surface and groundwater
Terrestrial carbon
The behaviour of enclosed spaces
The behaviour of spaces around buildings
(Source: designingbuildings.co.uk)
Building A Model
Model Classification:
Water quality models are usually classified according to model complexity, type
of receiving water, and the water quality parameters (dissolved oxygen,
nutrients, etc.) that the model can predict. The more complex the model is, the
more difficult and expensive will be its application to a given situation. Model
complexity is a function of four factors. • The number and type of water quality
indicators. In general, the more indicators that are included , the more complex
the model will be. In addition, some indicators are more complicated to predict
than others (see Table 1). Water Quality Models In order to determine the
impacts of a particular discharge on ambient water quality, it is usually necessary
to model the diffusion and dispersion of the discharge in the relevant water body.
The approach applies both to new discharges and to the upgrading of existing
sources. This chapter provides guidance on models that may be applicable in the
context of typical Bank projects. • The level of spatial detail. As the number of
pollution sources and water quality monitoring points increase, so do the data
required and the size of the model. • The level of temporal detail. It is much
easier to predict long-term static averages than shortterm dynamic changes in
water quality. Point estimates of water quality parameters are usually simpler
than stochastic predictions of the probability distributions of those parameters. •
The complexity of the water body under analysis. Small lakes that ―mix‖
completely are less complex than moderate-size rivers, which are less complex
than large rivers, which are less complex than large lakes, estuaries, and coastal
zones. The level of detail required can vary tremendously across diff erent
management applications. At one extreme, managers may be interested in the
long-term impact of a small industrial plant on dissolved oxygen in a small, well -
mixed lake. This type of problem can be addressed with a simple spreadsheet and
solved by a single analyst in a month or less. At the other extreme, if managers
want to know the rate of change in heavy metal concentrations in the Black Sea
that can be expected from industrial modernization in the lower Danube River,
Data Requirements:
According to the Jacob Beara, Milovan S. Beljinb and Randall R. Ross in 1992 in
the boom of Fundamentals of Ground-Water Modeling, Ground-water flow and
contaminant transport modeling has been used at many hazardous waste sites
with varying degrees of success. Models may be used throughout all phases of
the site investigation and remediation processes. The ability to reliably predict
the rate and direction of groundwater flow and contaminant transport is critical
in planning and implementing ground-water remediations. This paper presents an
overview of the essential components of ground-water flow and contaminant
transport modeling in saturated porous media. While fractured rocks and
fractured porous rocks may behave like porous media with respect to many flow
and contaminant transport phenomena, they require a separate discussion and are
not included in this paper. Similarly, the special features of flow and contaminant
transport in the unsaturated zone are also not included. This paper was prepared
for an audience with some technical background and a basic working knowledge
of ground-water flow and contaminant transport processes. A suggested format
for ground-water modeling reports and a selected bibliography are included as
appendices A and B, respectively.
The management of any system means making decisions aimed at achieving the
system‘s goals, without violating specified technical and nontechnical constraints
imposed on it. In a ground-water system, management decisions may be related
to rates and location of pumping and artificial recharge, changes in water quality,
location and rates of pumping in pump-and-treat operations, etc. Management‘s
objective function should be to evaluate the time and cost necessary to achieve
remediation goals. Management decisions are aimed at minimizing this cost
while maximizing the benefits to be derived from operating the system. The
value of management‘s objective function (e.g., minimize cost and maximize
effectiveness of remediation) usually depends on both the values o f the decision
variables (e.g., areal and temporal distributions of pumpage) and on the response
of the aquifer system to the implementation of these decisions. Constraints are
expressed in terms of future values of state variables of the considered ground -
water system, such as water table elevations and concentrations of specific
• the kind of solid matrix comprising the aquifer (with reference to its
homogeneity, isotropy, etc.);
• the relevant state variables and the area, or volume, over which the averages of
such variables are taken;
• sources and sinks of water and of relevant contaminants, within the domain and
on its boundaries (with reference to their approximation as point sinks and
sources, or distributed sources);
Selecting the appropriate conceptual model for a given problem is one of the
most important steps in the modeling process. Oversimplification may lead to a
model that lacks the required information, while undersimplification may result
in a costly model, or in the lack of data required for model calibration and
parameter estimation, or both. It is, therefore, important that all features relevant
to a considered problem be included in the conceptual model and that irrelevant
ones be excluded. The selection of an appropriate conceptual model and the
degree of simplification in any particular case depends on:
• flux equations that relate the flux(es) of the considered extensive quantity(ies)
to the relevant state variables of the problem;
• constitutive equations that define the behavior of the fluids and solids involved;
• an equation (or equations) that expresses initial conditions that describe the
known state of the considered system at some initial time; and
• an equation (or equations) that defines boundary conditions that describe the
interaction of the considered domain with its environment.
All the equations must be expressed in terms of the dependent variables selected
for the problem. The selection of the appropriate variables to be used in a
particular case depends on the available data. The number of equations included
in the model must be equal to the number of dependent variables. The boundary
conditions should be such that they enable a unique, stable solution. The most
general boundary condition for any extensive quantity states that the difference
in the normal component of the total flux of that quantity, on both sides of the
(Credit and Source: Jacob Beara, Milovan S. Beljinb and Randall R. Ross in
1992 in the boom of Fundamentals of Ground-Water Modeling)
SELF-TEST
3) Water Quality Modelling is a part of
a) Ecology
b) Environment
c) Environmental Modelling
a) Baseline Date
b) Statistics
c) Simulation
d) All of above
LEARNING OBJECTIVES
After successful completion of this unit, you will be able to learn
According to the Gary Haq and Dieter Schwela, 2008 in Edited Book Foundation
Course on Air Quality Management in Asia (Modelling Chanpter), an air quality
simulation model system (AQSM) is a numerical technique or methodology for
estimating air pollutant concentrations in space and time. It is a function of the
distribution of emissions and the existing meteorological and geophysical
conditions. An alternative name is dispersion model. Air pollution concentrations
are mostly used as an indicator for human exposure, which is a main component
for risk assessment of air pollutants on human health. An AQSM addresses the
problem of how to allocate available resources to produce a cost -effective
control plan. In this context, an AQSM can respond to the following t ypes of
questions: • What are the relative contributions to concentrations of air pollutants
from mobile and stationary sources? • What emission reductions are needed for
outdoor concentrations to meet air quality standards? • Where should a planned
source of emissions be sited? • What will be the change in ozone (O3)
concentrations if the emissions of precursor air pollutants (e.g. nitrogen oxides
(NOx) or hydrocarbons (HC)) are reduced by a certain percentage? • What will
be the future state of air quality under certain emission reduction scenarios?
Figure 3.1 shows the main components of an AQSM. An alternative and
complementary model is source apportionment (SA). SA starts from observed
concentrations and their chemical composition and estimates the rela tive
contribution of various source types by comparing the composition of sources
with the observed composition at the receptors. For this reason, SA is also called
Emission Estimates:
• estimate the impacts at receptor sites in the vicinity of a planned road or those
of envisaged changes in traffic flow; • support the selection of appropriate
monitoring sites (e.g. for ―hot spots‖) when no knowledge on potential
concentrations is available; • forecast air pollution concentrations; and
Once pollutants are released into the atmosphere they are transported by air
motions which lower the air pollutant concentrations in space over a period of
time. Meteorological parameters averaged over one-hour time intervals are
usually used to describe this phenomenon. Meteorological parameters include
wind speed, wind direction, turbulence, mixing height, atmospheric stability,
temperature and inversion. Wind Wind is a velocity vector having direction and
speed. Usually only the horizontal components of this vector are considered
since the vertical component is relatively small. The wind direction is the
direction from which the wind comes. Through wind speed, continuous pollutant
releases are diluted at the point of release. Concentrations in the plume are
Mixing height:
Fig. 4.12 Vertical dispersion under various conditions for low and high
elevations of the source [T: Temperature, θ: Adiabatic lapse] Source: Adapted
from Liptak (1974)
• flow of a layer of cold air under a layer of warm air (elevated inversion).
All of these occur, although the first process (cooling from below) is the more
common
Inversion due to cooling of air at the surface Shortly before sunset, the ground
surface begins to cool by radiation and cools the air layer nearest to it. By
sunset, there will be a strong but shallow inversion close to the ground. All night
this inversion will grow in strength and height, until by dawn of the next day, the
temperature profile will be practically the same as that shown for dawn. As the
temperature of the air adjacent to the ground is below that of the air at some
heights above, air pollutants released in this layer are trapped because the air
being colder than the layer above it will not rise. This is called ―stagnant
inversion‖. Inversion due to heating from above Heating an air layer from above
can simply occur when a cloud layer absorbs incoming solar radiation. However,
it often occurs when there is a high-pressure region (common in summer between
storms) in which there is a slow net downward flow of air and light winds. The
sinking air mass will increase in temperature at the adiabatic lapse rate (the
change in temperature of a mass of air as it changes height). It often becomes
warmer than the air below it. The result is an elevated inversion, also called
subsidence inversion or inversion aloft. These normally from 500 to 5000 m
above the ground, and they inhibit atmospheric mixing.
Flow of a layer of warm air over a layer of cold air If a flow of warm air meets a
layer of cold air, the warm air mass could simply sweep away the colder air.
Under certain circumstances such as trapping the cold air by mountains, a mass
of warm air is caused to flow over a cold dome. Even without mountains,
atmospheric forces alone can cause a mass of warm air to flow over a cold mass.
This is called an "overrunning" effect, when warm air overrides cold air at the
surface. Inversion due to flowing of cool air under warm air A horizontal flow of
cold air into a region lowers the surface temperature. As cold air is denser and
flows laterally, it displaces warmer and less dense air upwards. This phenomenon
Meteorological Measurements:
• wind speed;
• wind direction;
• net radiation;
• relative humidity;
• precipitation; and
• atmospheric pressure.
Some of these data overlap each other. The final selection of parameters will
depend on the instruments available and the type of specific data needed for the
user. The Automatic Weather Station (AWS) will collect high-quality, real-time
data that is normally used in a variety of weather observation activities ranging
from air quality data assessment and industrial accidental release forecasting to
longterm modelling for planning purposes. The weather station designed for air
quality studies will have to provide surface data and meteorological information
in the surface boundary layer and in the troposphere as a whole. For the purpose
of explaining air pollution transport and dispersion most of the sensors may be
located along a 10 m high mast. The basic suite of sensors will measure wind
velocity and wind direction, temperature, relative humidity, air pressure, and
precipitation. The expanded suite of sensors may offer measurement of solar
radiation, net radiation, wind fluctuations (turbulence), vertical temperature
gradients and visibility. To obtain electric power and a data retrieval system with
modems and computers, the AWS is often located with one or several of the air
quality monitoring stations.
Types of Models:
All models assume a material balance equation which applies to a specified set
of boundaries: (Accumulation rate) = (Flow-in rate) – (Flow-out rate) +
(Emission rate) – (Destruction rate) The units of all terms in this equation is
mass/time unit or g/s.
EVS042 Statistical Approaches and Modelling in Environmental Sciences Page 217
Box models
Box models (see Figure 3.10) illustrate the simplest kind of material balance. To
estimate the concentration in a city, air pollution in a rectangular box is
considered under the following major simplifying assumptions:
• The air pollution emission rate of the city is Q [g/s], which is independent of
space and time (continuous emission). Q is related to emission rate per unit area,
q [g/s·m2] by:
Q = q · (W · L) [g/s]
The mass of pollutant emitted from the source remains in the atmosphere. No
pollutant leaves or enters through the top of the box, nor through the sides that
• The turbulence is strong enough in the upwind di rection that the pollutant
concentrations [mass/volume] from releases of the sources within the box and
those due to the pollutant masses entering the box from upwind side are spatially
uniform in the box.
These assumptions lead to a steady state situation and the accumulation rate is
zero. All the terms can then be easily quantified and calculated.
If χ(t) [g/m3] denotes the pollutant concentration in the box as a function of time
t and χin the (constant) concentration in the incoming air mass, the:
For longer times t the concentration approaches a steady state (χ(t) = Q/( L · H ·
u)) which corresponds to zero accumulation rate.
There are several drawbacks of box models. Firstly, some of the assumptions are
unrealistic (e.g. wind speed independent of height or uniformity of air pollutant
concentrations throughout the box). Secondly, the model does not distinguish a
source configuration of a large numbers of small sources emitting pollutants at
low elevation (cars, houses, small industry, and open burning) from that of a
small number of large sources emitting larger amounts per source at higher
elevation (power plants, smelters, and cement plants). Both types of sources are
simply added to estimate a value for the emission rate per unit area (q). Of two
sources with the same emission rate, the higher elevated one leads to lower
EVS042 Statistical Approaches and Modelling in Environmental Sciences Page 219
ground level concentrations in reality. As there is no way t o deal with this
drawback, box models are unlikely to give reliable estimates, except perhaps
under very special circumstances.
• conservation of mass;
A Gaussian model is the solution of the basic equations for transport and
diffusion in the atmosphere assuming stationarity in time and complete
homogeneity in space. A Gaussian dispersion model is normally used for
considering a point source such as a factory smoke stack. It attempts to compute
the downwind concentration resulting from the point source. The origin of the
coordinate system is placed at the base of the stack, with the x axis align ed in the
downwind direction. The contaminated gas stream, which is normally called
―plume‖, is shown rising from the smokestack and then levelling off to travel in
the x direction and spreading in the y and z directions as it travels.
The plume normally rises to a considerable height above the stack because it is
emitted at a temperature higher than that of ambient air and with a vertical
velocity component. For Gaussian plume calculation, the plume is assumed to
start from a point with coordinates (0,0,H), where H is called the effective stack
height and is the sum of physical stack height (h) and the plume rise (dh). It
should be kept in mind that the Gaussian plume approach tries to calculate only
the average values without making any statement about instantaneous values.
The results obtained by Gaussian plume calculations should be considered only
as averages over periods of at least 10 minutes, and pre ferably one-half to one
hour. The Gaussian plume model so far allows one to estimate the concentration
at a receptor point due to a single emission source for a specific meteorology. In
this form, they are frequently used to estimate maximum concentrations to be
expected from single isolated sources:
After that the thread of contaminated air expands by turbulent mixing. All the
time, reference x is in the middle of the moving cloud. In a Lagrangian model the
pollution distribution is described by a set of discrete ―particles‖ (small air
volumes) or puffs, which are labelled by their changing location (i.e. their
trajectories are followed). Lagrangian particle models represent pollutant
releases as a stream of ―particles‖. Since the model ―particles‖ have no physical
dimensions, source types may be specified to have any shape and size, and the
emitted ―particles‖ may be distributed over an arbitrary line, ar ea or volume
(Ministry for the Environment, 2004).
The Gaussian puff model uses the Gaussian equation in three -dimensional and
Lagrangian viewpoints. An example of Gaussian puff model which is used as a
non-steady-state air quality model is CALPUFF (see Figure 3.13). The model
was developed originally by Sigma Research Corporation under funding
provided by the Californian Air Resource Board (CARB) (ASG, 2007).
The modelling system (Scire et al., 1990a, 1990b) developed to meet these
objectives consists of three components:
(2) a Gaussian puff dispersion model with chemical removal, wet and dry
deposition, complex terrain algorithms, building downwash, plume fumigation,
and other effects; and
(Credit for above whole chapter: Gary Haq and Dieter Schwela, 2008 in Edited
Book Foundation Course on Air Quality Management in Asia (Modelling
Chanpter)
SELF-TEST
1) Air Quality Modelling may require
a) Financial Data
b) Popullation Data
c) Meteriological Data
d) None of above
a) Economy
b) Statistics
c) Laws
d) None of above
SUMMARY
Elementary probability theory is an important part of sample survey designs. It
provides preliminary statistical concepts such as expectation, v ariance, and
covariance, as well as measure of error, interval estimation, and sample size
determination. Probability is quantified as a positive number between 0 and 1,
and can be expressed in a number of ways. The addition rule is used to determine
the probability of at least one of two events occurring, and the probability of
EVS042 Statistical Approaches and Modelling in Environmental Sciences Page 225
either event A or B is given by P(A) + P (B) – P (A and B). A and B are mutually
exclusive, meaning they cannot occur together.
Sensitivity is the proportion of those with the disease who also have a positive
test result, while specificity is the probability of a negative test result given that
the disease is absent. Sensitivity and specificity are useful statistics for
diagnostic tests, but if the disease is rare, accuracy may be limited. Two terms in
common use are f/(e+f)=1-sensitivity and g/(g+h)=1-specificity, which are
summarised in Fig. 3.7. The most important details in this text are the definitions
of sensitivity and specificity, the predictive value of a test, and the prevalence of
coronary artery disease in patients with suspected coronary artery disease.
It is important to remember that sensitivity and spe cificity have 'n' and 'p' in
them, respectively, and that the probability of the patient having coronary artery
disease is adjusted upwards to the probability of disease, with a positive test
result of 815/930=0.70. These values are affected by the prevale nce of the
disease, such as if those with the disease doubled in Table 3, then the positive
test would become 1630/(1630+115)=0.93 and the negative test
327/(327+416)=0.44. Bayes' Theorem states that the probability of having both a
positive exercise test and coronary artery disease is P(T+ and D+). From Table 3,
EVS042 Statistical Approaches and Modelling in Environmental Sciences Page 226
the probability of picking out one man with both is 815/1465=0.56. Bayes'
theorem enables prior assessments about the chances of a diagnosis to be
combined with the eventual test results to obtain a so-called "posterior"
assessment about the diagnosis.
Simulation tools support numerical insights into the system behavior. Systems
analysis consists of various steps, such as analyzing the decision problem,
forming a model, testing the model, and solving the decision problems by
scenario analysis, optimization. Typical examples of environmental problems
include water quality, population growth, and world economy. Hierarchical
systems methodology was developed with systems theory and basic knowledge
of cybernetics. Systems engineering provides a tool box with approved methods
in engineering sciences.
The control problem for water quality consists in the fulfillment of certain
conditions and optimization of the overall costs. This is a distributed parameter
system based on a partial differential equation and solved by hierarchical
EVS042 Statistical Approaches and Modelling in Environmental Sciences Page 227
optimization. Environmental object classification leads to the taxonomy
distinguishing between atmosphere (all objects above the surface of the Earth),
hydrosphere (water-related objects), lithosphere (relating to soil and rocks),
biosphere (all living matters) and technosphere (human-made objects). Maximum
likelihood is used to solve this problem. Environmental system models based on
measurements need aggregation, validation and interpretation of the initial
collection of environmental data.
Validation procedures are developed and applied (like temporal v., geographic v.,
space-time v., interparameter v., see Günther). Knowledge-based systems (expert
systems) play a role for initial evaluation of environmental raw data. For data
processing, statistical classification, data management and artificial intelligence
provide standard methods. When circumstances of measurements are known like
weather, date, time of day, etc., methods based on Bayesian probability theory
are used. Neural nets are used to handle uncertainty. Such validated data are the
basis for information systems and monitoring of the state of the environment.
The most important tasks in environmental modeling and simulation today are
the determination and analysis of the behaviour of environmental systems.
Simulation models are used in four fields: emission computation, process
control, groundwater - economical and flow investigations, and ecosystem
research. Environmental models can be used to study many things, such as
climate, coastal changes, hydro-ecological systems, ocean circulation, surface
and groundwater, terrestrial carbon, enclosed spaces, and spaces around
buildings. Water quality modeling is the use of mathematical simulation
techniques to analyze water quality-based data using mathematical simulation
techniques. It helps people understand the eminence of water quality issues and
provides evidence for policy makers to make decisions in order to properly
mitigate water.
https://web.worldbank.org
designingbuildings.co.uk
https://medium.com
https://commons.wikimedia.org/wiki/File:Pipe-PFR.sv
www.eolss.net
https://www.healthknowledge.org.uk/public-health-textbook/research-
methods/1b-statistical-methods/elementary-probability-theory
www.healthknowledge.org.uk
Jacob Beara, Milovan S. Beljinb and Randall R. Ross in 1992 in the boom
of Fundamentals of Ground-Water Modeling.
Dear Student,
You have gone through this book, it is time for you to do some thinking for us.
Please answer the following questions sincerely. Your response will help us to
analyse our performance and make the future editions of this book more useful.
Your response will be completely confidential and will in no way affect your
examination results. Your suggestions will receive prompt attention from us.
Please submit your feedback online at this QR Code or at following link
https://forms.gle/rpDib9sy5b8JEisQ9
Style
01. Do you feel that this book enables you to learn the subject independently
without any help from others?
02. Do you feel the following sections in this book serve their purpose? Write the
appropriate code in the boxes.
Code 1 for ―Serve the purpose fully‖
Code 2 for ―Serve the purpose partially‖
Code 3 for ―Do not serve any purpose‖
Code 4 for ―Purpose is not clear‖
03. Do you feel the following sections or features, if included, will enhance self -
learning and reduce help from others?
Index
Glossary
List of ―Important Terms Introduced‖
Two Color Printing
Content
04. How will you rate your understanding of the contents of this Book?
05. How will you rate the language used in this Book?
06. Whether the Syllabus and content of book complement to each other?
07. Which Topics you find most easy to understand in this book?
Sr.No. Topic Name Page No.
09. List the difficult topics you encountered in this Book. Also try to suggest
how they can be improved.
Use the following codes:
Code 1 for ―Simplify Text‖
Code 2 for ―Add Illustrative Figures‖
Code 3 for ―Provide Audio-Vision (Audio Cassettes with companion Book)‖
Code 4 for ―Special emphasis on this topic in counseling‖
10. List the errors which you might have encountered in this book.
11. Based on your experience, how would you place the components of distance
learning for their effectiveness?
1. 2. 3. 4. 5.