Professional Documents
Culture Documents
Ba Lab File
Ba Lab File
1|Page
(i) Local files: Weka can read data from various file formats such as CSV, ARFF, and Excel
files stored on the local computer.
(ii) Remote databases: Weka can connect to various types of remote databases such as
MySQL, PostgreSQL, and Microsoft SQL Server.
(iii) Web sources: Weka can access data from web sources such as APIs and web services.
(iv) Data generators: Weka provides several built-in data generators that can generate
synthetic data sets for experimentation and testing.
(v) Pre-processed data: Weka can accept pre-processed data sets from other software or data
analysis tools.
(vi) Distributed data: Weka can work with distributed data stored on Hadoop Distributed
File System (HDFS) or other distributed file systems.
(vii) Streaming data: Weka can handle streaming data in real-time using various data stream
mining algorithms.
Here Iris Plants Database is used as a data source that was created by R.A. Fisher.
1.3 Variables
A variable is a characteristic that can be measured and that can assume different values.
Height, age, income, province or country of birth, grades obtained at school and type of
housing are all examples of variables.
2|Page
1.3.1 Variables Identification
Variables identified in weak refer to the different features or attributes that are used in the
machine learning analysis. For example, in a dataset about customer demographics, variables
might include age, income, and location.
A categorical variable (also called qualitative variable) refers to a characteristic that can‟t be
quantifiable. Categorical variables can be either nominal or ordinal.
3|Page
(B) Ordinal variables
An ordinal variable is a variable whose values are defined by an order relation between the
different categories. In Table, the variable “behaviour” is ordinal because the category
“Excellent” is better than the category “Very good,” which is better than the category
“Good,” etc. There is some natural ordering, but it is limited since we do not know by how
much “Excellent” behaviour is better than “Very good” behaviour.
4|Page
discrete variable is the number of people in a household for a household of size 20 or less.
The number of possible values is 20, because it‟s not possible for a household to include a
number of people that would be a fraction of an integer like 2.27 for instance.
1.4 Classifier
a classifier is an algorithm that takes input data and assigns it to one of several predefined
categories or classes. The goal of a classifier is to learn a function that can accurately predict
the class of new, unseen data.
There are various types of classifiers, some of which are listed below:
2. Select the Naive Bayes algorithm from the list of available classifiers by clicking on
the "Classify" tab and then selecting "Naive Bayes" from the list.
3. Click on the "Start" button to begin the classification process. Weka will split your
data into training and testing sets, and will use the training set to build the Naive
Bayes model.
4. Once the model has been built, Weka will use it to classify the instances in the testing
set, and will display the results in the "Classify" panel.
It is important to note that the Naive Bayes algorithm makes the "naive" assumption that all
of the features are independent of each other, given the class label. This assumption may not
hold in all cases, but the algorithm is still widely used because it is computationally efficient
and often performs well in practice.
In addition to the basic Naive Bayes algorithm, Weka also provides several variations,
including Multinomial Naive Bayes and Complement Naive Bayes, which may be more
appropriate for certain types of datasets.
5|Page
Fig 1.4 Use of Naïve bayes Classifier
6|Page
Fig 1.5 Output of Naïve bayes Classifier
2. Select the k-NN algorithm from the list of available classifiers by clicking on the
"Classify" tab and then selecting "IBk" from the list.
3. In the "IBk options" window, enter the value of k you want to use. You can also
specify the distance metric to use (e.g., Euclidean distance, Manhattan distance).
4. Click on the "Start" button to begin the classification process. Weka will split your
data into training and testing sets, and will use the training set to build the k-NN
model.
5. Once the model has been built, Weka will use it to classify the instances in the testing
set, and will display the results in the "Classify" panel.
It is important to note that the performance of the k-NN algorithm can be highly sensitive to
the value of k, the distance metric used, and the choice of features. It is often a good idea to
experiment with different values of k and different distance metrics to find the best results.
In addition to the basic k-NN algorithm, Weka also provides several variations, including
weighted k-NN and instance-based learning with the option to perform feature selection,
which may be more appropriate for certain types of datasets.
7|Page
Fig 1.6 Output of k-NN classifier
8|Page
2. Assumptions: k-NN assumes that instances that are close in feature space are likely to
belong to the same class, while Naive Bayes assumes that the features are independent
given the class label.
3. Training time: k-NN does not require any training time as it simply memorizes the
training instances, while Naive Bayes requires training time to estimate the
probability distributions of each feature given each class label. KNN has a fast
training time as it simply stores the training data.
4. Performance: k-NN can perform well on datasets with complex decision boundaries
or noisy data, but may suffer from the curse of dimensionality as the number of
features increases. Naive Bayes can be sensitive to correlated features and may
perform poorly if the independence assumption is violated, but can be very efficient
and effective on high-dimensional datasets.
5. Interpretability: Naive Bayes is more interpretable than k-NN, as the probability
estimates for each class can be easily understood and used for decision-making.
Overall, the choice between k-NN and Naive Bayes in Weka depends on the characteristics
of the dataset, the desired level of interpretability, and the trade-off between accuracy and
efficiency. It is often a good idea to experiment with both algorithms and compare their
performance using evaluation metrics such as accuracy, precision, recall, F1-score, and
kappa.
Correctly Classified Instances: The number of instances that were correctly classified
by the k-NN algorithm.
Incorrectly Classified Instances: The number of instances that were misclassified by
the k-NN algorithm.
Kappa statistic: A measure of the agreement between the predicted and actual class
labels, taking into account the possibility of agreement by chance.
Mean absolute error: The average absolute difference between the predicted and
actual class labels.
Root mean squared error: The square root of the average squared difference between
the predicted and actual class labels.
Naive Bayes
Correctly Classified Instances: The number of instances that were correctly classified
by the Naive Bayes algorithm.
Incorrectly Classified Instances: The number of instances that were misclassified by
the Naive Bayes algorithm.
9|Page
Kappa statistic: A measure of the agreement between the predicted and actual class
labels, taking into account the possibility of agreement by chance.
Log-likelihood: The log-likelihood of the Naive Bayes model, which is a measure of
the goodness of fit of the model to the training data.
Prior probabilities: The probabilities of each class label in the training data.
Conditional probabilities: The estimated probabilities of each feature value given each
class label, which are used to calculate the posterior probabilities and make
predictions.
The second section of the output provides a detailed breakdown of the model's performance
for each class in the dataset:
1. Iris-setosa:
• TP rate: 1.000
• FP rate: 0.000
• Precision: 1.000
10 | P a g e
• Recall: 1.000
• F-measure: 1.000
• ROC Area: 1.000
Interpretation: The k-NN model correctly classified all 50 instances of Iris-setosa, with no
false positives or false negatives. This suggests that the model is very accurate in identifying
Iris-setosa.
2. Iris-virginica:
• TP rate: 1.000
• FP rate: 0.000
• Precision: 1.000
• Recall: 1.000
• F-measure: 1.000
• ROC Area: 1.000
Interpretation: The k-NN model correctly classified all 50 instances of Iris-virginica, with
no false positives or false negatives. This suggests that the model is very accurate in
identifying Iris-virginica.
3. Iris-versicolor:
• TP rate: 1.000
• FP rate: 0.000
• Precision: 1.000
• Recall: 1.000
• F-measure: 1.000
• ROC Area: 1.000
Interpretation: The k-NN model correctly classified all 50 instances of Iris-versicolor, with
no false positives or false negatives. This suggests that the model is very accurate in
identifying Iris-versicolor. The model had a weighted average TP rate of 1 and a weighted
average precision of 1, indicating that model performed very well overall in classifying the
instances in the iris dataset.
11 | P a g e
The first section of the output provides a summary of the Naïve Bayes model's performance
on the iris dataset:
Correctly Classified Instances: This indicates that out of the 150 instances in the
dataset, 144 are classified correctly by the k-NN model. This model is giving 96%
correctly classified instances.
Incorrectly Classified Instances: This indicates that out of the 150 instances in the
dataset, 6 are classified incorrectly by the k-NN model. This model is giving 4%
incorrectly classified instances.
Kappa statistic: This is a measure of agreement between the predicted and actual class
labels, taking into account the possibility of agreement occurring by chance. A value of 1
indicates perfect agreement, while a value of 0 indicates agreement no better than
chance. In this case, the kappa statistic is 0.94, indicating near perfect agreement
between the predicted and actual class labels.
Mean absolute error: This is the average absolute difference between the predicted and
actual class labels. In this case, the mean absolute error is 0.0324.
Root mean squared error: This is the square root of the average squared difference
between the predicted and actual class labels. In this case, the root mean squared error is
0.1495.
Relative absolute error: This is the mean absolute error as a percentage of the range of
the attribute values. In this case, the relative absolute error is 7.2883%.
Root relative squared error: This is the root mean squared error as a percentage of the
range of the attribute values. In this case, the root relative squared error is 31.7089%.
The second section of the output provides a detailed breakdown of the model's performance
for each class in the dataset:
1. Iris-setosa:
• TP rate: 1.000
• FP rate: 0.000
• Precision: 1.000
• Recall: 1.000
• F-measure: 1.000
• ROC Area: 1.000
Interpretation: The Naïve Bayes model correctly classified all 50 instances of Iris-setosa,
with no false positives or false negatives. This suggests that the model is very accurate in
identifying Iris-setosa.
12 | P a g e
2. Iris-virginica:
• TP rate: 0.920
• FP rate: 0.020
• Precision: 0.958
• Recall: 0.920
• F-measure: 0.939
• ROC Area: 0.993
Interpretation: The Naïve Bayes model correctly classified 46 out of 50 instances of Iris-
virginica, with only 4 false instances. The TP rate and precision were high, indicating that
the model is generally accurate in identifying Iris-virginica, although the precision was
slightly higher than the TP rate.
3. Iris-versicolor:
• TP rate: 0.960
• FP rate: 0.040
• Precision: 0.923
• Recall: 0.960
• F-measure: 0.941
• ROC Area: 0.993
Interpretation: The Naïve Bayes model correctly classified 48 out of instances of Iris-
versicolor, with 2 false instances. The TP rate is slightly higher than precision, indicating that
the model is generally accurate in identifying Iris-versicolor.
The model had a weighted average TP rate of 0.960 and a weighted average precision of
0.960, indicating that model performed well overall in classifying the instances in the iris
dataset.
13 | P a g e
MS – EXCEL
2.1 Correlation
Correlation is a statistical measure that expresses the extent to which two variables are
linearly related (meaning they change together at a constant rate). It‟s a common tool for
describing simple relationships without making a statement about cause and effect.
14 | P a g e
Spearman's rank correlation coefficient
It is appropriate when one or both variables are skewed or ordinal 1 and is robust when
extreme values are present. For a correlation between variables x and y, the formula for
calculating the sample Spearman's correlation coefficient is given by.
Correlation coefficients are used to measure the strength of the linear relationship
between two variables.
A correlation coefficient greater than zero indicates a positive relationship while a
value less than zero signifies a negative relationship.
A value of zero indicates no relationship between the two variables being compared.
A negative correlation, or inverse correlation, is a key concept in the creation of
diversified portfolios that can better withstand portfolio volatility.
Calculating the correlation coefficient is time-consuming, so data are often plugged
into a calculator, computer, or statistics program to find the coefficient.
2.1.2 Calculation of correlation in excel
15 | P a g e
Fig 2.2 Calculation of Correlation
A standard deviation (or σ) is a measure of how dispersed the data is in relation to the mean.
Low standard deviation means data are clustered around the mean, and high standard
deviation indicates data are more spread out.A standard deviation is used to determine how
16 | P a g e
estimations for a group of observations (i.e., data set) are spread out from the mean (average
or expected value).
here, σ is the standard deviation, x 1 is the data point we are solving for in the set, µ is the
mean, and N is the total number of data points.
3. Confidence interval
Standard deviation is used to calculate the confidence interval, which is the range of
values that we can be confident the true value lies in. It is used to estimate the
population parameter based on a sample.
4. Risk management
Standard deviation is used in finance and investment to measure the risk associated
with a particular investment. It helps investors to understand the volatility of returns
and the likelihood of a loss.
18 | P a g e
5. Quality control
Standard deviation is used in manufacturing to control the quality of products. It is
used to measure the variation in product characteristics and to identify defects or
inconsistencies.
6. Scientific research
Standard deviation is used in scientific research to analyse and interpret data. It is
used to compare different groups, assess the significance of results, and identify
trends and patterns in data.
standard deviation is a valuable statistical measure that helps to understand the
variation and distribution of data.
In probability theory and statistics, the coefficient of variation (CV), also known as relative
standard deviation . Coefficient of variation is a dimensionless measure of dispersion that
gives the extent of variability in data. It is very useful for comparing two data sets with
differing units.
19 | P a g e
2.3.2 Calculation Of Coefficient Of Variance In Excel
20 | P a g e
2.4 Methods for Graphical Representation
2.4.1 Bar Chart
A bar chart displays data using rectangular bars that are proportional to the values being
represented. The bars can be vertical or horizontal. In a vertical bar chart, the x-axis
represents categories or labels, and the y-axis represents the values being measured. In a
horizontal bar chart, the y-axis represents categories or labels, and the x-axis represents the
values being measured. Bar charts are useful for comparing data across categories or for
showing changes in data over time.
Here are some of the common uses of bar charts:
Comparison: Bar charts are great for comparing values across different categories.
The height or length of each bar corresponds to the value being represented, making it
easy to compare the values visually.
Distribution: Bar charts can also be used to show the distribution of data within a
category. For example, you can create a frequency distribution chart to show the
number of times a particular value occurs within a dataset.
Ranking: Bar charts can also be used to rank categories based on their values. This is
useful for showing the relative importance of different categories or for identifying
the top performers in a dataset.
Time Series: Bar charts can also be used to represent changes in data over time. This
is done by plotting the values on the y-axis and the time periods on the x-axis.
21 | P a g e
2.4.2 Pie Chart
A pie chart displays data as a circle divided into segments, with each segment representing a
proportion of the whole. Each segment is labelled with the corresponding value or category.
Pie charts are useful for showing the relative proportions of different categories or values in a
dataset. However, they can be more difficult to read than other chart types if there are too
many segments or if the segments are too small.
Here are some common uses of pie charts:
Proportions: Pie charts are useful for showing the proportion of different categories
or values in a dataset. The size of each segment corresponds to the proportion of the
whole that it represents.
Comparisons: Pie charts can be used to compare the relative sizes of different
categories or values. This is done by comparing the sizes of the different segments.
Composition: Pie charts can also be used to show the composition of a whole. For
example, you can create a pie chart to show the different components that make up a
total value.
Percentages: Pie charts are also useful for showing percentages. Each segment can be
labelled with its corresponding percentage, making it easy to understand the relative
sizes of each category or value.
Limitations: Pie charts are not suitable for showing large numbers of categories or
values, as the segments can become difficult to distinguish. Additionally, it can be
difficult to compare the sizes of different segments that are similar in size.
22 | P a g e
2.4.3 Line Chart
A line chart displays data as a series of points connected by lines. The x-axis represents time
or categories, and the y-axis represents the values being measured. Line charts are useful for
showing trends in data over time or for comparing multiple datasets. They can also be used to
show changes in a single dataset over time.
Here are some common uses of line diagrams:
1. Trends: Line diagrams are commonly used to show trends in data over time. This is
done by plotting the values on the y-axis and the time periods on the x-axis. By
connecting the points with lines, it is easy to see how the values change over time.
2. Comparison: Line diagrams can be used to compare multiple datasets. By plotting
the data for each dataset on the same chart, it is easy to compare the trends and
identify any differences or similarities.
3. Relationships: Line diagrams can also be used to show the relationship between two
variables. This is done by plotting one variable on the x-axis and the other variable on
the y-axis. By connecting the points with lines, it is easy to see how the two variables
are related.
4. Forecasting: Line diagrams can also be used for forecasting. By projecting the trend
line into the future, it is possible to make predictions about future values.
5. Limitations: Line diagrams are not suitable for showing categorical data, as the x-
axis is typically used for continuous numerical values such as time or measurements.
Additionally, line diagrams may not be effective for showing data with large
fluctuations or outliers, as these can distort the trend line.
23 | P a g e
Book Review
Book Name - The Art of Statistics : how to learn from data
Author - David Spiegelhalter
24 | P a g e
Chapter 1
Getting Things in Proportion: Categorical Data and Percentages
In this chapter, Spiegelhalter provides an overview of what statistics is and its various
applications. He discusses the different types of data and introduces the concept of
probability.
Categorical Data
Categorical data is a collection of information that is divided into groups. I.e, if an
organisation or agency is trying to get a biodata of its employees, the resulting data is referred
to as categorical.
1. Nominal Data
Nominal data is a type of data that is used to label the variables without providing any
numerical value. It is also known as the nominal scale. Nominal data cannot be ordered and
measured. But sometimes nominal data can be qualitative and quantitative. Some of the few
common examples of nominal data are letters, words, symbols, gender etc. These data are
analysed with the help of the grouping method. The variables are grouped together into
categories and the percentage or frequency can be calculated. It can be presented visually
using the pie chart.
2. Ordinal Data
Ordinal data is a type of data that follows a natural order. The notable features of ordinal data
are that the difference between data values cannot be determined. It is commonly encountered
in surveys, questionnaires, finance and economics.The data can be analysed using
visualisation tools. It is commonly represented using a bar chart. Sometimes the data may be
represented using tables in which each row in the table indicates the distinct category.
25 | P a g e
CHAPTER 2
Summarizing and Communicating Numbers. Lots of Numbers
This chapter covers the basics of data summarisation, including measures of central tendency
and variability. Spiegelhalter explains how to use graphs and charts to summarise data
effectively.
There are three ways of presenting the pattern of the values. These patterns can be variously
termed the data distribution, sample distribution or empirical distribution.
(a) The strip-chart/dot-diagram
It shows each data-point as a dot, but each one is given a random jitter to prevent multiple
guesses of the same number lying on top of each other and obscuring the overall pattern.
These types of charts are used to graphically depict certain data trends or groupings.
(c) Histogram
It counts how many data-points lie in each of a set of intervals – it gives a very rough idea of
the shape of the distribution.
(a) Count variables: where measurements are restricted to the integers 0, 1, 2 … For
example, the number of homicides each year, or guesses at the number of jelly beans in a jar.
(b) Continuous variables: measurements that can be made, at least in principle, to arbitrary
precision. For example, height and weight, each of which might vary both between people
and from time to time. These may, of course, be rounded to whole numbers of centimetres or
kilograms.
26 | P a g e
There are three basic interpretations of the term „average‟, sometimes jokingly referred to by
the single term „mean-median-mode‟. These are also known as measures of the location of
the data distribution
Mean: the sum of the numbers divided by the number of cases.
Median: the middle value when the numbers are put in order. This is how Galton
summarized the votes of his crowd.
Mode: the most common value.
27 | P a g e
CHAPTER 3
Why Are We Looking at Data Anyway? Populations and Measurement
Here, Spiegelhalter introduces the normal distribution and explains its importance in
statistics. He shows how to use the normal distribution to make predictions and construct
confidence intervals.
Inductive Inference
That is to say, inductive inference is based on a generalization from a finite set of past
observations, extending the observed pattern or relation to other future instances or instances
occurring elsewhere. inductive inference starts from propositions on data, and ends in
propositions that extend beyond the data. An example of an inductive inference is that, from
the proposition that up until now all observed pears were green, we conclude that the next
few pears will be green as well.
Population distribution
Population distribution denotes the spatial pattern due to dispersal of population, formation
of agglomeration, linear spread etc. Population density is the ratio of people to physical
space. It shows the relationship between a population and the size of the area in which it
lives.
There are four types of population. They are:
Finite Population
The finite population is also known as a countable population in which the population can be
counted. In other words, it is defined as the population of all the individuals or objects that
are finite. For statistical analysis, the finite population is more advantageous than the infinite
population. Examples of finite populations are employees of a company, potential consumer
in a market.
Infinite Population
The infinite population is also known as an uncountable population in which the counting of
units in the population is not possible. Example of an infinite population is the number of
germs in the patient‟s body is uncountable.
Existent Population
The existing population is defined as the population of concrete individuals. In other words,
the population whose unit is available in solid form is known as existent population.
Examples are books, students etc.
Hypothetical Population
The population in which whose unit is not available in solid form is known as the
hypothetical population. A population consists of sets of observations, objects etc that are all
something in common. In some situations, the populations are only hypothetical. Examples
28 | P a g e
are an outcome of rolling the dice, the outcome of tossing a coin.
29 | P a g e
Bell shaped curve
A bell curve is a common type of distribution for a variable, also known as the normal
distribution. The term "bell curve" originates from the fact that the graph used to depict a
normal distribution consists of a symmetrical bell-shaped curve.
The highest point on the curve, or the top of the bell, represents the most probable event in a
series of data (its mean, mode, and median in this case), while all other possible occurrences
are symmetrically distributed around the mean, creating a downward-sloping curve on each
side of the peak. The width of the bell curve is described by its standard deviation.
30 | P a g e
CHAPTER 4
What Causes What?
In this chapter, Spiegelhalter discusses the relationship between data description and
inference. He explains how to use hypothesis testing to make inferences about population
parameters.
Causation in Statistics
Statistics describes a relationship between two events or two variables. Causation is present
when the value of one variable or event increases or decreases as a direct result of the
presence or lack of another variable or event. Causation is difficult to pin down or be certain
about because circumstances and events can arise out of a complex interaction between
multiple variables.
Simpson’s Paradox
Simpson‟s Paradox is a statistical phenomenon where an association between two variables in
a population emerges, disappears or reverses when the population is divided into
subpopulations. For instance, two variables may be positively associated in a population, but
be independent or even negatively associated in all subpopulations.
Reverse causation
Reverse causation, is a phenomenon that describes the association of two variables differently
than you would expect. Instead of X causing Y, as is the case for traditional causation, Y
causes X. Some people refer to reverse causality as the "cart-before-the-horse bias" to
emphasize the unexpected nature of the correlation.
Lurking factors
A lurking variable is a variable that is unknown and not controlled for; It has an
important, significant effect on the variables of interest. They are extraneous variables, but
may make the relationship between dependent variables and independent variables seem
other than it actually is.
more illegal, or more dangerous drugs are simply the same kinds of people that would be also
okay with using both marijuana and alcohol.
31 | P a g e
CHAPTER 5
Modelling Relationships Using Regression
This chapter covers regression analysis, including simple and multiple regression models.
Spiegelhalter explains how to use regression analysis to make predictions and understand the
relationship between variables.
Regression
“Regression is the measure of average relationship between two or more variables in terms
of the original units of the data.” Regression modelling is a process of determining a
relationship between one or more independent variables and one dependent or output
variables.
Example-
Predicting the height of the person given the age of the person.
Predicting the Price of the car given the car model, year of manufacturing, mileage
, engine capacity etc.
y = a + bx
2. Multiple Regression
Assume that there are multiple independent variables say x1, x2, x3….xn. If the relationship
between independent variables x and dependent or output variable y is modeled by the
relation.
y = a0+a1*x1+a2*x2+……+an*xn
Than the regression model is called a Multiple Regression
3. Polynomial Regression
Assume that there is only one independent variable x. If the relationship between independent
variable x and dependent variable or output variable y is modeled by the relation
y = a0+a1*x+a2*x2+………+an*x^n
For some positive integer n>1, then we have a polynomial regression.
4. Logistic Regression
32 | P a g e
Logistic regression is used when the dependent variables is binary (0/1, True/False, Yes/No)
in nature.
Regression Coefficient
An estimated parameter in a statistical model, that expresses the strength of relationship
between an explanatory variable and an outcome in multiple regression analysis. The
coefficient will have a different interpretation depending on whether the outcome variable is
a continuous variable (multiple linear regression), a proportion (logistic regression), a count
(Poisson regression) or a survival time (Cox regression).
Residual error
The generic term for the component of the data that cannot be explained by a statistical
model, and so is said to be due to chance variation.
33 | P a g e
CHAPTER 6
Algorithms, Analytics and Prediction
Algorithm
Algorithm is a step-by-step procedure, which defines a set of instructions to be executed in a
certain order to get the desired output. Algorithms are generally created independent of
underlying languages, i.e. an algorithm can be implemented in more than one programming
language.
There are two broad tasks for such an algorithm:
For example, the likes and dislikes of an online customer, or whether that object in a robot‟s
vision is a child or a dog.
For example, what the weather will be next week, what a stock price might do tomorrow,
what products that customer might buy, or whether that child is going to run out in front of
our self-driving car etc.
Types of Algorithms
1. Brute Force Algorithm
This algorithm uses the general logic structure to design an algorithm. It is also called an
exhaustive search algorithm because it exhausts all possibilities to provide the required
solution.
There are two types of such algorithms:
Optimizing: Finding all possible solutions to a problem and then selecting the best
one.
Sacrificing: It will stop as soon as the best solution is found.
3. Greedy Algorithm
This is an algorithm paradigm/pattern that makes the best choice possible on each iteration in
the hopes of choosing the best solution.
It is simple to set up and has a shorter execution time.
34 | P a g e
4. Branch and Bound Algorithm
Only integer programming problems can be solved using the branch and bound algorithm.
This method divides all feasible solution sets into smaller subsets. These subsets are then
evaluated further to find the best solution.
5. Randomized Algorithm
As with a standard algorithm, you have predefined input and output. Deterministic algorithms
have a defined set of information and required results and follow some described steps. They
are more efficient than non-deterministic algorithms.
6. Backtracking
It is an algorithmic procedure that recursively and discards the solution if it does not satisfy
the constraints of the problem. Following your understanding of what is an algorithm, and its
approaches, you will now look at algorithm analysis.
Analytics
Analytics is the scientific process of discovering and communicating the meaningful patterns
which can be found in data.
Web analytics
Fraud analysis
Risk analysis
Advertisement and marketing
Enterprise decision management
35 | P a g e
CHAPTER 7
How Sure Can We Be About What Is Going On? Estimates and Intervals
This chapter covers the concepts of estimates and intervals, including confidence intervals
and standard errors. Spiegelhalter shows how to use these concepts to quantify uncertainty
and make informed decisions.
Margin of error
The margin of error in statistics is the degree of error in results received from random
sampling surveys. A higher margin of error in statistics indicates less likelihood of relying on
the results of a survey or poll, i.e. the confidence on the results will be lower to represent a
population.
Bootstrapping
Bootstrapping is a statistical procedure that resamples a single dataset to create many
simulated samples. This process allows you to calculate standard errors, construct confidence
intervals, and perform hypothesis testing for numerous types of sample statistics. Bootstrap
methods are alternative approaches to traditional hypothesis testing and are notable for being
easier to understand and valid for more conditions.
Bootstrapping a sample consists of creating new data sets of the same size by
resampling the original data, with replacement
Sample statistics calculated from bootstrap resamples tend towards a normal
distribution for larger data sets, regardless of the shape of the original data
distribution.
Sampling distribution
A sampling distribution is a probability distribution of a statistic that is obtained through
repeated sampling of a specific population. It describes a range of possible outcomes for a
statistic, such as the mean or mode of some variable, of a population.
36 | P a g e
Central limit theorem
The central limit theorem (CLT) states that the distribution of sample means approximates a
normal distribution as the sample size gets larger, regardless of the population's distribution.
Sample sizes equal to or greater than 30 are often considered sufficient for the CLT to hold.
37 | P a g e
CHAPTER 8
Probability – the Language of Uncertainty and Variability
What is Probability?
Probability denotes the possibility of the outcome of any random event. The meaning of this
term is to check the extent to which any event is likely to happen. For example, when we flip
a coin in the air, what is the possibility of getting a head? The answer to this question is
based on the number of possible outcomes. Here the possibility is either head or tail will be
the outcome. So, the probability of a head to come as a result is 1/2.
The probability is the measure of the likelihood of an event to happen. It measures the
certainty of the event. The formula for probability is given by;
P(E) = Number of Favourable Outcomes/Number of total outcomes
P(E) = n(E)/n(S)
Where n(E) = Number of event favourable to event E & n(S) = Total number of outcomes
There are 3 types of probabilities:
1. Theoretical Probability
2. Experimental Probability
3. Axiomatic Probability:-
In axiomatic probability, a set of rules or axioms are set which applies to all types. These
axioms are set by Kolmogorov and are known as Kolmogorov‟s three axioms. With the
axiomatic approach to probability, the chances of occurrence or non-occurrence of the events
can be quantified. The axiomatic probability lesson covers this concept in detail with
Kolmogorov‟s three rules (axioms) along with various examples.
38 | P a g e
Probability Tree
The tree diagram helps to organize and visualize the different possible outcomes. Branches
and ends of the tree are two main positions. Probability of each branch is written on the
branch, whereas the ends are containing the final outcome. Tree diagrams are used to figure
out when to multiply and when to add.
Conditional probability
Conditional probability is defined as the likelihood of an event or outcome occurring, based
on the occurrence of a previous event or outcome. Conditional probability is calculated by
multiplying the probability of the preceding event by the updated probability of the
succeeding, or conditional, event.
P(B|A) = P(A∩B) / P(A)
Application of probability
Probability theory is applied in everyday life in risk assessment and modeling. The insurance
industry and markets use actuarial science to determine pricing and make trading decisions.
Governments apply probabilistic methods in environmental regulation, entitlement analysis,
and financial regulation.
39 | P a g e
CHAPTER 9
Putting Probability and Statistics Together
In this chapter, Spiegelhalter discusses the importance of randomized trials in determining the
effectiveness of medical treatments and interventions. He explains how randomized trials are
designed and conducted, and how they differ from observational studies.
Binomial distribution
Binomial distribution is a statistical probability distribution that states the likelihood that a
value will take one of two independent values under a given set of parameters or
assumptions.
Standard error
The standard error is a statistical term that measures the accuracy with which a sample
distribution represents a population by using standard deviation. In statistics, a sample mean
deviates from the actual mean of a population; this deviation is the standard error of the
mean.
The standard error (SE) is the approximate standard deviation of a statistical sample
population.
The standard error describes the variation between the calculated mean of the
population and one which is considered known, or accepted as accurate.
The more data points involved in the calculations of the mean, the smaller the
standard error tends to be.
Funnel Plot
A funnel plot is a graphical tool used in statistics to assess whether there is publication bias or
small study effects in a meta-analysis or systematic review. A funnel plot is a scatter plot of
the effect sizes (usually standardized) of the included studies against a measure of their
precision, such as the standard error or the sample size.
The Central Limit Theorem
The Central Limit Theorem (CLT) is a fundamental concept in statistics that states that if a
random sample is taken from any population, the distribution of the sample means will
approach a normal distribution as the sample size increases, regardless of the shape of the
population distribution. This theorem is an essential tool for making inferences about
population parameters, such as means and proportions, based on sample data.
The Central Limit Theorem implies that sample means and other summary statistics can
beassumed to have a normal distribution for large samples.
40 | P a g e
Bernoulli distribution
The Bernoulli distribution is a discrete probability distribution that describes the outcomes of
a single binary experiment or trial, where the outcome can be either success (with probability
p) or failure (with probability 1-p).
The probability mass function (PMF) of the Bernoulli distribution is given by:
P(X = 1) = p
P(X = 0) = 1 - p
where X is a random variable that takes on the values 1 (success) or 0 (failure), and p is the
probability of success.
Uncertainty
Uncertainty is an inherent part of statistics and data analysis. It refers to the degree of doubt
or lack of confidence we have in our estimates or conclusions based on data.
There are several sources of uncertainty in statistics, including:
Sampling variability
Measurement error
Model uncertainty
Epistemic uncertainty
There are two main types of uncertainty in statistics and data analysis
41 | P a g e
CHAPTER 10
Answering Questions and Claiming Discoveries
This focuses on how to effectively communicate statistical findings and answer research
questions in a clear and transparent manner. The chapter emphasizes the importance of
providing context and acknowledging uncertainty in statistical results. One way to do this is
by using confidence intervals and p-values, which provide a range of plausible values for a
parameter and a measure of the strength of evidence against a null hypothesis, respectively.
Hypothesis
A hypothesis can be defined as a proposed explanation for a phenomenon. It is not the
absolute truth, but a provisional, working assumption, perhaps best thought of as a potential
suspect in a criminal case. A hypothesis can be either a null hypothesis or an alternative
hypothesis.
Permutation test
Permutation test is a non-parametric statistical test that is used to test the null hypothesis that
there is no difference between two groups. It is a method that does not rely on any
assumptions about the distribution of the data and can be used in situations where the
assumptions of traditional parametric tests, such as t-test or ANOVA, are violated.
Hypergeometric distribution
The hypergeometric distribution is a probability distribution that describes the probability of
obtaining a specified number of successes in a sample of a fixed size drawn without
replacement from a finite population of known size, where the population contains a fixed
number of elements that are classified into two categories (successes and failures).
The probability mass function of the hypergeometric distribution is given by:
P(X = k) = [(M choose k) * (N-M choose n-k)] / (N choose n)
42 | P a g e
where:
False-positive result
In statistics, a false-positive result occurs when a hypothesis test or a diagnostic test indicates
the presence of a condition or effect when it is actually absent. It is a type of error that occurs
when a test incorrectly identifies something as positive, even though it is negative. For
example, in medical testing, a false-positive result may occur when a diagnostic test indicates
that a patient has a disease or condition when they do not actually have it.
Bonferroni correction
The Bonferroni correction is a statistical method used to adjust the p-values in multiple
hypothesis testing to reduce the risk of false positives. It is named after the Italian
mathematician Carlo Emilio Bonferroni. For example, if 10 tests are being performed and the
desired overall alpha level is 0.05, then the corrected alpha level for each individual test
would be 0.05/10 = 0.005. This means that each test must have a p-value less than or equal to
0.005 to be considered significant.
In the method of Lagrange multipliers, the problem at hand is of the form max f(x) such that
g(x) ≤ c.
M(x, λ) = f(x) − λg(x) (2)
Then xo(λ) maximizes f(x) over all x such that g(x) ≤ g(xo(λ)).
43 | P a g e
CHAPTER 11
Learning from Experience the Bayesian Way
In this chapter, Spiegelhalter explores the Bayesian approach to statistical inference, which
involves updating our beliefs based on both prior knowledge and new data. He explains the
basic concepts of Bayesian inference, such as prior distributions, likelihood functions, and
posterior distributions, in a clear and accessible manner.
Bayesian Approach
The Bayesian approach to statistics is a framework for making decisions based on probability
theory and subjective beliefs or prior knowledge. Bayesian approach is its ability to
incorporate prior knowledge or beliefs into the analysis, which can lead to more accurate and
nuanced results. The Bayesian approach begins by specifying a prior distribution over
parameters that must be estimated. The prior reflects the information known to the researcher
without reference to the dataset on which the model is estimated.
Bayes’ theorem
Bayes' theorem is a fundamental principle of probability theory that describes the relationship
between conditional probabilities. Bayes' theorem states that the probability of an event A
occurring, given that event B has occurred, is equal to the probability of event B occurring,
given that event A has occurred, multiplied by the probability of event A occurring, divided
by the probability of event B occurring:
P(A|B) = P(B|A) x P(A) / P(B)
where:
P(A|B) is the conditional probability of event A occurring given that event B has
occurred. P(B|A) is the conditional probability of event B occurring given that event A
has occurred. P(A) is the prior probability of event A occurring.
P(B) is the prior probability of event B occurring.
Likelihood ratio
likelihood ratio is a measure of the strength of evidence provided by the data in favor of one
statistical hypothesis over another. Specifically, it is the ratio of the likelihoods of the data
under the two hypotheses being compared.
Suppose we have two statistical hypotheses: H1 and H2. The likelihood ratio is given by:
LR = L(D|H1) / L(D|H2)
where L(D|H1) is the likelihood of the data D under hypothesis H1, and L(D|H2) is the
likelihood of the data under hypothesis H2.
44 | P a g e
Odds and Likelihood Ratios
Odds and likelihood ratios are both measures of the strength of evidence provided by data in
favor of one hypothesis over another.
Odds are a way of expressing the likelihood of an event occurring, relative to the likelihood
of the event not occurring. The odds can range from 0 to infinity, with values greater than 1
indicating that the event is more likely to occur than not, and values less than 1 indicating that
the event is less likely to occur than not. The odds of an event A occurring are defined as the
ratio of the probability of A to the probability of not A:
odds(A) = P(A) / P(not A)
45 | P a g e
CHAPTER 12
How Things Go Wrong
In this chapter David Spiegelhalter discusses the potential sources of error and bias in
statistical analysis and how they can lead to incorrect conclusions. The author highlights the
importance of understanding the limitations of statistical analysis and the potential for errors
to arise due to factors such as data quality, measurement error, model assumptions, and
human biases.
Reproducibility crisis
The reproducibility crisis in statistics refers to the growing concern that many published
scientific findings may not be replicable by other researchers or may be subject to biases and
errors that affect the validity of the results. This crisis has been particularly prominent in the
field of psychology, where a number of high-profile studies have failed to replicate in
subsequent research.
Deliberate Fraud
Deliberate fraud in statistics refers to instances where researchers intentionally manipulate
data or analysis methods to produce misleading or false results. Deliberate fraud in statistics
is a serious issue that can have significant consequences, including the potential to harm
individuals or society as a whole if the fraudulent results are used to make important
decisions.
One well-known example of deliberate fraud in statistics is the case of Andrew Wakefield, a
former medical researcher who claimed to have found a link between the MMR vaccine and
autism. Wakefield's study was found to be fraudulent, with manipulated data and conflicts of
interest that were not disclosed. The study led to a decrease in vaccination rates and an
increase in the number of cases of measles, mumps, and rubella, highlighting the potential
harm caused by deliberate fraud in statistics.
Communication Breakdown
Communication breakdown in statistics refers to instances where there is a failure to
effectively communicate statistical information or concepts to non-expert audiences. This can
lead to misunderstandings, misinterpretations, and a lack of trust in the validity of statistical
information.
46 | P a g e
CHAPTER 13
How We Can Do Statistics Better
This chapter focuses on ways to improve the practice of statistics, particularly in light of the
challenges and issues discussed in previous chapters. The chapter begins by highlighting the
importance of transparency and reproducibility in statistical research. The authors emphasize
the need for researchers to share their data and code, as well as to use pre-registration and
open access journals. These practices can help to increase the transparency and replicability
of statistical research, which is crucial for building trust in the validity of statistical results.
If we want the use of statistics to improve, three groups need to act:
47 | P a g e
Assessable: if they wish to, audiences should be able to check the reliability of the
claims.
Useable: audiences should be able to exploit the information for their needs.
Data Ethics
Data ethics refers to the ethical considerations surrounding the collection, use, and sharing of
data in statistical research. As statistical analysis often involves sensitive personal or
confidential information, it is important to consider the ethical implications of data use and
ensure that research is conducted in an ethical and responsible manner.
The Analysis uses a repertoire of techniques:
(i) Data to Sample: Since these are exit polls and the respondents are saying what they have
done and not what they intend to do, experience suggests the responses should be reasonably
accurate measures of what people actually voted at this and previous elections.
(ii) Sample to Study Population: A representative sample is taken of those who actually
voted in each polling station, so the results from the sample can be used to roughly estimate
the change in vote, or „swing‟, in that small area.
(iii) Study Population to Target Population: Using knowledge of the demographics of each
polling station, a regression model is built that attempts to explain how the proportion of
people who change their vote between elections depends on the characteristics of the voters
in that polling area. In this way the swing does not have to be assumed to be the same
throughout the country, but is allowed to vary from area to area allowing, say, for whether
there is a rural or urban population. Then using the estimated regression model, knowledge of
the demographics of each of the 600 or so constituencies, and the votes cast at the previous
election, a prediction of the votes cast in this election can be made for each individual
constituency, even though most of the constituencies did not actually have any voters
interviewed in the exit poll.
48 | P a g e