You are on page 1of 48

Classification of Data by Weka through Naïve Bayes and K –

Nearest Neighbour methods

1.1 Introduction on Weka


Weka is a popular open-source machine learning software suite that provides a collection of
data pre-processing, classification, regression, clustering, association rules, and visualization
tools. Developed in Java, Weka is widely used in both academia and industry for research,
education, and real-world applications.
Weka stands for "Waikato Environment for Knowledge Analysis" and was developed by the
Machine Learning Group at the University of Waikato in New Zealand. It provides a
graphical user interface (GUI) for users to interact with the various machine learning
algorithms and tools, making it an accessible tool for individuals with varying levels of
experience in machine learning. Weka‟s Named after a flightless New Zealand bird.
Some of the key features of Weka include its ability to handle a wide variety of data formats,
support for a large number of machine learning algorithms, and the ability to easily compare
and evaluate different machine learning models. It also has an active community of
developers and users who contribute to its ongoing development and support.
Overall, Weka is a powerful and flexible tool that enables users to explore, analyse, and
model data in a user-friendly and accessible way.

Fig 1.1 Logo of WEKA

1.2 Sources of Data


The data that is used in weka can come from a variety of sources, including research studies,
businesses, and government organizations. The specific organization that collected the data
will depend on the specific dataset being used.
Some of the sources of data that can be used with Weka include:

1|Page
(i) Local files: Weka can read data from various file formats such as CSV, ARFF, and Excel
files stored on the local computer.
(ii) Remote databases: Weka can connect to various types of remote databases such as
MySQL, PostgreSQL, and Microsoft SQL Server.
(iii) Web sources: Weka can access data from web sources such as APIs and web services.

(iv) Data generators: Weka provides several built-in data generators that can generate
synthetic data sets for experimentation and testing.
(v) Pre-processed data: Weka can accept pre-processed data sets from other software or data
analysis tools.
(vi) Distributed data: Weka can work with distributed data stored on Hadoop Distributed
File System (HDFS) or other distributed file systems.
(vii) Streaming data: Weka can handle streaming data in real-time using various data stream
mining algorithms.
Here Iris Plants Database is used as a data source that was created by R.A. Fisher.

Fig 1.2 Home page of WEKA after uploading file

1.3 Variables
A variable is a characteristic that can be measured and that can assume different values.
Height, age, income, province or country of birth, grades obtained at school and type of
housing are all examples of variables.

2|Page
1.3.1 Variables Identification
Variables identified in weak refer to the different features or attributes that are used in the
machine learning analysis. For example, in a dataset about customer demographics, variables
might include age, income, and location.

1.3.2 Types of variables


Variables may be classified into two main categories: categorical and numeric. Each category
is then classified in two subcategories: nominal or ordinal for categorical variables, discrete
or continuous for numeric variables. These types are briefly outlined in this section.
Categorical variables represent discrete categories, such as gender or occupation.

Fig 1.3 Types of Variables

(I) Categorical variables

A categorical variable (also called qualitative variable) refers to a characteristic that can‟t be
quantifiable. Categorical variables can be either nominal or ordinal.

(A) Nominal variables


A nominal variable is one that describes a name, label or category without natural order. Sex
and type of dwelling are examples of nominal variables. In Table, the variable “mode of
transportation for travel to work” is also nominal. Nominal variables represent numerical
values, such as age or income. Nominal variables represent values that do not have a specific
order, such as country or customer ID.

Mode of transportation for travel to work Number of people


Car, truck, van as driver 9,929,470
Car, truck, van as passenger 923,975
Public transit 1,406,585
Walked 881,085
Bicycle 162,910
Other methods 146,835
Table 1.1 Example of Nominal variables

3|Page
(B) Ordinal variables
An ordinal variable is a variable whose values are defined by an order relation between the
different categories. In Table, the variable “behaviour” is ordinal because the category
“Excellent” is better than the category “Very good,” which is better than the category
“Good,” etc. There is some natural ordering, but it is limited since we do not know by how
much “Excellent” behaviour is better than “Very good” behaviour.

Behaviour Number of students


Excellent 5
Very good 12
Good 10
Bad 2
Very bad 1
Table 1.2 Example of Ordinal variables
It is important to note that even if categorical variables are not quantifiable, they can appear
as numbers in a data set. Correspondence between these numbers and the categories is
established during data coding. To be able to identify the type of variable, it is important to
have access to the metadata (the data about the data) that should include the code set used for
each categorical variable. For instance, categories used in Table 4.2.2 could appear as a
number from 1 to 5: 1 for “very bad,” 2 for “bad,” 3 for “good,” 4 for “very good” and 5 for
“excellent.”

(II) Numeric variables


A numeric variable (also called quantitative variable) is a quantifiable characteristic whose
values are numbers (except numbers which are codes standing up for categories). Numeric
variables may be either continuous or discrete.

(A) Continuous variables


A variable is said to be continuous if it can assume an infinite number of real values within a
given interval. For instance, consider the height of a student. The height can‟t take any
values. It can‟t be negative and it can‟t be higher than three metres. But between 0 and 3, the
number of possible values is theoretically infinite. A student may be 1.6321748755 … metres
tall. In practice, the methods used and the accuracy of the measurement instrument will
restrict the precision of the variable. The reported height would be rounded to the nearest
centimetre, so it would be 1.63 metres. The age is another example of a continuous variable
that is typically rounded down.

(B) Discrete variables


As opposed to a continuous variable, a discrete variable can assume only a finite number of
real values within a given interval. An example of a discrete variable would be the score
given by a judge to a gymnast in competition: the range is 0 to 10 and the score is always
given to one decimal (e.g. a score of 8.5). You can enumerate all possible values (0, 0.1,
0.2…) and see that the number of possible values is finite: it is 101! Another example of a

4|Page
discrete variable is the number of people in a household for a household of size 20 or less.
The number of possible values is 20, because it‟s not possible for a household to include a
number of people that would be a fraction of an integer like 2.27 for instance.

1.4 Classifier
a classifier is an algorithm that takes input data and assigns it to one of several predefined
categories or classes. The goal of a classifier is to learn a function that can accurately predict
the class of new, unseen data.
There are various types of classifiers, some of which are listed below:

1.4.1 Naïve bayes


Naive Bayes is a popular classification algorithm in Weka. It is a probabilistic algorithm that
is based on Bayes' theorem, which states that the probability of a hypothesis (such as a class
label) given some observed evidence (such as a set of features) is proportional to the
probability of the evidence given the hypothesis, multiplied by the prior probability of the
hypothesis.
To use the Naive Bayes algorithm in Weka, follow these steps:
1. Load your dataset into Weka using the "Open File" button on the main Weka screen.

2. Select the Naive Bayes algorithm from the list of available classifiers by clicking on
the "Classify" tab and then selecting "Naive Bayes" from the list.
3. Click on the "Start" button to begin the classification process. Weka will split your
data into training and testing sets, and will use the training set to build the Naive
Bayes model.
4. Once the model has been built, Weka will use it to classify the instances in the testing
set, and will display the results in the "Classify" panel.

It is important to note that the Naive Bayes algorithm makes the "naive" assumption that all
of the features are independent of each other, given the class label. This assumption may not
hold in all cases, but the algorithm is still widely used because it is computationally efficient
and often performs well in practice.
In addition to the basic Naive Bayes algorithm, Weka also provides several variations,
including Multinomial Naive Bayes and Complement Naive Bayes, which may be more
appropriate for certain types of datasets.

5|Page
Fig 1.4 Use of Naïve bayes Classifier

6|Page
Fig 1.5 Output of Naïve bayes Classifier

1.4.2 K-Nearest Neighbours


The k-Nearest Neighbours (k-NN) algorithm is a popular non-parametric classification and
regression algorithm in Weka. It is a simple algorithm that classifies a new instance based on
the class of its k nearest neighbours in the training set.
To use the k-NN algorithm in Weka, follow these steps:
1. Load your dataset into Weka using the "Open File" button on the main Weka screen.

2. Select the k-NN algorithm from the list of available classifiers by clicking on the
"Classify" tab and then selecting "IBk" from the list.
3. In the "IBk options" window, enter the value of k you want to use. You can also
specify the distance metric to use (e.g., Euclidean distance, Manhattan distance).
4. Click on the "Start" button to begin the classification process. Weka will split your
data into training and testing sets, and will use the training set to build the k-NN
model.
5. Once the model has been built, Weka will use it to classify the instances in the testing
set, and will display the results in the "Classify" panel.
It is important to note that the performance of the k-NN algorithm can be highly sensitive to
the value of k, the distance metric used, and the choice of features. It is often a good idea to
experiment with different values of k and different distance metrics to find the best results.
In addition to the basic k-NN algorithm, Weka also provides several variations, including
weighted k-NN and instance-based learning with the option to perform feature selection,
which may be more appropriate for certain types of datasets.

7|Page
Fig 1.6 Output of k-NN classifier

1.4.3 Difference Between K-NN and Naïve Bayes


The main differences between k-NN and Naive Bayes in Weka are:

1. Approach: k-NN is a non-parametric method that makes predictions based on the k


nearest training instances to a new instance, while Naive Bayes is a probabilistic
method that estimates the likelihood of each class label given the observed feature
values.

8|Page
2. Assumptions: k-NN assumes that instances that are close in feature space are likely to
belong to the same class, while Naive Bayes assumes that the features are independent
given the class label.
3. Training time: k-NN does not require any training time as it simply memorizes the
training instances, while Naive Bayes requires training time to estimate the
probability distributions of each feature given each class label. KNN has a fast
training time as it simply stores the training data.
4. Performance: k-NN can perform well on datasets with complex decision boundaries
or noisy data, but may suffer from the curse of dimensionality as the number of
features increases. Naive Bayes can be sensitive to correlated features and may
perform poorly if the independence assumption is violated, but can be very efficient
and effective on high-dimensional datasets.
5. Interpretability: Naive Bayes is more interpretable than k-NN, as the probability
estimates for each class can be easily understood and used for decision-making.
Overall, the choice between k-NN and Naive Bayes in Weka depends on the characteristics
of the dataset, the desired level of interpretability, and the trade-off between accuracy and
efficiency. It is often a good idea to experiment with both algorithms and compare their
performance using evaluation metrics such as accuracy, precision, recall, F1-score, and
kappa.

 Some common terms are:


k- NN

 Correctly Classified Instances: The number of instances that were correctly classified
by the k-NN algorithm.
 Incorrectly Classified Instances: The number of instances that were misclassified by
the k-NN algorithm.
 Kappa statistic: A measure of the agreement between the predicted and actual class
labels, taking into account the possibility of agreement by chance.
 Mean absolute error: The average absolute difference between the predicted and
actual class labels.
 Root mean squared error: The square root of the average squared difference between
the predicted and actual class labels.

Naive Bayes
 Correctly Classified Instances: The number of instances that were correctly classified
by the Naive Bayes algorithm.
 Incorrectly Classified Instances: The number of instances that were misclassified by
the Naive Bayes algorithm.

9|Page
 Kappa statistic: A measure of the agreement between the predicted and actual class
labels, taking into account the possibility of agreement by chance.
 Log-likelihood: The log-likelihood of the Naive Bayes model, which is a measure of
the goodness of fit of the model to the training data.
 Prior probabilities: The probabilities of each class label in the training data.

 Conditional probabilities: The estimated probabilities of each feature value given each
class label, which are used to calculate the posterior probabilities and make
predictions.

1.5 Interpretation of the above data


The first section of the output provides a summary of the k-NN model's performance on the
iris dataset:
 Correctly Classified Instances: This indicates that out of the 150 instances in the
dataset, all are classified correctly by the k-NN model.
 Incorrectly Classified Instances: This indicates that out of the 150 instances in the
dataset, no are classified incorrectly by the k-NN model.
 Kappa statistic: This is a measure of agreement between the predicted and actual class
labels, taking into account the possibility of agreement occurring by chance. A value of 1
indicates perfect agreement, while a value of 0 indicates agreement no better than
chance. In this case, the kappa statistic is 1, indicating strong agreement between the
predicted and actual class labels.
 Mean absolute error: This is the average absolute difference between the predicted and
actual class labels. In this case, the mean absolute error is 0.0085.
 Root mean squared error: This is the square root of the average squared difference
between the predicted and actual class labels. In this case, the root mean squared error is
0.0091.
 Relative absolute error: This is the mean absolute error as a percentage of the range of
the attribute values. In this case, the relative absolute error is 1.9219%.
 Root relative squared error: This is the root mean squared error as a percentage of the
range of the attribute values. In this case, the root relative squared error is 1.9335%.

The second section of the output provides a detailed breakdown of the model's performance
for each class in the dataset:
1. Iris-setosa:
• TP rate: 1.000
• FP rate: 0.000
• Precision: 1.000

10 | P a g e
• Recall: 1.000
• F-measure: 1.000
• ROC Area: 1.000
Interpretation: The k-NN model correctly classified all 50 instances of Iris-setosa, with no
false positives or false negatives. This suggests that the model is very accurate in identifying
Iris-setosa.

2. Iris-virginica:
• TP rate: 1.000
• FP rate: 0.000
• Precision: 1.000
• Recall: 1.000
• F-measure: 1.000
• ROC Area: 1.000
Interpretation: The k-NN model correctly classified all 50 instances of Iris-virginica, with
no false positives or false negatives. This suggests that the model is very accurate in
identifying Iris-virginica.

3. Iris-versicolor:
• TP rate: 1.000
• FP rate: 0.000
• Precision: 1.000
• Recall: 1.000
• F-measure: 1.000
• ROC Area: 1.000
Interpretation: The k-NN model correctly classified all 50 instances of Iris-versicolor, with
no false positives or false negatives. This suggests that the model is very accurate in
identifying Iris-versicolor. The model had a weighted average TP rate of 1 and a weighted
average precision of 1, indicating that model performed very well overall in classifying the
instances in the iris dataset.

11 | P a g e
The first section of the output provides a summary of the Naïve Bayes model's performance
on the iris dataset:
 Correctly Classified Instances: This indicates that out of the 150 instances in the
dataset, 144 are classified correctly by the k-NN model. This model is giving 96%
correctly classified instances.
 Incorrectly Classified Instances: This indicates that out of the 150 instances in the
dataset, 6 are classified incorrectly by the k-NN model. This model is giving 4%
incorrectly classified instances.
 Kappa statistic: This is a measure of agreement between the predicted and actual class
labels, taking into account the possibility of agreement occurring by chance. A value of 1
indicates perfect agreement, while a value of 0 indicates agreement no better than
chance. In this case, the kappa statistic is 0.94, indicating near perfect agreement
between the predicted and actual class labels.
 Mean absolute error: This is the average absolute difference between the predicted and
actual class labels. In this case, the mean absolute error is 0.0324.
 Root mean squared error: This is the square root of the average squared difference
between the predicted and actual class labels. In this case, the root mean squared error is
0.1495.
 Relative absolute error: This is the mean absolute error as a percentage of the range of
the attribute values. In this case, the relative absolute error is 7.2883%.
 Root relative squared error: This is the root mean squared error as a percentage of the
range of the attribute values. In this case, the root relative squared error is 31.7089%.
The second section of the output provides a detailed breakdown of the model's performance
for each class in the dataset:

1. Iris-setosa:
• TP rate: 1.000
• FP rate: 0.000
• Precision: 1.000
• Recall: 1.000
• F-measure: 1.000
• ROC Area: 1.000
Interpretation: The Naïve Bayes model correctly classified all 50 instances of Iris-setosa,
with no false positives or false negatives. This suggests that the model is very accurate in
identifying Iris-setosa.

12 | P a g e
2. Iris-virginica:
• TP rate: 0.920
• FP rate: 0.020
• Precision: 0.958
• Recall: 0.920
• F-measure: 0.939
• ROC Area: 0.993
Interpretation: The Naïve Bayes model correctly classified 46 out of 50 instances of Iris-
virginica, with only 4 false instances. The TP rate and precision were high, indicating that
the model is generally accurate in identifying Iris-virginica, although the precision was
slightly higher than the TP rate.

3. Iris-versicolor:
• TP rate: 0.960
• FP rate: 0.040
• Precision: 0.923
• Recall: 0.960
• F-measure: 0.941
• ROC Area: 0.993
Interpretation: The Naïve Bayes model correctly classified 48 out of instances of Iris-
versicolor, with 2 false instances. The TP rate is slightly higher than precision, indicating that
the model is generally accurate in identifying Iris-versicolor.
The model had a weighted average TP rate of 0.960 and a weighted average precision of
0.960, indicating that model performed well overall in classifying the instances in the iris
dataset.

13 | P a g e
MS – EXCEL
2.1 Correlation
Correlation is a statistical measure that expresses the extent to which two variables are
linearly related (meaning they change together at a constant rate). It‟s a common tool for
describing simple relationships without making a statement about cause and effect.

 correlation coefficient denoted by „r‟.


 There are two main types of correlation coefficients: Pearson's product moment
correlation coefficient and Spearman's rank correlation coefficient.

Chart 2.1 Showing Positive, Negative and No correlation

 Pearson's product moment correlation coefficient


It is used when both variables being studied are normally distributed. This coefficient is
affected by extreme values, which may exaggerate or dampen the strength of relationship,
and is therefore inappropriate when either or both variables are not normally distributed. For
a correlation between variables x and y, the formula for calculating the sample Pearson's
correlation coefficient is given by.

14 | P a g e
 Spearman's rank correlation coefficient

It is appropriate when one or both variables are skewed or ordinal 1 and is robust when
extreme values are present. For a correlation between variables x and y, the formula for
calculating the sample Spearman's correlation coefficient is given by.

where di is the difference in ranks for x and y.

2.1.1 Uses of correlation

 Correlation coefficients are used to measure the strength of the linear relationship
between two variables.
 A correlation coefficient greater than zero indicates a positive relationship while a
value less than zero signifies a negative relationship.
 A value of zero indicates no relationship between the two variables being compared.
 A negative correlation, or inverse correlation, is a key concept in the creation of
diversified portfolios that can better withstand portfolio volatility.
 Calculating the correlation coefficient is time-consuming, so data are often plugged
into a calculator, computer, or statistics program to find the coefficient.
2.1.2 Calculation of correlation in excel

Fig 2.1 Use of Correlation formula

15 | P a g e
Fig 2.2 Calculation of Correlation

2.1.3 Applications of correlation


1. Finance: Correlation is used in finance to analyse the relationship between different
assets, such as stocks or bonds. It helps investors to diversify their portfolios and
manage risk.
2. Economics: Correlation is used in economics to analyse the relationship between
different economic variables, such as inflation and unemployment. It helps
economists to understand the behaviour of the economy and make predictions about
future trends.
3. Social sciences: Correlation is used in social sciences, such as psychology and
sociology, to analyse the relationship between different variables, such as income and
happiness or education level and job satisfaction. It helps researchers to understand
the factors that influence human behaviour.
4. Medicine: Correlation is used in medicine to analyse the relationship between
different variables, such as diet and health outcomes or smoking and lung cancer. It
helps doctors and researchers to understand the risk factors for different diseases and
develop effective treatments.
It helps to identify patterns, make predictions, and understand the behaviour of
complex systems.

2.2 Standard deviation

A standard deviation (or σ) is a measure of how dispersed the data is in relation to the mean.
Low standard deviation means data are clustered around the mean, and high standard
deviation indicates data are more spread out.A standard deviation is used to determine how

16 | P a g e
estimations for a group of observations (i.e., data set) are spread out from the mean (average
or expected value).

Chart 2.2 High and low standard deviation curves

To calculate the standard deviation, use the following formula:

here, σ is the standard deviation, x 1 is the data point we are solving for in the set, µ is the
mean, and N is the total number of data points.

2.2.1 Uses of Standard deviation


Standard deviation is used by all portfolio managers to measure and track risk. One of the
most important ratios in portfolio management, Sharpe Ratio (for which William Sharpe got
a Nobel Prize) uses Standard Deviation to measure risk adjusted return.

2.2.2 Calculation of Standard Deviation in Excel

Fig. 2.3 Use of Standard deviation formula


17 | P a g e
Fig 2.4 Calculation of Standard Deviation

2.2.3 Applications of Standard deviation


1. Measure of variability
Standard deviation is a measure of variability in a set of data. It shows how much the
individual data points differ from the mean or average value of the data set. It helps to
understand the spread of data and identify outliers or extreme values.
2. Assessing the quality of data
Standard deviation can be used to assess the quality of data by checking if the data
points are closely clustered around the mean or widely dispersed. Data with a low
standard deviation is considered to be of better quality than data with a high standard
deviation.

3. Confidence interval
Standard deviation is used to calculate the confidence interval, which is the range of
values that we can be confident the true value lies in. It is used to estimate the
population parameter based on a sample.

4. Risk management
Standard deviation is used in finance and investment to measure the risk associated
with a particular investment. It helps investors to understand the volatility of returns
and the likelihood of a loss.

18 | P a g e
5. Quality control
Standard deviation is used in manufacturing to control the quality of products. It is
used to measure the variation in product characteristics and to identify defects or
inconsistencies.

6. Scientific research
Standard deviation is used in scientific research to analyse and interpret data. It is
used to compare different groups, assess the significance of results, and identify
trends and patterns in data.
standard deviation is a valuable statistical measure that helps to understand the
variation and distribution of data.

2.3 Co- Efficient of Variation

In probability theory and statistics, the coefficient of variation (CV), also known as relative
standard deviation . Coefficient of variation is a dimensionless measure of dispersion that
gives the extent of variability in data. It is very useful for comparing two data sets with
differing units.

2.3.1 Uses of Coefficient of Variance

 The co-efficient of variation (CV) is a statistical measure of the relative dispersion of


data points in a data series around the mean.
 It represents the ratio of the standard deviation to the mean.
 The CV is useful for comparing the degree of variation from one data series to
another, even if the means are drastically different from one another.
 In finance, the co-efficient of variation allows investors to determine how much
volatility, or risk, is assumed in comparison to the amount of return expected from
investments.
 The lower the ratio of the standard deviation to mean return, the better risk-return
trade off.
 For example, biologists and researchers often use it in their observations to calculate
repeatability within their data results. Educators may also apply the CV to compare
teaching methodologies, discovering what leads to higher grade point averages.

19 | P a g e
2.3.2 Calculation Of Coefficient Of Variance In Excel

Fig 2.5 Calculation of Coefficient Of Variance

2.3.3 Application Coefficient Of Variance


Some of the most common uses of CV include

1. Comparing the variability of datasets: The CV is a measure of relative variability


and can be used to compare the variability of different datasets, regardless of their
units of measurement. This is especially useful when comparing datasets with
different means, such as the heights of two different populations.
2. Evaluating the quality of data: A low CV indicates that the data is more precise and
consistent, while a high CV indicates that the data is more variable and less
consistent. By calculating the CV, you can assess the quality of the data and identify
any potential issues or outliers.
3. Analysing the risk and return of financial investments: The CV is commonly used
in finance to analyse the risk and return of different investments. A higher CV
indicates a higher risk and potential for greater returns, while a lower CV indicates
lower risk and potential for lower returns.
4. Assessing the reliability of laboratory measurements: In scientific research, the
CV is often used to assess the reliability of laboratory measurements. A low CV
indicates that the measurements are precise and consistent, while a high CV indicates
that the measurements are more variable and less reliable.
5. Monitoring the quality of manufacturing processes: In industrial settings, the CV
can be used to monitor the quality of manufacturing processes.

20 | P a g e
2.4 Methods for Graphical Representation
2.4.1 Bar Chart
A bar chart displays data using rectangular bars that are proportional to the values being
represented. The bars can be vertical or horizontal. In a vertical bar chart, the x-axis
represents categories or labels, and the y-axis represents the values being measured. In a
horizontal bar chart, the y-axis represents categories or labels, and the x-axis represents the
values being measured. Bar charts are useful for comparing data across categories or for
showing changes in data over time.
Here are some of the common uses of bar charts:

 Comparison: Bar charts are great for comparing values across different categories.
The height or length of each bar corresponds to the value being represented, making it
easy to compare the values visually.
 Distribution: Bar charts can also be used to show the distribution of data within a
category. For example, you can create a frequency distribution chart to show the
number of times a particular value occurs within a dataset.
 Ranking: Bar charts can also be used to rank categories based on their values. This is
useful for showing the relative importance of different categories or for identifying
the top performers in a dataset.
 Time Series: Bar charts can also be used to represent changes in data over time. This
is done by plotting the values on the y-axis and the time periods on the x-axis.

Chart 2.3 Example of Bar chart

21 | P a g e
2.4.2 Pie Chart
A pie chart displays data as a circle divided into segments, with each segment representing a
proportion of the whole. Each segment is labelled with the corresponding value or category.
Pie charts are useful for showing the relative proportions of different categories or values in a
dataset. However, they can be more difficult to read than other chart types if there are too
many segments or if the segments are too small.
Here are some common uses of pie charts:
 Proportions: Pie charts are useful for showing the proportion of different categories
or values in a dataset. The size of each segment corresponds to the proportion of the
whole that it represents.
 Comparisons: Pie charts can be used to compare the relative sizes of different
categories or values. This is done by comparing the sizes of the different segments.
 Composition: Pie charts can also be used to show the composition of a whole. For
example, you can create a pie chart to show the different components that make up a
total value.
 Percentages: Pie charts are also useful for showing percentages. Each segment can be
labelled with its corresponding percentage, making it easy to understand the relative
sizes of each category or value.
 Limitations: Pie charts are not suitable for showing large numbers of categories or
values, as the segments can become difficult to distinguish. Additionally, it can be
difficult to compare the sizes of different segments that are similar in size.

Chart 2.4 Example of Pie chart

22 | P a g e
2.4.3 Line Chart
A line chart displays data as a series of points connected by lines. The x-axis represents time
or categories, and the y-axis represents the values being measured. Line charts are useful for
showing trends in data over time or for comparing multiple datasets. They can also be used to
show changes in a single dataset over time.
Here are some common uses of line diagrams:
1. Trends: Line diagrams are commonly used to show trends in data over time. This is
done by plotting the values on the y-axis and the time periods on the x-axis. By
connecting the points with lines, it is easy to see how the values change over time.
2. Comparison: Line diagrams can be used to compare multiple datasets. By plotting
the data for each dataset on the same chart, it is easy to compare the trends and
identify any differences or similarities.
3. Relationships: Line diagrams can also be used to show the relationship between two
variables. This is done by plotting one variable on the x-axis and the other variable on
the y-axis. By connecting the points with lines, it is easy to see how the two variables
are related.
4. Forecasting: Line diagrams can also be used for forecasting. By projecting the trend
line into the future, it is possible to make predictions about future values.
5. Limitations: Line diagrams are not suitable for showing categorical data, as the x-
axis is typically used for continuous numerical values such as time or measurements.
Additionally, line diagrams may not be effective for showing data with large
fluctuations or outliers, as these can distort the trend line.

Chart 2.5 Example of Line chart

23 | P a g e
Book Review
Book Name - The Art of Statistics : how to learn from data
Author - David Spiegelhalter

Introduction of the Art of Statistics


The Art of Statistics (2019) is a non-technical introduction to the basic concepts of statistical
science. Side-lining abstract mathematical analyses in favour of a more human-oriented
approach, it explains how statistical science is helping us to answer questions and tell more
informative stories. Stepping beyond the numbers, it also considers the role that the media
and psychological bias play in the distortion of statistical claims. In these blinks you‟ll find
the tools and knowledge needed to understand and evaluate these claims.
The book covers a wide range of statistical topics, including probability theory, hypothesis
testing, regression analysis, and Bayesian statistics. Spiegelhalter uses real-world examples
and case studies throughout the book to illustrate the concepts and to show how statistics can
be used to make sense of data in a variety of fields, from sports and politics to health and
economics.
One of the strengths of the book is its focus on the practical aspects of statistics, rather than
just the theory. Spiegelhalter emphasizes the importance of understanding the limitations of
data and the potential for bias and error in statistical analysis, and provides practical tips and
guidelines for making good statistical decisions.

About the Author


David Spiegel halter is a British statistician and statistics communicator. One of the most
cited and influential researchers in his field, he serves as the Winton Professor for the Public
Understanding of Risk in the Statistical Laboratory at the University of Cambridge. He was
president of the Royal Statistical Society for 2017 and 2018.
Spiegelhalter has made significant contributions to the field of statistics, particularly in the
areas of Bayesian statistics, statistical methodology, and the communication of statistical
information. He has published numerous papers in academic journals and is the author of
several books, including "The Art of Statistics: How to Learn from Data" (2019), which has
been widely acclaimed for its accessible and engaging approach to teaching statistics to a
general audience.

24 | P a g e
Chapter 1
Getting Things in Proportion: Categorical Data and Percentages
In this chapter, Spiegelhalter provides an overview of what statistics is and its various
applications. He discusses the different types of data and introduces the concept of
probability.

Categorical Data
Categorical data is a collection of information that is divided into groups. I.e, if an
organisation or agency is trying to get a biodata of its employees, the resulting data is referred
to as categorical.

Types of Categorical Data


In general, categorical data has values and observations which can be sorted into categories
or groups. The best way to represent these data is bar graphs and pie charts. Categorical data
are further classified into two types namely,

1. Nominal Data
Nominal data is a type of data that is used to label the variables without providing any
numerical value. It is also known as the nominal scale. Nominal data cannot be ordered and
measured. But sometimes nominal data can be qualitative and quantitative. Some of the few
common examples of nominal data are letters, words, symbols, gender etc. These data are
analysed with the help of the grouping method. The variables are grouped together into
categories and the percentage or frequency can be calculated. It can be presented visually
using the pie chart.
2. Ordinal Data
Ordinal data is a type of data that follows a natural order. The notable features of ordinal data
are that the difference between data values cannot be determined. It is commonly encountered
in surveys, questionnaires, finance and economics.The data can be analysed using
visualisation tools. It is commonly represented using a bar chart. Sometimes the data may be
represented using tables in which each row in the table indicates the distinct category.

Categorical data - Bar Graphs


A bar graph plots numeric values for levels of a categorical feature as bars. Levels are plotted
on one chart axis, and values are plotted on the other axis. Each categorical value claims one
bar, and the length of each bar corresponds to the bar's value.

Categorical Data - Pie Charts


Pie charts make sense to show a parts-to-whole relationship for categorical or nominal data.
The slices in the pie typically represent percentages of the total. With categorical data, the
sample is often divided into groups and the responses have a defined order.

25 | P a g e
CHAPTER 2
Summarizing and Communicating Numbers. Lots of Numbers

This chapter covers the basics of data summarisation, including measures of central tendency
and variability. Spiegelhalter explains how to use graphs and charts to summarise data
effectively.
There are three ways of presenting the pattern of the values. These patterns can be variously
termed the data distribution, sample distribution or empirical distribution.
(a) The strip-chart/dot-diagram
It shows each data-point as a dot, but each one is given a random jitter to prevent multiple
guesses of the same number lying on top of each other and obscuring the overall pattern.
These types of charts are used to graphically depict certain data trends or groupings.

(b) The box-and-whisker plot


Box and whisker plots are very effective and easy to read, as they can summarize data from
multiple sources and display the results in a single graph. It summarizes some essential
features of the data distribution.

(c) Histogram
It counts how many data-points lie in each of a set of intervals – it gives a very rough idea of
the shape of the distribution.

Skewed data distribution


A skewed distribution occurs when one tail is longer than the other. Skewness defines the
asymmetry of a distribution. Unlike the familiar normal distribution with its bell-shaped
curve, these distributions are asymmetric. The two halves of the distribution are not mirror
images because the data are not distributed equally on both sides of the distribution‟s peak.

Variables which are recorded as numbers come in different varieties

(a) Count variables: where measurements are restricted to the integers 0, 1, 2 … For
example, the number of homicides each year, or guesses at the number of jelly beans in a jar.
(b) Continuous variables: measurements that can be made, at least in principle, to arbitrary
precision. For example, height and weight, each of which might vary both between people
and from time to time. These may, of course, be rounded to whole numbers of centimetres or
kilograms.

26 | P a g e
There are three basic interpretations of the term „average‟, sometimes jokingly referred to by
the single term „mean-median-mode‟. These are also known as measures of the location of
the data distribution
Mean: the sum of the numbers divided by the number of cases.

Median: the middle value when the numbers are put in order. This is how Galton
summarized the votes of his crowd.
Mode: the most common value.

 Range: the difference between the highest and lowest values


 Interquartile range: the range of the middle half of a distribution
 Standard deviation: average distance from the mean

Pearson correlation coefficient


The Pearson correlation coefficient (r) is the most common way of measuring a linear
correlation. It is a number between –1 and 1 that measures the strength and direction of the
relationship between two variables. When one variable changes, the other variable changes in
the same direction.

Spearman’s rank correlation


The Spearman's rank-order correlation is the nonparametric version of the Pearson product-
moment correlation. Spearman's correlation coefficient, (ρ, also signified by r s) measures the
strength and direction of association between two ranked variables. It basically gives the
measure of monotonicity of the relation between two variables i.e. how well the relationship
between two variables could be represented using a monotonic function.
Communication
The first rule of communication is to shut up and listen, so that you can get to know about the
audience for your communication, whether it might be politicians, professionals or the
general public.
The second rule of communication is to know what you want to achieve. Hopefully the aim is
to encourage open debate, and informed decision making.
Storytelling with Statistics
Infographics highlight interesting features and can guide the viewer through a story, but
should be used with awareness of their purpose and their impact. Sophisticated infographics
regularly appear in the media.

27 | P a g e
CHAPTER 3
Why Are We Looking at Data Anyway? Populations and Measurement
Here, Spiegelhalter introduces the normal distribution and explains its importance in
statistics. He shows how to use the normal distribution to make predictions and construct
confidence intervals.

Inductive Inference
That is to say, inductive inference is based on a generalization from a finite set of past
observations, extending the observed pattern or relation to other future instances or instances
occurring elsewhere. inductive inference starts from propositions on data, and ends in
propositions that extend beyond the data. An example of an inductive inference is that, from
the proposition that up until now all observed pears were green, we conclude that the next
few pears will be green as well.

Population distribution
Population distribution denotes the spatial pattern due to dispersal of population, formation
of agglomeration, linear spread etc. Population density is the ratio of people to physical
space. It shows the relationship between a population and the size of the area in which it
lives.
There are four types of population. They are:

Finite Population
The finite population is also known as a countable population in which the population can be
counted. In other words, it is defined as the population of all the individuals or objects that
are finite. For statistical analysis, the finite population is more advantageous than the infinite
population. Examples of finite populations are employees of a company, potential consumer
in a market.
Infinite Population
The infinite population is also known as an uncountable population in which the counting of
units in the population is not possible. Example of an infinite population is the number of
germs in the patient‟s body is uncountable.

Existent Population
The existing population is defined as the population of concrete individuals. In other words,
the population whose unit is available in solid form is known as existent population.
Examples are books, students etc.

Hypothetical Population
The population in which whose unit is not available in solid form is known as the
hypothetical population. A population consists of sets of observations, objects etc that are all
something in common. In some situations, the populations are only hypothetical. Examples

28 | P a g e
are an outcome of rolling the dice, the outcome of tossing a coin.

29 | P a g e
Bell shaped curve
A bell curve is a common type of distribution for a variable, also known as the normal
distribution. The term "bell curve" originates from the fact that the graph used to depict a
normal distribution consists of a symmetrical bell-shaped curve.
The highest point on the curve, or the top of the bell, represents the most probable event in a
series of data (its mean, mode, and median in this case), while all other possible occurrences
are symmetrically distributed around the mean, creating a downward-sloping curve on each
side of the peak. The width of the bell curve is described by its standard deviation.

30 | P a g e
CHAPTER 4
What Causes What?
In this chapter, Spiegelhalter discusses the relationship between data description and
inference. He explains how to use hypothesis testing to make inferences about population
parameters.

Causation in Statistics
Statistics describes a relationship between two events or two variables. Causation is present
when the value of one variable or event increases or decreases as a direct result of the
presence or lack of another variable or event. Causation is difficult to pin down or be certain
about because circumstances and events can arise out of a complex interaction between
multiple variables.

Simpson’s Paradox
Simpson‟s Paradox is a statistical phenomenon where an association between two variables in
a population emerges, disappears or reverses when the population is divided into
subpopulations. For instance, two variables may be positively associated in a population, but
be independent or even negatively associated in all subpopulations.

Reverse causation
Reverse causation, is a phenomenon that describes the association of two variables differently
than you would expect. Instead of X causing Y, as is the case for traditional causation, Y
causes X. Some people refer to reverse causality as the "cart-before-the-horse bias" to
emphasize the unexpected nature of the correlation.

Lurking factors
A lurking variable is a variable that is unknown and not controlled for; It has an
important, significant effect on the variables of interest. They are extraneous variables, but
may make the relationship between dependent variables and independent variables seem
other than it actually is.
more illegal, or more dangerous drugs are simply the same kinds of people that would be also
okay with using both marijuana and alcohol.

31 | P a g e
CHAPTER 5
Modelling Relationships Using Regression
This chapter covers regression analysis, including simple and multiple regression models.
Spiegelhalter explains how to use regression analysis to make predictions and understand the
relationship between variables.

Regression
“Regression is the measure of average relationship between two or more variables in terms
of the original units of the data.” Regression modelling is a process of determining a
relationship between one or more independent variables and one dependent or output
variables.
Example-

 Predicting the height of the person given the age of the person.
 Predicting the Price of the car given the car model, year of manufacturing, mileage
, engine capacity etc.

Types of Regression Model


1. Simple Linear Regression
Assume that there is only one independent variable x. If the relationship between x
(independent variable) and y (dependent or output variable) is modeled by the relation,

y = a + bx

Than the regression model is called a linear regression model.

2. Multiple Regression
Assume that there are multiple independent variables say x1, x2, x3….xn. If the relationship
between independent variables x and dependent or output variable y is modeled by the
relation.
y = a0+a1*x1+a2*x2+……+an*xn
Than the regression model is called a Multiple Regression

3. Polynomial Regression
Assume that there is only one independent variable x. If the relationship between independent
variable x and dependent variable or output variable y is modeled by the relation
y = a0+a1*x+a2*x2+………+an*x^n
For some positive integer n>1, then we have a polynomial regression.

4. Logistic Regression

32 | P a g e
Logistic regression is used when the dependent variables is binary (0/1, True/False, Yes/No)
in nature.

Regression Coefficient
An estimated parameter in a statistical model, that expresses the strength of relationship
between an explanatory variable and an outcome in multiple regression analysis. The
coefficient will have a different interpretation depending on whether the outcome variable is
a continuous variable (multiple linear regression), a proportion (logistic regression), a count
(Poisson regression) or a survival time (Cox regression).

Residual error
The generic term for the component of the data that cannot be explained by a statistical
model, and so is said to be due to chance variation.

33 | P a g e
CHAPTER 6
Algorithms, Analytics and Prediction
Algorithm
Algorithm is a step-by-step procedure, which defines a set of instructions to be executed in a
certain order to get the desired output. Algorithms are generally created independent of
underlying languages, i.e. an algorithm can be implemented in more than one programming
language.
There are two broad tasks for such an algorithm:

1. Classification (also known as discrimination or supervised learning): To say what kind


of situation we‟re facing.

For example, the likes and dislikes of an online customer, or whether that object in a robot‟s
vision is a child or a dog.

2. Prediction: To tell us what is going to happen.

For example, what the weather will be next week, what a stock price might do tomorrow,
what products that customer might buy, or whether that child is going to run out in front of
our self-driving car etc.

Types of Algorithms
1. Brute Force Algorithm
This algorithm uses the general logic structure to design an algorithm. It is also called an
exhaustive search algorithm because it exhausts all possibilities to provide the required
solution.
There are two types of such algorithms:

 Optimizing: Finding all possible solutions to a problem and then selecting the best
one.
 Sacrificing: It will stop as soon as the best solution is found.

2. Divide and Conquer


 It deconstructs the algorithm to solve the problem in various ways.
 It allows you to divide the problem into different methods, generating valid output for
valid input.

3. Greedy Algorithm
This is an algorithm paradigm/pattern that makes the best choice possible on each iteration in
the hopes of choosing the best solution.
It is simple to set up and has a shorter execution time.

34 | P a g e
4. Branch and Bound Algorithm

Only integer programming problems can be solved using the branch and bound algorithm.
This method divides all feasible solution sets into smaller subsets. These subsets are then
evaluated further to find the best solution.

5. Randomized Algorithm

As with a standard algorithm, you have predefined input and output. Deterministic algorithms
have a defined set of information and required results and follow some described steps. They
are more efficient than non-deterministic algorithms.

6. Backtracking
It is an algorithmic procedure that recursively and discards the solution if it does not satisfy
the constraints of the problem. Following your understanding of what is an algorithm, and its
approaches, you will now look at algorithm analysis.

Analytics
Analytics is the scientific process of discovering and communicating the meaningful patterns
which can be found in data.

 Turning raw data into insight for making better decisions.

Analytics relies on the application of statistics, computer programming, and operations


research to gain insight to the meanings of data.
It is especially useful in areas which record a lot of data or information.

 Web analytics
 Fraud analysis
 Risk analysis
 Advertisement and marketing
 Enterprise decision management

35 | P a g e
CHAPTER 7
How Sure Can We Be About What Is Going On? Estimates and Intervals

This chapter covers the concepts of estimates and intervals, including confidence intervals
and standard errors. Spiegelhalter shows how to use these concepts to quantify uncertainty
and make informed decisions.

Estimates and Intervals


In statistics, an estimate is a value that is used to represent an unknown parameter or quantity
of interest based on sample data. For example, if we want to estimate the mean height of all
students in a particular school, we might take a random sample of students and use the
sample mean height as our estimate of the population mean height.
A confidence interval is a range of values that is likely to contain the true value of the
parameter we are interested in, with a certain level of confidence. For example, if we
construct a 95% confidence interval for the population mean height, we can say that we are
95% confident that the true population mean height lies within the interval.

Margin of error
The margin of error in statistics is the degree of error in results received from random
sampling surveys. A higher margin of error in statistics indicates less likelihood of relying on
the results of a survey or poll, i.e. the confidence on the results will be lower to represent a
population.
Bootstrapping
Bootstrapping is a statistical procedure that resamples a single dataset to create many
simulated samples. This process allows you to calculate standard errors, construct confidence
intervals, and perform hypothesis testing for numerous types of sample statistics. Bootstrap
methods are alternative approaches to traditional hypothesis testing and are notable for being
easier to understand and valid for more conditions.

 Bootstrapping a sample consists of creating new data sets of the same size by
resampling the original data, with replacement
 Sample statistics calculated from bootstrap resamples tend towards a normal
distribution for larger data sets, regardless of the shape of the original data
distribution.

Sampling distribution
A sampling distribution is a probability distribution of a statistic that is obtained through
repeated sampling of a specific population. It describes a range of possible outcomes for a
statistic, such as the mean or mode of some variable, of a population.

36 | P a g e
Central limit theorem
The central limit theorem (CLT) states that the distribution of sample means approximates a
normal distribution as the sample size gets larger, regardless of the population's distribution.
Sample sizes equal to or greater than 30 are often considered sufficient for the CLT to hold.

37 | P a g e
CHAPTER 8
Probability – the Language of Uncertainty and Variability
What is Probability?
Probability denotes the possibility of the outcome of any random event. The meaning of this
term is to check the extent to which any event is likely to happen. For example, when we flip
a coin in the air, what is the possibility of getting a head? The answer to this question is
based on the number of possible outcomes. Here the possibility is either head or tail will be
the outcome. So, the probability of a head to come as a result is 1/2.
The probability is the measure of the likelihood of an event to happen. It measures the
certainty of the event. The formula for probability is given by;
P(E) = Number of Favourable Outcomes/Number of total outcomes
P(E) = n(E)/n(S)
Where n(E) = Number of event favourable to event E & n(S) = Total number of outcomes
There are 3 types of probabilities:

1. Theoretical Probability

It is based on the possible chances of something to happen. The theoretical probability is


mainly based on the reasoning behind probability. For example, if a coin is tossed, the
theoretical probability of getting a head will be ½.

2. Experimental Probability

It is based on the basis of the observations of an experiment. The experimental probability


can be calculated based on the number of possible outcomes by the total number of trials.
For example, if a coin is tossed 10 times and head is recorded 6 times then, the experimental
probability for heads is 6/10 or, 3/5.

3. Axiomatic Probability:-

In axiomatic probability, a set of rules or axioms are set which applies to all types. These
axioms are set by Kolmogorov and are known as Kolmogorov‟s three axioms. With the
axiomatic approach to probability, the chances of occurrence or non-occurrence of the events
can be quantified. The axiomatic probability lesson covers this concept in detail with
Kolmogorov‟s three rules (axioms) along with various examples.

38 | P a g e
Probability Tree

The tree diagram helps to organize and visualize the different possible outcomes. Branches
and ends of the tree are two main positions. Probability of each branch is written on the
branch, whereas the ends are containing the final outcome. Tree diagrams are used to figure
out when to multiply and when to add.
Conditional probability
Conditional probability is defined as the likelihood of an event or outcome occurring, based
on the occurrence of a previous event or outcome. Conditional probability is calculated by
multiplying the probability of the preceding event by the updated probability of the
succeeding, or conditional, event.
P(B|A) = P(A∩B) / P(A)

Where P = Probability, A = Event A & B = Event B

Application of probability

Probability theory is applied in everyday life in risk assessment and modeling. The insurance
industry and markets use actuarial science to determine pricing and make trading decisions.
Governments apply probabilistic methods in environmental regulation, entitlement analysis,
and financial regulation.

39 | P a g e
CHAPTER 9
Putting Probability and Statistics Together
In this chapter, Spiegelhalter discusses the importance of randomized trials in determining the
effectiveness of medical treatments and interventions. He explains how randomized trials are
designed and conducted, and how they differ from observational studies.

Binomial distribution
Binomial distribution is a statistical probability distribution that states the likelihood that a
value will take one of two independent values under a given set of parameters or
assumptions.

Standard error
The standard error is a statistical term that measures the accuracy with which a sample
distribution represents a population by using standard deviation. In statistics, a sample mean
deviates from the actual mean of a population; this deviation is the standard error of the
mean.

 The standard error (SE) is the approximate standard deviation of a statistical sample
population.
 The standard error describes the variation between the calculated mean of the
population and one which is considered known, or accepted as accurate.
 The more data points involved in the calculations of the mean, the smaller the
standard error tends to be.
Funnel Plot
A funnel plot is a graphical tool used in statistics to assess whether there is publication bias or
small study effects in a meta-analysis or systematic review. A funnel plot is a scatter plot of
the effect sizes (usually standardized) of the included studies against a measure of their
precision, such as the standard error or the sample size.
The Central Limit Theorem
The Central Limit Theorem (CLT) is a fundamental concept in statistics that states that if a
random sample is taken from any population, the distribution of the sample means will
approach a normal distribution as the sample size increases, regardless of the shape of the
population distribution. This theorem is an essential tool for making inferences about
population parameters, such as means and proportions, based on sample data.
The Central Limit Theorem implies that sample means and other summary statistics can
beassumed to have a normal distribution for large samples.

40 | P a g e
Bernoulli distribution
The Bernoulli distribution is a discrete probability distribution that describes the outcomes of
a single binary experiment or trial, where the outcome can be either success (with probability
p) or failure (with probability 1-p).
The probability mass function (PMF) of the Bernoulli distribution is given by:
P(X = 1) = p
P(X = 0) = 1 - p

where X is a random variable that takes on the values 1 (success) or 0 (failure), and p is the
probability of success.
Uncertainty
Uncertainty is an inherent part of statistics and data analysis. It refers to the degree of doubt
or lack of confidence we have in our estimates or conclusions based on data.
There are several sources of uncertainty in statistics, including:

 Sampling variability
 Measurement error
 Model uncertainty
 Epistemic uncertainty
There are two main types of uncertainty in statistics and data analysis

(a) Aleatory uncertainty


Aleatory uncertainty refers to the inherent variability and randomness in a system or process.
It is also called "irreducible uncertainty" because it cannot be reduced or eliminated by
collecting more data or improving our understanding of the system. Examples of aleatory
uncertainty include natural disasters, weather patterns, and human behavior.

(b) Epistemic uncertainty


Epistemic uncertainty, on the other hand, refers to the uncertainty that arises from our lack of
knowledge or understanding about a system or process. It is also called "reducible
uncertainty" because it can be reduced or eliminated by collecting more data or improving
our understanding of the system. Examples of epistemic uncertainty include measurement
error, modeling assumptions, and missing data.

41 | P a g e
CHAPTER 10
Answering Questions and Claiming Discoveries

This focuses on how to effectively communicate statistical findings and answer research
questions in a clear and transparent manner. The chapter emphasizes the importance of
providing context and acknowledging uncertainty in statistical results. One way to do this is
by using confidence intervals and p-values, which provide a range of plausible values for a
parameter and a measure of the strength of evidence against a null hypothesis, respectively.

Hypothesis
A hypothesis can be defined as a proposed explanation for a phenomenon. It is not the
absolute truth, but a provisional, working assumption, perhaps best thought of as a potential
suspect in a criminal case. A hypothesis can be either a null hypothesis or an alternative
hypothesis.

(A) Null hypothesis


In statistics, the null hypothesis (H0) is a statement that there is no significant difference or
relationship between variables or no effect of a treatment or intervention. It is often used as a
starting point for hypothesis testing and is typically the hypothesis that the researcher wants
to reject.

(B) Alternative hypothesis


The alternative hypothesis (Ha) is a statement that there is a significant difference or
relationship between variables or an effect of a treatment or intervention. It is the hypothesis
that the researcher wants to support or confirm if the null hypothesis is rejected.

Permutation test
Permutation test is a non-parametric statistical test that is used to test the null hypothesis that
there is no difference between two groups. It is a method that does not rely on any
assumptions about the distribution of the data and can be used in situations where the
assumptions of traditional parametric tests, such as t-test or ANOVA, are violated.

Hypergeometric distribution
The hypergeometric distribution is a probability distribution that describes the probability of
obtaining a specified number of successes in a sample of a fixed size drawn without
replacement from a finite population of known size, where the population contains a fixed
number of elements that are classified into two categories (successes and failures).
The probability mass function of the hypergeometric distribution is given by:
P(X = k) = [(M choose k) * (N-M choose n-k)] / (N choose n)

42 | P a g e
where:

 X is the number of successes in the sample


 M is the number of successes in the population
 N is the population size
 n is the sample size

False-positive result
In statistics, a false-positive result occurs when a hypothesis test or a diagnostic test indicates
the presence of a condition or effect when it is actually absent. It is a type of error that occurs
when a test incorrectly identifies something as positive, even though it is negative. For
example, in medical testing, a false-positive result may occur when a diagnostic test indicates
that a patient has a disease or condition when they do not actually have it.

Bonferroni correction
The Bonferroni correction is a statistical method used to adjust the p-values in multiple
hypothesis testing to reduce the risk of false positives. It is named after the Italian
mathematician Carlo Emilio Bonferroni. For example, if 10 tests are being performed and the
desired overall alpha level is 0.05, then the corrected alpha level for each individual test
would be 0.05/10 = 0.005. This means that each test must have a p-value less than or equal to
0.005 to be considered significant.

False discovery rate


The false discovery rate (FDR) is a statistical method used to control the proportion of false
positives among a set of significant results in multiple hypothesis testing. It was introduced
by Benjamini and Hochberg in 1995 as an alternative to the Bonferroni correction, which is a
more conservative method that controls the familywise error rate.
Neyman–Pearson Theory
The Neyman-Pearson theorem is a constrained optimazation problem, and hence one way to
prove it is via Lagrange multipliers. In the Neyman-Pearson framework, two hypotheses are
formulated: the null hypothesis (H0) and the alternative hypothesis (H1). The null hypothesis
is the default hypothesis that assumes no effect or difference between two groups or
variables, while the alternative hypothesis is the hypothesis that states that there is a specific
effect or difference.

In the method of Lagrange multipliers, the problem at hand is of the form max f(x) such that
g(x) ≤ c.
M(x, λ) = f(x) − λg(x) (2)
Then xo(λ) maximizes f(x) over all x such that g(x) ≤ g(xo(λ)).

43 | P a g e
CHAPTER 11
Learning from Experience the Bayesian Way
In this chapter, Spiegelhalter explores the Bayesian approach to statistical inference, which
involves updating our beliefs based on both prior knowledge and new data. He explains the
basic concepts of Bayesian inference, such as prior distributions, likelihood functions, and
posterior distributions, in a clear and accessible manner.

Bayesian Approach
The Bayesian approach to statistics is a framework for making decisions based on probability
theory and subjective beliefs or prior knowledge. Bayesian approach is its ability to
incorporate prior knowledge or beliefs into the analysis, which can lead to more accurate and
nuanced results. The Bayesian approach begins by specifying a prior distribution over
parameters that must be estimated. The prior reflects the information known to the researcher
without reference to the dataset on which the model is estimated.

Bayes’ theorem
Bayes' theorem is a fundamental principle of probability theory that describes the relationship
between conditional probabilities. Bayes' theorem states that the probability of an event A
occurring, given that event B has occurred, is equal to the probability of event B occurring,
given that event A has occurred, multiplied by the probability of event A occurring, divided
by the probability of event B occurring:
P(A|B) = P(B|A) x P(A) / P(B)
where:
P(A|B) is the conditional probability of event A occurring given that event B has
occurred. P(B|A) is the conditional probability of event B occurring given that event A
has occurred. P(A) is the prior probability of event A occurring.
P(B) is the prior probability of event B occurring.

Likelihood ratio
likelihood ratio is a measure of the strength of evidence provided by the data in favor of one
statistical hypothesis over another. Specifically, it is the ratio of the likelihoods of the data
under the two hypotheses being compared.
Suppose we have two statistical hypotheses: H1 and H2. The likelihood ratio is given by:
LR = L(D|H1) / L(D|H2)

where L(D|H1) is the likelihood of the data D under hypothesis H1, and L(D|H2) is the
likelihood of the data under hypothesis H2.

44 | P a g e
Odds and Likelihood Ratios
Odds and likelihood ratios are both measures of the strength of evidence provided by data in
favor of one hypothesis over another.
Odds are a way of expressing the likelihood of an event occurring, relative to the likelihood
of the event not occurring. The odds can range from 0 to infinity, with values greater than 1
indicating that the event is more likely to occur than not, and values less than 1 indicating that
the event is less likely to occur than not. The odds of an event A occurring are defined as the
ratio of the probability of A to the probability of not A:
odds(A) = P(A) / P(not A)

Likelihood Ratios and Forensic Science


Likelihood ratios are a useful tool in forensic science for evaluating the strength of evidence
provided by various types of forensic analyses. Forensic scientists often use likelihood ratios
to assess the probative value of evidence, particularly when evaluating complex mixtures of
DNA or other biological evidence. Likelihood ratios can also be used in other types of
forensic analyses, such as fingerprint analysis, firearms examination, and toolmark analysis.

Bayesian Statistical Inference


Bayesian statistical inference is an approach to statistical inference that uses Bayes' theorem
to update the probability of a hypothesis based on new data or evidence. In Bayesian
inference, probabilities are assigned to both the hypothesis and the data, and the probability
of the hypothesis is updated based on the probability of the data given the hypothesis.

Multilevel regression and post-stratification (MRP)


Multilevel regression and post-stratification (MRP) is a statistical technique that combines
two methods: multilevel regression, which is a method for modeling hierarchical or nested
data, and post-stratification, which is a method for adjusting sample weights to match
population distributions.
Multilevel regression (also known as hierarchical linear modeling or mixed-effects modeling)
is a method for modeling data that has a hierarchical or nested structure, such as individuals
nested within groups or repeated measures nested within individuals. Post-stratification is a
method for adjusting survey weights to match known population distributions on certain
demographic variables, such as age, gender, or race/ethnicity.

45 | P a g e
CHAPTER 12
How Things Go Wrong
In this chapter David Spiegelhalter discusses the potential sources of error and bias in
statistical analysis and how they can lead to incorrect conclusions. The author highlights the
importance of understanding the limitations of statistical analysis and the potential for errors
to arise due to factors such as data quality, measurement error, model assumptions, and
human biases.

Reproducibility crisis
The reproducibility crisis in statistics refers to the growing concern that many published
scientific findings may not be replicable by other researchers or may be subject to biases and
errors that affect the validity of the results. This crisis has been particularly prominent in the
field of psychology, where a number of high-profile studies have failed to replicate in
subsequent research.

Deliberate Fraud
Deliberate fraud in statistics refers to instances where researchers intentionally manipulate
data or analysis methods to produce misleading or false results. Deliberate fraud in statistics
is a serious issue that can have significant consequences, including the potential to harm
individuals or society as a whole if the fraudulent results are used to make important
decisions.
One well-known example of deliberate fraud in statistics is the case of Andrew Wakefield, a
former medical researcher who claimed to have found a link between the MMR vaccine and
autism. Wakefield's study was found to be fraudulent, with manipulated data and conflicts of
interest that were not disclosed. The study led to a decrease in vaccination rates and an
increase in the number of cases of measles, mumps, and rubella, highlighting the potential
harm caused by deliberate fraud in statistics.

Questionable Research Practices


Questionable research practices in statistics refer to practices that are ethically questionable
and may lead to biased or misleading research results. These practices may not necessarily be
considered fraud, but they can still undermine the integrity of scientific research and
contribute to the reproducibility crisis in statistics.

Communication Breakdown
Communication breakdown in statistics refers to instances where there is a failure to
effectively communicate statistical information or concepts to non-expert audiences. This can
lead to misunderstandings, misinterpretations, and a lack of trust in the validity of statistical
information.

46 | P a g e
CHAPTER 13
How We Can Do Statistics Better
This chapter focuses on ways to improve the practice of statistics, particularly in light of the
challenges and issues discussed in previous chapters. The chapter begins by highlighting the
importance of transparency and reproducibility in statistical research. The authors emphasize
the need for researchers to share their data and code, as well as to use pre-registration and
open access journals. These practices can help to increase the transparency and replicability
of statistical research, which is crucial for building trust in the validity of statistical results.
If we want the use of statistics to improve, three groups need to act:

 Producers of statistics: such as scientists, statisticians, survey companies and


industry. They can do statistics better. Producers, communicators and audiences all
have a role in improving the way that statistical science is used in society. Producers
need to ensure that science is reproducible. To demonstrate trustworthiness,
information should be accessible, intelligible, assessable and useable.
 Communicators: such as scientific journals, charities, government departments,
press officers, journalists and editors. They can communicate statistics better.
Communicators need to be wary of trying to fit statistical stories into standard
narratives.
 Audiences: such as the public, policy-makers and professionals. They can check
statistics better. Audiences need to call out poor practice by asking questions about
the trustworthiness of their numbers, their source and their interpretation.
Publication Bias
Publication bias in statistics refers to the tendency of researchers and journals to publish only
statistically significant results, while neglecting non-significant or negative results. This can
lead to an overrepresentation of positive results in the published literature, and can distort our
understanding of the true relationship between variables or the effectiveness of interventions.
Publication bias can occur for several reasons. One reason is that researchers may be more
likely to submit and publishers may be more likely to accept manuscripts with positive
results. This can be due to the perceived higher importance or novelty of positive results, or
the belief that negative results are less interesting or valuable.
Assessing a Statistical Claim or Story
Whether we are journalists, fact-checkers, academics, professionals in government or
business or NGOs, or simply members of the public, we are regularly told claims that are
based on statistical evidence. Assessing the trustworthiness of statistical claims appears a
vital skill for the modern world. Claims based on data need to be:

 Accessible: audiences should be able to get at the information.


 Intelligible: audiences should be able to understand the information.

47 | P a g e
 Assessable: if they wish to, audiences should be able to check the reliability of the
claims.
 Useable: audiences should be able to exploit the information for their needs.

Data Ethics
Data ethics refers to the ethical considerations surrounding the collection, use, and sharing of
data in statistical research. As statistical analysis often involves sensitive personal or
confidential information, it is important to consider the ethical implications of data use and
ensure that research is conducted in an ethical and responsible manner.
The Analysis uses a repertoire of techniques:

(i) Data to Sample: Since these are exit polls and the respondents are saying what they have
done and not what they intend to do, experience suggests the responses should be reasonably
accurate measures of what people actually voted at this and previous elections.
(ii) Sample to Study Population: A representative sample is taken of those who actually
voted in each polling station, so the results from the sample can be used to roughly estimate
the change in vote, or „swing‟, in that small area.
(iii) Study Population to Target Population: Using knowledge of the demographics of each
polling station, a regression model is built that attempts to explain how the proportion of
people who change their vote between elections depends on the characteristics of the voters
in that polling area. In this way the swing does not have to be assumed to be the same
throughout the country, but is allowed to vary from area to area allowing, say, for whether
there is a rural or urban population. Then using the estimated regression model, knowledge of
the demographics of each of the 600 or so constituencies, and the votes cast at the previous
election, a prediction of the votes cast in this election can be made for each individual
constituency, even though most of the constituencies did not actually have any voters
interviewed in the exit poll.

48 | P a g e

You might also like