You are on page 1of 51

What are the essential elements of data analysis?

The elements of a data analysis are the fundamental components of a data analysis used by the data
analyst: code, code comments, data visualization, non-data visualization, narrative text, summary
statistics, tables, and statistical models or computational algorithms.

Variable

A variable is simply something that can vary with time and we can measure this variation. In other
words, a variable is a characteristic or a phenomenon which is capable of being measured and changes
its value over time. A variable is classified into two:

1] Discrete

A discrete variable’s value changes only in complete numbers or increases in jumps. Thus the
phenomenon or characteristic, a discrete variable represents should be such that its value cannot be
infractions but only in whole numbers. For example, the number of children in a family can be 2, 3, 4 etc
but not 2.5, 3.5 etc.

2] Continuous

A continuous variable assumes fractional values or its value does not increase in jumps. For example,
the heights of students, the weights of students and so on.

Four Different Levels of Measurement

In descending order of precision, the four different levels of measurement are:

Nominal–Latin for name only (Republican, Democrat, Green, Libertarian)

Ordinal–Think ordered levels or ranks (small–8oz, medium–12oz, large–32oz)

Interval–Equal intervals among levels (1 dollar to 2 dollars is the same interval as 88 dollars to 89
dollars)

Ratio–Let the “o” in ratio remind you of a zero in the scale (Day 0, day 1, day 2, day 3, …)

The first level of measurement is nominal level of measurement. In this level of measurement, the
numbers in the variable are used only to classify the data. In this level of measurement, words,
letters, and alpha-numeric symbols can be used. Suppose there are data about people belonging to
three different gender categories. In this case, the person belonging to the female gender could be
classified as F, the person belonging to the male gender could be classified as M, and transgendered
classified as T. This type of assigning classification is nominal level of measurement.
The second level of measurement is the ordinal level of measurement. This level of
measurement depicts some ordered relationship among the variable’s observations. Suppose a
student scores the highest grade of 100 in the class. In this case, he would be assigned the first rank.
Then, another classmate scores the second highest grade of an 92; she would be assigned the
second rank. A third student scores a 81 and he would be assigned the third rank, and so on. The
ordinal level of measurement indicates an ordering of the measurements.

The third level of measurement is the interval level of measurement. The interval level of
measurement not only classifies and orders the measurements, but it also specifies that the
distances between each interval on the scale are equivalent along the scale from low interval to high
interval. For example, an interval level of measurement could be the measurement of anxiety in a
student between the score of 10 and 11, this interval is the same as that of a student who scores
between 40 and 41. A popular example of this level of measurement is temperature in centigrade,
where, for example, the distance between 940C and 960C is the same as the distance between 1000C
and 1020C.

The fourth level of measurement is the ratio level of measurement. In this level of measurement,
the observations, in addition to having equal intervals, can have a value of zero as well. The zero in
the scale makes this type of measurement unlike the other types of measurement, although the
properties are similar to that of the interval level of measurement. In the ratio level of
measurement, the divisions between the points on the scale have an equivalent distance between
them.

The researcher should note that among these levels of measurement, the nominal level is simply
used to classify data, whereas the levels of measurement described by the interval level and the
ratio level are much more exact.

What Is Data Management?

Data Management, Defined

Data management is the practice of collecting, keeping, and using data securely, efficiently, and
cost-effectively. The goal of data management is to help people, organizations, and connected things
optimize the use of data within the bounds of policy and regulation so that they can make decisions
and take actions that maximize the benefit to the organization. A robust data management
strategy is becoming more important than ever as organizations increasingly rely on intangible
assets to create value.
Managing digital data in an organization involves a broad range of tasks, policies, procedures, and
practices. The work of data management has a wide scope, covering factors such as how to:
 Create, access, and update data across a diverse data tier

 Store data across multiple clouds and on premises


 Provide high availability and disaster recovery
 Use data in a growing variety of apps, analytics, and algorithms
 Ensure data privacy and security
 Archive and destroy data in accordance with retention schedules and compliance
requirements
A formal data management strategy addresses the activity of users and administrators, the
capabilities of data management technologies, the demands of regulatory requirements, and the
needs of the organization to obtain value from its data.
Data Capital Is Business Capital
In today’s digital economy, data is a kind of capital, an economic factor of production in digital goods
and services. Just as an automaker can’t manufacture a new model if it lacks the necessary financial
capital, it can’t make its cars autonomous if it lacks the data to feed the onboard algorithms. This
new role for data has implications for competitive strategy as well as for the future of computing.

Given this central and mission-critical role of data, strong management practices and a robust
management system are essential for every organization, regardless of size or type.

Data Management Systems Today

Today’s organizations need a data management solution that provides an efficient way to manage
data across a diverse but unified data tier. Data management systems are built on data management
platforms and can include databases, data lakes and data warehouses, big data management
systems, data analytics, and more.
All these components work together as a “data utility” to deliver the data management capabilities
an organization needs for its apps, and the analytics and algorithms that use the data originated by
those apps. Although current tools help database administrators (DBAs) automate many of the
traditional management tasks, manual intervention is still often required because of the size and
complexity of most database deployments. Whenever manual intervention is required, the chance
for errors increases. Reducing the need for manual data management is a key objective of a new
data management technology, the autonomous database.
Data Management Platforms
The most critical step for continuous delivery of software is continuous integration (CI). CI is a
development practice where developers commit their code changes (usually small and incremental)
to a centralized source repository, which kicks off a set of automated builds and tests. This
repository allows developers to capture the bugs early and automatically before passing them on to
production. Continuous Integration pipeline usually involves a series of steps, starting from code
commit to performing basic automated linting/static analysis, capturing dependencies, and finally
building the software and performing some basic unit tests before creating a build artifact. Source
code management systems like Github, Gitlab, etc., offer webhooks integration to which CI tools like
Jenkins can subscribe to start running automated builds and tests after each code check-in.

A data management platform is the foundational system for collecting and analyzing large volumes
of data across an organization. Commercial data platforms typically include software tools for
management, developed by the database vendor or by third-party vendors. These data management
solutions help IT teams and DBAs perform typical tasks such as:
 Identifying, alerting, diagnosing, and resolving faults in the database system or
underlying infrastructure
 Allocating database memory and storage resources
 Making changes in the database design
 Optimizing responses to database queries for faster application performance
The increasingly popular cloud data platforms allow businesses to scale up or down quickly and cost-
effectively. Some are available as a service, allowing organizations to save even more.

What is an Autonomous Database


Based in the cloud, an autonomous database uses artificial intelligence (AI) and machine learning to
automate many data management tasks performed by DBAs, including managing database backups,
security, and performance tuning.
Also called a self-driving database, an autonomous database offers significant benefits for data
management, including:
 Reduced complexity

 Decreased potential for human error


 Higher database reliability and security
 Improved operational efficiency
 Lower costs
The increasingly popular cloud data platforms allow businesses to scale up or down quickly and cost-
effectively. Some are available as a service, allowing organizations to save even more.

Big Data Management Systems

In some ways, big data is just what it sounds like—lots and lots of data. But big data also comes in a
wider variety of forms than traditional data, and it’s collected at a high rate of speed. Think of all the
data that comes in every day, or every minute, from a social media source such as Facebook. The
amount, variety, and speed of that data are what make it so valuable to businesses, but they also
make it very complex to manage.
As more and more data is collected from sources as disparate as video cameras, social media, audio
recordings, and Internet of Things (IoT) devices, big data management systems have emerged. These
systems specialize in three general areas.

 Big data integration brings in different types of data—from batch to streaming—and


transforms it so that it can be consumed.
 Big data management stores and processes data in a data lake or data warehouse
efficiently, securely, and reliably, often by using object storage.
 Big data analysis uncovers new insights with analytics, including graph analytics, and
uses machine learning and AI visualization to build models.
Companies are using big data to improve and accelerate product development, predictive
maintenance, the customer experience, security, operational efficiency, and much more. As big data
gets bigger, so will the opportunities.

Data Management Challenges


Most of the challenges in data management today stem from the faster pace of business and the
increasing proliferation of data. The ever-expanding variety, velocity, and volume of data available to
organizations is pushing them to seek more-effective management tools to keep up. Some of the top
challenges organizations face include the following:

Lack of data Data from an increasing number and variety of sources such as sensors,
insight smart devices, social media, and video cameras is being collected and
stored. But none of that data is useful if the organization doesn’t know
what data it has, where it is, and how to use it. Data management solutions
need scale and performance to deliver meaningful insights in a timely
manner.

Difficulty Organizations are capturing, storing, and using more data all the time. To
maintaining data- maintain peak response times across this expanding tier, organizations
management need to continuously monitor the type of questions the database is
performance answering and change the indexes as the queries change—without
levels affecting performance.

Challenges Compliance regulations are complex and multijurisdictional, and they


complying with change constantly. Organizations need to be able to easily review their data
changing data and identify anything that falls under new or modified requirements. In
requirements particular, personally identifiable information (PII) must be detected,
tracked, and monitored for compliance with increasingly strict global
privacy regulations.

Need to easily Collecting and identifying the data itself doesn’t provide any value—the
process and organization needs to process it. If it takes a lot of time and effort to
convert data convert the data into what they need for analysis, that analysis won’t
happen. As a result, the potential value of that data is lost.
Constant need to In the new world of data management, organizations store data in multiple
store data systems, including data warehouses and unstructured data lakes that store
effectively any data in any format in a single repository. An organization’s data
scientists need a way to quickly and easily transform data from its original
format into the shape, format, or model they need it to be in for a wide
array of analyses.

Demand to With the availability of cloud data management systems, organizations can
continually now choose whether keep and analyze data in on-premises environments,
optimize IT agility in the cloud, or in a hybrid mixture of the two. IT organizations need to
and costs evaluate the level of identicality between on-premises and cloud
environments in order to maintain maximum IT agility and lower costs.

Data Management Principles and Data Privacy


The General Data Protection Regulation (GDPR) enacted by the European Union and implemented in
May 2018 includes seven key principles for the management and processing of personal data. These
principles include lawfulness, fairness, and transparency; purpose limitation; accuracy; storage
limitation; integrity and confidentiality; and more.

The GDPR and other laws that follow in its footsteps, such as the California Consumer Privacy Act
(CCPA), are changing the face of data management. These requirements provide standardized data
protection laws that give individuals control over their personal data and how it is used. In effect, it
turns consumers into data stakeholders with real legal recourse when organizations fail to obtain
informed consent at data capture, exercise poor control over data use or locality, or fail to comply
with data erasure or portability requirements.

Data Management Best Practices


Addressing data management challenges requires a comprehensive, well-thought-out set of best
practices. Although specific best practices vary depending on the type of data involved and the
industry, the following best practices address the major data management challenges organizations
face today:

Create a discovery A discovery layer on top of your organization’s data tier allows analysts
layer to identify your and data scientists to search and browse for datasets to make your data
data useable.

Develop a data A data science environment automates as much of the data


science environment transformation work as possible, streamlining the creation and
to efficiently evaluation of data models. A set of tools that eliminates the need for
repurpose your data the manual transformation of data can expedite the hypothesizing and
testing of new models.

Use autonomous Autonomous data capabilities use AI and machine learning to


technology to continuously monitor database queries and optimize indexes as the
maintain queries change. This allows the database to maintain rapid response
performance levels times and frees DBAs and data scientists from time-consuming manual
across your tasks.
expanding data tier

Use discovery to stay New tools use data discovery to review data and identify the chains of
on top of compliance connection that need to be detected, tracked, and monitored for
requirements multijurisdictional compliance. As compliance demands increase
globally, this capability is going to be increasingly important to risk and
security officers.

Ensure you’re using a A converged database is a database that has native support for all
converged database modern data types and the latest development models built into one
product. The best converged databases can run many kinds of
workloads, including graph, IoT, blockchain, and machine learning. T

Ensure your The goal of bringing data together is to be able to analyze it to make
database platform better, more timely decisions. A scalable, high-performance database
has the performance, platform allows enterprises to rapidly analyze data from multiple
scale, and availability sources using advanced analytics and machine learning so they can
to support your make better business decisions. T
business

Use a common query New technologies are enabling data management repositories to work
layer to manage together, making the differences between them disappear. A common
multiple and diverse query layer that spans the many kinds of data storage enables data
forms of data storage scientists, analysts, and applications to access data without needing to
know where it is stored and without needing to manually transform it
into a usable format.

The Value of a Data Science Environment


Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and
systems to extract value from data. Data scientists combine a range of skills—including statistics,
computer science, and business knowledge—to analyze data collected from the web, smartphones,
customers, sensors, and other sources.

Data Management Evolves

With data’s new role as business capital, organizations are discovering what digital startups and
disruptors already know: Data is a valuable asset for identifying trends, making decisions, and taking
action before competitors. The new position of data in the value chain is leading organizations to
actively seek better ways to derive value from this new capital.

INDEXI NG
Introduction

Data Engineers and data scientists often have to deal with an enormous amount of data. Dealing
with such data is not a straightforward task. To process this data as efficiently as possible, we need
to have a clear understanding of how the data is organized. So before moving on to the main topic,
let us build a basic ground first.

Memory in a computer system is organized in a hierarchy (as shown in the diagram below).

Image by author

Capacity and access time increase and cost decrease as we move from top to bottom of the
hierarchy. Tapes and disks play important role in a database system. Since the amount of data is
immense, storing all the data in the main memory would be very expensive.

Therefore, we store the data on tapes/disks and build a database system that brings data from the
lower level of memory into the main memory for processing as and when needed.

The database is stored as a collection of files (tables). Each file is a collection of pages (blocks). Each
block is a collection of records. A record is a sequence of fields.
Image by author

The block is nothing but a disk block. Data is accessed from disk to main memory
block by block.

Image by author

I/O Cost: while accessing a particular record, the number of memory blocks required to access that
record is called I/O Cost.

Database systems are carefully optimized to minimize this cost.

Organization of records in a file

In a file, records can be stored in two ways:

Ordered organization: When the records across the pages/blocks of a file are physically ordered
based on the values of one of its fields. For example, consider an employee table that is sorted based
on the employee ID field.

Unordered Organization: When the records are stored in a file in random order across the
pages/blocks of a file.
Access cost from database file without indexing

Image by author

Access cost based on unordered field:

Suppose a database Employees file/table has 1000 blocks and we want to search record whose
phone no. is X.

Select *
From employees
Where phone no = x;

The number of bock access will be 1000 (linear search) therefore I/O cost will be 1000.

Access cost based on ordered field:

Similarly, we want to search records from employee’s table/file whose employee ID is Y.

Select *
From employees
Where employeeID = Y;
The number of bock access will be log21000 almost equal to 10 (Binary search) therefore I/O cost will
be 10.

Can this I/O cost be improved any further? The answer is yes. This is where the concept of indexing
comes into the picture.

Consider this to reading a book (sorted based on chapters) and not having the index page. You want
to access a chapter: Photosynthesis, so you open a random page (middle page in case of binary
search) and turn the pages either left or right based on the chapter number you are looking for. This
will certainly take some time. What if the book contained an index page? You could have navigated
to that chapter just by looking at the index page.

Image by author

Similarly, we can apply indexing to database files.

Indexing

It is nothing but a way to optimize the performance of a database by simply minimizing the number
of disk block access while processing a query. As you saw in the book analogy, the book had an
additional page containing indexes similarly a database file consists of an index file that is stored in a
separate disk block/page.

Image by author

Each entry of the index file consists of two fields <search key, pointer>.
 The first column i.e., the Search key contains a copy of the primary key or candidate keys or
non-keys of the table.
 The second column i.e., the pointer contains a pointer or the address of the dick block
where that particular Search key value can be found.

The index file is also divided into blocks.

Suppose a database file has N blocks. We create an Index file for the database file. The index file is
further divided into M blocks (as you can see in the diagram below).

Image by author

Number of Index Blocks (M) << Number of database file blocks (N)

The number of block access or the I/O cost with indexing = log2M +1, which is very less than the
previous case.

Points to remember:

1. In the original database file, records can be sorted based on one field only.
2. The index file is present in the disk block in sorter order.
3. Binary search is used to search through the indices of the index file.
4. To access a record using the indexed entries, the average number of block accesses required
= log2M + 1, where M is the number of blocks in the Index file.

Categories of Index:

Now, based on the order of your database file and the number of entries/records you are going to
maintain in the index file, indices can be broadly classified as:

Dense Index: It has index entries for every search key value (and hence every record) in the
database file. The dense index can be built on order as well as unordered fields of the database files.
Image by author

Sparse Index: It has index entries for only some of the search key values/records in the database file.
The sparse index can be built only on the ordered field of the database file. The first record of the
block is called the anchor record.

Image by author
Type of Index:

Image by author
In this blog, we will cover single-level indexing.

Primary Indexing

In primary Indexing, the index is created on the ordered primary key field of the database file.

Image by author

1. It can be dense or sparse but sparse indexing is preferred.

2. The first record of each block is called block anchors.

3. The number of index entries = The number of blocks in the original database file.

4. For any database file, at most one primary index is possible because of indexing over the ordered
field.
5. Binary search is used to search through the indices of the index file.

6. I/O Cost to access record using primary index = log2M + 1, where M is the number of blocks in the
Index file.

Clustered Indexing

In clustered Indexing, the index is created on the ordered nonkey field of the database file.

Image by author

1. Clustered indexing is a mostly sparse index (dense index is also possible)


2. I/O Cost to access record using primary index >= log2M + 1, where M is the number of blocks
in the Index file.
3. For any database file, at most one clustered index is possible because of indexing over the
ordered field.

NOTE: A file has at most one physical ordering field, so it can have at most one primary index or one
secondary index but not both.

Secondary Indexing Over Primary Key Field

In secondary Indexing over the key field, the index is created on the unordered key field of the
database file. It is always a dense index.
Image by author

Indexing in MySQL

Let’s create a table named Employees consisting of the following records:

1. Employee_ID
2. Name
3. Age
4. Gender

Create table employees with Employee_ID as the primary key.

CREATE TABLE Employee (


Employee_ID int PRIMARY KEY,
Name varchar(25) NOT NULL,
Age int NOT NULL,
Gender varchar(6) NOT NULL
);

The command for creating an index is as follows:

CREATE INDEX index_name


ON table_name (column_1, column_2, ...);
Let’s create an index on the Employee_ID field (primary indexing).

CREATE INDEX index_id


On Employee (Employee_ID);

The command for dropping an index is as follows:

ALTER TABLE table_name


DROP INDEX index_name;

Points To remember:

 Create indices based on the attributes that are used in your ‘WHERE’ clause.
 If there is a field that appears in the ‘where’ clause in multiple queries so we may consider
creating an index on
that attribute.
 The list of attributes that are used in the select clause does not influence what attributes
you should create indices on. (We have to look at the ‘WHERE’ clause).

The media shown in this article on Sign Language Recognition are not owned by Analytics Vidhya
and are used at the Author’s discretion.

Descriptive Statistics

Descriptive statistics are used to describe the basic features of the data in a study. They provide
simple summaries about the sample and the measures. Together with simple graphics analysis, they
form the basis of virtually every quantitative analysis of data.

Descriptive statistics are typically distinguished from inferential statistics. With descriptive statistics
you are simply describing what is or what the data shows. With inferential statistics, you are trying
to reach conclusions that extend beyond the immediate data alone. For instance, we use inferential
statistics to try to infer from the sample data what the population might think. Or, we use inferential
statistics to make judgments of the probability that an observed difference between groups is a
dependable one or one that might have happened by chance in this study. Thus, we use inferential
statistics to make inferences from our data to more general conditions; we use descriptive statistics
simply to describe what’s going on in our data.

Descriptive Statistics are used to present quantitative descriptions in a manageable form. In a


research study we may have lots of measures. Or we may measure a large number of people on any
measure. Descriptive statistics help us to simplify large amounts of data in a sensible way. Each
descriptive statistic reduces lots of data into a simpler summary. For instance, consider a simple
number used to summarize how well a batter is performing in baseball, the batting average. This
single number is simply the number of hits divided by the number of times at bat (reported to three
significant digits). A batter who is hitting .333 is getting a hit one time in every three at bats. One
batting .250 is hitting one time in four. The single number describes a large number of discrete
events. Or, consider the scourge of many students, the Grade Point Average (GPA). This single
number describes the general performance of a student across a potentially wide range of course
experiences.

Every time you try to describe a large set of observations with a single indicator you run the risk of
distorting the original data or losing important detail. The batting average doesn’t tell you whether
the batter is hitting home runs or singles. It doesn’t tell whether she’s been in a slump or on a streak.
The GPA doesn’t tell you whether the student was in difficult courses or easy ones, or whether they
were courses in their major field or in other disciplines. Even given these limitations, descriptive
statistics provide a powerful summary that may enable comparisons across people or other units.

Univariate Analysis

Univariate analysis involves the examination across cases of one variable at a time. There are three
major characteristics of a single variable that we tend to look at:

 the distribution
 the central tendency
 the dispersion

In most situations, we would describe all three of these characteristics for each of the variables in
our study.

The Distribution

The distribution is a summary of the frequency of individual values or ranges of values for a variable.
The simplest distribution would list every value of a variable and the number of persons who had
each value. For instance, a typical way to describe the distribution of college students is by year in
college, listing the number or percent of students at each of the four years. Or, we describe gender
by listing the number or percent of males and females. In these cases, the variable has few enough
values that we can list each one and summarize how many sample cases had the value. But what do
we do for a variable like income or GPA? With these variables there can be a large number of
possible values, with relatively few people having each one. In this case, we group the raw scores
into categories according to ranges of values. For instance, we might look at GPA according to the
letter grade ranges. Or, we might group income into four or five ranges of income values.

Category Percent
Under 35 years old 9%
36–45 21%
46–55 45%
56–65 19%
66+ 6%
One of the most common ways to describe a single variable is with a frequency distribution.
Depending on the particular variable, all of the data values may be represented, or you may group
the values into categories first (e.g., with age, price, or temperature variables, it would usually not
be sensible to determine the frequencies for each value. Rather, the value are grouped into ranges
and the frequencies determined.). Frequency distributions can be depicted in two ways, as a table or
as a graph. The table above shows an age frequency distribution with five categories of age ranges
defined. The same frequency distribution can be depicted in a graph as shown in Figure 1. This type
of graph is often referred to as a histogram or bar chart.
Figure 1. Frequency distribution bar chart.
Distributions may also be displayed using percentages. For example, you could use percentages to
describe the:

 percentage of people in different income levels


 percentage of people in different age ranges
 percentage of people in different ranges of standardized test scores

Central Tendency

The central tendency of a distribution is an estimate of the “center” of a distribution of values. There
are three major types of estimates of central tendency:

 Mean
 Median
 Mode

The Mean or average is probably the most commonly used method of describing central tendency.
To compute the mean all you do is add up all the values and divide by the number of values. For
example, the mean or average quiz score is determined by summing all the scores and dividing by
the number of students taking the exam. For example, consider the test score values:

15, 20, 21, 20, 36, 15, 25, 15


The sum of these 8 values is 167, so the mean is 167/8 = 20.875.

The Median is the score found at the exact middle of the set of values. One way to compute the
median is to list all scores in numerical order, and then locate the score in the center of the sample.
For example, if there are 500 scores in the list, score #250 would be the median. If we order the 8
scores shown above, we would get:

15, 15, 15, 20, 20, 21, 25, 36


There are 8 scores and score #4 and #5 represent the halfway point. Since both of these scores
are 20, the median is 20. If the two middle scores had different values, you would have to
interpolate to determine the median.

The Mode is the most frequently occurring value in the set of scores. To determine the mode, you
might again order the scores as shown above, and then count each one. The most frequently
occurring value is the mode. In our example, the value 15 occurs three times and is the model. In
some distributions there is more than one modal value. For instance, in a bimodal distribution there
are two values that occur most frequently.

Notice that for the same set of 8 scores we got three different values (20.875, 20, and 15) for the
mean, median and mode respectively. If the distribution is truly normal (i.e., bell-shaped), the mean,
median and mode are all equal to each other.

Dispersion

Dispersion refers to the spread of the values around the central tendency. There are two common
measures of dispersion, the range and the standard deviation. The range is simply the highest value
minus the lowest value. In our example distribution, the high value is 36 and the low is 15, so the
range is 36 - 15 = 21.

The Standard Deviation is a more accurate and detailed estimate of dispersion because an outlier
can greatly exaggerate the range (as was true in this example where the single outlier value
of 36 stands apart from the rest of the values. The Standard Deviation shows the relation that set of
scores has to the mean of the sample. Again lets take the set of scores:

15, 20, 21, 20, 36, 15, 25, 15


to compute the standard deviation, we first find the distance between each value and the mean. We
know from above that the mean is 20.875. So, the differences from the mean are:

15 - 20.875 = -5.875
20 - 20.875 = -0.875
21 - 20.875 = +0.125
20 - 20.875 = -0.875
36 - 20.875 = 15.125
15 - 20.875 = -5.875
25 - 20.875 = +4.125
15 - 20.875 = -5.875
Notice that values that are below the mean have negative discrepancies and values above it have
positive ones. Next, we square each discrepancy:

-5.875 * -5.875 = 34.515625


-0.875 * -0.875 = 0.765625
+0.125 * +0.125 = 0.015625
-0.875 * -0.875 = 0.765625
15.125 * 15.125 = 228.765625
-5.875 * -5.875 = 34.515625
+4.125 * +4.125 = 17.015625
-5.875 * -5.875 = 34.515625
Now, we take these “squares” and sum them to get the Sum of Squares (SS) value. Here, the sum
is 350.875. Next, we divide this sum by the number of scores minus 1. Here, the result is 350.875 / 7
= 50.125. This value is known as the variance. To get the standard deviation, we take the square root
of the variance (remember that we squared the deviations earlier). This would be SQRT(50.125) =
7.079901129253.

Although this computation may seem convoluted, it’s actually quite simple. To see this, consider the
formula for the standard deviation:
\sqrt{\frac{\sum(X-\bar{X})^2}{n-1}}n−1∑(X−X¯)2​​

where:

 X is each score,
 X̄ is the mean (or average),
 n is the number of values,
 Σ means we sum across the values.

In the top part of the ratio, the numerator, we see that each score has the mean subtracted from it,
the difference is squared, and the squares are summed. In the bottom part, we take the number of
scores minus 1. The ratio is the variance and the square root is the standard deviation. In English, we
can describe the standard deviation as:

the square root of the sum of the squared deviations from the mean divided by the number of
scores minus one.

Although we can calculate these univariate statistics by hand, it gets quite tedious when you have
more than a few values and variables. Every statistics program is capable of calculating them easily
for you. For instance, I put the eight scores into SPSS and got the following table as a result:

Metric Value
N 8
Mean 20.8750
Median 20.0000
Mode 15.00
Standard Deviation 7.0799
Variance 50.1250
Range 21.00
which confirms the calculations I did by hand above.

The standard deviation allows us to reach some conclusions about specific scores in our distribution.
Assuming that the distribution of scores is normal or bell-shaped (or close to it!), the following
conclusions can be reached:

 approximately 68% of the scores in the sample fall within one standard deviation of the mean
 approximately 95% of the scores in the sample fall within two standard deviations of the mean
 approximately 99% of the scores in the sample fall within three standard deviations of the
mean

For instance, since the mean in our example is 20.875 and the standard deviation is 7.0799, we can
from the above statement estimate that approximately 95% of the scores will fall in the range
of 20.875-(2*7.0799) to 20.875+(2*7.0799) or between 6.7152 and 35.0348. This kind of information
is a critical stepping stone to enabling us to compare the performance of an individual on one
variable with their performance on another, even when the variables are measured on entirely
different scales.
What is Hypothesis Generation?

Hypothesis generation is an educated “guess” of various factors that are impacting the business
problem that needs to be solved using machine learning. In framing a hypothesis, the data scientist
must not know the outcome of the hypothesis that has been generated based on any evidence.

“A hypothesis may be simply defined as a guess. A scientific hypothesis is an intelligent guess.” –


Isaac Asimov

Hypothesis generation is a crucial step in any data science project. If you skip this or skim through
this, the likelihood of the project failing increases exponentially.

Hypothesis Generation vs. Hypothesis Testing

This is a very common mistake data science beginners make.

Hypothesis generation is a process beginning with an educated guess whereas hypothesis testing is a
process to conclude that the educated guess is true/false or the relationship between the variables
is statistically significant or not.

This latter part could be used for further research using statistical proof. A hypothesis is accepted or
rejected based on the significance level and test score of the test used for testing the hypothesis.

To understand more about hypothesis testing in detail, you can read about it here or you can also
learn it through this course.

How Does Hypothesis Generation Help?

Here are 5 key reasons why hypothesis generation is so important in data science:

 Hypothesis generation helps in comprehending the business problem as we dive deep in


inferring the various factors affecting our target variable
 You will get a much better idea of what are the major factors that are responsible to solve
the problem
 Data that needs to be collected from various sources that are key in converting your
business problem into a data science-based problem
 Improves your domain knowledge if you are new to the domain as you spend time
understanding the problem
 Helps to approach the problem in a structured manner

When Should you Perform Hypothesis Generation?

The million-dollar question – when in the world should you perform hypothesis generation?
 The hypothesis generation should be made before looking at the dataset or collection of the
data
 You will notice that if you have done your hypothesis generation adequately, you would
have included all the variables present in the dataset in your hypothesis generation
 You might also have included variables that are not present in the dataset

Case Study: Hypothesis Generation on “New York City Taxi Trip Duration Prediction”

Let us now look at the “NEW YORK CITY TAXI TRIP DURATION PREDICTION” problem statement and
generate a few hypotheses that would affect our taxi trip duration to understand hypothesis
generation.

Here’s the problem statement:

To predict the duration of a trip so that the company can assign the cabs that are free for the next
trip. This will help in reducing the wait time for customers and will also help in earning customer
trust.

Let’s begin!

Hypothesis Generation Based On Various Factors

1. Distance/Speed based Features


Let us try to come up with a formula that would have a relation with trip duration and would help us
in generating various hypotheses for the problem:

TIME=DISTANCE/SPEED

Distance and speed play an important role in predicting the trip duration.

We can notice that the trip duration is directly proportional to the distance traveled and inversely
proportional to the speed of the taxi. Using this we can come up with a hypothesis based on distance
and speed.
 Distance: More the distance traveled by the taxi, the more will be the trip duration.
 Interior drop point: Drop points to congested or interior lanes could result in an increase in
trip duration
 Speed: Higher the speed, the lower the trip duration

2. Features based on Car


Cars are of various types, sizes, brands, and these features of the car could be vital for commute not
only on the basis of the safety of the passengers but also for the trip duration. Let us now generate a
few hypotheses based on the features of the car.

 Condition of the car: Good conditioned cars are unlikely to have breakdown issues and
could have a lower trip duration
 Car Size: Small-sized cars (Hatchback) may have a lower trip duration and larger-sized cars
(XUV) may have higher trip duration based on the size of the car and congestion in the city

3. Type of the Trip


Trip types can be different based on trip vendors – it could be an outstation trip, single or pool rides.
Let us now define a hypothesis based on the type of trip used.

 Pool Car: Trips with pooling can lead to higher trip duration as the car reaches multiple
places before reaching your assigned destination

4. Features based on Driver Details


A driver is an important person when it comes to commute time. Various factors about the driver
can help in understanding the reason behind trip duration and here are a few hypotheses this.

 Age of driver: Older drivers could be more careful and could contribute to higher trip
duration
 Gender: Female drivers are likely to drive slowly and could contribute to higher trip duration
 Driver experience: Drivers with very less driving experience can cause higher trip duration
 Medical condition: Drivers with a medical condition can contribute to higher trip duration

5. Passenger details
Passengers can influence the trip duration knowingly or unknowingly. We usually come across
passengers requesting drivers to increase the speed as they are getting late and there could be other
factors to hypothesize which we can look at.

 Age of passengers: Senior citizens as passengers may contribute to higher trip duration as
drivers tend to go slow in trips involving senior citizens
 Medical conditions or pregnancy: Passengers with medical conditions contribute to a longer
trip duration
 Emergency: Passengers with an emergency could contribute to a shorter trip duration
 Passenger count: Higher passenger count leads to shorter duration trips due to congestion
in seating

6. Date-Time Features
The day and time of the week are important as New York is a busy city and could be highly
congested during office hours or weekdays. Let us now generate a few hypotheses on the date and
time-based features.

Pickup Day:

 Weekends could contribute to more outstation trips and could have a higher trip duration
 Weekdays tend to have higher trip duration due to high traffic
 If the pickup day falls on a holiday then the trip duration may be shorter
 If the pickup day falls on a festive week then the trip duration could be lower due to lesser
traffic

Time:

 Early morning trips have a lesser trip duration due to lesser traffic
 Evening trips have a higher trip duration due to peak hours

7. Road-based Features
Roads are of different types and the condition of the road or obstructions in the road are factors
that can’t be ignored. Let’s form some hypotheses based on these factors.

 Condition of the road: The duration of the trip is more if the condition of the road is bad
 Road type: Trips in concrete roads tend to have a lower trip duration
 Strike on the road: Strikes carried out on roads in the direction of the trip causes the trip
duration to increase

8. Weather Based Features


Weather can change at any time and could possibly impact the commute if the weather turns bad.
Hence, this is an important feature to consider in our hypothesis.

 Weather at the start of the trip: Rainy weather condition contributes to a higher trip
duration

Chi-Square Statistic: How to Calculate It / Distribution


Share on




Observed Variables: Definition


Contents
Definitions
1. What is a Chi Square Test?
2. What is a Chi-Square Statistic?
3. Chi Square P-Values.
4. The Chi-Square Distribution & Chi Distribution
Calculations:
1. How to Calculate a Chi-Square Statistic:
 By Hand (with video)
 SPSS Instructions.
2. How To Test a Chi Square Hypothesis (with video)
See also:
 Chi-square test for normality.
What is a Chi Square Test?
There are two types of chi-square tests. Both use the chi-square statistic and distribution for
different purposes:

 A chi-square goodness of fit test determines if sample data matches a population. For
more details on this type, see: Goodness of Fit Test.
 A chi-square test for independence compares two variables in a contingency table to
see if they are related. In a more general sense, it tests to see whether distributions
of categorical variables differ from each another.

What is a Chi-Square Statistic?


The formula for the chi-square statistic used in the chi square test is:

The chi-square formula.

The subscript “c” is the degrees of freedom. “O” is your observed value and E is your expected value.
It’s very rare that you’ll want to actually use this formula to find a critical chi-square value by hand.
The summation symbol means that you’ll have to perform a calculation for every single data item in
your data set. As you can probably imagine, the calculations can get very, very, lengthy and tedious.
Instead, you’ll probably want to use technology:
 Chi Square Test in SPSS.
 Chi Square P-Value in Excel.
A chi-square statistic is one way to show a relationship between two categorical variables. In
statistics, there are two types of variables: numerical (countable) variables and non-numerical
(categorical) variables. The chi-squared statistic is a single number that tells you how much
difference exists between your observed counts and the counts you would expect if there were no
relationship at all in the population.
There are a few variations on the chi-square statistic. Which one you use depends upon how you
collected the data and which hypothesis is being tested. However, all of the variations use the same
idea, which is that you are comparing your expected values with the values you actually collect. One
of the most common forms can be used for contingency tables:

Where O is the observed value, E is the expected value and “i” is the “ith” position in the
contingency table.

A low value for chi-square means there is a high correlation between your two sets of data. In
theory, if your observed and expected values were equal (“no difference”) then chi-square would be
zero — an event that is unlikely to happen in real life. Deciding whether a chi-square test statistic is
large enough to indicate a statistically significant difference isn’t as easy it seems. It would be nice if
we could say a chi-square test statistic >10 means a difference, but unfortunately that isn’t the case.
You could take your calculated chi-square value and compare it to a critical value from a chi-square
table. If the chi-square value is more than the critical value, then there is a significant difference.
You could also use a p-value. First state the null hypothesis and the alternate hypothesis. Then
generate a chi-square curve for your results along with a p-value (See: Calculate a chi-square p-value
Excel). Small p-values (under 5%) usually indicate that a difference is significant (or “small enough”).
Tip: The Chi-square statistic can only be used on numbers. They can’t be used for percentages,
proportions, means or similar statistical values. For example, if you have 10 percent of 200 people,
you would need to convert that to a number (20) before you can run a test statistic.

Chi Square P-Values.


A chi square test will give you a p-value. The p-value will tell you if your test results are significant or
not. In order to perform a chi square test and get the p-value, you need two pieces of information:
1. Degrees of freedom. That’s just the number of categories minus 1.
2. The alpha level(α). This is chosen by you, or the researcher. The usual alpha level is 0.05
(5%), but you could also have other levels like 0.01 or 0.10.

In elementary statistics or AP statistics, both the degrees of freedom(df) and the alpha level are
usually given to you in a question. You don’t normally have to figure out what they are.
You may have to figure out the df yourself, but it’s pretty simple: count the categories and subtract
1.
Degrees of freedom are placed as a subscript after the chi-square (Χ2) symbol. For example, the
following chi square shows 6 df:
Χ26.
And this chi square shows 4 df:
Χ24.

The Chi-Square Distribution


By Geek3|Wikimedia Commons GFDL

The chi-square distribution (also called the chi-squared distribution) is a special case of the gamma
distribution; A chi square distribution with n degrees of freedom is equal to a gamma distribution
with a = n / 2 and b = 0.5 (or β = 2).
Let’s say you have a random sample taken from a normal distribution. The chi square distribution is
the distribution of the sum of these random samples squared . The degrees of freedom (k) are equal
to the number of samples being summed. For example, if you have taken 10 samples from the
normal distribution, then df = 10. The degrees of freedom in a chi square distribution is also
its mean. In this example, the mean of this particular distribution will be 10. Chi square distributions
are always right skewed. However, the greater the degrees of freedom, the more the chi square
distribution looks like a normal distribution.
Uses
The chi-squared distribution has many uses in statistics, including:

 Confidence interval estimation for a population standard deviation of a normal


distribution from a sample standard deviation.
 Independence of two criteria of classification of qualitative variables.
 Relationships between categorical variables (contingency tables).
 Sample variance study when the underlying distribution is normal.
 Tests of deviations of differences between expected and observed frequencies (one-
way tables).
 The chi-square test (a goodness of fit test).
Chi Distribution
A similar distribution is the chi distribution. This distribution describes the square root of a variable
distributed according to a chi-square distribution.; with df = n > 0 degrees of freedom has
a probability density function of:
f(x) = 2(1-n/2) x(n-1) e(-(x2)/2) / Γ(n/2)
For values where x is positive.
The cdf for this function does not have a closed form, but it can be approximated with a series
of integrals, using calculus.
Back to Top
How to Calculate a Chi Square Statistic
A chi-square statistic is used for testing hypotheses. Watch this video, How to calculate a chi square.

The chi-square formula.

The chi-square formula is a difficult formula to deal with. That’s mostly because you’re expected to
add a large amount of numbers. The easiest way to solve the formula is by making a table.

Example question: 256 visual artists were surveyed to find out their zodiac sign. The results were:
Aries (29), Taurus (24), Gemini (22), Cancer (19), Leo (21), Virgo (18), Libra (19), Scorpio (20),
Sagittarius (23), Capricorn (18), Aquarius (20), Pisces (23). Test the hypothesis that zodiac signs are
evenly distributed across visual artists.
Step 1: Make a table with columns for “Categories,” “Observed,” “Expected,” “Residual (Obs-Exp)”,
“(Obs-Exp)2” and “Component (Obs-Exp)2 / Exp.” Don’t worry what these mean right now; We’ll
cover that in the following steps.

Step 2: Fill in your categories. Categories should be given to you in the question. There are 12 zodiac
signs, so:
Step 3: Write your counts. Counts are the number of each items in each category in column 2.
You’re given the counts in the question:

Step 4: Calculate your expected value for column 3. In this question, we would expect the 12 zodiac
signs to be evenly distributed for all 256 people, so 256/12=21.333. Write this in column 3.
Step 5: Subtract the expected value (Step 4) from the Observed value (Step 3) and place the result
in the “Residual” column. For example, the first row is Aries: 29-21.333=7.667.

Step 6: Square your results from Step 5 and place the amounts in the (Obs-Exp)2 column.
Step 7: Divide the amounts in Step 6 by the expected value (Step 4) and place those results in the
final column.

Step 8: Add up (sum) all the values in the last column.

This is the chi-square statistic: 5.094.


Example problem: Run a chi square test in SPSS.
Note: In order to run a chi-square test in SPSS you should already have written a hypothesis
statement. See: How to state the null hypothesis.

Step 1: Click “Analyze,” then click “Descriptive Statistics,” then click “Crosstabs.”

Chi square in SPSS is found in the Crosstabs command.

Step 2: Click the “Statistics” button. The statistics button is to the right of the Crosstabs window. A
new pop up window will appear.

Step 3: Click “Chi Square” to place a check in the box and then click “Continue” to return to the
Crosstabs window.
Step 4: Select the variables you want to run (in other words, choose two variables that you want to
compare using the chi square test). Click one variable in the left window and then click the arrow at
the top to move the variable into “Row(s).” Repeat to add a second variable to the “Column(s)”
window.
Step 5: Click “cells” and then check “Rows” and “Columns”. Click “Continue.”
Step 6: Click “OK” to run the Chi Square Test. The Chi Square tests will be returned at the bottom of
the output sheet in the “Chi Square Tests” box.
Step 7: Compare the p-value returned in the chi-square area (listed in the Asymp Sig column) to your
chosen alpha level.

A chi-square test for independence shows how categorical variables are related. There are a few
variations on the statistic; which one you use depends upon how you collected the data. It also
depends on how your hypothesis is worded. All of the variations use the same idea; you are
comparing the values you expect to get (expected values) with the values you actually collect
(observed values). One of the most common forms can be used in a contingency table.
The chi square hypothesis test is appropriate if you have:

 Discrete outcomes (categorical.)


 Dichotomous variables.
 Ordinal variables.
For example, you could have a clinical trial with blood sugar outcomes of hypoglycemic,
normoglycemic, or hyperglycemic.

Test a Chi Square Hypothesis: Steps


Sample question: Test the chi-square hypothesis with the following characteristics:
1. 11 Degrees of Freedom
2. Chi square test statistic of 5.094
Note: Degrees of freedom equals the number of categories minus 1.
Step 1: Take the chi-square statistic. Find the p-value in the chi-square table. If you are unfamiliar
with chi-square tables, the chi square table link also includes a short video on how to read the table.
The closest value for df=11 and 5.094 is between .900 and .950.
Note: The chi square table doesn’t offer exact values for every single possibility. If you use a
calculator, you can get an exact value. The exact p value is 0.9265.
Step 2: Use the p-value you found in Step 1. Decide whether to support or reject the null hypothesis.
In general, small p-values (1% to 5%) would cause you to reject the null hypothesis. This very large p-
value (92.65%) means that the null hypothesis should not be rejected.

What is a T test?

Various degrees of freedom for Student’s T.

The t test tells you how significant the differences between groups are; In other words it lets you
know if those differences (measured in means) could have happened by chance.
A very simple example: Let’s say you have a cold and you try a naturopathic remedy. Your cold lasts
a couple of days. The next time you have a cold, you buy an over-the-counter pharmaceutical and
the cold lasts a week. You survey your friends and they all tell you that their colds were of a shorter
duration (an average of 3 days) when they took the homeopathic remedy. What you really want to
know is, are these results repeatable? A t test can tell you by comparing the means of the two
groups and letting you know the probability of those results happening by chance.
Another example: Student’s T-tests can be used in real life to compare averages. For example, a
drug company may want to test a new cancer drug to find out if it improves life expectancy. In an
experiment, there’s always a control group (a group who are given a placebo, or “sugar pill”). The
control group may show an average life expectancy of +5 years, while the group taking the new drug
might have a life expectancy of +6 years. It would seem that the drug might work. But it could be
due to a fluke. To test this, researchers would use a Student’s t-test to find out if the results are
repeatable for an entire population.

The t score is a ratio between the difference between two groups and the difference within the
groups. The larger the t score, the more difference there is between groups. The smaller the t score,
the more similarity there is between groups. A t score of 3 means that the groups are three times as
different from each other as they are within each other. When you run a t test, the bigger the t-
value, the more likely it is that the results are repeatable.
 A large t-score tells you that the groups are different.
 A small t-score tells you that the groups are similar.
T-Values and P-values

How big is “big enough”? Every t-value has a p-value to go with it. A p-value is the probability that
the results from your sample data occurred by chance. P-values are from 0% to 100%. They are
usually written as a decimal. For example, a p value of 5% is 0.05. Low p-values are good; They
indicate your data did not occur by chance. For example, a p-value of .01 means there is only a 1%
probability that the results from an experiment happened by chance. In most cases, a p-value of 0.05
(5%) is accepted to mean the data is valid.
Calculating the Statistic / Test Types

There are three main types of t-test:


 An Independent Samples t-test compares the means for two groups.
 A Paired sample t-test compares means from the same group at different times (say,
one year apart).
 A One sample t-test tests the mean of a single group against a known mean.
You probably don’t want to calculate the test by hand (the math can get very messy, but if you insist
you can find the steps for an independent samples t test here.
Use the following tools to calculate the t test:
How to do a T test in Excel.
T test in SPSS.
T distribution on the TI 89.
T distribution on the TI 83.

What is a Paired T Test (Paired Samples T Test / Dependent Samples T Test)?


A paired t test (also called a correlated pairs t-test, a paired samples t test or dependent samples t
test) is where you run a t test on dependent samples. Dependent samples are essentially connected
— they are tests on the same person or thing. For example:
 Knee MRI costs at two different hospitals,
 Two tests on the same person before and after training,
 Two blood pressure measurements on the same person using different equipment.
When to Choose a Paired T Test / Paired Samples T Test / Dependent Samples T Test
Choose the paired t-test if you have two measurements on the same item, person or thing. You
should also choose this test if you have two items that are being measured with a unique condition.
For example, you might be measuring car safety performance in vehicle research and testing and
subject the cars to a series of crash tests. Although the manufacturers are different, you might be
subjecting them to the same conditions.

With a “regular” two sample t test, you’re comparing the means for two different samples. For
example, you might test two different groups of customer service associates on a business-related
test or testing students from two universities on their English skills. If you take a random
sample each group separately and they have different conditions, your samples are independent and
you should run an independent samples t test (also called between-samples and unpaired-samples).
The null hypothesis for the independent samples t-test is μ1 = μ2. In other words, it assumes the
means are equal. With the paired t test, the null hypothesis is that the pairwise difference between
the two tests is equal (H0: µd = 0). The difference between the two tests is very subtle; which one you
choose is based on your data collection method.
Paired Samples T Test By hand
Example question: Calculate a paired t test by hand for the following data:

Step 1: Subtract each Y score from each X score.

Step 2: Add up all of the values from Step 1.


Set this number aside for a moment.
Step 3: Square the differences from Step 1.

Step 4: Add up all of the squared differences from Step 3.

Step 5: Use the following formula to calculate the t-score:

 ΣD: Sum of the differences (Sum of X-Y from Step 2)


 ΣD2: Sum of the squared differences (from Step 4)
 (ΣD)2: Sum of the differences (from Step 2), squared.
If you’re unfamiliar with Σ you may want to read about summation notation first.
Step 6: Subtract 1 from the sample size to get the degrees of freedom. We have 11 items, so 11-1 =
10.
Step 7: Find the p-value in the t-table, using the degrees of freedom in Step 6. If you don’t have a
specified alpha level, use 0.05 (5%). For this example problem, with df = 10, the t-value is 2.228.

Step 8: Compare your t-table value from Step 7 (2.228) to your calculated t-value (-2.74). The
calculated t-value is greater than the table value at an alpha level of .05. The p-value is less than the
alpha level: p <.05. We can reject the null hypothesis that there is no difference between means.
Note: You can ignore the minus sign when comparing the two t-values, as ± indicates the direction;
the p-value remains the same for both directions.

TOBY WALTERS

Fundamental Analysis

 TOOLS FOR FUNDAMENTAL ANALYSIS


 SECTORS & INDUSTRIES ANALYSIS

What is Analysis of Variance (ANOVA)?

Analysis of variance (ANOVA) is an analysis tool used in statistics that splits an observed aggregate variability
found inside a data set into two parts: systematic factors and random factors. The systematic factors have a
statistical influence on the given data set, while the random factors do not. Analysts use the ANOVA test to
determine the influence that independent variables have on the dependent variable in a regression study.

The t- and z-test methods developed in the 20th century were used for statistical analysis until 1918, when
Ronald Fisher created the analysis of variance method.1 2  ANOVA is also called the Fisher analysis of variance,
and it is the extension of the t- and z-tests. The term became well-known in 1925, after appearing in Fisher's
book, "Statistical Methods for Research Workers."3  It was employed in experimental psychology and later
expanded to subjects that were more complex.

KEY TAKEAWAYS

Analysis of variance, or ANOVA, is a statistical method that separates observed variance data into
different components to use for additional tests.
 A one-way ANOVA is used for three or more groups of data, to gain information about the
relationship between the dependent and independent variables.
 If no true variance exists between the groups, the ANOVA's F-ratio should equal close to 1.
0 seconds of 1 minute, 1 secondVolume 75%

1:01

What Is the Analysis of Variance (ANOVA)?

The Formula for ANOVA is:

\begin{aligned} &\text{F} = \frac{ \text{MST} }{ \text{MSE} } \\ &\textbf{where:} \\ &\text{F} = \text{ANOVA


coefficient} \\ &\text{MST} = \text{Mean sum of squares due to treatment} \\ &\text{MSE} = \text{Mean sum
of squares due to error} \\
\end{aligned}​F=MSEMST​where:F=ANOVA coefficientMST=Mean sum of squares due to treatmentMS
E=Mean sum of squares due to error​

What Does the Analysis of Variance Reveal?

The ANOVA test is the initial step in analyzing factors that affect a given data set. Once the test is
finished, an analyst performs additional testing on the methodical factors that measurably
contribute to the data set's inconsistency. The analyst utilizes the ANOVA test results in an f-test to
generate additional data that aligns with the proposed regression models.

The ANOVA test allows a comparison of more than two groups at the same time to determine
whether a relationship exists between them. The result of the ANOVA formula, the F statistic (also
called the F-ratio), allows for the analysis of multiple groups of data to determine the variability
between samples and within samples.

If no real difference exists between the tested groups, which is called the null hypothesis, the result
of the ANOVA's F-ratio statistic will be close to 1. The distribution of all possible values of the F
statistic is the F-distribution. This is actually a group of distribution functions, with two
characteristic numbers, called the numerator degrees of freedom and the denominator degrees of
freedom.

Example of How to Use ANOVA

A researcher might, for example, test students from multiple colleges to see if students from one of
the colleges consistently outperform students from the other colleges. In a business application, an
R&D researcher might test two different processes of creating a product to see if one process is
better than the other in terms of cost efficiency.

The type of ANOVA test used depends on a number of factors. It is applied when data needs to be
experimental. Analysis of variance is employed if there is no access to statistical software resulting
in computing ANOVA by hand. It is simple to use and best suited for small samples. With many
experimental designs, the sample sizes have to be the same for the various factor level
combinations.

ANOVA is helpful for testing three or more variables. It is similar to multiple two-sample t-tests.
However, it results in fewer type I errors and is appropriate for a range of issues. ANOVA groups
differences by comparing the means of each group and includes spreading out the variance into
diverse sources. It is employed with subjects, test groups, between groups and within groups.

One-Way ANOVA Versus Two-Way ANOVA

There are two main types of ANOVA: one-way (or unidirectional) and two-way. There also
variations of ANOVA. For example, MANOVA (multivariate ANOVA) differs from ANOVA as the
former tests for multiple dependent variables simultaneously while the latter assesses only one
dependent variable at a time. One-way or two-way refers to the number of independent variables
in your analysis of variance test. A one-way ANOVA evaluates the impact of a sole factor on a sole
response variable. It determines whether all the samples are the same. The one-way ANOVA is used
to determine whether there are any statistically significant differences between the means of three
or more independent (unrelated) groups.
A two-way ANOVA is an extension of the one-way ANOVA. With a one-way, you have one
independent variable affecting a dependent variable. With a two-way ANOVA, there are two
independents. For example, a two-way ANOVA allows a company to compare worker productivity
based on two independent variables, such as salary and skill set. It is utilized to observe the
interaction between the two factors and tests the effect of two factors at the same time.

What is correlation analysis?

Correlation analysis in research is a statistical method used to measure the strength of the linear
relationship between two variables and compute their association. Simply put - correlation analysis
calculates the level of change in one variable due to the change in the other. A high correlation
points to a strong relationship between the two variables, while a low correlation means that the
variables are weakly related.
When it comes to market research, researchers use correlation analysis to analyze quantitative data
collected through research methods like surveys and live polls. They try to identify the relationship,
patterns, significant connections, and trends between two variables or datasets. There is a positive
correlation between two variabls when an increase in one variable leads to the increase in the other.
On the other hand, a negative correlation means that when one variable increases, the other
decreases and vice-versa.

Example of correlation analysis

Correlation between two variables can be either a positive correlation, a negative correlation, or no
correlation. Let's look at examples of each of these three types:

 Positive correlation: A positive correlation between two variables means both the variables
move in the same direction. An increase in one variable leads to an increase in the other
variable and vice versa.
For example, spending more time on a treadmill burns more calories.
 Negative correlation: A negative correlation between two variables means that the variables
move in opposite directions. An increase in one variable leads to a decrease in the other
variable and vice versa.
For example, increasing the speed of a vehicle decreases the time you take to reach your
destination.
 Weak/Zero correlation: No correlation exists when one variable does not affect the other.
For example, there is no correlation between the number of years of school a person has
attended and the letters in his/her name.
Uses of correlation analysis

Correlation analysis is used to study practical cases. Here, the researcher can't manipulate individual
variables. For example, correlation analysis is used to measure the correlation between the patient's
blood pressure and the medication used. Marketers use it to measure the effectiveness of
advertising. Researchers measure the increase/decrease in sales due to a specific marketing
campaign.

Advantages of correlation analysis

The advantages of correlation analysis are:

 Observe relationships: A correlation helps to identify the absence or presence of a


relationship between two variables. It tends to be more relevant to everyday life.
 A good starting point for research: It proves to be a good starting point when a researcher
starts investigating relationships for the first time.
 Uses for further studies: Researchers can identify the direction and strength of the
relationship between two variables and later narrow the findings down in later studies.
 Simple metrics: Research findings are simple to classify. The findings can range from -1.00 to
1.00. There can be only three potential broad outcomes of the analysis.

What is a Likelihood-Ratio Test?


The Likelihood-Ratio test (sometimes called the likelihood-ratio chi-squared test) is a hypothesis
test that helps you choose the “best” model between two nested models. “Nested models” means
that one is a special case of the other. For example, you might want to find out which of the
following models is the best fit:
 Model One has four predictor variables (height, weight, age, sex),
 Model Two has two predictor variables (age,sex). It is “nested” within model one
because it has just two of the predictor variables (age, sex).
This theory cam also be applied to matrices. For example, a scaled identity matrix is nested within a
more complex compound symmetry matrix.
The best model is the one that makes the data most likely, or maximizes the likelihood
function“>likelihood function, fn(X – 1, … , Xn|Θ).
Although the concept is relatively easy to grasp (i.e. the likelihood function is highest nearer the true
value for Θ), the calculations to find the inputs for the procedure are not.

Likelihood-ratio tests use log-likelihood functions, which are difficult and lengthy to calculate by
hand. Most statistical software packages have built in functions to handle them; On the other
hand, log-likelihood functions pose other serious challenges, like the difficulty of calculating global
maximums. These often involve hefty computations with complicated, multi-dimensional integrals.
Running the Test
Basically, the test compares the fit of two models. The null hypothesis is that the smaller model is
the “best” model; It is rejected when the test statistic is large. In other words, if the null hypothesis
is rejected, then the larger model is a significant improvement over the smaller one.
If you know the log-likelihood functions for the two models, the test statistic is relatively easy to
calculate as the ratio between the log-likelihood of the simpler model (s) to the model with more
parameters (g):
You might also see this equation with “s” written as the likelihood for the null model and “g” written
as the likelihood for the alternative model.
The test statistic approximates a chi-squared random variable. Degrees of freedom for the test equal
the difference in the number of parameters for the two models.

Data analysis techniques

Now we’re familiar with some of the different types of data, let’s focus on the topic at hand:
different methods for analyzing data.

a. Regression analysis

Regression analysis is used to estimate the relationship between a set of variables. When conducting
any type of regression analysis, you’re looking to see if there’s a correlation between a dependent
variable (that’s the variable or outcome you want to measure or predict) and any number of
independent variables (factors which may have an impact on the dependent variable). The aim of
regression analysis is to estimate how one or more variables might impact the dependent variable,
in order to identify trends and patterns. This is especially useful for making predictions and
forecasting future trends.

Let’s imagine you work for an ecommerce company and you want to examine the relationship
between: (a) how much money is spent on social media marketing, and (b) sales revenue. In this
case, sales revenue is your dependent variable—it’s the factor you’re most interested in predicting
and boosting. Social media spend is your independent variable; you want to determine whether or
not it has an impact on sales and, ultimately, whether it’s worth increasing, decreasing, or keeping
the same. Using regression analysis, you’d be able to see if there’s a relationship between the two
variables. A positive correlation would imply that the more you spend on social media marketing,
the more sales revenue you make. No correlation at all might suggest that social media marketing
has no bearing on your sales. Understanding the relationship between these two variables would
help you to make informed decisions about the social media budget going forward. However: It’s
important to note that, on their own, regressions can only be used to determine whether or not
there is a relationship between a set of variables—they don’t tell you anything about cause and
effect. So, while a positive correlation between social media spend and sales revenue may suggest
that one impacts the other, it’s impossible to draw definitive conclusions based on this analysis
alone.

There are many different types of regression analysis, and the model you use depends on the type of
data you have for the dependent variable. For example, your dependent variable might be
continuous (i.e. something that can be measured on a continuous scale, such as sales revenue in
USD), in which case you’d use a different type of regression analysis than if your dependent variable
was categorical in nature (i.e. comprising values that can be categorised into a number of distinct
groups based on a certain characteristic, such as customer location by continent). You can learn
more about different types of dependent variables and how to choose the right regression analysis
in this guide.
Regression analysis in action: Investigating the relationship between clothing brand Benetton’s
advertising expenditure and sales

b. Monte Carlo simulation

When making decisions or taking certain actions, there are a range of different possible outcomes. If
you take the bus, you might get stuck in traffic. If you walk, you might get caught in the rain or bump
into your chatty neighbor, potentially delaying your journey. In everyday life, we tend to briefly
weigh up the pros and cons before deciding which action to take; however, when the stakes are
high, it’s essential to calculate, as thoroughly and accurately as possible, all the potential risks and
rewards.

Monte Carlo simulation, otherwise known as the Monte Carlo method, is a computerized technique
used to generate models of possible outcomes and their probability distributions. It essentially
considers a range of possible outcomes and then calculates how likely it is that each particular
outcome will be realized. The Monte Carlo method is used by data analysts to conduct advanced risk
analysis, allowing them to better forecast what might happen in the future and make decisions
accordingly.

So how does Monte Carlo simulation work, and what can it tell us? To run a Monte Carlo simulation,
you’ll start with a mathematical model of your data—such as a spreadsheet. Within your
spreadsheet, you’ll have one or several outputs that you’re interested in; profit, for example, or
number of sales. You’ll also have a number of inputs; these are variables that may impact your
output variable. If you’re looking at profit, relevant inputs might include the number of sales, total
marketing spend, and employee salaries. If you knew the exact, definitive values of all your input
variables, you’d quite easily be able to calculate what profit you’d be left with at the end. However,
when these values are uncertain, a Monte Carlo simulation enables you to calculate all the possible
options and their probabilities. What will your profit be if you make 100,000 sales and hire five new
employees on a salary of $50,000 each? What is the likelihood of this outcome? What will your
profit be if you only make 12,000 sales and hire five new employees? And so on. It does this by
replacing all uncertain values with functions which generate random samples from distributions
determined by you, and then running a series of calculations and recalculations to produce models
of all the possible outcomes and their probability distributions. The Monte Carlo method is one of
the most popular techniques for calculating the effect of unpredictable variables on a specific output
variable, making it ideal for risk analysis.

Monte Carlo simulation in action: A case study using Monte Carlo simulation for risk analysis

c. Factor analysis

Factor analysis is a technique used to reduce a large number of variables to a smaller number of
factors. It works on the basis that multiple separate, observable variables correlate with each other
because they are all associated with an underlying construct. This is useful not only because it
condenses large datasets into smaller, more manageable samples, but also because it helps to
uncover hidden patterns. This allows you to explore concepts that cannot be easily measured or
observed—such as wealth, happiness, fitness, or, for a more business-relevant example, customer
loyalty and satisfaction.

Let’s imagine you want to get to know your customers better, so you send out a rather long survey
comprising one hundred questions. Some of the questions relate to how they feel about your
company and product; for example, “Would you recommend us to a friend?” and “How would you
rate the overall customer experience?” Other questions ask things like “What is your yearly
household income?” and “How much are you willing to spend on skincare each month?”

Once your survey has been sent out and completed by lots of customers, you end up with a large
dataset that essentially tells you one hundred different things about each customer (assuming each
customer gives one hundred responses). Instead of looking at each of these responses (or variables)
individually, you can use factor analysis to group them into factors that belong together—in other
words, to relate them to a single underlying construct. In this example, factor analysis works by
finding survey items that are strongly correlated. This is known as covariance. So, if there’s a strong
positive correlation between household income and how much they’re willing to spend on skincare
each month (i.e. as one increases, so does the other), these items may be grouped together.
Together with other variables (survey responses), you may find that they can be reduced to a single
factor such as “consumer purchasing power”. Likewise, if a customer experience rating of 10/10
correlates strongly with “yes” responses regarding how likely they are to recommend your product
to a friend, these items may be reduced to a single factor such as “customer satisfaction”.

In the end, you have a smaller number of factors rather than hundreds of individual variables. These
factors are then taken forward for further analysis, allowing you to learn more about your customers
(or any other area you’re interested in exploring).

Factor analysis in action: Using factor analysis to explore customer behavior patterns in Tehran

d. Cohort analysis

Cohort analysis is defined on Wikipedia as follows: “Cohort analysis is a subset of behavioral


analytics that takes the data from a given dataset and rather than looking at all users as one unit, it
breaks them into related groups for analysis. These related groups, or cohorts, usually share
common characteristics or experiences within a defined time-span.”

So what does this mean and why is it useful? Let’s break down the above definition further. A cohort
is a group of people who share a common characteristic (or action) during a given time period.
Students who enrolled at university in 2020 may be referred to as the 2020 cohort. Customers who
purchased something from your online store via the app in the month of December may also be
considered a cohort.

With cohort analysis, you’re dividing your customers or users into groups and looking at how these
groups behave over time. So, rather than looking at a single, isolated snapshot of all your customers
at a given moment in time (with each customer at a different point in their journey), you’re
examining your customers’ behavior in the context of the customer lifecycle. As a result, you can
start to identify patterns of behavior at various points in the customer journey—say, from their first
ever visit to your website, through to email newsletter sign-up, to their first purchase, and so on. As
such, cohort analysis is dynamic, allowing you to uncover valuable insights about the customer
lifecycle.

This is useful because it allows companies to tailor their service to specific customer segments (or
cohorts). Let’s imagine you run a 50% discount campaign in order to attract potential new customers
to your website. Once you’ve attracted a group of new customers (a cohort), you’ll want to track
whether they actually buy anything and, if they do, whether or not (and how frequently) they make
a repeat purchase. With these insights, you’ll start to gain a much better understanding of when this
particular cohort might benefit from another discount offer or retargeting ads on social media, for
example. Ultimately, cohort analysis allows companies to optimize their service offerings (and
marketing) to provide a more targeted, personalized experience. You can learn more about how to
run cohort analysis using Google Analytics here.

Cohort analysis in action: How Ticketmaster used cohort analysis to boost revenue

e. Cluster analysis

Cluster analysis is an exploratory technique that seeks to identify structures within a dataset. The
goal of cluster analysis is to sort different data points into groups (or clusters) that are internally
homogeneous and externally heterogeneous. This means that data points within a cluster are similar
to each other, and dissimilar to data points in another cluster. Clustering is used to gain insight into
how data is distributed in a given dataset, or as a preprocessing step for other algorithms.

There are many real-world applications of cluster analysis. In marketing, cluster analysis is commonly
used to group a large customer base into distinct segments, allowing for a more targeted approach
to advertising and communication. Insurance firms might use cluster analysis to investigate why
certain locations are associated with a high number of insurance claims. Another common
application is in geology, where experts will use cluster analysis to evaluate which cities are at
greatest risk of earthquakes (and thus try to mitigate the risk with protective measures).

It’s important to note that, while cluster analysis may reveal structures within your data, it won’t
explain why those structures exist. With that in mind, cluster analysis is a useful starting point for
understanding your data and informing further analysis. Clustering algorithms are also used in
machine learning—you can learn more about clustering in machine learning here.

Cluster analysis in action: Using cluster analysis for customer segmentation—a telecoms case study
example
f. Time series analysis

Time series analysis is a statistical technique used to identify trends and cycles over time. Time series
data is a sequence of data points which measure the same variable at different points in time (for
example, weekly sales figures or monthly email sign-ups). By looking at time-related trends, analysts
are able to forecast how the variable of interest may fluctuate in the future.

When conducting time series analysis, the main patterns you’ll be looking out for in your data are:

 Trends: Stable, linear increases or decreases over an extended time period.


 Seasonality: Predictable fluctuations in the data due to seasonal factors over a short period
of time. For example, you might see a peak in swimwear sales in summer around the same
time every year.
 Cyclic patterns: Unpredictable cycles where the data fluctuates. Cyclical trends are not due
to seasonality, but rather, may occur as a result of economic or industry-related conditions.

As you can imagine, the ability to make informed predictions about the future has immense value
for business. Time series analysis and forecasting is used across a variety of industries, most
commonly for stock market analysis, economic forecasting, and sales forecasting. There are different
types of time series models depending on the data you’re using and the outcomes you want to
predict. These models are typically classified into three broad types: the autoregressive (AR) models,
the integrated (I) models, and the moving average (MA) models. For an in-depth look at time series
analysis, refer to this introductory study on time series modeling and forecasting.

Time series analysis in action: Developing a time series model to predict jute yarn demand in
Bangladesh

g. Sentiment analysis

When you think of data, your mind probably automatically goes to numbers and spreadsheets. Many
companies overlook the value of qualitative data, but in reality, there are untold insights to be
gained from what people (especially customers) write and say about you. So how do you go about
analyzing textual data?

One highly useful qualitative technique is sentiment analysis, a technique which belongs to the
broader category of text analysis—the (usually automated) process of sorting and understanding
textual data. With sentiment analysis, the goal is to interpret and classify the emotions conveyed
within textual data. From a business perspective, this allows you to ascertain how your customers
feel about various aspects of your brand, product, or service. There are several different types of
sentiment analysis models, each with a slightly different focus. The three main types include:

 Fine-grained sentiment analysis: If you want to focus on opinion polarity (i.e. positive,
neutral, or negative) in depth, fine-grained sentiment analysis will allow you to do so. For
example, if you wanted to interpret star ratings given by customers, you might use fine-
grained sentiment analysis to categorize the various ratings along a scale ranging from very
positive to very negative.
 Emotion detection: This model often uses complex machine learning algorithms to pick out
various emotions from your textual data. You might use an emotion detection model to
identify words associated with happiness, anger, frustration, and excitement, giving you
insight into how your customers feel when writing about you or your product on, say, a
product review site.
 Aspect-based sentiment analysis: This type of analysis allows you to identify what specific
aspects the emotions or opinions relate to, such as a certain product feature or a new ad
campaign. If a customer writes that they “find the new Instagram advert so annoying”, your
model should detect not only a negative sentiment, but also the object towards which it’s
directed.

In a nutshell, sentiment analysis uses various Natural Language Processing (NLP) systems and
algorithms which are trained to associate certain inputs (for example, certain words) with certain
outputs. For example, the input “annoying” would be recognized and tagged as “negative”.
Sentiment analysis is crucial to understanding how your customers feel about you and your
products, for identifying areas for improvement, and even for averting PR disasters in real-time!

Sentiment analysis in action: 5 Real-world sentiment analysis case studies

4. The data analysis process

In order to gain meaningful insights from data, data analysts will perform a rigorous step-by-step
process. We go over this in detail in our step by step guide to the data analysis process—but, to
briefly summarize, the data analysis process generally consists of the following phases:

Association rules analysis is a technique to uncover how items are associated to each other. There
are three common ways to measure association.

Measure 1: Support. This says how popular an itemset is, as measured by the proportion of
transactions in which an itemset appears. In Table 1 below, the support of {apple} is 4 out of 8, or
50%. Itemsets can also contain multiple items. For instance, the support of {apple, beer, rice} is 2 out
of 8, or 25%.
Table 1. Example Transactions
If you discover that sales of items beyond a certain proportion tend to have a significant impact on
your profits, you might consider using that proportion as your support threshold. You may then
identify itemsets with support values above this threshold as significant itemsets.
Measure 2: Confidence. This says how likely item Y is purchased when item X is purchased,
expressed as {X -> Y}. This is measured by the proportion of transactions with item X, in which item Y
also appears. In Table 1, the confidence of {apple -> beer} is 3 out of 4, or 75%.

One drawback of the confidence measure is that it might misrepresent the importance of an
association. This is because it only accounts for how popular apples are, but not beers. If beers are
also very popular in general, there will be a higher chance that a transaction containing apples will
also contain beers, thus inflating the confidence measure. To account for the base popularity of both
constituent items, we use a third measure called lift.

Measure 3: Lift. This says how likely item Y is purchased when item X is purchased, while controlling
for how popular item Y is. In Table 1, the lift of {apple -> beer} is 1,which implies no association
between items. A lift value greater than 1 means that item Y is likely to be bought if item X is bought,
while a value less than 1 means that item Y is unlikely to be bought if item X is bought.

An Illustration
We use a dataset on grocery transactions from the arules R library. It contains actual transactions at
a grocery outlet over 30 days. The network graph below shows associations between selected items.
Larger circles imply higher support, while red circles imply higher lift:
Associations between selected items. Visualized using the arulesViz R library.

Several purchase patterns can be observed. For example:

 The most popular transaction was of pip and tropical fruits


 Another popular transaction was of onions and other vegetables
 If someone buys meat spreads, he is likely to have bought yogurt as well
 Relatively many people buy sausage along with sliced cheese
 If someone buys tea, he is likely to have bought fruit as well, possibly inspiring the
production of fruit-flavored tea
Recall that one drawback of the confidence measure is that it tends to misrepresent the importance
of an association. To demonstrate this, we go back to the main dataset to pick 3 association rules
containing beer:

Table 2. Association measures for beer-related rules

The {beer -> soda} rule has the highest confidence at 20%. However, both beer and soda appear
frequently across all transactions (see Table 3), so their association could simply be a fluke. This is
confirmed by the lift value of {beer -> soda}, which is 1, implying no association between beer and
soda.
Table 3. Support of individual items
On the other hand, the {beer -> male cosmetics} rule has a low confidence, due to few purchases of
male cosmetics in general. However, whenever someone does buy male cosmetics, he is very likely
to buy beer as well, as inferred from a high lift value of 2.6. The converse is true for {beer -> berries}.
With a lift value below 1, we may conclude that if someone buys berries, he would likely be averse
to beer.

It is easy to calculate the popularity of a single itemset, like {beer, soda}. However, a business owner
would not typically ask about individual itemsets. Rather, the owner would be more interested in
having a complete list of popular itemsets. To get this list, one needs to calculate the support values
for every possible configuration of items, and then shortlist the itemsets that meet the minimum
support threshold.
In a store with just 10 items, the total number of possible configurations to examine would be a
whopping 1023. This number increases exponentially in a store with hundreds of items.

Is there a way to reduce the number of item configurations to consider?

Apriori Algorithm
The apriori principle can reduce the number of itemsets we need to examine. Put simply, the apriori
principle states that
if an itemset is infrequent, then all its supersets must also be infrequent
This means that if {beer} was found to be infrequent, we can expect {beer, pizza} to be equally or
even more infrequent. So in consolidating the list of popular itemsets, we need not consider {beer,
pizza}, nor any other itemset configuration that contains beer.

Finding itemsets with high support


Using the apriori principle, the number of itemsets that have to be examined can be pruned, and the
list of popular itemsets can be obtained in these steps:

Step 0. Start with itemsets containing just a single item, such as {apple} and {pear}.
Step 1. Determine the support for itemsets. Keep the itemsets that meet your minimum support
threshold, and remove itemsets that do not.
Step 2. Using the itemsets you have kept from Step 1, generate all the possible itemset
configurations.
Step 3. Repeat Steps 1 & 2 until there are no more new itemsets.
Note that the support threshold that you pick in Step 1 could be based on formal analysis or past experience. If
you discover that sales of items beyond a certain proportion tend to have a significant impact on your profits,
you might consider using that proportion as your support threshold.

Finding item rules with high confidence or lift


We have seen how the apriori algorithm can be used to identify itemsets with high support. The same principle
can also be used to identify item associations with high confidence or lift. Finding rules with high confidence or
lift is less computationally taxing once high-support itemsets have been identified, because confidence and lift
values are calculated using support values.

Take for example the task of finding high-confidence rules. If the rule
{beer, chips -> apple}

has low confidence, all other rules with the same constituent items and with apple on the right hand
side would have low confidence too. Specifically, the rules

{beer -> apple, chips}


{chips -> apple, beer}

would have low confidence as well. As before, lower level candidate item rules can be pruned using
the apriori algorithm, so that fewer candidate rules need to be examined.

Limitations
 Computationally Expensive. Even though the apriori algorithm reduces the number of
candidate itemsets to consider, this number could still be huge when store inventories
are large or when the support threshold is low. However, an alternative solution would
be to reduce the number of comparisons by using advanced data structures, such as hash
tables, to sort candidate itemsets more efficiently.
 Spurious Associations. Analysis of large inventories would involve more itemset
configurations, and the support threshold might have to be lowered to detect certain
associations. However, lowering the support threshold might also increase the number of
spurious associations detected. To ensure that identified associations are generalizable,
they could first be distilled from a training dataset, before having their support and
confidence assessed in a separate test dataset.

You might also like