You are on page 1of 12

EXPLAIN Activity 2

ARCHITECTURE OF Data Analytics

DATA MINING BY DEEYA NIKAM 171


Data mining is an automatic or semi-automatic technical process
that analyses large amounts of scattered information to make sense
of it and turn it into knowledge. It looks for anomalies, patterns or
correlations among millions of records to predict results, as
indicated by the SAS Institute, a world leader in business analytics.

In the meantime, information continues to grow and grow. A 2017


research on big data reveals that 90% of world data is from after
2014 and its volume doubles every 1.2 years. In this context, data
mining is a strategic practice considered important by almost 80%
of organisations that apply business intelligence, according to
Forbes.

Thanks to the joint action of analytics and data mining, which


combines statistics, Artificial Intelligence and automatic learning,
companies can create models to discover connections between
millions of records. Some of the possibilities of data mining
WHAT IS DATA MINING? include:

To clean data of noise and repetitions.


Extract the relevant information and use it to evaluate possible
results.
Make better and faster business decisions.
PATTERNS
DESCRIPTIVE
PATTERNS
• Class/concept description: Data entries are associated with labels or classes. For
instance, in a library, the classes of items for borrowed items include books and
research journals, and customers' concepts include registered members and not
registered members. These types of descriptions are class or concept
descriptions.

• Frequent patterns: These are data points that occur more often in the dataset.
There are many kinds of recurring patterns, such as frequent items, frequent
subsequence, and frequent sub-structure.

• Associations: It shows the relationships between data and pre-defined association


rules. For instance, a shopkeeper makes an association rule that 70% of the time,
when a football is sold, a kit is bought alongside. These two items can be
combined together to make an association.

• Correlations: This is performed to find the statistical correlations between two


data points to find if they have positive, negative, or no effect.

• Clusters: This is the formation of a group of similar data points. Each point in the
collection is somewhat similar but very different from other members of different
groups.
PREDICTIVE
• Classification: It helps predict the label of unknown data points with
PATTERNS the help of known data points. For instance, if we have a dataset of X-
rays of cancer patients, then the possible labels would be cancer patient
and not cancer patient. These classes can be obtained by data
characterizations or by data discrimination.
• Regression: Unlike classification, regression is used to find the missing
numeric values from the dataset. It is also used to predict future
numeric values as well. For instance, we can find the behavior of the
next year's sales based on the past twenty years' sales by finding the
relation between the data.
• Outlier analysis: Not all data points in the dataset need to follow the
same behavior. Data points that don't follow the usual behavior are
called outliers. Analysis of these outliers is called outlier analysis.
These outliers are not considered while working on the data.
• Evolution analysis: As the name suggests, those data points change
their behavior and trends with time.
APRIORI ALGORITHM
An algorithm known as Apriori is a common one in data
mining. It's used to identify the most frequently occurring
elements and meaningful associations in a dataset. As an
example, products brought in by consumers to a shop may all be
used as inputs in this system.
An effective Market Basket Analysis is critical since it allows
consumers to purchase their products with more convenience,
resulting in a rise in market sales. Furthermore, it has been
applied in healthcare to aid in the identification of harmful
medication responses. A clustering algorithm is generated that
identifies which combinations of drugs and patient factors are
associated with adverse drug reactions.

In 1994, R. Agrawal and R. Srikant developed the Apriori method for identifying the most frequently occurring itemsets
in a dataset using the boolean association rule. Since it makes use of previous knowledge about common itemset
features, the method is referred to as Apriori. This is achieved by the use of an iterative technique or level-wise
approach, in which k-frequent itemsets are utilized to locate k+1 itemsets.

An essential feature known as the Apriori property is utilized to boost the effectiveness of level-wise production of
frequent itemsets. This property helps by minimizing the search area, which in turn serves to maximize the productivity
of level-wise creation of frequent patterns.
HOW DOES APRIORI WORK?
The Apriori algorithm operates on a straightforward premise. When the support value of an item set exceeds a
certain threshold, it is considered a frequent item set. Take into account the following steps. To begin, set the
support criterion, meaning that only those things that have more than the support criterion are considered
relevant.

Step 1: Create a list of all the elements that appear in every transaction and create a frequency table.
Step 2: Set the minimum level of support. Only those elements whose support exceeds or equals the threshold
support are significant.
Step 3: All potential pairings of important elements must be made, bearing in mind that AB and BA are
interchangeable.
Step 4: Tally the number of times each pair appears in a transaction.
Step 5: Only those sets of data that meet the criterion of support are significant.
Step 6: Now, suppose you want to find a set of three things that may be bought together. A rule, known as self-
join, is needed to build a three-item set. The item pairings OP, OB, PB, and PM state that two combinations
with the same initial letter are sought from these sets.
OPB is the result of OP and OB.
PBM is the result of PB and PM.
Step 7: When the threshold criterion is applied again, you'll get the significant itemset.
STEPS , ADVANTAGES & DISADVANTAGES OF APRIORI ALGORITHM
The Apriori algorithm has the following steps:

Step 1: Determine the level of transactional database support


and establish the minimal degree of assistance and
dependability.
Step 2: Take all of the transaction's supports that are greater
than the standard or chosen support value.
Step 3: Look for all rules with greater precision than the cutoff
or baseline standard, in these subgroups.
Step 4: It is best to arrange the rules in ascending order of
strength.

Advantages of Apriori
An algorithm that is simple to grasp.
The Merge and Squash processes are simple to apply on big itemsets
in huge databases.

Disadvantages of Apriori
It requires a significant amount of calculations if the itemsets are
extremely big and the minimal support is maintained to a bare
minimum.
A full scan of the whole database is required.
FP GROWTH ALGORITHM
What is FP Growth Algorithm?
The FP-Growth Algorithm is an alternative way to find frequent
item sets without using candidate generations, thus improving
performance. For so much, it uses a divide-and-conquer strategy.
The core of this method is the usage of a special data structure
named frequent-pattern tree (FP-tree), which retains the item set
association information.

This algorithm works as follows:

First, it compresses the input database creating an FP-tree instance


to represent frequent items.
After this first step, it divides the compressed database into a set of
conditional databases, each associated with one frequent pattern.
Finally, each such database is mined separately.
Using this strategy, the FP-Growth reduces the search costs by
recursively looking for short patterns and then concatenating them
into the long frequent patterns.

In large databases, holding the FP tree in the main memory is


impossible. A strategy to cope with this problem is to partition the
database into a set of smaller databases (called projected databases)
and then construct an FP-tree from each of these smaller databases.
ADVANTAGES AND DISADVANTAGES OF FP GROWTH
Advantages of FP Growth Algorithm
Here are the following advantages of the FP growth
algorithm, such as:

This algorithm needs to scan the database twice when


compared to Apriori, which scans the transactions for each
iteration.
The pairing of items is not done in this algorithm, making it
faster.
The database is stored in a compact version in memory.
It is efficient and scalable for mining both long and short
frequent patterns.

Disadvantages of FP-Growth Algorithm


This algorithm also has some disadvantages, such as:

FP Tree is more cumbersome and difficult to build than


Apriori.
It may be expensive.
The algorithm may not fit in the shared memory when the
database is large.
DIFFERENCE BETWEEN APRIORI AND FP GROWTH ALGORITHM
T h a n k
y o u !
DEEYA NIKAM 171

You might also like