Professional Documents
Culture Documents
Car Evaluation Dataset Python PDF
Car Evaluation Dataset Python PDF
Abstract - Today’s information systems store large number of frequent itemsets and the threshold parameters
quantities of transactional data which may contain valuable can be a useful guidance for the analysts to ensure
but hidden information. Association rules generation is a acceptable execution times and satisfactory results,
method developed for analysis of this type of data. This especially if coupled with domain knowledge and
method is however very resource-intensive. Both the specificities of similar datasets. The concerned
execution time and the final model highly depend on relationship depends on various factors such as: data
threshold parameters set by the analysts. In this paper we domain and inherent relationships between the elements,
analyze the impact of the minimal support parameter on the number of transactions, number of available items,
number of closed frequent itemsets discovered for various
average number of items per transaction etc. Some other
datasets. The analysis is conducted on referential datasets as
well as real-life datasets which classify transactional
measureable statistics of observed dataset could also be of
elements in categories organized in hierarchical manner. importance (such as mode, median, etc.).
Datasets that are not originally transactional can be In order to investigate the problem from different
transformed into a transactional form. They exhibit perspectives, we used various natively transactional
somewhat different relations between the minimal support datasets: a small referential dataset and a collection of
parameter and the number of discovered closed frequent datasets extracted from a large real-life database
itemsets. Findings presented in this paper can serve as concerning sales, courtesy of a major Croatian retailer.
guidance in setting up support as the most important
The extracted datasets present elements across different
parameter affecting the final execution time. In order to
analyze data characteristics from another perspective, we
hierarchical levels of categorization, so that the analyst
also present how varying confidence, lift and support can can observe the behavior of threshold parameter settings
affect the number of formed association rules containing in different scenarios. The conclusions drawn from our
two elements. experiments can serve as guidance for choosing most
promising hierarchical level for exploration as well as
I. INTRODUCTION setting up threshold parameters.
Current information systems store large quantities of We also considered possible appliance of the
transactional data. This data is often summarized and developed concepts on other types of data. Non-
further analyzed in an aggregated form. However, analysis transactional data can be adapted into a transactional form
of raw transactional data at the lowest level can offer before conducting the analysis, which may enable new
different insights into the data. These insights mainly refer insights into the dependencies between the observed
to relationships between specific transactional elements. objects or the variables connected to them.
The data mining method developed to explore this specific
There are also additional parameters that could be set
type of data is association rules generation. This method is
up by analysts which affect the finally presented
very resource-intensive and thus often accompanied with
association rules. The most commonly used parameters
long execution time – especially if the threshold
are confidence and lift. When considering relationships
parameters are set too low. In this case very large number
between pairs of elements, it is interesting to see how
of rules is generated, with many of them redundant or
many of them are filtered out by combinations of support
uninteresting. Alternatively, if the threshold parameters
and lift or support and confidence. The resulting graphs
are set too high, the generation of rules will be finished in
presenting the number of unfiltered association rules are
short time, but only a small number of rules will be
useful for drawing certain generic conclusions that can be
produced, if any, and will most probably express some
applied across various datasets.
facts previously known to the business analysts.
Our analysis was conducted using the Apriori
The most resource-intensive step in association rules
algorithm [10] implemented through a Python program
generation is finding closed frequent itemsets (CFIs,
module. It enabled us to control all steps in rule generation
definition given in Section III). The number of CFIs
process, measure the execution time and the total number
determines the total execution time of the final model
of CFIs resulting from different support threshold settings.
generation. Knowledge about the relationship between the
Object representations of the input data were named The interdependency between number of CFIs and
Transaction and Collection. Transaction is basically a support threshold parameter is a characteristic of every
binary vector (or list) corresponding to one of the rows in dataset. It depends on relationships between transactional
the input data. Collection is a subclass of Transaction, and elements, data domain and semantics. Taking market
it represents one of the ultimately mined frequent datasets. basket analysis for example, it can be stated that
It is expanded with some additional numeric parameters combinations of transaction elements can be a
(such as support). Transactions are bundled inside a consequence of customer habits, season, weather, various
TransactionSet, which, aside from being a container for customers’ characteristics, but also specific store
transactions, also stores attribute names, original filename offerings, product characteristics and so on. However,
etc. Collections, on the other hand, are placed inside a examination of various datasets revealed that regardless of
LayeredCollectionSet – a structure that houses but also dataset specifics and origin, lowering the support
organizes mined frequent collections based on their threshold commonly results in an explosion of the total
support, number of attributes etc. number of CFIs. An immediate consequence of this is a
drastic increase in execution time and resource
CollectionSetFactory is the main module for requirements.
converting input data into TransactionSets and,
subsequently, LayeredCollectionSets. It implements the In order to show an interdependency between support,
Apriori algorithm but also a number of extra execution time and the number of CFIs, a real-life dataset
functionalities which calculate additional numerical describing the appearance of certain category members in
parameters for the set. It allows caching of already transactions was chosen (a detailed description of this
processed datasets to avoid unnecessary repeated analyses dataset is given in subsection C). Graph presented in
and even enables internal linkage of collections with Figure 2 shows how lowering the absolute support value
produces contains
TransactionSet Transaction
CollectionSet
Factory subclasses
produces
LayeredCollectionSet
contains Collection
input for
input for
Threshold Results
extraction produces
Analysis
Module
.TAB file OLTP
OLTP database
database
200
1-att
150
2-att
100 3-att
Minimal
Absolute Support
50 4-att
0 5-att
4.38%
0.63%
1.88%
3.13%
5.63%
6.88%
8.13%
9.38%
Figure 2. Relationship between number of CFIs and minimal absolute 6-att
support threshold – real-life retail dataset