Unit - 5

OTHER MINING METHODOLOGIES
In statistical data mining techniques, it is created for the effective handling of large amounts of data that
are generally multidimensional and possibly of several complex types.
There are several well-established statistical methods for data analysis, especially for numeric data.
These methods have been used extensively to scientific records (e.g., records from experiments in
physics, engineering, manufacturing, psychology, and medicine), and to information from economics and
the social sciences.
There are various methodologies of statistical data mining are as follows −
Regression − In general, these techniques are used to forecast the value of a response (dependent)
variable from new predictor (independent) variables, where the variables are numeric. There are several
forms of regression, including linear, multiple, weighted, polynomial, nonparametric, and robust (robust
methods are beneficial when errors declines to satisfy normalcy conditions or when the data include
significant outliers).
Generalized linear models − These models and their generalization (generalized additive models),
enable a categorical (nominal) response variable (several transformation of it) to be associated with a set
of predictor variables in a manner same to the modeling of a mathematical response variable utilizing
linear regression. Generalized linear models involve logistic regression and Poisson regression.
Analysis of variance − These method analyze experimental information for two or more populations
defined by a numeric response variable and new categorical variables (factors). In general, an ANOVA
(single-factor analysis of variance) problem contains a comparison of k population or treatment defines
to decide if at least two of the means are different.
Mixed-effect models − These models are for exploring grouped data—data that can be classified as per
the one or more grouping variables. They generally define relationships between a response variable and
several covariates in data combined according to one or more factors. There are several areas of
application such as multilevel data, repeated measures data, block designs, and longitudinal data.
Factor analysis − This method can determine which variables are combined to produce a given factor.
For instance, for several psychiatric data, it is not applicable to compute a specific factor of interest
directly (e.g., intelligence); however, it is applicable to measure other quantities that reflect the element
of interest. Therefore, none of the variables is appropriated as dependent.
Discriminant analysis − This technique can predict a categorical response variable. Unlike generalized
linear models, it considers that the independent variables follow a multivariate normal distribution. The
process tries to decide several discriminant functions (linear set of the independent variables) that
discriminate between the groups represented by the response variable. Discriminant analysis is generally
used in social sciences.
Survival analysis − There are multiple well-established statistical methods exist for survival analysis.
These techniques initially were designed to forecast the probability that a patient undergoing a medical
analysis can survive at least to time t.
Quality control − There are multiple statistics is used to prepare charts for quality control, including
Shewhart charts and CUSUM charts. These statistics involve the mean, standard deviation, range, count,
moving average, moving standard deviation, and moving range.
DATA MINING APPLICATIONS

Data Mining for Financial Data Analysis
Most banks and financial institutions offer a wide variety of banking, investment, and credit services (the
latter include business, mortgage, and automobile loans and credit cards). Some also offer insurance and
stock investment services.
Financial data collected in the banking and financial industry are often relatively complete, reliable, and
of high quality, which facilitates systematic data analysis and data mining. Here we present a few typical
cases.
Design and construction of data warehouses for multidimensional data analysis and data mining:
Like many other applications, data warehouses need to be constructed for banking and financial data.
Multidimensional data analysis methods should be used to analyze the general properties of such data.
For example, a company’s financial officer may want to view the debt and revenue changes by month,
region, and sector, and other factors, along with maximum, minimum, total, average, trend, deviation, and
other statistical information. Data warehouses, data cubes (including advanced data cube concepts such as
multi feature, discovery-driven, regression, and prediction data cubes), characterization and class
comparisons, clustering, and outlier analysis will all play important roles in financial data analysis and
mining.
Loan payment prediction and customer credit policy analysis: Loan payment prediction and customer
credit analysis are critical to the business of a bank. Many factors can strongly or weakly influence loan
payment performance and customer credit rating. Data mining methods, such as attribute selection and
attribute relevance ranking, may help identify important factors and eliminate irrelevant ones.
For example, factors related to the risk of loan payments include loan-to-value ratio, term of the loan,
debt ratio (total amount of monthly debt versus total monthly income), payment-to-income ratio,
customer income level, education level, residence region, and credit history. Analysis of the customer
payment history may find that, say, payment-to-income ratio is a dominant factor, while education level
and debt ratio are not. The bank may then decide to adjust its loan-granting policy so as to grant loans to
those customers whose applications were previously denied but whose profiles show relatively low risks
according to the critical factor analysis.
Classification and clustering of customers for targeted marketing: Classification and clustering
methods can be used for customer group identification and targeted marketing. For example, we can use
classification to identify the most crucial factors that may influence a customer’s decision regarding
banking. Customers with similar behaviors regarding loan payments may be identified by
multidimensional clustering techniques. These can help identify customer groups, associate a new
customer with an appropriate customer group, and facilitate targeted marketing.
Detection of money laundering and other financial crimes: To detect money laundering and other
financial crimes, it is important to integrate information from multiple, heterogeneous databases (e.g.,
bank transaction databases and federal or state crime history databases), as long as they are potentially
related to the study.
Multiple data analysis tools can then be used to detect unusual patterns, such as large amounts of cash
flow at certain periods, by certain groups of customers. Useful tools include data visualization tools (to
display transaction activities using graphs by time and by groups of customers), linkage and information
network analysis tools (to identify links among different customers and activities), classification tools (to
filter unrelated attributes and rank the highly related ones), clustering tools (to group different cases),
outlier analysis tools (to detect unusual amounts of fund transfers or other activities), and sequential
pattern analysis tools (to characterize unusual access sequences). These tools may identify important
relationships and patterns of activities and help investigators focus on suspicious cases for further detailed
examination.
Data Mining for Retail and Telecommunication Industries

The retail industry is a well-fit application area for data mining, since it collects huge amounts of data on
sales, customer shopping history, goods transportation, consumption, and service. The quantity of data
collected continues to expand rapidly, especially due to the increasing availability, ease, and popularity of
business conducted on the Web, or e-commerce. Today, most major chain stores also have web sites
where customers can make purchases online. Some businesses, such as Amazon.com (www.amazon.com),
exist solely online, without any brick-and-mortar (i.e., physical) store locations. Retail data provide a rich
source for data mining. Retail data mining can help identify customer buying behaviors, discover
customer shopping patterns and trends, improve the quality of customer service, achieve better customer
retention and satisfaction, enhance goods consumption ratios, design more effective goods transportation
and distribution policies, and reduce the cost of business. A few examples of data mining in the retail
industry are outlined as follows:
Design and construction of data warehouses: Because retail data cover a wide spectrum (including
sales, customers, employees, goods transportation, consumption, and services), there can be many ways to
design a data warehouse for this industry.
The levels of detail to include can vary substantially. The outcome of preliminary data mining exercises
can be used to help guide the design and development of data warehouse structures. This involves
deciding which dimensions and levels to include and what preprocessing to perform to facilitate effective
data mining.
Multidimensional analysis of sales, customers, products, time, and region: The retail industry
requires timely information regarding customer needs, product sales, trends, and fashions, as well as the
quality, cost, profit, and service of commodities. It is therefore important to provide powerful
multidimensional analysis and visualization tools, including the construction of sophisticated data cubes
according to the needs of data analysis.
Analysis of the effectiveness of sales campaigns: The retail industry conducts sales campaigns using
advertisements, coupons, and various kinds of discounts and bonuses to promote products and attract
customers. Careful analysis of the effectiveness of sales campaigns can help improve company profits.
Multidimensional analysis can be used for this purpose by comparing the amount of sales and the number
of transactions containing the sales items during the sales period versus those containing the same items
before or after the sales campaign. Moreover, association analysis may disclose which items are likely to
be purchased together with the items on sale, especially in comparison with the sales before or after the
campaign.
Customer retention—analysis of customer loyalty: We can use customer loyalty card information to
register sequences of purchases of particular customers. Customer loyalty and purchase trends can be
analyzed systematically. Goods purchased at different periods by the same customers can be grouped into
sequences. Sequential pattern mining can then be used to investigate changes in customer consumption or
loyalty and suggest adjustments on the pricing and variety of goods to help retain customers and attract
new ones.
Product recommendation and cross-referencing of items: By mining associations from sales records,
we may discover that a customer who buys a digital camera is likely to buy another set of items. Such
information can be used to form product recommendations. Collaborative recommender systems use data
mining techniques to make personalized product recommendations during live customer transactions,
based on the opinions of other customers. Product recommendations can also be advertised on sales
receipts, in weekly flyers, or on the Web to help improve customer service, aid customers in selecting
items, and increase sales. Similarly, information, such as “hot items this week” or attractive deals, can be
displayed together with the associative information to promote sales.
Fraudulent analysis and the identification of unusual patterns: Fraudulent activity costs the retail
industry millions of dollars per year. It is important to (1) identify potentially fraudulent users and their
atypical usage patterns; (2) detect attempts to gain fraudulent entry or unauthorized access to individual
and organizational accounts; and (3) discover unusual patterns that may need special attention. Many of
these patterns can be discovered by multidimensional analysis, cluster analysis, and outlier analysis.
As another industry that handles huge amounts of data, the telecommunication industry has quickly
evolved from offering local and long-distance telephone services to providing many other comprehensive
communication services. These include cellular phone, smart phone, Internet access, email, text
messages, images, computer and web data transmissions, and other data traffic. The integration of
telecommunication, computer network, Internet, and numerous other means of communication and
computing has been under way, changing the face of telecommunications and computing. This has
created a great demand for data mining to help understand business dynamics, identify
telecommunication patterns, catch fraudulent activities, make better use of resources,
and improve service quality.
Data Mining in Science and Engineering
In the past, many scientific data analysis tasks tended to handle relatively small and homogeneous data
sets. Such data were typically analyzed using a “formulate hypothesis, build model, and evaluate results”
paradigm. Massive data collection and storage technologies have recently changed the landscape of
scientific data analysis.
Today, scientific data can be amassed at much higher speeds and lower costs. This has resulted in the
accumulation of huge volumes of high-dimensional data, stream data, and heterogenous data, containing
rich spatial and temporal information. Consequently, scientific applications are shifting from the
“hypothesize-and-test” paradigm toward a “collect and store data, mine for new hypotheses, confirm with
data or experimentation” process. This shift brings about new challenges for data mining. Vast amounts
of data have been collected from scientific domains (including geosciences, astronomy, meteorology,
geology, and biological sciences) using sophisticated telescopes, multispectral high-resolution remote
satellite sensors, global positioning systems, and new generations of biological data collection and
analysis technologies. Large data sets are also being generated due to fast numeric simulations in various
fields such as climate and ecosystem modeling, chemical engineering, fluid dynamics, and structural
mechanics. Here we look at some of the challenges brought about by emerging scientific applications of
data mining.
Data warehouses and data preprocessing: Data preprocessing and data warehouses are critical for
information exchange and data mining. Creating a warehouse often requires finding means for resolving
inconsistent or incompatible data collected in multiple environments and at different time periods. This
requires reconciling semantics, referencing systems, geometry, measurements, accuracy, and precision.
Methods are needed for integrating data from heterogeneous sources and for identifying events.
For instance, consider climate and ecosystem data, which are spatial and temporal and require cross-
referencing geospatial data. A major problem in analyzing such data is that there are too many events in
the spatial domain but too few in the temporal domain. For example, El Nino events occur only every
four to seven years, and previous data on them might not have been collected as systematically as they are
today. Methods are also needed for the efficient computation of sophisticated spatial aggregates and the
handling of spatial-related data streams.
Mining complex data types: Scientific data sets are heterogeneous in nature. They typically involve
semi-structured and unstructured data, such as multimedia data and geo referenced stream data, as well as
data with sophisticated, deeply hidden semantics (e.g., genomic and proteomic data). Robust and
dedicated analysis methods are needed for handling spatiotemporal data, biological data, related concept
hierarchies, and complex semantic relationships. For example, in bioinformatics, a research problem is to
identify regulatory influences on genes. Gene regulation refers to how genes in a cell are switched on (or
off) to determine the cell’s functions.
Different biological processes involve different sets of genes acting together in precisely regulated
patterns. Thus, to understand a biological process we need to identify the participating genes and their
regulators. This requires the development of sophisticated data mining methods to analyze large
biological data sets for clues about regulatory influences on specific genes, by finding DNA segments
(“regulatory sequences”) mediating such influence.
Graph-based and network-based mining: It is often difficult or impossible to model several physical
phenomena and processes due to limitations of existing modeling approaches. Alternatively, labeled
graphs and networks may be used to capture many of the spatial, topological, geometric, biological, and
other relational characteristics present in scientific data sets. In graph or network modeling, each object to
be mined is represented by a vertex in a graph, and edges between vertices represent relationships
between objects. For example, graphs can be used to model chemical structures, biological pathways, and
data generated by numeric simulations such as fluid-flow simulations. The success of graph or network
modeling, however, depends on improvements in the scalability and efficiency of many graph-based data
mining tasks such as classification, frequent pattern mining, and clustering.
Visualization tools and domain-specific knowledge: High-level graphical user interfaces and
visualization tools are required for scientific data mining systems. These should be integrated with
existing domain-specific data and information systems to guide researchers and general users in searching
for patterns, interpreting and visualizing discovered patterns, and using discovered knowledge in their
decision making.
Data mining in engineering shares many similarities with data mining in science. Both practices often
collect massive amounts of data, and require data preprocessing, data warehousing, and scalable mining
of complex types of data. Both typically use visualization and make good use of graphs and networks.
Moreover, many engineering processes need real-time responses, and so mining data streams in real time
often becomes a critical component. Massive amounts of human communication data pour into our daily
life. Such communication exists in many forms, including news, blogs, articles, web pages, online
discussions, product reviews, twitters, messages, advertisements, and communications, both on the Web
and in various kinds of social networks. Hence, data mining in social science and social studies has
become increasingly popular. Moreover, user or reader feedback regarding products, speeches, and
articles can be analyzed to deduce general opinions and sentiments on the views of those in society. The
analysis results can be used to predict trends, improve work, and help in decision making.
Computer science generates unique kinds of data. For example, computer programs can be long, and their
execution often generates huge-size traces. Computer networks can have complex structures and the
network flows can be dynamic and massive. Sensor networks may generate large amounts of data with
varied reliability. Computer systems and databases can suffer from various kinds of attacks, and their
system/data accessing may raise security and privacy concerns. These unique kinds of data provide fertile
land for data mining.
TRENDS IN DATA MINING

Businesses that have been slow in adopting the process of data mining are now catching up with the
others. Extracting important information through the process of data mining is widely used to make
critical business decisions. We can expect data mining to become as ubiquitous as some of the more
prevalent technologies used today in the coming decade. Data mining concepts are still evolving, and here
are the following latest trends, such as:
1. Application exploration
Data mining is increasingly used to explore applications in other areas, such as financial analysis,
telecommunications, biomedicine, wireless security, and science.
2. Multimedia Data Mining
This is one of the latest methods which is catching up because of the growing ability to capture useful
data accurately. It involves data extraction from different kinds of multimedia sources such as audio, text,
hypertext, video, images, etc. The data is converted into a numerical representation in different formats.
This method can be used in clustering and classifications, performing similarity checks, and identifying
associations.
3. Ubiquitous Data Mining
his method involves mining data from mobile devices to get information about individuals. Despite
having several challenges in this type, such as complexity, privacy, cost, etc., this method has a lot of
opportunities to be enormous in various industries, especially in studying human-computer interactions.
4. Distributed Data Mining
This type of data mining is gaining popularity as it involves mining a huge amount of information stored
in different company locations or at different organizations. Highly sophisticated algorithms are used to
extract data from different locations and provide proper insights and reports based on them.
5. Embedded Data Mining
Data mining features are increasingly finding their way into many enterprise software use cases, from
sales forecasting in CRM SaaS platforms to cyber threat detection in intrusion detection/prevention
systems. The embedding of data mining into vertical market software applications enables prediction
capabilities for any number of industries and opens up new realms of possibilities for unique value
creation.
6. Spatial and Geographic Data Mining
This new trending type of data mining includes extracting information from environmental, astronomical,
and geographical data, including images taken from outer space. This type of data mining can reveal
various aspects such as distance and topology, which are mainly used in geographic information systems
and other navigation applications.
7. Time Series and Sequence Data Mining
The primary application of this type of data mining is the study of cyclical and seasonal trends. This
practice is also helpful in analyzing even random events which occur outside the normal series of events.
Retail companies mainly use this method to access customers' buying patterns and behaviors.
8. Data Mining Dominance in the Pharmaceutical And Health Care Industries
Both the pharmaceutical and health care industries have long been innovators in the category of data
mining. The recent rapid development of coronavirus vaccines is directly attributed to advances in
pharmaceutical testing data mining techniques, specifically signal detection during the clinical trial
process for new drugs. In health care, specialized data mining techniques are being used to analyze DNA
sequences for creating custom therapies, make better-informed diagnoses, and more.
9. Increasing Automation In Data Mining
Today's data mining solutions typically integrate ML and big data stores to provide advanced data
management functionality alongside sophisticated data analysis techniques. Earlier incarnations of data
mining involved manual coding by specialists with a deep background in statistics and programming.
Modern techniques are highly automated, with AI/ML replacing most of these previously manual
processes for developing pattern-discovering algorithms.
10. Data Mining Vendor Consolidation
If history is any indication, significant product consolidation in the data mining space is imminent as
larger database vendors acquire data mining tooling startups to augment their offerings with new features.
The current fragmented market and a broad range of data mining players resemble the adjacent big data
vendor landscape that continues to undergo consolidation.
11. Biological data mining
Mining DNA and protein sequences, mining high dimensional microarray data, biological pathway and
network analysis, link analysis across heterogeneous biological data, and information integration of
biological data by data mining are interesting topics for biological data mining research.
DATA MINING TOOLS
Data Mining tools have the objective of discovering patterns/trends/groupings among large sets of data
and transforming data into more refined information.
It is a framework, such as Rstudio or Tableau that allows you to perform different types of data mining
analysis.
We can perform various algorithms such as clustering or classification on your data set and visualize the
results itself. It is a framework that provides us better insights for our data and the phenomenon that data
represent. Such a framework is called a data mining tool.
RAPID MINER
Rapid Miner is one of the most popular predictive analysis systems created by the company with the same
name as the Rapid Miner. It is written in JAVA programming language. It offers an integrated
environment for text mining, deep learning, machine learning, and predictive analysis.
The instrument can be used for a wide range of applications, including company applications, commercial
applications, research, education, training, application development, machine learning.
Rapid Miner provides the server on-site as well as in public or private cloud infrastructure. It has a
client/server model as its base. A rapid miner comes with template-based frameworks that enable fast
delivery with few errors(which are commonly expected in the manual coding writing process). Rapid
Miner is a data mining tool used to implement various classification and clustering algorithms. An important
feature of Rapid Miner is its ability to display results visually. It is more powerful as compared to Weka
because of language independance.
Rapid Miner also provides an integrated environment for machine learning, data mining, text mining,
predictive analytics and business analytics. It is used for business and industrial applications as well as for
research, education, training, rapid prototyping, and application development and supports all steps of the data
mining process.
General Features
 Rapid Miner is an environment for machine learning and data mining processes.
 Rapid miner uses XML to describe the operator trees modelling knowledge discovery process.
 It has flexible operators for data input and output file formats.
 It contains more than 100 learning schemes for regression classification and clustering analysis.
 Rapid Miner produces a selection of charts and visualizations automatically, choosing the most
appropriate settings based on data properties
 Rapid Miner supports about twenty two file formats.
 Rapid Miner includes many learning algorithms in addition to WEKA.
 It easily reads and writes Excel files and different databases.
 If you set up an illegal work flows Rapid Miner suggest Quick Fixes to make it legal.
 Rapid Miner has a responsive and intuitive GUI.
ORANGE DATA MINING

Orange is a perfect machine learning and data mining software suite. It supports the visualization and is a
software-based on components written in Python computing language and developed at the
bioinformatics laboratory at the faculty of computer and information science, Ljubljana University,
Slovenia.
As it is a software-based on components, the components of Orange are called "widgets." These widgets
range from preprocessing and data visualization to the assessment of algorithms and predictive modeling.
Widgets deliver significant functionalities such as:
o Displaying data table and allowing to select features
o Data reading
o Training predictors and comparison of learning algorithms
o Data element visualization, etc.
Besides, Orange provides a more interactive and enjoyable atmosphere to dull analytical tools. It is quite
exciting to operate.
Orange supports a flexible domain for developers, analysts, and data mining specialists. Python, a new
generation scripting language and programming environment, where our data mining scripts may be easy
but powerful. Orange employs a component-based approach for fast prototyping. We can implement our
analysis technique simply like putting the LEGO bricks, or even utilize an existing algorithm. What are
Orange components for scripting Orange widgets for visual programming?. Widgets utilize a
specially designed communication mechanism for passing objects like classifiers, regressors, attribute
lists, and data sets permitting to build easily rather complex data mining schemes that use modern
approaches and techniques.
Orange core objects and Python modules incorporate numerous data mining tasks that are far from data
preprocessing for evaluation and modeling. The operating principle of Orange is cover techniques and
perspective in data mining and machine learning. For example, Orange's top-down induction of decision
tree is a technique build of numerous components of which anyone can be prototyped in python and used
in place of the original one. Orange widgets are not simply graphical objects that give a graphical
interface for a specific strategy in Orange, but it includes an adaptable signaling mechanism that is for
communication and exchange of objects like data sets, classification models, learners, objects that store
the results of the assessment. All these ideas are significant and together recognize Orange from other
data mining structures.
Orange Widgets:
Orange widgets give us a graphical user interface to orange's data mining and machine learning
techniques. They incorporate widgets for data entry and preprocessing, classification, regression,
association rules and clustering a set of widgets for model assessment and visualization of assessment
results, and widgets for exporting the models into PMML.
Widgets convey the data by tokens that are passed from the sender to the receiver widget. For example, a
file widget outputs the data objects, that can be received by a widget classification tree learner widget.
The classification tree builds a classification model that sends the data to the widget that graphically
shows the tree. An evaluation widget may get a data set from the file widget and objects.
Weka Data Mining

Weka contains a collection of visualization tools and algorithms for data analysis and predictive
modelling, together with graphical user interfaces for easy access to these functions. The original non-
Java version of Weka was a Tcl/Tk front-end to (mostly third-party) modelling algorithms implemented
in other programming languages, plus data preprocessing utilities in C and a makefile-based system for
running machine learning experiments.
This original version was primarily designed as a tool for analyzing data from agricultural domains. Still,
the more recent fully Java-based version (Weka 3), developed in 1997, is now used in many different
application areas, particularly for educational purposes and research. Weka has the following advantages,
such as:
o Free availability under the GNU General Public License.

o Portability, since it is fully implemented in the Java programming language and thus runs on almost any
modern computing platform.
o A comprehensive collection of data preprocessing and modelling techniques.
o Ease of use due to its graphical user interfaces.
Weka supports several standard data mining tasks, specifically, data preprocessing, clustering,
classification, regression, visualization, and feature selection. Input to Weka is expected to be formatted
according to the Attribute-Relational File Format and filename with the .arff extension.
All Weka's techniques are predicated on the assumption that the data is available as one flat file or
relation, where a fixed number of attributes describes each data point (numeric or nominal attributes, but
also supports some other attribute types). Weka provides access to SQL databases using Java Database
Connectivity and can process the result returned by a database query. Weka provides access to deep
learning with Deeplearning4j.
It is not capable of multi-relational data mining. Still, there is separate software for converting a
collection of linked database tables into a single table suitable for processing using Weka. Another
important area currently not covered by the algorithms included the Weka distribution in sequence
modelling.
Features of Weka
1. Preprocess
The preprocessing of data is a crucial task in data mining. Because most of the data is raw, there are
chances that it may contain empty or duplicate values, have garbage values, outliers, extra columns, or
have a different naming convention. All these things degrade the results.
To make data cleaner, better and comprehensive, WEKA comes up with a comprehensive set of options
under the filter category.
2. Classify
Classification is one of the essential functions in machine learning, where we assign classes or categories
to items. The classic examples of classification are: declaring a brain tumour as "malignant" or
"benign" or assigning an email to a "spam" or "not_spam" class.
After selecting the desired classifier, we select test options for the training set. Some of the options are:
o Use training set: the classifier will be tested on the same training set.
o A supplied test set: evaluates the classifier based on a separate test set.
o Cross-validation Folds: assessment of the classifier based on cross-validation using the number
of provided folds.
o Percentage split: the classifier will be judged on a specific percentage of data.
Other than these, we can also use more test options such as Preserve order for % split, Output source
code, etc.
3. Cluster
In clustering, a dataset is arranged in different groups/clusters based on some similarities. In this case, the
items within the same cluster are identical but different from other clusters. Examples of clustering
include identifying customers with similar behaviours and organizing the regions according to
homogenous land use.
4. Associate
Association rules highlight all the associations and correlations between items of a dataset. In short, it is
an if-then statement that depicts the probability of relationships between data items. A classic example of
association refers to a connection between the sale of milk and bread.
5. Select Attributes
Every dataset contains a lot of attributes, but several of them may not be significantly valuable. Therefore,
removing the unnecessary and keeping the relevant details are very important for building a good model.
Many attribute evaluators and search methods include BestFirst, GreedyStepwise, and Ranker.
6. Visualize
In the visualize tab, different plot matrices and graphs are available to show the trends and errors
identified by the model.
SAS DATA MINING

SAS stands for Statistical Analysis System. It is a product of the SAS Institute created for analytics and
data management. SAS can mine data, change it, manage information from various sources, and analyze
statistics. It offers a graphical UI for non-technical users.
SAS data miner allows users to analyze big data and provide accurate insight for timely decision-making
purposes. SAS has distributed memory processing architecture that is highly scalable. It is suitable for
data mining, optimization, and text mining purposes.
KNIME
Is a powerful free open source data mining tool which enables data scientists to create independent
applications and services through a drag and drop interface. It can serve well as a business intelligence
resource, which can be used for business intelligence and data analytics.The software is available as a free
download on their website.
For the purposes of direct marketing, KNIME will allow converting multiple data sources, spreadsheets,
flat files, databases, and more into a standard format. This data can be normalized, analyzed, and
configured to generate visual representations. In other words, it can shape data into information. This data
aggregation provides the possibility to create easy to understand visualizations.
KNIME can be used as a key component of their marketing technology stack by direct marketers to gain
better understanding of the large amounts of data involved with a direct marketing operation.
Many business intelligence features are built in. There are numerous data visualization tools which can be
used for creating larger applications, and with some configuration, it can create an extremely powerful
dashboard for analyzing direct marketing data.
Getting started with KNIME takes some configuration; it is not an out-of- the-box solution. There are
numerous templates that can be configured for a multitude of purposes, however there are no direct-
marketing specific ones. That said, the functionality is certainly possible to configure to meet Business
Intelligence needs for use in direct marketing operations.
The modular nature of KNIME makes it possible to create brand-new workflows which can be well-
adapted to a BI dashboard. There are many useful features and modules that do not need to be built from
scratch; in many cases it merely requires configuring the data itself to use pre-existing structures.
Once configured, this will enable marketers to create various different types of reports, and can
theoretically help gain a much better understanding of users and target markets.
SISENSE
Sisense is extremely useful and best suited BI software. That it comes to reporting purposes
within the organization. It is developed by the company of same name ‘Sisense’. It has a
brilliant capability to handle. Also, process data for the small-scale/large scale organizations.
It allows combining data from various sources to build a common repository. Further, refines
data to generate rich reports. That gets shared across departments for reporting.
Sisense got awarded as best BI software is 2016 and still, holds a good position.
Sisense generates reports which are highly visual. It is specially designed for users that are
non-technical. It allows drag & drop facility as well as widgets.
Key Features of Sisense

1. Sisense allows you to create unlimited dashboards which can be shared with other users by email
(on a scheduled basis or each data update).
2. Its architecture efficiently utilizes the RAM, CPU, and disk space, allowing users to run Big Data
analysis on affordable hardware.
3. It allows you to export data to PDF, CSV, Excel, images, etc.
4. Sisense uses the Single-Sign-On authentication, which reduces password fatigue and increases the
user’s productivity.
5. It allows you to use filters in the dashboard widgets to query complex data from your data source
without writing SQL queries. The filters applied are translated into SQL queries your data source
understands. This approach to querying data exposes the queries to only the data source, which
provides ease of use to users.
6. Sisense Pulse allows you to keep track of your most crucial Key Performance Indicators (KPIs)
across all dashboards and build alerts. You can set alerts to notify you when certain thresholds are
met, or irregularities occur on your data.
7. It enables users to join data from multiple sources and perform the necessary analysis on them.
8. Its connection to data sources (both On-premise and Cloud) is made via Elasticube, which allows
seamless integration of this data.
9. It connects to both SQL and NoSQL databases.
SQL SERVER DATA TOOLS

Oracle Data Miner is an extension to Oracle SQL Developer. Oracle Data Miner is a graphical user
interface to Oracle Data Mining, a feature of Oracle Database. Oracle Data Miner enables users to build
descriptive and predictive models to:
 Predict customer behavior

 Target best customers
 Discover customer clusters, segments, and profiles
 Identify customer retention risks
 Identify promising selling opportunities
 Detect anomalous behavior
Oracle Data Miner provides an Application Programming Interface (API) that enables programmers to
build and use models.
Oracle Data Miner workflows capture and document the analytical methodology of the user. It can be
saved and shared with others to automate advanced analytical methodologies.
The Oracle Data Miner GUI is an extension to Oracle SQL Developer 3.0 or later that enables data
analysts to:
 Work directly with data inside the database

 Explore the data graphically
 Build and evaluate multiple data mining models
 Apply Oracle Data Miner models to new data
 Deploy Oracle Data Miner predictions and insights throughout the enterprise
DATAMELT DATA MINING
DataMelt is a computation and visualization environment which offers an interactive structure for data
analysis and visualization. It is primarily designed for students, engineers, and scientists. It is also known
as DMelt.
DMelt is a multi-platform utility written in JAVA. It can run on any operating system which is
compatible with JVM (Java Virtual Machine). It consists of Science and mathematics libraries.
o Scientific libraries:
Scientific libraries are used for drawing the 2D/3D plots.
o Mathematical libraries:
Mathematical libraries are used for random number generation, algorithms, curve fitting, etc.
DMelt can be used for the analysis of the large volume of data, data mining, and statistical analysis. It is
extensively used in natural sciences, financial markets, and engineering.
RATTLE
Ratte is a data mining tool based on GUI. It uses the R stats programming language. Rattle exposes the
statical power of R by offering significant data mining features. While rattle has a comprehensive and
well-developed user interface, It has an integrated log code tab that produces duplicate code for any GUI
operation.
The data set produced by Rattle can be viewed and edited. Rattle gives the other facility to review the
code, use it for many purposes, and extend the code without any restriction.
Rattle is used widely by Data Scientists across industry and by many independent consultants. It is also
used for teaching the concepts of Machine Learning and Data Mining, and as a pathway into the full
power of R for the Data Scientist - an important feature of Rattle is that all functionality accessed via the
graphical user interface is captured as a structured R script that can be run independently of Rattle to
repeat every step performed by Rattle. In addition to being a useful tool for learning R it transparently
supports repeatability of all activity in scripts that can extended or automatically be run at a later time.
Rattle provides considerable data mining functionality by exposing the power of the R through a
graphical user interface. Rattle is also used as a teaching facility to learn the R. There is an option called
as Log Code tab, which replicates the R code for any activity undertaken in the GUI, which can be
copied and pasted. Rattle can be used for statistical analysis, or model generation. Rattle allows for the
dataset to be partitioned into training, validation and testing. The dataset can be viewed and edited.
Terminology
There are various definitions of user interface types, so here’s how I’ll be using these terms:
GUI = Graphical User Interface using menus and dialog boxes to avoid having to type programming code. I
do not include any assistance for programming in this definition. So, GUI users are people who prefer using a
GUI to perform their analyses. They don’t have the time or inclination to become good programmers.
IDE = Integrated Development Environment which helps programmers write code. I do not include point-and-
click style menus and dialog boxes when using this term. IDE usersare people who prefer to write R code to
perform their analyses.

Unit - 5

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit - 5

Uploaded by

Copyright:

Available Formats

OTHER MINING METHODOLOGIES

DATA MINING APPLICATIONS

Data Mining for Retail and Telecommunication Industries

TRENDS IN DATA MINING

 Rapid Miner supports about twenty two file formats.

 Rapid Miner includes many learning algorithms in addition to WEKA.

 It easily reads and writes Excel files and different databases.

 Rapid Miner has a responsive and intuitive GUI.

ORANGE DATA MINING

Weka Data Mining

o Free availability under the GNU General Public License.

Many attribute evaluators and search methods include BestFirst, GreedyStepwise, and Ranker.

SAS DATA MINING

Key Features of Sisense

SQL SERVER DATA TOOLS

 Predict customer behavior

 Work directly with data inside the database

You might also like