Professional Documents
Culture Documents
In statistical data mining techniques, it is created for the effective handling of large amounts of data that
are generally multidimensional and possibly of several complex types.
There are several well-established statistical methods for data analysis, especially for numeric data.
These methods have been used extensively to scientific records (e.g., records from experiments in
physics, engineering, manufacturing, psychology, and medicine), and to information from economics and
the social sciences.
There are various methodologies of statistical data mining are as follows −
Regression − In general, these techniques are used to forecast the value of a response (dependent)
variable from new predictor (independent) variables, where the variables are numeric. There are several
forms of regression, including linear, multiple, weighted, polynomial, nonparametric, and robust (robust
methods are beneficial when errors declines to satisfy normalcy conditions or when the data include
significant outliers).
Generalized linear models − These models and their generalization (generalized additive models),
enable a categorical (nominal) response variable (several transformation of it) to be associated with a set
of predictor variables in a manner same to the modeling of a mathematical response variable utilizing
linear regression. Generalized linear models involve logistic regression and Poisson regression.
Analysis of variance − These method analyze experimental information for two or more populations
defined by a numeric response variable and new categorical variables (factors). In general, an ANOVA
(single-factor analysis of variance) problem contains a comparison of k population or treatment defines
to decide if at least two of the means are different.
Mixed-effect models − These models are for exploring grouped data—data that can be classified as per
the one or more grouping variables. They generally define relationships between a response variable and
several covariates in data combined according to one or more factors. There are several areas of
application such as multilevel data, repeated measures data, block designs, and longitudinal data.
Factor analysis − This method can determine which variables are combined to produce a given factor.
For instance, for several psychiatric data, it is not applicable to compute a specific factor of interest
directly (e.g., intelligence); however, it is applicable to measure other quantities that reflect the element
of interest. Therefore, none of the variables is appropriated as dependent.
Discriminant analysis − This technique can predict a categorical response variable. Unlike generalized
linear models, it considers that the independent variables follow a multivariate normal distribution. The
process tries to decide several discriminant functions (linear set of the independent variables) that
discriminate between the groups represented by the response variable. Discriminant analysis is generally
used in social sciences.
Survival analysis − There are multiple well-established statistical methods exist for survival analysis.
These techniques initially were designed to forecast the probability that a patient undergoing a medical
analysis can survive at least to time t.
Quality control − There are multiple statistics is used to prepare charts for quality control, including
Shewhart charts and CUSUM charts. These statistics involve the mean, standard deviation, range, count,
moving average, moving standard deviation, and moving range.
Design and construction of data warehouses for multidimensional data analysis and data mining:
Like many other applications, data warehouses need to be constructed for banking and financial data.
Multidimensional data analysis methods should be used to analyze the general properties of such data.
For example, a company’s financial officer may want to view the debt and revenue changes by month,
region, and sector, and other factors, along with maximum, minimum, total, average, trend, deviation, and
other statistical information. Data warehouses, data cubes (including advanced data cube concepts such as
multi feature, discovery-driven, regression, and prediction data cubes), characterization and class
comparisons, clustering, and outlier analysis will all play important roles in financial data analysis and
mining.
Loan payment prediction and customer credit policy analysis: Loan payment prediction and customer
credit analysis are critical to the business of a bank. Many factors can strongly or weakly influence loan
payment performance and customer credit rating. Data mining methods, such as attribute selection and
attribute relevance ranking, may help identify important factors and eliminate irrelevant ones.
For example, factors related to the risk of loan payments include loan-to-value ratio, term of the loan,
debt ratio (total amount of monthly debt versus total monthly income), payment-to-income ratio,
customer income level, education level, residence region, and credit history. Analysis of the customer
payment history may find that, say, payment-to-income ratio is a dominant factor, while education level
and debt ratio are not. The bank may then decide to adjust its loan-granting policy so as to grant loans to
those customers whose applications were previously denied but whose profiles show relatively low risks
according to the critical factor analysis.
Classification and clustering of customers for targeted marketing: Classification and clustering
methods can be used for customer group identification and targeted marketing. For example, we can use
classification to identify the most crucial factors that may influence a customer’s decision regarding
banking. Customers with similar behaviors regarding loan payments may be identified by
multidimensional clustering techniques. These can help identify customer groups, associate a new
customer with an appropriate customer group, and facilitate targeted marketing.
Detection of money laundering and other financial crimes: To detect money laundering and other
financial crimes, it is important to integrate information from multiple, heterogeneous databases (e.g.,
bank transaction databases and federal or state crime history databases), as long as they are potentially
related to the study.
Multiple data analysis tools can then be used to detect unusual patterns, such as large amounts of cash
flow at certain periods, by certain groups of customers. Useful tools include data visualization tools (to
display transaction activities using graphs by time and by groups of customers), linkage and information
network analysis tools (to identify links among different customers and activities), classification tools (to
filter unrelated attributes and rank the highly related ones), clustering tools (to group different cases),
outlier analysis tools (to detect unusual amounts of fund transfers or other activities), and sequential
pattern analysis tools (to characterize unusual access sequences). These tools may identify important
relationships and patterns of activities and help investigators focus on suspicious cases for further detailed
examination.
Multidimensional analysis of sales, customers, products, time, and region: The retail industry
requires timely information regarding customer needs, product sales, trends, and fashions, as well as the
quality, cost, profit, and service of commodities. It is therefore important to provide powerful
multidimensional analysis and visualization tools, including the construction of sophisticated data cubes
according to the needs of data analysis.
Analysis of the effectiveness of sales campaigns: The retail industry conducts sales campaigns using
advertisements, coupons, and various kinds of discounts and bonuses to promote products and attract
customers. Careful analysis of the effectiveness of sales campaigns can help improve company profits.
Multidimensional analysis can be used for this purpose by comparing the amount of sales and the number
of transactions containing the sales items during the sales period versus those containing the same items
before or after the sales campaign. Moreover, association analysis may disclose which items are likely to
be purchased together with the items on sale, especially in comparison with the sales before or after the
campaign.
Customer retention—analysis of customer loyalty: We can use customer loyalty card information to
register sequences of purchases of particular customers. Customer loyalty and purchase trends can be
analyzed systematically. Goods purchased at different periods by the same customers can be grouped into
sequences. Sequential pattern mining can then be used to investigate changes in customer consumption or
loyalty and suggest adjustments on the pricing and variety of goods to help retain customers and attract
new ones.
Product recommendation and cross-referencing of items: By mining associations from sales records,
we may discover that a customer who buys a digital camera is likely to buy another set of items. Such
information can be used to form product recommendations. Collaborative recommender systems use data
mining techniques to make personalized product recommendations during live customer transactions,
based on the opinions of other customers. Product recommendations can also be advertised on sales
receipts, in weekly flyers, or on the Web to help improve customer service, aid customers in selecting
items, and increase sales. Similarly, information, such as “hot items this week” or attractive deals, can be
displayed together with the associative information to promote sales.
Fraudulent analysis and the identification of unusual patterns: Fraudulent activity costs the retail
industry millions of dollars per year. It is important to (1) identify potentially fraudulent users and their
atypical usage patterns; (2) detect attempts to gain fraudulent entry or unauthorized access to individual
and organizational accounts; and (3) discover unusual patterns that may need special attention. Many of
these patterns can be discovered by multidimensional analysis, cluster analysis, and outlier analysis.
As another industry that handles huge amounts of data, the telecommunication industry has quickly
evolved from offering local and long-distance telephone services to providing many other comprehensive
communication services. These include cellular phone, smart phone, Internet access, email, text
messages, images, computer and web data transmissions, and other data traffic. The integration of
telecommunication, computer network, Internet, and numerous other means of communication and
computing has been under way, changing the face of telecommunications and computing. This has
created a great demand for data mining to help understand business dynamics, identify
telecommunication patterns, catch fraudulent activities, make better use of resources,
and improve service quality.
Data Mining in Science and Engineering
In the past, many scientific data analysis tasks tended to handle relatively small and homogeneous data
sets. Such data were typically analyzed using a “formulate hypothesis, build model, and evaluate results”
paradigm. Massive data collection and storage technologies have recently changed the landscape of
scientific data analysis.
Today, scientific data can be amassed at much higher speeds and lower costs. This has resulted in the
accumulation of huge volumes of high-dimensional data, stream data, and heterogenous data, containing
rich spatial and temporal information. Consequently, scientific applications are shifting from the
“hypothesize-and-test” paradigm toward a “collect and store data, mine for new hypotheses, confirm with
data or experimentation” process. This shift brings about new challenges for data mining. Vast amounts
of data have been collected from scientific domains (including geosciences, astronomy, meteorology,
geology, and biological sciences) using sophisticated telescopes, multispectral high-resolution remote
satellite sensors, global positioning systems, and new generations of biological data collection and
analysis technologies. Large data sets are also being generated due to fast numeric simulations in various
fields such as climate and ecosystem modeling, chemical engineering, fluid dynamics, and structural
mechanics. Here we look at some of the challenges brought about by emerging scientific applications of
data mining.
Data warehouses and data preprocessing: Data preprocessing and data warehouses are critical for
information exchange and data mining. Creating a warehouse often requires finding means for resolving
inconsistent or incompatible data collected in multiple environments and at different time periods. This
requires reconciling semantics, referencing systems, geometry, measurements, accuracy, and precision.
Methods are needed for integrating data from heterogeneous sources and for identifying events.
For instance, consider climate and ecosystem data, which are spatial and temporal and require cross-
referencing geospatial data. A major problem in analyzing such data is that there are too many events in
the spatial domain but too few in the temporal domain. For example, El Nino events occur only every
four to seven years, and previous data on them might not have been collected as systematically as they are
today. Methods are also needed for the efficient computation of sophisticated spatial aggregates and the
handling of spatial-related data streams.
Mining complex data types: Scientific data sets are heterogeneous in nature. They typically involve
semi-structured and unstructured data, such as multimedia data and geo referenced stream data, as well as
data with sophisticated, deeply hidden semantics (e.g., genomic and proteomic data). Robust and
dedicated analysis methods are needed for handling spatiotemporal data, biological data, related concept
hierarchies, and complex semantic relationships. For example, in bioinformatics, a research problem is to
identify regulatory influences on genes. Gene regulation refers to how genes in a cell are switched on (or
off) to determine the cell’s functions.
Different biological processes involve different sets of genes acting together in precisely regulated
patterns. Thus, to understand a biological process we need to identify the participating genes and their
regulators. This requires the development of sophisticated data mining methods to analyze large
biological data sets for clues about regulatory influences on specific genes, by finding DNA segments
(“regulatory sequences”) mediating such influence.
Graph-based and network-based mining: It is often difficult or impossible to model several physical
phenomena and processes due to limitations of existing modeling approaches. Alternatively, labeled
graphs and networks may be used to capture many of the spatial, topological, geometric, biological, and
other relational characteristics present in scientific data sets. In graph or network modeling, each object to
be mined is represented by a vertex in a graph, and edges between vertices represent relationships
between objects. For example, graphs can be used to model chemical structures, biological pathways, and
data generated by numeric simulations such as fluid-flow simulations. The success of graph or network
modeling, however, depends on improvements in the scalability and efficiency of many graph-based data
mining tasks such as classification, frequent pattern mining, and clustering.
Visualization tools and domain-specific knowledge: High-level graphical user interfaces and
visualization tools are required for scientific data mining systems. These should be integrated with
existing domain-specific data and information systems to guide researchers and general users in searching
for patterns, interpreting and visualizing discovered patterns, and using discovered knowledge in their
decision making.
Data mining in engineering shares many similarities with data mining in science. Both practices often
collect massive amounts of data, and require data preprocessing, data warehousing, and scalable mining
of complex types of data. Both typically use visualization and make good use of graphs and networks.
Moreover, many engineering processes need real-time responses, and so mining data streams in real time
often becomes a critical component. Massive amounts of human communication data pour into our daily
life. Such communication exists in many forms, including news, blogs, articles, web pages, online
discussions, product reviews, twitters, messages, advertisements, and communications, both on the Web
and in various kinds of social networks. Hence, data mining in social science and social studies has
become increasingly popular. Moreover, user or reader feedback regarding products, speeches, and
articles can be analyzed to deduce general opinions and sentiments on the views of those in society. The
analysis results can be used to predict trends, improve work, and help in decision making.
Computer science generates unique kinds of data. For example, computer programs can be long, and their
execution often generates huge-size traces. Computer networks can have complex structures and the
network flows can be dynamic and massive. Sensor networks may generate large amounts of data with
varied reliability. Computer systems and databases can suffer from various kinds of attacks, and their
system/data accessing may raise security and privacy concerns. These unique kinds of data provide fertile
land for data mining.
RAPID MINER
Rapid Miner is one of the most popular predictive analysis systems created by the company with the same
name as the Rapid Miner. It is written in JAVA programming language. It offers an integrated
environment for text mining, deep learning, machine learning, and predictive analysis.
The instrument can be used for a wide range of applications, including company applications, commercial
applications, research, education, training, application development, machine learning.
Rapid Miner provides the server on-site as well as in public or private cloud infrastructure. It has a
client/server model as its base. A rapid miner comes with template-based frameworks that enable fast
delivery with few errors(which are commonly expected in the manual coding writing process). Rapid
Miner is a data mining tool used to implement various classification and clustering algorithms. An important
feature of Rapid Miner is its ability to display results visually. It is more powerful as compared to Weka
because of language independance.
Rapid Miner also provides an integrated environment for machine learning, data mining, text mining,
predictive analytics and business analytics. It is used for business and industrial applications as well as for
research, education, training, rapid prototyping, and application development and supports all steps of the data
mining process.
General Features
Rapid Miner is an environment for machine learning and data mining processes.
Rapid miner uses XML to describe the operator trees modelling knowledge discovery process.
It has flexible operators for data input and output file formats.
It contains more than 100 learning schemes for regression classification and clustering analysis.
Rapid Miner produces a selection of charts and visualizations automatically, choosing the most
appropriate settings based on data properties
If you set up an illegal work flows Rapid Miner suggest Quick Fixes to make it legal.
Orange supports a flexible domain for developers, analysts, and data mining specialists. Python, a new
generation scripting language and programming environment, where our data mining scripts may be easy
but powerful. Orange employs a component-based approach for fast prototyping. We can implement our
analysis technique simply like putting the LEGO bricks, or even utilize an existing algorithm. What are
Orange components for scripting Orange widgets for visual programming?. Widgets utilize a
specially designed communication mechanism for passing objects like classifiers, regressors, attribute
lists, and data sets permitting to build easily rather complex data mining schemes that use modern
approaches and techniques.
Orange core objects and Python modules incorporate numerous data mining tasks that are far from data
preprocessing for evaluation and modeling. The operating principle of Orange is cover techniques and
perspective in data mining and machine learning. For example, Orange's top-down induction of decision
tree is a technique build of numerous components of which anyone can be prototyped in python and used
in place of the original one. Orange widgets are not simply graphical objects that give a graphical
interface for a specific strategy in Orange, but it includes an adaptable signaling mechanism that is for
communication and exchange of objects like data sets, classification models, learners, objects that store
the results of the assessment. All these ideas are significant and together recognize Orange from other
data mining structures.
Orange Widgets:
Orange widgets give us a graphical user interface to orange's data mining and machine learning
techniques. They incorporate widgets for data entry and preprocessing, classification, regression,
association rules and clustering a set of widgets for model assessment and visualization of assessment
results, and widgets for exporting the models into PMML.
Widgets convey the data by tokens that are passed from the sender to the receiver widget. For example, a
file widget outputs the data objects, that can be received by a widget classification tree learner widget.
The classification tree builds a classification model that sends the data to the widget that graphically
shows the tree. An evaluation widget may get a data set from the file widget and objects.
This original version was primarily designed as a tool for analyzing data from agricultural domains. Still,
the more recent fully Java-based version (Weka 3), developed in 1997, is now used in many different
application areas, particularly for educational purposes and research. Weka has the following advantages,
such as:
Weka supports several standard data mining tasks, specifically, data preprocessing, clustering,
classification, regression, visualization, and feature selection. Input to Weka is expected to be formatted
according to the Attribute-Relational File Format and filename with the .arff extension.
All Weka's techniques are predicated on the assumption that the data is available as one flat file or
relation, where a fixed number of attributes describes each data point (numeric or nominal attributes, but
also supports some other attribute types). Weka provides access to SQL databases using Java Database
Connectivity and can process the result returned by a database query. Weka provides access to deep
learning with Deeplearning4j.
It is not capable of multi-relational data mining. Still, there is separate software for converting a
collection of linked database tables into a single table suitable for processing using Weka. Another
important area currently not covered by the algorithms included the Weka distribution in sequence
modelling.
Features of Weka
1. Preprocess
The preprocessing of data is a crucial task in data mining. Because most of the data is raw, there are
chances that it may contain empty or duplicate values, have garbage values, outliers, extra columns, or
have a different naming convention. All these things degrade the results.
To make data cleaner, better and comprehensive, WEKA comes up with a comprehensive set of options
under the filter category.
2. Classify
Classification is one of the essential functions in machine learning, where we assign classes or categories
to items. The classic examples of classification are: declaring a brain tumour as "malignant" or
"benign" or assigning an email to a "spam" or "not_spam" class.
After selecting the desired classifier, we select test options for the training set. Some of the options are:
o Use training set: the classifier will be tested on the same training set.
o A supplied test set: evaluates the classifier based on a separate test set.
o Cross-validation Folds: assessment of the classifier based on cross-validation using the number
of provided folds.
o Percentage split: the classifier will be judged on a specific percentage of data.
Other than these, we can also use more test options such as Preserve order for % split, Output source
code, etc.
3. Cluster
In clustering, a dataset is arranged in different groups/clusters based on some similarities. In this case, the
items within the same cluster are identical but different from other clusters. Examples of clustering
include identifying customers with similar behaviours and organizing the regions according to
homogenous land use.
4. Associate
Association rules highlight all the associations and correlations between items of a dataset. In short, it is
an if-then statement that depicts the probability of relationships between data items. A classic example of
association refers to a connection between the sale of milk and bread.
5. Select Attributes
Every dataset contains a lot of attributes, but several of them may not be significantly valuable. Therefore,
removing the unnecessary and keeping the relevant details are very important for building a good model.
6. Visualize
In the visualize tab, different plot matrices and graphs are available to show the trends and errors
identified by the model.
KNIME
Is a powerful free open source data mining tool which enables data scientists to create independent
applications and services through a drag and drop interface. It can serve well as a business intelligence
resource, which can be used for business intelligence and data analytics.The software is available as a free
download on their website.
For the purposes of direct marketing, KNIME will allow converting multiple data sources, spreadsheets,
flat files, databases, and more into a standard format. This data can be normalized, analyzed, and
configured to generate visual representations. In other words, it can shape data into information. This data
aggregation provides the possibility to create easy to understand visualizations.
KNIME can be used as a key component of their marketing technology stack by direct marketers to gain
better understanding of the large amounts of data involved with a direct marketing operation.
Many business intelligence features are built in. There are numerous data visualization tools which can be
used for creating larger applications, and with some configuration, it can create an extremely powerful
dashboard for analyzing direct marketing data.
Getting started with KNIME takes some configuration; it is not an out-of- the-box solution. There are
numerous templates that can be configured for a multitude of purposes, however there are no direct-
marketing specific ones. That said, the functionality is certainly possible to configure to meet Business
Intelligence needs for use in direct marketing operations.
The modular nature of KNIME makes it possible to create brand-new workflows which can be well-
adapted to a BI dashboard. There are many useful features and modules that do not need to be built from
scratch; in many cases it merely requires configuring the data itself to use pre-existing structures.
Once configured, this will enable marketers to create various different types of reports, and can
theoretically help gain a much better understanding of users and target markets.
SISENSE
Sisense is extremely useful and best suited BI software. That it comes to reporting purposes
within the organization. It is developed by the company of same name ‘Sisense’. It has a
brilliant capability to handle. Also, process data for the small-scale/large scale organizations.
It allows combining data from various sources to build a common repository. Further, refines
data to generate rich reports. That gets shared across departments for reporting.
Sisense got awarded as best BI software is 2016 and still, holds a good position.
Sisense generates reports which are highly visual. It is specially designed for users that are
non-technical. It allows drag & drop facility as well as widgets.
Oracle Data Miner provides an Application Programming Interface (API) that enables programmers to
build and use models.
Oracle Data Miner workflows capture and document the analytical methodology of the user. It can be
saved and shared with others to automate advanced analytical methodologies.
The Oracle Data Miner GUI is an extension to Oracle SQL Developer 3.0 or later that enables data
analysts to:
Rattle is used widely by Data Scientists across industry and by many independent consultants. It is also
used for teaching the concepts of Machine Learning and Data Mining, and as a pathway into the full
power of R for the Data Scientist - an important feature of Rattle is that all functionality accessed via the
graphical user interface is captured as a structured R script that can be run independently of Rattle to
repeat every step performed by Rattle. In addition to being a useful tool for learning R it transparently
supports repeatability of all activity in scripts that can extended or automatically be run at a later time.
Rattle provides considerable data mining functionality by exposing the power of the R through a
graphical user interface. Rattle is also used as a teaching facility to learn the R. There is an option called
as Log Code tab, which replicates the R code for any activity undertaken in the GUI, which can be
copied and pasted. Rattle can be used for statistical analysis, or model generation. Rattle allows for the
dataset to be partitioned into training, validation and testing. The dataset can be viewed and edited.
Terminology
There are various definitions of user interface types, so here’s how I’ll be using these terms:
GUI = Graphical User Interface using menus and dialog boxes to avoid having to type programming code. I
do not include any assistance for programming in this definition. So, GUI users are people who prefer using a
GUI to perform their analyses. They don’t have the time or inclination to become good programmers.
IDE = Integrated Development Environment which helps programmers write code. I do not include point-and-
click style menus and dialog boxes when using this term. IDE usersare people who prefer to write R code to
perform their analyses.