You are on page 1of 22

Chapter 3: Big Data Analytics and Big Data Analytics Techniques

Big Data and its Importance

The importance of big data does not revolve around how much data a company has but how a
company utilises the collected data. Every company uses data in its own way; the more efficiently a
company uses its data, the more potential it has to grow. The company can take data from any source
and analyse it to find answers which will enable:

Cost Savings : Some tools of Big Data like Hadoop and Cloud-Based Analytics can bring cost
advantages to business when large amounts of data are to be stored and these tools also help in
identifying more efficient ways of doing business.

Time Reductions :The high speed of tools like Hadoop and in-memory analytics can easily identify
new sources of data which helps businesses analyzing data immediately and make quick decisions
based on the learnings.

Understand the market conditions : By analyzing big data you can get a better understanding of
current market conditions. For example, by analyzing customers’ purchasing behaviors, a company
can find out the products that are sold the most and produce products according to this trend. By this,
it can get ahead of its competitors.

Control online reputation: Big data tools can do sentiment analysis. Therefore, you can get feedback
about who is saying what about your company. If you want to monitor and improve the online
presence of your business, then, big data tools can help in all this.

Using Big Data Analytics to Boost Customer Acquisition and Retention

The customer is the most important asset any business depends on. There is no single business that
can claim success without first having to establish a solid customer base. However, even with a
customer base, a business cannot afford to disregard the high competition it faces. If a business is
slow to learn what customers are looking for, then it is very easy to begin offering poor quality
products. In the end, loss of clientele will result, and this creates an adverse overall effect on business
success. The use of big data allows businesses to observe various customer related patterns and trends.
Observing customer behaviour is important to trigger loyalty.

Using Big Data Analytics to Solve Advertisers Problem and Offer Marketing Insights

Big data analytics can help change all business operations. This includes the ability to match customer
expectation, changing company’s product line and of course ensuring that the marketing campaigns
are powerful.

Big Data Analytics As a Driver of Innovations and Product Development


Another huge advantage of big data is the ability to help companies innovate and redevelop their
products.

Drivers for Big data

Big Data emerged in the last decade from a combination of business needs and technology
innovations. A number of companies that have Big Data at the core of their strategy have become
very successful at the beginning of the 21st century. Famous examples include Apple, Amazon,
Facebook and Netflix.

A number of business drivers are at the core of this success and explain why Big Data has quickly
risen to become one of the most coveted topics in the industry. Six main business drivers can be
identified:

1. The digitization of society;


2. The plummeting of technology costs;
3. Connectivity through cloud computing;
4. Increased knowledge about data science;
5. Social media applications;
6. The upcoming Internet-of-Things (IoT).

In this blog post, we will explore a high-level overview of each of these business drivers. Each of
these adds to the competitive advantage of enterprises by creating new revenue streams by reducing
the operational costs.

1. The digitization of society


Big Data is largely consumer driven and consumer oriented. Most of the data in the world is generated
by consumers, who are nowadays ‘always-on’. Most people now spend 4-6 hours per day consuming
and generating data through a variety of devices and (social) applications. With every click, swipe or
message, new data is created in a database somewhere around the world. Because everyone now has
a smartphone in their pocket, the data creation sums to incomprehensible amounts. Some studies
estimate that 60% of data was generated within the last two years, which is a good indication of the
rate with which society has digitized.

2. The plummeting of technology costs


Technology related to collecting and processing massive quantities of diverse (high variety) data has
become increasingly more affordable. The costs of data storage and processors keep declining,
making it possible for small businesses and individuals to become involved with Big Data. For storage
capacity, the often-cited Moore’s Law still holds that the storage density (and therefore capacity) still
doubles every two years. The plummeting of technology costs has been depicted in the figure below.

Besides the plummeting of the storage costs, a second key contributing factor to the affordability of
Big Data has been the development of open source Big Data software frameworks. The most popular
software framework (nowadays considered the standard for Big Data) is Apache Hadoop for
distributed storage and processing. Due to the high availability of these software frameworks in open
sources, it has become increasingly inexpensive to start Big Data projects in organizations.

3. Connectivity through cloud computing


Cloud computing environments (where data is remotely stored in distributed storage systems) have
made it possible to quickly scale up or scale down IT infrastructure and facilitate a pay-as-you-go
model. This means that organizations that want to process massive quantities of data (and thus have
large storage and processing requirements) do not have to invest in large quantities of IT
infrastructure. Instead, they can license the storage and processing capacity they need and only pay
for the amounts they actually used. As a result, most of Big Data solutions leverage the possibilities
of cloud computing to deliver their solutions to enterprises.

4. Increased knowledge about data science


In the last decade, the term data science and data scientist have become tremendously popular. In
October 2012, Harvard Business Review called the data scientist “sexiest job of the 21st century” and
many other publications have featured this new job role in recent years. The demand for data scientist
(and similar job titles) has increased tremendously and many people have actively become engaged
in the domain of data science.

As a result, the knowledge and education about data science has greatly professionalized and more
information becomes available every day. While statistics and data analysis mostly remained an
academic field previously, it is quickly becoming a popular subject among students and the working
population.

5. Social media applications


Everyone understands the impact that social media has on daily life. However, in the study of Big
Data, social media plays a role of paramount importance. Not only because of the sheer volume of
data that is produced everyday through platforms such as Twitter, Facebook, LinkedIn and Instagram,
but also because social media provides nearly real-time data about human behavior.

Social media data provides insights into the behaviors, preferences and opinions of ‘the public’ on a
scale that has never been known before. Due to this, it is immensely valuable to anyone who is able
to derive meaning from these large quantities of data. Social media data can be used to identify
customer preferences for product development, target new customers for future purchases, or even
target potential voters in elections. Social media data might even be considered one of the most
important business drivers of Big Data.

6. The upcoming internet of things (IoT)


The Internet of things (IoT) is the network of physical devices, vehicles, home appliances and other
items embedded with electronics, software, sensors, actuators, and network connectivity which
enables these objects to connect and exchange data. It is increasingly gaining popularity as consumer
goods providers start including ‘smart’ sensors in household appliances. Whereas the average
household in 2010 had around 10 devices that connected to the internet, this number is expected to
rise to 50 per household by 2020. Examples of these devices include thermostats, smoke detectors,
televisions, audio systems and even smart refrigerators.
Optimization techniques

1. Remove Latency in Processing


Latency in processing occurs in traditional storage models that move slowly when retrieving data.
Organizations can decrease processing time by moving away from those slow hard disks and
relational databases, into in-memory computing software. Apache Spark is one popular example of
an in-memory storage model.

2. Exploit Data in Real Time


The goal of real-time data is to decrease the time between an event and the actionable insight that
could come from it. In order to make informed decisions, organizations should strive to make the
time between insight and benefit as short as possible. Apache Spark Streaming helps organizations
perform real-time data analysis.

3. Analyze Data Prior to Acting


It’s better to analyze data before acting on it, and this can be done through a combination of batch
and real-time data processing. While historical data has been used to analyze trends for years, the
availability of current data — both in batch form and streaming — now enables organizations to spot
changes in those trends as they occur. A full range of up-to-date data gives companies a broader and
more accurate perspective.

4. Turn Data into Decisions


Through machine learning, new methods of data prediction are constantly being born.
The fact is, the vast amount of big data that each organization has to manage would be impossible
without big data software and service platforms. Machine learning turns the massive amounts of data
into trends, which can be analyzed and used for high-quality decision making. Organizations should
use this technology to its fullest in order to fully optimize big data.

5. Leverage the Latest Technology


Big data technology is constantly evolving. In order to continue optimizing its data to the fullest, an
organization must keep up with the changing technology.
The key to being agile enough to jump from platform to platform is to minimize the friction that can
occur. Doing so will make data more flexible and more adaptable to the next technology. A great way
to minimize that friction is by using Talend’s Data Fabric platform.
Talend’s Data Fabric platform helps organizations bring software and service platforms, and more,
together in one place.
Dimensionality Reduction techniques

The two popular dimensionality reduction techniques, Principal Component Analysis and Linear
Discriminant Analysis are used to deal with Big Data.

What is Principal Component Analysis?

Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to


reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller
one that still contains most of the information in the large set.

Reducing the number of variables of a data set naturally comes at the expense of accuracy, but the
trick in dimensionality reduction is to trade a little accuracy for simplicity. Because smaller data sets
are easier to explore and visualize and make analyzing data much easier and faster for machine
learning algorithms without extraneous variables to process.

So to sum up, the idea of PCA is simple — reduce the number of variables of a data set, while
preserving as much information as possible.

Step By Step Explanation of PCA

STEP 1: Standardization

The aim of this step is to standardize the range of the continuous initial variables so that each one of
them contributes equally to the analysis.

More specifically, the reason why it is critical to perform standardization prior to PCA, is that the
latter is quite sensitive regarding the variances of the initial variables. That is, if there are large
differences between the ranges of initial variables, those variables with larger ranges will dominate
over those with small ranges (For example, a variable that ranges between 0 and 100 will dominate
over a variable that ranges between 0 and 1), which will lead to biased results. So, transforming the
data to comparable scales can prevent this problem.

Mathematically, this can be done by subtracting the mean and dividing by the standard deviation for
each value of each variable.

Once the standardization is done, all the variables will be transformed to the same scale.

Step 2: Covariance Matrix Computation

The aim of this step is to understand how the variables of the input data set are varying from the mean
with respect to each other, or in other words, to see if there is any relationship between them. Because
sometimes, variables are highly correlated in such a way that they contain redundant information. So,
in order to identify these correlations, we compute the covariance matrix.

The covariance matrix is a p × p symmetric matrix (where p is the number of dimensions) that has as
entries the covariances associated with all possible pairs of the initial variables. For example, for a 3-
dimensional data set with 3 variables x, y, and z, the covariance matrix is a 3×3 matrix of this from:

Since the covariance of a variable with itself is its variance (Cov(a,a)=Var(a)), in the main diagonal
(Top left to bottom right) we actually have the variances of each initial variable. And since the
covariance is commutative (Cov(a,b)=Cov(b,a)), the entries of the covariance matrix are symmetric
with respect to the main diagonal, which means that the upper and the lower triangular portions are
equal.

What do the covariances that we have as entries of the matrix tell us about the correlations
between the variables?

It’s actually the sign of the covariance that matters :

if positive then : the two variables increase or decrease together (correlated)

if negative then : One increases when the other decreases (Inversely correlated)

Now, that we know that the covariance matrix is not more than a table that summaries the correlations
between all the possible pairs of variables, let’s move to the next step.

Step 3: compute the eigenvectors and eigenvalues of the covariance matrix to identify the
principal components

Eigenvectors and eigenvalues are the linear algebra concepts that we need to compute from the
covariance matrix in order to determine the principal components of the data. Before getting to the
explanation of these concepts, let’s first understand what do we mean by principal components.

Principal components are new variables that are constructed as linear combinations or mixtures of
the initial variables. These combinations are done in such a way that the new variables (i.e., principal
components) are uncorrelated and most of the information within the initial variables is squeezed or
compressed into the first components. So, the idea is 10-dimensional data gives you 10 principal
components, but PCA tries to put maximum possible information in the first component, then
maximum remaining information in the second and so on, until having something like shown in the
scree plot below.
Organizing information in principal components this way, will allow you to reduce dimensionality
without losing much information, and this by discarding the components with low information and
considering the remaining components as your new variables.

An important thing to realize here is that, the principal components are less interpretable and don’t
have any real meaning since they are constructed as linear combinations of the initial variables.

Geometrically speaking, principal components represent the directions of the data that explain a
maximal amount of variance, that is to say, the lines that capture most information of the data. The
relationship between variance and information here, is that, the larger the variance carried by a line,
the larger the dispersion of the data points along it, and the larger the dispersion along a line, the more
the information it has. To put all this simply, just think of principal components as new axes that
provide the best angle to see and evaluate the data, so that the differences between the observations
are better visible.

Linear Discriminant Analysis

LDA is another popular dimensionality reduction approach for pre-processing step in data mining
and machine learning applications [39]. The main aim of LDA is to project a dataset with high number
of features onto a less-dimensional space with good class-separability. This will reduce computational
costs.

The approach followed by LDA is very much analogous to that of PCA. Apart from maximizing the
variance of data (PCA), LDA also maximizes separation of multiple classes. The goal of Linear
Discriminant Analysis is to project a dimension space onto a lesser subspace i (where i≤x−1 ) without
disturbing the class information.
The 5 steps for performing a LDA are listed below.

1. For every class of dataset, a d-dimensional mean vectors is computed.


2. Computation of scatter matrices is carried out.
3. The eigenvectors (E1,E2,E3,….Ed) and their corresponding eigenvalues (ψ1,ψ2,ψ3,….ψd) of
the scatter matrices are computed.
4. Sort the eigenvectors in descending order of eigenvalues and then opt for k eigenvectors which
have maximum eigenvalues in order to form a d∗i matrix WW.
5. Use the above d∗i eigenvector matrix for transforming the input samples into a new subspace.
i.e., YY=XX∗WW.

PCA vs. LDA: Both LDA and PCA are linear transformation techniques which can be used to reduce
the dimensionality /number of features. PCA is an “unsupervised” algorithm whereas LDA is
“supervised”.

Time Series Analysis

Time series is a sequence of data points recorded in time order, often taken at successive equally
paced points in time.

Time series data can be taken yearly, monthly, weekly, hourly or even by the minute.

Time Series Analysis comprised methods for analyzing time series data in order to extract meaningful
statistics and other characteristics of the data. It is different from Time Series forecasting which is the
use of a model to predict future values based on previously observed values. While time series
analysis is mostly statistics, with time series forecasting enters Machine Learning. Time series
analysis is a preparatory step to time series forecasting.

Examples of time series data

 Stock prices, Sales demand, website traffic, daily temperatures, quarterly sales

Time series is different from regression analysis because of its time-dependent nature.
1. Auto-correlation: Regression analysis requires that there is little or no autocorrelation in the
data. It occurs when the observations are not independent of each other. For example, in stock
prices, the current price is not independent of the previous price.
2. Seasonality, a characteristic which we will discuss below.

Time series analysis and forecasting are based on the assumption that past patterns in the variable to
be forecast will continue unchanged into the future.

Why is Time series analysis important?

Because time series forecasting is important! Business forecasting, understanding past behaviour and
planning for the future especially for policymakers heavily rely on time series analysis.

Time series data Vs Non-Time series data

If time is what uniquely identifies one observation from another, then it is highly likely that it is a
time series dataset. Not every data collected with respect to time represents a time series. The
observations have to be dependent on time.

Components of a Time Series

Trend: is a general direction in which something is developing or changing. A trend can be


upward(uptrend) or downward(downtrend). It is not always necessary that the increase or decrease is
consistently in the same direction in a given period.
Seasonality: Predictable pattern that recurs or repeats over regular intervals. Seasonality is often
observed within a year or less.

Seasonality of energy demand

The energy demand in the example above is higher during winter and lower during summer, which
coincides with climatic seasons. This pattern repeats every year, indicating seasonality in the time
series.

Another example is in retail sales where stores experience high sales during the last quarter of the
year.

Cycles: Occur when a series follows an up and down pattern that is not seasonal. Cyclical
variations are periodic in nature and repeat themselves like business cycle, which has four
phases (i) Peak (ii) Recession (iii) Trough/Depression (iv) Expansion.

Seasonality is different from cycles, as seasonal cycles are observed within a calendar year, while
cyclical effects can span duration shorter or longer than a calendar year.

Irregular fluctuation: These are variations that occur due to sudden causes and are
unpredictable. For example the rise in prices of food due to war, flood, earthquakes, farmers
striking etc.
When not to use a time series analysis

1. When the values are constant-This means they are not dependent on time so 1, the data is not
a time series data and 2 it is pointless as the values never change.
2. Values in the form of a function - For example sin x, cos x etc. It is, again, pointless to use
time series analysis as you calculate the values using a function.

Social Media Mining and Social Network Analysis and its Applications

What is social media data mining?

We know from our guide on data mining that this commonly refers to “knowledge discovery within
databases.” This may be the case for regular data mining, but social media data mining is done on a
much larger scale.

What is social media data mining used for?

Social media data mining is used to uncover hidden patterns and trends from social media
platforms like Twitter, LinkedIn, Facebook, and others. This is typically done through
machine learning, mathematics, and statistical techniques.

While data mining occurs within a company’s internal databases and systems, social media data
mining is far less limited as to what and where it explores.

After social data is mined, results are passed on to social media analytics software to explain and
visualize the insights.

How does social media data mining work?

Social data first needs to be collected and processed. This is data that is publicly available, which
may include age, sex, race, geographic location, job profession, schools you’ve attended, languages
you speak, friends and connections, networks you belong to, and more.
Then there’s the unstructured content of what you post on social media – like tweets, comments,
status updates – which is mainly what businesses, firms, and agencies are looking to mine. So, if your
profiles are completely public, just understand this is generally fair game for social media data
mining.

Then a variety of data mining techniques are applied. Some techniques may utilize machine learning,
some may not. This is all dependent on how deep the “miners” are looking to explore.

Finally, all of this insight needs to be visualized in a way so that it can be interpreted. While there are
a variety of data visualization tools to use, social media analytics often provides its own visualization
options.

That’s how social media data mining works in a nutshell, so, what are some of its use-cases?

What are some uses of social media data mining?

Why would a business, research firm, or government agency look to mine social data? Well, there
are a number of reasons. Here are a few of the more prominent ones:

1. Trend analysis

Trend analysis can be a very important metric for businesses who utilize social listening. For example,
businesses may analyze which topics, mentions, and keywords on social media are currently trending,
and apply mining techniques to understand why.

This insight can be extremely telling; let me provide you with one of the most prominent examples
of what I mean.

A recent analysis by SimplyMeasured concluded that mining sentiment on social media platforms
like Twitter and Facebook leading up to the 2016 U.S. Presidential Election was actually more
accurate at predicting the election’s results. Many traditional polls that year forecasted Hillary Clinton
to be the winner.
As you can see, then-candidate Donald Trump had a more positive sentiment on social media than
his opponent. Negative sentiment was just about neck-and-neck.

Trend analysis lets us see a different picture and understand hidden truths.
2. Event detection (social heat mapping)

Event detection – sometimes referred to as social heat mapping – can be an important metric for
researchers and agencies who utilize social media monitoring. The example below shows why.

In early 2016, scientists at ORNL mined social data from Twitter to examine power outages across
the U.S. By looking at textual and image data, paired with information on where this data was coming
from (geospatial), they could see in real-time where major outages were occurring.

Just think of the many possibilities and use-cases from a model like this. One I can think of is during
natural disasters.

3. Social spam detection

Even the social media platforms we use daily are benefiting from the use of data mining. One example
of this is through social spam detection.
You may see it on platforms where spammers and bots are very prominent – I’m looking at you
Twitter and Instagram.

Bots are always finding loopholes on these platforms to spam users with annoying, repetitive, and
useless content. Because of how powerful automation has become, detecting these bots and squashing
them can take some time. With social media data mining, platforms are steadily getting better at spam
detection.

So, what could trigger spam detection? This could be things like an excessive amount of followers
over an extremely short period of time. Excessive tweeting/commenting, tagging, and post updates
could be triggers as well.
To be more proactive about social spamming, Twitter recently pushed an update to limit the number
of accounts a user can follow in one day from 1,000 to 400.

Discovering the unknown

Whether it’s social media data mining or data mining in general, the whole purpose is to dive in and
discover what isn’t visible at surface-level.

With advances in technologies such as machine learning and artificial neural networks, social media
data mining will only continue to get more creative and in-depth. In the meantime, be sure to visualize
your results in a way larger audiences can understand.

Social media analytics often provide their own visualizations, but for more advanced users, it may be
worth seeing which data visualization options are out there.

Big Data analysis using Hadoop and other technologies

With rapid innovations, frequent evolutions of technologies and a rapidly growing internet
population, systems and enterprises are generating huge amounts of data to the tune of terabytes and
even petabytes of information. Since data is being generated in very huge volumes with great velocity
in all multi-structured formats like images, videos, weblogs, sensor data, etc. from all different
sources, there is a huge demand to efficiently store, process and analyze this large amount of data to
make it usable.

Hadoop is undoubtedly the preferred choice for such a requirement due to its key characteristics of
being reliable, flexible, economical, and a scalable solution. While Hadoop provides the ability to
store this large scale data on HDFS (Hadoop Distributed File System), there are multiple solutions
available in the market for analyzing this huge data like MapReduce, Pig and Hive. With the
advancements of these different data analysis technologies to analyze the big data, there are many
different school of thoughts about which Hadoop data analysis technology should be used when and
which could be efficient.

A well-executed big data analysis provides the possibility to uncover hidden markets, discover
unfulfilled customer demands and cost reduction opportunities and drive game-changing, significant
improvements in everything from telecommunication efficiencies and surgical or medical treatments,
to social media campaigns and related digital marketing promotions.

What is Big Data Analysis?

Big data is mostly generated from social media websites, sensors, devices, video/audio, networks, log
files and web, and much of it is generated in real time and on a very large scale. Big data analytics is
the process of examining this large amount of different data types, or big data, in an effort to uncover
hidden patterns, unknown correlations and other useful information.

Hadoop Data Analysis Technologies

Let’s have a look at the existing open source Hadoop data analysis technologies to analyze the huge
stock data being generated very frequently.
Featured MapReduce Pig Hive
Language Algorithm of Map PigLatin (Scripting SQL-like
and Reduce Language)
Functions (Can be
implemented in C,
Python, Java)
Schemas/Types No Yes (implicit) Yes(explicit)
Partitions No No Yes
Server No No Optional (Thrift)
Lines of code More lines of code Fewer (Around Fewer than
10 lines of PIG = 200 MapReduce and Pig
lines of Java) due to SQL Like
nature
Development Time More development Rapid development Rapid development
effort
Abstraction Lower level of Higher level of Higher level of
abstraction (Rigid abstraction (Scripts) abstraction (SQL like)
Procedural Structure)
Joins Hard to achieve join Joins can be easily Easy for joins
functionality written
Structured vs Semi- Can handle all these Works on all these Deal mostly with
Structured Vs kind of data types kind of data types structured and semi-
Unstructured data structured data
Complex business More control for Less control for Less control for
logic writing complex writing complex writing complex
business logic business logic business logic
Performance Fully tuned Slower than fully Slower than fully
MapReduce program tuned MapReduce tuned MapReduce
would be faster than program, but faster program, but faster
Pig/Hive than badly written than bad written
MapReduce code MapReduce code

Which Data Analysis Technologies should be used?

Based on the available sample dataset, it is having following properties:

 Data is having structured format


 It would require joins to calculate Stock Covariance
 It could be organized into schema
 In real environment, data size would be too much
 Based on these criteria and comparing with the above analysis of features of these
technologies, we can conclude:
If we use MapReduce, then complex business logic needs to be written to handle the joins. We would
have to think from map and reduce perspective and which particular code snippet will go into map
and which one will go into reduce side. A lot of development effort needs to go into deciding how
map and reduce joins will take place. We would not be able to map the data into schema format and
all efforts need to be handled programmatically.

If we are going to use Pig, then we would not be able to partition the data, which can be used for
sample processing from a subset of data by a particular stock symbol or particular date or month. In
addition to that Pig is more like a scripting language which is more suitable for prototyping and
rapidly developing MapReduce based jobs. It also doesn’t provide the facility to map our data into
an explicit schema format that seems more suitable for this case study.

Hive not only provides a familiar programming model for people who know SQL, it also eliminates
lots of boilerplate and sometimes tricky coding that we would have to do in MapReduce
programming. If we apply Hive to analyze the stock data, then we would be able to leverage the SQL
capabilities of Hive-QL as well as data can be managed in a particular schema. It will also reduce the
development time as well and can manage joins between stock data also using Hive-QL which is of
course pretty difficult in MapReduce. Hive also has its thrift servers, by which we can submit our
Hive queries from anywhere to the Hive server, which in turn executes them. Hive SQL queries are
being converted into map reduce jobs by Hive compiler, leaving programmers to think beyond
complex programming and provides opportunity to focus on business problem.

So based on the above discussion, Hive seems the perfect choice for the aforementioned case study.

Problem Solution with Hive

Apache Hive is a data warehousing package built on top of Hadoop for providing data summarization,
query and analysis. The query language being used by Hive is called Hive-QL and is very similar to
SQL.
Discriminant Analysis

During a study, there are often questions that strike the researcher that must be answered. These
questions include questions like ‘are the groups different?’, ‘on what variables, are the groups most
different?’, ‘can one predict which group a person belongs to using such variables?’ etc. In answering
such questions, discriminant analysis is quite helpful.
Discriminant analysis is a technique that is used by the researcher to analyze the research data when
the criterion or the dependent variable is categorical and the predictor or the independent variable is
interval in nature. The term categorical variable means that the dependent variable is divided into a
number of categories. For example, three brands of computers, Computer A, Computer B and
Computer C can be the categorical dependent variable.
The objective of discriminant analysis is to develop discriminant functions that are nothing but the
linear combination of independent variables that will discriminate between the categories of the
dependent variable in a perfect manner. It enables the researcher to examine whether significant
differences exist among the groups, in terms of the predictor variables. It also evaluates the accuracy
of the classification.
Discriminant analysis is described by the number of categories that is possessed by the dependent
variable.
As in statistics, everything is assumed up until infinity, so in this case, when the dependent variable
has two categories, then the type used is two-group discriminant analysis. If the dependent variable
has three or more than three categories, then the type used is multiple discriminant analysis. The
major distinction to the types of discriminant analysis is that for a two group, it is possible to derive
only one discriminant function. On the other hand, in the case of multiple discriminant analysis, more
than one discriminant function can be computed.
There are many examples that can explain when discriminant analysis fits. It can be used to know
whether heavy, medium and light users of soft drinks are different in terms of their consumption of
frozen foods. In the field of psychology, it can be used to differentiate between the price sensitive
and non price sensitive buyers of groceries in terms of their psychological attributes or characteristics.
In the field of business, it can be used to understand the characteristics or the attributes of a customer
possessing store loyalty and a customer who does not have store loyalty.
Cluster Analysis

Cluster analysis is a class of techniques that are used to classify objects or cases into relative groups
called clusters. Cluster analysis is also called classification analysis or numerical taxonomy. In
cluster analysis, there is no prior information about the group or cluster membership for any of the
objects.

Cluster Analysis has been used in marketing for various purposes. Segmentation of consumers in
cluster analysis is used on the basis of benefits sought from the purchase of the product. It can be
used to identify homogeneous groups of buyers.

Cluster analysis involves formulating a problem, selecting a distance measure, selecting a clustering
procedure, deciding the number of clusters, interpreting the profile clusters and finally, assessing the
validity of clustering.

The variables on which the cluster analysis is to be done should be selected by keeping past research
in mind. It should also be selected by theory, the hypotheses being tested, and the judgment of the
researcher. An appropriate measure of distance or similarity should be selected; the most commonly
used measure is the Euclidean distance or its square.

Clustering procedures in cluster analysis may be hierarchical, non-hierarchical, or a two-step


procedure. A hierarchical procedure in cluster analysis is characterized by the development of a tree
like structure. A hierarchical procedure can be agglomerative or divisive. Agglomerative methods
in cluster analysis consist of linkage methods, variance methods, and centroid methods. Linkage
methods in cluster analysis are comprised of single linkage, complete linkage, and average linkage.

The non-hierarchical methods in cluster analysis are frequently referred to as K means clustering.
The two-step procedure can automatically determine the optimal number of clusters by comparing
the values of model choice criteria across different clustering solutions. The choice of clustering
procedure and the choice of distance measure are interrelated. The relative sizes of clusters in cluster
analysis should be meaningful. The clusters should be interpreted in terms of cluster centroids.
Applications of Cluster Analysis

 Clustering analysis is broadly used in many applications such as market research, pattern
recognition, data analysis, and image processing.
 Clustering can also help marketers discover distinct groups in their customer base. And they
can characterize their customer groups based on the purchasing patterns.
 In the field of biology, it can be used to derive plant and animal taxonomies, categorize genes
with similar functionalities and gain insight into structures inherent to populations.
 Clustering also helps in identification of areas of similar land use in an earth observation
database. It also helps in the identification of groups of houses in a city according to house
type, value, and geographic location.
 Clustering also helps in classifying documents on the web for information discovery.
 Clustering is also used in outlier detection applications such as detection of credit card fraud.
 As a data mining function, cluster analysis serves as a tool to gain insight into the distribution
of data to observe characteristics of each cluster.

You might also like