You are on page 1of 23

UNIT - 3

BIG DATA ANALYTICS

V Phanishwara Hara Gopal


Assistant Professor
Department of Information Technology
VNR Vignana Jyothi Institute of Engg Tech
BDA - Introduction
• Big data analytics is the process of
• examining large data sets containing a variety of data types
• i.e., big data
• to uncover hidden patterns, unknown correlations,
• market trends,
• customer preferences
• and other useful business information.

• The analytical findings can lead to


• more effective marketing,
• new revenue opportunities,
• better customer service,
• improved operational efficiency,
• competitive advantages over rival organizations and other business
benefits.
BDA - Goal
• The primary goal of big data analytics is
• to help companies make more informed business decisions
• by enabling data scientists, predictive modelers and other analytics
professionals to analyze large volumes of transaction data,
• as well as other forms of data that may be untapped by
conventional business intelligence(BI) programs.
• Data can be collected from
• web server logs and Internet clickstream data,
• social media content and social network activity reports,
• text from customer emails and survey responses,
• mobile-phone call detail records
• and machine data captured by sensors connected to the Internet of
Things.
BDA - Tools
• Big data can be analyzed with the software tools
commonly used as part of advanced analytics disciplines
such as
• predictive analytics,
• data mining,
• text analytics
• and statistical analysis.
• Mainstream BI software and data visualization tools can
also play a role in the analysis process.
• But the semi-structured and unstructured data may not fit
well in traditional data warehouses based on relational
databases.
BDA - Tools
• Many organizations looking to
• collect, process and analyze big data have turned to a newer class
of technologies
• that includes Hadoop and related tools such as YARN,
MapReduce, Spark, Hive and Pig as well as NoSQL databases.
• Those technologies form the core of an open source
software framework that supports the processing of large
and diverse data sets across clustered systems.
Relation DB vs Big Data
• Weaknesses of Relational DB:
• Scaling of Treatment
• Scaling of Data
• The Redundancy
• Velocity
• Variety and Complexity
• As a transactional database, they must respect the
ACID constraints, i.e.
• the Atomicity of updates,
• the Consistency of the database,
• the Isolation and
• the Durability of queries.
• These constraints are perfectly applicable in a centralized
architecture, but much more complex to insure in a
distributed architecture.
BDA - Distributed Architecture
• Three major criteria can be
considered as a triptych in the
implementation of a distributed
architecture:
• The Coherence: All the nodes of the
system have to see exactly the same
data at the same time
• The Availability: The system must
stay up and running even if one of its
node is failing down
• The Partition Tolerance: each subnet-
works must be autonomous
Categories of NoSQL
• Major players of the market of online services had to
implement specific solutions in term of
• data storage,
• proprietary in a first hand,
• and then transmitted to open sources communities that have
insured the convergence of these heterogeneous solutions in four
major categories of NoSQL databases:
• Key-Value store
• Column-oriented Database
• Document store
• Graph Database
Outliers
• Outlier is a point or an observation that deviates
significantly from the other observations.
• Due to experimental errors or “special circumstances”
• Outlier detection tests to check for outliers
• Outlier treatment –
• Retention
• Exclusion
• Other treatment methods
• We have “OUTLIER” package in R to detect and treat
outliers in Data.
• Normally we use BOX Plot and Scatter plot to find outliers
from graphical representation.
Outliers
Outliers
• Outliers in data can distort predictions and affect the
accuracy, if you don’t detect and handle them
appropriately especially in regression models.
• An outlier is an observation that lies an abnormal distance
from other values in a random sample from a population.
• In a sense, this definition leaves it up to the analyst (or a
consensus process) to decide what will be considered
abnormal.
• Before abnormal observations can be singled out, it is
necessary to characterize normal observations.

Source : r-bloggers.com
Author: Selva Prabhakaran
Source: www.itl.nist.gov
Outliers
• Ways to Describe Data:
• Two activities are essential for characterizing a set of data:
• Examination of the overall shape of the graphed data for important
features, including symmetry and departures from assumptions.
• Examination of the data for unusual observations that are far removed
from the mass of data. These points are often referred to as outliers. Two
graphical techniques for identifying outliers, scatter plots and box plots,
along with an analytic procedure for detecting outliers when the distribution
is normal (Grubbs' Test).
• Box plot construction:
• The box plot is a useful graphical display for describing the behaviour
of the data in the middle as well as at the ends of the distributions.
• The box plot uses the median and the lower and upper quartiles
(defined as the 25th and 75th percentiles).
• If the lower quartile is Q1 and the upper quartile is Q3, then the
difference (Q3 - Q1) is called the interquartile range or IQ.
Outliers
• Boxplot with fences:
• A box plot is constructed by drawing a box between the upper and
lower quartiles with a solid line drawn across the box to locate the
median.
• The following quantities (called fences) are needed for identifying
extreme values in the tails of the distribution:
• lower inner fence: Q1 - 1.5*IQ
• upper inner fence: Q3 + 1.5*IQ
• lower outer fence: Q1 - 3*IQ
• upper outer fence: Q3 + 3*IQ
• Outlier detection criteria
• A point beyond an inner fence on either side is considered a mild
outlier.
• A point beyond an outer fence is considered an extreme outlier.
Outliers
• Example
• The data set of N = 90 ordered observations as shown below is
examined for outliers:
• 30, 171, 184, 201, 212, 250, 265, 270, 272, 289, 305, 306, 322, 322,
336, 346, 351, 370, 390, 404, 409, 411, 436, 437, 439, 441, 444, 448,
451, 453, 470, 480, 482, 487, 494, 495, 499, 503, 514, 521, 522, 527,
548, 550, 559, 560, 570, 572, 574, 578, 585, 592, 592, 607, 616, 618,
621, 629, 637, 638, 640, 656, 668, 707, 709, 719, 737, 739, 752, 758,
766, 792, 792, 794, 802, 818, 830, 832, 843, 858, 860, 869, 918, 925,
953, 991, 1000, 1005, 1068, 1441
Outliers
• Example
• The computations are as follows:
• Median = (n+1)/2 largest data point = the average of the 45th and 46th
ordered points = (559 + 560)/2 = 559.5
• Lower quartile = .25(N+1)th ordered point = 22.75th ordered point = 411 +
.75(436-411) = 429.75
• Upper quartile = .75(N+1)th ordered point = 68.25th ordered point = 739
+.25(752-739) = 742.25
• Interquartile range = 742.25 - 429.75 = 312.5
• Lower inner fence = 429.75 - 1.5 (312.5) = -39.0
• Upper inner fence = 742.25 + 1.5 (312.5) = 1211.0
• Lower outer fence = 429.75 - 3.0 (312.5) = -507.75
• Upper outer fence = 742.25 + 3.0 (312.5) = 1679.75

From an examination of the fence points and the data, one point (1441)
exceeds the upper inner fence and stands out as a mild outlier; there are no
extreme outliers.
Outliers
• Example
• Histogram with boxplot
• A histogram with an overlaid box plot are shown below:

• The outlier is identified as the largest value in the data set, 1441, and
appears as the circle to the right of the box plot.
Outliers
• Outliers should be investigated carefully.
• Often they contain valuable information about the process
under investigation or the data gathering and recording
process.
• Before considering the possible elimination of these
points from the data, one should try to understand why
they appeared and whether it is likely similar values will
continue to appear. Outliers are often bad data points.
Outliers and Missing Data treatment
• Inputting missing data using standard methods and
algorithmic approaches (mice package R):
• In R, missing values are represented by the symbol NA (not available).
• Impossible values (e.g., dividing by zero) are represented by the
symbol NaN (not a number).
• Unlike SAS, R uses the same symbol for character and numeric data.
• To test if there is any missing in the dataset we use is.na ()
function.
• For Example,
• We have defined “y” and then checked if there is any missing value. T
or True means that there is a missing value.
y <- c(1,2,3,NA)
is.na(y)
# returns a vector (F F F T)
Outliers and Missing Data treatment
• Arithmetic functions on missing values yield missing
values.
• For Example,
x <- c(1,2,NA,3)
mean(x)
# returns NA
• To remove missing values from our dataset we use
na.omit() function.
• For Example,
We can create new dataset without missing data as below: -
newdata<- na.omit(mydata)
Or, we can also use “na.rm=TRUE” in argument of the operator. From
above example we use na.rm and get desired result.
x <- c(1,2,NA,3)
mean(x, na.rm=TRUE)
# returns 2
Outliers and Missing Data treatment
• MICE Package -> Multiple Imputation by Chained
Equations
• MICE uses PMM to impute missing values in a dataset.
• PMM-> Predictive Mean Matching (PMM)
• PMM is a semi-parametric imputation approach.
• It is similar to the regression method except that for each missing
value, it fills in a value randomly from among the a observed donor
values from an observation whose regression-predicted values are
closest to the regression-predicted value for the missing value from
the simulated regression model.
Descriptive Statistics
• It is called the “simplest class of analytics”, descriptive
analytics allows you to condense big data into smaller,
more useful bits of information or a summary of what
happened.
• It has been estimated that more than 80% of business
analytics (e.g. social analytics) are descriptive.
• Some social data could include the number of posts, fans,
followers, page views, check-ins,pins, etc.
• It would appear to be an endless list if we tried to list them
all.
Data Preprocessing for the analysis
• Data pre-processing is an important step in the data
mining process.
• The phrase "garbage in, garbage out" is particularly
applicable to data mining and machine learning projects.
• Data-gathering methods are often loosely controlled,
resulting in out-of-range values (e.g., Income: −100),
impossible data combinations (e.g., Sex: Male, Pregnant:
Yes), missing values, etc.
• Analyzing data that has not been carefully screened for
such problems can produce misleading results.
• Thus, the representation and quality of data is first and
foremost before running an analysis.
Data Preprocessing for the analysis
• If there is much irrelevant and redundant information
present or noisy and unreliable data, then knowledge
discovery during the training phase is more difficult.
• Data preparation and filtering steps can take considerable
amount of processing time.
• Data pre-processing includes cleaning, normalization,
transformation, feature extraction and selection, etc.
• The product of data pre-processing is the final training
set.

You might also like