You are on page 1of 40

Module-1

Introduction to Big Data Analytics


Need of Big Data

 The rise in technology has led to the production and storage of voluminous amounts of data.
 Due to large volume of data, variety of data, various forms and formats pose challenges to
conventional systems for storage, processing and analysis.
 Increasing complexity needs a means of quick processing analyzing and usage of data
Big Data
 Definition: Big Data is high-volume, high –velocity and high-variety information asset that requires new
forms of processing for enhanced decision making, insight discovery and process optimization

 “A collection of data sets so large or complex that traditional data processing applications are inadequate”-
Wikipedia

 “Data of very large size, typically to the extent that its manipulation and management present significant
logistical challenges” - Oxford

 “Big data refers to data sets whose size is beyond the ability of typical database software tool to capture,
store, manage and analyze” – McKinsey Global Institute

 Data: is information, usually in the form of facts or statistics that can be analyzed or used.

 Web Data: Data present on web servers in the form of text, images, videos, audios and multimedia files for
web users.

 Examples: Wikipedia, Google Maps, McGraw-Hill Connect etc.,


Classification of Data

Example: XML, JSON document


• Contains tags/markers to separate elements and
records
• Data does not associate any formal model

Possible to
• Insert, delete, update,
append
• Indexing
• Scalability
• Transaction processing
• Encryption and • Do not conform and associate
Decryption with any data models
• File types: .TXT, .CSV
• Data may have internal
structure, but do not reveal any
relationship

Multi-structured data: consists of multiple formats of data ex: structured, semi structures/unstructured
Example: streaming of data on customer interactions, data of multiple sensors, data at web/enterprise server etc
Big Data Characteristics
Big Data Types
 Social networks and web data: Facebook, Twitter, e-mails, blogs et.,

 Transaction data and Business Processes (BPs): credit card transactions, flight bookings
etc.,

 Customer master data: data for facial recognition and for the name, date of birth, location
and income category etc.,

 Machine-generated data: machine-to-machine, IoT, web blogs, computer systems log etc.,

 Human-generated data: biometrics data, human-machine interaction data, e-mail records


with a mail server, MySQL database of student grades
TIME SERIES DATA
Outliers
 An outlier is literally a value or an entire observation (row) that lies well outside of the
norm.
 For example, this is often the
 case with data entry errors. Suppose a data set includes a Height variable, a person’s height
 measured in inches, and you see a value of 720. This is certainly an outlier—and it is certainly
 an error. Once you spot it, you can go back and check this observation to see what the
 person’s height should be. Maybe an extra 0 was accidentally appended and the true value
 is 72. In any case, this type of outlier is usually easy to discover and ix.
Outliers
 An outlier is literally a value or an entire observation (row) that lies well outside of the
norm.
 For example, it would
 be strange to ind a person with Age equal to 10 and Height equal to 72. Neither of these
 values is unusual by itself, but the combination is certainly unusual Missing Values
Missing Values
 For an Excel data set, you
 might expect missing data to be obvious from blank cells. This is certainly one possibility,
 but there are others. Missing data are coded in a variety of strange ways. One common
 method is to code missing values with an unusual number such as −9999 or 9999. Another
 method is to code missing values with a symbol such as − or *. If you know the code (and
 it is often supplied in a footnote), then it is usually a good idea, at least in Excel, to perform
 a global search and replace, replacing all of the missing value codes with blanks
Missing Values
 For an Excel data set, you
 The more important issue is what to do about missing values. One option is to ignore them.
 if you use Excel’s AVERAGE function on a column of data with missing values,
 it reacts the way you would hope and expect—it adds all the nonmissing values and divides by
the number of nonmissing values
 Similarly for rows and colums
Finding Relationships amongVariables
 The primary interest is usually in relationships between
 variables. For example, it is natural to ask what drives baseball salaries. Does it depend
 on qualitative factors, such as the player’s team or position? Does it depend on quantitative
 factors, such as the number of hits the player gets or the number of strikeouts?
RELATIONSHIPS AMONG CATEGORICALVARIABLES
 Consider a data set with at least two categorical variables, Smoking and Drinking. Each
 person is categorized into one of three smoking categories: nonsmoker (NS), occasional
 smoker (OS), and heavy smoker (HS). Similarly, each person is categorized into
 one of three drinking categories: nondrinker (ND), occasional drinker (OD), and heavy
 drinker (HD). Do the data indicate that smoking and drinking habits are related? For
 example, do nondrinkers tend to be nonsmokers? Do heavy smokers tend to be heavy drinkers?
RELATIONSHIPS AMONG CATEGORICALVARIABLES
RELATIONSHIPS AMONG CATEGORICAL VARIABLES AND
A NUMERICAL VARIABLE
 whenever you want to compare a numerical measure across two or more subpopulations.Here are
some examples
 ■ The subpopulations are males and females, and the numerical measure is salary.
 ■ The subpopulations are different regions of the country, and the numerical measure is
 the cost of living.
 ■ The subpopulations are different days of the week, and the numerical measure is the
 number of customers going to a particular fast-food chain.
 ■ The subpopulations are different machines in a manufacturing plant, and the
 numerical measure is the number of defective parts produced per day.
RELATIONSHIPS AMONG CATEGORICAL VARIABLES AND
A NUMERICAL VARIABLE
 whenever you want to compare a numerical measure across two or more subpopulations.Here are
some examples
 The subpopulations are patients who have taken a new drug and those who have
 taken a placebo, and the numerical measure is the recovery rate from a particular
 disease.
 ■ The subpopulations are undergraduates with various majors (business, English,
 history, and so on), and the numerical measure is the starting salary after graduating
RELATIONSHIPS AMONG NUMERICAL VARIABLES
 This section discusses methods for inding relationships among numerical variables. For example,
we might want to examine the relationship between heights and weights of people, or between
salary and years of experience of employees. To study such relationships,
 we introduce two new summary measures, correlation and covariance, and a new type of chart
called a scatterplot.
Probability and Probability Distributions
 key aspect of solving real business problems is dealing appropriately with uncertainty. This
involves recognizing explicitly that uncertainty exists and using quantitative methods to model
uncertainty. If you want to develop realistic business models, you cannot simply act as if
uncertainty doesn’t exist.
 For example, if you don’t know next month’s demand, you shouldn’t build a model that assumes
next month’s demand is a sure 1500 units. This
 is only wishful thinking. You should instead incorporate demand uncertainty explicitly into
 your model. To do this, you need to know how to deal quantitatively with uncertainty. This
 involves probability and probability distributions
Probability and Probability Distributions
 There are many sources of uncertainty. Demands for products are uncertain, times
 between arrivals to a supermarket are uncertain, stock price returns are uncertain, changes in
interest rates are uncertain, and so on.
 In many situations, the uncertain quantity—
 demand, time between arrivals, stock price return, change in interest rate—is a numerical quantity.
In the language of probability, it is called a random variable. More formally, a random variable
associates a numerical value with each possible random outcome
Probability and Probability Distributions
PROBABILITY ESSENTIALS
 When a weather forecaster states that the chance of rain is
 70%, he or she is making a probability statement. When a sports commentator states that the odds
against the Miami Heat winning the NBA Championship are 3 to 1, he or she is also making a
probability statement.

 A probability is a number between 0 and 1 that measures the likelihood that some
 event will occur. An event with probability 0 cannot occur, whereas an event with
 probability 1 is certain to occur. An event with probability greater than 0 and less than 1 involves
uncertainty. The closer its probability is to 1, the more likely it is to occur
Rule of Complements
 The simplest probability rule involves the complement of an event. If A is any event, then the
complement of A, denoted by A (or in some books by Ac), is the event that A does not occur..
 If the probability of A is P(A), then the probability of its complement, P(A), is given
 by Equation P(A) = 1 − P(A) Equivalently, the probability of an event and the probability of its
complement sum to 1. For example, if you believe that the probability of the Dow Finishing at or
above 14,000 is 0.25, then the probability that it will Finish the year below 14,000 is
 1 − 0.25 = 0.75.
Rule of Complements
 The simplest probability rule involves the complement of an event. If A is any event, then the
complement of A, denoted by A (or in some books by Ac), is the event that A does not occur.
 14,000 mark, then the complement of A is that the Dow will inish the year below 14,000.
 If the probability of A is P(A), then the probability of its complement, P(A), is given
 by Equation P(A) = 1 − P(A) Equivalently, the probability of an event and the probability of its
complement sum to 1. For example, if you believe that the probability of the Dow Finishing at or
above 14,000 is 0.25, then the probability that it will Finish the year below 14,000 is
 1 − 0.25 = 0.75.
Addition Rule
 Events are mutually exclusive if at most one of them can occur. That is, if one of them occurs, then
none of the others can occur.,
 exhaustive events, which means that they exhaust all possibilities—one
 of these three events must occur.
 Let A1 through An be any n events. Then the addition rule of probability involves the probability that
at least one of these events will occur.
 In addition, if the events A1 through An are exhaustive, then the probability is one because one of the
events is certain to occur
 Addition Rule for Mutually Exclusive Events
 P(at least one of A1 through An) = P(A1) + P(A2) + … + P(An) (4.2)
Addition Rule
 For example, consider the following three
 events involving a company’s annual revenue for the coming year:
(1) revenue is less than $1 million,
(2) revenue is at least $1 million but less than $2 million, and
(3) revenue is at least $2 million
Therefore, their probabilities must sum to 1. Suppose these probabilities are P(A1) = 0.5,
P(A2) = 0.3, and P(A3) = 0.2
 For example, the event that revenue is at least $1 million is the event
that either A2 or A3 occurs. From the addition rule, its probability is
P(revenue is at least $1 million) = P(A2) + P(A3) = 0.5
P(revenue is less than $2 million) = P(A1) + P(A2) = 0.8
P(revenue is less than $1 million or at least $2 million) = P(A1) + P(A3) = 0.7
Conditional Probability and the Multiplication Rule
 Let A and B be any events with probabilities P(A) and P(B). Typically, the probability
P(A) is assessed without knowledge of whether B occurs. However, if you are told that
B has occurred, then the probability of A might change. The new probability of A is
called the conditional probability of A given B, and it is denoted by P(A∣B).

 Conditional Probability
 P(A∣B) =P(A and B)
P(B)
 The numerator in this formula is the probability that both A and B occur. This probability must be
known to ind P(A∣B). However, in some applications P(A∣B) and P(B) are known. Then you can
multiply both sides of Equation (4.3) by P(B) to obtain the following multiplication rule for P(A
and B).
 Multiplication Rule P(A and B) = P(A∣B) P(B)
Conditional Probability and the Multiplication Rule
 Let A be the event that Bender meets its end-of-July deadline, and let B be the event that Bender
receives the materials from its supplier by the middle of July. The probabilities
 Bender is best able to assess on July 1 are probably P(B) and P(A∣B). At the beginning of July,
Bender might estimate that the chances of getting the materials on time from its supplier are 2 out of
3, so that P(B) = 2/3. Also, thinking ahead, Bender estimates that if it receives the required materials
on time, the chances of meeting the end-of-July deadline are 3 out of 4. This is a conditional
probability statement, namely, that P(A∣B) = 3/4. Then
 the multiplication rule implies that
 P(A and B) = P(A∣B)P(B) = (3/4) (2/3) = 0.5
 That is, there is a fifty-fifty chance that Bender will get its materials on time and meet its end-of-July
deadline.
Equally Likely Events
 Much of what you know about probability is probably based on situations where outcomes are
equally likely. These include lipping coins, throwing dice, drawing balls from urns, and other random
mechanisms.
 For example, suppose an urn contains 20 red marbles and 10 blue marbles. You plan to randomly
select ive marbles from the urn, and you are interested, say, in the probability of selecting at least
three red marbles
PROBABILITY DISTRIBUTION OF A SINGLE RANDOM VARIABLE

 This section we examine the probability distribution of a single random variable.


 There are really two types of random variables:
 discrete and continuous. A discrete random variable has only a finite number of possible values,
 whereas a continuous random variable has a continuum of possible values
 For example, the number of children in a family is clearly discrete, whereas the amount of rain this
year in San Francisco is clearly continuous.
 The essential properties of a discrete random variable and its associated probability distribution are
quite simple
PROBABILITY DISTRIBUTION OF A SINGLE RANDOM VARIABLE

 Let X be a random variable. To specify the probability distribution of X, we need to


 specify its possible values and their probabilities. We assume that there are k possible values, denoted
v1, v2, . . . , vk. The probability of a typical value vi is denoted in one of two ways, either P(X = vi)
or p(vi).
 The first is a reminder that this is a probability involving the random variable X, whereas the second
is a shorthand notation. Probability distributions must satisfy two criteria: (1) the probabilities must
be nonnegative, and (2) they must sum to 1. In symbols, we must have
 ak
 i=1
 p(vi) = 1, p(vi) ≥ 0
Normal, Binomial, Poisson and Exponential distribution

 Normal distribution, also known as the Gaussian distribution, is a probability distribution that is
symmetric about the mean, showing that data near the mean are more frequent in occurrence than
data far from the mean.
 In graphical form, the normal distribution appears as a "bell curve".
 What Is Binomial Distribution?
 Binomial distribution is a statistical distribution that summarizes the probability that a value will take
one of two independent values under a given set of parameters or assumptions
 Binomial distribution is a common discrete distribution used in statistics, as opposed to a continuous
distribution, such as normal distribution. This is because binomial distribution only counts two states,
typically represented as 1 (for a success) or 0 (for a failure), given a number of trials in the data.
Normal, Binomial, Poisson and Exponential distribution

 A Poisson distribution is a discrete probability distribution. It gives the probability of an event


happening a certain number of times (k) within a given interval of time or space.
 In general, Poisson distributions are often appropriate for count data. Count data is composed of
observations that are non-negative integers
 What is Exponential Distribution?
 In Probability theory and statistics, the exponential distribution is a continuous
probability distribution that often concerns the amount of time until some specific event happens
 . For example, the amount of time (beginning now) until an earthquake occurs has an exponential
distribution. Other examples include the length, in minutes, of long distance business telephone calls,
and the amount of time, in months, a car battery lasts
 For example, the amount of money customers spend in one trip to the supermarket follows an
exponential distribution. There are more people who spend small amounts of money and fewer
people who spend large amounts of money.
PROBABILITY DISTRIBUTION OF A SINGLE RANDOM VARIABLE
NORMAL DISTRIBUTION

 Any particular normal distribution is speciied by its mean and standard deviation. By changing the
mean, the normal curve shifts to the right or left. By changing the standard deviation, the curve
becomes more or less spread out
 Continuous Distributions and Density Functions
 For continuous distributions such as the normal distribution Now instead of a list of
possible values, there is a continuum of possible values, such as all values between 0 and 100 or all
values greater than 0. Instead of assigning probabilities to each individual value in the continuum, the
total probability of 1 is spread over this continuum.
The key to this spreading is called a density function, which acts like a histogram. The higher the value
of the density function, the more likely this region of the continuum is.
NORMAL DISTRIBUTION

 A density function, usually denoted by f(x), specifies the probability distribution of


a continuous random variable X. The higher f(x) is, the more likely x is. Also, the
total area between the graph of f(x) and the horizontal axis, which represents the total
probability, is equal to 1. Finally, f(x) is nonnegative for all possible values of X.

APPLICATIONS OF THE NORMAL DISTRIBUTION


(Refer Text book)
BINOMIAL DISTRIBUTION

 The binomial distribution is a discrete distribution that can occur in two situations: (1) when
sampling from a population
with only two types of members (males and females, for example), and (2) when
performing a sequence of identical experiments, each of which has only two possible outcomes.
 Consider a situation where there are n independent, identical trials, where the
probability of a success on each trial is p and the probability of a failure is 1 − p.
Define X to be the random number of successes in the n trials. Then X has a binomial
distribution with parameters n and p.

Application (Refer Text Book)


Poisson Distribution

 The Poisson distribution is a discrete distribution. It usually applies to the number of


events occurring within a speciied period of time or space. Its possible values are all of the nonnegative
integers: 0, 1, 2, and so on—there is no upper limit. Even though there is an infinite number of possible
values, this causes no real problems
Typical Examples of the Poisson Distribution
 A bank manager is studying the arrival pattern to the bank. The events are customer arrivals, the
number of arrivals in an hour is Poisson distributed, and λ represents the expected number of arrivals
per hour.
 An engineer is interested in the lifetime of a type of battery. A device that uses this
type of battery is operated continuously. When the irst battery fails, it is replaced by
a second; when the second fails, it is replaced by a third, and so on. The events are
battery failures, the number of failures that occur in a month is Poisson distributed,
and λ represents the expected number of failures per month.
Poisson Distribution

 A retailer is interested in the number of customers who order a particular product in


a week. Then the events are customer orders for the product, the number of customer
orders in a week is Poisson distributed, and λ is the expected number of orders per week.
 In a quality control setting, the Poisson distribution is often relevant for describing
the number of defects in some unit of space. For example, when paint is applied to
the body of a new car, any minor blemish is considered a defect. Then the number of
defects on the hood, say, might be Poisson distributed. In this case, λ is the expected
number of defects per hood.
Poisson Distribution

 A retailer is interested in the number of customers who order a particular product in


a week. Then the events are customer orders for the product, the number of customer
orders in a week is Poisson distributed, and λ is the expected number of orders per week.
 In a quality control setting, the Poisson distribution is often relevant for describing
the number of defects in some unit of space. For example, when paint is applied to
the body of a new car, any minor blemish is considered a defect. Then the number of
defects on the hood, say, might be Poisson distributed. In this case, λ is the expected
number of defects per hood.
Exponential Distribution

 Suppose that a bank manager is studying the pattern of customer arrivals at her branch location. As
indicated previously in this section, the number of arrivals in an hour at a facility such as a bank is
often well described by a Poisson distribution with parameter λ, where λ represents the expected
number of arrivals per hour. An alternative way to view the uncertainty in the arrival process is to
consider the times between customer arrivals. The most common probability distribution used to
model these times, often called interarrival times, is the exponential distribution

You might also like