Professional Documents
Culture Documents
Business Analytics Notes PDF
Business Analytics Notes PDF
Big data: big data has arisen to be defined as something like: that amount of data that will not practically fit into a
standard (relational) database for analysis and processing caused by the huge volumes of information being created by
human and machine-generated processes.
Structured, unstructured, semi-structured data: All data has structure of some sort. Delineating between structured
and unstructured data comes down to whether the data has a pre-defined data model and whether it’s organized in a
pre-defined way.
Time-stamped data: Time-stamped data is a dataset which has a concept of time ordering defining the sequence that
each data point was either captured (event time) or collected (processed time).
Machine data: Simply put, machine data is the digital exhaust created by the systems, technologies and infrastructure
powering modern businesses.
Open data: Open data is data that is freely available to anyone in terms of its use (the chance to apply analytics to it)
and rights to republish without restrictions from copyright, patents or other mechanisms of control. The Open Data
Institute states that open data is only useful if it’s shared in ways that people can actually understand. It needs to be
shared in a standardized format and easily traced back to where it came from.
Real time data : One of the most explosive trends in analytics is the ability to stream and act around real time data.
Some people argue that the term itself is something of a misnomer i.e. data can only travel as fast as the speed of
communications, which isn’t faster than time itself… so, logically, even real time data is slightly behind the actual
passage of time in the real world. However, we can still use the term to refer to instantaneous computing that happens
about as fast as a human can perceive.
Data Modeling
Business analysts solve tricky, icky, sticky project challenges using data modeling techniques. There are 4 data modeling
techniques you should get to know as a business analyst, so they can become part of your BA toolbox.
Entity Relationship Diagram – A handy tool that helps visualize relationships between key business concepts to
encourage business-focused database designs.
Data Dictionary – A spreadsheet format that enables you to communicate to business stakeholders clearly and in an
organized way, eliminating long lists of fields inside use cases or other requirements document.
Data Mapping – An essential template for a data migration or data integration project that will ensure any data-related
issues are discovered and resolved ahead of the last-minute data crunching that often derails big projects.
Glossary – Along with encouraging more effective communication among stakeholders, clarifying your requirements
documents, and helping you learn about a new business domain, a glossary will make the rest of the data modeling
techniques easier, as you’ll be working from a clear and unambiguous collection of terms.
Data Cleansing
Data cleansing is the process of altering data in a given storage resource to make sure that it is accurate and correct.
There are many ways to pursue data cleansing in various software and data storage architectures; most of them center
on the careful review of data sets and the protocols associated with any particular data storage technology.
Data cleansing is also known as data cleaning or data scrubbing.
Data cleansing is sometimes compared to data purging, where old or useless data will be deleted from a data set.
Although data cleansing can involve deleting old, incomplete or duplicated data, data cleansing is different from data
purging in that data purging usually focuses on clearing space for new data, whereas data cleansing focuses on
maximizing the accuracy of data in a system. A data cleansing method may use parsing or other methods to get rid of
syntax errors, typographical errors or fragments of records. Careful analysis of a data set can show how merging
multiple sets led to duplication, in which case data cleansing may be used to fix the problem.
A six-step approach to data cleansing and data preparation process:
Discovering helps the user understand what’s in the data and how it can be used effectively for analysis
Structuring makes working with data of all shapes and sizes easy by formatting the data to be used in traditional
applications
Cleaning involves removing data that may distort your analysis or standardizing your data into a single format
Enriching allows the user to augment the data with internal or third-party data to enhance the data for better
analysis
Validating brings data quality and inconsistency issues to the surface so the appropriate transformations can be
applied
Publishing allows the user to deliver the output of your data and load into downstream systems for analysis
One of the major challenges of the Big Data era is that it has realized the availability of a great amount and variety of massive
datasets for analysis by non-corporate data analysts, such as research scientists, data journalists, policy makers, SMEs and
individuals. A major characteristic of these datasets is that they are: accessible in a raw format that are not being loaded or indexed
in a database (e.g., plain text, json, rdf), dynamic, dirty and heterogeneous in nature. The level of difficulty in transforming a data-
curious user into someone who can access and analyze that data is even more burdensome now for a great number of users with
little or no support and expertise on the data processing part. The purpose of visual data exploration is to facilitate information
perception and manipulation, knowledge extraction and inference by non-expert users. The visualization techniques, used in a
variety of modern systems, provide users with intuitive means to interactively explore the content of the data, identify interesting
patterns, infer correlations and causalities, and supports sense-making activities that are not always possible with traditional data
traditional data analysis techniques.
In the Big Data era, several challenges arise in the field of data visualization and analytics. First, the modern exploration and
visualization systems should offer scalable data management techniques in order to efficiently handle billion objects datasets,
limiting the system response in a few milliseconds. Besides, nowadays systems must address the challenge of on-the-fly scalable
visualizations over large and dynamic sets of volatile raw data, offering efficient interactive exploration techniques, as well as
mechanisms for information abstraction, sampling and summarization for addressing problems related to visual information
overplotting. Further, they must encourage user comprehension offering customization capabilities to different user-defined
exploration scenarios and preferences according to the analysis needs. Overall, the challenge is to enable users to gain value and
insights out of the data as rapidly as possible, minimizing the role of IT-expert in the loop.
BASICS
Simple Random Sampling
To begin the process of drawing samples from a larger population, an analyst must craft a sampling plan, which
indicates exactly how the sample was selected. With a large population, different samples will yield different results,
and the idea is to create a consistent and unbiased approach. Simple random sampling is the most basic approach to
the problem. It draws a representative sample with the principle that every member of the population must have an
equal chance of being selected. The key to simple random sampling is assuring randomness when drawing the sample.
This requirement is achieved a number of ways, most rigorously by first coding every member of the population with a
number, and then using a random number generator to choose a subset. Sometimes it is impractical or impossible to
label every single member of an entire population, in which case systematic sampling methods are used.
Stratified Random Sampling.
In a stratified random approach, a population is first divided into subpopulations or strata, based upon one or more
classification criteria. Within each stratum, a simple random sample is taken from those members (the members of the
subpopulation). The number to be sampled from each stratum depends on its size relative to the population - that is, if
a classification system results in three subgroups or strata, and Group A has 50% of the population, and Group B and
Group C have 25% each, the sample we draw must conform to the same relative sizes (half of the sample from A, a
quarter each from B and C). The samples taken from each strata are then pooled together to form the overall sample.
Time-Series Data
Time series date refers to one variable taken over discrete, equally spaced periods of time. The distinguishing feature
of a time series is that it draws back on history to show how one variable has changed. Common examples include
historical quarterly returns on a stock or mutual fund for the last five years, earnings per share on a stock each quarter
for the last ten years or fluctuations in the market-to-book ratio on a stock over a 20-year period. In every case, past
time periods are examined.
The Central Limit Theorem
The central limit theorem states that, for a population distribution with mean = μ and a finite variance σ 2, the
sampling distribution will take on three important characteristics as the sample size becomes large:
1. The sample mean will be approximately normally distributed.
2. The sample mean will be equal to the population mean (μ).
3. The sample variance will be equal to the population variance (σ2) divided by the size of the sample (n).
For example the yield of rice per acre depends upon quality of seed, fertility of soil, fertilizer used, temperature, rainfall.
If one is interested to study the joint affect of all these variables on rice yield, one can use this technique.
An additional advantage of this technique is it also enables us to study the individual influence of these variables on
yield.
Dependent and Independent Variables
By multiple regression, we mean models with just one dependent and two or more independent (exploratory) variables.
The variable whose value is to be predicted is known as the dependent variable and the ones whose known values are
used for prediction are known independent (exploratory) variables.
The Multiple Regression Model
In general, the multiple regression equation of Y on X1, X2, …, Xk is given by:
Y = b0 + b1 X1 + b2 X2 + …………………… + bk Xk
In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems,
i.e. with more than two possible discrete outcomes.[1] That is, it is a model that is used to predict the probabilities of the different
possible outcomes of a categorically distributed dependent variable, given a set of independent variables (which may be real-
valued, binary-valued, categorical-valued,
There are multiple equivalent ways to describe the mathematical model underlying multinomial logistic regression. This can make
it difficult to compare different treatments of the subject in different texts. The article on logistic regression presents a number of
equivalent formulations of simple logistic regression, and many of these have analogues in the multinomial logit model.
The idea behind all of them, as in many other statistical classification techniques, is to construct a linear predictor function that
constructs a score from a set of weights that are linearly combined with the explanatory variables (features) of a given observation
using a dot product:
where Xi is the vector of explanatory variables describing observation i, βk is a vector of weights (or regression coefficients)
corresponding to outcome k, and score(Xi, k) is the score associated with assigning observation i to category k. In discrete
choice theory, where observations represent people and outcomes represent choices, the score is considered
the utility associated with person i choosing outcome k. The predicted outcome is the one with the highest score.