You are on page 1of 6

Different Kinds of Data

Big data: big data has arisen to be defined as something like: that amount of data that will not practically fit into a
standard (relational) database for analysis and processing caused by the huge volumes of information being created by
human and machine-generated processes.
Structured, unstructured, semi-structured data: All data has structure of some sort. Delineating between structured
and unstructured data comes down to whether the data has a pre-defined data model and whether it’s organized in a
pre-defined way.
Time-stamped data: Time-stamped data is a dataset which has a concept of time ordering defining the sequence that
each data point was either captured (event time) or collected (processed time).
Machine data: Simply put, machine data is the digital exhaust created by the systems, technologies and infrastructure
powering modern businesses.
Open data: Open data is data that is freely available to anyone in terms of its use (the chance to apply analytics to it)
and rights to republish without restrictions from copyright, patents or other mechanisms of control. The Open Data
Institute states that open data is only useful if it’s shared in ways that people can actually understand. It needs to be
shared in a standardized format and easily traced back to where it came from.
Real time data : One of the most explosive trends in analytics is the ability to stream and act around real time data.
Some people argue that the term itself is something of a misnomer i.e. data can only travel as fast as the speed of
communications, which isn’t faster than time itself… so, logically, even real time data is slightly behind the actual
passage of time in the real world. However, we can still use the term to refer to instantaneous computing that happens
about as fast as a human can perceive.

Data Modeling
Business analysts solve tricky, icky, sticky project challenges using data modeling techniques. There are 4 data modeling
techniques you should get to know as a business analyst, so they can become part of your BA toolbox.
Entity Relationship Diagram – A handy tool that helps visualize relationships between key business concepts to
encourage business-focused database designs.
Data Dictionary – A spreadsheet format that enables you to communicate to business stakeholders clearly and in an
organized way, eliminating long lists of fields inside use cases or other requirements document.
Data Mapping – An essential template for a data migration or data integration project that will ensure any data-related
issues are discovered and resolved ahead of the last-minute data crunching that often derails big projects.
Glossary – Along with encouraging more effective communication among stakeholders, clarifying your requirements
documents, and helping you learn about a new business domain, a glossary will make the rest of the data modeling
techniques easier, as you’ll be working from a clear and unambiguous collection of terms.

Data Cleansing
Data cleansing is the process of altering data in a given storage resource to make sure that it is accurate and correct.
There are many ways to pursue data cleansing in various software and data storage architectures; most of them center
on the careful review of data sets and the protocols associated with any particular data storage technology.
Data cleansing is also known as data cleaning or data scrubbing.
Data cleansing is sometimes compared to data purging, where old or useless data will be deleted from a data set.
Although data cleansing can involve deleting old, incomplete or duplicated data, data cleansing is different from data
purging in that data purging usually focuses on clearing space for new data, whereas data cleansing focuses on
maximizing the accuracy of data in a system. A data cleansing method may use parsing or other methods to get rid of
syntax errors, typographical errors or fragments of records. Careful analysis of a data set can show how merging
multiple sets led to duplication, in which case data cleansing may be used to fix the problem.
A six-step approach to data cleansing and data preparation process:
 Discovering helps the user understand what’s in the data and how it can be used effectively for analysis
 Structuring makes working with data of all shapes and sizes easy by formatting the data to be used in traditional
applications
 Cleaning involves removing data that may distort your analysis or standardizing your data into a single format
 Enriching allows the user to augment the data with internal or third-party data to enhance the data for better
analysis
 Validating brings data quality and inconsistency issues to the surface so the appropriate transformations can be
applied
 Publishing allows the user to deliver the output of your data and load into downstream systems for analysis

Data Visualization and Exploration Issues


Database exploration is a discovery process where relevant information or knowledge is identified and extracted from data. It is
related to the field of Knowledge Discovery in Databases (KDD), and emphasizes the process of knowledge discovery: the
development of hypotheses about the data, and the validation of those hypotheses. Discovery is not only possible from analytic
tools, but also from graphical, textual, numeric, and tabular presentations of data. Flexibility in data processing and output
presentation are fundamental requirements of any data exploration environment. To communicate information clearly and
efficiently, data visualization uses statistical graphics, plots, information graphics and other tools. Numerical data may be encoded
using dots, lines, or bars, to visually communicate a quantitative message. Effective visualization helps users analyze and reason
about data and evidence. It makes complex data more accessible, understandable and usable. Users may have particular analytical
tasks, such as making comparisons or understanding causality, and the design principle of the graphic (i.e., showing comparisons
or showing causality) follows the task. Tables are generally used where users will look up a specific measurement, while charts of
various types are used to show patterns or relationships in the data for one or more variables.
Data visualization is both an art and a science. It is viewed as a branch of descriptive statistics by some, but also as a grounded
theory development tool by others. Increased amounts of data created by Internet activity and an expanding number of sensors
in the environment are referred to as "big data" or Internet of things. Processing, analyzing and communicating this data present
ethical and analytical challenges for data visualization.[4] The field of data science and practitioners called data scientists help
address this challenge
To build useful data exploration and analysis systems, we must bring the disciplines of database management, data analysis and
data visualization together and examine the importance of interaction in this context. This will pose new problems and conditions
resulting from the additional capabilities of the integrated system

One of the major challenges of the Big Data era is that it has realized the availability of a great amount and variety of massive
datasets for analysis by non-corporate data analysts, such as research scientists, data journalists, policy makers, SMEs and
individuals. A major characteristic of these datasets is that they are: accessible in a raw format that are not being loaded or indexed
in a database (e.g., plain text, json, rdf), dynamic, dirty and heterogeneous in nature. The level of difficulty in transforming a data-
curious user into someone who can access and analyze that data is even more burdensome now for a great number of users with
little or no support and expertise on the data processing part. The purpose of visual data exploration is to facilitate information
perception and manipulation, knowledge extraction and inference by non-expert users. The visualization techniques, used in a
variety of modern systems, provide users with intuitive means to interactively explore the content of the data, identify interesting
patterns, infer correlations and causalities, and supports sense-making activities that are not always possible with traditional data
traditional data analysis techniques.
In the Big Data era, several challenges arise in the field of data visualization and analytics. First, the modern exploration and
visualization systems should offer scalable data management techniques in order to efficiently handle billion objects datasets,
limiting the system response in a few milliseconds. Besides, nowadays systems must address the challenge of on-the-fly scalable
visualizations over large and dynamic sets of volatile raw data, offering efficient interactive exploration techniques, as well as
mechanisms for information abstraction, sampling and summarization for addressing problems related to visual information
overplotting. Further, they must encourage user comprehension offering customization capabilities to different user-defined
exploration scenarios and preferences according to the analysis needs. Overall, the challenge is to enable users to gain value and
insights out of the data as rapidly as possible, minimizing the role of IT-expert in the loop.

Sampling and Estimation


As analysts, we are accustomed to using this sample information to assess how various markets from around the world
are performing. Any statistics that we compute with sample information, however, are only estimates of the
underlying population parameters. A sample, then, is a subset of the population—a subset studied to infer conclusions
about the population itself.
A data sample, or subset of a larger population, is used to help understand the behavior and characteristics of the entire
population. In the investing world, for example, all of the familiar stock market averages are samples designed to
represent the broader stock market and indicate its performance return.
It's important to understand the mechanics of sampling and estimating, particularly as they apply to financial variables,
and have the insight to critique the quality of research derived from sampling efforts.

BASICS
Simple Random Sampling
To begin the process of drawing samples from a larger population, an analyst must craft a sampling plan, which
indicates exactly how the sample was selected. With a large population, different samples will yield different results,
and the idea is to create a consistent and unbiased approach. Simple random sampling is the most basic approach to
the problem. It draws a representative sample with the principle that every member of the population must have an
equal chance of being selected. The key to simple random sampling is assuring randomness when drawing the sample.
This requirement is achieved a number of ways, most rigorously by first coding every member of the population with a
number, and then using a random number generator to choose a subset. Sometimes it is impractical or impossible to
label every single member of an entire population, in which case systematic sampling methods are used.
Stratified Random Sampling.
In a stratified random approach, a population is first divided into subpopulations or strata, based upon one or more
classification criteria. Within each stratum, a simple random sample is taken from those members (the members of the
subpopulation). The number to be sampled from each stratum depends on its size relative to the population - that is, if
a classification system results in three subgroups or strata, and Group A has 50% of the population, and Group B and
Group C have 25% each, the sample we draw must conform to the same relative sizes (half of the sample from A, a
quarter each from B and C). The samples taken from each strata are then pooled together to form the overall sample.
Time-Series Data
Time series date refers to one variable taken over discrete, equally spaced periods of time. The distinguishing feature
of a time series is that it draws back on history to show how one variable has changed. Common examples include
historical quarterly returns on a stock or mutual fund for the last five years, earnings per share on a stock each quarter
for the last ten years or fluctuations in the market-to-book ratio on a stock over a 20-year period. In every case, past
time periods are examined.
The Central Limit Theorem
The central limit theorem states that, for a population distribution with mean = μ and a finite variance σ 2, the
sampling distribution will take on three important characteristics as the sample size becomes large:
1. The sample mean will be approximately normally distributed.
2. The sample mean will be equal to the population mean (μ).
3. The sample variance will be equal to the population variance (σ2) divided by the size of the sample (n).

Multiple Regression Analysis


Multiple linear regression is the most common form of linear regression analysis. As a predictive analysis, the multiple
linear regression is used to explain the relationship between one continuous dependent variable and two or more
independent variables. The independent variables can be continuous or categorical.
Multiple regression analysis is a powerful technique used for predicting the unknown value of a variable from the known
value of two or more variables- also called the predictors.
More precisely, multiple regression analysis helps us to predict the value of Y for given values of X1, X2, …, Xk.

For example the yield of rice per acre depends upon quality of seed, fertility of soil, fertilizer used, temperature, rainfall.
If one is interested to study the joint affect of all these variables on rice yield, one can use this technique.
An additional advantage of this technique is it also enables us to study the individual influence of these variables on
yield.
Dependent and Independent Variables
By multiple regression, we mean models with just one dependent and two or more independent (exploratory) variables.
The variable whose value is to be predicted is known as the dependent variable and the ones whose known values are
used for prediction are known independent (exploratory) variables.
The Multiple Regression Model
In general, the multiple regression equation of Y on X1, X2, …, Xk is given by:

Y = b0 + b1 X1 + b2 X2 + …………………… + bk Xk

Logistic and Multinomial Regression

In statistics, multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems,
i.e. with more than two possible discrete outcomes.[1] That is, it is a model that is used to predict the probabilities of the different
possible outcomes of a categorically distributed dependent variable, given a set of independent variables (which may be real-
valued, binary-valued, categorical-valued,
There are multiple equivalent ways to describe the mathematical model underlying multinomial logistic regression. This can make
it difficult to compare different treatments of the subject in different texts. The article on logistic regression presents a number of
equivalent formulations of simple logistic regression, and many of these have analogues in the multinomial logit model.
The idea behind all of them, as in many other statistical classification techniques, is to construct a linear predictor function that
constructs a score from a set of weights that are linearly combined with the explanatory variables (features) of a given observation
using a dot product:

where Xi is the vector of explanatory variables describing observation i, βk is a vector of weights (or regression coefficients)
corresponding to outcome k, and score(Xi, k) is the score associated with assigning observation i to category k. In discrete
choice theory, where observations represent people and outcomes represent choices, the score is considered
the utility associated with person i choosing outcome k. The predicted outcome is the one with the highest score.

Predictive modeling with decision tree


The
decision tree is an important algorithm for predictive modelling and can be used to visually and explicitly represent decisions. It is
a graphical representation that makes use of branching methodology to exemplify all possible outcomes based on certain
conditions. In decision tree internal node represents a test on the attribute, branch depicts the outcome and leaf represents
decision made after computing attribute.
It can be classified into two types, Classification trees which are used to separate a dataset into different classes based on the
particular basis and generally used when we expect response variable in categorical nature. The other type is called Regression
Trees which are used when the response variable is continuous or numerical.
Decision Tree helps in making decisions under a particular circumstance and improves communication. It helps data scientist to
capture the idea that how different decisions can lead to different operational nature of the situation. It helps to take an optimal
decision. The algorithm is well suited for problems where instances are represented by attribute value and when training data
contains error. It is also applicable to the situation when the target function has discrete output value.
It has various advantages, it implicitly performs variable screening and requires relatively lesser effort from the user for data
preparation. Non-linear relations do not affect tree performance and are easy to understand. The decision tree is useful in data
exploration and does not make any assumptions on the linearity of data but as the number of trees increases the accuracy level
reduces. Another major drawback is that outcome may be based on expectation which could lead to bad decision making.
Decision Trees have great use in financing for option pricing and are used by banks to classify loan applicant by probability of their
default payment. It is also used widely in data science libraries in Python and R and SAS.

Predictive modeling with Neural Networks


A complex algorithm used for predictive analysis, the neural network, is biologically inspired by the structure of the human brain.
A neural network provides a very simple model in comparison to the human brain, but it works well enough for our purposes.
Widely used for data classification, neural networks process past and current data to estimate future values — discovering any
complex correlations hidden in the data — in a way analogous to that employed by the human brain.
Neural networks can be used to make predictions on time series data such as weather data. A neural network can be designed to
detect pattern in input data and produce an output free of noise.
The structure of a neural-network algorithm has three layers:
 The input layer feeds past data values into the next (hidden) layer. The black circles represent nodes of the neural
network.
 The hidden layer encapsulates several complex functions that create predictors; often those functions are hidden from
the user. A set of nodes (black circles) at the hidden layer represents mathematical functions that modify the input data;
these functions are called neurons.
 The output layer collects the predictions made in the hidden layer and produces the final result: the model’s prediction.
Here’s a closer look at how a neural network can produce a predicted output from input data. The hidden layer is the key
component of a neural network because of the neurons it contains; they work together to do the major calculations and produce
the output.
Each neuron takes a set of input values; each is associated with a weight (more about that in a moment) and a numerical value
known as bias. The output of each neuron is a function of the output of the weighted sum of each input plus the bias.
Most neural networks use mathematical functions to activate the neurons. A function in math is a relation between a set of inputs
and a set of outputs, with the rule that each input corresponds to an output.
For instance, consider the negative function where a whole number can be an input and the output is its negative equivalent. In
essence, a function in math works like a black box that takes an input and produces an output.
Neurons in a neural network can use sigmoid functions to match inputs to outputs. When used that way, a sigmoid function is
called a logistic function and its formula looks like this:
f(input) = 1/(1+eoutput)

Pattern Discovery (Cluster Analysis)


Cluster is a group of objects that belongs to the same class. In other words, similar objects are grouped in one cluster and
dissimilar objects are grouped in another cluster.
Clustering is the process of making a group of abstract objects into classes of similar objects.
Points to Remember
 A cluster of data objects can be treated as one group.
 While doing cluster analysis, we first partition the set of data into groups based on data similarity and then assign the
labels to the groups.
 The main advantage of clustering over classification is that, it is adaptable to changes and helps single out useful features
that distinguish different groups.
Requirements of Clustering in Data Mining
The following points throw light on why clustering is required in data mining −
 Scalability − We need highly scalable clustering algorithms to deal with large databases.
 Ability to deal with different kinds of attributes − Algorithms should be capable to be applied on any kind of data such as
interval-based (numerical) data, categorical, and binary data.
 Discovery of clusters with attribute shape − The clustering algorithm should be capable of detecting clusters of arbitrary
shape. They should not be bounded to only distance measures that tend to find spherical cluster of small sizes.
 High dimensionality − The clustering algorithm should not only be able to handle low-dimensional data but also the high
dimensional space.
 Ability to deal with noisy data − Databases contain noisy, missing or erroneous data. Some algorithms are sensitive to
such data and may lead to poor quality clusters.
 Interpretability − The clustering results should be interpretable, comprehensible, and usable.

Clustering Methods (Types) : Use Class Notes

You might also like