You are on page 1of 3

Big Data Overview  Structured data: Data containing a

defined data type, format, and structure


Big Data (that is, transaction data, online
Data is created constantly and an ever- analytical processing data cubes,
increasing rate. traditional RDBMS, CSV files, and even
simple spreadsheets).
Big Data cannot be efficiently analyzed using  Semi-structured data: Textual data
only traditional databases or methods. Needs files with a discernible pattern that
new tools and technologies to have a good enables parsing (such as Extensible
business benefit. Markup Language [XML] data files that
Big Data Definition: are self-describing and defined by an
XML schema).
“Big Data is data whose scale, distribution,
diversity, and/or timeliness require the use of
new technical architectures and analytics to
enable insights that unlock new sources of
business value.”
-McKinsey Global Report, 2011

Big Data Characteristics:


 Huge Volume of Data: Can be billions
of rows and millions of columns.  
 Complexity of Data Types and
Structure: Reflects the variety of new
data sources, formats, structures,
including digital traces.
 Speed of New Data creation and
growth: Can describe high velocity
data, with rapid ingestion and near real
time analysis.

Data Structures
 Big data can come in multiple forms,
including structured and non-structured
data such as financial data, text files,
multimedia files, and genetic mappings.
 The following shows four types of data
structures, with 80–90% of future data
growth coming from non- structured
data types. 
 Quasi-structured data: Textual data Current Analytical Architecture
with erratic data formats that can be
formatted with effort, tools, and time  The typical data architectures just
(for instance, web clickstream data that described are designed for storing
may contain inconsistencies in data and processing mission-critical data,
values and formats). supporting enterprise applications,
and enabling corporate reporting
activities. 

Unstructured Data
 Data that has no inherent structure,
which may include text documents,
PDFs, images, and video.
Data Respositories
 Spreadsheets/marts – for
recordkeeping. Analyst depends on data
extracts.
 Data Warehouse – Centralized data
containers in a purpose-built space.
Supports BI (Business Intelligence) and
reporting.
 Analytic Sandbox – Data assets
gathered from multiple sources and
technologies.
State of Practice in Analytics

 
Business Intelligence - BI tends to provide
reports, dashboards, and queries on business
questions for the current period or in the past. BI
systems make it easy to answer questions.
Review of Descriptive and Inferential
Statistics Data Processing and Visualization
with R
Statistics Refresher
Statistics – Descriptive (Collection,
Organization, Presentation). Inferential (Draw
conclusion for a large group/data, determine
relationship, make predictions).
Regression Analysis – Frequently used
analyzed the relationship between two or more
variables.
- At least two variable need to be
continuous.
Response Variable – Y must be a continuous
variable.
Predictor Variable – X1, X2,…,Xp can be
continuous, discrete or categorical variables.

Simple Linear Regression


- Describes the relationship between
TWO VARIABLES.

Logistic Regression

You might also like