You are on page 1of 38

UNIT 1

INTRODUCTION: DATA SCIENCE AND BIG


DATA
Syllabus
Introduction to Data science and Big Data, Defining Data science and Big Data,
Big Data examples, Data Explosion: Data Volume, Data Variety, Data Velocity and
Veracity. Big data infrastructure and challenges, Big Data Processing
Architectures: Data Warehouse, Re-Engineering the Data Warehouse, shared
everything and shared nothing architecture, Big data learning approaches. Data
Science – The Big Picture: Relation between AI, Statistical Learning, Machine
Learning , Data Mining and Big Data Analytics.
Data Science
Data Science is the science which uses computer science, statistics and
machine learning, visualization and human-computer interactions to collect,
clean, integrate, analyze, visualize, interact with data to create data products.

Goal of Data Science


Turn data into data products
Data Science Tasks
• Data analysis
• What percentage of users back to our site?
• Which products usually bought together?
• Modeling/statistics
• How many cars we are going to sell next year?
• Which city is better for opening new office?
• Engineering/prototyping
• Product to use a prediction model
• Visualization of analytics
Data Science can answer 5 Questions

• Is this A or B ? (Classification/ Multi Class Classification)


• Is this weird? (Anomaly detection Algorithm)
• How much or How many ? (Regression Algorithm) which ask for number
Example: What is the temperature on Tuesday?
• How is this organised? Structure of data set (Clustering algorithm)
• What should I do next ? (reinforcement learning algorithm)
Example: Decision making without human guidance
Lifecycle of Data Science
Data Science Application

• Transaction Databases 🡪 Recommender systems (NetFlix), Fraud Detection


(Security and Privacy)
• Wireless Sensor Data 🡪 Smart Home, Real-time Monitoring, Internet of
Things
• Text Data, Social Media Data 🡪 Product Review and Consumer Satisfaction
(Facebook, Twitter, LinkedIn), E-discovery
• Software Log Data 🡪 Automatic Trouble Shooting (Splunk)
• Genotype and Phenotype Data 🡪 Epic, 23andme, Patient-Centered Care,
Personalized Medicine
Introduction to Big Data
Big Data Definition
• “Big Data is the frontier of a firm's ability to store, process, and access
(SPA) all the data it needs to operate effectively, make decisions,
reduce risks, and serve customers.”
-- Forrester
• Big Data in general is defined as high volume, velocity and variety
information assets that demand cost-effective, innovative forms of
information processing for enhanced insight and decision making.”
--Gartner
• “Big data is data that exceeds the processing capacity of conventional
database systems. The data is too big, moves too fast, or doesn't fit
the structures of your database architectures. To gain value from this
data, you must choose an alternative way to process it.”
-- O’Reilly
• “Big data is the data characterized by 3 attributes: volume, variety
and velocity.”
--IBM
• “Big data is the data characterized by 4 key attributes: volume,
variety, velocity and value.”
--Oracle
Big data means huge amount of data, it is a collection of large datasets
that can not be processed using traditional computing techniques.
Sources of Big Data
1. Stock exchange
2. Social media data
3. Video sharing portals
4. Search engine data
5. Transport data
6. Banking data
Categories of data
1. Structured data
Data is stored in relations(tables) in relational database.

2. Semi-structured data
This type of data does not have any standard format model.

3. Unstructured data
This data do not have any predefined data model.
Issues regarding data in traditional file
1. Volume
2. Velocity
3. Variety
4. Variability
5. Complexity
Examples of Big data Applications
1. Fraud Detection
2. IT log analytics
3. Call center analytics
4. Social Media Analysis
Data Explosion
The data explosion is nothing but the rapid growth of the data.
One reason to this explosive growth of data is innovation.

Trends in data explosion


1. Business model transformation
2. Globalization
3. Personalization of services
Data Explosion
Innovation leads to Data explosion: Innovation has transformed the
way we engage in business, provide services, and the associated
measurement of value and profitability. Three fundamental trends that
shaped up the data world in the last few years are business model
transformation, globalization, and personalization of services
Data volume
Data volume in the big data can be defined as the amount of data that is
generated in continuous manner.
There are different data types available with different sizes.
Different sources of data
1. Machine data
2. Application log
3. Clickstream logs
4. External or third –party data
5. Emails
6. Contracts
7. Geographic information system and Geo-spatial data
Data Velocity
Data velocity in big data can be defines as the flow of the data from various
sources such as networks, human resources, social media etc.
The data can be huge and flows in continuous manner.
With the initiation of the big data, it becomes very important to understand
the velocity of data.
Examples of data velocity-
1. Amazon, facebook, Yahoo, and Google
2. Sensor data
3. Mobile Networks
4. Social Media
Data Variety
• It refers to Structured, Semi-structured and Unstructured data due to
different sources of data generated either by humans or by machines.

Data Veracity
It refers to the assurance of quality/integrity/credibility/accuracy of the
data. Since the data is collected from multiple sources, we need to check the
data for accuracy before using it for business insights.

Data Value
• Just because we collected lots of Data, it’s of no value unless we garner
some insights out of it. Value refers to how useful the data is in decision
making.
5 Vs of Big Data
• Raw Data: Volume
• Change over time: Velocity
• Data types: Variety
• Data Quality: Veracity
• Information for Decision Making: Value
Big data Infrastructure and Challenges
• Storage
• Transportation
• Processing
CPU
Memory
Software
• Speed or Throughput
Big Data Processing Architectures
1) Lambda Architecture
Lambda architecture mainly designed to manage the huge amount of data by
using the batch and stream methods.
It maintains Latency, throughput, fault tolerance
Example: Twitter, Spotify, Liveperson
2) Kappa Architecture
This is similar to lambda architecture but the batch layer is removed
from this architecture.
Example:
3) Zeta Architecture
It is useful for describing the scalable technique to boost the speed of combining data into
business.
7 pluggable components of Zeta Architecture
1. Distributed file system
2. Real-time data storage
3. Pluggable compute model/execution
Engine
4. deployment/container management
System
5. Solution architecture
6. Enterprise application
7. Dynamic & global resource management
Benefits of Zeta Architecture
1. It reduces time and cost.
2. It contains less moving parts
3. Duplication
4. Testing, troubleshooting
5. Better resource utilization
Difference between lambda and kappa architecture
Data Warehouse
• A Data Warehouse (DW) is a relational database that is designed for query
and analysis rather than transaction processing. It includes historical data
derived from transaction data from single and multiple sources.
• It is a single, complete and consistent store of data obtained from a variety
of different sources made available to end users.
• Data warehousing is a process of transforming raw data into systematic
information and making it available to users as per as requirement in a
timely manner.
• A data-warehouse is a heterogeneous collection of different data sources
organized under a unified schema. There are 2 approaches for constructing
data-warehouse: Top-down approach and Bottom-up approach are explained as
below.
Architecture of data Warehouse
There are two approaches
Top-down approach and Bottom-up approach
1) Top-down approach
2)Bottom-up Approach
Characteristics of Data Warehouse
Goals of Data Warehousing
• To help reporting as well as analysis
• Maintain the organization's historical information
• Be the foundation for decision making.

Need for Data Warehouse


Benefits of Data Warehouse
1. A data warehouse delivers Enhanced Business Intelligence.
2. More cost-effective decision making.
3. Better enterprise Intelligence
4. Enhanced customer service.
5. A data warehouse saves time
6. A data warehouse Enhances Data Quality and Consistency
7. A Data warehouse provides historical intelligence
8. Potential high returns on investment
Limitations of data warehousing
1. Under estimation of resources of data loading
2. Hidden problems with source systems
3. Required data not captured.
4. Increased end user demands
5. Data homogenization
6. High demand for resources
7. Data ownership
8. High maintenance
9. Long-duration projects
10. Complexity of integration
Re-Engineering the Data Warehouse
Components of data distribution in data warehouse
1. Transactional systems
2. Operational data store
3. Staging area
4. Data warehouse
5. Datamarts
6. Analytical databases
Shared everything and shared nothing architecture
1) Shared everything architecture
it is a kind of system architecture in which resources are shared including
storage, memory and processor.
There are two variations of this architecture
a. Distributed shared memory(DSM)
b. Symmetric multiprocessing(SMP)
2) Shared nothing architecture
It is a distributed computing architecture in which each node is independent i.e
none of the nodes share memory or disk storage.
Different nodes are interconnected by a network.
Every node is made of a processor, main memory and disk.
Advantages of shared nothing architecture
1. This architecture is scalable regarding the number of processors.
2. When nodes get added transmission capacity increases.
3. It is a read only database & decision support application.
4. In this architecture failure is local.

Disadvantages
1. Cost is high
2. The data sending involves the software interaction
3. In this technique more coordination is required.
Difference between shared everything and shared nothing architecture
Big data learning approaches
A. Machine Learning
It is a process that gives computers the ability to learn without being explicitly
programmed.
Machine learning is a method of data analysis that automates analytical model
building.
B. Machine Learning System Model

You might also like