Professional Documents
Culture Documents
IMPLEMENTATION
Big Data
• Big Data is a digital era phenomenon which involves the unprecedented generation of diverse data from
internal and external sources encompassing structured, semi-structured, and unstructured data.
• 5Vs
• Volume refers to the ever-growing large magnitude of data.
• Velocity refers to the continuous and high speed of data generation.
• Variety refers to the diverse data formats, from structured data to unstructured data.
• Veracity refers to data quality and integrity comprising biases, noise and abnormality.
• Value is the economic and social value that can be derived from Big Data.
Big Data Solutions
Big Data Cloud-based Solutions
Big Data Basics
• Database Management Systems (DBMS) are software used to
manage:
• Additions, updates, and deletions of data as transactions occur
• Support data queries and reporting
Foreign key Primary key of one table that appears in another table- captures logical
relationship.
Big Data Basics
Data Marts
A subset of data warehouse information
having focused information particular to
the needs of a given business unit.
Big Data Basics
New Entry:
Student ID Name Location Gender Age {“Student ID”: “ID008”,
ID001 Sachin Mumbai M 32 “Name”: “Ravi”,
ID002 Sourav Kolkata M 31 “Hobby”: “Football”,
“Gender”: “M”,
ID003 Mithali Delhi F 28 “Age”: “41”
ID004 Smriti Hyderabad F 34 }
• A major disadvantage of relational database is for any new additional noSQL database can accommodate
entry, data for all fields is required. If any field is missing, then a dummy these anomalies through its
value is entered which leads to space wastage. schemaless architecture.
• RDBMS
• Schema based
• Allows vertical scaling
• Disadvantage:
• requires maintaining data consistency which makes it hard to scale and
resource incentive
Big Data • Cannot scale horizontally due to its structured and schema-based nature
Basics
NoSQL
• Schema-less
• Allows vertical scaling as well as horizontal scaling
• Each item in the database has two fields: (i) unique key, and (ii) value
• For consistent keys, hash function is used that converts key into fixed
range.
• Largest known NoSQL database: Apple with 75000+ servers.
• When application and datastructure is constantly evolving, noSQL
database is preferred
Hash Function
https://emn178.github.io/online-tools/sha256.html
Big Data – Hadoop ecosystem
Multiple
copies across
systems Fault Tolerant
PARALLEL
Big Data – Hadoop ecosystem PROCESSING
MAPPER PHASE
SHUFFLE & SORT
Relational, 1 REDUCE PHASE
SPLIT database, 1
INPUT is, 1 database
schema, 1 database
Relational database based, 1 database
Relational database is schema based.
database, 3
is schema based. is is, 3
noSQL database is is .
noSQL database is
schema less. is .
schema less.
noSQL database is .
horizontally .
scalable. noSQL database is
horizontally
scalable.
Big Data – Hadoop ecosystem
• Helps in efficient management of resources (RAM, network
bandwidth, CPU).
• YARN processes job requests and manages cluster resources
• Comprises four roles:
• Resource Manager: assigns resources
• Node Manager: handles nodes and monitors resources used in
nodes
• Application Master: requests node manager for containers
whenever a task arises.
• Containers: holds collection of physical resources.
Big Data – Hadoop ecosystem
Big Data Solutions
Consistency
CA CP
Partition
Availability AP
Tolerance
BDA Strategic Value
BDA Strategic Roles
BDA – Value Creation
Thank you.