Professional Documents
Culture Documents
note:
covered content in Seminar folder
Database System Concepts book (Concept book): covered Preface, Chapter 10
Fundamentals Of Database Systems 7th Edition book (Fundamental book): covered
Preface, Chapter 24 (subsection 1 2 3)
1. Nosql
content ref:
Database System Concepts book > Chapter 24 NOSQL Databases and Big Data Storage
Systems (subsection 1 2 3)
1.1. compare to SQL
SQL systems offer too many services (powerful query language, concurrency control, etc.),
which this application may not need
A structured data model such the traditional relational model may be too restrictive.
1.2. Characteristics of NOSQL Systems
Distributed databases and distributed systems:
Horizontal scalability , Availability, Replication and Eventual Consistency
Sharding (partitioning) of Files
High-Performance Data Access: hashing or range partitioning on object keys
1.2.1. Categories of NOSQL Systems and common one
Document-based: Documents are accessible via their document id, but can also be accessed
rapidly using other indexes
EX: JSON as a document
EX: MongoDB,CouchDB
key-value: access by the key to the value associated with the key, the value can be a record or an
object or a document or even have a more complex data structure.
EX: Amazon: DynamoDB (key-value data stores or sometimes key-tuple or key-object data
stores.)
Column-based or wide column NOSQL systems: each column family is stored in its own files
EX: Google BigTable, Apache Hbase
Graph-based NOSQL systems: nodes can be found by traversing the edges using path
expressions
EX: Neo4J and GraphBase
Other: Hybrid NOSQL systems, Object databases, XML databases
EX: Facebook: Apache Cassandra (both key-value stores and column-based systems.),
OrientDB
1.3. The CAP Theorem
- The three letters in CAP refer to three desirable properties of distributed systems with replicated data:
Consistency nodes will have the same copies of a replicated data item visible for various
transactions
Availability: read or write request for a data item will either be processed successfully or will
receive a message that the operation cannot be completed
EX:
Partition tolerance system can continue operating if the network connecting the nodes has a fault
that results in two or more partitions, where the nodes in each partition can only communicate
among each other
1.3.1.1.1 Q: P thực sự là gì, nó sao giống với A quá
CAP theorem states that it is not possible to guarantee all three of the desirable properties at the same
time, we have to choose 2 out of 3 to be guaranteed
EX: eventual consistency is often adopted in NOSQL
2. OLTP vs OLAP
Online Transaction Processing (or OLTP) systems DB design that best suit for transactional
operation
Online Analytical Processing (OLAP) systems: DB design that best suit for analytical operation
3. Advance of ELT compare to ETL
Shortens the cycle between extraction and delivery to data lake/warehouse
Allows you to ingest volumes of raw data as immediately as the data becomes available
save the original version of data
suited to work with big data and analytics
4. Kinds of scalability in distributed systems
horizontal: adding more nodes
vertical: expanding power of existing node
5. Data layer in data pineline
Bronze/landing/raw layer: place that store raw data or historical data , no any modifications or
data quality check is applied at this layer. This layer is used for landing data to data pineline
before ingesting them into data platform or for archiving them
Silver/staging/processed layer: filtered, cleaned, standardized and data quality checked but not
aggregate for calculated in detail
Gold/production/data-mart layer: cleanest stage, usually used for KPIs report or feature
engineering (Business-level Aggregates)
6. Slowly change dimension (SCD)
SCD: Dimension table change throught time, we need to find most suitable approach to tracking
these changes
SCD0: [data at time1] + ["all row at time1 with updated at time2", aka "data at time2"] + ...
SCD1: overwrite old record with new updated value (keep id same, only some col is modified)
SCD2: observation usually go with `start date` `end data` and `is active` to track the version,
then data have content as [old data at time1] + [updated data at time 2] (same observation have 2
or more row with different activive time, and id, and a part of col to store data is changed)
SCD3: EX:
Before the Change
Customer ID Customer name Current Type Previous Type
1 Customer 1 Corporate Partner
After the Change
Customer ID Customer name Current Type PreviousType
1 Customer 1 Retail Corporate
note: if track too much previous value will increase the number of col
SCD4: combination of SCD1 and SCD2 (save 2 table at the same time)