You are on page 1of 5

S - DE concept

note:
 covered content in Seminar folder
 Database System Concepts book (Concept book): covered Preface, Chapter 10
 Fundamentals Of Database Systems 7th Edition book (Fundamental book): covered
Preface, Chapter 24 (subsection 1 2 3)

1. Nosql
content ref:
 Database System Concepts book > Chapter 24 NOSQL Databases and Big Data Storage
Systems (subsection 1 2 3)
1.1. compare to SQL
 SQL systems offer too many services (powerful query language, concurrency control, etc.),
which this application may not need
 A structured data model such the traditional relational model may be too restrictive.
1.2. Characteristics of NOSQL Systems
 Distributed databases and distributed systems:
 Horizontal scalability , Availability, Replication and Eventual Consistency
 Sharding (partitioning) of Files
 High-Performance Data Access: hashing or range partitioning on object keys
1.2.1. Categories of NOSQL Systems and common one
 Document-based: Documents are accessible via their document id, but can also be accessed
rapidly using other indexes
EX: JSON as a document
EX: MongoDB,CouchDB

 key-value: access by the key to the value associated with the key, the value can be a record or an
object or a document or even have a more complex data structure.
EX: Amazon: DynamoDB (key-value data stores or sometimes key-tuple or key-object data
stores.)
 Column-based or wide column NOSQL systems: each column family is stored in its own files
 EX: Google BigTable, Apache Hbase
 Graph-based NOSQL systems: nodes can be found by traversing the edges using path
expressions
EX: Neo4J and GraphBase
 Other: Hybrid NOSQL systems, Object databases, XML databases
EX: Facebook: Apache Cassandra (both key-value stores and column-based systems.),
OrientDB
1.3. The CAP Theorem
- The three letters in CAP refer to three desirable properties of distributed systems with replicated data:
 Consistency nodes will have the same copies of a replicated data item visible for various
transactions
 Availability: read or write request for a data item will either be processed successfully or will
receive a message that the operation cannot be completed
EX:
 Partition tolerance system can continue operating if the network connecting the nodes has a fault
that results in two or more partitions, where the nodes in each partition can only communicate
among each other
1.3.1.1.1 Q: P thực sự là gì, nó sao giống với A quá
CAP theorem states that it is not possible to guarantee all three of the desirable properties at the same
time, we have to choose 2 out of 3 to be guaranteed
EX: eventual consistency is often adopted in NOSQL
2. OLTP vs OLAP
 Online Transaction Processing (or OLTP) systems  DB design that best suit for transactional
operation
 Online Analytical Processing (OLAP) systems: DB design that best suit for analytical operation
3. Advance of ELT compare to ETL
 Shortens the cycle between extraction and delivery to data lake/warehouse
 Allows you to ingest volumes of raw data as immediately as the data becomes available
 save the original version of data
 suited to work with big data and analytics
4. Kinds of scalability in distributed systems
 horizontal: adding more nodes
 vertical: expanding power of existing node
5. Data layer in data pineline
 Bronze/landing/raw layer: place that store raw data or historical data , no any modifications or
data quality check is applied at this layer. This layer is used for landing data to data pineline
before ingesting them into data platform or for archiving them
 Silver/staging/processed layer: filtered, cleaned, standardized and data quality checked but not
aggregate for calculated in detail
 Gold/production/data-mart layer: cleanest stage, usually used for KPIs report or feature
engineering (Business-level Aggregates)
6. Slowly change dimension (SCD)
 SCD: Dimension table change throught time, we need to find most suitable approach to tracking
these changes
 SCD0: [data at time1] + ["all row at time1 with updated at time2", aka "data at time2"] + ...
 SCD1: overwrite old record with new updated value (keep id same, only some col is modified)
 SCD2: observation usually go with `start date` `end data` and `is active` to track the version,
then data have content as [old data at time1] + [updated data at time 2] (same observation have 2
or more row with different activive time, and id, and a part of col to store data is changed)
 SCD3: EX:
 Before the Change
Customer ID Customer name Current Type Previous Type
1 Customer 1 Corporate Partner
 After the Change
Customer ID Customer name Current Type PreviousType
1 Customer 1 Retail Corporate
note: if track too much previous value will increase the number of col
 SCD4: combination of SCD1 and SCD2 (save 2 table at the same time)

 SCD6 (rarely used): combination of SCD2 and 3

7. Cleaning data process


p_remind: có ghi trung ở các môn cũ (EX: ML intro)
 đọc vài dòng đầu
 xoá or merge cột
 thay dữ liệu missing hay lỗi thành giá trị phù hợp (EX: null, -1, ... )
 rename
 reformat kiểu dữ liệu (EX: kiểu ngày tháng)
8. SQL task
 indexing or create partition to boost performance
 encript thông tin quan trọng
 trigger check lỗi, tránh dữ liệu hanging
 check storage

9. kind of cloud service


 on premise : tự mua phần cứng tự cài, người cungg cấp cloud k liên quan gì hết
 Infrastructure aaS: phân cứng cung cấp bởi service supplier
 Platform aaS : you only manage application and data
 Software aas: EX: word online, fb, tiktok
 Serverless: chạy theo event, không tốn nhiều thời gian nhưng cần lượng requesr lớn
10. cloud tool DE should know

 databrick: incline to DS (also support for DA)


 PaaS
 snowflare: incline to DA
 SaaS

11. Archived - merge on read and copy on write


Inote: có quá nhiều phiên bản định nghĩa, sau đây là bản định nghĩa của riêng mình:
Inote: cái định nghĩa này dùng hình như có nhiều trường hợp và có nhiều cách hiểu, buông thôi, xàm
vãi l
 copy on write:
 cách hiểu 1: cho phép nhiều process thao tác trên cùng một data, mỗi process có một
copy version để thao tác với data đó, thay đổi thật sự đến data gốc được chờ đến khi
transaction thực sự commit (?? sau đó thì sẽ lấy một bản duy nhất hay sao), chiến lược
này cho phép nhiều nguồn thao tác trên cùng một data mà ko bị xung đột
>< cách hiểu này hơi sai vì nó hợp hơn với `write on copy`
 cách hiệu 2: có nhiều bản copy, nhưng mình chỉ thực sự lấy data bản copy khi người
dùng đọc
>< tự chế hơi vô nghĩa
 merge on write: khi có thay đổi trên data, ta chỉ lưu lại thông tin thay đổi, khi người dùng đọc
data thì mới merge nó lại, giúp tăng tốc quá trình ghi
>< tự chế hơi vô nghĩa
một số source để tìm hiểu thêm:
 https://www.google.com/search?
q=merge+on+read+and+copy+on+write&oq=merge+on+read+a&gs_lcrp=EgZjaHJvbWUqBwg
BEAAYgAQyBggAEEUYOTIHCAEQABiABDIICAIQABgWGB4yCAgDEAAYFhgeMggIB
BAAGBYYHjINCAUQABiGAxiABBiKBTINCAYQABiGAxiABBiKBdIBCDUzMjNqMGo5
qAIAsAIA&sourceid=chrome&ie=UTF-8
 https://www.dremio.com/blog/row-level-changes-on-the-lakehouse-copy-on-write-vs-merge-on-
read-in-apache-iceberg/

You might also like