Professional Documents
Culture Documents
Compression and Encoding - Compression: Gzip, Snappy, etc. for data compression.
Data Cleaning and Preprocessing - Handle missing values, outliers, and inconsistencies.
Error Handling and Logging - Implement robust error handling and logging.
Monitoring and Alerts - Implement monitoring and alerting for issue detection.
Testing and Validation - Rigorously test and validate the data loading process.
Suitable Data
Approach Description Data Volumes Supported Tools
Types
Data is extracted
from source
systems. Data is
Informatica,
transformed
ETL (Extract, Talend, Apache
(cleaned,
Transform, Structured data Medium to Large Nifi, Apache
enriched) during
Load) Spark, Microsoft
the process.
SSIS, etc.
Transformed data
is loaded into the
target system.
ELT (Extract, Data is extracted Structured and Medium to Large Apache NiFi,
Load, from source unstructured AWS Glue,
Transform) systems. Data is Talend,
loaded into a Informatica
storage area PowerCenter, etc.
without
transformation.
Transformation
Data is
continuously
Apache Kafka,
extracted in real-
Apache Flink,
Real-Time Data time or near real- Real-time and
Low to High AWS, Kinesis,
Integration time. Transformed streaming data
Apache Pulsar,
data is loaded
etc.
directly into the
target system.
Captures and
replicates changes Debezium,
in source data. Apache Kafka
Change Data
Replicated Any data type Low to High Connect,
Capture (CDC)
changes are GoldenGate,
loaded into the AWS DMS, etc.
target system.
Virtual integration
of data from
Denodo, SAP
multiple sources.
HANA Smart
Data remains in
Data Federation Any data type Low to High Data Access,
the source
Cisco Data
systems; no
Virtualization, etc.
physical
movement.
Data is copied to a
centralized data Amazon Redshift,
Data warehouse. Data Snowflake,
Structured data Medium to Large
Warehousing is transformed and Google BigQuery,
stored for analytics Teradata, etc.
and reporting.
Batch Data Ingestion: Migrating large data batches into the data
Initial Data Ingestion
lake.
Data Archiving and Archiving and Backup: Storing historical data and providing data
Backup backup and disaster recovery capabilities.
Data Migration Across Cross-Cloud Data Migration: Moving data between different cloud
Clouds or Envi’s providers or between on-premises and cloud environments.
Data Migration for Data Offloading Data Warehouses: Migrating historical data from data
Warehouses warehouses to the data lake for cost-effective storage.
Hybrid Data Lake Hybrid Data Lakes: Implementing data storage and processing in
Strategies both on-premises and cloud-based data lakes for consistency.
Step Description
Check unstructured data for errors and completeness to make sure it's
Data Validation
reliable and good to use.
Convert and organize data, like extracting text from images or videos,
Data Transformation
so it's ready for storage.
Find and remove any duplicate data to keep things clean and save on
Data Deduplication
storage space.
Shrink data to save space while keeping it accessible; like making files
Data Compression
smaller, but still usable.
Data Backup and Set up a way to make sure your data is safe, so you can get it back if
Recovery something goes wrong.
Create quick lookup lists to find data easily, like making a table of
Data Indexing
contents for a book.
Data Quality Keep an eye on data quality, so it's accurate and doesn't have
Management mistakes or conflicts.
Data Security and Protect sensitive data with locks and guards, so only the right people
Privacy can access it.
Make rules and plans for how data is handled, like deciding who's in
Data Governance
charge and what's allowed.
Add extra info to your data, like labels and descriptions, to make it
Metadata Enrichment
easier to find and understand.
Data with clear rules Data with some rules Data without strict
Data Schema
and structure. but also some flexibility. rules, often messy.
Data is easy to ask Data needs tools that Data needs special
Query and
questions about and can deal with some tools for
Analysis
analyze. flexibility. understanding.