AWS DATA ENGINEERING VIRTUAL INTERNSHIP
Internship report submitted in partial fulfilment of requirements for
the award of degree of
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
by
SHAIK MEERA FAREEDH (20131A05M9)
Under the esteemed guidance of
Name of the Course Coordinator Name of the Internship Mentor
Dr. Ch. Sita Kumari Dr.P.Prapoorna Roja
Associate Professor Professor
CSE CSE
Department of Computer Science and Engineering
GAYATRI VIDYA PARISHAD COLLEGE OF ENGINEERING
(AUTONOMOUS)
(Affiliated to JNTU-K, Kakinada)
VISAKHAPATNAM
2023– 2024
i
Gayatri Vidya Parishad College of Engineering (Autonomous)
Visakhapatnam
CERTIFICATE
This report on
“AWS DATA ENGINEERING VIRTUAL INTERNSHIP”
is a bonafide record of the Internship work submitted
by
SHAIK MEERA FAREEDH (20131A05M9)
In their VIII semester in partial fulfilment of the requirements for the Award
of Degree of
Bachelor of Technology in Computer Science and Engineering
During the academic year 2023-2024
Name of the course coordinator Head of The Department
Dr. Ch. Sita Dr. D. Uma Devi
Kumari
Associate Professor Associate Professor &
HOD CSE CSE, CSE(AI-ML)
Internship mentor
Dr.P.Prapoorna Roja
professor
CSE
ii
ACKNOWLEDGEMENT
We would like to express our deep sense of gratitude to our esteemed
institute Gayatri Vidya Parishad College of Engineering (Autonomous), which has
provided us an opportunity to fulfil our cherished desire.
We thank our Course coordinator, Dr. CH. SITA KUMARI Associate Professor,
Department of Computer Science and Engineering, for the kind suggestions and
guidance for the successful completion of our internship.
We thank our internship mentor Dr. P.Prapoorna Roja, Professor,
Department of Computer Science and Engineering, for the kind suggestions and
guidance for the successful completion of our internship.
We are highly indebted to Dr. D. UMA DEVI , Associate Professor, Head of the
Department of Computer Science and Engineering and Department of Computer
Science and Engineering(AI-ML) Gayatri Vidya Parishad College of Engineering
(Autonomous), for giving us an opportunity to do the internship in college.
We express our sincere thanks to our Principal Dr. A. B. KOTESWARA RAO,
Gayatri Vidya Parishad College of Engineering (Autonomous) for his
encouragement to us during this project, giving us a chance to explore and learn
new technologies in the form of mini project.
Finally, we are indebted to the teaching and non-teaching staff of the Computer
Science and Engineering Department for all their support in the completion of our
project.
SHAIK MEERA FAREEDH (20131A05M9)
iii
20131A05M9
iv
ABSTRACT
Data engineering is the science of analysing raw data to make conclusions about that
information. Data engineering relies on a variety of software tools ranging from
spreadsheets, data visualization, and reporting tools, data mining programs, or open-
source languages for the greatest data manipulation. Data Analysis mainly deals with
data collection, data storage, data preprocessing and data visualisation.
In the course-1 we have learnt about the cloud computing, Cloud computing is the on-
demand delivery of compute power, database, storage, applications, and other IT
resources via the internet with pay-as-you-go pricing. These resources run on server
computers that are located in large data centers in different locations around the world.
When you use a cloud service provider like AWS, that service provider owns the
computers that you. This course deals with the following main concepts of compute
services, storge services, management services, database services, compliance services,
AWS cost management services
As a part of our course-2 we learnt data engineering which deals with the raw data to
draw solutions. In this course we have learnt about big data which is the main and
foremost important tool for the data analysis so, this big data is very important for data
engineering, and here comes the problem of storing the data for that we have learnt
different tools used for data storage and how to analyse the big data and preprocess the
data the following are the concepts that we learnt in this course. The main concepts
includes storage using AMAZON includes amazon S3, amazon athena, amazon redshift,
amazon glue, amazon sagemaker and amazon IOT analysis.
v
Index
Sl. No Topic name Page number
COURSE-1
CLOUD FOUNDATION
1 Introduction to cloud computing 1-3
1.1-What is cloud computing
1.2-Traditional computing vs cloud
computing 1.3-Introduction to AWS
1.4-AWS CAF
2 Cloud economics and 4-5
billing 2.1-What is AWS
2.2-Paying foe resources in
AWS 2.3-AWS CTO
2.4-AWS organizations
3 AWS global 6
infrastructure 3.1-AWS
infrastructure
3.2-AWS foundational services
4 AWS cloud security 7-8
4.1-AWS cloud security
4.2-AWS shared responsibility
model 4.3-IAM
4.4-Security services
5 Networking and content delivery 9-10
5.1-Networking basics
5.2-Amazon VPC
5.3-Amazon Route 53
5.4-Amazon cloud front
6 Compute 11-12
6.1-Compute services
6.2-Amazon EC2
6.3-Container services
7 Storage 13
7.1-Amazon EBS
7.2- Amazon S3
7.3- Amazon EFS
7.4- Amazon S3 Glacier
vi
8 Databases 14-15
8.1-Relational database services
8.2-Cloud architecture
8.3-AWS trusted adviser
8.4-Automatic scaling and Monitoring
9 Cloud Architecture 16
9.1-Cloud architects
9.2-Reliability and Availability
10 Auto scaling and monitoring 17-18
10.1-Elastic load balance
10.2-Elastic cloud watch
10.3 -Elastic EC2 auto scaling
COURSE-2 DATA ENGINEERING
11 Introduction to data Engineering 19-20
12 Data driven Organizations 21-22
13 The Elements of data 23-24
14 Design Principles and patterns for data pipelines 25-26
15 Securing and Scaling the Data Pipeline 27-28
16 Ingesting and preparing data 29-30
17 Ingesting by Batch or by Stream 31-32
18 Storing and Organizing Data 33-34
19 Processing Big Data 35-36
20 Processing Data for ML 37-38
21 Analyzing and Visualizing Data 39-40
vii
22 Automating the Pipeline 41-42
23 Labs 43-51
24 Case study 52-53
25 Conclusion 54
26 Reference Links 55
viii
CLOUD FOUNDATIONS
1. INTRODUCTION TO CLOUD COMPUTING
1.1 What is cloud computing?
Cloud computing is the on-demand delivery of compute power, database, storage,
applications, and other IT resources via the internet with pay-as-you-go pricing. These
resources run on server computers that are located in large data centers in different
locations around the world. When you use a cloud service provider like AWS, that
service provider owns the computers that you are using. These resources can be used
together like building blocks to build solutions that help meet business goals and
satisfy technology requirements.
The services provided by cloud computing are:
IaaS
PaaS
SaaS
1.2 Differences between traditional computing and cloud computing.
Traditional Computing model
1. Infrastructure as hardware
2. Hardware Solutions:
i. Require Space, Staff, physical security, planning, Capital expenditure.
ii. Have a long hardware procurement cycle
iii. Require you to provision capacity by guessing theoretical maximum peaks
Cloud computing model
•Infrastructure as software
1
•Software solutions:
1. Are flexible
2. Can change more quickly, easily, and cost-effectively than hardware solutions
3. Eliminate the undifferentiated heavy-lifting tasks
\
Fig-1.2.1 Cloud Service models
Advantages of Cloud computing
Trade capital expense for variable expense Benefit from massive economies of scale Stop
guessing capacity
• Increase speed and agility Go global in minutes
• Stop spending money on running and maintaining data centers
1.3 Introduction to AWS (Amazon Web Services)
Amazon Web Services (AWS) is a secure cloud platform that offers a broad set of
global cloud based products. Because these products are delivered over the internet,
you have on-demand access to the compute, storage, network, database, and other IT
resources that you might need for your projects— and the tools to manage them. AWS
offers flexibility. Your AWS environment can be reconfigured and updated on
demand, scaled up or down automatically to meet usage patterns and optimize
spending, or shut down temporarily or permanently.
2
Fig-1.3.1 Services covered in the course
1.4 AWS cloud adoption framework (AWS CAF)
AWS CAF provides guidance and best practices to help organizations build a
comprehensive approach to cloud computing across the organization and throughout
the IT lifecycle to accelerate successful cloud adoption.
1. AWS CAF is organized into six perspectives.
2. Perspectives consist of sets of capabilities.
3
2. CLOUD ECONOMICS AND BILLING
2.1 What is AWS?
AWS is designed to allow application providers, ISVs, and vendors to quickly and
securely host your applications – whether an existing application or a new SaaS-based
application. You can use the AWS Management Console or well-documented web
services APIs to access AWS's application hosting platform.
2.2 How we pay for the resources used in AWS?
AWS realizes that every customer has different needs. If none of the AWS pricing
models work for your project, custom pricing is available for high-volume projects
with unique requirements. There are some rules associated with the amount paying to
aws:
•There is no charge (with some exceptions) for:
•Inbound data transfer.
•Data transfer between services within the same AWS Region.
•Pay for what you use.
•Start and stop anytime.
•No long-term contracts are required.
•Some services are free, but the other AWS services that they provision might not be
free
2.3 What is TCO?
Total Cost of Ownership (TCO) is the financial estimate to help identify direct and
indirect costs of a system.
Why use TCO?
•To compare the costs of running an entire infrastructure environment or specific
workload on- premises versus on AWS
•To budget and build the business case for moving to the cloud.
4
Use the AWS Pricing Calculator to:
•Estimate monthly costs
•Identify opportunities to reduce monthly costs
•Model your solutions before building them
•Explore price points and calculations behind your estimate
•Find the available instance types and contract terms that meet your needs
•Name your estimate and create and name groups of services
2.4 AWS organizations:
AWS Organizations is a free account management service that enables you to
consolidate multiple AWS accounts into an organization that you create and centrally
manage. AWS Organizations include consolidated billing and account management
capabilities that help you to better meet the budgetary, security, and compliance needs
of your business.
5
3. AWS GLOBAL INFRASTRUCTURE
The AWS Global Infrastructure is designed and built to deliver a flexible, reliable,
scalable, and Secure cloud computing environment with high-quality global network
performance. The AWS Cloud infrastructure is built around regions.
3.1 AWS infrastructure features:
Elasticity and scalability
•Elastic infrastructure; dynamic adaption of capacity
•Scalable infrastructure; adapts to accommodate growth
Fault-tolerance
•Continues operating properly in the presence of a failure
•Built-in redundancy of components
High availability
•High level of operational performance
•Minimized downtime
•No human intervention
Fig-3.1.1 Foundational services
6
4. AWS CLOUD SECURITY
4.1 AWS cloud security:
Cloud security is a collection of security measures designed to protect cloud-based
infrastructure applications and data. AWS provide security and services that help you
protect your data,accounts, workloads form unauthorized access.
4.2 AWS shared responsibility model:
This shared model can help relieve the customers operational burden as AWS
operates, manages and controls the components from the host operating system. The
responsibility of this model is protecting infrastructure that runs all the services
offered in the AWS Cloud. The infrastructure composed of hardware, software,
networking and facilities that run AWS Cloud services.
4.3 IAM:
With AWS identity and access management (IAM) you can specify who or what can
access services and resources in AWS. IAM is a web service that helps securely
control access to AWS resources. We use IAM control who is authenticated and
authorized to use resources.
There are two types of IAM policies:
1. AWS managed policies
2. Customer managed policies
Securing a new AWS account:
1. Safeguard your password and access keys
2.Activate multi-factor authentication on AWS account root user and any users with
interactive access to IAM
3.Limit AWS account root user access to your resources itself
4.4 Security accounts:
We need to secure accounts because if hacker cracks your passwords they could gain
access to social medial accounts, bank accounts, e-mails and all other accounts that
7
holds personal data. If someone obtain the access to the information you could become
the victim of identity theft. Make your account more secure by –
•Update account recovery options
•Remove risky access to your data
8
5. NETWORKING AND CONTENT DELIVERY
5.1 Networking basics
A computer network is two or more client machines that are connected together to
share resources. A network can be logically partitioned into subnets. Networking
requires a networking device (such as a router or switch) to connect all the clients
together and enable communication between them. Each client machine in a network
has a unique Internet Protocol (IP) address that identifies it. A 32bit IP address is
called an IPv4 address. A 128-bit IP address is called an IPv6 address.
Fig-5.1.1 OSI model
5.2 Amazon VPC:
VPC Amazon Virtual Private Cloud (Amazon VPC) is a service that lets you
provision a logically isolated section of the AWS Cloud (called a virtual private cloud,
or VPC) where you can launch your AWS resources.
5.3 Amazon route 53:
Amazon Route 53 is a highly available and scalable cloud Domain Name System
(DNS) web service. It is designed to give developers and businesses a reliable and
cost- effective way to route users to internet applications by translating names (like
www.example.com) into the numeric IP addresses (like 192.0.2.1) that computers use
to connect to each other.
9
Amazon Route 53 supports several types of routing policies, which determine how
Amazon Route 53 responds to queries:
•Simple routing (round robin)
•Weighted round robin routing
•Latency routing (LBR)
•Geolocation routing
5.4 Amazon CloudFront
Amazon CloudFront is a fast CDN service that securely delivers data, videos,
applications, and application programming interfaces (APIs) to customers globally
with low latency and high transfer speeds. Amazon CloudFront is a self-service
offering with pay as-you-go pricing.
10
6. COMPUTE
6.1 Compute services overview
Amazon Web Services (AWS) offers many compute services. Here is a brief
summary of what each compute service offers:
•Amazon Elastic Compute Cloud (Amazon EC2) provides resizable virtual machines.
•Amazon Elastic Container Registry (Amazon ECR) is used to store and retrieve
Docker images.
•VMware Cloud on AWS enables you to provision a hybrid cloud without custom
hardware.
•AWS Elastic Beanstalk provides a simple way to run and manage web applications.
•AWS Lambda is a serverless compute solution. You pay only for the compute time
that you use. •Amazon Elastic Kubernetes Service (Amazon EKS) enables you to
run managed Kubernetes on AWS.
•Amazon Light sail provides a simple-to-use service for building an application or
website.
•AWS Batch provides a tool for running batch jobs at any scale.
•AWS Outposts provides a way to run select AWS services in your on-premises data
center.
Fig-6.1.1 Compute services overview
11
6.2 Amazon Ec2
Amazon Elastic Compute Cloud (Amazon EC2):
• Provides virtual machines—referred to as EC2 instances—in the cloud.
• Gives you full control over the guest operating system (Windows or Linux)
on each instance.
• You can launch instances of any size into an Availability Zone anywhere in
the world.
• Launch instances from Amazon Machine Images (AMIs).
• Launch instances with a few clicks or a line of code, and they are ready in
minutes.
• You can control traffic to and from instances.
6.3 Container services:
Containers are a method of operating system virtualization. Benefits are:
• Repeatable.
• Self-contained environments.
• Software runs the same in different environments.
• Developer's laptop, test, production.
• Faster to launch and stop or terminate than virtual machines.
12
7. STORAGE
7.1 Amazon EBS:
Amazon EBS provides persistent block storage volumes for use with Amazon
EC2 instances. Persistent storage is any data storage device that retains data
after power to that device is shutoff. It is also sometimes called non- volatile
storage. Each Amazon EBS volume is automatically replicated within its
Availability Zone to protect you from component failure. It is designed for
high availability and durability. Amazon EBS volumes provide the consistent
and low- latency performance that is needed to run your workloads.
7.2 Amazon S3:
Amazon S3 is object storage that is built to store and retrieve any amount of
data from anywhere: websites and mobile apps, corporate applications, and
data from Internet of Things (IoT) sensors or devices. Amazon S3 is object -
level storage, which means that if you want to change a part of a file, you must
make the change and then re-upload the entire modified file. Amazon S3
stores data as objects within resources that are called buckets. The data that
you store in Amazon S3 is not associated with any particular server, and you
do not need manage any infrastructure yourself. You can put as many objects
into Amazon
S3 as you want. Amazon S3 holds trillions of objects and regularly peaks at
millions of requests per second
7.3 Amazon EFS:
Amazon EFS implements storage for EC2 instances that multiple virtual
machines can access at the same time. It is implemented as a shared file system
that uses the Network File System (NFS) protocol. Amazon Elastic File
System (Amazon EFS) provides simple, scalable, elastic file storage for use
with AWS services and on - premises resources. It offers a simple interface
that enables you to create and configure file systems quickly and easily.
Amazon EFS is built to dynamically scale on demand without disrupting
applications— it will grow and shrink automatically as you add and remove
files. It is designed so that your applications have the storage they need, when
they need it.
13
8. Databases
8.1 Relational database service:
Amazon relational database service (Amazon RDS) is a collection of
managed services that make it simple to set up, operate, and scale
databases in the cloud. To address the challenges of running an unmanaged,
standalone relational database, AWS provides a service that sets up,
operates, and scales the relational database without any ongoing
administration.
Amazon RDS provides cost efficient and resizable capacity, while
automating time consuming administrative tasks. Amazon RDS enables you
to focus on your application, so you can give applications the performance,
high availability, security, and compatibility that they need. With Amazon
RDS, your primary focus is your data and optimizing your application.
8.2 Cloud architecture:
•Engage with decision makers to identify the business goal and the
capabilities that need improvement
•Ensure alignment between technology deliverables of a solution and the
business goals.
•Work with delivery teams that are implementing the solution to ensure that
the technology features are appropriate
•A guide for designing infrastructures that are:
✓ Secure
✓ High performing
✓ Resilient
✓ Efficient
•A consistent approach to evaluating and implementing cloud architectures
•A way to provide best practices that were developed through lessons learned
by reviewing customer architectures
14
8.3 Aws trusted advisor:
AWS Trusted Advisor is an online tool that provides real time guidance to
help you provision your resources following AWS best practices.
1.Cost Optimization: AWS Trusted Advisor looks at your resource use and
makes recommendations to help you optimize cost by eliminating unused
and idle resources, or by making commitments to reserved capacity.
2.Performance: Improve the performance of your service by checking your
service limits, ensuring you take advantage of provisioned throughput, and
monitoring for overutilized instances.
3.Security: Improve the security of your application by closing gaps,
enabling various AWS security features, and examining your permissions.
4.Fault Tolerance: Increase the availability and redundancy of your AWS
application
8.4 Automatic scaling and monitoring:
Elastic Load Balancing:
Distributes incoming application or network traffic across multiple targets in
a single Availability Zone or across multiple Availability Zones.
Scales your load balancer as traffic to your application changes over time.
Types of load balancers:
1. Application load balancer
2. Classic load balancer
3. Network load balancer
15
9. CLOUD ARCHITECTURE
9.1 Cloud architects:
•Engage with decision makers to identify the business goal and the capabilities
that need improvement.
•Ensure alignment between technology deliverables of a solution and the
business goals.
•Work with delivery teams that are implementing the solution to ensure that
the technology
•A guide for designing infrastructures that are:
✓ Secure
✓ High performing
✓ Resilient
9.2 Reliability and Availability:
Fig-9.2.1 Reliability and Availability
16
10. AUTOMATIC SCALING AND MONITORING
10.1 Elastic load balancing:
Distributes incoming application or network traffic across multiple targets in
a single Availability Zone or across multiple Availability Zones.
Types of load balancers:
1. Application load balancer
2. Classic load balancer
3. Network load balancer
Elastic Load Balancing use cases:
1. Highly available Fault tolerant
applications 2.Containerized applications
3. Elasticity and scalability
Fig-10.1.1 Elastic load balance
17
Fig-10.1.2 Load balancers
10.2Amazon cloud watch:
Amazon CloudWatch helps you monitor your AWS resources and the
applications that you run on AWS in real time.
10.3 Amazon ec2 auto scaling:
Scaling is the ability to increase or decrease the compute capacity of your
application Amazon Ec2 Auto Scaling.
18
DATA ENGINEERING
11. Introduction to data Engineering
Data engineering is the process of designing and building systems that let
people collect and analyze raw data from multiple sources and formats. These
systems empower people to find practical applications of the data, which
businesses can use to thrive.
Data engineering is a skill that is in increasing demand. Data engineers are
the people who design the system that unifies data and can help you navigate
it.
Data engineers perform many different tasks including:
Acquisition: Finding all the different data sets around the business
Cleansing: Finding and cleaning any errors in the data
Conversion: Giving all the data a common format
Disambiguation: Interpreting data that could be interpreted in
multiple ways
Deduplication: Removing duplicate copies of data
Once this is done, data may be stored in a central repository such as a data
lake or data lakehouse. Data engineers may also copy and move subsets of data
into a data warehouse.
19
20
12. Data Driven
Organizations The Genesis of Data-Driven
Organizations:
The genesis of data-driven organizations can be traced back to the dawn
of the information age, where the advent of technology ushered in an era of
unprecedented data proliferation. Initially viewed as mere byproducts of digital
transactions, data soon evolved into a strategic asset, offering insights into
consumer behavior, market trends, and operational efficiency. Organizations began
to recognize the intrinsic value of data, laying the foundation for a paradigm
shift in decision-making – from intuition-driven to data-driven.
The Pillars of Data-Driven Culture:
At the heart of every data-driven organization lies a culture that embraces
data as a strategic imperative. This culture is built upon three foundational
pillars: data literacy, data transparency, and data-driven decision-making. Data
literacy entails equipping employees with the skills and knowledge needed to
understand, interpret, and analyze data effectively. Meanwhile, data
transparency fosters a culture of openness and accountability, where data is
accessible, reliable, and trustworthy. Finally, data-driven decision-making
empowers stakeholders to leverage data insights in every aspect of their
decision-making process, driving innovation and agility across the organization.
The Role of Technology in Data-Driven Transformation:
Technology serves as the backbone of data-driven transformation,
providing the infrastructure, tools, and platforms needed to harness the power of
data effectively. From advanced analytics and artificial intelligence to cloud
computing and big data technologies, organizations leverage a myriad of
technological innovations to unlock the full potential of their data assets.
Moreover, the rise of data management systems, such as data lakes and data
warehouses, enables organizations to consolidate, integrate, and analyze vast
volumes of data in real- time, facilitating informed decision-making at scale.
Challenges and Opportunities in the Data-Driven Journey:
Despite the promises of data-driven transformation, organizations face a
myriad of challenges on their journey towards becoming truly data-driven.
21
These challenges range from data silos and legacy systems to data privacy and
security
22
concerns. Moreover, cultural resistance and organizational inertia often pose
significant barriers to adoption, hindering the realization of the full potential of
data-driven initiatives. However, amidst these challenges lie boundless
opportunities for innovation, differentiation, and competitive advantage.
Organizations that embrace the data-driven mindset, foster a culture of
experimentation, and invest in the right technologies stand poised to thrive in
the digital age.
The Future of Data-Driven Organizations:
As we gaze into the horizon of the future, the trajectory of data-driven
organizations appears both promising and uncertain. Rapid advancements in
technology, coupled with evolving regulatory landscapes, continue to shape the
contours of the data-driven journey. Moreover, the democratization of data and
the rise of citizen data scientists herald a new era of empowerment, where
insights are no longer confined to the realm of data experts but are accessible to
all. Yet, amidst this uncertainty, one thing remains clear – data will continue to
be the driving force behind organizational innovation, disruption, and
transformation.
23
13. The Elements of Data
The Essence of Data:
At its essence, data embodies information – raw and unrefined, waiting to
be unlocked, interpreted, and harnessed. It comprises a myriad of elements, each
contributing to its richness, complexity, and utility. From structured data,
characterized by its organized format and predefined schema, to unstructured
data, defined by its fluidity and lack of formal structure, the diversity of data
elements mirrors the diversity of human experiences, thoughts, and interactions.
Structured Data:
Structured data represents the foundation of traditional databases,
characterized by its organized format, predictable schema, and tabular structure.
This form of data lends itself well to relational databases, where information is
stored in rows and columns, enabling efficient storage, retrieval, and analysis.
Examples of structured data include transaction records, customer profiles, and
inventory lists, each meticulously organized to facilitate easy access and
manipulation.
Unstructured Data:
In contrast to structured data, unstructured data defies conventional
categorization, encompassing a wide array of formats, including text, images,
videos, and audio recordings. This form of data lacks a predefined schema,
making it inherently more challenging to analyze and interpret. However, within
the realm of unstructured data lies a treasure trove of insights, waiting to be
unearthed through advanced analytics, natural language processing, and
machine learning algorithms.
Semi-Structured Data:
Semi-structured data occupies a unique space between structured and
unstructured data, combining elements of both. While it may possess some
semblance of organization, such as tags or metadata, it lacks the rigid structure
of traditional databases. Examples of semi-structured data include XML files,
24
JSON
25
documents, and log files, each offering a flexible framework for storing and
exchanging information across disparate systems.
The Lifecycle of Data:
Beyond its structural composition, data traverses a lifecycle – from its
inception to its eventual obsolescence. This lifecycle encompasses five distinct
stages: capture, storage, processing, analysis, and dissemination. At each stage,
data undergoes transformation, refinement, and enrichment, evolving from mere
bytes to actionable insights that drive informed decision-making and strategic
planning.
26
14. Data Principles and Patterns for Data Pipelines
This exploration delves into the intricacies of data pipeline design,
unveiling the principles and patterns that enable organizations to orchestrate
data flows efficiently, reliably, and at scale.
The Foundation of Data Pipeline Design:
At the heart of effective data pipeline design lies a foundation built upon
three fundamental principles: scalability, reliability, and maintainability.
Scalability ensures that data pipelines can handle increasing volumes of data
without compromising performance or efficiency. Reliability guarantees that
data is processed accurately and consistently, even in the face of failures or
disruptions. Maintainability encompasses the ease with which data pipelines can
be modified, extended, and debugged over time, ensuring their longevity and
adaptability in a rapidly evolving landscape.
Design Patterns for Data Pipelines:
To achieve these principles, data engineers rely on a myriad of design
patterns that encapsulate best practices, strategies, and techniques for building
robust and efficient data pipelines. Among these patterns are:
1. Extract, Transform, Load (ETL): This classic pattern involves extracting
data from various sources, transforming it into a structured format, and loading
it into a destination for analysis or storage. ETL pipelines are well-suited for
batch processing scenarios, where data is collected and processed in discrete
chunks.
2. Event-Driven Architecture: In contrast to batch processing, event-driven
architecture enables real-time data processing and analysis by reacting to events
as they occur. This pattern is ideal for scenarios where immediate insights or
actions are required, such as fraud detection, monitoring, or recommendation
systems.
3. Lambda Architecture: Combining the strengths of both batch and stream
processing, the Lambda architecture provides a framework for building robust,
fault-tolerant data pipelines that can handle both historical and real-time data.
By leveraging batch and speed layers in parallel, organizations can achieve
comprehensive insights with low latency.
27
4.Microservices Architecture: In the realm of distributed systems,
microservices architecture offers a modular approach to building data pipelines,
where individual components or services are decoupled and independently
deployable. This pattern enables greater agility, scalability, and fault isolation,
albeit at the cost of increased complexity.
Challenges and Considerations:
Despite the benefits of these design patterns, data pipeline design is not without
its challenges. Organizations must grapple with issues such as data quality,
latency, scalability bottlenecks, and integration complexity. Moreover, as data
pipelines grow in complexity and scale, managing dependencies, orchestrating
workflows, and ensuring end-to-end visibility become increasingly challenging
tasks.
28
15. Securing and Scaling the data pipeline
The Imperative of Data Pipeline Security:
Data pipeline security encompasses a multifaceted approach to
safeguarding data assets throughout their lifecycle – from ingestion to analysis
to storage. At its core, data pipeline security revolves around three key pillars:
confidentiality, integrity, and availability. Confidentiality ensures that data is
accessible only to authorized users, protecting it from unauthorized access or
disclosure. Integrity guarantees that data remains accurate and trustworthy,
preventing unauthorized modifications or tampering. Availability ensures that
data is accessible and usable when needed, minimizing downtime and
disruptions.
Securing the Data Pipeline:
To secure the data pipeline effectively, organizations must adopt a
layered approach that encompasses both preventive and detective controls. This
includes implementing encryption mechanisms to protect data in transit and at
rest, enforcing access controls to limit user privileges and permissions, and
deploying monitoring and auditing tools to detect and respond to suspicious
activities in real- time. Additionally, organizations must adhere to industry
standards and regulatory requirements, such as GDPR, HIPAA, and PCI DSS, to
ensure compliance and mitigate legal and reputational risks.
Scaling the data pipeline involves expanding its capacity and capabilities
to accommodate growing volumes of data, users, and workloads. This requires a
strategic approach that encompasses both vertical and horizontal scaling
techniques. Vertical scaling involves adding more resources, such as CPU,
memory, or storage, to existing infrastructure to handle increased demand.
Horizontal scaling, on the other hand, involves distributing workloads across
multiple nodes or instances, enabling parallel processing and improved
performance.
Scalability Considerations:
While scalability unlocks new opportunities for growth and innovation, it
also introduces a host of challenges and considerations. Organizations must
carefully evaluate factors such as data volume, velocity, variety, and veracity, as
29
well as the underlying infrastructure, architecture, and technology stack.
Moreover, as data pipelines scale in complexity and size, managing
dependencies, optimizing performance, and ensuring fault tolerance become
increasingly critical tasks.
Emerging Technologies and Best Practices:
To address these challenges, organizations are turning to emerging
technologies and best practices that offer scalable, secure, and efficient solutions
for data pipeline management. This includes the adoption of cloud-native
architectures, containerization technologies such as Docker and Kubernetes,
serverless computing platforms like AWS Lambda and Google Cloud
Functions, and distributed processing frameworks such as Apache Spark and
Apache Flink. Additionally, organizations are leveraging DevOps practices,
automation tools, and infrastructure-as-code principles to streamline deployment,
monitoring, and management of data pipelines.
30
16. Ingesting and preparing data
The Foundation of Data Ingestion:
Data ingestion serves as the gateway through which raw data enters the
organizational ecosystem, encompassing the processes and technologies
involved in extracting, transporting, and loading data from various sources into a
centralized repository. At its core, effective data ingestion revolves around three
key objectives: speed, scalability, and reliability. Speed ensures that data is
ingested in a timely manner, enabling real-time or near-real-time analytics.
Scalability guarantees that data pipelines can handle increasing volumes of data
without sacrificing performance or efficiency. Reliability ensures that data is
ingested accurately and consistently, minimizing data loss or corruption.
Strategies for Data Ingestion:
To achieve these objectives, organizations employ a variety of strategies and
technologies for data ingestion, each tailored to the unique requirements and
characteristics of their data ecosystem. This includes:
1. .Batch Processing: Batch processing involves ingesting data in discrete
chunks or batches at scheduled intervals. This approach is well-suited for
scenarios where data latency is acceptable, such as historical analysis or batch
reporting.
2. Stream Processing: Stream processing enables the ingestion of data in real-
time as it is generated or produced. This approach is ideal for scenarios where
immediate insights or actions are required, such as monitoring, anomaly
detection, or fraud detection.
3. Change Data Capture (CDC): CDC techniques capture and replicate
incremental changes to data sources, enabling organizations to ingest only the
modified or updated data, rather than the entire dataset. This minimizes
processing overhead and reduces latency, making it well-suited for scenarios
where data freshness is critical.
31
Data Preparation:
Once data is ingested into the organizational ecosystem, it must undergo a
process of preparation to make it suitable for analysis, modeling, or
visualization. Data preparation encompasses a range of activities, including
cleaning, transforming, enriching, and aggregating data to ensure its quality,
consistency, and relevance. This process is often iterative and involves
collaboration between data engineers, data scientists, and domain experts to
identify, understand, and address data quality issues and inconsistencies.
Technologies for Data Preparation:
To facilitate data preparation, organizations leverage a variety of technologies
and tools that automate and streamline the process. This includes:
1. Data Integration Platforms: Data integration platforms provide a unified
environment for orchestrating data ingestion, transformation, and loading tasks
across disparate sources and destinations. These platforms offer features such as
data profiling, data cleansing, and data enrichment to ensure data quality and
consistency.
2. Data Wrangling Tools: Data wrangling tools empower users to visually
explore, clean, and transform data without writing code. These tools offer
intuitive interfaces and built-in algorithms for tasks such as missing value
imputation, outlier detection, and feature engineering, enabling users to prepare
data more efficiently and effectively.
3. Data Preparation Libraries: Data preparation libraries, such as Pandas in
Python or Apache Spark's DataFrame API, provide programmable interfaces for
manipulating and transforming data at scale. These libraries offer a rich set of
functions and transformations for tasks such as filtering, grouping, and joining
data, enabling users to perform complex data preparation tasks with ease.
32
17. Ingesting by batch or by
system Understanding Batch Data Ingestion:
Batch data ingestion involves collecting and processing data in discrete
chunks or batches at scheduled intervals. This approach is well-suited for
scenarios where data latency is acceptable, such as historical analysis, batch
reporting, or periodic updates. Batch data ingestion offers several advantages,
including simplicity, scalability, and fault tolerance. By processing data in
predefined batches, organizations can optimize resource utilization, minimize
processing overhead, and ensure consistent performance even in the face of
failures or disruptions.
Strategies for Batch Data Ingestion:
To implement batch data ingestion effectively, organizations employ a
variety of strategies and technologies tailored to their specific requirements and
use cases. This includes:
1. Extract, Transform, Load (ETL): ETL processes involve extracting data
from various sources, transforming it into a structured format, and loading it into
a destination for analysis or storage. This approach is well-suited for scenarios
where data needs to be cleansed, aggregated, or enriched before further
processing.
2.Batch Processing Frameworks: Batch processing frameworks, such as
Apache Hadoop or Apache Spark, provide distributed computing capabilities for
processing large volumes of data in parallel. These frameworks offer features
such as fault tolerance, data locality optimization, and job scheduling, making
them well-suited for batch data ingestion tasks.
Exploring Stream Data Ingestion:
In contrast to batch data ingestion, stream data ingestion involves
processing data in real-time as it is generated or produced. This approach is ideal
for scenarios where immediate insights or actions are required, such as
monitoring, anomaly detection, or fraud detection. Stream data ingestion offers
several advantages, including low latency, continuous processing, and real-time
responsiveness. By ingesting and processing data in real-time, organizations can
react to events as they occur, enabling faster decision-making and proactive
intervention.
33
Strategies for Stream Data Ingestion:
To implement stream data ingestion effectively, organizations leverage a
variety of strategies and technologies that enable real-time data processing and
analysis. This includes:
1. Event-Driven Architectures: Event-driven architectures enable organizations
to ingest and process data in real-time in response to events or triggers. This
approach is well-suited for scenarios where immediate action is required, such
as IoT applications, real-time monitoring, or financial transactions processing.
2. Stream Processing Frameworks: Stream processing frameworks, such as
Apache Kafka or Apache Flink, provide distributed computing capabilities for
processing continuous streams of data in real-time. These frameworks offer
features such as fault tolerance, event time processing, and windowing
semantics, making them well-suited for stream data ingestion tasks.
Choosing the Optimal Approach:
The choice between batch and stream data ingestion depends on several
factors, including data latency requirements, processing complexity, resource
constraints, and use case objectives. While batch data ingestion offers
simplicity, scalability, and fault tolerance, stream data ingestion offers low
latency, continuous processing, and real-time responsiveness. Organizations
must carefully evaluate these factors and choose the optimal approach that
aligns with their unique requirements and objectives.
34
18. Storing and Organizing Data
The Foundation of Data Storage:
Data storage serves as the bedrock upon which the data ecosystem is
built, encompassing the processes and technologies involved in persisting and
retrieving data in a reliable and efficient manner. At its core, effective data
storage revolves around three key objectives: scalability, durability, and
accessibility. Scalability ensures that data storage solutions can accommodate
growing volumes of data without sacrificing performance or reliability.
Durability guarantees that data is protected against loss or corruption, even in
the face of hardware failures or disasters. Accessibility ensures that data is
readily available and accessible to authorized users, regardless of time, location,
or device.
1. Relational Databases: Relational databases provide a structured and
organized approach to storing data, using tables, rows, and columns to represent
data entities and relationships. This approach is well-suited for scenarios where
data integrity, consistency, and relational querying capabilities are paramount.
2. NoSQL Databases: NoSQL databases offer a flexible and scalable approach
to storing and querying unstructured or semi-structured data, using document,
key- value, column-family, or graph-based data models. This approach is well-
suited for scenarios where data volumes are large, data schemas are dynamic, or
horizontal scalability is required.
Organizing Data for Efficiency: In addition to storage considerations,
organizing data effectively is critical to ensuring its usability, discoverability,
and maintainability. Data organization encompasses the processes and
methodologies involved in structuring, categorizing, and indexing data to
facilitate efficient retrieval and analysis. This includes:
1. Data Modeling: Data modeling involves defining the structure, relationships,
and constraints of data entities and attributes, typically using entity-relationship
diagrams, schema definitions, or object-oriented models. This approach helps
ensure data consistency, integrity, and interoperability across the organization.
2. Data Partitioning: Data partitioning involves dividing large datasets into
smaller, more manageable partitions based on certain criteria, such as time,
geography, or key ranges. This approach helps distribute data processing and
storage resources more evenly, improving performance, scalability, and
availability.
35
Technologies for Data Storage and Organization:
To facilitate data storage and organization effectively, organizations leverage a
variety of technologies and tools that offer scalability, reliability, and flexibility.
This includes:
1. Cloud Storage Services: Cloud storage services, such as Amazon S3,
Google Cloud Storage, or Microsoft Azure Blob Storage, provide scalable and
durable storage solutions for storing and managing data in the cloud. These
services offer features such as encryption, versioning, and lifecycle
management, making them well-suited for a wide range of use cases.
2. Data Lakes: Data lakes provide a centralized repository for storing and
managing large volumes of structured, semi-structured, and unstructured data in
its native format. This approach enables organizations to ingest, store, and
analyze diverse datasets without the need for predefined schemas or data
transformations.
36
19. Processing Big Data
To overcome these challenges, organizations employ a variety of strategies and
technologies for big data processing, each tailored to the unique requirements
and characteristics of their data ecosystem. This includes:
1. Distributed Computing: Distributed computing frameworks, such as
Apache Hadoop, Apache Spark, and Apache Flink, provide scalable and fault-
tolerant platforms for processing big data in parallel across distributed clusters
of commodity hardware. These frameworks offer features such as distributed
storage, data locality optimization, and fault tolerance, making them well-suited
for batch and stream processing of big data.
2. In-Memory Processing: In-memory processing technologies, such as
Apache Ignite, Apache Arrow, and Redis, leverage the power of RAM to
accelerate data processing and analysis by keeping data in memory rather than
accessing it from disk. This approach enables faster query execution, iterative
processing, and real- time analytics, making it well-suited for interactive and
exploratory data analysis.
Architectures for Big Data Processing:
In addition to processing strategies, organizations must design architectures that
enable efficient and scalable big data processing workflows. This includes:
1. Lambda Architecture: The Lambda architecture provides a framework for
building robust and fault-tolerant big data processing pipelines that can handle
both batch and stream processing of data. By combining batch and speed layers
in parallel, organizations can achieve comprehensive insights with low latency,
enabling real-time and near-real-time analytics.
2. Kappa Architecture: The Kappa architecture offers a simplified alternative
to the Lambda architecture by eliminating the batch layer and relying solely on
stream processing for data ingestion and analysis. This approach simplifies the
architecture, reduces complexity, and enables faster time-to-insight, making it
well-suited for real-time analytics and event-driven applications.
37
Best Practices for Big Data Processing:
To optimize big data processing workflows, organizations should adhere to
several best practices, including:
1. Data Partitioning and Sharding: Partitioning large datasets into smaller,
more manageable chunks enables parallel processing and improves scalability
and performance. By dividing data based on certain criteria, such as time,
geography, or key ranges, organizations can distribute processing and storage
resources more evenly, minimizing bottlenecks and contention.
2. Data Compression and Serialization: Compressing and serializing data
before processing reduces storage and bandwidth requirements, improves data
transfer speeds, and accelerates query execution. By using efficient compression
algorithms and serialization formats, such as Apache Avro or Protocol Buffers,
organizations can minimize data footprint and optimize resource utilization.
38
20. Processing Data for ML
The Role of Data Processing in Machine Learning:
Data processing for machine learning encompasses the techniques and
technologies involved in preparing, cleaning, and transforming raw data into a
format suitable for training machine learning models. At its core, data
processing serves several critical functions, including feature extraction,
normalization, and dimensionality reduction. These preprocessing steps are
essential for improving model performance, reducing overfitting, and ensuring
generalization to unseen data.
Strategies for Data Processing in Machine Learning:
To achieve these objectives, organizations employ a variety of strategies and
techniques for data processing in machine learning, each tailored to the unique
requirements and characteristics of their data ecosystem. This includes:
1. Feature Engineering: Feature engineering involves selecting, extracting,
and transforming relevant features from raw data to facilitate model learning
and prediction. This may include numerical features, categorical features, text
features, or image features, depending on the nature of the data and the specific
machine learning task.
2. Data Normalization: Data normalization techniques, such as min-max
scaling or standardization, ensure that input features are on a similar scale,
preventing certain features from dominating the learning process and improving
model convergence and stability.
Architectures for Data Processing in Machine Learning:
In addition to processing strategies, organizations must design architectures that
enable efficient and scalable data processing workflows for machine learning.
This includes.
1. Data Pipelines: Data pipelines provide a structured framework for
orchestrating data processing tasks, from ingestion to preparation to training. By
automating and streamlining the data processing workflow, organizations can
ensure consistency, reproducibility, and scalability in machine learning model
development.
39
2. Model Serving Infrastructure: Model serving infrastructure enables
organizations to deploy trained machine learning models into production
environments, where they can serve real-time predictions or batch inference
requests. By decoupling model inference from model training, organizations can
achieve greater flexibility, scalability, and reliability in deploying machine
learning solutions.
40
21. Analyzing and Visualizing data
The Role of Data Analysis and Visualization:
Data analysis and visualization encompass the techniques and technologies
involved in exploring, summarizing, and communicating insights from raw data
in a visual and intuitive manner. At its core, data analysis serves several critical
functions, including descriptive analysis, diagnostic analysis, predictive
analysis, and prescriptive analysis. These analyses enable organizations to
uncover patterns, trends, anomalies, and relationships within their data,
providing a foundation for informed decision-making and strategic planning.
Strategies for Data Analysis:
1. Descriptive Analysis: Descriptive analysis involves summarizing and
aggregating data to provide a high-level overview of key metrics, trends, and
distributions. This may include summary statistics, frequency distributions,
histograms, or heatmaps, depending on the nature of the data and the specific
analysis objectives.
2. Diagnostic Analysis: Diagnostic analysis focuses on understanding the root
causes of observed patterns or anomalies within the data. This may involve
hypothesis testing, correlation analysis, regression analysis, or causal inference
techniques to identify relationships and dependencies between variables.
Strategies for Data Visualization:
1. Charts and Graphs: Charts and graphs are powerful tools for visualizing
patterns, trends, and relationships within the data. This may include bar charts,
line charts, scatter plots, pie charts, or box plots, each offering unique
advantages for representing different types of data and analysis objectives.
2. Dashboards: Dashboards provide a centralized and interactive interface for
visualizing and exploring data in real-time. This may include interactive charts,
tables, maps, or widgets, enabling users to drill down into specific data subsets,
filter data based on criteria, and gain deeper insights into key metrics and KPIs.
41
Best Practices for Data Analysis and Visualization:
To optimize data analysis and visualization workflows, organizations should
adhere to several best practices, including:
1. Audience-Centric Design: Designing visualizations with the end-user in
mind ensures that insights are communicated effectively and resonate with the
intended audience. Organizations should consider factors such as audience
demographics, preferences, and prior knowledge when designing visualizations.
2. Iterative Exploration: Data analysis and visualization are iterative processes
that require continuous exploration and refinement. Organizations should
encourage a culture of experimentation and iteration, where insights are refined
based on feedback, new data, and evolving analysis objectives.
42
22. Automating the pipeline
Pipeline automation encompasses the techniques and technologies involved in
orchestrating, scheduling, and monitoring data processing tasks across the data
pipeline lifecycle. At its core, pipeline automation serves several critical
functions, including:
1. Workflow Orchestration: Automating the sequencing and dependencies of
data processing tasks ensures that they are executed in the correct order and at
the appropriate times, minimizing delays, errors, and resource contention.
2. Resource Management: Automating the allocation and deallocation of
computing resources, such as CPU, memory, and storage, ensures that data
processing tasks have access to the necessary resources to execute efficiently
and reliably.
3. Monitoring and Alerting: Automating the monitoring of data pipeline
performance and health enables organizations to detect anomalies, errors, and
failures in real-time and take corrective actions proactively, minimizing
downtime and disruptions.
Strategies for Pipeline Automation:
1. Workflow Management Systems: Workflow management systems, such as
Apache Airflow, Apache NiFi, or Luigi, provide a centralized platform for
defining, scheduling, and executing data processing workflows. These systems
offer features such as task dependencies, scheduling, retry mechanisms, and
monitoring capabilities, making them well-suited for orchestrating complex data
pipelines.
2. Infrastructure as Code: Infrastructure as code (IaC) frameworks, such as
Terraform or AWS CloudFormation, enable organizations to automate the
provisioning and configuration of computing resources, such as virtual
machines, containers, or serverless functions, needed to execute data processing
tasks. By defining infrastructure as code, organizations can ensure consistency,
reproducibility, and scalability in their data pipeline deployments.
43
Best Practices for Pipeline Automation:
1. Modular Design: Breaking down data processing tasks into smaller, modular
components enables organizations to build reusable, composable workflows that
can be easily scaled, extended, and maintained over time. This promotes
flexibility, agility, and maintainability in data pipeline development.
2. Continuous Integration and Deployment: Adopting continuous integration
and deployment (CI/CD) practices enables organizations to automate the testing,
validation, and deployment of data pipeline changes in a rapid and reliable
manner. By automating the deployment pipeline, organizations can reduce
manual errors, accelerate time-to-market, and improve overall pipeline
reliability and stability.
Challenges and Considerations:
Despite the benefits of pipeline automation, organizations must contend with
several challenges and considerations, including:
1. Complexity: Automating complex data pipelines with heterogeneous data
sources, dependencies, and processing requirements can be challenging and
require careful planning, design, and implementation.
2. Security and Compliance: Ensuring the security and compliance of
automated data pipelines is paramount, particularly when dealing with sensitive
or regulated data. Organizations must implement robust access controls,
encryption mechanisms, and auditing capabilities to protect data privacy and
mitigate regulatory risks.
44
LABS
LAB 1
AMAZON-S3
Amazon Simple Storage Service (Amazon S3) is an object storage service that
offers industry-leading scalability, data availability, security, and performance.
Customers of all sizes and industries can use Amazon S3 to store and protect
any amount of data for a range of use cases, such as data lakes, websites, mobile
applications, backup and restore, archive, enterprise applications, IoT devices,
and big data analytics. Amazon S3 provides management features so that you
can optimize, organize, and configure access to your data to meet your specific
business, organizational, and compliance requirements. [2]
Amazon S3 is an object storage service that stores data as objects within
buckets. An object is a file and any metadata that describes the file. A bucket is
a container for objects. To store your data in Amazon S3, you first create a
bucket and specify a bucket name and AWS Region. Then, you upload your
data to that bucket as objects in Amazon S3. Each object has a key (or key
name), which is the unique identifier for the object within the bucket.
S3 provides features that you can configure to support your specific use case.
For example, you can use S3 Versioning to keep multiple versions of an object
in the same bucket, which allows you to restore objects that are accidentally
deleted or overwritten. Buckets and the objects in them are private and can be
accessed only if you explicitly grant access permissions. You can use bucket
policies, AWS Identity and Access Management (IAM) policies, access
control lists (ACLs), and S3 Access Points to manage access.
45
LAB 2
AMAZON ATHENA:
Amazon Athena is an interactive query service that makes it easy to analyze
data in Amazon S3 using standard SQL. Athena is serverless, so there is no
infrastructure to manage, and you pay only for the queries that you run. Athena
is easy to use. Simply point to your data in Amazon S3, define the schema, and
start querying using standard SQL. Most results are delivered within seconds.
With Athena, there’s no need for complex ETL jobs to prepare your data for
analysis. This makes it easy for anyone with SQL skills to quickly analyze
large- scale datasets. [3]
Athena is out-of-the-box integrated with AWS Glue Data Catalog, allowing
you to create a unified metadata repository across various services, crawl data
sources to discover schemas and populate your Catalog with new and modified
table and partition definitions, and maintain schema versioning.
Fig-2.1 Amazon athena
46
LAB-3
AMAZON GLUE:
AWS Glue is a serverless data integration service that makes it easier to
discover, prepare, move, and integrate data from multiple sources for analytics,
machine learning (ML), and application development.[4]
Preparing your data to obtain quality results is the first step in an analytics or
ML project. AWS Glue is a serverless data integration service that makes data
preparation simpler, faster, and cheaper. You can discover and connect to over
70 diverse data sources, manage your data in a centralized data catalog, and
visually create, run, and monitor ETL pipelines to load data into your data lakes.
Fig-3.1 Amazon Glue
47
LAB-4
AMAZON REDSHIFT
Amazon Redshift uses SQL to analyze structured and semi-structured data
across data warehouses, operational databases, and data lakes, using AWS-
designed hardware and machine learning to deliver the best price performance
at any scale. [5]
Fig-4.1(Amazon Redshift)
48
LAB-5
ANALYSE DATA WITH AMAZON SAGEMAKER:
Amazon SageMaker is a fully managed machine learning service. With
SageMaker, data scientists and developers can quickly and easily build and train
machine learning models, and then directly deploy them into a production-ready
hosted environment. It provides an integrated Jupyter authoring notebook
instance for easy access to your data sources for exploration and analysis, so
you don't have to manage servers. It also provides common machine learning
algorithms that are optimized to run efficiently against extremely large data in a
distributed environment. With native support for bring-your-own-algorithms
and frameworks, SageMaker offers flexible distributed training options that
adjust to your specific workflows. Deploy a model into a secure and scalable
environment by launching it with a few clicks from SageMaker Studio or the
SageMaker console. Training and hosting are billed by minutes of usage, with
no minimum fees and no upfront commitments. [6]
Creating jupyter notebook with amazon sagemaker
1. Open the notebook instance as follows:
a. Sign in to the SageMaker console at
https://console.aws.amazon.com/sagemaker/.
b. On the Notebook instances page, open your notebook instance by choosing
either Open JupyterLab for the JupyterLab interface or Open Jupyter for the
classic Jupyter view.
2. Create a notebook as follows:
a. If you opened the notebook in the JupyterLab view, on the File menu, choose
New, and then choose Notebook. For Select Kernel, choose conda_python3.
This preinstalled environment includes the default Anaconda installation and
Python 3.
b. If you opened the notebook in the classic Jupyter view, on the Files tab,
choose New, and then choose conda_python3. This preinstalled environment
includes the default Anaconda installation and Python 3.
3. Save the notebooks as follows:
a. In the JupyterLab view, choose File, choose Save Notebook As..., and then
rename the notebook. b. In the Jupyter classic view, choose File, choose Save
as..., and then rename the notebook.
49
LAB-6
LOAD DATA USING PIPELINE:
AWS Data Pipeline is a web service that you can use to automate the movement
and transformation of data.
With AWS Data Pipeline, you can define data-driven workflows, so that tasks can
be dependent on the
successful completion of previous tasks. You define the parameters of your data
transformations and AWS
Data Pipeline enforces the logic that you've set up.
The following components of AWS Data Pipeline work together to manage your
data:
• A pipeline definition specifies the business logic of your data management.
For more information, see
Pipeline Definition File Syntax.
• A pipeline schedules and runs tasks by creating Amazon EC2 instances to
perform the defined work
activities. You upload your pipeline definition to the pipeline, and then activate the
pipeline. You can
edit the pipeline definition for a running pipeline and activate the pipeline again
for it to take effect.
You can deactivate the pipeline, modify a data source, and then activate the pipeline
again. When you
are finished with your pipeline, you can delete it.
• Task Runner polls for tasks and then performs those tasks. For example, Task
Runner could copy log
files to Amazon S3 and launch Amazon EMR clusters. Task Runner is installed and
runs automatically
on resources created by your pipeline definitions. You can write a custom task
runner application, or
you can use the Task Runner application that is provided by AWS Data Pipeline.
For more information,
50
see Task Runners.
Fig-6.1 Loading Data
51
LAB-7
ANALIZE SREAMING DATA:
Amazon Kinesis Data Firehose is an extract, transform, and load (ETL) service
that reliably captures,
transforms, and delivers streaming data to data lakes, data stores, and analytics
services
.Amazon Kinesis is a suite of services for processing streaming data. With Amazon
Kinesis, you can ingest
real-time data such as video, audio, website clickstreams, or application logs. You
can process and analyze
the data as it arrives, instead of capturing it all to storage before you begin analysis.
Fig-7.1 Analysing data
52
LAB-8
ANALYSE IOT DATA WITH AWS IOT ANALYTICS:
AWS IoT Analytics automates the steps required for analyzing IoT data. You can
filter, transform, and enrich
the data before storing it in a time-series data store. AWS IoT Core provides
connectivity between IoT devices
and AWS Services. IoT Core is fully-integrated with IoT Analytics.
IoT data is highly unstructured which makes it difficult to analyze with traditional
analytics and business
intelligence tools that are designed to process structured data. IoT data comes from
devices that often record
fairly noisy processes (such as temperature, motion, or sound). The data from these
devices can frequently
have significant gaps, corrupted messages, and false readings that must be cleaned
up before analysis can
occur. Also, IoT data is often only meaningful in the context of additional, third
party data inputs. For
example, to help farmers determine when to water their crops, vineyard irrigation
systems often enrich
moisture sensor data with rainfall data from the vineyard, allowing for more
efficient water usage while
maximizing harvest yield. [7]
Fig-8.1 IOT Data
53
CASE STUDY
Predictive Maintenance in Manufacturing
Abstract: This case study focuses on implementing predictive maintenance
strategies in a manufacturing plant to minimize downtime, reduce maintenance
costs, and enhance operational efficiency. By leveraging data analytics and
machine learning, the manufacturing company aims to predict equipment
failures before they occur, allowing for timely maintenance and optimization of
production processes.
Problem: The manufacturing plant experiences unexpected equipment failures,
leading to unplanned downtime, production delays, and increased maintenance
expenses. The lack of predictive maintenance capabilities results in reactive
maintenance practices, negatively impacting productivity and profitability.
Solution: Data Collection: Utilize IoT sensors and monitoring systems
installedon machinery to collect real-time data on equipment performance,
operating conditions, and sensor readings. Data can include temperature,
vibration, pressure fluid levels, and other relevant parameters.
Data Preprocessing: Cleanse and preprocess raw sensor data to remove noise,
handle missing values, and standardize formats. Use tools like Pandas and
NumPy in Python for data manipulation and preprocessing tasks.
Exploratory Data Analysis (EDA): Conduct exploratory data analysis to
identify patterns, trends, and anomalies in the sensor data. Visualize sensor
readings over time, perform statistical analyses, and detect correlations between
variables using libraries such as Matplotlib and Seaborn.
Feature Engineering: Extract meaningful features from sensor data to capture
equipment health indicators, degradation trends, and failure patterns. Engineer
features such as rolling averages, moving averages, variance, and spectral
analysis to enhance the predictive power of the model.
Model Selection: Choose appropriate machine learning algorithms for
predictive maintenance, such as regression, classification, or time-series
forecasting models. Consider techniques like logistic regression, random forests,
support vector machines (SVM), or recurrent neural networks (RNNs) for
modeling equipment failures.
Model Evaluation: Evaluate model performance using metrics like accuracy,
precision, recall, and F1-score. Utilize techniques such as confusion matrices,
54
receiver operating characteristic (ROC) curves, and precision-recall curves to
assess the tradeoffs between true positives and false positives.
Predictive Modeling: Train the selected model on historical equipment failure
data, adjusting hyperparameters and model architectures as needed. Implement
techniques like cross-validation and ensemble learning to improve model
robustness and generalization performance.
Prediction and Action: Deploy the trained model to predict equipment failures
in real-time or near-real-time. Generate alerts or notifications when equipment
health deteriorates beyond a certain threshold, enabling proactive maintenance
actions.
Implementation: Integrate predictive maintenance insights into the
manufacturing plant's existing maintenance management systems. Schedule
maintenance activities based on predictive insights, prioritize critical assets, and
optimize spare parts inventory management.
Result: By implementing predictive maintenance strategies, the manufacturing
plant experiences a significant reduction in unplanned downtime, maintenance
costs, and production disruptions. Equipment reliability and uptime improve,
leading to higher operational efficiency and enhanced product quality.
Additionally, the proactive approach to maintenance fosters a culture of
continuous improvement and innovation within the organization.
55
CONCLUSION
Organizations face challenges in the rapidly evolving big data
ecosystem with new tools emerging and becoming outdated quickly.
Data engineering principles offer a solution by leveraging scalable,
flexible, and high-performing tools for data management and analysis.
The whitepaper provides guidance for navigating the complexities of
the big data ecosystem, emphasizing the use of AWS managed
services for efficient data processing.
AWS's suite of tools streamlines development, deployment, and
scaling of big data applications, allowing focus on business problem-
solving.
AWS enables building flexible, scalable architectures that meet
business needs through multiple tools for cost optimization,
performance, and resilience.
Embracing data engineering and AWS managed services empowers
organizations to build high-performing big data architectures,
simplifying development and adapting to evolving data challenges.
56
REFERENCE LINKS
Data Engineering on AWS
https://awsacademy.instructure.com/courses/68245/modules#
module_824165
Amazon-S3
https://docs.aws.amazon.com/AmazonS3/latest/userguide/We
lcome.html
Amazon Athena https://aws.amazon.com/athena/?
whats-new-cards.sort-
by=item.additionalFields.postDateTime&whats-new-
cards.sort-order=desc
Amazon Glue
https://aws.amazon.com/glue/
Amazon Redshift
https://aws.amazon.com/redshift/
Amazon Sagemaker
https://docs.aws.amazon.com/sagemaker/latest/dg/whatis.htm
l
Amazon IOT data Analysis
https://aws.amazon.com/iot-analytics/
Case Study
https://www.researchgate.net/publication/312004126_Big_D
ata_Analytics_for_Predictive_Maintenance_Strategies
57