You are on page 1of 21

Big Data

Analytics
‘Implementing Big Data’ : Case Study

● Karishma Schoeman (22MBMB40)


● Nikhil Khandelwal (22MBMB22)
Submitted to: TEAM ● Nitish K Sharma (22MBMB13)
Brigadier Chandra Sekhar ● Shivank Bhardwaj (22MBMB35)
● Vinayak Sharma (22MBMB33)
Table of contents
01 02 03
Paytm & Big Data Company Profile Legacy System

04 05 06
Implementation AWS EMR EMR Services

07 08
Challenges & Future Conclusion

2
01 Paytm & Big Data
Trends in big data

Paytm's Big Data integration: Mobile By 2025, over a quarter


of data will be real time
Enhancing services, transactions, and and real in nature and IoT
real-time data will
user experiences. time data account for more than
dominate 95% of it.

Insights are
generated via new
AI technologies like
Advance business Boost
transform machine learning and
intelligence productivity
the norms natural language
You don’t processing.
have to be
big to use
big data With increasing
Minimize risk Build stronger Security amounts of data being
produced, protection
and fraud customer remains and security of sensitive
relationship significant and private information
is crucial.

3
02 Company Profile
Employs cutting-edge
Paytm, founded in 2010, is an technologies like AI, blockchain, 400+ million registered users as
Indian fintech and e-commerce Cloud, NFC and machine of 2022.
giant. learning

Implements robust cybersecurity Paytm is expanding its services


1.2 billion transactions monthly globally, with users in multiple
measures, safeguarding user
as of 2022, showcasing
data and financial transactions countries benefiting from its
remarkable scalability
from potential threats. diverse fin solutions & services.
4
03 Legacy System
Legacy Data Pipeline

● The Paytm Central Data


Platform team is responsible
for turning disparate data
from multiple business units
into insights and actions for
their executive management
and merchants.
● The legacy data pipeline was
set up on premises using a
proprietary solution and didn't
utilize the open-source
Hadoop stack components
such as Spark or Hive.

5
Challenges of Legacy Data
Pipeline

● The legacy data pipeline was not


scalable, reliable, or performant.
This resulted in a number of
problems, including:
● Long processing times.
● Outages.
● SLA breaches.

6
04 Implementation
Migration Strategies
Legacy pipeline
• Optimization of hardware usage.
• Reduced data analytical processing
time.
Need for
• Configured data with spark to analyse
Implementation of
newly updated/inserted records
MIGRATION • Implemented incremental processing to
help reduce scanning time and storage
capacity
PARTNERSHIPS
Generating approximately 250K reports
per day, which are consumed by Paytm By partnering with AWS, the Paytm Central Data
executives and merchants Platform team created a modern data pipeline in a
short amount of time.
Analytical jobs took approximately
It provides reduced data analytical times with
8–10 hours to complete, which often led to
extraordinary scaling capabilities, generating high
Service Level Agreements (SLA) breaches. quality reports for the executive management and
merchants on a daily basis.

7
8
05 AWS EMR

9
06 EMR Services used by Paytm
Apache Spark For real-time and batch processing

MapReduce For distributed processing

HDFS To Store massive datasets securely and efficiently

Apache Hive For SQL-like queries and transformations.

Apache Mahout, TensorFlow, or other ML libraries For fraud detection, risk assessment,
and personalized user experiences.

Apache Flink or Spark Streaming To Monitor real-time transactions and detect anomalies

Apache Zeppelin, Tableau To represent key performance indicators, customer trends, and
business insights.

10
07 Challenges to benefits

❖ Automatic Scaling Based Benefits of AWS for Paytm


❖ 80:20 Spot to
On-Demand on EC2 Instances
❖ Elimination of Multiple ● Reduced infrastructure
Ratio
Scaling Policies management and
❖ Higher Cost ❖ Focus on Compute processing incidents by 70
Efficiency Requirements percent
● Streamlined data
Cost Savings Enhanced Cluster Mngmt
processing time for
majority workloads by 98
❖ Amazon EMR Managed
❖ Amazon EMR 6.3 Scaling percent
❖ Reduced Scale Time ❖ YARN Memory-Based ● Improved data availability
❖ Increased EC2 Spot Scaling by 30 percent
Instance Usage ❖ No Manual Configuration ● Reduced data
Needed infrastructure cost by 30
percent
Efficient Scaling Automated Scaling
11
08 Conclusion

Key Achievements Cost Optimization and Improved


❖ 400 TB legacy data migrated to Performance
Amazon S3.
❖ 40 data flows revamped for enhanced Strategic Data Processing
efficiency. Enhancement
❖ Spark Efficiency Gains {95% reduction
in runtime, CPU, I/O, and overall
Democratizing Data for a Cultural
computation time.}
Shift
❖ Optimized Data Flows {Scala on
Apache Spark, Azkaban for job
scheduling.} Operational Efficiency and Reliability
Boost

12
AWS EMR
Walk-through
Lets go to AWS Console

13
STEPS
1. Log-in to AWS console
2. Navigate to EMR Service
3. Click on "Create Cluster"
4. Configure Cluster
a. Select the
appropriate release
label (EMR version)
b. Choose the
applications
c. Choose Instance
Type(master & core
nodes)
d. Configure Cluster
Permissions
e. Configure bootstrap
actions or scripts
f. Configure storage
5. Configure network and
security.
6. Submit & access cluster

14
15
16
17
18
19
Thanks!
Do you have any questions?

20
References
● https://aws.amazon.com/blogs/big-data/how-paytm-modernized-their-data-pipeline-using-
amazon-emr/
● https://medium.com/@parth09/democratising-the-data-computation-in-complex-organisatio
ns-dc360243e36a
● https://aws.amazon.com/solutions/case-studies/paytm/
● https://paytm.com/blog/engineering/building-cloud-native-solutions-with-aws-codedeploy/
● https://www.linkedin.com/pulse/amazon-emr-your-solution-handle-big-data-musa-emin-oz
dem
● https://aws.amazon.com/emr/getting-started/
● https://www.sas.com/content/dam/SAS/documents/infographics/2019/en-big-data-110869.
pdf
● https://youtu.be/QuwaBOESGiU?si=OI1E4swkA28Dhfrq
● https://us-east-2.console.aws.amazon.com/emr/home?region=us-east-2#/clusters

21

You might also like