Yasaswi-Sr Data Engineer-Resume

Yasaswi Sr.
Data Engineer
Manijob201@gmail.com +1(628)-290-8561
Professional Summary
Around 9 years of IT experience as a professional Data Engineer in Big Data Analytical, Hadoop Ecosystem,
•
Designing and Implementing Data Pipelines, Cloud based applications and Data Visualization tools.
Experience working with Big data, Spark, Python, Teradata, SQL & Tableau.
•
Industrial experience in Data manipulation using Hadoop Ecosystem tools such as MapReduce, HDFS,
•
Yarn/MRv2, Pig, Hive, HBase, Spark, Kafka, Flume, Sqoop, Oozie.

Hands on Experience in developing Spark applications using PySpark, Data Frame, RDD, Spark SQL.
•
Expert in working with cloud PUB/SUB to replicate real time data from source system to GCP Big Query.
•
Designed and implemented end-to-end data pipelines to extract, cleanse, process and analyze huge amounts of
•
behavioral data and log data.

Expertise in working with Amazon Web Services like EC2, Redshift, S3, Athena and Glue for big data
•
development.
Experienced in developing production ready spark applications using Spark RDD APIs, Data frames, Spark-SQL,
•
and Spark-Streaming API's.

Extensively worked on system analysis, design, development, testing and implementation of projects (SDLC) and
•
capable of handling responsibilities independently as well as with proactive team members.

Worked extensively on fine tuning spark applications to improve performance and troubleshooting failures in
•
spark applications.
Having hands-on experience working with various No SQL databases such as Cassandra, Scylla DB and HBase.
•
Proficient in importing and exporting data from RDBMS to HDFS using Sqoop.
•
Working with GCP cloud using in GCP Cloud storage, DataProc, Data Flow, Big Query, Cloud Composer, Cloud
•
Pub/Sub.
Expertise in extraction, Transforming and loading data from various sources like flat files, excel, Oracle, MSSQL
•
Server and Teradata.

Experience in developing Spark applications using Spark - SQL in Databricks for data extraction,
•
transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover
insights into the customer usage patterns.
Used Hive extensively to perform various data analytics required by business teams.
•
Solid experience in working with various data formats like Parquet, Orc, Json, AVRO etc.,
•
Experience automating end-to-end data pipelines with strong resilience and recoverability.
•
Worked on Spark Streaming and Structured Spark streaming including Kafka for real time data processing.
•
Experience in creating Impala views on hive tables for fast access to data.
•
Experienced in using waterfall, Agile and Scrum models of software development process framework.
•
Strong experience in working with operating systems such as Linux and Unix.
•
Experience in execution of Batch jobs through the data streams to SPARK Streaming
•
Good knowledge in Oracle PL/SQL and shell scripting.

•
Experienced process-oriented Data Analyst having excellent analytical, quantitative, and problem-solving skills
•
using SQL, Tableau, Advanced Excel, Python and R.

Database/ETL Performance Tuning: Broad Experience in Database Development including effective use of
•
Database objects, SQL Trace, Explain Plan, Different types of Optimizers, Hints, Indexes, Table Partitions, Sub
Partitions, Materialized Views, Global Temporary tables, Autonomous Transitions, Bulk Binds, capabilities of
using Oracle Built-in functions. Performance tuning of Informatica mapping and workflow.
Sound knowledge in architecture of distributed systems and parallel processing frameworks.
•
Experienced working on continuous Integration/continuous delivery & build tools such as Jenkins and GIT, SVN
•
for version control.

Professional Experience
2022/05 – present Sr. Data Engineer
Foster City, CA Exabeam
Involved in the requirements gathering, sprint grooming, Design, Development, Unit
•
testing and Bug fixing.

Creating the tables in Hive and integrating data between Hive and Spark.
•
Developed Spark jobs to collect data from various source systems and store it on
•
HDFS to run Analytics.

Created Hive Partitions and Bucket tables to improve the performance.
•
Design and implement various layer of Data Lake, designing star schema for big
•
query.
Using g-cloud function with Python to load data into big query for on arrival csv files
•
in GCS bucket.
Loading data into snowflake tables from internal stage using Snow SQL.
•
Process and load bound and unbound data from Google pub/subtopic to big query
•
using cloud Dataflow with Python.

Created Hive tables using user defined functions.
•
Loaded and transformed large data sets of structured, semi-structured and

•
unstructured data in various formats like txt, zip, XML, JSON.

Designed pipelines with Apache Beam, Kubeflow, Data flow and orchestrated jobs into
•
GCP.
Hands on experience in migrating on premise workloads to Google Cloud Platform
•
using GCS, Big Query, Cloud Composer, Google Cloud Storage and Cloud Data Proc.
Exposure on IAM roles in GCP.
•
Designing and developing the ETL pipelines using python API.

•
Worked on Spark SQL, created Data frames by loading data from Hive tables and
•
created prep data and stored in Cloud Storage.

Experience in building multiple Data pipelines, end to end ETL and ELT process for
•
Data ingestion and transformation in GCP and coordinate task among the team.
Developed Airflow DAGs in python by importing the Airflow libraries.
•
Used Spark Streaming to receive real time data from the Kafka and store the stream
•
data to HDFS using Python and NoSQL databases such as HBase and Cassandra.
Designing and implementing data ingestion techniques for the data coming from
•
various source systems.

Exposure on IAM roles in GCP.
•
Used GCP scheduler for independent jobs.

•
Designed and Developed Spark code using Python, Pyspark and spark SQL for high-
•
speed data computation.

Leveraged GitHub for source code version control, and Jenkins for scheduling data
•
pipelines job execution.

Environment: Spark, Python, Hive, GCP, Big Data, Hadoop, Cloud SQL, Big Query,
Cloud DataProc, GCS, Cloud SQL, Cloud Composer, Cloud Data prep, Google Cloud
Dataflow, Teradata, SAS, Java, SQL Server, Hortonworks 2.5, HDFS 2.7.3,
2020/08 – 2022/04 Data Engineer

Englewood,CO Travelport
Work closely with multiple teams to gather requirements and maintain relationships
•
with those that are heavy users of data for analytics.

Ingesting the data from external servers such as FTP server or S3 buckets daily using
•
custom Input Adapters.

Used AWS Redshift, Spectrum and Athena services to query large amounts of data
•
stored on S3 to create a virtual Data Lake without having to go through the ETL
process.
Developed various spark applications using Scala to perform various enrichments of
•
user behavioral data (click stream data) merged with user profile data.
Involved in data cleansing, event enrichment, data aggregation, de-normalization and
•
data preparation needed for downstream model learning and reporting.

Utilized Spark Scala API to implement batch processing of jobs.
•
Fine-tuning spark applications/jobs to improve the efficiency and overall processing

•
time for the pipelines.

Troubleshooting Spark applications for improved error tolerance.
•
Converted Hive/SQL queries into Spark transformations using Spark RDDs in Python
•
and Scala.
Developed Spark scripts using Python on AWS EMR for Data Aggregation, Validation
•
and Ad Hoc querying.

Used broadcast variables in spark, effective and efficient Joins, transformations, and
•
other capabilities for data processing. Utilized Spark in Memory capabilities, to handle
large datasets.
Experienced in working with EMR cluster and S3 in AWS cloud and developed API
•
for using AWS Lambda to manage the servers and run the code in the AWS.
Creating Hive tables, loading and analyzing data using hive scripts. Implemented
•
Partitioning, Dynamic Partitions, Buckets in Hive.

Perform demographic analysis on given customer data by age, sex, region and so on in
•
Excel.
Perform hypothesis testing to determine if sample size is statistically significant with
•
respect to the population for customer segments in Excel.

Environment: AWS, Hadoop, AWS EC2, Amazon S3, HDFS, Pig, Hive, Spark, Python,
Cloudera CDH 4.6, Map Reduce, Sqoop, Oozie, Cassandra, Kafka, Tableau, Excel.
2018/09 – 2020/07 Spark Developer

Dallas, TX CBRE
Developed ETL data pipelines using Sqoop, Spark, Spark SQL, Scala, and Oozie.
•
Used Spark for interactive queries, processing of streaming data and integrated with
•
NoSQL databases.
Experience with AWS Cloud IAM, Data pipeline, EMR, S3, EC2.
•
Developed the batch scripts to fetch the data from AWS S3 storage and do required
•
transformations.
Developed Spark code using Scala and Spark-SQL for faster processing of data.
•
Developed Spark/Scala, Python for regular expression(regex) project in the

•
Hadoop/Hive environment with Linux/Windows for big data resources.

Created Oozie workflow engine to run multiple Spark jobs.
•
Developed file cleaners using Python libraries and made it clean.

•
Exploring with Spark for improving the performance and optimization of the existing
•
algorithms in Hadoop using Spark-SQL, Data Frame, pair RDD's, Spark YARN.
Experience with terraform scripts which automates the step execution in EMR to load
•
the data to Scylla DB.

De-normalizing the data as part of transformation which is coming from Netezza and
•
loading it to No SQL Databases and MySQL.

Environment: HDFS, Spark, Scala, Netezza, EMR, Oracle, NoSQL, Sqoop, AWS,
Terraform, Scylla DB, Cassandra, MySQL, Oozie.
2016/03 – 2018/06 Hadoop/ETL Developer

Hyderabad, India Couth InfoTech Pvt Ltd
Developed simple and complex spark jobs in python for Data Analysis on different
•
data formats.
Developed upgrade and downgrade scripts in SQL that filter bad and un-necessary
•
records and find out unique records based on different criteria.

Implemented custom Data Types, Input Format, Record Reader, Output Format,
•
Record Writer for Spark job computations to handle custom business requirements.
Experience in installing, configuring, supporting, and managing Hadoop Clusters
•
using Apache, Cloudera (CDH 5.X) distributions and on Amazon web services
(AWS).
Worked with SQOOP to extract data from relational database into Hadoop.
•
Experience in building scalable distributed data processing solutions with Hadoop

•
using tools such as Hive, HBase (NoSQL), and Sqoop.

Experience working with different Hadoop distributions like Cloudera, Horton works,
•
MapR and Apache distributions.

Involved in developing Spark scripts for data analysis in both python and Scala.
•
Developed multiple MapReduce jobs in Java for data cleaning and preprocessing.
•
Experience in writing SQL queries across different databases such as Teradata, hive.
•
Implemented Partitioning, Dynamic Partitions and Buckets in HIVE for efficient data
•
access.
Developed complex MapReduce using pig and hive to handle multiple files formats
•
like JSON, XML, CSV, and sequence files.

Extensively used ETL to extract data from various source system transform the data
•
and loading it into the data warehouses.

Migrated complex Map Reduce programs into Spark RDD transformations, actions.
•
Involved in creating Hive tables, loading with data, and writing hive queries which
•
runs internally in Map Reduce way.

Experience working with BI reporting tools such as Tableau, Power BI, Infogram.
•
Experienced in migrating HiveQL into Impala to minimize query response time.

•
Involved in story-driven agile development methodology and daily scrum meetings.

•
2014/10 – 2016/02 ETL Developer

Maisa Solutions Private Limited
Involved in Data Extraction, Transformation and Loading (ETL) between
•
Homogeneous and Heterogeneous systems using SQL tools (SSIS, Bulk Insert).
Performed ETL operations to support incremental, historical data loads, nightly data
•
loads and transformations using SSIS.

Created reports using SSRS from OLTP and OLAP data sources and deployed on the
•
report server.
Performed T-SQL tuning and optimization of queries for reports that took longer
•
execution time using MS SQL Profiler, Index Tuning Wizard and SQL Query
generated reports using SSRS which were sent to different users.
Maintained the table performance by normalization, creating indexes and collected
•
statistics by using query optimization, query execution plans, SQL Server Profiler and
Database engine tuning Advisor.
Created complex SSRS reports using multiple data providers, Global Variables,
•
Expressions, user defined Objects, aggregate aware objects, charts, and synchronized
queries
Created Linked Servers for data retrieval using OLE DB data sources and providers.
•
Developed a database for integration by writing SQL Queries and Stored Procedures.
•
Written unit test cases and performed unit testing for the same.
•
Stored all source code in the TortoiseSVN and updated development status in the
•
SVN timely.
Responsible for developing, support and maintenance for the ETL (Extract, Transform
•
and Load) processes using Informatica PowerCenter

Environment: C#.Net, SQL Server 2008 R2/2012, Oracle, DB2, SQL Integration
Services (SSIS), RDL, SQL Reporting (SSRS), SQL Analysis Services (SSAS), Business
Intelligence Studio, Alteryx, Tortoise SVN, OLE DB, MS SQL, OMS

Yasaswi-Sr Data Engineer-Resume

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Yasaswi-Sr Data Engineer-Resume

Uploaded by

Copyright:

Available Formats

Yasaswi Sr.

Yarn/MRv2, Pig, Hive, HBase, Spark, Kafka, Flume, Sqoop, Oozie.

behavioral data and log data.

and Spark-Streaming API's.

capable of handling responsibilities independently as well as with proactive team members.

Server and Teradata.

Good knowledge in Oracle PL/SQL and shell scripting.

using SQL, Tableau, Advanced Excel, Python and R.

for version control.

testing and Bug fixing.

HDFS to run Analytics.

using cloud Dataflow with Python.

Loaded and transformed large data sets of structured, semi-structured and

unstructured data in various formats like txt, zip, XML, JSON.

Designing and developing the ETL pipelines using python API.

created prep data and stored in Cloud Storage.

various source systems.

Used GCP scheduler for independent jobs.

speed data computation.

pipelines job execution.

2020/08 – 2022/04 Data Engineer

with those that are heavy users of data for analytics.

custom Input Adapters.

data preparation needed for downstream model learning and reporting.

Fine-tuning spark applications/jobs to improve the efficiency and overall processing

time for the pipelines.

and Ad Hoc querying.

Partitioning, Dynamic Partitions, Buckets in Hive.

respect to the population for customer segments in Excel.

2018/09 – 2020/07 Spark Developer

Developed Spark/Scala, Python for regular expression(regex) project in the

Hadoop/Hive environment with Linux/Windows for big data resources.

Developed file cleaners using Python libraries and made it clean.

the data to Scylla DB.

loading it to No SQL Databases and MySQL.

2016/03 – 2018/06 Hadoop/ETL Developer

records and find out unique records based on different criteria.

Experience in building scalable distributed data processing solutions with Hadoop

using tools such as Hive, HBase (NoSQL), and Sqoop.

MapR and Apache distributions.

like JSON, XML, CSV, and sequence files.

and loading it into the data warehouses.

runs internally in Map Reduce way.

Experienced in migrating HiveQL into Impala to minimize query response time.

Involved in story-driven agile development methodology and daily scrum meetings.

2014/10 – 2016/02 ETL Developer

loads and transformations using SSIS.

and Load) processes using Informatica PowerCenter

You might also like