You are on page 1of 4

Yasaswi Sr.

Data Engineer
Manijob201@gmail.com +1(628)-290-8561

Professional Summary
Around 9 years of IT experience as a professional Data Engineer in Big Data Analytical, Hadoop Ecosystem,

Designing and Implementing Data Pipelines, Cloud based applications and Data Visualization tools.
Experience working with Big data, Spark, Python, Teradata, SQL & Tableau.

Industrial experience in Data manipulation using Hadoop Ecosystem tools such as MapReduce, HDFS,

Yarn/MRv2, Pig, Hive, HBase, Spark, Kafka, Flume, Sqoop, Oozie.


Hands on Experience in developing Spark applications using PySpark, Data Frame, RDD, Spark SQL.

Expert in working with cloud PUB/SUB to replicate real time data from source system to GCP Big Query.

Designed and implemented end-to-end data pipelines to extract, cleanse, process and analyze huge amounts of

behavioral data and log data.


Expertise in working with Amazon Web Services like EC2, Redshift, S3, Athena and Glue for big data

development.
Experienced in developing production ready spark applications using Spark RDD APIs, Data frames, Spark-SQL,

and Spark-Streaming API's.


Extensively worked on system analysis, design, development, testing and implementation of projects (SDLC) and

capable of handling responsibilities independently as well as with proactive team members.


Worked extensively on fine tuning spark applications to improve performance and troubleshooting failures in

spark applications.
Having hands-on experience working with various No SQL databases such as Cassandra, Scylla DB and HBase.

Proficient in importing and exporting data from RDBMS to HDFS using Sqoop.

Working with GCP cloud using in GCP Cloud storage, DataProc, Data Flow, Big Query, Cloud Composer, Cloud

Pub/Sub.
Expertise in extraction, Transforming and loading data from various sources like flat files, excel, Oracle, MSSQL

Server and Teradata.


Experience in developing Spark applications using Spark - SQL in Databricks for data extraction,

transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover
insights into the customer usage patterns.
Used Hive extensively to perform various data analytics required by business teams.

Solid experience in working with various data formats like Parquet, Orc, Json, AVRO etc.,

Experience automating end-to-end data pipelines with strong resilience and recoverability.

Worked on Spark Streaming and Structured Spark streaming including Kafka for real time data processing.

Experience in creating Impala views on hive tables for fast access to data.

Experienced in using waterfall, Agile and Scrum models of software development process framework.

Strong experience in working with operating systems such as Linux and Unix.

Experience in execution of Batch jobs through the data streams to SPARK Streaming

Good knowledge in Oracle PL/SQL and shell scripting.


Experienced process-oriented Data Analyst having excellent analytical, quantitative, and problem-solving skills

using SQL, Tableau, Advanced Excel, Python and R.


Database/ETL Performance Tuning: Broad Experience in Database Development including effective use of

Database objects, SQL Trace, Explain Plan, Different types of Optimizers, Hints, Indexes, Table Partitions, Sub
Partitions, Materialized Views, Global Temporary tables, Autonomous Transitions, Bulk Binds, capabilities of
using Oracle Built-in functions. Performance tuning of Informatica mapping and workflow.
Sound knowledge in architecture of distributed systems and parallel processing frameworks.

Experienced working on continuous Integration/continuous delivery & build tools such as Jenkins and GIT, SVN

for version control.


Professional Experience
2022/05 – present Sr. Data Engineer
Foster City, CA Exabeam
Involved in the requirements gathering, sprint grooming, Design, Development, Unit

testing and Bug fixing.


Creating the tables in Hive and integrating data between Hive and Spark.

Developed Spark jobs to collect data from various source systems and store it on

HDFS to run Analytics.


Created Hive Partitions and Bucket tables to improve the performance.

Design and implement various layer of Data Lake, designing star schema for big

query.
Using g-cloud function with Python to load data into big query for on arrival csv files

in GCS bucket.
Loading data into snowflake tables from internal stage using Snow SQL.

Process and load bound and unbound data from Google pub/subtopic to big query

using cloud Dataflow with Python.


Created Hive tables using user defined functions.

Loaded and transformed large data sets of structured, semi-structured and


unstructured data in various formats like txt, zip, XML, JSON.


Designed pipelines with Apache Beam, Kubeflow, Data flow and orchestrated jobs into

GCP.
Hands on experience in migrating on premise workloads to Google Cloud Platform

using GCS, Big Query, Cloud Composer, Google Cloud Storage and Cloud Data Proc.
Exposure on IAM roles in GCP.

Designing and developing the ETL pipelines using python API.


Worked on Spark SQL, created Data frames by loading data from Hive tables and

created prep data and stored in Cloud Storage.


Experience in building multiple Data pipelines, end to end ETL and ELT process for

Data ingestion and transformation in GCP and coordinate task among the team.
Developed Airflow DAGs in python by importing the Airflow libraries.

Used Spark Streaming to receive real time data from the Kafka and store the stream

data to HDFS using Python and NoSQL databases such as HBase and Cassandra.
Designing and implementing data ingestion techniques for the data coming from

various source systems.


Exposure on IAM roles in GCP.

Used GCP scheduler for independent jobs.


Designed and Developed Spark code using Python, Pyspark and spark SQL for high-

speed data computation.


Leveraged GitHub for source code version control, and Jenkins for scheduling data

pipelines job execution.


Environment: Spark, Python, Hive, GCP, Big Data, Hadoop, Cloud SQL, Big Query,
Cloud DataProc, GCS, Cloud SQL, Cloud Composer, Cloud Data prep, Google Cloud
Dataflow, Teradata, SAS, Java, SQL Server, Hortonworks 2.5, HDFS 2.7.3,

2020/08 – 2022/04 Data Engineer


Englewood,CO Travelport
Work closely with multiple teams to gather requirements and maintain relationships

with those that are heavy users of data for analytics.


Ingesting the data from external servers such as FTP server or S3 buckets daily using

custom Input Adapters.


Used AWS Redshift, Spectrum and Athena services to query large amounts of data

stored on S3 to create a virtual Data Lake without having to go through the ETL
process.
Developed various spark applications using Scala to perform various enrichments of

user behavioral data (click stream data) merged with user profile data.
Involved in data cleansing, event enrichment, data aggregation, de-normalization and

data preparation needed for downstream model learning and reporting.


Utilized Spark Scala API to implement batch processing of jobs.

Fine-tuning spark applications/jobs to improve the efficiency and overall processing


time for the pipelines.


Troubleshooting Spark applications for improved error tolerance.

Converted Hive/SQL queries into Spark transformations using Spark RDDs in Python

and Scala.
Developed Spark scripts using Python on AWS EMR for Data Aggregation, Validation

and Ad Hoc querying.


Used broadcast variables in spark, effective and efficient Joins, transformations, and

other capabilities for data processing. Utilized Spark in Memory capabilities, to handle
large datasets.
Experienced in working with EMR cluster and S3 in AWS cloud and developed API

for using AWS Lambda to manage the servers and run the code in the AWS.
Creating Hive tables, loading and analyzing data using hive scripts. Implemented

Partitioning, Dynamic Partitions, Buckets in Hive.


Perform demographic analysis on given customer data by age, sex, region and so on in

Excel.
Perform hypothesis testing to determine if sample size is statistically significant with

respect to the population for customer segments in Excel.


Environment: AWS, Hadoop, AWS EC2, Amazon S3, HDFS, Pig, Hive, Spark, Python,
Cloudera CDH 4.6, Map Reduce, Sqoop, Oozie, Cassandra, Kafka, Tableau, Excel.

2018/09 – 2020/07 Spark Developer


Dallas, TX CBRE
Developed ETL data pipelines using Sqoop, Spark, Spark SQL, Scala, and Oozie.

Used Spark for interactive queries, processing of streaming data and integrated with

NoSQL databases.
Experience with AWS Cloud IAM, Data pipeline, EMR, S3, EC2.

Developed the batch scripts to fetch the data from AWS S3 storage and do required

transformations.
Developed Spark code using Scala and Spark-SQL for faster processing of data.

Developed Spark/Scala, Python for regular expression(regex) project in the


Hadoop/Hive environment with Linux/Windows for big data resources.


Created Oozie workflow engine to run multiple Spark jobs.

Developed file cleaners using Python libraries and made it clean.


Exploring with Spark for improving the performance and optimization of the existing

algorithms in Hadoop using Spark-SQL, Data Frame, pair RDD's, Spark YARN.
Experience with terraform scripts which automates the step execution in EMR to load

the data to Scylla DB.


De-normalizing the data as part of transformation which is coming from Netezza and

loading it to No SQL Databases and MySQL.


Environment: HDFS, Spark, Scala, Netezza, EMR, Oracle, NoSQL, Sqoop, AWS,
Terraform, Scylla DB, Cassandra, MySQL, Oozie.

2016/03 – 2018/06 Hadoop/ETL Developer


Hyderabad, India Couth InfoTech Pvt Ltd
Developed simple and complex spark jobs in python for Data Analysis on different

data formats.
Developed upgrade and downgrade scripts in SQL that filter bad and un-necessary

records and find out unique records based on different criteria.


Implemented custom Data Types, Input Format, Record Reader, Output Format,

Record Writer for Spark job computations to handle custom business requirements.
Experience in installing, configuring, supporting, and managing Hadoop Clusters

using Apache, Cloudera (CDH 5.X) distributions and on Amazon web services
(AWS).
Worked with SQOOP to extract data from relational database into Hadoop.

Experience in building scalable distributed data processing solutions with Hadoop


using tools such as Hive, HBase (NoSQL), and Sqoop.


Experience working with different Hadoop distributions like Cloudera, Horton works,

MapR and Apache distributions.


Involved in developing Spark scripts for data analysis in both python and Scala.

Developed multiple MapReduce jobs in Java for data cleaning and preprocessing.

Experience in writing SQL queries across different databases such as Teradata, hive.

Implemented Partitioning, Dynamic Partitions and Buckets in HIVE for efficient data

access.
Developed complex MapReduce using pig and hive to handle multiple files formats

like JSON, XML, CSV, and sequence files.


Extensively used ETL to extract data from various source system transform the data

and loading it into the data warehouses.


Migrated complex Map Reduce programs into Spark RDD transformations, actions.

Involved in creating Hive tables, loading with data, and writing hive queries which

runs internally in Map Reduce way.


Experience working with BI reporting tools such as Tableau, Power BI, Infogram.

Experienced in migrating HiveQL into Impala to minimize query response time.


Involved in story-driven agile development methodology and daily scrum meetings.


2014/10 – 2016/02 ETL Developer


Maisa Solutions Private Limited
Involved in Data Extraction, Transformation and Loading (ETL) between

Homogeneous and Heterogeneous systems using SQL tools (SSIS, Bulk Insert).
Performed ETL operations to support incremental, historical data loads, nightly data

loads and transformations using SSIS.


Created reports using SSRS from OLTP and OLAP data sources and deployed on the

report server.
Performed T-SQL tuning and optimization of queries for reports that took longer

execution time using MS SQL Profiler, Index Tuning Wizard and SQL Query
generated reports using SSRS which were sent to different users.
Maintained the table performance by normalization, creating indexes and collected

statistics by using query optimization, query execution plans, SQL Server Profiler and
Database engine tuning Advisor.
Created complex SSRS reports using multiple data providers, Global Variables,

Expressions, user defined Objects, aggregate aware objects, charts, and synchronized
queries
Created Linked Servers for data retrieval using OLE DB data sources and providers.

Developed a database for integration by writing SQL Queries and Stored Procedures.

Written unit test cases and performed unit testing for the same.

Stored all source code in the TortoiseSVN and updated development status in the

SVN timely.
Responsible for developing, support and maintenance for the ETL (Extract, Transform

and Load) processes using Informatica PowerCenter


Environment: C#.Net, SQL Server 2008 R2/2012, Oracle, DB2, SQL Integration
Services (SSIS), RDL, SQL Reporting (SSRS), SQL Analysis Services (SSAS), Business
Intelligence Studio, Alteryx, Tortoise SVN, OLE DB, MS SQL, OMS

You might also like