Ravali Data Engineer GCP

Ravali A
Email: aravali2697@gmail.com
Contact: +1 (940) 279 8359
Professional Summary:
● Around 8+ years of experience in Big Data related technologies. Consistently recognized for outstanding
performance and contributions to numerous Big Data projects. Extensively used Hadoop technologies such as
HDFS, Spark with Scala, Hive, HBase, Sqoop, Oozie, Kafka, SQL, Core Java, and Python. Have a strong
reputation for resolving issues, increasing customer happiness, and driving overall business objectives.
● Good experience in Amazon Web Services like S3, IAM, EC2, EMR, Kinesis, VPC, Dynamo DB, GCP, RedShift,
Amazon RDS, Lambda, Athena, Glue, DMS, Quick Sight, Amazon Elastic Load Balancing, Auto Scaling,
CloudWatch, SNS, SQS and other services of the AWS family.
● Experienced Data Analyst with a solid understanding of Data Modeling, Client/Server Applications, and
Visualization tools like Tableau, Power BI, and Data Warehouses.
● Result-oriented developer with strong problem-solving skills and able to work efficiently balancing between
resources and time constraints.
Professional Abridgment:
● Experience in working with large data sets of Structured and Unstructured data using Big Data/Hadoop and
Spark, Data Acquisition, Data Validation, Predictive modeling, Statistical modeling, Data modeling, Data
Visualization.
● Experience in installing, configuring, and using Apache, Hadoop ecosystem components like Hadoop Distributed
File System (HDFS), MapReduce, Yarn, Spark, Nifi, Pig, Hive, Flume, Hbase, Oozie, Zookeeper, Sqoop, and Scala.
● Hands-on experience in working with Spark SQL queries, Data frames, importing data from Data sources,
performing transformations, performing read/write operations, save the results to the output directory into
HDFS.
● Expertise in using Spark-SQL with various data sources like JSON, Parquet, and Hive.
● Designed robust ETL pipelines in GCP Data Factory and utilized databricks while working with key
stakeholders and data scientists to translate business needs into actionable reports.
● Migration operations using GCP from non-scalable, non-highly available platforms to scalable, highly available
architecture (web applications, databases, storage, etc).
● Performed ETL test scenarios by writing SQL scripts with consideration of Business scenarios.
● Exclusively worked on ETL test case scenarios with business functionality and generated the reports for business
users.
● Migrate apps and data from on-premises systems to GCP settings
● Experience in working with large data sets of Structured and Unstructured data using Big Data/Hadoop and
Spark, Data Acquisition, Data Validation, Predictive modeling, Statistical modeling, Data modeling, Data
Visualization.
● Experience on Migrating SQL database to Hadoo data Lake, Azure data lake Analytics, Azure SQL Database, Data
Bricks and Azure SQL Data warehouse and controlling and granting database access and Migrating On premise
databases to Azure Data Lake store using Azure Data factory
● Good experience in creating data ingestion pipelines, data transformations, data management, data
governance, real-time streaming engines at an Enterprise level.
● Experience in creating solutions for real-time data streaming solutions using Apache Spark/Spark Streaming,
Spark SQL & Data Frames, and Kafka.
● Experience in integrating Flume and Kafka in moving the data from sources to sinks in real-time.
● Proficient at using Spark APIs to explore, cleanse, aggregate, transform and store machine sensor data.
● Experience in building high-performance batch processing applications on Hortonworks and Cloudera Data
Platforms.
● Experience in NoSQL databases - HBase, MongoDB & Cassandra, database performance tuning & data
modeling.
● Expertise in importing and exporting Terabytes of data between HDFS and Relational Database Systems using
Sqoop.
● Experienced in working on Hadoop tools related to Data warehousing like Hive, Pig and also involved in
extracting the data from these tools onto the cluster using Sqoop.
● Experience in creating interactive Dashboards and Creative Visualizations using Visualization tools like Tableau,
Power BI.
● Worked as a technical member of the team for building end to end pipelines in google cloud platform in the
following services like bigquery, dataproc, dataflow, storage, airflow, composer
● Very keen in knowing newer techno stack that Google Cloud platform (GCP) adds.
● Proficient in Data Analysis, Cleansing, Transformation, Data Migration, Data Integration, Data Import, and
Data Export through the use of ETL tools such as Informatica and SSIS.
● Worked with different ETL tool environments like SSIS, Informatica, and reporting tool environments like SQL
Server Reporting Services (SSRS), and Business Objects.
● Knowledge and experience working in Agile environments, including the scrum process and used Project
Management tools like Jira and version control tools such as GitHub/Git.
● Strong Knowledge of Software Development lifecycle and methodologies like Waterfall, agile methodology and
Scrum approach.
● Excellent Business Analysis skills and extensive experience documenting Use Case Requirements, Requirements
gathering, Gap Analysis, and full life cycle development.
● Experience in using distributed computing architectures like AWS products (e.g. EC2, Redshift, and EMR, Elastic
search) and working on raw data migration to Amazon cloud into S3 and performing refined data processing.
● Experienced with setting up databases in AWS using RDS including MSSQL, MYSQL, MongoDB & DynamoDB.
Storage using S3 bucket and configuring instance backups to S3 bucket.
● Experienced in using Used Cloud watch logs to move application logs to S3 and create alarms based on a few
exceptions raised by applications.
Technical Acumen:
● Big Data Ecosystem: HDFS, MapReduce, Spark, Yarn, Hive, Sqoop, Kafka, Oozie, Zookeeper
● Hadoop Technologies: Apache Hadoop 1.x, Apache Hadoop 2.x, Cloudera CDH4/CDH5, Hortonworks
● Programming Languages: Java, MATLAB, Python, Scala, Shell Scripting, HiveQL
● Operating Systems: Windows (XP/7/8/10), Linux (Ubuntu, Centos)
● Database: RDBMS, MySQL, Teradata, DB2, Oracle
● BI Tool: Tableau, Power BI
● Cloud: AWS Azure, GCP
● Web Development: HTML, XML, JavaScript’s
● IDE Tools: Eclipse, Anaconda, PyCharm, Jupyter, IntelliJ
Professional Work Experience:

Client: Travelport Englewood, CO April 2022 to Present
Sr. Data Engineer
Project description:
The goal of this project is to build Dashboards and Reports for Production control and Production summary.
Creating Airflow DAGS and Reports using Tableau.
Responsibilities:
● Gather business requirements from clients and convert them to the technical specifications and generate
Reports and DAGS.
● Leveraging Big Data infrastructure for batch processing and responsible for building scalable data solutions
using Spark.
● Extract source feeds from AWS S3(Onelake) location to read/modify/edit/update the
PARQUET/CSV/FixedLength data using Spark/SPARK-SQL and store it in HDFS or OneLake locations.
● Worked on Airflow DAGS for generating emails and fetching reports from Tableau views.
● Worked on Airflow for scheduled Tasks and workflows.
● Worked on MYSSQL for running complex queries which are part of Airflow DAGS to fetch the records and load in
front end.
● As a functional process, performed joins, aggregations, filters, and other transformations on the datasets using
SPARK. Experience in handling appropriate features from Datasets in order to handle bad, null, zero, partial
records in Spark-SQL.
● Worked on back-end API development using python FLASK frameworks.
● Worked on creation on APIs in Python for new features example download and preview scrap data and
Also testing APIs in Post-Man by passing the request params .
● Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
● Created ETL Test Cases using the $DLC procedures and reviewed them with the Test Manager.
● Executed all the ETL Test Cases in the Test Environment and maintained them and documenting the test queries
and result for future references.
● Experience in GCP Dataproc, GCS, Cloud functions, BigQuery.
● Experience in moving data between CP and Azure using Azure Data Factory.
● Experience in building power bi reports on Azure Analysis services for better performance.
● Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery.
● Built, configured, and maintained Apache Airflow for workflow management and created workflows in python.
● Supporting the production workflows that deliver the business data on a daily basis and actively resolving the
production issues on live.
● Analyzed the production defects raised by the clients, identified the root cause, and implemented appropriate
solutions.
● Established JUnit Framework for test-driven development to create unit test cases to test the quality of code by
performing regressive testing on the designed frameworks.
● Writing complex snowsql scripts in snowflake cloud data warehouse to business analysis and reporting.
● Clearly documented the business solutions including complex technical workflows with an appropriate level of
detail for future reference.
Technologies Used: Hadoop, HDFS, Spark Scala, Python3, AWS Services - EMR, EC2, S3, SNS, Lambda, CICD, Jenkins,
JUnit, GitHub, Apache Airflow.
Client: Thomson Reuters Eagan, MN January 2020 to March 2022

Sr. Data Engineer
The project involved in migrating data from source systems and is fed into the EDW (Enterprise Data Warehouse). This
process includes extraction of data from source systems, applying transformations and loading the data after the query
tuning and performance check. The process of Extraction, transformation, compare and loading of the Data from DB2
into the Data warehouse is done using the Teradata Client Utilities like Fast load and Multiload. Coding of the processes
is done using Teradata SQL and Beg scripts. These Processes are scheduled to run daily, weekly or monthly. Teradata SQL
and Client Utilities have played a significant role in migrating the data to Data warehouse and achieving the expected
gains.
Responsibilities:
• Performed data analysis and gathered columns metadata of source systems for understanding requirement feasibility
analysis.
• Worked on the® Teradata stored procedures® and functions to confirm the data and have load it on the table.
• Developed procedures to populate the customer data warehouse with transaction data, cycle and monthly summary
data, and historical data
• Worked on optimizing and tuning the Teradata views and SQL's ® to improve the performance of batch and response
time of data for users.
● Developed data transition programs from Hadoop 2.x to Hadoop 3.x using Hive and Sqoop.
● Built Enterprise ingestion Sqoop framework to migrate databases from Hadoop 2.6 to 3.1.
● Worked in mixed role DevOps: Azure Architect/System Engineering, network operations and data engineering
● Design and development of migrating databases, and automating the process using shell scripts and tidal jobs.
● Worked on QA implementation, Encryption, and ingesting databases into HDFS across tenants and shared spaces.
● Involved In Migrating objects from Teradata to Snowflake.
● Exploring with Spark, improving performance and optimization of the existing algorithms in Hadoop using Spark
Context, Spark-SQL, Data Frame, and Pair RDD's.
● Used Sqoop to import data from different RDBMS systems like Oracle, DB2, and Netezza and loaded it into HDFS.
● Analyzed the data that need to be loaded into Hadoop and contacted respective source teams to get the table
information and connection details.
● Heavily involved in testing Snowflake to understand best possible ways to use the cloud resources.
● Created ETL packages to extract data from flat files, reformat the data and insert the reformatted data into the fact
tables and load fact tables into the data warehouse with the help of using SQL Server.
● Migration activities utilizing GCP from non-scalable, non-highly available platforms to scalable, highly available
architecture (web apps, databases, storage, etc) (web applications, databases, storage, etc).
● Experience in working with product teams to create various store level metrics and supporting data pipelines written
in GCP’s bigdata stack.
● Coordinated with team and Developed framework to generate Daily adhoc reports and Extracts from enterprise
data from BigQuery.
● Designed and Co-ordinated with Data Science team in implementing Advanced Analytical Models in Hadoop Cluster
over large Datasets.
● Wrote scripts in Hive SQL for creating complex tables with high performance metrics like partitioning, clustering and
skewing
● Work related to downloading BigQuery data into pandas or Spark data frames for advanced ETL capabilities.
● Worked with google data catalog and other google cloud API's for monitoring, query and billing related analysis for
BigQuery usage.
● Worked on creating POC for utilizing the ML models and Cloud ML for table Quality Analysis for the batch process.
● Knowledge about cloud dataflow and Apache beam.
● Good knowledge in using cloud shell for various tasks and deploying services.
● Created BigQuery authorized views for row level security or exposing the data to other teams.
● Expertise in designing and deployment of Hadoop cluster and different Big Data analytic tools including Pig. Hive,
SQOOP.
● Apache Spark, with Cloudera Distribution.
● Worked with DevOps team to strategize building and versioning infrastructure efficiently using terraform.
● served as an integrator between data architects, data scientists, and other data consumers.
● Build the Logical and Physical data model for snowflakes as per the changes required.
● Ability to do proof of concepts for managers in big data/GCP technologies and work closely with solution architect
to achieve both short/long term goals
● Examined the data quality to identify inconsistencies in Table relations, missing updated data, and applied python
libraries to handle the missing data.
● Used Agile methodologies with hands-on experience on Jira.
● Configured Json files to rewrite scripts on 3.1 and to ingest data into respective partitions.
Client: wellcare Tampa, FL September 2017 to December 2019

Data Engineer
Responsibilities:
● Developed the code for Data extractions from Oracle Database and loaded it into the AWS platform using AWS Data
Pipeline.
● Design and develop ETL integration patterns using Python on Spark.
● Develop framework for converting existing PowerCenter mappings and to PySpark(Python and Spark) Jobs.
● Create Pyspark frame to bring data from DB2 to Amazon S3.
● Translate business requirements into maintainable software components and understand impact (Technical and
Business)
● Provide guidance to development team working on PySpark as ETL platform
● Makes sure that quality standards are defined and met
● Optimize the Pyspark jobs to run on Kubernetes Cluster for faster data processing
● Provide workload estimates to client
● Developed framework for Behaviour Driven Development (BDD).
● Migrated On prem informatica ETL process to AWS cloud and Snowflakes
● Implement CID(Continuous Integration and Continuous Development) pipeline for Code Deployment
● Reviews components developed by the team members
● Imported data from AWS storage service S3 into Spark RDD and performed transformations and actions on RDD.
● Worked on AWS EC2, IAM, S3, LAMBDA, EBS, Elastic Load balancer (ELB), auto-scaling group services. Utilized AWS
EMR, S3, and cloud watch utilities to run and monitor Hadoop and Spark jobs on AWS.
● Implemented AWS cloud computing platform using RDS, S3, Redshift, and Python.
● Created scripts to sync data between local MongoDB and Postgres databases with those on AWS.
● Implemented Server-less architecture using AWS Lambda with Amazon S3 and Amazon DynamoDB.
● Worked extensively with importing metadata into Hive using Python and migrated existing tables and applications
to work on the AWS cloud (S3).
● Collecting and aggregating large amounts of log data using Flume and tagging data in HDFS for further analysis.
● Designed and Developed Extract, Transform, and Load (ETL) code using Informatica Mappings to load data from
heterogeneous Source systems flat files, XML’s, MS Access files, Oracle to target system Oracle under Stage, then to
data warehouse and then to Data Mart tables for reporting.
● Developed multiple spark batch jobs in Scala using Spark SQL and performed transformations using many APIs and
updated master data in Cassandra database as per the business requirement.
● Developed Spark streaming application, which helps to extract data from cloud to hive table and used Spark SQL to
process the massive amount of structured data.
● Worked on User Defined Functions in Hive to load data from HDFS to run aggregation functions on multiple rows.
● Experienced in writing Storm topology to accept the events from Kafka producer and emit into Cassandra.
● Transferred data from different data sources into HDFS systems using Kafka producers, consumers, Kafka brokers,
and used Zookeeper as-built coordinator between different brokers in Kafka.
● Analyzed HBase data in Hive by creating externally partitioned and bucketed tables.
● Designed appropriate Partitioning/Bucketing schema in HIVE for efficient data access during analysis and designed a
data warehouse using Hive external tables and created Hive queries for analysis.
● Configure Hive Meta store with MySQL to store the metadata for Hive tables and use Hive to analyze data ingested
into HBase using Hive-HBase integration.
● Wrote complex Hive queries to extract the data from various sources (Data Lake) and endure the data into HDFS.
● Worked with Flume for building fault-tolerant data Ingestion pipeline for transporting streaming data into HDFS.
Environment: Hadoop 2. x, HDFS, MapReduce, Apache Spark, Spark SQL, Spark Streaming, Scala, Pig, Hive, Oozie, Sqoop,
Kafka, Flume, Nifi, Zookeeper, Informatica, Cassandra, HBase, Postgres, MongoDB, AWS, Python, Linux, Snowflake,
Tableau.
Client: Limerock, India DE April 2015 to June 2017

Data Engineer
Responsibilities:
● Loaded and transformed huge sets of structured, semi structured, unstructured data using Hadoop/Big Data.
● Wrote numerous MapReduce jobs in Scala for data cleansing and used Impala, a parallel processing query
engine to analyze the data stored in the Hadoop clusters.
● Build data pipelines in Airflow in GCP for ETL related jobs using different airflow operators.
● Experience in GCP Dataproc, GCS, Cloud functions, BigQuery.
● Transfer apps and data from in-house systems to GCP configurations.
● Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery.
● Used python libraries, PCA, feature engineering techniques, and label encoding pre-processing Techniques to
reduce the high-dimensional data.
● Implemented complex ETL pipelines to integrate the data coming from different sources into the single data
warehouse repository.
● Developed SQL scripts for creating tables, Sequences, Triggers, views and materialized views.
● Addressed overfitting and under fitting by tuning the hyper parameter of the Machine learning algorithms by
using Lasso and Ridge Regularization and used GIT to coordinate team development.
● Wrote Hadoop Jobs to analyze data using MapReduce, Apache Crunch, Hive, Splunk, and Pig.
● Worked on creating Hive tables and written Hive queries for data analysis to meet business requirements.
● Worked with Apache Spark which provides a fast engine for big data processing integrated with Scala.
● Develop quality check modules in PySpark and SQL to validate data in a data lake, automated the process to
trigger the modules before the data gets ingested.
● Enhanced and optimized product Spark code to aggregate, group, and run data mining tasks using the Spark
framework.
● Processed different data set files like JSON, CSV, XML, Log files from HDFS into Spark using Scala and Spark-SQL
for faster testing and processing of data.
● Experience in collecting real-time log data from diverse sources like RESTful API and social media using web
crawlers in Python and filtered to load the data into HDFS for further analysis - ingested through Kafka.
● Experienced in using Sqoop to import and export the data between relational database servers (Oracle &
MySQL) and Hadoop.
● Experience in working with job workflow scheduling and monitoring tools like Oozie and Zookeeper.
● Expertise in handling significant data processing in the streaming process using Spark, along with Scala.
● Used Apache Nifi to automate the flow of data in the Software development life cycle.
● Created a Kafka producer to connect to different external sources and bring the data to a Kafka broker.
● Used Spark streaming to receive real-time data from the Kafka and store the stream data to HDFS using Scala
and NoSQL databases such as HBase and Cassandra.
● Implemented workflows using Apache Oozie framework to automate tasks.
Client: Edvensoft Solutions India Pvt. Ltd, India June 2013 to March 2015
SQL Developer
Responsibilities:
This project involves automation of migrating the Financial Risk data from Oracle Database to HDFS to get the proper
and fast analytics on it by which they can improve their service. Migrating the business logic, which has been written in
thousands of Oracles Stored Procedures to Spark-SQL in an optimized way. For existing data from Oracle dumped into
Hive tables by using Sqoop and computed the different types of data using Spark SQL and done the automation of
Kafka with Oozier to stream the future data continuously into HDFS/Hive to compute business logic by using Spark
SQL.
● Involved in gathering requirements, performing source system analysis, and developing ETL design specification
documents to load data from the operational data store to the data warehouse.
● Analyzing the Oracle stored procedures based on the business documents.
● Implemented several scheduled Spark, Hive & Map Reduce jobs in Hadoop Map Reduce distribution.
● Mapping the functional logic for all the stored procedures defined.
● Converting the Oracle stored procedure logic into Spark SQL using Java features.
● Implemented the generic tool for file level validation according to the business logic.
● Unit testing using JUnit.
● Involved in Sonar and Emma code coverage for JUnit.
● Code quality checks using Jenkins and peer reviews in an agile methodology.
● Optimization of Spark SQL code.
● Worked with HBase tables to store the continuous streaming data. And Created Hive table on top of HBase
tables
● Worked on data that was a combination of unstructured and structured data extracted from multiple sources
and developed Python scripts to automate the cleansing.
● Responsible for interpretation of raw data, statistical results, or compiled information
● Work with architects and senior developers to design and enforce established standards for building reporting
solutions.
● Interacted with commercial enterprise stakeholders to understand and support the requirements and
developed strategic plans to implement solutions and manage client expectations.
● Utilized Tableau Dashboard development for transforming the data obtained from various resources into
actionable insights and to find outliers in data.
● Experience in using SQL server Manager for writing complex SQL queries for retrieving relevant information for
insights and dashboards.
● Prepared high-level analysis reports using Excel and Tableau. Provided feedback on the quality of Data including
identification of billing patterns and outliers.
● Utilized E/R Studio to create comprehensive mapping documents and data lineage.
● Perform initial analysis to assess the quality of the data and conduct further analysis to determine the insights
from the data.

Ravali Data Engineer GCP

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ravali Data Engineer GCP

Uploaded by

Copyright:

Available Formats

Ravali A

Professional Work Experience:

Client: Thomson Reuters Eagan, MN January 2020 to March 2022

Client: wellcare Tampa, FL September 2017 to December 2019

Client: Limerock, India DE April 2015 to June 2017

You might also like