You are on page 1of 8

Akhil Reddy

Email: akhilreddy0760@gmail.com
PH: 4153471504
Sr Big Data Engineer / GCP / Hadoop Cloud Developer
As a data engineer, I was responsible for Assess and document data requirements and client-specific
requirements to develop user-friendly BI solutions - reports, dashboards, and decision aids. Worked on Data
warehousing, Data engineering, Feature engineering, big data, ETL/ELT, and Business Intelligence, specialize in
AWS frameworks, GCP / Hadoop Ecosystem, Excel, Snowflake, relational databases, tools like Tableau, PowerBI
Python and Data DevOps Frameworks/Azure DevOps Pipelines

PROFESSIONAL SUMMARY

 8+ years of professional experience in information technology with an expert hand in the areas of
BIG DATA, HADOOP, SPARK, HIVE, IMPALA, SQOOP, FLUME, KAFKA, SQL tuning, ETL
development, report development, database development, data modeling and strong knowledge
of oracle database architecture.
 Hands of experience in GCP, Big Query, GCS bucket, G - cloud function, cloud dataflow, Pub/suB
cloud shell, GSUTIL, BQ command line utilities, Data Proc, Stack driver
 Well knowledge and experience in Cloudera ecosystem (HDFS, YARN, Hive, SQOOP, FLUME,
HBASE, Oozie, Kafka, Pig), Data pipeline, data analysis and processing with hive SQL, IMPALA,
SPARK, SPARK SQL.
 Architected and implemented MLOps - Continuous delivery and automation pipelines in cloud
native machine learning - (AWS Step Function, CodeBuild, Lambda, Azure DevOps, Azure ML, GCP
ML with Kubeflow Pipelines, TensorFlow, Dataflow, Cloud Storage, AI Hub, Cloud Build)
 Using Flume, Kafka and Spark streaming to ingest real time or near real time data in HDFS.
 Analyzed data and provided insights with R Programming and Python Pandas
 Hands on experience on architecting the ETL transformation layers and writing spark jobs to do
the processing.
 Have good Programming experience with Python and Scala.
 Hands in experience on No SQL database like Hbase, Cassandra.
 Analyzing the way to migrate oracle database to redshift.
 Experience with scripting languages like PowerShell, Perl, Shell, etc.
 Extensive experience in writing MS SQL, T-SQL procedures, ORACLE TOAD functions and queries
 Effective team member, collaborative and comfortable working independently
 Proficient in achieving oracle SQL plan stability, maintaining baselines with SQL plans, ASH, AWR,
ADDM, Sql Advisor for pro-active follow up and SQL rewrites.
 Experience on Shell scripting to automate various activities.
 Application development with oracle forms and report with OBIEE, discoverer, report builder
and ETL development.
 Created shell scripts to fine tune the ETL flow of the Informatica workflows.
 Experience using python libraries for machine learning like pandas, numpy, matplotlib, sklearn,
scipy to Loading the dataset, summarizing the dataset, visualizing the dataset, evaluating some
algorithms and making some predictions
 Expertise in Amazon Web Services (AWS) Cloud Platform which includes services
like EC2, S3, VPC, ELB, IAM, DynamoDB, Cloud Front, Cloud Watch, Route 53, Elastic Beanstalk
(EBS), Auto Scaling, Security Groups, EC2 Container Service (ECS), Code Commit, Code Pipeline,
Code Build, Code Deploy, Dynamo DB, Auto Scaling, Security Groups, Red shift, CloudWatch,
CloudFormation, CloudTrail, Ops Works, Kinesis, IAM, SQS, SNS, SES.
 Experience in developing RESTful API's using Django REST Framework.
 Strong knowledge in data preparation, data modelling and data visualization using Power BI and
had experience in developing various reports, dashboards using various Visualizations in
Tableau.
 Examine and evaluate reporting requirements for various business units.
 Can work parallelly in both GCP and Azure Clouds coherently.
 Performance optimization: Experience in tuning and optimizing Apache Druid for performance,
including optimizing data processing workflows, data storage configurations, and query
performance to achieve optimal system performance.
 Strong expertise in Apache Druid: Experience and proficiency in working with Apache Druid,
including data ingestion, data storage, indexing, querying, and analysis using Druid's query APIs
or query languages like Druid SQL.
 Imported data from hive to gcs buckets and later ingested with druid.
 DevOps role converting existing AWS infrastructure to Server-less architecture (AWS Lambda,
Kinesis) deployed via Terraform template.
 Strong background in DevOps practices and tools, utilizing technologies like Docker, Kubernetes,
Jenkins, or GitLab CI/CD for building and deploying applications and data pipelines in a cloud
environment.

TECHNICAL SKILLS:

MYSQL, MS SQL Server, T: SQL, Oracle, PL/SQL.


Google Cloud Platform: GCP Cloud Storage, Big Query, Composer, Cloud Dataproc, Cloud SQL, Cloud Functions,
Cloud Pub/Sub.
Big Data: Spark, Azure Storage, Azure Database, Azure Data Factory, Azure Analysis Services.
ETL/Reporting: Power BI, Data Studio, Tableau
Python: Pandas, Numpy, SciPy, Matplotlib.
Programing: Shell/Bash, C#, R, Go.
MS: Visio, Power Point.

PROFESSIONAL EXPERIENCE:
Client: Expedia, San Francisco, CA. Sep 2021 – Till Date

GCP Data Engineer / Big Data Engineer


Project Description: Expedia Group, Inc. is an American online travel shopping company for consumer and
small business travel. Its websites, which primarily travel fare aggregators and travel metasearch engines,
include Expedia.com, Vrbo, Hotels.com, Hotwire.com, Orbitz, Travelocity, trivago, and CarRentals.com. This
project was about migrating old servers to replace them with scalable updated ones. Migrating data from
various databases and file systems and providing a centralized warehouse for ETL
Deliverables:
o Process and load bound and unbound Data from Google pub/sub topic to Bigquery using cloud
Dataflow with Python.
o Having experience working on Google Cloud Big data Technologies like Data Proc, Data Flow, Big Query
and GCP Storage, and having knowledge on pub sub.
o Devised simple and complex SQL scripts to check and validate Dataflow in various applications.
o Performed Data Analysis, Data Migration, Data Cleansing, Transformation, Integration, Data Import, and
Data Export through Python.
o Developed and deployed data pipeline in cloud such as AWS and GCP
o Performed data engineering functions: data extract, transformation, loading, and integration in support
of enterprise data infrastructures - data warehouse, operational data stores and master data
management
o Responsible for data services and data movement infrastructures good experience with ETL concepts,
building ETL solutions and Data modeling
o Architected several DAGs (Directed Acyclic Graph) for automating ETL pipelines
Responsibilities
 Experience in building and architecting multiple Data pipelines, end to end ETL and ELT process for Data
ingestion and transformation in GCP
 Implemented a Continuous Delivery pipeline with Docker and Git Hub
 Worked with g-cloud function with Python to load Data in to Bigquery for on arrival csv files in big
bucket
 Using g-cloud function with Python to load Data in to Bigquery for on arrival csv files in GCS bucket.
 Write a program to download a SQL Dump from there equipment maintenance site and then load it in
GCS bucket. On the other side load this SQL dump from GCS bucket to MYSQL (hosted in Google cloud
SQL) and load the Data from MYSQL to Bigquery using Python, Scala, spark and Dataproc.
 Experience in GCP Dataproc, GCS, Cloud functions, BigQuery.
 Using g-cloud function with Python to load Data in to Bigquery for on arrival csv files in GCS bucket.
 Write a program to download a SQL Dump from there equipment maintenance site and then load it in
GCS bucket. On the other side load this SQL dump from GCS bucket to MYSQL (hosted in Google cloud
SQL) and load the Data from MYSQL to Bigquery using Python, Scala, spark and Dataproc.
 Hands on experience on architecting the ETL transformation layers and writing spark jobs to do the
processing.
 Proficiency in Java programming language and experience with Spring Boot framework, enabling the
development of robust and scalable applications for data engineering and ML infrastructure.
 Gather and process raw data at scale (including writing scripts, web scraping, calling APIs, write SQL
queries, writing applications)
 Experience in fact dimensional modeling (Star schema, Snowflake schema), transactional modeling and
SCD (Slowly changing dimension)
 Devised PL/SQL Stored Procedures, Functions, Triggers, Views and packages. Made use of Indexing,
Aggregation and Materialized views to optimize query performance.
 Worked extensively on Dataflow and control flow jobs while Converting SSIS ETL jobs to Datastage.
 Involved in Performance tuning of DataStage jobs and Oracle queries.
 Developed logistic regression models (Python) to predict subscription response rate based on customers
variables like past transactions, response to prior mailings, promotions, demographics, interests, and
hobbies, etc.
 Developed near real time data pipeline using spark
 Process and load bound and unbound Data from Google pub/subtopic to Bigquery using cloud Dataflow
with Python
 Worked on GCP, Big Query, GCS bucket, G - cloud function, cloud dataflow, Pub/sub cloud shell, GSUTIL,
BQ command line utilities, Data Proc, Stack driver
 Implemented Apache Airflow for authoring, scheduling, and monitoring Data Pipelines
 Integrated Druid with Hive for High availability and provide data for sla reporting on real time data.

 Connecting Apache DRUID to Kafka for performing analytics.

 Worked on confluence and Jira skilled in data visualization like Matplotlib and seaborn library

 Used big data tools like Hadoop, Spark, Hive

 Implemented machine learning back-end pipeline with Pandas, Numpy


Environment: GCP, Bigquery, Gcs Bucket, G-Cloud Function, Apache Beam, Cloud Dataflow, Cloud Shell, Gsutil,
Docker, Kubernetes, AWS, Apache Airflow, Python, Pandas, Matplotlib, seaborn library, text mining, Numpy,
Scikit-learn, Heat maps, Bar charts, Line charts, ETL workflows, Datastage,linear regression, multivariate
regression, Python, Scala, druid

Client: Credit Karma, San Francisco, CA Nov 2019 to Aug 2021

Sr. GCP Data Engineer


Project Description: Credit Karma is an American multinational personal finance company. It is best known as a
free credit and financial management platform, but its features also include monitoring of unclaimed
property databases and a tool to identify and dispute credit report errors. Data quality management is a
discipline that includes the methods to measure, improve and certify the quality and integrity of an
organization’s data. This framework provides an objective approach to applying consistent data-flow
processes that focus on data quality priorities, assessing the data quality of a data asset and producing standard
data documentation with the goal of continuous improvement in data quality for the Data Fabric.

Responsibilities:

 Migrating an entire oracle database to BigQuery and using of power bi for reporting.
 Build data pipelines in airflow in GCP for ETL related jobs using different airflow operators.
 Experience in GCP Dataproc, GCS, Cloud functions, BigQuery.
 Experience in moving data between GCP and Azure using Azure Data Factory.
 Experience in building power bi reports on Azure Analysis services for better performance.
 Used cloud shell SDK in GCP to configure the services Data Proc, Storage, BigQuery
 Coordinated with team and Developed framework to generate Daily adhoc reports and Extracts from
enterprise data from BigQuery.
 Designed and Co-ordinated with Data Science team in implementing Advanced Analytical Models in
Hadoop Cluster over large Datasets.
 Wrote scripts in Hive SQL for creating complex tables with high performance metrics like partitioning,
clustering and skewing
 Work related to downloading BigQuery data into pandas or Spark data frames for advanced ETL
capabilities.
 Skilled in data modeling, ETL (Extract, Transform, Load) processes, and data integration techniques,
ensuring efficient data ingestion, transformation, and integration across various data sources and
platforms.
 Worked with google data catalog and other google cloud API’s for monitoring, query and billing related
analysis for BigQuery usage.
 Worked on creating POC for utilizing the ML models and Cloud ML for table Quality Analysis for the
batch process.
 Involved in Designing and Developing Enhancements of CSG using AWS APIS.
 Implemented large Lambda architectures using Azure Data platform capabilities like Azure Data Lake,
Azure Data Factory, HDInsight, Azure SQL Server, Azure ML and Power BI.
 Designed end to end scalable architecture to solve business problems using various Azure Components
like HDInsight, Data Factory, Data Lake, Storage and Machine Learning Studio.
 Experience implementing Cloud based Linux OS in AWS to Develop Scalable Applications with Python.
 Developed JSON Scripts for deploying the Pipeline in Azure Data Factory (ADF) that process the data
using the SQL Activity.
 Develop and deploy the outcome using spark code in Hadoop cluster running on GCP.
 Working on creating Various big data pipelines as part of the migration from on-prem servers into AWS.
 Knowledge about cloud dataflow and Apache beam.
 Good knowledge in using cloud shell for various tasks and deploying services.
 Created BigQuery authorized views for row level security or exposing the data to other teams.
 Expertise in designing and deployment of Hadoop cluster and different Big Data analytic tools including
Pig, Hive, SQOOP, Apache Spark, with Cloudera Distribution.

Environment: Gcp, Bigquery, Gcs Bucket, G-Cloud Function, Apache Beam, Cloud Dataflow, Cloud Shell, Gsutil,
Dataproc, Vm Instances, Cloud Sql, Mysql, Posgres, Sql Server, Python, Scala, Spark, Hive, Spark-Sql.

Client: Ford, Dearborn, MI Jun 2018 to Oct 2019


Role: Data Engineer
Project Description: Ford Motor Company designs, manufactures, and services cars and trucks. The Company
also provides vehicle-related financing, leasing, and insurance through its subsidiary. This project majorly
focuses on expanding and optimizing data and data pipeline architecture, as well as building and maintaining
the data workflow, designing optimal data pipeline and infrastructure required for optimal extraction,
transformation, and loading of data from a wide variety of data sources.
Deliverables:
o Developed Python scripts to automate data sampling process. Ensured the data integrity by checking for
completeness, duplication, accuracy, and consistency
o Defined facts, dimensions and designed the data marts using the Ralph Kimball's Dimensional Data Mart
modeling methodology using Erwin
o Implementing and Managing ETL solutions and automating operational processes.
o Optimizing and tuning the Redshift environment, enabling queries to perform up to 100x faster for Tableau and
SAS Visual Analytics.
o Integrated Kafka with Spark Streaming for real time data processing

Responsibilities:
 Worked on developing ETL processes (Data Stage Open Studio) to load data from multiple data sources
to HDFS using FLUME and SQOOP, and performed structural modifications using Map Reduce, HIVE.
 Worked collaboratively to manage build outs of large data clusters and real time streaming with Spark.
 Developed ETL data pipelines using Spark, Spark streaming and Scala.
 Responsible for loading Data pipelines from web servers using Sqoop, Kafka and Spark Streaming API.
 Have experience of working on Snowflake data warehouse.
 Creating Databricks notebooks using SQL, Python and automated notebooks using jobs.
 Used Spark for interactive queries, processing of streaming data and integration with popular NoSQL
database for huge volume of data.
 Hands-on experience with Feast, the Feature Management API, for feature store implementation and
management. Proficient in designing and maintaining feature stores to support data-driven applications
and machine learning models.
 Developed and designed an API (RESTful Web Service) for the company’s website.
 Developed and designed e-mail marketing campaigns using HTML and CSS.
 Extensive experience in Amazon Web Services (AWS) Cloud services such as EC2, VPC, S3, IAM, EBS,
RDS, ELB, VPC, Route53, Ops Works, DynamoDB, Autoscaling, CloudFront, CloudTrail, CloudWatch,
CloudFormation, Elastic Beanstalk, AWS SNS, AWS SQS, AWS SES, AWS SWF & AWS Direct Connect.
 Implemented Spark using Scala and SparkSQL for faster testing and processing of data.
 Developed various UDFs in Map-Reduce and Python for Pig and Hive.
 Defined job flows and developed simple to complex Map Reduce jobs as per the requirement.
 Developed PIG UDFs for manipulating the data according to Business Requirements and worked on
developing custom PIG Loaders.
 Installed Oozie workflow engine to run multiple Hive and Pig Jobs.
 Designing and Developing Apache NiFi jobs to get the files from transaction systems into data lake raw
zone.
 Developed PIG Latin scripts for the analysis of semi structured data
 Experienced in Databricks platform where it follows best practices for securing network access to cloud
applications.
 Implemented large Lambda architectures using Azure Data platform capabilities like Azure Data Lake,
Azure Data Factory, Azure Data CatLog, HDInsight, Azure SQL Server, Azure ML and Power BI.
 Using Azure Databricks, created Spark clusters and configured high concurrency clusters to speed up
the preparation of high-quality data.
 Used Azure Databricks for fast, easy, and collaborative spark-based platform on Azure.
 Used Hive and created Hive tables and involved in data loading and writing Hive UDFs.
 Installed and configured Hive, Pig, Sqoop, Flume and Oozie on the Hadoop cluster.
 Analysed the SQL scripts and designed it by using PySpark SQL for faster performance.
 Used Azure Data Factory, SQL API and MongoDB API and integrated data from MongoDB, MS SQL, and
cloud (Blob, Azure SQL DB, cosmos DB)
Environment: Spark, Spark Streaming, Apache Kafka, Apache NiFi, Hive, Tez, Azure, Azure Databricks, Azure
data grid, Azure Synapse analytics, Azure data catalog, ETL, PIG, PySpark, UNIX, Linux, Tableau, Teradata, Pig,
Snowflake, Sqoop, Hue, Oozie, Java, Scala, Python, GIT, GIT HUB

Client: Four Soft Private Limited Hyderabad, India Sep 2015 to Dec 2017
Data Modeler/Cloud Engineer
This project is an Ecommerce web-based application which allowed the customer to get a view of all the
products in the store and buy them online. The application mainly dealt with the online payment and billings. It
provides a search feature which uses Regular Expression or pattern matching with a user interface called
Search box through which consumers can access and view latest and top selling products/offers. It uses Secure
Good payment Gateway solution for customer to make a payment.
I work closely with Data/Application Architects and other stakeholders to design and maintain
advanced data Pipelines and I was responsible for developing and supporting advanced reports/data pipelines
that provide accurate and timely data for internal and external clients.

Deliverables:
o Responsible for building an Enterprise Data Lake to bring ML ecosystem capabilities to production and
make it readily consumable for data scientists and business users.
o Processing and transforming the data using AWS G to assist the Data Science team as per business
requirement.
o Implement Restful services using Spring MVC
o Used Spring Jdbc for interacting wif DataBase.
o Involved in writing JUnit test cases.
o Performing code reviews using SonarQube.
o Deploy Spring Boot Applications into Google Cloud - App Engine.
o Developing Spark applications for cleaning and validation of the ingested data into the AWS cloud.
o Working on fine-tuning Spark applications to improve the overall processing time for the pipelines.
o Implement simple to complex transformation on Streaming Data and Datasets.
o Work on analysing Hadoop cluster and different big data analytic tools including Hive, Spark, Python,
Sqoop, flume, Oozie.6
Responsibilities:
 Use Spark Streaming to stream data from external sources using Kafka service and responsible for
migrating the code base from Cloudera Platform to Amazon EMR and evaluated Amazon eco systems
components like RedShift, Dynamo DB.
 Perform configuration, deployment, and support of cloud services in Amazon Web Services (AWS).
 Designing and building multi-terabyte, full end-to-end Data Warehouse infrastructure from the ground
up on Confidential Redshift.
 Design Develop and test ETL Processes in AWS Glue to migrate Campaign data from external sources
like S3, ORC/Parquet/Text Files into AWS Redshift.
 Creating stacks using cloud formation to create datasets in AWS glue catalog.
 Designed and developed ETL process in AWS Glue to migrate flight aware usage data from S3 data
source to redshift.

 Coded Ingestion pipeline using step functions, Lambda's, SQS Queues, SNS Notifications, Glue ETL,
Crawlers and Athena.

 Assist team in developing various step functions, Lambda's, CFT's, SNS Topics, SQS Queues, Glue
crawlers for various types of data flow which includes Streaming, non-streaming, CSV, Json formats.

 Migrate an existing on-premises application to AWS.


 Build and configure a virtual data centre in the Amazon Web Services cloud to support Enterprise Data
Warehouse hosting including Virtual Private Cloud, Security Groups, Elastic Load Balancer.
 Implement data ingestion and handling clusters in real time processing using Kafka.
 Develop Spark Programs using Scala and Java API's and performed transformations and actions
on RDD's.
 Develop Spark application for filtering Json source data in AWS S3 and store it into HDFS with
partitions and used spark to extract schema of Json files.
 Develop Terraform scripts to create the AWS resources such as EC2, Auto Scaling Groups, ELB, S3, SNS
and Cloud Watch Alarms.
 Developed various kinds of mappings with collection of sources, targets and transformations using
Informatica Designer.
 Develop Spark programs with PySpark and applied principles of functional programming to process the
complex unstructured and structured data sets. Processed the data with Spark from Hadoop
Distributed File System (HDFS).
 Implement Serverless architecture using AWS Lambda with Amazon S3 and Amazon Dynamo DB.
Environment: Apache Spark, Scala, Java, PySpark, Hive, HDFS, Hortonworks, Apache HBase, AWS EMR, EC2,
AWS S3, AWS Redshift, Redshift Spectrum, RDS, Lambda, Informatica Center, Maven, Oozie, Apache NiFi, CI/CD
Jenkins, Tableau, IntelliJ, JIRA, Python and UNIX Shell Scripting

Client: Sonata Software, India. June 2014 to Aug 2015


Python Developer / Data Analyst / Junior Data Scientist
Deliverables:
I was Involved in design and development of Data transformation framework components to support ETL
process, which gets the Single Complete Actionable View of a customer.
Creation of dashboards and generated insights using Tableau and Python which helped in identifying
members who are not engaged and people's spending pattern. This enabled the credit union to target
members based on their needs age which enabled them to increase member’s engagement by 2%.
Responsibilities:
 Worked on development of data ingestion pipelines using ETL tool, Talend & bash scripting with big data
technologies including but not limited to Hive, Impala, Spark, Kafka, and Talend.
 Experience in developing scalable & secure data pipelines for large datasets.
 Gathered requirements for ingestion of new data sources including life cycle, data quality check,
transformations, and metadata enrichment.
 Work closely with data modelers and data analysts doing design work to document metadata
requirements and facilitate the transition from requirements to design of metadata structure in metadata
repository.
 Work with technical resources, key users in the enhancement and maintenance of the metadata
repository and its contents and to identify project issues as they arise
 Ensure testing results correspond to the metadata steward business and functional requirements.
 Importing data from MS SQL server and Teradata into HDFS using Sqoop.
 Supported data quality management by implementing proper data quality checks in data pipelines.
 Enhancing Data Ingestion Framework by creating more robust and secure data pipelines.
 Implemented data streaming capability using Kafka and Talend for multiple data sources.
 Responsible for maintaining and handling data inbound and outbound requests through big data
platform.
 Working knowledge of cluster security components like Kerberos, Sentry, SSL/TLS etc.
 Worked with multiple storage formats (Avro, Parquet) and databases (Hive, Impala, Kudu).
Environment: Spark, HDFS, Hive, Pig, Sqoop, Scala, Kafka, Shell scripting, Linux, Jenkins, Eclipse, Git, Oozie,
Talend, Agile Methodology,Metadata, Teradata

You might also like