Expert Big Data Engineer with Python

PAVAN SRI HARSHA LAGHUVARAPU
Phone: +1 8143778335
E Mail: pavansriharsha4290@gmail.com
PROFESSIONAL SUMMARY:
● Around 7+ years of professional experience in Data , involved in developing, implementing, configuring
Hadoop ecosystem components on Linux environment, development and maintenance of various applications using
Python, developing strategic methods for deploying Big data technologies to efficiently solve Big Data processing
requirement
● Having good knowledge in Python programming language
● Having good experience in Azure data bricks , Azure data lake,Azure Data Factory
● experience in Hadoop eco system components such as HDFS, MapReduce, Yarn, Pig, Hive and Sqoop
● Excellent programming skills at a higher level of abstraction using Python and Spark
● Good understanding in processing of real-time data using Spark
● Hands on experience in Importing and exporting data from different databases like MySQL, Oracle, Teradata
into HDFS using Sqoop
● Extensive experienced Big Data - Hadoop developer with varying level of expertise around different Big
Data/Hadoop ecosystem projects which include HDFS, MapReduce, HIVE, Sqoop etc
● Strong experience in writing scripts using Python API, PySpark API and Spark API for analyzing the data
● Sustaining the BigQuery, PySpark and Hive code by fixing the bugs and providing the enhancements required
by the Business User
● Hands-on use of Spark and Python API's to compare the performance of Spark with Hive and SQL, and Spark
SQL to manipulate Data Frames in Python
● Expertise in Python, user-defined functions (UDF) for Hive and Pig using Python
● Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the
requirement
● Proficiency in SQL across several dialects (we commonly write MySQL, PostgreSQL, SQL Server and Oracle)
● Experience in managing and reviewing Hadoop Log files
● Experienced using Sqoop to import data into HDFS from RDBMS and vice-versa
● Experience and understanding in Spark
● Hands on dealing with log files to extract data and to copy into HDFS using flume
● Experienced in the use of agile approaches, including Extreme Programming (XP Programming Techniques),
Test-Driven Development (TDD Methodologies) and Scrum
● Designed various ingestion and processing patterns based on use cases in Deltalake
● Experience in managing and storing confidential credential in Azure Key vault
● Built complex data ingestion/processing frameworks using Azure Databricks/ Python/ Pyspark
● Orchestrated the end to end data integration pipelines using Azure Data Factory
● Strong working experience with ingestion, storage, querying, processing and analysis of Big Data
● Hands-on experience in Python programming and Spark components like Spark-core
and Spark-SQL
● Worked on creating the RDDs and DFs for the required input data and performed the data transformations
using Spark-core
● Hands-on experience in dealing with Apache Hadoop components like HDFS, Map Reduce and Hive
● Hands on Experience with CI/CD configurations on production side
TECHNICAL SKILLS:
Bigdata/Hadoop MapReduce, Spark, SparkSQL, Azure Data factory,Data bricks, Kafka, PySpark, Hive, Yarn, Oozie
Technologies
Languages Python, Shell Scripting and SQL
Web Design Tools HTML, XML
Development Tools Microsoft SQL Studio, IntelliJ, Azure Databricks, Eclipse.
Public Cloud Microsoft Azure
Development Agile/Scrum, Waterfall
Methodologies
Build Tools Control-M, Oozie, Jenkins, Toad, SQL Loader, PostgreSQL, Talend, Maven, Ant, RTC, RSA, Hue,
SOAP UI
Reporting Tools MS Office (Word/Excel/Power Point/ Visio/Outlook), Crystal Reports XI, SSRS, Cognos.
Databases Microsoft SQL Server 2008,2010/2012, MySQL 4.x/5.x, Oracle 11g, 12c, DB2, Teradata, Netezza
Operating Systems All versions of Windows, UNIX, LINUX
PROJECTS PROFILE :
Clint- General Motors, Dearborn, MI 11/2020 – till Date

Role - Data Engineer
Description:
This project focuses on migrating the on-premise Enterprise Data Warehouse to Cloud using Microsoft Azure
technologies. It will include rewriting the whole solution using Azure technologies, migrating all the data from
various disparate legacy source systems, interacting with business getting and finalizing business requirements,
and finally integrating and testing the whole solution.
Responsibilities:
● Primarily involved in Data Migration using SQL, SQL Azure, Azure Data lake and Azure Data Factory
● Build analytic tools that utilize the data pipeline to provide actionable insights into customer acquisition,
operational efficiency and other key business performance metrics
● Working with Source team to extract the data and it will be loaded in the ADLS
● Creating the linked service for source and target connectivity Based on the requirement
● Once it’s created pipelines and datasets will be triggered based on LOAD (History/Delta) operations
● Based on source (big or small) data loaded files will be processed in Azure Databricks by applying operations
in Spark SQL which will be deployed through Azure Data Factory pipelines
● Involved in deploying the solutions to QA, DEV and PROD
● Involved in setting up the environments for QA, DEV and PROD using VSTS
● Professional in creating a data warehouse, design-related extraction, loading data functions, testing designs,
data modeling, and ensure the smooth running of applications
● Responsible for extracting the data from OLTP and OLAP using Azure Data factory and Databricks to Data lake
● Used Azure Databricks notebook to extract the data from Data lake and load it into Azure and On-prem SQL
database
● Worked with large data sets and high capacity big data processing platform, SQL and Data Warehouse
projects
● Developed pipelines that can extract data from various sources and merge into single source datasets in Data
lake using Databricks
● Performed encryption of sensitive data in to Data lake which are sensitive to business using cypher algorithm
● Decrypt the sensitive data using keys for refined datasets for analytics, by providing end users access
● Created connections from different sources from On-prem and cloud to only source for Power BI reports
● Complex High volume high-velocity projects end-to-end delivery experience with good exposure to BigData
architectures. Experience with some framework building experience on Hadoop Very good understanding of Big Data
ecosystem Experience with sizing and estimating large scale big data projects
● Coordinate with external teams to reduce data flow issues and unblock team members
● Always actively participate in four ceremonies: Sprint planning meeting, Daily Scrum, Sprint review meeting,
and Sprint retrospective meeting
● Passion for product quality, customer satisfaction and a proven track record for delivering quality
Environment: Hadoop, Pyspark, Python, Azure data bricks and Azure Data factory
Client - Marriott Hotels USA 06/2018 - 11/2020

Role - Azure Data Engineer
Responsibilities:
● Creating the pipelines and datasets which are deployed in ADF non-restricted
● creating Indexes, Indexed Views in observing Business Rules and
● creating effective Functions and appropriate Triggers to assist efficient data
● manipulation and data consistency.
● Data Extraction, Transforming, and Loading (ETL) using various tools such as SQL Server
Integration Services (SSIS), and Data Transformation Services (DTS).
● creating dynamic packages for Incremental Loads and Data Cleaning in Data Ware House
using SSIS.
● importing/exporting data between different sources like Oracle/Access/Excel etc. using
SSIS/DTS utility.
● Extracted, transformed, and loaded data from various heterogeneous data sources and
destinations like Access, Excel, CSV, Oracle, and flat files using connectors, tasks, and transformations
provided by SSIS.
● Involved in creating Jobs, SQL Mail Agent, Alerts, and Schedule SSIS
● High Availability and Disaster Recovery Planning
● Azure, Azure Data Factory Pipelines CI/CD with integration with GitHub, Linked Services
● with Datasets, Triggers, Window Trigger and using dependency as well, Event Trigger,
● Integration Runtime, Parameterize Datasets, Linked Services, Union, Pipelines,
Client - ADT, Dallas, TX 07/2015 -06/2018

Role - Hadoop Developer ( BIG Data )
Responsibilities:
● Designing the business requirement collection approach based on the project scope and SDLC methodology.
● Design and implement multiple ETL solutions with various data sources by extensive SQL Scripting, ETL tools,
Python, Shell Scripting, and scheduling tools. Data profiling and data wrangling of XML, Web feeds and file handling
using python, Unix, and SQL
● Loading data from different sources to a data warehouse to perform some data aggregations for business
Intelligence using python
● Design and build data structures on Azure Data Lake and data processing using Pyspark, Hive to provide
efficient reporting and analytics capability
● Lead the design for database and data pipeline using existing and emerging technologies to help improve
decision making
● Manage project timelines and provide timely updates on significant issues or developments
● Bring structure to large quantities of data to make analysis possible, extract meaning from data
● Designed and implemented Sqoop for the incremental job to read data from DB2 and load to Hive tables and
connected to Tableau for generating interactive reports using Hive server2
● Used Sqoop to channel data from different sources of HDFS and RDBMS
● Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and
aggregation from multiple file formats
● Used SSIS to build automated multi-dimensional cubes
● Used Spark Streaming to receive real-time data from the Kafka and store the stream data to HDFS using
Python
● Collected data using Spark Streaming from Azure storage account in near-real-time and performs necessary
Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS
● Installing, configuring and maintaining Data pipelines
● Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap
● Files extracted from Hadoop and dropped on daily hourly basis into Azure Gen2
● Authoring Python(PySpark) Scripts for custom UDF’s for Row/ Column manipulations, merges, aggregations,
stacking, data labeling and for all Cleaning and conforming tasks
● Writing Pig Scripts to generate MapReduce jobs and performed ETL procedures on the data in HDFS
● Develop solutions to leverage ETL tools and identify opportunities for process improvements using Python
● Validated the test data in DB2 tables and on Teradata using SQL queries
● Worked on Dimensional and Relational Data Modeling using Star and Snowflake Schemas, OLTP/OLAP
system, Conceptual, Logical and Physical data modeling
● Automated the data processing with Oozie to automate data loading into the Hadoop Distributed File System
● Developed Automation Regressing Scripts for validation of ETL process between multiple databases like,
Oracle and SQL Server using Python
Environment: Cloudera Manager (CDH5), Hadoop, HDFS, Pig, Hive, Kafka, Scrum, Git, Sqoop, Oozie, Pyspark,
Informatica, Tableau, OLTP, OLAP, HBase, Cassandra, Informatica, SQL Server, Python, Shell Scripting, XML, Unix
Education :
Masters in CS - Ganonn University 2015
Bachelors in CS - RAMCO Rajapalayam TN India - 2013

Expert Big Data Engineer with Python

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Expert Big Data Engineer with Python

Uploaded by

Copyright:

Available Formats

PAVAN SRI HARSHA LAGHUVARAPU

Clint- General Motors, Dearborn, MI 11/2020 – till Date

Client - Marriott Hotels USA 06/2018 - 11/2020

Client - ADT, Dallas, TX 07/2015 -06/2018

Masters in CS - Ganonn University 2015

Bachelors in CS - RAMCO Rajapalayam TN India - 2013

You might also like