You are on page 1of 6

Lakshmi Prasanna Reddy Valusani

Email: lakshmiprasannavalusani0428@gmail.com PH: 469-322-9629


Sr Data Engineer

PROFESSIONAL SUMMARY

● Experienced Data Engineer with over 9+ years of professional experience in data analytics and engineering
roles, specializing in Big Data technologies, cloud platforms, and data pipeline development. Highlights of my
experience include:
● Developed end-to-end data pipelines using Python, Pyspark, SQL, Shell Scripting, Apache Kafka, Airflow,
leveraging various AWS services for data ingestion, transformation, and storage.
● Implemented automation strategies for data pipelines, reducing manual effort and ensuring consistent data
processing using tools such as AWS SNS, Lambda, Airflow and Jenkins.
● Experience in designing end to end scalable architecture to solve business problems using various Azure
Services HDInsight, Data Factory, Data Lake, Data Bricks and Machine Learning Studio.
● Experience in Configuring Azure LinkedServices and Integration Runtimes to setup pipelines using Azure Data
Factory(ADF) and automate it using Azure scheduler.
● Experience in working with AWS services AWS Glue, Amazon Managed Kafka, Athena, AWS cloud formation
templates, ECS, Network Load Balancer, API Gateway, IAM roles and policies.
● Experience in working with GCP services Dataflow jobs, Apache beam, Pub Sub, Cloud Composer, Apache air
flow DAG for scheduling jobs, Big Query, Google Cloud Storage.
● Expertise in working with Snowflake, optimizing Snowflake architecture, and efficiently utilizing Snowflake
SQL, Snowpipe, and Snowflake streams.
● Extensive experience in designing and implementing scalable and high-performance Big Data infrastructure
using Spark, Hadoop, and related technologies.
● Proficient in leveraging Spark Core, Spark SQL, and Spark Streaming to build interactive analysis, batch
processing, and stream processing applications.
● Demonstrated expertise in integrating diverse data sources with varying structures while maintaining data
integrity, utilizing Python, Spark SQL, and Shell Scripting.
● Implemented ETL tasks migration from Teradata to Snowflake, resulting in a significant reduction in
processing time by 30%.
● Designed and managed Hadoop clusters, utilizing Hive partitioning, and employing Hue, Zeppelin, and
Apache Pig for analyzing extensive transactional data.
● Developed data quality audit systems using SNS, Lambda, S3, and Python to monitor the health status of
tables, ensuring data integrity and timely issue resolution.
● Experienced in GCP cloud and its services such as Google Cloud Storage, Google Cloud SQL, BigQuery, Cloud
Dataflow, and Cloud Security Command Center to provide a wide range of cloud computing requirements.
● Streamlined weekly jobs running on dynamically spun EMR clusters by fine-tuning Spark configurations,
reducing runtime by 30%.
● Worked with various database systems, including Postgres, Oracle, DB2, SQL Server, and MySQL, for data
storage and retrieval.
● Experienced in working with cloud platforms such as AWS, Azure, and GCP, leveraging services like EMR, S3,
Glue, Redshift, and Azure Data Factory for data processing and storage.
● Successfully designed and implemented data processing solutions on Azure using services like Azure Data
Factory, Azure Data Lake, and Azure Databricks, enabling efficient data ingestion, transformation, and
analytics workflows.
● Leveraged NoSQL databases including Azure Cosmos DB and Azure Table Storage, as well as other popular
NoSQL databases like HBase, DynamoDB, and MongoDB on Azure, to store and retrieve large volumes of
unstructured and semi-structured data.

TECHNICAL SKILLS

● Languages: Python, SQL, Pyspark, Scala, Shell Scripting.


● Relational Databases & Data warehouses: Teradata, Oracle, MySQL, Snowflake, SQL Server, Redshift, DB2,
BigQuery.
● Big Data Technologies: Spark, Hadoop, Apache Airflow, Nifi, Snowflake, Hive, Pig, HUE, Sqoop, Flume, Oozie,
Impala.
● Cloud Services: AWS -(EC2, S3, EBS, RDS, EMR, IAM, Azure - (ADLS, DataFactory, Databricks, HDInsights), GCP
– (Google Cloud Storage, Google Cloud SQL, BigQuery, Cloud Dataflow).
● Data Visualization tools: PowerBI, Tableau, Qlik.
● Methodologies: Agile, waterfall, TDD (Test-Driven-Development)
● NoSQL Databases: MongoDB, DynamoDB, Hbase.
● Version Control Tools: Azure Devops, GitHub, Gitlab, Bitbucket.
● Agile Tools: SNOW, JIRA.
● Cloud Environments: AWS, Azure, GCP.
● CI/CD Tools: Azure Devops, Jenkins.

PROFEESIONAL EXPERIENCE
Client: Protective Life Insurance, AL June 2022 to Present
Role: Sr. Data Engineer
Responsibilities:

● Demonstrated expertise in Google Cloud Platform (GCP), including working with key services such as Google
BigQuery, Google Cloud Dataflow, and Google Cloud Dataproc.
● Designed and implemented scalable data processing and analytics solutions using Google Cloud Dataflow,
leveraging its distributed processing capabilities for real-time and batch data pipelines.
● Developed and optimized data ingestion strategies using Google Cloud Pub/Sub, Google Cloud Storage, and
Google Cloud Dataflow, ensuring efficient and reliable data integration from various sources.
● Implemented and managed data storage solutions using Google BigQuery, including schema design,
partitioning strategies, and optimizing performance through query tuning and table partitioning.
● Integrated GCP services with third-party data processing frameworks like Apache Spark and Apache Hadoop,
leveraging Google Cloud Dataproc for large-scale data processing and distributed computing.
● Built robust data pipelines using GCP services, including data extraction, transformation, and loading (ETL)
processes, ensuring high data quality, reliability, and timeliness.
● Collaborate with development and DevOps teams to build Continuous Integration and Continuous
Deployment (CICD) pipelines for automated application deployments.
● Deploy data engineering solutions and services using IaC, ensuring smooth and error-free deployments.
● Troubleshoot any issues that arise during the deployment process to ensure the system is running smoothly.
● Build and manage Kubernetes clusters on GKE to host containerized applications and data engineering
workloads. Implement horizontal and vertical autoscaling in Kubernetes to optimize resource utilization and
performance.
● Utilized Apache NiFi to design and automate ETL workflows, seamlessly integrating data from various sources
into Teradata on GCP.
● Leveraged Talend's graphical interface to create and optimize complex ETL pipelines, ensuring efficient data
transformations and loading processes.
● Implemented Informatica's data integration capabilities to connect and synchronize data between Teradata
and other databases within the GCP ecosystem.
● Employed Apache Spark (PySpark) for data processing and transformation tasks, enhancing ETL performance
and data preparation efficiency.
● Utilized Google Cloud Dataflow to perform real-time data processing and stream data from Teradata to other
GCP services like BigQuery.
● Ensure compliance with best practices, security guidelines, and process improvement initiatives.
Implemented security and access controls in GCP, utilizing IAM roles, policies, and encryption mechanisms to
ensure data confidentiality, integrity, and compliance with regulatory requirements.
● Developed data governance frameworks and data cataloging solutions using GCP services like Google Cloud
Data Catalog, enabling effective data discovery and metadata management.
● Utilized Google Cloud Monitoring, Logging, and Error Reporting to monitor system performance, identify
bottlenecks, and troubleshoot issues, ensuring optimal performance and reliability.
● Collaborated with cross-functional teams, including data scientists, analysts, and stakeholders, to understand
business requirements and deliver end-to-end data solutions on GCP, driving data-driven decision-making
and insights.
● Created data extraction pipelines in Azure Data Factory (ADF) to collect data from diverse sources such as
databases, file systems, APIs, and streaming platforms.
● Implemented data ingestion processes using Azure services like Azure Data Lake Store (ADLS), Azure SQL,
Blob storage, and Azure SQL Data Warehouse.
● Designed and developed ETL (Extract, Transform, Load) processes using Azure Data Factory to cleanse,
validate, and transform the collected data into a standardized format.
● Leveraged Azure HDInsight and Azure Data Lake Store to handle and process structured and unstructured
data from incoming web feed data and server logs.
● Provisioned Spark clusters on Azure Databricks, utilizing cross-account roles in Azure to access Azure Data
Lake Store (ADLS) for efficient data processing.
● Programmed in Hive, Spark SQL, and Python within Azure Databricks to streamline data processing and build
data pipelines for generating useful insights.
● Orchestrated data pipelines using Azure Data Factory to manage the flow of data and schedule regular data
processing tasks.
● Imported data from various sources into HDFS using Sqoop and created internal and external tables in Hive
for data organization and analysis.

Environment: Google Compute Engine (GCE), Google Kubernetes Engine (GKE), Apache NIFI, Talend, Google Cloud
Storage, Google Cloud SQL, Cloud Memorystore, BigQuery, Apache Spark, Cloud Dataflow, Virtual Private Cloud
(VPC), Cloud Monitoring, Cloud Scheduler, Azure Data Factory, Azure Data Lake Store (ADLS), Azure SQL, Azure SQL
Data Warehouse, Azure Databricks, Hive, Sqoop, Python, Spark SQL, Spark Streaming, Kafka, Shell Scripting.

Client: Direct TV, TX January 2021 to June 2022


Role: Sr. Data Engineer
Responsibilities:

● Developed and deployed scalable applications on GCP using Compute Engine, App Engine, and Kubernetes
Engine.
● Implemented data processing pipelines using Spark and Scala, performing data transformation, aggregation,
and analysis on large datasets.
● Utilized Spark Streaming to process real-time data and implemented machine learning algorithms using
Spark MLlib.
● Designed and optimized data warehouses in GCP's BigQuery and AWS Redshift for efficient data storage and
analysis.
● Worked with AWS Cloud services, including EC2, S3, and Lambda, to deploy and manage applications in a
cloud-based environment.
● Collaborated with cross-functional teams to design and implement end-to-end data solutions, ensuring data
quality, security, and performance.
● Troubleshot and resolved issues related to data pipelines, performance bottlenecks, and infrastructure
configurations in GCP and AWS.
● Led the end-to-end data migration project from Teradata to Snowflake, ensuring the seamless transfer of
large volumes of legacy data.
● Designed and implemented data extraction processes from Teradata using tools like AWS Glue or Apache
Nifi, ensuring data consistency and integrity during the migration.
● Developed data transformation pipelines using Apache Spark or AWS EMR to convert and optimize the data
for Snowflake's schema and query performance.
● Leveraged Snowpipe or AWS Data Pipeline to ingest real-time and stream data into Snowflake, enabling near
real-time analytics capabilities.
● Successfully managed data migration projects, moving data from on-premises systems to Teradata on GCP,
ensuring data integrity and optimizing storage.
● Collaborated with cross-functional teams to understand data requirements and deliver reliable, well-
prepared datasets for analysis and reporting.
● Assured data quality and adherence to data governance policies, applying data validation and cleansing
techniques within the ETL processes.
● Developed custom data connectors and optimized ETL workflows to handle large-scale data volumes,
improving ETL performance by 25%.
● Ensured secure data transfer and access control by implementing encryption and data masking in ETL
processes for compliance and data privacy.
● Conducted thorough data validation and quality checks to ensure the accuracy and completeness of the
migrated data in Snowflake.
● Collaborated with cross-functional teams, including Data Architects and Business Analysts, to align the
migration strategy with business requirements and data governance policies.
● Optimized the migration process by leveraging AWS services such as Amazon S3 for staging, AWS Glue for
data cataloging, and Amazon Redshift for temporary storage.
● Designed and implemented an audit system using Apache Kafka to capture and record data lineage, data
changes, and metadata information for compliance and tracking purposes.
● Integrated the audit system with data pipelines and processing frameworks, ensuring data traceability, and
enabling easy debugging and troubleshooting.
● Developed custom data monitoring and alerting mechanisms using AWS CloudWatch or Apache Kafka
Connect to proactively identify and address any data quality or pipeline issues.
● Automated data ingestion and transformation processes using AWS Lambda functions, reducing manual
effort, and increasing efficiency.
● Implemented job orchestration and scheduling using Apache Airflow or AWS Step Functions, enabling
automated and reliable execution of data pipelines.
Environment: Google Kubernetes Engine (GKE), Google Cloud Storage, Google Cloud SQL, Cloud Memorystore,
BigQuery, AWS services (Glue, EMR, S3, Redshift, Lambda, CloudWatch, CloudFormation), Teradata, Snowflake,
Apache Kafka, Apache Nifi, Apache Spark, Apache Airflow, Terraform, DataBrew.

Client: SS&C Intralinks, MA Oct 2017 to Dec2020

Role: Big Data Engineer


Responsibilities:

● Optimized Spark configuration parameters, such as executor memory, executor cores, and shuffle partitions,
to enhance the performance and scalability of data processing jobs.
● Implemented data partitioning and bucketing techniques in Spark to distribute and organize data across
clusters, enabling parallel processing and reducing latency.
● Fine-tuned Spark SQL queries by optimizing joins, aggregations, and data filtering operations to improve
query performance and minimize resource consumption.
● Leveraged Spark's caching mechanisms, such as RDD or DataFrame caching, to persist intermediate results in
memory and avoid redundant computations, resulting in faster data processing.
● Developed a centralized data engineering framework using technologies like Apache Airflow or Luigi,
providing a standardized approach for different teams to define, schedule, and monitor their ETL workflows.
● Designed and implemented reusable data transformation modules or libraries to promote code reusability
and streamline the development process across multiple ETL projects.
● Integrated version control systems like Git into the data engineering workflow to ensure proper code
management, collaboration, and versioning across different teams and projects.
● Collaborated with Data Architects to define data modeling best practices and guidelines, ensuring
consistency and scalability in data structures across various ETL processes.
● Explored and implemented cloud-based big data technologies such as AWS EMR, Google Cloud Dataproc, or
Azure Databricks to leverage the scalability and elasticity of distributed computing for large-scale data
processing.
● Utilized containerization technologies like Docker or Kubernetes to create portable and scalable data
processing environments, enabling efficient resource utilization and easy deployment across different
clusters.
● Implemented data quality monitoring and validation mechanisms to ensure the integrity and accuracy of
processed data, utilizing tools like Apache Griffin or custom validation scripts.
● Conducted performance testing and benchmarking of data processing jobs using tools like Apache JMeter or
TPC-H, identifying bottlenecks and areas for optimization to achieve better scalability.
● Stayed up to date with emerging big data technologies and frameworks, continuously exploring and
evaluating new tools and techniques to improve the scalability and performance of data pipelines.

Environment: Apache Spark, Apache Kafka, AWS Glue, AWS EMR, Amazon S3, Amazon Redshift, AWS Lambda, AWS
CloudWatch, AWS CloudFormation, Apache Airflow, Terraform, Apache Nifi, Teradata, Snowflake.

Client: Value Momentum, India April 2014 to May 2017

Role: Hadoop Developer


Responsibilities:

● Developed and maintained Hadoop-based data processing applications, leveraging technologies such as
Hadoop MapReduce, HDFS, and Hive.
● Designed and implemented data ingestion pipelines to load large volumes of data from diverse sources into
Hadoop clusters using tools like Sqoop or Flume.
● Implemented custom MapReduce jobs in Java to process and analyze structured and unstructured data,
extracting meaningful insights and patterns.
● Utilized Hive for data querying and analysis, optimizing queries and leveraging Hive partitions and bucketing
for improved performance.
● Integrated and utilized other data processing frameworks and libraries such as Apache Spark, Pig, or
Cascading to enhance data processing capabilities.
● Worked with diverse data formats such as Avro, Parquet, or ORC, optimizing data storage and retrieval
efficiency in Hadoop clusters.
● Collaborated with Data Architects and Data Scientists to understand data requirements and design data
models and schemas for efficient data processing.
● Developed and maintained Oozie workflows to orchestrate and schedule Hadoop jobs, ensuring reliable and
timely execution of data processing tasks.
● Implemented data security measures in Hadoop clusters, including authentication, authorization, and data
encryption, to ensure data privacy and compliance.
● Optimized Hadoop cluster performance by tuning various parameters such as heap size, block size, and
replication factor, based on workload characteristics.
● Implemented data lineage and metadata management solutions using tools like Apache Atlas or custom-built
systems, enabling data traceability and governance.
● Developed and maintained monitoring and alerting mechanisms using tools like Nagios, Ganglia, or Ambari,
ensuring the health and performance of Hadoop clusters.
● Integrated Hadoop clusters with other data storage and processing systems, such as relational databases or
cloud storage, for seamless data integration and analysis.
● Implemented data backup and disaster recovery solutions for Hadoop clusters, ensuring data availability and
business continuity in case of system failures.
● Kept up to date with emerging technologies and trends in the big data ecosystem, continuously exploring
new tools and frameworks to enhance data processing capabilities.

Environment: Hadoop, MapReduce, HDFS, Hive, Sqoop, Flume, Apache Spark, Pig, Cascading, Avro, Parquet, ORC,
Java, Oozie, Apache Atlas, Nagios, Ganglia, Ambari

EDUCATIONAL DETAILS:

Master of Science in Computer and Information Sciences - UNIVERSITY OF NORTH TEXAS, Denton, TX

Bachelor of Engineering – SRI INDU COLLEGE OF ENGINEERING AND TECHNOLOGY, Hyderabad, Telangana, India

You might also like