Professional Documents
Culture Documents
aravindreddy0703@gmail.com
913-735-0692
PROFESSIONAL SUMMARY
As a senior data engineer with 9+ years of diversified IT experience, Solving business use cases for several clients. Have an
expert hand in areas of database development, ETL development, data modeling, big data technologies, and software
development with Agile methodology, which includes business analysis and modeling, user interaction, planning and testing,
migration, and documentation.
Working experience with the Hadoop framework and its ecosystems like Hadoop Distributed File System (HDFS),
MapReduce, Zookeeper, GitHub, Pig, Impala, Hive, HBase, Sqoop, and Oozie
Experience with configuration and development on multiple Hadoop distribution platforms like Cloudera and Hortonworks
(on-premise).
Good understanding of Hadoop architecture and various components such as HDFS, Job Tracker, Task Tracker, Name Node,
Data Node, and MapReduce concepts
Good Exposure to Apache Hadoop Map Reduce Programming, PIG Scripting, and Distributing Applications and HDFS Good
knowledge of Hadoop cluster architecture and monitoring the cluster.
Experience in using Partitions, bucketing optimization techniques in Hive and designed both Managed and External tables in
Hive to improve performance.
Procedural knowledge on cleansing and analyzing data using Hive, Hadoop Platform and on Relational databases such as
Oracle, SQL, Teradata.
Experience designing and building Data Lake solutions based on organizations needs and capabilities.
Good understanding on Installing and maintaining the Linux servers.
Worked with HL7 functions to facilitate interoperability between various health information systems, data management, HL7,
Data Masking, FHIR, data governance, and data quality.
Experience in developing data warehouse applications using Hadoop, Oracle, Teradata, and MS SQL servers on UNIX and
Windows platforms and experience in creating complex mappings using various transformations and developing strategies for
ETL mechanisms by using SSIS
Experience with cloud-based services such as AWS EMR, EC2, S3, Athena, EKS, RDS, SNS, SQS, Cloud Watch, AWS
Lambda, DynamoDB and Redshift to work in distributed data models that provide fast and efficient processing of big data
Experience in configuring and maintaining long-running Amazon clusters manually as well as through cloud formation
scripts on AWS.
Worked with various transformations like Normalizer, expression, rank, filter, group, aggregator, lookups, joiner, sequence
generator, sorter, SQLT, stored procedure, Update strategy, Source Qualifier, Transaction Control, Union, CDC etc.,
Proficient in Python programming language, with a strong focus on data manipulation, analysis, and machine learning
modeling using libraries such as NumPy, Pandas, and Scikit-learn.
Extensive experience in PySpark, a powerful open-source framework for big data processing and analytics. Developed and
optimized PySpark applications to process large-scale datasets efficiently, leveraging distributed computing capabilities .
Proficient in the Hive Query language and experienced in hive performance optimization using Static and Dynamic
partitioning, bucketing, and parallel execution.
Carried out Hive, a simple, generic custom UDF, and developed multiple hive views for accessing underlying table data.
Proficient in importing and exporting data using SQOOP from HDFS to RDBMS and vice versa.
Integrated Kafka with Spark streaming for real-time data processing.
Knowledge of big data workflow orchestration tools like Apache Oozie and Airflow
Experience in Extraction, Transformation, and Loading (ETL) data from various sources into data warehouses by using AWS
Glue, as well as data processing like collecting and moving data from various sources using Apache Kafka.
Using SSIS and SSRS for data extraction, transformation, loading, and reporting in the MS SQL Server
Experience in Dimensional Data Modeling Techniques like Star Schema, Snow-Flake Schema, Fact Tables, Dimensional
Tables, Transactional Modeling, and SCD (slowly changing dimensions)
Hands-on experience with the Snowflake cloud data warehouse for integrating data from multiple source systems, including
loading nested JSON-formatted data into the Snowflake table.
Handling NoSQL databases, including MongoDB, Cassandra, and HBase.
Experience in using different columnar file formats like RC File, AVRO, ORC, and Parquet formats
Working experience with functional programming languages like SQL, R, Scala, and Python
Skilled data engineer with experience in designing, developing, and managing data solutions in Azure.
Worked on Microsoft Azure services like HDInsight Clusters, BLOB, ADLS, Data Factory, and Logic Apps and also did
POC on Azure Data Bricks.
Expertise in Azure services such as Azure Data Factory, Azure Databricks, Azure Data Lakes, Azure Blob Storage, Azure
Synapse Analytics, Azure Cosmos DB, Azure HD Insight, Azure Event Hubs, Azure Virtual Machines, Azure Resource
Manager, Azure Functions, Azure SQL Database, Azure Queue Storage, Azure Stream Analytics, and Azure Analysis
Services
Expertise in creating, debugging, scheduling, and monitoring jobs using Airflow
Experience writing Spark streaming and Spark batch jobs, using Spark MLlib for analytics.
Experienced in Normalization (1NF, 2NF,3NF and BCNF) and De-normalization techniques for effective and optimum
performance in OLTP and OLAP environments.
Capable of processing large sets of structured, semi-structured, and unstructured data and supporting system application
architecture.
Worked with different database oracle, SQL Server, Teradata, Cassandra SQL Programming.
Proficient in UNIX/Linux commands, shell scripting, and application deployment.
Strong background in mathematics with excellent analytical skills.
TECHNICAL SKILLS
Hadoop/Big Data Tools Hadoop, Hive, HDFS, MapReduce, YARN, Pig, Flume, Spark, Kafka, Sqoop, Oozie,
Zookeeper, Nifi, Airflow.
Programming Languages Python, SQL, Scala, PySpark, Shell Scripting
DWH Schema Dimensional Modeling, Star Schema, Snowflake Schema
Databases Oracle, MySQL, SQL Server 2005, HBase, Cassandra, Netezza.
Cloud Environments AWS (EC2, EMR, S3, Kinesis, Dynamo DB, RedShift, AWS Lambda, RDS, SQS,
Quick Sight, SNS, EKS) Azure (Azure Data Factory, Data Bricks, Data Lakes, Azure
Synapse, Azure Blob storage, Azure Functions, Azure Queue Storage, Azure Event
hubs, Azure HD Insight, Azure virtual machines, Azure resource Manager, Azure
SQL DB)
Data Files JSON, CSV, PARQUET, AVRO, TEXTFILE. ORC
Methodologies Agile/Scrum
Operating System MacOS, Windows, Unix
PROFESSIONAL EXPERIENCE
Constructed robust data warehouse solutions to store and organize extensive healthcare data for analysis and reporting.
Created intricate ETL jobs with AWS Glue, converting data from diverse sources into a unified data model using Python and
SQL.
Formulated ETL processes to intake, transform, and load substantial data volumes from various sources into Delta Lake.
Designed and executed data solutions on AWS, leveraging services like AWS Lambda and Amazon Redshift.
Transitioned an on-premises application to AWS, employing services like EC2 and S3 for data processing and storage, and
maintained a Hadoop cluster on AWS EMR.
Proficient in data wrangling techniques, encompassing data cleansing, transformation, and aggregation, using PySpark’s Data
Frame API for efficient data processing and analysis.
Devised Spark streaming job to consume data from Kafka topic of different source systems and push the data into HDFS.
Expertise in creating DataStage parallel and sequence jobs, following established standards.
Performed MapReduce jobs to enhance data quality and accuracy
Formulated Kafka producer clients using confluent and generated events into Kafka topic
Extracted data from Cassandra using Sqoop, placed it in HDFS, and processed it with Hive.
Composed SQL scripts for data migration and handled data discrepancies, including migration from Teradata SQL to
Snowflake.
Converted Hive/SQL queries into Spark transformations using Spark RDDs and PySpark
Utilized Spark SQL for loading Parquet data, creating Schema RDD, and handling structured data.
Collaborated with the Hadoop ecosystem, enacted Spark with Scala, and utilized Data Frames and Spark SQL for faster data
processing.
Devised and maintained data orchestration workflows with AWS Step Functions for ETL tasks, and monitored data pipelines'
performance using AWS CloudWatch.
Subscribed to Kafka topics with Kafka consumer clients, processing real-time events using Spark.
Formulated Scala scripts and Hive UDFs, Implemented RDD in Spark for data aggregation and storage in S3.
Demonstrated expertise in creating, debugging, scheduling, and monitoring jobs using Airflow.
Proficiency in working with distributed computing frameworks like Hadoop and Apache Spark, enabling scalable and
parallelized data processing for large-scale machine learning projects.
Implemented CI/CD pipelines using Jenkins and Kubernetes for automated deployment and updates of data pipelines.
Implemented security measures to protect sensitive clinical data sets in accordance with HIPAA regulations.
Orchestrated data ingestion processes from various sources into Kubernetes-managed containers, ensuring seamless data flow
and maintaining high data integrity.
Migrated data from AWS S3 bucket to Snowflake by writing custom read/write snowflake utility function.
Ensured compliance with healthcare data standards, including FHIR and HL7c2, in data processing and storage.
Utilized Apache Spark with Python for big data analytics and ML applications, including Spark ML and ML lib.
Expertise in writing complex SQL queries and performing data transformations using SQL functions, joins, subqueries, and
aggregations within Snowflake.
Implemented security measures and ensured compliance with data protection standards in both SNS and EKS configurations.
Strong understanding of data modeling concepts and database design principles, including star schema and snowflake
schema, to optimize data structures for efficient data loading and querying in Snowflake.
Proficient in performance tuning and optimization techniques in Snowflake, including query optimization, partitioning,
clustering, and using Snowflake's query and resource optimization features to improve data processing efficiency.
Accomplished data warehouse and integration solutions with a strong focus on healthcare data compliance, including HIPAA
regulations.
Worked with building data warehouse structures and creating facts, dimensions, and aggregated tables by dimensional
modeling star and snowflake schema.
Implemented performance optimization techniques in PySpark jobs to enhance data processing speed and reduce resource
consumption, resulting in improved pipeline efficiency.
Ensured adherence to FHIR/HL7c2 standards in data integration processes, facilitating seamless interoperability across
healthcare systems.
Created data models and schema designs to support the storing and retrieving of clinical information.
Provided technical guidance and training to junior data engineers on Snowflake best practices, SQL optimization techniques,
and data modeling principles.
Managed schema evolution in PySpark applications to accommodate changes in data structure and ensure seamless
integration with evolving business needs.
Successfully handled large-scale data ingestion into Snowflake, optimizing file sizes and formats for efficient loading.
Implemented ELT processes in Snowflake, including loading raw data loading and transforming data using Snowflake's SQL
capabilities and virtual warehouses.
Migrated data from AWS S3 bucket to Snowflake by writing custom read/write snowflake utility function.
Adhered to healthcare regulations such as HIPAA by configuring EKS with appropriate security controls and auditing
capabilities and setting up SNS notifications for monitoring and alerting on data quality issues.
Conducted data analysis and provided insights to support business decision-making, using tools like AWS Quick Sight,
Tableau, or custom data visualization solutions.
Environment: Hadoop, Hive, MapReduce, Sqoop, Apache Kafka, Spark ML and ML lib FHIR/HL7c Standards, AWS Glue,
AWS Step Functions, AWS CloudWatch, AWS Lambda, Amazon Redshift, EMR, EC2, S3, SNS, EKS, AWS Quick Sight,
DataStage, Jenkins, Kubernetes, Snowflake, PySpark, Spark RDDs, Spark SQL, Scala 2.1.1, Python, SQL, Delta Lake, Airflow,
Tableau.
Environment: Hadoop, Hive, HDFS, Hive DDL’s Map-Reduce, Sqoop, Kafka, Apache NiFi, Terraform, Scala, Azure Data
Factory, Azure Databricks, Azure Data Lake Storage, Azure Blob Storage, Azure Functions, Azure SQL Database, Azure Cloud
Data Warehouse, Azure Queue Storage, Azure HDInsight, Azure Event Hubs, Azure Synapse Analytics, Azure Virtual Machines,
Azure Resource Manager, Power BI.
Education:
Bachelors in Computer Science from JNTU Hyderabad, 2014.