Professional Documents
Culture Documents
Data Engineer
Manijob201@gmail.com +1(628)-290-8561
Professional Summary
Around 9 years of IT experience as a professional Data Engineer in Big Data Analytical, Hadoop Ecosystem,
•
Designing and Implementing Data Pipelines, Cloud based applications and Data Visualization tools.
Experience working with Big data, Spark, Python, Teradata, SQL & Tableau.
•
Industrial experience in Data manipulation using Hadoop Ecosystem tools such as MapReduce, HDFS,
•
Expert in working with cloud PUB/SUB to replicate real time data from source system to GCP Big Query.
•
Designed and implemented end-to-end data pipelines to extract, cleanse, process and analyze huge amounts of
•
development.
Experienced in developing production ready spark applications using Spark RDD APIs, Data frames, Spark-SQL,
•
spark applications.
Having hands-on experience working with various No SQL databases such as Cassandra, Scylla DB and HBase.
•
Proficient in importing and exporting data from RDBMS to HDFS using Sqoop.
•
Working with GCP cloud using in GCP Cloud storage, DataProc, Data Flow, Big Query, Cloud Composer, Cloud
•
Pub/Sub.
Expertise in extraction, Transforming and loading data from various sources like flat files, excel, Oracle, MSSQL
•
transformation, and aggregation from multiple file formats for analyzing & transforming the data to uncover
insights into the customer usage patterns.
Used Hive extensively to perform various data analytics required by business teams.
•
Solid experience in working with various data formats like Parquet, Orc, Json, AVRO etc.,
•
Experience automating end-to-end data pipelines with strong resilience and recoverability.
•
Worked on Spark Streaming and Structured Spark streaming including Kafka for real time data processing.
•
Experience in creating Impala views on hive tables for fast access to data.
•
Experienced in using waterfall, Agile and Scrum models of software development process framework.
•
Strong experience in working with operating systems such as Linux and Unix.
•
Experience in execution of Batch jobs through the data streams to SPARK Streaming
•
Experienced process-oriented Data Analyst having excellent analytical, quantitative, and problem-solving skills
•
Database objects, SQL Trace, Explain Plan, Different types of Optimizers, Hints, Indexes, Table Partitions, Sub
Partitions, Materialized Views, Global Temporary tables, Autonomous Transitions, Bulk Binds, capabilities of
using Oracle Built-in functions. Performance tuning of Informatica mapping and workflow.
Sound knowledge in architecture of distributed systems and parallel processing frameworks.
•
Experienced working on continuous Integration/continuous delivery & build tools such as Jenkins and GIT, SVN
•
Developed Spark jobs to collect data from various source systems and store it on
•
Design and implement various layer of Data Lake, designing star schema for big
•
query.
Using g-cloud function with Python to load data into big query for on arrival csv files
•
in GCS bucket.
Loading data into snowflake tables from internal stage using Snow SQL.
•
Process and load bound and unbound data from Google pub/subtopic to big query
•
GCP.
Hands on experience in migrating on premise workloads to Google Cloud Platform
•
using GCS, Big Query, Cloud Composer, Google Cloud Storage and Cloud Data Proc.
Exposure on IAM roles in GCP.
•
Worked on Spark SQL, created Data frames by loading data from Hive tables and
•
Data ingestion and transformation in GCP and coordinate task among the team.
Developed Airflow DAGs in python by importing the Airflow libraries.
•
Used Spark Streaming to receive real time data from the Kafka and store the stream
•
data to HDFS using Python and NoSQL databases such as HBase and Cassandra.
Designing and implementing data ingestion techniques for the data coming from
•
Designed and Developed Spark code using Python, Pyspark and spark SQL for high-
•
stored on S3 to create a virtual Data Lake without having to go through the ETL
process.
Developed various spark applications using Scala to perform various enrichments of
•
user behavioral data (click stream data) merged with user profile data.
Involved in data cleansing, event enrichment, data aggregation, de-normalization and
•
Converted Hive/SQL queries into Spark transformations using Spark RDDs in Python
•
and Scala.
Developed Spark scripts using Python on AWS EMR for Data Aggregation, Validation
•
other capabilities for data processing. Utilized Spark in Memory capabilities, to handle
large datasets.
Experienced in working with EMR cluster and S3 in AWS cloud and developed API
•
for using AWS Lambda to manage the servers and run the code in the AWS.
Creating Hive tables, loading and analyzing data using hive scripts. Implemented
•
Excel.
Perform hypothesis testing to determine if sample size is statistically significant with
•
Used Spark for interactive queries, processing of streaming data and integrated with
•
NoSQL databases.
Experience with AWS Cloud IAM, Data pipeline, EMR, S3, EC2.
•
Developed the batch scripts to fetch the data from AWS S3 storage and do required
•
transformations.
Developed Spark code using Scala and Spark-SQL for faster processing of data.
•
Exploring with Spark for improving the performance and optimization of the existing
•
algorithms in Hadoop using Spark-SQL, Data Frame, pair RDD's, Spark YARN.
Experience with terraform scripts which automates the step execution in EMR to load
•
data formats.
Developed upgrade and downgrade scripts in SQL that filter bad and un-necessary
•
Record Writer for Spark job computations to handle custom business requirements.
Experience in installing, configuring, supporting, and managing Hadoop Clusters
•
using Apache, Cloudera (CDH 5.X) distributions and on Amazon web services
(AWS).
Worked with SQOOP to extract data from relational database into Hadoop.
•
Developed multiple MapReduce jobs in Java for data cleaning and preprocessing.
•
Experience in writing SQL queries across different databases such as Teradata, hive.
•
Implemented Partitioning, Dynamic Partitions and Buckets in HIVE for efficient data
•
access.
Developed complex MapReduce using pig and hive to handle multiple files formats
•
Involved in creating Hive tables, loading with data, and writing hive queries which
•
Homogeneous and Heterogeneous systems using SQL tools (SSIS, Bulk Insert).
Performed ETL operations to support incremental, historical data loads, nightly data
•
report server.
Performed T-SQL tuning and optimization of queries for reports that took longer
•
execution time using MS SQL Profiler, Index Tuning Wizard and SQL Query
generated reports using SSRS which were sent to different users.
Maintained the table performance by normalization, creating indexes and collected
•
statistics by using query optimization, query execution plans, SQL Server Profiler and
Database engine tuning Advisor.
Created complex SSRS reports using multiple data providers, Global Variables,
•
Expressions, user defined Objects, aggregate aware objects, charts, and synchronized
queries
Created Linked Servers for data retrieval using OLE DB data sources and providers.
•
Developed a database for integration by writing SQL Queries and Stored Procedures.
•
Written unit test cases and performed unit testing for the same.
•
Stored all source code in the TortoiseSVN and updated development status in the
•
SVN timely.
Responsible for developing, support and maintenance for the ETL (Extract, Transform
•