Anil Kumar: Data Engineer

Anil Kumar
DATA ENGINEER
Sanjayservice2@gmail.com
Sanjay@oktanix.com
8472648569
PROFESSIONAL SUMMARY:
• Over 8 years of extensive experience in IT industry including Big Data, Hadoop environments on
premise and Cloud environments for hosting cloud-based data warehouses and databases
using Redshift, Cassandra, and RDBMS sources.
• Experience in Big Data ecosystems using Hadoop, Pig, Hive, HDFS, MapReduce, Sqoop, Storm,
Spark, Airflow, Snowflake, Teradata, Flume, Kafka, Yarn, Oozie, and Zookeeper.
• Extensively used Python Libraries PySpark, Pytest, Pymongo, cxOracle, PyExcel, Boto3,
Psycopg, embedPy, NumPy and Beautiful Soup.
• Expert in providing ETL solutions for any type of business model.
• Implemented Integration solutions for cloud platforms with Informatica Cloud.
• High Exposure on Big Data technologies and Hadoop ecosystem, In - depth understanding of
Map Reduce and Hadoop Infrastructure.
• Expertise in writing end to end Data processing Jobs to analyze data using MapReduce, Spark
and Hive.
• Experience with Apache Spark ecosystem using Spark-Core, SQL, Data Frames, RDD's and
knowledge on Spark MLLib.
• Involved in building Data Models and Dimensional Modeling with 3NF, Star and
Snowflake schemas for OLAP and Operational data store (ODS) applications.
• Extensive Knowledge on developing Spark Streaming jobs by developing RDD’s (Resilient
Distributed Datasets) using Scala, PySpark and Spark-Shell.
• Experience in Alteryx platform, and involved in data preparation, data blending, and the
creation of data models, data sets using Alteryx.
• Experienced in data manipulation using Python for loading and extraction as well as with
Python libraries such as NumPy, SciPy and Pandas for data analysis and numerical
computations.
• Experience in using to analyze data from multiple sources and creating reports with Interactive
Dashboards using power BI. Extensive knowledge on designing Reports, Scorecards, and
Dashboards using Power BI.
• Experienced in using Pig scripts to do transformations, event joins, filters and pre-aggregations
before storing the data into HDFS.
• Strong knowledge of Hive analytical functions, extending Hive functionality by writing custom
UDFs.
• Expertise in writing Map Reduce Jobs in Python for processing large sets of structured, semi-
structured and unstructured data sets and stores them in HDFS.
• Good understanding of data modeling (Dimensional & Relational) concepts like Star-Schema
Modeling, Snowflake Schema Modeling, Fact and Dimension tables.
• Used Amazon Web Services Elastic Compute Cloud (AWS EC2) to launch cloud instance.
• Hands on experience working Amazon Web Services (AWS) using Elastic Map Reduce (EMR),
Redshift, and EC2 for data processing.
• Hands on experience in SQL and NOSQL database such as Snowflake, HBase, Cassandra and
MongoDB.
• Hands on experience in setting up workflow using Apache Airflow and Oozie workflow engine
for managing and scheduling Hadoop jobs.
• Strong experience in working with UNIX/LINUX environments, writing shell scripts.
• Worked with various formats of files like delimited text files, clickstream log files, Apache log
files, Avro files, JSON files, XML Files.
• Experienced in working in SDLC, Agile and Waterfall Methodologies.
TECHNICAL SKILLS:
Cloud Technologies Amazon Web Services (IAM, S3, EC2, VPC, ELB, Route53, RDS, Auto Scaling,
Cloud Front), Jenkins, GIT, CHEF, CONSUL, Docker, and Rack Space, GCP,
Azure.
Devops Tools Urban Code Deploy, Jenkins (CI), Puppet, Chef, Ansible, AWS.
Languages C, SQL, Languages Shell, and Python scripting.
Databases MySQL, Mongo DB, Cassandra, SQL Server.
Web/App Server Apache, IIS, HIS, Tomcat, WebSphere Application Server, JBoss.
CI Tools Hudson, Jenkins, Bamboo, Cruise Control.
Devops or other Jenkins, Perforce, Docker, deploy AWS, Chef, puppet, Ant, Atlassian-Jira,
Ansible, Open Stack and Salt Stack, Splunk.
WORK EXPERIENCE:
BCBS |Jacksonville, FL Jan
2021 to Present
Data Engineer
Responsibilities:
• Involved in SDLC Requirements gathering, Analysis, Design, Development and Testing of

application using Agile Methodology.
• Implemented Amazon S3 and Amazon Sage maker to deploy and host the model.
• Used BOTO3 library to connect Sage maker and python
• Efficient development skills in Teradata SQL Objects, Creating Tables, Stored Procedures, Functions,
Views, Indexing, Performance Tuning.
• Experience in Designing and Deployment of Reports for the End-User requests using Cognos,
Teradata, and Excel.
• Responsible for the execution of big data analytics, predictive analytics, and machine learning
initiatives.
• Implemented a proof of concept deploying this product in AWS S3 bucket and Snowflake.
• Utilize AWS services with focus on big data architect, analytics and enterprise Data warehouse and
business intelligence solutions to ensure optimal architecture, scalability, flexibility, availability,
performance, and to provide meaningful and valuable information for better decision-making.
• Developed Scala scripts, UDF's using both data frames/SQL and RDD in Spark for data aggregation,
queries and writing back into S3 bucket.
• Experienced working in data cleansing and data mining.
• Wrote, compiled, and executed programs as necessary using Apache Spark in Scala to perform ETL
jobs with ingested data.
• Used Spark Streaming to divide streaming data into batches as an input to Spark engine for batch
processing.
• Wrote Spark applications for data validation, cleansing, transformation, and custom aggregation
and used Spark engine, Spark SQL for data analysis and provided to the data scientists for further
analysis.
• Prepared scripts to automate the ingestion process using Python and Scala as needed through
various sources such as API, AWS S3, Teradata and snowflake.
• Designed and Developed Spark workflows using Scala for data pull from AWS S3 bucket and
Snowflake applying transformations on it.
• Implemented Spark RDD transformations to Map business analysis and apply actions on top of
transformations.
• Automated resulting scripts and workflow using Apache Airflow and shell scripting to ensure daily
execution in production.
• Created scripts to read CSV, json and parquet files from S3 buckets in Python and load into AWS S3,
DynamoDB and Snowflake.
• Implemented AWS Lambda functions to run scripts in response to events in Amazon DynamoDB
table or S3 bucket or to HTTP requests using Amazon API gateway.
• Migrated data from AWS S3 bucket to Snowflake by writing custom read/write snowflake
utility function using Scala.
• Worked on Snowflake Schemas and Data Warehousing and processed batch and streaming data
load pipeline using Snow Pipe and Matillion from data lake Confidential AWS S3 bucket.
• Profile structured, unstructured, and semi-structured data across various sources to
identify patterns in data and Implement data quality metrics using necessary queries
or python scripts based on source.
• Install and configure Apache Airflow for S3 bucket and Snowflake data warehouse and
created dags to run the Airflow.
• Created DAG to use the Email Operator, Bash Operator, and spark Livy operator to execute and
in EC2 instance.
• Deploy the code to EMR via CI/CD using Jenkins
• Extensively used Code cloud for code check-in and checkouts for version control.
Environment: Agile Scrum, MapReduce, Snowflake, Pig, Spark, Scala, Hive, Kafka, Python, Airflow, JSON,
Parquet, CSV, Code cloud, AWS.
US Bank| Minneapolis, MN Oct

2018 to Dec 2020
Data Engineer
Responsibilities:
• Created Pipelines in ADF using Linked Services, Datasets and Pipeline to Extract, Transform, and
load data from different sources like Azure SQL, Blob storage, Azure SQL Datawarehouse, write-
back tool, and backward.
• Extracted, Transformed and Loaded data from Sources Systems to Azure Data Storage services
using a combination of Azure Data Factory, Spark SQL, and U-SQL Azure Data Lake Analytics.
• Experience in creating complex Informatica Mappings using Source Qualifier, Expression, Router,
Aggregator, Lookup, Normalize, and other transformations in Informatica and well versed in
debugging an Informatica mapping using Debugger.
• Strong Data Warehousing experience in Application development & Quality Assurance testing using
Informatica Power Center 9.1/8.6(Designer, Workflow Manager, Workflow Monitor), Power
Exchange, OLAP, OLTP
• Data is Ingested to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure
DW) and processing the data in In Azure Databricks. Worked on Azure Services like IaaS, PaaS and
worked on storage like Blob (Page and Block), SQL Azure.
• Implemented OLAP multi-dimensional functionality using Azure SQL Data Warehouse. Retrieved
data using Azure SQL and Azure ML which is used to build, test, and predict the data.
• Worked on Cloud databases such as Azure SQL Database, SQL managed instance, SQL Elastic pool
on Azure, and SQL server.
• Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform
services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW,
HDInsight/Databricks, NoSQL DB).
• Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark data
bricks cluster.
• Designed and developed Azure Data Factory pipelines to Extract, Load and Transform data from
difference sources systems (Mainframe, SQL Server, IBM DB2, Shared Drives, etc.) to Azure Data
Storage services using a combination of Azure Data Factory, Azure Databricks (PySpark, Spark-
SQL), Azure Stream Analytics and U-SQL Azure Data Lake Analytics. Data Ingestion into various
Azure Storage Services like Azure Data Lake, Azure Blob Storage, Azure Synapse Analytics
(formerly known as Azure Data Warehouse).
• Configured and deployed Azure Automation Scripts for a multitude of applications utilizing the
Azure stack (including Compute, Web & Mobile, Blobs, ADF, Resource Groups, Azure Data Lake,
HDInsight Clusters, Azure Data Factory, Azure SQL, Cloud Services, and ARM), Services and Utilities
focusing on Automation.
• Involved in Migrating Objects from Teradata to Snowflake and created Snow pipe for continuous
data load.
• Increased consumption of solutions including Azure SQL Databases, Azure Cosmos DB and Azure
SQL.
• Created continuous integration and continuous delivery (CI/CD) pipeline on Azure that helps to
automate steps in the software delivery process.
• Deploying and managing applications in Datacenter, Virtual environment, and Azure platform as
well.
• Involved in converting Hive/SQL queries into Spark transformations using Spark RDD's and PySpark.
• Log data Stored in HBase DB is processed and analyzed and then imported into Hive warehouse,
which enabled end business analysts to write HQL queries.
• Handled importing of data from various data sources, performed transformations using Hive, and
loaded data into HDFS.
• Design, development, and implementation of performant ETL pipelines using PySpark and Azure
Data Factory.
Environment: Azure Data Factory(V2), Azure Data Bricks (PySpark, Spark SQL), Azure Data Lake, Azure
BLOB Storage, Azure ML, Azure SQL, Hive, Git, GitHub, JIRA, HQL, Snowflake, Teradata.
Verizon| Richmond, VA Jan
2017 to Sep 2018
Data Engineer
Responsibilities:
• Extensively used Agile methodology as the Organization Standard to implement the data Models.
• Created several types of data visualizations using Python and Tableau.
• Extracted Mega Data from AWS using SQL Queries to create reports.
• Performed reverse engineering using Erwin to redefine entities, attributes, and relationships
existing database.
• Analyzed functional and non-functional business requirements and translate into technical data
requirements and create or update existing logical and physical data models.
• Developed a data pipeline using Kafka to store data into HDFS.
• Designed and developed architecture for data services ecosystem spanning Relational, NoSQL,
and Big Data technologies.
• Performed Regression testing for Golden Test Cases from State (end to end test cases) and
automated the process using python scripts.
• Developed Spark jobs using Scala for faster real-time analytics and used Spark SQL for querying.
• Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data
into target system from multiple sources.
• Primarily Responsible for converting Manual Report system to fully automated CI/CD Data
Pipeline that ingest data from different Marketing platform to AWS S3 data lake.
• Utilized AWS services with focus on big data analytics, enterprise data warehouse and business
intelligence solutions to ensure optimal architecture, scalability, flexibility.
• Designed AWS architecture, Cloud migration, AWS EMR, DynamoDB, Redshift and event
processing using lambda function.
• Gathered data from Google AdWords, Apple search ad, Facebook ad, Bing ad, Snapchat ad,
Omniture data and CSG using their API.
• Importing existing datasets from Oracle to Hadoop system using SQOOP.
• Developed ETL pipelines in and out of data warehouse using combination of Python and Snowflakes
SnowSQL Writing SQL queries against Snowflake.
• Hands on experience in importing and exporting data from snowflake, Oracle and DB2 into HDFS
and HIVE using Sqoop for analysis, visualization and to generate reports.
• Created Sqoop jobs with incremental load to populate Hive External tables.
• Writing the Spark Core Programs for processing and cleansing data thereafter load that data into
Hive or HBase for further processing.
• Implemented usage of Amazon EMR for processing Big Data across a Hadoop Cluster of virtual
servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3).
• Used AWS system manager to automate operational tasks across AWS resources.
• Wrote Lambda function code and set CloudWatch Event as trigger with Cron job Expression.
• Connected Redshift to Tableau for creating dynamic dashboard for analytics team.
• Setup connection between S3 to AWS Sage Maker ML (Machine Learning platform) is used for
predictive analytics and uploading inferenced data to redshift.
• Good Experience in implementing and orchestrating data pipelines using Oozie and Airflow,
Worked with Cloudera, and Hortonworks distributions.
• Writing UNIX shell scripts to automate the jobs and scheduling cron jobs for job automation
using commands with Crontab.
• Working in big data technologies like spark, Scala, Hive, Hadoop cluster (Cloudera platform).
• Deployed the project on Amazon EMR with S3 connectivity for setting backup storage.
• Conducted ETL Data Integration, Cleansing, and Transformations using AWS glue Spark script.
• Wrote Python modules to extract data from the MySQL source database.
• Worked on Cloudera distribution and deployed on AWS EC2 Instances.
• Migrated high avail webservers and databases to AWS EC2 and RDS with min or no downtime.
• Worked with AWS IAM to generate new accounts, assign roles and groups.
• Deployed the project on Amazon EMR with S3 connectivity for setting a backup storage.
• Created Jenkins jobs for CI/CD using Git, Maven and Bash scripting.
Environment: AWS, Redshift, PySpark, Cloudera, Hadoop, Spark, Sqoop, MapReduce, Python, Tableau,
EC2, EMR, Glue, S3, Kafka, IAM, Azure, PostgreSQL, MySQL, Jenkins, Maven, AWS CLI, Cucumber, Java,
Unix, Shell Scripting, Maven, Git.
Kelly Maxson| Hyderabad, IN May
2013 to Dec 2016
SQL Developer
Responsibilities:
• Worked on new Data warehouse design for ETL and Reporting projects using SSIS and SSRS.
• Created various kinds of reports using Power BI and Tableau based on the client's needs.
• Implemented indexes such as clustered index, non-clustered index, covering index appropriately
on data structures to achieve faster data retrieval.
• Worked with Data Governance tools and extract-transform-load (ETL) processing tool for data
mining, data warehousing, and data cleaning using SQL.
• Optimized SQL performance, integrity, and security of the project’s databases/schemas.
• Performed ETL operations to support incremental, historical data loads and transformations using
SSIS.
• Created SSIS packages to extract data from OLTP to OLAP systems and scheduled jobs to call the
packages.
• Implemented Event Handlers and Error Handling in SSIS packages and notified process results to
various user communities.
• Designed SSIS packages to import data from multiple sources to control upstream and downstream
of data into SQL Azure database.
• Used various advanced SSIS functionalities like complex joins, conditional splitting, column
conversions for better performance during package execution.
• Developed impactful reports using SSRS, MS Excel, Pivot tables and Tableau to solve the business
requirements.
• Creating and deploying parameterized reports using power BI and Tableau.
• Maintained the physical database by monitoring performance, integrity and optimize SQL queries
for maximum efficiency using SQL Profile.
• Worked on formatting SSRS reports using the Global variables and expressions.
• Created Power BI Reports using the Tabular SSAS models in Power BI desktop and published them
to Dashboards using cloud service.
• Created designs and process flow on to how standardize Power BI Dashboards to meet the
requirements of the business. Made changes to the existing Power BI Dashboard on a regular basis
as per requests from Business.
• Created data maintenance plan for backup and restore and update statistics and rebuild indexes.
Environment: SQL Server 2016, T-SQL, SQL Profiler, SSIS, SSRS, SSAS, TFS, SSAS, MS SQL Server,
Oracle10g, Oracle WebLogic Server, Query Analyzer, Power BI, Power Pivot, Windows, MS Excel.

Anil Kumar: Data Engineer

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Anil Kumar: Data Engineer

Uploaded by

Copyright:

Available Formats

Anil Kumar

• Involved in SDLC Requirements gathering, Analysis, Design, Development and Testing of

US Bank| Minneapolis, MN Oct

• Creating and deploying parameterized reports using power BI and Tableau.

You might also like