Professional Documents
Culture Documents
DATA ENGINEER
Sanjayservice2@gmail.com
Sanjay@oktanix.com
8472648569
PROFESSIONAL SUMMARY:
• Over 8 years of extensive experience in IT industry including Big Data, Hadoop environments on
premise and Cloud environments for hosting cloud-based data warehouses and databases
using Redshift, Cassandra, and RDBMS sources.
• Experience in Big Data ecosystems using Hadoop, Pig, Hive, HDFS, MapReduce, Sqoop, Storm,
Spark, Airflow, Snowflake, Teradata, Flume, Kafka, Yarn, Oozie, and Zookeeper.
• Extensively used Python Libraries PySpark, Pytest, Pymongo, cxOracle, PyExcel, Boto3,
Psycopg, embedPy, NumPy and Beautiful Soup.
• Expert in providing ETL solutions for any type of business model.
• Implemented Integration solutions for cloud platforms with Informatica Cloud.
• High Exposure on Big Data technologies and Hadoop ecosystem, In - depth understanding of
Map Reduce and Hadoop Infrastructure.
• Expertise in writing end to end Data processing Jobs to analyze data using MapReduce, Spark
and Hive.
• Experience with Apache Spark ecosystem using Spark-Core, SQL, Data Frames, RDD's and
knowledge on Spark MLLib.
• Involved in building Data Models and Dimensional Modeling with 3NF, Star and
Snowflake schemas for OLAP and Operational data store (ODS) applications.
• Extensive Knowledge on developing Spark Streaming jobs by developing RDD’s (Resilient
Distributed Datasets) using Scala, PySpark and Spark-Shell.
• Experience in Alteryx platform, and involved in data preparation, data blending, and the
creation of data models, data sets using Alteryx.
• Experienced in data manipulation using Python for loading and extraction as well as with
Python libraries such as NumPy, SciPy and Pandas for data analysis and numerical
computations.
• Experience in using to analyze data from multiple sources and creating reports with Interactive
Dashboards using power BI. Extensive knowledge on designing Reports, Scorecards, and
Dashboards using Power BI.
• Experienced in using Pig scripts to do transformations, event joins, filters and pre-aggregations
before storing the data into HDFS.
• Strong knowledge of Hive analytical functions, extending Hive functionality by writing custom
UDFs.
• Expertise in writing Map Reduce Jobs in Python for processing large sets of structured, semi-
structured and unstructured data sets and stores them in HDFS.
• Good understanding of data modeling (Dimensional & Relational) concepts like Star-Schema
Modeling, Snowflake Schema Modeling, Fact and Dimension tables.
• Used Amazon Web Services Elastic Compute Cloud (AWS EC2) to launch cloud instance.
• Hands on experience working Amazon Web Services (AWS) using Elastic Map Reduce (EMR),
Redshift, and EC2 for data processing.
• Hands on experience in SQL and NOSQL database such as Snowflake, HBase, Cassandra and
MongoDB.
• Hands on experience in setting up workflow using Apache Airflow and Oozie workflow engine
for managing and scheduling Hadoop jobs.
• Strong experience in working with UNIX/LINUX environments, writing shell scripts.
• Worked with various formats of files like delimited text files, clickstream log files, Apache log
files, Avro files, JSON files, XML Files.
• Experienced in working in SDLC, Agile and Waterfall Methodologies.
TECHNICAL SKILLS:
Cloud Technologies Amazon Web Services (IAM, S3, EC2, VPC, ELB, Route53, RDS, Auto Scaling,
Cloud Front), Jenkins, GIT, CHEF, CONSUL, Docker, and Rack Space, GCP,
Azure.
Devops Tools Urban Code Deploy, Jenkins (CI), Puppet, Chef, Ansible, AWS.
Languages C, SQL, Languages Shell, and Python scripting.
Databases MySQL, Mongo DB, Cassandra, SQL Server.
Web/App Server Apache, IIS, HIS, Tomcat, WebSphere Application Server, JBoss.
CI Tools Hudson, Jenkins, Bamboo, Cruise Control.
Devops or other Jenkins, Perforce, Docker, deploy AWS, Chef, puppet, Ant, Atlassian-Jira,
Ansible, Open Stack and Salt Stack, Splunk.
WORK EXPERIENCE:
BCBS |Jacksonville, FL Jan
2021 to Present
Data Engineer
Responsibilities:
Environment: Agile Scrum, MapReduce, Snowflake, Pig, Spark, Scala, Hive, Kafka, Python, Airflow, JSON,
Parquet, CSV, Code cloud, AWS.
• Created Pipelines in ADF using Linked Services, Datasets and Pipeline to Extract, Transform, and
load data from different sources like Azure SQL, Blob storage, Azure SQL Datawarehouse, write-
back tool, and backward.
• Extracted, Transformed and Loaded data from Sources Systems to Azure Data Storage services
using a combination of Azure Data Factory, Spark SQL, and U-SQL Azure Data Lake Analytics.
• Experience in creating complex Informatica Mappings using Source Qualifier, Expression, Router,
Aggregator, Lookup, Normalize, and other transformations in Informatica and well versed in
debugging an Informatica mapping using Debugger.
• Strong Data Warehousing experience in Application development & Quality Assurance testing using
Informatica Power Center 9.1/8.6(Designer, Workflow Manager, Workflow Monitor), Power
Exchange, OLAP, OLTP
• Data is Ingested to one or more Azure Services - (Azure Data Lake, Azure Storage, Azure SQL, Azure
DW) and processing the data in In Azure Databricks. Worked on Azure Services like IaaS, PaaS and
worked on storage like Blob (Page and Block), SQL Azure.
• Implemented OLAP multi-dimensional functionality using Azure SQL Data Warehouse. Retrieved
data using Azure SQL and Azure ML which is used to build, test, and predict the data.
• Worked on Cloud databases such as Azure SQL Database, SQL managed instance, SQL Elastic pool
on Azure, and SQL server.
• Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform
services (Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW,
HDInsight/Databricks, NoSQL DB).
• Responsible for estimating the cluster size, monitoring, and troubleshooting of the Spark data
bricks cluster.
• Designed and developed Azure Data Factory pipelines to Extract, Load and Transform data from
difference sources systems (Mainframe, SQL Server, IBM DB2, Shared Drives, etc.) to Azure Data
Storage services using a combination of Azure Data Factory, Azure Databricks (PySpark, Spark-
SQL), Azure Stream Analytics and U-SQL Azure Data Lake Analytics. Data Ingestion into various
Azure Storage Services like Azure Data Lake, Azure Blob Storage, Azure Synapse Analytics
(formerly known as Azure Data Warehouse).
• Configured and deployed Azure Automation Scripts for a multitude of applications utilizing the
Azure stack (including Compute, Web & Mobile, Blobs, ADF, Resource Groups, Azure Data Lake,
HDInsight Clusters, Azure Data Factory, Azure SQL, Cloud Services, and ARM), Services and Utilities
focusing on Automation.
• Involved in Migrating Objects from Teradata to Snowflake and created Snow pipe for continuous
data load.
• Increased consumption of solutions including Azure SQL Databases, Azure Cosmos DB and Azure
SQL.
• Created continuous integration and continuous delivery (CI/CD) pipeline on Azure that helps to
automate steps in the software delivery process.
• Deploying and managing applications in Datacenter, Virtual environment, and Azure platform as
well.
• Involved in converting Hive/SQL queries into Spark transformations using Spark RDD's and PySpark.
• Log data Stored in HBase DB is processed and analyzed and then imported into Hive warehouse,
which enabled end business analysts to write HQL queries.
• Handled importing of data from various data sources, performed transformations using Hive, and
loaded data into HDFS.
• Design, development, and implementation of performant ETL pipelines using PySpark and Azure
Data Factory.
Environment: Azure Data Factory(V2), Azure Data Bricks (PySpark, Spark SQL), Azure Data Lake, Azure
BLOB Storage, Azure ML, Azure SQL, Hive, Git, GitHub, JIRA, HQL, Snowflake, Teradata.
Verizon| Richmond, VA Jan
2017 to Sep 2018
Data Engineer
Responsibilities:
• Extensively used Agile methodology as the Organization Standard to implement the data Models.
• Created several types of data visualizations using Python and Tableau.
• Extracted Mega Data from AWS using SQL Queries to create reports.
• Performed reverse engineering using Erwin to redefine entities, attributes, and relationships
existing database.
• Analyzed functional and non-functional business requirements and translate into technical data
requirements and create or update existing logical and physical data models.
• Developed a data pipeline using Kafka to store data into HDFS.
• Designed and developed architecture for data services ecosystem spanning Relational, NoSQL,
and Big Data technologies.
• Performed Regression testing for Golden Test Cases from State (end to end test cases) and
automated the process using python scripts.
• Developed Spark jobs using Scala for faster real-time analytics and used Spark SQL for querying.
• Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data
into target system from multiple sources.
• Primarily Responsible for converting Manual Report system to fully automated CI/CD Data
Pipeline that ingest data from different Marketing platform to AWS S3 data lake.
• Utilized AWS services with focus on big data analytics, enterprise data warehouse and business
intelligence solutions to ensure optimal architecture, scalability, flexibility.
• Designed AWS architecture, Cloud migration, AWS EMR, DynamoDB, Redshift and event
processing using lambda function.
• Gathered data from Google AdWords, Apple search ad, Facebook ad, Bing ad, Snapchat ad,
Omniture data and CSG using their API.
• Importing existing datasets from Oracle to Hadoop system using SQOOP.
• Developed ETL pipelines in and out of data warehouse using combination of Python and Snowflakes
SnowSQL Writing SQL queries against Snowflake.
• Hands on experience in importing and exporting data from snowflake, Oracle and DB2 into HDFS
and HIVE using Sqoop for analysis, visualization and to generate reports.
• Created Sqoop jobs with incremental load to populate Hive External tables.
• Writing the Spark Core Programs for processing and cleansing data thereafter load that data into
Hive or HBase for further processing.
• Implemented usage of Amazon EMR for processing Big Data across a Hadoop Cluster of virtual
servers on Amazon Elastic Compute Cloud (EC2) and Amazon Simple Storage Service (S3).
• Used AWS system manager to automate operational tasks across AWS resources.
• Wrote Lambda function code and set CloudWatch Event as trigger with Cron job Expression.
• Connected Redshift to Tableau for creating dynamic dashboard for analytics team.
• Setup connection between S3 to AWS Sage Maker ML (Machine Learning platform) is used for
predictive analytics and uploading inferenced data to redshift.
• Good Experience in implementing and orchestrating data pipelines using Oozie and Airflow,
Worked with Cloudera, and Hortonworks distributions.
• Writing UNIX shell scripts to automate the jobs and scheduling cron jobs for job automation
using commands with Crontab.
• Working in big data technologies like spark, Scala, Hive, Hadoop cluster (Cloudera platform).
• Deployed the project on Amazon EMR with S3 connectivity for setting backup storage.
• Conducted ETL Data Integration, Cleansing, and Transformations using AWS glue Spark script.
• Wrote Python modules to extract data from the MySQL source database.
• Worked on Cloudera distribution and deployed on AWS EC2 Instances.
• Migrated high avail webservers and databases to AWS EC2 and RDS with min or no downtime.
• Worked with AWS IAM to generate new accounts, assign roles and groups.
• Deployed the project on Amazon EMR with S3 connectivity for setting a backup storage.
• Created Jenkins jobs for CI/CD using Git, Maven and Bash scripting.
Environment: AWS, Redshift, PySpark, Cloudera, Hadoop, Spark, Sqoop, MapReduce, Python, Tableau,
EC2, EMR, Glue, S3, Kafka, IAM, Azure, PostgreSQL, MySQL, Jenkins, Maven, AWS CLI, Cucumber, Java,
Unix, Shell Scripting, Maven, Git.
Kelly Maxson| Hyderabad, IN May
2013 to Dec 2016
SQL Developer
Responsibilities:
• Worked on new Data warehouse design for ETL and Reporting projects using SSIS and SSRS.
• Created various kinds of reports using Power BI and Tableau based on the client's needs.
• Implemented indexes such as clustered index, non-clustered index, covering index appropriately
on data structures to achieve faster data retrieval.
• Worked with Data Governance tools and extract-transform-load (ETL) processing tool for data
mining, data warehousing, and data cleaning using SQL.
• Optimized SQL performance, integrity, and security of the project’s databases/schemas.
• Performed ETL operations to support incremental, historical data loads and transformations using
SSIS.
• Created SSIS packages to extract data from OLTP to OLAP systems and scheduled jobs to call the
packages.
• Implemented Event Handlers and Error Handling in SSIS packages and notified process results to
various user communities.
• Designed SSIS packages to import data from multiple sources to control upstream and downstream
of data into SQL Azure database.
• Used various advanced SSIS functionalities like complex joins, conditional splitting, column
conversions for better performance during package execution.
• Developed impactful reports using SSRS, MS Excel, Pivot tables and Tableau to solve the business
requirements.
• Maintained the physical database by monitoring performance, integrity and optimize SQL queries
for maximum efficiency using SQL Profile.
• Worked on formatting SSRS reports using the Global variables and expressions.
• Created Power BI Reports using the Tabular SSAS models in Power BI desktop and published them
to Dashboards using cloud service.
• Created designs and process flow on to how standardize Power BI Dashboards to meet the
requirements of the business. Made changes to the existing Power BI Dashboard on a regular basis
as per requests from Business.
• Created data maintenance plan for backup and restore and update statistics and rebuild indexes.
Environment: SQL Server 2016, T-SQL, SQL Profiler, SSIS, SSRS, SSAS, TFS, SSAS, MS SQL Server,
Oracle10g, Oracle WebLogic Server, Query Analyzer, Power BI, Power Pivot, Windows, MS Excel.