You are on page 1of 6

DEEMA Alk

Staten Island, NY 10304•347-493-9722• Deem.alk97@gmail.com

PROFESSIONAL SUMMARY

5 years of IT experience as a Developer, Designer & QA Test Engineer with cross-platform integration
experience using Hadoop Ecosystem, Java and Software Functional Testing. Hands on experience in
installing, configuring and using Hadoop Ecosystem - HDFS, MapReduce, Pig, Hive, Oozie, Flume, HBase,
Spark, Sqoop, Flume and Oozie. Strong understanding of various Hadoop services, MapReduce and YARN
architecture. Responsible for writing Map Reduce programs. Experienced in importing-exporting data into
HDFS using SQOOP Experience loading data to Hive partitions and creating buckets in Hive. Developed
Map Reduce jobs to automate transfer the data from HBase. Expertise in analysis using PIG, HIVE and
MapReduce. Experienced in developing UDFs for Hive, PIG using Java. Strong understanding of NoSQL
databases like HBase, MongoDB & Cassandra. Scheduling all Hadoop/hive/Sqoop/HBase jobs using Oozie
Experience in setting cluster in Amazon EC2 & S3 including the automation of setting & extending the
clusters in AWS Amazon cloud. Good understanding of Scrum methodologies, Test Driven Development and
continuous integration. Major strengths are familiarity with multiple software systems, ability to learn quickly
new technologies, adapt to new environments, self-motivated, team player, focused adaptive and quick
learner with excellent interpersonal, technical and communication skills. Experience in defining detailed
application software test plans, including organization, participant, schedule, test and application coverage
scope. Experience in gathering and defining functional and user interface requirements for software
applications. Experience in real time analytics with Apache Spark (RDD, Data Frames and Streaming API).
Used Spark Data Frames API over Cloudera platform to perform analytics on Hive data. Experience in
integrating Hadoop with Kafka. Expertise in uploading Click stream data from Kafka to HDFS.

SKILLS

BigData Technologies Flask Framework (python)


HDFS, Hadoop MapReduce, Hive, Pig, Sqoop, Operating Systems
Flume, Oozie, Storm, Spark, HBase. Windows 10/8/7/XP, Linux (Ubuntu 18.0), Unix,
Web Technologies MacOS
Cloud Services Apache Spark
Languages Data Storage and Retrieval
C, SQL, Python, Shell Scripting, Scala, R, Core Strong Communication and Interpersonal Skills
Java. Source and Version Control:
Database Git, GitHub
SQL Server 2008, Document DB, MySQL, Neo4j, Apache Hive
Teradata. Linux Environments
Methodologies Hadoop Coding
Agile, Waterfall model
API Frameworks

WORK HISTORY

DATA ENGINEER
PNC Bank - USA 04/2020 - Current
Build Data pipelines using Apache airflow
Created Data Lake on AWS using AWS S3, AWS Athena
Used AWS CodeCommit as the version control
Worked on AWS EMR Virtual Clusters to develop spark applications
Developed highly maintainable Hadoop code and followed all best practices regarding coding
Built, tested and deployed scalable, highly available and modular software products
Experience in Pyspark to build Spark applications
Used on AWS CodeBuild and Code pipeline for Continuous Integration and Deployment
Worked on Python Unit Testing using pytest Module
Worked on AWS Lake Formation to provide the permission for Datalake
Build Data Pipelines for Marketing teams to Ingest and Export the data from Datalake
Worked on Parquet, Json, Avro Big Data File Formats
Worked on spark application: Nearest Competitor store
Worked on Migration of Apache airflow version1 to version2
Used Visual Studio as IDE tool to develop and ServiceNow portal for Customer requests
Developed Spark applications using Scala utilizing Data frames and Spark SQL API for faster
processing of data
Involved in Agile methodologies, daily scrum meetings, spring planning, and scripts were written for
distribution of query for performance test jobs in Amazon Data Lake
Created Hive Tables, loaded transactional data from Teradata using Sqoop, and worked with highly
unstructured and semi-structured data of 2 Petabytes in size
Developed MapReduce jobs for cleaning, accessing, and validating the data and created and worked
Sqoop jobs with the incremental load to populate Hive External tables
Developed optimal strategies for distributing the weblog data over the cluster importing and exporting
the stored web log data into HDFS and Hive using Sqoop
Responsible for building scalable distributed data solutions using Hadoop Cloudera and designed and
developed automation test scripts using Python
Analyzed the SQL scripts and designed the solution to implement using Spark and implemented Hive
Generic UDF's to incorporate business logic into Hive Queries
Worked on MongoDB by using CRUD (Create, Read, Update and Delete), Indexing, Replication, and
Shading features
Involved in designing the row key in HBase to store Text and JSON as key values in the HBase table
and designed row key in such a way to get/scan it in sorted order
Integrated Oozie with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the
box (such as Map-Reduce, Pig, Hive, and Sqoop) as well as system specific jobs (such as Java
programs and shell scripts)
Creating Hive tables and working on them using Hive QL and designed and Implemented Partitioning
(Static, Dynamic) Buckets in HIVE
Developed syllabus/Curriculum data pipelines from Syllabus/Curriculum Web Services to HBASE and
Hive tables
Worked on Cluster coordination services through Zookeeper and monitored workload, job performance,
and capacity planning using Cloudera Manager
Involved in build applications using Maven and integrated with CI servers like Jenkins to build jobs
Monitor Hadoop Name Node Health status, number of Task trackers running, number of Data Nodes
running and automated all the jobs starting from pulling the Data from different Data Sources like
MySQL to pushing the result set Data to Hadoop Distributed File System
Developed story-telling dashboards in Tableau Desktop and published them on to Tableau Server and
used GitHub version controlling tools to maintain project versions.

DATA ENGINEER
JetBlue - New York, NY 01/2019 - 04/2020
Worked on analyzing Hadoop cluster and different big data analytic tools including Pig, Hive and Sqoop
Created POC on Hortonworks and suggested the best practice in terms HDP, HDF platform
Experience in understanding the security requirements for Hadoop and integrating with Kerberos
authentication infrastructure- KDC server setup, managing
Management and support of Hadoop Services including HDFS, Hive, Impala, and SPARK
Installing, Upgrading and Managing Hadoop Cluster on Cloudera
Troubleshooting many clouds related issues such as Data Node down, Network failure, login issues and
data block missing
Worked as Hadoop Admin and responsible for taking care of everything related to the clusters total of
100 nodes ranges from POC (Proof-of-Concept) to PROD clusters on Cloudera (CDH 5.5.2) distribution
Responsible for Cluster maintenance, Monitoring, commissioning and decommissioning Data nodes,
Troubleshooting, Manage and review data backups, Manage & review log files
Day to day responsibilities includes solving developer issues, deployments moving code from one
environment to other environment, providing access to new users and providing instant solutions to
reduce the impact and documenting the same and preventing future issues
Collaborating with application teams to install operating system and Hadoop updates, patches, version
upgrades
Monitored workload, job performance and capacity planning using Cloudera Manager
Involved in Analyzing system failures, identifying root causes, and recommended course of actions
Interacting with Cloudera support and log the issues in Cloudera portal and fixing them as per the
recommendations
Imported logs from web servers with Flume to ingest the data into HDFS
Using Flume and Spool directory loading the data from local system to HDFS
Retrieved data from HDFS into relational databases with Sqoop
Parsed cleansed and mined useful and meaningful data in HDFS using Map-Reduce for further analysis
Fine tuning hive jobs for optimized performance
Scripting Hadoop package installation and configuration to support fully automated deployments
Involved in chef-infra maintenance including backup/security fix on Chef Server
Deployed application updates using Jenkins
Installed, configured, and managed Jenkins
Triggering the SIT environment build of client remotely through Jenkins
Deployed and configured Git repositories with branching, forks, tagging, and notifications
Experienced and proficient deploying and administering GitHub
Deploy builds to production and work with the teams to identify and troubleshoot any issues
Worked on MongoDB database concepts such as locking, transactions, indexes, Shading, replication,
schema design
Consulted with the operations team on deploying, migrating data, monitoring, analyzing, and tuning
MongoDB applications
Viewing the selected issues of web interface using SonarQube
Developed a fully functional login page for the company's user facing website with complete UI and
validations
Installed, Configured and utilized AppDynamics (Tremendous Performance Management Tool) in the
whole JBoss Environment (Prod and Non-Prod)
Reviewed OpenShift PaaS product architecture and suggested improvement features after conducting
research on Competitor's products
Migrated data source passwords to encrypted passwords using Vault tool in all the JBoss application
servers
Participated in Migration undergoing from JBoss 4 to Web logic or JBoss 4 to JBoss 6 and its respective
POC
Responsible for upgradation of SonarQube using upgrade center
Resolving tickets submitted by users, P1 issues, troubleshoot the error documenting, resolving the
errors
Installed and configured Hive in Hadoop cluster and help business users/application teams fine tune
their HIVE QL for optimizing performance and efficient use of resources in cluster
Conduct performance tuning of the Hadoop Cluster and map reduce jobs
Also, the real-time applications with best practices to fix the design flaws
Implemented Oozie workflow for ETL Process for critical data feeds across the platform
Configured Ethernet bonding for all Nodes to double the network bandwidth
Implementing Kerberos Security Authentication protocol for existing cluster

DATA ENGINEER
Fannie Mae - Washington 01/2018 - 12/2018
Developed highly maintainable Hadoop code and followed all best practices regarding coding
Verified new product development effort alignment with supportability goals and proposed Service Level
Agreement (SLA) parameters
Performed data cleaning on unstructured information using various Hadoop tools
Involved in Hive/SQL queries performing spark transformations using Spark RDDs and Python (spark)
Created a Serverless data ingestion pipeline on AWS using lambda functions
Developed Apache Spark Applications by using Scala, Python, and Implemented Apache Spark data
processing module to handle data from various RDBMS and Streaming sources
Experience in developing and scheduling various Spark Streaming / batch Jobs using python (pyspark)
and Scala
Developing spark code using pyspark to be applying various transformations and actions for faster data
processing
Achieved high-throughput, scalable, fault-tolerant stream processing of live data streams using Apache
Spark Streaming
Used Spark Stream processing using Scala to get data into in-memory, created RDDs, Data Frames
and applied transformations and actions
Involved in using various Python libraries with spark to create data frames and store them to Hive
Sqoop jobs and Hive queries were created for data ingestion from relational databases to analyze
historical data
Experience in working with Elastic MapReduce (EMR) and setting up environments on amazon AWS
EC2 instances
Knowledge on handling Hive queries using Spark SQL that integrates with Spark environment
Executed Hadoop/Spark jobs on AWS EMR using programs, stored in S3 Buckets
Knowledge on creating the user defined functions (UDF's) in Hive
Worked with different File Formats like c, Avro, parquet for HIVE querying and processing based on
business logic
Worked on Sequence files, RC files, Map side joins, bucketing, partitioning for Hive performance
enhancement and storage improvement
Implemented Hive UDF's to implement business logic and Responsible for performing extensive data
validation using Hive
Involved in loading the structured and semi structured data into spark clusters using Spark SQL and
Data Frames API
Involved in developing code and generated various data frames based on the business requirement
and created temporary tables in hive
Utilized AWS CloudWatch to monitor the performance environment instances for operational and
performance metrics during load testing
Scripting Hadoop package installation and configuration to support fully automated deployments
Involved in chef-infra maintenance including backup/security fix on Chef Server
Deployed application updates using Jenkins
Installed, configured, and managed Jenkins
Triggering the SIT environment build of the client remotely through Jenkins
Deployed and configured Git repositories with branching, forks, tagging, and notifications
Experienced and proficient deploying and administering GitHub
Deploy builds to production and work with the teams to identify and troubleshoot any issues
Worked on MongoDB database concepts such as locking, transactions, indexes, Shading, replication,
schema design
Consulted with the operations team on deploying, migrating data, monitoring, analyzing, and tuning
MongoDB applications
Viewing the selected issues of web interface using SonarQube
Developed a fully functional login page for the company's user facing website with complete UI and
validations
Installed, Configured, and utilized AppDynamics (Tremendous Performance Management Tool) in the
whole JBoss Environment (Prod and Non-Prod)
Resolving tickets submitted by users, P1 issues, troubleshooting the error documenting, resolving the
errors
Installed and configured Hive in Hadoop cluster and helped business users/application teams fine tune
their HIVE QL for optimizing performance and efficient use of resources in the cluster
Conduct performance tuning of the Hadoop Cluster and map reduce jobs
Also, the real-time applications with best practices to fix the design flaws
Implemented Oozie workflow for ETL Process for critical data feeds across the platform
Configured Ethernet bonding for all Nodes to double the network bandwidth
Implementing Kerberos Security Authentication protocol for existing cluster
Built high availability for major production cluster and designed automatic failover control using
Zookeeper Failover Controller (ZKFC) and Quorum Journal nodes.

EDUCATION

Bachelor of Science
Long Island University - NY

You might also like