You are on page 1of 7

Anusha K

anushakalla1507@gmail.com
Phone No: (929)456-3121
SENIOR DATA ENGINEER

SUMMARY:

● 7+ Years of IT experience in a variety of industries working on Big Data technology using technologies


such as Cloudera and Hortonworks distributions. Hadoop working environment includes Hadoop, Spark,
MapReduce, Kafka, Hive, Ambari, Sqoop, HBase, and Impala.
● Fluent programming experience with Scala, Java, Python, SQL, T-SQL, R.
● Hands-on experience in developing and deploying enterprise-based applications using major Hadoop
ecosystem components like MapReduce, YARN, Hive, HBase, Flume, Sqoop, Spark MLlib, Spark GraphX,
Spark SQL, Kafka.
● Adept at configuring and installing Hadoop/Spark Ecosystem Components.
● Proficient with Spark Core, Spark SQL, Spark MLlib, Spark GraphX and Spark Streaming for processing and
transforming complex data using in-memory computing capabilities written in Scala.
● Worked with Spark to improve efficiency of existing algorithms using Spark Context, Spark SQL, Spark
MLlib, Data Frame, Pair RDD's and Spark YARN.
● Experience in application of various data sources like Oracle SE2, SQL Server, Flat Files and Unstructured
files into a data warehouse.
● Able to use Sqoop to migrate data between RDBMS, NoSQL databases and HDFS.
● Experience in Extraction, Transformation and Loading (ETL) data from various sources into Data
Warehouses, as well as data processing like collecting, aggregating and moving data from various sources
using Apache Flume, Kafka, PowerBI and Microsoft SSIS.
● Hands-on experience with Hadoop architecture and various components such as Hadoop File System
HDFS, Job Tracker, Task Tracker, Name Node, Data Node and Hadoop MapReduce programming.
● Comprehensive experience in developing simple to complex Map reduce and Streaming jobs using Scala
and Java for data cleansing, filtering and data aggregation. Also possess detailed knowledge of
MapReduce framework.
● Used IDEs like Eclipse, IntelliJ IDE, PyCharm IDE, Notepad ++, and Visual Studio for development.
● Seasoned practice in Machine Learning algorithms and Predictive Modeling such as Linear Regression,
Logistic Regression, Naïve Bayes, Decision Tree, Random Forest, KNN, Neural Networks, and K-means
Clustering.
● Ample knowledge of data architecture including data ingestion pipeline design, Hadoop/Spark
architecture, data modeling, data mining, machine learning and advanced data processing.
● Experience working with NoSQL databases like Cassandra and HBase and developed real-time read/write
access to very large datasets via HBase.
● Experience in developing customized UDF’s in Python to extend Hive and Pig Latin functionality.
● Expertise in designing complex Mappings and have expertise in performance tuning and slowly changing
Dimension Tables and Fact tables
● Extensively worked with Teradata utilities Fast export, and Multi Load to export and load data to/from
different source systems including flat files.
● Experienced in building Automation Regressing Scripts for validation of ETL process between multiple
databases like Oracle, SQL Server, Hive, and Mongo DB using Python. 
● Hands-on use of Spark and Scala API's to compare the performance of Spark with Hive and SQL, and Spark
SQL to manipulate Data Frames in Scala.
● Expertise in Python and Scala, user-defined functions (UDF) for Hive and Pig using Python. 
● Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per
the requirement.
● Hands on Spark MLlib utilities such as including classification, regression, clustering, collaborative
filtering, dimensionality reduction.
● Experience in working with Flume and NiFi for loading log files into Hadoop.
● Experience in developing customized UDF’s in Python to extend Hive and Pig Latin functionality.
● Extensively worked with Teradata utilities Fast export, and Multi Load to export and load data to/from
different source systems including flat files.
● Experienced in building Automation Regressing Scripts for validation of ETL process between multiple
databases like Oracle, SQL Server, Hive, and Mongo DB using Python. 
● Proficient SQL experience in querying, data extraction/transformations and developing queries for a wide
range of applications.
● Capable of processing large sets (Gigabytes) of structured, semi-structured or unstructured data.
● Experience in analyzing data using HiveQL, Pig, HBase and custom MapReduce programs in Java 8.
● Experience working with GitHub/Git 2.12 source and version control systems.
● Working with AWS/GCP cloud using in GCP Cloud storage, Data-Proc, Data Flow, Big- Query, EMR,
S3,Glacier and EC2 Instance with EMR cluster.
● Knowledge of Cloudera platform & Apache Hadoop 0.20. version.
● Very good exposure in OLAP and OLTP.
● Experienced in managing on-shore and off-shore teams, which includes hiring, mentoring, and in dealing
with performance appraisals of team members.
● Project Management level activity and Audit Like (CMMI, Lean & Project Level Configuration Audit
(IPWC)).

TECHNICAL SKILLS:

● Big Data Tools: Hadoop Ecosystem: Map Reduce, Spark 2.3, Airflow 1.10.8, Nifi 2, HBase 1.2, Hive 2.3, Pig
0.17 Sqoop 1.4, Kafka 1.0.1, Oozie 4.3, Hadoop 3.0
● Data Modeling Tools: Erwin Data Modeler, ER Studio v17
● Programming Languages: SQL, PL/SQL, and UNIX.
● Methodologies: RAD, JAD, System Development Life Cycle (SDLC), Agile
● Cloud Platform: AWS, Azure, Google Cloud.
● Cloud Management: Amazon Web Services (AWS)- EC2, EMR, S3, Redshift, EMR, Lambda, Athena
● Databases: Oracle 12c/11g, Teradata R15/R14.
● OLAP Tools: Tableau, SSAS, Business Objects, and Crystal Reports 9
● ETL/Data warehouse Tools: Informatica 9.6/9.1, and Tableau.
● Operating System: Windows, Unix, Sun Solaris

WORK EXPERIENCE:

DXC Technology, Topeka, KS Dec 2019 –


Present
Sr. Data Engineer

Responsibilities:

● Worked on AWS Data pipeline to configure data loads from S3 to into Redshift.
● Using AWS Redshift, I Extracted, transformed and loaded data from various heterogeneous data sources
and destinations
● Created Tables, Stored Procedures, and extracted data using T-SQL for business users whenever required.
● Performs data analysis and design, and creates and maintains large, complex logical and physical data
models, and metadata repositories using ERWIN and MB MDR
● I have written shell script to trigger data Stage jobs.
● Assist service developers in finding relevant content in the existing reference models.
● Like Access, Excel, CSV, Oracle, flat files using connectors, tasks and transformations provided by AWS
Data Pipeline.
● Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries.
● Worked on developing Pyspark script to encrypting the raw data by using Hashing algorithms concepts on
client specified columns.
● Responsible for Design, Development, and testing of the database and Developed Stored Procedures,
Views, and Triggers
● Developed Python-based API (RESTful Web Service) to track revenue and perform revenue analysis.
● Compiling and validating data from all departments and Presenting to Director Operation.
● KPI calculator Sheet and maintain that sheet within SharePoint.
● Created Tableau reports with complex calculations and worked on Ad-hoc reporting using PowerBI.
● Creating data model that correlates all the metrics and gives a valuable output.
● Worked on the tuning of SQL Queries to bring down run time by working on Indexes and Execution Plan.
● Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop
using Spark context, Spark-SQL,postgreSQL,Data Frame,OpenShift, Talend,pair RDD's
● Involved in integration of Hadoop cluster with spark engine to perform BATCH and GRAPHX operations.
● Performed data preprocessing and feature engineering for further predictive analytics using Python
Pandas.
● Developed and validated machine learning models including Ridge and Lasso regression for predicting
total amount of trade.
● Boosted the performance of regression models by applying polynomial transformation and feature
selectionand used those methods to select stocks.
● Generated report on predictive analytics using Python and Tableau including visualizing model
performance and prediction results. .
● Implemented Copy activity, Custom Azure Data Factory Pipeline Activities
● Primarily involved in Data Migration using SQL, SQL Azure, Azure Storage, and Azure Data Factory, SSIS,
PowerShell.
● Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services
(Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW,
HDInsight/Databricks, NoSQL DB).
● Migration of on premise data (Oracle/ SQL Server/ DB2/ MongoDB) to Azure Data Lake and Stored (ADLS)
using Azure Data Factory (ADF V1/V2).
● Developed a detailed project plan and helped manage the data conversion migration from the legacy
system to the target snowflake database.
● Design, develop, and test dimensional data models using Star and Snowflake schema methodologies
under the Kimball method.
● Implement ad-hoc analysis solutions using Azure Data Lake Analytics/Store, HDInsight
● Developed data pipeline using Spark, Hive, Pig, python, Impala, and HBase to ingest customer
● Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
● Ensure deliverables (Daily, Weekly & Monthly MIS Reports) are prepared to satisfy the project
requirements cost and schedule
● Worked on a direct query using PowerBI to compare legacy data with the current data and generated
reports and stored and dashboards.
● Designed SSIS Packages to extract, transfer, load (ETL) existing data into SQL Server from different
environments for the SSAS cubes (OLAP)
● SQL Server reporting services (SSRS). Created & formatted Cross-Tab, Conditional, Drill-down, Top N,
Summary, Form, OLAP, Sub reports, ad-hoc reports, parameterized reports, interactive reports & custom
reports
● Created action filters, parameters and calculated sets for preparing dashboards and worksheets using
PowerBI
● Developed visualizations and dashboards using PowerBI
● Sticking to ANSI SQL language specification wherever possible, and providing context about similar
functionality in other industry-standard engines (e.g. referencing PostgreSQL function documentation)
● Used ETL to implement the Slowly Changing Transformation, to maintain Historically Data in Data
warehouse.
● Performing ETL testing activities like running the Jobs, Extracting the data using necessary queries from
database transform, and upload into the Data warehouse servers.
● Created dashboards for analyzing POS data using Power BI
Environment:  MS SQL Server 2016, T-SQL, Oracle, Hive, Advance Excel (creating formulas, pivot tables,
Hlookup, Vlookup, Macros), Spark, MongoDB, SSAS, SSRS, OLAP, Python, ETL, Power BI, Tableau,
Hive/Hadoop, Snowflakes, AWS Data Pipeline, Cognos Report Studio 10.1,

Equifax, Plano, TX July 2017- Nov 2019


Sr. Data Engineer

Responsibilities:

● Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap.
Installing, configuring, and maintaining Data Pipelines
● Developed the features, scenarios, step definitions for BDD (Behavior Driven Development) and TDD (Test
Driven Development) using Cucumber, Gherkin and ruby.
● Designing the business requirement collection approach based on the project scope and SDLC
methodology.
● Creating Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data
from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and
backwards.
● Files extracted from Hadoop and dropped on daily hourly basis into S3. Working with Data governance
and Data quality to design various models and processes.
● Involved in all the steps and scope of the project reference data approach to MDM, have created a Data
Dictionary and Mapping from Sources to the Target in MDM Data Model.
● Experience managing Azure Data Lakes (ADLS) and Data Lake Analytics and an understanding of how to
integrate with other Azure Services. Knowledge of USQL
● Responsible for working with various teams on a project to develop analytics-based solution to target
customer subscribers specifically.
● Created functions and assigned roles in AWS Lambda to run python scripts, and AWS Lambda using java
to perform event driven processing. Created Lambda jobs and configured Roles using AWS/GCP CLI.
● Responsible for wide-ranging data ingestion using Sqoop and HDFS commands. Accumulate ‘partitioned’
data in various storage formats like text, Json, Parquet, etc. Involved in loading data from LINUX file
system to HDFS
● Storing Data Files in Google Cloud S3 Buckets daily basis. Using DataProc, Big Query to develop and
maintain GCP cloud base solution.
● Start working with AWS/GCP for storage and halding for tera byte of data for customer BI Reporting tools
● Built 12 node Hadoop cluster
● Installed and configured Hadoop eco system components
● Decommissioning nodes and adding nodes in the clusters for maintenance
● Monitored cluster health by Setting up alerts using Nagios and Ganglia
● Adding new users and groups of users as per the requests from the client
● Working on tickets opened by users regarding various incidents, requests
● Created a Lambda Deployment function, and configured it to receive events from S3 buckets
● Writing UNIX shell scripts to automate the jobs and scheduling cron jobs for job automation using
commands with Crontab.
● Developed various Mappings with the collection of all Sources, Targets, and Transformations using
Informatica Designer
● Developed Mappings using Transformations like Expression, Filter, Joiner and Lookups for better data
messaging and to migrate clean and consistent data
● Used Apache Spark Data frames, Spark-SQL, Spark MLLib extensively and developing and designing POC's
using Scala, Spark SQL and MLlib libraries.
● Data Integration ingests, transforms, and integrates structured data and delivers data to a scalable data
warehouse platform using traditional ETL (Extract, Transform, Load) tools and methodologies to collect of
data from various sources into a single data warehouse.
● Applied various machine learning algorithms and statistical modeling like decision trees, text analytics,
natural language processing (NLP), supervised and unsupervised, regression models, social network
analysis, neural networks, deep learning, SVM, clustering to identify Volume using scikit-learn
package in python, R, and Matlab. Collaborate with Data Engineers and Software Developers to develop
experiments and deploy solutions to production.
● Create and publish multiple dashboards and reports using Tableau server and work on Text Analytics,
Naive Bayes, Sentiment analysis, creating word clouds and retrieving data from Twitter and other social
networking platforms.
● Work on data that was a combination of unstructured and structured data from multiple sources and
automate the cleaning using Python scripts.
● Tackle highly imbalanced Fraud dataset using under sampling with ensemble methods, oversampling and
cost sensitive algorithms.
● Improve fraud prediction performance by using random forest and gradient boosting for feature selection
with Python Scikit-learn.
● Designed and developed architecture for data services ecosystem spanning Relational, NoSQL, and Big
Data technologies.
● Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target
system from multiple sources
● Involved in Unit Testing the code and provided the feedback to the developers. Performed Unit Testing of
the application by using NUnit.
● Designed both 3NF data models for OLTP systems and dimensional data models using star and snowflake
Schemas.
● Created and maintained SQL Server scheduled jobs, executing stored procedures for the purpose of
extracting data from Oracle into SQL Server. Extensively used Tableau for customer marketing data
visualization
● Optimize algorithm with stochastic gradient descent algorithm Fine-tuned the algorithm parameter with
manual tuning and automated tuning such as Bayesian Optimization.
● Write research reports describing the experiment conducted, results, and findings and make strategic
recommendations to technology, product, and senior management. Worked closely with regulatory
delivery leads to ensure robustness in prop trading control frameworks using Hadoop, Python Jupyter
Notebook, Hive and NoSql.
● Wrote production level Machine Learning classification models and ensemble classification models from
scratch using Python and PySpark to predict binary values for certain attributes in certain time frame.
● Performed all necessary day-to-day GIT support for different projects, Responsible for design and
maintenance of the GIT Repositories, and the access control strategies.

Environment: Hadoop, Kafka, Spark, Sqoop, Spark SQL, Spark-Streaming, Hive, Scala, pig, NoSQL, Oozie,
Hbase, Data Lake, Python , AWS/GCP(Glue, Lambda, StepFunctions, SQS, Code Build, Code Pipeline,
EventBridge, Athena), Unix, Linux Shell Scripting, Informatica PowerCenter

American Century Investments, Kansas, MO Nov 2015-


Jun 2017
Big Data Engineer

Responsibilities:

● Migrating data from FS to Snowflake within the organization


● Imported Legacy data from SQL Server and Teradata into Amazon S3.
● Created consumption views on top of metrics to reduce the running time for complex queries.
● Exported Data into Snowflake by creating Staging Tables to load Data of different files from Amazon S3.
● Compare the data in a leaf level process from various databases when data transformation or data
loading takes place. I need to analyze and look into the data quality when these types of loads are done
(To look for any data loss, data corruption).
● As a part of Data Migration, wrote many SQL Scripts for Mismatch of data and worked on loading the
history data from Teradata SQL to snowflake.
● Developed SQL scripts to Upload, Retrieve, Manipulate and handle sensitive data (National Provider
Identifier Data I.e. Name, Address, SSN, Phone No) in Teradata, SQL Server Management Studio and
Snowflake Databases for the Project
● Worked on to retrieve the data from FS to S3 using spark commands
● Built S3 buckets and managed policies for S3 buckets and used S3 bucket and Glacier for storage and
backup on AWS/GCP
● Created Metric tables, End user views in Snowflake to feed data for Tableau refresh.
● Generated Custom SQL to verify the dependency for the daily, Weekly, Monthly jobs.
● Using Nebula Metadata, registered Business and Technical Datasets for corresponding SQL scripts
● Experienced in working with spark ecosystem using Spark SQL and Scala queries on different formats like
text file, CSV file.
● Developed spark code and spark-SQL/streaming for faster testing and processing of data.
● Closely involved in scheduling Daily, Monthly jobs with Precondition/Postcondition based on the
requirement.
● Monitor the Daily, Weekly, Monthly jobs and provide support in case of failures/issues.
● Experience in Developing Spark applications using Spark - SQL in Databricks for data extraction,
transformation, and aggregation from multiple file formats for analyzing & transforming the data to
uncover insights into the customer usage patterns.
● Worked on analyzing Hadoop Cluster and different big data analytic tools including Pig, Hive.
● Working experience with data streaming process with Kafka, Apache Spark, Hive.
● Worked with various HDFS file formats like Avro, Sequence File, Nifi, Json and various compression
formats like Snappy, bzip2.
● Used Spark-Streaming APIs to perform necessary transformations and actions on the data got from Kafka.
● Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.

Environment: Snowflake, AWS/GCP S3, GitHub, Service Now, HP Service Manager, EMR, Nebula, Kafka, Jira,
Confluence, Shell/Perl Scripting, Python, AVRO, Zookeeper Teradata, SQL Server, Apache Spark, Sqoop.

Global Logical Technologies, Hyderabad, India Aug 2014-


Oct 2015
Data & Reporting Analyst

Responsibilities:

● Involved in converting Hive/SQL queries into Spark transformations using Spark data frames, Scala and
Python.
● Research and recommend suitable technology stack for Hadoop migration considering current enterprise
architecture.
● Responsible for building scalable distributed data solutions using Hadoop.
● Experienced in loading and transforming of large sets of structured, semi-structured and unstructured
data.
● Developed Spark jobs and Hive Jobs to summarize and transform data.
● Experienced in developing Spark scripts for data analysis in both python and Scala.
● Wrote Scala scripts to make spark streaming work with Kafka as part of spark Kafka integration efforts.
● Built on-premise data pipelines using Kafka and spark for real-time data analysis.
● Created reports in TABLEAU for visualization of the data sets created and tested Spark SQL connectors.
● Implemented Hive complex UDF's to execute business logic with Hive Queries.
● Developed a different kind of custom filters and handled pre-defined filters on HBase data using API.
● Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data.
● Handled importing data from different data sources into HDFS using Sqoop and performing
transformations using Hive and then loading data into HDFS.
● Exporting of a result set from HIVE to MySQL using Sqoop export tool for further processing.
● Collecting and aggregating large amounts of log data and staging data in HDFS for further analysis.
● Experience in managing and reviewing Hadoop Log files.
● Used Sqoop to transfer data between relational databases and Hadoop.
● Worked on HDFS to store and access huge datasets within Hadoop.
● Good hands on experience with GitHub.

Environment: Cloudera Manager (CDH5), HDFS, Sqoop, Pig, Hive,Tableau, Python,Scala,Oozie, Kafka,
flume, MySql,Java, Git.

GVK Bio Sciences, Hyderabad, India Oct 2013- July 2014


Data Analyst

Responsibilities:

● Collaborated with Business Analysts, SMEs across departments to gather business requirements, and
identify workable items for further development.
● Partnered with ETL developers to ensure that data is well cleaned and the data warehouse is up-to-date
for reporting purpose by Pig.
● Selected and generated data into csv files and stored them into AWS/GCP S3 by using AWS EC2 and then
structured and stored in AWS/GCP Redshift.
● Processed some simple statistical analysis of data profiling like cancel rate, var, skew, kurt of trades, and
runs of each stock every day group by 1 min, 5 min, and 15 min.
● Used PySpark and Pandas to calculate the moving average and RSI score of the stocks and generated
them into data warehouse.
● Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop
using Spark context, Spark-SQL, PostgreSQL, Data Frame, OpenShift, Talend, pair RDD's
● Involved in integration of Hadoop cluster with spark engine to perform BATCH and GRAPHX operations.
● Developed complex SQL statements to extract the Data and packaging/encrypting Data for delivery to
customers.
● Provided business intelligence analysis to decision-makers using an interactive OLAP tool
● Created T/SQL statements (select, insert, update, delete) and stored procedures.
● Defined Data requirements and elements used in XML transactions.
● Created Informatica mappings using various Transformations like Joiner, Aggregate, Expression, Filter and
Update Strategy.
● Performed Tableau administering by using tableau admin commands.
● Involved in defining the source to target Data mappings, business rules and Data definitions.
● Ensured the compliance of the extracts to the Data Quality Center initiatives
● Metrics reporting, Data mining and trends in helpdesk environment using Access
● Worked on SQL Server Integration Services (SSIS) to integrate and analyze data from multiple
heterogeneous information sources
● Performed data preprocessing and feature engineering for further predictive analytics using Python
Pandas.
● Developed and validated machine learning models including Ridge and Lasso regression for predicting
total amount of trade.
● Boosted the performance of regression models by applying polynomial transformation and feature
selection and used those methods to select stocks.
● Generated report on predictive analytics using Python and Tableau including visualizing model
performance and prediction results.
● Utilized Agile and Scrum methodology for team and project management.
● Used Git for version control with colleagues.

Environment: Spark, AWS/GCP Redshift,Python,Tableau, Informatica,Pandas,Pig,Pyspark,Sql Server, T-


Sql,XML,Git.

You might also like