Professional Documents
Culture Documents
anushakalla1507@gmail.com
Phone No: (929)456-3121
SENIOR DATA ENGINEER
SUMMARY:
TECHNICAL SKILLS:
● Big Data Tools: Hadoop Ecosystem: Map Reduce, Spark 2.3, Airflow 1.10.8, Nifi 2, HBase 1.2, Hive 2.3, Pig
0.17 Sqoop 1.4, Kafka 1.0.1, Oozie 4.3, Hadoop 3.0
● Data Modeling Tools: Erwin Data Modeler, ER Studio v17
● Programming Languages: SQL, PL/SQL, and UNIX.
● Methodologies: RAD, JAD, System Development Life Cycle (SDLC), Agile
● Cloud Platform: AWS, Azure, Google Cloud.
● Cloud Management: Amazon Web Services (AWS)- EC2, EMR, S3, Redshift, EMR, Lambda, Athena
● Databases: Oracle 12c/11g, Teradata R15/R14.
● OLAP Tools: Tableau, SSAS, Business Objects, and Crystal Reports 9
● ETL/Data warehouse Tools: Informatica 9.6/9.1, and Tableau.
● Operating System: Windows, Unix, Sun Solaris
WORK EXPERIENCE:
Responsibilities:
● Worked on AWS Data pipeline to configure data loads from S3 to into Redshift.
● Using AWS Redshift, I Extracted, transformed and loaded data from various heterogeneous data sources
and destinations
● Created Tables, Stored Procedures, and extracted data using T-SQL for business users whenever required.
● Performs data analysis and design, and creates and maintains large, complex logical and physical data
models, and metadata repositories using ERWIN and MB MDR
● I have written shell script to trigger data Stage jobs.
● Assist service developers in finding relevant content in the existing reference models.
● Like Access, Excel, CSV, Oracle, flat files using connectors, tasks and transformations provided by AWS
Data Pipeline.
● Utilized Spark SQL API in PySpark to extract and load data and perform SQL queries.
● Worked on developing Pyspark script to encrypting the raw data by using Hashing algorithms concepts on
client specified columns.
● Responsible for Design, Development, and testing of the database and Developed Stored Procedures,
Views, and Triggers
● Developed Python-based API (RESTful Web Service) to track revenue and perform revenue analysis.
● Compiling and validating data from all departments and Presenting to Director Operation.
● KPI calculator Sheet and maintain that sheet within SharePoint.
● Created Tableau reports with complex calculations and worked on Ad-hoc reporting using PowerBI.
● Creating data model that correlates all the metrics and gives a valuable output.
● Worked on the tuning of SQL Queries to bring down run time by working on Indexes and Execution Plan.
● Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop
using Spark context, Spark-SQL,postgreSQL,Data Frame,OpenShift, Talend,pair RDD's
● Involved in integration of Hadoop cluster with spark engine to perform BATCH and GRAPHX operations.
● Performed data preprocessing and feature engineering for further predictive analytics using Python
Pandas.
● Developed and validated machine learning models including Ridge and Lasso regression for predicting
total amount of trade.
● Boosted the performance of regression models by applying polynomial transformation and feature
selectionand used those methods to select stocks.
● Generated report on predictive analytics using Python and Tableau including visualizing model
performance and prediction results. .
● Implemented Copy activity, Custom Azure Data Factory Pipeline Activities
● Primarily involved in Data Migration using SQL, SQL Azure, Azure Storage, and Azure Data Factory, SSIS,
PowerShell.
● Architect & implement medium to large scale BI solutions on Azure using Azure Data Platform services
(Azure Data Lake, Data Factory, Data Lake Analytics, Stream Analytics, Azure SQL DW,
HDInsight/Databricks, NoSQL DB).
● Migration of on premise data (Oracle/ SQL Server/ DB2/ MongoDB) to Azure Data Lake and Stored (ADLS)
using Azure Data Factory (ADF V1/V2).
● Developed a detailed project plan and helped manage the data conversion migration from the legacy
system to the target snowflake database.
● Design, develop, and test dimensional data models using Star and Snowflake schema methodologies
under the Kimball method.
● Implement ad-hoc analysis solutions using Azure Data Lake Analytics/Store, HDInsight
● Developed data pipeline using Spark, Hive, Pig, python, Impala, and HBase to ingest customer
● Involved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Python and Scala.
● Ensure deliverables (Daily, Weekly & Monthly MIS Reports) are prepared to satisfy the project
requirements cost and schedule
● Worked on a direct query using PowerBI to compare legacy data with the current data and generated
reports and stored and dashboards.
● Designed SSIS Packages to extract, transfer, load (ETL) existing data into SQL Server from different
environments for the SSAS cubes (OLAP)
● SQL Server reporting services (SSRS). Created & formatted Cross-Tab, Conditional, Drill-down, Top N,
Summary, Form, OLAP, Sub reports, ad-hoc reports, parameterized reports, interactive reports & custom
reports
● Created action filters, parameters and calculated sets for preparing dashboards and worksheets using
PowerBI
● Developed visualizations and dashboards using PowerBI
● Sticking to ANSI SQL language specification wherever possible, and providing context about similar
functionality in other industry-standard engines (e.g. referencing PostgreSQL function documentation)
● Used ETL to implement the Slowly Changing Transformation, to maintain Historically Data in Data
warehouse.
● Performing ETL testing activities like running the Jobs, Extracting the data using necessary queries from
database transform, and upload into the Data warehouse servers.
● Created dashboards for analyzing POS data using Power BI
Environment: MS SQL Server 2016, T-SQL, Oracle, Hive, Advance Excel (creating formulas, pivot tables,
Hlookup, Vlookup, Macros), Spark, MongoDB, SSAS, SSRS, OLAP, Python, ETL, Power BI, Tableau,
Hive/Hadoop, Snowflakes, AWS Data Pipeline, Cognos Report Studio 10.1,
Responsibilities:
● Transforming business problems into Big Data solutions and define Big Data strategy and Roadmap.
Installing, configuring, and maintaining Data Pipelines
● Developed the features, scenarios, step definitions for BDD (Behavior Driven Development) and TDD (Test
Driven Development) using Cucumber, Gherkin and ruby.
● Designing the business requirement collection approach based on the project scope and SDLC
methodology.
● Creating Pipelines in ADF using Linked Services/Datasets/Pipeline/ to Extract, Transform, and load data
from different sources like Azure SQL, Blob storage, Azure SQL Data warehouse, write-back tool and
backwards.
● Files extracted from Hadoop and dropped on daily hourly basis into S3. Working with Data governance
and Data quality to design various models and processes.
● Involved in all the steps and scope of the project reference data approach to MDM, have created a Data
Dictionary and Mapping from Sources to the Target in MDM Data Model.
● Experience managing Azure Data Lakes (ADLS) and Data Lake Analytics and an understanding of how to
integrate with other Azure Services. Knowledge of USQL
● Responsible for working with various teams on a project to develop analytics-based solution to target
customer subscribers specifically.
● Created functions and assigned roles in AWS Lambda to run python scripts, and AWS Lambda using java
to perform event driven processing. Created Lambda jobs and configured Roles using AWS/GCP CLI.
● Responsible for wide-ranging data ingestion using Sqoop and HDFS commands. Accumulate ‘partitioned’
data in various storage formats like text, Json, Parquet, etc. Involved in loading data from LINUX file
system to HDFS
● Storing Data Files in Google Cloud S3 Buckets daily basis. Using DataProc, Big Query to develop and
maintain GCP cloud base solution.
● Start working with AWS/GCP for storage and halding for tera byte of data for customer BI Reporting tools
● Built 12 node Hadoop cluster
● Installed and configured Hadoop eco system components
● Decommissioning nodes and adding nodes in the clusters for maintenance
● Monitored cluster health by Setting up alerts using Nagios and Ganglia
● Adding new users and groups of users as per the requests from the client
● Working on tickets opened by users regarding various incidents, requests
● Created a Lambda Deployment function, and configured it to receive events from S3 buckets
● Writing UNIX shell scripts to automate the jobs and scheduling cron jobs for job automation using
commands with Crontab.
● Developed various Mappings with the collection of all Sources, Targets, and Transformations using
Informatica Designer
● Developed Mappings using Transformations like Expression, Filter, Joiner and Lookups for better data
messaging and to migrate clean and consistent data
● Used Apache Spark Data frames, Spark-SQL, Spark MLLib extensively and developing and designing POC's
using Scala, Spark SQL and MLlib libraries.
● Data Integration ingests, transforms, and integrates structured data and delivers data to a scalable data
warehouse platform using traditional ETL (Extract, Transform, Load) tools and methodologies to collect of
data from various sources into a single data warehouse.
● Applied various machine learning algorithms and statistical modeling like decision trees, text analytics,
natural language processing (NLP), supervised and unsupervised, regression models, social network
analysis, neural networks, deep learning, SVM, clustering to identify Volume using scikit-learn
package in python, R, and Matlab. Collaborate with Data Engineers and Software Developers to develop
experiments and deploy solutions to production.
● Create and publish multiple dashboards and reports using Tableau server and work on Text Analytics,
Naive Bayes, Sentiment analysis, creating word clouds and retrieving data from Twitter and other social
networking platforms.
● Work on data that was a combination of unstructured and structured data from multiple sources and
automate the cleaning using Python scripts.
● Tackle highly imbalanced Fraud dataset using under sampling with ensemble methods, oversampling and
cost sensitive algorithms.
● Improve fraud prediction performance by using random forest and gradient boosting for feature selection
with Python Scikit-learn.
● Designed and developed architecture for data services ecosystem spanning Relational, NoSQL, and Big
Data technologies.
● Used SQL Server Integrations Services (SSIS) for extraction, transformation, and loading data into target
system from multiple sources
● Involved in Unit Testing the code and provided the feedback to the developers. Performed Unit Testing of
the application by using NUnit.
● Designed both 3NF data models for OLTP systems and dimensional data models using star and snowflake
Schemas.
● Created and maintained SQL Server scheduled jobs, executing stored procedures for the purpose of
extracting data from Oracle into SQL Server. Extensively used Tableau for customer marketing data
visualization
● Optimize algorithm with stochastic gradient descent algorithm Fine-tuned the algorithm parameter with
manual tuning and automated tuning such as Bayesian Optimization.
● Write research reports describing the experiment conducted, results, and findings and make strategic
recommendations to technology, product, and senior management. Worked closely with regulatory
delivery leads to ensure robustness in prop trading control frameworks using Hadoop, Python Jupyter
Notebook, Hive and NoSql.
● Wrote production level Machine Learning classification models and ensemble classification models from
scratch using Python and PySpark to predict binary values for certain attributes in certain time frame.
● Performed all necessary day-to-day GIT support for different projects, Responsible for design and
maintenance of the GIT Repositories, and the access control strategies.
Environment: Hadoop, Kafka, Spark, Sqoop, Spark SQL, Spark-Streaming, Hive, Scala, pig, NoSQL, Oozie,
Hbase, Data Lake, Python , AWS/GCP(Glue, Lambda, StepFunctions, SQS, Code Build, Code Pipeline,
EventBridge, Athena), Unix, Linux Shell Scripting, Informatica PowerCenter
Responsibilities:
Environment: Snowflake, AWS/GCP S3, GitHub, Service Now, HP Service Manager, EMR, Nebula, Kafka, Jira,
Confluence, Shell/Perl Scripting, Python, AVRO, Zookeeper Teradata, SQL Server, Apache Spark, Sqoop.
Responsibilities:
● Involved in converting Hive/SQL queries into Spark transformations using Spark data frames, Scala and
Python.
● Research and recommend suitable technology stack for Hadoop migration considering current enterprise
architecture.
● Responsible for building scalable distributed data solutions using Hadoop.
● Experienced in loading and transforming of large sets of structured, semi-structured and unstructured
data.
● Developed Spark jobs and Hive Jobs to summarize and transform data.
● Experienced in developing Spark scripts for data analysis in both python and Scala.
● Wrote Scala scripts to make spark streaming work with Kafka as part of spark Kafka integration efforts.
● Built on-premise data pipelines using Kafka and spark for real-time data analysis.
● Created reports in TABLEAU for visualization of the data sets created and tested Spark SQL connectors.
● Implemented Hive complex UDF's to execute business logic with Hive Queries.
● Developed a different kind of custom filters and handled pre-defined filters on HBase data using API.
● Implemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data.
● Handled importing data from different data sources into HDFS using Sqoop and performing
transformations using Hive and then loading data into HDFS.
● Exporting of a result set from HIVE to MySQL using Sqoop export tool for further processing.
● Collecting and aggregating large amounts of log data and staging data in HDFS for further analysis.
● Experience in managing and reviewing Hadoop Log files.
● Used Sqoop to transfer data between relational databases and Hadoop.
● Worked on HDFS to store and access huge datasets within Hadoop.
● Good hands on experience with GitHub.
Environment: Cloudera Manager (CDH5), HDFS, Sqoop, Pig, Hive,Tableau, Python,Scala,Oozie, Kafka,
flume, MySql,Java, Git.
Responsibilities:
● Collaborated with Business Analysts, SMEs across departments to gather business requirements, and
identify workable items for further development.
● Partnered with ETL developers to ensure that data is well cleaned and the data warehouse is up-to-date
for reporting purpose by Pig.
● Selected and generated data into csv files and stored them into AWS/GCP S3 by using AWS EC2 and then
structured and stored in AWS/GCP Redshift.
● Processed some simple statistical analysis of data profiling like cancel rate, var, skew, kurt of trades, and
runs of each stock every day group by 1 min, 5 min, and 15 min.
● Used PySpark and Pandas to calculate the moving average and RSI score of the stocks and generated
them into data warehouse.
● Exploring with Spark to improve the performance and optimization of the existing algorithms in Hadoop
using Spark context, Spark-SQL, PostgreSQL, Data Frame, OpenShift, Talend, pair RDD's
● Involved in integration of Hadoop cluster with spark engine to perform BATCH and GRAPHX operations.
● Developed complex SQL statements to extract the Data and packaging/encrypting Data for delivery to
customers.
● Provided business intelligence analysis to decision-makers using an interactive OLAP tool
● Created T/SQL statements (select, insert, update, delete) and stored procedures.
● Defined Data requirements and elements used in XML transactions.
● Created Informatica mappings using various Transformations like Joiner, Aggregate, Expression, Filter and
Update Strategy.
● Performed Tableau administering by using tableau admin commands.
● Involved in defining the source to target Data mappings, business rules and Data definitions.
● Ensured the compliance of the extracts to the Data Quality Center initiatives
● Metrics reporting, Data mining and trends in helpdesk environment using Access
● Worked on SQL Server Integration Services (SSIS) to integrate and analyze data from multiple
heterogeneous information sources
● Performed data preprocessing and feature engineering for further predictive analytics using Python
Pandas.
● Developed and validated machine learning models including Ridge and Lasso regression for predicting
total amount of trade.
● Boosted the performance of regression models by applying polynomial transformation and feature
selection and used those methods to select stocks.
● Generated report on predictive analytics using Python and Tableau including visualizing model
performance and prediction results.
● Utilized Agile and Scrum methodology for team and project management.
● Used Git for version control with colleagues.