Transcend User Manual

 Hands-on experience with the Hadoop stacks (e.g.
MapReduce, HDFS, Spark, Spark streaming, Strom,

Hive Sqoop, Pig, Hive, HBase, Flume, Zookeeper, Avro, Mahout, Oozie etc.)
 Involved in design, architecture and implementation of large scale streaming applications using Hadoop
Lambda architecture.
 Worked on multiple NoSQL platforms like Hbase, Cassandra etc. (e.g. key-value stores, graph databases,
documented oriented dbs)
 Have hands-on with Analytics algorithms like Bayesian Algorithm, decision trees, regressions, random
forest, neural networks, genetic algorithms etc.
 Hands-on experience with "productionalizing" Hadoop applications (e.g. administration, configuration
 management, monitoring, debugging, and performance tuning, continuous builds and continuous
deployments)
 Excellent analytical, problem solving, communication and interpersonal skills with ability to interact with
individuals at all levels and can work as a part of a team as well as independently.
 Having around 2 yrs. of banking industry experience in big data Hadoop framework and related hadoop
 ecosystem like HDFS, Map Reduce, Hive, Pig, Hbase, Sqoop, Flume, Zookeper and Oozie, Scala, Spark.
 Implemented Hadoop based Enterprise data warehouses, integrated Hadoop with Enterprise Data
Warehouse systems.
 Involved in writing the Pig scripts to reduce the job execution time.
 Experienced with Sqoop to export/import data from a RDBMS to Hadoop and vice versa.
 Involved in creating tables in Hbase tables and storing and retrieving data from Hbase tables.
 Experienced in Writing Hive UDF's and Pig UDF's based on requirements.
 Developed complex PIG scripts and HIVE queries.
 Experience in installing, configuring and administrating the Hadoop Cluster of Major Hadoop Distributions
like Apache Hadoop and cloudera.
 Good knowledge on Hadoop streaming with python.
 Good Knowledge on BIG DATA concepts.
 Troubleshoot map reduce jobs, PIG Latin Scripting and HIVE queries.
 Experience in configuring, installing, supporting and monitoring Hadoop cluster using Apache Cloudera
distribution.
 Around 8 years of professional experience in IT, including 3 years of hands on experience in Big Data,
Hadoop Ecosystem Components
 In depth knowledge of Hadoop Architecture and Hadoop daemons such as Name Node, Secondary Name
Node, Data Node, Job Tracker and Task Tracker.
 Experience in writing Map Reduce programs using Apache Hadoop for analyzing Big Data.
 Hands on experience in writing Ad-hoc Queries for moving data from HDFS to HIVE and analyzing the data
using HIVE QL.
 Experience in importing and exporting data using Sqoop from Relational Database Systems to HDFS.
 Experience in writing Hadoop Jobs for analyzing data using Pig Latin Commands.
 Good Knowledge of analyzing data in HBase using Hive and Pig.
 Working Knowledge in NoSQL Databases like HBase and Cassandra.
 Good Knowledge in Amazon AWS concepts like EMR and EC2 web services which provides fast and
efficient processing of Big Data.
 Experiencing in using Hadoop Packages in R known as RHADOOP.
 Experience in Launching EC2 instances in Amazon EMR using Console.
 Extending Hive and Pig core functionality by writing custom UDFs like UDAFs and UDTFs.
 Experience in administrative tasks such as installing Hadoop and its ecosystem components such as Hive
and Pig in Distributed Mode.
 Knowledge in understanding the security requirements for Hadoop and integrate with Kerberos
authentication and authorization infrastructure.
 Experience in using Apache Flume for collecting, aggregating and moving large amounts of data from
application servers.
 Passionate towards working in Big Data and Analytics environment.
 Experience in using Zookeeper and Oozie Operational Services for coordinating the cluster and scheduling
workflows.
 Working Knowledge in configuring and monitoring tools like Ganglia and Nagios.
 Knowledge on Reporting tools like Tableau which is used to do analytics on data in cloud.
 Extensive experience with SQL, PL/SQL, Shell Scripting and database concepts.
 Experience in developing applications using Java & J2EE technologies.
Roles & Responsibilities:
 Involved in implementation of the project using components like HDFS, Sqoop, Hive, Map reduce, MongoDB
and Oozie.
 Installed/Configured/Maintained Apache Hadoop clusters for application development and Hadoop tools like
Hive, Pig, HBase, Zookeeper and Sqoop.
 Wrote the shell scripts to monitor the health check of Hadoop daemon services and respond accordingly to
any warning or failure conditions.
 Managing and scheduling Jobs on a Hadoop cluster.
 Deployed Hadoop Cluster in the following modes Standalone ♦ Pseudo-distributed ♦ Fully Distributed
 Developed Pig Latin scripts to extract the data from the web server output files to load into HDFS
 Developed the Pig UDF'S to pre-process the data for analysis
 Develop Hive queries for the analysts
 Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with
Pig.
 Cluster co-ordination services through Zookeeper.
 Collected the logs data from web servers and integrated in to HDFS using Flume.
 Implemented Fair schedulers on the Job tracker to share the resources of the Cluster for the Map Reduce
jobs given by the users.
 Managed and reviewed Hadoop log files
 Proactively monitored systems and services, architecture design and implementation of Hadoop
deployment,
 configuration management, backup, and disaster recovery systems and procedures.
 Involved in analyzing system failures, identifying root causes, and recommended course of actions.
 Worked with systems engineering team to plan and deploy new Hadoop environments and expand existing
Hadoop clusters.
 Monitored multiple Hadoop clusters environments using Ganglia and Nagios. Monitored workload, job
performance and capacity planning using Cloudera Manager.
 Installed and configured Flume, Hive, Pig, Sqoop and Oozie on the Hadoop cluster.
 Used Flume to collect, aggregate, and store the web log data from different sources like web servers, mobile
and network devices and pushed to HDFS.
 Analyzed the web log data using the HiveQL to extract number of unique visitors per day, page- views, visit
duration, most purchased product on website.
 Exported the analyzed data to the relational databases using Sqoop for visualization and to generate reports
for the BI team.
 Integrated Oozie with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box
(like
 Java MapReduce, Pig, Hive, Sqoop) as well as system specific jobs (such as Java programs and shell
scripts)
 Involved in installing and configuring Kerberos for the authentication of users and Hadoop daemons.
Responsibilities
 Capable of processing large sets of structured, semi-structured and unstructured data and supporting
systems application architecture.
 Developed MapReduce programs to parse the raw data, populate staging tables and store the refined data
in partitioned tables in the EDW.
 Implementation of requirement by using Big Data Ecosystems such as Cloudera CDH-5 (Hadoop,
 MapReduce, HDFS, Hive, Pig, Sqoop, Oozie and Flume)
 Create Hive queries that helped business users spot emerging trends by comparing fresh data with EDW
reference tables and historical metrics
 Use Oozie to automate data loading into the Hadoop Distributed File System and PIG to pre-process the
data.
 Able to assess business rules, collaborate with stakeholders and perform source-to-target data mapping,
design and review.
 Managed and reviewed Hadoop log files.
 Provide assistance for troubleshooting and resolution of problems relating to Hadoop jobs.
Responsibilities
 Worked on a live 30 nodes Hadoop cluster running Hortonworks Distributed Platform.
 Worked with highly unstructured and semi structured data of 90 TB in size (270 TB with replication factor of
3).
 Extracted the data using FLUME from various logs.
 Created and worked Sqoop jobs with incremental load to populate Hive External tables.
 Extensive experience in writing Pig scripts to transform raw data from several data sources into forming
baseline data.
 Very good understanding of Partitions, Bucketing concepts in Hive and designed both Managed and
External tables in Hive to optimize performance.
 Developed UDFs in Java as and when necessary to use in PIG and HIVE queries.
 Developed Oozie workflow for scheduling and orchestrating the ETL process users/application support team
whenever required.
 Provide assistance for troubleshooting and resolution of problems relating to Hadoop jobs.
 Worked on a live 67 nodes Hadoop cluster running CDH 5.3.2.

 Worked with highly Unstructured, Semi structured data of 90 TB in size.
 Interacted with the users, Business Analysts for collecting, understating the business requirements.
 Exporting data from RDBMS to HIVE, HDFS and HIVE, HDFS to RDBMS by using SQOOP.
 Developed simple to complex Map Reduce job using Pig, Hive & SQOOP.
 Organized data into tables, performing transformations, and simplifying complex queries with Hive.
 Create Hive external tables on the map reduce output before partitioning, bucketing is applied on top of it.
 Created HBase tables to load large sets of structured, semi-structured and unstructured data.
 Developed Oozie workflow and maintained several batch jobs to run automatically depending on business
requirements.
 Involved in managing and reviewing Hadoop log files.
 Worked in Agile environment, which maintain the story points in Scrum model?
(Admin & support)
 Very good experience in monitoring and managing the Hadoop cluster using Cloudera Manager.
Worked on designing and upgrading CDH 4 to CDH 5.
Involved in Hadoop Cluster environment administration that includes adding and removing cluster nodes,
cluster capacity planning, cluster Monitoring and Troubleshooting.
o Manage and review Hadoop log files and Log cases with Cloudera Manager.
o Monitor and report data usage across the clusters.
o Performing Cloudera Manager Admin activities and providing access to users based on their levels/
requirements.
 Implement best practices regarding system monitoring, change control, Service level agreements.
o Installed and managed multiple Hadoop clusters - Production, stage, development.
 Administrator for Pig, Hive and Hbase installing updates, patches and upgrade
o Collecting and aggregating large amounts of log data using Apache Flume.
 Developed Pig Latin scripts to extract the data from the web server output files to load into HDFS.
o Wrote MapReduce jobs using Java API.
o Developed Scripts and Batch Job to schedule various Hadoop Program.
o Installed and maintained Apache Hadoop clusters for application development and Hadoop tools
like Hive,
 Pig, HBase and Sqoop.
o Installed and configured Pig and also written PigLatin scripts.
o Developed the Pig UDF'S to pre-process the data for analysis.
o Develop Hive queries for the analysts.
o Wrote Hive queries for data analysis to meet the business requirements.
o Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-processing with
Pig.
 Implemented Fair schedulers on the Job tracker to share the resources of the Cluster for the Map Reduce
jobs given by the users.
 Took part in monitoring, troubleshooting and managing Hadoop log files.
 Responsibilities:
 Worked with business teams and created Hive queries for ad hoc access.
 Continuous monitoring and managing the Hadoop cluster through Cloudera Manager.
 Involved in review of functional and non-functional requirements
 Responsible to manage data coming from different sources.
 Installed and configured Hadoop ecosystem like HBase, Flume, Pig and Sqoop.
 Loaded daily data from websites to Hadoop cluster by using Flume.
 Involved in loading data from UNIX file system to HDFS.
 Creating Hive tables and working on them using Hive QL.
 Created complex Hive tables and executed complex Hive queries on Hive warehouse.
 Wrote MapReduce code to convert unstructured data to semi structured data.
 Developed programs in Spark based on the application for faster data processing than standard
MapReduce
 programs.
o Used Pig to extract, transformation & load of semi structured data.
o Installed and configured Hive and also written Hive UDFs.
o Develop Hive queries for the analysts.
o Developed workflow in Oozie to automate the tasks of loading the data into HDFS and pre-
processing with
 Pig.
o Cluster co-ordination services through ZooKeeper.
o Collected the logs data from web servers and integrated in to HDFS using Flume.
o Creating Hive tables and working on them using Hive QL.
 Worked on Hive for exposing data for further analysis and for generating transforming files from different
analytical formats to text files.
o Design and implement Map Reduce jobs to support distributed data processing.
o Supported MapReduce Programs those are running on the cluster.
o Involved in HDFS maintenance and loading of structured and unstructured data.
o Wrote MapReduces job using Java API.
o Designing NoSQL schemas in Hbase.
o Wrote the shell scripts to monitor the health check of Hadoop daemon services and respond
accordingly to any warning or failure conditions.
o Involved in Hadoop cluster task like Adding and Removing Nodes without any effect to running jobs
and data.
o Developed the Pig UDF'S to pre-process the data for analysis.
o Involved in Hadoop cluster task like Adding and Removing Nodes without any effect to running jobs
and data.
o Experienced in Agile Methodologies and SCRUM Process.
o Installed and configured Hadoop MapReduce, HDFS and developed multiple MapReduce jobs in
Java for data cleansing and preprocessing.
o Evaluated business requirements and prepared detailed specifications that follow project guidelines
required to develop written programs.
o Responsible for building scalable distributed data solutions using Hadoop.
o Analysed large amounts of data sets to determine optimal way to aggregate and report on it.
o Handled importing of data from various data sources, performed transformations using Hive,
MapReduce, and loaded data into HDFS.
o Importing and exporting data into HDFS using Sqoop.
o Wrote MapReduce code to make un-structured data into semi- structured data and loaded into Hive
tables.
o Responsible for cluster maintenance, adding and removing cluster nodes, cluster monitoring and
troubleshooting.
o Worked extensively in creating MapReduce jobs to power data for search and aggregation
o Developed programs in Spark based on the application for faster data processing than standard
MapReduce programs.
o Worked extensively with Sqoop for importing metadata from Oracle.
o Extensively used Pig for data cleansing.
o Created partitioned tables in Hive.
o Managed and reviewed Hadoop log files.
o Involved in creating Hive tables, loading with data and writing hive queries which will run internally
in MapReduce way.
o Used Hive to analyse the partitioned and bucketed data and compute various metrics for reporting.
o Installed and configured Pig and also written PigLatin scripts.
o Developed Pig Latin scripts to extract the data from the web server output files to load into HDFS.
o Created Hbase tables to store various data formats of data coming from different portfolios.
o Developed MapReduce jobs to automate transfer of data from Hbase.
o Used SVN, Tortoise SVN version control tools for code management (checkins, checkouts and
synchronizing
 BIG DATA EXPERIENCE
o Coding Experience in Spark using Scala.
o Experience in Hadoop architecture and Map Reduce programming.
o Experience in Data Analysis using Hive and Impala.
o Experience in ETL using SQOOP, FLUME, Hive and HDFS
o Experience in Data Retrieval optimization, Pre-processing, Joins and Filtering Patterns.
o Production ETL Experience with Oracle, MySql, Db2 and legacy systems.
o Experience in HiveQL, Pig Latin in creating reports, writing scripts for business use cases.
o Experience in Data Modeling in NoSQL Databases (Cassandra, Hbase)
o Experience in preprocessing of weblogs.
o Experience in setting up Cloudera CHD3, CHD4 Hadoop Cluster.
o Proficient in Installation, Configuration of Cloudera components HDFS, SQOOP, FLUME, OOZie,
Hive,Spark.
o Configured a Spark streaming application to stream syslogs and various application logs from 100+
nodes for monitoring and alerting as well as to feed the data to dynamic dashboards.
o Migrated traditional MR jobs to Spark MR Jobs.
o Worked on Spark SQL and Spark Streaming.
o Imported, exported file to the HDFS, Hive, Impala SQL.
o The processed results were consumed by HIVE, Scheduling applications and various other BI
reports through data warehousing multi-dimensional models.
o Run Ad-Hoc query through PIG Latin language, Hive or Java mapreducer
o Wrote PIG scripts and executed by using Grunt shell
o Big data analysis using Pig, Hive and User defined functions (UDF)
o Performed joins, group by and other operations in MapReduce using Java or PIG Latin
o Scheduling all Hadoop/Hive/Sqoop/Hbase jobs using Oozie.
o Collected log data from the web servers and integrated it to HDFS using Flume
o Used setter and getter methods of Java in the reducer to set/get values to and from the java jar
o Processed the output from PIG, Hive and formatted it before sending to the Hadoop output file
o Directed massive cloud migration to Amazon EC2/AWS.
o Design and coding of efficient, reliable and scalable AWS infrastructure.
o Used HIVE definition to map the output file to tables
o Involved in managing and reviewing Hadoop log files
o Developed Scripts and Batch Jobs to schedule various Hadoop Program
o Working on multiple projects spanning from Architecting Hadoop Clusters, Installation,
Configuration and Management of Hadoop Cluster.
o Installed and managed multiple hadoop clusters - Production, stage, development.
o Installed and managed production cluster of 150 Node cluster with 4+ PB.
o Managed multiple Hadoop clusters spanning to 250 Nodes.
o Involved in analyzing system failures, identifying root causes, and recommended course of actions
and lab clusters.
o Designed the Cluster tests before and after upgrades to validate the cluster status.
o Regular Maintenance of Commissioned/decommission nodes as disk failures occur using Cloudera
Manager.
o Documented and prepared run books of systems processes and procedures for future references.
o Installed and configured Flume, Hive, Pig, Sqoop and Oozie on the Hadoop cluster.
o Integrated Oozie with the rest of the Hadoop stack supporting several types of Hadoop jobs out of
the box (like MapReduce, Pig, Hive, Sqoop) as well as system specific jobs.
o Developed entire data transfer model using Sqoop framework.
o Performed Benchmarking and performance tuning on the Hadoop infrastructure.
o Automated data loading between production and disaster recovery cluster.
o Migrated hive schema from production cluster to DR cluster.
o Implemented the Hadoop Name-node HA services to make the Hadoop services highly available
o Exporting data from RDBMS to HIVE, HDFS and HIVE, HDFS to RDBMS by using SQOOP
o Worked on Migrating application by doing Poc's from relation database systems.
o Installed Hadoop on clustered Environments on Dev/UAT/Prod Environments
o Strong knowledge in administration and development of Hdfs, Hive, Pig with Hive QL and Pig Latin
scripts Respectively
o Installed, Upgraded and managed datameer on boarding users and maintaining data links.
o Installed and tested impala from beta versions in LAB environments and implemented GA release
in prod.
o Configure the cluster properties to gain the high cluster performance by taking cluster hardware
configuration as key criteria
o Designed the rack topology for the production Hadoop cluster using CM
o Manage the day to day operations of the cluster for backup and support
o Created internal and external Hive tables and defined static and dynamic partitions as per
requirement for optimized performance
o Conducted root cause analysis and worked with Big Data Analysts, Designers and Scientists in
Troubleshooting map reduce job failures and issues with Hive and MapReduce.
o Migrated the Oozie workflows to new version during the up gradation of Hadoop cluster from
cdh3u2 to cdh4u6
o Developed sqoop jobs for loading data from oracle, db2 to hadoop for history load and delta loads.
o Developed Shell scripts to report the disk usage by users on Hadoop clusters and automate
alerting system when user reaches his quotas.
o Proactively involved in ongoing maintenance, support and improvements in Hadoop cluster.
o Resolving user issues and incidents related onboarding, job failures and any technical support
questions
Map Reduce
o Hands on experience in developing Map Reduce programs using Apache Hadoop for analyzing the
Big Data
o Expertise in optimizing traffic across network using Combiners, joining multiple schema datasets
using Joins and organizing data using Partitioners
o Experience in writing Custom Counters for analyzing the data and testing using MRUnit framework
o Experienced in writing complex Map Reduce programs that work with different file formats like Text,
Sequence, Xml and Avro
o Expertise in composing Map Reduce Pipelines with many user-defined functions using to
implement complex algorithms.
 PIG
o Expertise in writing ad-hoc Map Reduce programs using Pig Scripts
o Used Pig as ETL tool to do transformations, event joins, filter and some pre-aggregations
o Implemented business logic by writing Pig Latin UDFs in Java and used various UDFs from
Piggybanks and other sources.
 HIVE
o Expertise in Hive Query Language (HiveQL), Hive Security and debugging Hive issues
o Responsible for performing extensive data validation using HIVE Dynamic Partitioning and
Bucketing
o Experience in developing custom UDFs for Pig and Hive to incorporate methods and functionality
of Java into Pig Latin and HQL (Hive QL)
o Worked on different set of tables like External Tables and Managed Tables
o Experiences with working different Hive SerDe's that handle file formats like avro, xml
o Analyzed the data by performing Hive queries and used HIVE UDFs for complex querying. Kafka
and Storm
o Expert in implementing unified data platform to gather data from different sources using Kafka Java
Producers and consumers.
o Experienced in design Kafka brokers create custom partitions and integrated with apache storm for
transformations.
o Experienced in implementing Kafka Simple consumers to get data from specific partitions.
o Experienced in implementing Storm topologies to do pre processing before move data to target
consumers.
o Experienced in design/develop storm spouts, boults to get data from Kafka sources.
Spark
o Experienced in migrating Map reduce to spark Transformations using in memory processing.
o Design and develop Spark transformations, apply spark actions to imply algorithms.
o Experienced in working with Spark QL to analyze structure data queries.
o Knowledge in implementing predictive algorithms using Spark Mlib libraries.
 NoSQL & Others
o Expert database engineer; NoSQL and relational data modeling
o Responsible for building scalable distributed data solutions using Datastax Cassandra.
o Experienced in design/develop data model for Cassandra file system and implement CRUD
operations using
 Thrift and Rest API.
o Experienced in working with Cassandra Query Language (CQL) to work with file system.
o Expertise in HBase Cluster Setup, Configurations, HBase Implementation and HBase Client API
o Worked on importing data into HBase using HBase Shell and HBase Client API.
o Extensive Experienced in working with MongoDB. Creating collections, insert, find data.
o Experienced win implementing data injection using apache NIFI data flow.
o Experienced in implementing full text search analysis using apache Solr.
 Responsibilities
o Capable of processing large sets of structured, semi-structured and unstructured data an
o Supporting systems application architecture.
o Developed MapReduce programs to parse the raw data, populate staging tables and store the
refined data in partitioned tables in the EDW.
o Implementation of requirement by using Big Data Ecosystems such as Cloudera CDH-5 (Hadoop,
MapReduce, HDFS, Hive, Pig, Sqoop, Oozie and Flume)
o Create Hive queries that helped business users spot emerging trends by comparing fresh data with
EDW reference tables and historical metrics
o Use Oozie to automate data loading into the Hadoop Distributed File System and PIG to pre-
process the data.
o Able to assess business rules, collaborate with stakeholders and perform source-to-target data
mapping,design and review.
o Managed and reviewed Hadoop log files.
o Provide assistance for troubleshooting and resolution of problems relating to Hadoop jobs.
 Responsibilities
o Worked on a live 30 nodes Hadoop cluster running Hortonworks Distributed Platform.
o Worked with highly unstructured and semi structured data of 90 TB in size (270 TB with replication
factor of 3).
o Extracted the data using FLUME from various logs.
o Created and worked Sqoop jobs with incremental load to populate Hive External tables.
o Extensive experience in writing Pig scripts to transform raw data from several data sources into
forming baseline data.
o Very good understanding of Partitions, Bucketing concepts in Hive and designed both Managed
and External tables in Hive to optimize performance.
o Developed UDFs in Java as and when necessary to use in PIG and HIVE queries.
o Developed Oozie workflow for scheduling and orchestrating the ETL process users/application
support team whenever required.
o Provide assistance for troubleshooting and resolution of problems relating to Hadoop jobs.
Environment: Hadoop, Yarn, MapReduce, Spark, Hive, HBase, HDFS, Hive, Java (JDK 1.6), Linux, Cloudera,
MapReduce, Oracle 10g, PL/SQL, SQL*PLUS, Toad 9.6, UNIX Shell Scripting, Eclipse, Scala.
Environment: Hadoop, MapReduce, HDFS, Hive, Pig, Spark, HBase, Java, Cloudera Linux, XML, MySQL,
MySQL Workbench, Java 6, Eclipse, Cassandra

Transcend User Manual

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Transcend User Manual

Uploaded by

Copyright:

Available Formats

 Hands-on experience with the Hadoop stacks (e.g.

MapReduce, HDFS, Spark, Spark streaming, Strom,

Roles & Responsibilities:

 Worked on a live 67 nodes Hadoop cluster running CDH 5.3.2.

You might also like