Professional Documents
Culture Documents
PH: 601-691-1228 Linkedin:: Karthik Potharaju Sr. Hadoop/Big Data Developer
PH: 601-691-1228 Linkedin:: Karthik Potharaju Sr. Hadoop/Big Data Developer
Over 8+ years of overall IT experience in a variety of industries, which includes hands on experience
of around 5+ years in Big Data technologies (1.0 and 2.0) and designing and implementing Map
Reduce MR1 and MR2 architectures.
Well versed in installation, configuration, supporting and managing of Big Data and underlying
infrastructure of Hadoop Cluster.
Good knowledge of Hadoop Development and various components such as HDFS, Job Tracker, Task
Tracker, Data Node, Name Node and Map-Reduce concepts.
Hands on experience on major components in Hadoop Ecosystem like Hadoop Map Reduce, HDFS,
HIVE, PIG, HBase, Zookeeper, Sqoop, Oozie, Cassandra, Flume and Avro.
Experience in installation, configuration, Management, supporting and monitoring Hadoop cluster
using various distributions such as Apache and Cloudera.
Experience in analyzing data using HiveQL, Pig Latin, HBase and custom Map Reduce programs in
Java.
Involved in project planning, setting up standards for implementation and design of Hadoop based
applications.
Written MapReduce programs with custom logics based on the requirement and writing custom
UDFs in pig and hive based on the user requirement.
Involved in the pilot of Hadoop cluster hosted on Amazon Web Services (AWS).
Implemented NOSQL databases like HBase, Cassandra and MongoDB for storing and processing
different formats of data.
Implemented Oozie for writing work flows and scheduling jobs. Written Hive queries for data
analysis and to process the data for visualization.
Installed Spark and performed analyzing HDFS data and then, by caching a dataset in memory to
perform a large variety of complex computations interactively
Experience in importing and exporting the different formats of data into HDFS, HBASE from different
RDBMS databases and vice versa.
Developed applications using Spark for data processing.
Replaced existing map-reduce jobs and Hive scripts with Spark Data-Frame transformation and
actions for the faster analysis of the data.
Experienced in working with Amazon Web Services (AWS) using EC2 for computing and S3 as
storage mechanism
Java, Python & Other Experience in installing and setting up Hadoop Environment in cloud though
Amazon Web services (AWS) like EMR and EC2 which provide efficient processing of data.
Very good experience in complete project life cycle (design, development, testing and
implementation) of Client Server and Web applications.
Developed Spark Applications by using Scala, Java and Implemented Apache Spark data processing
project to handle data from various RDBMS and Streaming sources.
Worked with the Spark for improving performance and optimization of the existing algorithms
in Hadoop using Spark Context, Spark-SQL, Spark MLlib, Data Frame, Pair RDD's, Spark YARN.
Experienced in Apache Spark for implementing advanced procedures like text analytics and
processing using the in-memory computing capabilities written in Scala.
Experience using middleware architecture using Sun Java technologies like J2EE, Servlets, and
application servers like Web Sphere and Web logic.
Used Different Spark Modules like Spark core, Spark RDD's, Spark Data frame, Spark SQL.
Converted Various Hive queries into Spark transformations and Actions that are required.
Experience in working on apache Hadoop open source distribution with technologies like HDFS,
Map-reduce, Python, Pig, Hive, Hue, HBase, SQOOP, Oozie, Zookeeper, Spark, Spark-Streaming,
Storm, Kafka, Cassandra, Impala, Snappy, Green plum and MongoDB, Mesos.
Technical Skills
Hadoop Components HDFS, Hue, MapReduce, PIG, Hive, HCatalog, Hbase, Sqoop, Impala,
Zookeeper, Flume, Kafka, Yarn, Cloudera Manager, Kerberos.
Spark Components Apache Spark, Data Frames, Spark SQL, Spark, YARN, Pair RDDs
Web Technologies / J2EE, XML, Log4j, HTML, XML , CSS, JavaScript,
Other components
Server Side Scripting UNIX Shell Scripting.
Databases Oracle 10g, Microsoft SQL Server, MySQL, DB2, Teradata
Programming Languages Java, C, C++, Scala, Impala, Python.
Web Servers Apache Tomcat, BEA WebLogic.
IDE Eclipse, Dreamweaver
OS/Platforms Windows 2005/2008, Linux (All major distributions), Unix.
NoSQL Databases Hbase, MongoDB.
Methodologies Agile (Scrum), Waterfall, UML, Design Patterns, SDLC.
Responsibilities:
Developed simple to complex MapReduce streaming jobs using Java language for processing and
validating the data.
Developed data pipeline using MapReduce, Flume, Sqoop and Pig to ingest customer behavioral
data into HDFS for analysis.
Developed MapReduce and Spark jobs to discover trends in data usage by users.
Implemented Spark using Python and Spark SQL for faster processing of data.
Developed functional programs in SCALA for connecting the streaming data application and
gathering web data using JSON and XML and passing it to FLUME.
Used Spark for interactive queries, processing of streaming data and integration with popular
NoSQL database for huge volume of data.
Used the Spark -Cassandra Connector to load data to and from Cassandra.
Real time streaming the data using Spark with Kafka.
Experienced in working with Amazon Web Services (AWS) EC2 and S3 in Spark RDD
Handled importing data from different data sources into HDFS using Sqoop and also performing
transformations using Hive, MapReduce and then loading data into HDFS.
Exported the analyzed data to the relational databases using Sqoop, to further visualize and
generate reports for the BI team.
Configured other ecosystems like Hive, Sqoop, Flume, Pig and Oozie.
Collecting and aggregating large amounts of log data using Flume and staging data in HDFS for
further analysis
Analyzed the data by performing Hive queries (HiveQL) and running Pig scripts (Pig Latin) to study
customer behavior.
Used Hive to analyze the partitioned and bucketed data and compute various metrics for reporting.
Developed Pig Latin scripts to perform Map Reduce jobs.
Developed product profiles using Pig and commodity UDFs.
Worked on scalable distributed data system using Hadoop ecosystem.
Developed Hive scripts in HiveQL to De-Normalize and Aggregate the data.
Created HBase tables and column families to store the user event data.
Written automated HBase test cases for data quality checks using HBase command line tools.
Created UDF's to store specialized data structures in HBase and Cassandra.
Create and configured the AWS RDS/Redshift to use Hadoop Ecosystem on AWS infrastructure
Scheduled and executed work flows in Oozie to run Hive and Pig jobs.
Used Impala to read, write and query the Hadoop data in HDFS from HBase or Cassandra.
Used Tez framework for building high performance jobs in Pig and Hive.
Configured Kafka to read and write messages from external programs.
Configured Kafka to handle real time data.
Developed end to end data processing pipelines that begin with receiving data using distributed
messaging systems Kafka through persistence of data into HBase.
Uploaded and processed terabytes of data from various structured and unstructured sources into
HDFS (AWS cloud) using Sqoop and Flume.
Written Storm topology to emit data into Cassandra DB.
Written Storm topology to accept data from Kafka producer and process the data.
Continuous monitoring and managing the Hadoop cluster using Cloudera Manager.
Worked extensively with importing metadata into Hive and migrated existing tables and applications
to work on Hive and AWS cloud.
Developed interactive shell scripts for scheduling various data cleansing and data loading process.
Performed data validation on the data ingested using MapReduce by building a custom model to
filter all the invalid data and cleanse the data.
Experience with data wrangling and creating workable datasets.
Developed schemas to handle reporting requirements using Jaspersoft.
Environment: Hadoop, MapReduce, Spark, Pig, Hive, Sqoop, Oozie, HBase, Zookeeper, Kafka, Flume,
Solr, Storm, Tez, Impala, Mahout, Cassandra, Cloudera manager, MySQL, Jaspersoft, Multi-node cluster
with Linux-Ubuntu, Windows, Unix.