This document is a resume for William A Thorndike, who has 5 years of experience as a Big Data Engineer. He has extensive skills and experience working with Hadoop and Spark technologies like HDFS, YARN, Hive, Pig, Impala, Kafka, and Spark Streaming. He has designed and implemented both on-premises and cloud-based big data solutions using these technologies. His experience includes data engineering, ETL processes, and developing Spark applications for batch and stream processing.
This document is a resume for William A Thorndike, who has 5 years of experience as a Big Data Engineer. He has extensive skills and experience working with Hadoop and Spark technologies like HDFS, YARN, Hive, Pig, Impala, Kafka, and Spark Streaming. He has designed and implemented both on-premises and cloud-based big data solutions using these technologies. His experience includes data engineering, ETL processes, and developing Spark applications for batch and stream processing.
This document is a resume for William A Thorndike, who has 5 years of experience as a Big Data Engineer. He has extensive skills and experience working with Hadoop and Spark technologies like HDFS, YARN, Hive, Pig, Impala, Kafka, and Spark Streaming. He has designed and implemented both on-premises and cloud-based big data solutions using these technologies. His experience includes data engineering, ETL processes, and developing Spark applications for batch and stream processing.
I.T. and Hadoop/Big Data \r\n\r\nSeasoned Hadoop/Big Data Engineer skilled in the use of Spark/Spark Streaming, Spark Data Frames. Experience working with Hadoop components, Kafka, Kibana and PySpark. Design and implement on-prem and cloud Big Data ecosystems and pipelines using Hadoop and Spark.\r\n\r\nPROFESSIONAL PROFILE\r\n�\tProficient in extracting and generating analysis using Business Intelligence Tool, Tableau for better analysis of data.\r\n�\tEffective in HDFS, YARN, Pig, Hive, Impala, Sqoop, HBase, Cloudera. \r\n�\tMLlib, Spark GraphX.\r\n�\tExperience processing data using Spark Streaming API with Scala.\r\n�\tSpark Architecture including Spark Core, Spark SQL, Spark Streaming, Spark\r\n�\tETL, data extraction, transformation and load using Hive, Pig and HBase.\r\n�\tVery Good knowledge and Hands-on experience in Cassandra, Flume and YARN. \r\n�\tExperience in implementing User Defined Functions for Pig and Hive. \r\n�\tExtensive Knowledge in Development, analysis and design of ETL methodologies in all the phases of Data Warehousing life cycle. \r\n�\tExcellent understanding of Hadoop architecture and its components such as HDFS, Job Tracker, Task Tracker, Name Node, and Data Node. \r\n�\tExpertise in Python and Scala, user-defined functions (UDF) for Hive and Pig using Python.\r\n�\tHands-on use of Spark and Scala API's to compare the performance of Spark with Hive and SQL, and Spark SQL to manipulate Data Frames in Scala.\r\n�\tExpertise in preparing the test cases, documenting and performing unit testing and Integration testing. \r\n�\tHands on Experience on major components in Hadoop Echo Systems like Spark, HDFS, HIVE, PIG, HBase, Zookeeper, Sqoop, Oozie, Flume, Kafka.\r\n�\tWork Experience with Cloud Infrastructure like Amazon Web Services.\r\n�\tExperience in Importing and Exporting data using Sqoop from Oracle/Mainframe DB2 to HDFS and Data Lake.\r\n�\tExperience in developing Shell Scripts, Oozie Scripts and Python Scripts.\r\n�\tExpert in writing complex SQL queries with databases like DB2, MySQL, SQL Server and MS SQL Server.\r\n�\tExperience in importing and exporting data using Sqoop and SFTP for Hadoop to/from RDBMS. \r\n�\tExtensive experience with Databases such as MySQL, Oracle 11G. \r\n�\tExperience in using Kafka as a messaging system to implement real-time Streaming solutions using Spark Streaming.\r\n�\tExpertise with the tools in Hadoop Ecosystem including HDFS, Pig, Hive, Sqoop, Storm, Spark, Kafka, Yarn, Oozie, Zookeeper etc.\r\n�\tKnowledge in implementing advanced procedures like text analytics and processing using Apache Spark with Python language.\r\n\r\nTECHNICAL SECURITY SKILLS PROFILE\r\n\r\nAPACHE\r\nApache Ant, Apache Flume, Apache Hadoop, Apache YARN, Apache Hive, Apache Kafka, Apache MAVEN, Apache Oozie, Apache Pig, Apache Spark, Apache Tez, Apache Zookeeper, Cloudera Impala, HDFS\tHADOOP\r\nHive, Pig, Zookeeper\r\nSqoop, Oozie, Yarn\r\nMaven, Ant, Flume, HDFS\r\nApache Airflows\r\n\r\nVERSIONING\r\nGit, GitHub\r\n\tDATA VISUALIZATION TOOLS\r\nPentaho, QlikView, Tableau\r\n\r\nPROGRAMMING\r\nSpark, Scala, PySpark, PyTorch, Java \r\n\r\nFRAMEWORKS\r\nSpark, Kafka \r\nFILE FORMATS\r\nParquet, Avro & JSON, ORC\r\n\tHADOOP ADMINISTRATION\r\nAmbari, Yarn, Workflows, Zookeeper, Oozie, Cluster Management, Cluster Security\tSCRIPTING\r\nPig Latin, HiveQL, MapReduce, Shell scripting, SQL, Spark SQL\r\nSOFTWARE DEVELOPMENT\r\nTest-Driven Development\r\nContinuous Integration\r\nUnit Testing, Functional Testing, Scenario Testing, Regression Testing, Object-Oriented Programming, Functional Programming\tIDE\r\nJupyter Notebooks, PyCharm\r\nContinuous Integration (CI CD): Jenkins\r\n\r\nDATA MANAGEMENT\r\nHDFS, Data Lake, Data Warehouse, Database\tPROJECT METHODOLOGY\r\nAgile Scrum, Sprint Planning, Sprint Retrospective, Sprint Grooming, Backlog, Daily Scrums\r\n \r\nBIG DATA DISTRIBUTIONS AND PLATFORMS\r\nAWS Cloud, Hadoop On-Prem, Hadoop, Cloudera (CDH),\r\nHortonworks Data Platform (HDP)\tAMAZON AWS CLOUD \r\nAWS Lambda, AWS S3, AWS RDS\r\nAWS EMR, AWS Redshift, AWS S3\r\nAWS Lambda, AWS Kinesis, AWS ELK, AWS Cloud Formation, AWS IAM\tDATABASE\r\nApache Cassandra\r\nAWS Redshift\r\nAmazonRDS\r\nApache Hbase\r\nSQL, NoSQL\r\nElasticsearch\r\n\r\nPROFESSIONAL EXPERIENCE PROFILE\r\n\r\nBIG DATA ENGINEER\r\n3M\r\nMaplewood, MN\r\nJuly 2017 - Present\r\n\r\n�\tWorked with Spark to create structured data from the pool of unstructured data received.\r\n�\tImplemented advanced procedures like text analytics and processing using the in-memory computing capabilities like Apache Spark written in Scala.\r\n�\tImplemented Spark using Scala and Spark SQL for faster testing and processing of data.\r\n�\tInvolved in converting Hive/SQL queries into Spark transformations using Spark RDDs, Scala.\r\n�\tDocumented the requirements including the available code which should be implemented using Spark, Hive, HDFS and Elasticsearch.\r\n�\tMaintained ELK (Elasticsearch, Kibana) and Wrote Spark scripts using Scala shell.\r\n�\tImplemented Spark using Scala and utilizing Data frames and Spark SQL API for faster processing of data.\r\n�\tFine-tuned resources for long running Spark Applications to utilize better parallelism and executor memory for more caching.\r\n�\tUsed Apache Spark and Scala on large datasets using Spark to process real time data.\r\n�\tTransferred Streaming data from different data sources into HDFS and HBase using Apache Flume.\r\n�\tFetching the live stream data from DB2 to Hbase table using Spark Streaming and Apache Kafka.\r\n�\tInvolved in complete Big Data flow of the application starting from data ingestion from upstream to HDFS, processing the data into HDFS using Spark Streaming.\r\n�\tDeveloped ETL pipelines using Spark and Hive for performing various business specific transformations.\r\n�\tAutomated the pipelines in Spark for bulk loads as well as incremental loads of various datasets.\r\n�\tWorked on building input adapters for data dumps from FTP Servers using Apache spark.\r\n�\tIntegration of Kafka with Spark for real time data processing.\r\n�\tPerformed streaming data ingestion to the Spark distribution environment, using Kafka.\r\n�\tExtracted Real time feed using Spark streaming and convert it to RDD and process data into Data Frame and load the data into Cassandra. Elasticsearch and Logstash performance and configure tuning.\r\n�\tResponsible for designing and deploying new ELK clusters (Elasticsearch, Logstash, Kibana, beats, Kafka, zookeeper etc.\r\n�\tBash source databases and creating ETL pipeline into Kibana and Elasticsearch. Involved in the process of data acquisition, data pre-processing and data exploration of project in Scala.\r\n�\tDeveloped Spark applications for the entire batch processing by using Scala.\r\n�\tDeveloped Spark scripts by using Scala shell commands as per the requirement and used PySpark for proof of concept.\r\n�\tImplemented Hadoop using Hortonworks Data Platform (HDP).\r\n�\tWorked on continuous Integration with Jenkins and automated jar files at end of day.\r\n\r\nBIG DATA ENGINEER � AMAZON AWS CLOUD\r\nClif Bar\r\nEmeryville, CA\r\nApril 2016 - July 2017\r\n\r\n�\tExperienced in implementing Spark RDD transformations, actions to implement business analysis.\r\n�\tImplemented Spark using Scala and Spark SQL for faster testing and processing of data.\r\n�\tUsed AWS RedShift Clusters to sync data from Hoot and used AWS RDS to store the data for retrieval to dashboard.\r\n�\tExpertise in AWS data migration between different database platforms like SQL Server to Amazon Aurora using RDS tool.\r\n�\tResponsible for continuous monitoring and managing Elastic MapReduce (EMR) cluster through AWS console.\r\n�\tImplemented AWS Lambda functions to run scripts in response to events in Amazon Dynamo DB table or S3 bucket or to HTTP requests using Amazon API Gateway.\r\n�\tExperience in working on AWS Kinesis for processing huge amounts of real time data.\r\n�\tAutomated the installation of ELK agent (file beat) with Ansible playbook. Developed KAFKA Queue System to Collect Log data without Data Loss and Publish to various Sources.\r\n�\tUsed AWS Cloud Formation to ensure successful deployment of database templates. Automated cloud deployments using Chef, Python (Boto and Fabric), Ruby, Scripting and AWS Cloud Formation templates.\r\n�\tConfigured AWS IAM and Security Group as per requirement and distributed them as groups into various availability zones of the VPC.\r\n\r\nBIG DATA DEVELOPER\r\nIntuitive Research & Technology\r\nHuntsville, AL\r\nDecember 2014 - April 2016\r\n\r\n�\tMigrated data from RDBMS for streaming or static data into the Hadoop cluster using Hive, Pig, Flume and Sqoop.\r\n�\tImplemented HDFS access controls, directory and file permissions user authorization that facilitates stable, secure access for multiple users in a large multi-tenant clusterDFS\r\n�\tHadoop\r\n�\tApplication development using Hadoop Ecosystems such as Spark, Kafka, HDFS, HIVE, Oozie and Sqoop.\r\n�\tWorked in Big Data Hadoop Ecosystem technologies like HDFS, Map Reduce, YARN, Apache Hive, Apache Spark, Hbase, Scala and Python for distributed processing of data.\r\n�\tAutomated all the jobs for pulling data from FTP server to load data into Hive tables, using Oozie workflows.\r\n�\tInvolved in scheduling Oozie workflow engine to run multiple HiveQL, Sqoop and Pig jobs.\r\n�\tDesigned HBase row key and data modelling to insert to HBase tables using concepts of lookup tables and staging tables.\r\n�\tSpark Involved in creating frameworks which utilized a large number of Spark and Hadoop applications running in series to create one cohesive E2E Big Data pipeline.\r\n�\tUsed Spark-SQL to Load Parquet data and created Datasets defined by Case classes and handled structured data using Spark SQL which were finally stored into Hive tables for downstream consumption.\r\n�\tCloudera implementation of several applications, highly distributive, scalable and large in nature using Cloudera Hadoop.\r\n�\tCloudera Manager used to collect metrics \r\n�\tDeveloped Shell Scripts, Oozie Scripts and Python Scripts.\r\n\r\nHADOOP DEVELOPER\r\nSage Rutty\r\nRochester NY\r\nAugust 2013 - December 2014\r\n\r\n�\tMonitored Hadoop cluster using tools like Nagios, Ganglia, Ambari.\r\n�\tManaging Hadoop clusters via Cloudera Manager, Command Line, and Hortonworks Ambari agent.\r\n�\tInstalled and configured Tableau Desktop to connect to the Hortonworks Hive Framework (Database) which contains the Bandwidth data form the locomotive through the Hortonworks ODBC connector for further analytics of the data.\r\n�\tCluster\r\n�\tYarn\r\n�\tDeveloped Oozie workflow for scheduling and orchestrating the ETL process within the Cloudera Hadoop system.\r\n�\tAutomated workflows using shell scripts pull data from various databases into Hadoop.\r\n�\tInvolved in Cluster Level Security, Security of perimeter (Authentication- Cloudera Manager, Active directory, Kerberos/Ranger) Access (Authorization and permissions- Sentry) Visibility (Audit and Lineage - Navigator) Data (Data Encryption at Rest).\r\n�\tBalancing Hadoop cluster using balancer utilities to spread data across the cluster equally.\r\n�\tUsed Spark API over Cloudera Hadoop YARN to perform analytics on data in Hive.\r\n�\tConfigured Yarn capacity scheduler to support various business SLA's.\r\n�\tImplemented Capacity schedulers on the Yarn Resource Manager to share the resources of the cluster for the Map Reduce jobs given by the users.\r\n\r\nEDUCATION\r\nMaster�s of Computer Science \r\nWestern Illinois University\r\nMacomb, Illinois\r\n"