You are on page 1of 10

DATA SCIENCE PIPELINE AND

HADOOP ECO-SYSYEM

3408
SHIVPRAKASH VISHWAKARMA
INTRODUCTION
In simple words, a pipeline in data science is “a set of actions which

changes the raw (and confusing) data from various sources (surveys,

feedbacks, list of purchases, votes, etc.), to an understandable format

so that we can store it and use it for analysis.”


PROCESS OF DATASCIENCE PIPELINE
Fetching/Obtaining Data
Scrubbing/Cleaning the Data
EDA
Modelling the Data
Interpreting the Data
THE OSEMN FRAMEWORK
DABL LIBRARY

dabl is a data analysis baseline library that makes supervised machine

learning modeling easier and accessible for beginners or folks with no

knowledge of data science. dabl is inspired by the Scikit-learn library and it tries

to democratize machine learning modeling by reducing the boilerplate task and

automating the components.


dabl library includes various features that make it easier to process, analyze

and model the data in a few lines of Python code


HADOOP ECO-SYSTEM

•Hadoop Ecosystem is a platform or a suite which provides

various services to solve the big data problems.


• It includes Apache projects and various commercial tools

and solutions.
•There are four major elements of Hadoop i.e. HDFS,

MapReduce, YARN, and Hadoop Common.


•Most of the tools or solutions are used to supplement or support these
major elements.
•All these tools work collectively to provide services such as absorption,
analysis, storage and maintenance of data etc.
•HDFS : HDFS is a distributed file system that handles large data sets
running on commodity hardware. It is used to scale a single Apache
Hadoop cluster to hundreds (and even thousands) of nodes.
•MAP-REDUCE : MapReduce is a programming model for writing
applications that can process Big Data in parallel on multiple nodes.
MapReduce provides analytical capabilities for analyzing huge volumes of
complex data.
•YARN : YARN is a large-scale, distributed operating system for big data
applications. The technology is designed for cluster management and is one of
the key features in the second generation of Hadoop, the Apache Software
Foundation's open source distributed processing framework.
•HADOOP COMMON : Hadoop Common refers to the collection of common
utilities and libraries that support other Hadoop modules. It is an essential
part or module of the Apache Hadoop Framework, along with the Hadoop
Distributed File System (HDFS), Hadoop YARN and Hadoop MapReduce.
Thank you!

You might also like