Welcome to Scribd!

3408 Shiv Big Data Presentation

Uploaded by

0% found this document useful (0 votes)

19 views10 pages

The document discusses data science pipelines and the Hadoop ecosystem. It defines a data science pipeline as transforming raw data from various sources into an understandable format for storage and analysis. The typical steps in a pipeline are fetching data, cleaning, exploratory data analysis, modeling, and interpreting results. It also discusses the OSEMN framework. The document then explains the dabl library which makes machine learning modeling easier for beginners with features like automated processing and analysis. Finally, it provides an overview of the Hadoop ecosystem including its major components like HDFS, MapReduce, YARN and Hadoop Common which work together to store, analyze and manage big data.

Original Description:

Original Title

3408_SHIV_BIG_DATA_PRESENTATION

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

19 views10 pages

3408 Shiv Big Data Presentation

Uploaded by

Parth Kadam

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 10

Search inside document

DATA SCIENCE PIPELINE AND

HADOOP ECO-SYSYEM

3408
SHIVPRAKASH VISHWAKARMA
INTRODUCTION
In simple words, a pipeline in data science is “a set of actions which

changes the raw (and confusing) data from various sources (surveys,

feedbacks, list of purchases, votes, etc.), to an understandable format

so that we can store it and use it for analysis.”

PROCESS OF DATASCIENCE PIPELINE
Fetching/Obtaining Data
Scrubbing/Cleaning the Data
EDA
Modelling the Data
Interpreting the Data
THE OSEMN FRAMEWORK
DABL LIBRARY

dabl is a data analysis baseline library that makes supervised machine

learning modeling easier and accessible for beginners or folks with no

knowledge of data science. dabl is inspired by the Scikit-learn library and it tries

to democratize machine learning modeling by reducing the boilerplate task and

automating the components.

dabl library includes various features that make it easier to process, analyze

and model the data in a few lines of Python code

HADOOP ECO-SYSTEM

•Hadoop Ecosystem is a platform or a suite which provides

various services to solve the big data problems.

• It includes Apache projects and various commercial tools

and solutions.
•There are four major elements of Hadoop i.e. HDFS,

MapReduce, YARN, and Hadoop Common.

•Most of the tools or solutions are used to supplement or support these
major elements.
•All these tools work collectively to provide services such as absorption,
analysis, storage and maintenance of data etc.
•HDFS : HDFS is a distributed file system that handles large data sets
running on commodity hardware. It is used to scale a single Apache
Hadoop cluster to hundreds (and even thousands) of nodes.
•MAP-REDUCE : MapReduce is a programming model for writing
applications that can process Big Data in parallel on multiple nodes.
MapReduce provides analytical capabilities for analyzing huge volumes of
complex data.
•YARN : YARN is a large-scale, distributed operating system for big data
applications. The technology is designed for cluster management and is one of
the key features in the second generation of Hadoop, the Apache Software
Foundation's open source distributed processing framework.
•HADOOP COMMON : Hadoop Common refers to the collection of common
utilities and libraries that support other Hadoop modules. It is an essential
part or module of the Apache Hadoop Framework, along with the Hadoop
Distributed File System (HDFS), Hadoop YARN and Hadoop MapReduce.
Thank you!

User Guide Template Word
Document12 pages
User Guide Template Word
Yudha Amryal
No ratings yet
Chapter - 2 Hadoop
Document32 pages
Chapter - 2 Hadoop
Rahul Pawar
No ratings yet
Oracle Quick Guides: Part 1 - Oracle Basics: Database and Tools
From Everand
Oracle Quick Guides: Part 1 - Oracle Basics: Database and Tools
Malcolm Coxall
No ratings yet
Document Control-SOP
Document2 pages
Document Control-SOP
nivil_thomas
No ratings yet
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
Document23 pages
Exploring Bigdata With Hadoop: Dr.A.Bazila Banu Associate Professor Department of Cse
MAMAN MYTHIEN S
No ratings yet
Module 1 - Introduction To Big Data
Document40 pages
Module 1 - Introduction To Big Data
raghunath sastry
100% (1)
Bda Lab Manual
Document40 pages
Bda Lab Manual
vishalatdwork573
0% (1)
A Review Paper On Big Data Database'S: Cassandra, Hbase, Hive
Document6 pages
A Review Paper On Big Data Database'S: Cassandra, Hbase, Hive
shani thakur
No ratings yet
Hadoop Ecosystem PDF
Document6 pages
Hadoop Ecosystem PDF
Kittu
No ratings yet
PV Elite - Pressure Vessel Design
Document6 pages
PV Elite - Pressure Vessel Design
budy sinaga
100% (4)
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Hortonworks Data Platform (HDP)
Document56 pages
Hortonworks Data Platform (HDP)
Harshit Bansal
100% (1)
Data Science Pipeline and Hadoop Ecosystem
Document8 pages
Data Science Pipeline and Hadoop Ecosystem
Shiv
No ratings yet
Unit 2
Document56 pages
Unit 2
Ramstage Testing
No ratings yet
What Is The Hadoop Ecosystem
Document5 pages
What Is The Hadoop Ecosystem
Zahra Mea
No ratings yet
BigData Unit 2
Document15 pages
BigData Unit 2
Sreedhar Arikatla
No ratings yet
CC-KML051-Unit V
Document17 pages
CC-KML051-Unit V
Fdjs
No ratings yet
Haddob Lab Report
Document12 pages
Haddob Lab Report
Magneto Eric Apollyon Thorn
No ratings yet
What Is The Hadoop Ecosystem?
Document4 pages
What Is The Hadoop Ecosystem?
Maanit Singal
No ratings yet
Apache Hadoop
Document11 pages
Apache Hadoop
Imaad Ukaye
No ratings yet
Experiment No.1: AIM: Study of Hadoop
Document6 pages
Experiment No.1: AIM: Study of Hadoop
Harshita Mandloi
No ratings yet
Hadoop Intro - Part1
Document45 pages
Hadoop Intro - Part1
nosopa5904
No ratings yet
Chapter-2-Hadoop Eco System
Document34 pages
Chapter-2-Hadoop Eco System
noor222.202
No ratings yet
Cloud - UNIT V
Document18 pages
Cloud - UNIT V
Shikha Sharma
No ratings yet
Bda 18CS72 Mod-2
Document152 pages
Bda 18CS72 Mod-2
Dhathri Reddy
No ratings yet
BDA Unit 3
Document6 pages
BDA Unit 3
Sp
No ratings yet
Hadoop Ecosystem: Hdfs Mapreduce Yarn Hadoop Common
Document5 pages
Hadoop Ecosystem: Hdfs Mapreduce Yarn Hadoop Common
Harshdeep850
No ratings yet
Lesson 1 - Introduction To Big Data and Hadoop
Document46 pages
Lesson 1 - Introduction To Big Data and Hadoop
PoojaSampath
No ratings yet
Chapter 2 Hadoop Eco System
Document34 pages
Chapter 2 Hadoop Eco System
lamisaldhamri237
No ratings yet
Chapter 2 - 大数据生态系统
Document31 pages
Chapter 2 - 大数据生态系统
gs68295
No ratings yet
Apache Hadoop Technology
Document1 page
Apache Hadoop Technology
Seethal Kumars
No ratings yet
UNIT-I Introduction To Hadoop - A20
Document24 pages
UNIT-I Introduction To Hadoop - A20
Manoj Reddy
No ratings yet
Getting Started With HDP Sandbox
Document107 pages
Getting Started With HDP Sandbox
risdianto sigma
No ratings yet
Bda - 10
Document7 pages
Bda - 10
deshpande.pxresh
No ratings yet
Big Data Analytics Unit-3
Document15 pages
Big Data Analytics Unit-3
4241 DAYANA SRI VARSHA
No ratings yet
PPT1 Module2 Hadoop ECosystem
Document17 pages
PPT1 Module2 Hadoop ECosystem
Hiran Suresh
No ratings yet
CSE Hadoop Report
Document14 pages
CSE Hadoop Report
rohit
No ratings yet
Hadoop
Document13 pages
Hadoop
kajole7693
No ratings yet
Hadoop
Document6 pages
Hadoop
Vikas Sinha
No ratings yet
.Analysis and Processing of Massive Data Based On Hadoop Platform A Perusal of Big Data Classification and Hadoop Technology
Document3 pages
.Analysis and Processing of Massive Data Based On Hadoop Platform A Perusal of Big Data Classification and Hadoop Technology
Precious Pearl
No ratings yet
Big Data Analytics Using Hadoop
Document26 pages
Big Data Analytics Using Hadoop
bhargavi
No ratings yet
2 Hadoop
Document20 pages
2 Hadoop
YASH PRAJAPATI
No ratings yet
Hadoop
Document11 pages
Hadoop
Inu Kag
No ratings yet
Hadoop Bitcoin-BlockChain - A New Era Needed in Distributed Computing
Document7 pages
Hadoop Bitcoin-BlockChain - A New Era Needed in Distributed Computing
pacdox
No ratings yet
Introduction To Big Data Technologies
Document10 pages
Introduction To Big Data Technologies
indolent56
No ratings yet
Hadoop Tutorial
Document17 pages
Hadoop Tutorial
Priyadarsini Rout
No ratings yet
BDA Unit 2 Q&A
Document14 pages
BDA Unit 2 Q&A
viswakranthipalagiri
No ratings yet
Big Data Module 2
Document23 pages
Big Data Module 2
Srikanth M
No ratings yet
OLAP (Online Analytical Processing) : Zalpa Rathod (39) Yatin Puthran (37) Mayuri Pawar (35) Mitesh Patil
Document37 pages
OLAP (Online Analytical Processing) : Zalpa Rathod (39) Yatin Puthran (37) Mayuri Pawar (35) Mitesh Patil
Ahsan Asim
No ratings yet
To Hadoop: A Dell Technical White Paper
Document9 pages
To Hadoop: A Dell Technical White Paper
webregistros
No ratings yet
Kadi Sarva Vishwavidyalaya: LDRP Institute of Technology and Research Gandhinagar
Document44 pages
Kadi Sarva Vishwavidyalaya: LDRP Institute of Technology and Research Gandhinagar
Himanshu M
No ratings yet
Lab Manual BDA
Document36 pages
Lab Manual BDA
hemalata jangale
No ratings yet
Research Paper On Hadoop Technology
Document4 pages
Research Paper On Hadoop Technology
efjddr4z
100% (1)
Hadoop
Document25 pages
Hadoop
RAJNISH KUMAR ROY
No ratings yet
Unit 4
Document33 pages
Unit 4
Sahana Shetty
100% (1)
BDA Module-2 Notes PDF
Document14 pages
BDA Module-2 Notes PDF
VTU ML Workshop
No ratings yet
Module 2. 16974328568170
Document113 pages
Module 2. 16974328568170
Sagar B S
No ratings yet
Research Paper On Apache Hadoop
Document6 pages
Research Paper On Apache Hadoop
soezsevkg
100% (1)
Unit 2
Document10 pages
Unit 2
tripathineeharika
No ratings yet
Activity: NAME: Chogle Saif Ali ROLLNO.: 12CO27 Class: Be-Co Summary: Components of Hadoop Ecosystem
Document5 pages
Activity: NAME: Chogle Saif Ali ROLLNO.: 12CO27 Class: Be-Co Summary: Components of Hadoop Ecosystem
Saif Chogle
No ratings yet
Big Data
Document3 pages
Big Data
Gajanand Sharma
No ratings yet
Hadoop
Document58 pages
Hadoop
duggy
No ratings yet
Big Data Hadoop
Document37 pages
Big Data Hadoop
SDHR BCA
No ratings yet
U4 Ch10 Managing Contracts
Document27 pages
U4 Ch10 Managing Contracts
Parth Kadam
No ratings yet
U4 Ch11 Managing People
Document25 pages
U4 Ch11 Managing People
Parth Kadam
No ratings yet
U3-Ch06 Activity Planning
Document29 pages
U3-Ch06 Activity Planning
Parth Kadam
No ratings yet
U4 Ch09 Monitoring and Control
Document35 pages
U4 Ch09 Monitoring and Control
Parth Kadam
No ratings yet
U3 Ch07 Risk Management
Document32 pages
U3 Ch07 Risk Management
Parth Kadam
No ratings yet
U3 Ch08 Resource Allocation
Document17 pages
U3 Ch08 Resource Allocation
Parth Kadam
No ratings yet
U2 Ch04 Project Approach
Document82 pages
U2 Ch04 Project Approach
Parth Kadam
No ratings yet
U2 Ch05 Software Effort Estimation PartII Final
Document78 pages
U2 Ch05 Software Effort Estimation PartII Final
Parth Kadam
No ratings yet
U1 Ch03 Overview-Of Project-Planning Step Wise
Document34 pages
U1 Ch03 Overview-Of Project-Planning Step Wise
Parth Kadam
No ratings yet
U1 Ch02 Project Evaluation
Document52 pages
U1 Ch02 Project Evaluation
Parth Kadam
No ratings yet
Part-B Java List 2023
Document11 pages
Part-B Java List 2023
Lady Bug
No ratings yet
Matlab PDF
Document131 pages
Matlab PDF
Guilherme Rodrigues
No ratings yet
IR Radar With Laser Shoot - SYNOPSIS
Document11 pages
IR Radar With Laser Shoot - SYNOPSIS
ashishkvian
No ratings yet
EPM9900P Brochure EN GA 31996A LTR 2022 15 02 R005
Document16 pages
EPM9900P Brochure EN GA 31996A LTR 2022 15 02 R005
abdullah.ibnu77
No ratings yet
Automatic Watershed Delineation Using Arcswat/Arc Gis: By: - Yalelet.F
Document9 pages
Automatic Watershed Delineation Using Arcswat/Arc Gis: By: - Yalelet.F
Addisu melak
No ratings yet
Notis Larangan - ELCB No More Used
Document2 pages
Notis Larangan - ELCB No More Used
Faizal Fezal
No ratings yet
Cause and Effect: Sample Sentences
Document2 pages
Cause and Effect: Sample Sentences
Jose Flores Ccayanchira
No ratings yet
Connecting To The PowerSCADA Alarm Database Via ODBC
Document3 pages
Connecting To The PowerSCADA Alarm Database Via ODBC
Sanek U
No ratings yet
Rockwell PLC-5 To Modicon M580 With Unity Pro: Upgrading Your Control System Is Easier Than You Think
Document2 pages
Rockwell PLC-5 To Modicon M580 With Unity Pro: Upgrading Your Control System Is Easier Than You Think
timsar1357
No ratings yet
Introduction To Datastage: Ibm Infosphere Datastage V11.5
Document23 pages
Introduction To Datastage: Ibm Infosphere Datastage V11.5
Pramod Yadav
No ratings yet
Role of Computer Application in Management
Document2 pages
Role of Computer Application in Management
ankur singh
100% (2)
AIN1501 - Study Unit - 7
Document49 pages
AIN1501 - Study Unit - 7
Hazel Nyamukapa
No ratings yet
Informatics Practices SQP
Document7 pages
Informatics Practices SQP
Sonia barmase
No ratings yet
Jin Wang
Document2 pages
Jin Wang
Harshith N Jain
No ratings yet
Office Automation Key
Document11 pages
Office Automation Key
Veera Azhagan
No ratings yet
237 242 437 537 542 Firmware Update Instructions en
Document6 pages
237 242 437 537 542 Firmware Update Instructions en
Richard Gobeille
0% (1)
Netflix
Document4 pages
Netflix
silas kirwa
No ratings yet
Automation Tool For Sheet Metal Forming Using ANSA
Document9 pages
Automation Tool For Sheet Metal Forming Using ANSA
Nanda
No ratings yet
3WL Acb PDF
Document32 pages
3WL Acb PDF
electrical_1012000
No ratings yet
Flow Meters: DFM Industrial 7/25 Operation Manual
Document93 pages
Flow Meters: DFM Industrial 7/25 Operation Manual
Tidjani Sall
No ratings yet
Software Testing: Contact Session - 1
Document40 pages
Software Testing: Contact Session - 1
Sohit
No ratings yet
C++ Question and Answers
Document14 pages
C++ Question and Answers
Aden Kheire Mohamed
No ratings yet
Zoo Management
Document6 pages
Zoo Management
shamagondal
0% (1)
Ek Hotel TV: Iptv System and Signage Digital For Hospitality Sector
Document6 pages
Ek Hotel TV: Iptv System and Signage Digital For Hospitality Sector
Rafael
No ratings yet
TIME TABLE M.Phil-PhD CS Spring 2021 (Old Campus)
Document1 page
TIME TABLE M.Phil-PhD CS Spring 2021 (Old Campus)
hassan
No ratings yet
Roadmap BSC CE 20230310 v2
Document1 page
Roadmap BSC CE 20230310 v2
muhammad20t2021
No ratings yet
Canva Templates
Document1 page
Canva Templates
welcom to reality
No ratings yet