Big Data (Part2)

Selected 4 – Chapter 6 – Big data & Analytics.
Big Data (Part 2)

.‫ بتقدمهلنا‬tool ‫ اللي كل‬Services ‫ وال‬Hadoop ecosystem ‫هنتكلم عن ال‬
Hadoop ecosystem:
- HDFS → Hadoop distributed file system
- YARN → Yet another resource negotiation
- MapReduce → Data processing using programming
- Spark → In memory data processing
- PIG, HIVE → Data processing services using Query (SQL – like)
- HBase → NoSQL database
- Mahout, Spark MLlib → Machine learning
- Apache Drill → SQL on Hadoop
- Zookeeper → Managing cluster
- Oozie → Job scheduling
- Flume, Sqoop → Data ingesting services
- Solr & Lucence → Searching & indexing
- Ambari → Provision, monitor and maintain cluster
Hadoop ecosystem:
HDFS (Hadoop Distributed File System) Storage
- Stores different types of large data sets (structured, semi-structured and unstructured)
- HDFS creates a level of abstraction over the resources, from where we can see the whole HDFS as a
single unit.
- It Stores data across various resources and maintains the log file about the stored data. (metadata)
- HDFS has 2 core components, i.e., NameNode and DataNode
Nodes connected ‫ وهي مجموعة من ال‬Hadoop clusters ‫ من خالل ال‬Data ‫ بيخزن كل أنواع ال‬.Hadoop ‫ فال‬storage unit ‫ هو ال‬HDFS ‫ال‬
Data ‫ ال‬manage ‫ اللي بي‬manager ‫ هو ال‬Name nodes ‫ وعندنا ال‬data ‫ ال‬hold ‫ اللي بت‬data nodes ‫ ال‬nodes ‫ عندنا نوعين من ال‬.‫مع بعض‬
.‫ واحده‬machine ‫ اللي موجودة عندها كأنها متخزنه على‬Files ‫ هنا بيبقا شايف إن كل ال‬user ‫ ال‬.Hadoop clusters ‫ اللي موجودة فال‬nodes
YARN (Yet Another Resources Negotiator)
- Performs all your processing activities by allocating resources and scheduling tasks.
- Two services: Resources Manager and Node Manager
- Resource Manager: Manages resources and schedule applications running on top of YARN
- Node Manager: Manages containers and monitors resources utilization in each container
Map ‫ بمعني إني لو عندي‬resources to run a particular task on the Hadoop clusters ‫ ال‬Allocate ‫ هو الحاجة اللي مسؤوله إنها ت‬YARN‫ال‬
‫ اللي هتشارك‬nodes ‫ ف مين اللي مسؤول عشان ننفذ التاسك دي؟ ومين ال‬.Hadoop cluster ‫ دا على ال‬Code ‫ وعايزين ننفذ ال‬Reduce task
‫ عشان هو اللي بيحددلنا‬YARN ‫ ال‬Contact ‫ الزم قبل أي حاجه ن‬.YARN ‫ دي كلها هو ال‬Resources ‫فإننا ننفذ التاسك دي؟ اللي بيحدد ال‬
‫ فدوره إن أول ما نبدأ‬Master ‫ ودا ال‬Resources manager ‫ ال‬2 components ‫ بيتكون من‬YARN ‫ وبالتالي ال‬.‫محتاجين ايه عشان ننفذ التاسكات‬
.‫ اللي معانا فالتاسك دي‬resources ‫ ال‬determine ‫ هو المسؤول إني ي‬Resource manager ‫ أي تاسك ال‬submit ‫ن‬
‫ اللي محتاجها فالتاسك دي وابدأ انفذ التاسك فال‬Nodes ‫ فيه ال‬Container ‫ بيبقا عندي‬.node manager ‫ التاني هو ال‬Component‫وال‬
.‫ التاسك‬execute ‫ ألنها بت‬.worker node ‫ أو بنقول عليها‬node manager ‫ دا بتشتغل ك‬Container ‫ وكل نود فال‬.‫ دي‬Container
.Version1 ‫ مكنش موجود ف‬Hadoop V2 ‫ بدأ يظهر في‬YARN ‫ال‬
Map Reduce: Data processing using programming

- Core component in a Hadoop ecosystem for processing
- Helps in writing applications that processes large datasets using distributed and parallel
algorithms
- In a map Reduce program, Map () and Reduce () are 2 functions
- Map function performs actions like filtering, grouping and sorting
- Reduce function aggregates and summarizes the result produced by map function
‫ عال‬Data ‫ ما بنخزن ال‬once ‫ احنا قولنا إن‬.HDFS ‫ اللي متخزنه على ال‬Big data ‫ على ال‬Analytics ‫ اللي بتخليني أقدر اعمل‬tools ‫دي أحد أهم ال‬
Schema ‫ على الداتا ب‬processing ‫ خليتنا نقدر نعمل‬Map Reduce ‫ ال‬.get insights ‫ دي عشان ن‬data ‫ على ال‬Processing ‫ بنبدأ نعمل‬HDFS
‫ عن طريق إنه ي‬large data sets ‫ ل‬process ‫ فهو بيعمل‬.‫ اللي كانت موجودة قبل كده‬Tranditional processing paradigm ‫مختلفة عن ال‬
‫ اللي فيها‬Nodes ‫ ناحية ال‬map logic ‫ ال‬Push ‫ أول ما بن‬.Reduce () ‫ وال‬Map () ‫ ال‬2 functions ‫ من خالل‬.‫ ناحية الداتا‬processing logic ‫ ال‬push
‫ اللي‬Primary language ‫ ال‬.concurrently ‫ اللي عندها بشكل‬data ‫ على ال‬Map logic ‫ ال‬execute ‫ وت‬process ‫ بتبدأ ت‬Nodes ‫ ال‬data ‫ال‬
.‫ برضو‬R‫ أو ال‬Python ‫ وممكن طبعا استخدم ال‬.map Reduce functions ‫ عشان اكتب ال‬JAVA ‫بستخدمها هي لغة ال‬
Pig: Data processing service using Query:

- Pig has 2 parts: Pig Latin, the language and the pig runtime for the execution
environment
- 1 line of pig Latin = approx. 100 lines of map-reduce job.
- The compiler internally converts pig Latin to MapReduce
- It gives you a platform for building data flow for ETL
- PIG first loads the data, then performs various functions like grouping, filtering, joining,
sorting, etc. and finally dumps the data on the screen or store in HDFS.
‫ فبنقدر نعمل‬.Python – JAVA – R ‫ بتاعي بطريقة أسهل من غير ما نبقا عارفين‬Processing logic ‫ خلتني أقدر اكتب ال‬tool ‫ دي‬Pig ‫ال‬
Pig Latin language ‫ بتتكون من جزئين ال‬Pig ‫ ال‬.‫ سهلة وبسيطة جدا‬high level language ‫ هي‬.Pig Latin ‫ مكتوبه بال‬Queries ‫ ب‬Processing
.‫ تاسك‬Map Reduce ‫ يحولها ل‬Logic ‫ بياخد ال‬Compiler ‫ أو ال‬Pig Latin Runtime ‫ بتاعنا ال‬logic ‫ بعد ما بنكتب ال‬.Pig Latin Runtime ‫وال‬
Hive: Data processing service using query:

- A data warehousing component which analyses data sets in a distributed environment
using SQL-like interface
- The query language of Hive is called hive query language (HQL)
- 2 basic components: Hive command line and JDBC/ ODBC driver
- Supports user defined functions (UDF) to accomplish specific needs.
‫ كويس بسبب شغل‬SQL ‫ هنا استغلوا إن الناس كانت عارفه‬.Hadoop ecosystem ‫ مهمة جدا فال‬tool ‫ وهي‬.‫ بظبط‬Pig ‫ تقريبا شبه ال‬Hive ‫ال‬
‫ اللي‬.HDFS ‫ اللي موجودة فال‬data ‫ عال‬Processing ‫ فيبقا سهل إنهم يعملوا‬.‫ أوي‬SQL‫ وعملوا لغة قريبة من ال‬relational databases ‫ال‬
.Map Reduce tasks ‫ ل‬Queries ‫ بتبدأ تترجم ال‬HIVE ‫ دي ال‬Query ‫بيحصل إن بعد ما بنكتب ال‬
Mahout: Machine learning:

- Provides an environment for creating machine learning applications
- It performs collaborative filtering, clustering and classification
- Provides a command line to invoke various algorithms
- It has a predefined set of libraries (written in Java) which already contains different inbuilt
algorithms for different use cases
Clustering, classification, ‫ زي ال‬.ML algorithms ‫ موجود فيها عدد كبير جدا من ال‬Machine learning library ‫ هي‬mahout ‫ال‬
Codes from scratch ‫ دي بدل ما اكتب ال‬Library ‫ فبالتالي لو عايز اعمل أي تاسك من دول هستخدم ال‬.‫ وهكذا‬regression
Spark: In memory data processing

- A framework for real time data analytics in a distributed computing environment.
- Written in Scala and was originally developed at the University of California, Berkeley
- It executes in-memory computations to increase speed of data processing over Map-
Reduce
- 100x Faster than Hadoop for large scale data processing by exploiting in-memory
computations.
.real time processing ‫ بمعني إني أقدر اعمل‬real time data analytics ‫ خلتني أقدر اعمل‬Spark ‫ال‬
‫ بشكل‬Data streams ‫ على داتا بتجيلي بشكل‬process ‫ هنا بت‬Spark ‫ أما ال‬.HDFS ‫ على داتا متخزنه فال‬Process ‫ بت‬Map Reduce ‫ال‬
‫ بيتكتب عشان‬post ‫ لكل‬Process ‫ دلوقتي ومحتاجين نعمل‬Facebook ‫ عال‬Posts ‫ زي مثال ناس بتكتب‬.real time ‫ وبتشتغل عليها‬Continuous
.real time decisions ‫ فباخد‬.‫ دا‬Post ‫ ال‬Block ‫لو في عنف أو تنمر أو أي حاجه سيئة يبدأ ي‬
.map reduce ‫× من ال‬100 ‫ أسرع‬Spark ‫ال‬
HBase: NoSQL database:

- An open source, non-relational distributed database a NoSQL database.
- Supports all types of data and that is why, It’s capable of handling anything and
everything
- It is modelled after Google’s Bigtable
- It gives us a fault tolerant way of sorting sparse data
- It is written in Java and HBase applications can be written in REST, Avro and Thrift APIs
‫ بنستفيد منها إننا‬.Columns ‫ بتخزن الداتا بشكل‬Column oriented database ‫ دا عباره عن‬.NoSQL databases ‫ دا نوع من أنواع ال‬HBase ‫ال‬
.‫ بتاعي‬Application ‫ الداتا اللي متخزنه فيها من خالل ال‬access ‫ عشان نقدر ن‬App ‫ ل‬Backend ‫ وبقدر استخدمها ك‬.‫نخزن الداتا بأشكال مختلفة‬
Apache Drill: SQL on Hadoop:

- An open-source application which works with distributed environment to analyze large datasets
- Follows the ANSI SQL
- Supports different kind NoSQL databases and file systems
- For example: Azure Blob storage, Google cloud storage, HBase, MongoDB, MapR-DB HDFS, MapR-FS
Amazon S3, Swift, NAS and local files
- Combines a variety of data stores just by using a single query.
data ‫ فأكثر من‬contact data ‫ يعني بن‬.‫ واحد‬Query ‫ باستخدام‬data source ‫ ألكثر من‬Query ‫ بيخلينا نقدر نعمل‬Open source ‫ دا‬Apache drill ‫ال‬
‫ ويظبط شكله‬Query ‫ بيمسك ال‬Drill ‫ فال‬.data source 1, 10, 20, 3 ‫ واقوله نفذها على‬Query ‫ يعني ممكن اكتب‬.‫ واشتغل عليهم‬source
‫و‬Apache drill ‫ لل‬result‫ ويبعت ال‬Query ‫ ال‬execute‫ هيبدأ ي‬Data source ‫ اللي هيتعامل معاها وكل‬Data source ‫بحيث إنه يبقا مناسب لل‬
.final result ‫ ال‬combine ‫ بيبدأ ي‬Apache drill ‫ال‬
Oozie: Job scheduler:

- Oozie is a job scheduler in Hadoop ecosystem
- 2 Kinds of Oozie jobs: Oozie workflow and Oozie coordinator
- Oozie workflow: Sequential set of actions to be executed
- Oozie coordinator: Oozie job which are triggered when the data is made available to it or even
triggered base on time.
actions ‫ فببدأ احدد ال‬.‫ متكرر كل ساعة أو كل يوم أو كل أسبوع‬periodically ‫ بشكل‬run job ‫ يعني لو احنا محتاجين ن‬Job scheduler ‫ دا‬Oozie‫ال‬
run automatically ‫ ف كل ساعة مثال لوحده بيبدأ ي‬.set of actions need to be executed periodically ‫ ودا‬Oozie workflow ‫ لل‬to be executed
9 ‫ كل ساعتين أو كل يوم جمعة أو كل يوم‬Workflow ‫ يعني أقوله انه هينفذ ال‬time based ‫ عندنا طريقتين طريقة‬.‫ المطلوبة‬jobs ‫وينفذ ال‬
.‫فالشهر‬
‫ أو مثال اول ما يجيلك داتا من‬Available ‫ تبقا‬Data ‫ يعني بنفذه كل ما يحصل حاجه زي مثال أول ما ال‬based on specific event ‫أو ممكن يبقا‬
.‫ معين ابدأ نفذ التاسك وهكذا‬sensor
Apache Flume: data ingesting services: (IMPORT ONLY)

- Ingests unstructured and semi-structured data into HDFS.
- It helps us in collecting, aggregating and moving large number of datasets.
- It helps us to ingests online streaming data from various sources like network traffic, social media, email
messages, log files etc. in HDFS
HDFS ‫ لل‬Import ‫ يعني اعملهم‬HDFS ‫ عايزه اخدهم لل‬Semi structured – unstructured data ‫ يعني أي‬.data ‫ لل‬Import ‫ بيعمل‬Flume ‫ال‬
data streams ‫ اللي جايالي بشكل‬Data ‫ لل‬import‫ وكمان بيخليني أقدر اعمل‬.Apache flume ‫بعمل كده ب استخدام ال‬
Apache Sqoop: Data ingesting service

- Another data ingesting service
- Sqoop can impost as well as export structured data
from RDBMS
- Flume only ingests unstructured data or semi-
structured data in HDFS
Structured data ‫ بظبط بس بتشتغل على ال‬Flume ‫ زي ال‬Sqoop ‫ال‬
HDFS ‫ للداتا فال‬Export‫ و‬Import ‫وبتخليني أقدر اعمل‬

Apache Solr and Lucene:

- 2 services which are used for searching and indexing in Hadoop ecosystem.
- Apache Lucene is based on Java, which also helps in spell checking.
- Apache Lucene is the engine, Apache Solr is a complete application built around Lucene.
- Solr uses Apache Lucene Java search library for searching and indexing.
‫ على حاجه فيهم بشكل سريع‬search ‫ وعايز اعمل‬documents ‫ يعني لو عندي مجموعة من ال‬search engines ‫اإلثنين دول عاملين زي ال‬
.Library ‫ اللي بيستخدم ال‬App‫ دا ال‬Apache solar ‫ وال‬.Apache Lucene library ‫بستخدم ال‬
Zookeeper: Coordinator:
- An open-source server which enables highly reliable distributed coordination
- Apache Zookeeper coordinates with various Hadoop services in a distributed
environment
- Performs synchronization, configuration maintenance, grouping and naming.
Hadoop ecosystem ‫ اللي فال‬tools ‫ بتاعة هو إنه يتأكد إن ال‬Main purpose ‫ ال‬coordinator ‫ هو‬Zookeeper ‫اخر حاجه بقا ال‬
tools ‫ وال‬broken system ‫ أنا هيبقا عندي‬Zookeeper ‫ ومن غير ال‬.Interruption ‫ سوا من غير أي‬Communicate ‫بتقدر ت‬
.‫دي كلها مش هتبقا قادره تتعامل مع بعض‬

Big Data (Part2)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data (Part2)

Uploaded by

Copyright:

Available Formats

Selected 4 – Chapter 6 – Big data & Analytics.

Big Data (Part 2)

HDFS (Hadoop Distributed File System) Storage

YARN (Yet Another Resources Negotiator)

.Version1 ‫ مكنش موجود ف‬Hadoop V2 ‫ بدأ يظهر في‬YARN ‫ال‬

Map Reduce: Data processing using programming

Pig: Data processing service using Query:

Hive: Data processing service using query:

Mahout: Machine learning:

Spark: In memory data processing

.map reduce ‫× من ال‬100 ‫ أسرع‬Spark ‫ال‬

HBase: NoSQL database:

Apache Drill: SQL on Hadoop:

.final result ‫ ال‬combine ‫ بيبدأ ي‬Apache drill ‫ال‬

Oozie: Job scheduler:

.‫ معين ابدأ نفذ التاسك وهكذا‬sensor

Apache Flume: data ingesting services: (IMPORT ONLY)

Apache Sqoop: Data ingesting service

HDFS ‫ للداتا فال‬Export‫ و‬Import ‫وبتخليني أقدر اعمل‬

Apache Solr and Lucene:

.‫دي كلها مش هتبقا قادره تتعامل مع بعض‬

You might also like