Welcome to Scribd!

Как устроен Apache Spark

Uploaded by

0% found this document useful (0 votes)

5 views2 pages

This document discusses how Spark partitions data when reading files. By default, Spark creates one partition per 128 MB of file size. The number of partitions can be controlled by changing the spark.sql.files.maxPartitionBytes parameter, which specifies the maximum size of each partition. The document demonstrates reading a 3.6 MB CSV file with the default and custom partition sizes, showing it creates 1 partition with the default and 4 partitions when maxPartitionBytes is set to 1 MB. It is generally not recommended to change this parameter from the default unless needed for a specific use case.

Original Description:

Copyright

Available Formats

DOCX, PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as DOCX, PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

5 views2 pages

Как устроен Apache Spark

Uploaded by

Дмитрий Черненко

Copyright:

Available Formats

Download as DOCX, PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 2

Search inside document

Как устроен Apache Spark: архитектура и принцип работы

Спарк состоит из следующих компонентов:

 Ядро (Core);
 SQL – инструмент для аналитической обработки данных с помощью SQL-запросов;
 Streaming – надстройка для обработки потоковых данных, о которой подробно мы
рассказывали здесь и здесь;
 MLlib – набор библиотек машинного обучения;
 GraphX – модуль распределённой обработки графов.

Основное различие двух фреймворков — способ обращения к данным.

Hadoop сохраняет данные на жесткий диск на каждом шаге алгоритма
MapReduce, а Spark производит все операции в оперативной памяти.
Благодаря этому Spark выигрывает в производительности до 100 раз и
позволяет обрабатывать данные в потоке.

How to control the number of partitions when reading /

writing files in Spark?
Now, you must’ve had the question in your mind as to how spark partitions the data when
reading textfiles. Let’s read a simple textfile and see the number of partitions here.
Let’s read the CSV file which was the input dataset in my first post, [pyspark dataframe basic
operations]

df2 = spark.read.format('csv').options(delimiter=',',
header=True).load('/Users/ecom-ahmed.noufel/Downloads/population.csv')
df2.rdd.getNumPartitions()
1

How to control the number of partitions when reading /

df2 = spark.read.format('csv').options(delimiter=',',
header=True).load('/Users/ecom-ahmed.noufel/Downloads/population.csv')
df2.rdd.getNumPartitions()
1

We can see that only 1 partition is created here. Alright let’s break this down,
Spark by default creates 1 partition for every 128 MB of the file.
So if you are reading a file of size 1GB, it creates 10 partitions.
So how to control the number of bytes a single partition can hold ?
The file being read here is of 3.6 MB, hence only 1 partition is created in this case.
The no. of partitions is determined by spark.sql.files.maxPartitionBytes parameter, which is
set to 128 MB, by default.

spark.conf.get("spark.sql.files.maxPartitionBytes") # -> 128 MB by default

'134217728'

Note: The files being read must be splittable by default for spark to create partitions when
reading the file. So, in case of compressed files like snappy, gz or lzo etc, a single partition is
created irrespective of the size of the file.
Let’s manually set the spark.sql.files.maxPartitionBytes to 1MB, so that we get 4 partitions
upon reading the file.

spark.conf.set("spark.sql.files.maxPartitionBytes", 1000000)
spark.conf.get("spark.sql.files.maxPartitionBytes")
'1000000'
df3 = spark.read.format('csv').options(delimiter=',',
header=True).load('/Users/ecom-ahmed.noufel/Downloads/population.csv')
df3.rdd.getNumPartitions()
4

As discussed, 4 partitions were created.

Caution: The above manual setting of the spark.sql.files.maxPartitionBytes to 1MB was just
for the purpose of this demo and is not recommended in practical scenarios. It is recommended
to use the default setting or set a value based on your input size and cluster hardware size.

MicrosoftSQLServer2019 PDF
Document489 pages
MicrosoftSQLServer2019 PDF
Alexandru Silviu Dinică
100% (6)
Interview Question
Document24 pages
Interview Question
Anil Yarlagadda
No ratings yet
What Is Spark?: History of Apache Spark
Document65 pages
What Is Spark?: History of Apache Spark
Apurva
No ratings yet
DataStage Theory Part
Document18 pages
DataStage Theory Part
Jesse Kota
No ratings yet
Pyspark Interview Questions: Click Here
Document35 pages
Pyspark Interview Questions: Click Here
Rajachandra Voodiga
0% (1)
Spark Interview Questions and Answers
Document31 pages
Spark Interview Questions and Answers
srinivas75k
100% (1)
Introduction To Apache Spark (Spark) : - by Praveen
Document19 pages
Introduction To Apache Spark (Spark) : - by Praveen
vasari8882573
No ratings yet
Comprehensive Guide For Tuning Spark Big Data Applications and Infrastructure
Document20 pages
Comprehensive Guide For Tuning Spark Big Data Applications and Infrastructure
sandipmondal
100% (1)
SQL Server Ground To Cloud
Document167 pages
SQL Server Ground To Cloud
Paul Zgondea
No ratings yet
Top Answers To Spark Interview Questions
Document32 pages
Top Answers To Spark Interview Questions
srinivas75k
No ratings yet
Apache Spark - Optimization Techniques
Document7 pages
Apache Spark - Optimization Techniques
vikas gupta
No ratings yet
Databricks Apache Spark Certified Developer Master Cheat Sheet
Document29 pages
Databricks Apache Spark Certified Developer Master Cheat Sheet
rajikare
No ratings yet
Pyspark Interview Code
Document197 pages
Pyspark Interview Code
mailme me
100% (1)
Py Spark
Document427 pages
Py Spark
andrea
No ratings yet
07 Spark Dataframes
Document45 pages
07 Spark Dataframes
mhafod
100% (1)
Spark 101
Document25 pages
Spark 101
Daniel Ortiz
No ratings yet
Spark Architecture
Document12 pages
Spark Architecture
abikoolin
No ratings yet
8888888888888888888
Document131 pages
8888888888888888888
kumar kumar
100% (1)
Apache Spark Interview Guide
Document22 pages
Apache Spark Interview Guide
Venmo 6193
0% (1)
Apache Spark Tutorial
Document6 pages
Apache Spark Tutorial
abhimanyu thakur
100% (1)
Hands - On Exercise: Using The Spark Shell..................................
Document13 pages
Hands - On Exercise: Using The Spark Shell..................................
miyumi
100% (2)
Reddit Resume
Document1 page
Reddit Resume
Anonymous 5WAlH2
No ratings yet
Fast Data Processing With Spark - Second Edition - Sample Chapter
Document18 pages
Fast Data Processing With Spark - Second Edition - Sample Chapter
Packt Publishing
No ratings yet
Top Answers To Spark Interview Questions
Document4 pages
Top Answers To Spark Interview Questions
Ejaz Alam
No ratings yet
Foundation of Data Science
Document201 pages
Foundation of Data Science
Shweta Gadhave
No ratings yet
Admin Cloudera
Document637 pages
Admin Cloudera
Brais Rodriguez
100% (3)
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Spark Interview Questions
Document8 pages
Spark Interview Questions
Jnsk Srinu
100% (1)
Azure Machine Learning
Document18 pages
Azure Machine Learning
Ram
No ratings yet
Learn Apache Spark
Document31 pages
Learn Apache Spark
abreddy2003
100% (1)
Unit 5
Document109 pages
Unit 5
Rajesh Kumar Rakasula
No ratings yet
17 SparkSQL
Document44 pages
17 SparkSQL
Petter P
No ratings yet
DataStage Configuration File
Document7 pages
DataStage Configuration File
rachit
No ratings yet
Scala PDF
Document29 pages
Scala PDF
Mauricio Alejandro Arenas Arriagada
No ratings yet
Bigdata Notes
Document26 pages
Bigdata Notes
Anil Yarlagadda
No ratings yet
Azure Cosmos DB Workshop
Document147 pages
Azure Cosmos DB Workshop
springlee
No ratings yet
Databricks Spark Reference Applications
Document37 pages
Databricks Spark Reference Applications
jose
No ratings yet
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
Document11 pages
Apache Spark Ecosystem - Complete Spark Components Guide: 1. Objective
divya kolluri
No ratings yet
Fabric Data Warehouse
Document280 pages
Fabric Data Warehouse
Omar
No ratings yet
Machine Learning in Spark
Document26 pages
Machine Learning in Spark
brockthebone
No ratings yet
Apache Spark Streaming Presentation
Document28 pages
Apache Spark Streaming Presentation
SlavimirVesić
100% (1)
Spark Interview Questions
Document7 pages
Spark Interview Questions
Rajesh Sugumaran
100% (1)
The Delta Lake Series Lakehouse 012921
Document19 pages
The Delta Lake Series Lakehouse 012921
John Jiménez
No ratings yet
Interview - Questions
Document8 pages
Interview - Questions
SELVAKUMAR MP
No ratings yet
BDALab Assn5
Document16 pages
BDALab Assn5
Deepti Agrawal
No ratings yet
Aula01.3bCe240Ce245Ce229CEDS84121c (Apache Spark)
Document30 pages
Aula01.3bCe240Ce245Ce229CEDS84121c (Apache Spark)
Victor Tavares Alvarenga
No ratings yet
Practical Assignment - :: Distributed Data Processing With Apache Spark
Document3 pages
Practical Assignment - :: Distributed Data Processing With Apache Spark
Teshome Mulugeta
No ratings yet
Unit-5 Spark
Document20 pages
Unit-5 Spark
Siva
No ratings yet
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
Document11 pages
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
Israa
No ratings yet
Function Spark
Document10 pages
Function Spark
bhargavi
No ratings yet
Apache Spark
Document6 pages
Apache Spark
Tam
No ratings yet
Configuration File
Document7 pages
Configuration File
Subramanyam Hindustani
No ratings yet
Spark in Production
Document34 pages
Spark in Production
Sridhar Plv
No ratings yet
Module 3
Document51 pages
Module 3
sagarhn sagarhn
No ratings yet
Function Spark
Document9 pages
Function Spark
bhargavi
No ratings yet
09 Programming Hadoop - Spark, R and Pig
Document80 pages
09 Programming Hadoop - Spark, R and Pig
Neeraj Garg
No ratings yet
Big Data Assignment
Document6 pages
Big Data Assignment
suibian.270619
No ratings yet
Apache Spark Components
Document4 pages
Apache Spark Components
nitinlucky
No ratings yet
Pyspark Questions & Scenario Based
Document25 pages
Pyspark Questions & Scenario Based
Sowjanya Vakkalanka
No ratings yet
Spark2x: Big Data Huawei Course
Document25 pages
Spark2x: Big Data Huawei Course
Thiago Siqueira
No ratings yet
Thesis Apache Spark
Document4 pages
Thesis Apache Spark
iapesmiig
100% (2)
ApacheSpark MyNotes
Document6 pages
ApacheSpark MyNotes
seenu0104
No ratings yet
Bda Unit-4 PDF
Document63 pages
Bda Unit-4 PDF
Harry
No ratings yet
Unit 4 Spark Cassendra
Document41 pages
Unit 4 Spark Cassendra
downloadjain123
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
Document43 pages
Big Data Computing Spark Basics and RDD: Ke Yi
Patrick Li
No ratings yet
Are You Still Using Pandas For Big Data
Document14 pages
Are You Still Using Pandas For Big Data
sumanth_0678
No ratings yet
Apache Spark Interview Questions and Answers PDF
Document31 pages
Apache Spark Interview Questions and Answers PDF
Zyad Ahmed
No ratings yet
Pyspark Modules&packages RDD
Document9 pages
Pyspark Modules&packages RDD
klogeswaran.it
No ratings yet
DataStage Configuration File
Document6 pages
DataStage Configuration File
Jose Antonio Lopez Mesa
No ratings yet
There Are 7 Tips For Improving Map Reduce Performance:: Configuring The Cluster Correctly
Document4 pages
There Are 7 Tips For Improving Map Reduce Performance:: Configuring The Cluster Correctly
Mounika Chowdary
No ratings yet
The Spark Programming Model
Document7 pages
The Spark Programming Model
ud
No ratings yet
4.3. Spark SQL
Document25 pages
4.3. Spark SQL
yagoencuestas
No ratings yet
An Empirical Study On Apache Spark
Document15 pages
An Empirical Study On Apache Spark
Lokesh Dikshi
No ratings yet
A Brief Introduction To Apache Spark
Document10 pages
A Brief Introduction To Apache Spark
Venkatesh Narisetty
No ratings yet
Using Spark On Cori: Lisa Gerhardt, Evan Racah NERSC New User Training
Document14 pages
Using Spark On Cori: Lisa Gerhardt, Evan Racah NERSC New User Training
sonlongho
No ratings yet
Data Science - Trello
Document10 pages
Data Science - Trello
khanmd
No ratings yet
22 2019-03-07 UdemyforBusinessCourseList PDF
Document72 pages
22 2019-03-07 UdemyforBusinessCourseList PDF
isaacbombay
No ratings yet
Spark Tutorial - Learn Spark Programming
Document21 pages
Spark Tutorial - Learn Spark Programming
Sudhanshi Bakre
No ratings yet
Assignment 7solution
Document4 pages
Assignment 7solution
Dharvesh
No ratings yet
Gartner-Building A Data Mana 771797 NDX
Document65 pages
Gartner-Building A Data Mana 771797 NDX
Aadi Manchanda
No ratings yet
Natalino Busa: Senior Director Data
Document5 pages
Natalino Busa: Senior Director Data
Parwaz Khan
No ratings yet
RDD Actions
Document18 pages
RDD Actions
durgapriyachikkala05
No ratings yet
Big Data Analytics - A Hands-On Approach (PDFDrive) (1) - 35-42
Document8 pages
Big Data Analytics - A Hands-On Approach (PDFDrive) (1) - 35-42
Vikas Verma
No ratings yet
ST-1 Solution Big Data KCS061
Document26 pages
ST-1 Solution Big Data KCS061
Shivam Garg
No ratings yet
KNIME Azure Integration User Guide: KNIME AG, Zurich, Switzerland Version 4.3 (Last Updated On 2021-05-26)
Document14 pages
KNIME Azure Integration User Guide: KNIME AG, Zurich, Switzerland Version 4.3 (Last Updated On 2021-05-26)
Ali Habib
No ratings yet
How Do I Register and Schedule My Cloudera Exam
Document16 pages
How Do I Register and Schedule My Cloudera Exam
AakashMalhotra
No ratings yet
Nova Guliyev: SR Data Consultant
Document6 pages
Nova Guliyev: SR Data Consultant
Arth Patel
No ratings yet
Zookeeper HBase SPARK
Document25 pages
Zookeeper HBase SPARK
Bhargav Kumar Jayanthi
No ratings yet
Apache Spark Analytics Made Simple PDF
Document76 pages
Apache Spark Analytics Made Simple PDF
prerit_t
No ratings yet
VI Sem Scheme and Syllabus of IT
Document16 pages
VI Sem Scheme and Syllabus of IT
SOUMY PRAJAPATI
No ratings yet