You are on page 1of 2

Как устроен Apache Spark: архитектура и принцип работы

Спарк состоит из следующих компонентов:

 Ядро (Core);
 SQL – инструмент для аналитической обработки данных с помощью SQL-запросов;
 Streaming – надстройка для обработки потоковых данных, о которой подробно мы
рассказывали здесь и здесь;
 MLlib – набор библиотек машинного обучения;
 GraphX – модуль распределённой обработки графов.

Основное различие двух фреймворков — способ обращения к данным.


Hadoop сохраняет данные на жесткий диск на каждом шаге алгоритма
MapReduce, а Spark производит все операции в оперативной памяти.
Благодаря этому Spark выигрывает в производительности до 100 раз и
позволяет обрабатывать данные в потоке.

How to control the number of partitions when reading /


writing files in Spark?
Now, you must’ve had the question in your mind as to how spark partitions the data when
reading textfiles. Let’s read a simple textfile and see the number of partitions here.
Let’s read the CSV file which was the input dataset in my first post, [pyspark dataframe basic
operations]

df2 = spark.read.format('csv').options(delimiter=',',
header=True).load('/Users/ecom-ahmed.noufel/Downloads/population.csv')
df2.rdd.getNumPartitions()
1

We can see that only 1 partition is created here. Alright let’s break this down,
Spark by default creates 1 partition for every 128 MB of the file.
So if you are reading a file of size 1GB, it creates 10 partitions.

How to control the number of partitions when reading /


writing files in Spark?
Now, you must’ve had the question in your mind as to how spark partitions the data when
reading textfiles. Let’s read a simple textfile and see the number of partitions here.
Let’s read the CSV file which was the input dataset in my first post, [pyspark dataframe basic
operations]

df2 = spark.read.format('csv').options(delimiter=',',
header=True).load('/Users/ecom-ahmed.noufel/Downloads/population.csv')
df2.rdd.getNumPartitions()
1

We can see that only 1 partition is created here. Alright let’s break this down,
Spark by default creates 1 partition for every 128 MB of the file.
So if you are reading a file of size 1GB, it creates 10 partitions.
So how to control the number of bytes a single partition can hold ?
The file being read here is of 3.6 MB, hence only 1 partition is created in this case.
The no. of partitions is determined by spark.sql.files.maxPartitionBytes  parameter, which is
set to 128 MB, by default.

spark.conf.get("spark.sql.files.maxPartitionBytes") # -> 128 MB by default


'134217728'

Note: The files being read must be splittable by default for spark to create partitions when
reading the file. So, in case of compressed files like snappy, gz or lzo etc, a single partition is
created irrespective of the size of the file.
Let’s manually set the spark.sql.files.maxPartitionBytes  to 1MB, so that we get 4 partitions
upon reading the file.

spark.conf.set("spark.sql.files.maxPartitionBytes", 1000000)
spark.conf.get("spark.sql.files.maxPartitionBytes")
'1000000'
df3 = spark.read.format('csv').options(delimiter=',',
header=True).load('/Users/ecom-ahmed.noufel/Downloads/population.csv')
df3.rdd.getNumPartitions()
4

As discussed, 4 partitions were created.


Caution: The above manual setting of the spark.sql.files.maxPartitionBytes  to 1MB was just
for the purpose of this demo and is not recommended in practical scenarios. It is recommended
to use the default setting or set a value based on your input size and cluster hardware size.

You might also like