You are on page 1of 6

python - How to transform Spark dataframe to Polars dataf... https://stackoverflow.com/questions/73203318/how-to-trans...

How to transform Spark dataframe to Polars dataframe?


Asked 1 year ago Modified 11 months ago Viewed 5k times

I wonder how i can transform Spark dataframe to Polars dataframe.

Let's say i have this code on PySpark:


8
df = spark.sql('''select * from tmp''')

I can easily transform it to pandas dataframe using .toPandas . Is there something similar in
polars, as I need to get a polars dataframe for further processing?

python pyspark python-polars

Share Improve this question Follow asked Aug 2, 2022 at 7:08


s1nbad
83 1 4

AFAIK from the doc, spark does not have polars support yet. – samkart Aug 2, 2022 at 9:07

Report this ad

4 Answers Sorted by: Highest score (default)

Join Stack Overflow to find the best answer to your technical question, help others answer theirs.

Sign up with email Sign up with Google Sign up with GitHub Sign up with Facebook

1 of 6 27/08/2023, 07:40
python - How to transform Spark dataframe to Polars dataf... https://stackoverflow.com/questions/73203318/how-to-trans...

Context
Pyspark uses arrow to convert to pandas. Polars is an abstraction over arrow memory. So we
22
can hijack the API that spark uses internally to create the arrow data and use that to create
the polars DataFrame .

TLDR
Given an spark context we can write:

import pyarrow as pa
import polars as pl

sql_context = SQLContext(spark)

data = [('James',[1, 2]),]


spark_df = sql_context.createDataFrame(data=data, schema =
["name","properties"])

df = pl.from_arrow(pa.Table.from_batches(spark_df._collect_as_arrow()))

print(df)

shape: (1, 2)
┌───────┬────────────┐
│ name ┆ properties │
│ --- ┆ --- │
│ str ┆ list[i64] │
╞═══════╪════════════╡
│ James ┆ [1, 2] │
└───────┴────────────┘

Serialization steps
This will actually be faster than the toPandas provided by spark itself, because it saves an
extra copy.

toPandas() will lead to this serialization/copy step:

spark-memory -> arrow-memory -> pandas-memory

With the query provided we have:

Join Stack Overflow->toarrow/polars-memory


spark-memory find the best answer to your technical question, help others answer theirs.

Sign upShare
with email
Improve this Sign up with Google
answer Sign up with GitHub Sign up with Facebook
edited Aug 8, 2022 at 5:10 answered Aug 2, 2022 at 10:07
Follow ritchie46

2 of 6 27/08/2023, 07:40
python - How to transform Spark dataframe to Polars dataf... https://stackoverflow.com/questions/73203318/how-to-trans...

Follow ritchie46
10.4k 1 24 43

Join Stack Overflow to find the best answer to your technical question, help others answer theirs.

Sign up with email Sign up with Google Sign up with GitHub Sign up with Facebook

3 of 6 27/08/2023, 07:40
python - How to transform Spark dataframe to Polars dataf... https://stackoverflow.com/questions/73203318/how-to-trans...

Polars is not distributed, while Spark is


7 Note that Polars is a single-machine multi-threaded DataFrame library. Spark in contrast is a
multi-machine multi-threaded DataFrame library. So Spark distributes the DataFrame across
multiple machines.

Transform Spark DataFrame with Polars code scalable


If your dataset requires this feature, because the DataFrame does not fit onto a single
machine, then _collect_as_arrow , to_dict and from_pandas do not work for you.

If you want to transform your Spark DataFrame using some Polars code (Spark -> Polars ->
Spark), you can do this distributed and scalable using mapInArrow :

import pyarrow as pa
import polars as pl

from typing import Iterator

# The example data as a Spark DataFrame


data = [(1, 1.0), (2, 2.0)]
spark_df = spark.createDataFrame(data=data, schema = ['id', 'value'])
spark_df.show()

# Define your transformation on a Polars DataFrame


# Here we multply the 'value' column by 2
def polars_transform(df: pl.DataFrame) -> pl.DataFrame:
return df.select([
pl.col('id'),
pl.col('value') * 2
])

# Converts a part of the Spark DataFrame into a Polars DataFrame and call
`polars_transform` on it
def arrow_transform(iter: Iterator[pa.RecordBatch]) -> Iterator[pa.RecordBatch]:
# Transform a single RecordBatch so data fit into memory
# Increase spark.sql.execution.arrow.maxRecordsPerBatch if batches are too
small
for batch in iter:
polars_df = pl.from_arrow(pa.Table.from_batches([batch]))
polars_df_2 = polars_transform(polars_df)
for b in polars_df_2.to_arrow().to_batches():
yield b

Join Stack
# MapOverflow to find
the Spark the best
DataFrame toanswer
Arrow,to yourto
then technical
Polars,question,
run the help
the others answer theirs.
`polars_transform` on it,
# and transform everything back to Spark DataFrame, all distributed and scalable
Sign up with email
spark_df_2 Sign up with Google Sign up with GitHub
= spark_df.mapInArrow(arrow_transform, schema='id Sign
long,up value
with Facebook
double')

4 of 6 27/08/2023, 07:40
python - How to transform Spark dataframe to Polars dataf... https://stackoverflow.com/questions/73203318/how-to-trans...

spark_df_2.show()

Share Improve this answer Follow answered Sep 23, 2022 at 9:49
EnricoM
321 3 3

You can't directly convert from spark to polars. But you can go from spark to pandas, then
create a dictionary out of the pandas data, and pass it to polars like this:
1
pandas_df = df.toPandas()
data = pandas_df.to_dict('list')
pl_df = pl.DataFrame(data)

As @ritchie46 pointed out, you can use pl.from_pandas() instead of creating a dictionary:

pandas_df = df.toPandas()
pl_df = pl.from_pandas(pandas_df)

Also, as mentioned in @DataPsycho's answer, this may cause out of memory exception for
large datasets. This is because toPandas() will collect the data to the driver first. In this case,
it is better to write to csv or parquet file and then read back. But avoid repartition(1)
because this will move the data to the driver too.

The code I have provided is suitable for datasets that will fit in your driver memory. If you have
the option to increase the driver memory you can do so by increasing the value of
spark.driver.memory .

Share Improve this answer edited Aug 2, 2022 at 10:13 answered Aug 2, 2022 at 9:12
Follow viggnah
1,699 1 3 12

You should never go to polars via a python dictionary. Polars as a pl.from_pandas argument. That
will save you a lot of heap allocations and ensure type correctness. – ritchie46 Aug 2, 2022 at 9:59

Yes, I thought about converting my data into a pandas dataframe first, but I don't think that would work
with the amount of data I'm working with:( Hopefully, Spark will add polar support soon. – s1nbad
Aug 2, 2022 at 11:03

Join Stack Overflow to find the best answer to your technical question, help others answer theirs.

Sign up with email Sign up with Google Sign up with GitHub Sign up with Facebook

5 of 6 27/08/2023, 07:40
python - How to transform Spark dataframe to Polars dataf... https://stackoverflow.com/questions/73203318/how-to-trans...

It will be good to know your usecase. Heavy transformations you should do either with spark
or polars. You should not be mixing both dataframes. What ever polars can do spark can do
0 all of them. So you should be doing all of your transformation with spark. Then write the file
as csv or parquet format. Then You should read the transformed file with Polars and
everything will run blazing fast, But if you are interested in plotting then read it directly into
pandas and use matplotlib. So if you will have a spark dataframe you can write it as csv:

(transformed_df
.repartition(1)
.write
.option("header",true)
.option("delimiter",",") # by default it is ,
.csv("<your_path>")
)

Now read it with polars or pandas with read_csv . If you will have small amount of memory in
the drive node of spark then transformed_df.toPandas() might fail because of not having
much memory.

Share Improve this answer Follow answered Aug 2, 2022 at 9:46


DataPsycho
958 1 8 28

I mostly work with spark, but sometimes I have to create a pandas dataframe for some extra
analysis/drawing of graphs. So I wanted to know if there is such a possibility while working with polar :)
– s1nbad Aug 2, 2022 at 10:56

You should keep using pandas for plotting unfortunately. – DataPsycho Aug 2, 2022 at 11:07

Join Stack Overflow to find the best answer to your technical question, help others answer theirs.

Sign up with email Sign up with Google Sign up with GitHub Sign up with Facebook

6 of 6 27/08/2023, 07:40

You might also like