pyspark.sql.DataFrame.repartition#

DataFrame.repartition(numPartitions, *cols)[source]#

Returns a new DataFrame partitioned by the given partitioning expressions. The resulting DataFrame is hash partitioned.

New in version 1.3.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters

numPartitionsint: can be an int to specify the target number of partitions or a Column. If it is a Column, it will be used as the first partitioning column. If not specified, the default number of partitions is used.
colsstr or Column: partitioning columns.

Changed in version 1.6.0: Added optional arguments to specify the partitioning columns. Also made numPartitions optional if partitioning columns are specified.

Returns

DataFrame: Repartitioned DataFrame.

Examples

>>> from pyspark.sql import functions as sf
>>> df = spark.range(0, 64, 1, 9).withColumn(
...     "name", sf.concat(sf.lit("name_"), sf.col("id").cast("string"))
... ).withColumn(
...     "age", sf.col("id") - 32
... )
>>> df.select(
...     sf.spark_partition_id().alias("partition")
... ).distinct().sort("partition").show()
+---------+
|partition|
+---------+
|        0|
|        1|
|        2|
|        3|
|        4|
|        5|
|        6|
|        7|
|        8|
+---------+

Repartition the data into 10 partitions.

>>> df.repartition(10).select(
...     sf.spark_partition_id().alias("partition")
... ).distinct().sort("partition").show()
+---------+
|partition|
+---------+
|        0|
|        1|
|        2|
|        3|
|        4|
|        5|
|        6|
|        7|
|        8|
|        9|
+---------+

Repartition the data into 7 partitions by ‘age’ column.

>>> df.repartition(7, "age").select(
...     sf.spark_partition_id().alias("partition")
... ).distinct().sort("partition").show()
+---------+
|partition|
+---------+
|        0|
|        1|
|        2|
|        3|
|        4|
|        5|
|        6|
+---------+

Repartition the data into 3 partitions by ‘age’ and ‘name’ columns.

>>> df.repartition(3, "name", "age").select(
...     sf.spark_partition_id().alias("partition")
... ).distinct().sort("partition").show()
+---------+
|partition|
+---------+
|        0|
|        1|
|        2|
+---------+