pyspark.sql.DataFrame.repartition#
- DataFrame.repartition(numPartitions, *cols)[source]#
Returns a new
DataFrame
partitioned by the given partitioning expressions. The resultingDataFrame
is hash partitioned.New in version 1.3.0.
Changed in version 3.4.0: Supports Spark Connect.
- Parameters
- numPartitionsint
can be an int to specify the target number of partitions or a Column. If it is a Column, it will be used as the first partitioning column. If not specified, the default number of partitions is used.
- colsstr or
Column
partitioning columns.
Changed in version 1.6.0: Added optional arguments to specify the partitioning columns. Also made numPartitions optional if partitioning columns are specified.
- Returns
DataFrame
Repartitioned DataFrame.
Examples
>>> from pyspark.sql import functions as sf >>> df = spark.range(0, 64, 1, 9).withColumn( ... "name", sf.concat(sf.lit("name_"), sf.col("id").cast("string")) ... ).withColumn( ... "age", sf.col("id") - 32 ... ) >>> df.select( ... sf.spark_partition_id().alias("partition") ... ).distinct().sort("partition").show() +---------+ |partition| +---------+ | 0| | 1| | 2| | 3| | 4| | 5| | 6| | 7| | 8| +---------+
Repartition the data into 10 partitions.
>>> df.repartition(10).select( ... sf.spark_partition_id().alias("partition") ... ).distinct().sort("partition").show() +---------+ |partition| +---------+ | 0| | 1| | 2| | 3| | 4| | 5| | 6| | 7| | 8| | 9| +---------+
Repartition the data into 7 partitions by ‘age’ column.
>>> df.repartition(7, "age").select( ... sf.spark_partition_id().alias("partition") ... ).distinct().sort("partition").show() +---------+ |partition| +---------+ | 0| | 1| | 2| | 3| | 4| | 5| | 6| +---------+
Repartition the data into 3 partitions by ‘age’ and ‘name’ columns.
>>> df.repartition(3, "name", "age").select( ... sf.spark_partition_id().alias("partition") ... ).distinct().sort("partition").show() +---------+ |partition| +---------+ | 0| | 1| | 2| +---------+