pyspark.sql.DataFrame.repartition¶
-
DataFrame.
repartition
(numPartitions, *cols)[source]¶ Returns a new
DataFrame
partitioned by the given partitioning expressions. The resultingDataFrame
is hash partitioned.New in version 1.3.0.
- Parameters
- numPartitionsint
can be an int to specify the target number of partitions or a Column. If it is a Column, it will be used as the first partitioning column. If not specified, the default number of partitions is used.
- colsstr or
Column
partitioning columns.
Changed in version 1.6: Added optional arguments to specify the partitioning columns. Also made numPartitions optional if partitioning columns are specified.
Examples
>>> df.repartition(10).rdd.getNumPartitions() 10 >>> data = df.union(df).repartition("age") >>> data.show() +---+-----+ |age| name| +---+-----+ | 2|Alice| | 5| Bob| | 2|Alice| | 5| Bob| +---+-----+ >>> data = data.repartition(7, "age") >>> data.show() +---+-----+ |age| name| +---+-----+ | 2|Alice| | 5| Bob| | 2|Alice| | 5| Bob| +---+-----+ >>> data.rdd.getNumPartitions() 7 >>> data = data.repartition(3, "name", "age") >>> data.show() +---+-----+ |age| name| +---+-----+ | 5| Bob| | 5| Bob| | 2|Alice| | 2|Alice| +---+-----+