pyspark.RDD.repartition#

RDD.repartition(numPartitions)[source]#

Return a new RDD that has exactly numPartitions partitions.

Can increase or decrease the level of parallelism in this RDD. Internally, this uses a shuffle to redistribute data. If you are decreasing the number of partitions in this RDD, consider using coalesce, which can avoid performing a shuffle.

New in version 1.0.0.

Parameters
numPartitionsint, optional

the number of partitions in new RDD

Returns
RDD

a RDD with exactly numPartitions partitions

Examples

>>> rdd = sc.parallelize([1,2,3,4,5,6,7], 4)
>>> sorted(rdd.glom().collect())
[[1], [2, 3], [4, 5], [6, 7]]
>>> len(rdd.repartition(2).glom().collect())
2
>>> len(rdd.repartition(10).glom().collect())
10