pyspark.RDD.partitionBy¶
-
RDD.
partitionBy
(numPartitions: Optional[int], partitionFunc: Callable[[K], int] = <function portable_hash>) → pyspark.rdd.RDD[Tuple[K, V]][source]¶ Return a copy of the RDD partitioned using the specified partitioner.
New in version 0.7.0.
- Parameters
- numPartitionsint, optional
the number of partitions in new
RDD
- partitionFuncfunction, optional, default portable_hash
function to compute the partition index
- Returns
Examples
>>> pairs = sc.parallelize([1, 2, 3, 4, 2, 4, 1]).map(lambda x: (x, x)) >>> sets = pairs.partitionBy(2).glom().collect() >>> len(set(sets[0]).intersection(set(sets[1]))) 0