RDD.
combineByKey
Generic function to combine the elements for each key using a custom set of aggregation functions.
Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a “combined type” C.
To avoid memory allocation, both mergeValue and mergeCombiners are allowed to modify and return their first argument instead of creating a new C.
In addition, users can control the partitioning of the output RDD.
New in version 0.7.0.
a function to turns a V into a C
a function to merge a V into a C
a function to combine two C’s into a single one
the number of partitions in new RDD
RDD
function to compute the partition index
a RDD containing the keys and the aggregated result for each key
See also
RDD.reduceByKey()
RDD.aggregateByKey()
RDD.foldByKey()
RDD.groupByKey()
Notes
(Int, Int) into an RDD of type (Int, List[Int]).
Examples
>>> rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 2)]) >>> def to_list(a): ... return [a] ... >>> def append(a, b): ... a.append(b) ... return a ... >>> def extend(a, b): ... a.extend(b) ... return a ... >>> sorted(rdd.combineByKey(to_list, append, extend).collect()) [('a', [1, 2]), ('b', [1])]