pyspark.RDD.aggregate¶

RDD.aggregate(zeroValue: U, seqOp: Callable[[U, T], U], combOp: Callable[[U, U], U]) → U[source]¶

Aggregate the elements of each partition, and then the results for all the partitions, using a given combine functions and a neutral “zero value.”

The functions op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object allocation; however, it should not modify t2.

The first function (seqOp) can return a different result type, U, than the type of this RDD. Thus, we need one operation for merging a T into an U and one operation for merging two U

New in version 1.1.0.

Parameters

zeroValueU: the initial value for the accumulated result of each partition
seqOpfunction: a function used to accumulate results within a partition
combOpfunction: an associative function used to combine results from different partitions

Returns

U: the aggregated result

See also

RDD.reduce()
RDD.fold()

Examples

>>> seqOp = (lambda x, y: (x[0] + y, x[1] + 1))
>>> combOp = (lambda x, y: (x[0] + y[0], x[1] + y[1]))
>>> sc.parallelize([1, 2, 3, 4]).aggregate((0, 0), seqOp, combOp)
(10, 4)
>>> sc.parallelize([]).aggregate((0, 0), seqOp, combOp)
(0, 0)

pyspark.SparkContext.wholeTextFiles

pyspark.RDD.aggregateByKey