pyspark.sql.functions.
collect_set
Aggregate function: returns a set of objects with duplicate elements eliminated.
New in version 1.6.0.
Changed in version 3.4.0: Supports Spark Connect.
Column
target column to compute on.
list of objects with no duplicates.
Notes
The function is non-deterministic because the order of collected results depends on the order of the rows which may be non-deterministic after a shuffle.
Examples
>>> df2 = spark.createDataFrame([(2,), (5,), (5,)], ('age',)) >>> df2.agg(array_sort(collect_set('age')).alias('c')).collect() [Row(c=[2, 5])]