pyspark.sql.functions.hll_sketch_agg¶
-
pyspark.sql.functions.
hll_sketch_agg
(col: ColumnOrName, lgConfigK: Union[int, pyspark.sql.column.Column, None] = None) → pyspark.sql.column.Column[source]¶ Aggregate function: returns the updatable binary representation of the Datasketches HllSketch configured with lgConfigK arg.
New in version 3.5.0.
- Parameters
- col
Column
or str or int - lgConfigKint, optional
The log-base-2 of K, where K is the number of buckets or slots for the HllSketch
- col
- Returns
Column
The binary representation of the HllSketch.
Examples
>>> df = spark.createDataFrame([1,2,2,3], "INT") >>> df1 = df.agg(hll_sketch_estimate(hll_sketch_agg("value")).alias("distinct_cnt")) >>> df1.show() +------------+ |distinct_cnt| +------------+ | 3| +------------+ >>> df2 = df.agg(hll_sketch_estimate( ... hll_sketch_agg("value", lit(12)) ... ).alias("distinct_cnt")) >>> df2.show() +------------+ |distinct_cnt| +------------+ | 3| +------------+ >>> df3 = df.agg(hll_sketch_estimate( ... hll_sketch_agg(col("value"), lit(12))).alias("distinct_cnt")) >>> df3.show() +------------+ |distinct_cnt| +------------+ | 3| +------------+