pyspark.sql.DataFrame.distinct#
- DataFrame.distinct()[source]#
Returns a new
DataFrame
containing the distinct rows in thisDataFrame
.New in version 1.3.0.
Changed in version 3.4.0: Supports Spark Connect.
- Returns
DataFrame
DataFrame with distinct records.
See also
Examples
Remove duplicate rows from a DataFrame
>>> df = spark.createDataFrame( ... [(14, "Tom"), (23, "Alice"), (23, "Alice")], ["age", "name"]) >>> df.distinct().show() +---+-----+ |age| name| +---+-----+ | 14| Tom| | 23|Alice| +---+-----+
Count the number of distinct rows in a DataFrame
>>> df.distinct().count() 2
Get distinct rows from a DataFrame with multiple columns
>>> df = spark.createDataFrame( ... [(14, "Tom", "M"), (23, "Alice", "F"), (23, "Alice", "F"), (14, "Tom", "M")], ... ["age", "name", "gender"]) >>> df.distinct().show() +---+-----+------+ |age| name|gender| +---+-----+------+ | 14| Tom| M| | 23|Alice| F| +---+-----+------+
Get distinct values from a specific column in a DataFrame
>>> df.select("name").distinct().show() +-----+ | name| +-----+ | Tom| |Alice| +-----+
Count the number of distinct values in a specific column
>>> df.select("name").distinct().count() 2
Get distinct values from multiple columns in DataFrame
>>> df.select("name", "gender").distinct().show() +-----+------+ | name|gender| +-----+------+ | Tom| M| |Alice| F| +-----+------+
Get distinct rows from a DataFrame with null values
>>> df = spark.createDataFrame( ... [(14, "Tom", "M"), (23, "Alice", "F"), (23, "Alice", "F"), (14, "Tom", None)], ... ["age", "name", "gender"]) >>> df.distinct().show() +---+-----+------+ |age| name|gender| +---+-----+------+ | 14| Tom| M| | 23|Alice| F| | 14| Tom| NULL| +---+-----+------+
Get distinct non-null values from a DataFrame
>>> df.distinct().filter(df.gender.isNotNull()).show() +---+-----+------+ |age| name|gender| +---+-----+------+ | 14| Tom| M| | 23|Alice| F| +---+-----+------+