pyspark.sql.DataFrame.unionByName¶
-
DataFrame.
unionByName
(other, allowMissingColumns=False)[source]¶ Returns a new
DataFrame
containing union of rows in this and anotherDataFrame
.This is different from both UNION ALL and UNION DISTINCT in SQL. To do a SQL-style set union (that does deduplication of elements), use this function followed by
distinct()
.New in version 2.3.0.
Examples
The difference between this function and
union()
is that this function resolves columns by name (not by position):>>> df1 = spark.createDataFrame([[1, 2, 3]], ["col0", "col1", "col2"]) >>> df2 = spark.createDataFrame([[4, 5, 6]], ["col1", "col2", "col0"]) >>> df1.unionByName(df2).show() +----+----+----+ |col0|col1|col2| +----+----+----+ | 1| 2| 3| | 6| 4| 5| +----+----+----+
When the parameter allowMissingColumns is
True
, the set of column names in this and otherDataFrame
can differ; missing columns will be filled with null. Further, the missing columns of thisDataFrame
will be added at the end in the schema of the union result:>>> df1 = spark.createDataFrame([[1, 2, 3]], ["col0", "col1", "col2"]) >>> df2 = spark.createDataFrame([[4, 5, 6]], ["col1", "col2", "col3"]) >>> df1.unionByName(df2, allowMissingColumns=True).show() +----+----+----+----+ |col0|col1|col2|col3| +----+----+----+----+ | 1| 2| 3|null| |null| 4| 5| 6| +----+----+----+----+
Changed in version 3.1.0: Added optional argument allowMissingColumns to specify whether to allow missing columns.