pyspark.testing.assertDataFrameEqual#
- pyspark.testing.assertDataFrameEqual(actual, expected, checkRowOrder=False, rtol=1e-05, atol=1e-08, ignoreNullable=True, ignoreColumnOrder=False, ignoreColumnName=False, ignoreColumnType=False, maxErrors=None, showOnlyDiff=False, includeDiffRows=False)[source]#
A util function to assert equality between actual and expected (DataFrames or lists of Rows), with optional parameters checkRowOrder, rtol, and atol.
Supports Spark, Spark Connect, pandas, and pandas-on-Spark DataFrames. For more information about pandas-on-Spark DataFrame equality, see the docs for assertPandasOnSparkEqual.
New in version 3.5.0.
- Parameters
- actualDataFrame (Spark, Spark Connect, pandas, or pandas-on-Spark) or list of Rows
The DataFrame that is being compared or tested.
- expectedDataFrame (Spark, Spark Connect, pandas, or pandas-on-Spark) or list of Rows
The expected result of the operation, for comparison with the actual result.
- checkRowOrderbool, optional
A flag indicating whether the order of rows should be considered in the comparison. If set to False (default), the row order is not taken into account. If set to True, the order of rows is important and will be checked during comparison. (See Notes)
- rtolfloat, optional
The relative tolerance, used in asserting approximate equality for float values in actual and expected. Set to 1e-5 by default. (See Notes)
- atolfloat, optional
The absolute tolerance, used in asserting approximate equality for float values in actual and expected. Set to 1e-8 by default. (See Notes)
- ignoreNullablebool, default True
Specifies whether a column’s nullable property is included when checking for schema equality. When set to True (default), the nullable property of the columns being compared is not taken into account and the columns will be considered equal even if they have different nullable settings. When set to False, columns are considered equal only if they have the same nullable setting.
New in version 4.0.0.
- ignoreColumnOrderbool, default False
Specifies whether to compare columns in the order they appear in the DataFrame or by column name. If set to False (default), columns are compared in the order they appear in the DataFrames. When set to True, a column in the expected DataFrame is compared to the column with the same name in the actual DataFrame.
New in version 4.0.0.
- ignoreColumnNamebool, default False
Specifies whether to fail the initial schema equality check if the column names in the two DataFrames are different. When set to False (default), column names are checked and the function fails if they are different. When set to True, the function will succeed even if column names are different. Column data types are compared for columns in the order they appear in the DataFrames.
New in version 4.0.0.
- ignoreColumnTypebool, default False
Specifies whether to ignore the data type of the columns when comparing. When set to False (default), column data types are checked and the function fails if they are different. When set to True, the schema equality check will succeed even if column data types are different and the function will attempt to compare rows.
New in version 4.0.0.
- maxErrorsbool, optional
The maximum number of row comparison failures to encounter before returning. When this number of row comparisons have failed, the function returns independent of how many rows have been compared. Set to None by default which means compare all rows independent of number of failures.
New in version 4.0.0.
- showOnlyDiffbool, default False
If set to True, the error message will only include rows that are different. If set to False (default), the error message will include all rows (when there is at least one row that is different).
New in version 4.0.0.
- includeDiffRows: bool, False
If set to True, the unequal rows are included in PySparkAssertionError for further debugging. If set to False (default), the unequal rows are not returned as a data set.
New in version 4.0.0.
Notes
When assertDataFrameEqual fails, the error message uses the Python difflib library to display a diff log of each row that differs in actual and expected.
For checkRowOrder, note that PySpark DataFrame ordering is non-deterministic, unless explicitly sorted.
Note that schema equality is checked only when expected is a DataFrame (not a list of Rows).
For DataFrames with float/decimal values, assertDataFrame asserts approximate equality. Two float/decimal values a and b are approximately equal if the following equation is True:
absolute(a - b) <= (atol + rtol * absolute(b))
.ignoreColumnOrder cannot be set to True if ignoreColumnNames is also set to True. ignoreColumnNames cannot be set to True if ignoreColumnOrder is also set to True.
Examples
>>> df1 = spark.createDataFrame(data=[("1", 1000), ("2", 3000)], schema=["id", "amount"]) >>> df2 = spark.createDataFrame(data=[("1", 1000), ("2", 3000)], schema=["id", "amount"]) >>> assertDataFrameEqual(df1, df2) # pass, DataFrames are identical
>>> df1 = spark.createDataFrame(data=[("1", 0.1), ("2", 3.23)], schema=["id", "amount"]) >>> df2 = spark.createDataFrame(data=[("1", 0.109), ("2", 3.23)], schema=["id", "amount"]) >>> assertDataFrameEqual(df1, df2, rtol=1e-1) # pass, DataFrames are approx equal by rtol
>>> df1 = spark.createDataFrame(data=[(1, 1000), (2, 3000)], schema=["id", "amount"]) >>> list_of_rows = [Row(1, 1000), Row(2, 3000)] >>> assertDataFrameEqual(df1, list_of_rows) # pass, actual and expected data are equal
>>> import pyspark.pandas as ps >>> df1 = ps.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]}) >>> df2 = ps.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]}) >>> # pass, pandas-on-Spark DataFrames are equal >>> assertDataFrameEqual(df1, df2)
>>> df1 = spark.createDataFrame( ... data=[("1", 1000.00), ("2", 3000.00), ("3", 2000.00)], schema=["id", "amount"]) >>> df2 = spark.createDataFrame( ... data=[("1", 1001.00), ("2", 3000.00), ("3", 2003.00)], schema=["id", "amount"]) >>> assertDataFrameEqual(df1, df2) Traceback (most recent call last): ... PySparkAssertionError: [DIFFERENT_ROWS] Results do not match: ( 66.66667 % ) *** actual *** ! Row(id='1', amount=1000.0) Row(id='2', amount=3000.0) ! Row(id='3', amount=2000.0) *** expected *** ! Row(id='1', amount=1001.0) Row(id='2', amount=3000.0) ! Row(id='3', amount=2003.0)
Example for ignoreNullable
>>> from pyspark.sql.types import StructType, StructField, StringType, LongType >>> df1_nullable = spark.createDataFrame( ... data=[(1000, "1"), (5000, "2")], ... schema=StructType( ... [StructField("amount", LongType(), True), StructField("id", StringType(), True)] ... ) ... ) >>> df2_nullable = spark.createDataFrame( ... data=[(1000, "1"), (5000, "2")], ... schema=StructType( ... [StructField("amount", LongType(), True), StructField("id", StringType(), False)] ... ) ... ) >>> assertDataFrameEqual(df1_nullable, df2_nullable, ignoreNullable=True) # pass >>> assertDataFrameEqual( ... df1_nullable, df2_nullable, ignoreNullable=False ... ) Traceback (most recent call last): ... PySparkAssertionError: [DIFFERENT_SCHEMA] Schemas do not match. --- actual +++ expected - StructType([StructField('amount', LongType(), True), StructField('id', StringType(), True)]) ? ^^^ + StructType([StructField('amount', LongType(), True), StructField('id', StringType(), False)]) ? ^^^^
Example for ignoreColumnOrder
>>> df1_col_order = spark.createDataFrame( ... data=[(1000, "1"), (5000, "2")], schema=["amount", "id"] ... ) >>> df2_col_order = spark.createDataFrame( ... data=[("1", 1000), ("2", 5000)], schema=["id", "amount"] ... ) >>> assertDataFrameEqual(df1_col_order, df2_col_order, ignoreColumnOrder=True)
Example for ignoreColumnName
>>> df1_col_names = spark.createDataFrame( ... data=[(1000, "1"), (5000, "2")], schema=["amount", "identity"] ... ) >>> df2_col_names = spark.createDataFrame( ... data=[(1000, "1"), (5000, "2")], schema=["amount", "id"] ... ) >>> assertDataFrameEqual(df1_col_names, df2_col_names, ignoreColumnName=True)
Example for ignoreColumnType
>>> df1_col_types = spark.createDataFrame( ... data=[(1000, "1"), (5000, "2")], schema=["amount", "id"] ... ) >>> df2_col_types = spark.createDataFrame( ... data=[(1000.0, "1"), (5000.0, "2")], schema=["amount", "id"] ... ) >>> assertDataFrameEqual(df1_col_types, df2_col_types, ignoreColumnType=True)
Example for maxErrors (will only report the first mismatching row)
>>> df1 = spark.createDataFrame([(1, "A"), (2, "B"), (3, "C")]) >>> df2 = spark.createDataFrame([(1, "A"), (2, "X"), (3, "Y")]) >>> assertDataFrameEqual(df1, df2, maxErrors=1) Traceback (most recent call last): ... PySparkAssertionError: [DIFFERENT_ROWS] Results do not match: ( 33.33333 % ) *** actual *** Row(_1=1, _2='A') ! Row(_1=2, _2='B') *** expected *** Row(_1=1, _2='A') ! Row(_1=2, _2='X')
Example for showOnlyDiff (will only report the mismatching rows)
>>> df1 = spark.createDataFrame([(1, "A"), (2, "B"), (3, "C")]) >>> df2 = spark.createDataFrame([(1, "A"), (2, "X"), (3, "Y")]) >>> assertDataFrameEqual(df1, df2, showOnlyDiff=True) Traceback (most recent call last): ... PySparkAssertionError: [DIFFERENT_ROWS] Results do not match: ( 66.66667 % ) *** actual *** ! Row(_1=2, _2='B') ! Row(_1=3, _2='C') *** expected *** ! Row(_1=2, _2='X') ! Row(_1=3, _2='Y')
The includeDiffRows parameter can be used to include the rows that did not match in the PySparkAssertionError. This can be useful for debugging or further analysis.
>>> df1 = spark.createDataFrame( ... data=[("1", 1000.00), ("2", 3000.00), ("3", 2000.00)], schema=["id", "amount"]) >>> df2 = spark.createDataFrame( ... data=[("1", 1001.00), ("2", 3000.00), ("3", 2003.00)], schema=["id", "amount"]) >>> try: ... assertDataFrameEqual(df1, df2, includeDiffRows=True) ... except PySparkAssertionError as e: ... spark.createDataFrame(e.data).show() +-----------+-----------+ | _1| _2| +-----------+-----------+ |{1, 1000.0}|{1, 1001.0}| |{3, 2000.0}|{3, 2003.0}| +-----------+-----------+