pyspark.pandas.DataFrame.spark.apply#
- spark.apply(func, index_col=None)#
Applies a function that takes and returns a Spark DataFrame. It allows natively apply a Spark function and column APIs with the Spark column internally used in Series or Index.
Note
set index_col and keep the column named as so in the output Spark DataFrame to avoid using the default index to prevent performance penalty. If you omit index_col, it will use default index which is potentially expensive in general.
Note
it will lose column labels. This is a synonym of
func(psdf.to_spark(index_col)).pandas_api(index_col)
.- Parameters
- funcfunction
Function to apply the function against the data by using Spark DataFrame.
- Returns
- DataFrame
- Raises
- ValueErrorIf the output from the function is not a Spark DataFrame.
Examples
>>> psdf = ps.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]}, columns=["a", "b"]) >>> psdf a b 0 1 4 1 2 5 2 3 6
>>> psdf.spark.apply( ... lambda sdf: sdf.selectExpr("a + b as c", "index"), index_col="index") ... c index 0 5 1 7 2 9
The case below ends up with using the default index, which should be avoided if possible.
>>> psdf.spark.apply(lambda sdf: sdf.groupby("a").count().sort("a")) a count 0 1 1 1 2 1 2 3 1