pyspark.pandas.concat¶

pyspark.pandas.concat(objs: List[Union[pyspark.pandas.frame.DataFrame, pyspark.pandas.series.Series]], axis: Union[int, str] = 0, join: str = 'outer', ignore_index: bool = False, sort: bool = False) → Union[pyspark.pandas.series.Series, pyspark.pandas.frame.DataFrame][source]¶

Concatenate pandas-on-Spark objects along a particular axis with optional set logic along the other axes.

Parameters

objsa sequence of Series or DataFrame: Any None objects will be dropped silently unless they are all None in which case a ValueError will be raised
axis{0/’index’, 1/’columns’}, default 0: The axis to concatenate along.
join{‘inner’, ‘outer’}, default ‘outer’: How to handle indexes on other axis (or axes).
ignore_indexbool, default False: If True, do not use the index values along the concatenation axis. The resulting axis will be labeled 0, …, n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information. Note the index values on the other axes are still respected in the join.
sortbool, default False: Sort non-concatenation axis if it is not already aligned.

Returns

object, type of objs: When concatenating all Series along the index (axis=0), a Series is returned. When objs contains at least one DataFrame, a DataFrame is returned. When concatenating along the columns (axis=1), a DataFrame is returned.

See also

Series.append: Concatenate Series.
DataFrame.join: Join DataFrames using indexes.
DataFrame.merge: Merge DataFrames by indexes or columns.

Examples

>>> from pyspark.pandas.config import set_option, reset_option
>>> set_option("compute.ops_on_diff_frames", True)

Combine two Series.

>>> s1 = ps.Series(['a', 'b'])
>>> s2 = ps.Series(['c', 'd'])
>>> ps.concat([s1, s2])
0    a
1    b
0    c
1    d
dtype: object

Clear the existing index and reset it in the result by setting the ignore_index option to True.

>>> ps.concat([s1, s2], ignore_index=True)
0    a
1    b
2    c
3    d
dtype: object

Combine two DataFrame objects with identical columns.

>>> df1 = ps.DataFrame([['a', 1], ['b', 2]],
...                    columns=['letter', 'number'])
>>> df1
  letter  number
0      a       1
1      b       2
>>> df2 = ps.DataFrame([['c', 3], ['d', 4]],
...                    columns=['letter', 'number'])
>>> df2
  letter  number
0      c       3
1      d       4

>>> ps.concat([df1, df2])
  letter  number
0      a       1
1      b       2
0      c       3
1      d       4

Combine DataFrame and Series objects with different columns.

>>> ps.concat([df2, s1])
  letter  number     0
0      c     3.0  None
1      d     4.0  None
0   None     NaN     a
1   None     NaN     b

Combine DataFrame objects with overlapping columns and return everything. Columns outside the intersection will be filled with None values.

>>> df3 = ps.DataFrame([['c', 3, 'cat'], ['d', 4, 'dog']],
...                    columns=['letter', 'number', 'animal'])
>>> df3
  letter  number animal
0      c       3    cat
1      d       4    dog

>>> ps.concat([df1, df3])
  letter  number animal
0      a       1   None
1      b       2   None
0      c       3    cat
1      d       4    dog

Sort the columns.

>>> ps.concat([df1, df3], sort=True)
  animal letter  number
0   None      a       1
1   None      b       2
0    cat      c       3
1    dog      d       4

Combine DataFrame objects with overlapping columns and return only those that are shared by passing inner to the join keyword argument.

>>> ps.concat([df1, df3], join="inner")
  letter  number
0      a       1
1      b       2
0      c       3
1      d       4

>>> df4 = ps.DataFrame([['bird', 'polly'], ['monkey', 'george']],
...                    columns=['animal', 'name'])

Combine with column axis.

>>> ps.concat([df1, df4], axis=1)
  letter  number  animal    name
0      a       1    bird   polly
1      b       2  monkey  george

>>> reset_option("compute.ops_on_diff_frames")

pyspark.pandas.get_dummies pyspark.pandas.sql