DataFrameWriter (Spark 2.3.4 JavaDoc)

Object
- org.apache.spark.sql.DataFrameWriter<T>

```
public final class DataFrameWriter<T>
extends Object
```
Interface used to write a Dataset to external storage systems (e.g. file systems, key-value stores, etc). Use Dataset.write to access this.

Since:

1.4.0

Method Summary

All Methods Instance Methods Concrete Methods
Modifier and Type	Method and Description
`DataFrameWriter<T>`	`bucketBy(int numBuckets, String colName, scala.collection.Seq<String> colNames)` Buckets the output by the given columns.
`DataFrameWriter<T>`	`bucketBy(int numBuckets, String colName, String... colNames)` Buckets the output by the given columns.
`void`	`csv(String path)` Saves the content of the `DataFrame` in CSV format at the specified path.
`DataFrameWriter<T>`	`format(String source)` Specifies the underlying output data source.
`void`	`insertInto(String tableName)` Inserts the content of the `DataFrame` to the specified table.
`void`	`jdbc(String url, String table, java.util.Properties connectionProperties)` Saves the content of the `DataFrame` to an external database table via JDBC.
`void`	`json(String path)` Saves the content of the `DataFrame` in JSON format ( JSON Lines text format or newline-delimited JSON) at the specified path.
`DataFrameWriter<T>`	`mode(SaveMode saveMode)` Specifies the behavior when data or table already exists.
`DataFrameWriter<T>`	`mode(String saveMode)` Specifies the behavior when data or table already exists.
`DataFrameWriter<T>`	`option(String key, boolean value)` Adds an output option for the underlying data source.
`DataFrameWriter<T>`	`option(String key, double value)` Adds an output option for the underlying data source.
`DataFrameWriter<T>`	`option(String key, long value)` Adds an output option for the underlying data source.
`DataFrameWriter<T>`	`option(String key, String value)` Adds an output option for the underlying data source.
`DataFrameWriter<T>`	`options(scala.collection.Map<String,String> options)` (Scala-specific) Adds output options for the underlying data source.
`DataFrameWriter<T>`	`options(java.util.Map<String,String> options)` Adds output options for the underlying data source.
`void`	`orc(String path)` Saves the content of the `DataFrame` in ORC format at the specified path.
`void`	`parquet(String path)` Saves the content of the `DataFrame` in Parquet format at the specified path.
`DataFrameWriter<T>`	`partitionBy(scala.collection.Seq<String> colNames)` Partitions the output by the given columns on the file system.
`DataFrameWriter<T>`	`partitionBy(String... colNames)` Partitions the output by the given columns on the file system.
`void`	`save()` Saves the content of the `DataFrame` as the specified table.
`void`	`save(String path)` Saves the content of the `DataFrame` at the specified path.
`void`	`saveAsTable(String tableName)` Saves the content of the `DataFrame` as the specified table.
`DataFrameWriter<T>`	`sortBy(String colName, scala.collection.Seq<String> colNames)` Sorts the output in each bucket by the given columns.
`DataFrameWriter<T>`	`sortBy(String colName, String... colNames)` Sorts the output in each bucket by the given columns.
`void`	`text(String path)` Saves the content of the `DataFrame` in a text file at the specified path.

Methods inherited from class Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Method Detail
  - partitionBy
```
public DataFrameWriter<T> partitionBy(String... colNames)
```
    Partitions the output by the given columns on the file system. If specified, the output is laid out on the file system similar to Hive's partitioning scheme. As an example, when we partition a dataset by year and then month, the directory layout would look like:
    - year=2016/month=01/ - year=2016/month=02/
    Partitioning is one of the most widely used techniques to optimize physical data layout. It provides a coarse-grained index for skipping unnecessary data reads when queries have predicates on the partitioned columns. In order for partitioning to work well, the number of distinct values in each column should typically be less than tens of thousands.
    This is applicable for all file-based data sources (e.g. Parquet, JSON) starting with Spark 2.1.0.
    
    Parameters:
    
    colNames - (undocumented)
    
    Returns:
    
    (undocumented)
    
    Since:
    
    1.4.0
  - bucketBy
```
public DataFrameWriter<T> bucketBy(int numBuckets,
                                   String colName,
                                   String... colNames)
```
    Buckets the output by the given columns. If specified, the output is laid out on the file system similar to Hive's bucketing scheme.
    This is applicable for all file-based data sources (e.g. Parquet, JSON) starting with Spark 2.1.0.
    
    Parameters:
    
    numBuckets - (undocumented)
    
    colName - (undocumented)
    
    colNames - (undocumented)
    
    Returns:
    
    (undocumented)
    
    Since:
    
    2.0
  - sortBy
```
public DataFrameWriter<T> sortBy(String colName,
                                 String... colNames)
```
    Sorts the output in each bucket by the given columns.
    This is applicable for all file-based data sources (e.g. Parquet, JSON) starting with Spark 2.1.0.
    
    Parameters:
    
    colName - (undocumented)
    
    colNames - (undocumented)
    
    Returns:
    
    (undocumented)
    
    Since:
    
    2.0
  - mode
```
public DataFrameWriter<T> mode(SaveMode saveMode)
```
    Specifies the behavior when data or table already exists. Options include: - SaveMode.Overwrite: overwrite the existing data. - SaveMode.Append: append the data. - SaveMode.Ignore: ignore the operation (i.e. no-op). - SaveMode.ErrorIfExists: default option, throw an exception at runtime.
    
    Parameters:
    
    saveMode - (undocumented)
    
    Returns:
    
    (undocumented)
    
    Since:
    
    1.4.0
  - mode
```
public DataFrameWriter<T> mode(String saveMode)
```
    Specifies the behavior when data or table already exists. Options include: - overwrite: overwrite the existing data. - append: append the data. - ignore: ignore the operation (i.e. no-op). - error or errorifexists: default option, throw an exception at runtime.
    
    Parameters:
    
    saveMode - (undocumented)
    
    Returns:
    
    (undocumented)
    
    Since:
    
    1.4.0
  - format
```
public DataFrameWriter<T> format(String source)
```
    Specifies the underlying output data source. Built-in options include "parquet", "json", etc.
    
    Parameters:
    
    source - (undocumented)
    
    Returns:
    
    (undocumented)
    
    Since:
    
    1.4.0
  - option
```
public DataFrameWriter<T> option(String key,
                                 String value)
```
    Adds an output option for the underlying data source.
    You can set the following option(s):
    - timeZone (default session local timezone): sets the string that indicates a timezone to be used to format timestamps in the JSON/CSV datasources or partition values.
    Parameters:
    
    key - (undocumented)
    
    value - (undocumented)
    
    Returns:
    
    (undocumented)
    
    Since:
    
    1.4.0
  - option
```
public DataFrameWriter<T> option(String key,
                                 boolean value)
```
    Adds an output option for the underlying data source.
    
    Parameters:
    
    key - (undocumented)
    
    value - (undocumented)
    
    Returns:
    
    (undocumented)
    
    Since:
    
    2.0.0
  - option
```
public DataFrameWriter<T> option(String key,
                                 long value)
```
    Adds an output option for the underlying data source.
    
    Parameters:
    
    key - (undocumented)
    
    value - (undocumented)
    
    Returns:
    
    (undocumented)
    
    Since:
    
    2.0.0
  - option
```
public DataFrameWriter<T> option(String key,
                                 double value)
```
    Adds an output option for the underlying data source.
    
    Parameters:
    
    key - (undocumented)
    
    value - (undocumented)
    
    Returns:
    
    (undocumented)
    
    Since:
    
    2.0.0
  - options
```
public DataFrameWriter<T> options(scala.collection.Map<String,String> options)
```
    (Scala-specific) Adds output options for the underlying data source.
    You can set the following option(s):
    - timeZone (default session local timezone): sets the string that indicates a timezone to be used to format timestamps in the JSON/CSV datasources or partition values.
    Parameters:
    
    options - (undocumented)
    
    Returns:
    
    (undocumented)
    
    Since:
    
    1.4.0
  - options
```
public DataFrameWriter<T> options(java.util.Map<String,String> options)
```
    Adds output options for the underlying data source.
    You can set the following option(s):
    - timeZone (default session local timezone): sets the string that indicates a timezone to be used to format timestamps in the JSON/CSV datasources or partition values.
    Parameters:
    
    options - (undocumented)
    
    Returns:
    
    (undocumented)
    
    Since:
    
    1.4.0
  - partitionBy
```
public DataFrameWriter<T> partitionBy(scala.collection.Seq<String> colNames)
```
    Partitions the output by the given columns on the file system. If specified, the output is laid out on the file system similar to Hive's partitioning scheme. As an example, when we partition a dataset by year and then month, the directory layout would look like:
    - year=2016/month=01/ - year=2016/month=02/
    Partitioning is one of the most widely used techniques to optimize physical data layout. It provides a coarse-grained index for skipping unnecessary data reads when queries have predicates on the partitioned columns. In order for partitioning to work well, the number of distinct values in each column should typically be less than tens of thousands.
    This is applicable for all file-based data sources (e.g. Parquet, JSON) starting with Spark 2.1.0.
    
    Parameters:
    
    colNames - (undocumented)
    
    Returns:
    
    (undocumented)
    
    Since:
    
    1.4.0
  - bucketBy
```
public DataFrameWriter<T> bucketBy(int numBuckets,
                                   String colName,
                                   scala.collection.Seq<String> colNames)
```
    Buckets the output by the given columns. If specified, the output is laid out on the file system similar to Hive's bucketing scheme.
    This is applicable for all file-based data sources (e.g. Parquet, JSON) starting with Spark 2.1.0.
    
    Parameters:
    
    numBuckets - (undocumented)
    
    colName - (undocumented)
    
    colNames - (undocumented)
    
    Returns:
    
    (undocumented)
    
    Since:
    
    2.0
  - sortBy
```
public DataFrameWriter<T> sortBy(String colName,
                                 scala.collection.Seq<String> colNames)
```
    Sorts the output in each bucket by the given columns.
    This is applicable for all file-based data sources (e.g. Parquet, JSON) starting with Spark 2.1.0.
    
    Parameters:
    
    colName - (undocumented)
    
    colNames - (undocumented)
    
    Returns:
    
    (undocumented)
    
    Since:
    
    2.0
  - save
```
public void save(String path)
```
    Saves the content of the DataFrame at the specified path.
    
    Parameters:
    
    path - (undocumented)
    
    Since:
    
    1.4.0
  - save
```
public void save()
```
    Saves the content of the DataFrame as the specified table.
    
    Since:
    
    1.4.0
  - insertInto
```
public void insertInto(String tableName)
```
    Inserts the content of the DataFrame to the specified table. It requires that the schema of the DataFrame is the same as the schema of the table.
    Parameters:
    
    tableName - (undocumented)
    
    Since:
    
    1.4.0
    
    Note:
    Unlike saveAsTable, insertInto ignores the column names and just uses position-based resolution. For example:
    
    scala> Seq((1, 2)).toDF("i", "j").write.mode("overwrite").saveAsTable("t1") scala> Seq((3, 4)).toDF("j", "i").write.insertInto("t1") scala> Seq((5, 6)).toDF("a", "b").write.insertInto("t1") scala> sql("select * from t1").show +---+---+ | i| j| +---+---+ | 5| 6| | 3| 4| | 1| 2| +---+---+
    
    Because it inserts data to an existing table, format or options will be ignored.
  - saveAsTable
```
public void saveAsTable(String tableName)
```
    Saves the content of the DataFrame as the specified table.
    In the case the table already exists, behavior of this function depends on the save mode, specified by the mode function (default to throwing an exception). When mode is Overwrite, the schema of the DataFrame does not need to be the same as that of the existing table.
    When mode is Append, if there is an existing table, we will use the format and options of the existing table. The column order in the schema of the DataFrame doesn't need to be same as that of the existing table. Unlike insertInto, saveAsTable will use the column names to find the correct column positions. For example:
```
    scala> Seq((1, 2)).toDF("i", "j").write.mode("overwrite").saveAsTable("t1")
    scala> Seq((3, 4)).toDF("j", "i").write.mode("append").saveAsTable("t1")
    scala> sql("select * from t1").show
    +---+---+
    |  i|  j|
    +---+---+
    |  1|  2|
    |  4|  3|
    +---+---+
 
```
    In this method, save mode is used to determine the behavior if the data source table exists in Spark catalog. We will always overwrite the underlying data of data source (e.g. a table in JDBC data source) if the table doesn't exist in Spark catalog, and will always append to the underlying data of data source if the table already exists.
    When the DataFrame is created from a non-partitioned HadoopFsRelation with a single input path, and the data source provider can be mapped to an existing Hive builtin SerDe (i.e. ORC and Parquet), the table is persisted in a Hive compatible format, which means other systems like Hive will be able to read this table. Otherwise, the table is persisted in a Spark SQL specific format.
    Parameters:
    
    tableName - (undocumented)
    
    Since:
    
    1.4.0
  - jdbc
```
public void jdbc(String url,
                 String table,
                 java.util.Properties connectionProperties)
```
    Saves the content of the DataFrame to an external database table via JDBC. In the case the table already exists in the external database, behavior of this function depends on the save mode, specified by the mode function (default to throwing an exception).
    Don't create too many partitions in parallel on a large cluster; otherwise Spark might crash your external database systems.
    You can set the following JDBC-specific option(s) for storing JDBC:
    - truncate (default false): use TRUNCATE TABLE instead of DROP TABLE.
    In case of failures, users should turn off truncate option to use DROP TABLE again. Also, due to the different behavior of TRUNCATE TABLE among DBMS, it's not always safe to use this. MySQLDialect, DB2Dialect, MsSqlServerDialect, DerbyDialect, and OracleDialect supports this while PostgresDialect and default JDBCDirect doesn't. For unknown and unsupported JDBCDirect, the user option truncate is ignored.
    Parameters:
    
    url - JDBC database url of the form jdbc:subprotocol:subname
    
    table - Name of the table in the external database.
    
    connectionProperties - JDBC database connection arguments, a list of arbitrary string tag/value. Normally at least a "user" and "password" property should be included. "batchsize" can be used to control the number of rows per insert. "isolationLevel" can be one of "NONE", "READ_COMMITTED", "READ_UNCOMMITTED", "REPEATABLE_READ", or "SERIALIZABLE", corresponding to standard transaction isolation levels defined by JDBC's Connection object, with default of "READ_UNCOMMITTED".
    
    Since:
    
    1.4.0
  - json
```
public void json(String path)
```
    Saves the content of the DataFrame in JSON format ( JSON Lines text format or newline-delimited JSON) at the specified path. This is equivalent to:
```
   format("json").save(path)
 
```
    You can set the following JSON-specific option(s) for writing JSON files:
    - compression (default null): compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, bzip2, gzip, lz4, snappy and deflate).
    - dateFormat (default yyyy-MM-dd): sets the string that indicates a date format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to date type.
    - timestampFormat (default yyyy-MM-dd'T'HH:mm:ss.SSSXXX): sets the string that indicates a timestamp format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to timestamp type.
    Parameters:
    
    path - (undocumented)
    
    Since:
    
    1.4.0
  - parquet
```
public void parquet(String path)
```
    Saves the content of the DataFrame in Parquet format at the specified path. This is equivalent to:
```
   format("parquet").save(path)
 
```
    You can set the following Parquet-specific option(s) for writing Parquet files:
    - compression (default is the value specified in spark.sql.parquet.compression.codec): compression codec to use when saving to file. This can be one of the known case-insensitive shorten names(none, snappy, gzip, and lzo). This will override spark.sql.parquet.compression.codec.
    Parameters:
    
    path - (undocumented)
    
    Since:
    
    1.4.0
  - orc
```
public void orc(String path)
```
    Saves the content of the DataFrame in ORC format at the specified path. This is equivalent to:
```
   format("orc").save(path)
 
```
    You can set the following ORC-specific option(s) for writing ORC files:
    - compression (default is the value specified in spark.sql.orc.compression.codec): compression codec to use when saving to file. This can be one of the known case-insensitive shorten names(none, snappy, zlib, and lzo). This will override orc.compress and spark.sql.orc.compression.codec. If orc.compress is given, it overrides spark.sql.orc.compression.codec.
    Parameters:
    
    path - (undocumented)
    
    Since:
    
    1.5.0
    
    Note:
    
    Currently, this method can only be used after enabling Hive support
  - text
```
public void text(String path)
```
    Saves the content of the DataFrame in a text file at the specified path. The DataFrame must have only one column that is of string type. Each row becomes a new line in the output file. For example:
```
   // Scala:
   df.write.text("/path/to/output")

   // Java:
   df.write().text("/path/to/output")
 
```
    You can set the following option(s) for writing text files:
    - compression (default null): compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, bzip2, gzip, lz4, snappy and deflate).
    Parameters:
    
    path - (undocumented)
    
    Since:
    
    1.6.0
  - csv
```
public void csv(String path)
```
    Saves the content of the DataFrame in CSV format at the specified path. This is equivalent to:
```
   format("csv").save(path)
 
```
    You can set the following CSV-specific option(s) for writing CSV files:
    - sep (default ,): sets a single character as a separator for each field and value.
    - quote (default "): sets a single character used for escaping quoted values where the separator can be part of the value. If an empty string is set, it uses u0000 (null character).
    - escape (default \): sets a single character used for escaping quotes inside an already quoted value.
    - charToEscapeQuoteEscaping (default escape or \0): sets a single character used for escaping the escape for the quote character. The default value is escape character when escape and quote characters are different, \0 otherwise.
    - escapeQuotes (default true): a flag indicating whether values containing quotes should always be enclosed in quotes. Default is to escape all values containing a quote character.
    - quoteAll (default false): a flag indicating whether all values should always be enclosed in quotes. Default is to only escape values containing a quote character.
    - header (default false): writes the names of columns as the first line.
    - nullValue (default empty string): sets the string representation of a null value.
    - compression (default null): compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, bzip2, gzip, lz4, snappy and deflate).
    - dateFormat (default yyyy-MM-dd): sets the string that indicates a date format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to date type.
    - timestampFormat (default yyyy-MM-dd'T'HH:mm:ss.SSSXXX): sets the string that indicates a timestamp format. Custom date formats follow the formats at java.text.SimpleDateFormat. This applies to timestamp type.
    - ignoreLeadingWhiteSpace (default true): a flag indicating whether or not leading whitespaces from values being written should be skipped.
    - ignoreTrailingWhiteSpace (default true): a flag indicating defines whether or not trailing whitespaces from values being written should be skipped.
    Parameters:
    
    path - (undocumented)
    
    Since:
    
    2.0.0

Class DataFrameWriter<T>

Method Summary

Methods inherited from class Object

Method Detail

partitionBy

bucketBy

sortBy

mode

mode

format

option

option

option

option

options

options

partitionBy

bucketBy

sortBy

save

save

insertInto

saveAsTable

jdbc

json

parquet

orc

text

csv