pyspark.sql.DataFrameReader.option

DataFrameReader.option(key, value)[source]

Adds an input option for the underlying data source.

You can set the following option(s) for reading files:
  • timeZone: sets the string that indicates a time zone ID to be used to parse

    timestamps in the JSON/CSV datasources or partition values. The following formats of timeZone are supported:

    • Region-based zone ID: It should have the form ‘area/city’, such as ‘America/Los_Angeles’.

    • Zone offset: It should be in the format ‘(+|-)HH:mm’, for example ‘-08:00’ or ‘+01:00’. Also ‘UTC’ and ‘Z’ are supported as aliases of ‘+00:00’.

    Other short names like ‘CST’ are not recommended to use because they can be ambiguous. If it isn’t set, the current value of the SQL config spark.sql.session.timeZone is used by default.

  • pathGlobFilter: an optional glob pattern to only include files with paths matching

    the pattern. The syntax follows org.apache.hadoop.fs.GlobFilter. It does not change the behavior of partition discovery.

  • modifiedBefore: an optional timestamp to only include files with

    modification times occurring before the specified time. The provided timestamp must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)

  • modifiedAfter: an optional timestamp to only include files with

    modification times occurring after the specified time. The provided timestamp must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)

New in version 1.5.