pyspark.SparkContext.sequenceFile¶

SparkContext.sequenceFile(path: str, keyClass: Optional[str] = None, valueClass: Optional[str] = None, keyConverter: Optional[str] = None, valueConverter: Optional[str] = None, minSplits: Optional[int] = None, batchSize: int = 0) → pyspark.rdd.RDD[Tuple[T, U]][source]¶

Read a Hadoop SequenceFile with arbitrary key and value Writable class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. The mechanism is as follows:

A Java RDD is created from the SequenceFile or other InputFormat, and the key and value Writable classes

Serialization is attempted via Pickle pickling

If this fails, the fallback is to call ‘toString’ on each key and value

CPickleSerializer is used to deserialize pickled objects on the Python side

New in version 1.3.0.

Parameters

pathstr: path to sequencefile
keyClass: str, optional: fully qualified classname of key Writable class (e.g. “org.apache.hadoop.io.Text”)
valueClassstr, optional: fully qualified classname of value Writable class (e.g. “org.apache.hadoop.io.LongWritable”)
keyConverterstr, optional: fully qualified name of a function returning key WritableConverter
valueConverterstr, optional: fully qualifiedname of a function returning value WritableConverter
minSplitsint, optional: minimum splits in dataset (default min(2, sc.defaultParallelism))
batchSizeint, optional, default 0: The number of Python objects represented as a single Java object. (default 0, choose batchSize automatically)

Returns

RDD: RDD of tuples of key and corresponding value

See also

RDD.saveAsSequenceFile()
RDD.saveAsNewAPIHadoopFile()
RDD.saveAsHadoopFile()
SparkContext.newAPIHadoopFile()
SparkContext.hadoopFile()

Examples

>>> import os
>>> import tempfile

Set the class of output format

>>> output_format_class = "org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat"

>>> with tempfile.TemporaryDirectory() as d:
...     path = os.path.join(d, "hadoop_file")
...
...     # Write a temporary Hadoop file
...     rdd = sc.parallelize([(1, {3.0: "bb"}), (2, {1.0: "aa"}), (3, {2.0: "dd"})])
...     rdd.saveAsNewAPIHadoopFile(path, output_format_class)
...
...     collected = sorted(sc.sequenceFile(path).collect())

>>> collected
[(1, {3.0: 'bb'}), (2, {1.0: 'aa'}), (3, {2.0: 'dd'})]

pyspark.SparkContext.runJob pyspark.SparkContext.setCheckpointDir