SparkContext.
addFile
Add a file to be downloaded with this Spark job on every node. The path passed can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI.
To access the file in Spark jobs, use SparkFiles.get() with the filename to find its download location.
SparkFiles.get()
A directory can be given if the recursive option is set to True. Currently directories are only supported for Hadoop-supported filesystems.
New in version 0.7.0.
can be either a local file, a file in HDFS (or other Hadoop-supported filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs, use SparkFiles.get() to find its download location.
whether to recursively add files in the input directory
See also
SparkContext.listFiles()
SparkContext.addPyFile()
Notes
A path can be added only once. Subsequent additions of the same path are ignored.
Examples
>>> import os >>> import tempfile >>> from pyspark import SparkFiles
>>> with tempfile.TemporaryDirectory() as d: ... path1 = os.path.join(d, "test1.txt") ... with open(path1, "w") as f: ... _ = f.write("100") ... ... path2 = os.path.join(d, "test2.txt") ... with open(path2, "w") as f: ... _ = f.write("200") ... ... sc.addFile(path1) ... file_list1 = sorted(sc.listFiles) ... ... sc.addFile(path2) ... file_list2 = sorted(sc.listFiles) ... ... # add path2 twice, this addition will be ignored ... sc.addFile(path2) ... file_list3 = sorted(sc.listFiles) ... ... def func(iterator): ... with open(SparkFiles.get("test1.txt")) as f: ... mul = int(f.readline()) ... return [x * mul for x in iterator] ... ... collected = sc.parallelize([1, 2, 3, 4]).mapPartitions(func).collect()
>>> file_list1 ['file:/.../test1.txt'] >>> file_list2 ['file:/.../test1.txt', 'file:/.../test2.txt'] >>> file_list3 ['file:/.../test1.txt', 'file:/.../test2.txt'] >>> collected [100, 200, 300, 400]