JSON Files
Spark SQL can automatically infer the schema of a JSON dataset and load it as a Dataset[Row]
.
This conversion can be done using SparkSession.read.json()
on either a Dataset[String]
,
or a JSON file.
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON.
For a regular multi-line JSON file, set the multiLine
option to true
.
// Primitive types (Int, String, etc) and Product types (case classes) encoders are // supported by importing this when creating a Dataset. import spark.implicits._
// A JSON dataset is pointed to by path. // The path can be either a single text file or a directory storing text files val path = “examples/src/main/resources/people.json” val peopleDF = spark.read.json(path)
// The inferred schema can be visualized using the printSchema() method peopleDF.printSchema() // root // |– age: long (nullable = true) // |– name: string (nullable = true)
// Creates a temporary view using the DataFrame peopleDF.createOrReplaceTempView(“people”)
// SQL statements can be run by using the sql methods provided by spark val teenagerNamesDF = spark.sql(“SELECT name FROM people WHERE age BETWEEN 13 AND 19”) teenagerNamesDF.show() // +——+ // | name| // +——+ // |Justin| // +——+
// Alternatively, a DataFrame can be created for a JSON dataset represented by // a Dataset[String] storing one JSON object per string val otherPeopleDataset = spark.createDataset( ”””{“name”:”Yin”,”address”:{“city”:”Columbus”,”state”:”Ohio”}}””” :: Nil) val otherPeople = spark.read.json(otherPeopleDataset) otherPeople.show() // +—————+—-+ // | address|name| // +—————+—-+ // |[Columbus,Ohio]| Yin| // +—————+—-+
Spark SQL can automatically infer the schema of a JSON dataset and load it as a Dataset<Row>
.
This conversion can be done using SparkSession.read().json()
on either a Dataset<String>
,
or a JSON file.
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON.
For a regular multi-line JSON file, set the multiLine
option to true
.
import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row;
// A JSON dataset is pointed to by path. // The path can be either a single text file or a directory storing text files Dataset<Row> people = spark.read().json(“examples/src/main/resources/people.json”);
// The inferred schema can be visualized using the printSchema() method people.printSchema(); // root // |– age: long (nullable = true) // |– name: string (nullable = true)
// Creates a temporary view using the DataFrame people.createOrReplaceTempView(“people”);
// SQL statements can be run by using the sql methods provided by spark Dataset<Row> namesDF = spark.sql(“SELECT name FROM people WHERE age BETWEEN 13 AND 19”); namesDF.show(); // +——+ // | name| // +——+ // |Justin| // +——+
// Alternatively, a DataFrame can be created for a JSON dataset represented by // a Dataset<String> storing one JSON object per string. List<String> jsonData = Arrays.asList( ”{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}”); Dataset<String> anotherPeopleDataset = spark.createDataset(jsonData, Encoders.STRING()); Dataset<Row> anotherPeople = spark.read().json(anotherPeopleDataset); anotherPeople.show(); // +—————+—-+ // | address|name| // +—————+—-+ // |[Columbus,Ohio]| Yin| // +—————+—-+
Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame.
This conversion can be done using SparkSession.read.json
on a JSON file.
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON.
For a regular multi-line JSON file, set the multiLine
parameter to True
.
# spark is from the previous example. sc = spark.sparkContext
# A JSON dataset is pointed to by path.
The path can be either a single text file or a directory storing text files
</span>path = “examples/src/main/resources/people.json” peopleDF = spark.read.json(path)
# The inferred schema can be visualized using the printSchema() method peopleDF.printSchema() # root
|– age: long (nullable = true)
|– name: string (nullable = true)
</span> # Creates a temporary view using the DataFrame peopleDF.createOrReplaceTempView(“people”)
# SQL statements can be run by using the sql methods provided by spark teenagerNamesDF = spark.sql(“SELECT name FROM people WHERE age BETWEEN 13 AND 19”) teenagerNamesDF.show() # +——+
| name|
+——+
|Justin|
+——+
</span> # Alternatively, a DataFrame can be created for a JSON dataset represented by
an RDD[String] storing one JSON object per string
</span>jsonStrings = [’{“name”:”Yin”,”address”:{“city”:”Columbus”,”state”:”Ohio”}}’] otherPeopleRDD = sc.parallelize(jsonStrings) otherPeople = spark.read.json(otherPeopleRDD) otherPeople.show() # +—————+—-+
| address|name|
+—————+—-+
|[Columbus,Ohio]| Yin|
+—————+—-+
</span>
Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame. using
the read.json()
function, which loads data from a directory of JSON files where each line of the
files is a JSON object.
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON.
For a regular multi-line JSON file, set a named parameter multiLine
to TRUE
.
# A JSON dataset is pointed to by path. # The path can be either a single text file or a directory storing text files. path <- “examples/src/main/resources/people.json” # Create a DataFrame from the file(s) pointed to by path people <- read.json(path)
</span># The inferred schema can be visualized using the printSchema() method. printSchema(people) ## root ## |– age: long (nullable = true) ## |– name: string (nullable = true)
</span># Register this DataFrame as a table. createOrReplaceTempView(people, “people”)
</span># SQL statements can be run by using the sql methods. teenagers <- sql(“SELECT name FROM people WHERE age >= 13 AND age <= 19”) head(teenagers) ## name ## 1 Justin
</span><div>Find full example code at “examples/src/main/r/RSparkSQLExample.R” in the Spark repo.</div>