JSON Files

Spark SQL can automatically infer the schema of a JSON dataset and load it as a Dataset[Row]. This conversion can be done using SparkSession.read.json() on either a Dataset[String], or a JSON file.

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON.

For a regular multi-line JSON file, set the multiLine option to true.

// Primitive types (Int, String, etc) and Product types (case classes) encoders are // supported by importing this when creating a Dataset. import spark.implicits._

// A JSON dataset is pointed to by path. // The path can be either a single text file or a directory storing text files val path = “examples/src/main/resources/people.json” val peopleDF = spark.read.json(path)

// The inferred schema can be visualized using the printSchema() method peopleDF.printSchema() // root // |– age: long (nullable = true) // |– name: string (nullable = true)

// Creates a temporary view using the DataFrame peopleDF.createOrReplaceTempView(“people”)

// SQL statements can be run by using the sql methods provided by spark val teenagerNamesDF = spark.sql(“SELECT name FROM people WHERE age BETWEEN 13 AND 19”) teenagerNamesDF.show() // +——+ // | name| // +——+ // |Justin| // +——+

// Alternatively, a DataFrame can be created for a JSON dataset represented by // a Dataset[String] storing one JSON object per string val otherPeopleDataset = spark.createDataset( ”””{“name”:”Yin”,”address”:{“city”:”Columbus”,”state”:”Ohio”}}””” :: Nil) val otherPeople = spark.read.json(otherPeopleDataset) otherPeople.show() // +—————+—-+ // | address|name| // +—————+—-+ // |[Columbus,Ohio]| Yin| // +—————+—-+

Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala" in the Spark repo.

Spark SQL can automatically infer the schema of a JSON dataset and load it as a Dataset<Row>. This conversion can be done using SparkSession.read().json() on either a Dataset<String>, or a JSON file.

For a regular multi-line JSON file, set the multiLine option to true.

import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row;

// A JSON dataset is pointed to by path. // The path can be either a single text file or a directory storing text files Dataset<Row> people = spark.read().json(“examples/src/main/resources/people.json”);

// The inferred schema can be visualized using the printSchema() method people.printSchema(); // root // |– age: long (nullable = true) // |– name: string (nullable = true)

// Creates a temporary view using the DataFrame people.createOrReplaceTempView(“people”);

// SQL statements can be run by using the sql methods provided by spark Dataset<Row> namesDF = spark.sql(“SELECT name FROM people WHERE age BETWEEN 13 AND 19”); namesDF.show(); // +——+ // | name| // +——+ // |Justin| // +——+

// Alternatively, a DataFrame can be created for a JSON dataset represented by // a Dataset<String> storing one JSON object per string. List<String> jsonData = Arrays.asList( ”{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}”); Dataset<String> anotherPeopleDataset = spark.createDataset(jsonData, Encoders.STRING()); Dataset<Row> anotherPeople = spark.read().json(anotherPeopleDataset); anotherPeople.show(); // +—————+—-+ // | address|name| // +—————+—-+ // |[Columbus,Ohio]| Yin| // +—————+—-+

Find full example code at "examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java" in the Spark repo.

Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame. This conversion can be done using SparkSession.read.json on a JSON file.

For a regular multi-line JSON file, set the multiLine parameter to True.

# spark is from the previous example. sc = spark.sparkContext

# A JSON dataset is pointed to by path.

The path can be either a single text file or a directory storing text files

</span>path = “examples/src/main/resources/people.json” peopleDF = spark.read.json(path)

# The inferred schema can be visualized using the printSchema() method peopleDF.printSchema() # root

|– age: long (nullable = true)

|– name: string (nullable = true)

</span> # Creates a temporary view using the DataFrame peopleDF.createOrReplaceTempView(“people”)

# SQL statements can be run by using the sql methods provided by spark teenagerNamesDF = spark.sql(“SELECT name FROM people WHERE age BETWEEN 13 AND 19”) teenagerNamesDF.show() # +——+

| name|

+——+

|Justin|

+——+

</span> # Alternatively, a DataFrame can be created for a JSON dataset represented by

an RDD[String] storing one JSON object per string

</span>jsonStrings = [’{“name”:”Yin”,”address”:{“city”:”Columbus”,”state”:”Ohio”}}’] otherPeopleRDD = sc.parallelize(jsonStrings) otherPeople = spark.read.json(otherPeopleRDD) otherPeople.show() # +—————+—-+

| address|name|

+—————+—-+

|[Columbus,Ohio]| Yin|

+—————+—-+

</span>

Find full example code at "examples/src/main/python/sql/datasource.py" in the Spark repo.

Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame. using the read.json() function, which loads data from a directory of JSON files where each line of the files is a JSON object.

For a regular multi-line JSON file, set a named parameter multiLine to TRUE.

# A JSON dataset is pointed to by path. # The path can be either a single text file or a directory storing text files. path <- “examples/src/main/resources/people.json” # Create a DataFrame from the file(s) pointed to by path people <- read.json(path)

</span># The inferred schema can be visualized using the printSchema() method. printSchema(people) ## root ## |– age: long (nullable = true) ## |– name: string (nullable = true)

</span># Register this DataFrame as a table. createOrReplaceTempView(people, “people”)

</span># SQL statements can be run by using the sql methods. teenagers <- sql(“SELECT name FROM people WHERE age >= 13 AND age <= 19”) head(teenagers) ## name ## 1 Justin

</span><div>Find full example code at “examples/src/main/r/RSparkSQLExample.R” in the Spark repo.</div>

CREATE TEMPORARY VIEW jsonTable
USING org.apache.spark.sql.json
OPTIONS (
  path "examples/src/main/resources/people.json"
)

SELECT * FROM jsonTable

Spark SQL Guide

JSON Files

The path can be either a single text file or a directory storing text files

|– age: long (nullable = true)

|– name: string (nullable = true)

| name|

+——+

|Justin|

+——+

an RDD[String] storing one JSON object per string

| address|name|

+—————+—-+

|[Columbus,Ohio]| Yin|

+—————+—-+