class Dataset[T] extends Serializable
A Dataset is a strongly typed collection of domain-specific objects that can be transformed
in parallel using functional or relational operations. Each Dataset also has an untyped view
called a DataFrame
, which is a Dataset of Row.
Operations available on Datasets are divided into transformations and actions. Transformations
are the ones that produce new Datasets, and actions are the ones that trigger computation and
return results. Example transformations include map, filter, select, and aggregate (groupBy
).
Example actions count, show, or writing data out to file systems.
Datasets are "lazy", i.e. computations are only triggered when an action is invoked. Internally,
a Dataset represents a logical plan that describes the computation required to produce the data.
When an action is invoked, Spark's query optimizer optimizes the logical plan and generates a
physical plan for efficient execution in a parallel and distributed manner. To explore the
logical plan as well as optimized physical plan, use the explain
function.
To efficiently support domain-specific objects, an Encoder is required. The encoder maps
the domain specific type T
to Spark's internal type system. For example, given a class Person
with two fields, name
(string) and age
(int), an encoder is used to tell Spark to generate
code at runtime to serialize the Person
object into a binary structure. This binary structure
often has much lower memory footprint as well as are optimized for efficiency in data processing
(e.g. in a columnar format). To understand the internal binary representation for data, use the
schema
function.
There are typically two ways to create a Dataset. The most common way is by pointing Spark
to some files on storage systems, using the read
function available on a SparkSession
.
val people = spark.read.parquet("...").as[Person] // Scala Dataset<Person> people = spark.read().parquet("...").as(Encoders.bean(Person.class)); // Java
Datasets can also be created through transformations available on existing Datasets. For example, the following creates a new Dataset by applying a filter on the existing one:
val names = people.map(_.name) // in Scala; names is a Dataset[String] Dataset<String> names = people.map((Person p) -> p.name, Encoders.STRING));
Dataset operations can also be untyped, through various domain-specific-language (DSL) functions defined in: Dataset (this class), Column, and functions. These operations are very similar to the operations available in the data frame abstraction in R or Python.
To select a column from the Dataset, use apply
method in Scala and col
in Java.
val ageCol = people("age") // in Scala Column ageCol = people.col("age"); // in Java
Note that the Column type can also be manipulated through its various functions.
// The following creates a new column that increases everybody's age by 10. people("age") + 10 // in Scala people.col("age").plus(10); // in Java
A more concrete example in Scala:
// To create Dataset[Row] using SparkSession val people = spark.read.parquet("...") val department = spark.read.parquet("...") people.filter("age > 30") .join(department, people("deptId") === department("id")) .groupBy(department("name"), people("gender")) .agg(avg(people("salary")), max(people("age")))
and in Java:
// To create Dataset<Row> using SparkSession Dataset<Row> people = spark.read().parquet("..."); Dataset<Row> department = spark.read().parquet("..."); people.filter(people.col("age").gt(30)) .join(department, people.col("deptId").equalTo(department.col("id"))) .groupBy(department.col("name"), people.col("gender")) .agg(avg(people.col("salary")), max(people.col("age")));
- Annotations
- @Stable()
- Source
- Dataset.scala
- Since
1.6.0
- Grouped
- Alphabetic
- By Inheritance
- Dataset
- Serializable
- Serializable
- AnyRef
- Any
- Hide All
- Show All
- Public
- All
Instance Constructors
- new Dataset(sqlContext: SQLContext, logicalPlan: LogicalPlan, encoder: Encoder[T])
- new Dataset(sparkSession: SparkSession, logicalPlan: LogicalPlan, encoder: Encoder[T])
Value Members
-
final
def
!=(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
final
def
##(): Int
- Definition Classes
- AnyRef → Any
-
final
def
==(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
def
agg(expr: Column, exprs: Column*): DataFrame
Aggregates on the entire Dataset without groups.
Aggregates on the entire Dataset without groups.
// ds.agg(...) is a shorthand for ds.groupBy().agg(...) ds.agg(max($"age"), avg($"salary")) ds.groupBy().agg(max($"age"), avg($"salary"))
- Annotations
- @varargs()
- Since
2.0.0
-
def
agg(exprs: Map[String, String]): DataFrame
(Java-specific) Aggregates on the entire Dataset without groups.
(Java-specific) Aggregates on the entire Dataset without groups.
// ds.agg(...) is a shorthand for ds.groupBy().agg(...) ds.agg(Map("age" -> "max", "salary" -> "avg")) ds.groupBy().agg(Map("age" -> "max", "salary" -> "avg"))
- Since
2.0.0
-
def
agg(exprs: Map[String, String]): DataFrame
(Scala-specific) Aggregates on the entire Dataset without groups.
(Scala-specific) Aggregates on the entire Dataset without groups.
// ds.agg(...) is a shorthand for ds.groupBy().agg(...) ds.agg(Map("age" -> "max", "salary" -> "avg")) ds.groupBy().agg(Map("age" -> "max", "salary" -> "avg"))
- Since
2.0.0
-
def
agg(aggExpr: (String, String), aggExprs: (String, String)*): DataFrame
(Scala-specific) Aggregates on the entire Dataset without groups.
(Scala-specific) Aggregates on the entire Dataset without groups.
// ds.agg(...) is a shorthand for ds.groupBy().agg(...) ds.agg("age" -> "max", "salary" -> "avg") ds.groupBy().agg("age" -> "max", "salary" -> "avg")
- Since
2.0.0
-
def
alias(alias: Symbol): Dataset[T]
(Scala-specific) Returns a new Dataset with an alias set.
(Scala-specific) Returns a new Dataset with an alias set. Same as
as
.- Since
2.0.0
-
def
alias(alias: String): Dataset[T]
Returns a new Dataset with an alias set.
Returns a new Dataset with an alias set. Same as
as
.- Since
2.0.0
-
def
apply(colName: String): Column
Selects column based on the column name and returns it as a Column.
Selects column based on the column name and returns it as a Column.
- Since
2.0.0
- Note
The column name can also reference to a nested column like
a.b
.
-
def
as(alias: Symbol): Dataset[T]
(Scala-specific) Returns a new Dataset with an alias set.
(Scala-specific) Returns a new Dataset with an alias set.
- Since
2.0.0
-
def
as(alias: String): Dataset[T]
Returns a new Dataset with an alias set.
Returns a new Dataset with an alias set.
- Since
1.6.0
-
def
as[U](implicit arg0: Encoder[U]): Dataset[U]
Returns a new Dataset where each record has been mapped on to the specified type.
Returns a new Dataset where each record has been mapped on to the specified type. The method used to map columns depend on the type of
U
:- When
U
is a class, fields for the class will be mapped to columns of the same name (case sensitivity is determined byspark.sql.caseSensitive
). - When
U
is a tuple, the columns will be mapped by ordinal (i.e. the first column will be assigned to_1
). - When
U
is a primitive type (i.e. String, Int, etc), then the first column of theDataFrame
will be used.
If the schema of the Dataset does not match the desired
U
type, you can useselect
along withalias
oras
to rearrange or rename as required.Note that
as[]
only changes the view of the data that is passed into typed operations, such asmap()
, and does not eagerly project away any columns that are not present in the specified class.- Since
1.6.0
- When
-
final
def
asInstanceOf[T0]: T0
- Definition Classes
- Any
-
def
cache(): Dataset.this.type
Persist this Dataset with the default storage level (
MEMORY_AND_DISK
).Persist this Dataset with the default storage level (
MEMORY_AND_DISK
).- Since
1.6.0
-
def
checkpoint(eager: Boolean): Dataset[T]
Returns a checkpointed version of this Dataset.
Returns a checkpointed version of this Dataset. Checkpointing can be used to truncate the logical plan of this Dataset, which is especially useful in iterative algorithms where the plan may grow exponentially. It will be saved to files inside the checkpoint directory set with
SparkContext#setCheckpointDir
.- Since
2.1.0
-
def
checkpoint(): Dataset[T]
Eagerly checkpoint a Dataset and return the new Dataset.
Eagerly checkpoint a Dataset and return the new Dataset. Checkpointing can be used to truncate the logical plan of this Dataset, which is especially useful in iterative algorithms where the plan may grow exponentially. It will be saved to files inside the checkpoint directory set with
SparkContext#setCheckpointDir
.- Since
2.1.0
-
def
clone(): AnyRef
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
-
def
coalesce(numPartitions: Int): Dataset[T]
Returns a new Dataset that has exactly
numPartitions
partitions, when the fewer partitions are requested.Returns a new Dataset that has exactly
numPartitions
partitions, when the fewer partitions are requested. If a larger number of partitions is requested, it will stay at the current number of partitions. Similar to coalesce defined on anRDD
, this operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions.However, if you're doing a drastic coalesce, e.g. to numPartitions = 1, this may result in your computation taking place on fewer nodes than you like (e.g. one node in the case of numPartitions = 1). To avoid this, you can call repartition. This will add a shuffle step, but means the current upstream partitions will be executed in parallel (per whatever the current partitioning is).
- Since
1.6.0
-
def
col(colName: String): Column
Selects column based on the column name and returns it as a Column.
Selects column based on the column name and returns it as a Column.
- Since
2.0.0
- Note
The column name can also reference to a nested column like
a.b
.
-
def
colRegex(colName: String): Column
Selects column based on the column name specified as a regex and returns it as Column.
Selects column based on the column name specified as a regex and returns it as Column.
- Since
2.3.0
-
def
collect(): Array[T]
Returns an array that contains all rows in this Dataset.
Returns an array that contains all rows in this Dataset.
Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.
For Java API, use collectAsList.
- Since
1.6.0
-
def
collectAsList(): List[T]
Returns a Java list that contains all rows in this Dataset.
Returns a Java list that contains all rows in this Dataset.
Running collect requires moving all the data into the application's driver process, and doing so on a very large dataset can crash the driver process with OutOfMemoryError.
- Since
1.6.0
-
def
columns: Array[String]
Returns all column names as an array.
Returns all column names as an array.
- Since
1.6.0
-
def
count(): Long
Returns the number of rows in the Dataset.
Returns the number of rows in the Dataset.
- Since
1.6.0
-
def
createGlobalTempView(viewName: String): Unit
Creates a global temporary view using the given name.
Creates a global temporary view using the given name. The lifetime of this temporary view is tied to this Spark application.
Global temporary view is cross-session. Its lifetime is the lifetime of the Spark application, i.e. it will be automatically dropped when the application terminates. It's tied to a system preserved database
global_temp
, and we must use the qualified name to refer a global temp view, e.g.SELECT * FROM global_temp.view1
.- Annotations
- @throws( ... )
- Since
2.1.0
- Exceptions thrown
AnalysisException
if the view name is invalid or already exists
-
def
createOrReplaceGlobalTempView(viewName: String): Unit
Creates or replaces a global temporary view using the given name.
Creates or replaces a global temporary view using the given name. The lifetime of this temporary view is tied to this Spark application.
Global temporary view is cross-session. Its lifetime is the lifetime of the Spark application, i.e. it will be automatically dropped when the application terminates. It's tied to a system preserved database
global_temp
, and we must use the qualified name to refer a global temp view, e.g.SELECT * FROM global_temp.view1
.- Since
2.2.0
-
def
createOrReplaceTempView(viewName: String): Unit
Creates a local temporary view using the given name.
Creates a local temporary view using the given name. The lifetime of this temporary view is tied to the SparkSession that was used to create this Dataset.
- Since
2.0.0
-
def
createTempView(viewName: String): Unit
Creates a local temporary view using the given name.
Creates a local temporary view using the given name. The lifetime of this temporary view is tied to the SparkSession that was used to create this Dataset.
Local temporary view is session-scoped. Its lifetime is the lifetime of the session that created it, i.e. it will be automatically dropped when the session terminates. It's not tied to any databases, i.e. we can't use
db1.view1
to reference a local temporary view.- Annotations
- @throws( ... )
- Since
2.0.0
- Exceptions thrown
AnalysisException
if the view name is invalid or already exists
-
def
crossJoin(right: Dataset[_]): DataFrame
Explicit cartesian join with another
DataFrame
.Explicit cartesian join with another
DataFrame
.- right
Right side of the join operation.
- Since
2.1.0
- Note
Cartesian joins are very expensive without an extra filter that can be pushed down.
-
def
cube(col1: String, cols: String*): RelationalGroupedDataset
Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them.
Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.
This is a variant of cube that can only group by existing columns using column names (i.e. cannot construct expressions).
// Compute the average for all numeric columns cubed by department and group. ds.cube("department", "group").avg() // Compute the max age and average salary, cubed by department and gender. ds.cube($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))
- Annotations
- @varargs()
- Since
2.0.0
-
def
cube(cols: Column*): RelationalGroupedDataset
Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them.
Create a multi-dimensional cube for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.
// Compute the average for all numeric columns cubed by department and group. ds.cube($"department", $"group").avg() // Compute the max age and average salary, cubed by department and gender. ds.cube($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))
- Annotations
- @varargs()
- Since
2.0.0
-
def
describe(cols: String*): DataFrame
Computes basic statistics for numeric and string columns, including count, mean, stddev, min, and max.
Computes basic statistics for numeric and string columns, including count, mean, stddev, min, and max. If no columns are given, this function computes statistics for all numerical or string columns.
This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting Dataset. If you want to programmatically compute summary statistics, use the
agg
function instead.ds.describe("age", "height").show() // output: // summary age height // count 10.0 10.0 // mean 53.3 178.05 // stddev 11.6 15.7 // min 18.0 163.0 // max 92.0 192.0
Use summary for expanded statistics and control over which statistics to compute.
- cols
Columns to compute statistics on.
- Annotations
- @varargs()
- Since
1.6.0
-
def
distinct(): Dataset[T]
Returns a new Dataset that contains only the unique rows from this Dataset.
Returns a new Dataset that contains only the unique rows from this Dataset. This is an alias for
dropDuplicates
.Note that for a streaming Dataset, this method returns distinct rows only once regardless of the output mode, which the behavior may not be same with
DISTINCT
in SQL against streaming Dataset.- Since
2.0.0
- Note
Equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom
equals
function defined onT
.
-
def
drop(col: Column): DataFrame
Returns a new Dataset with a column dropped.
Returns a new Dataset with a column dropped. This version of drop accepts a Column rather than a name. This is a no-op if the Dataset doesn't have a column with an equivalent expression.
- Since
2.0.0
-
def
drop(colNames: String*): DataFrame
Returns a new Dataset with columns dropped.
Returns a new Dataset with columns dropped. This is a no-op if schema doesn't contain column name(s).
This method can only be used to drop top level columns. the colName string is treated literally without further interpretation.
- Annotations
- @varargs()
- Since
2.0.0
-
def
drop(colName: String): DataFrame
Returns a new Dataset with a column dropped.
Returns a new Dataset with a column dropped. This is a no-op if schema doesn't contain column name.
This method can only be used to drop top level columns. the colName string is treated literally without further interpretation.
- Since
2.0.0
-
def
dropDuplicates(col1: String, cols: String*): Dataset[T]
Returns a new Dataset with duplicate rows removed, considering only the subset of columns.
Returns a new Dataset with duplicate rows removed, considering only the subset of columns.
For a static batch Dataset, it just drops duplicate rows. For a streaming Dataset, it will keep all data across triggers as intermediate state to drop duplicates rows. You can use withWatermark to limit how late the duplicate data can be and system will accordingly limit the state. In addition, too late data older than watermark will be dropped to avoid any possibility of duplicates.
- Annotations
- @varargs()
- Since
2.0.0
-
def
dropDuplicates(colNames: Array[String]): Dataset[T]
Returns a new Dataset with duplicate rows removed, considering only the subset of columns.
Returns a new Dataset with duplicate rows removed, considering only the subset of columns.
For a static batch Dataset, it just drops duplicate rows. For a streaming Dataset, it will keep all data across triggers as intermediate state to drop duplicates rows. You can use withWatermark to limit how late the duplicate data can be and system will accordingly limit the state. In addition, too late data older than watermark will be dropped to avoid any possibility of duplicates.
- Since
2.0.0
-
def
dropDuplicates(colNames: Seq[String]): Dataset[T]
(Scala-specific) Returns a new Dataset with duplicate rows removed, considering only the subset of columns.
(Scala-specific) Returns a new Dataset with duplicate rows removed, considering only the subset of columns.
For a static batch Dataset, it just drops duplicate rows. For a streaming Dataset, it will keep all data across triggers as intermediate state to drop duplicates rows. You can use withWatermark to limit how late the duplicate data can be and system will accordingly limit the state. In addition, too late data older than watermark will be dropped to avoid any possibility of duplicates.
- Since
2.0.0
-
def
dropDuplicates(): Dataset[T]
Returns a new Dataset that contains only the unique rows from this Dataset.
Returns a new Dataset that contains only the unique rows from this Dataset. This is an alias for
distinct
.For a static batch Dataset, it just drops duplicate rows. For a streaming Dataset, it will keep all data across triggers as intermediate state to drop duplicates rows. You can use withWatermark to limit how late the duplicate data can be and system will accordingly limit the state. In addition, too late data older than watermark will be dropped to avoid any possibility of duplicates.
- Since
2.0.0
-
def
dtypes: Array[(String, String)]
Returns all column names and their data types as an array.
Returns all column names and their data types as an array.
- Since
1.6.0
- val encoder: Encoder[T]
-
final
def
eq(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
def
equals(arg0: Any): Boolean
- Definition Classes
- AnyRef → Any
-
def
except(other: Dataset[T]): Dataset[T]
Returns a new Dataset containing rows in this Dataset but not in another Dataset.
Returns a new Dataset containing rows in this Dataset but not in another Dataset. This is equivalent to
EXCEPT DISTINCT
in SQL.- Since
2.0.0
- Note
Equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom
equals
function defined onT
.
-
def
exceptAll(other: Dataset[T]): Dataset[T]
Returns a new Dataset containing rows in this Dataset but not in another Dataset while preserving the duplicates.
Returns a new Dataset containing rows in this Dataset but not in another Dataset while preserving the duplicates. This is equivalent to
EXCEPT ALL
in SQL.- Since
2.4.0
- Note
Equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom
equals
function defined onT
. Also as standard in SQL, this function resolves columns by position (not by name).
-
def
explain(): Unit
Prints the physical plan to the console for debugging purposes.
Prints the physical plan to the console for debugging purposes.
- Since
1.6.0
-
def
explain(extended: Boolean): Unit
Prints the plans (logical and physical) to the console for debugging purposes.
Prints the plans (logical and physical) to the console for debugging purposes.
- extended
default
false
. Iffalse
, prints only the physical plan.
- Since
1.6.0
-
def
explain(mode: String): Unit
Prints the plans (logical and physical) with a format specified by a given explain mode.
Prints the plans (logical and physical) with a format specified by a given explain mode.
- mode
specifies the expected output format of plans.
simple
Print only a physical plan.extended
: Print both logical and physical plans.codegen
: Print a physical plan and generated codes if they are available.cost
: Print a logical plan and statistics if they are available.formatted
: Split explain output into two sections: a physical plan outline and node details.
- Since
3.0.0
-
def
filter(func: FilterFunction[T]): Dataset[T]
(Java-specific) Returns a new Dataset that only contains elements where
func
returnstrue
.(Java-specific) Returns a new Dataset that only contains elements where
func
returnstrue
.- Since
1.6.0
-
def
filter(func: (T) ⇒ Boolean): Dataset[T]
(Scala-specific) Returns a new Dataset that only contains elements where
func
returnstrue
.(Scala-specific) Returns a new Dataset that only contains elements where
func
returnstrue
.- Since
1.6.0
-
def
filter(conditionExpr: String): Dataset[T]
Filters rows using the given SQL expression.
Filters rows using the given SQL expression.
peopleDs.filter("age > 15")
- Since
1.6.0
-
def
filter(condition: Column): Dataset[T]
Filters rows using the given condition.
Filters rows using the given condition.
// The following are equivalent: peopleDs.filter($"age" > 15) peopleDs.where($"age" > 15)
- Since
1.6.0
-
def
finalize(): Unit
- Attributes
- protected[lang]
- Definition Classes
- AnyRef
- Annotations
- @throws( classOf[java.lang.Throwable] )
-
def
first(): T
Returns the first row.
Returns the first row. Alias for head().
- Since
1.6.0
-
def
flatMap[U](f: FlatMapFunction[T, U], encoder: Encoder[U]): Dataset[U]
(Java-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.
(Java-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.
- Since
1.6.0
-
def
flatMap[U](func: (T) ⇒ TraversableOnce[U])(implicit arg0: Encoder[U]): Dataset[U]
(Scala-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.
(Scala-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.
- Since
1.6.0
-
def
foreach(func: ForeachFunction[T]): Unit
(Java-specific) Runs
func
on each element of this Dataset.(Java-specific) Runs
func
on each element of this Dataset.- Since
1.6.0
-
def
foreach(f: (T) ⇒ Unit): Unit
Applies a function
f
to all rows.Applies a function
f
to all rows.- Since
1.6.0
-
def
foreachPartition(func: ForeachPartitionFunction[T]): Unit
(Java-specific) Runs
func
on each partition of this Dataset.(Java-specific) Runs
func
on each partition of this Dataset.- Since
1.6.0
-
def
foreachPartition(f: (Iterator[T]) ⇒ Unit): Unit
Applies a function
f
to each partition of this Dataset.Applies a function
f
to each partition of this Dataset.- Since
1.6.0
-
final
def
getClass(): Class[_]
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
def
groupBy(col1: String, cols: String*): RelationalGroupedDataset
Groups the Dataset using the specified columns, so that we can run aggregation on them.
Groups the Dataset using the specified columns, so that we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.
This is a variant of groupBy that can only group by existing columns using column names (i.e. cannot construct expressions).
// Compute the average for all numeric columns grouped by department. ds.groupBy("department").avg() // Compute the max age and average salary, grouped by department and gender. ds.groupBy($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))
- Annotations
- @varargs()
- Since
2.0.0
-
def
groupBy(cols: Column*): RelationalGroupedDataset
Groups the Dataset using the specified columns, so we can run aggregation on them.
Groups the Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.
// Compute the average for all numeric columns grouped by department. ds.groupBy($"department").avg() // Compute the max age and average salary, grouped by department and gender. ds.groupBy($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))
- Annotations
- @varargs()
- Since
2.0.0
-
def
groupByKey[K](func: MapFunction[T, K], encoder: Encoder[K]): KeyValueGroupedDataset[K, T]
(Java-specific) Returns a KeyValueGroupedDataset where the data is grouped by the given key
func
.(Java-specific) Returns a KeyValueGroupedDataset where the data is grouped by the given key
func
.- Since
2.0.0
-
def
groupByKey[K](func: (T) ⇒ K)(implicit arg0: Encoder[K]): KeyValueGroupedDataset[K, T]
(Scala-specific) Returns a KeyValueGroupedDataset where the data is grouped by the given key
func
.(Scala-specific) Returns a KeyValueGroupedDataset where the data is grouped by the given key
func
.- Since
2.0.0
-
def
hashCode(): Int
- Definition Classes
- AnyRef → Any
- Annotations
- @native()
-
def
head(): T
Returns the first row.
Returns the first row.
- Since
1.6.0
-
def
head(n: Int): Array[T]
Returns the first
n
rows.Returns the first
n
rows.- Since
1.6.0
- Note
this method should only be used if the resulting array is expected to be small, as all the data is loaded into the driver's memory.
-
def
hint(name: String, parameters: Any*): Dataset[T]
Specifies some hint on the current Dataset.
Specifies some hint on the current Dataset. As an example, the following code specifies that one of the plan can be broadcasted:
df1.join(df2.hint("broadcast"))
- Annotations
- @varargs()
- Since
2.2.0
-
def
inputFiles: Array[String]
Returns a best-effort snapshot of the files that compose this Dataset.
Returns a best-effort snapshot of the files that compose this Dataset. This method simply asks each constituent BaseRelation for its respective files and takes the union of all results. Depending on the source relations, this may not find all input files. Duplicates are removed.
- Since
2.0.0
-
def
intersect(other: Dataset[T]): Dataset[T]
Returns a new Dataset containing rows only in both this Dataset and another Dataset.
Returns a new Dataset containing rows only in both this Dataset and another Dataset. This is equivalent to
INTERSECT
in SQL.- Since
1.6.0
- Note
Equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom
equals
function defined onT
.
-
def
intersectAll(other: Dataset[T]): Dataset[T]
Returns a new Dataset containing rows only in both this Dataset and another Dataset while preserving the duplicates.
Returns a new Dataset containing rows only in both this Dataset and another Dataset while preserving the duplicates. This is equivalent to
INTERSECT ALL
in SQL.- Since
2.4.0
- Note
Equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom
equals
function defined onT
. Also as standard in SQL, this function resolves columns by position (not by name).
-
def
isEmpty: Boolean
Returns true if the
Dataset
is empty.Returns true if the
Dataset
is empty.- Since
2.4.0
-
final
def
isInstanceOf[T0]: Boolean
- Definition Classes
- Any
-
def
isLocal: Boolean
Returns true if the
collect
andtake
methods can be run locally (without any Spark executors).Returns true if the
collect
andtake
methods can be run locally (without any Spark executors).- Since
1.6.0
-
def
isStreaming: Boolean
Returns true if this Dataset contains one or more sources that continuously return data as it arrives.
Returns true if this Dataset contains one or more sources that continuously return data as it arrives. A Dataset that reads data from a streaming source must be executed as a
StreamingQuery
using thestart()
method inDataStreamWriter
. Methods that return a single answer, e.g.count()
orcollect()
, will throw an AnalysisException when there is a streaming source present.- Since
2.0.0
-
def
javaRDD: JavaRDD[T]
Returns the content of the Dataset as a
JavaRDD
ofT
s.Returns the content of the Dataset as a
JavaRDD
ofT
s.- Since
1.6.0
-
def
join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame
Join with another
DataFrame
, using the given join expression.Join with another
DataFrame
, using the given join expression. The following performs a full outer join betweendf1
anddf2
.// Scala: import org.apache.spark.sql.functions._ df1.join(df2, $"df1Key" === $"df2Key", "outer") // Java: import static org.apache.spark.sql.functions.*; df1.join(df2, col("df1Key").equalTo(col("df2Key")), "outer");
- right
Right side of the join.
- joinExprs
Join expression.
- joinType
Type of join to perform. Default
inner
. Must be one of:inner
,cross
,outer
,full
,fullouter
,full_outer
,left
,leftouter
,left_outer
,right
,rightouter
,right_outer
,semi
,leftsemi
,left_semi
,anti
,leftanti
, left_anti.
- Since
2.0.0
-
def
join(right: Dataset[_], joinExprs: Column): DataFrame
Inner join with another
DataFrame
, using the given join expression.Inner join with another
DataFrame
, using the given join expression.// The following two are equivalent: df1.join(df2, $"df1Key" === $"df2Key") df1.join(df2).where($"df1Key" === $"df2Key")
- Since
2.0.0
-
def
join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame
Equi-join with another
DataFrame
using the given columns.Equi-join with another
DataFrame
using the given columns. A cross join with a predicate is specified as an inner join. If you would explicitly like to perform a cross join use thecrossJoin
method.Different from other join functions, the join columns will only appear once in the output, i.e. similar to SQL's
JOIN USING
syntax.- right
Right side of the join operation.
- usingColumns
Names of the columns to join on. This columns must exist on both sides.
- joinType
Type of join to perform. Default
inner
. Must be one of:inner
,cross
,outer
,full
,fullouter
,full_outer
,left
,leftouter
,left_outer
,right
,rightouter
,right_outer
,semi
,leftsemi
,left_semi
,anti
,leftanti
, left_anti.
- Since
2.0.0
- Note
If you perform a self-join using this function without aliasing the input
DataFrame
s, you will NOT be able to reference any columns after the join, since there is no way to disambiguate which side of the join you would like to reference.
-
def
join(right: Dataset[_], usingColumns: Seq[String]): DataFrame
Inner equi-join with another
DataFrame
using the given columns.Inner equi-join with another
DataFrame
using the given columns.Different from other join functions, the join columns will only appear once in the output, i.e. similar to SQL's
JOIN USING
syntax.// Joining df1 and df2 using the columns "user_id" and "user_name" df1.join(df2, Seq("user_id", "user_name"))
- right
Right side of the join operation.
- usingColumns
Names of the columns to join on. This columns must exist on both sides.
- Since
2.0.0
- Note
If you perform a self-join using this function without aliasing the input
DataFrame
s, you will NOT be able to reference any columns after the join, since there is no way to disambiguate which side of the join you would like to reference.
-
def
join(right: Dataset[_], usingColumn: String): DataFrame
Inner equi-join with another
DataFrame
using the given column.Inner equi-join with another
DataFrame
using the given column.Different from other join functions, the join column will only appear once in the output, i.e. similar to SQL's
JOIN USING
syntax.// Joining df1 and df2 using the column "user_id" df1.join(df2, "user_id")
- right
Right side of the join operation.
- usingColumn
Name of the column to join on. This column must exist on both sides.
- Since
2.0.0
- Note
If you perform a self-join using this function without aliasing the input
DataFrame
s, you will NOT be able to reference any columns after the join, since there is no way to disambiguate which side of the join you would like to reference.
-
def
join(right: Dataset[_]): DataFrame
Join with another
DataFrame
.Join with another
DataFrame
.Behaves as an INNER JOIN and requires a subsequent join predicate.
- right
Right side of the join operation.
- Since
2.0.0
-
def
joinWith[U](other: Dataset[U], condition: Column): Dataset[(T, U)]
Using inner equi-join to join this Dataset returning a
Tuple2
for each pair wherecondition
evaluates to true.Using inner equi-join to join this Dataset returning a
Tuple2
for each pair wherecondition
evaluates to true.- other
Right side of the join.
- condition
Join expression.
- Since
1.6.0
-
def
joinWith[U](other: Dataset[U], condition: Column, joinType: String): Dataset[(T, U)]
Joins this Dataset returning a
Tuple2
for each pair wherecondition
evaluates to true.Joins this Dataset returning a
Tuple2
for each pair wherecondition
evaluates to true.This is similar to the relation
join
function with one important difference in the result schema. SincejoinWith
preserves objects present on either side of the join, the result schema is similarly nested into a tuple under the column names_1
and_2
.This type of join can be useful both for preserving type-safety with the original object types as well as working with relational data where either side of the join has column names in common.
- other
Right side of the join.
- condition
Join expression.
- joinType
Type of join to perform. Default
inner
. Must be one of:inner
,cross
,outer
,full
,fullouter
,full_outer
,left
,leftouter
,left_outer
,right
,rightouter
,right_outer
.
- Since
1.6.0
-
def
limit(n: Int): Dataset[T]
Returns a new Dataset by taking the first
n
rows.Returns a new Dataset by taking the first
n
rows. The difference between this function andhead
is thathead
is an action and returns an array (by triggering query execution) whilelimit
returns a new Dataset.- Since
2.0.0
-
def
localCheckpoint(eager: Boolean): Dataset[T]
Locally checkpoints a Dataset and return the new Dataset.
Locally checkpoints a Dataset and return the new Dataset. Checkpointing can be used to truncate the logical plan of this Dataset, which is especially useful in iterative algorithms where the plan may grow exponentially. Local checkpoints are written to executor storage and despite potentially faster they are unreliable and may compromise job completion.
- Since
2.3.0
-
def
localCheckpoint(): Dataset[T]
Eagerly locally checkpoints a Dataset and return the new Dataset.
Eagerly locally checkpoints a Dataset and return the new Dataset. Checkpointing can be used to truncate the logical plan of this Dataset, which is especially useful in iterative algorithms where the plan may grow exponentially. Local checkpoints are written to executor storage and despite potentially faster they are unreliable and may compromise job completion.
- Since
2.3.0
-
def
map[U](func: MapFunction[T, U], encoder: Encoder[U]): Dataset[U]
(Java-specific) Returns a new Dataset that contains the result of applying
func
to each element.(Java-specific) Returns a new Dataset that contains the result of applying
func
to each element.- Since
1.6.0
-
def
map[U](func: (T) ⇒ U)(implicit arg0: Encoder[U]): Dataset[U]
(Scala-specific) Returns a new Dataset that contains the result of applying
func
to each element.(Scala-specific) Returns a new Dataset that contains the result of applying
func
to each element.- Since
1.6.0
-
def
mapPartitions[U](f: MapPartitionsFunction[T, U], encoder: Encoder[U]): Dataset[U]
(Java-specific) Returns a new Dataset that contains the result of applying
f
to each partition.(Java-specific) Returns a new Dataset that contains the result of applying
f
to each partition.- Since
1.6.0
-
def
mapPartitions[U](func: (Iterator[T]) ⇒ Iterator[U])(implicit arg0: Encoder[U]): Dataset[U]
(Scala-specific) Returns a new Dataset that contains the result of applying
func
to each partition.(Scala-specific) Returns a new Dataset that contains the result of applying
func
to each partition.- Since
1.6.0
-
def
na: DataFrameNaFunctions
Returns a DataFrameNaFunctions for working with missing data.
Returns a DataFrameNaFunctions for working with missing data.
// Dropping rows containing any null values. ds.na.drop()
- Since
1.6.0
-
final
def
ne(arg0: AnyRef): Boolean
- Definition Classes
- AnyRef
-
final
def
notify(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
final
def
notifyAll(): Unit
- Definition Classes
- AnyRef
- Annotations
- @native()
-
def
observe(observation: Observation, expr: Column, exprs: Column*): Dataset[T]
Observe (named) metrics through an
org.apache.spark.sql.Observation
instance.Observe (named) metrics through an
org.apache.spark.sql.Observation
instance. This is equivalent to callingobserve(String, Column, Column*)
but does not require addingorg.apache.spark.sql.util.QueryExecutionListener
to the spark session. This method does not support streaming datasets.A user can retrieve the metrics by accessing
org.apache.spark.sql.Observation.get
.// Observe row count (rows) and highest id (maxid) in the Dataset while writing it val observation = Observation("my_metrics") val observed_ds = ds.observe(observation, count(lit(1)).as("rows"), max($"id").as("maxid")) observed_ds.write.parquet("ds.parquet") val metrics = observation.get
- Annotations
- @varargs()
- Since
3.3.0
- Exceptions thrown
IllegalArgumentException
If this is a streaming Dataset (this.isStreaming == true)
-
def
observe(name: String, expr: Column, exprs: Column*): Dataset[T]
Define (named) metrics to observe on the Dataset.
Define (named) metrics to observe on the Dataset. This method returns an 'observed' Dataset that returns the same result as the input, with the following guarantees:
- It will compute the defined aggregates (metrics) on all the data that is flowing through the Dataset at that point.
- It will report the value of the defined aggregate columns as soon as we reach a completion point. A completion point is either the end of a query (batch mode) or the end of a streaming epoch. The value of the aggregates only reflects the data processed since the previous completion point.
Please note that continuous execution is currently not supported.
The metrics columns must either contain a literal (e.g. lit(42)), or should contain one or more aggregate functions (e.g. sum(a) or sum(a + b) + avg(c) - lit(1)). Expressions that contain references to the input Dataset's columns must always be wrapped in an aggregate function.
A user can observe these metrics by either adding org.apache.spark.sql.streaming.StreamingQueryListener or a org.apache.spark.sql.util.QueryExecutionListener to the spark session.
// Monitor the metrics using a listener. spark.streams.addListener(new StreamingQueryListener() { override def onQueryStarted(event: QueryStartedEvent): Unit = {} override def onQueryProgress(event: QueryProgressEvent): Unit = { event.progress.observedMetrics.asScala.get("my_event").foreach { row => // Trigger if the number of errors exceeds 5 percent val num_rows = row.getAs[Long]("rc") val num_error_rows = row.getAs[Long]("erc") val ratio = num_error_rows.toDouble / num_rows if (ratio > 0.05) { // Trigger alert } } } override def onQueryTerminated(event: QueryTerminatedEvent): Unit = {} }) // Observe row count (rc) and error row count (erc) in the streaming Dataset val observed_ds = ds.observe("my_event", count(lit(1)).as("rc"), count($"error").as("erc")) observed_ds.writeStream.format("...").start()
- Annotations
- @varargs()
- Since
3.0.0
-
def
orderBy(sortExprs: Column*): Dataset[T]
Returns a new Dataset sorted by the given expressions.
Returns a new Dataset sorted by the given expressions. This is an alias of the
sort
function.- Annotations
- @varargs()
- Since
2.0.0
-
def
orderBy(sortCol: String, sortCols: String*): Dataset[T]
Returns a new Dataset sorted by the given expressions.
Returns a new Dataset sorted by the given expressions. This is an alias of the
sort
function.- Annotations
- @varargs()
- Since
2.0.0
-
def
persist(newLevel: StorageLevel): Dataset.this.type
Persist this Dataset with the given storage level.
Persist this Dataset with the given storage level.
- newLevel
One of:
MEMORY_ONLY
,MEMORY_AND_DISK
,MEMORY_ONLY_SER
,MEMORY_AND_DISK_SER
,DISK_ONLY
,MEMORY_ONLY_2
,MEMORY_AND_DISK_2
, etc.
- Since
1.6.0
-
def
persist(): Dataset.this.type
Persist this Dataset with the default storage level (
MEMORY_AND_DISK
).Persist this Dataset with the default storage level (
MEMORY_AND_DISK
).- Since
1.6.0
-
def
printSchema(level: Int): Unit
Prints the schema up to the given level to the console in a nice tree format.
Prints the schema up to the given level to the console in a nice tree format.
- Since
3.0.0
-
def
printSchema(): Unit
Prints the schema to the console in a nice tree format.
Prints the schema to the console in a nice tree format.
- Since
1.6.0
- val queryExecution: QueryExecution
-
def
randomSplit(weights: Array[Double]): Array[Dataset[T]]
Randomly splits this Dataset with the provided weights.
Randomly splits this Dataset with the provided weights.
- weights
weights for splits, will be normalized if they don't sum to 1.
- Since
2.0.0
-
def
randomSplit(weights: Array[Double], seed: Long): Array[Dataset[T]]
Randomly splits this Dataset with the provided weights.
Randomly splits this Dataset with the provided weights.
- weights
weights for splits, will be normalized if they don't sum to 1.
- seed
Seed for sampling. For Java API, use randomSplitAsList.
- Since
2.0.0
-
def
randomSplitAsList(weights: Array[Double], seed: Long): List[Dataset[T]]
Returns a Java list that contains randomly split Dataset with the provided weights.
Returns a Java list that contains randomly split Dataset with the provided weights.
- weights
weights for splits, will be normalized if they don't sum to 1.
- seed
Seed for sampling.
- Since
2.0.0
-
lazy val
rdd: RDD[T]
Represents the content of the Dataset as an
RDD
ofT
.Represents the content of the Dataset as an
RDD
ofT
.- Since
1.6.0
-
def
reduce(func: ReduceFunction[T]): T
(Java-specific) Reduces the elements of this Dataset using the specified binary function.
(Java-specific) Reduces the elements of this Dataset using the specified binary function. The given
func
must be commutative and associative or the result may be non-deterministic.- Since
1.6.0
-
def
reduce(func: (T, T) ⇒ T): T
(Scala-specific) Reduces the elements of this Dataset using the specified binary function.
(Scala-specific) Reduces the elements of this Dataset using the specified binary function. The given
func
must be commutative and associative or the result may be non-deterministic.- Since
1.6.0
-
def
repartition(partitionExprs: Column*): Dataset[T]
Returns a new Dataset partitioned by the given partitioning expressions, using
spark.sql.shuffle.partitions
as number of partitions.Returns a new Dataset partitioned by the given partitioning expressions, using
spark.sql.shuffle.partitions
as number of partitions. The resulting Dataset is hash partitioned.This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).
- Annotations
- @varargs()
- Since
2.0.0
-
def
repartition(numPartitions: Int, partitionExprs: Column*): Dataset[T]
Returns a new Dataset partitioned by the given partitioning expressions into
numPartitions
.Returns a new Dataset partitioned by the given partitioning expressions into
numPartitions
. The resulting Dataset is hash partitioned.This is the same operation as "DISTRIBUTE BY" in SQL (Hive QL).
- Annotations
- @varargs()
- Since
2.0.0
-
def
repartition(numPartitions: Int): Dataset[T]
Returns a new Dataset that has exactly
numPartitions
partitions.Returns a new Dataset that has exactly
numPartitions
partitions.- Since
1.6.0
-
def
repartitionByRange(partitionExprs: Column*): Dataset[T]
Returns a new Dataset partitioned by the given partitioning expressions, using
spark.sql.shuffle.partitions
as number of partitions.Returns a new Dataset partitioned by the given partitioning expressions, using
spark.sql.shuffle.partitions
as number of partitions. The resulting Dataset is range partitioned.At least one partition-by expression must be specified. When no explicit sort order is specified, "ascending nulls first" is assumed. Note, the rows are not sorted in each partition of the resulting Dataset.
Note that due to performance reasons this method uses sampling to estimate the ranges. Hence, the output may not be consistent, since sampling can return different values. The sample size can be controlled by the config
spark.sql.execution.rangeExchange.sampleSizePerPartition
.- Annotations
- @varargs()
- Since
2.3.0
-
def
repartitionByRange(numPartitions: Int, partitionExprs: Column*): Dataset[T]
Returns a new Dataset partitioned by the given partitioning expressions into
numPartitions
.Returns a new Dataset partitioned by the given partitioning expressions into
numPartitions
. The resulting Dataset is range partitioned.At least one partition-by expression must be specified. When no explicit sort order is specified, "ascending nulls first" is assumed. Note, the rows are not sorted in each partition of the resulting Dataset.
Note that due to performance reasons this method uses sampling to estimate the ranges. Hence, the output may not be consistent, since sampling can return different values. The sample size can be controlled by the config
spark.sql.execution.rangeExchange.sampleSizePerPartition
.- Annotations
- @varargs()
- Since
2.3.0
-
def
rollup(col1: String, cols: String*): RelationalGroupedDataset
Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them.
Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.
This is a variant of rollup that can only group by existing columns using column names (i.e. cannot construct expressions).
// Compute the average for all numeric columns rolled up by department and group. ds.rollup("department", "group").avg() // Compute the max age and average salary, rolled up by department and gender. ds.rollup($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))
- Annotations
- @varargs()
- Since
2.0.0
-
def
rollup(cols: Column*): RelationalGroupedDataset
Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them.
Create a multi-dimensional rollup for the current Dataset using the specified columns, so we can run aggregation on them. See RelationalGroupedDataset for all the available aggregate functions.
// Compute the average for all numeric columns rolled up by department and group. ds.rollup($"department", $"group").avg() // Compute the max age and average salary, rolled up by department and gender. ds.rollup($"department", $"gender").agg(Map( "salary" -> "avg", "age" -> "max" ))
- Annotations
- @varargs()
- Since
2.0.0
-
def
sameSemantics(other: Dataset[T]): Boolean
Returns
true
when the logical query plans inside both Datasets are equal and therefore return same results.Returns
true
when the logical query plans inside both Datasets are equal and therefore return same results.- Annotations
- @DeveloperApi()
- Since
3.1.0
- Note
The equality comparison here is simplified by tolerating the cosmetic differences such as attribute names.
,This API can compare both Datasets very fast but can still return
false
on the Dataset that return the same results, for instance, from different plans. Such false negative semantic can be useful when caching as an example.
-
def
sample(withReplacement: Boolean, fraction: Double): Dataset[T]
Returns a new Dataset by sampling a fraction of rows, using a random seed.
-
def
sample(withReplacement: Boolean, fraction: Double, seed: Long): Dataset[T]
Returns a new Dataset by sampling a fraction of rows, using a user-supplied seed.
Returns a new Dataset by sampling a fraction of rows, using a user-supplied seed.
- withReplacement
Sample with replacement or not.
- fraction
Fraction of rows to generate, range [0.0, 1.0].
- seed
Seed for sampling.
- Since
1.6.0
- Note
This is NOT guaranteed to provide exactly the fraction of the count of the given Dataset.
-
def
sample(fraction: Double): Dataset[T]
Returns a new Dataset by sampling a fraction of rows (without replacement), using a random seed.
-
def
sample(fraction: Double, seed: Long): Dataset[T]
Returns a new Dataset by sampling a fraction of rows (without replacement), using a user-supplied seed.
-
def
schema: StructType
Returns the schema of this Dataset.
Returns the schema of this Dataset.
- Since
1.6.0
-
def
select[U1, U2, U3, U4, U5](c1: TypedColumn[T, U1], c2: TypedColumn[T, U2], c3: TypedColumn[T, U3], c4: TypedColumn[T, U4], c5: TypedColumn[T, U5]): Dataset[(U1, U2, U3, U4, U5)]
Returns a new Dataset by computing the given Column expressions for each element.
Returns a new Dataset by computing the given Column expressions for each element.
- Since
1.6.0
-
def
select[U1, U2, U3, U4](c1: TypedColumn[T, U1], c2: TypedColumn[T, U2], c3: TypedColumn[T, U3], c4: TypedColumn[T, U4]): Dataset[(U1, U2, U3, U4)]
Returns a new Dataset by computing the given Column expressions for each element.
Returns a new Dataset by computing the given Column expressions for each element.
- Since
1.6.0
-
def
select[U1, U2, U3](c1: TypedColumn[T, U1], c2: TypedColumn[T, U2], c3: TypedColumn[T, U3]): Dataset[(U1, U2, U3)]
Returns a new Dataset by computing the given Column expressions for each element.
Returns a new Dataset by computing the given Column expressions for each element.
- Since
1.6.0
-
def
select[U1, U2](c1: TypedColumn[T, U1], c2: TypedColumn[T, U2]): Dataset[(U1, U2)]
Returns a new Dataset by computing the given Column expressions for each element.
Returns a new Dataset by computing the given Column expressions for each element.
- Since
1.6.0
-
def
select[U1](c1: TypedColumn[T, U1]): Dataset[U1]
Returns a new Dataset by computing the given Column expression for each element.
Returns a new Dataset by computing the given Column expression for each element.
val ds = Seq(1, 2, 3).toDS() val newDS = ds.select(expr("value + 1").as[Int])
- Since
1.6.0
-
def
select(col: String, cols: String*): DataFrame
Selects a set of columns.
Selects a set of columns. This is a variant of
select
that can only select existing columns using column names (i.e. cannot construct expressions).// The following two are equivalent: ds.select("colA", "colB") ds.select($"colA", $"colB")
- Annotations
- @varargs()
- Since
2.0.0
-
def
select(cols: Column*): DataFrame
Selects a set of column based expressions.
Selects a set of column based expressions.
ds.select($"colA", $"colB" + 1)
- Annotations
- @varargs()
- Since
2.0.0
-
def
selectExpr(exprs: String*): DataFrame
Selects a set of SQL expressions.
Selects a set of SQL expressions. This is a variant of
select
that accepts SQL expressions.// The following are equivalent: ds.selectExpr("colA", "colB as newName", "abs(colC)") ds.select(expr("colA"), expr("colB as newName"), expr("abs(colC)"))
- Annotations
- @varargs()
- Since
2.0.0
-
def
selectUntyped(columns: TypedColumn[_, _]*): Dataset[_]
Internal helper function for building typed selects that return tuples.
Internal helper function for building typed selects that return tuples. For simplicity and code reuse, we do this without the help of the type system and then use helper functions that cast appropriately for the user facing interface.
- Attributes
- protected
-
def
semanticHash(): Int
Returns a
hashCode
of the logical query plan against this Dataset.Returns a
hashCode
of the logical query plan against this Dataset.- Annotations
- @DeveloperApi()
- Since
3.1.0
- Note
Unlike the standard
hashCode
, the hash is calculated against the query plan simplified by tolerating the cosmetic differences such as attribute names.
-
def
show(numRows: Int, truncate: Int, vertical: Boolean): Unit
Displays the Dataset in a tabular form.
Displays the Dataset in a tabular form. For example:
year month AVG('Adj Close) MAX('Adj Close) 1980 12 0.503218 0.595103 1981 01 0.523289 0.570307 1982 02 0.436504 0.475256 1983 03 0.410516 0.442194 1984 04 0.450090 0.483521
If
vertical
enabled, this command prints output rows vertically (one line per column value)?-RECORD 0------------------- year | 1980 month | 12 AVG('Adj Close) | 0.503218 AVG('Adj Close) | 0.595103 -RECORD 1------------------- year | 1981 month | 01 AVG('Adj Close) | 0.523289 AVG('Adj Close) | 0.570307 -RECORD 2------------------- year | 1982 month | 02 AVG('Adj Close) | 0.436504 AVG('Adj Close) | 0.475256 -RECORD 3------------------- year | 1983 month | 03 AVG('Adj Close) | 0.410516 AVG('Adj Close) | 0.442194 -RECORD 4------------------- year | 1984 month | 04 AVG('Adj Close) | 0.450090 AVG('Adj Close) | 0.483521
- numRows
Number of rows to show
- truncate
If set to more than 0, truncates strings to
truncate
characters and all cells will be aligned right.- vertical
If set to true, prints output rows vertically (one line per column value).
- Since
2.3.0
-
def
show(numRows: Int, truncate: Int): Unit
Displays the Dataset in a tabular form.
Displays the Dataset in a tabular form. For example:
year month AVG('Adj Close) MAX('Adj Close) 1980 12 0.503218 0.595103 1981 01 0.523289 0.570307 1982 02 0.436504 0.475256 1983 03 0.410516 0.442194 1984 04 0.450090 0.483521
- numRows
Number of rows to show
- truncate
If set to more than 0, truncates strings to
truncate
characters and all cells will be aligned right.
- Since
1.6.0
-
def
show(numRows: Int, truncate: Boolean): Unit
Displays the Dataset in a tabular form.
Displays the Dataset in a tabular form. For example:
year month AVG('Adj Close) MAX('Adj Close) 1980 12 0.503218 0.595103 1981 01 0.523289 0.570307 1982 02 0.436504 0.475256 1983 03 0.410516 0.442194 1984 04 0.450090 0.483521
- numRows
Number of rows to show
- truncate
Whether truncate long strings. If true, strings more than 20 characters will be truncated and all cells will be aligned right
- Since
1.6.0
-
def
show(truncate: Boolean): Unit
Displays the top 20 rows of Dataset in a tabular form.
Displays the top 20 rows of Dataset in a tabular form.
- truncate
Whether truncate long strings. If true, strings more than 20 characters will be truncated and all cells will be aligned right
- Since
1.6.0
-
def
show(): Unit
Displays the top 20 rows of Dataset in a tabular form.
Displays the top 20 rows of Dataset in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right.
- Since
1.6.0
-
def
show(numRows: Int): Unit
Displays the Dataset in a tabular form.
Displays the Dataset in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right. For example:
year month AVG('Adj Close) MAX('Adj Close) 1980 12 0.503218 0.595103 1981 01 0.523289 0.570307 1982 02 0.436504 0.475256 1983 03 0.410516 0.442194 1984 04 0.450090 0.483521
- numRows
Number of rows to show
- Since
1.6.0
-
def
sort(sortExprs: Column*): Dataset[T]
Returns a new Dataset sorted by the given expressions.
Returns a new Dataset sorted by the given expressions. For example:
ds.sort($"col1", $"col2".desc)
- Annotations
- @varargs()
- Since
2.0.0
-
def
sort(sortCol: String, sortCols: String*): Dataset[T]
Returns a new Dataset sorted by the specified column, all in ascending order.
Returns a new Dataset sorted by the specified column, all in ascending order.
// The following 3 are equivalent ds.sort("sortcol") ds.sort($"sortcol") ds.sort($"sortcol".asc)
- Annotations
- @varargs()
- Since
2.0.0
-
def
sortWithinPartitions(sortExprs: Column*): Dataset[T]
Returns a new Dataset with each partition sorted by the given expressions.
Returns a new Dataset with each partition sorted by the given expressions.
This is the same operation as "SORT BY" in SQL (Hive QL).
- Annotations
- @varargs()
- Since
2.0.0
-
def
sortWithinPartitions(sortCol: String, sortCols: String*): Dataset[T]
Returns a new Dataset with each partition sorted by the given expressions.
Returns a new Dataset with each partition sorted by the given expressions.
This is the same operation as "SORT BY" in SQL (Hive QL).
- Annotations
- @varargs()
- Since
2.0.0
-
lazy val
sparkSession: SparkSession
- Annotations
- @transient()
-
lazy val
sqlContext: SQLContext
- Annotations
- @transient()
-
def
stat: DataFrameStatFunctions
Returns a DataFrameStatFunctions for working statistic functions support.
Returns a DataFrameStatFunctions for working statistic functions support.
// Finding frequent items in column with name 'a'. ds.stat.freqItems(Seq("a"))
- Since
1.6.0
-
def
storageLevel: StorageLevel
Get the Dataset's current storage level, or StorageLevel.NONE if not persisted.
Get the Dataset's current storage level, or StorageLevel.NONE if not persisted.
- Since
2.1.0
-
def
summary(statistics: String*): DataFrame
Computes specified statistics for numeric and string columns.
Computes specified statistics for numeric and string columns. Available statistics are:
- count
- mean
- stddev
- min
- max
- arbitrary approximate percentiles specified as a percentage (e.g. 75%)
- count_distinct
- approx_count_distinct
If no statistics are given, this function computes count, mean, stddev, min, approximate quartiles (percentiles at 25%, 50%, and 75%), and max.
This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting Dataset. If you want to programmatically compute summary statistics, use the
agg
function instead.ds.summary().show() // output: // summary age height // count 10.0 10.0 // mean 53.3 178.05 // stddev 11.6 15.7 // min 18.0 163.0 // 25% 24.0 176.0 // 50% 24.0 176.0 // 75% 32.0 180.0 // max 92.0 192.0
ds.summary("count", "min", "25%", "75%", "max").show() // output: // summary age height // count 10.0 10.0 // min 18.0 163.0 // 25% 24.0 176.0 // 75% 32.0 180.0 // max 92.0 192.0
To do a summary for specific columns first select them:
ds.select("age", "height").summary().show()
Specify statistics to output custom summaries:
ds.summary("count", "count_distinct").show()
The distinct count isn't included by default.
You can also run approximate distinct counts which are faster:
ds.summary("count", "approx_count_distinct").show()
See also describe for basic statistics.
- statistics
Statistics from above list to be computed.
- Annotations
- @varargs()
- Since
2.3.0
-
final
def
synchronized[T0](arg0: ⇒ T0): T0
- Definition Classes
- AnyRef
-
def
tail(n: Int): Array[T]
Returns the last
n
rows in the Dataset.Returns the last
n
rows in the Dataset.Running tail requires moving data into the application's driver process, and doing so with a very large
n
can crash the driver process with OutOfMemoryError.- Since
3.0.0
-
def
take(n: Int): Array[T]
Returns the first
n
rows in the Dataset.Returns the first
n
rows in the Dataset.Running take requires moving data into the application's driver process, and doing so with a very large
n
can crash the driver process with OutOfMemoryError.- Since
1.6.0
-
def
takeAsList(n: Int): List[T]
Returns the first
n
rows in the Dataset as a list.Returns the first
n
rows in the Dataset as a list.Running take requires moving data into the application's driver process, and doing so with a very large
n
can crash the driver process with OutOfMemoryError.- Since
1.6.0
-
def
toDF(colNames: String*): DataFrame
Converts this strongly typed collection of data to generic
DataFrame
with columns renamed.Converts this strongly typed collection of data to generic
DataFrame
with columns renamed. This can be quite convenient in conversion from an RDD of tuples into aDataFrame
with meaningful names. For example:val rdd: RDD[(Int, String)] = ... rdd.toDF() // this implicit conversion creates a DataFrame with column name `_1` and `_2` rdd.toDF("id", "name") // this creates a DataFrame with column name "id" and "name"
- Annotations
- @varargs()
- Since
2.0.0
-
def
toDF(): DataFrame
Converts this strongly typed collection of data to generic Dataframe.
Converts this strongly typed collection of data to generic Dataframe. In contrast to the strongly typed objects that Dataset operations work on, a Dataframe returns generic Row objects that allow fields to be accessed by ordinal or name.
- Since
1.6.0
-
def
toJSON: Dataset[String]
Returns the content of the Dataset as a Dataset of JSON strings.
Returns the content of the Dataset as a Dataset of JSON strings.
- Since
2.0.0
-
def
toJavaRDD: JavaRDD[T]
Returns the content of the Dataset as a
JavaRDD
ofT
s.Returns the content of the Dataset as a
JavaRDD
ofT
s.- Since
1.6.0
-
def
toLocalIterator(): Iterator[T]
Returns an iterator that contains all rows in this Dataset.
Returns an iterator that contains all rows in this Dataset.
The iterator will consume as much memory as the largest partition in this Dataset.
- Since
2.0.0
- Note
this results in multiple Spark jobs, and if the input Dataset is the result of a wide transformation (e.g. join with different partitioners), to avoid recomputing the input Dataset should be cached first.
-
def
toString(): String
- Definition Classes
- Dataset → AnyRef → Any
-
def
transform[U](t: (Dataset[T]) ⇒ Dataset[U]): Dataset[U]
Concise syntax for chaining custom transformations.
Concise syntax for chaining custom transformations.
def featurize(ds: Dataset[T]): Dataset[U] = ... ds .transform(featurize) .transform(...)
- Since
1.6.0
-
def
union(other: Dataset[T]): Dataset[T]
Returns a new Dataset containing union of rows in this Dataset and another Dataset.
Returns a new Dataset containing union of rows in this Dataset and another Dataset.
This is equivalent to
UNION ALL
in SQL. To do a SQL-style set union (that does deduplication of elements), use this function followed by a distinct.Also as standard in SQL, this function resolves columns by position (not by name):
val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2") val df2 = Seq((4, 5, 6)).toDF("col1", "col2", "col0") df1.union(df2).show // output: // +----+----+----+ // |col0|col1|col2| // +----+----+----+ // | 1| 2| 3| // | 4| 5| 6| // +----+----+----+
Notice that the column positions in the schema aren't necessarily matched with the fields in the strongly typed objects in a Dataset. This function resolves columns by their positions in the schema, not the fields in the strongly typed objects. Use unionByName to resolve columns by field name in the typed objects.
- Since
2.0.0
-
def
unionAll(other: Dataset[T]): Dataset[T]
Returns a new Dataset containing union of rows in this Dataset and another Dataset.
Returns a new Dataset containing union of rows in this Dataset and another Dataset. This is an alias for
union
.This is equivalent to
UNION ALL
in SQL. To do a SQL-style set union (that does deduplication of elements), use this function followed by a distinct.Also as standard in SQL, this function resolves columns by position (not by name).
- Since
2.0.0
-
def
unionByName(other: Dataset[T], allowMissingColumns: Boolean): Dataset[T]
Returns a new Dataset containing union of rows in this Dataset and another Dataset.
Returns a new Dataset containing union of rows in this Dataset and another Dataset.
The difference between this function and union is that this function resolves columns by name (not by position).
When the parameter
allowMissingColumns
istrue
, the set of column names in this and otherDataset
can differ; missing columns will be filled with null. Further, the missing columns of thisDataset
will be added at the end in the schema of the union result:val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2") val df2 = Seq((4, 5, 6)).toDF("col1", "col0", "col3") df1.unionByName(df2, true).show // output: "col3" is missing at left df1 and added at the end of schema. // +----+----+----+----+ // |col0|col1|col2|col3| // +----+----+----+----+ // | 1| 2| 3|null| // | 5| 4|null| 6| // +----+----+----+----+ df2.unionByName(df1, true).show // output: "col2" is missing at left df2 and added at the end of schema. // +----+----+----+----+ // |col1|col0|col3|col2| // +----+----+----+----+ // | 4| 5| 6|null| // | 2| 1|null| 3| // +----+----+----+----+
Note that this supports nested columns in struct and array types. With
allowMissingColumns
, missing nested columns of struct columns with the same name will also be filled with null values and added to the end of struct. Nested columns in map types are not currently supported.- Since
3.1.0
-
def
unionByName(other: Dataset[T]): Dataset[T]
Returns a new Dataset containing union of rows in this Dataset and another Dataset.
Returns a new Dataset containing union of rows in this Dataset and another Dataset.
This is different from both
UNION ALL
andUNION DISTINCT
in SQL. To do a SQL-style set union (that does deduplication of elements), use this function followed by a distinct.The difference between this function and union is that this function resolves columns by name (not by position):
val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2") val df2 = Seq((4, 5, 6)).toDF("col1", "col2", "col0") df1.unionByName(df2).show // output: // +----+----+----+ // |col0|col1|col2| // +----+----+----+ // | 1| 2| 3| // | 6| 4| 5| // +----+----+----+
Note that this supports nested columns in struct and array types. Nested columns in map types are not currently supported.
- Since
2.3.0
-
def
unpersist(): Dataset.this.type
Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.
Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk. This will not un-persist any cached data that is built upon this Dataset.
- Since
1.6.0
-
def
unpersist(blocking: Boolean): Dataset.this.type
Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.
Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk. This will not un-persist any cached data that is built upon this Dataset.
- blocking
Whether to block until all blocks are deleted.
- Since
1.6.0
-
final
def
wait(): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long, arg1: Int): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... )
-
final
def
wait(arg0: Long): Unit
- Definition Classes
- AnyRef
- Annotations
- @throws( ... ) @native()
-
def
where(conditionExpr: String): Dataset[T]
Filters rows using the given SQL expression.
Filters rows using the given SQL expression.
peopleDs.where("age > 15")
- Since
1.6.0
-
def
where(condition: Column): Dataset[T]
Filters rows using the given condition.
Filters rows using the given condition. This is an alias for
filter
.// The following are equivalent: peopleDs.filter($"age" > 15) peopleDs.where($"age" > 15)
- Since
1.6.0
-
def
withColumn(colName: String, col: Column): DataFrame
Returns a new Dataset by adding a column or replacing the existing column that has the same name.
Returns a new Dataset by adding a column or replacing the existing column that has the same name.
column
's expression must only refer to attributes supplied by this Dataset. It is an error to add a column that refers to some other Dataset.- Since
2.0.0
- Note
this method introduces a projection internally. Therefore, calling it multiple times, for instance, via loops in order to add multiple columns can generate big plans which can cause performance issues and even
StackOverflowException
. To avoid this, useselect
with the multiple columns at once.
-
def
withColumnRenamed(existingName: String, newName: String): DataFrame
Returns a new Dataset with a column renamed.
Returns a new Dataset with a column renamed. This is a no-op if schema doesn't contain existingName.
- Since
2.0.0
-
def
withColumns(colsMap: Map[String, Column]): DataFrame
(Java-specific) Returns a new Dataset by adding columns or replacing the existing columns that has the same names.
(Java-specific) Returns a new Dataset by adding columns or replacing the existing columns that has the same names.
colsMap
is a map of column name and column, the column must only refer to attribute supplied by this Dataset. It is an error to add columns that refers to some other Dataset.- Since
3.3.0
-
def
withColumns(colsMap: Map[String, Column]): DataFrame
(Scala-specific) Returns a new Dataset by adding columns or replacing the existing columns that has the same names.
(Scala-specific) Returns a new Dataset by adding columns or replacing the existing columns that has the same names.
colsMap
is a map of column name and column, the column must only refer to attributes supplied by this Dataset. It is an error to add columns that refers to some other Dataset.- Since
3.3.0
-
def
withMetadata(columnName: String, metadata: Metadata): DataFrame
Returns a new Dataset by updating an existing column with metadata.
Returns a new Dataset by updating an existing column with metadata.
- Since
3.3.0
-
def
withWatermark(eventTime: String, delayThreshold: String): Dataset[T]
Defines an event time watermark for this Dataset.
Defines an event time watermark for this Dataset. A watermark tracks a point in time before which we assume no more late data is going to arrive.
Spark will use this watermark for several purposes:
- To know when a given time window aggregation can be finalized and thus can be emitted when using output modes that do not allow updates.
- To minimize the amount of state that we need to keep for on-going aggregations,
mapGroupsWithState
anddropDuplicates
operators.
The current watermark is computed by looking at the
MAX(eventTime)
seen across all of the partitions in the query minus a user specifieddelayThreshold
. Due to the cost of coordinating this value across partitions, the actual watermark used is only guaranteed to be at leastdelayThreshold
behind the actual event time. In some cases we may still process records that arrive more thandelayThreshold
late.- eventTime
the name of the column that contains the event time of the row.
- delayThreshold
the minimum delay to wait to data to arrive late, relative to the latest record that has been processed in the form of an interval (e.g. "1 minute" or "5 hours"). NOTE: This should not be negative.
- Since
2.1.0
-
def
write: DataFrameWriter[T]
Interface for saving the content of the non-streaming Dataset out into external storage.
Interface for saving the content of the non-streaming Dataset out into external storage.
- Since
1.6.0
-
def
writeStream: DataStreamWriter[T]
Interface for saving the content of the streaming Dataset out into external storage.
Interface for saving the content of the streaming Dataset out into external storage.
- Since
2.0.0
-
def
writeTo(table: String): DataFrameWriterV2[T]
Create a write configuration builder for v2 sources.
Create a write configuration builder for v2 sources.
This builder is used to configure and execute write operations. For example, to append to an existing table, run:
df.writeTo("catalog.db.table").append()
This can also be used to create or replace existing tables:
df.writeTo("catalog.db.table").partitionedBy($"col").createOrReplace()
- Since
3.0.0
Deprecated Value Members
-
def
explode[A, B](inputColumn: String, outputColumn: String)(f: (A) ⇒ TraversableOnce[B])(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[B]): DataFrame
(Scala-specific) Returns a new Dataset where a single column has been expanded to zero or more rows by the provided function.
(Scala-specific) Returns a new Dataset where a single column has been expanded to zero or more rows by the provided function. This is similar to a
LATERAL VIEW
in HiveQL. All columns of the input row are implicitly joined with each value that is output by the function.Given that this is deprecated, as an alternative, you can explode columns either using
functions.explode()
:ds.select(explode(split($"words", " ")).as("word"))
or
flatMap()
:ds.flatMap(_.words.split(" "))
- Annotations
- @deprecated
- Deprecated
(Since version 2.0.0) use flatMap() or select() with functions.explode() instead
- Since
2.0.0
-
def
explode[A <: Product](input: Column*)(f: (Row) ⇒ TraversableOnce[A])(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[A]): DataFrame
(Scala-specific) Returns a new Dataset where each row has been expanded to zero or more rows by the provided function.
(Scala-specific) Returns a new Dataset where each row has been expanded to zero or more rows by the provided function. This is similar to a
LATERAL VIEW
in HiveQL. The columns of the input row are implicitly joined with each row that is output by the function.Given that this is deprecated, as an alternative, you can explode columns either using
functions.explode()
orflatMap()
. The following example uses these alternatives to count the number of books that contain a given word:case class Book(title: String, words: String) val ds: Dataset[Book] val allWords = ds.select($"title", explode(split($"words", " ")).as("word")) val bookCountPerWord = allWords.groupBy("word").agg(count_distinct("title"))
Using
flatMap()
this can similarly be exploded as:ds.flatMap(_.words.split(" "))
- Annotations
- @deprecated
- Deprecated
(Since version 2.0.0) use flatMap() or select() with functions.explode() instead
- Since
2.0.0
-
def
registerTempTable(tableName: String): Unit
Registers this Dataset as a temporary table using the given name.
Registers this Dataset as a temporary table using the given name. The lifetime of this temporary table is tied to the SparkSession that was used to create this Dataset.
- Annotations
- @deprecated
- Deprecated
(Since version 2.0.0) Use createOrReplaceTempView(viewName) instead.
- Since
1.6.0