Dataset

Value Members

final def !=(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def !=(arg0: Any): Boolean

Definition Classes
Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def ==(arg0: Any): Boolean

Definition Classes
Any
def as(alias: String): Dataset[T]

Applies a logical alias to this Dataset that can be used to disambiguate columns that have the same name after two Datasets have been joined.
Applies a logical alias to this Dataset that can be used to disambiguate columns that have the same name after two Datasets have been joined.

Since
1.6.0
def as[U](implicit arg0: Encoder[U]): Dataset[U]

Returns a new Dataset where each record has been mapped on to the specified type.
Returns a new Dataset where each record has been mapped on to the specified type. The method used to map columns depend on the type of U:
- When U is a class, fields for the class will be mapped to columns of the same name (case sensitivity is determined by spark.sql.caseSensitive)
- When U is a tuple, the columns will be be mapped by ordinal (i.e. the first column will be assigned to _1).
- When U is a primitive type (i.e. String, Int, etc). then the first column of the DataFrame will be used.
If the schema of the DataFrame does not match the desired U type, you can use select along with alias or as to rearrange or rename as required.
Since
1.6.0
final def asInstanceOf[T0]: T0

Definition Classes
Any
def cache(): Dataset.this.type

Persist this Dataset with the default storage level (MEMORY_AND_DISK).
Persist this Dataset with the default storage level (MEMORY_AND_DISK).

Since
1.6.0
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
def coalesce(numPartitions: Int): Dataset[T]

Returns a new Dataset that has exactly numPartitions partitions.
Returns a new Dataset that has exactly numPartitions partitions. Similar to coalesce defined on an RDD, this operation results in a narrow dependency, e.g. if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of the 100 new partitions will claim 10 of the current partitions.

Since
1.6.0
def collect(): Array[T]

Returns an array that contains all the elements in this Dataset.
Returns an array that contains all the elements in this Dataset.
Running collect requires moving all the data into the application's driver process, and doing so on a very large Dataset can crash the driver process with OutOfMemoryError.
For Java API, use collectAsList.

Since
1.6.0
def collectAsList(): List[T]

Returns an array that contains all the elements in this Dataset.
Returns an array that contains all the elements in this Dataset.
Running collect requires moving all the data into the application's driver process, and doing so on a very large Dataset can crash the driver process with OutOfMemoryError.
For Java API, use collectAsList.

Since
1.6.0
def count(): Long

Returns the number of elements in the Dataset.
Returns the number of elements in the Dataset.

Since
1.6.0
def distinct: Dataset[T]

Returns a new Dataset that contains only the unique elements of this Dataset.
Returns a new Dataset that contains only the unique elements of this Dataset.
Note that, equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom equals function defined on T.

Since
1.6.0
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
def explain(): Unit

Prints the physical plan to the console for debugging purposes.
Prints the physical plan to the console for debugging purposes.

Definition Classes
Dataset → Queryable
Since
1.6.0
def explain(extended: Boolean): Unit

Prints the plans (logical and physical) to the console for debugging purposes.
Prints the plans (logical and physical) to the console for debugging purposes.

Definition Classes
Dataset → Queryable
Since
1.6.0
def filter(func: FilterFunction[T]): Dataset[T]

(Java-specific) Returns a new Dataset that only contains elements where func returns true.
(Java-specific) Returns a new Dataset that only contains elements where func returns true.

Since
1.6.0
def filter(func: (T) ⇒ Boolean): Dataset[T]

(Scala-specific) Returns a new Dataset that only contains elements where func returns true.
(Scala-specific) Returns a new Dataset that only contains elements where func returns true.

Since
1.6.0
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
def first(): T

Returns the first element in this Dataset.
Returns the first element in this Dataset.

Since
1.6.0
def flatMap[U](f: FlatMapFunction[T, U], encoder: Encoder[U]): Dataset[U]

(Java-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.
(Java-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.

Since
1.6.0
def flatMap[U](func: (T) ⇒ TraversableOnce[U])(implicit arg0: Encoder[U]): Dataset[U]

(Scala-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.
(Scala-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.

Since
1.6.0
def foreach(func: ForeachFunction[T]): Unit

(Java-specific) Runs func on each element of this Dataset.
(Java-specific) Runs func on each element of this Dataset.

Since
1.6.0
def foreach(func: (T) ⇒ Unit): Unit

(Scala-specific) Runs func on each element of this Dataset.
(Scala-specific) Runs func on each element of this Dataset.

Since
1.6.0
def foreachPartition(func: ForeachPartitionFunction[T]): Unit

(Java-specific) Runs func on each partition of this Dataset.
(Java-specific) Runs func on each partition of this Dataset.

Since
1.6.0
def foreachPartition(func: (Iterator[T]) ⇒ Unit): Unit

(Scala-specific) Runs func on each partition of this Dataset.
(Scala-specific) Runs func on each partition of this Dataset.

Since
1.6.0
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
def groupBy[K](func: MapFunction[T, K], encoder: Encoder[K]): GroupedDataset[K, T]

(Java-specific) Returns a GroupedDataset where the data is grouped by the given key func.
(Java-specific) Returns a GroupedDataset where the data is grouped by the given key func.

Since
1.6.0
def groupBy(cols: Column*): GroupedDataset[Row, T]

Returns a GroupedDataset where the data is grouped by the given Column expressions.
Returns a GroupedDataset where the data is grouped by the given Column expressions.

Annotations
@varargs()
Since
1.6.0
def groupBy[K](func: (T) ⇒ K)(implicit arg0: Encoder[K]): GroupedDataset[K, T]

(Scala-specific) Returns a GroupedDataset where the data is grouped by the given key func.
(Scala-specific) Returns a GroupedDataset where the data is grouped by the given key func.

Since
1.6.0
def hashCode(): Int

Definition Classes
AnyRef → Any
def intersect(other: Dataset[T]): Dataset[T]

Returns a new Dataset that contains only the elements of this Dataset that are also present in other.
Returns a new Dataset that contains only the elements of this Dataset that are also present in other.
Note that, equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom equals function defined on T.

Since
1.6.0
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
def isTraceEnabled(): Boolean

Attributes
protected
Definition Classes
Logging
def joinWith[U](other: Dataset[U], condition: Column): Dataset[(T, U)]

Using inner equi-join to join this Dataset returning a Tuple2 for each pair where condition evaluates to true.
Using inner equi-join to join this Dataset returning a Tuple2 for each pair where condition evaluates to true.
other
Right side of the join.
condition
Join expression.

Since
1.6.0
def joinWith[U](other: Dataset[U], condition: Column, joinType: String): Dataset[(T, U)]

Joins this Dataset returning a Tuple2 for each pair where condition evaluates to true.
Joins this Dataset returning a Tuple2 for each pair where condition evaluates to true.
This is similar to the relation join function with one important difference in the result schema. Since joinWith preserves objects present on either side of the join, the result schema is similarly nested into a tuple under the column names _1 and _2.
This type of join can be useful both for preserving type-safety with the original object types as well as working with relational data where either side of the join has column names in common.
other
Right side of the join.
condition
Join expression.
joinType
One of: inner, outer, left_outer, right_outer, leftsemi.

Since
1.6.0
def log: Logger

Attributes
protected
Definition Classes
Logging
def logDebug(msg: ⇒ String, throwable: Throwable): Unit

Attributes
protected
Definition Classes
Logging
def logDebug(msg: ⇒ String): Unit

Attributes
protected
Definition Classes
Logging
def logError(msg: ⇒ String, throwable: Throwable): Unit

Attributes
protected
Definition Classes
Logging
def logError(msg: ⇒ String): Unit

Attributes
protected
Definition Classes
Logging
def logInfo(msg: ⇒ String, throwable: Throwable): Unit

Attributes
protected
Definition Classes
Logging
def logInfo(msg: ⇒ String): Unit

Attributes
protected
Definition Classes
Logging
def logName: String

Attributes
protected
Definition Classes
Logging
def logTrace(msg: ⇒ String, throwable: Throwable): Unit

Attributes
protected
Definition Classes
Logging
def logTrace(msg: ⇒ String): Unit

Attributes
protected
Definition Classes
Logging
def logWarning(msg: ⇒ String, throwable: Throwable): Unit

Attributes
protected
Definition Classes
Logging
def logWarning(msg: ⇒ String): Unit

Attributes
protected
Definition Classes
Logging
def map[U](func: MapFunction[T, U], encoder: Encoder[U]): Dataset[U]

(Java-specific) Returns a new Dataset that contains the result of applying func to each element.
(Java-specific) Returns a new Dataset that contains the result of applying func to each element.

Since
1.6.0
def map[U](func: (T) ⇒ U)(implicit arg0: Encoder[U]): Dataset[U]

(Scala-specific) Returns a new Dataset that contains the result of applying func to each element.
(Scala-specific) Returns a new Dataset that contains the result of applying func to each element.

Since
1.6.0
def mapPartitions[U](f: MapPartitionsFunction[T, U], encoder: Encoder[U]): Dataset[U]

(Java-specific) Returns a new Dataset that contains the result of applying func to each partition.
(Java-specific) Returns a new Dataset that contains the result of applying func to each partition.

Since
1.6.0
def mapPartitions[U](func: (Iterator[T]) ⇒ Iterator[U])(implicit arg0: Encoder[U]): Dataset[U]

(Scala-specific) Returns a new Dataset that contains the result of applying func to each partition.
(Scala-specific) Returns a new Dataset that contains the result of applying func to each partition.

Since
1.6.0
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
def persist(newLevel: StorageLevel): Dataset.this.type

Persist this Dataset with the given storage level.
Persist this Dataset with the given storage level.
newLevel
One of: MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.

Since
1.6.0
def persist(): Dataset.this.type

Persist this Dataset with the default storage level (MEMORY_AND_DISK).
Persist this Dataset with the default storage level (MEMORY_AND_DISK).

Since
1.6.0
def printSchema(): Unit

Prints the schema of the underlying Dataset to the console in a nice tree format.
Prints the schema of the underlying Dataset to the console in a nice tree format.

Definition Classes
Dataset → Queryable
Since
1.6.0
val queryExecution: QueryExecution

Definition Classes
Dataset → Queryable
def rdd: RDD[T]

Converts this Dataset to an RDD.
Converts this Dataset to an RDD.

Since
1.6.0
def reduce(func: ReduceFunction[T]): T

(Java-specific) Reduces the elements of this Dataset using the specified binary function.
(Java-specific) Reduces the elements of this Dataset using the specified binary function. The given func must be commutative and associative or the result may be non-deterministic.

Since
1.6.0
def reduce(func: (T, T) ⇒ T): T

(Scala-specific) Reduces the elements of this Dataset using the specified binary function.
(Scala-specific) Reduces the elements of this Dataset using the specified binary function. The given func must be commutative and associative or the result may be non-deterministic.

Since
1.6.0
def repartition(numPartitions: Int): Dataset[T]

Returns a new Dataset that has exactly numPartitions partitions.
Returns a new Dataset that has exactly numPartitions partitions.

Since
1.6.0
def sample(withReplacement: Boolean, fraction: Double): Dataset[T]

Returns a new Dataset by sampling a fraction of records, using a random seed.
Returns a new Dataset by sampling a fraction of records, using a random seed.

Since
1.6.0
def sample(withReplacement: Boolean, fraction: Double, seed: Long): Dataset[T]

Returns a new Dataset by sampling a fraction of records.
Returns a new Dataset by sampling a fraction of records.

Since
1.6.0
def schema: StructType

Returns the schema of the encoded form of the objects in this Dataset.
Returns the schema of the encoded form of the objects in this Dataset.

Definition Classes
Dataset → Queryable
Since
1.6.0
def select[U1, U2, U3, U4, U5](c1: TypedColumn[T, U1], c2: TypedColumn[T, U2], c3: TypedColumn[T, U3], c4: TypedColumn[T, U4], c5: TypedColumn[T, U5]): Dataset[(U1, U2, U3, U4, U5)]

Returns a new Dataset by computing the given Column expressions for each element.
Returns a new Dataset by computing the given Column expressions for each element.

Since
1.6.0
def select[U1, U2, U3, U4](c1: TypedColumn[T, U1], c2: TypedColumn[T, U2], c3: TypedColumn[T, U3], c4: TypedColumn[T, U4]): Dataset[(U1, U2, U3, U4)]

Returns a new Dataset by computing the given Column expressions for each element.
Returns a new Dataset by computing the given Column expressions for each element.

Since
1.6.0
def select[U1, U2, U3](c1: TypedColumn[T, U1], c2: TypedColumn[T, U2], c3: TypedColumn[T, U3]): Dataset[(U1, U2, U3)]

Returns a new Dataset by computing the given Column expressions for each element.
Returns a new Dataset by computing the given Column expressions for each element.

Since
1.6.0
def select[U1, U2](c1: TypedColumn[T, U1], c2: TypedColumn[T, U2]): Dataset[(U1, U2)]

Returns a new Dataset by computing the given Column expressions for each element.
Returns a new Dataset by computing the given Column expressions for each element.

Since
1.6.0
def select[U1](c1: TypedColumn[T, U1])(implicit arg0: Encoder[U1]): Dataset[U1]

Returns a new Dataset by computing the given Column expression for each element.
Returns a new Dataset by computing the given Column expression for each element.
```
val ds = Seq(1, 2, 3).toDS()
val newDS = ds.select(expr("value + 1").as[Int])
```
Since
1.6.0
def select(cols: Column*): DataFrame

Returns a new DataFrame by selecting a set of column based expressions.
Returns a new DataFrame by selecting a set of column based expressions.
```
df.select($"colA", $"colB" + 1)
```
Attributes
protected
Annotations
@varargs()
Since
1.6.0
def selectUntyped(columns: TypedColumn[_, _]*): Dataset[_]

Internal helper function for building typed selects that return tuples.
Internal helper function for building typed selects that return tuples. For simplicity and code reuse, we do this without the help of the type system and then use helper functions that cast appropriately for the user facing interface.

Attributes
protected
def show(numRows: Int, truncate: Boolean): Unit

Displays the Dataset in a tabular form.
Displays the Dataset in a tabular form. For example:
```
year  month AVG('Adj Close) MAX('Adj Close)
1980  12    0.503218        0.595103
1981  01    0.523289        0.570307
1982  02    0.436504        0.475256
1983  03    0.410516        0.442194
1984  04    0.450090        0.483521
```
numRows
Number of rows to show
truncate
Whether truncate long strings. If true, strings more than 20 characters will be truncated and all cells will be aligned right

Since
1.6.0
def show(truncate: Boolean): Unit

Displays the top 20 rows of Dataset in a tabular form.
Displays the top 20 rows of Dataset in a tabular form.
truncate
Whether truncate long strings. If true, strings more than 20 characters will be truncated and all cells will be aligned right

Since
1.6.0
def show(): Unit

Displays the top 20 rows of Dataset in a tabular form.
Displays the top 20 rows of Dataset in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right.

Since
1.6.0

def show(numRows: Int): Unit

Displays the content of this Dataset in a tabular form.

Displays the content of this Dataset in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right. For example:

year  month AVG('Adj Close) MAX('Adj Close)
1980  12    0.503218        0.595103
1981  01    0.523289        0.570307
1982  02    0.436504        0.475256
1983  03    0.410516        0.442194
1984  04    0.450090        0.483521

numRows: Number of rows to show

Since: 1.6.0

val sqlContext: SQLContext

Definition Classes
Dataset → Queryable
def subtract(other: Dataset[T]): Dataset[T]

Returns a new Dataset where any elements present in other have been removed.
Returns a new Dataset where any elements present in other have been removed.
Note that, equality checking is performed directly on the encoded representation of the data and thus is not affected by a custom equals function defined on T.

Since
1.6.0
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def take(num: Int): Array[T]

Returns the first num elements of this Dataset as an array.
Returns the first num elements of this Dataset as an array.
Running take requires moving data into the application's driver process, and doing so with a very large num can crash the driver process with OutOfMemoryError.

Since
1.6.0
def takeAsList(num: Int): List[T]

Returns the first num elements of this Dataset as an array.
Returns the first num elements of this Dataset as an array.
Running take requires moving data into the application's driver process, and doing so with a very large num can crash the driver process with OutOfMemoryError.

Since
1.6.0
def toDF(): DataFrame

Converts this strongly typed collection of data to generic Dataframe.
Converts this strongly typed collection of data to generic Dataframe. In contrast to the strongly typed objects that Dataset operations work on, a Dataframe returns generic Row objects that allow fields to be accessed by ordinal or name.
def toDS(): Dataset[T]

Returns this Dataset.
Returns this Dataset.

Since
1.6.0
def toString(): String

Definition Classes
Queryable → AnyRef → Any
def transform[U](t: (Dataset[T]) ⇒ Dataset[U]): Dataset[U]

Concise syntax for chaining custom transformations.
Concise syntax for chaining custom transformations.
```
def featurize(ds: Dataset[T]) = ...

dataset
  .transform(featurize)
  .transform(...)
```
Since
1.6.0
def union(other: Dataset[T]): Dataset[T]

Returns a new Dataset that contains the elements of both this and the other Dataset combined.
Returns a new Dataset that contains the elements of both this and the other Dataset combined.
Note that, this function is not a typical set union operation, in that it does not eliminate duplicate items. As such, it is analogous to UNION ALL in SQL.

Since
1.6.0
def unpersist(): Dataset.this.type

Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.
Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.

Since
1.6.0
def unpersist(blocking: Boolean): Dataset.this.type

Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.
Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.
blocking
Whether to block until all blocks are deleted.

Since
1.6.0
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )

class Dataset[T] extends Queryable with Serializable with Logging

Value Members

final def !=(arg0: AnyRef): Boolean

final def !=(arg0: Any): Boolean

final def ##(): Int

final def ==(arg0: AnyRef): Boolean

final def ==(arg0: Any): Boolean

def as(alias: String): Dataset[T]

def as[U](implicit arg0: Encoder[U]): Dataset[U]

final def asInstanceOf[T0]: T0

def cache(): Dataset.this.type

def clone(): AnyRef

def coalesce(numPartitions: Int): Dataset[T]

def collect(): Array[T]

def collectAsList(): List[T]

def count(): Long

def distinct: Dataset[T]

final def eq(arg0: AnyRef): Boolean

def equals(arg0: Any): Boolean

def explain(): Unit

def explain(extended: Boolean): Unit

def filter(func: FilterFunction[T]): Dataset[T]

def filter(func: (T) ⇒ Boolean): Dataset[T]

def finalize(): Unit

def first(): T

def flatMap[U](f: FlatMapFunction[T, U], encoder: Encoder[U]): Dataset[U]

def flatMap[U](func: (T) ⇒ TraversableOnce[U])(implicit arg0: Encoder[U]): Dataset[U]

def foreach(func: ForeachFunction[T]): Unit

def foreach(func: (T) ⇒ Unit): Unit

def foreachPartition(func: ForeachPartitionFunction[T]): Unit

def foreachPartition(func: (Iterator[T]) ⇒ Unit): Unit

final def getClass(): Class[_]

def groupBy[K](func: MapFunction[T, K], encoder: Encoder[K]): GroupedDataset[K, T]

def groupBy(cols: Column*): GroupedDataset[Row, T]

def groupBy[K](func: (T) ⇒ K)(implicit arg0: Encoder[K]): GroupedDataset[K, T]

def hashCode(): Int

def intersect(other: Dataset[T]): Dataset[T]

final def isInstanceOf[T0]: Boolean

def isTraceEnabled(): Boolean

def joinWith[U](other: Dataset[U], condition: Column): Dataset[(T, U)]

def joinWith[U](other: Dataset[U], condition: Column, joinType: String): Dataset[(T, U)]

def log: Logger

def logDebug(msg: ⇒ String, throwable: Throwable): Unit

def logDebug(msg: ⇒ String): Unit

def logError(msg: ⇒ String, throwable: Throwable): Unit

def logError(msg: ⇒ String): Unit

def logInfo(msg: ⇒ String, throwable: Throwable): Unit

def logInfo(msg: ⇒ String): Unit

def logName: String

def logTrace(msg: ⇒ String, throwable: Throwable): Unit

def logTrace(msg: ⇒ String): Unit

def logWarning(msg: ⇒ String, throwable: Throwable): Unit

def logWarning(msg: ⇒ String): Unit

def map[U](func: MapFunction[T, U], encoder: Encoder[U]): Dataset[U]

def map[U](func: (T) ⇒ U)(implicit arg0: Encoder[U]): Dataset[U]

def mapPartitions[U](f: MapPartitionsFunction[T, U], encoder: Encoder[U]): Dataset[U]

def mapPartitions[U](func: (Iterator[T]) ⇒ Iterator[U])(implicit arg0: Encoder[U]): Dataset[U]

final def ne(arg0: AnyRef): Boolean

final def notify(): Unit

final def notifyAll(): Unit

def persist(newLevel: StorageLevel): Dataset.this.type

def persist(): Dataset.this.type

def printSchema(): Unit

val queryExecution: QueryExecution

def rdd: RDD[T]

def reduce(func: ReduceFunction[T]): T

def reduce(func: (T, T) ⇒ T): T

def repartition(numPartitions: Int): Dataset[T]

def sample(withReplacement: Boolean, fraction: Double): Dataset[T]

def sample(withReplacement: Boolean, fraction: Double, seed: Long): Dataset[T]

def schema: StructType

def select[U1, U2, U3, U4, U5](c1: TypedColumn[T, U1], c2: TypedColumn[T, U2], c3: TypedColumn[T, U3], c4: TypedColumn[T, U4], c5: TypedColumn[T, U5]): Dataset[(U1, U2, U3, U4, U5)]

def select[U1, U2, U3, U4](c1: TypedColumn[T, U1], c2: TypedColumn[T, U2], c3: TypedColumn[T, U3], c4: TypedColumn[T, U4]): Dataset[(U1, U2, U3, U4)]

def select[U1, U2, U3](c1: TypedColumn[T, U1], c2: TypedColumn[T, U2], c3: TypedColumn[T, U3]): Dataset[(U1, U2, U3)]

def select[U1, U2](c1: TypedColumn[T, U1], c2: TypedColumn[T, U2]): Dataset[(U1, U2)]

def select[U1](c1: TypedColumn[T, U1])(implicit arg0: Encoder[U1]): Dataset[U1]

def select(cols: Column*): DataFrame

def selectUntyped(columns: TypedColumn[_, _]*): Dataset[_]

def show(numRows: Int, truncate: Boolean): Unit