Applies a logical alias to this Dataset that can be used to disambiguate columns that have the same name after two Datasets have been joined.
Applies a logical alias to this Dataset that can be used to disambiguate columns that have the same name after two Datasets have been joined.
1.6.0
Returns a new Dataset where each record has been mapped on to the specified type.
Returns a new Dataset where each record has been mapped on to the specified type. The
method used to map columns depend on the type of U
:
U
is a class, fields for the class will be mapped to columns of the same name
(case sensitivity is determined by spark.sql.caseSensitive
)U
is a tuple, the columns will be be mapped by ordinal (i.e. the first column will
be assigned to _1
).U
is a primitive type (i.e. String, Int, etc). then the first column of the
DataFrame will be used.If the schema of the DataFrame does not match the desired U
type, you can use select
along with alias
or as
to rearrange or rename as required.
1.6.0
Persist this Dataset with the default storage level (MEMORY_AND_DISK
).
Persist this Dataset with the default storage level (MEMORY_AND_DISK
).
1.6.0
Returns a new Dataset that has exactly numPartitions
partitions.
Returns a new Dataset that has exactly numPartitions
partitions.
Similar to coalesce defined on an RDD, this operation results in a narrow dependency, e.g.
if you go from 1000 partitions to 100 partitions, there will not be a shuffle, instead each of
the 100 new partitions will claim 10 of the current partitions.
1.6.0
Returns an array that contains all the elements in this Dataset.
Returns an array that contains all the elements in this Dataset.
Running collect requires moving all the data into the application's driver process, and doing so on a very large Dataset can crash the driver process with OutOfMemoryError.
For Java API, use collectAsList.
1.6.0
Returns an array that contains all the elements in this Dataset.
Returns an array that contains all the elements in this Dataset.
Running collect requires moving all the data into the application's driver process, and doing so on a very large Dataset can crash the driver process with OutOfMemoryError.
For Java API, use collectAsList.
1.6.0
Returns the number of elements in the Dataset.
Returns the number of elements in the Dataset.
1.6.0
Returns a new Dataset that contains only the unique elements of this Dataset.
Prints the physical plan to the console for debugging purposes.
Prints the physical plan to the console for debugging purposes.
1.6.0
Prints the plans (logical and physical) to the console for debugging purposes.
Prints the plans (logical and physical) to the console for debugging purposes.
1.6.0
(Java-specific)
Returns a new Dataset that only contains elements where func
returns true
.
(Java-specific)
Returns a new Dataset that only contains elements where func
returns true
.
1.6.0
(Scala-specific)
Returns a new Dataset that only contains elements where func
returns true
.
(Scala-specific)
Returns a new Dataset that only contains elements where func
returns true
.
1.6.0
Returns the first element in this Dataset.
Returns the first element in this Dataset.
1.6.0
(Java-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.
(Scala-specific) Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.
(Java-specific)
Runs func
on each element of this Dataset.
(Java-specific)
Runs func
on each element of this Dataset.
1.6.0
(Scala-specific)
Runs func
on each element of this Dataset.
(Scala-specific)
Runs func
on each element of this Dataset.
1.6.0
(Java-specific)
Runs func
on each partition of this Dataset.
(Java-specific)
Runs func
on each partition of this Dataset.
1.6.0
(Scala-specific)
Runs func
on each partition of this Dataset.
(Scala-specific)
Runs func
on each partition of this Dataset.
1.6.0
(Java-specific)
Returns a GroupedDataset where the data is grouped by the given key func
.
(Java-specific)
Returns a GroupedDataset where the data is grouped by the given key func
.
1.6.0
Returns a GroupedDataset where the data is grouped by the given Column expressions.
Returns a GroupedDataset where the data is grouped by the given Column expressions.
1.6.0
(Scala-specific)
Returns a GroupedDataset where the data is grouped by the given key func
.
(Scala-specific)
Returns a GroupedDataset where the data is grouped by the given key func
.
1.6.0
Returns a new Dataset that contains only the elements of this Dataset that are also
present in other
.
Using inner equi-join to join this Dataset returning a Tuple2 for each pair
where condition
evaluates to true.
Using inner equi-join to join this Dataset returning a Tuple2 for each pair
where condition
evaluates to true.
Right side of the join.
Join expression.
1.6.0
Joins this Dataset returning a Tuple2 for each pair where condition
evaluates to
true.
Joins this Dataset returning a Tuple2 for each pair where condition
evaluates to
true.
This is similar to the relation join
function with one important difference in the
result schema. Since joinWith
preserves objects present on either side of the join, the
result schema is similarly nested into a tuple under the column names _1
and _2
.
This type of join can be useful both for preserving type-safety with the original object types as well as working with relational data where either side of the join has column names in common.
Right side of the join.
Join expression.
One of: inner
, outer
, left_outer
, right_outer
, leftsemi
.
1.6.0
(Java-specific)
Returns a new Dataset that contains the result of applying func
to each element.
(Java-specific)
Returns a new Dataset that contains the result of applying func
to each element.
1.6.0
(Scala-specific)
Returns a new Dataset that contains the result of applying func
to each element.
(Scala-specific)
Returns a new Dataset that contains the result of applying func
to each element.
1.6.0
(Java-specific)
Returns a new Dataset that contains the result of applying func
to each partition.
(Java-specific)
Returns a new Dataset that contains the result of applying func
to each partition.
1.6.0
(Scala-specific)
Returns a new Dataset that contains the result of applying func
to each partition.
(Scala-specific)
Returns a new Dataset that contains the result of applying func
to each partition.
1.6.0
Persist this Dataset with the given storage level.
Persist this Dataset with the given storage level.
One of: MEMORY_ONLY
, MEMORY_AND_DISK
, MEMORY_ONLY_SER
,
MEMORY_AND_DISK_SER
, DISK_ONLY
, MEMORY_ONLY_2
,
MEMORY_AND_DISK_2
, etc.
1.6.0
Persist this Dataset with the default storage level (MEMORY_AND_DISK
).
Persist this Dataset with the default storage level (MEMORY_AND_DISK
).
1.6.0
Prints the schema of the underlying Dataset to the console in a nice tree format.
Converts this Dataset to an RDD.
Converts this Dataset to an RDD.
1.6.0
(Java-specific) Reduces the elements of this Dataset using the specified binary function.
(Java-specific)
Reduces the elements of this Dataset using the specified binary function. The given func
must be commutative and associative or the result may be non-deterministic.
1.6.0
(Scala-specific) Reduces the elements of this Dataset using the specified binary function.
(Scala-specific)
Reduces the elements of this Dataset using the specified binary function. The given func
must be commutative and associative or the result may be non-deterministic.
1.6.0
Returns a new Dataset that has exactly numPartitions
partitions.
Returns a new Dataset that has exactly numPartitions
partitions.
1.6.0
Returns a new Dataset by sampling a fraction of records, using a random seed.
Returns a new Dataset by sampling a fraction of records, using a random seed.
1.6.0
Returns a new Dataset by sampling a fraction of records.
Returns a new Dataset by sampling a fraction of records.
1.6.0
Returns the schema of the encoded form of the objects in this Dataset.
Returns a new Dataset by computing the given Column expressions for each element.
Returns a new Dataset by computing the given Column expressions for each element.
Returns a new Dataset by computing the given Column expressions for each element.
Returns a new Dataset by computing the given Column expressions for each element.
Returns a new Dataset by computing the given Column expression for each element.
Returns a new DataFrame by selecting a set of column based expressions.
Returns a new DataFrame by selecting a set of column based expressions.
df.select($"colA", $"colB" + 1)
1.6.0
Internal helper function for building typed selects that return tuples.
Internal helper function for building typed selects that return tuples. For simplicity and code reuse, we do this without the help of the type system and then use helper functions that cast appropriately for the user facing interface.
Displays the Dataset in a tabular form.
Displays the Dataset in a tabular form. For example:
year month AVG('Adj Close) MAX('Adj Close) 1980 12 0.503218 0.595103 1981 01 0.523289 0.570307 1982 02 0.436504 0.475256 1983 03 0.410516 0.442194 1984 04 0.450090 0.483521
Number of rows to show
Whether truncate long strings. If true, strings more than 20 characters will be truncated and all cells will be aligned right
1.6.0
Displays the top 20 rows of Dataset in a tabular form.
Displays the top 20 rows of Dataset in a tabular form.
Whether truncate long strings. If true, strings more than 20 characters will be truncated and all cells will be aligned right
1.6.0
Displays the top 20 rows of Dataset in a tabular form.
Displays the top 20 rows of Dataset in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right.
1.6.0
Displays the content of this Dataset in a tabular form.
Displays the content of this Dataset in a tabular form. Strings more than 20 characters will be truncated, and all cells will be aligned right. For example:
year month AVG('Adj Close) MAX('Adj Close) 1980 12 0.503218 0.595103 1981 01 0.523289 0.570307 1982 02 0.436504 0.475256 1983 03 0.410516 0.442194 1984 04 0.450090 0.483521
Number of rows to show
1.6.0
Returns a new Dataset where any elements present in other
have been removed.
Returns a new Dataset where any elements present in other
have been removed.
Note that, equality checking is performed directly on the encoded representation of the data
and thus is not affected by a custom equals
function defined on T
.
1.6.0
Returns the first num
elements of this Dataset as an array.
Returns the first num
elements of this Dataset as an array.
Running take requires moving data into the application's driver process, and doing so with
a very large num
can crash the driver process with OutOfMemoryError.
1.6.0
Returns the first num
elements of this Dataset as an array.
Returns the first num
elements of this Dataset as an array.
Running take requires moving data into the application's driver process, and doing so with
a very large num
can crash the driver process with OutOfMemoryError.
1.6.0
Converts this strongly typed collection of data to generic Dataframe.
Converts this strongly typed collection of data to generic Dataframe. In contrast to the strongly typed objects that Dataset operations work on, a Dataframe returns generic Row objects that allow fields to be accessed by ordinal or name.
Returns this Dataset.
Returns this Dataset.
1.6.0
Concise syntax for chaining custom transformations.
Concise syntax for chaining custom transformations.
def featurize(ds: Dataset[T]) = ...
dataset
.transform(featurize)
.transform(...)
1.6.0
Returns a new Dataset that contains the elements of both this and the other
Dataset
combined.
Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.
Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.
1.6.0
Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.
Mark the Dataset as non-persistent, and remove all blocks for it from memory and disk.
Whether to block until all blocks are deleted.
1.6.0
:: Experimental :: A Dataset is a strongly typed collection of objects that can be transformed in parallel using functional or relational operations.
A Dataset differs from an RDD in the following ways:
A Dataset can be thought of as a specialized DataFrame, where the elements map to a specific JVM object type, instead of to a generic Row container. A DataFrame can be transformed into specific Dataset by calling
df.as[ElementType]
. Similarly you can transform a strongly-typed Dataset to a generic DataFrame by callingds.toDF()
.COMPATIBILITY NOTE: Long term we plan to make DataFrame extend
Dataset[Row]
. However, making this change to the class hierarchy would break the function signatures for the existing functional operations (map, flatMap, etc). As such, this class should be considered a preview of the final API. Changes will be made to the interface after Spark 1.6.1.6.0