public class GroupedDataset<K,V>
extends Object
implements scala.Serializable
Dataset
has been logically grouped by a user specified grouping key. Users should not
construct a GroupedDataset
directly, but should instead call groupBy
on an existing
Dataset
.
COMPATIBILITY NOTE: Long term we plan to make GroupedDataset)
extend GroupedData
. However,
making this change to the class hierarchy would break some function signatures. As such, this
class should be considered a preview of the final API. Changes will be made to the interface
after Spark 1.6.
Modifier and Type | Method and Description |
---|---|
<U1> Dataset<scala.Tuple2<K,U1>> |
agg(TypedColumn<V,U1> col1)
Computes the given aggregation, returning a
Dataset of tuples for each unique key
and the result of computing this aggregation over all elements in the group. |
<U1,U2> Dataset<scala.Tuple3<K,U1,U2>> |
agg(TypedColumn<V,U1> col1,
TypedColumn<V,U2> col2)
Computes the given aggregations, returning a
Dataset of tuples for each unique key
and the result of computing these aggregations over all elements in the group. |
<U1,U2,U3> Dataset<scala.Tuple4<K,U1,U2,U3>> |
agg(TypedColumn<V,U1> col1,
TypedColumn<V,U2> col2,
TypedColumn<V,U3> col3)
Computes the given aggregations, returning a
Dataset of tuples for each unique key
and the result of computing these aggregations over all elements in the group. |
<U1,U2,U3,U4> |
agg(TypedColumn<V,U1> col1,
TypedColumn<V,U2> col2,
TypedColumn<V,U3> col3,
TypedColumn<V,U4> col4)
Computes the given aggregations, returning a
Dataset of tuples for each unique key
and the result of computing these aggregations over all elements in the group. |
<U,R> Dataset<R> |
cogroup(GroupedDataset<K,U> other,
CoGroupFunction<K,V,U,R> f,
Encoder<R> encoder)
Applies the given function to each cogrouped data.
|
<U,R> Dataset<R> |
cogroup(GroupedDataset<K,U> other,
scala.Function3<K,scala.collection.Iterator<V>,scala.collection.Iterator<U>,scala.collection.TraversableOnce<R>> f,
Encoder<R> evidence$4)
Applies the given function to each cogrouped data.
|
Dataset<scala.Tuple2<K,Object>> |
count()
Returns a
Dataset that contains a tuple with each key and the number of items present
for that key. |
<U> Dataset<U> |
flatMapGroups(FlatMapGroupsFunction<K,V,U> f,
Encoder<U> encoder)
Applies the given function to each group of data.
|
<U> Dataset<U> |
flatMapGroups(scala.Function2<K,scala.collection.Iterator<V>,scala.collection.TraversableOnce<U>> f,
Encoder<U> evidence$2)
Applies the given function to each group of data.
|
<L> GroupedDataset<L,V> |
keyAs(Encoder<L> evidence$1)
Returns a new
GroupedDataset where the type of the key has been mapped to the specified
type. |
Dataset<K> |
keys()
Returns a
Dataset that contains each unique key. |
<U> Dataset<U> |
mapGroups(scala.Function2<K,scala.collection.Iterator<V>,U> f,
Encoder<U> evidence$3)
Applies the given function to each group of data.
|
<U> Dataset<U> |
mapGroups(MapGroupsFunction<K,V,U> f,
Encoder<U> encoder)
Applies the given function to each group of data.
|
org.apache.spark.sql.execution.QueryExecution |
queryExecution() |
Dataset<scala.Tuple2<K,V>> |
reduce(scala.Function2<V,V,V> f)
Reduces the elements of each group of data using the specified binary function.
|
Dataset<scala.Tuple2<K,V>> |
reduce(ReduceFunction<V> f)
Reduces the elements of each group of data using the specified binary function.
|
public org.apache.spark.sql.execution.QueryExecution queryExecution()
public <L> GroupedDataset<L,V> keyAs(Encoder<L> evidence$1)
GroupedDataset
where the type of the key has been mapped to the specified
type. The mapping of key columns to the type follows the same rules as as
on Dataset
.
evidence$1
- (undocumented)public Dataset<K> keys()
Dataset
that contains each unique key.
public <U> Dataset<U> flatMapGroups(scala.Function2<K,scala.collection.Iterator<V>,scala.collection.TraversableOnce<U>> f, Encoder<U> evidence$2)
Dataset
.
This function does not support partial aggregation, and as a result requires shuffling all
the data in the Dataset
. If an application intends to perform an aggregation over each
key, it is best to use the reduce function or an
Aggregator
.
Internally, the implementation will spill to disk if any given group is too large to fit into
memory. However, users must take care to avoid materializing the whole iterator for a group
(for example, by calling toList
) unless they are sure that this is possible given the memory
constraints of their cluster.
f
- (undocumented)evidence$2
- (undocumented)public <U> Dataset<U> flatMapGroups(FlatMapGroupsFunction<K,V,U> f, Encoder<U> encoder)
Dataset
.
This function does not support partial aggregation, and as a result requires shuffling all
the data in the Dataset
. If an application intends to perform an aggregation over each
key, it is best to use the reduce function or an
Aggregator
.
Internally, the implementation will spill to disk if any given group is too large to fit into
memory. However, users must take care to avoid materializing the whole iterator for a group
(for example, by calling toList
) unless they are sure that this is possible given the memory
constraints of their cluster.
f
- (undocumented)encoder
- (undocumented)public <U> Dataset<U> mapGroups(scala.Function2<K,scala.collection.Iterator<V>,U> f, Encoder<U> evidence$3)
Dataset
.
This function does not support partial aggregation, and as a result requires shuffling all
the data in the Dataset
. If an application intends to perform an aggregation over each
key, it is best to use the reduce function or an
Aggregator
.
Internally, the implementation will spill to disk if any given group is too large to fit into
memory. However, users must take care to avoid materializing the whole iterator for a group
(for example, by calling toList
) unless they are sure that this is possible given the memory
constraints of their cluster.
f
- (undocumented)evidence$3
- (undocumented)public <U> Dataset<U> mapGroups(MapGroupsFunction<K,V,U> f, Encoder<U> encoder)
Dataset
.
This function does not support partial aggregation, and as a result requires shuffling all
the data in the Dataset
. If an application intends to perform an aggregation over each
key, it is best to use the reduce function or an
Aggregator
.
Internally, the implementation will spill to disk if any given group is too large to fit into
memory. However, users must take care to avoid materializing the whole iterator for a group
(for example, by calling toList
) unless they are sure that this is possible given the memory
constraints of their cluster.
f
- (undocumented)encoder
- (undocumented)public Dataset<scala.Tuple2<K,V>> reduce(scala.Function2<V,V,V> f)
f
- (undocumented)public Dataset<scala.Tuple2<K,V>> reduce(ReduceFunction<V> f)
f
- (undocumented)public <U1> Dataset<scala.Tuple2<K,U1>> agg(TypedColumn<V,U1> col1)
Dataset
of tuples for each unique key
and the result of computing this aggregation over all elements in the group.
col1
- (undocumented)public <U1,U2> Dataset<scala.Tuple3<K,U1,U2>> agg(TypedColumn<V,U1> col1, TypedColumn<V,U2> col2)
Dataset
of tuples for each unique key
and the result of computing these aggregations over all elements in the group.
col1
- (undocumented)col2
- (undocumented)public <U1,U2,U3> Dataset<scala.Tuple4<K,U1,U2,U3>> agg(TypedColumn<V,U1> col1, TypedColumn<V,U2> col2, TypedColumn<V,U3> col3)
Dataset
of tuples for each unique key
and the result of computing these aggregations over all elements in the group.
col1
- (undocumented)col2
- (undocumented)col3
- (undocumented)public <U1,U2,U3,U4> Dataset<scala.Tuple5<K,U1,U2,U3,U4>> agg(TypedColumn<V,U1> col1, TypedColumn<V,U2> col2, TypedColumn<V,U3> col3, TypedColumn<V,U4> col4)
Dataset
of tuples for each unique key
and the result of computing these aggregations over all elements in the group.
col1
- (undocumented)col2
- (undocumented)col3
- (undocumented)col4
- (undocumented)public Dataset<scala.Tuple2<K,Object>> count()
Dataset
that contains a tuple with each key and the number of items present
for that key.
public <U,R> Dataset<R> cogroup(GroupedDataset<K,U> other, scala.Function3<K,scala.collection.Iterator<V>,scala.collection.Iterator<U>,scala.collection.TraversableOnce<R>> f, Encoder<R> evidence$4)
Dataset
this
and other
. The function can return an iterator containing elements of an
arbitrary type which will be returned as a new Dataset
.
other
- (undocumented)f
- (undocumented)evidence$4
- (undocumented)public <U,R> Dataset<R> cogroup(GroupedDataset<K,U> other, CoGroupFunction<K,V,U,R> f, Encoder<R> encoder)
Dataset
this
and other
. The function can return an iterator containing elements of an
arbitrary type which will be returned as a new Dataset
.
other
- (undocumented)f
- (undocumented)encoder
- (undocumented)