Statistics (Spark 1.3.1 JavaDoc)

Object
- org.apache.spark.mllib.stat.Statistics

```
public class Statistics
extends Object
```
:: Experimental :: API for statistical functions in MLlib.

Constructor Summary

Constructors
Constructor and Description

Statistics()

Constructors
Constructor and Description
`Statistics()`

Method Summary

Methods
Modifier and Type	Method and Description
`static ChiSqTestResult`	`chiSqTest(Matrix observed)` Conduct Pearson's independence test on the input contingency matrix, which cannot contain negative entries or columns or rows that sum up to 0.
`static ChiSqTestResult[]`	`chiSqTest(RDD<LabeledPoint> data)` Conduct Pearson's independence test for every feature against the label across the input RDD.
`static ChiSqTestResult`	`chiSqTest(Vector observed)` Conduct Pearson's chi-squared goodness of fit test of the observed data against the uniform distribution, with each category having an expected frequency of `1 / observed.size`.
`static ChiSqTestResult`	`chiSqTest(Vector observed, Vector expected)` Conduct Pearson's chi-squared goodness of fit test of the observed data against the expected distribution.
`static MultivariateStatisticalSummary`	`colStats(RDD<Vector> X)` Computes column-wise summary statistics for the input RDD[Vector].
`static double`	`corr(RDD<Object> x, RDD<Object> y)` Compute the Pearson correlation for the input RDDs.
`static double`	`corr(RDD<Object> x, RDD<Object> y, String method)` Compute the correlation for the input RDDs using the specified method.
`static Matrix`	`corr(RDD<Vector> X)` Compute the Pearson correlation matrix for the input RDD of Vectors.
`static Matrix`	`corr(RDD<Vector> X, String method)` Compute the correlation matrix for the input RDD of Vectors using the specified method.

Methods inherited from class Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

- Constructor Detail
  - Statistics
```
public Statistics()
```
- Method Detail
  - colStats
```
public static MultivariateStatisticalSummary colStats(RDD<Vector> X)
```
    Computes column-wise summary statistics for the input RDD[Vector].
    
    Parameters:
    X - an RDD[Vector] for which column-wise summary statistics are to be computed.
    
    Returns:
    MultivariateStatisticalSummary object containing column-wise summary statistics.
  - corr
```
public static Matrix corr(RDD<Vector> X)
```
    Compute the Pearson correlation matrix for the input RDD of Vectors. Columns with 0 covariance produce NaN entries in the correlation matrix.
    
    Parameters:
    X - an RDD[Vector] for which the correlation matrix is to be computed.
    
    Returns:
    Pearson correlation matrix comparing columns in X.
  - corr
```
public static Matrix corr(RDD<Vector> X,
          String method)
```
    Compute the correlation matrix for the input RDD of Vectors using the specified method. Methods currently supported: pearson (default), spearman.
    Note that for Spearman, a rank correlation, we need to create an RDD[Double] for each column and sort it in order to retrieve the ranks and then join the columns back into an RDD[Vector], which is fairly costly. Cache the input RDD before calling corr with method = "spearman" to avoid recomputing the common lineage.
    
    Parameters:
    X - an RDD[Vector] for which the correlation matrix is to be computed.
    method - String specifying the method to use for computing correlation. Supported: pearson (default), spearman
    
    Returns:
    Correlation matrix comparing columns in X.
  - corr
```
public static double corr(RDD<Object> x,
          RDD<Object> y)
```
    Compute the Pearson correlation for the input RDDs. Returns NaN if either vector has 0 variance.
    Note: the two input RDDs need to have the same number of partitions and the same number of elements in each partition.
    
    Parameters:
    x - RDD[Double] of the same cardinality as y.
    y - RDD[Double] of the same cardinality as x.
    
    Returns:
    A Double containing the Pearson correlation between the two input RDD[Double]s
  - corr
```
public static double corr(RDD<Object> x,
          RDD<Object> y,
          String method)
```
    Compute the correlation for the input RDDs using the specified method. Methods currently supported: pearson (default), spearman.
    Note: the two input RDDs need to have the same number of partitions and the same number of elements in each partition.
    
    Parameters:
    x - RDD[Double] of the same cardinality as y.
    y - RDD[Double] of the same cardinality as x.
    method - String specifying the method to use for computing correlation. Supported: pearson (default), spearman
    
    Returns:
    A Double containing the correlation between the two input RDD[Double]s using the specified method.
  - chiSqTest
```
public static ChiSqTestResult chiSqTest(Vector observed,
                        Vector expected)
```
    Conduct Pearson's chi-squared goodness of fit test of the observed data against the expected distribution.
    Note: the two input Vectors need to have the same size. observed cannot contain negative values. expected cannot contain nonpositive values.
    
    Parameters:
    observed - Vector containing the observed categorical counts/relative frequencies.
    expected - Vector containing the expected categorical counts/relative frequencies. expected is rescaled if the expected sum differs from the observed sum.
    
    Returns:
    ChiSquaredTest object containing the test statistic, degrees of freedom, p-value, the method used, and the null hypothesis.
  - chiSqTest
```
public static ChiSqTestResult chiSqTest(Vector observed)
```
    Conduct Pearson's chi-squared goodness of fit test of the observed data against the uniform distribution, with each category having an expected frequency of 1 / observed.size.
    Note: observed cannot contain negative values.
    
    Parameters:
    observed - Vector containing the observed categorical counts/relative frequencies.
    
    Returns:
    ChiSquaredTest object containing the test statistic, degrees of freedom, p-value, the method used, and the null hypothesis.
  - chiSqTest
```
public static ChiSqTestResult chiSqTest(Matrix observed)
```
    Conduct Pearson's independence test on the input contingency matrix, which cannot contain negative entries or columns or rows that sum up to 0.
    
    Parameters:
    observed - The contingency matrix (containing either counts or relative frequencies).
    
    Returns:
    ChiSquaredTest object containing the test statistic, degrees of freedom, p-value, the method used, and the null hypothesis.
  - chiSqTest
```
public static ChiSqTestResult[] chiSqTest(RDD<LabeledPoint> data)
```
    Conduct Pearson's independence test for every feature against the label across the input RDD. For each feature, the (feature, label) pairs are converted into a contingency matrix for which the chi-squared statistic is computed. All label and feature values must be categorical.
    
    Parameters:
    data - an RDD[LabeledPoint] containing the labeled dataset with categorical features. Real-valued features will be treated as categorical for each distinct value.
    
    Returns:
    an array containing the ChiSquaredTestResult for every feature against the label. The order of the elements in the returned array reflects the order of input features.

Class Statistics

Constructor Summary

Method Summary

Methods inherited from class Object

Constructor Detail

Statistics

Method Detail

colStats

corr

corr

corr

corr

chiSqTest

chiSqTest

chiSqTest

chiSqTest