pyspark.ml.stat.
ChiSquareTest
Conduct Pearson’s independence test for every feature against the label. For each feature, the (feature, label) pairs are converted into a contingency matrix for which the Chi-squared statistic is computed. All label and feature values must be categorical.
The null hypothesis is that the occurrence of the outcomes is statistically independent.
New in version 2.2.0.
Methods
test(dataset, featuresCol, labelCol[, flatten])
test
Perform a Pearson’s independence test using dataset.
Methods Documentation
Changed in version 3.1.0: Added optional flatten argument.
flatten
pyspark.sql.DataFrame
DataFrame of categorical labels and categorical features. Real-valued features will be treated as categorical for each distinct value.
Name of features column in dataset, of type Vector (VectorUDT).
Name of label column in dataset, of any numerical type.
if True, flattens the returned dataframe.
DataFrame containing the test result for every feature against the label. If flatten is True, this DataFrame will contain one row per feature with the following fields:
featureIndex: int
pValue: float
degreesOfFreedom: int
statistic: float
If flatten is False, this DataFrame will contain a single Row with the following fields:
pValues: Vector
degreesOfFreedom: Array[int]
statistics: Vector
Each of these fields has one value per feature.
Examples
>>> from pyspark.ml.linalg import Vectors >>> from pyspark.ml.stat import ChiSquareTest >>> dataset = [[0, Vectors.dense([0, 0, 1])], ... [0, Vectors.dense([1, 0, 1])], ... [1, Vectors.dense([2, 1, 1])], ... [1, Vectors.dense([3, 1, 1])]] >>> dataset = spark.createDataFrame(dataset, ["label", "features"]) >>> chiSqResult = ChiSquareTest.test(dataset, 'features', 'label') >>> chiSqResult.select("degreesOfFreedom").collect()[0] Row(degreesOfFreedom=[3, 1, 0]) >>> chiSqResult = ChiSquareTest.test(dataset, 'features', 'label', True) >>> row = chiSqResult.orderBy("featureIndex").collect() >>> row[0].statistic 4.0