col(col)
col
Returns a Column based on the given column name.
Column
column(col)
column
lit(col)
lit
Creates a Column of literal value.
broadcast(df)
broadcast
Marks a DataFrame as small enough for use in broadcast joins.
coalesce(*cols)
coalesce
Returns the first column that is not null.
input_file_name()
input_file_name
Creates a string column for the file name of the current Spark task.
isnan(col)
isnan
An expression that returns true if the column is NaN.
isnull(col)
isnull
An expression that returns true if the column is null.
monotonically_increasing_id()
monotonically_increasing_id
A column that generates monotonically increasing 64-bit integers.
nanvl(col1, col2)
nanvl
Returns col1 if it is not NaN, or col2 if col1 is NaN.
rand([seed])
rand
Generates a random column with independent and identically distributed (i.i.d.) samples uniformly distributed in [0.0, 1.0).
randn([seed])
randn
Generates a column with independent and identically distributed (i.i.d.) samples from the standard normal distribution.
spark_partition_id()
spark_partition_id
A column for partition ID.
when(condition, value)
when
Evaluates a list of conditions and returns one of multiple possible result expressions.
bitwise_not(col)
bitwise_not
Computes bitwise not.
bitwiseNOT(col)
bitwiseNOT
expr(str)
expr
Parses the expression string into the column that it represents
greatest(*cols)
greatest
Returns the greatest value of the list of column names, skipping null values.
least(*cols)
least
Returns the least value of the list of column names, skipping null values.
sqrt(col)
sqrt
Computes the square root of the specified float value.
abs(col)
abs
Computes the absolute value.
acos(col)
acos
Computes inverse cosine of the input column.
acosh(col)
acosh
Computes inverse hyperbolic cosine of the input column.
asin(col)
asin
Computes inverse sine of the input column.
asinh(col)
asinh
Computes inverse hyperbolic sine of the input column.
atan(col)
atan
Compute inverse tangent of the input column.
atanh(col)
atanh
Computes inverse hyperbolic tangent of the input column.
atan2(col1, col2)
atan2
New in version 1.4.0.
bin(col)
bin
Returns the string representation of the binary value of the given column.
cbrt(col)
cbrt
Computes the cube-root of the given value.
ceil(col)
ceil
Computes the ceiling of the given value.
conv(col, fromBase, toBase)
conv
Convert a number in a string column from one base to another.
cos(col)
cos
Computes cosine of the input column.
cosh(col)
cosh
Computes hyperbolic cosine of the input column.
cot(col)
cot
Computes cotangent of the input column.
csc(col)
csc
Computes cosecant of the input column.
exp(col)
exp
Computes the exponential of the given value.
expm1(col)
expm1
Computes the exponential of the given value minus one.
factorial(col)
factorial
Computes the factorial of the given value.
floor(col)
floor
Computes the floor of the given value.
hex(col)
hex
Computes hex value of the given column, which could be pyspark.sql.types.StringType, pyspark.sql.types.BinaryType, pyspark.sql.types.IntegerType or pyspark.sql.types.LongType.
pyspark.sql.types.StringType
pyspark.sql.types.BinaryType
pyspark.sql.types.IntegerType
pyspark.sql.types.LongType
unhex(col)
unhex
Inverse of hex.
hypot(col1, col2)
hypot
Computes sqrt(a^2 + b^2) without intermediate overflow or underflow.
sqrt(a^2 + b^2)
log(arg1[, arg2])
log
Returns the first argument-based logarithm of the second argument.
log10(col)
log10
Computes the logarithm of the given value in Base 10.
log1p(col)
log1p
Computes the natural logarithm of the “given value plus one”.
log2(col)
log2
Returns the base-2 logarithm of the argument.
pmod(dividend, divisor)
pmod
Returns the positive value of dividend mod divisor.
pow(col1, col2)
pow
Returns the value of the first argument raised to the power of the second argument.
rint(col)
rint
Returns the double value that is closest in value to the argument and is equal to a mathematical integer.
round(col[, scale])
round
Round the given value to scale decimal places using HALF_UP rounding mode if scale >= 0 or at integral part when scale < 0.
bround(col[, scale])
bround
Round the given value to scale decimal places using HALF_EVEN rounding mode if scale >= 0 or at integral part when scale < 0.
sec(col)
sec
Computes secant of the input column.
shiftleft(col, numBits)
shiftleft
Shift the given value numBits left.
shiftright(col, numBits)
shiftright
(Signed) shift the given value numBits right.
shiftrightunsigned(col, numBits)
shiftrightunsigned
Unsigned shift the given value numBits right.
signum(col)
signum
Computes the signum of the given value.
sin(col)
sin
Computes sine of the input column.
sinh(col)
sinh
Computes hyperbolic sine of the input column.
tan(col)
tan
Computes tangent of the input column.
tanh(col)
tanh
Computes hyperbolic tangent of the input column.
toDegrees(col)
toDegrees
degrees(col)
degrees
Converts an angle measured in radians to an approximately equivalent angle measured in degrees.
toRadians(col)
toRadians
radians(col)
radians
Converts an angle measured in degrees to an approximately equivalent angle measured in radians.
add_months(start, months)
add_months
Returns the date that is months months after start.
current_date()
current_date
Returns the current date at the start of query evaluation as a DateType column.
DateType
current_timestamp()
current_timestamp
Returns the current timestamp at the start of query evaluation as a TimestampType column.
TimestampType
date_add(start, days)
date_add
Returns the date that is days days after start.
date_format(date, format)
date_format
Converts a date/timestamp/string to a value of string in the format specified by the date format given by the second argument.
date_sub(start, days)
date_sub
Returns the date that is days days before start.
date_trunc(format, timestamp)
date_trunc
Returns timestamp truncated to the unit specified by the format.
datediff(end, start)
datediff
Returns the number of days from start to end.
dayofmonth(col)
dayofmonth
Extract the day of the month of a given date/timestamp as integer.
dayofweek(col)
dayofweek
Extract the day of the week of a given date/timestamp as integer.
dayofyear(col)
dayofyear
Extract the day of the year of a given date/timestamp as integer.
second(col)
second
Extract the seconds of a given date as integer.
weekofyear(col)
weekofyear
Extract the week number of a given date as integer.
year(col)
year
Extract the year of a given date/timestamp as integer.
quarter(col)
quarter
Extract the quarter of a given date/timestamp as integer.
month(col)
month
Extract the month of a given date/timestamp as integer.
last_day(date)
last_day
Returns the last day of the month which the given date belongs to.
localtimestamp()
localtimestamp
Returns the current timestamp without time zone at the start of query evaluation as a timestamp without time zone column.
minute(col)
minute
Extract the minutes of a given timestamp as integer.
months_between(date1, date2[, roundOff])
months_between
Returns number of months between dates date1 and date2.
next_day(date, dayOfWeek)
next_day
Returns the first date which is later than the value of the date column based on second week day argument.
hour(col)
hour
Extract the hours of a given timestamp as integer.
make_date(year, month, day)
make_date
Returns a column with a date built from the year, month and day columns.
from_unixtime(timestamp[, format])
from_unixtime
Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the given format.
unix_timestamp([timestamp, format])
unix_timestamp
Convert time string with given pattern (‘yyyy-MM-dd HH:mm:ss’, by default) to Unix time stamp (in seconds), using the default timezone and the default locale, returns null if failed.
to_timestamp(col[, format])
to_timestamp
Converts a Column into pyspark.sql.types.TimestampType using the optionally specified format.
pyspark.sql.types.TimestampType
to_date(col[, format])
to_date
Converts a Column into pyspark.sql.types.DateType using the optionally specified format.
pyspark.sql.types.DateType
trunc(date, format)
trunc
Returns date truncated to the unit specified by the format.
from_utc_timestamp(timestamp, tz)
from_utc_timestamp
This is a common function for databases supporting TIMESTAMP WITHOUT TIMEZONE.
to_utc_timestamp(timestamp, tz)
to_utc_timestamp
window(timeColumn, windowDuration[, …])
window
Bucketize rows into one or more time windows given a timestamp specifying column.
session_window(timeColumn, gapDuration)
session_window
Generates session window given a timestamp specifying column.
timestamp_seconds(col)
timestamp_seconds
Converts the number of seconds from the Unix epoch (1970-01-01T00:00:00Z) to a timestamp.
window_time(windowColumn)
window_time
Computes the event time from a window column.
array(*cols)
array
Creates a new array column.
array_contains(col, value)
array_contains
Collection function: returns null if the array is null, true if the array contains the given value, and false otherwise.
arrays_overlap(a1, a2)
arrays_overlap
Collection function: returns true if the arrays contain any common non-null element; if not, returns null if both the arrays are non-empty and any of them contains a null element; returns false otherwise.
array_join(col, delimiter[, null_replacement])
array_join
Concatenates the elements of column using the delimiter.
create_map(*cols)
create_map
Creates a new map column.
slice(x, start, length)
slice
Collection function: returns an array containing all the elements in x from index start (array indices start at 1, or from the end if start is negative) with the specified length.
concat(*cols)
concat
Concatenates multiple input columns together into a single column.
array_position(col, value)
array_position
Collection function: Locates the position of the first occurrence of the given value in the given array.
element_at(col, extraction)
element_at
Collection function: Returns element of array at given index in extraction if col is array.
array_append(col, value)
array_append
Collection function: returns an array of the elements in col1 along with the added element in col2 at the last of the array.
array_sort(col[, comparator])
array_sort
Collection function: sorts the input array in ascending order.
array_insert(arr, pos, value)
array_insert
Collection function: adds an item into a given array at a specified array index.
array_remove(col, element)
array_remove
Collection function: Remove all elements that equal to element from the given array.
array_distinct(col)
array_distinct
Collection function: removes duplicate values from the array.
array_intersect(col1, col2)
array_intersect
Collection function: returns an array of the elements in the intersection of col1 and col2, without duplicates.
array_union(col1, col2)
array_union
Collection function: returns an array of the elements in the union of col1 and col2, without duplicates.
array_except(col1, col2)
array_except
Collection function: returns an array of the elements in col1 but not in col2, without duplicates.
array_compact(col)
array_compact
Collection function: removes null values from the array.
transform(col, f)
transform
Returns an array of elements after applying a transformation to each element in the input array.
exists(col, f)
exists
Returns whether a predicate holds for one or more elements in the array.
forall(col, f)
forall
Returns whether a predicate holds for every element in the array.
filter(col, f)
filter
Returns an array of elements for which a predicate holds in a given array.
aggregate(col, initialValue, merge[, finish])
aggregate
Applies a binary operator to an initial state and all elements in the array, and reduces this to a single state.
zip_with(left, right, f)
zip_with
Merge two given arrays, element-wise, into a single array using a function.
transform_keys(col, f)
transform_keys
Applies a function to every key-value pair in a map and returns a map with the results of those applications as the new keys for the pairs.
transform_values(col, f)
transform_values
Applies a function to every key-value pair in a map and returns a map with the results of those applications as the new values for the pairs.
map_filter(col, f)
map_filter
Returns a map whose key-value pairs satisfy a predicate.
map_from_arrays(col1, col2)
map_from_arrays
Creates a new map from two arrays.
map_zip_with(col1, col2, f)
map_zip_with
Merge two given maps, key-wise into a single map using a function.
explode(col)
explode
Returns a new row for each element in the given array or map.
explode_outer(col)
explode_outer
posexplode(col)
posexplode
Returns a new row for each element with position in the given array or map.
posexplode_outer(col)
posexplode_outer
inline(col)
inline
Explodes an array of structs into a table.
inline_outer(col)
inline_outer
get(col, index)
get
Collection function: Returns element of array at given (0-based) index.
get_json_object(col, path)
get_json_object
Extracts json object from a json string based on json path specified, and returns json string of the extracted json object.
json_tuple(col, *fields)
json_tuple
Creates a new row for a json column according to the given field names.
from_json(col, schema[, options])
from_json
Parses a column containing a JSON string into a MapType with StringType as keys type, StructType or ArrayType with the specified schema.
MapType
StringType
StructType
ArrayType
schema_of_json(json[, options])
schema_of_json
Parses a JSON string and infers its schema in DDL format.
to_json(col[, options])
to_json
Converts a column containing a StructType, ArrayType or a MapType into a JSON string.
size(col)
size
Collection function: returns the length of the array or map stored in the column.
struct(*cols)
struct
Creates a new struct column.
sort_array(col[, asc])
sort_array
Collection function: sorts the input array in ascending or descending order according to the natural ordering of the array elements.
array_max(col)
array_max
Collection function: returns the maximum value of the array.
array_min(col)
array_min
Collection function: returns the minimum value of the array.
shuffle(col)
shuffle
Collection function: Generates a random permutation of the given array.
reverse(col)
reverse
Collection function: returns a reversed string or an array with reverse order of elements.
flatten(col)
flatten
Collection function: creates a single array from an array of arrays.
sequence(start, stop[, step])
sequence
Generate a sequence of integers from start to stop, incrementing by step.
array_repeat(col, count)
array_repeat
Collection function: creates an array containing a column repeated count times.
map_contains_key(col, value)
map_contains_key
Returns true if the map contains the key.
map_keys(col)
map_keys
Collection function: Returns an unordered array containing the keys of the map.
map_values(col)
map_values
Collection function: Returns an unordered array containing the values of the map.
map_entries(col)
map_entries
Collection function: Returns an unordered array of all entries in the given map.
map_from_entries(col)
map_from_entries
Collection function: Converts an array of entries (key value struct types) to a map of values.
arrays_zip(*cols)
arrays_zip
Collection function: Returns a merged array of structs in which the N-th struct contains all N-th values of input arrays.
map_concat(*cols)
map_concat
Returns the union of all the given maps.
from_csv(col, schema[, options])
from_csv
Parses a column containing a CSV string to a row with the specified schema.
schema_of_csv(csv[, options])
schema_of_csv
Parses a CSV string and infers its schema in DDL format.
to_csv(col[, options])
to_csv
Converts a column containing a StructType into a CSV string.
years(col)
years
Partition transform function: A transform for timestamps and dates to partition data into years.
months(col)
months
Partition transform function: A transform for timestamps and dates to partition data into months.
days(col)
days
Partition transform function: A transform for timestamps and dates to partition data into days.
hours(col)
hours
Partition transform function: A transform for timestamps to partition data into hours.
bucket(numBuckets, col)
bucket
Partition transform function: A transform for any type that partitions by a hash of the input column.
approxCountDistinct(col[, rsd])
approxCountDistinct
New in version 1.3.0.
approx_count_distinct(col[, rsd])
approx_count_distinct
Aggregate function: returns a new Column for approximate distinct count of column col.
avg(col)
avg
Aggregate function: returns the average of the values in a group.
collect_list(col)
collect_list
Aggregate function: returns a list of objects with duplicates.
collect_set(col)
collect_set
Aggregate function: returns a set of objects with duplicate elements eliminated.
corr(col1, col2)
corr
Returns a new Column for the Pearson Correlation Coefficient for col1 and col2.
col1
col2
count(col)
count
Aggregate function: returns the number of items in a group.
count_distinct(col, *cols)
count_distinct
Returns a new Column for distinct count of col or cols.
cols
countDistinct(col, *cols)
countDistinct
covar_pop(col1, col2)
covar_pop
Returns a new Column for the population covariance of col1 and col2.
covar_samp(col1, col2)
covar_samp
Returns a new Column for the sample covariance of col1 and col2.
first(col[, ignorenulls])
first
Aggregate function: returns the first value in a group.
grouping(col)
grouping
Aggregate function: indicates whether a specified column in a GROUP BY list is aggregated or not, returns 1 for aggregated or 0 for not aggregated in the result set.
grouping_id(*cols)
grouping_id
Aggregate function: returns the level of grouping, equals to
kurtosis(col)
kurtosis
Aggregate function: returns the kurtosis of the values in a group.
last(col[, ignorenulls])
last
Aggregate function: returns the last value in a group.
max(col)
max
Aggregate function: returns the maximum value of the expression in a group.
max_by(col, ord)
max_by
Returns the value associated with the maximum value of ord.
mean(col)
mean
median(col)
median
Returns the median of the values in a group.
min(col)
min
Aggregate function: returns the minimum value of the expression in a group.
min_by(col, ord)
min_by
Returns the value associated with the minimum value of ord.
mode(col)
mode
Returns the most frequent value in a group.
percentile_approx(col, percentage[, accuracy])
percentile_approx
Returns the approximate percentile of the numeric column col which is the smallest value in the ordered col values (sorted from least to greatest) such that no more than percentage of col values is less than the value or equal to that value.
product(col)
product
Aggregate function: returns the product of the values in a group.
skewness(col)
skewness
Aggregate function: returns the skewness of the values in a group.
stddev(col)
stddev
Aggregate function: alias for stddev_samp.
stddev_pop(col)
stddev_pop
Aggregate function: returns population standard deviation of the expression in a group.
stddev_samp(col)
stddev_samp
Aggregate function: returns the unbiased sample standard deviation of the expression in a group.
sum(col)
sum
Aggregate function: returns the sum of all values in the expression.
sum_distinct(col)
sum_distinct
Aggregate function: returns the sum of distinct values in the expression.
sumDistinct(col)
sumDistinct
var_pop(col)
var_pop
Aggregate function: returns the population variance of the values in a group.
var_samp(col)
var_samp
Aggregate function: returns the unbiased sample variance of the values in a group.
variance(col)
variance
Aggregate function: alias for var_samp
cume_dist()
cume_dist
Window function: returns the cumulative distribution of values within a window partition, i.e.
dense_rank()
dense_rank
Window function: returns the rank of rows within a window partition, without any gaps.
lag(col[, offset, default])
lag
Window function: returns the value that is offset rows before the current row, and default if there is less than offset rows before the current row.
lead(col[, offset, default])
lead
Window function: returns the value that is offset rows after the current row, and default if there is less than offset rows after the current row.
nth_value(col, offset[, ignoreNulls])
nth_value
Window function: returns the value that is the offsetth row of the window frame (counting from 1), and null if the size of window frame is less than offset rows.
ntile(n)
ntile
Window function: returns the ntile group id (from 1 to n inclusive) in an ordered window partition.
percent_rank()
percent_rank
Window function: returns the relative rank (i.e.
rank()
rank
Window function: returns the rank of rows within a window partition.
row_number()
row_number
Window function: returns a sequential number starting at 1 within a window partition.
asc(col)
asc
Returns a sort expression based on the ascending order of the given column name.
asc_nulls_first(col)
asc_nulls_first
Returns a sort expression based on the ascending order of the given column name, and null values return before non-null values.
asc_nulls_last(col)
asc_nulls_last
Returns a sort expression based on the ascending order of the given column name, and null values appear after non-null values.
desc(col)
desc
Returns a sort expression based on the descending order of the given column name.
desc_nulls_first(col)
desc_nulls_first
Returns a sort expression based on the descending order of the given column name, and null values appear before non-null values.
desc_nulls_last(col)
desc_nulls_last
Returns a sort expression based on the descending order of the given column name, and null values appear after non-null values.
ascii(col)
ascii
Computes the numeric value of the first character of the string column.
base64(col)
base64
Computes the BASE64 encoding of a binary column and returns it as a string column.
bit_length(col)
bit_length
Calculates the bit length for the specified string column.
concat_ws(sep, *cols)
concat_ws
Concatenates multiple input string columns together into a single string column, using the given separator.
decode(col, charset)
decode
Computes the first argument into a string from a binary using the provided character set (one of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’).
encode(col, charset)
encode
Computes the first argument into a binary from a string using the provided character set (one of ‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’, ‘UTF-16LE’, ‘UTF-16’).
format_number(col, d)
format_number
Formats the number X to a format like ‘#,–#,–#.–’, rounded to d decimal places with HALF_EVEN round mode, and returns the result as a string.
format_string(format, *cols)
format_string
Formats the arguments in printf-style and returns the result as a string column.
initcap(col)
initcap
Translate the first letter of each word to upper case in the sentence.
instr(str, substr)
instr
Locate the position of the first occurrence of substr column in the given string.
length(col)
length
Computes the character length of string data or number of bytes of binary data.
lower(col)
lower
Converts a string expression to lower case.
levenshtein(left, right)
levenshtein
Computes the Levenshtein distance of the two given strings.
locate(substr, str[, pos])
locate
Locate the position of the first occurrence of substr in a string column, after position pos.
lpad(col, len, pad)
lpad
Left-pad the string column to width len with pad.
ltrim(col)
ltrim
Trim the spaces from left end for the specified string value.
octet_length(col)
octet_length
Calculates the byte length for the specified string column.
regexp_extract(str, pattern, idx)
regexp_extract
Extract a specific group matched by a Java regex, from the specified string column.
regexp_replace(string, pattern, replacement)
regexp_replace
Replace all substrings of the specified string value that match regexp with replacement.
unbase64(col)
unbase64
Decodes a BASE64 encoded string column and returns it as a binary column.
rpad(col, len, pad)
rpad
Right-pad the string column to width len with pad.
repeat(col, n)
repeat
Repeats a string column n times, and returns it as a new string column.
rtrim(col)
rtrim
Trim the spaces from right end for the specified string value.
soundex(col)
soundex
Returns the SoundEx encoding for a string
split(str, pattern[, limit])
split
Splits str around matches of the given pattern.
substring(str, pos, len)
substring
Substring starts at pos and is of length len when str is String type or returns the slice of byte array that starts at pos in byte and is of length len when str is Binary type.
substring_index(str, delim, count)
substring_index
Returns the substring from string str before count occurrences of the delimiter delim.
overlay(src, replace, pos[, len])
overlay
Overlay the specified portion of src with replace, starting from byte position pos of src and proceeding for len bytes.
sentences(string[, language, country])
sentences
Splits a string into arrays of sentences, where each sentence is an array of words.
translate(srcCol, matching, replace)
translate
A function translate any character in the srcCol by a character in matching.
trim(col)
trim
Trim the spaces from both ends for the specified string column.
upper(col)
upper
Converts a string expression to upper case.
call_udf(udfName, *cols)
call_udf
Call an user-defined function.
pandas_udf([f, returnType, functionType])
pandas_udf
Creates a pandas user defined function (a.k.a.
udf([f, returnType])
udf
Creates a user defined function (UDF).
unwrap_udt(col)
unwrap_udt
Unwrap UDT data type column into its underlying type.
md5(col)
md5
Calculates the MD5 digest and returns the value as a 32 character hex string.
sha1(col)
sha1
Returns the hex string result of SHA-1.
sha2(col, numBits)
sha2
Returns the hex string result of SHA-2 family of hash functions (SHA-224, SHA-256, SHA-384, and SHA-512).
crc32(col)
crc32
Calculates the cyclic redundancy check value (CRC32) of a binary column and returns the value as a bigint.
hash(*cols)
hash
Calculates the hash code of given columns, and returns the result as an int column.
xxhash64(*cols)
xxhash64
Calculates the hash code of given columns using the 64-bit variant of the xxHash algorithm, and returns the result as a long column.
assert_true(col[, errMsg])
assert_true
Returns null if the input column is true; throws an exception with the provided error message otherwise.
raise_error(errMsg)
raise_error
Throws an exception with the provided error message.