How to Create a New Column From Another Column Based on Multiple Conditions in PySpark
How can we create a column based on another column in PySpark with multiple conditions?
For instance, suppose we have a PySpark DataFrame df
with a time
column, containing an integer representing the hour of the day from 0
to 24
.
We want to create a new column day_or_night
that follows these criteria:
- If
time
is between[0, 8]
, thenday_or_night
isNight
- If
time
is between[9, 18]
, thenday_or_night
isDay
- If
time
is between[19, 23]
, thenday_or_night
isNight
This can be simplified down to: everything between 9am
and 7pm
should be considered Day
.
Anything else is considered Night
.
Create new columns using withColumn()
We can easily create new columns based on other columns using the DataFrame’s withColumn()
method.
For example, if the column num
is of type double
, we can create a new column num_div_10
like so:
df = df.withColumn('num_div_10', df['num'] / 10)
But now, we want to set values for our new column based on certain conditions.
Create conditions using when()
and otherwise()
We can create a proper if
-then
-else
structure using when()
and otherwise()
in PySpark.
The when()
method functions as our if
statement. The first argument is our condition, and the second argument is the value of that column if that condition is true.
Any when()
method chained after the first when()
is essentially an else if
statement.
Naturally, otherwise()
is our else
statement.
from pyspark.sql.functions import when, col
conditions = when(col("time") < 9, "Night")
.when(col("time") > 18, "Night")
.otherwise("Day")
df = df.withColumn("day_or_night", conditions)
As we can see, when()
allows us to chain multiple if
statements together.
We can also use simple AND and OR operators to simplify logic.
from pyspark.sql.functions import when, col
conditions = when(col("time") < 9 | col("time") > 18, "Night")
.otherwise("Day")
df = df.withColumn("day_or_night", conditions)
We can further simplify using between()
(both lower and upper bounds of between()
are inclusive).
from pyspark.sql.functions import when, col
conditions = when(col("time").between(9,18), "Day")
.otherwise("Night")
df = df.withColumn("day_or_night", conditions)
Read more about
between()
in the PySpark documentation.