You are on page 1of 3

 Window_2 is simply a window over “Policyholder ID”.

Although both Window_1 and Window_2 provide a view over the “Policyholder ID” field, Window_1
furhter sorts the claims payments for a particular policyholder by “Paid From Date” in an ascending
order. This is important for deriving the Payment Gap using the “lag” Window Function, which is
discussed in Step 3.

## Customise Windows to apply the Window Functions to

Window_1 = Window.partitionBy("Policyholder ID").orderBy("Paid From Date")

Window_2 = Window.partitionBy("Policyholder ID").orderBy("Policyholder ID")

Step 3 — Windows Functions for Durations on Claim


“with_Column” is a PySpark method for creating a new column in a dataframe.

The following columns are created to derive the Duration on Claim for a particular policyholder. In this
order:

 Date of First Payment — this is the minimum “Paid From Date” for a particular policyholder,
over Window_1 (or indifferently Window_2).
 Date of Last Payment — this is the maximum “Paid To Date” for a particular policyholder, over
Window_1 (or indifferently Window_2).
 Duration on Claim per Payment — this is the Duration on Claim per record, calculated as Date of
Last Payment minus Date of First Payment at each row.
 Duration on Claim per Policyholder — this sums the “Duration on Claim per Payment”
column above for a particular policyholder over Window_1 (or indifferently Window_2), and
arrives at a row-agostic sum (i.e. the total Duration on Claim).

df_1_spark = df_1.withColumn("Date of First Payment", F.min("Paid From Date").over(Window_1)) \

.withColumn("Date of Last Payment", F.max("Paid To Date").over(Window_1)) \

.withColumn("Duration on Claim - per Payment", F.datediff(F.col("Date of Last Payment"), F.col("Date of First


Payment")) + 1) \

.withColumn("Duration on Claim - per Policyholder", F.sum("Duration on Claim - per Payment").over(Window_2)) \

As mentioned previously, for a policyholder, there may exist Payment Gaps between claims payments. In
other words, over the pre-defined windows, the “Paid From Date” for a particular payment may not
follow immediately the “Paid To Date” of the previous payment. You should be able to see in Table 1
that this is the case for policyholder B.

For the purpose of actuarial analyses, Payment Gap for a policyholder needs to be identified and
subtracted from the Duration on Claim initially calculated as the difference between the dates of first and
last payments.

The Payment Gap can be derived using the Python codes below:
.withColumn("Paid To Date Last Payment", F.lag("Paid To Date", 1).over(Window_1)) \

.withColumn("Paid To Date Last Payment adj", F.when(F.col("Paid To Date Last Payment").isNull(), F.col("Paid From
Date")) \

.otherwise(F.date_add(F.col("Paid To Date Last Payment"), 1))) \

.withColumn("Payment Gap", F.datediff(F.col("Paid From Date"), F.col("Paid To Date Last Payment adj")))

It may be easier to explain the above steps using visuals. As shown in the table below, the Window
Function “F.lag” is called to return the “Paid To Date Last Payment” column which for a policyholder
window is the “Paid To Date” of the previous row as indicated by the blue arrows. This is then compared
against the “Paid From Date” of the current row to arrive at the Payment Gap. As expected, we have a
Payment Gap of 14 days for policyholder B.

For the purpose of calculating the Payment Gap, Window_1 is used as the claims payments need to be in
a chornological order for the “F.lag” function to return the desired output.

Table 3: Derive Payment Gap. Table by author

Adding the finishing touch below gives the final Duration on Claim, which is now one-to-one against the
Policyholder ID.

.withColumn("Payment Gap - Max", F.max("Payment Gap").over(Window_2)) \

.withColumn("Duration on Claim - Final", F.col("Duration on Claim - per Policyholder") - F.col("Payment Gap -


Max"))

The table below shows all the columns created with the Python codes above.
Table 4: All columns created in PySpark. Table by author

Payout Ratio
The Payout Ratio is defined as the actual Amount Paid for a policyholder, divided by the Monthly Benefit
for the duration on claim. This measures how much of the Monthly Benefit is paid out for a particular
policyholder.

Leveraging the Duration on Claim derived previously, the Payout Ratio can be derived using the Python
codes below.

.withColumn("Amount Paid Total", F.sum("Amount Paid").over(Window_2)) \


.withColumn("Monthly Benefit Total", F.col("Monthly Benefit") * F.col("Duration on Claim - Final") / 30.5) \
.withColumn("Payout Ratio", F.round(F.col("Amount Paid Total") / F.col("Monthly Benefit Total"), 1))

The outputs are as expected as shown in the table below. To show the outputs in a PySpark session,
simply add .show() at the end of the codes.

You might also like