Skip to content

Feature Engineering with SQL

Introduction

In the previous sections, we defined the important, predictive features in our dataset. In this section, we will implement the feature engineering process using the insights we gained from our analysis. Again, we will be using DuckDB SQL to perform our feature engineering.

Our feature engineering methods will be a mix meaning that we sometimes drop some columns, or sometimes create new features by combining those columns and drop the original columns. We will also be creating some new features by applying some mathematical operations on the existing features.

Feature engineering will be applied on each of the table and then we will combine all the necessary (together with the feature engineered columns) columns from each table to create our final dataset for modeling.

Connect to Data

Let's firstly import the necessary libraries and connect to our data. We will be using DuckDB SQL to perform our feature engineering.

import pandas as pd
import numpy as np
import polars as pl 
import plotly.express as px
import duckdb
import os 


# Set pandas options to display all columns and rows
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None) 

con = duckdb.connect("/home/vasif/Desktop/Home-Credit-Risk-Analysis/home_credit.duckdb") # connect to the duckdb database

Applications Table Feature Engineering

Firstly, create a new table called home_credit_train which will be our final dataset for modeling. Firstly, read the applications.parquet file and import the data into the home_credit_train table. We will be performing feature engineering on this table and then we will be combining it with the other tables to create our final dataset for modeling.

con.execute(r"""CREATE OR REPLACE TABLE home_credit_train AS SELECT * 
FROM '/home/vasif/Desktop/Home-Credit-Risk-Analysis/parquet_files/applications.parquet'""")

Feature Engineering on House Parameters

Instead of 50 columns all representing the housing parameters with different scores - average, mean and median, the better option is to choose the columns with average scores and applying feature engineering to them which will result just two columns to represent the house quality of applicants:

  • housing_score_avg - the average of all non-null values for a client. A high score means the client lives in a high-end building
  • housing_data_count - The number of these columns that are not null. This is a proxy for "data completeness." Wealthier clients often have more documented housing records.
Show Code
con.execute("""
create or replace table home_credit_train as 
select 
    * exclude (
        apartments_avg, basementarea_avg, years_beginexpluatation_avg, 
        years_build_avg, commonarea_avg, elevators_avg, entrances_avg, 
        floorsmax_avg, floorsmin_avg, landarea_avg, livingapartments_avg, 
        livingarea_avg, nonlivingapartments_avg, nonlivingarea_avg
    ),

    -- calculate sum of all 14 columns (treating null as 0) divided by the count of non-nulls 
    (
        coalesce(apartments_avg, 0) + coalesce(basementarea_avg, 0) + 
        coalesce(years_beginexpluatation_avg, 0) + coalesce(years_build_avg, 0) + 
        coalesce(commonarea_avg, 0) + coalesce(elevators_avg, 0) + 
        coalesce(entrances_avg, 0) + coalesce(floorsmax_avg, 0) + 
        coalesce(floorsmin_avg, 0) + coalesce(landarea_avg, 0) + 
        coalesce(livingapartments_avg, 0) + coalesce(livingarea_avg, 0) + 
        coalesce(nonlivingapartments_avg, 0) + coalesce(nonlivingarea_avg, 0)
    ) / nullif(
        (case when apartments_avg is not null then 1 else 0 end) + 
        (case when basementarea_avg is not null then 1 else 0 end) + 
        (case when years_beginexpluatation_avg is not null then 1 else 0 end) + 
        (case when years_build_avg is not null then 1 else 0 end) + 
        (case when commonarea_avg is not null then 1 else 0 end) + 
        (case when elevators_avg is not null then 1 else 0 end) + 
        (case when entrances_avg is not null then 1 else 0 end) + 
        (case when floorsmax_avg is not null then 1 else 0 end) + 
        (case when floorsmin_avg is not null then 1 else 0 end) + 
        (case when landarea_avg is not null then 1 else 0 end) + 
        (case when livingapartments_avg is not null then 1 else 0 end) + 
        (case when livingarea_avg is not null then 1 else 0 end) + 
        (case when nonlivingapartments_avg is not null then 1 else 0 end) + 
        (case when nonlivingarea_avg is not null then 1 else 0 end), 
    0) as housing_score_avg,

    -- keep the count of how many columns provided data 
    (
        (case when apartments_avg is not null then 1 else 0 end) + 
        (case when basementarea_avg is not null then 1 else 0 end) + 
        (case when years_beginexpluatation_avg is not null then 1 else 0 end) + 
        (case when years_build_avg is not null then 1 else 0 end) + 
        (case when commonarea_avg is not null then 1 else 0 end) + 
        (case when elevators_avg is not null then 1 else 0 end) + 
        (case when entrances_avg is not null then 1 else 0 end) + 
        (case when floorsmax_avg is not null then 1 else 0 end) + 
        (case when floorsmin_avg is not null then 1 else 0 end) + 
        (case when landarea_avg is not null then 1 else 0 end) + 
        (case when livingapartments_avg is not null then 1 else 0 end) + 
        (case when livingarea_avg is not null then 1 else 0 end) + 
        (case when nonlivingapartments_avg is not null then 1 else 0 end) + 
        (case when nonlivingarea_avg is not null then 1 else 0 end)
    ) as housing_data_count
from home_credit_train
""")


In this code, we did the following:

  • We excluded the original 14 columns that represent the housing parameters with different scores - average, mean and median - using the exclude keyword in the select statement.
  • We created a new column housing_score_avg which is the average of all non-null values for a client. We used the coalesce function to treat null values as 0 and then divided the sum by the count of non-null values to get the average score.
  • We created another column housing_data_count which is the count of how many of these columns provided data (i.e., are not null). This is a proxy for "data completeness," as wealthier clients often have more documented housing records.
  • Finally, we updated the table home_credit_train with the new columns and the original columns that we kept.

In each step of the aggregation or modification in the tables it is best practice to visualize the top 5 rows of the resulted table to investigate the changes made and understand the current structure.

con.execute("select * from home_credit_train limit 5").fetchdf()
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI FONDKAPREMONT_MODE HOUSETYPE_MODE TOTALAREA_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
100002 1 Cash loans M N Y 0 202500 406598 24700.5 351000 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.018801 -9461 -637 -3648 -2120 nan 1 1 0 1 1 0 Laborers 1 2 2 WEDNESDAY 10 0 0 0 0 0 0 Business Entity Type 3 0.083037 0.262949 0.139376 0.0247 0.0369 0.9722 0.6192 0.0143 0 0.069 0.0833 0.125 0.0369 0.0202 0.019 0 0 0.0252 0.0383 0.9722 0.6341 0.0144 0 0.069 0.0833 0.125 0.0377 0.022 0.0198 0 0 0.025 0.0369 0.9722 0.6243 0.0144 0 0.069 0.0833 0.125 0.0375 0.0205 0.0193 0 0 reg oper account block of flats 0.0149 Stone, brick False 2 2 2 2 -1134 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
100003 0 Cash loans F N N 0 270000 1.2935e+06 35698.5 1.1295e+06 Family State servant Higher education Married House / apartment 0.003541 -16765 -1188 -1186 -291 nan 1 1 0 1 1 0 Core staff 2 1 1 MONDAY 11 0 0 0 0 0 0 School 0.311267 0.622246 nan 0.0959 0.0529 0.9851 0.796 0.0605 0.08 0.0345 0.2917 0.3333 0.013 0.0773 0.0549 0.0039 0.0098 0.0924 0.0538 0.9851 0.804 0.0497 0.0806 0.0345 0.2917 0.3333 0.0128 0.079 0.0554 0 0 0.0968 0.0529 0.9851 0.7987 0.0608 0.08 0.0345 0.2917 0.3333 0.0132 0.0787 0.0558 0.0039 0.01 reg oper account block of flats 0.0714 Block False 1 0 1 0 -828 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
100004 0 Revolving loans M Y Y 0 67500 135000 6750 135000 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.010032 -19046 -225 -4260 -2531 26 1 1 1 1 1 0 Laborers 1 2 2 MONDAY 9 0 0 0 0 0 0 Government nan 0.555912 0.729567 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 0 0 0 -815 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
100006 0 Cash loans F N Y 0 135000 312682 29686.5 297000 Unaccompanied Working Secondary / secondary special Civil marriage House / apartment 0.008019 -19005 -3039 -9833 -2437 nan 1 1 0 1 0 0 Laborers 2 2 2 WEDNESDAY 17 0 0 0 0 0 0 Business Entity Type 3 nan 0.650442 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 2 0 2 0 -617 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 nan nan nan nan nan nan
100007 0 Cash loans M N Y 0 121500 513000 21865.5 513000 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.028663 -19932 -3038 -4311 -3458 nan 1 1 0 1 0 0 Core staff 1 2 2 THURSDAY 11 0 0 0 0 1 1 Religion nan 0.322738 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan 0 0 0 0 -1106 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Note

The amount of columns is quite large in the home_credit_train table. Therefore, in each step of the feature engineering process, we will not give the full table with all the columns here. However, you can run the given code above to see the full table in each of the step.

Besides the average score for the house parameters, I also created mode calculation for these parameters. The mode is the most frequently occurring value in a column. This can be useful to capture the most common housing condition for each client, which might be more representative than the average in some cases.

The metrics with the MODE preffix also will be used to create features to represent them:

  • housing_score_mode - represents the mode of housing parameters.
  • housing_data_count_mode - represents the non-null housing parameters. How many applicants provided full housing details.
Show Code
con.execute("""
create or replace table home_credit_train as 
select 
    * exclude (
        apartments_mode, basementarea_mode, years_beginexpluatation_mode, 
        years_build_mode, commonarea_mode, elevators_mode, entrances_mode, 
        floorsmax_mode, floorsmin_mode, landarea_mode, livingapartments_mode, 
        livingarea_mode, nonlivingapartments_mode, nonlivingarea_mode
    ),
    /* calculate average for mode columns */
    (
        coalesce(apartments_mode, 0) + coalesce(basementarea_mode, 0) + 
        coalesce(years_beginexpluatation_mode, 0) + coalesce(years_build_mode, 0) + 
        coalesce(commonarea_mode, 0) + coalesce(elevators_mode, 0) + 
        coalesce(entrances_mode, 0) + coalesce(floorsmax_mode, 0) + 
        coalesce(floorsmin_mode, 0) + coalesce(landarea_mode, 0) + 
        coalesce(livingapartments_mode, 0) + coalesce(livingarea_mode, 0) + 
        coalesce(nonlivingapartments_mode, 0) + coalesce(nonlivingarea_mode, 0)
    ) / nullif(
        (case when apartments_mode is not null then 1 else 0 end) + 
        (case when basementarea_mode is not null then 1 else 0 end) + 
        (case when years_beginexpluatation_mode is not null then 1 else 0 end) + 
        (case when years_build_mode is not null then 1 else 0 end) + 
        (case when commonarea_mode is not null then 1 else 0 end) + 
        (case when elevators_mode is not null then 1 else 0 end) + 
        (case when entrances_mode is not null then 1 else 0 end) + 
        (case when floorsmax_mode is not null then 1 else 0 end) + 
        (case when floorsmin_mode is not null then 1 else 0 end) + 
        (case when landarea_mode is not null then 1 else 0 end) + 
        (case when livingapartments_mode is not null then 1 else 0 end) + 
        (case when livingarea_mode is not null then 1 else 0 end) + 
        (case when nonlivingapartments_mode is not null then 1 else 0 end) + 
        (case when nonlivingarea_mode is not null then 1 else 0 end), 
    0) as housing_score_mode,

    /* count non-null mode columns */
    (
        (case when apartments_mode is not null then 1 else 0 end) + 
        (case when basementarea_mode is not null then 1 else 0 end) + 
        (case when years_beginexpluatation_mode is not null then 1 else 0 end) + 
        (case when years_build_mode is not null then 1 else 0 end) + 
        (case when commonarea_mode is not null then 1 else 0 end) + 
        (case when elevators_mode is not null then 1 else 0 end) + 
        (case when entrances_mode is not null then 1 else 0 end) + 
        (case when floorsmax_mode is not null then 1 else 0 end) + 
        (case when floorsmin_mode is not null then 1 else 0 end) + 
        (case when landarea_mode is not null then 1 else 0 end) + 
        (case when livingapartments_mode is not null then 1 else 0 end) + 
        (case when livingarea_mode is not null then 1 else 0 end) + 
        (case when nonlivingapartments_mode is not null then 1 else 0 end) + 
        (case when nonlivingarea_mode is not null then 1 else 0 end)
    ) as housing_data_count_mode
from home_credit_train
""")

Same procedure is applied to the median columns as well. The median is the middle value in a sorted list of numbers. This can be useful to capture the central tendency of housing conditions for each client, which might be more robust to outliers than the average.

Show Code
con.execute("""
create or replace table home_credit_train as 
select 
    * exclude (
        apartments_medi, basementarea_medi, years_beginexpluatation_medi, 
        years_build_medi, commonarea_medi, elevators_medi, entrances_medi, 
        floorsmax_medi, floorsmin_medi, landarea_medi, livingapartments_medi, 
        livingarea_medi, nonlivingapartments_medi, nonlivingarea_medi
    ),
    /* calculate average for medi columns */
    (
        coalesce(apartments_medi, 0) + coalesce(basementarea_medi, 0) + 
        coalesce(years_beginexpluatation_medi, 0) + coalesce(years_build_medi, 0) + 
        coalesce(commonarea_medi, 0) + coalesce(elevators_medi, 0) + 
        coalesce(entrances_medi, 0) + coalesce(floorsmax_medi, 0) + 
        coalesce(floorsmin_medi, 0) + coalesce(landarea_medi, 0) + 
        coalesce(livingapartments_medi, 0) + coalesce(livingarea_medi, 0) + 
        coalesce(nonlivingapartments_medi, 0) + coalesce(nonlivingarea_medi, 0)
    ) / nullif(
        (case when apartments_medi is not null then 1 else 0 end) + 
        (case when basementarea_medi is not null then 1 else 0 end) + 
        (case when years_beginexpluatation_medi is not null then 1 else 0 end) + 
        (case when years_build_medi is not null then 1 else 0 end) + 
        (case when commonarea_medi is not null then 1 else 0 end) + 
        (case when elevators_medi is not null then 1 else 0 end) + 
        (case when entrances_medi is not null then 1 else 0 end) + 
        (case when floorsmax_medi is not null then 1 else 0 end) + 
        (case when floorsmin_medi is not null then 1 else 0 end) + 
        (case when landarea_medi is not null then 1 else 0 end) + 
        (case when livingapartments_medi is not null then 1 else 0 end) + 
        (case when livingarea_medi is not null then 1 else 0 end) + 
        (case when nonlivingapartments_medi is not null then 1 else 0 end) + 
        (case when nonlivingarea_medi is not null then 1 else 0 end), 
    0) as housing_score_medi,

    /* count non-null medi columns */
    (
        (case when apartments_medi is not null then 1 else 0 end) + 
        (case when basementarea_medi is not null then 1 else 0 end) + 
        (case when years_beginexpluatation_medi is not null then 1 else 0 end) + 
        (case when years_build_medi is not null then 1 else 0 end) + 
        (case when commonarea_medi is not null then 1 else 0 end) + 
        (case when elevators_medi is not null then 1 else 0 end) + 
        (case when entrances_medi is not null then 1 else 0 end) + 
        (case when floorsmax_medi is not null then 1 else 0 end) + 
        (case when floorsmin_medi is not null then 1 else 0 end) + 
        (case when landarea_medi is not null then 1 else 0 end) + 
        (case when livingapartments_medi is not null then 1 else 0 end) + 
        (case when livingarea_medi is not null then 1 else 0 end) + 
        (case when nonlivingapartments_medi is not null then 1 else 0 end) + 
        (case when nonlivingarea_medi is not null then 1 else 0 end)
    ) as housing_data_count_medi
from home_credit_train
""")

Feature Engineering on Documents

In addition to the housing parameters, there are also lots of columns representing the documents provided by the applicants. These columns are named as FLAG_DOCUMENT_2, FLAG_DOCUMENT_3, ..., FLAG_DOCUMENT_21. Each of these columns indicates whether a specific document was provided by the applicant (1) or not (0). We can apply engineering methods to replace these columns with only two columns:

  • total_document_count: representing total count of documents provided by the applicant.
  • flag_document_3_kept: whether or not document 3 is provided or not.

These two columns is sufficient for predictive performance. The more number of documents provided by the applicant, the more likely they are to be wealthier and less risky. However, document 3 is a special document that is often provided by applicants and has a strong correlation with the target variable. Therefore, we will keep it as a separate feature to capture its specific impact on the target variable.

Show Code
con.execute("""
create or replace table home_credit_train as 
select 
    * exclude (
        flag_document_2, flag_document_3, flag_document_4, flag_document_5, 
        flag_document_6, flag_document_7, flag_document_8, flag_document_9, 
        flag_document_10, flag_document_11, flag_document_12, flag_document_13, 
        flag_document_14, flag_document_15, flag_document_16, flag_document_17, 
        flag_document_18, flag_document_19, flag_document_20, flag_document_21
    ),
    /* feature 1: the specific most important document */
    flag_document_3 as flag_document_3_kept,

    /* feature 2: total count of all documents submitted */
    (
        coalesce(flag_document_2, 0) + coalesce(flag_document_3, 0) + 
        coalesce(flag_document_4, 0) + coalesce(flag_document_5, 0) + 
        coalesce(flag_document_6, 0) + coalesce(flag_document_7, 0) + 
        coalesce(flag_document_8, 0) + coalesce(flag_document_9, 0) + 
        coalesce(flag_document_10, 0) + coalesce(flag_document_11, 0) + 
        coalesce(flag_document_12, 0) + coalesce(flag_document_13, 0) + 
        coalesce(flag_document_14, 0) + coalesce(flag_document_15, 0) + 
        coalesce(flag_document_16, 0) + coalesce(flag_document_17, 0) + 
        coalesce(flag_document_18, 0) + coalesce(flag_document_19, 0) + 
        coalesce(flag_document_20, 0) + coalesce(flag_document_21, 0)
    ) as total_document_count
from home_credit_train
""")

After doing just these two feature engineering steps on the housing parameters and the document flags, we have reduced the number of columns in the home_credit_train from 120 to 68.

train = con.execute("select * from home_credit_train").fetchdf()
train.shape
(307511, 68)

Previous Applications Table Feature Engineering

Previous Applications table contains the data of applicants who laid more applications to the Home Credit in the past. Some of the current customers are new and they dont have any information on this data, however by running the code below it can be seen that some applicants has more than even 70 previous applications and moreover, some of them reaches up to 40 approved credit applications.

Show Code
con.execute("""
    select sk_id_curr, 
    count(distinct sk_id_prev),
    sum(
        case when NAME_CONTRACT_STATUS = 'Approved' then 1 else 0 end
    ) as total_approved,
    from previous_applications
    group by 1
    having count(distinct sk_id_prev) > 2
    order by 2 desc, 3 desc
""").fetchdf()
SK_ID_CURR count(DISTINCT sk_id_prev) total_approved
187868 77 10
265681 73 4
173680 72 10
242412 68 15
206783 67 11
156367 66 7
382179 64 4
389950 64 2
198355 63 17
345161 62 8
446486 62 7
280586 61 15
238250 61 6
227585 60 19
206862 60 15
133023 60 2
401563 59 8
242431 59 7
235163 58 7
110899 58 6
379932 55 10
205430 55 6
344403 54 21


Therefore, blind aggregation on this table will lead to false and misguided results which will deteriorate the classification model. The feature engineering ideas for this table is provided below:

Feature Engineering Ideas:

  • Refusal rate: it is a ratio of number of refused previous applications to the number of total applications.
  • Approval rate: it is a ratio of number of approved applications to the number of total applications.
  • Days since last approval: it is the days between last approval date and current application date.
  • Days since last refusal: it represents the time between last refusal date and current application date.
  • Degree of trust: it is a numeric feature representing the ratio of amt_credit to amt_application which shows how much credit did the applicant apply and how much the Home Credit approved. If it is more than 1, then it is sign of high trust meaning that the Home Credit trusted the applicant and gave more money than his request. Otherwise, it means the Home Credit issued less credit to the applicant than his request.
  • Average Goods Price for Refusal: - this shows the average price for the goods for only refused applications.
  • Average Goods Price for Approval: - this shows the average price for the goods for only approved applications.
  • AVG Rate Interest Refusal: - this shows the average rate interest for the previous refused applications.
  • AVG Rate Interest Approved: - this shows the average rate interest for the previous approved applications.
  • Last Rejection Reason: - this feature uses CODE_REJECT_REASON column to show what was the last rejection reason.
  • Cash Loan Rate: - showing the ratio of cash loans to total loans.
  • Walk In Rate: - Percentage of applications that were walk-ins.
  • Avg_Term_Approved: - Average CNT_PAYMENT for approved loans.
  • Avg_Down_Payment_Rate: - Average AMT_DOWN_PAYMENT / AMT_CREDIT. Higher down payments indicate liquid cash on hand (wealth).
  • Max_Prev_Annuity: - The highest monthly payment they have ever successfully applied for using AMT_ANNUITY.
  • Days_Since_Last_Termination: - Current Date - DAYS_TERMINATION. It shows the days passed since the termination of last approval.
  • High_Yield_Group_Rate: - Ratio of applications with NAME_YIELD_GROUP = 'high'. A client consistently in the "high" group is flagged as risky by the bank internally.
  • Most_Freq_Seller_Industry: - Mode of the NAME_SELLER_INDUSTRY
  • Insurance_Uptake_Rate: - People who buy insurance are often risk-averse, meaning they are more careful with money and less likely to default. It is found by NFLAG_INSURED_ON_APPROVAL column.

Now let's create these feature engineering variables and add these to the main train table.

Show Code
con.execute(r"""
create or replace table previous_application_agg as 
select 
    sk_id_curr,

    /* --- 0. VOLUME (The Missing Piece) --- */
    count(*) as application_count_prev,

    /* --- 1. RATES & COUNTS --- */
    sum(case when name_contract_status = 'Refused' then 1 else 0 end) * 1.0 / count(*) as refusal_rate_prev,
    sum(case when name_contract_status = 'Approved' then 1 else 0 end) * 1.0 / count(*) as approval_rate_prev,
    sum(case when name_contract_type = 'Cash loans' then 1 else 0 end) * 1.0 / count(*) as cash_loan_rate_prev,
    sum(case when name_product_type = 'walk-in' then 1 else 0 end) * 1.0 / count(*) as walk_in_rate_prev,
    avg(case when nflag_insured_on_approval = 1 then 1 else 0 end) as insurance_uptake_rate_prev,
    sum(case when name_yield_group = 'high' then 1 else 0 end) * 1.0 / count(*) as high_yield_group_rate_prev,

    /* --- 2. RECENCY (CLEANED) --- */
    max(case when name_contract_status = 'Approved' then nullif(days_decision, 365243) else null end) as days_since_last_approval_prev,
    max(case when name_contract_status = 'Refused' then nullif(days_decision, 365243) else null end) as days_since_last_refusal_prev,
    max(nullif(days_termination, 365243)) as days_since_last_termination_prev,

    /* --- 3. TRUST & WEALTH --- */
    avg(case when name_contract_status = 'Approved' then (amt_credit / nullif(amt_application, 0)) else null end) as degree_of_trust_avg_prev,
    avg(amt_down_payment / nullif(amt_credit, 0)) as avg_down_payment_rate_prev,
    max(amt_annuity) as max_prev_annuity_prev,
    avg(case when name_contract_status = 'Approved' then cnt_payment else null end) as avg_term_approved_prev,

    /* --- 4. TWIN VARIABLES --- */
    avg(case when name_contract_status = 'Refused' then amt_goods_price else null end) as avg_goods_price_refused_prev,
    avg(case when name_contract_status = 'Approved' then amt_goods_price else null end) as avg_goods_price_approved_prev,

    /* --- 5. INTEREST RATES --- */
    avg(case when name_contract_status = 'Refused' then rate_interest_primary else null end) as avg_rate_interest_refused_prev,
    avg(case when name_contract_status = 'Approved' then rate_interest_primary else null end) as avg_rate_interest_approved_prev,

    /* --- 6. CATEGORICAL MODES --- */
    mode(name_seller_industry) as most_freq_seller_industry_prev,
    (array_agg(code_reject_reason order by days_decision desc))[1] as last_rejection_reason_prev

from '/home/vasif/Desktop/Home-Credit-Risk-Analysis/parquet_files/previous_applications.parquet'
group by sk_id_curr
""")

Now join the aggregated table with the main home_credit_train table to add these features to the main table.

Show Code
con.execute("""
create or replace table home_credit_train as 
select 
    t1.*,
    /* We exclude sk_id_curr from t2 so we don't have duplicate ID columns */
    t2.* exclude (sk_id_curr)
from home_credit_train t1
left join previous_application_agg t2 
    on t1.sk_id_curr = t2.sk_id_curr
""")

Credit Card Balance Feature Engineering

Now it is time to do feature engineering on the credit_card_balance table. Before doing aggregations and feature engineering let's re-analyze this table and understand what it represents in the dataset.

This table represents the monthly behavioral history of clients who possess a Home Credit credit card. Unlike a standard loan (which has a fixed schedule), a credit card is dynamic. This table provides a month-by-month snapshot of:

  • Balance: How much they owe right now.

  • Drawings: How they used the money (ATM cash withdrawals vs. Point-of-Sale shopping).

  • Payments: How much they paid back this month vs. the minimum required.

  • Status: Are they active, over limit, or delinquent?

This is arguably the strongest Behavioral Signal in the dataset since it tells behavior of the applicants.

The feature engineering ideas on this table:

  • cc_total_months_history: Total number of months the client has data in the credit card table.
  • cc_active_months_count: Number of months the contract status was 'Active'.
  • cc_completed_months_count: Number of months the contract status was 'Completed' (closed).
  • cc_utilization_rate_avg: Average percentage of the credit limit used (Balance / Limit).
  • cc_balance_spike_ratio: Ratio of the client's maximum balance ever vs. their average balance. High values indicate sudden debt spikes.
  • cc_avg_interest_burden: Average monthly amount owed in interest and fees (Total Receivable - Principal).
  • cc_payment_over_min_ratio: Ratio of Amount Paid / Minimum Required: 1.0 - Paid exactly the minimum., >1.0 - Paid more than minimum (Good), <1.0 - Paid less than minimum (Bad).
  • cc_atm_drawing_rate: Percentage of total spending that was taken as Cash from ATMs.
  • cc_avg_ticket_size: Average amount spent per single transaction.
  • cc_avg_monthly_drawings_count: Average number of times the card is used per month (Total).
  • cc_avg_monthly_atm_count: Average number of ATM withdrawals per month.
  • cc_avg_monthly_pos_count: Average number of shop purchases (Point of Sale) per month.
  • cc_max_dpd: The maximum Days Past Due (DPD) ever recorded.
  • cc_max_dpd_def: The maximum "Tolerant" DPD. This ignores small, negligible amounts and focuses on real missed payments.
  • cc_late_payment_count: Total count of months where the client was late (DPD > 0).
  • cc_limit_growth_value: The difference between the highest credit limit they ever had and the lowest (Max Limit - Min Limit).
Show Code
con.execute(r"""
create or replace table credit_card_agg as 
select 
    sk_id_curr,

    /* --- 1. VOLUME & STATUS --- */
    count(*) as cc_total_months_history,
    sum(case when name_contract_status = 'Active' then 1 else 0 end) as cc_active_months_count,
    sum(case when name_contract_status = 'Completed' then 1 else 0 end) as cc_completed_months_count,

    /* --- 2. UTILIZATION & DEBT LOAD --- */
    avg(amt_balance / nullif(amt_credit_limit_actual, 0)) as cc_utilization_rate_avg,
    max(amt_balance) / nullif(avg(amt_balance), 0) as cc_balance_spike_ratio,

    /* Interest Burden: How much extra are they paying in fees/interest? */
    avg(amt_total_receivable - amt_receivable_principal) as cc_avg_interest_burden,

    /* --- 3. PAYMENT DISCIPLINE --- */
    /* Ratio > 1.0 means paying down debt. Ratio < 1.0 means missing payments. */
    avg(amt_payment_total_current / nullif(amt_inst_min_regularity, 0)) as cc_payment_over_min_ratio,

    /* --- 4. SPENDING BEHAVIOR (Amounts) --- */
    /* Cash Usage Rate (High risk) */
    sum(amt_drawings_atm_current) / nullif(sum(amt_drawings_current), 0) as cc_atm_drawing_rate,
    /* Average Ticket Size */
    sum(amt_drawings_current) / nullif(sum(cnt_drawings_current), 0) as cc_avg_ticket_size,

    /* --- 5. USAGE FREQUENCY (Counts) - NEW --- */
    /* Are they heavy users? */
    avg(cnt_drawings_current) as cc_avg_monthly_drawings_count,
    avg(cnt_drawings_atm_current) as cc_avg_monthly_atm_count,
    avg(cnt_drawings_pos_current) as cc_avg_monthly_pos_count,

    /* --- 6. DELINQUENCY (Strict vs Tolerant) --- */
    max(sk_dpd) as cc_max_dpd,
    /* Tolerance DPD: This is the 'real' bad indicator (ignores small $1 debts) */
    max(sk_dpd_def) as cc_max_dpd_def,
    /* Count of late months */
    sum(case when sk_dpd > 0 then 1 else 0 end) as cc_late_payment_count,

    /* --- 7. LIMIT MANAGEMENT --- */
    /* Growth: Max Limit - Min Limit */
    max(amt_credit_limit_actual) - min(amt_credit_limit_actual) as cc_limit_growth_value

from credit_card_balance
group by sk_id_curr
""")

Now join the aggregated table with the main home_credit_train table to add these features to the main table.

Show Code
con.execute("""
create or replace table home_credit_train as 
select 
    t1.*,
    t2.* exclude (sk_id_curr)
from home_credit_train t1
left join credit_card_agg t2 
    on t1.sk_id_curr = t2.sk_id_curr
""")

POS CASH Balance Table Feature Engineering

Feature Engineering: POS Cash Balance (POS_CASH_balance)

1. What is this table? This table contains monthly snapshots of Point-of-Sale (POS) and Cash Loans. These are typically smaller, fixed-term loans for specific goods (like appliances, electronics, or furniture) or small cash advances. Unlike credit cards, these usually have a fixed end date and installment schedule.

2. Why is it required? This table provides a unique view into "Debt Drag" and Contract Stability.

  • Future Burden: Unlike other tables, this one explicitly tells us how many installments are left (CNT_INSTALMENT_FUTURE). A client with 24 months left on a loan is under more pressure than one with 1 month left.
  • Restructuring Signs: If the term length (CNT_INSTALMENT) keeps changing, it usually means the client couldn't pay and had to renegotiate the loan terms.

3. Feature Explanations Below is the explanation for every alias used in the SQL code.

1. Volume

  • pos_total_months_history: Total number of monthly records available for the client.

2. Future Burden (The "Debt Drag")

  • pos_avg_future_installments: On average, how many months does the client still have to pay? High numbers mean they are locked into long-term debt.
  • pos_remaining_debt_ratio: The ratio of remaining installments to the total original term.
  • ~1.0: Loan just started (High current burden).
  • ~0.0: Loan is almost finished (Low burden).

3. Late Payments

  • pos_late_payment_count: Count of months where SK_DPD (Days Past Due) was greater than 0.
  • pos_max_dpd: The worst (maximum) lateness ever recorded for this client.
  • pos_max_dpd_def: The worst "Tolerant" DPD. This metric ignores small, negligible overdue amounts (often errors) and focuses on significant defaults.

4. Status Rates

  • pos_completed_rate: Percentage of months where the contract status was 'Completed'. High rates suggest a history of successfully finishing loans.
  • pos_active_rate: Percentage of months where the contract was 'Active' (currently being paid).

5. Term Stability (Restructuring)

  • pos_term_variance: The variance in the original term length (CNT_INSTALMENT).
  • Low Variance: The loan terms stayed the same (Stable).
  • High Variance: The term length changed frequently (e.g., extending a 12-month loan to 24 months). This is a strong indicator of financial distress/restructuring.
Show Code
con.execute(r"""
create or replace table pos_cash_agg as 
select 
    sk_id_curr,

    /* --- 1. VOLUME --- */
    count(*) as pos_total_months_history,

    /* --- 2. FUTURE BURDEN (The Killer Features) --- */
    /* Average remaining months: High = Heavy Debt */
    avg(cnt_instalment_future) as pos_avg_future_installments,

    /* Remaining Ratio: (Remaining / Total). 
       1.0 = Just Started (Max Risk). 0.0 = Finished (Min Risk). */
    avg(cnt_instalment_future / nullif(cnt_instalment, 0)) as pos_remaining_debt_ratio,

    /* --- 3. LATE PAYMENTS --- */
    sum(case when sk_dpd > 0 then 1 else 0 end) as pos_late_payment_count,
    max(sk_dpd) as pos_max_dpd,
    /* Tolerance version */
    max(sk_dpd_def) as pos_max_dpd_def,

    /* --- 4. STATUS RATES --- */
    /* Completion Rate: High means they successfully finish contracts */
    sum(case when name_contract_status = 'Completed' then 1 else 0 end) * 1.0 / count(*) as pos_completed_rate,

    /* Active Rate: High means they are currently paying */
    sum(case when name_contract_status = 'Active' then 1 else 0 end) * 1.0 / count(*) as pos_active_rate,

    /* --- 5. TERM STABILITY (Restructuring Check) --- */
    /* If Variance of Installment Count is high, they kept changing the terms (bad sign) */
    variance(cnt_instalment) as pos_term_variance

from POS_CASH_balance
group by sk_id_curr
""")

Add these features to the main home_credit_train table by joining with the aggregated pos_cash_agg table.

con.execute("""
create or replace table home_credit_train as 
select 
    t1.*,
    t2.* exclude (sk_id_curr)
from home_credit_train t1
left join pos_cash_agg t2 
    on t1.sk_id_curr = t2.sk_id_curr
""")

Installments Payments Table Feature Engineering

Feature Engineering: Installments Payments (installments_payments)

1. What is this table? This table represents the "Moment of Truth" for repayment. It compares the contractual plan against the actual reality. For every single payment made by the client on previous loans, it records:

  • The Promise: When it was due (DAYS_INSTALMENT) and how much (AMT_INSTALMENT).
  • The Reality: When it was actually paid (DAYS_ENTRY_PAYMENT) and how much (AMT_PAYMENT).

2. Why is it required? This is often the single strongest predictor of default.

  • Credit Card table shows monthly habits.
  • POS table shows debt burden.
  • Installments table shows Obedience. It answers the critical questions: "Does this person pay on the exact day they promised?" and "Do they pay the full amount?"

3. Feature Explanations Below is the explanation for every alias used in the SQL code.

1. Lateness (DPD - Days Past Due)

  • ins_late_payment_count: Total number of times the client paid after the due date.
  • ins_avg_dpd: On average, how many days late are they? (Calculated only on late payments).
  • ins_max_dpd: The worst lateness event ever. (e.g., if they were once 50 days late, this captures that extreme risk).

2. Diligence (DBD - Days Before Due)

  • ins_avg_dbd: On average, how many days early do they pay?
  • Paying early is a strong "Wealth Signal" indicating cash flow stability.

  • ins_max_dbd: The earliest they have ever paid (e.g., paying off a loan 30 days in advance).

3. Short Payment (Underpayment)

  • ins_underpayment_count: Number of times they paid less than the required installment amount.
  • ins_payment_ratio: Ratio of Amount Paid / Amount Required.
  • 1.0: Perfect payment.
  • < 1.0: Underpayment (Major Red Flag).

  • ins_total_missed_balance: The total sum of money they failed to pay across all historical installments (Required - Paid).

4. Restructuring (Version Changes)

  • ins_version_change_count: Count of how many different "payment calendars" (NUM_INSTALMENT_VERSION) exist for the client.
  • Low (1): The original plan was followed perfectly.
  • High (>1): The payment schedule was changed mid-loan (Restructuring), usually because the client couldn't pay.

5. Recent Behavior (Trend)

  • ins_recent_late_count_1y: A specific count of late payments occurring only in the last 365 days. This helps the model distinguish between "Old Mistakes" (5 years ago) and "Current Risk."
Show Code con.execute(r""" create or replace table installments_agg as select sk_id_curr, /* --- 1. LATENESS (DPD - Days Past Due) --- */ /* Only count positive days (Late). If early, it is 0. */ sum(case when days_entry_payment > days_instalment then 1 else 0 end) as ins_late_payment_count, avg(case when days_entry_payment > days_instalment then (days_entry_payment - days_instalment) else 0 end) as ins_avg_dpd, max(case when days_entry_payment > days_instalment then (days_entry_payment - days_instalment) else 0 end) as ins_max_dpd, /* --- 2. DILIGENCE (DBD - Days Before Due) --- */ /* Only count negative days (Early). Paying early is a "Wealth Signal" */ avg(case when days_entry_payment < days_instalment then (days_instalment - days_entry_payment) else 0 end) as ins_avg_dbd, max(case when days_entry_payment < days_instalment then (days_instalment - days_entry_payment) else 0 end) as ins_max_dbd, /* --- 3. SHORT PAYMENT (Underpayment) --- */ /* Did they pay LESS than required? */ sum(case when amt_payment < amt_instalment then 1 else 0 end) as ins_underpayment_count, /* The Payment Ratio: 1.0 = Perfect. < 1.0 = Short. */ avg(amt_payment / nullif(amt_instalment, 0)) as ins_payment_ratio, /* Total 'Missing' Money */ sum(amt_instalment - amt_payment) as ins_total_missed_balance, /* --- 4. RESTRUCTURING (Version Changes) --- */ /* If they have many versions (1, 2, 3), they changed terms often. */ count(distinct num_instalment_version) as ins_version_change_count, /* --- 5. RECENT BEHAVIOR (Last 1 Year) --- */ /* We look at behavior only in the last 365 days */ sum(case when days_instalment > -365 and days_entry_payment > days_instalment then 1 else 0 end) as ins_recent_late_count_1y from installments_payments group by sk_id_curr """)

Add these features to the main home_credit_train table by joining with the aggregated installments_agg table.

con.execute("""
create or replace table home_credit_train as 
select 
    t1.*,
    t2.* exclude (sk_id_curr)
from home_credit_train t1
left join installments_agg t2 
    on t1.sk_id_curr = t2.sk_id_curr
""")

Bureau and Bureau Balance Table Feature Engineering

Feature Engineering: Bureau Balance (bureau_balance)

1. What is this table? This table contains the monthly delinquency history for loans held at other financial institutions (not Home Credit).

  • While the main bureau table tells us "This client has a car loan at Bank X," this bureau_balance table tells us the month-by-month status of that car loan (e.g., "Paid on time in Jan," "Late in Feb," "Written Off in March").

2. Why is it required? This is the granular "External Reputation" check.

  • A client might tell Home Credit they are safe, but this table reveals if they are currently failing to pay their debts elsewhere.
  • The STATUS column is encoded (C, X, 0, 1, 2, 3, 4, 5). We must convert these codes into a numerical "Risk Score" to measure how severe their financial trouble is.

3. Feature Explanations Below is the explanation for every alias used in the SQL code.

1. Duration

  • bb_months_history: Total number of months of history available for this specific external loan.
  • **bb_first_month / bb_last_month**: The timeline of the loan relative to the current application. (e.g., Did this history happen 5 years ago or last month?).

2. Risk Score (The "Max Badness")

  • bb_max_delinquency_level: We convert the confusing text codes into a generic "Badness Scale":
  • 0: Paid on time / Closed / No Info.
  • 1: Late (1-30 days).
  • 2-4: Significantly Late (30-120 days).
  • 5: Written Off / Sold (The worst possible status).
  • This feature captures the worst status the client ever reached on this loan.

3. Recent Risk (Trend)

  • bb_delinquency_last_12m_flag: A binary flag (1 or 0).
  • 1: The client was late (Status 1-5) at least once in the last 12 months.
  • 0: The client has been clean for the last year.
  • This distinguishes "Current Risk" from "Old Mistakes".

4. Status Counts

  • bb_count_status_1: Count of months where they were slightly late (1-30 days). Frequent small delays indicate disorganization.
  • bb_count_status_5: Count of months where the debt was "Written Off" or "Sold." This is a major red flag indicating default at another bank.
Show Code
con.execute("""
create or replace table bureau_balance_agg as 
select 
    sk_id_bureau,

    /* --- 1. DURATION --- */
    count(*) as bb_months_history,
    min(months_balance) as bb_first_month,
    max(months_balance) as bb_last_month,

    /* --- 2. RISK SCORE (Weighted Status) --- */
    /* We convert char status to numbers to find the "Max Badness" */
    /* 0=OK, 1=Late, 5=Written Off. We ignore C and X for risk calculation */
    max(case 
        when status = '5' then 5 
        when status = '4' then 4 
        when status = '3' then 3 
        when status = '2' then 2 
        when status = '1' then 1 
        else 0 
    end) as bb_max_delinquency_level,

    /* --- 3. RECENT RISK (Last 12 Months) --- */
    /* Is the trouble recent? */
    max(case 
        when months_balance >= -12 and status in ('1','2','3','4','5') then 1 
        else 0 
    end) as bb_delinquency_last_12m_flag,

    /* --- 4. STATUS COUNTS --- */
    sum(case when status = '1' then 1 else 0 end) as bb_count_status_1, /* Late < 30 days */
    sum(case when status = '5' then 1 else 0 end) as bb_count_status_5  /* Written Off */

from bureau_balance
group by sk_id_bureau
""")

Add these features to the main home_credit_train table by joining with the aggregated bureau_balance_agg table through the bureau table.

con.execute("""
create or replace table home_credit_train as 
select 
    t1.*,
    t2.* exclude (sk_id_curr)
from home_credit_train t1
left join bureau_agg t2 
    on t1.sk_id_curr = t2.sk_id_curr
""")

In the below code, we will do the same for the bureau table by creating aggregated features and then joining with the main table.

Show Code
con.execute(r"""
create or replace table home_credit_train as 
select 
    *,

    /* --- 1. INTERNAL AFFORDABILITY (The Basics) --- */
    /* Credit-to-Income: How many years of salary does this loan represent? */
    amt_credit / nullif(amt_income_total, 0) as ratio_credit_to_income,

    /* Annuity-to-Income (PTI): What % of monthly salary goes to THIS loan? */
    amt_annuity / nullif(amt_income_total, 0) as ratio_annuity_to_income,

    /* Annuity-to-Credit: What is the effective interest rate/term intensity? */
    amt_annuity / nullif(amt_credit, 0) as ratio_annuity_to_credit,

    /* --- 2. EXTERNAL AFFORDABILITY (The "True" Debt Load) --- */
    /* Global Debt-to-Income: (Bureau Debt + New Loan) / Income */
    /* This is the most honest measure of how broke they really are. */
    (coalesce(bureau_total_active_debt, 0) + amt_credit) / nullif(amt_income_total, 0) as ratio_total_debt_to_income,

    /* Global Credit Utilization: (Bureau Active Limit + New Loan) / Income */
    (coalesce(bureau_total_active_limit, 0) + amt_credit) / nullif(amt_income_total, 0) as ratio_total_credit_limit_to_income,

    /* --- 3. DISPOSABLE INCOME PROXY --- */
    /* How much money is left after paying the new loan? */
    amt_income_total - amt_annuity as feature_disposable_income,

    /* --- 4. CREDIT MATURITY --- */
    /* Age / Employment Ratios */
    days_employed / nullif(days_birth, 0) as ratio_employed_to_age,
    amt_income_total / nullif(days_birth, 0) as ratio_income_to_age

from home_credit_train
""")

Data Cleaning: Handling Anomalies

After these steps, in some entries we may encounter some clients whose days_employed is greater than days_birth which is an anomaly in the data. We can replace these values with nulls to avoid misleading the model. This is due to the fact that some clients have DAYS_EMPLOYED value of 365243 which is a placeholder for "Not Employed". Since these clients are not employed, it is logical to replace this value with nulls to avoid misleading the model.

con.execute("""
update home_credit_train
set 
    days_employed = null 
where days_employed = 365243;
""")

In the next step, we will re-create the age/employment ratio and income/age ratio features to reflect the changes we made in the days_employed column. For unemployed clients, these ratios will become null which is a safe outcome since they are not employed and we don't want to mislead the model by giving them a false ratio value.

con.execute("""
update home_credit_train
set 
    /* If days_employed is NULL (unemployed), this ratio becomes NULL, which is safe */
    ratio_employed_to_age = abs(days_employed) / nullif(abs(days_birth), 0),

    /* Income to age is always valid */
    ratio_income_to_age = amt_income_total / nullif(abs(days_birth), 0)
""")

Finally, let`s check the final version of the train table after all these feature engineering steps.


SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 FONDKAPREMONT_MODE HOUSETYPE_MODE TOTALAREA_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR housing_score_avg housing_data_count housing_score_mode housing_data_count_mode housing_score_medi housing_data_count_medi flag_document_3_kept total_document_count application_count_prev refusal_rate_prev approval_rate_prev cash_loan_rate_prev walk_in_rate_prev insurance_uptake_rate_prev high_yield_group_rate_prev days_since_last_approval_prev days_since_last_refusal_prev days_since_last_termination_prev degree_of_trust_avg_prev avg_down_payment_rate_prev max_prev_annuity_prev avg_term_approved_prev avg_goods_price_refused_prev avg_goods_price_approved_prev avg_rate_interest_refused_prev avg_rate_interest_approved_prev most_freq_seller_industry_prev last_rejection_reason_prev cc_total_months_history cc_active_months_count cc_completed_months_count cc_utilization_rate_avg cc_balance_spike_ratio cc_avg_interest_burden cc_payment_over_min_ratio cc_atm_drawing_rate cc_avg_ticket_size cc_avg_monthly_drawings_count cc_avg_monthly_atm_count cc_avg_monthly_pos_count cc_max_dpd cc_max_dpd_def cc_late_payment_count cc_limit_growth_value pos_total_months_history pos_avg_future_installments pos_remaining_debt_ratio pos_late_payment_count pos_max_dpd pos_max_dpd_def pos_completed_rate pos_active_rate pos_term_variance ins_late_payment_count ins_avg_dpd ins_max_dpd ins_avg_dbd ins_max_dbd ins_underpayment_count ins_payment_ratio ins_total_missed_balance ins_version_change_count ins_recent_late_count_1y bureau_total_loans_count bureau_active_loans_count bureau_closed_loans_count bureau_credit_card_count bureau_mortgage_count bureau_car_loan_count bureau_total_active_debt bureau_total_active_limit bureau_total_overdue_amount bureau_max_overdue_ever bureau_utilization_rate bureau_days_since_last_loan bureau_days_until_next_end bureau_max_delinquency_level_ever bureau_total_late_months_history ratio_credit_to_income ratio_annuity_to_income ratio_annuity_to_credit ratio_total_debt_to_income ratio_total_credit_limit_to_income feature_disposable_income ratio_employed_to_age ratio_income_to_age
242471 0 Cash loans F Y N 1 90000 652500 21694.5 652500 Unaccompanied State servant Secondary / secondary special Married House / apartment 0.028663 -17798 -9130 -4242 -1340 7 1 1 0 1 0 0 Laborers 3 2 2 FRIDAY 10 0 0 0 0 0 0 Postal nan 0.611574 0.609276 block of flats 0.0469 Panel False 2 0 2 0 -93 0 0 0 1 0 1 0.163111 9 0.163767 9 0.1633 9 1 1 3 0 1 0 0 0 0.333333 -1013 -824 0.954029 0.108272 4367.93 4 nan 21937.5 nan nan Consumer electronics XAP 36 36 0 0.138038 4.01547 293.589 5.68002 0.758997 5465.68 0.888889 0.222222 0.962963 8 8 1 0 14 3 0.5 0 0 0 0.142857 0.857143 0 1 0.277778 15 5.87037 19 0 1 0 2 0 5 1 4 1 0 0 301260 315000 0 0 0.956382 -176 920 1 8 7.25 0.24105 0.0332483 10.5973 10.75 68305.5 0.512979 -5.05675
242472 1 Cash loans F N N 0 90000 647046 19048.5 463500 Unaccompanied Pensioner Secondary / secondary special Single / not married House / apartment 0.020713 -22396 -3994 -4008 nan 1 0 0 1 0 0 1 3 3 MONDAY 8 0 0 0 0 0 0 XNA nan 0.0871064 0.172495 nan 0 0 0 0 -1095 0 0 0 0 0 3 nan 0 nan 0 nan 0 1 1 5 0 1 0.6 0 0.6 0.2 -1095 -547 1.0897 0.0322524 14379.5 10.8 nan 117884 nan nan XNA XAP 36 36 0 0.49078 3.52269 2852.43 2.18108 0.830703 5052.85 1.61111 0.916667 0.694444 0 0 0 270000 59 7.27119 0.475989 0 0 0 0.0677966 0.932203 19.0836 0 0 0 4.125 19 0 1 0 2 0 8 3 5 2 0 0 400622 688500 760.5 nan 0.581876 -328 1040 0 7.1894 0.21165 0.0294392 11.6408 14.8394 70951.5 -16.3084 -4.01857
242476 0 Cash loans F Y Y 1 135000 1.16672e+06 34245 913500 Unaccompanied State servant Secondary / secondary special Married House / apartment 0.020713 -14229 -6722 -7980 -4892 12 1 1 0 1 0 0 Medicine staff 3 3 2 WEDNESDAY 13 0 0 0 0 0 0 Medicine 0.316581 0.0350015 0.392774 reg oper spec account block of flats 0.0525 Panel False 0 0 0 0 -1327 0 0 0 1 2 2 0.23792 10 0.24063 10 0.23878 10 1 1 2 0 0.5 0.5 0 0 0.5 -1327 -1018 1.08251 0 13601.6 10 nan 98010 nan nan XNA XAP 96 96 0 0.339839 8.4301 187.852 12.783 0 3011.76 0.395833 0 0.395833 0 0 0 90000 11 5 0.5 0 0 0 0.0909091 0.909091 0 2 0.0551181 5 2.82677 31 4 0.984252 27207.6 2 0 6 2 4 2 0 0 342014 1.1205e+06 0 8415 0.305233 -707 612 0 8.6424 0.253667 0.0293514 11.1758 16.9424 100755 0.472415 -9.48767
242490 0 Cash loans F N Y 0 247500 50940 5616 45000 Unaccompanied Working Higher education Married House / apartment 0.022625 -10412 -2085 -2029 -2833 nan 1 1 0 1 0 0 Core staff 2 2 2 WEDNESDAY 15 0 0 0 0 0 0 Kindergarten 0.601655 0.331324 0.598926 nan 3 0 3 0 -716 0 0 0 0 0 3 nan 0 nan 0 nan 0 1 1 9 0.222222 0.333333 0.444444 0.111111 0.111111 0.111111 -172 -172 -381 1.05555 0.100499 27902.8 24 nan 294360 nan nan XNA XAP 6 6 0 0 nan 0 nan nan nan 0 nan nan 0 0 0 90000 18 22.6667 0.617424 0 0 0 0.0555556 0.944444 555.546 0 0 0 17.8235 30 0 1.61096 -315000 3 0 3 1 2 1 0 0 207729 225000 0 7303.5 0.92324 -848 501 0 0.205818 0.0226909 0.110247 1.04513 1.11491 241884 0.20025 -23.7706
242504 1 Cash loans F N N 1 90000 509400 37066.5 450000 Unaccompanied Working Secondary / secondary special Married House / apartment 0.0228 -12163 -2402 -5800 -2651 nan 1 1 1 1 0 0 Sales staff 3 2 2 TUESDAY 10 0 0 0 0 0 0 Self-employed 0.599388 0.477412 0.200926 reg oper account block of flats 0.0043 Wooden False 1 0 1 0 -319 0 0 0 1 0 3 0.142907 14 0.14375 14 0.143207 14 1 1 9 0.444444 0.555556 0.333333 0.444444 0.111111 0.222222 -319 -286 -464 1.08631 0.068268 34869.2 12 393750 134557 nan nan XNA HC 10 10 0 0.675318 1.76388 3752.55 0.764237 0.0563712 5442.82 4.4 0.111111 4.55556 0 0 0 180000 48 10.8125 0.58125 0 0 0 0.0833333 0.916667 90.8825 5 0.3 7 8.75 76 4 0.966667 22174 3 3 19 4 14 3 0 0 543760 787500 0 0 0.690489 -53 27305 1 3 5.66 0.41185 0.072765 11.7018 14.41 52933.5 0.197484 -7.39949

The shape of of final home_credit_train table is (307511, 146) which means we have 122 features including the target variable. We have successfully engineered a large number of features from the bureau and bureau_balance tables and joined them with the main table to create a rich dataset for modeling.

Convert to Parquet

After all the feature engineering steps, we can convert the final home_credit_train table to parquet format for faster loading in the modeling phase. Parquet is a columnar storage format that is optimized for analytical queries and can significantly reduce the time it takes to load the data into memory.

con.execute(r"""
COPY home_credit_train 
TO '/home/vasif/Desktop/Vasif_Asadov_Website/docs/data_analysis/home_credit_risk/home_credit_train_final.parquet' 
(FORMAT PARQUET)
""")

This will create a parquet file named home_credit_train_final.parquet in the specified directory which can be used for the modeling phase.

Conclusion

We did extraordinary amount of feature engineering in this step by creating a final aggregated dataset by combining all of the information from multiple tables. We created a total of 146 features from those tables. If we blindly add the columns from the tables and create aggregated data, then the number of features would have been in the thousands which is not ideal for modeling. Instead, we carefully engineered a set of 122 features based on the deep analysis we did in the previous sections. We focused on creating features that are meaningful and have a strong relationship with the target variable. This will help us build a more accurate and robust model in the next step.

In the next step, we will load this final parquet file into a pandas dataframe and perform some final preprocessing steps before feeding it into the model. We will also do some exploratory data analysis to understand the distribution of the features and their relationship with the target variable. We will handle missing values and do some feature scaling if necessary. Finally, we will be ready to train our machine learning model on this rich dataset and make predictions on the test set.