Customer Behavior and Retention Analysis¶
Introduction¶
In this section we will analyze customer behavior and retention patterns using the Olist e-commerce dataset. The goal is to segment the customer base into distinct clusters based on their purchasing behavior, identify trends in customer retention, and understand the factors that influence customer loyalty. We will do the following analysis:
- Segment customers into clusters such as Champions, Loyalists, Hibernating, and At Risk based on their recency, frequency, and monetary value (RFM analysis).
- Understand how the customer base has evolved over time, including trends in new customer acquisition and retention rates.
- Analyze whether pareto principle holds true for the customer base, i.e., if a small percentage of customers contribute to a large percentage of revenue.
- Explore the average time between repeat customers' purchase timestamps to understand the frequency of repeat purchases.
- Identify the customers that exceeded the average purchase cycle time more than 2 standard deviations, and analyze their behavior to understand potential reasons for their inactivity.
Each question will be solved by SQL queries toghether with supportive explanations, tabular results and obtained insights. The insights obtained from the analysis will guide our market strategies and help Olist company to improve customer retention and loyalty, ultimately driving business growth.
Customer Segmentation using RFM Analysis¶
1. Can we segment the customer base into distinct clusters, specifically, Champions, Loyalists, Hibernating and At Risk by scoring each unique user based on the recency of their last order, the frequency of their purchases and their total monetary contribution?
Solution: Before solving the problem, let's briefly explain the RFM analysis and how we will use it to segment customers. RFM stands for Recency, Frequency, and Monetary value. Recency means how recently a customer made a purchase. It is calculated as the number of days since the last purchase. Frequency refers to how often a customer makes a purchase, which can be measured by counting the number of orders. Monetary value represents how much money a customer has spent, which can be calculated by summing the total payment value for each customer.
To segment customers, we will assign scores from 1 to 5 for each RFM metric, where 5 is the best score and 1 is the worst. For recency, customers with lower days since their last purchase will receive higher scores. For frequency and monetary value, customers with higher counts and values will receive higher scores. After assigning scores, we will group customers into segments based on their RFM scores.
To correctly divide the customers into segments, we will use ntile window function to assign scores based on the distribution of RFM values. This method allows us to create segments that are relative to the overall customer base, ensuring that we capture the nuances in customer behavior effectively.
I have structured my SQL query in the folllowing sequence:
- First, I will find the last order date in the dataset to use as a reference point for calculating recency.
- Then, I will calculate the RFM values for each unique customer by joining the relevant tables and aggregating the necessary metrics.
- Next, I will assign RFM scores using the
ntilefunction to create relative scores for each metric. This function firstly, will order the customers based on their RFM values and then divide them into 5 equal groups and then for corresponding customer, it will assign a score from 1 to 5 based on the group they belong to. For example, customer with id = 1 has recency value of 10 days, frequency count of 5 and monetary value of 100. If the customer falls into the top 20% of customers with the lowest recency values, they will receive an R score of 5. If they fall into the top 20% of customers with the highest frequency counts, they will receive an F score of 5. If they fall into the bottom 20% of customers with the lowest monetary values, they will receive an M score of 1. This way, we can effectively segment customers based on their relative standing in each RFM metric. - Finally, I will group the customers into segments based on their RFM scores and calculate the average scores and other relevant metrics for each segment.
Show SQL Code
with dataset_max_date as (
-- find the last order date in the dataset - reference date
select
max(order_purchase_timestamp) as max_date
from
orders
),
customer_rfm_base as (
-- calculate rfm values for each unique customer
select
customers.customer_unique_id,
max(orders.order_purchase_timestamp) as last_order_date,
count(distinct orders.order_id) as frequency_count,
sum(payments.payment_value) as monetary_value,
dmd.max_date
from
orders
inner join
customers on customers.customer_id = orders.customer_id
inner join
payments on orders.order_id = payments.order_id
cross join
dataset_max_date dmd
where
orders.order_status = 'delivered'
group by
customers.customer_unique_id,
dmd.max_date
),
rfm_scores as (
-- assigning scores from 1 to 5 using ntile window function
-- 5 is the best score, 1 is the worst
select
customer_unique_id,
frequency_count,
monetary_value,
DATEDIFF(day, last_order_date, max_date) as recency_days,
-- recency score: lower_days = better score (5)
-- descending order -> larger days will get lower ntile
ntile(5) over (order by DATEDIFF(day, last_order_date, max_date) desc) as r_score,
-- frequency score: higher count = better score
ntile(5) over (order by frequency_count asc) as f_score,
-- monetary score: higher value = better score
ntile(5) over (order by monetary_value asc) as m_score
from
customer_rfm_base
),
rfm_segments as (
-- group the scores into segments
select
customer_unique_id,
recency_days,
frequency_count,
monetary_value,
r_score,
f_score,
m_score,
case
when r_score >= 4 and (f_score + m_score) / 2 >= 4 then 'Champions'
when r_score >= 3 and (f_score + m_score) / 2 >= 3 then 'Loyal Customers'
when r_score <= 2 and (f_score + m_score) / 2 >= 4 then 'At Risk'
when r_score <= 2 and (f_score + m_score) / 2 <= 2 then 'Hibernating'
else 'Potential Loyalist'
end as customer_segment
from
rfm_scores
)
select
customer_segment,
count(customer_unique_id) as customer_count,
cast(avg(r_score * 1.0) as decimal(5,2)) as avg_recency_score,
cast(avg(f_score * 1.0) as decimal(5,2)) as avg_frequency_score,
cast(avg(m_score * 1.0) as decimal(5,2)) as avg_monetary_score,
cast(avg(recency_days * 1.0) as int) as avg_days_since_last_order,
cast(avg(frequency_count * 1.0) as int) as avg_frequency_count,
cast(avg(monetary_value) as decimal(10,2)) as avg_lifetime_spend
from
rfm_segments
group by
customer_segment
order by
customer_count desc;
Results:
| customer_segment | customer_count | avg_recency_score | avg_frequency_score | avg_monetary_score | avg_days_since_last_order | avg_frequency_count | avg_lifetime_spend |
|---|---|---|---|---|---|---|---|
| Hibernating | 25937 | 1.43 | 1.42 | 2.25 | 453 | 1 | 84.59 |
| Loyal Customers | 25509 | 3.86 | 3.83 | 2.92 | 193 | 1 | 141.31 |
| Potential Loyalist | 22558 | 2.59 | 2.52 | 2.89 | 321 | 1 | 190.69 |
| Champions | 18417 | 4.60 | 4.56 | 4.23 | 129 | 1 | 272.96 |
| At Risk | 936 | 1.58 | 4.95 | 4.40 | 424 | 2 | 315.29 |
Insights:
- The largest segment of customers falls into the 'Hibernating' category, which indicates that a significant portion of the customer base has not made recent purchases and has low frequency and monetary value. This suggests that there may be an opportunity to re-engage these customers through targeted marketing campaigns or promotions.
- The 'Loyal Customers' segment has a relatively high average recency score, indicating that these customers have made recent purchases. However, their average frequency and monetary scores are moderate, suggesting that while they are loyal, they may not be the most valuable customers in terms of revenue. Strategies to increase their purchase frequency or average order value could be beneficial.
- The 'Champions' segment, while smaller in size compared to the 'Hibernating' and 'Loyal Customers' segments, has the highest average scores across all RFM metrics, indicating that these customers are highly engaged and valuable. They have made recent purchases, have a high frequency of orders, and contribute significantly to revenue. This segment should be prioritized for retention efforts and could also be targeted for upselling or cross-selling opportunities.
- The 'At Risk' segment, although small in size, has a high average frequency and monetary score, indicating that these customers have been valuable in the past but have not made recent purchases. This suggests that they may be at risk of churning, and targeted retention strategies should be implemented to re-engage these customers and prevent them from becoming inactive.
Customer Retention Trends and Analysis¶
2. How has the customer base evolved over time in terms of new customer acquisition and retention rates? Specifically, can we identify trends in the number of new customers acquired each month and the percentage of customers retained month-over-month?
Solution: This type of questions is called cohort analysis, where we analyze the behavior of groups of customers (cohorts) over time. In this case, we will create monthly cohorts based on the month of the customer's first purchase. We will then track the retention of these cohorts over subsequent months to understand how many customers from each cohort continue to make purchases in the following months.
Show SQL Code
with customer_first_purchase as (
-- step 1: identify the acquisition month (birth month) for every unique customer
select
c.customer_unique_id,
min(datefromparts(year(o.order_purchase_timestamp), month(o.order_purchase_timestamp), 1)) as cohort_month
from
orders o
inner join
customers c on o.customer_id = c.customer_id
where
o.order_status not in ('canceled', 'unavailable')
group by
c.customer_unique_id
),
cohort_activities as (
-- step 2: join all orders back to the birth month to see when they returned
select
fp.customer_unique_id,
fp.cohort_month,
datediff(month, fp.cohort_month, datefromparts(year(o.order_purchase_timestamp), month(o.order_purchase_timestamp), 1)) as month_index
from
orders o
inner join
customers c on o.customer_id = c.customer_id
inner join
customer_first_purchase fp on c.customer_unique_id = fp.customer_unique_id
where
o.order_status not in ('canceled', 'unavailable')
),
cohort_sizes as (
-- step 3: calculate the denominator (total customers acquired in each month)
select
cohort_month,
count(distinct customer_unique_id) as total_customers
from
customer_first_purchase
group by
cohort_month
),
retention_counts as (
-- step 4: count how many unique customers shopped in month 1, 3, and 6
select
cohort_month,
count(distinct case when month_index = 1 then customer_unique_id end) as month_1_returnees,
count(distinct case when month_index = 3 then customer_unique_id end) as month_3_returnees,
count(distinct case when month_index = 6 then customer_unique_id end) as month_6_returnees
from
cohort_activities
group by
cohort_month
)
-- final output: calculate retention percentages
-- we filter for cohorts with > 10 customers to remove statistically insignificant beta testing months (late 2016)
select
r.cohort_month,
s.total_customers as cohort_size,
cast(100.0 * r.month_1_returnees / s.total_customers as decimal(5,2)) as month_1_retention_pct,
cast(100.0 * r.month_3_returnees / s.total_customers as decimal(5,2)) as month_3_retention_pct,
cast(100.0 * r.month_6_returnees / s.total_customers as decimal(5,2)) as month_6_retention_pct
from
retention_counts r
inner join
cohort_sizes s on r.cohort_month = s.cohort_month
where
s.total_customers > 10
order by
r.cohort_month;
Results:
| cohort_month | cohort_size | month_1_retention_pct | month_3_retention_pct | month_6_retention_pct |
|---|---|---|---|---|
| 2016-10-01 | 290 | 0.00 | 0.00 | 0.34 |
| 2017-01-01 | 752 | 0.40 | 0.13 | 0.40 |
| 2017-02-01 | 1690 | 0.24 | 0.12 | 0.24 |
| 2017-03-01 | 2571 | 0.51 | 0.39 | 0.16 |
| 2017-04-01 | 2325 | 0.60 | 0.17 | 0.34 |
| 2017-05-01 | 3541 | 0.48 | 0.40 | 0.42 |
| 2017-06-01 | 3102 | 0.45 | 0.39 | 0.35 |
| 2017-07-01 | 3822 | 0.52 | 0.24 | 0.31 |
| 2017-08-01 | 4130 | 0.68 | 0.27 | 0.29 |
| 2017-09-01 | 4075 | 0.69 | 0.29 | 0.22 |
| 2017-10-01 | 4392 | 0.71 | 0.09 | 0.20 |
| 2017-11-01 | 7190 | 0.56 | 0.17 | 0.11 |
| 2017-12-01 | 5439 | 0.22 | 0.35 | 0.17 |
| 2018-01-01 | 6951 | 0.33 | 0.29 | 0.17 |
| 2018-02-01 | 6357 | 0.36 | 0.30 | 0.20 |
| 2018-03-01 | 6931 | 0.40 | 0.29 | 0.00 |
| 2018-04-01 | 6698 | 0.58 | 0.24 | 0.00 |
| 2018-05-01 | 6586 | 0.52 | 0.20 | 0.00 |
| 2018-06-01 | 5920 | 0.42 | 0.00 | 0.00 |
| 2018-07-01 | 6016 | 0.52 | 0.00 | 0.00 |
| 2018-08-01 | 6209 | 0.02 | 0.00 | 0.00 |
Insights:
- The customer acquisition trend shows a significant increase in the number of new customers acquired each month, particularly from mid-2017 to early 2018. This indicates that Olist experienced rapid growth during this period, likely due to successful marketing campaigns, expansion of product offerings, or increased brand awareness.
- The retention rates show a general decline over time, with the highest retention observed in the first month after acquisition (month 1 retention) and a significant drop in retention by month 3 and month 6. This suggests that while Olist was successful in acquiring new customers, retaining them over the long term has been a challenge. The sharp decline in retention rates after the first month indicates that many customers may not have had a positive experience or found enough value to continue shopping with Olist.
- The retention rates also show some variability across different cohorts, with certain months (e.g., 2017-08-01 and 2017-09-01) exhibiting higher retention rates compared to others. This could be due to various factors such as seasonal promotions, changes in product offerings, or improvements in customer service during those periods. Analyzing the specific actions taken during these months could provide insights into strategies that may have contributed to better retention.
Pareto Principle Analysis for Customer Base¶
3. Does the customer base adhere to the '80/20 Rule' (Pareto Principle), where the top 20% of unique customers contribute to 80% of the total revenue? If so, what defines the profile of these top-tier customers?
Solution: Pareto Principle, also known as the 80/20 rule, suggests that a small percentage of customers (20%) often contribute to a large percentage of revenue (80%). We can check whether this principle holds true for our case by applying the following steps:
- First, we will calculate the total lifetime spend for each unique customer by summing up their payment values across all their orders.
- Next, we will rank the customers based on their total spend and calculate the cumulative revenue contribution for each customer.
- Then, we will use the
percent_rankwindow function to determine which customers fall into the top 20% of spenders. This function will assign a percentile rank to each customer based on their total spend. This process is done by ordering the customers in descending order of their total spend and then dividing them into percentiles. Customers with a percentile rank of 0.20 or less will be classified as the top 20% of spenders. - Besides the ranking, we will also calculate the cumulative revenue contribution for each customer to see how much of the total revenue is contributed by the top 20% of customers. This will allow us to determine if the top 20% of customers indeed contribute to around 80% of the total revenue, as suggested by the Pareto Principle.
- Finally, we will calculate the total revenue contributed by the top 20% of customers and compare it to the total revenue to see if it adheres to the 80/20 rule.
If the pareto principle holds true, it means our revenue comes predominantly from a small group of high-value customers. Understanding the profile of these top-tier customers can help us tailor our marketing strategies, improve customer retention, and focus on providing exceptional service to this valuable segment of our customer base.
Show SQL Code
with customer_revenue as (
-- step 1: calculate total lifetime spend per unique customer
select
c.customer_unique_id,
sum(p.payment_value) as total_spend
from
orders o
inner join
customers c on o.customer_id = c.customer_id
inner join
payments p on o.order_id = p.order_id
where
o.order_status not in ('canceled', 'unavailable')
group by
c.customer_unique_id
),
revenue_ranking as (
-- step 2: rank customers by spend and calculate cumulative totals
select
customer_unique_id,
total_spend,
-- using percent_rank to determine the top 20% percentile
percent_rank() over (order by total_spend desc) as percentile_rank,
-- running total of revenue
sum(total_spend) over (order by total_spend desc) as cumulative_revenue,
-- total revenue across all customers for percentage calculation
sum(total_spend) over () as grand_total_revenue
from
customer_revenue
),
pareto_summary as (
-- step 3: identify the contribution of the top 20%
select
case
when percentile_rank <= 0.20 then 'Top 20% (VIPs)'
else 'Bottom 80% (Long Tail)'
end as customer_tier,
count(customer_unique_id) as customer_count,
sum(total_spend) as tier_revenue,
max(grand_total_revenue) as total_revenue
from
revenue_ranking
group by
case
when percentile_rank <= 0.20 then 'Top 20% (VIPs)'
else 'Bottom 80% (Long Tail)'
end
)
-- final output: calculating the percentage contribution per tier
select
customer_tier,
customer_count,
cast(tier_revenue as decimal(15,2)) as revenue_contribution,
cast(100.0 * tier_revenue / total_revenue as decimal(5,2)) as revenue_pct
from
pareto_summary
order by
revenue_pct desc;
Results:
| customer_tier | customer_count | revenue_contribution | revenue_pct |
|---|---|---|---|
| Top 20% (VIPs) | 19000 | 8432842.07 | 53.58 |
| Bottom 80% (Long Tail) | 75989 | 7306294.94 | 46.42 |
Insights:
- The analysis revealed that the top 20% of customers contribute to approximately 53.58% of the total revenue. In reality, there is not a perfect adherence to the 80/20 rule, as the top 20% of customers do not contribute to 80% of the revenue always. However, considering that more than half of the revenue of our company comes from only 20% of all customers, it is still a pareto principle type of distribution, where a small percentage of customers contribute to a large percentage of revenue.
- The top 20% of customers, often referred to as VIPs, are a critical segment for Olist. They contribute significantly to the company's revenue and likely have higher engagement and loyalty. Understanding the characteristics of these VIP customers, such as their purchasing behavior, preferences, and demographics, can help Olist tailor its marketing strategies and customer service efforts to better meet their needs and enhance their experience.
Purchase Frequency and Inactivity Analysis¶
4. For the segment of customers with multiple purchases, what is the average time interval (in days) between consecutive orders? How does this 'purchase latency' vary across different product categories?
*Solution: To analyze the purchase frequency and inactivity patterns among customers with multiple purchases, we will follow these steps:
- First, we will identify customers who have made multiple purchases by counting the number of orders for each unique customer. We will focus on those customers who have made at least two purchases to analyze the time interval between their consecutive orders.
- Next, we will calculate the time interval (in days) between consecutive orders for each customer. This can be done using the
lagwindow function to access the previous order's purchase timestamp and then calculating the difference in days between the current order and the previous order. - We will also join the relevant tables to include product category information for each order, allowing us to analyze how purchase latency varies across different product categories.
- Finally, we will calculate the average purchase latency for each product category and identify any significant variations in purchase frequency across categories. This analysis will help us understand customer behavior and identify potential opportunities for targeted marketing or promotions based on purchase patterns in different product categories.
Show SQL Code
with purchase_sequences as (
-- step 1: identify orders for repeat customers and find the previous order date
-- we use lag() to get the prior purchase timestamp for each unique user
select
c.customer_unique_id,
o.order_purchase_timestamp,
p.product_category_name,
lag(o.order_purchase_timestamp) over (
partition by c.customer_unique_id
order by o.order_purchase_timestamp
) as previous_order_timestamp
from
orders o
inner join
customers c on o.customer_id = c.customer_id
inner join
order_items oi on o.order_id = oi.order_id
inner join
products p on oi.product_id = p.product_id
where
o.order_status not in ('canceled', 'unavailable')
),
latency_calculations as (
-- step 2: calculate days between orders and filter only for the second+ purchases
select
customer_unique_id,
product_category_name,
datediff(day, previous_order_timestamp, order_purchase_timestamp) as days_to_next_order
from
purchase_sequences
where
previous_order_timestamp is not null -- ensures we only look at repeat purchase events
)
-- final output: average latency per category to identify replenishment cycles
select
lc.product_category_name,
pcnt.product_category_name_english,
count(*) as repeat_purchase_events,
avg(days_to_next_order) as avg_days_between_orders,
min(days_to_next_order) as min_days,
max(days_to_next_order) as max_days
from
latency_calculations lc
left join product_category_name_translation pcnt
on pcnt.product_category_name = lc.product_category_name
group by
lc.product_category_name,
pcnt.product_category_name_english
having
count(*) >= 5 -- filtering for statistical significance in categories
order by
repeat_purchase_events desc,
avg_days_between_orders asc;
Results:
| product_category_name | product_category_name_english | repeat_purchase_events | avg_days_between_orders | min_days | max_days |
|---|---|---|---|---|---|
| moveis_decoracao | furniture_decor | 2266 | 7 | 0 | 573 |
| cama_mesa_banho | bed_bath_table | 2261 | 14 | 0 | 485 |
| informatica_acessorios | computers_accessories | 1372 | 12 | 0 | 455 |
| utilidades_domesticas | housewares | 1321 | 11 | 0 | 469 |
| esporte_lazer | sports_leisure | 1258 | 18 | 0 | 467 |
| beleza_saude | health_beauty | 1140 | 19 | 0 | 583 |
| ferramentas_jardim | garden_tools | 939 | 9 | 0 | 524 |
| relogios_presentes | watches_gifts | 543 | 30 | 0 | 514 |
| telefonia | telephony | 447 | 14 | 0 | 491 |
| moveis_escritorio | office_furniture | 445 | 3 | 0 | 347 |
| automotivo | auto | 444 | 19 | 0 | 449 |
| brinquedos | toys | 358 | 19 | 0 | 581 |
| perfumaria | perfumery | 358 | 20 | 0 | 609 |
| fashion_bolsas_e_acessorios | fashion_bags_accessories | 300 | 29 | 0 | 457 |
| pet_shop | pet_shop | 298 | 23 | 0 | 427 |
| bebes | baby | 294 | 18 | 0 | 337 |
| papelaria | stationery | 272 | 23 | 0 | 522 |
| eletronicos | electronics | 266 | 15 | 0 | 367 |
| cool_stuff | cool_stuff | 252 | 19 | 0 | 442 |
| 230 | 14 | 0 | 431 | ||
| construcao_ferramentas_construcao | construction_tools_construction | 216 | 14 | 0 | 348 |
| casa_construcao | home_construction | 148 | 14 | 0 | 482 |
| moveis_sala | furniture_living_room | 111 | 10 | 0 | 489 |
| bebidas | drinks | 99 | 20 | 0 | 489 |
| consoles_games | consoles_games | 94 | 23 | 0 | 524 |
| malas_acessorios | luggage_accessories | 91 | 22 | 0 | 350 |
| eletrodomesticos | home_appliances | 80 | 25 | 0 | 338 |
| construcao_ferramentas_iluminacao | construction_tools_lights | 79 | 14 | 0 | 509 |
| alimentos | food | 76 | 19 | 0 | 362 |
| casa_conforto | home_confort | 72 | 18 | 0 | 306 |
| instrumentos_musicais | musical_instruments | 67 | 18 | 0 | 297 |
| eletroportateis | small_appliances | 65 | 15 | 0 | 235 |
| sinalizacao_e_seguranca | signaling_and_security | 62 | 1 | 0 | 102 |
| alimentos_bebidas | food_drink | 61 | 8 | 0 | 194 |
| construcao_ferramentas_jardim | costruction_tools_garden | 58 | 15 | 0 | 405 |
| climatizacao | air_conditioning | 55 | 10 | 0 | 281 |
| telefonia_fixa | fixed_telephony | 53 | 9 | 0 | 267 |
| livros_interesse_geral | books_general_interest | 51 | 9 | 0 | 331 |
| market_place | market_place | 43 | 6 | 0 | 270 |
| moveis_cozinha_area_de_servico_jantar_e_jardim | kitchen_dining_laundry_garden_furniture | 41 | 24 | 0 | 340 |
| industria_comercio_e_negocios | industry_commerce_and_business | 39 | 6 | 0 | 180 |
| agro_industria_e_comercio | agro_industry_and_commerce | 34 | 18 | 0 | 518 |
| construcao_ferramentas_seguranca | construction_tools_safety | 34 | 20 | 0 | 248 |
| fashion_calcados | fashion_shoes | 31 | 9 | 0 | 93 |
| artigos_de_natal | christmas_supplies | 29 | 15 | 0 | 370 |
| fashion_roupa_masculina | fashion_male_clothing | 26 | 1 | 0 | 23 |
| pcs | computers | 26 | 11 | 0 | 174 |
| audio | audio | 23 | 33 | 0 | 404 |
| moveis_quarto | furniture_bedroom | 21 | 3 | 0 | 64 |
| artes | art | 14 | 37 | 0 | 444 |
| eletrodomesticos_2 | home_appliances_2 | 14 | 41 | 0 | 434 |
| livros_tecnicos | books_technical | 13 | 3 | 0 | 36 |
| fraldas_higiene | diapers_and_hygiene | 13 | 4 | 0 | 57 |
| fashion_underwear_e_moda_praia | fashion_underwear_beach | 13 | 6 | 0 | 77 |
| fashion_roupa_feminina | fashio_female_clothing | 10 | 6 | 0 | 68 |
| livros_importados | books_imported | 10 | 15 | 0 | 145 |
| cine_foto | cine_photo | 9 | 0 | 0 | 5 |
| casa_conforto_2 | home_comfort_2 | 9 | 3 | 0 | 34 |
| dvds_blu_ray | dvds_blu_ray | 9 | 17 | 0 | 146 |
| tablets_impressao_imagem | tablets_printing_image | 9 | 24 | 0 | 113 |
| construcao_ferramentas_ferramentas | costruction_tools_tools | 9 | 37 | 0 | 274 |
| fashion_esporte | fashion_sport | 6 | 17 | 0 | 106 |
| artigos_de_festas | party_supplies | 6 | 35 | 0 | 215 |
| musica | music | 5 | 0 | 0 | 2 |
Insights:
- Analyzing the avg_days_between_orders is not sufficient to make important conclusions about purchase frequency, since the more important metric is how many times the products are purchased repeatedly. Therefore, we should consider the repeat_purchase_events column as well, which shows us how many times the products in each category were purchased repeatedly. For example, the 'moveis_decoracao' (furniture_decor) category has the highest number of repeat purchase events (2266) and an average of 7 days between orders, indicating that customers frequently repurchase items in this category within a short time frame. This could suggest that customers are buying multiple items for home decoration or that they are replacing items frequently.
- The following categories with high repeat purchase events include 'cama_mesa_banho' (bed_bath_table) and 'informatica_acessorios' (computers_accessories), which also show relatively short average days between orders (14 and 12 days, respectively). This indicates that customers in these categories also have a high purchase frequency, possibly due to the nature of the products, which may require regular replacement or upgrading.
- On the other hand, categories like 'artigos_de_festas' (party_supplies) and 'musica' (music) have fewer repeat purchase events and longer average days between orders, suggesting that customers may not need to repurchase these items as frequently. This could be due to the fact that party supplies are often used for specific events and music may be purchased for long-term enjoyment, leading to less frequent repurchases.
- Overall, the analysis of purchase latency across different product categories provides valuable insights into customer behavior and can help Olist identify which categories have higher purchase frequencies. This information can be used to optimize inventory management, tailor marketing strategies, and improve customer retention by targeting customers with personalized offers based on their purchasing patterns in specific categories.
High Risk of Churn Detection¶
5. Which customers have exceeded the average purchase cycle by more than 2 standard deviations without placing a new order? Can we flag these users as 'High Risk of Churn' based on their deviation from the typical repurchase behavior?
Solution: To identify customers who are at high risk of churn based on their purchase behavior, we will follow these steps:
- First, we will calculate the average purchase cycle (the average time interval between consecutive orders) and the standard deviation of the purchase cycle for customers with multiple purchases. This will help us establish a benchmark for typical repurchase behavior.
- Next, we will calculate the recency of the last purchase for each customer by finding the difference in days between their last order date and the reference date (the maximum order purchase timestamp in the dataset).
- We will then compare the recency of each customer's last purchase to the average purchase cycle plus two standard deviations. Customers whose recency exceeds this threshold will be flagged as 'High Risk of Churn' because they have gone significantly longer than the typical repurchase cycle without making a new purchase.
- Finally, we will count the number of customers who fall into the 'High Risk of Churn' category to understand the potential size of this segment and to inform retention strategies.
Show SQL Code
with dataset_max_date as (
-- finding the last order date in the dataset to act as "today" for recency/churn calculation
select max(order_purchase_timestamp) as max_date from orders
),
purchase_intervals as (
-- step 1: use lag() to calculate the day-gap between consecutive orders for the same unique user
select
c.customer_unique_id,
o.order_purchase_timestamp,
days_since_prior = datediff(day, lag(o.order_purchase_timestamp) over (
partition by c.customer_unique_id
order by o.order_purchase_timestamp
), o.order_purchase_timestamp)
from
orders o
inner join
customers c on o.customer_id = c.customer_id
where
o.order_status not in ('canceled', 'unavailable')
),
cycle_stats as (
-- step 2: calculate global benchmarks for repurchase behavior
-- stdev() helps define what is "statistically normal" vs "anomalous"
select
avg_cycle = avg(days_since_prior * 1.0),
std_dev_cycle = stdev(days_since_prior)
from
purchase_intervals
where
days_since_prior is not null -- only includes repeat purchase events
),
customer_last_purchase as (
-- step 3: find the recency of the absolute last order for every customer
select
c.customer_unique_id,
max(o.order_purchase_timestamp) as last_order_date,
days_since_last_order = datediff(day, max(o.order_purchase_timestamp), d.max_date)
from
orders o
inner join
customers c on o.customer_id = c.customer_id
cross join
dataset_max_date d
where
o.order_status not in ('canceled', 'unavailable')
group by
c.customer_unique_id, d.max_date
)
-- final output: count only the users exceeding the churn threshold
select
count(distinct lp.customer_unique_id) as high_risk_churn_customer_count
from
customer_last_purchase lp
cross join
cycle_stats stats
where
lp.days_since_last_order > (stats.avg_cycle + (2 * stats.std_dev_cycle));
Results:
| high_risk_churn_customer_count |
|---|
| 42051 |
Insights:
- The analysis identified 42,051 customers who have exceeded the average purchase cycle by more than 2 standard deviations without placing a new order. These customers can be flagged as 'High Risk of Churn' based on their significant deviation from typical repurchase behavior.
Conclusion¶
Through the analysis of customer retention, purchase frequency, and churn risk, we have gained valuable insights into the behavior of Olist's customer base. The segmentation of customers using RFM analysis has allowed us to identify key segments such as 'Champions', 'Loyal Customers', 'At Risk', and 'Hibernating', each with distinct characteristics and opportunities for targeted marketing strategies. The cohort analysis revealed trends in customer acquisition and retention rates over time, highlighting the challenges of retaining customers after their initial purchase. The Pareto Principle analysis showed that a small percentage of customers contribute to a significant portion of the revenue, emphasizing the importance of understanding and catering to these high-value customers. Finally, the purchase frequency analysis across product categories provided insights into customer behavior and potential opportunities for optimizing inventory and marketing strategies. The identification of customers at high risk of churn allows Olist to implement targeted retention efforts to re-engage these customers and prevent them from becoming inactive. Overall, these analyses provide a comprehensive understanding of customer behavior and retention patterns, enabling Olist to make informed decisions to enhance customer loyalty and drive revenue growth.