Delivery and Logistics Analysis¶

Delivery Performance Analysis¶

11. What is the overall 'On-Time Delivery Rate' (percentage of orders delivered before the order_estimated_delivery_date)? Furthermore, what is the average variance between the estimated and actual delivery dates across different states?

Solution:

Firstly, we prepare order-level delivery variances (delivery_data CTE): We join the orders and customers tables, filtering strictly for successfully 'delivered' orders with valid dates. We use a CASE statement to flag if an order was delivered on or before the estimate (is_on_time), and use datediff to calculate the exact delivery_variance_days.
Secondly, we aggregate performance by state (state_performance CTE): Grouping by customer_state, we sum the on-time flags to calculate the on_time_rate percentage. We calculate the state's average delivery variance (state_avg_accuracy). Crucially, we also use a window function avg() over () without a partition to calculate the national_avg_accuracy baseline across all states.
Finally, we classify states against the national benchmark (Final SELECT): We format the output metrics and use a CASE statement to evaluate the state's performance. By comparing the state_avg_accuracy against the national_avg_accuracy (+/- a 2-day buffer), we categorize the logistics estimation strategy for that region as 'Extremely Conservative', 'Relatively Aggressive', or 'Standard'.

Show SQL Code

with delivery_data as (
    -- step 1: prepare variance data per order
    select 
        c.customer_state,
        o.order_id,
        o.order_estimated_delivery_date,
        o.order_delivered_customer_date,
        case 
            when o.order_delivered_customer_date <= o.order_estimated_delivery_date then 1 
            else 0 
        end as is_on_time,
        datediff(day, o.order_delivered_customer_date, o.order_estimated_delivery_date) as delivery_variance_days
    from 
        orders o
    inner join 
        customers c on o.customer_id = c.customer_id
    where 
        o.order_status = 'delivered'
        and o.order_delivered_customer_date is not null
        and o.order_estimated_delivery_date is not null
),

state_performance as (
    -- step 2: aggregate by state and calculate national benchmark
    select 
        customer_state,
        count(order_id) as total_delivered_orders,
        cast(100.0 * sum(is_on_time) / count(order_id) as decimal(5,2)) as on_time_rate,
        avg(delivery_variance_days * 1.0) as state_avg_accuracy,
        -- getting the national average across all states to use as a baseline
        avg(avg(delivery_variance_days * 1.0)) over () as national_avg_accuracy
    from 
        delivery_data
    group by 
        customer_state
)

-- final output: categorizing states relative to the national logistics standard
select 
    customer_state,
    total_delivered_orders,
    on_time_rate,
    cast(state_avg_accuracy as decimal(10,2)) as avg_accuracy_days,
    cast(national_avg_accuracy as decimal(10,2)) as national_benchmark,
    case 
        when state_avg_accuracy > national_avg_accuracy + 2 then 'Extremely Conservative (High Buffer)'
        when state_avg_accuracy < national_avg_accuracy - 2 then 'Relatively Aggressive (Lower Buffer)'
        else 'Standard (Aligned with National Avg)'
    end as relative_performance
from 
    state_performance
order by 
    avg_accuracy_days desc;

Freight Cost by Regions¶

12. How does the average freight cost (freight_value) correlate with average delivery time across different states? Can we identify regions that suffer from the 'Logistics Gap' - paying disproportionately high shipping fees for the slowest service?

Solution:

Firstly, we prepare order-level freight and delivery data (order_metrics CTE): We join the orders, customers, and order_items tables, filtering strictly for successfully 'delivered' orders. We calculate the total_order_freight by summing up the freight values of all items within an order (handling multi-item orders properly), and we use datediff to calculate the total delivery_time_days from purchase to customer delivery.
Secondly, we calculate state averages and national benchmarks (state_benchmarks CTE): Grouping by customer_state, we average both the freight costs (avg_state_freight) and delivery times (avg_state_delivery_time) to establish the typical performance for each region. Crucially, we use the avg() over () window function to establish the overall national_avg_freight and national_avg_delivery_time as baselines.
Finally, we classify regional logistics efficiency (Final SELECT): We format the output metrics and use a CASE statement to compare each state's averages against the national benchmarks. By cross-evaluating whether a state is above or below average in both cost and speed, we neatly categorize them into actionable profiles like 'Logistics Gap (High Cost / Slow)', 'Premium Service', 'Budget / Inefficient', or 'Optimized Logistics'.

Show SQL Code

with order_metrics as (
    -- step 1: prepare freight and delivery time per order
    -- we sum freight_value per order first to handle cases with multiple items
    select 
        o.order_id,
        c.customer_state,
        sum(oi.freight_value) as total_order_freight,
        datediff(day, o.order_purchase_timestamp, o.order_delivered_customer_date) as delivery_time_days
    from 
        orders o
    inner join 
        customers c on o.customer_id = c.customer_id
    inner join 
        order_items oi on o.order_id = oi.order_id
    where 
        o.order_status = 'delivered'
        and o.order_delivered_customer_date is not null
    group by 
        o.order_id, c.customer_state, o.order_purchase_timestamp, o.order_delivered_customer_date
),

state_benchmarks as (
    -- step 2: calculate state averages and compare to national benchmarks
    select 
        customer_state,
        avg(total_order_freight) as avg_state_freight,
        avg(delivery_time_days * 1.0) as avg_state_delivery_time,
        -- national benchmarks using window functions
        avg(avg(total_order_freight)) over () as national_avg_freight,
        avg(avg(delivery_time_days * 1.0)) over () as national_avg_delivery_time
    from 
        order_metrics
    group by 
        customer_state
)

-- final output: classifying the logistics efficiency of each state
select 
    customer_state,
    cast(avg_state_freight as decimal(10,2)) as avg_freight,
    cast(avg_state_delivery_time as decimal(10,2)) as avg_delivery_days,
    case 
        when avg_state_freight > national_avg_freight and avg_state_delivery_time > national_avg_delivery_time then 'Logistics Gap (High Cost / Slow)'
        when avg_state_freight > national_avg_freight and avg_state_delivery_time <= national_avg_delivery_time then 'Premium Service (High Cost / Fast)'
        when avg_state_freight <= national_avg_freight and avg_state_delivery_time > national_avg_delivery_time then 'Budget / Inefficient (Low Cost / Slow)'
        else 'Optimized Logistics (Low Cost / Fast)'
    end as logistics_classification
from 
    state_benchmarks
order by 
    avg_state_freight desc, avg_state_delivery_time desc;

Delivery Timeline Decomposition¶

13. Can we decompose the total delivery timeline into three distinct stages: 'Approval Time' (Purchase to Approval), 'Dispatch Time' (Approval to Carrier Handover), and 'Last-Mile Delivery' (Carrier to Customer)? Which of these stages contributes the most to overall delays?

Solution:

Firstly, we extract time intervals for each fulfillment stage (order_lifecycle CTE): We use the datediff function to calculate the time spent in each distinct phase: 'Approval Time' (purchase to approval), 'Dispatch Time' (approval to carrier handover), and 'Last-Mile Delivery' (carrier to customer). We calculate these differences in seconds and divide by 3600.0 to get precise hour measurements, filtering strictly for 'delivered' orders that have all necessary timestamp data.
Secondly, we calculate the mean duration of each stage (stage_averages CTE): We aggregate the data by taking the average of the hours spent in each respective stage (avg_approval_hrs, avg_dispatch_hrs, avg_last_mile_hrs, and avg_total_hrs). Crucially, we include a WHERE clause here to filter out negative durations (>= 0), eliminating potential data entry errors or anomalies from skewing our averages.
Finally, we calculate the percentage contributions (Final SELECT): We convert our aggregated average hour metrics into average days (by dividing by 24.0) for easier human readability. To answer the core of the question—identifying the primary source of delays—we calculate the exact percentage each stage contributes to the total lead time (e.g., 100.0 * avg_last_mile_hrs / avg_total_hrs).

Show SQL Code

with order_lifecycle as (
    -- step 1: extract time intervals for each stage of the fulfillment process
    -- using hours (3600.0 seconds) to maintain precision for fast stages like approval
    select 
        order_id,
        -- stage 1: approval time (internal processing)
        approval_hours = datediff(second, order_purchase_timestamp, order_approved_at) / 3600.0,

        -- stage 2: dispatch time (warehouse / seller preparation)
        dispatch_hours = datediff(second, order_approved_at, order_delivered_carrier_date) / 3600.0,

        -- stage 3: last-mile delivery (carrier performance)
        last_mile_hours = datediff(second, order_delivered_carrier_date, order_delivered_customer_date) / 3600.0,

        -- total lead time
        total_lead_time_hours = datediff(second, order_purchase_timestamp, order_delivered_customer_date) / 3600.0
    from 
        orders
    where 
        order_status = 'delivered'
        and order_approved_at is not null
        and order_delivered_carrier_date is not null
        and order_delivered_customer_date is not null
),

stage_averages as (
    -- step 2: calculate mean durations across all orders
    select 
        avg_approval_hrs = avg(approval_hours),
        avg_dispatch_hrs = avg(dispatch_hours),
        avg_last_mile_hrs = avg(last_mile_hours),
        avg_total_hrs = avg(total_lead_time_hours)
    from 
        order_lifecycle
    where 
        -- filtering out negative values (data entry errors) to ensure clean averages
        approval_hours >= 0 and dispatch_hours >= 0 and last_mile_hours >= 0
)

-- final output: showing the "weight" of each stage in the delivery process
select 
    cast(avg_approval_hrs / 24.0 as decimal(10,2)) as avg_approval_days,
    cast(avg_dispatch_hrs / 24.0 as decimal(10,2)) as avg_dispatch_days,
    cast(avg_last_mile_hrs / 24.0 as decimal(10,2)) as avg_last_mile_days,
    cast(avg_total_hrs / 24.0 as decimal(10,2)) as avg_total_lead_days,

    -- contribution percentages
    cast(100.0 * avg_approval_hrs / avg_total_hrs as decimal(5,2)) as approval_pct_contribution,
    cast(100.0 * avg_dispatch_hrs / avg_total_hrs as decimal(5,2)) as dispatch_pct_contribution,
    cast(100.0 * avg_last_mile_hrs / avg_total_hrs as decimal(5,2)) as last_mile_pct_contribution
from 
    stage_averages;

Local vs Long-Haul Logistics Performance¶

14. How does logistics performance compare between 'Local Orders' (where customer_state matches seller_state) and 'Long-Haul Orders' (Cross-border)? Specifically, what is the 'Speed Premium' - the reduction in delivery days achieved by fulfilling orders locally?

Solution:

Firstly, we classify the fulfillment path (shipment_classification CTE): We join the orders, customers, order_items, and sellers tables to access both the customer's and seller's locations. We use a CASE statement to flag an order as 'Local' if the customer_state matches the seller_state, and 'Long-Haul' otherwise. We crucially use DISTINCT here to ensure that if a customer buys multiple items from the same seller in one order, we only evaluate the order's delivery timeline once.
Secondly, we calculate individual delivery metrics (performance_metrics CTE): For each classified order, we calculate the exact delivery_days (using seconds divided by 86400.0 for precision) and create a binary flag (is_on_time) by comparing the actual delivery date against the estimated delivery date.
Thirdly, we aggregate the performance statistics (summary_stats CTE): Grouping strictly by our new shipment_type, we calculate the total shipment_count, the avg_delivery_days, and the overall on_time_rate_pct.
Finally, we calculate the 'Speed Premium' (Final SELECT): We format the output fields and use the lag() window function to compare the two rows. By subtracting the current row's average delivery days from the previous row's (ordered descendingly so Long-Haul comes first), we isolate the exact speed_premium_days—the amount of time saved by fulfilling an order locally.

Show SQL Code

with shipment_classification as (
    -- step 1: link customers and sellers to classify the fulfillment path
    -- we use distinct to handle multiple items from the same seller in one order
    select distinct
        o.order_id,
        c.customer_state,
        s.seller_state,
        o.order_purchase_timestamp,
        o.order_delivered_customer_date,
        o.order_estimated_delivery_date,
        -- classification logic
        case 
            when c.customer_state = s.seller_state then 'Local' 
            else 'Long-Haul' 
        end as shipment_type
    from 
        orders o
    inner join 
        customers c on o.customer_id = c.customer_id
    inner join 
        order_items oi on o.order_id = oi.order_id
    inner join 
        sellers s on oi.seller_id = s.seller_id
    where 
        o.order_status = 'delivered'
        and o.order_delivered_customer_date is not null
),

performance_metrics as (
    -- step 2: calculate duration and on-time status for each shipment
    select 
        shipment_type,
        datediff(second, order_purchase_timestamp, order_delivered_customer_date) / 86400.0 as delivery_days,
        case when order_delivered_customer_date <= order_estimated_delivery_date then 1 else 0 end as is_on_time
    from 
        shipment_classification
),

summary_stats as (
    -- step 3: aggregate results by shipment type
    select 
        shipment_type,
        count(*) as shipment_count,
        avg(delivery_days) as avg_delivery_days,
        cast(100.0 * sum(is_on_time) / count(*) as decimal(5,2)) as on_time_rate_pct
    from 
        performance_metrics
    group by 
        shipment_type
)

-- final output: comparing the two groups and calculating the speed premium
select 
    shipment_type,
    shipment_count,
    cast(avg_delivery_days as decimal(10,2)) as avg_delivery_days,
    on_time_rate_pct,
    -- speed premium: calculating how much faster local is compared to long-haul
    -- we use a window function to find the difference between the two rows
    cast(
        lag(avg_delivery_days) over (order by avg_delivery_days desc) - avg_delivery_days 
    as decimal(10,2)) as speed_premium_days
from 
    summary_stats;

Freight Ratio Analysis by Product Category¶

15. What is the average 'Freight Ratio' (Freight Value / Product Price) across different product categories? Can we identify categories where shipping costs are disproportionately high (>20% of item value), leading to potential 'Cart Abandonment'?

Solution:

Firstly, we aggregate price and freight metrics per category (category_freight_stats CTE): We join the order_items and products tables to calculate the total revenue (sum(price)) and total freight costs (sum(freight_value)) for each product_category_name. We also calculate the average price and average freight cost per item, making sure to filter out any records with a null category.
Finally, we calculate the 'Freight Ratio' and classify abandonment risk (Final SELECT): We compute the freight_ratio_pct by dividing the total category freight by the total category revenue (using nullif to prevent division by zero). We then use a CASE statement to flag categories where shipping costs exceed 20% of the product value as 'High Abandonment Risk (>20%)' and those between 10% and 20% as 'Moderate Overhead'. This allows us to easily spot product lines where shipping fees might be disproportionately high and deterring customers.

Show SQL Code

with category_freight_stats as (
    -- step 1: aggregate total price and freight per category
    -- we use order_items directly as it contains both price and freight_value
    select 
        p.product_category_name,
        count(oi.order_item_id) as total_items_sold,
        avg(oi.price) as avg_product_price,
        avg(oi.freight_value) as avg_freight_cost,
        sum(oi.price) as total_category_revenue,
        sum(oi.freight_value) as total_category_freight
    from 
        order_items oi
    inner join 
        products p on oi.product_id = p.product_id
    where 
        p.product_category_name is not null
    group by 
        p.product_category_name
)

-- final output: calculating the ratio and risk levels
select 
    product_category_name,
    cast(avg_product_price as decimal(10,2)) as avg_price,
    cast(avg_freight_cost as decimal(10,2)) as avg_freight,
    -- freight ratio: what percentage of the product price is the shipping cost?
    cast(100.0 * total_category_freight / nullif(total_category_revenue, 0) as decimal(5,2)) as freight_ratio_pct,
    -- risk classification
    case 
        when (total_category_freight / nullif(total_category_revenue, 0)) > 0.20 then 'High Abandonment Risk (>20%)'
        when (total_category_freight / nullif(total_category_revenue, 0)) between 0.10 and 0.20 then 'Moderate Overhead'
        else 'Optimized / Low Shipping Impact'
    end as shipping_cost_impact
from 
    category_freight_stats
order by 
    freight_ratio_pct desc;