Data Preparation¶
Data Ingestion¶
The first step in our analysis process is data ingestion, where we will load the sales data into our analysis environment. The sales data for Adventureworks is provided in a .bak file, which is a backup file format used by SQL Server. We will restore this .bak file to our SQL Server instance running through Docker, and then connect to the database using Dbeaver to access the sales data for analysis.
Firstly, as I am using Docker to run SQL Server, I should activate the Docker container of SQL Server using the following commands in the terminal:
Note: I am running Docker on a Linux system - Fedora Linux, so I need to use sudo to run the Docker commands. If you are using Docker on Windows or Mac, you will not need to use sudo.
sudo systemctl start docker # Start the Docker service
sudo docker start sqlserver # Start the SQL Server container
sudo docker ps # Check if the container is running
After starting the Docker container, we will restore the .bak file to the SQL Server instance. After downloading the .bak file from the provided link, we should place it in a directory that is accessible to the Docker container.
docker cp AdventureWorksDW2022.bak 80ac0912ffb9:/var/opt/mssql/data/AdventureWorksDW2022.bak # Copy the .bak file to the Docker container
docker exec -it 80ac0912ffb9 ls -lh /var/opt/mssql/data/AdventureWorksDW2022.bak # Verify that the file has been copied successfully
Now we are ready to restore the .bak file to the SQL Server instance. We can do this using Dbeaver by connecting to the SQL Server instance and running the following SQL command:
This will give us the logical file names of the data and log files contained in the .bak file. We will use these logical file names to restore the database to our SQL Server instance.
RESTORE DATABASE [AdventureWorksDW2022]
FROM DISK = '/var/opt/mssql/data/AdventureWorksDW2022.bak'
WITH
MOVE 'AdventureWorksDW2022' TO '/var/opt/mssql/data/AdventureWorksDW2022.mdf',
MOVE 'AdventureWorksDW2022_log' TO '/var/opt/mssql/data/AdventureWorksDW2022_log.ldf',
RECOVERY,
REPLACE;
Finally, we will see the restored database in Dbeaver and we can start analyzing the sales data using SQL queries. We will connect to the AdventureWorksDW2022 database in Dbeaver and explore the tables and data to understand the structure of the sales data and prepare it for analysis.
Data Cleaning¶
After successfully restoring the sales data to our SQL Server instance, the next step in our analysis process is data cleaning. Data cleaning is a crucial step in any data analysis project, as it ensures that the data we are working with is accurate, consistent, and ready for analysis.
Firstly, we should define the database and schema we will be working with in our SQL queries. In this case, we will be working with the AdventureWorksDW2022 database and the dbo schema, which contains the sales data we need for our analysis.
Restored AdventureWorksDW2022 database contains a lot of tables, representing both Business to Customer (B2C) and Business to Business (B2B) sales data. B2C sales data shows the sales made directly to individual customers, while B2B sales data shows the sales made to other markets-sellers. For our analysis, we will focus on the B2C sales data, which is more relevant for understanding customer behavior and preferences. Therefore, we should identify the tables that contain B2C sales data and focus our cleaning efforts on those tables.
drop table if exists AdventureWorksDWBuildVersion,DatabaseLog,DimAccount,DimOrganization,
DimScenario,FactAdditionalInternationalProductDescription,FactCallCenter,FactCurrencyRate,
FactFinance,FactSalesQuota,FactSurveyResponse,NewFactCurrencyRate,ProspectiveBuyer; FactResellerSales,FactResellerSalesReason,FactResellerSalesSummary
go
After running the above SQL command, we will encounter the error message which indicates that the tables can not be dropped because they are referenced by foreign key constraints. This is because there are relationships between the tables in the database, and dropping one table may affect the integrity of the data in other tables. To resolve this issue, we should disable/remove all the foreign key constraints in the database before dropping the tables. After dropping the unnecessary tables, we can re-enable the foreign key constraints to maintain data integrity.
-- Disable all foreign key constraints in the database
DECLARE @sql NVARCHAR(MAX) = ''; -- Initialize an empty string to hold the SQL commands to disable all foreign key constraints
SELECT @sql +=
'ALTER TABLE [' + OBJECT_SCHEMA_NAME(parent_object_id) + '].['
+ OBJECT_NAME(parent_object_id) + '] DROP CONSTRAINT ['
+ name + '];' + CHAR(13) -- Append the SQL command to disable the foreign key constraint to the @sql variable
FROM sys.foreign_keys; -- Query the sys.foreign_keys system catalog view to get the names of all foreign key constraints in the database and append the SQL command to disable each constraint to the @sql variable
EXEC sp_executesql @sql; -- Execute the SQL commands to disable all foreign key constraints in the database
Now if you run the previous SQL command to drop the unnecessary tables, you will see that the tables are dropped successfully without any errors. After dropping the unnecessary tables, we can now analyze each table individually to identify the necessary columns for our analysis and remove any unnecessary columns that are not relevant to our analysis. This will help us to focus on the relevant data and make our analysis more efficient.
Customer Dimension Table¶
Firstly, analyze the first table, which is the DimCustomer table. This table contains information about the customers, such as their demographics and contact information. We will identify the necessary columns for our analysis and remove any unnecessary columns that are not relevant to our analysis.
Firstly, we will look at the structure of the DimCustomer table to understand the columns it contains and their data types. We can do this by running the following SQL command:
select top 10 * from dimcustomer; -- View the first 10 rows of the DimCustomer table to understand its structure and the columns it contains
The raw tables will not be given here visually, since they are not cleaned yet, and will make the webpage look messy. I will clean the tables in the next step and show the cleaned tables in the next section. You can see the raw - original, uncleaned tables by following the Google Spreadsheet link given in the sources section of this project.
After analyzing the first 10 rows of the DimCustomer table, we can identify the necessary columns for our analysis and remove any unnecessary columns that are not relevant to our analysis. I dropped the following columns from the DimCustomer table:
alter table dimcustomer
drop column title, middlename,namestyle, suffix, emailaddress,
spanisheducation, frencheducation, spanishoccupation, frenchoccupation,
addressline1, addressline2, phone;
DimCustomer table that contains only the necessary columns for our analysis.
Geography Dimension Table¶
Next, we will analyze the DimGeography table, which contains information about the geographical locations of the customers, such as their city, state, and country. We will identify the necessary columns for our analysis and remove any unnecessary columns that are not relevant to our analysis.
Product Dimension Table¶
Next, we will analyze the DimProduct table, which contains information about the products that were sold, such as their names, categories, and prices. We will identify the necessary columns for our analysis and remove any unnecessary columns that are not relevant to our analysis.
alter table dimproduct
drop column
englishdescription, spanishproductname, frenchproductname, color, safetystocklevel,
reorderpoint, largephoto, frenchdescription, chinesedescription, arabicdescription,
hebrewdescription, thaidescription, germandescription, japanesedescription, turkishdescription;
Other Tables¶
We will also take all other tables into consideration and analyze them individually to identify the necessary columns for our analysis and remove any unnecessary columns that are not relevant to our analysis. I am giving the full SQL command to clean all those tables here.
Show SQL Code
-- DATE DIMENSION
alter table dimdate
drop column
spanishdaynameofweek, frenchdaynameofweek, spanishmonthname, frenchmonthname;
-- EMPLOYEES
alter table DimEmployee
drop column
parentemployeenationalidalternatekey,
middlename, namestyle, loginid, emailaddress, phone,
emergencycontactname, emergencycontactphone, employeephoto;
-- PRODUCT CATEGORY
alter table dimproductcategory
drop column
spanishproductcategoryname, frenchproductcategoryname;
-- PRODUCT SUBCATEGORY
alter table DimProductSubcategory
drop column
spanishproductsubcategoryname, frenchproductsubcategoryname;
-- SALES TERRITORY
alter table DimSalesTerritory
drop column
salesterritoryimage;
-- FACT INTERNET SALES
alter table FactInternetSales
drop column
promotionkey, carriertrackingnumber, customerponumber;
After running the above SQL commands, we will have cleaned tables that contain only the necessary columns for our analysis. This will help us to focus on the relevant data and make our analysis more efficient. We can now proceed to the next step of our analysis process, which is data modelling, where we will create foreign key relationships between the tables to create a star schema for our analysis.
Data Modelling¶
Now we should create foreign key relationships between the tables to create a star schema for our analysis. A star schema is a type of database schema that is commonly used in data warehousing and business intelligence applications. It consists of a central fact table that contains the measures of interest, and several dimension tables that contain the attributes related to those measures.
In our database, the FactInternetSales table will serve as our central fact table, which contains the measures of interest such as sales amount, quantity, and discount. The dimension tables that we will use in our star schema include DimCustomer, DimGeography, DimProduct, DimDate, DimEmployee, DimProductCategory, DimProductSubcategory, and DimSalesTerritory.
Creating a foreign key relationship between the fact table and the dimension tables will allow us to join the tables together and perform analysis on the sales data based on the attributes in the dimension tables. It allows us to efficient ly query the data and perform aggregations and calculations based on the measures in the fact table and the attributes in the dimension tables. Creationg a foreign key is simply done by running the following SQL command:
alter table factinternetsales
add constraint FK_InternetSales_Products
foreign key (ProductKey)
references dimproduct(ProductKey);
With the above code, we have created a foreign key called FK_InternetSales_Products that references the ProductKey column in the dimproduct table from the ProductKey column in the factinternetsales table. We will repeat this process to create foreign key relationships between the fact table and all the dimension tables to complete our star schema for analysis.
For simplicity, I am giving the full SQL command to create all the necessary foreign key relationships between the fact table and the dimension tables here, as it is a straightforward process and does not require any complex logic.
Show SQL Code
-- internetsales and calendar (orderdate, duedate, shipdate)
alter table factinternetsales
add constraint FK_InternetSales_Date_Orderdate
foreign key (orderdatekey)
references dimdate(datekey);
go
alter table factinternetsales
add constraint FK_InternetSales_Date_Duedate
foreign key (duedatekey)
references dimdate(datekey);
go
alter table factinternetsales
add constraint FK_InternetSales_Date_Shipdate
foreign key (shipdatekey)
references dimdate(datekey);
go
-- internetsales and customers
alter table factinternetsales
add constraint FK_InternetSales_Customers
foreign key (customerkey)
references dimcustomer(customerkey);
go
-- internetsales and currency
alter table factinternetsales
add constraint FK_InternetSales_Currency
foreign key (currencykey)
references dimcurrency(currencykey);
go
-- internetsales and salesterritory
alter table factinternetsales
add constraint FK_InternetSales_SalesTerritory
foreign key (salesterritorykey)
references dimsalesterritory(salesterritorykey);
go
-- dimcustomer and dimgeography
alter table dimcustomer
add constraint FK_Customer_Geography
foreign key (geographykey)
references dimgeography(geographykey);
go
-- dimemployee and dimsalesterritory
alter table dimemployee
add constraint FK_Employee_SalesTerritory
foreign key (salesterritorykey)
references dimsalesterritory(salesterritorykey);
go
-- dimgeography and dimsalesterritory
alter table dimgeography
add constraint FK_Geography_SalesTerritory
foreign key (salesterritorykey)
references dimsalesterritory(salesterritorykey);
go
-- dimproduct and dimproductsubcategory
alter table dimproduct
add constraint FK_Product_ProductSubcategory
foreign key (productsubcategorykey)
references dimproductsubcategory(productsubcategorykey);
go
-- dimproductsubcategory and dimproductcategory
alter table dimproductsubcategory
add constraint FK_ProductSubcategory_ProductCategory
foreign key (productcategorykey)
references dimproductcategory(productcategorykey);
go
-- factinternetsalesreason and salesreason
alter table factinternetsalesreason
add constraint FK_InternetSalesReason_SalesReason
foreign key (salesreasonkey)
references salesreason(salesreasonkey);
go
-- FACTPRODUCTINVENTORY - another fact table that contains inventory data, which can be used for analysis related to inventory levels and stock management. We will create foreign key relationships between the `factproductinventory` table and the relevant dimension tables to integrate the inventory data into our star schema for analysis.
-- factproductinventory and dimproduct
alter table factproductinventory
add constraint FK_ProductInventory_Products
foreign key (productkey)
references dimproduct(productkey);
go
-- factproductinventory and dimdate
alter table factproductinventory
add constraint FK_ProductInventory_Date
foreign key (datekey)
references DimDate(datekey);
go
With the above SQL commands, we have created all the necessary foreign key relationships between the fact table and the dimension tables to complete our star schema for analysis. This will allow us to efficiently query the data and perform aggregations and calculations based on the measures in the fact table and the attributes in the dimension tables.
For simplicity and readilibity, I have changed the names of the tables and columns to more user-friendly names. For example, I have changed the name of the DimCustomer table to Customers, DimProduct table to Products, and DimDate table to Calendar. I have also changed the names of the columns in the tables to more user-friendly names. This will make it easier for us to write SQL queries and understand the data when we perform our analysis in the next step.
By writing the following SQL commands, we can rename the tables and columns to more user-friendly names:
-- Rename tables
exec sp_rename 'DimCustomer', 'Customers';
exec sp_rename 'DimProduct', 'Products';
exec sp_rename 'DimDate', 'Calendar';
exec sp_rename 'DimEmployee', 'Employees';
exec sp_rename 'DimProductCategory', 'ProductCategory';
exec sp_rename 'DimProductSubcategory', 'ProductSubcategory';
exec sp_rename 'DimSalesTerritory', 'SalesTerritory';
Business Questions¶
To perform our deep analysis, we need to define some business questions that will guide our analysis and help us to focus on the relevant data. Business questions are specific questions that we want to answer through our analysis, and they should be based on the goals and objectives of the business. I used LLM and AI tools to help me generate real world business questions that are relevant to the sales data we have in our database. These business questions will help us to understand customer behavior, identify trends and patterns in the sales data, and make informed business decisions based on our analysis.
I will note all questions here and then in the next pages, I will answer each question one by one using SQL queries and data analysis techniques.
1. What is the overall Gross Profit and Gross Margin for the B2C channel, and how has it trended year over year?
Business Insight: This question helps us to understand the profitability of the B2C channel and how it has changed over time. Gross Profit and Gross Margin are key financial metrics that indicate the profitability of sales, and analyzing their trends can provide insights into the performance of the B2C channel and help identify areas for improvement.
2. What is the Year-over-Year (YoY) revenue growth rate for each month?
Business Insight: This question helps us to understand the growth rate of revenue for each month and identify any seasonal trends or patterns in the sales data. Analyzing YoY revenue growth can provide insights into the performance of the business and help identify opportunities for growth and improvement.
3. What is the monthly seasonality of our sales, and which quarter drives the most revenue historically?
Business Insight: This question helps us to understand the seasonal patterns in our sales data and identify which quarter drives the most revenue historically. Analyzing seasonality can provide insights into customer behavior and help us to optimize our marketing and sales strategies to capitalize on peak sales periods.
4. What are the top 10 products by revenue, and how has their performance changed over time?
Business Insight: This question helps us to identify the top-performing products in terms of revenue and analyze how their performance has changed over time. Understanding which products are driving the most revenue can help us to focus our marketing and sales efforts on those products and identify opportunities for growth and improvement.
5. What is the Year-to-Date (YTD) cumulative revenue for the most recent fully completed year?
Business Insight: This question helps us to understand the cumulative revenue for the most recent fully completed year and compare it to previous years. Analyzing YTD cumulative revenue can provide insights into the overall performance of the business and help us to identify trends and patterns in our sales data.
6. Which demographic segments (based on Income and Education) drive the highest overall revenue?
Business Insight: This question helps us to understand which demographic segments, based on income and education, are driving the highest overall revenue. Analyzing demographic segments can provide insights into customer behavior and preferences, and help us to tailor our marketing and sales strategies to target the most profitable customer segments.
7. Does the customer's commute distance and vehicle ownership impact their lifetime value?
Business Insight: This question helps us to understand if the customer's commute distance and vehicle ownership impact their lifetime value. Analyzing the relationship between commute distance and vehicle ownership with customer lifetime value can provide insights into customer behavior and preferences, and help us to identify opportunities for improving customer retention and increasing customer lifetime value.
8. Who are our "Whales" (Top 100 Most Valuable Customers), and what is their purchasing frequency?
Business Insight: This question helps us to identify our "Whales", which are the top 100 most valuable customers, and analyze their purchasing frequency. Understanding who our most valuable customers are and how frequently they make purchases can provide us with insights which may be helpful to tailor our marketing and sales strategies to retain these customers and encourage repeat purchases, ultimately increasing customer lifetime value and driving revenue growth.
9. What is our generational spend distribution (Age Brackets)?
Business Insight: This question helps us to understand our generational spend distribution based on age brackets. Age can be an important factor in customer behavior and preferences, and analyzing our generational spend distribution can provide insights into which age groups are driving the most revenue. As a result, we can create targeted marketing and sales strategies for each age group to optimize our revenue and improve customer engagement.
10. What is our Customer Retention Rate (One-and-Done vs. Repeat Buyers)?
Customers are one-and-done if they made only one purchase and never returned, while repeat buyers are customers who made multiple purchases over time. Defining one-and-done customers and repeat buyers can help us to understand our customer retention rate and identify opportunities for improving customer retention. This will allow us why some customers are not returning and what we can do to encourage them to make repeat purchases, ultimately increasing customer lifetime value and driving revenue growth.
11. Which product categories and subcategories drive the highest profit margins, and which are drag on profitability?
Business Insight: This question asks us to analyze profit margins which is a key financial metric that indicates the profitability of products. Profit margin is calculated as (Revenue - Cost) / Revenue, and it helps us to understand how much profit we are making on each product. By analyzing which product categories and subcategories drive the highest profit margins, we can identify which products are the most profitable and focus our marketing and sales efforts on those products. Conversely, by identifying which product categories and subcategories are a drag on profitability, we can identify opportunities for improvement, such as reducing costs or increasing prices, or putting promotions for those products to improve our overall profitability.
12. What is the disparity between our "High Volume" products and our "High Value" products?
Business Insight: This question helps us to understand the disparity between our "High Volume" products, which are products that sell in large quantities but may have lower profit margins, and our "High Value" products, which are products that sell in smaller quantities but have higher profit margins. Analyzing the disparity between high volume and high value products can provide insights into our product mix and help us to optimize our inventory and marketing strategies to maximize revenue and profitability.
13. What is our basket size distribution, and how frequently do customers buy multiple items in a single transaction?
Business Insight: Basket size distribution refers to the number of items that customers purchase in a single transaction. Analyzing our basket size distribution can provide insights into customer purchasing behavior and help us to identify opportunities for increasing our average order value. By understanding how frequently customers buy multiple items in a single transaction, we can create targeted marketing and sales strategies to encourage customers to add more items to their baskets, ultimately increasing revenue and profitability.
14. Are there any "Dead Weight" products in our active catalog that have negligible sales in the most recent year?
Business Insight: "Dead Weight" products refer to products that are still active in our catalog but have negligible sales in the most recent year. Analyzing our product catalog to identify any dead weight products can help us to optimize our inventory and focus our marketing and sales efforts on products that are driving revenue. By identifying and potentially discontinuing dead weight products, we can reduce costs associated with inventory management and improve our overall profitability.
15. How does product pricing tier affect consumer purchasing behavior?
Business Insight: Product pricing tiers mean that we have different price points for our products, such as low, medium, and high price tiers. Price tier is the price point at which a product is sold, and it can have a significant impact on consumer purchasing behavior. It shows directly how price-sensitive our customers are and how they respond to different price points. By analyzing how product pricing tier affects consumer purchasing behavior, we can gain insights into our customers' price sensitivity and optimize our pricing strategy to maximize revenue and profitability. For example, if we find that customers are more likely to purchase products in the medium price tier, we can focus our marketing efforts on promoting those products and potentially adjust our pricing strategy to encourage more purchases in that tier.
16. What are the primary psychological drivers (Sales Reasons) behind our B2C purchases?
Business Insight: Sales reasons refer to the primary psychological drivers that influence customers' purchasing decisions. Analyzing the sales reasons behind our B2C purchases can provide insights into customer behavior and preferences, and help us to tailor our marketing and sales strategies to better meet the needs and desires of our customers. By understanding the primary psychological drivers behind our B2C purchases, we can create targeted marketing campaigns that resonate with our customers and encourage them to make purchases, ultimately increasing revenue and profitability.
17. What percentage of our total B2C volume is driven by promotional discounts versus standard pricing?
Business Insight: This question helps us to understand the impact of promotional discounts on our B2C sales volume. By analyzing the percentage of our total B2C volume that is driven by promotional discounts versus standard pricing, we can gain insights into the effectiveness of our promotional strategies and how they influence customer purchasing behavior. This information can help us to optimize our promotional campaigns and pricing strategy to maximize revenue and profitability. For example, if we find that a significant portion of our sales volume is driven by promotional discounts, we may want to consider adjusting our pricing strategy or creating more targeted promotions to encourage customers to purchase at standard pricing, ultimately improving our profit margins.
18. What is our Customer Lifetime Value (CLV) distribution, and how long does it take for a customer to reach their maximum value?
Business Insight: Customer Lifetime Value (CLV) is a key metric that estimates the total revenue a business can expect from a single customer over the course of their relationship. Analyzing our CLV distribution can provide insights into the value of our customers and help us to identify opportunities for improving customer retention and increasing customer lifetime value. By understanding how long it takes for a customer to reach their maximum value, we can create targeted marketing and sales strategies to encourage customers to make repeat purchases and increase their lifetime value, ultimately driving revenue growth and profitability.
19. Do the primary sales drivers change depending on the Product Category?
Business Insight: This question helps us to understand if the primary sales drivers, which are the factors that influence customers' purchasing decisions, change depending on the product category. Analyzing the primary sales drivers for different product categories can provide insights into customer behavior and preferences, and help us to tailor our marketing and sales strategies for each product category to better meet the needs and desires of our customers. By understanding if the primary sales drivers change depending on the product category, we can create more targeted marketing campaigns that resonate with our customers and encourage them to make purchases, ultimately increasing revenue and profitability.
20. Which countries and regions drive the most B2C revenue, and how does the Average Order Value (AOV) differ geographically?*
Business Insight: This question helps us to understand which countries and regions are driving the most B2C revenue, and how the Average Order Value (AOV) differs geographically. Analyzing our sales data by geography can provide insights into customer behavior and preferences in different regions, and help us to tailor our marketing and sales strategies to better meet the needs of customers in those regions. By understanding which countries and regions drive the most B2C revenue, we can focus our marketing efforts on those areas to maximize revenue. Additionally, by analyzing how AOV differs geographically, we can identify opportunities for optimizing our pricing strategy and product offerings in different regions to increase revenue and profitability.
21. Are we meeting our operational Service Level Agreements (SLAs)? What is the average lead time between Order Date and Ship Date?
Business Insight: Service Level Agreements (SLAs) will show us the operational performance of our business in terms of order fulfillment. Analyzing the average lead time between Order Date and Ship Date can provide insights into our operational efficiency and help us to identify areas for improvement in our order fulfillment process. By understanding if we are meeting our SLAs and analyzing the lead time between Order Date and Ship Date, we can optimize our operations to improve customer satisfaction and retention, ultimately driving revenue growth and profitability.
22. How heavily do freight and tax costs impact the final landed cost for the consumer across different countries?
Business Insight: This question helps us to understand the impact of freight and tax costs on the final landed cost for the consumer across different countries. Analyzing the impact of freight and tax costs can provide insights about the tax and logistics landscape in different regions, and help us to optimize our pricing strategy and supply chain management to minimize costs for our customers. By understanding how heavily freight and tax costs impact the final landed cost for the consumer across different countries, we can identify opportunities for reducing costs and improving our competitiveness in those markets, ultimately increasing revenue and profitability.
23. Is there a regional bias for specific Product Categories?
Business Insight: Regional bias for specific product categories refers to the tendency for certain product categories to perform better in specific regions. For example, customers in the rural areas may purchase largely from the "Mountain Bikes" category, while customers in urban areas may purchase more from the "Road Bikes" category. Analyzing regional bias for specific product categories can provide insights into customer behavior and preferences in different regions, and help us to tailor our marketing and sales strategies to better meet the needs of customers in those regions. By understanding if there is a regional bias for specific product categories, we can create more targeted marketing campaigns that resonate with our customers and encourage them to make purchases, ultimately increasing revenue and profitability.
Conclusion¶
In this section, we have prepared our sales data for analysis by restoring the .bak file to our SQL Server instance, cleaning the data by removing unnecessary columns and tables, and creating a star schema by establishing foreign key relationships between the fact table and dimension tables. We have also defined 23 business questions that will guide our analysis and help us to focus on the relevant data to gain insights into customer behavior, identify trends and patterns in the sales data, and make informed business decisions based on our analysis. In the next sections, we will answer each of these business questions one by one using SQL queries and data analysis techniques.