Friday, 23 August 2013

Creating a star schema from single table - Initial data Load

Creating a star schema from single table.

Here we would like to create a star schema when source data is provided in one file (csv or excel) or table.
We will use the oracle DB scripts here. And suggest difference for sqlserever

Assuming the data from file has been moved to table, ie. Staging table. Here we will see how to load data form staging table to fact for first time(Initial Load).

SALES_STG

Id number , --Unique Id
Sales_date Date,
Store_code number, --sqlserver numeric , mysql float can be used
Store_name Varchar2(512), -- Sqlserver and Mysql only Varchar
city_name Varchar2(512),
state_name Varchar2(512),
region_name Varchar2(512),
country_name Varchar2(512),
item_name varchar2(512) ,-- Not taking code to have another type of example, But Item Name is unique
Item_category varchar2(512) ,
Item_group varchar2(512) ,
Sales_group varchar2(512), -- Both sales group and type need to be in one dim , but combination is key
Sales_type varchar2(512),
Sales_qty number,
Sales_value number ,
sales_price number,
cost number,
order_number number,
order_source varchar2(512)



Data for region dimension we have store_code available as key, Numeric key.
For Item , there is no Item code , but item name is unique 
Sales group and Type the combination is unique
Also need dimensions for Order Number and Order Source 


Creating Dimensions

In case of oracle you can create a sequence for each table for primary key.
In case of sqlserver you can have identity column and in case of mysql you can have auto increment.

Dimension needed are
  1. Region
  2. Item
  3. Sales Group and Type
  4. Order Number
  5. Order Source

Create Scripts

Staging
create table SALES_STG
(
Id number , --Unique Id
Sales_date Date,
Store_code number, --sqlserver numeric , mysql float can be used
Store_name Varchar2(512), -- Sqlserver and Mysql only Varchar
city_name Varchar2(512),
state_name Varchar2(512),
region_name Varchar2(512),
country_name Varchar2(512),
item_name varchar2(512) ,-- Not taking code to have another type of example, But Item Name is unique
Item_category varchar2(512) ,
Item_group varchar2(512) ,
Sales_group varchar2(512), -- Both sales group and type need in one dim , but combination is key
Sales_type varchar2(512),
Sales_qty number,
Sales_value number ,
sales_price number,
cost number,
order_number number,
order_source varchar2(512)) ;

Region

CREATE SEQUENCE dim_region_seq
 START WITH     1
 INCREMENT BY   1
 NOCACHE
 NOCYCLE;



create table dim_region
 (
Store_id Number , -- In case of sql server int IDENTITY(1,1),
Store_code number,
Store_name Varchar2(512),
city_name Varchar2(512),
state_name Varchar2(512),
region_name Varchar2(512),
country_name Varchar2(512)

);

Item
 CREATE SEQUENCE dim_item_seq
 START WITH     1
 INCREMENT BY   1
 NOCACHE
 NOCYCLE;

create table dim_item
(
item_id number,
item_name varchar2(512) ,
Item_category varchar2(512) ,
Item_group varchar2(512)
);

Sales Group Type

CREATE SEQUENCE Dim_sales_group_type_seq
 START WITH     1
 INCREMENT BY   1
 NOCACHE
 NOCYCLE;

create table Dim_sales_group_type
(
sales_group_type_id number,
Sales_group varchar2(512),
Sales_type varchar2(512)
);

Order Number

CREATE SEQUENCE dim_order_number_seq
 START WITH     1
 INCREMENT BY   1
 NOCACHE
 NOCYCLE;
create table dim_order_number
(
order_id number,
order_number number
);


CREATE SEQUENCE dim_order_source_seq
 START WITH     1
 INCREMENT BY   1
 NOCACHE
 NOCYCLE;

Order Source 

create table dim_order_source
(
order_source_id number,
order_source number
);

We need to insert data into these table. In case of oracle you need to select sequence.nextval . In case of sqlserver and mysql you can leave the column during insert .

Region Insert

Oracle

insert into dim_region
select
 dim_region_seq.nextval ,
 STORE_CODE,
  STORE_NAME,
  CITY_NAME,
  STATE_NAME,
  REGION_NAME,
  COUNTRY_NAME from
  (
SELECT distinct
  STORE_CODE,
  STORE_NAME,
  CITY_NAME,
  STATE_NAME,
  REGION_NAME,
  COUNTRY_NAME
FROM SALES_STG) A ;

Sqlserver 
insert into dim_region
SELECT distinct
  STORE_CODE,
  STORE_NAME,
  CITY_NAME,
  STATE_NAME,
  REGION_NAME,
  COUNTRY_NAME
FROM SALES_STG  ;

Item Insert 

insert into dim_item
select dim_item_seq.nextval item_id,
ITEM_NAME,
  ITEM_CATEGORY,
  ITEM_GROUP from (
SELECT distinct
  ITEM_NAME,
  ITEM_CATEGORY,
  ITEM_GROUP
FROM SALES_STG) A;

Sales Group Type Insert 

insert into Dim_sales_group_type
select Dim_sales_group_type_seq.nextval sales_group_type_id,
 SALES_GROUP,
  SALES_TYPE from (
SELECT distinct
  SALES_GROUP,
  SALES_TYPE
FROM SALES_STG) A ;

Order Number Insert

insert into dim_order_number
select dim_order_number_seq.nextval,
ORDER_NUMBER from (
SELECT distinct
  ORDER_NUMBER
FROM SALES_STG) A ;

Order Source Insert

insert into dim_order_source
select dim_order_source_seq.nextval ,
ORDER_SOURCE from (
SELECT distinct
  ORDER_SOURCE
FROM SALES_STG) A ;

Validation for keys in case of Region, Item and Sales Group and Type.

These three queries should not return any result
select STORE_CODE, count(*)  from dim_region group by STORE_CODE having count(*) >1;
select ITEM_NAME, count(*)  from dim_item group by ITEM_NAME having count(*) >1;
select SALES_GROUP,SALES_TYPE, count(*)  from Dim_sales_group_type group by SALES_GROUP,SALES_TYPE having count(*) >1;

In case they return result , there is some issue with keys at source.
Same need to be repeated for all the dimensions for source primary key. To make sure that data populated correctly in dimension.

Creating the fact

CREATE SEQUENCE fact_sales_seq
 START WITH     1
 INCREMENT BY   1
 NOCACHE
 NOCYCLE;


   CREATE TABLE FACT_SALES
   ( FACT_ID NUMBER,
STORE_ID NUMBER,
ITEM_ID NUMBER,
SALES_GROUP_TYPE_ID NUMBER,
ORDER_ID NUMBER,
ORDER_SOURCE_ID NUMBER,
SOURCE_ID NUMBER,
STORE_CODE NUMBER,
STORE_NAME VARCHAR2(512),
CITY_NAME VARCHAR2(512),
STATE_NAME VARCHAR2(512),
REGION_NAME VARCHAR2(512),
COUNTRY_NAME VARCHAR2(512),
ITEM_NAME VARCHAR2(512),
ITEM_CATEGORY VARCHAR2(512),
ITEM_GROUP VARCHAR2(512),
SALES_GROUP VARCHAR2(512),
SALES_TYPE VARCHAR2(512),
SALES_QTY NUMBER,
SALES_VALUE NUMBER,
SALES_PRICE NUMBER,
COST NUMBER,
ORDER_NUMBER NUMBER,
ORDER_SOURCE VARCHAR2(512),
ROW_TYPE CHAR(1),
UPDATE_AT DATE,
CREATED_AT DATE
   ) ;
 
insert into  fact_sales
SELECT
fact_sales_seq.nextval fact_id,
region.Store_id,
item.item_id,
type1.sales_group_type_id,
od.order_id,
os.order_source_id,
  stg.Id source_id,
  stg.STORE_CODE,
  stg.STORE_NAME,
  stg.CITY_NAME,
  stg.STATE_NAME,
  stg.REGION_NAME,
  stg.COUNTRY_NAME,
  stg.ITEM_NAME,
  stg.ITEM_CATEGORY,
  stg.ITEM_GROUP,
  stg.SALES_GROUP,
  stg.SALES_TYPE,
  stg.SALES_QTY,
  stg.SALES_VALUE,
  stg.SALES_PRICE,
  stg.COST,
  stg.ORDER_NUMBER,
  stg.ORDER_SOURCE,
  'N' row_type,
  sysdate update_at,
  sysdate created_at
FROM SALES_STG Stg
left join  dim_region region on (Stg.STORE_CODE =region.STORE_CODE)
left join  dim_item item on (Stg.ITEM_NAME =item.ITEM_NAME)
left join  Dim_sales_group_type type1 on (Stg.SALES_GROUP=type1.SALES_GROUP and Stg.SALES_TYPE =type1.SALES_TYPE)
left join  dim_order_number od on (stg.ORDER_NUMBER = od.ORDER_NUMBER)
left join  dim_order_source os on(stg.ORDER_SOURCE = os.ORDER_SOURCE)


As there is no summarization happening  for data , the rows in the fact and stg should be same
Select count(*) from SALES_STG should be same as select count(*) from FACT_SALES.

Few Points 
  1. Ideally there should not be a left join (in fact insert) , we should have select like SELECT distinct nvl(SALES_GROUP,'N/A') SALES_GROUP,   nvl(SALES_TYPE,'N/A') SALES_TYPE FROM SALES_STG .This will make sure inner join gets all the rows. 
  2. Instead of sydate in query , sysdate should be put in some variable at the start of ETL and that variable can be used after that. 
  3. We should always use list of the columns in insert list. This will make sure that insert will not fail even if you add  new column to DB(column where null is allowed). 

Saturday, 10 August 2013

Analyzing the Query

When you are using a star schema for your warehouse or for reporting system and you faces data mismatch issues , then how to debug it; how to find the root cause for the mismatch. Is data mismatch is only at query level or correct data in not there in data warehouse or not collected from source.

We will consider the following warehouse star schema to understand the problem

  1. Fact_sales (row_id, Sales_date, Creation_date, Delivery_date, Sales_value,Del_value, Item_id, region_id , customer_id,order_no, emp_id)
  2. Dim_Customer (Customer_id , Customer_code, Customer_name)
  3. Dim_Item (item_id, item_code , Item_name, Item_cat_desc)
  4. Dim_region (region_id, region_code, city, state, country )
  5. Dim_Emp (emp_id , emp_code, rec_st_date, rec_end_date, emp_name, Dept,manager_name)

We are assuming that reporting layer or queries for reports are using inner joins(to start with)

Source database 

  1. sales ( Sales_date, Creation_date, Delivery_date, Sales_value,Del_value, Item_code, region_code , customer_code,order_code, emp_code)
  2. Customer ( Customer_code, Customer_name)
  3. Item (item_code , Item_name, Item_cat_desc)
  4. Region (region_code, city, state, country )
  5. Emp ( emp_code, rec_st_date, rec_end_date, emp_name, Dept,manager_name)


All the warehouse tables have their own surrogate/new keys.


Case 1:
We get a mismatch when we take data for a month , say march 2013. It does not match with source
select sum(sales) from  Fact_sales where Sales_date between '1-mar-2013' and '31-mar-2013'

We need check

  1. Are we using the same date at both places (sales date vs sales date)
  2. In warehouse is sales date is date or datetime . Based on that check the query like 
    1. select sum(sales) from  Fact_sales where Sales_date between '1-mar-2013' and '31-mar-2013 23:59:59'
    2. Or may it is truncated at warehouse , check source like 
    3. Select sum(sales) from  sales where Sales_date between '1-mar-2013' and '31-mar-2013 23:59:59'
  3. In case we still have the mismatch check for rows collected . Hope you have used left join while populating the dimensional keys. 
  4. If data granularity is same then you can check the row count and get the diff
  5. If granularity is same and you have row mismatch then try changing the inner joins in ETL to left/right based on main(here sales table) position , one by one for each dimension table and try to find the reason 
  6. In case granularity is not same , you may have to do this exercise by breaking the queries.
Case 2:
Data is same for time range , but if one view/group by; say City data is Less/More

We need check if data is Less
  1. In case data is less and data was correct for time means that some value may be missing in dimension 
  2. Take you fact and dim join , make it a left join for fact and select keys from both the tables 
    1. Select Fact_sales.region_id, Dim_region.region_id from Fact_sales left join Dim_region on (Fact_sales.region_id=Dim_region.region_id ) where Dim_region.region_id  is null
  3. If the above query return results it means you have some keys missing  and you need to get the key values missing. Use output of this query to check missing keys.
  4. If the above case does not result any result , it may be case that fact might not have got some keys for this dimension 
  5. Select * from Fact_sales where region_id is null
  6. If the above query gives result then check why keys missed out during ETL. 
  7. If you find the reason for mismatch populate the fact data with correct keys again
  8. In case, null values are there at the sourcetoo , than you should consider having a left join in reporting layer
When data is more 
  1. When data is more means dimension can have more data . As we have new key here , we need to check is the source key duplicated here. 
  2. Select region_code , count(*) from dim_region group by region_code having count(*) >1 order by 2 desc
  3. In case 2nd query gives results , means there is duplicate data in dimension and dimension need to be reloaded. It may also mean fact reloading to get the correct keys
  4. In case query in step 2 does not return results. We need to check does the fact have duplicate rows. This case can only happen if the month data query also gives more rows or data mismatch. So just follow the case 1.
Case 3: 
Data wrong only for Emp Table
Here we have made employee table SCD , we have to check that we have join and keys coming in correct manner
The Join should be something like 

Fact_sales left join Dim_emp on 
( fact_sales.emp_id = dim_emp, emp_id and 
fact_sales.sales_date between   dim_emp.rec_st_date and rec_end_date)

Case 4
Rows are same in fact and source table, still there is small mismatch across all dimensions
Check for data type. Is float data going into Int or different number of decimal precision at source and target.

Case 5
Data same for base measure but there is mismatch in some formula 
Sum (Sales_value) is correct , Sum (Del_value) is correct
But sum(Sales_value+Del_value ) is not correct or Sum (Sales_value) + Sum (Del_value) is not correct for some data view.

In this case null value can be a issue. Usually null with any operation will result null 
Try Sum( case when Sales_value is null then 0 else Sales_value end + case when Del_value is null then 0 else Del_value end )


Please note that these are kind of general debugging method and author does not claim that it can resolve your issues. So please use your own discretion. Also there are just general guideline based on experience and it may mere coincidence to he similar with some examples at other places.