Analyzing the Query
When you are using a star schema for your warehouse or for reporting system and you faces data mismatch issues , then how to debug it; how to find the root cause for the mismatch. Is data mismatch is only at query level or correct data in not there in data warehouse or not collected from source.
We will consider the following warehouse star schema to understand the problem
We are assuming that reporting layer or queries for reports are using inner joins(to start with)
Source database
All the warehouse tables have their own surrogate/new keys.
Case 1:
We get a mismatch when we take data for a month , say march 2013. It does not match with source
select sum(sales) from Fact_sales where Sales_date between '1-mar-2013' and '31-mar-2013'
We need check
When you are using a star schema for your warehouse or for reporting system and you faces data mismatch issues , then how to debug it; how to find the root cause for the mismatch. Is data mismatch is only at query level or correct data in not there in data warehouse or not collected from source.
We will consider the following warehouse star schema to understand the problem
- Fact_sales (row_id, Sales_date, Creation_date, Delivery_date, Sales_value,Del_value, Item_id, region_id , customer_id,order_no, emp_id)
- Dim_Customer (Customer_id , Customer_code, Customer_name)
- Dim_Item (item_id, item_code , Item_name, Item_cat_desc)
- Dim_region (region_id, region_code, city, state, country )
- Dim_Emp (emp_id , emp_code, rec_st_date, rec_end_date, emp_name, Dept,manager_name)
We are assuming that reporting layer or queries for reports are using inner joins(to start with)
Source database
- sales ( Sales_date, Creation_date, Delivery_date, Sales_value,Del_value, Item_code, region_code , customer_code,order_code, emp_code)
- Customer ( Customer_code, Customer_name)
- Item (item_code , Item_name, Item_cat_desc)
- Region (region_code, city, state, country )
- Emp ( emp_code, rec_st_date, rec_end_date, emp_name, Dept,manager_name)
All the warehouse tables have their own surrogate/new keys.
Case 1:
We get a mismatch when we take data for a month , say march 2013. It does not match with source
select sum(sales) from Fact_sales where Sales_date between '1-mar-2013' and '31-mar-2013'
We need check
- Are we using the same date at both places (sales date vs sales date)
- In warehouse is sales date is date or datetime . Based on that check the query like
- select sum(sales) from Fact_sales where Sales_date between '1-mar-2013' and '31-mar-2013 23:59:59'
- Or may it is truncated at warehouse , check source like
- Select sum(sales) from sales where Sales_date between '1-mar-2013' and '31-mar-2013 23:59:59'
- In case we still have the mismatch check for rows collected . Hope you have used left join while populating the dimensional keys.
- If data granularity is same then you can check the row count and get the diff
- If granularity is same and you have row mismatch then try changing the inner joins in ETL to left/right based on main(here sales table) position , one by one for each dimension table and try to find the reason
- In case granularity is not same , you may have to do this exercise by breaking the queries.
Case 2:
Data is same for time range , but if one view/group by; say City data is Less/More
We need check if data is Less
- In case data is less and data was correct for time means that some value may be missing in dimension
- Take you fact and dim join , make it a left join for fact and select keys from both the tables
- Select Fact_sales.region_id, Dim_region.region_id from Fact_sales left join Dim_region on (Fact_sales.region_id=Dim_region.region_id ) where Dim_region.region_id is null
- If the above query return results it means you have some keys missing and you need to get the key values missing. Use output of this query to check missing keys.
- If the above case does not result any result , it may be case that fact might not have got some keys for this dimension
- Select * from Fact_sales where region_id is null
- If the above query gives result then check why keys missed out during ETL.
- If you find the reason for mismatch populate the fact data with correct keys again
- In case, null values are there at the sourcetoo , than you should consider having a left join in reporting layer
When data is more
- When data is more means dimension can have more data . As we have new key here , we need to check is the source key duplicated here.
- Select region_code , count(*) from dim_region group by region_code having count(*) >1 order by 2 desc
- In case 2nd query gives results , means there is duplicate data in dimension and dimension need to be reloaded. It may also mean fact reloading to get the correct keys
- In case query in step 2 does not return results. We need to check does the fact have duplicate rows. This case can only happen if the month data query also gives more rows or data mismatch. So just follow the case 1.
Case 3:
Data wrong only for Emp Table
Here we have made employee table SCD , we have to check that we have join and keys coming in correct manner
The Join should be something like
Fact_sales left join Dim_emp on
( fact_sales.emp_id = dim_emp, emp_id and
fact_sales.sales_date between dim_emp.rec_st_date and rec_end_date)
Case 4
Rows are same in fact and source table, still there is small mismatch across all dimensions
Check for data type. Is float data going into Int or different number of decimal precision at source and target.
Case 5
Data same for base measure but there is mismatch in some formula
Sum (Sales_value) is correct , Sum (Del_value) is correct
But sum(Sales_value+Del_value ) is not correct or Sum (Sales_value) + Sum (Del_value) is not correct for some data view.
In this case null value can be a issue. Usually null with any operation will result null
Try Sum( case when Sales_value is null then 0 else Sales_value end + case when Del_value is null then 0 else Del_value end )
Please note that these are kind of general debugging method and author does not claim that it can resolve your issues. So please use your own discretion. Also there are just general guideline based on experience and it may mere coincidence to he similar with some examples at other places.
why the fact table has a primary key (surrogate key)?
ReplyDeleteSir, It's a good article
ReplyDeleteThe Blog Was Great!!!! By reading this blog thoroughly one question arises in my mind…i.e. How to deal with confusing the components of “Roles and Functions.
ReplyDeleteThe Blog Was Great!!!! By reading this blog thoroughly one question arises in my mind…i.e. How to deal with confusing the components of “Roles and Functions.
ReplyDelete