Creating a star schema from single table.
Here we would like to create a star schema when source data
is provided in one file (csv or excel) or table.
We will use the oracle DB scripts here. And suggest difference for
sqlserever
Assuming the data from file has been moved to table, ie. Staging table. Here we will see how to load data form staging table to fact for first time(Initial Load).
SALES_STG
Id number , --Unique Id
Sales_date Date,
Store_code number, --sqlserver numeric , mysql float can be used
Store_name Varchar2(512), --
Sqlserver and Mysql only Varchar
city_name Varchar2(512),
state_name Varchar2(512),
region_name Varchar2(512),
country_name Varchar2(512),
item_name varchar2(512) ,-- Not
taking code to have another type of example, But Item Name is unique
Item_category varchar2(512) ,
Item_group varchar2(512) ,
Sales_group varchar2(512), --
Both sales group and type need to be in one dim , but combination is key
Sales_type varchar2(512),
Sales_qty number,
Sales_value number ,
sales_price number,
cost number,
order_number number,
order_source varchar2(512)
Data for region dimension we have store_code available as key, Numeric key.
For Item , there is no Item code , but item name is unique
Sales group and Type the combination is unique
Also need dimensions for Order Number and Order Source
Creating Dimensions
In case of oracle you can create a sequence for each table for primary key.
In case of oracle you can create a sequence for each table for primary key.
In case of sqlserver you can have identity column and in case of mysql you can have auto increment.
Dimension needed are
Create Scripts
Staging
We need to insert data into these table. In case of oracle you need to select sequence.nextval . In case of sqlserver and mysql you can leave the column during insert .
Region Insert
Oracle
insert into dim_region
select
dim_region_seq.nextval ,
STORE_CODE,
STORE_NAME,
CITY_NAME,
STATE_NAME,
REGION_NAME,
COUNTRY_NAME from
(
SELECT distinct
STORE_CODE,
STORE_NAME,
CITY_NAME,
STATE_NAME,
REGION_NAME,
COUNTRY_NAME
FROM SALES_STG) A ;
Sqlserver
insert into dim_region
SELECT distinct
STORE_CODE,
STORE_NAME,
CITY_NAME,
STATE_NAME,
REGION_NAME,
COUNTRY_NAME
FROM SALES_STG ;
Item Insert
insert into dim_item
select dim_item_seq.nextval item_id,
ITEM_NAME,
ITEM_CATEGORY,
ITEM_GROUP from (
SELECT distinct
ITEM_NAME,
ITEM_CATEGORY,
ITEM_GROUP
FROM SALES_STG) A;
Sales Group Type Insert
insert into Dim_sales_group_type
select Dim_sales_group_type_seq.nextval sales_group_type_id,
SALES_GROUP,
SALES_TYPE from (
SELECT distinct
SALES_GROUP,
SALES_TYPE
FROM SALES_STG) A ;
Order Number Insert
insert into dim_order_number
select dim_order_number_seq.nextval,
ORDER_NUMBER from (
SELECT distinct
ORDER_NUMBER
FROM SALES_STG) A ;
Order Source Insert
insert into dim_order_source
select dim_order_source_seq.nextval ,
ORDER_SOURCE from (
SELECT distinct
ORDER_SOURCE
FROM SALES_STG) A ;
Validation for keys in case of Region, Item and Sales Group and Type.
These three queries should not return any result
select STORE_CODE, count(*) from dim_region group by STORE_CODE having count(*) >1;
select ITEM_NAME, count(*) from dim_item group by ITEM_NAME having count(*) >1;
select SALES_GROUP,SALES_TYPE, count(*) from Dim_sales_group_type group by SALES_GROUP,SALES_TYPE having count(*) >1;
In case they return result , there is some issue with keys at source.
Same need to be repeated for all the dimensions for source primary key. To make sure that data populated correctly in dimension.
Creating the fact
CREATE SEQUENCE fact_sales_seq
START WITH 1
INCREMENT BY 1
NOCACHE
NOCYCLE;
CREATE TABLE FACT_SALES
( FACT_ID NUMBER,
STORE_ID NUMBER,
ITEM_ID NUMBER,
SALES_GROUP_TYPE_ID NUMBER,
ORDER_ID NUMBER,
ORDER_SOURCE_ID NUMBER,
SOURCE_ID NUMBER,
STORE_CODE NUMBER,
STORE_NAME VARCHAR2(512),
CITY_NAME VARCHAR2(512),
STATE_NAME VARCHAR2(512),
REGION_NAME VARCHAR2(512),
COUNTRY_NAME VARCHAR2(512),
ITEM_NAME VARCHAR2(512),
ITEM_CATEGORY VARCHAR2(512),
ITEM_GROUP VARCHAR2(512),
SALES_GROUP VARCHAR2(512),
SALES_TYPE VARCHAR2(512),
SALES_QTY NUMBER,
SALES_VALUE NUMBER,
SALES_PRICE NUMBER,
COST NUMBER,
ORDER_NUMBER NUMBER,
ORDER_SOURCE VARCHAR2(512),
ROW_TYPE CHAR(1),
UPDATE_AT DATE,
CREATED_AT DATE
) ;
insert into fact_sales
SELECT
fact_sales_seq.nextval fact_id,
region.Store_id,
item.item_id,
type1.sales_group_type_id,
od.order_id,
os.order_source_id,
stg.Id source_id,
stg.STORE_CODE,
stg.STORE_NAME,
stg.CITY_NAME,
stg.STATE_NAME,
stg.REGION_NAME,
stg.COUNTRY_NAME,
stg.ITEM_NAME,
stg.ITEM_CATEGORY,
stg.ITEM_GROUP,
stg.SALES_GROUP,
stg.SALES_TYPE,
stg.SALES_QTY,
stg.SALES_VALUE,
stg.SALES_PRICE,
stg.COST,
stg.ORDER_NUMBER,
stg.ORDER_SOURCE,
'N' row_type,
sysdate update_at,
sysdate created_at
FROM SALES_STG Stg
left join dim_region region on (Stg.STORE_CODE =region.STORE_CODE)
left join dim_item item on (Stg.ITEM_NAME =item.ITEM_NAME)
left join Dim_sales_group_type type1 on (Stg.SALES_GROUP=type1.SALES_GROUP and Stg.SALES_TYPE =type1.SALES_TYPE)
left join dim_order_number od on (stg.ORDER_NUMBER = od.ORDER_NUMBER)
left join dim_order_source os on(stg.ORDER_SOURCE = os.ORDER_SOURCE)
As there is no summarization happening for data , the rows in the fact and stg should be same
Dimension needed are
- Region
- Item
- Sales Group and Type
- Order Number
- Order Source
Create Scripts
Staging
| create table SALES_STG ( Id number , --Unique Id Sales_date Date, Store_code number, --sqlserver numeric , mysql float can be used Store_name Varchar2(512), -- Sqlserver and Mysql only Varchar city_name Varchar2(512), state_name Varchar2(512), region_name Varchar2(512), country_name Varchar2(512), item_name varchar2(512) ,-- Not taking code to have another type of example, But Item Name is unique Item_category varchar2(512) , Item_group varchar2(512) , Sales_group varchar2(512), -- Both sales group and type need in one dim , but combination is key Sales_type varchar2(512), Sales_qty number, Sales_value number , sales_price number, cost number, order_number number, order_source varchar2(512)) ; Region CREATE SEQUENCE dim_region_seq START WITH 1 INCREMENT BY 1 NOCACHE NOCYCLE; create table dim_region ( Store_id Number , -- In case of sql server int IDENTITY(1,1), Store_code number, Store_name Varchar2(512), city_name Varchar2(512), state_name Varchar2(512), region_name Varchar2(512), country_name Varchar2(512) ); Item CREATE SEQUENCE dim_item_seq START WITH 1 INCREMENT BY 1 NOCACHE NOCYCLE; create table dim_item ( item_id number, item_name varchar2(512) , Item_category varchar2(512) , Item_group varchar2(512) ); Sales Group Type CREATE SEQUENCE Dim_sales_group_type_seq START WITH 1 INCREMENT BY 1 NOCACHE NOCYCLE; create table Dim_sales_group_type ( sales_group_type_id number, Sales_group varchar2(512), Sales_type varchar2(512) ); Order Number CREATE SEQUENCE dim_order_number_seq START WITH 1 INCREMENT BY 1 NOCACHE NOCYCLE; create table dim_order_number ( order_id number, order_number number ); CREATE SEQUENCE dim_order_source_seq START WITH 1 INCREMENT BY 1 NOCACHE NOCYCLE; Order Source create table dim_order_source ( order_source_id number, order_source number ); |
We need to insert data into these table. In case of oracle you need to select sequence.nextval . In case of sqlserver and mysql you can leave the column during insert .
Region Insert
Oracle
insert into dim_region
select
dim_region_seq.nextval ,
STORE_CODE,
STORE_NAME,
CITY_NAME,
STATE_NAME,
REGION_NAME,
COUNTRY_NAME from
(
SELECT distinct
STORE_CODE,
STORE_NAME,
CITY_NAME,
STATE_NAME,
REGION_NAME,
COUNTRY_NAME
FROM SALES_STG) A ;
Sqlserver
insert into dim_region
SELECT distinct
STORE_CODE,
STORE_NAME,
CITY_NAME,
STATE_NAME,
REGION_NAME,
COUNTRY_NAME
FROM SALES_STG ;
Item Insert
insert into dim_item
select dim_item_seq.nextval item_id,
ITEM_NAME,
ITEM_CATEGORY,
ITEM_GROUP from (
SELECT distinct
ITEM_NAME,
ITEM_CATEGORY,
ITEM_GROUP
FROM SALES_STG) A;
Sales Group Type Insert
insert into Dim_sales_group_type
select Dim_sales_group_type_seq.nextval sales_group_type_id,
SALES_GROUP,
SALES_TYPE from (
SELECT distinct
SALES_GROUP,
SALES_TYPE
FROM SALES_STG) A ;
Order Number Insert
insert into dim_order_number
select dim_order_number_seq.nextval,
ORDER_NUMBER from (
SELECT distinct
ORDER_NUMBER
FROM SALES_STG) A ;
Order Source Insert
insert into dim_order_source
select dim_order_source_seq.nextval ,
ORDER_SOURCE from (
SELECT distinct
ORDER_SOURCE
FROM SALES_STG) A ;
Validation for keys in case of Region, Item and Sales Group and Type.
These three queries should not return any result
select STORE_CODE, count(*) from dim_region group by STORE_CODE having count(*) >1;
select ITEM_NAME, count(*) from dim_item group by ITEM_NAME having count(*) >1;
select SALES_GROUP,SALES_TYPE, count(*) from Dim_sales_group_type group by SALES_GROUP,SALES_TYPE having count(*) >1;
In case they return result , there is some issue with keys at source.
Same need to be repeated for all the dimensions for source primary key. To make sure that data populated correctly in dimension.
Creating the fact
CREATE SEQUENCE fact_sales_seq
START WITH 1
INCREMENT BY 1
NOCACHE
NOCYCLE;
CREATE TABLE FACT_SALES
( FACT_ID NUMBER,
STORE_ID NUMBER,
ITEM_ID NUMBER,
SALES_GROUP_TYPE_ID NUMBER,
ORDER_ID NUMBER,
ORDER_SOURCE_ID NUMBER,
SOURCE_ID NUMBER,
STORE_CODE NUMBER,
STORE_NAME VARCHAR2(512),
CITY_NAME VARCHAR2(512),
STATE_NAME VARCHAR2(512),
REGION_NAME VARCHAR2(512),
COUNTRY_NAME VARCHAR2(512),
ITEM_NAME VARCHAR2(512),
ITEM_CATEGORY VARCHAR2(512),
ITEM_GROUP VARCHAR2(512),
SALES_GROUP VARCHAR2(512),
SALES_TYPE VARCHAR2(512),
SALES_QTY NUMBER,
SALES_VALUE NUMBER,
SALES_PRICE NUMBER,
COST NUMBER,
ORDER_NUMBER NUMBER,
ORDER_SOURCE VARCHAR2(512),
ROW_TYPE CHAR(1),
UPDATE_AT DATE,
CREATED_AT DATE
) ;
insert into fact_sales
SELECT
fact_sales_seq.nextval fact_id,
region.Store_id,
item.item_id,
type1.sales_group_type_id,
od.order_id,
os.order_source_id,
stg.Id source_id,
stg.STORE_CODE,
stg.STORE_NAME,
stg.CITY_NAME,
stg.STATE_NAME,
stg.REGION_NAME,
stg.COUNTRY_NAME,
stg.ITEM_NAME,
stg.ITEM_CATEGORY,
stg.ITEM_GROUP,
stg.SALES_GROUP,
stg.SALES_TYPE,
stg.SALES_QTY,
stg.SALES_VALUE,
stg.SALES_PRICE,
stg.COST,
stg.ORDER_NUMBER,
stg.ORDER_SOURCE,
'N' row_type,
sysdate update_at,
sysdate created_at
FROM SALES_STG Stg
left join dim_region region on (Stg.STORE_CODE =region.STORE_CODE)
left join dim_item item on (Stg.ITEM_NAME =item.ITEM_NAME)
left join Dim_sales_group_type type1 on (Stg.SALES_GROUP=type1.SALES_GROUP and Stg.SALES_TYPE =type1.SALES_TYPE)
left join dim_order_number od on (stg.ORDER_NUMBER = od.ORDER_NUMBER)
left join dim_order_source os on(stg.ORDER_SOURCE = os.ORDER_SOURCE)
As there is no summarization happening for data , the rows in the fact and stg should be same
Select count(*) from SALES_STG should be same as select count(*) from FACT_SALES.
Few Points
- Ideally there should not be a left join (in fact insert) , we should have select like SELECT distinct nvl(SALES_GROUP,'N/A') SALES_GROUP, nvl(SALES_TYPE,'N/A') SALES_TYPE FROM SALES_STG .This will make sure inner join gets all the rows.
- Instead of sydate in query , sysdate should be put in some variable at the start of ETL and that variable can be used after that.
- We should always use list of the columns in insert list. This will make sure that insert will not fail even if you add new column to DB(column where null is allowed).
No comments:
Post a Comment