Hey guys how are you? Continuing our SQL series (S01EP13), today I’ll share a very hand little query that I use very often for check data duplication. In fact, this would be an upgrade version of ODI’s pk check.
An upgrade version because in ODI, if you enable PK check, if he finds duplication, he eliminate both data. This code I’ll show you, you would choose if you want to keep the last created duplication or the oldest one, but only one will be eliminated.
I have a test table with this values:
If I want to check for duplicate PK, I can just run this query here:
The Idea here is, we have 2 queries. The first one will check if the ROWID it has is bigger or smaller (you choice) than the MIN or MAX ROWID (depending of your previous choice) than the second sub query by any joins you want to check.
In this case, we wanted to check only if the PK column had duplicated values, but we could check any other column by just replace it in the join. In fact, we could have any amount of columns in the join and that would check if there’s any duplications in all columns you inserted there.
Then you can select the first using > and MIN or the last by using < and MAX as well you can select what column you want to check in the where clause.
One important thing to mention is that this query is meant to work as a delete because it’ll keep what was not in the select. What I mean is, if you have more than one duplication, it’ll bring, in this case, all the rows that has the ROWID > then the one selected in the first query:
Then if I have multiple duplications, the query will return everything that needs to be deleted and the only one remaining was the first one inserted (3, Chuck, Giampaoli).
I hope you enjoy this little trick and see you soon.
Hey guys how are you? Let’s take a look today in the opposite of S01EP12 situation, in fact we’ll use the same example again to show how can we convert a string in a list of values in a easy and dynamic way, starting with this query here:
I’ll transform this query in a with and I’ll use REGEXP to put this back into a list of values. This is very useful when we extract metadata from essbase for example, because essbase exports the UDA’s as a list of values. Of coarse this has many uses other than this but let’s keep this one in mind.
Now what we need to do is to split the strings by comma in this case, then the idea is to count the amount of commas we have in a row and split the strings by that amount.
The idea here is to use the REGEXP_COUNT to count how many words we have in between the commas and then use it to multiply the rows in the CONNECT BY LEVEL. For example, if we have 3 words, the connect by will create 3 rows of the same row, one with the LEVEL = 1 another with the LEVEL =2 and the last one with LEVEL=3.
With that we just need to use the REGEXP_SUBSTR to extract the words based in the LEVEL, this way we’ll have the REGEXP_SUBSTR(STR, ‘[^,]+’, 1, LEVEL (that will be 1 for the first row, 2 for the second and 3 for the third one).
Hey guys how are you keeping? I hope everybody is healthy and keep this way in this difficult times.
And to make our life less complicate, here’s another tip. Let’s talk about how to concatenate stuff in Oracle.
Imagine a simple case, we want to query the Planning repository to get the list of UDA’s a member have. We can easily do that by query the HSP_OBJECT, HSP_MEMBER_TO_UDA and HSP_UDA tables.
I’m filtering just 3 products to make it easier for us to see. The results shows that each project has a different number of UDA’s, and we never know how many it’ll be, then the easiest way to concatenate them is to use the command LISTAGG (or WM_CONCAT if you are in a DB version prior to 11.1).
The command is very simple LISTAGG(Column, Separator) WITHIN GROUP (ORDER BY column). As we can see the command allow us to select the separator we want (can be comma or any string really) as well to order the results by another column). Let’s take a look in the example above.
As you can see, it easily create a list split my comma (as specified) for me, and the nice thing about it is that I don’t need to do any string treatment if return null or if I have just one string on it and things like that.
This is an extremely good Function and we heavily use it in ODI to generate dynamic code because its simplicity, for example, we can generate a SQL statement on the fly using the command on source and command on target:
With this results we can easily pass this info to the command on target to generate a dynamic query where ODI will replace the columns we got in the target as well the table name and will also loop for each row we have in the source. This is very handy.
And for the ones that are not in the ORACLE 11.2 and ahead, we can still do that using WM_CONCAT. Is not as powerful as LISTAGG, but works pretty well. Let’s try the first example again:
I cannot show you the results since WM_CONCAT was decommissioned in the 12c (my version), but it’ll work like this. We don’t have the option to choose the separator and to make the string unique and to order by it we need to add DISTINCT in the command WM_CONCAT(DISTINCT column).
Today a quick tip that I think is very useful. From time to time the business ask us to validate if a table has data or not before we load it. It’s fare, specially if you use a truncate and insert approach.
The problem is, sometimes, the table/view they are asking for has millions of rows, and there’s no other safe way to validate if a table has data or not than querying it.
I just fixed a case where an interface had a validation that basically counts 3 different tables that together had 40 million rows per period. This validations were taking around 1000 sec to happens.
The data load that happens before that took 1200 sec. Then, basically the validation process were taking as much time as the load process.
After some changes, the query now is validating the 3 tables in 0.3 seconds. Way better than before. Basically I just used 3 things:
The hint /*+ FIRST_ROWS(1) */ that makes oracle prepare the best plan to query just one row (in my case since I used 1 as parameter.
The filter ROWNUM = 1 to make sure oracle just return 1 row, if we don’t use that, the hint can make everything very slow because oracle will be planning for just one row, but without filtering it’ll bring more (using the best plan possible for 1 row).
And UNION ALL instead of UNION, because there’s a huge difference between them. when you use UNION, oracle matches the sets of data to make sure you have unique rows after that. UNION ALL in other case, just bring everything each set return without any extra process to validate anything. UNION ALL is always faster than UNION.
In the end I have an query like this:
As you can see, the query is very simple and for this example I just had the name of the table there, then we know the table is not empty for that period. We can do other approach like summing then all together and validate if the results is = 3 for example or any other logic we need can be implemented on top of this query.
I hope this is helpful for you guys and see you in the next post.
Continuing the series Fragmented and Aggregated tables in OBIEE using ODI and today we are talking about how to Populating the Fragmented tables using ODI.
Just to make easier for you to navigate in this series, here’s the parts of it:
In my previous posts we had design our 18 tables and the partition management process. Now we’ll talk about how we’ll going to load data to our fragmented tables. As you can imagine, for the 6 fragmented tables, the code will be the same, the only difference is what goes to each table. Then what we need to make ODI to do for us is to dynamically filter and load the data that comes from the same query to different tables based in a rule.
If you are following this series from the beginning, you probably already guessed what we’ll need to do here. The same thing we did for the Partition management process, well, almost the same thing because we’ll going to do something a little bit more optimize for the data load since this is the most time-consuming step.
The main difference between the approach I explained in the previous post is that it loops one Statement by each row that returns from the Source command in SERIAL, that means, we need to wait it to finish for the second loop to start. For the Partition Management that’s ok because everything there is very fast, but for that load we can have something better.
We don’t need and we don’t want to wait for a serial execution because we have one table for each Loop we’ll perform (This is not the case but even if we have only one table, if we have partitions and/or Sub-Partitions, we can Loop the code by the Partition/Sub partition level and load everything in Parallel. The only case we can’t load in parallel is if we have a single table without partition).
In our case we’ll going to loop all the 6 tables load in parallel. To do that, we are going to use the same approach as before but instead of having the ALTER TABLE in the Command on Target, we’ll going to have an OdiStartScen, to call a Scenario multiple times in PARALLEL. Let’s take a look in the Command on Target:
As we can see, the Command on Target contains an ODI tool called OdiStartScen that is used to call a scenario. We have a few parameters there to set how we’ll going to call the scenario and, in this case, the important one is the SYNC_MODE=2, that means it’ll execute everything in parallel (1 is serial). Everything else are variables that we are using to pass information from the Command on Source to the Called scenario.
A quick tip here before we continue, if you want to know how in havens, I have all this setting in mind, for your information I don’t, but there’s a very easy way to find out. Just drop an ODI Start Scen from the ODI Toolbox in any part of your scenario and fill with the information you need:
In the General we set the scenario we want to load, version of the scenario (-1 is always the latest scenario you have), the Synchronous and Asynchronous mode (serial and parallel), Session name, in case you want to have an different name showing in your execution (this is always a good Idea to have since you are looping the same code with different parameters, then in my case I pass as a variable the name of the table, the period and the hierarchy that scenario will be loading).
In the Additional Variables tab you can set all variables you need to use inside the scenario (as you can see in my case, I have a lot of variables because the complexity of my sources, you case will be different).
And after you set everything, you just need to click in the Command tab to get the code that ODI generates based in the setting you did:
This is the code that you need to paste in the Command on target to call a scenario. After that you can delete the OdiStartScen object from your scenario and continue the development. Another advantage of this is that if you have the code and want to create an OdiStartScen step in a scenario, you can paste in the Command tab and the General/Additional Variable will be filled based on the code.
With the Target code ready we just need a query that will going to pass all the information needed to call the scenario. In my case, I pass a lot of information to the target, as you can see by the amount of variable I have. Also, the variable information doesn’t need to come only from the command on source. If you have variable in your scenario and they are the same for all the loops, you can just use these variables to pass their values to the Scenario the command on target is calling. The command on source needs only to have the information that will dynamically change in each loop.
In my case I’m passing all kind of information like the SESSION_NAME (the name of the session that we use to create parameter specific for each session we have and use the same variables for all, more information HERE) but the important one for us is the TABLE_NM.
In my case, the other variables will change the values depending of the table we need to load for each time we loop. To do that I insert all information I need to be pass to the scenario I’m calling in a Parameter table (that I like to call ODI_PARAMETERS). This way I can easily query this table to get specific information I need for each Loop I’m going to perform.
With all this set, we just need to create a child scenario and add “Declare” variables for each Variable we want to pass to the internal scenario like this:
As you can see, this is a very simple scenario that contains just one procedure that’ll be used to load the data to our tables.
This approach can be done with Interfaces as well, in fact with anything at all. We can have a mains scenario that loops a very complex scenario, that loops another scenario and so on. This is a very nice way to loop something because we have a lot of control over this kind of loop as well, we can change the behavior of each loop because the info we get from the Command on Source.
Also, a very good feature is that, if the source query doesn’t return any rows, the Command on Target is not executed at all. Then you’ll always have a clean execution. For example, if you have a folder that can have 0 to X number of files and you want to create an scenario to load it, the best approach would be to read this folder files (OS command to a file and loaf file or Java or Phyton or…) and store the info in our ODI parameter. Then we create a proc to call a scenario that will contain the interface that will load each file, and finally we just do a Select in the Command on Source to get the name of the files in the folder. With that, if we have 0 files, nothing will be executed, but if we have 1000 files, we’ll loop the scenario 1000 times and each time we’ll pass the name of a different file to the internal scenario. Very neat way to do it.
Inside the procedure we just need to create the query need to load the data and use the variables to filter the right data and load the right table. This is not my case but if you have different source for each table you want to load, just insert this information in the parameter table and pass it to the internal scenario. Use these variables to replace the Table Names in the SQL and you are done.
And that’s it. This is all we need to do to load all our 6 fragmented tables. In my next post I’ll be talking about how we can aggregate our data and load our Aggregated tables.
In my previous post we design 18 tables and its partitions and sub-partitions. Now what we need to do is to make ODI manage these partitions for us. As I said before, the Sub partitions will be static and if we need to create one more sub-partitions, we’ll be doing that manually. That’s always the best choice because manage sub-partitions add a layer of complicity that is not needed.
With that said, if you need to manage the sub-partitions, you can use the exactly same approach I’ll show here just adapting to the needs of the sub-partition’s creation and sub-partition template update.
Let’s start analyzing what needs to be done. For each table we need to issue this command to the database:
For List partitions:
ALTER TABLE table_name ADD PARTITION partition_name VALUES (value[, value]…);
For Range Partitions:
ALTER TABLE table_name ADD PARTITION partition_name VALUES LESS THAN (value[, value]…);
Sounds hard but in fact is very simple and it’ll going to work in any SQL you have. To do that this we will use a procedure with our ALTER TABLE statement in the Command on target tab and a SELECT in the command on source to pass the information we need to create the partitions like the table names, the partition name and the partition values.
For each row that the SELECT on source returns, the statement in the Command on target will be executed, and we can pass the values from the SELECT in the Command on Source to change the behavior of the statement in the Command on target.
This is something very important that we need to understand about ODI. ODI It’s not an ETL tool but a code generator tool that orchestrate Database and a lot of other tools if you will.
What I mean by that is, until ODI send the code for the database to execute, everything that ODI does is to generate code. If we use the ODI API it’ll replace it before send to the database as well to replaces the variable with their values, meaning a variable like this #SQL on the Command on Target tab could contain an entire dynamically generate INSERT that ODI would replace before send it to the Database.
And this is why ODI is so powerful, because we can manipulate string and make everything dynamic.
Then our query in the source should return something like this:
The query will change depending of your design of the application then I don’t think sharing my query will going to help. One tip I can give you is always compare the name of the partition with the content of the column PARTITION_NAME on ALL_TAB_PARTITIONS view to see if the partition you want to create already exists to prevent an error in ODI, in fact you have 2 choices, either do this or flag the step as Ignore Error to end the step with a warning. I prefer to compare and end the step without warnings, then I compare.
Another thing to notice is that for FY20 Feb, my PARTITION_VALUE (MAX_PERIOD) is the 1 day after the last day of the month. It’s hard to see here because the calendar used by the client is all different but if you have the normal calendar and the mask you have for you PERIOD_ID is YYYYMMDD, for 2020/Feb you partition ID needs to be 20200230 (Feb ends 02/29/2020).
That’s because this is a RANGE partition, and for RANGE partition we define the partition value as VALUES LESS THAN, what means that everything under 20200230 will be FEB unless we have a smaller partition (JAN) below it. But for example, if you smallest partition is 2018/Jan, everything below that period will be inserted in the 2018/Jan partition, then you need be careful with when you’ll start to create partitions.
Another important thing to mention is that you cannot insert a partition smaller than an existing partition, just bigger. That means, if you start with 2018/Jan, it’s impossible to create a 2017 partition, the only way is to drop the table and create the table starting by 2017. A RANGE partition needs to be sequential and always growing.
Having these values returning from the SELECT on the Command on source tab, the only thing that is left is to use these variables in the statement in the command on Target tab like this:
ALTER TABLE #TABLE_NM ADD PARTITION PARTITION_NM VALUES LESS THAN (#MAX_PERIOD);
Pay attention on the values of the partitions. In this case the values are numeric, that is why I’m not using quotes. If the values were String, we need to enclosure the variable with quotes.
And that’s it, this is all we need to do to manage partitions using ODI. Because the SELECT on the source returns All tables with their partitions name and values and we pass to the TARGET statement this information, for each row in the source the target will be looped, creating the partitions you need for each table.
We are done for the fragmented tables. Now we need to manage the partitions for the Aggregated table. The approach is the exactly the same, the only thing that will change is the content of the query from the Command on Source tab and that the PARTITION now will be a LIST Partition instead of a RANGE partition.
We could have done just one query to return all values including if that was a RANGE or a LIST partition, but for the sake of simplicity is always better to split the scenarios to not over complicate something simple.
For the month level we need the query in the Source to return:
And the statement in the Command on Target should be like this:
ALTER TABLE #TABLE_NM ADD PARTITION PARTITION_NM VALUES (#MAX_PERIOD);
That’s all that needs to be done. For the quarter level the results are similar:
And the statement in the Command on Target should be like this:
ALTER TABLE #TABLE_NM ADD PARTITION PARTITION_NM VALUES (#MAX_PERIOD);
As you can see, it’s very simple to manipulate and manage the database with ODI, we just need to create a query with the information we want in the source, and replace the target statement with that information, that’s all we need to do.
Another important thing that we need in our procedure, in case you want to go with the truncate/insert approach, is to truncate the partitions before load. This is also managed by ODI and also works in the exactly same way as the CREATE PARTITIONS but a little bit simpler since we don’t need to have the partitions values:
And the statement in the Command on Target should be like this:
ALTER TABLE #TABLE_NM TRUNCATE PARTITION PARTITION_NM;
For month level:
And the statement in the Command on Target should be like this:
ALTER TABLE #TABLE_NM TRUNCATE PARTITION PARTITION_NM;
And for the Quarter Level:
And the statement in the Command on Target should be like this:
ALTER TABLE #TABLE_NM TRUNCATE PARTITION PARTITION_NM;
Also, could be done with just one query but again, simplicity works always better than complexity.
This is all we need to start loading data. One last thing we could do, if is you case, is to create a purge process to drop old partitions. If your business requires 5 years of data, it’s a good idea to drop older partitions. The approach is exactly the same
In fact, the results of the Command on source are exactly the same as the ones we need for the TRUNCATE PARTITION steps, the only difference is that you need to have in place a logic to find the old partitions. Other than that, the only thing that changes are the Statement on the Target that should be:
ALTER TABLE #TABLE_NM DROP PARTITION PARTITION_NM;
And that’s it, we just create a procedure to manage the partitions for all our 18 tables including create, truncate and drop old partitions.
In the next part of this series we’ll going to talk about populating the Fragmented tables.
I normally don’t talk about OBIEE, despite the fact I work a lot with it, because there’s a lot of people talking about it. But recently I worked in a project and I decide to use fragmented tables and aggregated tables at same time and I found that we don’t have much specific information regarding this matter, and of course, zero information about using ODI to enable this project.
Then since this will be kind of big, I decide to split this in 5 parts:
Creating the tables: I’ll explain the advantages and how you can create the fragmented tables and also how to further improve performance using partitioning and sub partitioning.
To start the series Fragmented and Aggregated tables in OBIEE, today we are talking about how to Creating the tables for our project.
In this project I had data from 4 different source system, and each source system had different amount of data. Dell data had 20 million rows per month, EMC data 30 million, DTC data 5 million and STAT data another 5 million.
The business wanted to have 15 quarters of data stored, then if we do a simple calculation we’ll going to have: 20.000.000 + 30.000.000 + 5.000.000 + 5.000.000 * 45 = 2,700,000,000 rows.
That’s a lot of data even for a partitioned table. Then I decide to split this tables in some way.
Analyzing the data, I saw that this tables contain 3 different account hierarchies and the users will only analyze one per time: BS, Income and PL2. I decide to split the tables by it. Then just with that now I had 3 fragmented tables. JAT_JE_DETAIL_BS, JAT_JE_DETAIL_INCOME and JAT_JE_DETAIL_PL2.
That’s nice, just with that I split that 2.7 trillion rows in 3. The problem now is that I have 4 different source system right, and even if I partition ate the table by source system, EMC is as big as the other 3 sources put together, and that would make that source slower than the other to retrieve.
With that I decided to split the tables by 2 again, one for EMC and another for the other sources. That makes the data to be distribute evenly. 20.000.000 + 5.000.000 + 5.000.000 in the D table and 30.000.000 in the E table (D and E is how I called them).
With that now we have 6 partitioned tables, right? 3 for EMC data (one for each Account Hierarchy) and another 3 tables for Dell, DTC and STAT data.
Now that we have a nice and even data distribution, we need to decide how we’ll going to create the partition and sub partition for these tables.
The partitions I always like to use the PERIOD_ID because it makes sense in most of cases (unless we are creating a Forecast app, in this case the users will want to see by scenario and not by one period).
Then for these tables JAT_JE_DETAIL_D_BS_FACT, JAT_JE_DETAIL_D_INCOME_FACT, JAT_JE_DETAIL_D_PL2_FACT we are going to partition by period. But what kind of partition we will going to use? Well, for this case, this tables contains Journal information at day level and the users want to drill from the month to the Journal rows of that mouth. What that means is that we’ll always going to work inside a month, then the best way to go is using a RANGE partition.
Range partitions, how the name already says, uses a Range of values to create the partition area. In our case it’ll be one month. The range can be built by a DATE column or a Number column, and the second one is our case since is easier to manipulate the periods using a number than a data.
(The period for this project is not a regular month, for example, the month of February FY20 start in Jan/30 and goes until Feb/2).
For the sub partitions is an easy call, we’ll going to use the Source System since we have 3 different one in this tables right (Dell, DTC and STAT), and the partition type will be LIST since we have a simple list of values. Also, for sub partitions is always easier to have values that don’t increase because the complexity of adding new sub partition values to a table.
What I mean is that to add a new partition you just need to use one ALTER TABLE ADD PARTITION statement and add any number of partitions you need. But for a sub partition is different. Every time you create a sub partition Oracle create a Sub Partition template. That template is used every time you create a Partition.
In our case for example, our Sub Partition Template would be DELL, DTC and STAT. What that means is that when you create another partition, this partition will be created with these 3 sub partitions automatically. Then if you want to add another value to a sub partition you need to do 2 things instead of just one (in partition case).
You need to change the sub partition template definition to add the new sub partition and you need add your new Sub partition in every single already existing Partitions (is not update automatically). After that all new partitions will contains all your sub partitions.
If you don’t update the template, every time you create a partition, you’ll not have the new value you want.
Enough of theory, let’s proceed with the other set of tables, JAT_JE_DETAIL_E_BS_FACT, JAT_JE_DETAIL_E_INCOME_FACT, JAT_JE_DETAIL_E_PL2_FACT. For these tables we’ll continue to have the partitions as a RANGE of the PERIOD_ID but the sub partition doesn’t make sense for us to use source system because we have only one source system.
And now is the time to explain another thing about partitions. For Oracle to use the partition you need to filter the column that is used to create the partition. For example, if we are creating a partition using the column PERIOD_ID, we must filter the data by PERIOD_ID and the ideal would be always have a filter in the Partition and Sub Partition column, because if you don’t, it’ll do a full table scan anyway. Then for a data retrieval point of view, the choice for the partitions are always made based in the Filter requirement, that means, we need to choose columns that we can force the user to choose in OBIEE. For an ETL point of view, the best would be to have partitions and sub partitions in columns that we can use to truncate before load.
For example, in our first set of tables I can easily and fast truncate a period and a source system before I load it back to the tables. That’s the beauty of partitions, I can drop, truncate and create on demand.
That said, you can imagine that some times what is good for the report is not good for the DW, but this is not our case today.
With this new piece of information in mind, the users are force in the OBIEE prompts to choose a Gen 2 Account, then this will be our sub partition for our second set of tables also with the partition type as LIST.
Before we proceed, let’s just see how everything is set then:
Table Name
Partition
Sub Partition
JAT_JE_DETAIL_D_BS_FACT
PERIOD_ID (Range)
SOURCE_SYSTEM_ID (List)
JAT_JE_DETAIL_D_INCOME_FACT
PERIOD_ID (Range)
SOURCE_SYSTEM_ID (List)
JAT_JE_DETAIL_D_PL2_FACT
PERIOD_ID (Range)
SOURCE_SYSTEM_ID (List)
JAT_JE_DETAIL_E_BS_FACT
PERIOD_ID (Range)
ACCOUNT_GEN2_ID (List)
JAT_JE_DETAIL_E_INCOME_FACT
PERIOD_ID (Range)
ACCOUNT_GEN2_ID (List)
JAT_JE_DETAIL_E_PL2_FACT
PERIOD_ID (Range)
ACCOUNT_GEN2_ID (List)
Perfect, now the only thing that is missing here is the aggregate tables.
Aggregated table are tables that uses a dimension Drill to pre aggregate the vales that it contains. It’s an easy way to improve performance during a drill operation.
The most common aggregated table is for the Period dimension, and this is the one we’ll going to do here as well. Then in our case, as mentioned before, the 6 tables we have now contains day level data right. I want to have the data aggregated by Month and by Quarter.
What that means is that we’ll need to have one more table for each one of these 6 tables for Monthly level data and one more table for each one of these 6 tables for Quarterly level data. These tables will going to have the same Partitions and Sub Partitions definitions with only one difference, now instead of using a Range Partition, we’ll going to use a List Partition because we don’t have a range anymore, we’ll going to aggregate that range of days in one month.
Then let’s see how the tables needs to be set:
Table Name
Partition
Sub Partition
JAT_JE_DETAIL_D_BS_FACT
PERIOD_ID (Range)
SOURCE_SYSTEM_ID (List)
JAT_JE_DETAIL_D_INCOME_FACT
PERIOD_ID (Range)
SOURCE_SYSTEM_ID (List)
JAT_JE_DETAIL_D_PL2_FACT
PERIOD_ID (Range)
SOURCE_SYSTEM_ID (List)
JAT_JE_DETAIL_D_M_BS_FACT
PERIOD_ID (List)
SOURCE_SYSTEM_ID (List)
JAT_JE_DETAIL_D_M_INCOME_FACT
PERIOD_ID (List)
SOURCE_SYSTEM_ID (List)
JAT_JE_DETAIL_D_M_PL2_FACT
PERIOD_ID (List)
SOURCE_SYSTEM_ID (List)
JAT_JE_DETAIL_D_Q_BS_FACT
PERIOD_ID (List)
SOURCE_SYSTEM_ID (List)
JAT_JE_DETAIL_D_Q_INCOME_FACT
PERIOD_ID (List)
SOURCE_SYSTEM_ID (List)
JAT_JE_DETAIL_D_Q_PL2_FACT
PERIOD_ID (List)
SOURCE_SYSTEM_ID (List)
JAT_JE_DETAIL_E_BS_FACT
PERIOD_ID (Range)
ACCOUNT_GEN2_ID (List)
JAT_JE_DETAIL_E_INCOME_FACT
PERIOD_ID (Range)
ACCOUNT_GEN2_ID (List)
JAT_JE_DETAIL_E_PL2_FACT
PERIOD_ID (Range)
ACCOUNT_GEN2_ID (List)
JAT_JE_DETAIL_E_M_BS_FACT
PERIOD_ID (List)
ACCOUNT_GEN2_ID (List)
JAT_JE_DETAIL_E_M_INCOME_FACT
PERIOD_ID (List)
ACCOUNT_GEN2_ID (List)
JAT_JE_DETAIL_E_M_PL2_FACT
PERIOD_ID (List)
ACCOUNT_GEN2_ID (List)
JAT_JE_DETAIL_E_Q_BS_FACT
PERIOD_ID (List)
ACCOUNT_GEN2_ID (List)
JAT_JE_DETAIL_E_Q_INCOME_FACT
PERIOD_ID (List)
ACCOUNT_GEN2_ID (List)
JAT_JE_DETAIL_E_Q_PL2_FACT
PERIOD_ID (List)
ACCOUNT_GEN2_ID (List)
Great, now we have 18 tables, 6 fragmented tables for daily level data, 6 fragmented tables for monthly aggregated level data (M) and 6 fragmented tables for quarterly aggregated level data (Q).
The last thing about tables is that we need a matching Period dimension (for this case) to match the data in each aggregated level. What that means is that, in this case, since we have 3 different level of aggregations, we need to have 3 Period dimensions.
OBIEE don’t allow us to join the same table using columns that are not an ID column. Then what we need is one PERIOD dimension (DIM_PERIOD) that contains the drill from Year to day (Period ID will be YYYYQMMDD number), another PERIOD Month Dimension (DIM_PERIOD_MONTH) that contains Drill from Year to Month (Period ID will be YYYYQMM number) and one last PERIOD Quarter table (DIM_PEIOD_QUARTER) that contains Drill from Year to Quarter (Period ID will be YYYYQ number).
Of course, the Aggregated FACT tables also will going to have a matching ID (and we need that to aggregate the data right).
Last thing to mentioning is that the Detail Fact table (the one in the Day level) it contains a lot of Journal details like Journal number, Description, Header ID and others. For the detail this is ok, but when we design the aggregated tables, only ID are allowed there for maximum aggregation. If the table contains Descriptions, nothing will be going to be aggregated right?
Don’t worry that OBIEE can handle this kind of situation very well.
That’s for now. I hope you guys enjoy it and found helpful.
Continuing our Oracle SQL for EPM series, today we’ll start to talk about analytic functions and how can we use them for more than “just” analytics.
To start with, let’s talk about RANK(). As the name suggest, RANK() is used to rank our data based in something. It’s very useful to find out each data is more relevant than others. Let’s see a example:
Here we have a small table with 2 currencies and a few products. Let’s first start with the basic function of RANK() and see each product generated more income:
The basic syntax is RANK() OVER (ORDER BY COLUMN). Basically what you are saying to oracle is, rank my data based by a column (or multiple columns). Since I just ordered by data, the values of the RANK() got duplicated everything oracle finds the same value. This is because we have 2 currencies and they are both USD.
To fix data we can do 2 things: Or we can include currency in side the order by or we can use another more advanced use of RANK() that is OVER PARTITION.
Let’s see how it works:
If I just add another column in the ORDER BY, it’ll basically create the Rank based in the order of these 2 columns. It’s the same as do a ORDER BY and then follow the order of the data that returns. Then in this case, you can see that the products PR235 for Functional Data got Rank 1 and for USD rank 11, even both having the same value. By the way, you also can see that the Ranks is ordering in the opposite order that we would like to have. This was intentional to show you how the Rank is produce. To fix that we just need to put a DESC in the ORDER BY clause, like we would do in a normal ORDER BY.
Ok then let’s see the more advanced way to write this query:
Instead of inserting new columns in the ORDER BY we can use PARTITION BY instead. The results here is the same, but this can be used in other ways as well and I would say that this would be the best way to used it since is more clear what you want to do.
The PARTITION BY does exactly what the name says, it partition the data by the content of one or more columns. In fact, the PARTITION BY clause can be used in most off the analytics functions like MAX, SUM, MIN, AVG…. then it’s very powerful and the best thing is that, if you use it, you don’t need to use a group by (we’ll see that in the future).
Now, as I said before, we can have other uses for RANK than just ranking data. Let’s say that you have this table without the CURRENCIES column:
Without the CURRENCIES column we end up with duplicate data in the table right? In this case we could do just a distinct and use the data as is, but let’s say you want to create the CURRENCIES column based in the data that we have, and the rule would be, the first data you find is USD and the second (if exists) would be Function. We can use Rank for that too:
Since here the data is the same for the same product, the only thing that could differentiate them was the ROWNUM (or ROWID, that would be better to make sure each one was the first one, but harder to see the example) I used it to create a Rank that shows each row has the lowest ROWNUM and that would have the Rank 1, the second one will be 2 and with this information, I just did a decode to make the 1 USD and the 2 Functional (Also a NA in case we have more than 2 duplicated rows).
This can be used in exactly the same way if you have a metadata table without the datastorage information and you want to create it. Then the first member you find (Trough our friend CONNECT BY PRIOR) will be the Prototype (Store or never Share or Dynamic Calc and Store) and the other would be Shared members.
Of coarse there’s way more ways to use this function, and we’ll see more of them with the other analytics functions that we’ll going to see here.
Continuing the Oracle SQL for EPM series today we’ll see another usage of Connect by. I’m talking a lot about connect by because is a very useful function and we should use it a lot in the analytic space.
Let’s imagine that we are working with a planning application and we want to figure out in a query what is the region of each cost center and to what top level that cost centers belong. In this app we have the Entity dimension with Support Geography as a attribute.
The support geography has the region on the generation 2 of the hierarchy. How can we do that. Well, connect by to the help.
First of all we need to get the entire physical geography hierarchy. To do that we’ll do a query like this:
This is a simple query that joins the HSP_ATTRIBUTE_MEMBER table (this table contains all the attributes from all attribute dimensions) that’s why we have that sub-query to select just the Support Geography members (1).
Then we join this table with the HSP_OBJECT that contains all metadata from everything in Planning: Forms, Dimensions, Flows, everything. We do that to get the name of the members. And finally we do a connect by to rebuild the hierarchy (2).
With the hierarchy re-built we can use the Connect by root to figure out to witch top level each member belongs. (3)
With this query we know what’s the leaf member of the Support Geography dimensions (ID) and to what Region that member belong. Now we need to do the same for the Entity dimension.
With this query we are filtering the OBJECT_TYPE=33, that means Entity (1) (If you want to know what are the possible object types you can query the HSP_OBJECT_TYPE table to check that out), and then we do a Connect By to re-built the hierarchy and then use some Connect by Root to get the parents as well the Sys connect By Path to get the Path of the hierarchy.
Now we just need to join everything under the same query to get all the Cost Centers and to witch top level each one belongs and to witch region they are part of as well.
Another thing that I like a lot about the WITH clause is that is very easy to create nested queries. In this case I created 2 different queries, each one with a WITH and a select based in that WITH. Now everything I need to do is put everything under the same with by:
We can see that now I just got the Select that was under the WITH clause and just created a nested WITH by just creating the step 1 and 2 and now I have instead of 2 WITH queries I have just 1 WITH with 4 Clauses under it being 2 of then a nested one based in the previous one (Connect by from the filtered query).
All we need to do is putting everything together by joining the ENT_HIER and the SG_HIER using the HSP_MEMBER_TO_ATTRIBUTE table. This table basically is the map between the ENTITY dimension and the Attribute dimensions, in this case Support Geography.
The results is the Cost Center, the Path of that member in the hierarchy and with this we can use the REGEXP to extract any level of the hierarchy, and finally the region that cost center belong.
You may thing why we need to do that. well, this is a query I used to join with the HSP_GROUP table to get the groups and the members from each region and then create the security dynamically for each user. That means, I have a Planning application with Entities that has cost centers from different regions and I’m using the Attribute dimension to generate the security.
If a User has access only to AMER data, it’ll only see the AMER cost centers. This is just one example of what we can achieve using Connect by.
I hope you guys enjoy. Next time we’ll talk about another very very useful function that I really love it.
Continuing the Oracle SQL for EPM series, today we’ll extend a little bit the “Connect by” post to see some neat thing we can do when we mix regular expression with connect by.
First of all, let’s start explaining the basic of this approach. We can use Connect By and Level to generate data in Oracle like this:
This is very useful when you need to generate 2 rows for each row we have in a query, then you can do a Connect by Level <=2 and use it as a query without join, duplicating everything in the source query. Then you can use a Decode or Case to say, Case when level = 1 then x when level =2 then y….
But for us this is just the basic concept that we need to understand for our trick. let’s get back to our metadata table, the one with parent child relationship and with all metadata from all apps and all dims that we have. Let’s say we need to create an hierarchical table based in a parent child table. Some people does one union for each level that we have in the source table (too much work), another will do filter all over the places, I don’t, I just use connect by with regular expression. Let’s take a look.
As we can see, we have 17 levels or generations for the account dimension. quite big. First of all, let’s do a connect by path to flat the entire hierarchy in one row and remove the parent child relationship.
Now, as we can see, if we just do a connect by prior with the sys_connect_by _path, the query will generate 1 row for each level of data. We don’t want that, unless you want to have the ability to input data in the upper level of the hierarchy, that is usually not the case. What we want is to have one the entire path for each leaf member, then we need to filter just the leafs in our queries. In my case my table already have a column that says if a member has children or not, but I’ll proceed as if I don’t have it.
If we don’t have the leaf information we can generate it by using Connect_By_IsLeaf that basically return 0 if is a parent or 1 if is a leaf. Again, we are filtering this outside the connect by prior and now we have just the leafs and the entire hierarchy above him. Next step is to split this back into rows.
REGEXP_SUBSTR is the perfect tool to split the string we have into new columns. As you can see, we can use the follow expression: REGEXP_SUBSTR(Column, ‘[^|]+’, 1, 1) Basically the regexp uses what is in [] to find what yo want and split it in the way you want. In our case I’m looking for the first | (that is defined by the 1, 1) and get the first word after it. Then I have another regexp and I changed the 1, 1 to 1, 2 to get the second word and keep doing this until I get to the 17 generation (as we saw in the beginning).
Basically if you have 17 generation you need to have 17 columns in the table, then you need to have 17 REGEXP, one for each column, you just need to change the 1, 1, 1, 2, 1, 3… until the 1, 17.
Here’s when ODI come in handy because if you use ODI you just need to do a select in the source to figure out how many generations you will have and then generate dynamically the 17 REGEXP and pass this to the target dynamically. It’s very easy since the code itself is always the same and the only thing that changes is the second parameter, and it’s a sequential parameter.
I hope this can be as useful for you as it’s for me and see you soon.