Archive for April, 2022

ODI in the hybrid database world – Files/Stages and SnowSQL

Posted in Uncategorized on April 14, 2022 by radk00

Hi all, as mentioned in the second post of this series, there is a faster way to load data into Snowflake using ODI. It will require the creation of new KM, it has some peculiarities that some may think as limitations at first but the speed of the data loading is totally worth the work and the preparations needed for making it work. First, lets just picture how is the current Snowflake JDBC process.

It’s simple and very straight forward. You create an ODI mapping that will read your on-premises DB and send data over to Snowflake using JDBC. If you have small data loads or if you are happy with the time that the jobs are taking using this method, I recommend you stay with it because its pretty simple and works.

Now, if you need speed due to the volume of data that you need to transfer, you may create the following architecture:

Let’s describe the steps. First ODI will be used (using a Mapping) to generate a text file. The format may be anything you like (it needs to match the Snowflake stage definition, as you will see below). I’m using a pipe | delimited file for this example. Then ODI calls SnowSQL client (more on that later) that will compress and push the file to a Snowflake STAGE area, which will then be finally copied over to the final table.

If you stop and think a little bit about it for a second now, it seems very stupid. You have the data in a database, you extract it to a file, then you call a process to compress/push over the internet, stage it then finally copies it. Its way more work than the first method, right? It is way more work, it also requires space to store the text file, however, its way faster then JDBC.

You see, the main bottleneck when working with the cloud is exactly the transfer over the internet. With the second technique, what we end up doing is to zip a large file and send it across the network all at once, instead of relaying in a Java JDBC connector that is buffering some X number of rows and sending it across repeatedly. The amount of work that the JDBC driver does internally is way more and way slower than just creating a file, compress it, send it.

Also, cloud structures are awesome on working with files. Every cloud provider out there makes it very easy and fast to manipulate “raw” data. Snowflake is no different. It will stage and copy the compressed text file in an extreme speed, way faster then batch of rows using JDBC.

If you are still not sure if you should follow this route, my answer is wait until you really need to create a fast process and give it a go. You may do a test by simply extracting the data to a text file and run SnowSQL commands to push the data. You will see it will be super-fast.

Let’s see how we should implement this. First thing is that you will need to install SnowSQL client on your architecture (ODI agent server). This client will be the one called to execute things in Snowflake, including pushing the file and copying it. I won’t go over the details on SnowSQL, but you may read it all in this documentation from Snowflake.

Another thing that I’ll just assume that you know is how STAGES, PUT and COPY commands work in Snowflake. You may read about their documentation here, here and here.

Second step is to create a copy of the current “IKM SQL to File Append” and give it a new name, in my case its “IKM SQL to File Append – Snowflake PUT”. Delete some steps of it and leave only the ones below:

These ones will basically just create a file in a server. This file needs to have the same name as the table that you will want to load in Snowflake, plus the “.txt” extension (e.g., if you are loading CLIENTS table, you need to create CLIENTS.txt file in the server). The target Datastore definition may be anything you want, but I’m following this pattern:

Now you need to add only two more steps in the KM, as below:

Snowflake PUT

The target command is the following:

OdiOSCommand "-OUT_FILE=<%=odiRef.getTable("TARG_NAME")%>_put.log" "-ERR_FILE=<%=odiRef.getTable("TARG_NAME")%>_put.err"

snowsql -c #P_CONNECTION_NAME -w #P_SNOW_WAREHOUSE -r #P_SNOW_ROLE -d #P_SNOW_DATABASE -s #P_SNOW_SCHEMA -q 'PUT file://<%=odiRef.getTable("TARG_NAME")%>.txt @#P_STAGE_NAME auto_compress=#P_AUTO_COMPRESS parallel=#P_PUT_PARALLEL'

We can see that it is basically one OS command that is calling snowsql client. It is passing all the connection information in order to login and then it is issuing a PUT command to Snowflake. This PUT command is sending a text file to a stage area, with auto compress and a defined number of parallel workers. If you are familiar with ODI, you know that all of those # variables need to come from some place. You may implement in any way you want, but in my case, I did put a SQL statement in the command on source tab that returns all this information from a parameter table that is located in the on-premises database, like below:

I even added a CONFIG_CODE filter (which is a KM option that I added to this new KM), in the case of having multiple Snowflake configurations (which is very common to have). So, if you have multiple configs, you may add this option on your new KM and use it when you are creating a new mapping.

Snowflake Copy

This step is very similar to the one before. In the target tab we will have the following:

OdiOSCommand "-OUT_FILE=<%=odiRef.getTable("TARG_NAME")%>_copy.log" "-ERR_FILE=<%=odiRef.getTable("TARG_NAME")%>_copy.err"

snowsql -c #P_CONNECTION_NAME -w #P_SNOW_WAREHOUSE -r #P_SNOW_ROLE -d #P_SNOW_DATABASE -s #P_SNOW_SCHEMA -q 'copy into #P_SNOW_DATABASE.#P_SNOW_SCHEMA.<%=odiRef.getTargetTable("TABLE_NAME")%> from  @#P_STAGE_NAME/<%=odiRef.getTargetTable("TABLE_NAME")%>.txt.gz'

This one is issuing a copy from Snowflake stage to the final table. On the source tab, we have the same SQL shown on the prior step:

And that’s it. You are ready to push data from on-premises to Snowflake in a very fast way. It gives you some work upfront, but I may guarantee to you that it’s worth it.

Thanks, see you soon!

ODI in the hybrid database world – Snowflake JDBC

Posted in Uncategorized on April 13, 2022 by radk00

This second post will talk about Snowflake integration with ODI. Let’s picture a scenario like the last post: your company has a large on-premises ETL/database footprint, but it is starting to move slowly to the cloud, in this case, Snowflake. You want to use the existing ODI architecture for this task, but Snowflake is not a technology that comes out of the box with ODI, so how could you do that? Let’s figure it out in this post.

Luckily a good friend of mine, Michael Rainey, wrote about it in his post here. I won’t go over the details because I don’t want to copy and paste what is already written there, but in a very resumed way you need to download the Snowflake JDBC driver, add it to the ODI agent, create/copy a new technology for Snowflake usage and that’s it. It will work just fine.

However, after working with it for some time, I found some details that I think its worth sharing with you. First, differently from the first post where Oracle is already a technology that ODI knows, Snowflake is not and for that reason you may start to face some small issues here and there regarding SQL statements for example. If the KMs or procedures that you are using are standard/universal SQL that both Oracle and Snowflake understands, it will work just fine. If the SQL is kind of different in Snowflake (meaning a different syntax) or it is Oracle exclusive, than you will need to start doing some customizations.  Luckily, most of these customizations should be very simple to adapt to Snowflake.

Another thing is data volume. For small workloads, it works well. For larger ones you will need to do some tweaks. One way to decrease the time loads is by playing with Array Fetch/Batch Update Sizes and Degree of Parallelism for Target, as you can see below.

Array Fetch/Batch Update Sizes are very hard to fine tune to an optimal value, since it depends on a lot of factors like length and size of the table, network, and so on. Sometimes you may fine tune for smaller tables, but larger tables suffer and vice versa. You will need to run some tests and see what the best value for your case is. However, Degree of Parallelism for Target is one that you may increase up to 20 without too much worry and you will see a huge gain. You cannot increase further because Snowflake (at least in my account) has a limit of 20 parallel threads working on the same object at a time.

If you want to see what is happening on the push of data to Snowflake, you may check the Load task of it in Operator and click on Details. It will show you the details and times that each thread took to execute it:

However, even with those tweaks in the Topology, I found myself into situations where the data load was just not fast enough. Upon doing some research and some testing, I figure out that there is a way to push data to Snowflake way faster than JDBC, and it is by using SnowSQL client. This one I’ll cover on the next post.

See you soon!

ODI in the hybrid database world – Oracle Autonomous Database

Posted in Uncategorized on April 13, 2022 by radk00

Hi all, today I’ll start a series of four post related to the ODI position in a hybrid database world. Everybody knows for quite some time that the cloud is the future. Some companies may delay its adoption, but it will eventually happen in a way or another. However, this adoption will probably not be all at once. Companies, especially the ones that have a large investment on-premises, will need to live in a hybrid mode until things get migrated, built, adapted. And this takes time, a lot of time.

Also, people often start thinking about migrating to the cloud by either:

  • Massively migrating the existing database/data to the cloud, which may sound very promising in the paper, but generally fails miserably when trying to implement, simply because cloud and on-premises are not the same thing (even if the marketing guys tells your boss that its all the same and the migration is a piece of cake).
  • Starting from scratch, which is great for new projects, but most people already have invested and need their on-premises architecture and don’t want to redo all the existing stuff again.

The truth is that companies will end up building something hybrid: whatever is new will be developed thinking on the cloud already but whatever already exists will be integrated (not migrated) into the cloud by stages, until up to some point that the old process either gets converted completely or gets replaced by something else new on the cloud.

For those that had ODI as their ETL tool on-premises, they will find it easy to integrate things to the cloud using whatever they have today. This is because ODI is great to incorporate technologies that does not come out of the box in an easily matter. For this series of four posts, I’ll be talking about the following:

  • Integrating with Oracle Autonomous Database
  • Integrating with Snowflake JDBC
  • Integrating with Snowflake – Files/Stages and SnowSQL
  • Integrating with Google Big Query

For this first post, let’s start with “Integrating with Oracle Autonomous Database” just because its extremally easy to do. Let’s imagine the scenario. You already have a large ETL architecture on-premises and your company started to use Oracle Autonomous Database as their cloud solution. Instead of migrating all at once, they will do it by stages, leveraging everything that they already have build and pushing only essential data to the cloud. Since its a hybrid approach, maybe they even want to get data from the cloud to the on-premises database, to support some existing application.

First thing to do in ODI is to create a new Data Server in the Oracle Technology:

Add the user and password that will be used to connect. Now, instead of adding the JDBC details, as we usually would do, click on “Use Credential File”:

You will need to point to the file that has the connection to your cloud DB. To get this, go to your Oracle DB instance in the cloud, click on DB Connection and download the wallet file.

Add a password to it:

Save the Zip file and go to ODI. On Credential File select the zip file that you just downloaded. If the file is correct, you will be able to select the Connection Details below:

And its done. If you go to JDBC URL, you will see that ODI automatically populate all the info for you:

Click Test Connection to make sure all is correct, and you are good to go:

From this point on, since its Oracle, its all the same. You may do whatever you want with this database because its Oracle. The only difference is that is located somewhere in the cloud and not on-premises. One thing to notice though, is that, since its on the cloud, it will have network constraints. Data volumes will take time depending on several factors that are beyond this post and depends on each companies’ architecture. But the main thing is that you may create ODI mappings and procedures, and push/get data to/from the cloud as needed and in a very simple way.

See you son!