ETL Method – Fastest Way To Get Data from DB2 to Microsoft SQL Server

For a while, I have been working on figuring out a “better” way to get data from DB2 to Microsoft SQL Server. There are many different options and approaches and environments, and this one is mine, your mileage may vary.

Usually, when pulling data from DB2 to any Windows box, the first thing you might think of is ODBC. You can either use the Microsoft DB2 driver (which works, if you are lucky enough to get it configured and working), or the IBM iSeries Client Access ODBC Driver (which works well), or another 3rd party ODBC driver. Using ODBC, you can access DB2 with a ton of different clients. Excel, WinSQL, any 3rd party SQL Tool, a MSSQL linked server, SSIS, etc. ODBC connects just fine, and will work for “querying” needs. Also, with the drivers you might install, you can usually set up an OLE DB connection if your client supports it (SSIS for example) and query the data using OLEDB – this works as well, but there are some caveats, which I will talk about.

In comes SSIS, the go to ETL tool for MSFT BI developers. You want to get data from DB2 to your SQL Server Data Warehouse, or whatever. You try with an OLEDB connection source, but it is clunky, weird, and sometimes doesn’t work at all (PrimeOutput Errors Anyone?). If you do manage to get OLEDB configured and working, you still probably will be missing out on some performance gains compared to the method I am going describe.

Back to SSIS, using ODBC. It works. You have to create an ADO.NET ODBC connection, and use a DataReader source instead of an OLEDB source. Everything works fine, except one thing. It is slow! Further proof?

http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/thread/162e55e5-b64b-423e-94c1-dd764ca1f683

http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=96977

http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/thread/cfade7e7-50d5-4447-9821-35c5d5ae1b66

http://www.sqlservercentral.com/Forums/Topic702042-148-1.aspx

http://www.sqlservercentral.com/Forums/Topic666993-148-1.aspx

Ok, enough links. But if you do read those. SQL 2000 DTS is faster than using SQL 2005/2008 SSIS. WTF? The best I can guess is that it is because of the .NET wrapper around ODBC. DTS is using “native” ODBC.

So, now what? Do we want to use DTS 2000? No. What to do though?

Well, after a few days of research, and just exploring around, I think I have found a good answer.. Replace DB2 with SQL Server.. just kidding. Here is what you need to do:

Install the IBM Client Access tools. There is a tool called “Data Transfer From iSeries Server” which the actual exe is "C:Program FilesIBMClient Accesscwbtf.exe"

image

This little tool allows you to set up data transfers from your DB2 system to multiple output choices (Display, Printer, Html, and Text). We want to export to Text file on our filesystem. You have to set up a few options, like the FileName, etc. In “Data Options” you can set up a where statement, aggregates, etc.

If you output to a file, you can go into “Details” and choose a file type, etc. I use ASCII Text, and then in the  “ascii file details” I uncheck all checkboxes. You set up your options and then hit the “Transfer data from iSeries” button and it will extract data to the file you chose in the filename field. Pretty sweet. But this is a GUI, how can I use this tool? I am not going to run this manually. Well, you are in luck.

If you hit the “Save” button, it will save a .dtf file for you. If you open this .dtf file in a text editor, you will see all options are defined in text, in a faux ini style. Awesome, we are getting somewhere.

Now, how do you run this from a cmd prompt? Well, we are in luck again. Dig around in C:Program FilesIBMClient Access and you will find a little exe called “rxferpcb”

image

What this tool allows you to do, is pass in a “request” (aka a DTF file), and a userid/password for your DB2 system, and it will execute the transfer for you. Sweet!

Now what do we do from here?

1) Create an SSIS package

2) Create an execute process task, call rxferpcb and pass in your arguements.

3) Create a BULK Insert task, and load up the file that the execute process task created. (note you have to create a .FMT file for fixed with import. I create a .NET app to load the FDF file (the transfer description) which will auto create a .FMT file for me, and a SQL Create statement as well – saving time and tedious work)

Now take 2 minutes and think how you could make everything generic/expression/variable driven, and you have yourself a sweet little SSIS package to extract any table from DB2 to text and bulk load it.

image

What is so great about the .DTF files is that you can modify them with a text editor, which means you can create/modify them programmatically. Think – setting where statements for incremental loads, etc.

image

 

You can see from the two screenshots above, that is all there is. Everything is expression/variable drive. Full Load, and Incremental Load. Using nothing but .dtf files, rxferpcb, a little .NET app I wrote to automatically create DTF’s for incremental (where statements), truncate, delete, and bulk insert. I can load up any table from DB2 to SQL by just setting 3 variables in a parent package.

After you wrap your head around everything I just went over, then stop to think about this. The whole DTF/Data Transfer/etc is all exposed in a COM API for “Data Transfer Automation Objects’”

http://www-912.ibm.com/s_dir/slkbase.NSF/643d2723f2907f0b8625661300765a2a/0c637d6b03f927ff86256a710076ab22?OpenDocument

With that information at your disposal, you could really do some cool things. Why not just create a SSIS Source Adapter that wraps that COM object and dumps the rows directly to the SSIS Buffer, and then does an OLEDB insert or Bulk Insert using the SQL Server Destination?

I have found in my tests that I can load over 100 million row tables – doing a full complete load, in about 6-7
hours. 30-40 million row tables in 4 hours. 2 to extract, 2 to BULK insert. Again, your mileage may vary depending on the width of your table, network speed, disk I/O, etc. To compare, with ODBC, just pulling and inserting 2 million records was taking over 2 hours, I didn’t wait around for it to finish. Pulling 2 Million records with my method described in this blog takes about 3-5 minutes (or less!)

I know I have skimmed over most of the nitty gritty details in this post, but I hope to convey from a high level that ODBC/OLE DB just aren’t as fast as the method here, I have spent a lot of time over the last few weeks comparing and contrasting performance and manageability. Now, if I could just get that DB2 server upgrade to SQL Server 2008. . . Happy ETL’ing!

The problem isn’t SQL Server. It’s you.

Throughout all my years in different places, I have seen SQL, Oracle, Firebird, MySQL, DB2, Zortec, Access, and probably a few other crazy databases set up and run, and administered. Of course most of them along the way have been Microsoft SQL Server, (6.5, 7, 2000, 2005, 2008). I’ve worked with some knowledgeable DBA’s, and in those cases everything usually turns out ok.

But sometimes, in some department or place or whatever, your buddy down the street wanting to start a new company, your girlfriends place of work that wants to track orders, whatever, they usually try to get SQL Server running, and what sometimes happens next just makes my head spin. Microsoft, bless them, sometime in the past, not so much now, tried to market SQL Server as “self manageable”. Probably sometime between 6.5 and 7, they tweaked some update stats routines and schedules and its all good, right? Set autogrow by default, and you are good to go. Wrong.

What this awesome marketing strategy did, was get people, places, and organizations, mostly ISV’s to use SQL Server and install it, get their app running, and walk away. Of course it runs for a while, runs like a champ even. But then months, even years go by and the system starts running slow. There is no DBA around, they didn’t need one, SQL Server manages itself! Wrong again.

What you might end up with though, are people using the system that might know a little bit, enough to be deadly even, and they start making changes, when in reality you need a full fledged DBA to manage your server, and database, hence the name DBA (database administrator). But before the DBA comes on to the scene to save the day, you will have the people that blame SQL Server. “Oh SQL Server doesn’t work at all, it can’t perform’”… or “Our other databases run 10x as fast, what gives” (not mentioning they have 3 DBA’s for those “other” databases, but not for MS SQL). and the quotes keep coming.

That is why the title of this post is what it is.

The problem isn’t SQL server, it’s you

. If you fail to realize that MS SQL is an Enterprise class database system, and treat it like some out of the box, already configured, plug and play system, you are going to run into issues eventually. You need a DBA. Probably best to have one BEFORE you implement any system, even if it is a consultant to guide your implementation, and assist as time goes on.

I sometimes get tired trying to argue that MS SQL can hold its own against Oracle, DB2, whatever. Trust me, it can. I could probably go find tons of SQL DBA’s that would back me up as well. It is all about how you manage and administer it! SQL Sever does just fine, as long as you know what you are doing. Just like any system. I think sometimes that if we took SSMS away, and just made everything cmd line/scripting, that people “outside” of the MS SQL community would see how MS SQL works in compared to their own systems.

This post isn’t meant to be a beat down rant or anything, but the same things can be said for .NET compared to Java, C++, etc, or whatever. It just seems sometimes that people that live and breathe Microsoft SQL need to know what the other RDMS/BI systems are capable of, but for some reason the same isn’t true for people that use the other systems. They kind of just brush MS SQL off as a play toy, something that shouldn’t be taken seriously, a “hobbyist” SQL system. Something that any enterprise wouldn’t be caught dead running, that is of course, unless you are Microsoft. 🙂

I’m still hedging my bets on MS SQL and .NET, I haven’t seen anything better for the price and ease of use, and the best part about it, the community. The MS SQL and Development community is huge compared to anything else, and to me that just puts the icing on the cake. Just remember the next time someone who needs a MS SQL DBA but doesn’t have one complains about performance of their system, you can tell them it’s not SQL Server’s fault, it’s probably the lack of neglect to SQL Server that caused the problems.