Categories
Business Intelligence Geeky/Programming SQLServerPedia Syndication

ETL Method – Fastest Way To Get Data from DB2 to Microsoft SQL Server

For a while, I have been working on figuring out a “better” way to get data from DB2 to Microsoft SQL Server. There are many different options and approaches and environments, and this one is mine, your mileage may vary.

Usually, when pulling data from DB2 to any Windows box, the first thing you might think of is ODBC. You can either use the Microsoft DB2 driver (which works, if you are lucky enough to get it configured and working), or the IBM iSeries Client Access ODBC Driver (which works well), or another 3rd party ODBC driver. Using ODBC, you can access DB2 with a ton of different clients. Excel, WinSQL, any 3rd party SQL Tool, a MSSQL linked server, SSIS, etc. ODBC connects just fine, and will work for “querying” needs. Also, with the drivers you might install, you can usually set up an OLE DB connection if your client supports it (SSIS for example) and query the data using OLEDB – this works as well, but there are some caveats, which I will talk about.

In comes SSIS, the go to ETL tool for MSFT BI developers. You want to get data from DB2 to your SQL Server Data Warehouse, or whatever. You try with an OLEDB connection source, but it is clunky, weird, and sometimes doesn’t work at all (PrimeOutput Errors Anyone?). If you do manage to get OLEDB configured and working, you still probably will be missing out on some performance gains compared to the method I am going describe.

Back to SSIS, using ODBC. It works. You have to create an ADO.NET ODBC connection, and use a DataReader source instead of an OLEDB source. Everything works fine, except one thing. It is slow! Further proof?

http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/thread/162e55e5-b64b-423e-94c1-dd764ca1f683

http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=96977

http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/thread/cfade7e7-50d5-4447-9821-35c5d5ae1b66

http://www.sqlservercentral.com/Forums/Topic702042-148-1.aspx

http://www.sqlservercentral.com/Forums/Topic666993-148-1.aspx

Ok, enough links. But if you do read those. SQL 2000 DTS is faster than using SQL 2005/2008 SSIS. WTF? The best I can guess is that it is because of the .NET wrapper around ODBC. DTS is using “native” ODBC.

So, now what? Do we want to use DTS 2000? No. What to do though?

Well, after a few days of research, and just exploring around, I think I have found a good answer.. Replace DB2 with SQL Server.. just kidding. Here is what you need to do:

Install the IBM Client Access tools. There is a tool called “Data Transfer From iSeries Server” which the actual exe is "C:Program FilesIBMClient Accesscwbtf.exe"

image

This little tool allows you to set up data transfers from your DB2 system to multiple output choices (Display, Printer, Html, and Text). We want to export to Text file on our filesystem. You have to set up a few options, like the FileName, etc. In “Data Options” you can set up a where statement, aggregates, etc.

If you output to a file, you can go into “Details” and choose a file type, etc. I use ASCII Text, and then in the  “ascii file details” I uncheck all checkboxes. You set up your options and then hit the “Transfer data from iSeries” button and it will extract data to the file you chose in the filename field. Pretty sweet. But this is a GUI, how can I use this tool? I am not going to run this manually. Well, you are in luck.

If you hit the “Save” button, it will save a .dtf file for you. If you open this .dtf file in a text editor, you will see all options are defined in text, in a faux ini style. Awesome, we are getting somewhere.

Now, how do you run this from a cmd prompt? Well, we are in luck again. Dig around in C:Program FilesIBMClient Access and you will find a little exe called “rxferpcb”

image

What this tool allows you to do, is pass in a “request” (aka a DTF file), and a userid/password for your DB2 system, and it will execute the transfer for you. Sweet!

Now what do we do from here?

1) Create an SSIS package

2) Create an execute process task, call rxferpcb and pass in your arguements.

3) Create a BULK Insert task, and load up the file that the execute process task created. (note you have to create a .FMT file for fixed with import. I create a .NET app to load the FDF file (the transfer description) which will auto create a .FMT file for me, and a SQL Create statement as well – saving time and tedious work)

Now take 2 minutes and think how you could make everything generic/expression/variable driven, and you have yourself a sweet little SSIS package to extract any table from DB2 to text and bulk load it.

image

What is so great about the .DTF files is that you can modify them with a text editor, which means you can create/modify them programmatically. Think – setting where statements for incremental loads, etc.

image

 

You can see from the two screenshots above, that is all there is. Everything is expression/variable drive. Full Load, and Incremental Load. Using nothing but .dtf files, rxferpcb, a little .NET app I wrote to automatically create DTF’s for incremental (where statements), truncate, delete, and bulk insert. I can load up any table from DB2 to SQL by just setting 3 variables in a parent package.

After you wrap your head around everything I just went over, then stop to think about this. The whole DTF/Data Transfer/etc is all exposed in a COM API for “Data Transfer Automation Objects’”

http://www-912.ibm.com/s_dir/slkbase.NSF/643d2723f2907f0b8625661300765a2a/0c637d6b03f927ff86256a710076ab22?OpenDocument

With that information at your disposal, you could really do some cool things. Why not just create a SSIS Source Adapter that wraps that COM object and dumps the rows directly to the SSIS Buffer, and then does an OLEDB insert or Bulk Insert using the SQL Server Destination?

I have found in my tests that I can load over 100 million row tables – doing a full complete load, in about 6-7
hours. 30-40 million row tables in 4 hours. 2 to extract, 2 to BULK insert. Again, your mileage may vary depending on the width of your table, network speed, disk I/O, etc. To compare, with ODBC, just pulling and inserting 2 million records was taking over 2 hours, I didn’t wait around for it to finish. Pulling 2 Million records with my method described in this blog takes about 3-5 minutes (or less!)

I know I have skimmed over most of the nitty gritty details in this post, but I hope to convey from a high level that ODBC/OLE DB just aren’t as fast as the method here, I have spent a lot of time over the last few weeks comparing and contrasting performance and manageability. Now, if I could just get that DB2 server upgrade to SQL Server 2008. . . Happy ETL’ing!

Categories
Business Intelligence SQLServerPedia Syndication

Early Arriving Facts, Late Arriving Dimensions, Inferred Dimensions

Most ETL systems (at least that I have seen/studied/worked on) that populate data warehouses run something like

 

1) Load Dims

a) populate an unknown

b) populate dim data

2) Load Facts

a) join/lookup to dim’s, and if no match, set as “unknown” dimension record

3) Process Cube

 

This type of system works in many cases, but there are some flaws that bubble up over time. First, unless you reload your fact table, or update your unknown dimension keys on your fact, you could end up with unknowns, that will be unknowns forever. The system described above also means you need to run it in that order. Dims first, Facts second.

1155499_the_blends__4

Early Arriving Facts/Late Arriving Dimension – If you are an optimist, we have the fact data before we have the dimension data. Or, if you are a pessimist, we don’t have the dimension data when we load the fact. You choose, but in either scenario, we have data missing somewhere.

Like I mentioned earlier, many systems will just set the early arriving fact as “unknown” and set it to a unknown dimension key (usually –1) in the fact table. Some people might just ignore the fact record completely. You probably don’t want to do that.

But what if we have the “business” key in our fact data select. What can we do with that?

One option is to modify your dimension data select to UNION in all the distinct business keys from your fact data that aren’t in your dimension data. This works in a small data set. If you fact table is 500 million rows, you won’t like the performance of this option.

Another option we can use is the idea of an inferred dimension. As you load your fact table data (preferably through SSIS) you do a lookup to your dimension. If you have a match, cool, take that key and move on. If you don’t have a match, instead of setting the key to –1 (unknown), do this:

1) Insert a new dimension record with your business key from your fact table

2) Grab the newly created dimension key from the record you just inserted

3) Merge the key back into your fact data pipeline.

Awesome. Now, sometime in the future, your Dimension process can come through, and if you are doing Slowly Changing Dim’s, it should just update your inferred dimension records with data. If your inferred dimension records are some one offs that might never get updated, you might be able to get someone to manually update them through some interface, or whatever, in any event you aren’t stuck with tons of fact records that are set to –1/unknown.

Of course, the method above works best using SSIS, with a “Get Data -> Lookup Pattern –> Insert” method.

Happy ETL’ing!

Categories
Geeky/Programming Life Product Reviews

Apple Time Capsule Rocks – Microsoft needs to make one.

Since I make a living and love working on Windows, SQL, Windows Server, Office, Exchange, etc, it is kind of weird I am a Mac guy. Never thought it would happen. iPhone, Mac, Macbook Pro, Apple TV, Time Capsule, accessories…

Anyways, since Ella was born, I have TONS of pics and videos of her, on my laptop. And it auto backs up to my Apple Time Capsule. Well, iMovie auto imports videos from iPhoto, and then removes them? I am not sure, but long story short, I was missing some videos from when we got home from the hospital.

So what do I do? Freak out? No.. just fire up the Time Machine on my Mac, go back to February, and there is my iPhoto Library, I restore it, and get the videos back.

Now, I just need to offload it offsite someone, there is MacMini Colo – Transport – http://www.macminicolo.net/transport/ which looks promising, just a little too much $$, but maybe my next step.

But really, Windows needs this. We have a few Windows machines here at home, and I just feel like, umm, stuff is volatile. Thank god for flickr..that is all I have to say. I know there is Windows Home Server, but that is another BOX and updates, and whatever. I just want a device, a dumb device. I know I can get NAS and whatever, but I just want simple. Where is the Microsoft Time Capsule?

Categories
Business Intelligence SQLServerPedia Syndication

The problem isn’t SQL Server. It’s you.

Throughout all my years in different places, I have seen SQL, Oracle, Firebird, MySQL, DB2, Zortec, Access, and probably a few other crazy databases set up and run, and administered. Of course most of them along the way have been Microsoft SQL Server, (6.5, 7, 2000, 2005, 2008). I’ve worked with some knowledgeable DBA’s, and in those cases everything usually turns out ok.

But sometimes, in some department or place or whatever, your buddy down the street wanting to start a new company, your girlfriends place of work that wants to track orders, whatever, they usually try to get SQL Server running, and what sometimes happens next just makes my head spin. Microsoft, bless them, sometime in the past, not so much now, tried to market SQL Server as “self manageable”. Probably sometime between 6.5 and 7, they tweaked some update stats routines and schedules and its all good, right? Set autogrow by default, and you are good to go. Wrong.

What this awesome marketing strategy did, was get people, places, and organizations, mostly ISV’s to use SQL Server and install it, get their app running, and walk away. Of course it runs for a while, runs like a champ even. But then months, even years go by and the system starts running slow. There is no DBA around, they didn’t need one, SQL Server manages itself! Wrong again.

What you might end up with though, are people using the system that might know a little bit, enough to be deadly even, and they start making changes, when in reality you need a full fledged DBA to manage your server, and database, hence the name DBA (database administrator). But before the DBA comes on to the scene to save the day, you will have the people that blame SQL Server. “Oh SQL Server doesn’t work at all, it can’t perform’”… or “Our other databases run 10x as fast, what gives” (not mentioning they have 3 DBA’s for those “other” databases, but not for MS SQL). and the quotes keep coming.

That is why the title of this post is what it is.

The problem isn’t SQL server, it’s you

. If you fail to realize that MS SQL is an Enterprise class database system, and treat it like some out of the box, already configured, plug and play system, you are going to run into issues eventually. You need a DBA. Probably best to have one BEFORE you implement any system, even if it is a consultant to guide your implementation, and assist as time goes on.

I sometimes get tired trying to argue that MS SQL can hold its own against Oracle, DB2, whatever. Trust me, it can. I could probably go find tons of SQL DBA’s that would back me up as well. It is all about how you manage and administer it! SQL Sever does just fine, as long as you know what you are doing. Just like any system. I think sometimes that if we took SSMS away, and just made everything cmd line/scripting, that people “outside” of the MS SQL community would see how MS SQL works in compared to their own systems.

This post isn’t meant to be a beat down rant or anything, but the same things can be said for .NET compared to Java, C++, etc, or whatever. It just seems sometimes that people that live and breathe Microsoft SQL need to know what the other RDMS/BI systems are capable of, but for some reason the same isn’t true for people that use the other systems. They kind of just brush MS SQL off as a play toy, something that shouldn’t be taken seriously, a “hobbyist” SQL system. Something that any enterprise wouldn’t be caught dead running, that is of course, unless you are Microsoft. 🙂

I’m still hedging my bets on MS SQL and .NET, I haven’t seen anything better for the price and ease of use, and the best part about it, the community. The MS SQL and Development community is huge compared to anything else, and to me that just puts the icing on the cake. Just remember the next time someone who needs a MS SQL DBA but doesn’t have one complains about performance of their system, you can tell them it’s not SQL Server’s fault, it’s probably the lack of neglect to SQL Server that caused the problems.

Categories
Geeky/Programming

Using Windows Performance Toolkit to find System Issues in Vista/Win2k8/Win7

Windows 7 RC1 just came out. I am a TechNet subscriber, so I wanted to try it out. I have an old (2005) Dell desktop, 2.8 GHz, 2 GB ram, 160 GB drive box. 3.7 rating for Vista (because of the Graphics card mostly, would be 4.4 otherwise – not too bad, even for being kind of an old box). It has been sitting in the basement since I moved into my new place in October, doing nothing really. I use Mac full time at home, so it just sits.

A few times I have tried to get Windows Vista running smooth on it, Media Center, or just a file server,etc. Thing is, it was just flaking out. I knew it was a hardware issue, but figured it might be the CPU fan, or overheating, etc. Vista installed fine, but as I was using it, I would see just hang-ups, lockups. Not BSOD’s, but it would just hang, for 30 seconds, 1 minute, and then come back. WTF?

Nothing in the Reliability monitor, nothing I could see in event logs, etc. I rebooted, did Windows Memory test, nothing there. If you go into Computer Management, you will see Performance, then Data Collector Sets and Reports, Monitoring Tools. You can set it up to run a test on metrics of your system and it will give you a report

image

I did this, and everything was ok. BUT… Avg Disk Length Queue was > 2 – red flag. Disk issues. But I wanted to know more. So I started digging around, and there is a Windows Performance Toolkit you can download. Here is another good site going into detail about the WPT.

So I fire up cmd line (as admin! – start->run, cmd ctrl+shift+enter), and run

xperf -providers K

to see what providers are available for the Kernel flags. IOTrace looks like something I want, so I then run

xperf -on IOTrace

and let it run. I go and open/close things, play around, see if I can replicate the issue. Once I feel I have, I want to stop and analyze the trace. You need to stop it and output to a file using this command:

xperf -d iotrace.etl

Side note: Files are named ETL. Coming from a BI background, this makes my world explode, since it has nothing to do with Extract, Transform, and Load

Now that my trace is done, time to analyze:

xperfview iotrace.etl

And you get some awesome stats like this:
image

Although I didn’t save my stats from my tests that showed the bad IO, what I saw were just gaps in the graphs, glitches in The Matrix. Time missing. Something is really bad here. So I did the drive error checking in Vista:

image

And when that ran, after reboot, it got to 11% and croaked. Bad drive. So I went and bought a new 500 GB SATA drive and loaded it up, and I am running Windows 7 now. Pretty sweet.

After all this fun spelunking into Windows performance, it also got me thinking about things, like running these detailed traces on SQL Server boxes or other servers on intervals, and saving them somehow, reporting on the data. The IOTrace is just one of hundreds of traces, that you can then auto analyze. I know that there are perfmon tools but there are some added benefits to xperf that you can you utilize, and I am glad I learned more about it and put it to use, just another tool for the sysadmin tool belt.

Categories
Life

Social Experiment: Purging Facebook “Friends”

So last night I decided it was time. Time to put and end to the madness, the quizzes, the nonsense, the “friends on my list that aren’t really my friends” game. What I decided to do was just… delete them all. Everyone. Then see how my “friend” list grows from there.

image

I think one thing as well, when I started Facebook, they didn’t have many of the features they do now. Friend lists were nonexistent, but they did have these “how do I know this person” feature, which now is subtly hidden or not even there at all.  What ensued was close to 500 people with no way to manage them. What if I want a picture album that only XYZ people can see? I need to create a list, but creating a list when you have no one in lists is a pain. Now I can create lists and when I add new friends I can add them to the lists then.

Also, the news feed. Filled with junk, or updates from people I don’t care about. Was getting sick of hiding people or updates, or quizzes, or whatever. I got to thinking, why do I have all these people on my friend list if I don’t care what they are updating? So I started to pair it down, remove people I don’t know, then people I haven’t talked to in real life, then people I haven’t talked to in years and when they got on FB they added me and I haven’t said anything to, then I said, well I will remove people I wouldn’t go have a beer with, then it just became a mess and I said whatever, I will remove everyone and start this experiment. Remove myself from all Groups and stop being a Fan of Sunshine, Campfires, and Not Having Swine Flu. 

After removing a few people I wrote on my wall that I was removing everyone, and before I was even done getting through the purge process (you have to delete one by one by the way. Facebook makes it easy to add friends, pain to remove more than one at a time) someone had added me back. Cool. Then I twittered it, and a couple of friends added me back. Nice. So that’s where I am at, and if I stay there, then cool, if I add friend back, well then at least I will have them categorized as I go.

It’s funny now with few friends, how bad Facebook is at telling me who I should be friends with. It wants me to be friends with everyone from SCSU because I graduated there in 2002. Not good. Wondering though if it doesn’t recommend people I have removed? Hard to say.

Sometimes spring cleaning is fun, starting at zero, new.

Categories
Business Intelligence Geeky/Programming SQLServerPedia Syndication

SSIS – Two Ways Using Expressions Can Make Your Life Easier – Multi DB Select, Non Standard DB Select

In SQL Server Integration Services (SSIS), pretty much every task or transformation lets you set “expressions” up. Expressions are basically ways to set property values programmatically.

Here are two scenarios where you might use expressions (there are 100’s of uses, these are just two that are kind of related).

  1. Multiple Database Select – You have multiple databases – same schema, let’s say you have 300 installs of a 3rd party product and they all need their own database. I know it might sound impossible, but trust me, it can happen. Now, you want to run the same query over all databases, and pull data from a table, and dump into a data warehouse, for example. You could write 300 queries, and keep adding/removing based on the databases, you could create some elaborate dynamic SQL proc using loops, or you might have some other way, or, you could use SSIS Expressions.

    Now, how would you go about doing this? It is pretty easy actually. First step, you need to set up a loop in SSIS. You would want to grab a recordset of database names using an Execute SQL Task, or however you’d like, and store in an object variable. Then you can loop through that list. Your only difference in your query would be database name, so what you would do is have a variable for your SELECT statement. Name it whatever, but what you want to do is click on the variable, the properties of it. You will see Expression. Open the expression box and then set it to something like this

    ”SELECT Col1,Col2,Col3 FROM “ +  @[User:CurrentDatabaseName] + ".dbo.MyTable"

    image

    @[User:CurrentDatabaseName] is another variable to store the databasename that you would grab as you loop through your list of databasenames.

    Finally, in your dataflow, OLE DB source, you can change the Data Access Mode to “SQL Command From Variable”, and then it will let you choose your variable. As your for loop loops through your database names, and updates your SELECT variable, you can then select data from each database as you loop through them.

    image  

  2. Non-Standard Database Select – Not sure how to label this one, but here is what I am talking about. I like to make all my queries as stored procedures in SSIS, at least as much as possible. This works great when you are doing SQL Server to SQL Server, but what happens if its Oracle to SQL Server, DB2 to SQL Server, etc? Yes I know you can create stored procs on those systems, but you might be in a place or position where you just can’t or don’t want to. In that case you would want to use just standard T-SQL select statements to get data. You can easily put in params if the source is an OLE DB source, but what if it is an ODBC Source? You have to use the DataReader source, and you can’t easily set params – like a WHERE statement. You HAVE to use Expressions in order to have a query with a dynamic WHERE statement or passing in a variable as WHERE statement filter.

    So, throw a DataFlow on your package, and inside that, throw a DataReader source, and then set the connection to your ODBC Connection (ADO.NET Connection) and set the command text. Good to go. But where to set the connection? Not very intuitive. Go back to your DataFlow and look at the expressions for it. You will see one for DataReaderSource.CommandText (where DataReaderSource is the name of your DataReaderSource). You can set the expression up there. Now you can change an Oracle SQL Statement or DB2 or whatever to something that takes params without the need for a stored proc on that other database server.

So, while there are hundreds if not thousands of uses for expressions in SSIS, these are just a couple of uses that can make your life easier when trying to do more dynamic type queries in your DataFlow. Happy ETL’ing!

Categories
Business Intelligence SQLServerPedia Syndication

SQL Server 2008 – Saving changes is not permitted

Finally getting around to doing some work on SQL 2008, and after about 3 minutes, I run into this error: “Saving Changes is not permitted.. blah blah blah” See screenshot below.

image

This is different than SQL 2005. Microsoft maybe trying to save us from ourselves? The thing is, I never “enabled the option Prevent saving changes that require the table to be re-created” – it seems to be enabled by default. It would be awesome if this error told me exactly where the setting was.

 

Well, it happes to be in Tools->Options, Designers, Table and Database Designers. Uncheck the box and go about your merry way!

 

image

Categories
Business Intelligence SQLServerPedia Syndication

OLAP PivotTable Extensions on CodePlex

This weekend, I ran across this on CodePlex – OLAP PivotTable Extensions which got me thinking back to a post by the Excel blog about adding calculated measures and named sets in VBA (which is another blog post completely)

From CodePlex:

OLAP PivotTable Extensions is an Excel 2007 add-in which extends the functionality of PivotTables on Analysis Services cubes. The Excel 2007 API has certain PivotTable functionality which is not exposed in the UI. OLAP PivotTable Extensions provides an interface for some of this functionality.

What an awesome tool. I have been playing with it for a couple days and I have turned on some of the “power” users of the OLAP cubes to it as well. The first thing I thought of when running across this was “Woah, ok, when business users request calculated measures that might be more obscure, or just specific to them, they can add them! We don’t have to do a special release, maybe not even a release at al!”

The uses for this tool could be pretty extensive. You can import and export calculation libraries, you can also see the MDX that Excel is producing, which is another plus (I know there are other ways to get it, but this tool makes it easy!) – With the MDX you can just copy it and run it in SSMS to see the results there. You can see how Excel is doing things behind the scenes with your result set to make it look nice.

Another sweet feature, if you have a cube with tons of attributes, there is a search tab to search for the attributes you want.

I haven’t seen any issues yet. One user had to install the Visual Studio 2005 Tools for Office Second Edition Runtime which the CodePlex site says is required, so no big deal.

If you have tons of users using OLAP Cubes with Excel 2007, take a look at this free open source tool on CodePlex, you probably will get some good mileage out of it. I think Microsoft should put these features in the next version of Office!


Categories
Business Intelligence Geeky/Programming SQLServerPedia Syndication

Using Offline OLAP to Develop Cube Reports Without SSAS

One feature of Excel 2007 that I think is really cool, and also a little hidden, is the “Offline OLAP” feature.

If you insert a pivot table connected to an OLAP cube into Excel 2007, and go to the PivotTable Options Ribbon Menu, You will see the “OLAP Tools” Button. Click on that and then “Offline OLAP”.

image

Once you go through the Offline OLAP Wizard, it will create a .cub file for you. What this ends up being is a local disconnected “cube” you can analyze in Excel, if you are on the plane, or in some remote area with no internet connection.

Other uses for the offline .cub files that I have found is this. Let’s say you want developers to develop Web based reports using .NET, maybe using the Dundas OLAP Services controls. If you don’t want to have to load SSAS or set them up to connect to any server so they can just develop locally, the .cub file is the way to go. In their .NET code, they can just change the connection string to point to the local .cub file, and then later when you are ready to go live, you can just change the connection string to the live cube. That way, if you are developing the cube at the same time reports are being developed, you don’t have to worry about uptime, etc. Just send them an updated .cub file every once in a while.

I don’t know much about the details for upcoming releases, like Microsoft Project Gemini, but I have a feeling that it might feel a bit like this, using Offline OLAP, or local analysis.

On a final note, if you really want to get geeky, you can actually create the .cub files from .NET, but that is another blog post 🙂