Yamanalysis: Analyzing Yammer and Using PowerPivot on MySQL

I have blogged before about we use Yammer. Some interesting data can be gleaned from the usage of Yammer. One thing though is that the data and usage stats are limited in the Yammer area, but you can get all the data and take a look at things. I ran into Yamanalysis and decided to give it a try.

After getting Ruby, Rails, MySQL, curl/curb, GraphViz, IBM WordCloud and whatever else configured, I finally got it working. (FYI – MySQL 5.0 – you need to run the config wizard as administrator on Windows 7 or it just hangs at the end).

Pretty cool data and analysis from a higher level. Of course after getting everything working, I wanted to hit the data with PowerPivot. This sounds like an easy feat, but yet seemed to be a complicated task.

I first got the ODBC connector 5.1 for MySQL (Since PowerPivot doesn’t natively connect to MySQL,and 5.1 since that is the only one I could find reliably and get to work.), and set up an ODBC source. Tests fine.

In PowerPivot, I would run through the wizard and it would get architecture mismatches, and catastrophic failures, trying to test the connection. Ignoring that and moving forward, running a query would just hang on import forever. I tried different DSN’s, User/System DSNs, etc, to no avail.

What I ended up doing was firing up my local Microsoft SQL instance, and creating a linked server through a system DSN to the MySQL instance, then I could query the data fine from SQL. I opened up PowerPivot, connected to SQL local and then ran the query to MySQL and it work. What a workaround, what a hack, but at least I can hit the data in PowerPivot locally, which was my goal here.

Of course I could take what Yamanalysis is doing and dump to SQL, or do something similar in C# and dump to SQL, that might be a project for another day.

ETL Method – Fastest Way To Get Data from DB2 to Microsoft SQL Server

For a while, I have been working on figuring out a “better” way to get data from DB2 to Microsoft SQL Server. There are many different options and approaches and environments, and this one is mine, your mileage may vary.

Usually, when pulling data from DB2 to any Windows box, the first thing you might think of is ODBC. You can either use the Microsoft DB2 driver (which works, if you are lucky enough to get it configured and working), or the IBM iSeries Client Access ODBC Driver (which works well), or another 3rd party ODBC driver. Using ODBC, you can access DB2 with a ton of different clients. Excel, WinSQL, any 3rd party SQL Tool, a MSSQL linked server, SSIS, etc. ODBC connects just fine, and will work for “querying” needs. Also, with the drivers you might install, you can usually set up an OLE DB connection if your client supports it (SSIS for example) and query the data using OLEDB – this works as well, but there are some caveats, which I will talk about.

In comes SSIS, the go to ETL tool for MSFT BI developers. You want to get data from DB2 to your SQL Server Data Warehouse, or whatever. You try with an OLEDB connection source, but it is clunky, weird, and sometimes doesn’t work at all (PrimeOutput Errors Anyone?). If you do manage to get OLEDB configured and working, you still probably will be missing out on some performance gains compared to the method I am going describe.

Back to SSIS, using ODBC. It works. You have to create an ADO.NET ODBC connection, and use a DataReader source instead of an OLEDB source. Everything works fine, except one thing. It is slow! Further proof?

http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/thread/162e55e5-b64b-423e-94c1-dd764ca1f683

http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=96977

http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/thread/cfade7e7-50d5-4447-9821-35c5d5ae1b66

http://www.sqlservercentral.com/Forums/Topic702042-148-1.aspx

http://www.sqlservercentral.com/Forums/Topic666993-148-1.aspx

Ok, enough links. But if you do read those. SQL 2000 DTS is faster than using SQL 2005/2008 SSIS. WTF? The best I can guess is that it is because of the .NET wrapper around ODBC. DTS is using “native” ODBC.

So, now what? Do we want to use DTS 2000? No. What to do though?

Well, after a few days of research, and just exploring around, I think I have found a good answer.. Replace DB2 with SQL Server.. just kidding. Here is what you need to do:

Install the IBM Client Access tools. There is a tool called “Data Transfer From iSeries Server” which the actual exe is "C:Program FilesIBMClient Accesscwbtf.exe"

image

This little tool allows you to set up data transfers from your DB2 system to multiple output choices (Display, Printer, Html, and Text). We want to export to Text file on our filesystem. You have to set up a few options, like the FileName, etc. In “Data Options” you can set up a where statement, aggregates, etc.

If you output to a file, you can go into “Details” and choose a file type, etc. I use ASCII Text, and then in the  “ascii file details” I uncheck all checkboxes. You set up your options and then hit the “Transfer data from iSeries” button and it will extract data to the file you chose in the filename field. Pretty sweet. But this is a GUI, how can I use this tool? I am not going to run this manually. Well, you are in luck.

If you hit the “Save” button, it will save a .dtf file for you. If you open this .dtf file in a text editor, you will see all options are defined in text, in a faux ini style. Awesome, we are getting somewhere.

Now, how do you run this from a cmd prompt? Well, we are in luck again. Dig around in C:Program FilesIBMClient Access and you will find a little exe called “rxferpcb”

image

What this tool allows you to do, is pass in a “request” (aka a DTF file), and a userid/password for your DB2 system, and it will execute the transfer for you. Sweet!

Now what do we do from here?

1) Create an SSIS package

2) Create an execute process task, call rxferpcb and pass in your arguements.

3) Create a BULK Insert task, and load up the file that the execute process task created. (note you have to create a .FMT file for fixed with import. I create a .NET app to load the FDF file (the transfer description) which will auto create a .FMT file for me, and a SQL Create statement as well – saving time and tedious work)

Now take 2 minutes and think how you could make everything generic/expression/variable driven, and you have yourself a sweet little SSIS package to extract any table from DB2 to text and bulk load it.

image

What is so great about the .DTF files is that you can modify them with a text editor, which means you can create/modify them programmatically. Think – setting where statements for incremental loads, etc.

image

 

You can see from the two screenshots above, that is all there is. Everything is expression/variable drive. Full Load, and Incremental Load. Using nothing but .dtf files, rxferpcb, a little .NET app I wrote to automatically create DTF’s for incremental (where statements), truncate, delete, and bulk insert. I can load up any table from DB2 to SQL by just setting 3 variables in a parent package.

After you wrap your head around everything I just went over, then stop to think about this. The whole DTF/Data Transfer/etc is all exposed in a COM API for “Data Transfer Automation Objects’”

http://www-912.ibm.com/s_dir/slkbase.NSF/643d2723f2907f0b8625661300765a2a/0c637d6b03f927ff86256a710076ab22?OpenDocument

With that information at your disposal, you could really do some cool things. Why not just create a SSIS Source Adapter that wraps that COM object and dumps the rows directly to the SSIS Buffer, and then does an OLEDB insert or Bulk Insert using the SQL Server Destination?

I have found in my tests that I can load over 100 million row tables – doing a full complete load, in about 6-7
hours. 30-40 million row tables in 4 hours. 2 to extract, 2 to BULK insert. Again, your mileage may vary depending on the width of your table, network speed, disk I/O, etc. To compare, with ODBC, just pulling and inserting 2 million records was taking over 2 hours, I didn’t wait around for it to finish. Pulling 2 Million records with my method described in this blog takes about 3-5 minutes (or less!)

I know I have skimmed over most of the nitty gritty details in this post, but I hope to convey from a high level that ODBC/OLE DB just aren’t as fast as the method here, I have spent a lot of time over the last few weeks comparing and contrasting performance and manageability. Now, if I could just get that DB2 server upgrade to SQL Server 2008. . . Happy ETL’ing!

SSIS – Two Ways Using Expressions Can Make Your Life Easier – Multi DB Select, Non Standard DB Select

In SQL Server Integration Services (SSIS), pretty much every task or transformation lets you set “expressions” up. Expressions are basically ways to set property values programmatically.

Here are two scenarios where you might use expressions (there are 100’s of uses, these are just two that are kind of related).

  1. Multiple Database Select – You have multiple databases – same schema, let’s say you have 300 installs of a 3rd party product and they all need their own database. I know it might sound impossible, but trust me, it can happen. Now, you want to run the same query over all databases, and pull data from a table, and dump into a data warehouse, for example. You could write 300 queries, and keep adding/removing based on the databases, you could create some elaborate dynamic SQL proc using loops, or you might have some other way, or, you could use SSIS Expressions.

    Now, how would you go about doing this? It is pretty easy actually. First step, you need to set up a loop in SSIS. You would want to grab a recordset of database names using an Execute SQL Task, or however you’d like, and store in an object variable. Then you can loop through that list. Your only difference in your query would be database name, so what you would do is have a variable for your SELECT statement. Name it whatever, but what you want to do is click on the variable, the properties of it. You will see Expression. Open the expression box and then set it to something like this

    ”SELECT Col1,Col2,Col3 FROM “ +  @[User:CurrentDatabaseName] + ".dbo.MyTable"

    image

    @[User:CurrentDatabaseName] is another variable to store the databasename that you would grab as you loop through your list of databasenames.

    Finally, in your dataflow, OLE DB source, you can change the Data Access Mode to “SQL Command From Variable”, and then it will let you choose your variable. As your for loop loops through your database names, and updates your SELECT variable, you can then select data from each database as you loop through them.

    image  

  2. Non-Standard Database Select – Not sure how to label this one, but here is what I am talking about. I like to make all my queries as stored procedures in SSIS, at least as much as possible. This works great when you are doing SQL Server to SQL Server, but what happens if its Oracle to SQL Server, DB2 to SQL Server, etc? Yes I know you can create stored procs on those systems, but you might be in a place or position where you just can’t or don’t want to. In that case you would want to use just standard T-SQL select statements to get data. You can easily put in params if the source is an OLE DB source, but what if it is an ODBC Source? You have to use the DataReader source, and you can’t easily set params – like a WHERE statement. You HAVE to use Expressions in order to have a query with a dynamic WHERE statement or passing in a variable as WHERE statement filter.

    So, throw a DataFlow on your package, and inside that, throw a DataReader source, and then set the connection to your ODBC Connection (ADO.NET Connection) and set the command text. Good to go. But where to set the connection? Not very intuitive. Go back to your DataFlow and look at the expressions for it. You will see one for DataReaderSource.CommandText (where DataReaderSource is the name of your DataReaderSource). You can set the expression up there. Now you can change an Oracle SQL Statement or DB2 or whatever to something that takes params without the need for a stored proc on that other database server.

So, while there are hundreds if not thousands of uses for expressions in SSIS, these are just a couple of uses that can make your life easier when trying to do more dynamic type queries in your DataFlow. Happy ETL’ing!