Redgate SQL Search – Free Download

I have always wanted a good way to search all the db’s on a server, find procs, views, whatever that have something in their DDL so i could do what I want with it. There are ways using DMV’s or other things in SQL Server, but it just becomes a pain.

Then came along SQL Search (http://www.red-gate.com/products/SQL_Search/index.htm) – from Redgate. And it is free! Works like a charm. It eliminates the need to fire up some query and change what you are looking for. It works well. Check it out the next time you are trying to find every object that has “whatever” in its T-SQL DDL.

Microsoft Business Intelligence Development in a Team Environment

Today I received an email asking to some extent best practices on development with SQL Server Integration Studio (SSIS) and Business Intelligence Developer Studio (BIDS) in a team environment. Here is part of the email:

Me and another DBA belong to the same team, we have a SQL server with SSIS running. We use the SSIS transfer data among multiple data sources. In SQL 2000 DTS, both of us can save the package on the server and open/edit it in the enterprise manager. In SQL 2005, I can see the package on server, but can’t open it directly. We came out a solution: create a shared folder on the server called ‘SSIS Projects’, both of us can access to it. We run the ‘SQL Server Business Intelligence Development Studio’ on local PC, to open the project in that shared folder. When done with the change, save the package to the SSIS server. Now, we have more than 50 packages in a project. Problem is: it’s very slow when open a project, ‘Business Intelligence Development Studio’ tends to open/verify every single package inside a project, takes up to 10 mins and getting worse. We really miss the SQL 2000 DTS, but we have to turn to SQL 2005.

  1. Are we doing the right thing? Is there any better solution for SSIS developing in a team environment?

  2. When open a project, does ‘Business Intelligence Development Studio’ has to open/verify every package?

 

This got me thinking, and I figured instead of write an email back, it would be good info for a blog post. So here is what I think and some things I have done that have worked.

First, yes, SQL 2000 DTS allows you to just edit on the server, do more than SSIS, is just way better than SSIS. Wait, what? Well, yeah some people will say that, because it does one thing that might be a little rigmarole in SSIS, but no, SQL 2000 DTS is not better than SSIS, just wanted to clear that up.

So, the is meant to be a starting point, by no means all encompassing, and as always, YMMV.

One thing that I first thought about is this: Yeah, if BI devs and SQL devs have never really worked in a team environment, developing software, how would they know what to do, or best practices? They would just go about “making it work” until everything breaks or who know what.

 

So how to develop Microsoft Business Intelligence Solutions in a team environment?

 

1) Standardize on Versions

 

First, figure out what “versions” you are going to support, and what you are going to use, and get standardized on them. I am guessing majority of BI devs right now are on the 2005 stack. Yeah, there is still probably a bit of 2000 legacy stuff out there, and some people are now getting into the 2008 stuff, but 2005 is pretty much the norm from what I see, at least at this point.

So, 2005. Get all your dev’s on 2005 on their machine – same patch level, etc. Get BIDS up to the same level. Get BIDS helper installed everywhere. Strive to get all your ETL packages in SSIS 2005, get all your cubes to SSAS 2005, etc, etc. Come to a consensus on things like config files for SSIS, naming conventions, within your development and on disk – folder structure is key! With a smaller number of versions of things floating around, it makes it easy for anyone on the team to open up a solution and start hammering away without tons of setup.

2) Get Source Control

 

This is crucial! I have talked about source control in the past, and also about some that aren’t so great. Really it doesn’t matter what you use, I prefer SVN. I install Tortoise SVN, SVN proper (to do scripting etc if I need to using cmd line) and also purchase Visual SVN, an add on to Visual Studio that integrates with SVN. for 50 bucks you have your source control system. Visual Source Safe works but is outdated, honestly I hate it. Team Foundation Server is good, but expensive. Other solutions might be using something like GIT, etc. Whatever you do, just get a source control system going, and learn it well. Learn how to create repos, commit, update, revert, merge, etc. Set up a user for each BI dev and make sure they commit often, and make sure they leave comments in the source control log when they commit, history is your lifeline to go back to something if you need to! Note: exclude .suo, bin, obj directories, .user files, etc. Anything that changes every time you build, open, etc, you want to exclude from source control.

 

3) Development Box

 

You now have your version standardized, and your source control setup. You can get most of your work done on your machine, but you need somewhere to test deployments, run scenarios, etc, etc. Make sure you have a comparable box to your production server. Set it up the same, same software etc. Make sure its backed up. Let all the devs know its a dev box, it can be wiped at any time for any reason if need be. It can be rebooted 5 times a day if need be. Its a dev box! But you can test and develop and tweak and change settings to your hearts content and not have to worry about breaking Mr. Executives reports.

 

4) Developing, Merging, Committing, Collaborating, Communicating.

So now you have your setup, well.. setup. Start creating stuff. SSIS Packages, ETL’s, SSAS Cubes, SSRS Reports, the whole MSFT BI Solution. This is where stuff can start to get tricky in a team environment though. SSIS/SSAS/SSRS isn’t as clean cut as something like C#/VB.NET, etc. Everything is in some form of XML behind the scenes, and with graphical based editing, you can move stuff around and it changes the files. Things like that are going to be your enemy. This is why you need to collaborate and communicate. Usually one person should be working on one project at a time. You can get really good at communicating and then in SSIS at least have multiple people working on different packages. Also in SSAS dimension editing and stuff can be done by multiple people at the same time as long as the dim is already hooked up to the cube. But you want to make sure that you communicate, “Hey, I am checking this in, you might want to do an update”, or “Is anyone working on this or are they going to? I want to modify something, and I will check it in so you all can see it”

You want to make sure you have your folder structure, and solution/project structure set up well. C:Projects  ..  and then maybe a folder for each major project “CompanySales” and under that, “ETL”, “Cube”, “Reports” and have a solution under each with 1 project of each type. You can also have a generic SSRS solution with many projects, which might work well for you. In any case, just come up with a standard and stick to it. Trust me it will make your life easier. The question from the email above, it sounds like they have every package in one solution, one project. Sounds like it needs to be split, multiple solutions, multiple projects.

 

5) Deployment Scenarios and Strategies

Now that you have everything developed, tested, checked in, what do you do?

Personally for SSIS I like xcopy deployments. One folder on the server, not on the C drive, but another drive, lets say “E:SSIS” under that a folder for each project. Put your dtsx and configs in the same folder. 99% of the time you are going to call the dtsx from a SQL Agent Job, and most likely you are going to run into a scenario
where you need uber rights to execute it, so learn how to create a proxy/credential in SQL security so you can run the step as that. Once you have this folder and subfolders setup, you can use something like Beyond Compare to compare the folder on the server to the one you have locally that matches. Remember to copy files from the bin directory of your project after you build it, not the files directly on your project. As far as BIDS validating every package, there are workarounds out there you can do, here is one.

For SSAS, I try to lean towards using the Deployment Wizard that comes with SQL Server. You can use BIDS deployment, but if you start doing anything advanced with roles, partitions, etc, you are going to run into trouble. Take control and use the deployment wizard. I usually like to deploy, and then process manually when developing. And then later use SQL Agent or and SSIS package to actually do processing when it comes to a scheduled processing.

SSRS, I have become used to the auto deployment from Visual Studio. To really do this though, you need a project for every folder in SSRS, which can become a pain. You can always just upload the .RDL file and connection and do it manually, but if you start off right with using the deployment from BIDS, it can make your life easier.

 

So that is just a 10 minute overview of everything to kind of get started. Everything depends on your infrastructure and the way your team is setup, etc. But I think the biggest thing to take from all of the above is to standardize on things. If you standardize on as much as possible, SQL versions, setup of machines, naming conventions, layouts, design patterns, etc, everyone can do things faster and pretty soon it will start running like a well oiled machine!

The problem isn’t SQL Server. It’s you.

Throughout all my years in different places, I have seen SQL, Oracle, Firebird, MySQL, DB2, Zortec, Access, and probably a few other crazy databases set up and run, and administered. Of course most of them along the way have been Microsoft SQL Server, (6.5, 7, 2000, 2005, 2008). I’ve worked with some knowledgeable DBA’s, and in those cases everything usually turns out ok.

But sometimes, in some department or place or whatever, your buddy down the street wanting to start a new company, your girlfriends place of work that wants to track orders, whatever, they usually try to get SQL Server running, and what sometimes happens next just makes my head spin. Microsoft, bless them, sometime in the past, not so much now, tried to market SQL Server as “self manageable”. Probably sometime between 6.5 and 7, they tweaked some update stats routines and schedules and its all good, right? Set autogrow by default, and you are good to go. Wrong.

What this awesome marketing strategy did, was get people, places, and organizations, mostly ISV’s to use SQL Server and install it, get their app running, and walk away. Of course it runs for a while, runs like a champ even. But then months, even years go by and the system starts running slow. There is no DBA around, they didn’t need one, SQL Server manages itself! Wrong again.

What you might end up with though, are people using the system that might know a little bit, enough to be deadly even, and they start making changes, when in reality you need a full fledged DBA to manage your server, and database, hence the name DBA (database administrator). But before the DBA comes on to the scene to save the day, you will have the people that blame SQL Server. “Oh SQL Server doesn’t work at all, it can’t perform’”… or “Our other databases run 10x as fast, what gives” (not mentioning they have 3 DBA’s for those “other” databases, but not for MS SQL). and the quotes keep coming.

That is why the title of this post is what it is.

The problem isn’t SQL server, it’s you

. If you fail to realize that MS SQL is an Enterprise class database system, and treat it like some out of the box, already configured, plug and play system, you are going to run into issues eventually. You need a DBA. Probably best to have one BEFORE you implement any system, even if it is a consultant to guide your implementation, and assist as time goes on.

I sometimes get tired trying to argue that MS SQL can hold its own against Oracle, DB2, whatever. Trust me, it can. I could probably go find tons of SQL DBA’s that would back me up as well. It is all about how you manage and administer it! SQL Sever does just fine, as long as you know what you are doing. Just like any system. I think sometimes that if we took SSMS away, and just made everything cmd line/scripting, that people “outside” of the MS SQL community would see how MS SQL works in compared to their own systems.

This post isn’t meant to be a beat down rant or anything, but the same things can be said for .NET compared to Java, C++, etc, or whatever. It just seems sometimes that people that live and breathe Microsoft SQL need to know what the other RDMS/BI systems are capable of, but for some reason the same isn’t true for people that use the other systems. They kind of just brush MS SQL off as a play toy, something that shouldn’t be taken seriously, a “hobbyist” SQL system. Something that any enterprise wouldn’t be caught dead running, that is of course, unless you are Microsoft. 🙂

I’m still hedging my bets on MS SQL and .NET, I haven’t seen anything better for the price and ease of use, and the best part about it, the community. The MS SQL and Development community is huge compared to anything else, and to me that just puts the icing on the cake. Just remember the next time someone who needs a MS SQL DBA but doesn’t have one complains about performance of their system, you can tell them it’s not SQL Server’s fault, it’s probably the lack of neglect to SQL Server that caused the problems.

Business Intelligence != Reporting

Last week I was in a heated discussion about Business Intelligence, and it came up that “Business Intelligence is just writing reports, what is the difference from just writing some reports?”

I tend to disagree. Writing reports is, well, writing reports. You can write reports off of tons of data, and yeah, probably get some good info, but usually it ends up in disparate information.

I have blogged about this before, probably a few times, but this is usually what happens in organizations:

1) Some app or system is set up where data is created in some kind of data store, usually a database

2) People want to see the data in some type of report

3) Developers start cranking out reports right off the source system, stopping the bleeding for the current need for more reporting.

4) System keeps growing, more data is created/collected.

5) Reports off source system start slowing down, timing out, many different reports end up for the same thing and don’t match, end users complain.

6) Developers get clever, start moving and summarizing data on some kind of interval, all with custom code, custom design. They write some reports off the summary data to once again stop the bleeding. Some reports are still off the source data. Reports don’t match. End users complain

7) More data is collected, more users want reports, and want them with different views, grouped by this, sorted by that, add this column, remove this column, etc.

8) Developers are taking the heat. Their reports aren’t matching, they are running slow, etc. They can’t keep up with all the report requests. They decide to create a custom reporting engine on top of their summary data that allows end users to create their one reports using some quasi query language and drag and drop columns/grouping. This stops the bleeding for a while.

9) Finally someone in the organization realizes that the developers are reinventing the wheel. This is where Business Intelligence comes in. The source data is left untouched, the BI group creates and ETL to have some kind of data warehouse design and structure, using out of the box tools. (SQL Server Management Studio is and out of the box tool – you use that to write queries. SQL Server Integration/Analysis/Reporting services are out of the box tools, you use them to create ETL’s, Cubes, and Reports).

10) End users can now write reports using Excel off the OLAP Cube, or they use the Report Builder with Reporting Services to create reports off a UDM (Universal Data Model), or the BI devs create reports in Reporting Services, all off the same source data. One version of the truth, less custom reporting.

11) Developers are left to develop on the source system turning requirements into code/features, not focus on reports, and the BI group is focused on getting all the data turned into information.

Of course this is a perfect scenario described above. Nothing is ever as simple as that, but we can hope. Also I am biased towards the MSFT toolset, but there are other toolsets out there that would solve the same problem.

All I know is that I have seen everywhere I go pretty much the same thing happen from #1 to #8 above. In some cases you never see #9-#11 and you end up in a world of hurt.

In the end though, Business Intelligence just isn’t “writing reports”, far from it.


SSIS – Slowly Changing Dimensions with Checksum

In the Microsoft Business Intelligence world, most people use SQL Server Integration Services (SSIS) to do their ETL. Usually with your ETL you are building a DataMart or OLAP Database so then you can build a multi-dimensional cube using SQL Server Analysis Services (SSAS) or report on data using SQL Server Reporting Services (SSRS).

When creating the ETL using SSIS, you pull data from one or many sources (usually an OLTP database) and summarize it, put it into a snowflake or star schema in your DataMart. You build you Dimension tables and Fact tables, which you can then build your Dims and Measures in SSAS.

When building and populating your Dimensions in SSIS, you pull data from a source table, and then move it to your dimension table (in the most basic sense).

You can handle situations in different ways (SCD type 1, SCD type 2, etc) – basically, should you update records, or mark them as old and add new records, etc.

The basic problem with dimension loading comes in with grabbing data from the source, checking if you already have it in your destination dimension table, and then either inserting, updating, or ignoring it.

SSIS has a built in transformation, called the “Slowly Changing Dimension” wizard. You give it some source data, you go through some wizard screens, choosing what columns from your source are “business keys” (columns that are the unique key/primary key in your source table) and what other columns you want. You choose if the other columns are changing or historical columns, and then choose what should happen when data is different, update the records directly, or add date columns, etc.

Once you get through the SCD wizard, some things happen behind the scenes, and you see the insert, update transformations are there and things just “work” when you you run your package.

What most people fail to do, is tweak any settings with the automagically created transformations. You should tweak some things depending on your environment, size of source/destination data, etc.

What I have found in my experience though, is that the SCD is good for smaller dimensions (less than 10,000 records). When you get a dimension that grows larger than that, things slow down dramatically. For example, a dimension with 2 million, or 10 million or more records, will run really slow using the SCD wizard – it will take 15-20 minutes to process a 2 million row dimension.

What else can you do besides use the SCD wizard?

I have tried a few other 3rd party SCD “wizard” transformations, and I couldn’t get them to work correctly in my environment, or they didn’t work correctly. What I have found to work the best, the fastest and the most reliable is using a “checksum SCD method”

The checksum method works similarly to the SCD wizard, but you basically do it yourself.

First, you need to get a checksum transformation (you can download here: http://www.sqlis.com/post/Checksum-Transformation.aspx)

The basic layout of your package to populate a dimension will be like this:


What you want to do, is add a column your dimension destination table called “RowChecksum” that is a BIGINT. Allow nulls, you can update them all to 0 by default if you want, etc.

In your “Get Source Data” Source, add the column to the result query. In your “Checksum Transformation”, add the columns that you want your checksum to be created on. What you want to add is all the columns that AREN’T your business keys.

In your “Lookup Transformation”, your query should grab from your destination the checksum column and the business key or keys. Map the business keys as the lookup columns, add the dimension key as a new column on the output, and the existing checksum column as a new column as well.

So you do a lookup, and if the record already exists, its going to come out the valid output (green line) and you should tie that to your conditional transformation. You need to take the error output from the lookup to your insert destination. (Think of it this way, you do a lookup, you couldn’t find a match, so it is an “error condition, the row doesn’t already exist in your destination, so you INSERT it).

On the conditional transformation, you need to do a check if the existing checksum == the new checksum. If they equal, you don’t need to do anything. You can just ignore that output, but I find it useful to use a trash destination transform, so when debugging you can see the row counts.

If the checksums don’t equal, you put that output the OLEDB Command (where you do your update statement).

Make sure in your insert statement, you set the “new checksum” column from the lookup output to the RowChecksum column in your table. In the update, you want to do the update on the record that matches your business keys, and set the RowChecksum to the new checksum value.

One thing you might also run into is this. If your destination table key column isn’t an identity column, you will need to create that new id before your insert transformation. Which probably consists of grabbing the MAX key + 1 in the Control Flow tab, dumping into a variable, and using that in the Data Flow tab. You can use a script component to add 1 each time, or you can get a rownumber transformation as well (Both the trash and rownumber transformations are on the SQLIS website – same as the checksum transformation)>

After getting your new custom slowly changing dimension data flow set up, you should see way better performance, I have seen 2.5 million rows process in around 2 minutes.

One other caveat I have seen, is in the Insert Destination, you probably want to uncheck the “Table Lock” checkbox. I have seen the Insert and Update transformations lock up on each other. Basically what happens is the run into some type of race condition. The cpu on your server will just sit low, but the package will just run forever, never error out, just sit there. It usually happens with dimension tables that are huge.

Like I said earlier, the checksum method is good for large dimension tables. I use the checksum method for all my dimension population, small or large, I find it easier to manage when everything is standardized.

By no means is this post an extensive list of everything that you would run across, but more of a basic overview of how it works with using the checksum method.

In any event, you can get huge performance gains by using the checksum method and not using the SCD Wizard that comes with SSIS, and you will also feel like you have more control of what is going on, because you will. Happy ETL’ing 🙂