Most ETL systems (at least that I have seen/studied/worked on) that populate data warehouses run something like
1) Load Dims
a) populate an unknown
b) populate dim data
2) Load Facts
a) join/lookup to dim’s, and if no match, set as “unknown” dimension record
3) Process Cube
This type of system works in many cases, but there are some flaws that bubble up over time. First, unless you reload your fact table, or update your unknown dimension keys on your fact, you could end up with unknowns, that will be unknowns forever. The system described above also means you need to run it in that order. Dims first, Facts second.
Early Arriving Facts/Late Arriving Dimension – If you are an optimist, we have the fact data before we have the dimension data. Or, if you are a pessimist, we don’t have the dimension data when we load the fact. You choose, but in either scenario, we have data missing somewhere.
Like I mentioned earlier, many systems will just set the early arriving fact as “unknown” and set it to a unknown dimension key (usually –1) in the fact table. Some people might just ignore the fact record completely. You probably don’t want to do that.
But what if we have the “business” key in our fact data select. What can we do with that?
One option is to modify your dimension data select to UNION in all the distinct business keys from your fact data that aren’t in your dimension data. This works in a small data set. If you fact table is 500 million rows, you won’t like the performance of this option.
Another option we can use is the idea of an inferred dimension. As you load your fact table data (preferably through SSIS) you do a lookup to your dimension. If you have a match, cool, take that key and move on. If you don’t have a match, instead of setting the key to –1 (unknown), do this:
1) Insert a new dimension record with your business key from your fact table
2) Grab the newly created dimension key from the record you just inserted
3) Merge the key back into your fact data pipeline.
Awesome. Now, sometime in the future, your Dimension process can come through, and if you are doing Slowly Changing Dim’s, it should just update your inferred dimension records with data. If your inferred dimension records are some one offs that might never get updated, you might be able to get someone to manually update them through some interface, or whatever, in any event you aren’t stuck with tons of fact records that are set to –1/unknown.
Of course, the method above works best using SSIS, with a “Get Data -> Lookup Pattern –> Insert” method.