“What is a Pessimistic Load?”
I’m glad you asked! A pessimistic load is any load pattern that incorporates a strict reconciliation process. Strict reconciliation throws an error when the reconciliation effort fails.
How Does It Work?
First you load some records. Then you check to see if the records you intended to load have been loaded. That’s it.
The first part is the load process, the second part is the reconciliation. The reconciliation process can be broken into chunks – and the chunks can be executed during the load process. For example, one can stage data and then reconcile stage-to-source.
Reconciliation Method 1: Row Counts
There are a number of ways to accomplish reconciliation. The most common is record counts. Count the number of rows you expected to load from a source – perhaps a flat file or database table – and then count the number of rows you loaded. This is easiest when you are deleting all the records from the destination before the load and you are then loading all the records from a source to a destination. You simply count the rows in the source and count the rows in the destination, and then check to see if they are identical.
Things get tricksy when you are only loading some of the rows from a source or if you are loading rows to a destination that already contains records; or both.
Reconciliation Method 2: Numeric Values
Another way to accomplish reconciliation is to use numeric values. Star schema data warehouses contain dimensions and facts. Dimensions are usually descriptive and Facts are usually collections of numeric measures (plus artificial keys to related dimensions). Summing numeric measures in a fact is one way to reconcile values between a fact source and destination, but sometimes the numeric values become very large – they can grow to exceed the maximum value for the data type.
One way to address exceeding a numeric data type is to hash numeric values. There are a couple techniques for hashing numeric data but the goal is the same: present a unique-ish value that can be compared between the source and the destination. One of my favorite tricks is to use a running modulus with a prime number – the larger the prime number, the better.
The Development Lifecycle of a Pessimistic Load
Pessimistic loads flow through a maturity model or lifecycle. Like every load process, pessimistic loads begin by failing a lot. Initial failures occur because the source data is not cleansed (or not properly cleansed) and raises errors – like NULLs or duplicates – when one tries to load a destination from a source.
As the process matures NULLs are replaced or managed, duplicates are detected and removed, and then quality / consistency checks – soft rules – are added to make sure the data is correct. Soft rules include checks for things like “is the start date of a record earlier than the end date?”
Therefore the first phase of implementing a pessimistic load is filled with data quality errors.
…once the reconciliation rules are synchronized with the data cleansing rules, the enterprise achieves a load process that is nigh bulletproof.
The second phase is interesting because data quality errors begin dropping in the load process (and in the load packages if one is loading using SSIS) but the reconciliation process continues to throw errors – stopping the load process if so designed – because the reconciliation checks do not know the data being loaded is good.
Process Fault Inversion
It is common for the enterprise to witness a kind of “process fault inversion” as the load process cleanses more and more data, while the reconciliation process lags behind the loaders in recognizing and separating (and counting or calculating hash values for) data that conforms to updated cleansing rules. Eventually the load process reaches a point where the only failures are due to the reconciliation process rejecting or missing or miscalculating essentially good data.
Although this is a frustrating phase of the project, it is a necessary and positive development in the maturity of a pessimistic load because once the reconciliation rules are synchronized with the data cleansing rules, the enterprise achieves a load process that is nigh bulletproof.
Conclusion
Confidence in the integrity of the load increases as errors decrease – optimally to 0 – and, on the rare error, the mature pessimistic load process is trusted to detect truly new and different source data conditions. Hence the name: Pessimistic load.
It is a curious thing to consider such things as counting rows loaded as pessimistic and needing pointing out. Those of us who started our career in the highly structured environment of mainframes would have gotten fired if we didn’t provide counts of records at every step of processing. It was just 101 IT. Now we have to point it out to people as something worthwhile to do. When I interview ETL developers, it is very common for them to have a quizzical look when I ask them about error handling and logging and validation counts.