Introduction
There’s quite a bit of hoopla about MDM lately, mostly due to awareness. While “Lacks needed data” was cited as the number 1 problem (with 21 votes) in data warehouses in the recent IBM Data Warehousing Satisfaction Survey (2007), “Insufficient or inadequate master data” made a decent showing at second with 17 votes (see Table 3).
The new Microsoft MDM site is a good place to start learning about Master Data Management and Microsoft’s efforts in this area.
Master Data is a pretty common thing for engineers. I learned about it way back in my manufacturing engineering days.
The Issue
Consider this scenario: Conglomerate C (CC) makes widgets and starts acquiring businesses that also make widgets. CC sells widgets by the pound, but Acquisition A (AA) measures them by counting individual widgets, while Acquisition B (AB) sells them by the case (gross, or 144 ea).
CC now wants all this data in a data warehouse so they can compare apples to apples and know, among other things, how many widgets they’re actually making and selling on a given day.
Note: Instrumentation and measurement are scientific disciplines in their own rite. There’s a lot more to this, which I hope to cover here and in my new Applied MDM blog.
The Unit of Measure in the existing database, dbCC, is pounds. The Widgets tables from the three companies look like this:
dbCC.dbo.Widgets | ||
ID | Date | Weight |
1 | 1/1/2007 | 2076 |
2 | 1/2/2007 | 2100 |
3 | 1/3/2007 | 1977 |
dbAA.Product.Widgets | ||
ProductID | Date | Count |
F0932E13-218D-458A-BE09-3286AFDE0280 | 1 Jan 2007 | 10,265 |
F68BF7AC-553E-4A32-B1CB-442DD310194C | 2 Jan 2007 | 13,009 |
8C0C7511-1386-4C13-84B8-2351248280E6 | 3 Jan 2007 | 17,121 |
dbAB.dbo.Widgets | ||
ID | Date | Cases |
1 | 20070101 | 84 |
2 | 20070102 | 82 |
3 | 20070103 | 99 |
Standards, Standards, Standards
One facet of MDM is standardizing this data. The keys to standardizing this data are recognizing traits in the data types, along with the quality of any proposed conversion.
For example, the Cases to Count ratio is most likely stable and predictable. Conversion is easily accomplished using multiplication (or division, depending on which way you go in the standardization). Quality will be high.
But the weight to count (individual or case count) conversion is going to be impacted by other factors. Most notably, do all widgets weigh the same? If not, what’s the acceptable tolerance? Quality will be adversely affected by tolerance.
Dimensional analysis (the multiplication or division you do to convert known quantities) is also a question about measurement granularity. Trust me: You will want to store as fine a grain as possible.
Looking at the sample data, you will want to store WidgetCount somewhere. dbAA is already in this format. Yay. dbAB is easy enough: dbAB.dbo.Widgets.Cases * 144 gives you WidgetCount. Again, the math on widget Weight in the dbCC data becomes fuzzy. But our customer assures us each pound contains five widgets. Five widgets per pound is probably a high quality number. It’s much higher than, say, five hundred or five thousand per pound!
“Fuzziness” will impact the integrity of your data. A couple important measures of data warehouse integrity are data accuracy and signal to noise (usually defined by the percentage of “unknowns” in the data).
There are other issues in these examples: date and ID standardization. Dates are fairly straightforward. The IDs can be a little tricky. To standardize the IDs in this example I may consider a LocationID and ProductDateID (ID) composite key on the first pass.
I might create a couple tables in the data warehouse staging database that look like this:
Staging.Products.Widget | |||
LocationID | ID | Date | Count |
1 | 1 | 1/1/2007 | 10380 |
1 | 2 | 1/2/2007 | 10500 |
1 | 3 | 1/3/2007 | 9885 |
2 | 1 | 1/1/2007 | 10,265 |
2 | 2 | 1/2/2007 | 13,009 |
2 | 3 | 1/3/2007 | 17,121 |
3 | 1 | 1/1/2007 | 12,096 |
3 | 2 | 1/2/2007 | 11,808 |
3 | 3 | 1/3/2007 | 14,256 |
Staging.Products.Location | |||
LocationID | LocationDescription | ||
1 | dbCC | ||
2 | dbAA | ||
3 | dbAB |
I’ve assumed (based on customer feedback) I get 5 widgets / pound from dbCC, and I know the math for the rest. Here I’ve prepared Location data for a star schema via a LocationID and Location table. But I haven’t touched dates and IDs, they will need some work before reaching the destination data warehouse.
DW.Product.FactWidget | ||
LocationID | DateID | Count |
1 | 1 | 10,380 |
1 | 2 | 10,500 |
1 | 3 | 9,885 |
2 | 1 | 10,265 |
2 | 2 | 13,009 |
2 | 3 | 17,121 |
3 | 1 | 12,096 |
3 | 2 | 11,808 |
3 | 3 | 14,256 |
There’s more to Master Data Management, but this is one type of business problem folks are trying to solve when they talk about MDM and how it ultimately relates to data warehouse quality and customer satisfaction.
:{> Andy
Comments