Master Data Management 101: Standardize Units of Measure

Introduction

 

   There’s quite a bit of hoopla about MDM lately, mostly due to awareness. While “Lacks needed data” was cited as the number 1 problem (with 21 votes) in data warehouses in the recent IBM Data Warehousing Satisfaction Survey (2007), “Insufficient or inadequate master data” made a decent showing at second with 17 votes (see Table 3).

 

   The new Microsoft MDM site is a good place to start learning about Master Data Management and Microsoft’s efforts in this area.

   Master Data is a pretty common thing for engineers. I learned about it way back in my manufacturing engineering days.

 

The Issue

 

   Consider this scenario: Conglomerate C (CC) makes widgets and starts acquiring businesses that also make widgets. CC sells widgets by the pound, but Acquisition A (AA) measures them by counting individual widgets, while Acquisition B (AB) sells them by the case (gross, or 144 ea).

 

   CC now wants all this data in a data warehouse so they can compare apples to apples and know, among other things, how many widgets they’re actually making and selling on a given day.

 

Note: Instrumentation and measurement are scientific disciplines in their own rite. There’s a lot more to this, which I hope to cover here and in my new Applied MDM blog.

 

   The Unit of Measure in the existing database, dbCC, is pounds. The Widgets tables from the three companies look like this:

 

dbCC.dbo.Widgets

ID

Date

Weight

1

1/1/2007

2076

2

1/2/2007

2100

3

1/3/2007

1977

 

dbAA.Product.Widgets

ProductID

Date

Count

F0932E13-218D-458A-BE09-3286AFDE0280

1 Jan 2007

10,265

F68BF7AC-553E-4A32-B1CB-442DD310194C

2 Jan 2007

13,009

8C0C7511-1386-4C13-84B8-2351248280E6

3 Jan 2007

17,121

 

dbAB.dbo.Widgets

ID

Date

Cases

1

20070101

84

2

20070102

82

3

20070103

99

 

 

Standards, Standards, Standards

 

   One facet of MDM is standardizing this data. The keys to standardizing this data are recognizing traits in the data types, along with the quality of any proposed conversion.

   For example, the Cases to Count ratio is most likely stable and predictable. Conversion is easily accomplished using multiplication (or division, depending on which way you go in the standardization). Quality will be high.

   But the weight to count (individual or case count) conversion is going to be impacted by other factors. Most notably, do all widgets weigh the same? If not, what’s the acceptable tolerance? Quality will be adversely affected by tolerance.

 

   Dimensional analysis (the multiplication or division you do to convert known quantities) is also a question about measurement granularity. Trust me: You will want to store as fine a grain as possible.

    Looking at the sample data, you will want to store WidgetCount somewhere. dbAA is already in this format. Yay. dbAB is easy enough: dbAB.dbo.Widgets.Cases * 144 gives you WidgetCount. Again, the math on widget Weight in the dbCC data becomes fuzzy. But our customer assures us each pound contains five widgets. Five widgets per pound is probably a high quality number. It’s much higher than, say, five hundred or five thousand per pound!

 

   “Fuzziness” will impact the integrity of your data. A couple important measures of data warehouse integrity are data accuracy and signal to noise (usually defined by the percentage of “unknowns” in the data).

   When I have encountered this scenario in the field, I have informed the customer of the dangers and begged them to collect better metrics at the WidgetWeight station for company CC.

 

Enter Stage Left…

 

   There are other issues in these examples: date and ID standardization. Dates are fairly straightforward. The IDs can be a little tricky. To standardize the IDs in this example I may consider a LocationID and ProductDateID (ID) composite key on the first pass.

 

   I might create a couple tables in the data warehouse staging database that look like this:

 

Staging.Products.Widget

LocationID

ID

Date

Count

1

1

1/1/2007

10380

1

2

1/2/2007

10500

1

3

1/3/2007

9885

2

1

1/1/2007

10,265

2

2

1/2/2007

13,009

2

3

1/3/2007

17,121

3

1

1/1/2007

12,096

3

2

1/2/2007

11,808

3

3

1/3/2007

14,256

Staging.Products.Location

LocationID

LocationDescription

1

dbCC

2

dbAA

3

dbAB

 

   I’ve assumed (based on customer feedback) I get 5 widgets / pound from dbCC, and I know the math for the rest. Here I’ve prepared Location data for a star schema via a LocationID and Location table. But I haven’t touched dates and IDs, they will need some work before reaching the destination data warehouse.

 

Just the Fact, Ma’am

 

   When I load the destination data warehouse, LocationID maps nicely to the Location dimension surrogate key, while I could simply make the column now labeled ID into a Date dimension surrogate key (removing the Date column from the fact table altogether because it’s now redundant because it’s now redundant).

   This three-column table then makes a decent Fact:

 

DW.Product.FactWidget
LocationID DateID Count
1 1 10,380
1 2 10,500
1 3 9,885
2 1 10,265
2 2 13,009
2 3 17,121
3 1 12,096
3 2 11,808
3 3 14,256

 

   There’s more to Master Data Management, but this is one type of business problem folks are trying to solve when they talk about MDM and how it ultimately relates to data warehouse quality and customer satisfaction.

 

:{> Andy

 

Technorati Tags:

Andy Leonard

andyleonard.blog

Christian, husband, dad, grandpa, Data Philosopher, Data Engineer, Azure Data Factory, SSIS guy, and farmer. I was cloud before cloud was cool. :{>

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.