This post is part of a series titled Enterprise Data Warehouses: Uncomfortable Truths.
There are cases where developing Enterprise Data Warehouses costs less than maintaining Enterprise Data Warehouses. Which cases? When enterprises do not maintain their Data Warehouse, EDW maintenance costs nothing (or nearly nothing).
One side-effect of not maintaining an EDW is the Enterprise Data Warehouse becomes less and less reliable until it is effectively useless.
One may calculate the quality of the data in a data warehouse by:
- Acquiring the total number of rows in all the tables in the database that hosts the data warehouse (QT);
- Counting the number of rows containing “bad” – in-error due to missing, duplicated, or otherwise inaccurate – data (QB);
- Subtracting the number of rows containing bad data from the total number of rows to ascertain the number of “good” data rows (QG = QT – QB); and then
- Calculating the percentage of good data rows in the data warehouse by dividing good number of rows by the total number of rows (QDW = QG / QT * 0.01%).
One challenge in calculating Enterprise Data Warehouse data quality is identifying “bad” data rows.
When Does “Bad” Become “Too Bad?”
Crossing the threshold between “bad data” and “data that is too bad” is somewhat dependent on how the data is being used by the enterprise.
Don’t let that simplistic-sounding response trip you up. Please recognize two truths about the sentence above:
- The sentence above has nothing to do with math.
- The sentence above has everything to do with enterprise culture; specifically, enterprise data culture.
If the data is used for predictive analytics – training machine learning models, for example – variation in data quality ranges from minimal to multiplied as many predictive applications focus on small (or single) observed predictions, often limited to predicting the “next step” by an individual or entity. Errors in data used for descriptive analytics may multiply deltas between predicted and actual observations compared later. When supplied to prescriptive processes, “bad” data may drive decisions of increasingly-lower quality at the enterprise level.
Experience informs the author Data Warehouse Data Quality – QDW – is often greater than 99% when “bad” data crosses the threshold and renders the enterprise data warehouse data quality “too bad.”
How to Fix Low Quality Data in an Enterprise Data Warehouse
One way to improve low quality data in an Enterprise Data Warehouse is to adapt the scientific method in a data warehouse lifecycle management process similar to Measure > Automate > Test > Measure…
Measure Bad Data
Detecting bad data is easier in the early stages. Simply look for data captured in “error” tables or data that causes processes – especially data integration processes – to fail. As the process matures, however, detecting and measuring bad data becomes more complex and more difficult to accurately achieve. At some point, diminishing returns kick in – which impacts detection and correction automation.
Once bad data is isolated, root cause analysis follows. Early in the process, it’s common to find root causes that follow the Pareto principle with 80% of error rows caused by a similar error in the process. As with measuring bad data, automating detection complexity increases into diminishing returns. It’s common to reach a point where the cost of the cure outweighs the cost of the malady – a point where manual updates are preferred to overly-complex automation.
Test any automation applied to data integration and processing in enterprise data warehouse solutions. It’s possible to unintentionally introduce additional variation in data warehouse quality when applying fixes for more recently discovered issues. The author recommends storing all tests applied to detect earlier root causes as regression tests. Execute regression tests after each automation solution to make sure the latest automation does not re-introduce previously addressed issues.
Measure the results and respond accordingly.
Adding Data, Measures, and Analyses
In addition to managing data quality throughout the lifecycle of an enterprise data warehouse, your EDW should be designed with the future in mind. “Future-proofing” an enterprise data warehouse solution may involve:
- Building Extract, Transform, and Load (ETL) solutions using tried and true design patterns, or ELT (Extract, Load, and Transform) solutions.
- Using code-generation automation – such as Business Intelligence Markup Language (or Biml) – for automated SSIS, Azure Data Factory, and T-SQL code-generation where applicable.
- Putting Data Integration Lifecycle Management (DILM) and DevOps for data integration into practice in the enterprise.
The Hidden Costs of Maintaining an Enterprise Data Warehouse
When updating or extending data or functionality in an enterprise data warehouse solution, developing smarter code trumps developing more code. Lack of experience – especially experience at scale – leads some to develop solutions that increase the total cost of ownership. The cost of maintaining the enterprise data warehouse is usually distributed over 5-10 years. Since IT departments are often treated as expense centers or sunk costs, the investment required to update, or simply manage, an enterprise data warehouse is easy to underestimate (or overlook entirely).
By contrast, the cost of enterprise data warehouse development is usually capitalized and experienced by the enterprise as a large, lump-sum expense. A well-designed enterprise data warehouse is easier and less-expensive to maintain and extend.
Enterprise data warehouse maintenance often costs more than developing an enterprise data warehouse.
A future post in this series covers reasons why Building an Enterprise Data Warehouse is Not Enough. The current post explains two reasons all Enterprise Data Warehouses are not merely built, but should also be maintained.
Need Help Implementing Your Enterprise Data Warehouse?
At Enterprise Data & Analytics, we believe you should ask four questions when hiring an enterprise data consultancy.
Enterprise Data & Analytics offers comprehensive Data Warehouse, ETL, SSIS, and Azure Data Factory knowledge, experience, and success to your enterprise. We engage in three phases:
- Training – Enterprise Data & Analytics offers enterprise-scale, experience-based training; sharing knowledge from experience with enterprise teams of all sizes.
- Partnering – Enterprise Data & Analytics team members join your enterprise team members at the developer, project manager, and developer manager levels to shepherd (freshly-trained) team members as we – together – apply data integration patterns to your enterprise data warehouse solution.
- Sustaining – After the code is implemented, tested, documented, and signed-off; Enterprise Data & Analytics team members stand ready to support your enterprise data warehouse team in the event of a disaster, re-join your team to jumpstart the next version, or merely to answer a question.
Contact us today!