This post was originally published at SQLBlog.
Because, as Jen Underwood (jenunderwood.com | LinkedIn) states in the Data Driven podcast Jen Underwood on Data Visualization and Automating Data Science: The future of data science is automation.
If automation is the future, how do we decide what to automate? We look for the long pole. What’s the long pole in data science?
Data Integration is the Long Pole
According to Lars Nielsen in his book Unicorns Among Us:
“As intellectually stimulating as the work might be, there are also mundane aspects to the data scientist’s job. In fact, there is pure drudge work that can take from 50 to 80 percent of a data scientist’s time: This involves the collecting, cleaning and organizing of unruly, disparate, messy, unstructured data before it can be mined for the gold of actionable BI. Some call this aspect of the job ‘data wrangling.’” (emphasis mine)
The long pole in data science is “data wrangling” or “data munging,” also known as data integration.
“How Do We Automate Data Integration, Andy?”
I’m glad you asked! My answer is:
-
Eliminate redundancy.
-
Automate execution.
-
Practice DILM (Data Integration Lifecycle Management).
First, eliminate the cut-and-paste drudge work from data integration development.
What is the “cut-and-paste drudge work?” A significant amount of data integration is based on repeatable patterns. One can automate SSIS design patterns, such as truncate-and-load and incremental load, using Biml (Business Intelligence Markup Language). Hours, days, and weeks of work can be reduced to minutes by using Biml and Biml Frameworks, such as BimlFlex and the BimlExpress Metadata Framework.
Second, automate data integration execution.
Data integration execution has long been grouped into processing jobs or batches. A best practice in SSIS development is to build small, unit-of-work packages that perform specific load or transformation operations. Why? Smaller packages are easier to develop, test, manage, and maintain. Unit-of-work packages promote code re-use, as some packages can be used in multiple batched operations.
There’s no free lunch, and building unit-of-work packages presents a new issue: Now one has lots more SSIS packages to execute. An execution framework addresses this issue. Enterprise Data & Analytics offers a free execution framework, the SSIS Framework Community Edition.
The SSIS Framework Community Edition includes a Parent SSIS package that executes packages collected into batches called “SSIS Applications.” Metadata about SSIS Applications is stored in tables integrated (although isolated by a custom schema) into the SSISDB database. Data integration professionals can configure metadata for three (or three hundred) packages in a single batch, and this batch can be executed by starting the Parent SSIS package and overriding a single parameter.
Consider this quote in a CNN article from a senior official with Obama 2012 US re-election campaign:
Third, data integration needs enterprise lifecycle management, like all other software development. (Data integration development is software development, after all.)
Repeatable, collaborative, and communicable processes form the heart of enterprise DevOps. Repeatable releases and source control for SSIS are no longer optional because they improve code quality and reduce downtime. Enterprises need at least three lifecycle “tiers” – Development, Test, and Production. Why? Development and Production environments are usually not in question; what about this third tier? It’s not important what this tier is called – or even if there are more tiers between Development and Production. This tier is important because it’s not Production and not Development.
All software works in Development. Software is built in Development and the design-time defaults all point to Development resources. Enterprises do not want the first deployment of any software to be the Production deployment. Instead, a test deployment – to a different environment (not Development and not Production) – will assure all external parameters are properly configured and included in the deployment plan. A successful test deployment to an environment (lifecycle tier) that matches Production dramatically improves confidence that the Production deployment will succeed.
When deploying SSIS to the SSIS Catalog, though, you need to be sure your test deployment tier is closely aligned to the Production environment. That can be… challenging, but SSIS Catalog Compare detects – and can script and deploy – differences between SSIS Catalogs residing in different Data Integration Lifecycle tiers. Catalog Compare generates scripts for externalized parameters – parameters that override the design-time defaults – by scripting SSIS Catalog Environments, Environment Variables, Project and Package References, and Reference Mappings.
Conclusion
Why is automation important? Automating data integration changes the dynamics for data science.
Jen’s right. Lars is right. Automation is the future of data science and automating the long pole – data integration – is the place to begin. For SSIS automation, DILM Suite is a solution.
What can we accomplish by automating data science? We can change the world.
:{>
Learn More:
IESSIS2: Immersion Event on Advanced Integration Services – Oct 2017, Chicago
Tools:
SSIS Framework Community Edition
BimlExpress Metadata Framework
SSIS Catalog Compare
DILM Suite
Recordings and Posts:
SSIS Lifecycle Management
Advanced SSIS Execution
SSIS Academy: Using the SSIS Catalog Day 1 – Create the Catalog and Deploy
SSIS Academy: Using the SSIS Catalog Day 2 – Package Execution and Monitoring
The Data Integration Lifecycle Management (DILM) Blog Series:
An Example of Data Integration Lifecycle Management with SSIS, Part 0
An Example of Data Integration Lifecycle Management with SSIS, Part 1
An Example of Data Integration Lifecycle Management with SSIS, Part 2
An Example of Data Integration Lifecycle Management with SSIS, Part 3
An Example of Data Integration Lifecycle Management with SSIS, Part 4