Loading...
Skip to content
AndyLeonard.blog() AndyLeonard.blog()
  • RSS
  • Twitter
  • LinkedIn
  • About Andy
  • SSIS and ADF Training
  • Mailing List
  • Feedback
  • linktr.ee

Why Automate?

Posted on July 29, 2017Updated on July 26, 2017by Andy LeonardCategories:Biml, Biml Frameworks, BimlExpress Metadata Framework, DILM, DILM Suite, SSIS, SSIS Frameworks

This post was originally published at SQLBlog.

Because, as Jen Underwood (jenunderwood.com | LinkedIn) states in the Data Driven podcast Jen Underwood on Data Visualization and Automating Data Science: The future of data science is automation.

If automation is the future, how do we decide what to automate? We look for the long pole. What’s the long pole in data science?

Data Integration is the Long Pole

According to Lars Nielsen in his book Unicorns Among Us:

“As intellectually stimulating as the work might be, there are also mundane aspects to the data scientist’s job. In fact, there is pure drudge work that can take from 50 to 80 percent of a data scientist’s time: This involves the collecting, cleaning and organizing of unruly, disparate, messy, unstructured data before it can be mined for the gold of actionable BI. Some call this aspect of the job ‘data wrangling.’” (emphasis mine)

The long pole in data science is “data wrangling” or “data munging,” also known as data integration.

“How Do We Automate Data Integration, Andy?”

I’m glad you asked! My answer is:

  1. Eliminate redundancy.
  2. Automate execution.
  3. Practice DILM (Data Integration Lifecycle Management).

First, eliminate the cut-and-paste drudge work from data integration development.

What is the “cut-and-paste drudge work?” A significant amount of data integration is based on repeatable patterns. One can automate SSIS design patterns, such as truncate-and-load and incremental load, using Biml (Business Intelligence Markup Language). Hours, days, and weeks of work can be reduced to minutes by using Biml and Biml Frameworks, such as BimlFlex and the BimlExpress Metadata Framework.

Second, automate data integration execution.

Data integration execution has long been grouped into processing jobs or batches. A best practice in SSIS development is to build small, unit-of-work packages that perform specific load or transformation operations. Why? Smaller packages are easier to develop, test, manage, and maintain. Unit-of-work packages promote code re-use, as some packages can be used in multiple batched operations.

There’s no free lunch, and building unit-of-work packages presents a new issue: Now one has lots more SSIS packages to execute. An execution framework addresses this issue. Enterprise Data & Analytics offers a free execution framework, the SSIS Framework Community Edition.

The SSIS Framework Community Edition includes a Parent SSIS package that executes packages collected into batches called “SSIS Applications.” Metadata about SSIS Applications is stored in tables integrated (although isolated by a custom schema) into the SSISDB database. Data integration professionals can configure metadata for three (or three hundred) packages in a single batch, and this batch can be executed by starting the Parent SSIS package and overriding a single parameter.

Consider this quote in a CNN article from a senior official with Obama 2012 US re-election campaign:

Third, data integration needs enterprise lifecycle management, like all other software development. (Data integration development is software development, after all.)

Repeatable, collaborative, and communicable processes form the heart of enterprise DevOps. Repeatable releases and source control for SSIS are no longer optional because they improve code quality and reduce downtime. Enterprises need at least three lifecycle “tiers” – Development, Test, and Production. Why? Development and Production environments are usually not in question; what about this third tier? It’s not important what this tier is called – or even if there are more tiers between Development and Production. This tier is important because it’s not Production and not Development.

All software works in Development. Software is built in Development and the design-time defaults all point to Development resources. Enterprises do not want the first deployment of any software to be the Production deployment. Instead, a test deployment – to a different environment (not Development and not Production) – will assure all external parameters are properly configured and included in the deployment plan. A successful test deployment to an environment (lifecycle tier) that matches Production dramatically improves confidence that the Production deployment will succeed.

When deploying SSIS to the SSIS Catalog, though, you need to be sure your test deployment tier is closely aligned to the Production environment. That can be… challenging, but SSIS Catalog Compare detects – and can script and deploy – differences between SSIS Catalogs residing in different Data Integration Lifecycle tiers. Catalog Compare generates scripts for externalized parameters – parameters that override the design-time defaults – by scripting SSIS Catalog Environments, Environment Variables, Project and Package References, and Reference Mappings.

Conclusion

Why is automation important? Automating data integration changes the dynamics for data science.

Jen’s right. Lars is right. Automation is the future of data science and automating the long pole – data integration – is the place to begin. For SSIS automation, DILM Suite is a solution.

What can we accomplish by automating data science? We can change the world.

:{>

Learn More:
IESSIS2: Immersion Event on Advanced Integration Services – Oct 2017, Chicago

Tools:
SSIS Framework Community Edition
BimlExpress Metadata Framework
SSIS Catalog Compare
DILM Suite

Recordings and Posts:
SSIS Lifecycle Management
Advanced SSIS Execution
SSIS Academy: Using the SSIS Catalog Day 1 – Create the Catalog and Deploy
SSIS Academy: Using the SSIS Catalog Day 2 – Package Execution and Monitoring

The Data Integration Lifecycle Management (DILM) Blog Series:
An Example of Data Integration Lifecycle Management with SSIS, Part 0
An Example of Data Integration Lifecycle Management with SSIS, Part 1
An Example of Data Integration Lifecycle Management with SSIS, Part 2
An Example of Data Integration Lifecycle Management with SSIS, Part 3
An Example of Data Integration Lifecycle Management with SSIS, Part 4

Andy Leonard

andyleonard.blog

Christian, husband, dad, grandpa, Data Philosopher, Data Engineer, Azure Data Factory, SSIS, and Biml guy. I was cloud before cloud was cool. :{>

TaggedData IntegrationData ScienceData WranglingDILMMunging

Training

New live, online ADF and SSIS course deliveries have been scheduled!. Learn more at Enterprise Data & Analytics' Training page.
Next up: Master the Fundamentals of Azure Data Factory is 20 Jun 2022.

Let's meet to discuss your Azure Data Factory, SSIS, or Data Warehouse project, or schedule a demo of SSIS Framework or SSIS Catalog Compare.

Do you need to level-up your ADF and SSIS skills in a hurry? Sign up for the Premium Level – All Recordings for 1 Year and get started on a full year of ADF and SSIS training!

I still deliver consulting! SSIS running too slow? Want to migrate to the cloud? Need help with that data warehouse project? Contact me. At Enterprise Data & Analytics, we are here to help!

  • RSS
  • Twitter
  • LinkedIn
© 2022 by AndyLeonard.blog(). All rights reserved.
Back to top

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close