DILM Suite + Azure Data Factory Integration Runtime

DILM Suite was designed to support enterprise data engineering / data integration with SSIS. But the solutions, utilities, and tools work well with Azure Data Factory Integration Runtime (ADF IR, or “SSIS in the cloud”).

As I mentioned in my post titled The Cloud Costs Money, leaving ADF Integration Runtime running can get pricey – and fast.

Most DILM Suite Functions Do Not Require ADF Integration Runtime to be Running

That’s right. Since DILM Suite tools and utilities connect directly to the database, they do not require Azure Data Factory Integration Runtime to be in a running state for most operations.

Here, for example, I’ve connected Framework Browser to an instance of SSIS Framework Community Edition – deployed to the ADF Integration Runtime as demonstrated in the latest Summer-O’-ADF webinar, Designing a Custom ADF SSIS Execution Framework – doesn’t have to be in a Running state to allow interaction with Framework Browser and Catalog Browser (click to enlarge):

That’s handy information right there. And it can save you money.

:{>

Updates to Three DILM Suite Applications

It’s been a busy couple weeks here at Enterprise Data & Analytics in Farmville. I just published updates to not one, not two, but three – yes, three! – applications that are part of the Data Integration Lifecycle Management Suite, or DILM Suite.

SSIS Framework Community Edition

To prepare for Thursday’s (12 Jul 2018) webinar titled Designing a Custom ADF SSIS Execution Framework, I updated the documentation for SSIS Framework Community Edition. Why did I update the documentation? Because I want folks to know they can use SSIS Framework Community Edition in the cloud! That’s right, SSIS Framework Community Edition – still free, still open source – works with the SSIS Catalog behind Azure Data Factory Integration Runtime.

SSIS Framework Browser

Also in preparation for Thursday’s webinar, I updated SSIS Framework Browser! Version 0.4.2.0 (beta) remains free and now also works with SSIS Framework Community Edition implemented in Azure Data Factory Integration Runtime. Use Framework Browser to view metadata for SSIS Applications and Application Packages stored in the Framework.

SSIS Catalog Browser

Finally, I updated the login experience for those using SSIS Catalog Browser to connect to Azure Data Factory Integration Runtime. As with all Catalog Browser updates, I’ve applied the same updates to the user experience in SSIS Catalog Compare. Which reminds me…

SSIS Catalog Compare

On Thursday 19 Jul 2018 at noon EDT, I’m delivering another free webinar titled Use SSIS Catalog Compare to Lift and Shift SSIS to ADF! In this webinar, I will be demonstrating some of the features in version 3, which is approaching Preview.

I hope you will join me for these upcoming webinars and that you’ll take the free DILM Suite products for a test drive!

The Summer-O’-ADF Series Continues!

:{>

Free Webinar – Designing a Custom ADF SSIS Execution Framework

Folks, I’m not going to even pretend to be cool and collected about this webinar about designing an ADF SSIS Execution Framework. I’ve been waiting for this one for a loooooooong time! The next (free!) webinar in the series I’m calling the Summer-O’-ADF is Designing a Custom ADF SSIS Execution Framework.

It’s at noon EDT Thursday 12 Jul 2018.

Why Am I So Excited?

I’ve been designing data engineering frameworks since the days of DTS (Data Transformation Services). Enterprise Data & Analytics (EDNA) implements SSIS Frameworks for enterprises. We even give away a free version of a framework at DILM Suite (DILM == Data Integration Lifecycle Management). It’s called SSIS Framework Community Edition and it’s not only free, it’s open source.

I wrote about SSIS Framework Community Edition and Data Integration Lifecycle Management in my latest book: Data Integration Lifecycle Management with SSIS.

I’m excited about this webinar because it combines a long-held passion – DevOps for SSIS – with a new passion – Azure Data Factory (ADF)! I cannot wait to show you how!

If you haven’t seen my webinars titled The Azure Data Factory Controller Design Pattern and ADF Controller Design Pattern with the SSIS Integration Runtime (also part of the Summer-O’-ADF webinar series), you may want to check them out first (registration required).

Register today!

:{>

SSIS Design Pattern: Controller Pattern

SSIS Framework Community Edition defaults to serial execution. The Controller Pattern can help. How? Read on…

“Great Andy, But What If I Need To Load In Parallel Using A Framework?”

Enter the SSIS Design Pattern named the Controller Pattern. A Controller is an SSIS package that executes other SSIS packages. Controllers can be serial or parallel, or combinations of both. I’ve seen (great and talented) SSIS architects design frameworks around specific Controllers – building a Controller for each collection of loaders related to a subject area.

There’s nothing wrong with those solutions.

SSIS Framework Community Edition ships with a generic metadata-driven serial controller named Parent.dtsx which is found in the Framework SSIS solution.

Specific Controller Pattern Design

A specific Controller can appear as shown here (click to enlarge):

This controller achieves parallel execution. One answer to the question, “How do I execute packages in parallel?” is to build a specific controller like this one.

Advantages

  • Just Works
  • Simple and straightforward, uses out-of-the-box Execute Package Task

Disadvantages

  • “All Executions” Catalog Report is… misleading…

“How is the All Executions Report misleading, Andy?”

I’m glad you asked. If you build and deploy a project such as SerialControllers SSIS project shown here – and then execute the SerialController.dtsx package – the All Executions reports only a single package execution: SerialController.dtsx (click to enlarge):

We see one and only one execution listed in the All Executions report. If we click on the link to view the Overview report we see each package listed individually:

The All Executions report accurately reflects an important aspect of the execution of the SerialController.dtsx SSIS package. The execution of this package – and the packages called by SerialController.dtsx – share the same Execution ID value. This is not necessarily a bad thing, but it is something of which to be aware.

Specific Controller Design in SSIS Framework Community Edition

A specific Controller built using SSIS Framework Community Edition can appear as shown here:

This controller uses Execute SQL Tasks instead of Execute Package Tasks. The T-SQL in the Execute SQL Tasks calls a stored procedure named custom.execute_catalog_package that is part of SSIS Framework Community Edition.

One answer to the question, “How do I execute packages in parallel using SSIS Framework Community Edition?” is to build a Controller.

Advantages

  • Just Works
  • The SSIS Catalog All Executions report is accurate

Disadvantages

  • Adds complexity

The All Executions Report is no longer misleading. If you build and deploy a project such as SerialControllersInFrameworkCE SSIS project shown here – and then execute the SerialControllerInFrameworkCE.dtsx package – the All Executions reports each package execution (click to enlarge):

We now see one execution listed in the All Executions report for each package. As before, All Executions accurately reflects an important aspect of the execution of the SerialControllerInFrameworkCE.dtsx SSIS package: The execution of the Controller and each Child package now have distinct Execution ID values.

When using specific Controllers with an SSIS Framework it’s common to create a single-package SSIS Application that simply starts the Controller, and then let the Controller package call the child packages. Parent.dtsx in SSIS Framework Community Edition is a generic metadata-driven Controller, but it doesn’t mind executing specific Controllers one bit!

Once Upon A Time…

Not too long ago, Kent Bradshaw and I endeavored to add automated parallel package execution to our Commercial and Enterprise SSIS Frameworks. We achieved our goal, but the solution added so much complexity to the Framework and its associated metadata that we opted to not market the implemented solution.

Why? Here are some reasons:

Starting SSIS packages in parallel is very easy to accomplish in the SSIS Catalog. The SYNCHRONIZED execution parameter is False by default. That means we could build a controller package similar to the SerialControllerInFrameworkCE.dtsx SSIS package – with precedence constraints between each Execute SQL Task, even – and the SSIS Catalog would start the packages in rapid succession. In some scenarios – such as the scenario discussed in this post (from which the current post was derived) – this then becomes a race condition engine.

A Race Condition Engine?

Yes. Because controlling only when packages start is not enough to effectively manage race conditions. To mitigate the race condition described in this post I need to make sure the dimension loaders complete before starting the fact loaders. A (much simplified) Controller for such a process could appear as shown here (click to enlarge):

I grabbed this screenshot after the dimension loader and some (vague) pre-operations process have completed in parallel but while the fact loader is still executing. Please note the combination of the Sequence Container and precedence constraint which ensure the fact loader does not start executing until the dimension loader execution is complete. The sequence container creates a “virtual step” whereby all tasks within the container must complete before the sequence container evaluates the precedence constraint. Since each task inside this container starts an SSIS package (and since the SYNCHRONIZED execution parameter is set to True by default in SSIS Framework Community Edition), nothing downstream of this container can begin executing until everything inside the container has completed executing. This is how we avoid the race condition scenario described earlier.

How does one automate this process in a framework?

It’s not simple.

The method Kent and I devised was to create and operate upon metadata used to define and configure a “virtual step.” In  SSIS Framework Community Edition the Application Packages table is where we store the Execution Order attribute. We reasoned if two Application Package entities shared the same value for Execution Order, then the associated package(s) (I’m leaving out some complexity in the design here, but imagine executing the same package in parallel with itself…) compose a virtual step.

In a virtual step packages would start together, execute, and not proceed to the next virtual step – which could be another serial package execution or another collection of packages executing in parallel in yet another virtual step – until all packages in the current virtual step had completed execution. Here, again, I gloss over even more complexity regarding fault tolerance. Kent and I added metadata to configure whether a virtual step should fail if an individual package execution failed.

This was but one of our designs (we tried three). We learned managing execution dependency in a framework is not trivial. We opted instead to share the Controller pattern.

We Learned Something Else

While visiting a client who had deployed the Controller Pattern, we noticed something. The client used a plotter to print large copies of Controller control flows and post them on the walls outside his cubicle.

When we saw this we got it.

The tasks in the Controller’s control flow were well-named. They were, one could say, self-documenting. By posting updated versions of the Controller control flows whenever the design changed, the data engineer was informing his colleagues of changes to the process.

He didn’t need to explain what had changed. It was obvious to anyone stopping by his cubicle for a few minutes. Briliant!

Conclusion

In this post I wrote about some data integration theory. I also answered a question I regularly receive about performing parallel loads using SSIS Framework Community Edition. I finally covered some of the challenges of automating a solution to manage parallel execution of SSIS packages in a Framework.

Note: much of this material was shared earlier in this post. I broke the Controller Pattern part out in this post because the other post was really too long.

:{>

Two Recordings Available: SSIS Catalog and SSIS Lifecycle

I recently delivered two hour-long webinars: Introduction to Configuring SSIS in the SSIS Catalog and Introduction to the SSIS Lifecycle. These two webinars can be considered Part 1 and Part 2 for managing enterprise data engineering with SSIS – using the SSIS Catalog and utilities I built as part of the DILM Suite (DILM == Data Integration Lifecycle Management).

The recordings are now available and you can access them by clicking the links above.

What’s Covered

In Introduction to Configuring SSIS in the SSIS Catalog I cover deploying SSIS projects to the SSIS Catalog, configuring literal overrides, and using references to environments to externalize values via reference mappings.

In Introduction to the SSIS Lifecycle I demonstrate gaps in SSIS Catalog lifecycle management and demonstrate utilities I built to close said gaps – utilities that are available at DILM Suite. Most solutions and utilities at DILM Suite are free and some are even open source. Some – like SSIS Catalog Compare – are not free.

  1. If your enterprise is using SSIS 2012+, you should consider the SSIS Catalog.
  2. If you are going to use the SSIS Catalog, you should consider using environments and references for configurations management.
  3. If you’re going to use environments and references for configurations management, you will hit a wall when it’s time to promote environments and references to the next tier in your lifecycle.

In these webinars, I show you the wall.
And then I show you the stairs I built over that wall.

Check out the webinars (registration required):

Enjoy!

:{>

You Might Need an SSIS Framework

You might need an SSIS framework. “How can I tell if I need an SSIS framework, Andy?” I’m glad you asked.

Does your enterprise:

  • Practice DevOps?
  • Execute lots of SQL Server Integration Services (SSIS) packages?
  • Execute SSIS packages several times per day?
  • Execute “SSIS in the Cloud” using the Azure Data Factory version 2 Integration Runtime (ADFv2 IR)?
  • Require configuration options not available in off-the-shelf solutions?

How an SSIS Framework Can Help

One SSIS best practice is to develop small, unit-of-work packages. You can think of them as data engineering functions. Design SSIS packages with the fewest number of Data Flow Tasks, optimally one.

There’s a saying in engineering (and life): “There’s no free lunch.” Applied to data engineering with SSIS, if you apply the best practice of creating small, unit-of-work packages, you end up with a bunch of SSIS packages. How do you manage executing all these packages? An SSIS framework.

An SSIS framework manages package execution, configuration, and logging.

SSIS Framework Community Edition

The SSIS Framework Community Edition is part of the DILM (Data Integration Lifecycle Management) Suite. SSIS Framework Community Edition groups the execution of several SSIS packages into SSIS Applications, which are a collection of SSIS packages configured to execute in a specific order.

“Can’t I just use Execute Package Tasks for that, Andy?”

Yes. And no. When deploying to the SSIS Catalog, the Execute Package Task can be used to execute any package as long as that package exists in the same project. What if you have a utility package – say a package that archives flat files after you’ve loaded them – that you want to store in a single SSIS Catalog folder and project but call from different processes (or SSIS applications)? SSIS Framework Community Edition can execute that package as part of an SSIS application.

SSIS Framework Community Edition is Catalog-Integrated

SSIS Framework Community Edition is integrated into the SSIS Catalog. When packages execute as part of an SSIS application, operational metadata and execution information is sent to the SSIS Catalog’s tables. You can view operational metrics and metadata using the catalog reports solution built into SQL Server Management Studio (SSMS)…

…or you could view SSIS execution logs and operational metadata using Catalog Reports – a free and open-source SQL Server Reporting Services (SSRS) solution from DILM Suite.

SSIS Framework Community Edition is Free. And Open-Source.

SSIS Framework Community Edition is free and open-source. In fact, the documentation walks you through building your own SSIS framework – it teaches you how you would design your own SSIS framework.

SSIS Framework Community Edition is Customize-able

Customization is one of the coolest features of open-source software. If you need some unique functionality, you have the source code and can code it up yourself!

If you don’t have time to code your own unique functionality, Enterprise Data & Analytics can help. It’s possible SSIS Framework Commercial or Enterprise Edition already has the functionality you seek. Compare editions to learn more.

SSIS Framework Community Edition is Cloud-Ready

I can hear you thinking, “Wait. It’s free. It’s open-source. And it runs in the cloud?” Yep, yep, and yep!

We Can Help

At Enterprise Data & Analytics, we’ve been building data integration frameworks for over 15 years. I wrote a book about Data Integration Lifecycle Management (DILM):

We built the DILM Suite– a collection of utilities and solutions, many of which are free (and some even open-source!):

We grok frameworks.

Learn more at Enterprise Data & Analytics.

Contact us today!

Why I Built SSIS Framework Community Edition, by Andy Leonard

few weeks ago I published a blog post titled Why I Built DILM Suite, by Andy Leonard – an excerpt from the book Data Integration Life Cycle Management with SSIS. Last week I published Why I Built SSIS Catalog Compare, by Andy Leonard, another excerpt from the book. This post is another excerpt of the same book. Enjoy!

A best practice in SSIS development is to build small, unit-of-work SSIS packages. There are several reasons for this:

  • SSIS is software development and a best practice with software development is separation of concerns. Separation of concerns is primarily achieved by decoupling. One way to decouple SSIS is to build small, single-function SSIS packages.
  • If an SSIS package contains seven Data Flow Tasks and the design of a source table changes and breaks one Data Flow Task, all tasks in the SSIS Package should be tested. Fewer Data Flow Tasks means less and quicker testing.
  • If all SSIS packages contain the minimum number of Data Flow Tasks (optimally one) and a package execution fails in the middle of the night, on-call support has a pretty good idea where to begin troubleshooting.

“There’s No Free Lunch”

While these are good and valid reasons to build SSIS solutions with several smaller SSIS packages, following this advice creates new issues. One issue: You now have a bunch of SSIS packages that require executing in some order. What’s a data integration developer to do?

8.1   SSIS Framework Community Edition

Consider the SSIS Framework Community Edition, a free and open-source solution available at DILM Suite. SSIS Framework Community Edition allows the execution of one or more SSIS packages – in a specified execution order – by executing a single stored procedure and passing it a single argument. For example, I can execute a test Framework Application with the following Transact-SQL statement:

Continuing my theme of “there’s no free lunch,” SSIS execution frameworks greatly simplify execution commands like the one listed above but they create another issue: the need to manage a lot of metadata. SSIS Framework Community Edition relies on metadata to build a Framework Application – mentioned earlier. A Framework Application is a collection of SSIS packages configured to execute in a specified order. If you build idempotent (re-executable) Transact-SQL that includes print statements (to inform you of what the T-SQL is doing) and use any kind of formatting, you’re looking at 30-40 lines of Transact-SQL per SSIS package.

That’s a lot of T-SQL.

8.1.1  Help for SSIS Catalog Projects Already Deployed

Perhaps you are reading this and thinking, “That’s awesome, but I have a bajillion SSIS packages already deployed to my SSIS Catalog. What about them?” I wrote a blog post titled Adding an SSIS Application to SSIS Framework Community Edition. I included a script at the end of that post that uses three parameters – Framework Application Name, Catalog Folder Name, and Catalog Project Name – and from those three pieces of metadata loads the metadata for a new Framework Application into SSIS Framework Community Edition’s metadata tables, as shown in Figure 8-1:


Figure 8-1. Building a Framework Application from an SSIS Catalog Project

The script reads SSIS Catalog Project metadata shown in Figure 8-2 and loads the Framework Application metadata into SSIS Framework Community Edition metadata tables in a few seconds:


Figure 8-2. The SSIS Catalog Project

The Framework Application named “Load AdventureWorks2014 Stage” which contains 71 SSIS packages can now be executed with the following Transact-SQL statement:

8.2   Viewing SSIS Catalog Reports

We can view the executions of these 71 SSIS packages using the Catalog Reporting solution built into SSMS. To view all SSIS package executions, right-click the SSMS Object Explorer Integration Services Catalogs node’s SSISDB node, hover over Reports, hover over Standard Reports, and click All Executions as shown in Figure 8-3:


Figure 8-3. Opening the Built-In SSIS Catalog Reports

The All Executions report displays and surfaces SSIS package execution logs as shown in Figure 8-4:


Figure 8-4. The SSIS Catalog All Executions Report

To summarize, we supplied three pieces of metadata to a Transact-SQL script that built a Framework Application containing 71 SSIS packages, and then we executed those 71 SSIS packages by starting a single stored procedure and passing it one parameter value.

8.3   Viewing SSIS Framework Community Edition Metadata

I mentioned earlier there’s quite a bit of metadata required for the SSIS Framework Community Edition. The script we used earlier is a nice piece of automation for entering SSIS Framework Community Edition metadata, but what happens when we want to view the Framework Applications already stored?

Framework Browser

Enter Framework Browser, another free utility from DILM Suite that you can download from DILM Suite – shown in Figure 8-5:


Figure 8-5. The Framework Application Load AdventureWorks2014 Stage

A Framework Application is a collection of SSIS packages – called Application Packages in the Framework – that execute in a specified order. Framework Browser lists Application Packages in the order they execute.

If we expand the Application Properties virtual folder, we see Framework Application metadata. Expand the Application Package node and the Application Package Properties node to surface Application Package metadata as shown in Figure 8-6:


Figure 8-6. Surfacing Application and Application Package Framework Metadata

Framework Browser is another free utility from DILM Suite.

Parallel Execution in SSIS Framework Community Edition

A number of folks are using the SSIS Framework Community Edition. I know because I continue to get interesting questions about it! Before I write more, I’d like to invite users to the DILM Suite Slack channel for Community Edition. If you cannot join, please let me know (andy.leonard@dilmsuite.com) and I will invite you!

For those who may not know, SSIS Framework Community Edition is free and open source.

I’m writing about parallel execution in SSIS Framework Community Edition because I’ve received more than one question about parallel execution in the past several days. Rather than copying and pasting responses, I figured I’d write a post about it and copy and paste the URL!

This is a long post and you may not be interested in reading all of it. It can be divided into three sections:

  1. One scenario to consider when parallelizing loads in SSIS (which starts right after this list) – laden with data integration theory
  2. Ways to parallelize loads in SSIS with and without a Framework – practical answers to the question at hand
  3. Thoughts regarding designing a Framework to automate parallel execution

Parallel Execution in SSIS

If you search online for ways to improve SSIS performance you will find myriad tips and tricks. Most of them are valid, some are not, and at least one has been invalid since mid-2006 (and it may be the one you find first). Why is this so? The internet is an echo chamber, and search engines are the medium through which the echoes travel.

One tip you will doubtless encounter is: Execute SSIS packages in parallel to improve performance. This is excellent advice.*

*Please note the asterisk. I place an asterisk there because there’s more to the story. Lots more, in fact. I can hear thinking, “What more is there to the story, Andy?” I’m glad you asked.

Dependencies and Race Conditions

Sit a spell and let Grandpa Andy tell yall a story about some data integratin’.

Suppose for a minute that you’ve read and taken my advice about writing small, unit-of-work SSIS packages. I wouldn’t blame you for taking this advice. It’s not only online, it’s written in a couple books (I know, I wrote that part of those books). One reason for building small, function-y SSIS packages is that it promotes code re-use. For example, SSIS packages that perform daily incremental loads can be re-used to perform monthly incremental loads to a database that serves as a data mart by simply changing a few parameters.

Change the parameter values and the monthly incremental load can load both quarterly and yearly data marts.

You want better performance out of the daily process, so you read and implement the parallel execution advice* you’ve found online. For our purposes let’s assume you’ve designed a star schema instead of one of those pesky data vaults (with their inherent many-to-many relationships and the ability to withstand isolated and independent loads and refreshes…).

You have dependencies. The dimensions must be loaded before the facts. You decide to manage parallelism by examining historical execution times. Since you load data in chronological order and use a brute-force change detection pattern, the daily dimension loads always complete before the fact loads reach the latest data. You decide to fire all packages at the same time and your daily execution time drops by half, monthly executions time drops to 40% of its former execution time, and everyone is ecstatic…

…until the quarterly loads.

The fact loaders begin failing. You troubleshoot and see the lookups that return dimension surrogate keys are failing to find a match. By the time you identify and check for the missing surrogate keys, you find them all loaded into the dimensions. What is going on?

What Happened

Facts are, by nature, “skinny.” Although you may load many more fact rows when loading a star schema, those rows are composed mostly of numbers and are not very wide.

When an SSIS Data Flow Task loads rows, it places the rows in buffers in RAM. You can visualize these buffers as buckets or boxes. If you choose boxes, think of the buffers as rectangles that can change shape. Some rectangles are square (remember, from high school geometry?), some are wide and not tall, some are tall and not wide. No matter what, the area of the rectangle doesn’t change although the shape will change based on the shape of each table. The shape of the buffer is coupled to the shape of the table – and is specifically driven by the length of the table row.

Aside: When I talk about this in Expert SSIS, I refer to this as “thinking like a Data Flow.”

One result is that fact loaders, when they start, most often load rows differently compared the dimension loaders. In engineering-speak, fact and dimension loaders have a different performance characteristic. I learned about characteristics studying electronics. When I became an electrician in a manufacturing plant, repairing and later designing and building electrical control systems, I learned about race conditions, which is what we have here.

Because the SSIS Data Flow Task buffers data and because dimensional data are most often wide, dimensions load fewer rows per buffer. Conversely, because fact data are narrow, facts load more rows per buffer.

There exists a (hidden) threshold in this scenario defined by when the fact loaders (executing in parallel with dimension loaders) reach new rows (which reference new dimension surrogate keys) compared with when the dimension loaders reach and load their new rows, thereby assigning said new rows their dimensional surrogate keys.

At lower row counts – such as daily or even monthly loads – the difference between dimensional and fact loader performance characteristics remains below the threshold, and everything appears peachy.

The race condition I describe here occurs because, at higher row counts, the dimensions load slower – much slower – than the facts load. The difference in performance characteristics crosses the threshold when one or more fact loaders reach new rows for which corresponding dimensions have not yet been loaded.

This is a race condition.

Not a Late-Arriving Dimension

Now, you may read this scenario and think, “No Andy, this is a late-arriving dimension or early-arriving fact.” It’s not. Both dimension and fact source records are present in the source database when this load process begins. Late-arriving facts occur when the dimension load starts before a particular row in the dimension has been added, and the fact row referencing that particular dimension record is added after the dimension load completes but before the fact load starts.

Inferred Members

But since you brought this up, is there a solution for late-arriving dimensions? Yep. Inferred members.

Let’s say you have implemented inferred member logic in your fact load and you encounter this race condition. What happens? Remember, the dimension and fact loaders are running in parallel and they work just fine for daily and monthly processing. In our scenario, the quarterly load crosses the threshold and causes a race condition.

When the fact loader reaches a dimension record that has not yet been loaded, the dimension loader is still running.

If you implement inferred member logic, you insert a row into the dimension – inserting the natural key which generates an (artificial) surrogate key – which is then returned to the fact loader so that the fact can continue loading without failure or interruption. While inferred members is a cool and helpful pattern, what happens when the dimension loader catches up? Hopefully, you have either a unique constraint on the natural key (and you haven’t disabled it to speed up the dimensional load process…another idea I and others have shared that you may have read) or you’ve incorporated inferred logic into the dimension loaders so that you check for the existence of the natural key or business key first – which updates the dimension row if the natural key already exists (and adds logic which further complicates and slows down the dimension loading process… there’s no free lunch…). Otherwise, you have a failed dimension loader execution or (worse, much worse), your race condition coupled with missing design elements combine to create duplicate dimension rows (which render the data mart useless).

Solving the Race Condition

The easy way to solve for race conditions – especially hidden race conditions – is to execute loaders serially. This goes directly against the SSIS performance advice of executing loaders in parallel.

Applied in SSIS Framework Community Edition

This is why SSIS Framework Community Edition defaults to serial execution.

“Great Andy, But What If I Need To Load In Parallel Using A Framework And I Mitigate Race Conditions?”

Enter the SSIS Design Pattern named the Controller Pattern. A Controller is an SSIS package that executes other SSIS packages. Controllers can be serial or parallel, or combinations of both. I’ve seen (great and talented) SSIS architects design frameworks around specific Controllers – building a Controller for each collection of loaders related to a subject area. SSIS Framework Community Edition ships with a generic metadata-driven serial controller named Parent.dtsx, found in the Framework SSIS solution.

Specific Controller Design

A specific Controller can appear as shown here (click to enlarge):

This controller achieves parallel execution. One answer to the question, “How do I execute packages in parallel?” is to build a specific controller like this one.

Advantages

  • Works
  • Simple and straightforward, uses out-of-the-box Execute Package Task

Disadvantages

  • “All Executions” Catalog Report is… misleading

How is the All Executions Report misleading? If you build and deploy a project such as SerialControllers SSIS project shown here – and then execute the SerialController.dtsx package – the All Executions reports only a single package execution: SerialController.dtsx (click to enlarge):

We see one and only one execution listed in the All Executions report, but if we click on the link to view the Overview report, we see each package listed individually:

All Executions accurately reflects an important aspect of the execution of the SerialController.dtsx SSIS package: The execution of this package – and the packages called by SerialController.dtsx – share the same Execution ID value. This is not necessarily a bad thing, but it is something of which to be aware.

Specific Controller Design in SSIS Framework Community Edition

A specific Controller built using SSIS Framework Community Edition can appear as shown here:

This controller uses Execute SQL Tasks instead of Execute Package Tasks. The T-SQL in the Execute SQL Tasks call a stored procedure named custom.execute_catalog_package that is part of SSIS Framework Community Edition.

One answer to the question, “How do I execute packages in parallel using SSIS Framework Community Edition?” is to build a controller such as this.

Advantages

  • Works
  • All Executions report is accurate

Disadvantages

  • Adds complexity

The All Executions Report is no longer misleading. If you build and deploy a project such as SerialControllersInFrameworkCE SSIS project shown here – and then execute the SerialControllerInFrameworkCE.dtsx package – the All Executions reports each package execution (click to enlarge):

We now see one execution listed in the All Executions report for each package. As before, All Executions accurately reflects an important aspect of the execution of the SerialControllerInFrameworkCE.dtsx SSIS package: The execution of the Controller and each Child package now have distinct Execution ID values.

When using specific Controllers with an SSIS Framework, it’s common to create a single-package SSIS Application that simply starts the Controller, and let the Controller package call the child packages. Parent.dtsx in SSIS Framework Community Edition is a generic metadata-driven Controller, but it doesn’t mind executing specific Controllers one bit!

 

Once Upon A Time…

Not too long ago, Kent Bradshaw and I endeavored to add automated parallel package execution to our Commercial and Enterprise SSIS Frameworks. We achieved our goal, but the solution added so much complexity to the Framework and its associated metadata that we opted to not market the implemented solution.

Why? Here are some reasons:

Starting SSIS packages in parallel is very easy to accomplish in the SSIS Catalog. The SYNCHRONIZED execution parameter is False by default, and that means we could build a controller package similar to the SerialControllerInFrameworkCE.dtsx SSIS package – with precedence constraints between each Execute SQL Task, even – and the SSIS Catalog would start the packages in rapid succession. In some scenarios – such as the scenario discussed in the first section of this post – this then becomes a race condition engine.

Why?

Because controlling only when packages start is not enough to effectively manage race conditions. To mitigate the race condition described earlier I need to make sure the dimension loaders complete before starting the fact loaders. A (much simplified) Controller for such a process could appear as shown here (click to enlarge):

I grabbed this screenshot after the dimension loader and some (vague) pre-operations process have completed in parallel, but while the fact loader is still executing. Please note the combination of the Sequence Container and precedence constraint which ensure the fact loader does not start executing until the dimension loader execution is complete. The sequence container creates a “virtual step” whereby all tasks within the container must complete before the sequence container evaluates the precedence constraint. Since each task inside this container starts an SSIS package (and since the SYNCHRONIZED execution parameter is set to True by default in SSIS Framework Community Edition), nothing downstream of this container can begin executing until everything inside the container has completed executing. This is how we avoid the race condition scenario described earlier.

How does one automate this process in a framework?

It’s not simple.

The method Kent and I devised was to create and operate upon metadata used to define and configure a “virtual step.” In  SSIS Framework Community Edition the Application Packages table is where we store the Execution Order attribute. We reasoned if two Application Package entities shared the same value for Execution Order, the associated package(s) (I’m leaving out some complexity in the design here, but imagine executing the same package in parallel with itself…) compose a virtual step.

In a virtual step the packages would start together, execute, and not proceed to the next virtual step – which could be another serial package execution or another collection of packages executing in parallel in yet another virtual step – until all packages in the current virtual step had completed execution. Here, again, I gloss over even more complexity regarding fault tolerance – we added metadata to configure whether a virtual step should fail if an individual package execution failed.

This was but one of our designs (we tried three). We learned managing execution dependency in a framework is not trivial. We opted instead to share the Controller pattern.

Conclusion

In this post I wrote about some data integration theory. I also answered a question I regularly receive about performing parallel loads using SSIS Framework Community Edition. I finally covered some of the challenges of automating a solution to manage parallel execution of SSIS packages in a Framework.