SSIS Framework Community Edition defaults to serial execution. The Controller Pattern can help. How? Read on…
“Great Andy, But What If I Need To Load In Parallel Using A Framework?”
Enter the SSIS Design Pattern named the Controller Pattern. A Controller is an SSIS package that executes other SSIS packages. Controllers can be serial or parallel, or combinations of both. I’ve seen (great and talented) SSIS architects design frameworks around specific Controllers – building a Controller for each collection of loaders related to a subject area.
There’s nothing wrong with those solutions.
SSIS Framework Community Edition ships with a generic metadata-driven serial controller named Parent.dtsx which is found in the Framework SSIS solution.
Specific Controller Pattern Design
A specific Controller can appear as shown here (click to enlarge):
This controller achieves parallel execution. One answer to the question, “How do I execute packages in parallel?” is to build a specific controller like this one.
- Just Works
- Simple and straightforward, uses out-of-the-box Execute Package Task
- “All Executions” Catalog Report is… misleading…
“How is the All Executions Report misleading, Andy?”
I’m glad you asked. If you build and deploy a project such as SerialControllers SSIS project shown here – and then execute the SerialController.dtsx package – the All Executions reports only a single package execution: SerialController.dtsx (click to enlarge):
We see one and only one execution listed in the All Executions report. If we click on the link to view the Overview report we see each package listed individually:
The All Executions report accurately reflects an important aspect of the execution of the SerialController.dtsx SSIS package. The execution of this package – and the packages called by SerialController.dtsx – share the same Execution ID value. This is not necessarily a bad thing, but it is something of which to be aware.
Specific Controller Design in SSIS Framework Community Edition
A specific Controller built using SSIS Framework Community Edition can appear as shown here:
This controller uses Execute SQL Tasks instead of Execute Package Tasks. The T-SQL in the Execute SQL Tasks calls a stored procedure named custom.execute_catalog_package that is part of SSIS Framework Community Edition.
One answer to the question, “How do I execute packages in parallel using SSIS Framework Community Edition?” is to build a Controller.
- Just Works
- The SSIS Catalog All Executions report is accurate
- Adds complexity
The All Executions Report is no longer misleading. If you build and deploy a project such as SerialControllersInFrameworkCE SSIS project shown here – and then execute the SerialControllerInFrameworkCE.dtsx package – the All Executions reports each package execution (click to enlarge):
We now see one execution listed in the All Executions report for each package. As before, All Executions accurately reflects an important aspect of the execution of the SerialControllerInFrameworkCE.dtsx SSIS package: The execution of the Controller and each Child package now have distinct Execution ID values.
When using specific Controllers with an SSIS Framework it’s common to create a single-package SSIS Application that simply starts the Controller, and then let the Controller package call the child packages. Parent.dtsx in SSIS Framework Community Edition is a generic metadata-driven Controller, but it doesn’t mind executing specific Controllers one bit!
Once Upon A Time…
Not too long ago, Kent Bradshaw and I endeavored to add automated parallel package execution to our Commercial and Enterprise SSIS Frameworks. We achieved our goal, but the solution added so much complexity to the Framework and its associated metadata that we opted to not market the implemented solution.
Why? Here are some reasons:
Starting SSIS packages in parallel is very easy to accomplish in the SSIS Catalog. The SYNCHRONIZED execution parameter is False by default. That means we could build a controller package similar to the SerialControllerInFrameworkCE.dtsx SSIS package – with precedence constraints between each Execute SQL Task, even – and the SSIS Catalog would start the packages in rapid succession. In some scenarios – such as the scenario discussed in this post (from which the current post was derived) – this then becomes a race condition engine.
A Race Condition Engine?
Yes. Because controlling only when packages start is not enough to effectively manage race conditions. To mitigate the race condition described in this post I need to make sure the dimension loaders complete before starting the fact loaders. A (much simplified) Controller for such a process could appear as shown here (click to enlarge):
I grabbed this screenshot after the dimension loader and some (vague) pre-operations process have completed in parallel but while the fact loader is still executing. Please note the combination of the Sequence Container and precedence constraint which ensure the fact loader does not start executing until the dimension loader execution is complete. The sequence container creates a “virtual step” whereby all tasks within the container must complete before the sequence container evaluates the precedence constraint. Since each task inside this container starts an SSIS package (and since the SYNCHRONIZED execution parameter is set to True by default in SSIS Framework Community Edition), nothing downstream of this container can begin executing until everything inside the container has completed executing. This is how we avoid the race condition scenario described earlier.
How does one automate this process in a framework?
It’s not simple.
The method Kent and I devised was to create and operate upon metadata used to define and configure a “virtual step.” In SSIS Framework Community Edition the Application Packages table is where we store the Execution Order attribute. We reasoned if two Application Package entities shared the same value for Execution Order, then the associated package(s) (I’m leaving out some complexity in the design here, but imagine executing the same package in parallel with itself…) compose a virtual step.
In a virtual step packages would start together, execute, and not proceed to the next virtual step – which could be another serial package execution or another collection of packages executing in parallel in yet another virtual step – until all packages in the current virtual step had completed execution. Here, again, I gloss over even more complexity regarding fault tolerance. Kent and I added metadata to configure whether a virtual step should fail if an individual package execution failed.
This was but one of our designs (we tried three). We learned managing execution dependency in a framework is not trivial. We opted instead to share the Controller pattern.
We Learned Something Else
While visiting a client who had deployed the Controller Pattern, we noticed something. The client used a plotter to print large copies of Controller control flows and post them on the walls outside his cubicle.
When we saw this we got it.
The tasks in the Controller’s control flow were well-named. They were, one could say, self-documenting. By posting updated versions of the Controller control flows whenever the design changed, the data engineer was informing his colleagues of changes to the process.
He didn’t need to explain what had changed. It was obvious to anyone stopping by his cubicle for a few minutes. Briliant!
In this post I wrote about some data integration theory. I also answered a question I regularly receive about performing parallel loads using SSIS Framework Community Edition. I finally covered some of the challenges of automating a solution to manage parallel execution of SSIS packages in a Framework.
Note: much of this material was shared earlier in this post. I broke the Controller Pattern part out in this post because the other post was really too long.