I started answering a question on SQL Community Slack’s #ssis channel and I realized this would be better served as a blog post. The question was about three SSIS Data Flow properties: DefaultBufferSize, Engine Thread and DefaultBufferMaxRows.
I rarely change the EngineThreads property.
DefaultBufferSize and DefaultBufferMaxRows are two ways of managing the size limits of a Data Flow buffer. The two Data Flow Task properties can – and should – be treated as a single property. DefaultBufferSize is the number of bytes per buffer. DefaultBufferMaxRows is the number of rows per buffer. The defaults are 10,485,760 (10M) and 10,000, respectively.
For source rows that are 1K in width it’s a wash because 10,000 rows * 1K == 10M.
Let’s look at some implications of the math(s):
If the source rows are wider than 1K, you will hit the 10M memory limit set by DefaultBufferSize before you encounter the DefaultBufferMaxRows limit.
If the rows are less than 1K wide, you will hit the row count limit set by DefaultBufferMaxRows before you encounter the DefaultBufferSize limit.
Together, these properties make up what I refer to as the Data Flow Task “Buffer Size” property.
When tuning a Data Flow Task I adjust both properties together by a similar factor because, in my little mind (at least), they are one property.
It’s possible to mitigate for data types and I cover some ways to tune for the integration of BLOb data types in my series on Loading BLObs.
Data Size and Shape
As a function of the relationship between hard drive disk speeds (compounded by the fact that current systems were designed to work with spinning platters and read/write heads), RAM, and processor cache, the graph for moving “chunks” of data in any data movement operation (whether data or files) results in a curve*. In general, breaking an integration operation into ten chunks will almost always* perform the total load / move operation faster than loading / moving the same amount in a single operation. There is a point where the number of chunks results in a cursor-y operation. Here the performance curve trends downward (or upward, if you draw the graph like I do in my classes). The top (or bottom) of the curve represents maximum throughput for that size and shape of data containing the source data types.
Something to consider (I will not cover more in this post): The max-throughput can, and often almost always moves over time.
To demonstrate, I’ve created a sample SSIS project in SSDT v15.1.61801.210 for SSIS v14.0.3002.92 (Visual Studio 2017 for SSIS 2017). I select from the AdventureWorks2014 Person.Person table using an OLE DB Source adapter (click to enlarge):
I map to an almost identical table – via an OLE DB Destination adapter – in a test database (named TestDB) on the same instance of SQL Server:
Executing with the defaults (10,000 rows for DefaultBufferMaxRows and 10M for DefaultBufferSize), the Data Flow Task executes in 10.594 seconds:
Dropping the Buffer Size properties an order of magnitude…
… results in an execution time of 6.328 seconds:
For those curious about the effects of queuing, returning the properties to 10,000 and 10,485,760 results in an execution time of 9.61 seconds on my demo VM.
Your mileage may vary.
In this post, I’ve briefly discussed and demonstrated one impact of managing SSIS Data Flow Task performance by adjusting the Buffer Size properties – DefaultBufferMaxRows and DefaultBufferSize.
Intentionally equivocal language in use… caveats abound.
For those who may not know, SSIS Framework Community Edition is free and open source.
I’m writing about parallel execution in SSIS Framework Community Edition because I’ve received more than one question about parallel execution in the past several days. Rather than copying and pasting responses, I figured I’d write a post about it and copy and paste the URL!
This is a long post and you may not be interested in reading all of it. It can be divided into three sections:
One scenario to consider when parallelizing loads in SSIS (which starts right after this list) – laden with data integration theory
If you search online for ways to improve SSIS performance you will find myriad tips and tricks. Most of them are valid, some are not, and at least one has been invalid since mid-2006 (and it may be the one you find first). Why is this so? The internet is an echo chamber, and search engines are the medium through which the echoes travel.
One tip you will doubtless encounter is: Execute SSIS packages in parallel to improve performance. This is excellent advice.*
*Please note the asterisk. I place an asterisk there because there’s more to the story. Lots more, in fact. I can hear thinking, “What more is there to the story, Andy?” I’m glad you asked.
Dependencies and Race Conditions
Sit a spell and let Grandpa Andy tell yall a story about some data integratin’.
Suppose for a minute that you’ve read and taken my advice about writing small, unit-of-work SSIS packages. I wouldn’t blame you for taking this advice. It’s not only online, it’s written in a couple books (I know, I wrote that part of those books). One reason for building small, function-y SSIS packages is that it promotes code re-use. For example, SSIS packages that perform daily incremental loads can be re-used to perform monthly incremental loads to a database that serves as a data mart by simply changing a few parameters.
Change the parameter values and the monthly incremental load can load both quarterly and yearly data marts.
You want better performance out of the daily process, so you read and implement the parallel execution advice* you’ve found online. For our purposes let’s assume you’ve designed a star schema instead of one of those pesky data vaults (with their inherent many-to-many relationships and the ability to withstand isolated and independent loads and refreshes…).
You have dependencies. The dimensions must be loaded before the facts. You decide to manage parallelism by examining historical execution times. Since you load data in chronological order and use a brute-force change detection pattern, the daily dimension loads always complete before the fact loads reach the latest data. You decide to fire all packages at the same time and your daily execution time drops by half, monthly executions time drops to 40% of its former execution time, and everyone is ecstatic…
…until the quarterly loads.
The fact loaders begin failing. You troubleshoot and see the lookups that return dimension surrogate keys are failing to find a match. By the time you identify and check for the missing surrogate keys, you find them all loaded into the dimensions. What is going on?
Facts are, by nature, “skinny.” Although you may load many more fact rows when loading a star schema, those rows are composed mostly of numbers and are not very wide.
When an SSIS Data Flow Task loads rows, it places the rows in buffers in RAM. You can visualize these buffers as buckets or boxes. If you choose boxes, think of the buffers as rectangles that can change shape. Some rectangles are square (remember, from high school geometry?), some are wide and not tall, some are tall and not wide. No matter what, the area of the rectangle doesn’t change although the shape will change based on the shape of each table. The shape of the buffer is coupled to the shape of the table – and is specifically driven by the length of the table row.
Aside: When I talk about this in Expert SSIS, I refer to this as “thinking like a Data Flow.”
One result is that fact loaders, when they start, most often load rows differently compared the dimension loaders. In engineering-speak, fact and dimension loaders have a different performance characteristic. I learned about characteristics studying electronics. When I became an electrician in a manufacturing plant, repairing and later designing and building electrical control systems, I learned about race conditions, which is what we have here.
Because the SSIS Data Flow Task buffers data and because dimensional data are most often wide, dimensions load fewer rows per buffer. Conversely, because fact data are narrow, facts load more rows per buffer.
There exists a (hidden) threshold in this scenario defined by when the fact loaders (executing in parallel with dimension loaders) reach new rows (which reference new dimension surrogate keys) compared with when the dimension loaders reach and load their new rows, thereby assigning said new rows their dimensional surrogate keys.
At lower row counts – such as daily or even monthly loads – the difference between dimensional and fact loader performance characteristics remains below the threshold, and everything appears peachy.
The race condition I describe here occurs because, at higher row counts, the dimensions load slower – much slower – than the facts load. The difference in performance characteristics crosses the threshold when one or more fact loaders reach new rows for which corresponding dimensions have not yet been loaded.
This is a race condition.
Not a Late-Arriving Dimension
Now, you may read this scenario and think, “No Andy, this is a late-arriving dimension or early-arriving fact.” It’s not. Both dimension and fact source records are present in the source database when this load process begins. Late-arriving facts occur when the dimension load starts before a particular row in the dimension has been added, and the fact row referencing that particular dimension record is added after the dimension load completes but before the fact load starts.
But since you brought this up, is there a solution for late-arriving dimensions? Yep. Inferred members.
Let’s say you have implemented inferred member logic in your fact load and you encounter this race condition. What happens? Remember, the dimension and fact loaders are running in parallel and they work just fine for daily and monthly processing. In our scenario, the quarterly load crosses the threshold and causes a race condition.
When the fact loader reaches a dimension record that has not yet been loaded, the dimension loader is still running.
If you implement inferred member logic, you insert a row into the dimension – inserting the natural key which generates an (artificial) surrogate key – which is then returned to the fact loader so that the fact can continue loading without failure or interruption. While inferred members is a cool and helpful pattern, what happens when the dimension loader catches up? Hopefully, you have either a unique constraint on the natural key (and you haven’t disabled it to speed up the dimensional load process…another idea I and others have shared that you may have read) or you’ve incorporated inferred logic into the dimension loaders so that you check for the existence of the natural key or business key first – which updates the dimension row if the natural key already exists (and adds logic which further complicates and slows down the dimension loading process… there’s no free lunch…). Otherwise, you have a failed dimension loader execution or (worse, much worse), your race condition coupled with missing design elements combine to create duplicate dimension rows (which render the data mart useless).
Solving the Race Condition
The easy way to solve for race conditions – especially hidden race conditions – is to execute loaders serially. This goes directly against the SSIS performance advice of executing loaders in parallel.
“Great Andy, But What If I Need To Load In Parallel Using A Framework And I Mitigate Race Conditions?”
Enter the SSIS Design Pattern named the Controller Pattern. A Controller is an SSIS package that executes other SSIS packages. Controllers can be serial or parallel, or combinations of both. I’ve seen (great and talented) SSIS architects design frameworks around specific Controllers – building a Controller for each collection of loaders related to a subject area. SSIS Framework Community Edition ships with a generic metadata-driven serial controller named Parent.dtsx, found in the Framework SSIS solution.
Specific Controller Design
A specific Controller can appear as shown here (click to enlarge):
This controller achieves parallel execution. One answer to the question, “How do I execute packages in parallel?” is to build a specific controller like this one.
Simple and straightforward, uses out-of-the-box Execute Package Task
“All Executions” Catalog Report is… misleading
How is the All Executions Report misleading? If you build and deploy a project such as SerialControllers SSIS project shown here – and then execute the SerialController.dtsx package – the All Executions reports only a single package execution: SerialController.dtsx (click to enlarge):
We see one and only one execution listed in the All Executions report, but if we click on the link to view the Overview report, we see each package listed individually:
All Executions accurately reflects an important aspect of the execution of the SerialController.dtsx SSIS package: The execution of this package – and the packages called by SerialController.dtsx – share the same Execution ID value. This is not necessarily a bad thing, but it is something of which to be aware.
Specific Controller Design in SSIS Framework Community Edition
A specific Controller built using SSIS Framework Community Edition can appear as shown here:
This controller uses Execute SQL Tasks instead of Execute Package Tasks. The T-SQL in the Execute SQL Tasks call a stored procedure named custom.execute_catalog_package that is part of SSIS Framework Community Edition.
One answer to the question, “How do I execute packages in parallel using SSIS Framework Community Edition?” is to build a controller such as this.
All Executions report is accurate
The All Executions Report is no longer misleading. If you build and deploy a project such as SerialControllersInFrameworkCE SSIS project shown here – and then execute the SerialControllerInFrameworkCE.dtsx package – the All Executions reports each package execution (click to enlarge):
We now see one execution listed in the All Executions report for each package. As before, All Executions accurately reflects an important aspect of the execution of the SerialControllerInFrameworkCE.dtsx SSIS package: The execution of the Controller and each Child package now have distinct Execution ID values.
When using specific Controllers with an SSIS Framework, it’s common to create a single-package SSIS Application that simply starts the Controller, and let the Controller package call the child packages. Parent.dtsx in SSIS Framework Community Edition is a generic metadata-driven Controller, but it doesn’t mind executing specific Controllers one bit!
Once Upon A Time…
Not too long ago, Kent Bradshaw and I endeavored to add automated parallel package execution to our Commercial and Enterprise SSIS Frameworks. We achieved our goal, but the solution added so much complexity to the Framework and its associated metadata that we opted to not market the implemented solution.
Why? Here are some reasons:
Starting SSIS packages in parallel is very easy to accomplish in the SSIS Catalog. The SYNCHRONIZED execution parameter is False by default, and that means we could build a controller package similar to the SerialControllerInFrameworkCE.dtsx SSIS package – with precedence constraints between each Execute SQL Task, even – and the SSIS Catalog would start the packages in rapid succession. In some scenarios – such as the scenario discussed in the first section of this post – this then becomes a race condition engine.
Because controlling only when packages start is not enough to effectively manage race conditions. To mitigate the race condition described earlier I need to make sure the dimension loaders complete before starting the fact loaders. A (much simplified) Controller for such a process could appear as shown here (click to enlarge):
I grabbed this screenshot after the dimension loader and some (vague) pre-operations process have completed in parallel, but while the fact loader is still executing. Please note the combination of the Sequence Container and precedence constraint which ensure the fact loader doesnot start executing until the dimension loader execution is complete. The sequence container creates a “virtual step” whereby all tasks within the container must complete before the sequence container evaluates the precedence constraint. Since each task inside this container starts an SSIS package (and since the SYNCHRONIZED execution parameter is set to True by default in SSIS Framework Community Edition), nothing downstream of this container can begin executing until everything inside the container has completed executing. This is how we avoid the race condition scenario described earlier.
How does one automate this process in a framework?
It’s not simple.
The method Kent and I devised was to create and operate upon metadata used to define and configure a “virtual step.” In SSIS Framework Community Edition the Application Packages table is where we store the Execution Order attribute. We reasoned if two Application Package entities shared the same value for Execution Order, the associated package(s) (I’m leaving out some complexity in the design here, but imagine executing the same package in parallel with itself…) compose a virtual step.
In a virtual step the packages would start together, execute, and not proceed to the next virtual step – which could be another serial package execution or another collection of packages executing in parallel in yet another virtual step – until all packages in the current virtual step had completed execution. Here, again, I gloss over even more complexity regarding fault tolerance – we added metadata to configure whether a virtual step should fail if an individual package execution failed.
This was but one of our designs (we tried three). We learned managing execution dependency in a framework is not trivial. We opted instead to share the Controller pattern.
In this post I wrote about some data integration theory. I also answered a question I regularly receive about performing parallel loads using SSIS Framework Community Edition. I finally covered some of the challenges of automating a solution to manage parallel execution of SSIS packages in a Framework.
Ever wonder why SSIS runs so slow? Watch SSIS author Andy Leonard as he runs test loads using sample and real-world data and shows you how to tune SQL Server 2016 Integration Services (SSIS 2016) packages.
We’ll start by experimenting with SSIS design patterns to improve performance loading AdventureWorks data. We will implement different change detection patterns and compare execution performance for each. Then, we’ll explain a Data Flow Task’s bottleneck when loading binary large objects – or Blobs.
Finally, we’ll demonstrate a design pattern that uses a Script Component in a Data Flow to boost load performance to MySql, whether on-premises or in the cloud.
Description What is Intelligent Data Integration? SSIS packages developed using tried and true design patterns, built to participate in a DevOps enterprise practicing DILM, produced using Biml and executed using an SSIS Framework.
Attend a day of training focused on intelligent data integration delivered by an experienced SSIS consultant who has also led an enterprise team of several ETL developers during multiple projects that spanned 2.5 years. And delivered. Attendees will learn:
– a holistic approach to data integration design. – a methodology for enterprise data integration that spans development through operational support. – how automation changes everything. Including data integration with SSIS.
Topics include: 1. SSIS Design Patterns 2. Executing SSIS in the Enterprise 3. Custom SSIS Execution Frameworks 4. DevOps and SSIS 5. Biml, Biml Frameworks, and Tools
What is Intelligent Data Integration? SSIS packages developed using tried and true design patterns, built to participate in a DevOps enterprise practicing DILM, produced using Biml and executed using an SSIS Framework.
Attend a day of training focused on intelligent data integration delivered by an experienced SSIS consultant who has also led an enterprise team of several ETL developers during multiple projects that spanned 2.5 years. And delivered.
Attendees will learn: – a holistic approach to data integration design. – a methodology for enterprise data integration that spans development through operational support. – how automation changes everything. Including data integration with SSIS.
Topics include: 1. SSIS Design Patterns Data Flow Performance ETL Instrumentation 2. Executing SSIS in the Enterprise The SSIS Catalog – the good, the bad, and the ugly. 3. Custom SSIS Execution Frameworks 4. DevOps and SSIS A (Condensed) Sprint in the Life of a Data Integration Solution Version Control and SSIS 5. Business Intelligence Markup Language A Brief Introduction to Biml in the free utilities, BimlExpress and BimlOnline 6. SSIS Design Patterns + Biml Putting the DILM (Data Integration Lifecycle Management) components together. 7. SSIS Design Patterns + Biml + Custom SSIS Execution Frameworks Executing the new combinations. 8. SSIS Design Patterns + Biml + Custom SSIS Execution Frameworks => DevOps Enterprise-class data integration with SSIS.
The target audience for this course is data integration developers and architects who want to learn more about SSIS performance, DevOps, execution, and automation.
Do you need to use SSIS to load some data from a large source? This post covers some thoughts on a couple/three patterns you may apply to help. I can hear you thinking, “Which pattern should I use, Andy?” That’s an excellent question! The answer is, “It depends.” You should test each pattern and see which performs best for you.
Pattern 0: Just Do It
The first pattern is easy. Select the rows you need in an SSIS Data Flow Task OLE DB Source adapter. You can build a query in SSMS (like I did here):
Is there anything wrong with this query? Well, maybe. Functions – like DatePart() – in the WHERE clause can slow things down a bit. It’s a valid pattern, though, so you should use it in your SSIS Data Flow Task’s OLE DB Source adapter to test performance.
Pattern 1: Stage Select Rows Using a Data Flow
Another option is to stage the subset of rows you’re after – or just a list of uniquely-identifying columns from those rows – also known as a business key or natural key. You could create a Data Flow Task that selects the rows you wish to load, similar to that shown here:
Those rows could be loaded to a staging table and later read from that table.
Reading all the columns – as shown above – is really just adding a step that you don’t need if your goal is to load data straight from a source to a destination. If, however, you need to join to other tables or lookup reference data, loading more than just the business keys into a staging table may be the way to go.
Pattern 2: Load Staged Rows Using T-SQL
A third option is to use T-SQL to load the target table after you’re loaded a stage table via Data Flow Task. Configure the T-SQL query similar to that shown here:
Your Control Flow now appears as shown:
Pattern 3: Stage and Load Rows Using T-SQL and Execute SQL Tasks
Another option is to stage the subset of rows you’re after using an Execute SQL Task and then use a second Execute SQL Task to load those rows into the target table. Your stage query would appear as shown here:
The Control flow would now contain an Execute SQL Task for Staging and another for Loading:
Pattern 4: Skipping the Stage Step Altogether
If you’re just doing a simple single-table load (like in this demo), and if the source and destination are on the same instance, you can often achieve great performance using T-SQL to load from the source directly to the destination:
Your Control Flow becomes remarkably simple, as shown here:
Staging a subset of the data you need, or even unique keys for a subset of the desired data, is one way to improve load performance.
Actually not. But as a fan of Seinfeld, it’s cool to work in that quote.
Seriously, though, No Change Detection is a valid change detection use case. In SSIS it looks like this:
Whack ‘n’ Load
I (lovingly) refer to this SSIS Design Pattern as the Whack ‘n’ Load Pattern. Is it an Incremental Load design pattern? Maybe?? It depends on how you want to classify incremental loads. I will argue for “yes, it’s an incremental load pattern because it picks up new and changed rows since the last execution.” But even as I was just typing that, I could feel the definition is a stretch.
The SSIS package shown above loads data from a flat file into a SQL Server database table. The Execute SQL Task named “SQL Truncate Target” executes a statement to truncate the destination table similar to:
Truncate Table dbo.SomeTable
The Data Flow Task named “DFT Load Target” contains the components shown in the “breakout,” a Flat File Source Adapter named “FFSrc Flat File” and an OLE DB Destination Adapter named “OLEDBDest Target.” The OLE DB Destination Adapter is aimed at the same table truncated in the Execute SQL task named “SQL Truncate Target.”
When executed, the SSIS package appears as shown here:
The flat file contains only five rows at this time. Tomorrow it make contain 5, 6, or 6,000 rows. Sometimes, data scales.
The Whack ‘n’ Load pattern always clears the destination table and then loads the contents of the source into the destination table.
There are few moving parts so there’s less opportunity for the load process to fail. When a Whack ‘n’ Load package fails, it’s usually due to unanticipated data conditions such as a NULL where no NULLs are allowed, string truncation (trying to put ten pounds of string into a five-pound column, for example), or some catastrophic failure in either the SSIS execution engine, SQL Server, the Operating System (OS), or hardware.
It’s possible for change detection to “miss” a changed record in the source. There are various reasons this occurs and sundry ways in which it happens when it occurs. One reason changes are not detected is hash collisions. A hash collision occurs when different values (or combinations of values) produce the same hash value. All hash algorithms have a collision probability and Jeff Preshing does a great job explaining both the theory and how to calculate the hash collision probability in his post titled Hash Collision Probabilities. The Whack ‘n’ Load design pattern does not miss changes.
Whack ‘n’ Load is slow – especially for larger data sets.
When to Use the Whack ‘n’ Load Design Pattern
I often use the Whack ‘n’ Load design pattern when loading small data sets like reference lookup data.
I often build Whack ‘n’ Load SSIS packages to prototype staging databases or to quickly stand up a reporting instance. These days I use Business Intelligence Markup Language (Biml) to accomplish the initial portion of a data integration project because I can generate a metric tonne of SSIS packages in a few minutes. Later I identify poorly-performing packages and modify them to use another design pattern.
When Not to Use the Whack ‘n’ Load Design Pattern
I don’t use the Whack ‘n’ Load design pattern when loading larger data sets because of performance concerns.
I don’t use the Whack ‘n’ Load design pattern when loading data that requires more complex storage schemes, such as the change history requirement for loading a Type II Star Schema Dimension.
Use the Whack ‘n’ Load design pattern to build a quick and dirty staging solution – especially if you’re using automation like Biml.
Data integration is the foundation of data science, business intelligence, and enterprise data warehousing. This instructor-led training class is specifically designed for SQL Server Integration Services (SSIS) professionals responsible for developing, deploying, and managing data integration at enterprise-scale.
You will learn to improve data integration with SSIS by:
Building better data integration.
Making data integration execution more manageable.
Building data integration faster.
SSIS Design Patterns for Performance – how to build SSIS packages that execute and load data faster by tuning SSIS data flows and implementing performance patterns.
SSIS Deployment, Configuration, Execution, and Monitoring – the “Ops” part of DevOps with SSIS using out-of-the-box tools and the latest utilities.
Automation – how to use Business Intelligence Markup Language (Biml) to improve SSIS quality and reduce development time.
I was honored to write with the inventor of Biml, Scott Currie, and an awesome team of BimlHeroes: Jacob Alley, Martin Andersson, Peter Avenant, Bill Fellows, Simon Peck, Reeves Smith, Raymond Sondak, Benjamin Weissman, and Cathrine Wilhelmsen.
Part I: Learning Biml
Introduction to the Biml Language
Basic Staging Operations
Reusing Code, Helper Classes and Methods
Part II: Biml Frameworks
A Custom Biml Framework
Using Biml as an SSIS Design Patterns Engine
Integration with a Custom SSIS Execution Framework
Advanced Biml Frameworks and BimlFlex
Part III: Biml Topics
Biml and Analysis Services
Biml for T-SQL
Documenting Your Biml Solution
Part IV: Appendices
Parallel Load Patterns in Biml
Amazon projects the book’s release in early December but we anticipate the book will be available sooner.
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.