I am honored to join my friends at the Triad SQL PASS BI Group in Greensboro North Carolina 30 Oct 2018, where I will present Faster SSIS!
Ever wonder why SSIS runs so slowly? Watch SSIS author Andy Leonard as he runs test loads using sample and real-world data, and shows you how to tune SQL Server 2016 Integration Services (SSIS 2016) packages. We’ll start by experimenting with SSIS design patterns to improve performance loading AdventureWorks data. We will implement different change detection patterns and compare execution performance for each. Then, we’ll explain a Data Flow Task’s bottleneck when loading binary large objects – or Blobs. Finally, we’ll demonstrate a design pattern that uses a Script Component in a Data Flow to boost load performance to MySql, whether on-premises or in the cloud. Prerequisites: Familiarity with SQL Server Integration Services (SSIS).
Benjamin Franklin is credited with saying, “Time is money.”
The 2011 movie, In Time, depicts an entire economy based on time. (A friend pointed out that time is not fungible, but I digress…)
So, what could your enterprise do with more time?
When you think about it, this question lies at the heart of many data- and software-related enterprise activities, such as:
Software development methodologies
Project management methodologies
I believe this is an important question – perhaps the important question of the cloud-era. I believe time is replacing money in a way we’ve never before experienced. I believe the cloud is driving this new economy.
Technology Economy Conversations
Not long ago I wrote of my experience when I left an instance of Azure Data Factory SSIS Integration Runtime running overnight and it cost me about $30USD. The post was titled The Cloud Costs Money and reflects some of the thinking of this post.
Not long after that, I was honored to chat with Stuart Ainsworth (@codegumbo) at Atlanta Azure DataFest:
Chatting with Stu
In this Data Driven DataPoint, captured while attending the inaugural Atlanta Azure Data Fest, I was honored to speak with Rie Irish, Julie Smith, Tim Radney, Geoff Hiten, and Stuart Ainsworth.
The event itself was an astounding event on two levels:
The velocity of technological innovation is increasing (“well duh, Captain Obvious”) so, if you haven’t attended such an event recently – and by “recently” I mean the past eighteen months – you should attend to see how folks are combining cloud, Internet-of-Things (IoT), analytics, machine learning, artificial intelligence, on-premises, and hybrid technologies to deliver – frankly – amazing solutions.
Community. Networking with people will change your career. It will change your career in a way that will change your life. Ask anyone who is engaged in a Microsoft data community. My synopsis of Atlanta Azure DataFest is here and my theme is “it is not too late to jump in”:
A [Database Transaction Unit] is a blended measure of CPU, memory, and data I/O and transaction log I/O in a ratio determined by an OLTP benchmark workload designed to be typical of real-world OLTP workloads. Doubling the DTUs by increasing the performance level of a database equates to doubling the set of resource available to that database.
That’s a great definition. But what are the implications?
Stu and I discussed the following data integration scenario: Your enterprise hardware is currently fixed – which fixes the capacity of your data-related workload. You can change your enterprise’s workload capacity at any time; you can increase capacity by buying more or better hardware.
Imagine your enterprise migrates your data and data-related workloads to the cloud. (I know a company that can help! :)) After migration, your enterprise can scale hardware up to meet demand, and then scale it back down again when demand drops. The economics of pay-for-only-what-you-need-when-you-need-it is compelling, to be sure, and it drives almost all decisions to migrate to the cloud.
But there’s more.
Time to market matters to many enterprises. Time to market matters more than ever to some enterprises. The impact of time to market is easy to underestimate.
Thinking in DTUs
Consider the math: A DTU is a DTU. How the DTU cycles are distributed across time and processors doesn’t really matter.
Let’s say you pay $100 to incrementally load your data warehouse and the load takes 24 hours to execute at the scale you’ve selected in the cloud. Prior to thinking in DTUs, engineers and business people would think, “That’s just the way it is. If I want more or faster, I need to pay for more or faster.” But DTU math doesn’t quite work that way. Depending on your workload and DTU pricing at the time (FULL DISCLOSURE: DTU PRICING CHANGES REGULARLY!), you may be able to spend that same $100 on more compute capabilities and reduce the amount of time required to load the same data into the same data warehouse to minutes instead of hours.
That’s DTU Math.
The shift to DTU thinking is subtle but vital.
We are used to thinking the only way to make things faster is to spend more money. That’s simply no longer accurate. The shape of the line between cost and performance is may still trend linear but you can dramatically – and very, very quickly – alter the slope of that line, especially with regards to time.
The fact that the cost/performance curve can be altered in seconds instead of months meta-changes everything.
The statements above are examples of DTU thinking and DTU math. So, please ask yourself: “What could my enterprise do with more time?”
Why is that so important? Because Ben was right: Time is money.
Check out this schedule – there are a bunch of smart people presenting – plus me!
Ever wonder why SSIS runs so slow? Watch SSIS author Andy Leonard as he runs test loads using sample and real-world data and shows you how to tune SQL Server 2017 Integration Services (SSIS 2017) packages.
We’ll start by experimenting with SSIS design patterns to improve performance loading AdventureWorks data. We will implement different change detection patterns and compare execution performance for each. Then, we’ll explain a Data Flow Task’s bottleneck when loading binary large objects – or Blobs.
Finally, we’ll demonstrate a design pattern that uses a Script Component in a Data Flow to boost load performance to MySql, whether on-premises or in the cloud. Prerequisites: None. Some SSIS and SQL knowledge will be helpful but is not required.
I hope to see you there! I’d love to meet you if you read this blog and attend – just walk up and introduce yourself, I’m the southern guy with the braided beard.
I started answering a question on SQL Community Slack’s #ssis channel and I realized this would be better served as a blog post. The question was about three SSIS Data Flow properties: DefaultBufferSize, Engine Thread and DefaultBufferMaxRows.
I rarely change the EngineThreads property.
DefaultBufferSize and DefaultBufferMaxRows are two ways of managing the size limits of a Data Flow buffer. The two Data Flow Task properties can – and should – be treated as a single property. DefaultBufferSize is the number of bytes per buffer. DefaultBufferMaxRows is the number of rows per buffer. The defaults are 10,485,760 (10M) and 10,000, respectively.
For source rows that are 1K in width it’s a wash because 10,000 rows * 1K == 10M.
Let’s look at some implications of the math(s):
If the source rows are wider than 1K, you will hit the 10M memory limit set by DefaultBufferSize before you encounter the DefaultBufferMaxRows limit.
If the rows are less than 1K wide, you will hit the row count limit set by DefaultBufferMaxRows before you encounter the DefaultBufferSize limit.
Together, these properties make up what I refer to as the Data Flow Task “Buffer Size” property.
When tuning a Data Flow Task I adjust both properties together by a similar factor because, in my little mind (at least), they are one property.
It’s possible to mitigate for data types and I cover some ways to tune for the integration of BLOb data types in my series on Loading BLObs.
Data Size and Shape
As a function of the relationship between hard drive disk speeds (compounded by the fact that current systems were designed to work with spinning platters and read/write heads), RAM, and processor cache, the graph for moving “chunks” of data in any data movement operation (whether data or files) results in a curve*. In general, breaking an integration operation into ten chunks will almost always* perform the total load / move operation faster than loading / moving the same amount in a single operation. There is a point where the number of chunks results in a cursor-y operation. Here the performance curve trends downward (or upward, if you draw the graph like I do in my classes). The top (or bottom) of the curve represents maximum throughput for that size and shape of data containing the source data types.
Something to consider (I will not cover more in this post): The max-throughput can, and often almost always moves over time.
To demonstrate, I’ve created a sample SSIS project in SSDT v15.1.61801.210 for SSIS v14.0.3002.92 (Visual Studio 2017 for SSIS 2017). I select from the AdventureWorks2014 Person.Person table using an OLE DB Source adapter (click to enlarge):
I map to an almost identical table – via an OLE DB Destination adapter – in a test database (named TestDB) on the same instance of SQL Server:
Executing with the defaults (10,000 rows for DefaultBufferMaxRows and 10M for DefaultBufferSize), the Data Flow Task executes in 10.594 seconds:
Dropping the Buffer Size properties an order of magnitude…
… results in an execution time of 6.328 seconds:
For those curious about the effects of queuing, returning the properties to 10,000 and 10,485,760 results in an execution time of 9.61 seconds on my demo VM.
Your mileage may vary.
In this post, I’ve briefly discussed and demonstrated one impact of managing SSIS Data Flow Task performance by adjusting the Buffer Size properties – DefaultBufferMaxRows and DefaultBufferSize.
Intentionally equivocal language in use… caveats abound.
For those who may not know, SSIS Framework Community Edition is free and open source.
I’m writing about parallel execution in SSIS Framework Community Edition because I’ve received more than one question about parallel execution in the past several days. Rather than copying and pasting responses, I figured I’d write a post about it and copy and paste the URL!
This is a long post and you may not be interested in reading all of it. It can be divided into three sections:
One scenario to consider when parallelizing loads in SSIS (which starts right after this list) – laden with data integration theory
If you search online for ways to improve SSIS performance you will find myriad tips and tricks. Most of them are valid, some are not, and at least one has been invalid since mid-2006 (and it may be the one you find first). Why is this so? The internet is an echo chamber, and search engines are the medium through which the echoes travel.
One tip you will doubtless encounter is: Execute SSIS packages in parallel to improve performance. This is excellent advice.*
*Please note the asterisk. I place an asterisk there because there’s more to the story. Lots more, in fact. I can hear thinking, “What more is there to the story, Andy?” I’m glad you asked.
Dependencies and Race Conditions
Sit a spell and let Grandpa Andy tell yall a story about some data integratin’.
Suppose for a minute that you’ve read and taken my advice about writing small, unit-of-work SSIS packages. I wouldn’t blame you for taking this advice. It’s not only online, it’s written in a couple books (I know, I wrote that part of those books). One reason for building small, function-y SSIS packages is that it promotes code re-use. For example, SSIS packages that perform daily incremental loads can be re-used to perform monthly incremental loads to a database that serves as a data mart by simply changing a few parameters.
Change the parameter values and the monthly incremental load can load both quarterly and yearly data marts.
You want better performance out of the daily process, so you read and implement the parallel execution advice* you’ve found online. For our purposes let’s assume you’ve designed a star schema instead of one of those pesky data vaults (with their inherent many-to-many relationships and the ability to withstand isolated and independent loads and refreshes…).
You have dependencies. The dimensions must be loaded before the facts. You decide to manage parallelism by examining historical execution times. Since you load data in chronological order and use a brute-force change detection pattern, the daily dimension loads always complete before the fact loads reach the latest data. You decide to fire all packages at the same time and your daily execution time drops by half, monthly executions time drops to 40% of its former execution time, and everyone is ecstatic…
…until the quarterly loads.
The fact loaders begin failing. You troubleshoot and see the lookups that return dimension surrogate keys are failing to find a match. By the time you identify and check for the missing surrogate keys, you find them all loaded into the dimensions. What is going on?
Facts are, by nature, “skinny.” Although you may load many more fact rows when loading a star schema, those rows are composed mostly of numbers and are not very wide.
When an SSIS Data Flow Task loads rows, it places the rows in buffers in RAM. You can visualize these buffers as buckets or boxes. If you choose boxes, think of the buffers as rectangles that can change shape. Some rectangles are square (remember, from high school geometry?), some are wide and not tall, some are tall and not wide. No matter what, the area of the rectangle doesn’t change although the shape will change based on the shape of each table. The shape of the buffer is coupled to the shape of the table – and is specifically driven by the length of the table row.
Aside: When I talk about this in Expert SSIS, I refer to this as “thinking like a Data Flow.”
One result is that fact loaders, when they start, most often load rows differently compared the dimension loaders. In engineering-speak, fact and dimension loaders have a different performance characteristic. I learned about characteristics studying electronics. When I became an electrician in a manufacturing plant, repairing and later designing and building electrical control systems, I learned about race conditions, which is what we have here.
Because the SSIS Data Flow Task buffers data and because dimensional data are most often wide, dimensions load fewer rows per buffer. Conversely, because fact data are narrow, facts load more rows per buffer.
There exists a (hidden) threshold in this scenario defined by when the fact loaders (executing in parallel with dimension loaders) reach new rows (which reference new dimension surrogate keys) compared with when the dimension loaders reach and load their new rows, thereby assigning said new rows their dimensional surrogate keys.
At lower row counts – such as daily or even monthly loads – the difference between dimensional and fact loader performance characteristics remains below the threshold, and everything appears peachy.
The race condition I describe here occurs because, at higher row counts, the dimensions load slower – much slower – than the facts load. The difference in performance characteristics crosses the threshold when one or more fact loaders reach new rows for which corresponding dimensions have not yet been loaded.
This is a race condition.
Not a Late-Arriving Dimension
Now, you may read this scenario and think, “No Andy, this is a late-arriving dimension or early-arriving fact.” It’s not. Both dimension and fact source records are present in the source database when this load process begins. Late-arriving facts occur when the dimension load starts before a particular row in the dimension has been added, and the fact row referencing that particular dimension record is added after the dimension load completes but before the fact load starts.
But since you brought this up, is there a solution for late-arriving dimensions? Yep. Inferred members.
Let’s say you have implemented inferred member logic in your fact load and you encounter this race condition. What happens? Remember, the dimension and fact loaders are running in parallel and they work just fine for daily and monthly processing. In our scenario, the quarterly load crosses the threshold and causes a race condition.
When the fact loader reaches a dimension record that has not yet been loaded, the dimension loader is still running.
If you implement inferred member logic, you insert a row into the dimension – inserting the natural key which generates an (artificial) surrogate key – which is then returned to the fact loader so that the fact can continue loading without failure or interruption. While inferred members is a cool and helpful pattern, what happens when the dimension loader catches up? Hopefully, you have either a unique constraint on the natural key (and you haven’t disabled it to speed up the dimensional load process…another idea I and others have shared that you may have read) or you’ve incorporated inferred logic into the dimension loaders so that you check for the existence of the natural key or business key first – which updates the dimension row if the natural key already exists (and adds logic which further complicates and slows down the dimension loading process… there’s no free lunch…). Otherwise, you have a failed dimension loader execution or (worse, much worse), your race condition coupled with missing design elements combine to create duplicate dimension rows (which render the data mart useless).
Solving the Race Condition
The easy way to solve for race conditions – especially hidden race conditions – is to execute loaders serially. This goes directly against the SSIS performance advice of executing loaders in parallel.
“Great Andy, But What If I Need To Load In Parallel Using A Framework And I Mitigate Race Conditions?”
Enter the SSIS Design Pattern named the Controller Pattern. A Controller is an SSIS package that executes other SSIS packages. Controllers can be serial or parallel, or combinations of both. I’ve seen (great and talented) SSIS architects design frameworks around specific Controllers – building a Controller for each collection of loaders related to a subject area. SSIS Framework Community Edition ships with a generic metadata-driven serial controller named Parent.dtsx, found in the Framework SSIS solution.
Specific Controller Design
A specific Controller can appear as shown here (click to enlarge):
This controller achieves parallel execution. One answer to the question, “How do I execute packages in parallel?” is to build a specific controller like this one.
Simple and straightforward, uses out-of-the-box Execute Package Task
“All Executions” Catalog Report is… misleading
How is the All Executions Report misleading? If you build and deploy a project such as SerialControllers SSIS project shown here – and then execute the SerialController.dtsx package – the All Executions reports only a single package execution: SerialController.dtsx (click to enlarge):
We see one and only one execution listed in the All Executions report, but if we click on the link to view the Overview report, we see each package listed individually:
All Executions accurately reflects an important aspect of the execution of the SerialController.dtsx SSIS package: The execution of this package – and the packages called by SerialController.dtsx – share the same Execution ID value. This is not necessarily a bad thing, but it is something of which to be aware.
Specific Controller Design in SSIS Framework Community Edition
A specific Controller built using SSIS Framework Community Edition can appear as shown here:
This controller uses Execute SQL Tasks instead of Execute Package Tasks. The T-SQL in the Execute SQL Tasks call a stored procedure named custom.execute_catalog_package that is part of SSIS Framework Community Edition.
One answer to the question, “How do I execute packages in parallel using SSIS Framework Community Edition?” is to build a controller such as this.
All Executions report is accurate
The All Executions Report is no longer misleading. If you build and deploy a project such as SerialControllersInFrameworkCE SSIS project shown here – and then execute the SerialControllerInFrameworkCE.dtsx package – the All Executions reports each package execution (click to enlarge):
We now see one execution listed in the All Executions report for each package. As before, All Executions accurately reflects an important aspect of the execution of the SerialControllerInFrameworkCE.dtsx SSIS package: The execution of the Controller and each Child package now have distinct Execution ID values.
When using specific Controllers with an SSIS Framework, it’s common to create a single-package SSIS Application that simply starts the Controller, and let the Controller package call the child packages. Parent.dtsx in SSIS Framework Community Edition is a generic metadata-driven Controller, but it doesn’t mind executing specific Controllers one bit!
Once Upon A Time…
Not too long ago, Kent Bradshaw and I endeavored to add automated parallel package execution to our Commercial and Enterprise SSIS Frameworks. We achieved our goal, but the solution added so much complexity to the Framework and its associated metadata that we opted to not market the implemented solution.
Why? Here are some reasons:
Starting SSIS packages in parallel is very easy to accomplish in the SSIS Catalog. The SYNCHRONIZED execution parameter is False by default, and that means we could build a controller package similar to the SerialControllerInFrameworkCE.dtsx SSIS package – with precedence constraints between each Execute SQL Task, even – and the SSIS Catalog would start the packages in rapid succession. In some scenarios – such as the scenario discussed in the first section of this post – this then becomes a race condition engine.
Because controlling only when packages start is not enough to effectively manage race conditions. To mitigate the race condition described earlier I need to make sure the dimension loaders complete before starting the fact loaders. A (much simplified) Controller for such a process could appear as shown here (click to enlarge):
I grabbed this screenshot after the dimension loader and some (vague) pre-operations process have completed in parallel, but while the fact loader is still executing. Please note the combination of the Sequence Container and precedence constraint which ensure the fact loader doesnot start executing until the dimension loader execution is complete. The sequence container creates a “virtual step” whereby all tasks within the container must complete before the sequence container evaluates the precedence constraint. Since each task inside this container starts an SSIS package (and since the SYNCHRONIZED execution parameter is set to True by default in SSIS Framework Community Edition), nothing downstream of this container can begin executing until everything inside the container has completed executing. This is how we avoid the race condition scenario described earlier.
How does one automate this process in a framework?
It’s not simple.
The method Kent and I devised was to create and operate upon metadata used to define and configure a “virtual step.” In SSIS Framework Community Edition the Application Packages table is where we store the Execution Order attribute. We reasoned if two Application Package entities shared the same value for Execution Order, the associated package(s) (I’m leaving out some complexity in the design here, but imagine executing the same package in parallel with itself…) compose a virtual step.
In a virtual step the packages would start together, execute, and not proceed to the next virtual step – which could be another serial package execution or another collection of packages executing in parallel in yet another virtual step – until all packages in the current virtual step had completed execution. Here, again, I gloss over even more complexity regarding fault tolerance – we added metadata to configure whether a virtual step should fail if an individual package execution failed.
This was but one of our designs (we tried three). We learned managing execution dependency in a framework is not trivial. We opted instead to share the Controller pattern.
In this post I wrote about some data integration theory. I also answered a question I regularly receive about performing parallel loads using SSIS Framework Community Edition. I finally covered some of the challenges of automating a solution to manage parallel execution of SSIS packages in a Framework.
Ever wonder why SSIS runs so slow? Watch SSIS author Andy Leonard as he runs test loads using sample and real-world data and shows you how to tune SQL Server 2016 Integration Services (SSIS 2016) packages.
We’ll start by experimenting with SSIS design patterns to improve performance loading AdventureWorks data. We will implement different change detection patterns and compare execution performance for each. Then, we’ll explain a Data Flow Task’s bottleneck when loading binary large objects – or Blobs.
Finally, we’ll demonstrate a design pattern that uses a Script Component in a Data Flow to boost load performance to MySql, whether on-premises or in the cloud.
Description What is Intelligent Data Integration? SSIS packages developed using tried and true design patterns, built to participate in a DevOps enterprise practicing DILM, produced using Biml and executed using an SSIS Framework.
Attend a day of training focused on intelligent data integration delivered by an experienced SSIS consultant who has also led an enterprise team of several ETL developers during multiple projects that spanned 2.5 years. And delivered. Attendees will learn:
– a holistic approach to data integration design. – a methodology for enterprise data integration that spans development through operational support. – how automation changes everything. Including data integration with SSIS.
Topics include: 1. SSIS Design Patterns 2. Executing SSIS in the Enterprise 3. Custom SSIS Execution Frameworks 4. DevOps and SSIS 5. Biml, Biml Frameworks, and Tools
What is Intelligent Data Integration? SSIS packages developed using tried and true design patterns, built to participate in a DevOps enterprise practicing DILM, produced using Biml and executed using an SSIS Framework.
Attend a day of training focused on intelligent data integration delivered by an experienced SSIS consultant who has also led an enterprise team of several ETL developers during multiple projects that spanned 2.5 years. And delivered.
Attendees will learn: – a holistic approach to data integration design. – a methodology for enterprise data integration that spans development through operational support. – how automation changes everything. Including data integration with SSIS.
Topics include: 1. SSIS Design Patterns Data Flow Performance ETL Instrumentation 2. Executing SSIS in the Enterprise The SSIS Catalog – the good, the bad, and the ugly. 3. Custom SSIS Execution Frameworks 4. DevOps and SSIS A (Condensed) Sprint in the Life of a Data Integration Solution Version Control and SSIS 5. Business Intelligence Markup Language A Brief Introduction to Biml in the free utilities, BimlExpress and BimlOnline 6. SSIS Design Patterns + Biml Putting the DILM (Data Integration Lifecycle Management) components together. 7. SSIS Design Patterns + Biml + Custom SSIS Execution Frameworks Executing the new combinations. 8. SSIS Design Patterns + Biml + Custom SSIS Execution Frameworks => DevOps Enterprise-class data integration with SSIS.
The target audience for this course is data integration developers and architects who want to learn more about SSIS performance, DevOps, execution, and automation.
Do you need to use SSIS to load some data from a large source? This post covers some thoughts on a couple/three patterns you may apply to help. I can hear you thinking, “Which pattern should I use, Andy?” That’s an excellent question! The answer is, “It depends.” You should test each pattern and see which performs best for you.
Pattern 0: Just Do It
The first pattern is easy. Select the rows you need in an SSIS Data Flow Task OLE DB Source adapter. You can build a query in SSMS (like I did here):
Is there anything wrong with this query? Well, maybe. Functions – like DatePart() – in the WHERE clause can slow things down a bit. It’s a valid pattern, though, so you should use it in your SSIS Data Flow Task’s OLE DB Source adapter to test performance.
Pattern 1: Stage Select Rows Using a Data Flow
Another option is to stage the subset of rows you’re after – or just a list of uniquely-identifying columns from those rows – also known as a business key or natural key. You could create a Data Flow Task that selects the rows you wish to load, similar to that shown here:
Those rows could be loaded to a staging table and later read from that table.
Reading all the columns – as shown above – is really just adding a step that you don’t need if your goal is to load data straight from a source to a destination. If, however, you need to join to other tables or lookup reference data, loading more than just the business keys into a staging table may be the way to go.
Pattern 2: Load Staged Rows Using T-SQL
A third option is to use T-SQL to load the target table after you’re loaded a stage table via Data Flow Task. Configure the T-SQL query similar to that shown here:
Your Control Flow now appears as shown:
Pattern 3: Stage and Load Rows Using T-SQL and Execute SQL Tasks
Another option is to stage the subset of rows you’re after using an Execute SQL Task and then use a second Execute SQL Task to load those rows into the target table. Your stage query would appear as shown here:
The Control flow would now contain an Execute SQL Task for Staging and another for Loading:
Pattern 4: Skipping the Stage Step Altogether
If you’re just doing a simple single-table load (like in this demo), and if the source and destination are on the same instance, you can often achieve great performance using T-SQL to load from the source directly to the destination:
Your Control Flow becomes remarkably simple, as shown here:
Staging a subset of the data you need, or even unique keys for a subset of the desired data, is one way to improve load performance.
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.