SSIS Design Pattern: Controller Pattern

SSIS Framework Community Edition defaults to serial execution. The Controller Pattern can help. How? Read on…

“Great Andy, But What If I Need To Load In Parallel Using A Framework?”

Enter the SSIS Design Pattern named the Controller Pattern. A Controller is an SSIS package that executes other SSIS packages. Controllers can be serial or parallel, or combinations of both. I’ve seen (great and talented) SSIS architects design frameworks around specific Controllers – building a Controller for each collection of loaders related to a subject area.

There’s nothing wrong with those solutions.

SSIS Framework Community Edition ships with a generic metadata-driven serial controller named Parent.dtsx which is found in the Framework SSIS solution.

Specific Controller Pattern Design

A specific Controller can appear as shown here (click to enlarge):

This controller achieves parallel execution. One answer to the question, “How do I execute packages in parallel?” is to build a specific controller like this one.

Advantages

  • Just Works
  • Simple and straightforward, uses out-of-the-box Execute Package Task

Disadvantages

  • “All Executions” Catalog Report is… misleading…

“How is the All Executions Report misleading, Andy?”

I’m glad you asked. If you build and deploy a project such as SerialControllers SSIS project shown here – and then execute the SerialController.dtsx package – the All Executions reports only a single package execution: SerialController.dtsx (click to enlarge):

We see one and only one execution listed in the All Executions report. If we click on the link to view the Overview report we see each package listed individually:

The All Executions report accurately reflects an important aspect of the execution of the SerialController.dtsx SSIS package. The execution of this package – and the packages called by SerialController.dtsx – share the same Execution ID value. This is not necessarily a bad thing, but it is something of which to be aware.

Specific Controller Design in SSIS Framework Community Edition

A specific Controller built using SSIS Framework Community Edition can appear as shown here:

This controller uses Execute SQL Tasks instead of Execute Package Tasks. The T-SQL in the Execute SQL Tasks calls a stored procedure named custom.execute_catalog_package that is part of SSIS Framework Community Edition.

One answer to the question, “How do I execute packages in parallel using SSIS Framework Community Edition?” is to build a Controller.

Advantages

  • Just Works
  • The SSIS Catalog All Executions report is accurate

Disadvantages

  • Adds complexity

The All Executions Report is no longer misleading. If you build and deploy a project such as SerialControllersInFrameworkCE SSIS project shown here – and then execute the SerialControllerInFrameworkCE.dtsx package – the All Executions reports each package execution (click to enlarge):

We now see one execution listed in the All Executions report for each package. As before, All Executions accurately reflects an important aspect of the execution of the SerialControllerInFrameworkCE.dtsx SSIS package: The execution of the Controller and each Child package now have distinct Execution ID values.

When using specific Controllers with an SSIS Framework it’s common to create a single-package SSIS Application that simply starts the Controller, and then let the Controller package call the child packages. Parent.dtsx in SSIS Framework Community Edition is a generic metadata-driven Controller, but it doesn’t mind executing specific Controllers one bit!

Once Upon A Time…

Not too long ago, Kent Bradshaw and I endeavored to add automated parallel package execution to our Commercial and Enterprise SSIS Frameworks. We achieved our goal, but the solution added so much complexity to the Framework and its associated metadata that we opted to not market the implemented solution.

Why? Here are some reasons:

Starting SSIS packages in parallel is very easy to accomplish in the SSIS Catalog. The SYNCHRONIZED execution parameter is False by default. That means we could build a controller package similar to the SerialControllerInFrameworkCE.dtsx SSIS package – with precedence constraints between each Execute SQL Task, even – and the SSIS Catalog would start the packages in rapid succession. In some scenarios – such as the scenario discussed in this post (from which the current post was derived) – this then becomes a race condition engine.

A Race Condition Engine?

Yes. Because controlling only when packages start is not enough to effectively manage race conditions. To mitigate the race condition described in this post I need to make sure the dimension loaders complete before starting the fact loaders. A (much simplified) Controller for such a process could appear as shown here (click to enlarge):

I grabbed this screenshot after the dimension loader and some (vague) pre-operations process have completed in parallel but while the fact loader is still executing. Please note the combination of the Sequence Container and precedence constraint which ensure the fact loader does not start executing until the dimension loader execution is complete. The sequence container creates a “virtual step” whereby all tasks within the container must complete before the sequence container evaluates the precedence constraint. Since each task inside this container starts an SSIS package (and since the SYNCHRONIZED execution parameter is set to True by default in SSIS Framework Community Edition), nothing downstream of this container can begin executing until everything inside the container has completed executing. This is how we avoid the race condition scenario described earlier.

How does one automate this process in a framework?

It’s not simple.

The method Kent and I devised was to create and operate upon metadata used to define and configure a “virtual step.” In  SSIS Framework Community Edition the Application Packages table is where we store the Execution Order attribute. We reasoned if two Application Package entities shared the same value for Execution Order, then the associated package(s) (I’m leaving out some complexity in the design here, but imagine executing the same package in parallel with itself…) compose a virtual step.

In a virtual step packages would start together, execute, and not proceed to the next virtual step – which could be another serial package execution or another collection of packages executing in parallel in yet another virtual step – until all packages in the current virtual step had completed execution. Here, again, I gloss over even more complexity regarding fault tolerance. Kent and I added metadata to configure whether a virtual step should fail if an individual package execution failed.

This was but one of our designs (we tried three). We learned managing execution dependency in a framework is not trivial. We opted instead to share the Controller pattern.

We Learned Something Else

While visiting a client who had deployed the Controller Pattern, we noticed something. The client used a plotter to print large copies of Controller control flows and post them on the walls outside his cubicle.

When we saw this we got it.

The tasks in the Controller’s control flow were well-named. They were, one could say, self-documenting. By posting updated versions of the Controller control flows whenever the design changed, the data engineer was informing his colleagues of changes to the process.

He didn’t need to explain what had changed. It was obvious to anyone stopping by his cubicle for a few minutes. Briliant!

Conclusion

In this post I wrote about some data integration theory. I also answered a question I regularly receive about performing parallel loads using SSIS Framework Community Edition. I finally covered some of the challenges of automating a solution to manage parallel execution of SSIS packages in a Framework.

Note: much of this material was shared earlier in this post. I broke the Controller Pattern part out in this post because the other post was really too long.

:{>

Experience Matters

This is a picture of Kent Bradshaw, Microsoft Certified Data Scientist. Kent and I work together at Enterprise Data & Analytics delivering data engineering, business intelligence, and data-based solutions that help businesses turn their data into actionable information. We do not do this alone, we are surrounded by an excellent team that holds diverse skills, perspectives, and work and life experience.

In this post I want to focus on experience because I believe experience is the most valuable component a consultant brings to a customer.

When Experience Hurts

Experience can hurt if we allow it. We can become locked into that’s-the-way-it’s-done syndrome. There are new frameworks and methodologies introduced every day (literally). It’s too much for any one person to keep up with the latest and greatest and most shiny new stuff.

This is why a team delivers value to customers.

A diverse team – like the team at Enterprise Data & Analytics – is incredibly valuable. We’re not just a bunch of old guys, we are also Millennials. Our younger team members are constantly learning about new technologies and methodologies and bringing them the attention of our older team members. The energy our younger team members bring to our projects is inspiring and amazing, and…

One of the most rewarding parts of my job as Chief Data Engineer is watching our younger folks thrive, grow, and succeed.

I love how our team works together! I love it when we all pile on WebEx to solve a problem for a customer or design part of a data warehouse solution. It’s thrilling! I’m proud beyond words of our team.

Although we haven’t advertised it much, the team at Enterprise Data & Analytics has grown – nearly doubling in 2018 (so far).

Diversity in backgrounds, experiences, and levels of experience helps us overcome the downsides of experience.

When Experience Helps

Kent and I (and others on the Enterprise Data & Analytics team) have delivered many data-related solutions. I like to joke and say, “when we started working with data we had to carve our own chips out of wood!” That’s not entirely accurate, but I learned Motorola 6800 machine code in 1975 (I was 11) and Kent started when punch cards were the norm.

Having lots of experience gives Kent and me perspective. We understand the maturity cycles of technological, corporate-cultural, and even projects. Why? We’ve experienced a number of said cycles. When the inevitable “bumps in the road” occur, we don’t panic or lose perspective. We’ve likely experienced something very similar in the past, made mistakes attempting to recover, and learned the best way forward from those mistakes.

Do we still make mistakes? Yep. But fewer, and we have a deeper, richer well from which to draw experience-based, time-tested solutions. Perhaps most important to our customers, we have the confidence that only comes from recovering from a mistake (or several).

Plus, our more experienced team members have the privilege and honor of sharing our experience with our younger team members. Together, we learn from each other. I love that!

Our more experienced team members bring leadership and management experience to bear on each and every Enterprise Data & Analytics project. My experience leading a team of 40 ETL developers at Unisys gives me a deeper appreciation for the total cost of ownership of any solution (which can dwarf the cost of development – especially short-sighted development).

Snapping It Together

With our diverse backgrounds, experiences, and levels of experience, Enterprise Data & Analytics is able to deliver the best solutions, period.

How may our team serve you? We offer:

  • Data Warehouse Rescue – is your data warehouse project stalled? We can help.
  • SQL Server and MySQL database performance tuning, development, and support – both on-premises and in the cloud (AWS and Azure).
  • SSIS (SQL Server Integration Services) coaching, development, performance tuning, and training.
  • Biml (Business Intelligence Markup Language) coaching, development, and training.
    • Coaching is my favorite! We navigate and you fly.
  • Data science to help you visualize your data, and data engineering to prepare your data for data science and business intelligence.
  • Automation solutions including SSIS Catalog Compare, Catalog Browser, SSIS Framework Community Edition, the BimlExpress Metadata Framework, and more!
  • Our team writes the books others read to learn SSIS.

Contact us today for more information.

:{>

Biml Case Study: Automate an Extract for Patrick LeBlanc’s K12 Solution

My friend and brother, Patrick LeBlanc (@patrickdba), built a cool Power Bi Dashboard (and database!) to help K12 education systems track students and student activities such as attendance. Better yet, he gives it away! He gives away the code and he shares a free Guy in a Cube video walking through the features.

Patrick emailed and asked if I could build a simple extract process in SSIS. I said, “Sure!” and was able to turn the work around in about 2.5 hours. He was blown away that I cranked out the solution in such a short amount of time, but I had a secret weapon: Biml!

In this video, I walk through how I did it and show you how you can too! You can score a free trial of BimlStudio here. You can grab my code – which includes the Biml project and the SSIS project –  here.

Want to learn how it’s done? Sign up for Biml Basic Training at Biml Academy.

Happy Biml’ing!

:{>

Data Integration Pattern: The Pessimistic Load

“What is a Pessimistic Load?”

I’m glad you asked! A pessimistic load is any load pattern that incorporates a strict reconciliation process. Strict reconciliation throws an error when the reconciliation effort fails.

How Does It Work?

First you load some records. Then you check to see if the records you intended to load have been loaded. That’s it.

The first part is the load process, the second part is the reconciliation. The reconciliation process can be broken into chunks – and the chunks can be executed during the load process. For example, one can stage data and then reconcile stage-to-source.

Reconciliation Method 1: Row Counts

There are a number of ways to accomplish reconciliation. The most common is record counts. Count the number of rows you expected to load from a source – perhaps a flat file or database table – and then count the number of rows you loaded. This is easiest when you are deleting all the records from the destination before the load and you are then loading all the records from a source to a destination. You simply count the rows in the source and count the rows in the destination, and then check to see if they are identical.

Things get tricksy when you are only loading some of the rows from a source or if you are loading rows to a destination that already contains records; or both.

Reconciliation Method 2: Numeric Values

Another way to accomplish reconciliation is to use numeric values. Star schema data warehouses contain dimensions and facts. Dimensions are usually descriptive and Facts are usually collections of numeric measures (plus artificial keys to related dimensions). Summing numeric measures in a fact is one way to reconcile values between a fact source and destination, but sometimes the numeric values become very large – they can grow to exceed the maximum value for the data type.

One way to address exceeding a numeric data type is to hash numeric values. There are a couple techniques for hashing numeric data but the goal is the same: present a unique-ish value that can be compared between the source and the destination. One of my favorite tricks is to use a running modulus with a prime number – the larger the prime number, the better.

The Development Lifecycle of a Pessimistic Load

Pessimistic loads flow through a maturity model or lifecycle. Like every load process, pessimistic loads begin by failing a lot. Initial failures occur because the source data is not cleansed (or not properly cleansed) and raises errors – like NULLs or duplicates – when one tries to load a destination from a source.

As the process matures NULLs are replaced or managed, duplicates are detected and removed, and then quality / consistency checks – soft rules – are added to make sure the data is correct. Soft rules include checks for things like “is the start date of a record earlier than the end date?”

Therefore the first phase of implementing a pessimistic load is filled with data quality errors.

…once the reconciliation rules are synchronized with the data cleansing rules, the enterprise achieves a load process that is nigh bulletproof.

The second phase is interesting because data quality errors begin dropping in the load process (and in the load packages if one is loading using SSIS) but the reconciliation process continues to throw errors – stopping the load process if so designed – because the reconciliation checks do not know the data being loaded is good.

Process Fault Inversion

It is common for the enterprise to witness a kind of “process fault inversion” as the load process cleanses more and more data, while the reconciliation process lags behind the loaders in recognizing and separating (and counting or calculating hash values for) data that conforms to updated cleansing rules. Eventually the load process reaches a point where the only failures are due to the reconciliation process rejecting or missing or miscalculating essentially good data.

Although this is a frustrating phase of the project, it is a necessary and positive development in the maturity of a pessimistic load because once the reconciliation rules are synchronized with the data cleansing rules, the enterprise achieves a load process that is nigh bulletproof.

Conclusion

Confidence in the integrity of the load increases as errors decrease – optimally to 0 – and, on the rare error, the mature pessimistic load process is trusted to detect truly new and different source data conditions. Hence the name: Pessimistic load.

Immersion Event on Learning SSIS – May 2018!

I am honored to partner with the awesome team at SQLSkills to deliver IESSIS1: Immersion Event on Learning SQL Server Integration Services in person in Chicago 7-11 May 2018!

Click here to register!

Here’s what I’ll be covering:

Module 1: Introduction

In the introduction we share information about the instructor, course flow, and content, and gather information about attendee expectations for the course. Topics covered include:

  • Training overview
  • Expectations
  • Introductions

Module 2: Creating Your First SSIS package

We believe many learn SSIS best by doing, so we get attendee’s hands on SSIS as soon as possible. In this module, we demonstrate how to build an SSIS package and invite attendees to walk through building a basic SSIS package with us. Topics covered include:

  • Introduction to SQL Server Data Tools – Business Intelligence (SSDT-BI), a version of Microsoft Visual Studio that serves as the SSIS integration development environment (IDE)
  • Visual Studio tips and tricks
  • Exercise: Create an SSIS package

Module 3: Introduction to the Data Flow Task

SSIS was designed to move data. The SSIS Data Flow Task is a complex data integration engine built to support generic data load use cases. In this module, we introduce SSIS Data Flow concepts. Topics covered include:

  • Introduction to the Data Flow Task
  • Connection Manager and Data Flow Task basics
  • Source and destination adapters
  • Exercise: Create a Real Data Flow, Part 1

Module 4: Data Flow Task 201

Once the basics of Data Flow development are understood, we introduce a basic version of the SSIS design pattern called the incremental load. Building idempotent (safely re-executable) SSIS packages is one key to successful data integration in the enterprise. Topics covered include:

  • Intermediate Data Flow concepts
  • Building re-executable data loaders
  • Exercise: Create a Real Data Flow, Part 2

Module 5: Data Flow Task 202

Change detection is one of the most important components of Extract, Transform, and Load (ETL) projects. In this module, we expand the basic change detection introduced in the previous module and begin to examine SSIS design patterns that improve SSIS Data Flow performance. Topics covered include:

  • Advanced Data Flow concepts
  • Building and tuning robust incremental loader
  • Exercise: Create a Real Data Flow, Parts 3 and 4

Module 6: The Control Flow

The SSIS Control Flow is the work flow manager for an SSIS package. In this module, we demonstrate work flow containment and management in the SSIS Control Flow, including advanced Precedence Constraint configuration and package execution control. Topics covered include:

  • Containers, precedence, and work flow
  • Transactions, restartability, and locking

Module 7: Loop Containers

Loop containers provide execution work flow isolation and iterative functionality. In this module, we examine traditional and non-traditional configurations for the For Loop and ForEach Loop containers. Topics covered include:

  • Using For Loop and Foreach Loop Containers

Module 8: Data Flow Task 301

After turning our attention to the SSIS Control Flow, we return to the SSIS Data Flow for an extended period of hands-on labs. In this module, attendees are tasked with the first of three labs that build a robust data integration loader using SSIS, with an instructor available to assist with questions. This lab focuses on transforming the data by applying strong-typing rules and managing data that violates those rules. Topics covered include:

  • Data typing basics
  • Building an incremental loader for real-world data
  • Exercise: Build a Robust Incremental Staging Loader, Part 1

Module 9: Data Flow Task 302

In this module, attendees are tasked with the second of three labs that build a robust data integration loader using SSIS, with an instructor available to assist with questions. This lab focuses on using built-in SSIS Adapter functionality to manage and execute DDL operations during data integration staging development. Topics covered include:

  • Intermediate data staging
  • Managing real-world changes to data sources
  • Exercise: Build a Robust Incremental Staging Loader, Part 2

Module 10: Data Flow Task 303

In this module, attendees are tasked with the third of three labs that build a robust data integration loader using SSIS, with an instructor available to assist with questions. This lab reinforces writing idempotent SSIS Data Flow packages and introduces data quality concepts via data cleansing in SSIS. Topics covered include:

  • Basic data cleansing
  • Deriving values
  • Exercise: Build a Robust Incremental Cleansing Loader

Module 11: Event Handlers, Logging, and Configurations

An important component of any solution is messaging and configuration. In this module, we demonstrate SSIS’s messaging model in the context of event handlers, logging, and managing externalized parameter and variable values. Topics covered include:

  • A survey of SSIS event handlers focusing on OnError and OnInformation events
  • Using SSIS’s built-in logging facility to capture package execution details and statistics
  • Using built-in package configurations to externalize variable values

Module 12: Security, Deployment, and Execution

SSIS facilitates secure connection and parameter value management. In this module, we build on our understanding of messaging, logging, and externalization by examining project, package, and connection security. We then turn our attention to deployment and execution patterns, and discuss the role of security in each. Topics covered include:

  • SSIS Package deployment options
  • Security implications
  • SSIS Package execution

Module 13: ETL Design Patterns

SSIS is a powerful enterprise data integration engine. In this module, we discuss experiences using SSIS to solve complex and “edgy” use cases, and some patterns we’ve realized from those engagements. Topics covered include:

  • Leveraging less-documented features of SSIS
  • Using Parent-Child design pattern to achieve “hands-free” custom logging and creative custom configuration
  • ETL Instrumentation

Module 14: Enterprise Execution Patterns

Data integration in a large enterprise is different from data integration in a smaller shop. In this module, we discuss patterns of SSIS development and management that lower the total cost of owning a data integration solution built using SSIS. Topics covered include:

  • Leveraging the Parent-Child design pattern and much of what we’ve learned over the past four days to build a metadata-driven SSIS execution framework engine

I hope to see you there!

:{>

Available for Pre-Order – The Biml Book!

 

I am excited to announce The Biml Book is now available for pre-order from Amazon and Apress!

I was honored to write with the inventor of Biml, Scott Currie, and an awesome team of BimlHeroes: Jacob Alley, Martin Andersson, Peter Avenant, Bill Fellows, Simon Peck, Reeves Smith, Raymond Sondak, Benjamin Weissman, and Cathrine Wilhelmsen.

Chapters

Part I: Learning Biml

  • Biml Tools
  • Introduction to the Biml Language
  • Basic Staging Operations
  • Importing Metadata
  • Reusing Code, Helper Classes and Methods

Part II: Biml Frameworks

  • A Custom Biml Framework
  • Using Biml as an SSIS Design Patterns Engine
  • Integration with a Custom SSIS Execution Framework
  • Metadata Automation
  • Advanced Biml Frameworks and BimlFlex

Part III: Biml Topics

  • Biml and Analysis Services
  • Biml for T-SQL
  • Documenting Your Biml Solution
  • Troubleshooting Metadata
  • Troubleshooting Biml

Part IV: Appendices

  • Source Control
  • Parallel Load Patterns in Biml
  • Metadata Persistence

Amazon projects the book’s release in early December but we anticipate the book will be available sooner.

Reserve your copy today!

:{>

Why I Do Not Mix High Availability with Reporting Solutions

One way to offload reporting requirements from the Production OLTP (OnLine Transaction Processing) database while providing a near-real-time copy of Production data is to execute reporting and analytics queries against a High Availability solution replica – such as SQL Server Availability Groups.

Learn more about Availability Groups On at: Overview of Always On Availability Groups (SQL Server).

“Why Don’t You Mix High Availability and Reporting Solutions, Andy?”

Two reasons:

  1. Indexes
  2. Additional columns

“We Don’t Need Different Indexes for Reporting…”

“… our OLTP database is optimized for OLTP and analytics queries.”

Um, are you sure about that?

I do a little database tuning. It’s not my forte but I’m not ignorant of the principles. Plus I have friends who are really, really good at database performance tuning.

We talk.

Mostly they talk and I listen (because I have a lot to learn more about… well, everything). They say things like, “This database I tuned had way too many indexes. The write performance was being clobbered by a bunch of unnecessary indexes;” or,  “The clustered index was clustered on the wrong column(s) and sub-queries were dying as a result.”

Is your enterprise Production OLTP database impacted by a hybrid indexing strategy? I can help.

Reporting Index Strategy

Indexing for reporting is almost always different than indexing for OLTP. One example is reference data. Before you get all froggy on me and jump to a conclusion that reference data is always a small subset, I present exhibit A: The National Provider Index (NPI) file. At over 5GB and just under 5 million records, this reference dataset is not small. For the sake of brevity I will only mention pre-computed drill-across merged fact tables in advanced star schema designs in passing. You can learn more about them – and star schemas in general – from Christopher Adamson’s (excellent) book, Star Schema:

Beyond reference data, star schemas are simply different beasts when compared to OLTP databases. Many OLTP databases are normalized to some normal form. The most popular is third normal form.  Star schemas reshape and, sometimes, repeat data – a big no-no in third normal form. The star schema isn’t the only alternative for reporting database models, either – there’s also Data Vault which, again, requires a different indexing strategy than an OLTP database. There are other models but this post is long enough.

“Why All The Fuss About Indexes, Andy?”

Replica databases in high availability solutions match the primary database. If you think about it, this must be so in order for the replica to take over in the event of a primary failure. When the primary fails, the last thing you want is for a database that looks and performs differently to take over. They need to be the same or high availability may be compromised.

This “same-ness” requirement includes indexes.

The same goes for mirroring solutions.
The same goes for backup and restore.

Backup and Restore is Decoupled

An advantage of using backup and restore is: the restored database is decoupled from the source. You could backup the OLTP system, restore it to a reporting instance of SQL Server, and then execute scripts to change the indexing strategy so the database is more amenable to analytics queries…

If the database is small enough and/or your enterprise reporting can tolerate the latency.

“We Don’t Need Additional Columns for Analytics Queries…”

“… we’ve added all those columns to our OLTP database and OLTP is unaffected by them.”

Um, are you sure about that?

I’ve seen lots of solutions over the decades: views, functions, computed columns, etc. I haven’t seen any solutions that have zero performance impact. Not. One.

Again, HA solutions require replicas that are identical to the source database (I cannot fathom the overhead required if they didn’t…). This requirement applies to columns as well as indexes.

“What About Hybrid Approaches?”

I’ve seen some interesting hybrid approaches:

These and similar solutions work well because, like backup and restore, the reporting database are decoupled from the source.

Our next considerations are latency tolerance and data integration.

Latency

Find an enterprise that permits analytics queries against the Production OLTP database and ask an analyst if they absolutely must have up-to-the-nanosecond results. As surely as a project manager will ask a developer, “Do we need to add resources?” that analyst will emphatically reply, “Yes!”

Um, are you sure about that?

I like to follow the Data Trail through the enterprise. I ask (pesky) questions like, “How much time elapses between the time this report query is executed and someone acts on the results?” When I hear the answer, I listen for something we data engineers call grain. And not just any old grain. I’m listening for time dimension grain.

I’ve been doing this for decades. No one has ever used the time dimension grain “seconds” when answering my earlier question.

There are near-real-time data integration use cases, such as data signal processing.

Data Integration

Grain plays a vital role in data warehouse and data integration design – especially time dimension grain. Time dimension grain drives aggregation requirements. It’s a fantastic question to ask on Day One. You want to carefully weigh the answer.

I always ask the preferred grain of the time dimension.
I almost always design the data warehouse to match the time dimension grain of the source data.

I can hear you thinking, “When would you not design the data warehouse to match the time dimension grain of the source data, Andy?” That’s an excellent question. My answer is, “It depends.”

“How Do You Do Analytics Data Integration, Andy?”

I bet you  know the answer to that question already. Yep, most of the time SSIS is involved. It may shock you to learn that these days I design more ELT (Extract, Load, and Transform) with the “transform” step built in the relational engine than ETL (Extract, Transform, and Load), especially with SQL Server 2012+. Why? The relational engine often outperforms SSIS Data Flow transformations.

When I execute the “T” in SQL Server, this is almost always accomplished via an SSIS Execute SQL Task. Why? If I’m using an SSIS Data Flow as a “data pump,” I want to log performance metadata (ETL Instrumentation). Since SSIS is my execution engine for the “EL” portion, it’s convenient to collect the same metrics from the “T” part of the process. It’s mighty convenient to collect the metrics in the same place and in the same format, too.

Advantages of Using SSIS

One nice thing about using a data integration platform like SSIS is I have the flexibility to choose my integration approach. I can load TableA using ELT and TableB using ETL.

I can also select different design patterns for different table loads: I can truncate and load smaller reference tables and use an incremental load design pattern to load larger tables.

Integration approach and design patterns apply to calculating additional column values: I may choose to use the relational engine for one aggregation and use .Net in SSIS for another calculation.

Executing SSIS in the SSIS Catalog means logging is automatic.  Runtime metrics are collected, no fuss, no muss.

Automation

Loading a reporting instance relies on SSIS design patterns. If you’re designing a reporting solution for a source database with more than a few tables, consider using Biml to auto-generate SSIS packages.

Not every reporting and analytics database solution requires the level of automation demonstrated in the short (2:32) video shown here, but some will. Spoiler: the video demonstrates performing about 24 hours of work in just over 6 minutes.

Conclusion

High availability solution requirements – combined with analytics and reporting database index strategy, the need for additional reporting columns, and latency constraints – mean many enterprise reporting and analytics solutions require a separate database maintained in some decoupled-from-Production-OLTP manner.

:{>

Learn more:
Expert SSIS Training (Live, Online)
IESSIS1: Immersion Event on Learning SQL Server Integration Services (In-person)
Enterprise Data & Analytics