One way to offload reporting requirements from the Production OLTP (OnLine Transaction Processing) database while providing a near-real-time copy of Production data is to execute reporting and analytics queries against a High Availability solution replica – such as SQL Server Always On.
Learn more about Always On at: Overview of Always On Availability Groups (SQL Server).
“Why Don’t You Mix High Availability and Reporting Solutions, Andy?”
- Additional columns
“We Don’t Need Different Indexes for Reporting…”
“… our OLTP database is optimized for OLTP and analytics queries.”
Um, are you sure about that?
I do a little database tuning. It’s not my forte but I’m not ignorant of the principles. Plus I have friends who are really, really good at database performance tuning.
Mostly they talk and I listen (because I have a lot to learn more about… well, everything). They say things like, “This database I tuned had way too many indexes. The write performance was being clobbered by a bunch of unnecessary indexes;” or, “The clustered index was clustered on the wrong column(s) and sub-queries were dying as a result.”
Is your enterprise Production OLTP database impacted by a hybrid indexing strategy? I can help.
Reporting Index Strategy
Indexing for reporting is almost always different than indexing for OLTP. One example is reference data. Before you get all froggy on me and jump to a conclusion that reference data is always a small subset, I present exhibit A: The National Provider Index (NPI) file. At over 5GB and just under 5 million records, this reference dataset is not small. For the sake of brevity I will only mention pre-computed drill-across merged fact tables in advanced star schema designs in passing. You can learn more about them – and star schemas in general – from Christopher Adamson’s (excellent) book, Star Schema:
Beyond reference data, star schemas are simply different beasts when compared to OLTP databases. Many OLTP databases are normalized to some normal form. The most popular is third normal form. Star schemas reshape and, sometimes, repeat data – a big no-no in third normal form. The star schema isn’t the only alternative for reporting database models, either – there’s also Data Vault which, again, requires a different indexing strategy than an OLTP database. There are other models but this post is long enough.
“Why All The Fuss About Indexes, Andy?”
Replica databases in high availability solutions match the primary database. If you think about it, this must be so in order for the replica to take over in the event of a primary failure. When the primary fails, the last thing you want is for a database that looks and performs differently to take over. They need to be the same or high availability may be compromised.
This “same-ness” requirement includes indexes.
The same goes for mirroring solutions.
The same goes for backup and restore.
Backup and Restore is Decoupled
An advantage of using backup and restore is: the restored database is decoupled from the source. You could backup the OLTP system, restore it to a reporting instance of SQL Server, and then execute scripts to change the indexing strategy so the database is more amenable to analytics queries…
If the database is small enough and/or your enterprise reporting can tolerate the latency.
“We Don’t Need Additional Columns for Analytics Queries…”
“… we’ve added all those columns to our OLTP database and OLTP is unaffected by them.”
Um, are you sure about that?
I’ve seen lots of solutions over the decades: views, functions, computed columns, etc. I haven’t seen any solutions that have zero performance impact. Not. One.
Again, HA solutions require replicas that are identical to the source database (I cannot fathom the overhead required if they didn’t…). This requirement applies to columns as well as indexes.
“What About Hybrid Approaches?”
I’ve seen some interesting hybrid approaches:
These and similar solutions work well because, like backup and restore, the reporting database are decoupled from the source.
Our next considerations are latency tolerance and data integration.
Find an enterprise that permits analytics queries against the Production OLTP database and ask an analyst if they absolutely must have up-to-the-nanosecond results. As surely as a project manager will ask a developer, “Do we need to add resources?” that analyst will emphatically reply, “Yes!”
Um, are you sure about that?
I like to follow the Data Trail through the enterprise. I ask (pesky) questions like, “How much time elapses between the time this report query is executed and someone acts on the results?” When I hear the answer, I listen for something we data engineers call grain. And not just any old grain. I’m listening for time dimension grain.
I’ve been doing this for decades. No one has ever used the time dimension grain “seconds” when answering my earlier question.
There are near-real-time data integration use cases, such as data signal processing.
Grain plays a vital role in data warehouse and data integration design – especially time dimension grain. Time dimension grain drives aggregation requirements. It’s a fantastic question to ask on Day One. You want to carefully weigh the answer.
I always ask the preferred grain of the time dimension.
I almost always design the data warehouse to match the time dimension grain of the source data.
I can hear you thinking, “When would you not design the data warehouse to match the time dimension grain of the source data, Andy?” That’s an excellent question. My answer is, “It depends.”
“How Do You Do Analytics Data Integration, Andy?”
I bet you know the answer to that question already. Yep, most of the time SSIS is involved. It may shock you to learn that these days I design more ELT (Extract, Load, and Transform) with the “transform” step built in the relational engine than ETL (Extract, Transform, and Load), especially with SQL Server 2012+. Why? The relational engine often outperforms SSIS Data Flow transformations.
When I execute the “T” in SQL Server, this is almost always accomplished via an SSIS Execute SQL Task. Why? If I’m using an SSIS Data Flow as a “data pump,” I want to log performance metadata (ETL Instrumentation). Since SSIS is my execution engine for the “EL” portion, it’s convenient to collect the same metrics from the “T” part of the process. It’s mighty convenient to collect the metrics in the same place and in the same format, too.
Advantages of Using SSIS
One nice thing about using a data integration platform like SSIS is I have the flexibility to choose my integration approach. I can load TableA using ELT and TableB using ETL.
I can also select different design patterns for different table loads: I can truncate and load smaller reference tables and use an incremental load design pattern to load larger tables.
Integration approach and design patterns apply to calculating additional column values: I may choose to use the relational engine for one aggregation and use .Net in SSIS for another calculation.
Executing SSIS in the SSIS Catalog means logging is automatic. Runtime metrics are collected, no fuss, no muss.
Loading a reporting instance relies on SSIS design patterns. If you’re designing a reporting solution for a source database with more than a few tables, consider using Biml to auto-generate SSIS packages.
Not every reporting and analytics database solution requires the level of automation demonstrated in the short (2:32) video shown here, but some will. Spoiler: the video demonstrates performing about 24 hours of work in just over 6 minutes.
High availability solution requirements – combined with analytics and reporting database index strategy, the need for additional reporting columns, and latency constraints – mean many enterprise reporting and analytics solutions require a separate database maintained in some decoupled-from-Production-OLTP manner.
Expert SSIS Training Online with Brent Ozar Unlimited! – 11-12 Sep 2017
IESSIS1: Immersion Event on Learning SQL Server Integration Services – Chicago, 2-6 Oct 2017!
Enterprise Data & Analytics