The question whether data should be staged when performing any ETL or import process is certainly not new, and there are probably as many opinions as there are people who have asked the question. I always enjoy seeing the different opinions, and even the bickering over the smallest of details when it comes to this subject.
As it is the case with beauty, the value of differing opinions are in the eyes of the beholder…you are free to accept or dismiss anything you want/need from that opinion. When looking at it objectively though, you will gain the necessary perspective on the subject. But i digress…
The safest answer to this question is probably what you would expect from any technology-related question: It depends. However, this post would not be complete without sharing my own unique perspective on the subject. To illustrate my point of view, let’s do a little word play…note the emphasis in the questions below.
Question: Do I have to stage the data when developing an ETL or import process?
No, it is very possible to develop an entire ETL or data import process without staging the data at all. SSIS (or any other ETL tool for that matter) is very capable of providing the necessary tasks and functionality for most of your requirements. More complex processes may require workarounds because of a lack of specific functionality, and you may end up deciding that it is better to stage the data anyways…but it doesn’t change the fact that staging isn’t an absolute necessity.
Question: Should I stage data when developing an ETL or import process?
In my opinion, YES!!! Staging data is always a good idea, and my reasons for saying that are the following:
- What happens when something goes terribly wrong? If you’re importing data from large flat files or another database system which you don’t have easy access to, how would you go about investigating a data issue?
I like to plan for worst-case scenarios, and if the data was in a staging area I would have easy access to it in a familiar environment. Needless to say that writing SQL queries is much easier (and more powerful) than visual inspection of text/xml/excel files.
- In most cases, set-based operations (using T-SQL) will perform better than the row-by-row processing of SSIS tasks…even if the entire SSIS operation is performed in-memory.
There can and certainly will be exceptions, but I am yet to come across a case where using a staging area, the appropriate indexes and well-written T-SQL performed very poorly and/or were greatly outperformed by the built-in data loading tasks in SSIS.
- I personally do not prefer to change and redeploy SSIS packages for every small change in business logic (change is the norm in BI projects).
Staging the data gives me an opportunity to apply business rules and logic in stored procedures, without the need to edit and redeploy packages when something changes.
There are no hard and fast rules when it comes to development, but I do believe that preparation is necessary to be in the best possible position in all kinds of situations.
Staging the data gives me the buffer I need to reference source data easily and/or re-load data without the need to go back to the source. It also provides me with the flexibility to use other features within SQL Server like indexing, window functions and set-based operations to maximise efficiency…