Back in 2012 (yes, that long ago!) I wrote about staging your data and why it is important when developing an ETL process for a data warehouse. The comments I made then are still valid, but I want to dive a little deeper into the “how” of designing the staging area and process.
The art of simplicity
The concept of staging is not a complicated one, but you shouldn’t be deceived by the apparent simplicity of it. There’s a lot you can do in this phase of your ETL process, and it is as much a skill as it is an art to get it all right and still appear simplistic.
I have a few primary objectives when designing a staging process: Efficiency, Modularity, Recoverability & Traceability. Let’s take a closer look at each one of these, and some ideas & good practices that will help you achieve it…
Efficiency
I want this part of the process to run as efficiently as it can, and with minimal impact to any source system it extracts data from. I’d also like to try and avoid any failures in this first step of the ETL process, so that I don’t have to extract the data again. Speed is everything and here are some things I like to do to ensure that my staging process is as efficient as it possibly can be:
- Explicit select-queries
- I suspect this is true for most ETL tools, but speaking from experience with SSIS there is some overhead if you connect to a table as a whole or use a “select *” type of query…even if you limit the attributes later in the data flow. Let the source system deal with the heavy lifting, but give it the best possible information to do so by explicitly selecting the attributes you want.
- Filter the source data
- It goes without saying that filtering the source data will speed things up. Only extract what you need (but read the section on recoverability first).
- No (or little) in-flight transformations
- Reducing the amount of transformations you perform between the source system and your staging area will maximize efficiency, and I like to do my transformations only once the data is staged. I personally have a copy/paste mentality, doing as little as possible to the data between source and staging area.
- I may explicitly convert to a different data type if required, especially if the source system has data types that won’t convert seamlessly to what’s available in the destination (staging database).
- Heaps
- Inserting into tables with no indexes is extremely efficient, and in my experience the benefit outweighs the annoyance (and time) to create indexes down the line (if required). With a well-designed process, you’ll be surprised at how few indexes you really need.
Modularity
At times you may want to execute only a part of the process, and my design needs to be modular to accommodate that need. There’s nothing worse than having to extract large amounts of source data just to get to a few tables or records. At a conceptual level, the design should look something like this:
Each block represents a package or process, and at the most granular level one package/process will extract data from one source table into one destination (staging) table. Moving one level up you also create a package/process that will extract all the data from a source (Source 1 in this example), which now gives you the ability to execute that easily as a unit.
Doing it this way gives you a lot of flexibility and control, and as your environment evolves over time you may also want to create functional units of execution, for instance a parent package/process that extracts all sales data irrespective of source.
Recoverability
In the event of failure or downtime, how can I ensure that the process recovers “gracefully”…without having to make any code changes and/or hand-holding? Ultimately I’d like this part of the process to be completely hands-off, but still flexible in case I need to make adjustments.
I prefer a sliding window approach when extracting source data, and it usually looks like this:
- The number of days, months or years (I prefer months) to go back when extracting data (i.e. the size of the sliding window) is stored in a metadata table.
- The ETL process extracts the metadata and converts the size of the sliding window into elements that can be used to filter source data. I usually create a stored procedure to do that.
- Source tables that can (or should) be filtered make use of the parameters to filter the incoming data.
This approach has many benefits in my opinion. One such benefit is that the ETL process can fail for a day or two (or ten depending on how large the window is), and be restarted without any code changes or hand-holding after the issues are resolved.
Loading or re-loading historic information is also now as simple as updating a parameter value in the metadata table. I also like the fact that my staging area will hold a little more data than just the past day’s worth, in case I need to troubleshoot a data issue from the data that was loaded a few days ago.
Does this contradict the “efficiency” mantra? It’s a fine line, but I don’t think so…you can bring in a little more data than needed, while still making sure that the process is as efficient as possible.
Traceability
My measure of success here is how much my staging environment allows me to troubleshoot or answer data related questions without the need to go back to the source system. The sliding window approach certainly helps with that but the design of the staging tables themselves play an important role in providing traceability, both backwards (to the source) and forwards (to the target dimension or fact table).
To illustrate, let’s assume that I am designing an ETL process to populate a Product dimension. The source database (AdventureWorks2019) has 3 different source tables we’d like to get the data from: Production.Product, Production.ProductCategory & Production.ProductSubCategory.
Focusing on the Production.Product table as it is our primary source table and at the same granularity as our target dimension, we execute the following query in our ETL process to extract the data:
Why all the data type conversions? Without diving into too much of the detail, I’ve chosen to convert from unicode (nvarchar) to non-unicode (varchar) as I don’t expect to receive any special characters, and converted the ModifiedDate attribute (which is datetime in the source) because I am not very confident in my ETL tool’s ability to handle date and time datatypes elegantly…yes, I’m looking at you SSIS!
We can debate the finer points of these conversions on a different platform, but this is to illustrate how I would design my staging table and the source query is just for reference.
Based on the above, the design of my staging table will look something like this…
- Source Attributes
- This section contains all the source attributes with the same name, data type and order as the source and as little transformation as possible. Doing it that way makes it super easy to trace a field back to its origin.
- Always allow nulls here to avoid failures in the first (extract) phase of the ETL process. You’d rather deal with the nulls in your staging table than having to go back and extract the data again.
- Metadata
- The InsertDate attribute defaults to the current date & time, and serves as reference when the data was inserted into this staging table.
- I always add this at the end of my source attributes, acting as visual boundary between the data I get from the source and the data I derive during the ETL process.
- Keys, Lookups and Derived Attributes
- In this section I store the surrogate key of the target table (to determine whether it’s and insert or update), all fields that are looked up from other staging tables, and attributes derived from existing fields…usually in that order.
- I persist the info here to avoid duplicate work and to serve as reference of the transformations that occurred during the process.
- Change Type Indicators
- After identifying what type of change will be required in the target dimension table, I persist it in these two indicators which in turn simplifies the insert/update process. Most importantly it gives me valuable information after the fact, especially if something has gone wrong and I end up with unexpected results.
You can get more granular with this design if you really want to, but I have found this approach to be a really good starting point in terms of providing traceability throughout the ETL process.
That’s a wrap!
It’s worth repeating the phrase “the art of simplicity”, and while seemingly simple you shouldn’t underestimate the importance of this part of the ETL process. Trust me, you’ll reap the rewards many times over if you think things through and pay attention to the details. Happy staging 🙂
One thought on “ETL foundations: Staging your data”