Building a framework for orchestration in Azure Data Factory: Orchestration

Martin Schoombee

2 years ago

This blog post is part of the Building a framework for orchestration in Azure Data Factory series.

In the context of what we’re talking about throughout this series – facilitating the execution of an ETL process in a platform like Azure Data Factory – orchestration means that we’re using the ETL tool primarily for the “E” (Extract) part of the process. In addition to that, most people I know would also use the ETL tool to facilitate the workflow, in other words the order of execution and any constraints that go along with that.

In what I’d like to call the “traditional” approach for lack of a better term, all parts of the ETL process are performed natively by the tool (image below), using whatever built-in tasks are available and of course accounting for any nuances. With this approach, transformations are typically performed in transit and in memory.

With the “orchestration” approach we’re using the ETL tool to only extract and move the data, and will perform other transformations and loading tasks somewhere else…usually closer to the destination. Strictly speaking we’re really doing “ELTL” as shown in the image below, where the ETL tool is used to extract the source data and load it into a staging area. From there, you will most likely use T-SQL to perform any transformations before loading the data into the dimensional model…assuming you’re using a relational database for that.

I know this may seem like an over-simplification, and there are certainly many variations to what I describe above. I am not going to delve into the variations or nuances here (on purpose) as it will distract from the overall intent of this series, but let me give you my personal take on the characteristics of orchestration from an ETL perspective:

Data is usually extracted by the ETL tool and staged in some kind of staging area. My preference is somewhere close to the destination, which in most cases will be an Azure SQL database if your dimensional model is also in Azure SQL. During this initial step, the data is not changed unless it is absolutely necessary.
Transformations are not directly performed by the ETL tool. For me, that usually means a series of stored procedures in which I can make changes to the business logic without redeploying ADF pipelines.

Why orchestration?

There is no shortage of opinions around “best practices” when it comes to the development of ETL processes, and it’s definitely not the hill I’d like to die on. Dependent on your own background, skills and preferences you may decide that orchestration is not for you…and that’s ok. I have chosen the path of orchestration however, and here are some of the reasons why it has worked out really well for me in the last 20 years:

ETL tools are really good at moving data. What they’re not so good at is applying business logic at scale, and you will pay a premium for the tools that are. If your Data Warehouse is hosted in a relational database then you would have access to the SQL language which is great at performing set-based operations…and a better option in my view.
Testing business logic or addressing failures in pipelines are extremely laborious if your transformations happen in-transit. I’d much rather troubleshoot and/or test a SQL query, and don’t want to extract the data from the source multiple times when that happens.
You don’t necessarily have to use stored procedures, but I think they provide much needed autonomy & structure for the independent execution, testing and deployment of transformations. It’s an integral part of my approach and I feel that using an ETL tool to execute multiple queries via built-in tasks obfuscates the process to an extent, and makes it somewhat cumbersome to test without a bunch of copy-and-paste exercises.

An additional item to point out and which is only relevant to the cloud is cost. Performing your transformations in transit may be convenient but it’s also the most expensive thing you will do in the cloud. If you use Data Flows in Azure Data Factory for instance, your overall cost will be significantly higher if you take into account the minimum cluster size of 8 vCores, amongst other factors. This is ultimately the deal-breaker for me and the reason why I advocate for orchestration. It is the most cost effective approach for cloud-based ETL, and I am yet to see any evidence to refute that.

What to expect next

My goal with this post is to set the stage for the rest of the series. The ADF framework that I’ll be introducing will use metadata to facilitate the movement of data (from source to staging), as well as the execution of stored procedures to implement the required transformations. Data Flows will not be covered as part of this, but it will not be impossible to adapt this framework to work with Data Flows as well…if that is your preference.

Stay tuned for the next blog post, where we will be taking a closer look at the metadata that drives the execution within the framework.

Why orchestration?

What to expect next

Share this: