Site icon Martin's Blog

Building a framework for orchestration in Azure Data Factory: Orchestration

This blog post is part of the Building a framework for orchestration in Azure Data Factory series.





In the context of what we’re talking about throughout this series – facilitating the execution of an ETL process in a platform like Azure Data Factory – orchestration means that we’re using the ETL tool primarily for the “E” (Extract) part of the process. In addition to that, most people I know would also use the ETL tool to facilitate the workflow, in other words the order of execution and any constraints that go along with that.

In what I’d like to call the “traditional” approach for lack of a better term, all parts of the ETL process are performed natively by the tool (image below), using whatever built-in tasks are available and of course accounting for any nuances. With this approach, transformations are typically performed in transit and in memory.








With the “orchestration” approach we’re using the ETL tool to only extract and move the data, and will perform other transformations and loading tasks somewhere else…usually closer to the destination. Strictly speaking we’re really doing “ELTL” as shown in the image below, where the ETL tool is used to extract the source data and load it into a staging area. From there, you will most likely use T-SQL to perform any transformations before loading the data into the dimensional model…assuming you’re using a relational database for that.








I know this may seem like an over-simplification, and there are certainly many variations to what I describe above. I am not going to delve into the variations or nuances here (on purpose) as it will distract from the overall intent of this series, but let me give you my personal take on the characteristics of orchestration from an ETL perspective:





Why orchestration?

There is no shortage of opinions around “best practices” when it comes to the development of ETL processes, and it’s definitely not the hill I’d like to die on. Dependent on your own background, skills and preferences you may decide that orchestration is not for you…and that’s ok. I have chosen the path of orchestration however, and here are some of the reasons why it has worked out really well for me in the last 20 years:





An additional item to point out and which is only relevant to the cloud is cost. Performing your transformations in transit may be convenient but it’s also the most expensive thing you will do in the cloud. If you use Data Flows in Azure Data Factory for instance, your overall cost will be significantly higher if you take into account the minimum cluster size of 8 vCores, amongst other factors. This is ultimately the deal-breaker for me and the reason why I advocate for orchestration. It is the most cost effective approach for cloud-based ETL, and I am yet to see any evidence to refute that.





What to expect next

My goal with this post is to set the stage for the rest of the series. The ADF framework that I’ll be introducing will use metadata to facilitate the movement of data (from source to staging), as well as the execution of stored procedures to implement the required transformations. Data Flows will not be covered as part of this, but it will not be impossible to adapt this framework to work with Data Flows as well…if that is your preference.

Stay tuned for the next blog post, where we will be taking a closer look at the metadata that drives the execution within the framework.

Exit mobile version