This blog post is part of the Building a framework for orchestration in Azure Data Factory series.





Before we dive into the details of the Data Factory pipelines, it is worth explaining the conceptual structure of my framework and its components. How it all fits together is important, and after reading the post on the metadata as well the pieces of the puzzle will hopefully start falling into place.

When I started thinking about what I’d like the framework to do, three conceptual layers started to emerge and we’ll review them from the bottom up:








Workers

Worker pipelines perform the actual work during the ETL process, like copying the data from a source system (into staging) or executing a stored procedure to load a dimension or fact table.

Their function is singular and specific, and because of the metadata structure you will only need one worker pipeline for each unique source system, type of source system (if your prefer) or type of task. This part may not be super obvious right now, but hang in there because it will make perfect sense when we look at the actual pipelines.

In order to remain autonomous and enable testing that’s independent from the rest of the framework, each worker pipeline receives the task ID (or key) as the only parameter. And while this also means that every worker pipeline needs to retrieve its own metadata as opposed to receiving it from the initiating (parent) pipeline, it goes a long way towards achieving my goals of simplicity and repeatable functionality.

Orchestrators

Orchestrators facilitate the execution of all the worker pipelines while enforcing the sequence as defined by the metadata, as well as monitoring the pipeline executions for any runtime errors. Monitoring is critical because of the way the API works, and the framework will use that to run the required worker pipeline.

I like the logical separation of state from the worker pipelines, because most of the fancy footwork will reside here although it won’t have to change frequently. This means that you can deploy this layer once, and not touch it again unless the mechanics of the framework changes. With this approach you’re mitigating risk, an overlooked feature of many frameworks in my opinion.

Controllers

The tasks within the pipelines of this layer will initiate the execution of a single process, the highest “unit of execution” from our metadata structure.

As I eluded to in the previous blog post, I prefer to have this explicit boundary as it would be risky to have the option to execute everything. The metadata tables may contain the details of many different processes and tasks, some of which may not be required to run every single time your ETL process runs. Think of maintenance tasks you only want to run once a week, or year-end processes that only run at the end of a fiscal year.

All of these are good enough reasons for me to restrict the framework from executing everything at the same time, and hence the reason for the design.

Show me the code already!!

This blog series is as much about the process of defining and articulating the requirements of an ETL framework as it is about the technical artifacts. If this post feels like a dangling carrot forcing you to read one more post before getting into the meat of things, see it as an attempt to bring you back to the foundational steps of designing such a framework instead.

Andy Leonard (twitter | blog) has a great blog post about his process of “scaffolding” to design a new solution or pipeline. Whether you choose to do it with “dummy” pipelines in ADF like Andy or use some other tool to create a wire-frame, stepping back to think about a solution from a conceptual rather than a tactical perspective is the best thing you can do to help yourself. It takes a lot of practice to not fall into the trap of writing code from the onset, especially when that is your primary job.

Before embarking on the adventure of building your own framework, define what is important to you (your goals) and identify the conceptual steps (or layers) of the process in an abstract manner before jumping into writing the code. Trust me, it will save you a lot of time and frustration down the road…

3 thoughts on “Building a framework for orchestration in Azure Data Factory: Framework components

  1. Anonymous says:

    Hi Martin,

    I am loving this series! Great work. And I just noticed the shout-out – thank you!!

    :{>

    1. Thank you Andy 🙂

Leave a Reply to Martin SchoombeeCancel reply

Discover more from Martin's Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading