Martin's Blog

This blog post is part of the Building a framework for orchestration in Azure Data Factory series.

The orchestration layer of the framework is where all the magic happens. It facilitates the execution of processes and/or tasks as defined in the metadata, and needs to do it both seamlessly and efficiently. Ideally you would want to deploy this layer only once, and never have to touch it again. And it is really with that in mind that I designed this layer…to function independently and with minimal dependencies in both directions.

I would have loved for this layer to consist of only one pipeline but there are some nuances in Data Factory that make it impossible, the primary nuance being that you cannot nest ForEach activities. As a result, this layer contains three pipelines that will be covered by the sections below in more detail.

Pipeline: Main

This pipeline is the single point of entry for all process and/or task executions, and for lack of a better name I just call it Orchestrator – Main. It has only two functions:

Receive the details of what to execute (the process or task) via parameters.
Get a unique list of execution sequences from the metadata.

Before explaining the second function a little more, let’s take a closer look at the list of parameters for this pipeline (image below):

Subscription ID & Resource Group: The API request to execute a pipeline needs the GUID of the Subscription, name of the Data Factory and Resource Group to which the Data Factory belongs. System variables give us access to the Data Factory name, but if you have multiple Data Factory resources (or multiple Resource Groups) you may have to use a different method to get those.
Environment: The metadata tables make provision for different environments, and this parameter will be passed into the stored procedure that returns the relevant information. To mitigate some risk, I prefer to use a non-production value as default here.
Process Name & Task Name: In conjunction with the Environment parameter, these two values identify the process or task I want to initiate and will be used when extracting the metadata. The metadata stored procedure accounts for the “All” value, which gives me the ability to specify the name of a process only and initiate every task associated with it. The logic in the same procedure also prevents the execution of all processes and all tasks at the same time, a much needed safety mechanism to avoid potential chaos.

If you’ve been following the blog series so far you will notice the consistent approach in dealing with values that won’t change often, storing them as default parameter values (i.e. the Subscription ID & Resource Group) as opposed to in the metadata tables. The approach is intentional, and something I want to stress again even if it’s becoming annoying at this point.

There are many decisions to be made when you build a framework like this, and most of the time it would be a tradeoff between functionality and complexity. The more bells & whistles you want, the more complexity you will introduce into the solution. It’s ok if you end up there, but my advice to you is that you shouldn’t start there…start with simplicity & consistency in mind, and only change it for the sake of features you will need at least 80% of the time. And by that I don’t mean that you shouldn’t build cool functionality or account for exceptions, but do so with intention and not just for the sake of doing it. This approach will save you an incredible amount of effort, and your future self will thank you!

I digress…back to the second function of this pipeline and also the first step within it, getting a unique list of execution sequences. If we have a list of tasks we’d like to initiate, it’s important to know what should be done in parallel and what should wait until a previous task is complete. The combined execution sequences of the process and task (in the metadata tables) dictate the behavior, and I found the best way to start is to get a list of iterations we need to perform first.

In the following example (image below), if we are initiating all of the tasks to load Dimension tables (ProcessKey = 2) we will need to perform four iterations (one for each unique task execution sequence).

The GetExecutionSequences stored procedure will take the execution sequence number of the process, multiply it by some arbitrary large number (10,000 if you look at the code), add the sequence number of the task and return that unique list of values. In the example above we will end up with: 20001, 20002, 20003 and 20004. As we enumerate this list, we can go back to the metadata tables and perform the same logic to get all tasks with a specific number and know that if there is more than one, they need to be executed in parallel.

In theory the logic here is not absolutely necessary, because I have put some guardrails in place to avoid the execution of multiple processes at the same time. Nevertheless, I prefer to have it in case I decide to change that in future. Good housekeeping is never a bad thing.

The second activity will iterate the list of sequence numbers and initiate the Get Tasks to Execute pipeline for each one sequentially.

Pipeline: Get Tasks to Execute

The Get Tasks to Execute pipeline receives all of the parameters from the main entry point, plus the execution sequence number and the GUID from the initiating pipeline (image below). With all of this information, the pipeline can now go back to the metadata tables and get the details of all the tasks that have to be executed as part of this iteration.

The GetTasksToExecute stored procedure takes care of that (the first step in the pipeline), and the tasks it returns will be executed in parallel this time…the key difference between the ForEach activities in the two orchestration pipelines thus far.

You’ll also notice that the same stored procedure performs a dual purpose, as it is used by worker pipelines as well.

Pipeline: Execute Task

This pipeline is responsible for the initiation of the required worker pipeline (from the task metadata), the monitoring of its execution and the necessary mitigation if there is a failure. The mitigation in this context is to stop any other pipelines from running, because a ForEach activity in Azure Data Factory will complete all of its iterations…even when a failure occurs.

At face value this pipeline appears to be very simple, but there’s quite a bit of fancy footwork going on…the kind that needs some explanation. Let’s start with the variables:

Execution Status – This variable is used to store the status of the worker pipeline, which we will continuously monitor. It has a default value of “Unknown” that will be the pipeline’s queue that we haven’t executed anything yet.
Error Message – This variable will contain the error message, if any errors occurred during the execution of the worker pipeline.
Run ID – The GUID of the execution is required to monitor the status of the running worker pipeline, and this variable will be used to store its value.

The Switch activity is where the fancy footwork happens. It is encapsulated by an Until activity that forces the pipeline to keep running until a certain condition is met. The condition in our case is either the completion of the worker pipeline or an error, and the Switch activity will take the necessary steps depending on where we are with the execution.

The image above shows the two lanes within the Switch activity, and it’s a little confusing because of the alphabetic ordering of the conditions. The condition of the bottom lane (ExecutionStatus = “Unknown”) will be met the first time this activity runs, and the following will happen:

The worker pipeline will be initiated via the ADF API, and the API response will return the GUID of the pipeline execution which we will need to monitor its progress.
The Execution Status & Run ID variable values will be set, to indicate that the worker pipeline is in progress.

The Until activity’s exit condition hasn’t been met at this point, and we’ll hit the Default lane this time around and do the following:

Wait – There’s no point in checking the execution status every second, so we’ll wait a little before doing that. I usually set this to 10 seconds as a default, but you’ll probably want to find an appropriate number that works for you. Don’t set it too high though, because runtime equals money in the cloud and every bit adds up.
Monitor – Get the execution status of the worker pipeline, again via the Data Factory API.
Update Variables – Update the values of the variables according to the output from the previous monitoring step.

As soon as the monitoring step returns a result other than “In Progress”, the exit condition of the Until activity will be met and the last few steps will be executed in the event of a failure:

Fail the current orchestration pipeline. The benefit we get out of this is visibility in the logs.
Fail the highest parent pipeline (the main entry point). We do this via the API again because it gives us the ability to recursively cancel every child pipeline as well, after which everything comes to a beautiful and complete stop.

No logging steps?!

Yes, you may have noticed that there are no log tables or custom logging steps in this framework. I try to avoid custom logging wherever possible, and feel that Azure provides enough visibility through the built-in logs and Log Analytics for troubleshooting and monitoring purposes.

My recommendation would be to use Log Analytics to record all of the activity in Data Factory, and Meagan Longoria has a good blog post with the steps required to enable it. The cost is minimal and benefits are enough to make this a no-brainer in my opinion. If custom logging is your preference (or requirement), it would be easy enough to add some extra steps to the orchestration pipelines of this framework.

Reminder: Data Factory templates

As part of this blog series, I am publishing the templates of my framework in my public GitHub repo. It is free to use by anyone who wants to, either as a starting point for their own framework or as a complete solution to play around with.

I recommend that you attempt to deploy these in a test environment first, and please read the documentation before doing so. It contains important information you will need for a successful deployment.

The documentation for the orchestrator pipelines: Readme – Orchestrators

The documentation for the worker pipelines: Readme – Workers

The ARM templates: Data Factory Templates

Building a framework for orchestration in Azure Data Factory: Orchestrators

Pipeline: Main

Pipeline: Get Tasks to Execute

Pipeline: Execute Task

No logging steps?!

Reminder: Data Factory templates

Like this:

One thought on “Building a framework for orchestration in Azure Data Factory: Orchestrators”

Leave a ReplyCancel reply

Pipeline: Main

Pipeline: Get Tasks to Execute

Pipeline: Execute Task

No logging steps?!

Reminder: Data Factory templates

Share this:

Like this:

One thought on “Building a framework for orchestration in Azure Data Factory: Orchestrators”

Leave a ReplyCancel reply

Discover more from Martin's Blog