Site icon Martin's Blog

Building a framework for orchestration in Azure Data Factory: Metadata

This blog post is part of the Building a framework for orchestration in Azure Data Factory series.





The metadata that drives the execution within a framework is probably the most critical part. Going back to our analogy of building a house, the metadata would be the foundation. It is here where you are going to make some architectural decisions outside of which the framework cannot operate.

One such decision is how configurable or flexible you’d like the framework to be. In other words, how many attributes would you like to be dynamic and/or have the option to change during execution. It seems like an easy choice, and most engineers would lean towards “everything” or “as much as possible” as an answer. In reality however, the trade-off is complexity and the more dynamic you make the framework the more complicated it becomes. And you pay for the complexity later when you need to maintain or add new functionality to it.

I made that mistake at first, and quickly realized that the metadata monster I was building would not be sustainable in the long run. The answers to the following questions helped me settle on the final design:





<side note> Have you ever wondered why ETL framework and/or automation tools have not been more widely adopted as a standard? I have and I’ll wager an opinion…it’s because these tools force you to buy into their methodology, and in my opinion most ETL processes hardly fit the cookie-cutter profile. Is there a lot of repetition? Yes, but there’s also enough variation to make it extremely difficult for those tools to be a perfect solution. </side note>





I can probably ramble on about metadata considerations for a few more paragraphs, but I think it’s better explained by showing what I ended up with and the reasons for it. The image below shows the metadata structure I settled on, and I’ll explain the individual pieces in a bit more detail.








I ultimately decided on two hierarchical levels of granularity (processes & tasks) and feel that it gives me enough flexibility to control the order of execution, provide a mechanism for parallel execution and doesn’t add too much complexity.





Table: Process

A process in the context of this framework will be something like copying data from a specific source system to the staging area, or executing a series of stored procedures to load dimension tables. It forms the rigid boundary we discussed in an earlier post, and one I am happy to live with during the execution of my ETL process.

The explicit boundary means that I will need some other way of handling the execution of an entire ETL process from start to finish. It’s a deliberate choice that will become clearer towards the end of the series. Apologies for the suspense but the context is necessary. #SorryNotSorry





Here are some of the metadata attributes I store in the Process table:








Table: Task

A task is the most granular unit of execution, and given the description of what a process might be a task will be the copying of an individual table from a source system, or the execution of a single stored procedure to load a dimension/fact table. The attributes I like to store in the Task table are the following:





Table: Task Query

At this point you may be asking why I have a separate table for queries, if a task is the most granular unit of execution. I deal with some customers that have multiple instances of the same ERP system. These are at best separate (identical) databases on the same server, and at worst distributed across different physical servers. Extracting data from these are usually cumbersome, because you either have to duplicate a lot of code if you have separate staging tables, or build some mechanism to combine the data before transforming and loading it.

The separate metadata table accounts for this scenario, with multiple source queries that can populate the same staging table. It may not be as useful for you, but it is extremely useful to me for those reasons. Here are the attributes of this table:








Table: Parameter

The Parameter table is nothing more than a store for key/value pairs, for any global parameters you may need. My primary use for it is to store the number of days or months I’d like to go back when extracting source data (a sliding window approach), which is then translated into a date value and injected into source queries where needed.

Yes, I prefer to use a sliding-window approach and develop my ETL processes in such a way that it doesn’t cause duplicate data. I’d much rather extract a little more data than having to hand-hold the process after a failure…and the sliding-window approach makes that possible.








As an example, the image above shows two parameter values…one that determines the size of the sliding window for a daily load, and one for an hourly (or more frequent) load. For daily loads I may want to go back 20 days, whereas for an hourly load I may only want to load the last day’s worth of data.

Task queries may look something like this when it contains a placeholder for a dynamic value, and the stored procedure that extracts the metadata will inject the necessary date value before it is passed into the Copy Activity:





With placeholder: select field1, field2, ... from Invoices where InvoiceDate >= '@[Xero_StartDate]' 

After extraction/injection: select field1, field2, ... from Invoices where InvoiceDate >= '1/1/2024'




Stored Procedures

There are a few stored procedures to extract the necessary metadata for pipeline execution. The contents are relatively straight forward and the usage will be clear once we start talking about the actual pipelines of the framework. For that reason I will not discuss them in detail here, but they are available in my GitHub repo.





Want the source code for all the metadata tables? Have a look at my GitHub repo.

Exit mobile version