Building a framework for orchestration in Azure Data Factory: Recap

Martin Schoombee

2 years ago

This blog post is part of the Building a framework for orchestration in Azure Data Factory series.

We’re wrapping up this series with a short recap of the most important bits and pieces…

Frameworks are extremely useful when they are thoughtfully designed and implemented. I have seen both sides of the coin, but what I probably see the most of is a lack of any sort of framework. What I typically see are some naming conventions and coding standards, but many companies miss the opportunity to take it one step further and reduce the inefficiencies of repetitive tasks. There’s a ton of repetition in ETL processes, and in my opinion that gives us a really good opportunity to streamline the way in which we are doing things with a well designed framework.

If I have to highlight the most valuable things I’ve learned from building this framework, it would be the following:

Define your goals first. A framework needs to be developed with intent, and not knowing where you’d like to end up will make the process extremely frustrating.
Set your boundaries. Without a solid foundation and knowing what the absolute “must haves” versus the “nice to haves” are, you may end up with a bunch of cool but useless functionality. Don’t fall into that trap.
Know your audience. Who are you developing this for and what are their skills, strengths and weaknesses? If you have a mature team, adding some complexity may not be a big deal…but if you don’t or if you’re a consultant like I am, you probably want to lean towards simplicity and ease of use.
Think ahead. The framework is not just about you, it’s about the person who needs to take it over after you’re no longer there. Account for future growth or changes within the environment, for instance how much work would it be to move everything to a new subscription or resource group (if you’re in the cloud), or what would happen if data volumes double in the next few years? Thinking ahead and about the bigger picture is critical, and if you don’t have those skills in-house then I would recommend that you find a good consulting partner who can help you with the big picture items (the architecture).
Eliminate repetition. Repetition is great when you’re learning a new skill, but why would you want to repeat the same twenty steps each time a new pipeline has to be created? There’s a better way, and if you find a way to eliminate the mundane repetition then that would be it.
Learn new skills. Developing ETL processes are not just about creating SSIS packages or ADF pipelines, or even stored procedures. If you’re in the cloud, learning about ARM templates to speed up coding or PowerShell to automate deployments will turn out to be invaluable. When you embark on the building of a framework, think about the secondary skills that could help you in that process.
Time is money. You should always develop with cost containment in mind, but even more so when you’re in the cloud. My advice would be to start with the options that result in the lowest cost first, and then adjust based on your goals and/or requirements.

Where to from here?

Once you’ve successfully built a metadata-driven framework, the most tedious task would be to maintain the metadata itself. Let’s face it, storing all the source queries and column mappings in a database is super useful, but updating them when you need to add a new attribute for instance will be very painful…and this is where some big-picture thinking helps.

Think about the ways in which you can address that challenge. I have, and have developed a process whereby I maintain the metadata in an Excel workbook and use PowerShell to generate source queries and column mappings. I take that even one step further and generate statements to recreate staging tables, as well as generate & deploy the ARM templates when I’m ready with some changes or have a new environment. Say what you’d like about Excel, but using it in this fashion to maintain the metadata and document your system at the same time is incredibly useful, and much less frustrating than formatting JSON code that’s stored in a database table.

The options are plenty, and I hope that this series has given you some ideas on how you could go about creating your own framework…or a head-start towards that goal at minimum.

<side note> I am thinking about doing some training based on this series, and I’d love to hear your feedback. Please send me a DM on social media or leave a comment on this blog post if that is something that would interest you </side note>

Reminder: Data Factory templates

As part of this blog series, I have published the templates of my framework in my public GitHub repo. It is free to use by anyone who wants to, either as a starting point for their own framework or as a complete solution to play around with.

I recommend that you attempt to deploy these in a test environment first, and please read the documentation before doing so. It contains important information you will need for a successful deployment.

The documentation for the orchestrator pipelines: Readme – Orchestrators

The documentation for the worker pipelines: Readme – Workers

The ARM templates: Data Factory Templates

Where to from here?

Reminder: Data Factory templates

Share this: