One of the most appealing features in Azure Data Factory (ADF) is implicit mapping. The benefit of this is that I can create one dataset and reuse it multiple times and without explicitly mapping the source & destination columns. Compare that to Integration Services (SSIS) where columns have to be explicitly mapped and where metadata is validated before execution (forcing you to create granular-level packages), and you can understand why it is so appealing to anyone coming from that world…the SSIS world I mean.
With implicit mapping you don’t specify a schema in the dataset or mapping in the copy activity, and as long as your destination column names are the same as the source (and in the same order), it works beautifully and is a welcome time-saver especially when you’re staging data.
It’s all about the source
The catch-22 is that implicit mapping only works if the data types of the source can be successfully interpreted and mapped to the destination. As you can imagine, the ability of Azure Data Factory to map data types from one SQL Server database to another SQL Server database works exceptionally well, while your mileage will vary with less structured sources. ADF actually converts native data types in the source to interim data types, before converting to the destination’s native data types. This double translation works well enough for most relational database sources, but not so well with others even if the documentation suggests otherwise.
To illustrate this point, let’s extract some data out of Microsoft Dynamics 365 CRM and see what the difference would be between implicit and explicit mapping. Create two similar staging tables and define the source query (FetchXml) in the following way:
-- Staging Table 1
create table [Staging_CRM].[businessunit_Test1](
[businessunitid] uniqueidentifier null,
[name] varchar(255) null
);
-- Staging Table 2
create table [Staging_CRM].[businessunit_Test2](
[businessunitid] varchar(50) null,
[name] varchar(255) null
);
-- Source Query (FetchXml)
<fetch mapping="logical">
<entity name="businessunit">
<attribute name="businessunitid"/>
<attribute name="name"/>
</entity>
</fetch>
The businessunitid attribute shown above is a guid, while the name attribute is alpha-numeric. An Azure Data Factory copy activity into businessunit_Test1 works as expected when using implicit mapping, but try to do the same into businessunit_Test2 where the data type of businessunitid is slightly different and you get the following error:
Change the copy activity from an implicit to an explicit mapping while still using the businessunit_Test2 staging table, and everything suddenly works:
What’s happening here?!
It’s difficult to say exactly what is happening, especially considering the fact that the documentation indicates that a GUID interim data type should convert to a String without any issues. My guess is that the interim data type (inferred by ADF in this case) is not as compatible as it would seem, and the only way around it is to either change the data type of the staging table attribute or implement explicit mapping. Depending on your scenario, neither of these may be ideal solutions for you.
Dynamic mapping to the rescue
There is an option that gives us the best of both worlds, giving us more control over the data type mapping while still maintaining the flexibility of implicit mapping. But before we dive into the solution, let’s have a look at what happens underneath the covers when we explicitly map columns. If we look at the JSON source code of the pipeline with the explicit mapping shown earlier, this is what you will see:
The typeProperties section of the source code contains the translation (or mapping) between the source and target (sink), and the interface allows us to parameterize this and add it as dynamic content:
The most important thing to remember is that the ColumnMapping parameter needs to be defined as type Object even though you’re just passing a string as value. In our example, the parameter value will include everything except for the “translator :” part and will look like the value below. I prefer to store the mapping in a metadata table somewhere, and remove all special characters to conserve some space:
{"type": "TabularTranslator","mappings": [{"source": {"name": "businessunitid","type": "Guid"},"sink": {"name": "businessunitid","type": "String"}}, {"source": {"name": "name","type": "String"},"sink": {"name": "name","type": "String"}}]}
My advice when implementing dynamic mapping is to create an explicit mapping which is representative of all the expected data types as a first step. After making sure that your data type mappings work, adding this as a parameter should be fairly straight forward. If you’re dealing with JSON source data (think APIs) that contain nested levels, this first step becomes even more valuable as it will show you how to traverse the JSON structure successfully.
how to the same in databricks
Does this work when I want to load older parquet files that do not have the same columns anymore, into 1 staging table? 4 columns have recently been added to the source table, but the older parquet files do not have these. I want to load the full data lake at night, and right now this returns errors because column mapping is not the same for all files.
I am not familiar enough with Parquet sources to definitively answer your question. I know it works for JSON sources (returning null for attributes that don’t exist), and given the fact that Parquet makes provision for optional attributes my initial gut feeling would be yes…but you’d have to test it to make sure.
very helpful. Thanks