Subscribe

Data warehouse automation – a practical walk through

There are several principles to live by when it comes to data warehouse automation and its implementation in real life.
Julian Thomas
By Julian Thomas, Principal consultant at PBT Group
Johannesburg, 13 May 2021

In my previous article, titled “The value of data warehouse automation to the modern business”, I introduced the concept of data warehouse automation (DWA). In this article, I will provide an example of how DWA can be implemented in real life.

First, I believe that it is important to agree on a few key principles to live by. To my mind, these include:

  • The data warehouse must be metadata-driven and as such, a central repository, where all metadata can be stored and exploited effectively, is essential.
  • Every process that runs is important and must comply with an existing standard, and be executed using the approved toolset. Nothing executes manually, outside of one of the approved ETL tools or languages.
  • Again, every process that runs is important and must therefore be logged. Nothing that moves data into or out of the data warehouse can run, without providing a clear log and audit trail.

Next, it is critical to talk about metadata. All metadata is important, but some is more important and holds more value, and this is what a business must place focus on next. The data model, of course, is at the absolute centre and defines the very heart of the solution. One can simplify this to a list of tables, attributes, data types, constraints and relationships in a target model. This is the first set of metadata that must be incorporated into the data warehouse automation framework, as it serves as the anchor that everything else will link to.

Almost as important as the data model is what is referred to as the source to target mapping (STM) specification. This is what maps each table and attribute in the target model, back to the original source tables and attributes. This mapping includes any business logic required to cleanse, validate, or transform the input data. The data model tells the business WHAT must be loaded. The STM tells HOW to load the data. Storing the STM in the metadata repository is a tipping point in a business’s ETL maturity. This opens the door to endless opportunities.

The STM, in conjunction with ETL standards and templates, will allow the business to dynamically generate a great deal of its ETL code. In most cases, the code to extract and stage source data is similar and repetitive. As such, this is where a business can spend a lot of time. It is also where the business can experience a lot of human error. The business can now improve this by using this metadata to dynamically generate this code. In addition, the business can use this to rapidly roll out changes to existing code.

The cleansing, validation and transformation logic also reside in the STM, and as such, a business can use this to generate transformation and load routines. Of course, there will always be some business rules that will not fit cleanly into a metadata model. But that is OK; it is not the goal to be perfect, or to generate 100% of the solution generically. The aim is simply to achieve as much improvement and automation as possible.

Next up is the testing requirements. The business will need to define the relevant tests and expected responses for the solution. One can define test cases in much the same way they defined the STM and use this to dynamically generate all of the test code. This can sometimes be a bit ambitious though as once again, some tests might be too complex for one to dynamically generate.

The data model, of course, is at the absolute centre and defines the very heart of the solution.

For this reason, I tend to go with a hybrid approach, allowing for the simpler test cases to be described and dynamically generated. In addition, I allow for other test cases to be manually created. It is important, however, to remember to register the name and location of the test code in the metadata repository as well. By doing this, the business will have everything it needs to generate, and execute test cases, regardless of their complexity.

The last core component of a DWA framework is the ability to build and deploy a solution. When referring to “build”, this is made up of the process of assembling all the disparate parts of the solution, into a deployable artefact. There are important metadata requirements here; a business needs to identify each individual component of its solution (all of which should have been registered in the metadata), and must also identify the required version, and log the fact that it has been incorporated into a “build”.

It is important that this can be linked tothe relevant change control or task management ID that is associated with this piece of work. For example, a Jira number that identifies an epic or story. The detailed activity of building the deployment pack, as well as the actual deployment with the resulting status, dates, deployment resources, approvals, etc, must all be logged. In this way, the business will have a clear view of what was built (and why), what was deployed, and it is all linked to specific versions and request numbers, providing a very clear audit trail of what went into the production system.

There are other items that can be linked into a DWA metadata framework, and I encourage businesses to carefully consider them all. The more metadata gathered, the more opportunities for automation and improvement increases.

The items I have listed here are what I would recommend are focused on first. These are the low-hanging fruit, and will provide immediate relief, improvement and excitement in the day-to-day world of the business. 

Share