Tableau Data Blending Saved Me Two Days' Work!
Data blending came to the rescue when I had to add a new metric to a KPI dashboard!
Picture the scene - you have two years’ worth of clickstream data - that’s a record for every click, across every device, across multiple sites that get millions of hits per day.
The problem
You have a Tableau KPI dashboard which reads in multiple summary tables created from this huge clickstream data. And now, the requirements have changed - you need to add another metric.
The dashboard has many tables feeding into it due to the different summary logic used to calculate the metrics at campaign level, page level, and site level. Specifically, the way transactions were attributed to each customer depended on how many pages and/or campaigns they had viewed - but that’s a topic for another post!
This meant that I’d have to join back to different historical tables at various levels as well as create new code to incorporate the new metric into future runs.
My first thoughts were to accept the fact that I’d have to rerun the entire PySpark process for historical dates to include the new metric in each of the summary tables - this would have taken at least two days ⏰ and incurred high AWS processing costs 💰.
The solution
Then, it struck me that I should use a data blend in Tableau!
📊 Tableau Data Blend:
Data blends let you query independent data sources which are brought together in a Tableau sheet.
The underlying data sources are queried separately and are never joined together at row-level.
This means the data can be aggregated first, and then combined at the aggregate level.
The new metric could easily be summarized, which means it would only need one input table into Tableau.
This input could then be used for all worksheets by aggregating to the required level of detail.
I’ve usually only used Tableau data blends whenever I had to link to a small mapping table for things like tidying up campaign names and tagging. But this was going to be an interesting experiment to see how Tableau performs when blending large datasets. As it turns out, it worked very well, with no noticeable impact on dashboard performance.
This is a similar concept to using APIs in software development - where pieces of code are set up to run a single task which are all called as needed. Using this analogy, the table I have for the new metric can be called upon to only provide the new metric and then used in conjunction with the existing data.
The new metric’s source has a different refresh schedule to the rest of the data. This meant that using a blend had the added benefit of being able to run the update process as a standalone process, as and when the data becomes available - without interfering with the rest of the process.
If you have a process which is similar this - where you need to combine data at different levels of detail from different sources - then a data blend may just come in handy!