The Right Team with the Wrong Tools

Written by: Håkan Bellarp

June 10, 2020

 

What is new?

BI Developers have a long tradition creating valuable information from data. With the (inevitable) shift from traditional on-premise Data Warehouse solutions to Analytical Platforms in the Cloud we need to re-evaluate some of the methods and change the tools of the trade. The fact that the data is extracted, cleaned, conformed and integrated in the cloud using platform services instead of Software running on Virtual Machines is not the big difference. The big difference is of course the arise of the Data Lake/Big Data that allows for diverse data formats and streaming/real time data ingestion.

In my opinion there are two common mistakes that traditional BI developers makes when building an analytical platform in the cloud.

The first mistake - use the data lake only for storage

The first common mistake is to use the Data Lake only as a storage of the stage-in tables in the traditional Data Warehouse.

By taking full use of the capabilities you should instead use a neat option that allows you to store ingested data indefinitely. A useful metaphor is to think about the data pipeline as moving data through three different layers, Bronze, Silver and Gold. In the Bronze layer we have the Raw-data that we keep perpetually in its' original format. In the Silver layer we have conformed and cleaned the data ready for integration. In the Gold layer we have our most valuable integrated data that is packaged for consumption by our Dashboards and Reports. At a minimum, the Silver layer should be processed in the Data Lake.

The second mistake - fail to design for streaming data

The second mistake is to fail to design for how to process streaming data in the Data Lake. If you for instance extract your Point of Sales data in an hourly batch today, but need the data in near- or real time you should not need to rebuild your current pipeline or worse set up a new parallell pipeline just for the streaming ingestion. This is by the way a very likely scenario if you are a Retailer. The need for real time analytics is just going to be more and more important if you are on a digitization journey.

The right team

Traditional BI Developers have the methods, experience and understanding of the importance of a structured approach to building a Data Platform. There are important concepts that we need to preserve from the Data Warehouse tradition e.g. Data Governance, Compliance and a layered data architecture with standardized design templates. Without a structured approach your Data Lake will eventually turn in to a Data Swamp. But keep in mind that they need to be trained on how to think pipelines rather than batch.

The right tools

A modern tool has to be able to handle complex structures (Big Data), streaming data, different programming languages like Python and also SQL. Databricks is a very good example on a tool that ticks all the important boxes. You can use a tool like Azure Data Factory together with Databricks. Azure Data Factory can handle the orchestrating and scheduling of the pipelines and possibly also be used for ingest of batch sources. Databricks would process the data inside the platform.

The rise of the Data Engineer

The new paradigm with Cloud based Analytical Platform requires a new generation of BI Developers, the Data Engineer. The Data Engineer needs to combine the traditional skills of a BI Developer with the skills from System Engineering e.g. how to develop within a modern DevOps framework and additional programming languages like Python and Scala.