Why we use Databricks
We have invested thousands of hours building solutions on the Databricks platform and it is our tool of choice for developing modern data lake solutions. Over the years, and based on 20+ years of experience, we have also developed a framework, which we call Velocity, which not only leverages all the benefits of Databricks but also accelerates development by providing re-usable components and reducing maintenance overheads and costs.
In this article, we will detail what Databricks is and why we have chosen Databricks as our primary tool for our solutions.
What is Databricks?
Databricks was developed by the creators of Apache Spark, and as such, before we delve into Databricks itself, it’s important to explain what the Apache Spark is.
Apache Spark is a data processing framework that can quickly perform processing tasks on very large data sets and can also distribute data processing tasks across multiple computers, either on its own or in parallel. These two qualities have made it the cornerstone of big data and machine learning, which require the orchestration of computing power to process large amounts of data.
It can be deployed in a variety of ways, providing developers the ability to use a variety of programming languages such as Java, Scala, Python, and R programming languages, and also supports SQL, streaming data, machine learning, and graph processing. Its huge popularity means that you will find it used in a variety of industries such as banks, telecommunications companies, gaming companies and governments and some of the largest companies in the world.
What makes it stand out from traditional MapReduce technologies used in Hadoop, Spark uses an in-memory data engine which means it can perform tasks much faster in certain situations and allows processing to be distributed much more effectively.
The magic of Databricks, is that it provides a web-based, unified platform available on Microsoft Azure, Amazon Web Services and the Google Cloud Platform to enable Data Engineers, Data Scientists, and Businesses to collaborate and work closely on notebooks, experiments, models, data, libraries, and jobs.
This unified approach simplifies implementations by eliminating the data silos that traditionally separate and complicate data engineering, analytics, BI, data science and machine learning. Furthermore, because it’s developed on an on open source and open standards technology, it maximizes flexibility, security and governance which helps organizations operate more efficiently and innovate faster.
Additionally, because it’s a managed, cloud-based platform, its quick to set up and is not only easily configured to scale but is also cost effective because you only pay for what you use. It eliminates the need for on-premises hardware or complex set-up that native Apache Spark implementations would normally require.
Another key feature of Databricks, the details of which are out of the scope of this article, is the support of Delta Lake storage format, which supports ACID transactions, streaming and batch data sources, schema enforcement and time travel.
So now you have had an introduction to the Databricks platform, now we can take a look at just some of the features which make it our preferred tool for building modern, cloud data solutions. As we are Microsoft specialists, this list will focus mainly on Azure Databricks:
1. Data sources
The Databricks runtime supports a huge array of file formats and data sources, such as Avro, Images, JSON, Parquet, XML, Zip files and much more. It also has connectors available for connecting to a variety of data storage utilities such as Azure Storage, AWS S3, Cassandra, MongoDB, Snowflake and SQL databases. A complete list can be found here.
Additionally, for streaming data sources, Databricks supports the likes of Apache Kafka, Amazon Kinesis, Azure Event Hubs and much more! More details here.
2. Languages and environment
As mentioned earlier, it allows commonly used programming languages like Python, R, Scala and SQL to be used. There languages are converted in though API’s to interact with Spark. This is great if you have different teams with different specializations.
Note: Although different languages are supported, for production implementations, we recommend choosing one standard for your organization. However, for ad-hoc analysis, it might make sense to be more flexible.
3. Cloud Integration
Being integrated into Microsoft Azure, this provides a wealth of benefits, including:
- The ability to store, retrieve and update data on Azure Data Lake and Blob storage.
- Orchestration and scheduling/triggering of jobs and notebooks with Azure Data Factory.
- Security with Azure Active Directory
- Deployments and version control with Azure DevOps
Due to its availability on the big 3 cloud platforms and the fact its based on an open-source framework, this means the support from a huge community of Apache Spark engineers. Extensive documentation and support is also available for all aspects of Databricks, including the programming languages needed.
5. Productivity and Collaboration
Through a collaborative and integrated environment, Databricks can streamline the process of exploring data, prototyping, and running data-driven applications in Spark. This allows for data exploration to determine how to use data stored in our data lake, multiple users can document progress in notebooks in real-time and create visualize in just a few clicks.
6. Machine Learning
Databricks also provides an integrated, end-to-end machine learning platform incorporating managed services for experiment tracking, model training, feature development and management, and feature and model serving.
Models can be trained either manually or with AutoML. Training parameters and models can be tracked with MLFlow tracking, feature tables can be created and be accessed by model training and inference. Lastly, using Model Registry, models can be shared, managed and served.
More details can be found here.
If you’re interested in learning more about Databricks or our Velocity framework, reach out to us today for a free introductory Demo. For a limited time, we are also offering a free initial consultation to help you get started and how Velocity can help you accelerate your analytics journey.