Databricks and Google Big Query

April 27, 2020

Databricks is a powerful unified platform for data and AI. Databricks supports:

·         Reliable data engineering for large-scale data processing for batch and streaming workloads.

·         Analytics on all your data. Enable analytics on the most complete and recent data.

·         Collaborative data science. Simplify and accelerate data science on large datasets.

·         Production machine learning. Standardize ML lifecycle from experimentation to production.

BigQuery is Google’s serverless, highly scalable, enterprise data warehouse designed to make all your data analysts productive. Because there is no infrastructure to manage, you can focus on analyzing data to find meaningful insights using familiar SQL without the need for a database administrator.

In this article, we are going to guide you on how to setup a connection between Google BigQuery and Azure Databricks in order to use Databricks notebook to read and write from BigQuery tables.

 

BigQuery Setup

1.      Create a service account by going to APIs & Services > Credentials > Create Credentials > Service Account

2.      Go to your service account and create a key file of your service account. You can find your service account by going to IAM & Admin > Service Accounts > “Your Service Account“ > Edit > Create Key > JSON > Create.

3.      In order to write to BigQuery table, you need one bucket for storing a temporary file and also need to add your service account as a member of that bucket. Go to Storage > “Your Bucket“ > Permission > Add Members.

The service account that you’re using must have a role as an admin in both Google Cloud Storage and Google BigQuery or it won’t have a permission to read or write.

Databricks Setup

1.      For Databricks, you need to install the necessary libraries on your cluster which include:

·         google-cloud-bigquery

·         google-cloud-storage

·         spark_bigquery_latest.jar

·         com.spotify:spark-bigquery_2.10:0.2.2

2.      Then we need to upload the key file of our service by going to Data > Add Data > Upload File > Browse and then upload your key file.​​

 

3.      Specify the location of key file on Databricks notebook.

Python:

file_location = "/dbfs/FileStore/tables/bucket_261614_7c8493eb0eef-d33e6.json"

Scala:

val file_location = "/dbfs/FileStore/tables/bucket_261614_7c8493eb0eef-d33e6.json"

Read from BigQuery Table

To read from BigQuery Table by using Databricks notebook, you can follow this example:

%python
table = spark.read\
             .format("bigquery")\
             .option("credentialsFile", file_location)\
             .option("parentProject", "PARENT-PROJECT-ID")\
             .option("project", "PROJECT-ID")\
             .option("table", "PROJECT-ID.DATASET.TABLE-NAME")\
             .load()

Write to BigQuery Table

To write to BigQuery Table by using Databricks notebook, we need to use Scala:

%scala
import com.spotify.spark.bigquery._

sqlContext.setGcpJsonKeyFile(file_location)
sqlContext.setBigQueryProjectId("PROJECT-ID")
sqlContext.setBigQueryGcsBucket("BUCKET-NAME")

table.saveAsBigQueryTable("PROJECT-ID.DATASET.TABLE-NAME")

Summary

In this article we've worked through the steps of setting up access to Google Big Query with a service account, installing the necessary libraries in Databricks and reading and writing data to and from Big Query.

This can be incredibly useful for accessing Google Big Query public datasets for analytics, building machine learning models or if you're already using Big Query in your organization.

Contact us today if you need help implementing solutions using Databricks or Microsoft Azure today!