Databricks System Tables with dbt

G the Sheep
4 min readJul 29, 2024

--

Snowflake users had it, and now Databricks does, who’d have thought…

Snowflake users have long relied on their system-generated SNOWFLAKE database to do things like spend monitoring, investigate warehouse utilisation, or dig into weirdness in their tagged dbt models (maybe even much, much more with tasty tools like Select.dev).
Databricks hasn’t lacked this information, it just expected a little more legwork from the user. Previously, users would utilise the provided Python-based Notebooks to query the API for this data, or require account console access to go digging in there, but alas, the Databricks System Tables have arrived!

Don’t worry, I secretly like Notebooks too

Databricks System Tables

Before we even begin - yeah, yeah, yeah we can run notebooks and look at all the same information already, fine! But some people just prefer SQL! Entire data teams can thrive on SQL alone these days, especially when riding the surge of dbt/ SQLMesh/ etc, so if good engineering practices like monitoring and refining resource consumption from our workflows is to be encouraged, then it makes sense to have the data accessible to the tools of choice.

Databricks System Tables seem to be getting added pretty regularly right now, and come with some pretty comprehensive documentation! But for the sake of this, we’ll first just check out 2 key tables related to Billing:

If you can’t see any of these tables, check out the enablement instructions.

Bringing SQL/ dbt into the mix

Checking out our usage

First up, usage. Maybe this is something you already have a solution for, but also quite a nice table to kick off the system database with, if you’ve not explored it yet.

This table is pretty much a row by row account of everything you do in Databricks that spends money (terrifying, right?!), but don’t be too worried, it’s only in DBUs right now, and they’re pretty much free…until you convert them to currency. 🙃

select
workspace_id,
sku_name,
usage_date,
sum(usage_quantity) as total_dbu
from
system.billing.usage
where
usage_date >= '2024-01-01'
group by
all

With very little effort, you can already start picking through which resources are costing you the most each day, which workspaces are giving you grief, and pick out any periodic trends. Now just to throw this into a cheeky little visualisation (left to you and your favourite plotting tool).

But DBUs only get us so far, for those of us that aren’t hardwired into the Databricks Spark clusters, we’ll need to start thinking about real money.

All in DBUs

Costing it up

Next, we bring in the prices, now this requires a little more understanding…

We’re going to be joining on both usage_date (fairly self-explanatory), and sku_name, which is a Databricks generated identifier that maps to some sort of priced consumption/ activity, i.e. ENTERPRISE_JOBS_COMPUTE is job compute spend on an Enterprise plan, but they can get slightly tricker than that.

select
usage.workspace_id,
usage.sku_name,
usage.usage_date,
sum(usage.usage_quantity * list_prices.pricing.default) as total_spend
from
system.billing.usage
inner join
system.billing.list_prices
on usage.sku_name =list_prices.sku_name
where
usage.usage_end_time >= list_prices.price_start_time
and (list_prices.price_end_time is null or usage.usage_end_time < list_prices.price_end_time)
and usage.usage_date >= '2024-01-01'
group by
all

Huzzah! That’s not too bad, right? A very quick and easy way to start seeing how much you’re spending on various resources, and with minimal effort, you can pull in tags or other metadata from the usage table to start splitting your spend however you see fit!

But what if even that is too much faff?

Well, here’s one I made earlier…

Installing the package

To install the package linked above, it’s time to get down and dirty with your packages.yml in whichever dbt project seems best.

Currently, you’ll need an entry like so:

packages:
- git: "https://github.com/GtheSheep/dbt-databricks-monitoring.git"
revision: 0.0.1

That’s it! But it’s worth considering materialising these models under something that isn’t the default schema, by adding a block, something like this, to your dbt_project.yml:

  dbt_databricks_monitoring:
+schema: monitoring
staging:
+schema: staging

After the next dbt run (provided all permissions were granted to read from the relevant tables 👀), you should be greeted with a bunch of new tables to help monitor your Databricks deployment!

usage_spend for your usage, friend

Extending the package

Currently the package is focused around cost monitoring, hence the available entities being clusters and warehouses and the sole facts being driven by usage, but this is easy to flesh out into jobs, optimizations, or even lineage!!!
The intention is very much to maintain and extend the dbt package going forward, but as always, all contributions are both welcomed and appreciated!

--

--

Responses (1)