Thursday, December 11, 2025

Databricks Training Notes - Compute

All purpose compute -R/W/X - More expensive

Serverless version of all purpose compute

All purpose is also known as Classic Compute.

Classic Compute - VMs, Databricks Consumption DBU/hr.

Job Compute - R/X - Cheaper

Serverless version of Job Compute

You can't run Scala/R on Serverless compute.

Serverless DBU cost is higher as VM is in-built into it.

RDD - Resilient, Dataset, Distributed

Worker dies, it can recreate data partition and keep running. RDD keeps extra RAM available.

Vector Search - Word embeddings. Array of floats. Specialized engine to build index of those numbers.

Pools - Pool of VMs that you need to be paying for. Classic compute scenario. Pools have gone away.

Serverless Compute - cheaper version

Serverless Compute - performance optimized version - usually 5 seconds

Cluster - Drivers and Worker Nodes. Single node cluster - driver is the worker. SkLearn, Pandas consume driver memory.

Use Job or Serverless clusters in production. Avoid interactive clusters in prod. Enable Photon for faster and cheaper execution. Reuse clusters to reduce startup time and cost.

Serverless - Photon engine, Predictive IO, Intelligent Workload Management

Pro - Photon, Predictive IO

Classic - Photon engine

Performance considerations: SKEW/SPILL/STORAGE/SHUFFLE/SERIALIZATION 

Adaptive Query Execution helps code optimization

Row Filter:

CREATE OR REPLACE FUNCTION device_filter(device_id INT)
  RETURN IF(IS_ACCOUNT_GROUP_MEMBER('admin'), true, device_id < 30);

ALTER TABLE silver
SET ROW FILTER device_filter ON (device_id);

SELECT *
FROM silver
ORDER BY device_id DESC;


SELECT
  *,
  cast(from_unixtime(user_first_touch_timestamp/1000000) AS DATE) AS first_touch_date
FROM read_files(
  "/Volumes/dbacademy_ecommerce/v01/raw/users-historical",
  format => 'parquet')
LIMIT 10;