DataFusion 54.0.0 Lands LATERAL Joins and SQL Lambdas

DataFusion 54.0.0 adds LATERAL joins and lambdas, LanceDB benchmarks lakehouse metadata on S3, plus SQLMesh and Dagster point releases for production fleets

Jun 12, 2026

DataPrincipal Daily - June 12th, 2026

Apache DataFusion closes two long-standing SQL gaps on the same day, and LanceDB puts a number on what lakehouse metadata actually costs at scale.

⚡ TL;DR

Apache DataFusion 54.0.0 ships LATERAL joins, SQL lambda functions, and vector-search functions in a record 740-commit release.
LanceDB benchmarks the Lance table format against Delta Lake and Iceberg on S3, putting a number on object-store metadata cost.
SQLMesh 0.235.4 adds StarRocks support and timezone-aware scheduling.
Dagster 1.13.9 introduces hierarchical asset groups, an is: type filter, and multi-replica user-code deployments.

🏗️ Platforms & Architecture

Apache DataFusion 54.0.0 closes the LATERAL and lambda gaps

Apache DataFusion 54.0.0, released June 12, 2026, landed 740 commits from 139 contributors, a new project record. The release adds LATERAL joins, SQL lambda functions (x -> expr) with higher-order array functions (array_transform, array_filter, array_any_match), spilling nested-loop joins for out-of-memory protection, a new Avro reader built on the arrow-avro crate, and vector-search functions (cosine_distance, inner_product, array_normalize). On performance, near-unique sort-merge joins are cited at 20x to 50x faster, repartitioning is cited at up to 50% faster on some repartition-heavy queries, and Parquet scans are cited at up to roughly 2x faster on skewed scans. Uncorrelated scalar subqueries are now evaluated through a new dedicated physical operator, and the Parquet writer gains content-defined chunking for deduplication.

LanceDB benchmarks lakehouse metadata performance on S3

On June 9, 2026, LanceDB engineer Jack Ye published a Rust-based benchmark comparing the metadata performance of the Lance table format against Delta Lake and Apache Iceberg on Amazon S3 and S3 Express One Zone, run on an EC2 c7i.48xlarge in us-east-1b. It measured three operations: commit latency over 10,000 sequential append commits; metadata load latency across table versions; and concurrent commit throughput with many simultaneous writers. It also tested Iceberg in two configurations (a direct Postgres catalog and an Apache Polaris REST catalog) against Lance and Delta Lake. Ye attributes Lance’s claimed metadata advantage to a compact protobuf manifest plus a best-effort version-hint file that enables parallel discovery, compared with Delta Lake log replay and Iceberg multi-file metadata-tree traversal. This is a vendor-authored benchmark, so its specific directions are the author’s claims.

🔧 Tools & Products

SQLMesh 0.235.4 adds StarRocks and timezone-aware scheduling

Tobiko Data shipped SQLMesh v0.235.4 on June 11, 2026. It adds StarRocks as a supported execution engine, introduces a cron_tz field in model_defaults so model schedules can be pinned to a specific timezone, adds an --environment flag to the sqlmesh janitor command for scoped environment cleanup, and ships a secure field for ClickHouse connection config plus a Redshift table-properties handler, crediting 10 first-time contributors.

Dagster 1.13.9 brings hierarchical asset groups

Dagster 1.13.9 (libraries 0.29.9) arrived June 11, 2026, one week after 1.13.8, and lets asset group names contain / separators to define hierarchical groups with wildcard selection, plus a new is: filter that selects assets by type, such as is:external or is:materializable. It adds multi-replica Helm support for user-code deployments with consistent gRPC server IDs, extends DltLoadCollectionComponent to honor partitions_def and backfill_policy, and fixes a Postgres and MySQL event-log watcher deadlock alongside a cron-schedule partition-counting error.

💎 Gems & Tools

sqlglot A no-dependency Python SQL parser, transpiler, and optimizer that translates between 30-plus SQL dialects. Worth a look for anyone normalizing or rewriting SQL across engines.

delta-rs A Rust and Python implementation of Delta Lake that reads and writes Delta tables without Spark or the JVM. Useful for single-node Delta access from Python pipelines.

Data Principal

Discussion about this post

Ready for more?