🐦‍⬛ DataPrincipal Weekly - June 18th, 2026

Multigres v0.1 alpha brings horizontal scaling to PostgreSQL, plus Hudi 1.2's native VECTOR type, DuckDB 1.5.4, Materialize transactions, and DataFusion 54.

Jun 18, 2026

This week in data engineering, PostgreSQL’s scaling story led the week, while table formats, query engines, and catalogs all shipped concrete releases.

🐘 Multigres v0.1 alpha arrives to scale Postgres horizontally, as Vitess once scaled MySQL.

🔍 Apache Hudi 1.2 adds a native VECTOR type with built-in similarity search on lakehouse tables.

🦆 DuckDB ships 1.5.4 (Variegata) alongside the 1.4.5 LTS maintenance release on the same day.

✍️ The Week in Data (4 minute read)

A photo-frame company spent Christmas 2024 watching its database go down for three hours, then rebuilt to survive the next one. A brand-new project from the person who once kept YouTube’s database alive is now building that same fix for everyone else. The thread connecting them is the question Postgres has dodged for years: what happens when one machine is not enough?

Start with Aura Frames. The 2024 outage was a write-ahead log overflow caused by replication slots on an Amazon RDS PostgreSQL 14.1 instance, and it occurred during peak holiday traffic with no one at a desk. They sharded their ten highest-write tables, went from one primary to eight, and grew from 192 to 896 vCPU. The payoff showed up a year later: Christmas 2025 saw peak transaction volume of 226,000 per second, and the lights stayed on.

That is a heroic amount of plumbing for a team that just wanted to store pictures. The database is wonderful right up to the moment a single primary becomes the ceiling, and then you are hand-rolling shard logic that has nothing to do with your product.

This is exactly the gap Multigres is built to fill. Its v0.1 alpha landed this week as an open-source horizontal-scaling layer for Postgres, led by Sugu Sougoumarane, who co-created Vitess, the system that scaled MySQL at YouTube. The alpha ships connection pooling through a multigateway and multipooler, consensus-based automatic failover, pgBackRest backups, and a Kubernetes operator. Sharding, the headline act, is deliberately held back for a later release.

Sharding is the glamorous word with the unglamorous parts: pooling and failover are what actually buckle under real production load, and Multigres shipped those first. Multigres is trying to turn the Aura Frames war story into a product so the next team does not have to bleed for the same lesson.

There is a second Postgres story this week, and it is about deleting data rather than scaling it. A large DELETE in Postgres is more expensive than it looks. Every removed row becomes a dead tuple that vacuum has to reclaim, and the space is not even returned to the operating system without a VACUUM FULL. DROP TABLE and TRUNCATE, by contrast, run independently of how much data they are removing. The recommended escape is date-based partitioning, so a recurring purge becomes an occasional DROP TABLE instead of a vacuum marathon.

A large delete is like clearing out a fridge by wiping down each shelf item by item while the door hangs open. Dropping a partition is pulling out the whole drawer and rinsing it in one motion. Same goal, wildly different cost, and the difference compounds at volume.

Postgres at volume is no longer a single-database problem; it is a distributed-systems problem, and the ecosystem is finally building the parts to admit that. Materialize implemented transaction commit and rollback as a SQL view maintained by incremental view maintenance, and the whole thing resolves in about 30 milliseconds.

Apache Hudi 1.2 added a native VECTOR type with built-in similarity search, alongside BLOB and VARIANT, and Lance file support. Milvus published Loon, a versioned manifest storage engine, as part of its 3.0 beta. DuckDB shipped v1.5.4 and v1.4.5 LTS on the same day, with v2.0.0 targeted for autumn 2026, and Apache DataFusion 54.0.0 added LATERAL joins and SQL lambda functions. None of these is the Postgres story directly, but they rhyme with it: the workloads that used to live in separate specialist systems are moving inside the engines we already run.

A year ago the answer to Postgres scale was “shard it yourself and pray.” This week it is a named project from someone who has done this before, validated by a team that learned it the hard way on the worst possible night.

What to watch next week: whether Multigres puts a credible date on actual sharding, since pooling and failover are table stakes and the shard router is the real test. Keep an eye on whether more teams publish their primary-to-shard migration numbers as Aura Frames did, because that is how the next team gets a playbook to copy. And watch the vector-in-engine race, where Hudi and Milvus just raised the stakes.

🔬 Deep Dives

Multigres v0.1 Alpha: an operating system for Postgres (8 minute read)

Multigres released its v0.1 alpha, an open-source horizontal-scaling layer for PostgreSQL led by Sugu Sougoumarane, co-creator of Vitess. The alpha ships connection pooling through a multigateway and multipooler architecture, consensus-based automatic failover, pgBackRest-driven backups, and a Kubernetes operator, and defers sharding to a later release.

From Christmas outage to 226,000 transactions per second: an Aura Frames Postgres scaling retrospective (12 minute read)

Andrew Atkinson documents how Aura Frames re-architected its PostgreSQL infrastructure on Amazon RDS after a three-hour Christmas 2024 outage caused by write-ahead log overflow tied to replication slots on RDS Postgres 14.1. The team sharded its ten highest-write tables, expanding from one primary to eight and from 192 to 896 vCPU, and reports a Christmas 2025 peak of 226,000 transactions per second.

Transaction processing in the data plane (12 minute read)

Materialize describes implementing transaction commit and rollback logic as a SQL view maintained by incremental view maintenance, which resolves at about 30 milliseconds. The post includes SQL implementations, benchmarks, and an appendix on optimizations.

Why we built Loon, a storage engine for AI data that never stops changing (11 minute read)

Milvus published an engineering post on Loon, a versioned, manifest-based storage engine that backs Milvus 3.0 (in beta) and Zilliz Vector Lakebase, replacing the older segment binlog storage. Loon stores scalar fields as Parquet and dense vectors as Vortex within row-ID-aligned column groups, with a manifest tracking the files, formats, delete logs, statistics, and indexes for each dataset version.

A metadata benchmark of Lance, Delta Lake, and Iceberg on S3 (11 minute read)

This benchmark compares metadata handling across Lance, Delta Lake, and Iceberg on Amazon S3 and S3 Express, measuring sequential append commit latency, metadata load latency, and concurrent commit throughput. It describes Lance’s compact protobuf snapshot design and reverse lexicographical manifest ordering for version discovery through bounded listing.

Apache Hudi 1.2 brings a native VECTOR type and Lance file support to the lakehouse (6 minute read)

Apache Hudi 1.2 introduces a native VECTOR logical type with built-in similarity search on Hudi tables, plus new BLOB and VARIANT types for binary objects and semi-structured data under transactional guarantees. The release adds Lance file format support alongside Parquet, ORC, and HFile.

Disk is the data plane: Flight Shuffle in Daft (8 minute read)

The post describes how Daft rebuilt its distributed shuffle around Apache Arrow Flight, local disk, and streaming reads to handle multi-terabyte workloads. The design spills directly to NVMe disk and uses binary transfer, compression, and multi-threaded parallelism to scale beyond memory limits.

How Edgar Allan Poe found bugs in Turso (7 minute read)

Turso engineer Mikael Francoeur described using LLM coding agents for differential testing of the company’s SQLite-compatible database against SQLite, comparing query results as a fitness function. The post details assigning the agents personas to break repetitive exploration cycles and reports one agent finding three novel bugs within five minutes after a persona was assigned.

🚀 Launches & Tools

Announcing DuckDB 1.5.4 (Variegata) and 1.4.5 LTS (Andium) (4 minute read)

DuckDB released v1.5.4 (Variegata) alongside the v1.4.5 LTS (Andium) maintenance release on the same day, with bug fixes spanning VARIANT casting under filters, MERGE INTO binding, and crashes in Arrow GeoArrow serialization. The release adds Parquet statistics pruning improvements and decompression security hardening, with v2.0.0 planned for autumn 2026.

Apache DataFusion 54.0.0 (6 minute read)

DataFusion 54.0.0 adds LATERAL joins, SQL lambda functions, including an array_transform UDF, and a new Avro reader, along with performance improvements in join, scan, and query planning. The release represents roughly 11 weeks of development and about 740 commits.

What’s new in Unity Catalog at Data + AI Summit 2026: External Lineage GA and Governance Hub (8 minute read)

At Data + AI Summit 2026, Databricks announced Unity Catalog updates including External Lineage reaching general availability across upstream sources and downstream BI reports, and a Governance Hub for data stewards entering private preview. Additional features include tag propagation for governed tag inheritance, attribute-based access control grant policies in beta, and column-level Table Insights at general availability.

Airbyte ships Destination Redshift 4.0 with Direct Loading and Speed Mode (6 minute read)

Airbyte released Destination Redshift 4.0, which adds a Speed Mode that replaces JSON over stdio with Protobuf over Unix domain sockets, Direct Loading that removes intermediate raw tables, and a move to the Bulk CDK architecture. The post reports benchmark throughput rising from 8.36 to 38.27 MB/s on 1 GB datasets.

Feast 0.64.0 (4 minute read)

Feast 0.64.0 introduces an Apache Flink compute engine for feature engineering and a data quality monitoring capability with native compute, multi-backend support, a REST API, and a CLI. The release also adds a first-class LabelView, a Feast and MLflow integration, and mTLS for remote registry gRPC clients.

📈 Opinions & Advice

The only scalable delete in Postgres is DROP TABLE (8 minute read)

PlanetScale’s Tom Pang argues that large DELETE statements add work to Postgres because each removed row becomes a dead tuple the vacuum process must reclaim, and that space is not returned to the operating system without a VACUUM FULL. The piece contrasts this with DROP TABLE and TRUNCATE, which run independently of data size, and recommends date-based partitioning to convert recurring deletes into occasional DROP TABLE operations.

How to make the architectural case for Dagster (7 minute read)

Dagster’s James Dzidek presents an orchestration maturity model describing a shift from job-centric to asset-centric architectures. The post argues that an asset-aware approach makes data dependencies, freshness, quality, and lineage explicit at enterprise scale.

YAML vs Python workflows: which is better for orchestration? (6 minute read)

Kestra’s Elliot Gunn argues that YAML suits defining workflow structure and coordination while Python suits execution logic and data transformation. The post recommends separating the two concerns, using YAML for workflow definitions and Python for task implementation.

💎 Gems & Repos

lakeFS lakeFS is an open-source data versioning layer that brings Git-style branching, committing, and merging to object storage such as S3, GCS, and Azure Blob. It enables isolated experimentation and rollback on data lakes.

Arroyo Arroyo is a distributed stream processing engine written in Rust that runs stateful SQL queries over real-time sources such as Kafka, with exactly-once semantics and no JVM dependency.

⚡ Quick Links

Apache Flink 2.1.3 (3 minute read) The Apache Flink community released 2.1.3, the third bug-fix release of the 2.1 series, with five bug fixes plus vulnerability fixes. Fixes include a MiniBatchGroupAggFunction record-loss bug and Google Cloud Storage connectivity issues.
TimescaleDB 2.28.0 (4 minute read) TimescaleDB 2.28.0 speeds up first() and last() queries on compressed data by reading aggregates from columnstore metadata and adds vectorized evaluation of CASE expressions on compressed data. It is the final minor release to support PostgreSQL 15.
SQLMesh 0.235.4 (3 minute read) SQLMesh 0.235.4 adds StarRocks as a new engine adapter and an override for the DuckLake data path in DuckDB options. The release also adds a cron_tz setting to model defaults for timezone-aware scheduling and an --environment flag for the janitor command.
Great Expectations 1.18.1 (2 minute read) Great Expectations 1.18.1 is a maintenance release whose primary change is a bug fix to HTML-escape regex angle brackets in Data Docs rendering. It follows 1.18.0, which raises an error on CloudDataContext construction as part of the GX Cloud shutdown.

That is the week in data. See you next Thursday 👋

🤓 You can connect with me on LinkedIn.

Data Principal

Discussion about this post

Ready for more?