Releases · estuary/flow

This release introduces number of big changes in different areas, including:

Schema evolution
Inferred schema handling
Flowctl
Re-using old spec names
General control-plane operation

Schema evolution

Schema evolution in streaming systems is hard. When we first released Flow, the approach to schema evolution was one of it's "killer features" because we were able to validate type-compatibility of heterogeneous pipelines (e.g. Postgres->BigQuery) end-to-end. But detecting incompatible schema changes is one thing, and deciding what to do about them is another. Real life data pipelines have many complex requirements, which clearly can't be handled by the one evolveIncompatibleCollections boolean we had on capture specs. People wanted more control over how our automation responds to incompatible schema changes.

So we're introducing a new onIncompatibleSchemaChange field on materialization specs, which allows you to configure how the system responds when incompatible schema changes are detected. You can specify onIncompatibleSchemaChange at the top level of a materialization spec, and/or as part of each binding. The top-level property serves as a default for any binding that does not set its own onIncompatibleSchemaChange. It has four possible values:

backfill (default if unspecified): increment the backfill counter of affected bindings, which re-creates the destination resources to fit the new schema and backfills them.
disableBinding: disable the affected bindings. A human will need to re-enable them and decide how to resolve the incompatible fields
disableTask: disable the entire materialization. A human will need to re-enable it and decide how to resolve the incompatible fields
abort: don't take any automated action. A human will need to decide what to do

These behaviors apply only when an automated action observes an incompatible schema change. If you're making changes manually via the UI, onIncompatibleSchemaChange is ignored.

Note: You won't see onIncompatibleSchemaChange in the main UI yet, but it can now be set using flowctl or the "Advanced specification editor".

Note: With the introduction of onIncompatibleSchemaChange, the behavior of the existing evolveIncompatibleCollections field of captures no longer makes much sense. For the very short term, that behavior will remain unchanged. But soon we will seek to greatly simplify it. Today, that one boolean, on the capture spec, controls how the system responds to incompatible schema changes in any of the captured collections. In the future, evolveIncompatibleCollections will only pertain to collections that need to be re-created entirely. In other words, its meaning will be "re-create collections as necessary in order to publish them". In practice, this would only ever be required if you change either the key of the collection or the logical partitioning configuration.

Inferred schema handling

As a user, it's hard to get direct visibility to what the inferred schema of a collection is at any given moment. That's all changing, because now we're moving to an approach where the inferred schema gets added directly to your collection specs. The inferred schema gets added under $defs with a key of flow://inferred-schema, so it's still possible to customize other parts of the read schema, just as you would have before. The difference is that you can now see the inferred schema that's being used for each collection.

But that's not the only difference, because you can now use inferred schemas with derivations, too! To do so, just include "$ref": "flow://inferred-schema" as part of the collection's readSchema, just like any other collection. Our automation will periodically update the collection spec to inline the actual inferred schema as it notices it changing.

Lastly, we're introducing a more aggressive heuristic for inferred schema updates. Collections that have more frequent inferred schema updates will be checked much more frequently, and inferred schemas that have gone a while without any updates will be checked somewhat less frequently, up to a maximum interval of every 2 hours.

Flowctl changes

All flowctl users will need to upgrade to the latest release in order to maintain compatibility.

In addition, there's some new behavior in flowctl to help prevent accidentally overwriting changes to specs. Flowctl will now set the expectPubId property whenever you run catalog pull-specs. This property contains the id of the publication that most recently modified the spec. When publishing, we return an error if a spec has been published since the expectPubId. If this happens, you'll need to run catalog pull-specs again in order to get the freshest copy of the spec and try your changes again. This is especially important now that we in-line inferred schemas as part of collection specs, as it prevents users from accidentally publishing an outdated inferred schema.

Re-using old spec names

Previously, our control plane would prevent you from re-using a name that you'd used before, even after deleting the original specs. This was because we used the spec names as the storage prefix in cloud storage buckets, so we couldn't be sure that a new collection would be starting out with an empty storage prefix if it had the same name as a previously deleted one. Now, we add a unique alphanumeric path segment to the cloud storage path for each journal, like acmeCo/my-collection/112233445566abcd/. If you delete acmeCo/my-collection, you can now create another collection with the same name, and it will have a different alphanumeric suffix. The previous naming restriction was a common source of annoyance, so we're glad to finally get this working in a way that's much more in line with user expectations.

Note that cloud storage paths for existing collections and task recovery logs will remain unchanged. The suffix will only be added for new specifications.

General control-plane operation

These changes are grouped together because they were all enabled by the same fundamental changes to the code that handles publications and background automations.

We've made publications faster and more reliable by minimizing the tasks that get re-validated as part of a given publication. For example, if you publish a materialization, we no longer re-validate other materializations that happen to source from the same collections. And we now update the data-plane shard/journal specs (that represent the actual work/data of your pipelines) asynchronously, after the publication has committed. This keeps the UI faster, and also allows our data-plane updates to be more reliable.

Finally, we introduced a new internal framework for writing background automations. This is what has enabled the changes to inferred schema handling, schema evolution, and our asynchronous shard/journal spec updates. We're looking forward to many more features that are enabled by this framework.

What's Changed

sum annotation now supports arbitrary precision using string-encoded numerics
Add experimental flowctl raw stats sub-command
Various minor JSON Schema handling improvements.
Switch to simd-json for fast JSON parsing and transcoding.

Filtered PRs impacting `flowctl`:

crates/json: don't validate strings with underscores as integers or numbers by @williamhbaker in #1364
Update runtime::container::start() to take a new allow_local flag by @jshearer in #1361
json: fix ordering of integers greater than i64::MAX by @psFried in #1367
validation: fix bucket name validation for GCS and Azure by @psFried in #1370
thread through --allow-local argument when running locally by @psFried in #1374
validation: allow unsatisfiable constraints on excluded fields by @psFried in #1375
update a number of dependencies, including RocksDB (to 8.10) by @jgraettinger in #1389
connector-init: set connector_type on protocol check Spec by @jgraettinger in #1400
models/journals: region configuration for S3 storage mappings by @williamhbaker in #1410
improve schema validation errors by including metadata about the collection that failed by @jgraettinger in #1408
flowctl: resurrect stats subcommand under raw by @psFried in #1432
make: codesign binaries on mac by @mdibaiee in #1436
simd-doc, gazette, avro, and dekaf crates by @jgraettinger in #1448
flowctl(preview): multiple bindings may read from one collection by @mdibaiee in #1466
crates/doc: support arbitrary precision with sum annotation by @jgraettinger in #1477
crates/doc: relax sum inspection to allow numeric strings by @jgraettinger in #1481

Full Changelog: v0.3.12...v0.3.13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

Contributors

What's Changed

Contributors

Schema evolution

Inferred schema handling

Flowctl changes

Re-using old spec names

General control-plane operation

What's Changed

Filtered PRs impacting `flowctl`:

Contributors

Releases: estuary/flow

v0.5.3

What's Changed

Contributors

v0.5.2

What's Changed

Contributors

v0.5.1

v0.5.0

v0.4.0

Schema evolution

Inferred schema handling

Flowctl changes

Re-using old spec names

General control-plane operation

v0.3.13

What's Changed

Filtered PRs impacting flowctl:

Contributors

v0.3.12: flowctl: enable self-service task logs

v0.3.11

v0.3.9

v0.3.8

Filtered PRs impacting `flowctl`: