Skip to content

Releases: estuary/flow

v0.5.3

19 Sep 18:47
Compare
Choose a tag to compare

What's Changed

  • Batch draft spec upserts, rather than attempting to paginate the response by @jshearer in #1629
  • use max build_id for validating local specs by @psFried in #1636
  • Update pagination to use 0-indexed inclusive ranges, and page by response length rather than fixed page size. by @jshearer in #1637

v0.5.2

13 Sep 18:35
Compare
Choose a tag to compare

What's Changed

Couple of fixes for flowctl

  • fix: flowctl catalog commands do need pagination after all by @jshearer in #1626
  • .devcontainer/release.Dockerfile: add additional podman packages & newline fix by @jgraettinger in #1625

Full Changelog: v0.5.1...v0.5.2

v0.5.1

05 Sep 15:26
Compare
Choose a tag to compare

Includes fixes for a few issues (#1590 and #1558) that affected flowctl catalog pull-specs.

v0.5.0

30 Aug 13:09
Compare
Choose a tag to compare

This release adds preliminary support for the upcoming federated data planes feature 🚀

v0.4.0

24 Jun 13:55
4d74515
Compare
Choose a tag to compare

This release introduces number of big changes in different areas, including:

  • Schema evolution
  • Inferred schema handling
  • Flowctl
  • Re-using old spec names
  • General control-plane operation

Schema evolution

Schema evolution in streaming systems is hard. When we first released Flow, the approach to schema evolution was one of it's "killer features" because we were able to validate type-compatibility of heterogeneous pipelines (e.g. Postgres->BigQuery) end-to-end. But detecting incompatible schema changes is one thing, and deciding what to do about them is another. Real life data pipelines have many complex requirements, which clearly can't be handled by the one evolveIncompatibleCollections boolean we had on capture specs. People wanted more control over how our automation responds to incompatible schema changes.

So we're introducing a new onIncompatibleSchemaChange field on materialization specs, which allows you to configure how the system responds when incompatible schema changes are detected. You can specify onIncompatibleSchemaChange at the top level of a materialization spec, and/or as part of each binding. The top-level property serves as a default for any binding that does not set its own onIncompatibleSchemaChange. It has four possible values:

  • backfill (default if unspecified): increment the backfill counter of affected bindings, which re-creates the destination resources to fit the new schema and backfills them.
  • disableBinding: disable the affected bindings. A human will need to re-enable them and decide how to resolve the incompatible fields
  • disableTask: disable the entire materialization. A human will need to re-enable it and decide how to resolve the incompatible fields
  • abort: don't take any automated action. A human will need to decide what to do

These behaviors apply only when an automated action observes an incompatible schema change. If you're making changes manually via the UI, onIncompatibleSchemaChange is ignored.

Note: You won't see onIncompatibleSchemaChange in the main UI yet, but it can now be set using flowctl or the "Advanced specification editor".

Note: With the introduction of onIncompatibleSchemaChange, the behavior of the existing evolveIncompatibleCollections field of captures no longer makes much sense. For the very short term, that behavior will remain unchanged. But soon we will seek to greatly simplify it. Today, that one boolean, on the capture spec, controls how the system responds to incompatible schema changes in any of the captured collections. In the future, evolveIncompatibleCollections will only pertain to collections that need to be re-created entirely. In other words, its meaning will be "re-create collections as necessary in order to publish them". In practice, this would only ever be required if you change either the key of the collection or the logical partitioning configuration.

Inferred schema handling

As a user, it's hard to get direct visibility to what the inferred schema of a collection is at any given moment. That's all changing, because now we're moving to an approach where the inferred schema gets added directly to your collection specs. The inferred schema gets added under $defs with a key of flow://inferred-schema, so it's still possible to customize other parts of the read schema, just as you would have before. The difference is that you can now see the inferred schema that's being used for each collection.

But that's not the only difference, because you can now use inferred schemas with derivations, too! To do so, just include "$ref": "flow://inferred-schema" as part of the collection's readSchema, just like any other collection. Our automation will periodically update the collection spec to inline the actual inferred schema as it notices it changing.

Lastly, we're introducing a more aggressive heuristic for inferred schema updates. Collections that have more frequent inferred schema updates will be checked much more frequently, and inferred schemas that have gone a while without any updates will be checked somewhat less frequently, up to a maximum interval of every 2 hours.

Flowctl changes

All flowctl users will need to upgrade to the latest release in order to maintain compatibility.

In addition, there's some new behavior in flowctl to help prevent accidentally overwriting changes to specs. Flowctl will now set the expectPubId property whenever you run catalog pull-specs. This property contains the id of the publication that most recently modified the spec. When publishing, we return an error if a spec has been published since the expectPubId. If this happens, you'll need to run catalog pull-specs again in order to get the freshest copy of the spec and try your changes again. This is especially important now that we in-line inferred schemas as part of collection specs, as it prevents users from accidentally publishing an outdated inferred schema.

Re-using old spec names

Previously, our control plane would prevent you from re-using a name that you'd used before, even after deleting the original specs. This was because we used the spec names as the storage prefix in cloud storage buckets, so we couldn't be sure that a new collection would be starting out with an empty storage prefix if it had the same name as a previously deleted one. Now, we add a unique alphanumeric path segment to the cloud storage path for each journal, like acmeCo/my-collection/112233445566abcd/. If you delete acmeCo/my-collection, you can now create another collection with the same name, and it will have a different alphanumeric suffix. The previous naming restriction was a common source of annoyance, so we're glad to finally get this working in a way that's much more in line with user expectations.

Note that cloud storage paths for existing collections and task recovery logs will remain unchanged. The suffix will only be added for new specifications.

General control-plane operation

These changes are grouped together because they were all enabled by the same fundamental changes to the code that handles publications and background automations.

We've made publications faster and more reliable by minimizing the tasks that get re-validated as part of a given publication. For example, if you publish a materialization, we no longer re-validate other materializations that happen to source from the same collections. And we now update the data-plane shard/journal specs (that represent the actual work/data of your pipelines) asynchronously, after the publication has committed. This keeps the UI faster, and also allows our data-plane updates to be more reliable.

Finally, we introduced a new internal framework for writing background automations. This is what has enabled the changes to inferred schema handling, schema evolution, and our asynchronous shard/journal spec updates. We're looking forward to many more features that are enabled by this framework.

v0.3.13

31 May 18:58
Compare
Choose a tag to compare

What's Changed

  • sum annotation now supports arbitrary precision using string-encoded numerics
  • Add experimental flowctl raw stats sub-command
  • Various minor JSON Schema handling improvements.
  • Switch to simd-json for fast JSON parsing and transcoding.

Filtered PRs impacting flowctl:

  • crates/json: don't validate strings with underscores as integers or numbers by @williamhbaker in #1364
  • Update runtime::container::start() to take a new allow_local flag by @jshearer in #1361
  • json: fix ordering of integers greater than i64::MAX by @psFried in #1367
  • validation: fix bucket name validation for GCS and Azure by @psFried in #1370
  • thread through --allow-local argument when running locally by @psFried in #1374
  • validation: allow unsatisfiable constraints on excluded fields by @psFried in #1375
  • update a number of dependencies, including RocksDB (to 8.10) by @jgraettinger in #1389
  • connector-init: set connector_type on protocol check Spec by @jgraettinger in #1400
  • models/journals: region configuration for S3 storage mappings by @williamhbaker in #1410
  • improve schema validation errors by including metadata about the collection that failed by @jgraettinger in #1408
  • flowctl: resurrect stats subcommand under raw by @psFried in #1432
  • make: codesign binaries on mac by @mdibaiee in #1436
  • simd-doc, gazette, avro, and dekaf crates by @jgraettinger in #1448
  • flowctl(preview): multiple bindings may read from one collection by @mdibaiee in #1466
  • crates/doc: support arbitrary precision with sum annotation by @jgraettinger in #1477
  • crates/doc: relax sum inspection to allow numeric strings by @jgraettinger in #1481

Full Changelog: v0.3.12...v0.3.13

v0.3.12: flowctl: enable self-service task logs

19 Jan 21:52
Compare
Choose a tag to compare
  • flowctl logs can now be used to read the logs of your tasks.

v0.3.11

07 Dec 14:25
Compare
Choose a tag to compare

Fixes a panic in flowctl preview, and adds support for the new backfill counter on capture and materialization bindings.

v0.3.9

20 Nov 18:02
Compare
Choose a tag to compare

flowctl preview has been expanded and is now able to run capture, derivation, and materialization tasks. See flowctl preview --help for details.

flowctl raw capture has been removed, as it's now part of flowctl preview.

v0.3.8

30 Oct 17:34
Compare
Choose a tag to compare

-flowctl catalog publish now accounts for changes to a collection's inferred schema. It will now publish collections that have had updates to their inferred schema (if they use schema inference), even if the collection spec itself is unchanged.

  • Fixes a bug where sops or jq would be required, even when not using encrypted endpoint configs