Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API error in nested resource block can modify state unexpectedly #631

Open
Integralist opened this issue Jan 9, 2023 · 0 comments
Open
Labels

Comments

@Integralist
Copy link
Collaborator

Integralist commented Jan 9, 2023

Problem

When creating a service resource (e.g. fastly_service_vcl), if any of the API calls (that are made to create/update the associated nested block's within the resource) happen to fail, then the user's state can show as being incorrect (i.e. showing the 'planned changes' even though they were never actioned by the API/Terraform).

NOTE: We should look at ways to improve this situation, and although this is not ideal behaviour, it's not 'breaking the world' currently (if it was, then we'd be seeing more reports of this as a critical/blocking problem for customers).

Context

The reason this issue occurs is that fundamentally the Fastly Terraform provider appears to violate the programming model that Terraform expects.

Although there is no explicit documentation on how to develop a provider whose resources are more complex, from various sources, I've been informed that Terraform is designed to work with a single resource mapping to a single API call.

Fastly's fastly_service_vcl and fastly_service_compute resources are not a 1:1 mapping with the Fastly API.

These single resources are actually responsible for multiple API entities, and so when any of the nested entity API calls fail it causes the user's state to get messed up (as Terraform will apply the planned changes to the user's state regardless of any errors reported).

This is the way the Fastly provider has been since day one, and I don't think it has ever really been addressed. For the most part, customers occasionally stumble into this and get confused by the behaviour and then figure out workarounds (that has been the feedback I've received).

NOTE: For additional complexity, the Fastly provider blurs the lines between the CRUD (Create, Read, Update, Delete) operations because it doesn't create any of the nested blocks/entities via the resource's Create operation but as part of its Update. Making the reasoning around possible solutions more difficult.

Solution(s)

Option 1

The simplest workaround is for the user to run terraform refresh which will update the state to match the remote system. Now this should probably work in most cases but I can't predict every conceivable scenario that might occur, so it's hard to say for certain if it works every time. Not ideal that the user has to deal with this though.

Option 2

A programmatic solution is to take a copy of the nested block data, then use d.Set to apply a zero value, only then using d.Set to apply the planned changes if the associated API call(s) are successful. If any API call is not successful, then the provider will error and only apply the partially constructed changes to the state. Anything that failed can be reset using the copy of the originally planned data or prior state (if prior data exists). How well this approach works for both Create and Update scenarios remains to be validated (for example, it seems to work in a simple reduced test case but in more complex situations and in environments like CI, I couldn't say for sure how well this actually works).

Or a similar approach might be to blanket re-apply original data...

diff --git a/fastly/block_fastly_service_domain.go b/fastly/block_fastly_service_domain.go
index 0abffed5..9c7fbd3a 100644
--- a/fastly/block_fastly_service_domain.go
+++ b/fastly/block_fastly_service_domain.go
@@ -54,6 +54,9 @@ func (h *DomainServiceAttributeHandler) GetSchema() *schema.Schema {
 
 // Create creates the resource.
 func (h *DomainServiceAttributeHandler) Create(_ context.Context, d *schema.ResourceData, resource map[string]any, serviceVersion int, conn *gofastly.Client) error {
+	// Take copy of the state so in case of error we can restore it.
+	backupState := d.Get(h.GetKey()).(*schema.Set).List()
+
 	opts := gofastly.CreateDomainInput{
 		ServiceID:      d.Id(),
 		ServiceVersion: serviceVersion,
@@ -67,6 +70,9 @@ func (h *DomainServiceAttributeHandler) Create(_ context.Context, d *schema.Reso
 	log.Printf("[DEBUG] Fastly Domain Addition opts: %#v", opts)
 	_, err := conn.CreateDomain(&opts)
 	if err != nil {
+		if err := d.Set(h.GetKey(), backupState); err != nil {
+			log.Printf("[WARN] Error restoring Domains for (%s): %s", d.Id(), err)
+		}
 		return err
 	}
 	return nil

Option 3

Alternatively, should we do as is commonly recommended: redesign the provider to have a 1:1 mapping with its API (so instead of lots of nested blocks we have separate resources for things like domains and backends). The problem with this approach is I don't think it's possible with Fastly's data model.

For example, how does a 'VCL service' get activated? We need a service to exist for a domain, backend, etc to be added to it via our API and so we'd need to have each separate resource depend_on the fastly_service_vcl resource whose Create operation would have to not activate the service. Now we need to understand how we'd then 'activate' the VCL service.

It might work in a non-automated sense if you had an activate attribute set to false for the initial terraform apply and then switch it to true to activate the service. In an automated approach we'd need something like an 'activation' resource which used depend_on to reference all the other resources that were depending on the fastly service resource. This would be tedious and likely require a very large chain of resources like...

fastly_service_vcl <- depended on by -> fastly_domain/fastly_backend/...etc <- depended on by -> fastly_activation

Option 4

Fastly could create an API that accepts a single chunk of data. The Fastly provider could then mould the Terraform configuration into the shape required for a single API call.

Historical Research

I've kept the following comments/update notes for posterity...


I've replicated an issue reported by a customer (#629) with regard to missing state values when there is an API error.

An example of this issue is to define and apply a cache_setting block within a fastly_service_vcl. Then in a separate step, add a snippet block with an invalid name attribute (e.g. name = "/invalid") and also delete the cache_setting block.

When you try to apply these two changes the provider will first process the snippet before processing the cache_setting block as snippet is defined first in the provider logic (while cache_setting is defined after it).

The snippet Create function logic will cause an API error to be returned, and consequently will stop the Terraform provider from processing the cache_snippet deletion.

The problem is that the Terraform state file reflects the final config changes (as if they'd both been successfully processed).

When running the second apply with TF_LOG=TRACE there is no reported call into the cache_setting resource (nor any API call for it made), so it's weird that Terraform deletes the cache_setting resource from the state as the provider shouldn't have reached that step once the error occurred.

I opened a discussion on the HashiCorp forum as I was unsure of the root cause of this issue and it was suggested that there is a strange behaviour in the Terraform v2 SDK that can cause this type of state modification to occur...

Your code is incredibly complex, and I couldn’t figure out what was going on… but what you said made me think of some other weird behaviour I have seen, and I was successfully able to mock up a dummy provider that reproduced the issue.

It appears that when a resource Update function returns a Diagnostics containing an error, even though that error is reported to the user, Terraform SDKv2 is still committing the planned change to the state !!!

This feels like a massive bug to me.

It is supported by this comment in the Terraform v2 SDK...

Although confusing, it has been discovered that during an update when an error is returned, the proposed config is set into state, even without any calls to d.Set.

This was also reported as a bug officially here and was fixed in the new framework that supersedes the v2 SDK here.

So until we migrate the Fastly Terraform Provider to the new framework we'll need to consider implementing the use of Partial to see if this helps avoid the issue.

UPDATE 1

It seems the use of Partial only works for Update operations and not Create or Delete operations.

UPDATE 2

Seems there are still issues open that are related to bug that are tripping up provider authors (1, 2).

UPDATE 3

What makes our provider more complicated is that we are calling Update at the end of the Create operation, and it's actually the Update operation that then triggers the 'Create' for the nested resource blocks. This makes reasoning about how to handle errors in the nested resource blocks a lot harder (especially when it comes to partial state application).

UPDATE 4

I've opened a separate discussion on the HashiCorp forum regarding error handling when a 'resource' isn't a 1:1 mapping to its API.

Tried the following diff to force a re-run of the Read:

diff --git a/fastly/base_fastly_service.go b/fastly/base_fastly_service.go
index 8e3aeea7..1a51ce06 100644
--- a/fastly/base_fastly_service.go
+++ b/fastly/base_fastly_service.go
@@ -10,6 +10,7 @@ import (
 	"time"
 
 	gofastly "github.com/fastly/go-fastly/v7/fastly"
+	"github.com/hashicorp/go-multierror"
 	"github.com/hashicorp/terraform-plugin-sdk/v2/diag"
 	"github.com/hashicorp/terraform-plugin-sdk/v2/helper/customdiff"
 	"github.com/hashicorp/terraform-plugin-sdk/v2/helper/schema"
@@ -346,6 +347,9 @@ func resourceServiceUpdate(ctx context.Context, d *schema.ResourceData, meta any
 			}
 		}
 
+		// Wrap any nested resource block errors
+		var result error
+
 		// This delegates the bulk of processing to attribute handlers which manage state
 		// for their own attributes.
 		for _, a := range serviceDef.GetAttributeHandler() {
@@ -355,16 +359,23 @@ func resourceServiceUpdate(ctx context.Context, d *schema.ResourceData, meta any
 					if errors.Is(err, context.Canceled) {
 						return nil
 					}
-
 					return diag.FromErr(err)
 				}
 
 				if err := a.Process(ctx, d, latestVersion, conn); err != nil {
-					return diag.FromErr(err)
+					result = multierror.Append(result, err)
 				}
 			}
 		}
 
+		if result != nil {
+			// IMPORTANT: API errors can still apply terraform diff to state.
+			// https://github.com/fastly/terraform-provider-fastly/issues/631
+			// To workaround this bug we need to re-run the Read operation.
+			resourceServiceRead(ctx, d, meta, serviceDef)
+			return diag.FromErr(result)
+		}
+
 		// Validate version.
 		log.Printf("[DEBUG] Validating Fastly Service (%s), Version (%v)", d.Id(), latestVersion)
 		valid, msg, err := conn.ValidateVersion(&gofastly.ValidateVersionInput{

UPDATE 5

OK, so after asking on Stack Overflow I got a detailed response explaining that we're essentially violating the programming model of Terraform by not having a resource that maps 1:1 with a single API.

The state modification that's occurring is expected (as far as Terraform is concerned) as the planned values provided to each action function (Create, Update etc) are mutable and Terraform expects those to be set even if an error is returned.

The proposed workaround to support the programming model we're using (a one-to-many relationship within our resources), we need to store off the planned values provided and then use d.Set to remove the planned values provided (so that when the function returns the state will show no nested blocks). Only then should you use d.Set to add back in the blocks that have actually been successfully created.

@Integralist Integralist added the bug label Jan 9, 2023
@Integralist Integralist changed the title API error in nested resource block can unexpected modify state API error in nested resource block can modify state unexpectedly Jan 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant