Making Terraform less Terrafying

Akash Nair
7 min readAug 13, 2022

If you have been using Terraform for a while to manage your infrastructure, chances are that you started on the right foot but ended up on the wrong one.

You probably tried to do everything by the book. And it looks quite alright: a simple and neat structure to write reusable Terraform code. Except that it gets worse and worse with every new module or resource you add to your codebase. Your Statefile becomes gigantic, the dependency graphs need their own high-end dedicated compute to be generated and you have to wait for way too long for even Terraform Plan to finish. That’s when you know, it’s time to go from one pain point to another: refactoring your Terraform jungle code.

Some common problems that are noticed when the Terraform codebase grows quite a lot

  1. The Pipeline takes forever to run: Terraform Plan takes a lot of time because of the several different dependencies created between different modules and resources. Making a small change to the Node Pool still can mean having to wait several minutes to let planning finish and then another few minutes because of course Apply also has to navigate its way through dependencies and sub-modules!
  2. Cluster creation from scratch no longer works: Usually, this is one of the main intentions of having IaC, to be able to simply create a duplicate infrastructure setup for development or even Disaster Recovery. But when you do not test the cluster creation regularly when adding more modules and resources, it becomes more and more unlikely to work because you either created some cyclic dependencies, introduced for_each loops that can’t be computed during the Plan, use a bunch of hard-coded values, used incompatible default values for some variables and so on.
  3. Reading the code is a nightmare: Jumping from one module folder to the other, from one .tf file to the other, from one data block which has a for_each referring to another resource block from a different provider. It just reduces productivity by quite a lot and makes it hard to figure out how it all works together.

So to get rid of these problems, you can break the code down into different layers and different repositories.

Different Layers

What are these layers? It depends on the kind of homogenous components you have in your infrastructure, but a general example could be the

  1. Infrastructure: Where you create the Kubernetes cluster, VPC, Databases, etc
  2. CustomResourceDefinitions: Terraform has a problem with CRD dependencies
  3. Apps or Helm charts: This would be the final layer in the Apply sequence as these objects not only require the underlying cluster to be present but also the CRDs!

Different Repos

There’s no clear answer on which one is better: multiple repositories or a monorepo for your IaC code. It does make sense to at least break the code down into one “modules” repo and several “clusters” repos that consume from this modules repo. A multirepo Terraform approach could look like this

  1. Modules Repo: A single repository containing all of the modules being used grouped by directories.
  2. Project/Cluster Repos: A group of repositories corresponding to several different clusters with calls to the Modules Repo for importing specific modules

Multi-Layer Terraform Setup

You can separate your Modules into their own individual repos if they are big enough to live on their own. But with the possibility of versioning Modules by using revisions or by publishing versions to a registry, the idea of having several different repositories for each tiny module seems rather unnecessary.

Infrastructure: You can put your Kubernetes cluster, VPC, IAM, Storage, Databases, etc in this layer. This lets you remove any depends_on conditions you might have used on a helm_release that relies on the underlying infrastructure to be present. This is a nice way to decouple infrastructure from the apps that are installed on it.

Custom Resource Definitions: During the Plan phase, applications dependent on CRDs will fail to be planned if the CRDs are not already installed. But the CRDs would only be installed after the Apply step. But Apply is dependent on Plan which won’t pass if the cluster doesn’t already exist! To overcome this, you can create a layer that only installs the CRD definitions on the cluster created from the previous Infrastructure layer. This way we decouple applications from the CRDs and also the CRDs from the infrastructure!

Applications: While tools like FluxCD and ArgoCD should be managing app deployments, you sometimes need a few infrastructure related apps on the cluster like an ingress-controller, cert-manager, etc. This is the layer, where we will install these infrastructure apps. This layer would be the final part of the pipeline and will create a fully functional, ready to use Kubernetes cluster.

These layers would all be housed inside the Modules monorepo and the directory structure could something like this

├── apps
│ ├── argocd
│ │ ├── argocd.tf
│ │ └── variables.tf
│ ├── cert-manager
│ └── nginx-ingress
├── crds
│ ├── crds.tf
│ ├── locals.tf
│ ├── provider.tf
│ └── variables.tf
├── infra
│ ├── cluster
│ │ ├── cluster.tf
│ │ ├── locals.tf
│ │ ├── provider.tf
│ │ └── variables.tf
│ └── vpc

Multirepo Terraform Setup

A Multirepo setup could be seen in a few different ways.

You can have a different repository for each of your clusters, for instance, staging, dev, and production clusters.

Or you could have a single repository for a “type of clusters” but different branches for different environments. For example, a “project-A-cluster” repository with branches named staging, dev, and production each of which creates a same-named cluster.

Whichever approach you choose, these types of infrastructure repositories are supposed to be lean with an equal number of directories corresponding to the number of layers we defined earlier. So in this case, three directories correspond to our three layers: infrastructure, CRDs and apps.

Branch-based environment deployments can be achieved by using different .tfvars file. For example, creating staging.tfvars, production.tfvars, dev.tfvars, etc and letting the CI/CD Pipeline pick these files up depending on which branch the pipeline is running on.

Here’s an overview of how the cluster directory structure could look like

├── cluster_configuration
│ ├── dev.tfvars
│ ├── prod.tfvars
│ └── staging.tfvars
├── terraform_crd
│ ├── crd.tf
│ ├── provider.tf
│ └── variables.tf
├── terraform_infra
│ ├── infra.tf
│ ├── outputs.tf
│ ├── provider.tf
│ ├── variables.tf
└── terraform_k8s_apps
│ ├── apps.tf
│ ├── outputs.tf
│ ├── provider.tf
│ ├── variables.tf

By moving all the .tfvars into cluster_configuration it becomes possible to create several instances of this cluster, for e.g: production, staging, dev.

Migrating from Single Layer to Multi-Layer

Now the multi-layered approach sounds all nice but how can you even get started with such a messy codebase at hand? Well, it’s not as scary as it seems thanks to the ability to manipulate the Terraform State.

Start with a clean State 😉

To avoid messing with the state that is in use, simply create a new state backend and then copy over the existing state to it to create a sandbox where we can test our Plans.

To do this, simply run terraform state pull > current-state.json from the current backend and then terraform state push current-state.json to the new state backend.

Now you can modify and play around with this state and your infrastructure should be safe as long as you don’t accidentally do an Apply with this new state!

mv things around with terraform mv

In this sandbox state, you can start moving around modules to the desired place. For example, if you had defined your VPC module deep inside myproject.infrastructure.aws.vpc, you can move it out to a higher level by first writing a definition for it and then moving the address of the module in the Terraform state to this new location.

So you would write a new block in your main.tf file inside the infrastructure folder like so

module “vpc” {

source = “../my-old-repository/path/to/my/vpc/module”

}

and then run terraform state mv myproject.infrastructure.aws.vpc vpc

Voila! Terraform now tracks the existing VPC resource at the address vpc in the Statefile. When you now run terraform plan , it should ideally show you no changes!

This way you can move out all the other infrastructure modules out into the root folder. And then you can remove all the Resources that are part of the CRD and Apps groups and deal with them in their own layers and statefile.

Note:

You can also use terraform import for resources that were somehow not part of the Terraform state and created manually to map them against a configuration. But the import command syntax depend on the resource you are trying to import. For example, this is the method for importing a GCP VPC.

Migrating to the new State

After you have moved over all the resources to the new structured locations defined by this layered approach, you can now switch over from the old state to the new one.

The only pre-requisite to this step is that you thoroughly go through the Plan results of each layer and make sure no unexpected changes are taking place and most importantly no breaking destruction changes!

Once confirmed, the migration is simply letting Apply run on each of the layers and deleting the old state.

With this, you will have divided your terraform code into homogeneous layers and made your infrastructure as code easier to manage and maintain.

Conclusion

It’s nice how much manipulation Terraform allows with its Statefile. Modifying existing resources in parallel with two Statefiles, moving resources around within the State, migrating to a new State backend and importing new resources into the State are all very useful in being able to restructuring the codebase. Terraform works best when there are clear divisions between the kind of infrastructure being provisioned and while it’s easy to just use a single Terraform Plan and Apply to create everything, it can become a big problem when something doesn’t go according to Plan. Having said that, good luck with your refactoring and the first step towards having a well-oiled machine that is a properly structured Terraform setup.

--

--