How getsupplied.ai adopted Infrastructure-as-a-Code

Intro

GetSupplied.ai is a startup on a mission to provide efficient, compliant suppliers for digital marketplaces - logistic companies, food delivery services, travel agencies or auctions. Our products - KYC/KYB, Reporting, Supplier Management and Communication - offer a unique capability of all-in-one toolbox to onboard, manage and report on behalf of suppliers as legislation requires.

When we started in November 2024 I spawned the whole infrastructure in AWS manually: I created the IAM roles, provisioned S3 buckets, Lambda functions and layers, incepted DynamoDB collections. For long time it was alright to have a single production environment because there were very few clients and they used our solution couple of hours a day max. I could deploy a change and even if it was falty despite the quality control we had, it was not a big deal. But now we are a partner to ~50 enterprises processing thousands of suppliers data, and we need to have stuff properly setup.

The Goal

I aspired to achieve the following goals:

Unify the regions we deploy our resources to to decrease latency
Enable proper quality assurance without disrupting clients workflows
Increase velocity through transparency of our infrastructure
Improve maintainability cleaning up and unifying environment variables, secrets and parameters
Get disaster recovery capability by enabling a 1 hour RTO

We were able to achieve all of this implementing infrastructure as a code during a 2 month period and migrating the manual production environment to automatic one with minimal downtime and zero data loss. In this article I will cover the approach, tools and limitations imposed by the technology we used.

State before

I am a solid believer in serverless technologies: they allow you to go fast while paying only for the resources consumed while scaling when required at the same time. So our whole solution do not contain a single stateful and/or unmanaged component. Our compute is a set of AWS Lambda Functions; the data is stored in DynamoDB collections; the documents are in S3 buckets. Some of the inter-function communication is performed with AWS SQS.

Provisioning of a new Lambda function require 10+ clicks counting the name, runtime, permissions, roles etc.

However, all those components should still exist in production; I had to go to AWS console and provision it manually. That was a justified decisions: we had to focus on developing the product rather than care about second priority items. While create a function or uploading a lambda layer is usually a one-time job, creating and using environment variables is not: create one in the lambda itself, create one in the code, use the same name in DynamoDB collection. Same applies for the bucket names: it should exist simultaneously in 3 places: in the S3 itself; in the environment variables in lambda and in the code itself. Those implicit connections create invisible yet painful coupling.

One of the features of KYC product is comparing faces on the ID document and on the selfie the supplier takes using our web application. We use AWS Rekognition for that. Although Rekognition exists in a handful or regions, initial region I deployed the Lambda to was not in the list. This inconsistency made me to have the infrastructure in several regions for a time being which created accidential complexity as well.

One of the KYC steps: uploading a photo and document id for data extraction and face comparison

Also by spring we hired 2 contractors which started contributing to our products. And despite having tests and verification procedures, we managed to break the core functionality several times, which could allow forgiveness given our age, but should be addressed given our ambitions.

One ring to rule them all

In the classical LotR the ring was a making of evil; in software solutions unification is a blessing.

How we could address 3 different problems with a single comprehensive strategy? Richard Rumelt in his classical book "Good Strategy, Bad Strategy" tells us we need a set of complementary policies to get a leverage. This is what I decided:

Provision an exact copy of production environment called staging
Push all the changes to staging first and verify the changes there
Leverage Terraform to capture all the infraststructure and intrinsic dependencies between it's components
Use migration to IaaC to clean up and unify all the variables and secrets we have.
Mandate all the new changes include the changes in terraform as well to make sure the environments stay identical.

Here's how we did it.

Attempt #1

My first idea was to use ChatGPT to make the terraform scripts. So I would literally go to AWS Console, make a screenshot of the DynamoDB collections and let AI to generate the scripts for me. Then I would do the same with S3 buckets and Lambda functions.

There are not 1, but 2 huge elephants in the room though: 1) I never worked with terraform properly before 2) this approach can not work at all because you obviously can not screenshot all the policies and parameters for everything. This was just wrong way; it served a good example though of using AI where it does not belong.

ChatGPT can create terraform for you based on a screenshot. It does mean you should, just because you can though.

Attempt #2

The idea number 2 was to research if there is a tool that can generate terraform based on the existing infrastructure. And indeed there are! The most popular tool is indeed terraformer by Google: it can import the infrastructure with a single command.

This approach works much better: it pools all the roles, policies, names, environment variables automatically.

Although this is a good start, it is not a good end result either. You still don't have appropriate connections there. It also made us realize we need to make a step back and clean up all the variables and names in the code before writing the terraform whatsoever.

Attempt #3

At this point we realized it still makes sense to write the terraform code manually - not without the help of Github Copilot indeed. The problem ahead of us layed clearly: provisioning staging will be a piece of cake; how do you spin up the production though?

We had 2 options:

Try to incorporate the state of current production into our terraform state
Create the new infrastructure and migrate the data.

Even with serverless environment, you still have the data to migrate

At the same time we began implementing the Liveness Check feature, which requires a certain single region within the EU where it can possibly work: eu-west-1. Thus we realized we will need to move the infrastructure to a new single region which forced to go with option #2.

Luckily, data migration within AWS is easy: you can just sync the data between S3 buckets. Migration of DynamoDB collections is more complicated, but we could just do export/import with our volumes and call it a day.

Finally, we came up with the following plan:

Make the appropriate naming for any parameters we use. Remove hardcoded values, unify naming - make things rights.
Use the generated terraform to have an understanding of overall picture.
Write the terraform half-manually with the appropriate region variables, output variables and parameters to make all the coupling explicit.
Provision the staging environment
Test that it works well
Provision the production environment
Migrate the data from the old production to the new one

Big Switch

Finally we had the new infrastructure fully operational in production in the new region. Now we had to make the big switch: switch the DNS.

And here the new problems arises: our web appications - supplier app and admin panel - are deployed as web applications from S3 buckets. However in order for an app to be accessible from https://app.supplied.eu/ from a browser the name of the bucket should be exactly the same as the domain name: app.suppled.eu. That means you can't include the environment variable in the bucket name. What is even worse is that bucket names are GLOBAL. Not region global, not organization global, they are TOTALLY GLOBAL. And our apps already occupied those names which makes us to remove those buckets.

The offical documentation says, that the bucket name does not get freed instantly; it may take up to 48 hours to propagate. Fortunately, for us it took only 15 minutes from removal to freeing, so we were down only for like 15 minutes.

Final words

Finally, we achieved all of our goals:

Cleaned up the naming
Made the connections explicit
Create identical environments allowing for testing any new changes
Got a capability of spinning up new infra in new region in under 1 hour.