Fearless deployments with CodeDeploy

24 July 2024

CodeDeploy is an AWS fully managed code deployment service for AWS Lambda, EC2, ECS and on-premise servers. In this post, I will show how we used it to reduce overall deployment time by 12x and application downtime by 36x on a recent EC2 project all whilst enabling automatic rollbacks for failed deployments.

Firstly, some context. Prior to using CodeDeploy, our application was deployed as a JAR file to EC2 using a GitHub Action. Here's the Github Action action.yml file:

name: "Deploy to environment"
description: "Installs dependencies, tears down stack and deploys project"
runs:
  using: "composite"
  steps:
    - name: Setup Node
      uses: actions/setup-node@v3
      with:
        node-version: 18.13.0
    - name: Install AWS CDK
      run: npm install -g aws-cdk
      shell: sh
    - name: Tear down stack and deploy
      run: |
        cdk destroy --force --execute --ci --require-approval=never $STACK -c BUILD_NUMBER=2.1.$GITHUB_RUN_NUMBER -c ENVIRONMENT=$ENVIRONMENT
        cdk deploy --execute --ci --require-approval=never $STACK -c BUILD_NUMBER=2.1.$GITHUB_RUN_NUMBER -c ENVIRONMENT=$ENVIRONMENT
      working-directory: infrastructure
      shell: sh

Crucially, the project has several database migration scripts that are executed as part of the deployment. These scripts can update the underlying database schema, invalidating the old version of the application - a deployment therefore had to update the schema and the latest version of the application as a unit. Consequently, the action executed a cdk destroy followed by a cdk deploy to first tear down and then re-deploy the new version of the application.

A slight disclaimer: the original authors of the action clearly felt the cdk destroy step was warranted because, by itself, a cdk deploy only updates the EC2 instance's 'user data script'. It does not stop and restart the new version of the application. Therefore, without the destroy step, we might update the database schema but leave the old version of the application running. The problem, however, with cdk destroy is that it is extremely time-consuming - it would have been better and faster if the action avoided this step altogether and instead stopped, deployed and then restarted the server. But even with this, the action would have too slow for our needs (and still much slower than using CodeDeploy).

Once the new application is deployed and the EC2 instance restarted, it's user data script is executed. Here's how our original script looked before the updates to use CodeDeploy:

#!/bin/bash
\# Associate the elastic IP to the EC2 instance
aws ec2 associate-address --instance-id ${'$'}(curl -s http://169.254.169.254/latest/meta-data/instance-id) --allocation-id ${context.elasticIP} --allow-reassociation

yum update -y
yum remove java-1.7.0-openjdk -y
yum install java-1.8.0 -y
yum install java-devel -y
yum install ruby -y
yum install wget -y

mkdir /opt/project
echo 'export environmentName="${yamlEnvironment.environmentName}"' >> /etc/profile
wget https://aws-codedeploy-${context.region}.s3.${context.region}.amazonaws.com/latest/install
chmod +x ./install
./install auto

The BUILD_NUMBER environment variable is used to pull the correct JAR file from S3. Note that we had set things up to use an AWS-provided AMI ensuring that every deployment used the latest Amazon Linux 2 version.

Extended downtime

Our application was a vital cog in our client's infrastructure, but it is mostly used during US business hours - we had the luxury of being able to take it offline for short periods, before the US working day started. Still, it was important to minimise these windows - a few minutes of downtime versus tens of minutes was acceptable.

Using cdk to destroy and then deploy the new application resulted in 3 fundamental problems that affected our deployment windows:

Slow deployments: a single deployment took at least 12 minutes to complete.
Extended downtime: destroying the environment meant that the website was down until the deployment had completed i.e. 12+ minutes.
No automatic rollback: If a deployment failed, the website would be down for at least 24 minutes, until the previous working version was redeployed. This time included 12+ minutes for the initial failed deployment, 12+ minutes for the new deployment, plus whatever time it took to react and fix the problem.

Switching to CodeDeploy

CodeDeploy is an AWS fully managed code deployment service, for AWS Lambda, EC2, ECS and on-premises servers. For EC2 deployments, CodeDeploy supports 2 deployment types: in place and blue\green.

With an in place deployment, the old version of the application is stopped, the latest version installed and then started and validated. A blue/green deployment starts the application on a new instance and then switches traffic from the old instances to the new instances. For more information, see the deployment types user guide.

Here are the 4 steps we followed to switching from a Github Action/CDK based deployment to using CodeDeploy.

1. Create a CodeDeploy application

To specify a CodeDeploy application, you will first need to create and configure a ServerApplication. Ours was written in Kotlin:

class Codedeploy(
    private val stack: Stack,
    brand: Brand
) {

    val deploymentGroupName = "IN_PLACE_ALL_AT_ONCE"
    val codeDeployApplicationName = "projectCodeDeploy"
    val autoScalingGroupName = "AutoScalingGroup"

    init {
        createApplication(brand)
        createCfnDeploymentGroup(brand)
    }

    private fun createApplication(brand: Brand) {
        ServerApplication.Builder.create(stack, "CodeDeployApplication")
            .applicationName(codeDeployApplicationName)
            .build()
    }

    private fun createCfnDeploymentGroup(brand: Brand): CfnDeploymentGroup {
        CfnDeploymentGroup.Builder.create(stack, "CfnDeploymentGroup")
            .applicationName(codeDeployApplicationName)
            .autoScalingGroups(listOf(autoScalingGroupName))
            .deploymentGroupName(deploymentGroupName)
            .serviceRoleArn(buildCodeDeployIamRole(brand).roleArn)
            .deploymentConfigName("CodeDeployDefault.AllAtOnce")
            .autoRollbackConfiguration(
                CfnDeploymentGroup.AutoRollbackConfigurationProperty.builder()
                    .enabled(true)
                    .events(listOf("DEPLOYMENT_FAILURE"))
                    .build()
            )
            .build()
    }

    private fun buildCodeDeployIamRole(brand: Brand): Role {
        return Role.Builder.create(stack, "CodeDeployIamRole")
            .assumedBy(ServicePrincipal("codedeploy.amazonaws.com"))
            .managedPolicies(listOf(ManagedPolicy.fromManagedPolicyArn(stack, "CodeDeployRolePolicy", "arn:aws:iam::aws:policy/service-role/AWSCodeDeployRole")))
            .build()
    }
}

A few things to note:

An application can have many deployment groups, allowing for different deployment strategies. In our case, we specified a single deployment group.
deploymentConfigName specifies how many instances of the deployment run at once. See the deployment configuration guide for more details. As our project needed to be deployed as a single unit, the entire deployment was run in place and all at once.
Enabling autoRollbackConfiguration ensures that CodeDeploy will automatically deploy the last working version on a failed deployment.
An IAM role is required for the CodeDeploy’s deployment group to have the necessary permissions. Following AWS's principle of least privilege, the role’s managed policies are created from an AWS policy that contains only the permissions needed for CodeDeploy to work.

2. Update the EC2 user data script

The next step is update the EC2's user data script to install and run the CodeDeploy agent. The agent requires ruby and wget to be installed.

#!/bin/bash
/# Associate the elastic IP to the EC2 instance
aws ec2 associate-address --instance-id ${'$'}(curl -s http://169.254.169.254/latest/meta-data/instance-id) --allocation-id ${context.elasticIP} --allow-reassociation
yum update -y
yum remove java-1.7.0-openjdk -y
yum install java-1.8.0 -y
yum install java-devel -y
yum install ruby -y
yum install wget -y

mkdir /opt/project
echo 'export environmentName="${yamlEnvironment.environmentName}"' >> /etc/profile
wget https://aws-codedeploy-${context.region}.s3.${context.region}.amazonaws.com/latest/install
chmod +x ./install
./install auto

3. Configure the CodeDeploy agent

Next, configure CodeDeploy's agent. Do this by creating an AppSpec file (appspec.yml) in the root of your source code. This file informs CodeDeploy what to install from your Github respository and which lifecycle hooks to run in response to deployment lifecycle events. AppSpec hooks allow you to configure aspects of the deployment. The full list available hooks is documented on the guide.

For EC2 deployments, each hook provides the location of a script (relative to the root of the source code being deployed) which the CodeDeploy agent executes. The timeout is optional, but cannot be greater than 3600 (1 hour). For example:

version: 0.0
os: linux
files:
  - source: /
    destination: /opt/project
hooks:
  ApplicationStop:
    - location: stop_application.sh
      timeout: 180
      runas: root
  ApplicationStart:
    - location: start_application.sh
      timeout: 10
      runas: root
  ValidateService:
    - location: validate_service.sh
      timeout: 60
      runas: root

In the above example, the ValidateService hook runs a script that checks the project’s health check endpoint to confirm the deployment has succeeded.

4. Update the GitHub Action

The final step is to update the GitHub action for deploying the project by using the aws cli to create a new deployment. The command-line JSON processor jq is used to extract the deployment ID and check if the deployment succeeded:

- name: Deploy app with AWS CodeDeploy
  run: |
    DEPLOYMENT_OUTPUT=$(aws deploy create-deployment \
      --application-name CodeDeploy \
      --deployment-group-name IN_PLACE_ALL_AT_ONCE \
      --revision "{\"revisionType\":\"S3\",\"s3Location\":{\"bucket\":\"project-s3-bucket-${ENVIRONMENT}\",\"key\":\"project-builds/deployment-package-2.1.${GITHUB_RUN_NUMBER}.zip\",\"bundleType\":\"zip\"}}" \
      --description "${GITHUB_SHA}" \
      --region=${AWS_DEFAULT_REGION} \
      --output json)        
    DEPLOYMENT_ID=$(echo $DEPLOYMENT_OUTPUT | jq -r .deploymentId)
    echo "DEPLOYMENT_ID=$DEPLOYMENT_ID" >> $GITHUB_ENV
  shell: sh

- name: Check Deployment succeeded
  run: |
    aws deploy wait deployment-successful --deployment-id=${DEPLOYMENT_ID}
  shell: sh

The Result

Wiring all the above together resulted in significant improvements to deploy times:

Reduced deployment time: deployment now takes 50-60 seconds, ~13x faster!
Reduced downtime: downtime was reduced to 15-20 seconds, down from ~12 minutes. This time is limited only by how long it takes to stop the old version and serve the new version. For us, this was a ~36x reduction in downtime compared to using CDK.
Automatic rollback on a failed deployment: If a deployment fails, CodeDeploy will roll back to the latest working version. Made possible with a single line of code!

Great, but how much does it cost?

The good news is that if you are deploying to EC2, Lambda or ECS there is no additional charges. Indeed, your overall charges should reduce because of a considerably reduced GitHub consumption. Here's how AWS frames CodeDeploy pricing:

For CodeDeploy on EC2, Lambda, ECS: There is no additional charge for code deployments to Amazon EC2, AWS Lambda or Amazon ECS through AWS CodeDeploy.
For CodeDeploy On-Premises: You pay $0.02 per on-premises instance update using AWS CodeDeploy. There are no minimum fees and no upfront commitments. For example, a deployment to three instances equals three instance updates. You will only be charged if CodeDeploy performs an update to an instance. You will not be charged for any instances skipped during the deployment.
You pay for any other AWS resources (e.g. S3 buckets) you may use in conjunction with CodeDeploy to store and run your application. You only pay for what you use, as you use it; there are no minimum fees and no upfront commitments.

Article By

Oliver Looney

Software Engineer