Grammarly’s writing assistant serves twenty million people every day. Supporting such an involved user base with a smooth rolling update of the application cluster could present a real maintenance challenge. But luckily there are services that facilitate making incremental updates with zero downtime. One option for our engineering team to consider using is Amazon ECS, which can simplify the update process with a procedure called rolling update deployment.
In this article, we will talk about how we can make rolling updates with Amazon ECS. We will also look more closely at graceful shutdown procedures along with how to deal with long-lived sessions.
Without further ado, let’s dive into the nitty-gritty.
Run and roll
Amazon ECS works based on tasks. These can be thought of as single application instances, which may consist of one or more of Docker containers. (It is common to have just one container in a task.) ECS service configuration defines the desired number of task instances, or just tasks, for horizontal scalability.
To carry out a rolling update deployment in ECS, you follow these two steps:
- Register a new task definition using the AWS ECS
register-task-definition command
. - Launch a service update using the AWS ECS
update-service command
.
ECS then restarts all tasks, honoring the minimum healthy percent parameter. This ensures a zero-downtime deployment.
When ECS replaces a task during a service update, it executes a Docker stop, which issues a SIGTERM
signal to the application process then waits 30 seconds for a container to stop (ECS_CONTAINER_STOP_TIMEOUT
). If the container is still alive after that, a SIGKILL
is issued and the container gets forcibly stopped.
When a load balancer is attached to the service, there are two further steps. ECS first removes the task from the load balancer. Then, before executing a Docker stop, it waits for the load balancer to drain active connections. This draining delay is called a deregistration delay and can be configured in the load balancer target group settings.
Long live the connection
This draining approach works great for the majority of cases. But every once in a while, you need a little more control over the shutdown process—especially when you have long-lived sessions, such as WebSockets or HTTP long polling.
The problem with long-lived WebSocket sessions is that the Application Load Balancer (ALB) starts to forcibly kill all “still active” connections once the draining period has elapsed. By the time the application receives a SIGTERM
, the long-lived connections may have already been terminated by the ALB. This makes it impossible to close them gracefully on the application side.
This doesn’t have to be a big problem. The long-lived WebSockets will be closed abnormally—which would lead to 1006 Abnormal Closure code on the server and the client. But that’s all. Clients will reconnect to another healthy task and continue to operate normally.
There may still be rare situations when you’ll want to do some housekeeping before or after the disconnect, or track real abnormal closures and service restart closures differently. Or perhaps the perfectionist in you simply does not want to flood clients with a bunch of 1006 codes during each deployment. (Note: If you use a custom close code, be careful with 1012 Service Restart, as this code is not fully supported by Microsoft Edge.)
To cope with this type of situation, it’s important to find a way to close connections gracefully. If Amazon Web Services were to notify the application about draining events, it would be possible to initiate the connection’s termination process from the application side. But unfortunately that’s not an option, so we need to explore other ideas. One possibility is constantly polling the target state with the DescribeTargetHealth API. Another option is to introduce a Lambda that triggers when the target state changes and notifies the application. We don’t use either of these approaches at Grammarly—they are both far from ideal.
The method we use to tackle this issue is to detach the ECS service from the load balancer. (Note: Unfortunately, if you already have an existing ECS service running with the attached load balancer, you will have to re-create the service to alter the configuration).
This brings us back to return to the original flow:
By receiving a SIGTERM
signal, the application can terminate all in-flight connections gracefully and shut itself down properly. (Note: In Java, a SIGTERM
handler can be added with Runtime.addShutdownHook
.)
So far, so good. We now have full control of the connection closure process. But since we have detached the ECS service from the ALB, we need a way to route and balance traffic to the service tasks. There are multiple ways to implement that. The choice depends on your infrastructure schema. For example, if you use EC2 launch type for your service and host your instances in the auto-scaling group, the auto-scaling group itself can be attached to the load balancer. Another approach would be to use a DNS service discovery so that tasks can be accessed throughout an internal domain name and ECS can handle the registration/deregistration of a task in the service discovery registry. Though this second approach looks promising, there is no easy way to route traffic from ALB to a domain name, and most likely you will have to introduce an intermediate proxy (something like Nginx) to route the traffic, and that will make your infrastructure even more complex.
We’ve come a long way and now have managed to get the balanced traffic to our application. But we do not want to route traffic to unhealthy or stopping tasks. To make sure we avoid doing so, we should look closely at those tasks. Once a SIGTERM
is received, a stopping task has to turn off the health check endpoint and fail all health check requests for that load balancer to move the stopping target to the unhealthy state. It will require a wait of at least UnhealthyThresholdCount * HealthCheckIntervalSeconds
(according to health check settings) before the load balancer considers the instance to be unhealthy and stops routing traffic to it. It might be a good idea to wait even longer, as that will effectively introduce the connection-draining feature to the shutdown process. (Pay attention to ECS_CONTAINER_STOP_TIMEOUT
, which has a default value of 30 seconds but can be increased up to 10 minutes.)
We’re almost there, but there is one more thing to address. By detaching the load balancer from the ECS, the service task no longer gets killed and restarted by failing the load balancer health check (as was the case for the native service load balancing). Luckily this functionality can be easily added using the container health check (as configured in a task definition).
Using this setup, we have full control over the application shutdown process, which allows us to handle connections termination on the application side.
Conclusion
In this article, we examined in depth the Amazon ECS service update flow and reviewed the pros and cons for some of the workarounds which may help in non-standard scenarios. This is a window into one method by which Grammarly can smoothly update our product applications to consistently deliver high-quality writing support to our millions of users.