TCP/UDP health

TCP/UDP health

In my current role at an email security company, we’re on track to bring our infrastructure up to current tech standards. This means moving a lot of static, mostly EC2-based, infrastructure into a highly scalable and as serverless stack as possible.

We know that getting to our ultimate desired goal is going to take a lot of time, planning, testing and failure. In the meantime we all agreed that even small steps away from our monolithic structures provide us some incredible advantages while also giving us keen insights into how to better model our desired goal.

My first project was to take a small but important piece of our product infrastructure and make it scalable and pluggable. That piece was our RBL server; Realtime Blackhole List.

Having up-to-date RBL’s and lightning fast DNS lookups are a crucial piece of providing enterprise email security. For this, we leverage rbldnsd and zone syncing from spamhaus. I’m not going to go into details on either of these services, as they’re not the main topic of the post, but essentially we ran these on a tiny Linux EC2 which allowed all of our smtp servers to query. Moving this into docker was in and of itself extremely simple, and there’s plenty of documentation and online activity around that.

Being that fast and effective RBL lookups are crucial to our product and partners, we discovered a small hitch in moving this simple docker conversion into production. The “problem” such that it was in our use-case is that we leverage AWS for our hosting and AWS ECS Fargate for our docker platform; however, DNS is a UDP protocol and AWS’s Network Load Balancers (NLB) only support TCP health checks. Ensuring these tasks are actually running is a hard requirement for this to be production-viable.

What we found in AWS’s actual documentation, as well as in our online research, is that the recommended solution is to run a sidecar task that performs a health check and reports back to the NLB. AWS recommends NGINX. However, when we look at this under a microscope, we find that this is not a trust-worthy solution. The sidecar may be running fine, that does not mean our RBL task is. What if it’s sync is failing? The NGINX health check, checking itself, says we’re good. But we aren’t. Furthermore, we’re now running a sidecar task along with every task, and we felt that to be wasteful at the very least.

Enter XINETD. This is such a cool little Linux tool. Essentially it runs as a daemon and will maintain an active connection to a defined port. By defining a port and exposing it on the container, we can now provide a TCP port for the NLB to check for health.

I came across this genius Gist which leverages XINETD to check if his PostgrSQL is running. I knew this was something special, so I modified the bash script only slightly in order to fit our needs:

#!/bin/bash

# Put this file into /usr/local/bin and
# chmod +x it

return_fail() {
  echo  -ne "HTTP/1.1 503 Service Unavailable\r\n"
  echo  -ne "Content-Type: text/html\r\n"
  echo  -ne "Content-Length: 37\r\n"
  echo  -ne "Connection: close\r\n"
  echo  -ne "\r\n"
  echo  -ne "RBL is not running or not accessible.\r\n"
  echo  -ne "\r\n"
  exit 1
}

return_ok() {
  echo  -ne "HTTP/1.1 200 OK\r\n"
  echo  -ne "Content-Type: text/plain\r\n"
  echo  -ne "Content-Length: 15\r\n"
  # echo -ne "Connection: close\r\n"
  echo  -ne "\r\n"
  echo  -ne "RBL is running.\r\n"
  echo  -ne "\r\n"
  exit 0
}

# This part needs to be here and it's basicall what the "request"
# part of the http check looks like. We just wait for a blank (or really short)
# line which a newline is and only then proceeding with the
# "response" part, which is in the next while loop
while read line; do
  if [ $(echo $line | wc -c) -lt "3" ]; then
    break
  fi
done

host 2.0.0.127.$RBL_ZONE localhost | grep "has address" >/dev/null 2>&1
if [[ $? -eq 0 ]] ; then
  return_ok
else
  if [[ $? -eq 1 ]] ; then
    return_fail
  fi
fi

This essentially performs a positive RBL zone check and reports back an HTTP-like response that the NLB (or any browser, really) can GET. Slick!

Then all I had to do was create a health check for our NLB in the terraform module, and send it.

resource "aws_lb_target_group" "rbl" {
  name        = var.app_name
  port        = 53
  protocol    = "TCP_UDP"
  vpc_id      = data.aws_vpc.selected.id
  target_type = "ip"
  health_check {
    healthy_threshold   = 2
    unhealthy_threshold = 2
    port                = 8088
    protocol            = "HTTP"
    path                = "/"
    matcher             = "200-399"
    interval            = 10
  }
}

The end result is an actual health check that queries against our RBL task to verify it as functional, while also satisfying the TCP requirements of the AWS NLB. Everyone’s happy.

Now of course some Docker purists won’t like this because it breaks the oath of “One process per container,” but for us running the low level daemon along with our task was a no-brainer concession. We’ve been running this solution in production for months now, and it works. Very well.

This is a fairly edge case scenario and solution, but if it finds you and helps you through your project, I’m happy.