Thank you, Prometheus! Part One

Thank you, Prometheus! Part One

If you’re in IT or DevOps, or in any way support multiple web/public endpoints, you’ll probably agree that monitoring sucks!

I’ve used so many different monitoring and alerting tools over the years. Some of which were full on SaaS solutions that cost the business handsomely. And, they have all been not fun.

In my current role, we were facing this very dreaded issue. Our current monitoring “system”—a collection of open source and SaaS tools—was growing a little long in the tooth. A perhaps more important catalyst for change was the fact that we were collecting a ton of new custom metrics, and our AWS CloudWatch bill was growing to a point that was upsetting Finance.

We’d recently implemented a Grafana instance for visualization, so my Senior coworker floated the idea of us looking into Prometheus to handle the bulk of the actual monitoring and metrics collection. He’d had a little exposure to it and some colleagues of his highly praised it. So, I was tasked with setting us up with a Proof of Concept.

There were many hurdles.

The first minor hurdle was getting around the fact that so much of what’s out there for setting up Prometheus is rather tied up with Kubernetes. We are running mostly in AWS, and leveraging Docker/Fargate for scale, and as I’ve noted before, we lean into Terraform for our infrastructure management. Looking at the overview of Prometheus, it’s easy to see how Kubernetes is a great choice, but none of the team was ready to try and shift our patterns to a whole new infrastructure model.

Once I got around how Prometheus pieces its components—or services—all fit together, I was able to cobble together a Terraform project to build out the foundational infrastructure. I somewhat wanted to recreate the Kubernetes structure within Fargate, but as we looked it over, we decided that we should be able to safely run Prometheus and the Alertmanager components together on a single EC2. (This is open to change in the future).

We lean into Ansible for configuration management, and I found an absolutely invaluable Ansible Galaxy Role that became my building block for configuring the multiple services. We have since heavily updated and added to this role with other exporters, but it goes without saying that this role saved me HOURS of leg work.

Once these base configurations were ironed out and pushed to the Prometheus server, we had our skeleton to build on. Out of the box, and by itself, Prometheus ain’t pretty. In fact, without a visualization tool like Grafana, it doesn’t provide much real world use.

Our system was immediately monitoring itself. It provided a simple web UI and an HTTP endpoint to cURL, and that was about it.

We need this thing to monitor our entire infrastructure. And it does so, with a handful of yaml files.

But, getting to that point proved to be the next—and biggest—hurdle to get over.

…to be continued.