The Office Space meme applied to a Grafana dashboard (That Would Be Great)

The three observability pillars, erm, signal types (logs, metrics, traces) are not equally important to every role and in every situation. In this post we discuss the transition from traditional monitoring of servers by sysadmins to exercising observability of services and business impact by developers & platform teams. And we throw in some Machine Learning, for good measures.

Back in the mid 90s, when I commenced my university studies, I once had to visit the sysadmins because I fat-fingered the (paper)form applying for mail access. I distinctly remember the racks, cables, and servers and the seriousness about in the air; these fine folks would wear actual pagers—the ones you see in the movies, with just the phone number to call back displayed—and they almost always had a Leatherman at the ready.

The sysadmin would be concerned about the health and utilization of their servers and not unlike IT folklore play the one or other funny prank on us students as we tried to figure out what the difference between the WorldWideWeb and Hyper-G is. In any case, this “server/boxes” mentality would last well into the 2010s and in some shops you today still can see this being practiced. More out of necessity than out of conviction, but …

Then, starting maybe around 2013 to 2015 the cloud native train slowly started to pick up steam and with that we saw a shift-to-the-left. In other words, visualizing the supply chain from the origin (or: left-most place), that is, the 200 bucks mechanical keyboard of yer developer to the lofty heights of a Lambda function on the right-most end, the operational joy has shifted to the left. This is, of course, a euphemism for: as less and less boxes are needed and hence the server (boxes) babysitting is less of a topic, in a cloud native setting, developers are increasingly expected to pick up that slack.

OK. So, signals, right?

Above mentioned sysadmins would focus on lower-level or system-level signals. A page when the hard-disk drive is 80% full, maybe? Red blinking lights when the CPUs are nearing 95% utilization. You get the idea. This is certainly useful and makes sense in the context of servers (boxes). Not so much when we look at cloud native services or applications operations. Think: serverless environments, containers, Lambdas.

So, what you are increasingly interested in are higher-level or business-level signals: How much $tuff did we sell today? What percentage of the folks who clicked on the first link ended up subscribing? What’s the QPS value of that service? You see, rather than manually entering stuff into Excel and sharing PDF docs that report on the monthly business performance, you could just look at a Grafana dashboard with, like, live data?

That sounds all good and doable. In the past ten to twelve months we’ve seen a number of activities around harmonizing o11y: from OpenTelemetry and OpenMetrics in the telemetry domain (defining how you instrument, represent signals on the wire and transport to destinations) to, more recent efforts around Service Level Objectives (SLO) formalization via OpenSLO, we’re getting there. And then there are the destinations: visual frontends from Grafana to Jaeger to OpenSearch Dashboards to Long Term Storage (LTS) of metrics in the Prometheus space (Cortex, Thanos, M3DB, etc.). It’s indeed exciting times and the opportunity is real: we can now cover all signal types across the stack.

With the same tools and protocols we can measure, query, visualize, and alert on from the amount of memory a process uses to the number of business transactions in the past hour.

And yet, there’s manual work around; to name a two concrete examples:

  1. the automated configuration of destinations based on the systems under observation, and
  2. the correlation (or: transition) between signal types, for example, recently I toyed around with a logs/traces unification experiment:

Now, where does that leave us? I did promise you some Machine Learning (ML), so here we go: while it might be a marketing term and oftentimes abused, there are tons and tons of real-world use cases. And I’d argue that by and large it’s best used when you don’t notice it—and yes, I am aware of that it’s hard to raise VC money with a modest and silent attitude on the ML/AI topic.

Let’s have a look at a concrete example in the o11y domain: you might, wonder how an abuse of your service looks like and for that you’d need a baseline for the workload (how does it “usually” look like, for example expressed by low-level metrics like CPU or memory usage or high-level metrics such number of transactions). Now, really this is a spectrum: you can manually look at a dashboard and then a Prometheus alerting rule. On the other end of the spectrum you might find features like CloudWatch Anomaly Detection or services such as Amazon Lookout for Metrics you can use to automate this process.

I hope you find this open source & open specifications observability space as exciting as I do and we can travel this journey together for a while! Find me on the CNCF Slack and on Twitter (DMs are open). Ah, and if you want to keep up with the developments, you may want to consider to sign up to my weekly newsletter on the topic (mail or RSS) via o11y.news

Solution Engineering Lead in the AWS open source observability service team