Semantic conventions can safe lives!

o11y theory: instrumentation

Michael Hausenblas

--

TL;DR: We discuss why instrumentation matters for observability (o11y) and what the costs and benefits are.

The Cloud Native Computing Foundation (CNCF) defines o11y as follows:

Observability is a system property that defines the degree to which the system can generate actionable insights. It allows users to understand a system’s state from these external outputs and take (corrective) action. […]

Alright. So the goal is to get actionable insights to understand what’s going on in the system under observation and to influence said system. Why would you want to understand the system, say, a serverless app using Lambda, Step Functions, DynamoDB, and SQS or a containerized workload running in Kubernetes? Why would you want to influence it?

That’s an excellent question and I tried to address this in the short paper Return on Investment Driven Observability: if you’re able to upfront state your motivations in a SMART manner, you are much more likely to be successful implementing your o11y strategy. The costs (SaaS licenses, ingestion costs, storage costs, query costs, etc.) are usually something you can estimate fairly accurately. You may not like the outcome but it is rather deterministic. The true challenge is to figure out the Return on your investment. For example, if you plan to reduce the time to fix an issue in production, that’s a clear goal. Or, you may choose to increase the number of releases you can do per week. Maybe you’re mostly interested in optimizing the performance of your API, lower latency and less resource usage.

Any one of these goals or a combination thereof is great, but you have to know what you’re after or you can not effectively sell your stakeholders on your o11y strategy. Doesn’t matter if you’re arguing to switch o11y vendors, standardizing on OpenTelemetry (which, amongst friends, is almost always a good move), or introducing a new telemetry signal type such as traces or profiles. Know your goals and how to measure them and I guarantee you will have a much smoother sailing. This is not a guarantee that you won’t run into adoption blockers or hear from folks in your org that “we’ve always done it this way, why change it” … but at least it gives you a fighting chance ;)

Back to instrumentation.

Let’s be honest. We all want to get most out of things with the least amount of effort and that’s actually not a bad thing. When it comes to telemetry it means we want to get all the signals (profiles, traces, metrics, and yes even logs) for free. That is, we would love to automatically generate the telemetry signals without having to change our code, right?

So, some good news up-front: no matter if you’re looking at open source and open standards or proprietary (vendor-specific) solutions. In general, you can and should use auto-instrumentation, that is, automatically generating signals without code changes.

In terms of auto-instrumentation, OpenTelemetry has you covered for a range of interpreted/byte-code languages such as Java, .NET, JavaScript, Python, PHP, and Ruby. For other, compiled languages including C++, Go, Rust, etc. you can either do manual instrumentation or leverage eBPF to perform the same (either in the collector or directly via eBPF-based auto-instrumentation such as provided by projects like Odigos).

I say “in general you should” because, based on customer conversations and community feedback, I see it as a starting point, to get “up and running” fast and convince stakeholders such as your developers or management, to buy into your proposed o11y strategy and stack. But I wouldn’t recommend to stop there. There is almost always room for improvement to manually instrument the application code. It’s just that auto-instrumentation gets you from 0 to 1 much faster ;)

For manual instrumentation, my tip is simple: use OpenTelemetry. This allows you to “instrument once and ingest everywhere”. By that I mean: you don’t need to touch your code again, from an o11y perspective, if you decide that you want to change the destination of your signals, so the backend to ingest logs, metrics, traces, and profiles into. OpenTelemetry also comes with semantic conventions, simply put a set of key-value pairs that up-front define where signals come from (such as a pod in a Kubernetes cluster or a certain Lambda function) and what they mean (oh, hey that’s an HTTP GET request here). This often overlooked and under-appreciated OpenTelemetry feature is really super important in the context of instrumentation and will be subject of a different blog post in the near future.

A closing thought: you can certainly overdo it with manual instrumentation. If you’re “over-instrumenting” your source code readability and maintainability might go South and the maniac who has to fix the code you wrote 3 months ago (which could and oftentimes is indeed you yourself) will not exactly love you for it. I suppose, if you’ve done a good job in the initial phase, defining what your o11y goals are, you have a good handle on what is “enough” or “too much” manual instrumentation. At this point, I don’t (yet) have hard data on how much is too much but would appreciate you sharing your experiences and insights.

--

--

No responses yet