This post defines observability, gives an example of a user issue, and then defines OpenTelemetry, which should be used, to easily pin point user issues in an application by looking at the application outputs. This is critical to good developer experience, which saves developers time.

Observing a System

Software engineers often spend a large amount of time trying to solve user issues that might not be easily reproducible. Sometimes, system logs don't have enough information and one is left guessing at what happened, using a process of elimination while reading code, to guess what happened.

In theory, we should be able to pin point any issues in the code from outputs of the application - that is, you should be able to tell the internal state of a system, by observing it.

This is where observability comes in to the picture. The Wikipedia definition is:

In software engineering, more specifically in distributed computing, observability is the ability to collect data about programs' execution, modules' internal states, and the communication among components.[1][2] To improve observability, software engineers use a wide range of logging and tracing techniques to gather telemetry information, and tools to analyze and use it. Observability is foundational to site reliability engineering, as it is the first step in triaging a service outage. One of the goals of observability is to minimize the amount of prior knowledge needed to debug an issue.

Example of an Issue

An example of an issue that could not be reproduced is intermittent duplicate API calls being made by the browser after a user clicks a button to submit information.

The metrics, logs and traces, were not detailed enough to provide the reason why this was happening. The API call was made idempotent to ensure the API call submission only changes state once, which meant that the user was ok and that form submission only happened once.

But let's dig in to what could be observed:

  • browser network tab - shows one call
  • front end library that does the post - middleware showed two posts
  • network gateway logs show two posts
  • infrastructure (kubernetes - service and pod) logs showed nothing unusual
  • application logs show two posts, sometimes on two different kubernetes pod

Even with all this information, nothing could pin point the issue.

Developer Experience

The above scenario is like looking for a needle in a haystack, which is not fun, and should not be the case with well developed apps.

You might resort to a process of elimination, and asking why 5 times, like in a retrospective.

To add to the haystack, application logs may also use non-standard terms, or even use existing standard terms for other purposes (like correlation id and span id), as many libraries are catching up with Open Telemetry such as those noted here and here.

The answer to at least make the haystack clearer, is OpenTelemetry.

OpenTelemetry

OpenTelemetry is:

An observability framework and toolkit designed to facilitate the

  • Generation
  • Export
  • Collectionof telemetry data such as traces, metrics, and logs.
  • Open source, as well as vendor- and tool-agnostic, meaning that it can be used with a broad variety of observability backends, including open source tools like Jaeger and Prometheus, as well as commercial offerings. OpenTelemetry is not an observability backend itself.

A major goal of OpenTelemetry is to enable easy instrumentation of your applications and systems, regardless of the programming language, infrastructure, and runtime environments used.

So the lesson is - use OpenTelemetry!