The Developer's Guide to Not Losing the Metrics You Need

Navigate to:

Gathering and storing metrics is one of the many parallel tasks a developer must do through the production cycle. Since you never know when an adverse event might occur, you have the metric you need to debug a problem when and if you need it.

However, you cannot store all metrics forever. This even applies to purpose-built time series database, such as the one InfluxDB offers, which is intended for high-cardinality data. Time series databases may seem “magical” when it comes to scalability, but they do not have the ability to infinitely scale, and at some point, even InfluxDB’s tool will reach a limit.

The limitations of time series databases is why you should manage your storage as a set of tubes with different sizes. When you don’t know whether a metric or a trace will be useful, or if storage becomes too expensive because you collected too many metrics, you can always store them for a short period of time. You can also move them later if they become useful or after aggregating them to decrease their pressure on your system.

If you are using InfluxDB, you can set a minimal retention policy (of just two to three days for example) for all your metrics. You can then move the metrics to another InfluxDB instance (with a longer retention policy) Kapacitor, the TICK Stack’s native data processing engine.

This is a hard issue, but I’m uncomfortable with having logs, traces and metrics in the same place. The end result is easy aggregation between all this different point of view from your system. Because at the end of the day, metrics and logs are just a different representation from the reality of your system—and having them together will behave like a powerful crystal ball capable of answering questions on the state of your system with a much higher level of  granularity.

The unique way to store this giant amount of data as described above is  via retention policy and data aggregation. One solution is Kapacitor, which can process both stream and batch data from InfluxDB—it lets you plug in your own custom logic or user-defined functions to process alerts with dynamic thresholds, match metrics for patterns and compute statistical anomalies. It also performs specific actions based on these alerts like dynamic load rebalancing.

Using Kapacitor with InfluxDB is simple and can allow you to store data as is or send it as an aggregate. In either case, this is a straightforward way for you to start looking at your metric data to determine what you need to keep—all without the guesswork.

In addition, when it comes to collecting system and application data, developers today also talk about logs, events, and tracing.

Access to metrics is like being able to get a good night’s sleep—while events are like a slap in the face. You first design metrics in the form of pretty graphs to understand normal system operation. When something deviates from that normal functioning, you get a slap that wakes you up when an event occurs.

With every unexpected slap in the middle of the night, confusion follows. You look around you to get information about what is happening, who interrupted your dreams and why, but metrics and events don’t tell the full story. That’s where monitoring alone falls short.

At this point, it is important to remind yourself that monitoring helps you know when something goes wrong but doesn’t answer the following questions:

  1. What is going on?
  2. How can I fix it so I can go back to sleep again?

At this junction, if the problem resides in your application, you start studying logs to see what is going on. If you are in a small low-traffic environment, you will probably find what you are looking for, and then you are done.

However, if yours is a complex system, it can be distributed or heavily reliant on third-party services where logs are massive and you can’t identify what is broken by simply watching them—in this case, you need to reduce the scope of the outage. To do that, you can look at your traces. Although identifying traces can be a challenge, there are two actions you can take:

  • First, expose the trace_id (the identifier for every trace/request) inside your logs to connect them. It will help you filter logs for a specific request.
  • Second, teach your support team and customers why the trace_id is essential. They should know that it is the key to finding out what is happening. If you have tech people as customers, it's easier when you provide them with an HTTP HEADER for example. If your customers are nontechnical, then it is a good strategy to have your UI send back the appropriate identifier.

Real issues I encountered led me to write this blog. Everything I wrote is a lesson learned from troubleshooting distributed applications: that is, metrics, events, logs and traces are not mutually exclusive. They are tools to make debugging, monitoring and observability possible. I can’t wait to have a single solution to group them all in order to make my life as a developer more awesome than it already is.