Stop sending metrics. Start deriving them.

Metrics aren't that useful. Unless they're derived from deeper sources of information.

John Gallagher

Apr 24, 2024

Stop sending metrics from your app.

Huh? But aren't metrics really useful?

Nope. And here's why.

1. Metrics provide answers to questions you already know.

Question: At 3am when your site is down, it's not obvious what's going wrong and you're half asleep, will metrics help?

Maybe. If you're lucky, you've encountered an issue you already thought might happen.

But maybe, just maybe, this is a non trivial issue that you've never seen before.

I've experienced this many times.

Here's how it goes:

1. Oh crap. What's going on?

2. Check all 10 metrics. All look good.

3. Now what?

If every production issue you have falls into 5-10 different repeatable scenarios, great!

But in 2024 our apps are a *lot* more complex than this.

2. Metrics result in dead ends.

Question: If the metric *is* showing that there's a problem, can it explain *why*?

No. Of course it can't.

It's just a number.

Imagine if you went to the doctor.

"I'm fainting all the time... what's going on?"

"OK, let me take your blood pressure. Wow - that's high. Well, thanks for coming to see me. Laters!"

I don't think you'd be very happy.

Unless you can correlate the metric with other anomalous activity, you're not much further forwards towards a solution.

I've experienced this too.

Here's how it goes:

1. Oh crap. What's going on?

2. Check all 10 metrics. Oh! Background job queue latency is massive.

3. It's something to do with the background jobs.

4. Now what?

OK, sometimes you get lucky and there are other metrics that hint at the reasons.

Maybe you've seen a flood of errors in your error tracking software.

All those other times? You're on your own.

3. Metrics can get expensive.

Question: If you set up a feature in your monitoring tool and 3 weeks later you realise because of one tiny misunderstanding the metric has racked up $7,000...

...is that a good feature?

Erm.... NO.

Months ago I had this exact problem.

I'd set up a metric, added a few dimensions to it. Just a handful.

A week later my manager pinged me...

"Looks like this metric is getting expensive... can we turn it off?"

Turns out that we were being charged extra for each unique value of those "few dimensions".

No worries. There was only... 220 unique values... charged at... oh wait…

...well I'm sure you can fill in the gaps.

We switching off the metric before we'd even had a chance to use it.

Metrics have their place.

But they should be derived from other tools.

What’s the plan Mr Observability Guru?

What should you be doing instead? Try these:

1. Tracing.

Tracing can give you a proper structured map of the call stack.

That's infinitely more useful than a metric.

Make sure you add all the context you care about to each span.

I’ll be discussing this in more detail in a future post.

2. Ditch traditional tools.

If you’re already invested in an observability tool, use it.

However - there are always exceptions.

Sometimes you want to make a break with the past.

Using a next generation tool like Honeycomb can get you an awful lot.

Clear, transparent pricing based on events.

A killer UI with instant drill down into the parts of your stack you care the most about.

Fully Open Telemetry compatible.

And proper, structured, aggregateable events.

3. Logging.

Sigh. Seriously? I looked at the logs the other day and they told me the square root of f**k all!
Engineer Who Hates Logs

Yup, I know, I've experienced useless logs too.

But over the last year I've invested heavily in setup of logs.

And to be crystal clear, when I say “logs”, I mean “structured logs with 50-200 dimensions that prioritise IDs”.

Like… proper, modern logs. Not logs stuck in the 1980s.

Metrics can be derived from logs.

I’ve aggregated logs together into metrics. Totes doable.

This means you can go to the source of the data!

You can look at why that weird slowdown of the site happens on a Monday morning.

Not just that it happens.

Proper logging is paying off dividends for my current client.

Engineers can, in 30 seconds, diagnose a whole bunch of problems from first principles.

Click… type… click… “Ah! So that’s where the errors are coming from…”

However, it takes a lot of setup.

And that’s where I can help.

Free Rails Observability Playbook

Check out my free playbook for observability in Rails.

Then play around with it and have fun.

It’s still a work in progress, so feedback is very welcome!

What is helpful about this playbook?
What would you like to see that’s missing?
Is the survey a good idea?

Have a great week and I’ll see you* soon!

* Not literally, I’m typing this in Belfast, Northern Ireland. Internet and all that.

Joyful Programming

Discussion about this post