What's a wide context aware event?
Instead of "three pillars" I'm a big fan of context aware events. But what are they?
There are two elements to these events:
1. Wide
2. Context aware
Let’s go.
1. Wide
A wide event is one that has lots of attributes.
Think 150. 200.
What goes into a wide event?
The common rule of thumb is "whatever you can think of that you have access to that's vaguely relevant to the current operation."
Many systems work off lots of small, narrow events, events with 20 or 30 dimensions.
For example, you have an event for "http request started".
Another event for "http API request started"
Another event for "http API request returned"
Another event for "http request finished".
Where does the duration live? Well... we only know the duration once the event has been delivered back.
So the duration has to live on the "returned" or "finished" events.
Likewise with the status.
But the request attributes are on the "started" events.
This means the information about a single event - the request - is smeared between several events.
This causes big problems when using these events to understand your system.
I've tried to do this - and it's always frustrating when exploring.
You keep having to do skips between duplicate events to get attributes that should have been there to begin with!
Instead, merge attributes that live together into a single wide event.
"HTTP request handled"
"HTTP API request made"
This halves the number of events.
It also makes each event richer - it's more self contained.
Don't stop there. Throw everything you can possibly think of into that event that might be useful.
Which Heroku dyno served the request?
Which host?
What was the incoming IP address?
What were all the headers?
What's the app name?
The server name?
A full list of attributes for inspiration:
https://www.honeycomb.io/blog/event-foo-what-should-i-add-to-an-event
2. Context aware
This is a doozy. And it's something I see teams not even understanding, let alone getting right.
Events are never stand alone!
They almost always happen in some kind of context - in a job, in a server, in an HTTP cycle, within an API request.
So it's critical to capture this context as much as possible.
You can now say "Show me all GET request that trigger a job of class X that makes an API call that's longer than 3 seconds"
Seriously powerful!
And it's all possible because the events are "stacked".
Here's where traces can do things that logs just cannot.
You can still have context in your logs - shared tags spread between multiple logs.
At BiggerPockets we tag all log requests that happen within jobs with the job attributes.
But what happens when two contexts try to write to the same log key? Well... they get overwritten. Not ideal.
Traces are far more powerful.
Traces consist of several spans. Each span can be nested within another.
Ever seen a flame graph?
Yup, that's a trace. And each span can be within another one.
This means you can do the query we have above no problems at all!
The big issue?
There are two - cost and performance.
Traces are slower than logs.
They are also a lot more costly and your expense can easily explode.
All it takes is a few extra spans in some critical high volume requests...
...and your bill skyrockets.
This is all true with observability 1.0 tools.
There's a new breed of tools - LightStep (ServiceNow) and Honeycomb - that support wide events.
They have simple pricing models - based on X million events.
This tends to be more cost effective than with traditional three pillar vendors.
This makes sense - after all, they're not charging for three different kinds of data types, just one.
That's not all - because they only deal with wide, context aware events, there's no silos, no different parts of the app to skip between.
Open Telemetry - the enabler
Open Telemetry is developing fast.
Open Telemetry, for those of you who aren't aware, is finally a standard for observability.
Idea is that you can standardise the tooling across your app.
Then the vendor you choose to ship traces, logs and metrics to will be an implementation detail.
It's a matter of time until we can switch between traditional Observability 1.0 tools and Observability 2.0 tools with three lines of code.
In that new world, many will stick with what they're familiar with - the traditional three pillars.
But it’ll open up options to those who are curious about wide context aware events.
Open Telemetry allows multiple exporters and processors.
So you’ll be able to try out different observability stacks with a handful of lines of code.
Don’t like them? No worries, just stop sending data and revert to using your current system.
Like them? Just switch off the old provider.
Vendor lock in will disappear like a bad dream.
Personally? I cannot wait.
Photo by Kenny Eliason on Unsplash
Brilliant! Finally a simple alternative to 3p