Scoped, Structured, and Distributed Logging - Jane Lusby

Okay, so what is this talk? This talk is intended to introduce my coworkers to logging concepts I learned about at a rust meetup. Eliza Wiesman gave a talk on her library tokio-rs/tracing https://www.youtube.com/watch?v=j_kXRg3zlec which I found compelling, in no small part because I felt it would be useful to have similar features in our clusters.

tracing itself is a library that is based on Distributed Tracing as defined by the microservices world.

The main concepts I would like to introduce are the Event+Span model of logging, structured logging, and as little rust syntax as is necessary to demonstrait those two ideas. The Event+Span model of logging encodes relationships between log messsages and units of work done by the system and makes it easier to find logs which are tied logically to some event on another node, and to isolate logs from those on different threads to handle the confusion and chattiness that is common when debugging heavily asyncronous code. Structured logging is a paradigm for how data is formatted so that it can be output in various formats (plain text vs json vs binary) and parsed more effectively.

All of the examples will be in rust, mostly because none of the c++ implementations of distributed tracing or structured logging are anywhere near as clean to look at as tracing is. This is 100% thanks to rust’s powerful macro system.

Definitions

Some quick definitions so the examples make sense

  • Event - Essentially a log message
  • Span - An abstract representation of a unit of work, usually a few ids, a name, and whatever associated data you want.
  • Trace - A Span and all of its children

The secret sauce is in that children bit. Every span and event has a parent span associated with it. These relationships define a treelike structure that represents the various code paths and events you go through when completing a unit of work.

A Basic Example, logging in rust.

The following example is taken from the log crate, not tracing. It is the current existing standard for logging in the rust ecosystem, you can view tracing as an extension upon log’s standard. The includes are from tracing rather than log but they’re a drop in replacement, you can replace tracing with log on line 2 and everything will work the same more or less.

// Import `tracing`'s macros rather than `log`'s
use tracing::{span, info, warn, Level};

// unchanged from here forward
pub fn shave_the_yak(yak: &mut Yak) {
    info!(target: "yak_events", "Commencing yak shaving for {:?}", yak);

    loop {
        match find_a_razor() {
            Ok(razor) => {
                info!("Razor located: {}", razor);
                yak.shave(razor);
                break;
            }
            Err(err) => {
                warn!("Unable to locate a razor: {}, retrying", err);
            }
        }
    }
}

// Dummy impls to make the example compile
#[derive(Debug)] pub struct Yak(String);
impl Yak { fn shave(&mut self, _: u32) {} }
fn find_a_razor() -> Result<u32, u32> { Ok(1) }

So in this example we have a function, on entry we log a message saying we are gonna shave a yak, and what yak we’re going to shave, then we loop and find a razor, and if we get one we shave the yak and log that, and if we dont we warn that we cant find a razor.

The Same Basic Example: Reloaded

So now we will rewrite this example to use features from tracing. This will introduce spans and structured log messages, and will lead into the rest of the talk.

use tracing::{span, info, warn, Level};

pub fn shave_the_yak(yak: &mut Yak) {
    // create and enter a span to represent the scope
    let span = span!(Level::TRACE, "shave_the_yak", ?yak);
    let _enter = span.enter();

    // Since the span is annotated with the yak, it is part of the context
    // for everything happening inside the span. Therefore, we don't need
    // to add it to the message for this event, as the `log` crate does.
    info!(target: "yak_events", "Commencing yak shaving");
    loop {
        match find_a_razor() {
            Ok(razor) => {
                // We can add the razor as a field rather than formatting it
                // as part of the message, allowing subscribers to consume it
                // in a more structured manner:
                info!({ %razor }, "Razor located");
                yak.shave(razor);
                break;
            }
            Err(err) => {
                // However, we can also create events with formatted messages,
                // just as we would for log records.
                warn!("Unable to locate a razor: {}, retrying", err);
            }
        }
    }
}

#[derive(Debug)] pub struct Yak(String);
impl Yak { fn shave(&mut self, _: u32) {} }
fn find_a_razor() -> Result<u32, u32> { Ok(1) }

Ok so what changed, now instead of logging a message on entering the function we create a span and do something with our yak argument and then enter said span. Then we log the same message as before to say that we’re starting our yak shave. Then in the loop when we have a razor instead of formatting the razor into the log string we do something weird with it. The warn log line is unchanged.

Okay so you probably guessed already but those weird things are structured logging. The % and ? infront of them are a special syntax implemented by the macros in tracing that tell the logger whether or not to use Debug or Display when formatting the data. These are just two traits for how to essentially ostream objects in rust. Display is the default user defined version, and is intended to be user facing. Debug is generally automatically generated by the compiler for you and is essentially a jsonification of the struct for use by developers when debugging. In this case % means format this with Display and ? means format this with Debug.

What does the output look like?

This is a tricky question because it’s not actually defined by tracing but by the subscriber implementation, aka the thing that tracing passes all the events to and delegates all decision logic to. The tracing group provides a couple of implementations of subscribers, we will focus on tracing-fmt because its the most basic “normal” logger that converts your events straight into log lines on stdout, just like you’d expect.

The basic example above might print something like this.

Jul 23 14:01:51.803 INFO  shave_the_yak{yak=Yak{name: "hairy"}}: Commencing yak shaving
Jul 23 14:01:51.804 INFO  shave_the_yak{yak=Yak{name: "hairy"}}: Razor located razor=1

As you can see each log line is associated with the parent span, this logger outputs all parent spans and the associated data of each parent span and then the event message followed by any associated data of the log event. The actual output can be whatever you like, the point is that when you go to log you have all of this data available to you in a highly structured manner.

The structured logging lets the logger format and stringify the name of your fields and append them or prepend them nicely and consistently. No more having to write "Shaving yak (" << yak << ")" or similar custom and inconsistent formatting for each and every log message.

What am I proposing? Why is this useful for Scale Computing?

What im not proposing is replacing or rewriting our entire log system. I want to see what incremental changes we can make to our existing log framework to move us towards a Span+Event model of logging.

With our tasks especially, its very common to have a few fields that we put in every log message like the rsdID or journalID or w/e, that really should be associated with the span that represents the task.

Many of our tasks spin up tasks on remote nodes and it can be hard to trace a parent child relationship between them. We do have parent ids for tasks, which is amazing and is already I think part of the way there, but its not really formalized or easy to work with, and its very specific to our Task/PDU system.

With general logging im constantly having to write messages like "variable name = " << variable" just to get some basic debug information available. I want structured logging so I can just write something like DEBUG("whatever event just happened", var) and have it magically and consistently output the variable name and its value, and possibly output the data as json so I can query for a specific set of lines and then grab the associated json data and parse / process it.

The goal here is to make logs easier to write, easier to process, and easier to search. I doubt we’ll ever get all the way there to every feature I can imagine, but I think there are certainly some low hanging fruit that we could pick, and I want to make sure everyone is familiar with these concepts so when opportunities arrise we’re ready for them.