Log Everything as JSON. Make Your Life Easier.

NyxWulf · on April 27, 2012

I've seen several articles like this, and there are a number of things to consider.

Logging to ascii means that the standard unix tools work out of the box with your log files. Especially if you use something like a tab delimiter, then you typically don't need to specify the delimiter.

As an upside you are aren't storing the column definition in every single line, which if you are doing large volume traffic definitely matters. For instance we store gigabytes of log files per hour, grossing up that space by a significant margin impacts storage, transit and process times during write (marshallers and custom log formatting). Writes are the hardest to scale, so if I'm going to add scale or extra parsing time, I'd rather handle that in Hadoop where I can throw massive parallel resources at it.

Next you can achieve much of the advantages of json or protocol buffers by having a defined format and a structured release process before someone can change the format. Add fields to the end and don't remove defunct fields. This is the same process you have to use with protocol buffers or conceptually with JSON to have it work.

Overall there are advantages to these other formats, but the articles like this that I've seen gloss over the havoc this creates with a standard linux tool chain. You can process a LOT of data with simple tools like Gawk and bash pipelines. It turns out you can even scale those same processes all the way up to Hadoop streaming.

ntkachov · on April 27, 2012

You make a very good point about unix tools. However, as a Javascript/JSON guy myself, I really like the way JSON works. And for a small site like mine, JSON would work much better out of the box than some sort of tab structure.

I work with node.js which means I can console.log and then pipe into a log file. Any object sent into console.log are automatically converted to JSON. I can also do stack traces with JSON if I ever end up with some sort of nasty bugs.

When you log Gb a day, absolutely, tabs are the way to go. But when you have a tiny little thing like mine, saving the effort for something more value adding is probably the better choice.

leef · on April 27, 2012

The use of JSON logging plus a tool like Record Stream [1] is very powerful and solves the tool chain issue. Recs is complimentary to standard unix tools as well.

1- http://search.cpan.org/~bernard/App-RecordStream-3.7.3/READM...

brugidou · on April 27, 2012

We used to do just that, logging as CSV, but switched to JSON.

At a certain scale you don't use Unix tool chain anymore except for a tail and that's for pure debugging. We log >10TB per day and they go to Hadoop for processing.

JSON is crazy verbose. But you pay for flexibility, we want to remove and add fields at will everyday with new business requirements. Its a pain to maintain csv log versions or use the "never remove a column" rule.

Compression rocks on JSON you get 90% gzip compression easily.

We would have considered logging to thrift if we didn't have this huge flexibility requirement.

wladimir · on April 27, 2012

Especially if you use something like a tab delimiter, then you typically don't need to specify the delimiter.

What if the fields contain tabs? For every human-readable format that can contain arbitrary user input (nearly all of them) you need some form of escaping (I guess you could do length prefixing in ASCII decimal, but it'd not be pretty either and incompatible with basic tools).

But by far the biggest problem is the logging of text messages aimed at a humans, not the delimiting. Regular expressions can help in searching logs in quick-hack jobs, but if you need to parse logs for visualization or reporting, which is very common in organizations, using them is error-prone. After all, you rely on English messages of a certain form. The complexity of that could quickly move from "easy with regexps" to "we need NLP in our log parser!" (never mind security problems with one field leaking into another due to a slightly wrong regexps).

The application might change the message to make the message more readable for humans, or even move around fields, and your automated parser breaks. Structured messages, on the other hand, won't change for those concerns as the formatting for humans happens in the back-end.

I get a bit annoyed at the "WTF kid, learn UNIX tools" kind of responses here. UNIX tools are one way of doing things, not the holy one perfect way. Tool support is important, but there are also tools for processing JSON, XML, streams available. Shouldn't you use the best tool for the job, not the one you happen to know?

(I don't mean that JSON is necessarily the best format in every use-case, but for automated processing every structured format at all trumps arbitrarily delimited and escaped files. You can easily convert between structured formats should the need arise)

burgerbrain · on April 27, 2012

> "Shouldn't you use the best tool for the job, not the one you happen to know?"

Well, 'best' in this case should be determined by combining a number of factors. Certainly technological superiority should be weighted quite heavily, but in certain circumstances the "Nobody ever got fired for buying IBM" effect is also very important.

Of course there is also the "The best tool for the job is the one you have/know." quip, which I don't generally agree with myself... so I guess what I'm saying is that your mileage may vary.

(disclaimer: I log with JSON)

wladimir · on April 27, 2012

But such thinking can block innovation. It's a form of path lock-in. Both UNIX and Windows "gurus" are guilty of this, of seeing their way as the "one true way". Just because it's always been that way.

New ideas are not always better, but sometimes they might be. In the longer run "I'm used to this" (on its own) is not a good argument as there will always be new people that are not used to your specific blub, and if they can be more productive or make more reliable/secure systems then eventually you'll be out of the market.

See also: https://news.ycombinator.com/item?id=3892410

burgerbrain · on April 27, 2012

Oh certainly, I fully agree. I think I'm just saying that I think logging is going to be one of those things, for better or worse, that most developers look at and do the mental calculus of "It's a good idea, but do I want to go out on a limb here, with this issue?" In at least some cases all the factors added together just won't work out to it being worth the risk/effort.

Basically just the IBM thing. Was IBM always the best choice? No. But even so, it was often the best choice for the individuals making that call. This is the sort of thing that you have to recognize and contend with if you want to introduce change.

seanp2k2 · on April 27, 2012

Hosting Sysadmin here -- logs don't change that often unless you're developing an app that logs, and for that, I like logging to MySQL ("just add a column in a few places, an voila".)

You should use the best tool for the job. If your job is webpage stats for apache, there are literally hundreds of pre-existing tools to parse apache logs, and the name of the *nix game is plaintext.

bmelton · on April 27, 2012

Why would logging to MySQL be a better tool for the job than NoSQL?

Unless you're routinely purging the MySQL tables fairly often (during which case, it almost doesn't matter what you're using), my best guess is that you're going to end up with a slower database on read than you would with almost any NoSQL alternative.

Of course, you could keep the MySQL tables flat, but if that's what you're doing, why use MySQL at all?

keenerd · on April 27, 2012

> Logging to ascii means that the standard unix tools work out of the box with your log files.

Check out Jshon, it acts as a bridge between JSON and the standard unix tools. For small files it is faster than cat.

http://kmkeen.com/jshon/

mhansen · on April 27, 2012

I like jsonpipe better - then you don't have to remember all the command line flags, and you can compose pipes with your familiar unix tools.

https://github.com/dvxhouse/jsonpipe

  $ echo '{"a": 1, "b": 2}' | jsonpipe
  /   {}
  /a  1
  /b  2
  $ echo '["foo", "bar", "baz"]' | jsonpipe
  /   []
  /0  "foo"
  /1  "bar"
  /2  "baz"

keenerd · on April 27, 2012

I would not say better - different certainly. Jsonpipe is a lot slower and heavier. (Matters when you are adding json based web 2.0 integration to your router. For small cases jshon is 15x faster and uses 1/14th the ram.) And I would really want to avoid using jsonpipe inside of a loop.

It also seems to handle typical use cases inelegantly. Probably the most common thing I use Jshon for is turning json into a tab deliminated text file. To compare both, here is a query that returns json search results:

    curl -s "https://aur.archlinux.org/rpc.php?type=search&arg=python"

If I want to get the name, version, votes and description into a single tab deliminated output with jshon:

    jshon -e results -a -e Name -u -p -e Version -u -p -e NumVotes -u -p -e Description -u | \
    sed 's/^$/-/' |  paste -s -d "\t\t\t\n"

With jsonpipe it looks like:

    jsonpipe | grep -e '^results/[0-9]*/(Name|Version|NumVotes|Description)\t' > matches.tmp
    grep 'Name\t' matches.tmp | cut -f 2 > name.tmp
    grep 'Version\t' matches.tmp | cut -f 2 > version.tmp
    grep 'NumVotes\t' matches.tmp | cut -f 2 > vote.tmp
    grep 'Description\t' matches.tmp | cut -f 2 > desc.tmp
    sed -i 's/^$/-/' *.tmp
    paste -d '\t\t\t\n' name.tmp version.tmp vote.tmp desc.tmp
    rm {name,version,vote,desc}.tmp

Most of that awkwardness is from `paste` really wanting real files to operate on. But if you are going to use jsonpipe, you might as well just write the whole thing in Python.

The one thing that I do like about jsonpipe is that each line has the fully self contained path. So you can shuffle (or otherwise destroy) the output but still have something with usable context. Except for the example above, where the order matters a lot. For really simple cases jsonpipe's method is nice. I might just port it to C so that there can be a fair comparison.

mhansen · on April 30, 2012

Thanks for the very enlightening reply. :)

peteretep · on April 27, 2012

> Logging to ascii means that the standard unix tools work out of the box with your log files.

Oh Perl, you came, and you gave without taking...

https://github.com/rcaputo/app-pipefilter

yummyfajitas · on April 27, 2012

If you gzip your logs (which you should do anyway), the column definitions (i.e., data which is widely repeated) will take only a few bytes.

Use bson (mongodb's binary json format) if you are really worried about this.

rachelbythebay · on April 27, 2012

This article feels like it would work just as well with "Protocol Buffers", "Thrift", "XML", or even maybe "ASN.1". If that's truly the case, maybe the better thing to say is "please don't (only) log in ASCII", followed by "please use a format which is hard to get wrong".

JSON scares me a little. Don't you have to worry about escaping a whole bunch of characters just in case something gets the wrong idea about what you have in a field? I saw a page not too long ago which listed about a dozen characters which should be substituted in some manner when used in JSON.

Full disclosure: I got tired of ASCII logging from my web server and wrote something to stream binary protocol buffers (!) to a file instead. http://rachelbythebay.com/w/2012/02/12/progress/

lpolovets · on April 27, 2012

Over the last 10 years, I've gradually moved from ASCII to Thrift/ProtocolBuffers to JSON. Here are some random thoughts:

1) As others said, you should definitely use a library for JSON reading/writing.

2) JSON is simple and surprisingly fast to read/write. In one large benchmark (http://code.google.com/p/thrift-protobuf-compare/wiki/Benchm...), the Jackson JSON library (for Java) comes in somewhere between Thrift and Protocol Buffers.

3) JSON feels friendlier than protocol buffers and Thrift because it's human readable instead of binary.

4) JSON is more convenient than multi-line formats like XML because you can grep for things easily. For example, if you have {time:015128752, event:"authentication error", user:"bob smith", details: "..."}, then I can take do the following:

  cat log.json | grep 'user:"bob smith"'

It's hard to do that with XML because even if you find the right user, the entire log message/object spans multiple lines.

5) JSON is not as compact as binary, but it gets surprisingly close if you gzip it. All of those repeated descriptive attribute names are great for humans, but compression algorithms love them as well.

6) The format is fairly universal, but if you ever come across a language without JSON parsing libraries, the format is not so hard that you couldn't write your own parser (compare that to Thrift, XML, etc)

squarecog · on April 27, 2012

JSON is a good choice if you have very tight coupling between producers and consumers of your logs.

Twitter (disclosure: I manage/was early engineer on the analytics infra team) used Json initially, and it quickly turned into a mess because JSON does not enforce schemas or type safety. There is no way for your consumers to find out what you are logging, other than to sample records. This makes evolving logging, deprecating fields, and doing sanity checking very complex (you basically have to build a separate metadata discovery system).

Use Thrift or Avro. Avro is extra nice because not only does it have a schema, it keeps the schema together with the data; however, the support for it is not as mature/wide. It's improving fast, though. Thrift, and, I think, Avro, have JSON protocols if you really love JSON -- you get the human-readable debugging mode, the metadata, the type safety, and compact binary representation.. happiness ensues.

calibraxis · on April 27, 2012

Excellent point. People should read your coworker Nathan Marz's book _Big Data_, for a lucid discussion of schemas.

deno · on April 27, 2012

> JSON is more convenient than multi-line formats like XML because you can grep for things easily.

Only if the text representation is pretty-printed in a certain way.

> It's hard to do that with XML because even if you find the right user, the entire log message/object spans multiple lines.

XML tooling makes grepping obsolete.

    cat log.xml | xpath //log[@user="bob smith"]

seanp2k2 · on April 27, 2012

+1 for XML. It gets lots of hate, but if you're doing integrations or API work, it's almost everywhere.

Check out xmllint too, ESP the "shell" feature in it (basically a REPL for xpath and other useful XML spelunking.)

matan_a · on April 27, 2012

Just please consider that XPath needs to store the items in a tree structure first (XML needs to be parsed and loaded into memry) before it can be useful. Running that on a large log file would be an interesting performance experiment.

deno · on April 27, 2012

Nonsense, XPath works very well with streamed content, and there are even implementations of XPath engines for FPGA and GPGPU, all of which have very strict memory limitations.

DOM is completely optional, and you only need it for convenience or to observe the modifications, e.g. the way browsers make use of it.

Just think how you would compile any XPath expression — it’s very similar to REGEX, only for structured data.

empthought · on April 28, 2012

I think you're overstating the capabilities and implementation of common XPath processors here, and silently implying the use of a subset or modified version of XPath as specified. Arbitrary XPath will likely require loading the entire document in the worst case, because it allows navigation to any portion of the tree.

deno · on April 28, 2012

Navigation to any portion of the tree does not imply any need to keep the entire document in memory at once.

Worst case scenario XPath expression will not yield any results before traversing the entire tree. However any XPath 1.0 (and I imagine 2.0 is no different in that regard) can be compiled into a deterministic state machine (DFA), which only needs to keep tabs on how many elements it has seen and what conditions has been met.

What XPath 1.0 specifically doesn’t allow are arbitrary sub-expressions in predicates. Those would be problematic in certain conditions.

The only potential issue with performance is when running a large number of XPEs against a single stream. So there are various techniques to remedy that, including merging state machines for branching expression etc.

empthought · on April 28, 2012

There's a following:: axis in XPath which lets you look ahead to arbitrary elements. So the processor needs to load the document into memory, which of course is not streaming.

deno · on April 28, 2012

I don’t see how this axis itself is a problem. Could you provide a specific example?

empthought · on April 28, 2012

Just /node/node2/following::* in a predicate or whatever. Perhaps you're conflating XPath-as-used with XPath-as-specified? http://msdn.microsoft.com/en-us/library/ms950778.aspx explains that you can't support all of XPath (even 1.0) and process things the way that you describe.

deno · on April 29, 2012

Right, they completely eliminate any buffering. That’s very strict, but you’re right. I thought you meant that there’s a subset of XPath that can’t work on anything but DOM.

empthought · on April 29, 2012

Yeah, that's all. The reason it's relevant is that XPath processors often value conformance over expanding the utility of their system. I didn't want someone going off and using libxslt or xalan expecting it to process things in a memory-efficient manner.

LeafStorm · on April 27, 2012

Fun fact: `user:"bob smith"` is invalid JSON because object keys must always be quoted.

The JSON spec is very precise, you can find it right on the front page of http://json.org/.

sant0sk1 · on April 27, 2012

OT and pedantic, but common enough that I thought I'd point it out: there's no need to cat and pipe a file into grep, because grep takes a file name as an argument, eg - grep 'user:"bob smith"' log.json

chrisbroadfoot · on April 27, 2012

I often use cat and pipe to grep, it's nicer to have a left -> right flow.

ralph · on April 27, 2012

I don't like it, but you can do

    <foo grep bar | ...

if you really want L-to-R whilst avoiding the redundant cat and its overhead. That doesn't allow grep to do its speed-ups though, because it's still reading stdin, not a filename.

lindvall · on April 27, 2012

Ends up that passing the file directly to grep will allow it to use tricks to make processing faster than reading from a pipe.

tim_sw · on April 27, 2012

You can also use tools like recordStream, which allow you to do grep, etc. on individual json fields. https://github.com/benbernard/RecordStream (It has a lot of other features as well)

Sidenote: benbernard worked on Amazon dev tools team

CurtHagenlocher · on April 27, 2012

JSON is no more inherently single-line than XML; both can be formatted to either single-line or multi-line and still be valid.

frsyuki · on April 27, 2012

There are binary-based serialization formats such as MessagePack [1] or BSON [2] that doesn't need escapeing overhead.

You can simply replace JSON with MessagePack since MessagePack is compatible with JSON.

[1] Why Not MessagePack? http://blog.andrewvc.com/why-arent-you-using-messagepack

[2] Performant Entity Serialization: BSON vs MessagePack (vs JSON) http://stackoverflow.com/questions/6355497/performant-entity...

simonw · on April 27, 2012

Because using a binary format prevents you from tailing your log file without additional tools.

jerf · on April 27, 2012

"Don't you have to worry about escaping a whole bunch of characters just in case something gets the wrong idea about what you have in a field?"

Either you're using a length-delimited format, which is pretty hostile for humans to read, or you're using a string that will require something to be escaped.

The only third option is bugs, like a newline that gets out into your logs and oops, new log line.

"I saw a page not too long ago which listed about a dozen characters which should be substituted in some manner when used in JSON."

It is very likely what you saw was the list of Unicode characters that are accepted by the JSON format but rejected by browsers and/or HTML in the context of a Javascript embedded in an HTML page. In this case, it's the browsers and/or JS-in-HTML with the issue, not JSON. And I don't mean that as a who-to-blame sort of issue, I mean, if you're producing JSON and consuming JSON and it doesn't pass through HTML it's a non-issue, and any string that passes through JS-in-HTML will have the same problem even if it isn't "JSON", so it really hasn't got anything to do with JSON if I am correct about what you're referring to.

rogerbinns · on April 27, 2012

I'm happy with JSON and it is the native format of MongoDB, CouchDB and similar databases. And grep/sed/awk. The biggest problem with JSON is that it has no standard way of representing dates, times or binary data.

kelnos · on April 27, 2012

Don't you have to worry about escaping a whole bunch of characters just in case something gets the wrong idea about what you have in a field?

Yes, but you don't do it yourself. As the article mentions, pretty much every language you'd want to use has a JSON library. Use that, and (bugs aside) it'll take care of escaping for you.

rachelbythebay · on April 27, 2012

I sure hope everyone uses a library. What worries me is that it looks so simple, so someone might just glue together their own encoding and call it "done".

A higher barrier to entry isn't always a bad thing.

kiyoto · on April 27, 2012

rachelbythebay: Yea, you've got some good points. I don't think the author means JSON is the be-all-end-all solution to structured logging. It has its issues (escaping for one), but I think it is a reasonable choice because of its wide adoption.

A nice use of protocol buffer there. Speaking of serialization, my personal favorite is MessagePack (httpx://github.com/msgpack).

djcjr · on April 27, 2012

what, no mention of sexp's?

kiyoto · on April 27, 2012

djcjr: Are you an OCaml user ;p?

skrebbel · on April 27, 2012

The real takeaway is that log files invariably tend to become interfaces for something. They often end up being used for monitoring tools, business intelligence, system diagnostic tools, system tests, and so on. And they're great for this. But not when they contain sentences like "Opening conection...", which break half those tools the moment someone fixes the typo.

The log strings became an interface. Avoid this. If it's an interface, it has to be specced, and it has to allow for backward compatibility, just like any other interface that crosses component / tool boundaries.

Whether you do the actual data storage with JSON or something else doesn't matter. It's an implementation detail (though I agree that keeping it not only machine-readable, but also human-readable, is probably a good thing).

Design the classes that represent log files, and treat them like they're part of a library API. Don't remove fields. Ideally, use the same classes for writing (from your main software) and parsing the logs (from all that other tooling) and include version information in the parser so that the class interface can be current yet the data can be ancient.

gbog · on April 27, 2012

That exactly the opposite of my understanding of logs, and the reason why I can't agree with the OP.

For me logs are a way to store extensive historical data about what happened, in the cheapest and simplest way possible. Logs are a "write a lot and read almost never" kind of storage. For this kind of storage, the simplest way to do it is tab delimited flat files.

Logs are for debug or "legal" purposes: a client complains he has lost all his data, your boss comes in the room with fume steaming outside his ears telling you names because you "can't store f* client's F* valuables cleany". Then, using awk or grep kungfu, you come up ten minutes later with the exact millisec when the client did click on "Yes, I am sure I want to delete my data", his IP, his session_id, his browser fingerprint and so on. In case of security audit, you are required to send 10 years of server logs to a trusted third-party (they will not do anything with it, they just want to make sure you have logs). You zip the thing and send it by email to then (thus crashing security auditor's email server). You have done your duty.

These are what logs are for. Having JSON or any other format inside just make it more fragile, less versatile, will mess with line-oriented commands and is unnecessary.

If you are using your logs regularly to track some business data about your product, you are using logs for the wrong purpose and should consider using something else.

gizzlon · on April 27, 2012

You just email them? No encryption or anything like that?

Kind of funny that a security audit is the thing that trigger this .. =/

sciurus · on April 27, 2012

There's been a lot of noise about logging in the linux ecosystem lately.

There's Project Lumberjack (http://bazsi.blogs.balabit.com/2012/02/project-lumberjack-to...) to encourage applications to generate structured logs and better document/integrate tools for working with those logs. The proposed structure is Common Event Expression (http://cee.mitre.org/).

At the last kernel summit, ideas (http://lwn.net/Articles/492125/) were presented on how to make kernel messages more structured.

More radically, there's The Journal (http://lwn.net/Articles/468049/), a proposed replacement for syslog.

oasisbob · on April 27, 2012

Also, the newest syslog RFC (http://tools.ietf.org/html/rfc5424) allows for structured data to be included with the message.

The nice thing about this approach is that you can serialize the received messages however you'd like: JSON, XML, TSV, whatever.

Support for RFC 5424 in syslog daemons and logging libraries is thin, but will hopefully improve soon.

a3_nm · on April 27, 2012

What if I need, say, to find the 10 IPs that make the most requests? With the Apache log format, I can write the following in about 15 seconds:

  cut -d ' ' -f1 log | sort | uniq -c | sort -nr | head

Say you need to follow accesses to a particular file? The following quick and dirty one-liner probably works well enough:

  tail -f log | grep --line-buffered file.pdf

How do you do that with json?

Granted, as soon as your logs stop being a sequence of records (lines) with a fixed sequence of neatly delimited records, you will need something more than text. However, I still don't know of tools to work with json from the command line that are as concise, efficient, flexible and robust as the standard unix utilities for text.

keenerd · on April 27, 2012

> How do you do that with json?

Depends on the schema used, but it would probably be something like

    jshon -a -e "ip" -u < log | sort | uniq -c | sort -nr | head

gravitystorm · on April 27, 2012

The problem with your first command is the "cut" - works great for a while. Then you deal with other_vhosts_combined.log, and there's a leading hostname and port, so you want -f2 instead. After that, you start working with the user-agent field, and after remembering that the date and request fields have internal spaces, settle for something like -f13, which works until someone puts a space in their faked-up referer, and you can't use cut any more.

I love plain-text logs, but that apache logformat is a bad example of "neatly delimited records", given the unstable mixture of spaces between and within fields (some of which are quoted, some not, and one even quoted with square brackets!).

deno · on April 27, 2012

Not JSON, but one vision as to how structured data could fit into UNIX ecosystem: http://xml-coreutils.sourceforge.net/

In your current pipeline just replace `cut` with an XPath expression, everything else stays practically the same.

Of course current XML ecosystem has different tools that makes your case trivial, though that’s something new to learn.

tim_sw · on April 27, 2012

recordStream provides some of these features. Take a look here https://github.com/benbernard/RecordStream

Ixiaus · on April 27, 2012

I dunno, this feels like the "web developers" approach to logging. I can't say that it wouldn't be cool to be able to parse logs into a structured format, but honestly, tools are already there to parse logs that are very powerful (gawk + shell pipes + {whatever_unix_tool_you_can_think_of}). If you don't have programmers that can knock out a real quick awk liner to process a logfile for you in any custom way you want, then I could see where this approach (using JSON) is useful because then they can use something they are familiar with instead of something they are not. But really, you should know how to use the Unix tool chain if you're a programmer.

leif · on April 27, 2012

TSV and you're done

smaller

readable, esp. with `column -t <log`

works with awk/cut/join/grep/sort/column/etc./etc./etc.

if you have complicated enough logs that you can't maintain the shell scripts that parse them, you probably also have enough log data that json's going to blow up your space and you probably want indexes anyway, so throw it in a real database (oh hi I work for one of these, log analysis is actually one of our strong suits)

but others have already commented to this effect

delinka · on April 27, 2012

"Alex ... [realizes] that someone added an extra field in each line"

Someone?!? Who's touching the server configuration and why? Unless Alex put a publicly accessible web interface on his .conf files, this shouldn't be happening.

Back on topic. The increase in size for logging in JSON could easily be a deal breaker.

pbiggar · on April 27, 2012

OT, but I really like the way you expressed this: "could easily be a deal breaker". Too many comments on HN instead say "This would never work because of the increase in size" instead. Your comment recognizes the nuance without getting bogged down in what-ifs and disclaimers. Very well put.

cdmoyer · on April 27, 2012

Maybe Joe or Tina the other developers. Or maybe it was Mark or Cheryl from systems.

That said, in my experience, this doesn't end up being a common issue because people do the sensible thing and add the new field to the end of the file, so your tools just continue to work.

jakejake · on April 27, 2012

I've done various different log formats over the years including JSON.

One thing I've done for logging errors or warning is to log them in RSS format. I monitor them just like any RSS feed. It's really handy because there's already tons of ways to read these logs so we don't have to create anything.

I wouldn't use this for a debug log because it would probably be unusable if there was a large volume of logs, but for watching errors it's great.

zmj · on April 27, 2012

This idea is as old as Lisp. http://sites.google.com/site/steveyegge2/the-emacs-problem

jacques_chester · on April 27, 2012

One of the non-functional requirements of logs is that they should be fast to write. Marshalling data into a structured format takes longer than spitting out sprintfs.

If you really need structure for ease of querying, you might as well go all the way and throw it into a proper data store.

gorset · on April 27, 2012

Remember we're talking about writing to disk. The time it takes to marshal the data is pretty much insignificant to the time it takes to do the actual write.

We use JSON on disk heavily in several places where we generate or receive huge amounts of data. JSON-in-a-file is a pretty good datastore for sequential data processing because it's so convent and you can work with the data using cat, tail or any tool supporting JSON.

jacques_chester · on April 27, 2012

> The time it takes to marshal the data is pretty much insignificant to the time it takes to do the actual write.

And JSON is more verbose.

Basically this is another read/write dichotomy.

When do we pay the price of structuring data? At write time? Or at read time? I'd rather pay it at read time, as writing may be shaving performance from my actual primary production system.

gorset · on April 27, 2012

This is premature optimization. Do you think it's expensive to write a few brackets, quotes and escape a few characters? We have harder problems to think about than optimizing for writes in our logging system :-)

How can you trade write/read without some structure in the log message? And do you know that sprintf actually costs something too? It's basically a mini-language which must be parsed and interpreted. You can go google sprintf+performance to see stories of people finding this out - but most of the time it doesn't matter, because the cost is insignificant and you can do what's most convenient for you.

jacques_chester · on April 29, 2012

My point is that structure is imposed somewhere if you want to query your data. Either that structure is in the data, or in tools that parse the data. There is a "conservation of structure", if you like.

Most of the bunfights between proponents of different technologies is really an argument about where to pay structural tax. You can pay more at write time or read time on a sliding scale.

For logs, I think the smart option is to pay as little write-time overhead as possible. Their purpose is to maximally describe an event with minimal interruption to service. Every strictly unnecessary adornment to structure takes you further away from that core non-functional goal.

stevewilhelm · on April 27, 2012

You're still logging to local disk?

thezilch · on April 27, 2012

a) log writes should be buffered b) once buffered or immediately (ex. UDP [ex. Etsy's StatsD]), the client can typically continue and even complete without the log being flushed all the way to non-volatile media

frsyuki · on April 27, 2012

Spitting out sprintfs means parsing texts. JSON might be slow since it's text, but binary based formats should be faster especially the log includes many integers.

jacques_chester · on April 27, 2012

sprintf tends to be a faster path than "build tree, render down to text".

deno · on April 27, 2012

Both XML and JSON can be streamed. No tree building necessary.

frsyuki · on April 27, 2012

We're also using Fluentd as well as original JSON-based logging libraries.

Fluentd deals with JSON-based logs. JSON is good for human facing interface because it is human readable and GREPable.

On the other side, Fluentd handles logs in MessagePack format internally. Msgpack is a serialization format compatible with JSON and can be an efficient replacement of JSON.

I wrote plugin for Fluentd that send those structured logs to Librato Metrics (https://metrics.librato.com/) which provides charting and dashboard features.

With Fluentd, our logs became program-friendly as well as human-firnedly.

dasil003 · on April 27, 2012

Loggly supports this, and they provided a good interface for querying the data as well. We used it for a while as a way to unify a couple GB of daily log data from our Rails app running on multiple instances. I even wrote a library that allows you to quickly add arbitrary keys to the request log entry anywhere in the app.

Unfortunately we had to disable it temporarily as the Ruby client did not cope well when latency increased to the Loggly service. It was fine for a while since we are both on AWS, but one day our site started getting super slow. It took a while to track down the problem because the Loggly client has threaded delivery, so a given request would not be delayed. But the problem was that the next request couldn't be started until the delivery thread terminated.

Okay I realize this is not the best architecture. There should be a completely isolated process that's pushing the queued logs to Loggly so that the app never deals with anything but a local logging service. Loggly supports syslogng, but that would be standard logging not JSON, so I think if we want to go this route we need to come up with something on our own...

Simpletoon · on April 27, 2012

I only need three programs to deal with anti-ASCII, pro-complexity JSON, XML, etc. crowd: tr, sed and lex.

All the effort these Javascripters expend putting data JSON just gets undone by my custom UNIX-style filters; then I can actually work with the text.

Are they making life easier? For who? Seems like it's just more work for everybody, translating text back and forth into myriad formats.

But what can you do?

deno · on April 27, 2012

Your “plain text” probably has some implicit structure. XML, JSON, ProtocolBuffers, just make that structure explicit.

Dropping to plain text only to run sed or grep is a classic case of “if you have a hammer….” XML has a myriad of tools that do make your life easier — you just need to learn to embrace them.

Simpletoon · on April 27, 2012

It all starts as a stream. That is the "universal format".

sed was designed to edit streams. A stream can be transformed via stream editing into any text format, for any downstream consumer. It's line based. That's the only limitation.

lex can handle multiline "records". There is nothing you cannot do with lex. But only if you know how to use it. It's usually faster than any scripting language. Worth learning to use? Your choice. But it is what it is. It works. It, or some clone of it, was used to build the compiler that someone used to compile the shared library you're using as part of your special solution for the format of the month.

If you produce a stream as JSON, that's great. But now we're limited to consumers that understand JSON.

If you know your consumer wants JSON, then sure use some specialised library. But that's not what this guy is suggesting. He wants everything in JSON.

Well, not every consumer wants JSON.

This is a case of "I learned [X]. Please everyone use [X]."

None of us want to have to learn every language and every application.

Now consider if [X] is UNIX. For better or worse, it's the foundation on which most stuff talked about here runs. Perhaps it seems crude, it lacks sophistication in the eyes of a younger generation. It's a "hammer". But what can you build without a "hammer"?

In his case, [X] is Javascript. What's the foundation for Javascript? A "web browser".

Perhaps some people think nothing is possible without a web browser that can run Javascript.

It's a very narrow view.

deno · on April 27, 2012

Your plain text is less portable than any structured format. You’re creating ad-hoc parsers to process your ad-hoc format. There has to be some implicit structure to this text, otherwise you wouldn’t be able to use lex.

All it does is it ties your format to the specific implementation of your parser, including all the bugs in your custom stack. Your logs are now .docs, just in plain text.

> It all starts as a stream. That is the "universal format".

False. It starts as a data structure in the memory of the producing entity. The most direct or lightweight format would be a direct memory dump of the process. This would be unpractical, so the choice is between a generic portable structured data format and an ad-hoc serialization format.

Here’s your pipeline:

Structured data (Producer) → Plain text → Structured data (Consumer)

It’s like creating JPEGs of your logs and then running OCR to get the structured data back. That would be insane, right? But that’s the exact analogy, just your pipeline is a little lighter.

Now consider the alternative:

Structured data (Producer) → Portable structured data (.xml) → Structured data (Consumer)

The data in a structured, portable and uniform format like XML can be leveraged to offer rich and powerful tools, like XPath/XQuery/XSLT, all the while remaining agnostic to the specific data domain.

It’s just the logical thing to do.

jl6 · on April 27, 2012

It sounds like your [X] is sed and lex.

rhizome · on April 27, 2012

Except that fixed-field loglines are much faster to process than parsing JSON, which makes a difference when working with large logs.

kelnos · on April 27, 2012

Depends on what you value more: CPU time, or your developers' time keeping the log format and parsers in sync after any change.

mkross · on April 27, 2012

You could test that the logging and parsing are in step, which then reduces the developer time to "Doh, red test" and a simple fix.

kelnos · on April 30, 2012

Why increase your burden even to that? Computers are there to do stupid repetitive stuff for us so we don't have to. Why design a logging format that requires even minimal manual intervention to keep automated parsing working?

LeafStorm · on April 27, 2012

So...integrate quick scripts to get a count of IP addresses into your app's test suite?

daenz · on April 27, 2012

Logging to mongo (as JSON) has proven useful to us. Makes it easy to slice and dice the data.

asuth · on April 27, 2012

Ya, logs are soooo much more valuable when they're in a database that you can query. We log to MySQL and then correlate logs with users, events, IPs, pages, etc.

Logs are just as relational as anything else, it took me awhile to realize it though.

joelthelion · on April 27, 2012

I would love to see a JSON based shell, instead of the traditional shells based on raw strings. Heck, we could have a whole ecosystem of tools built around JSON or similar semi-structured representations.

mmphosis · on April 27, 2012

Log Many things as [my favorite format]. Make My Life Easier by doing the difficult work.

I would log in a fast compact, but not limited, and heavily documented binary format at a hardware level with lots of fail-safes. Maybe what I am doing is more appropriately called creating a journal. [My favorite scheduler] would very lazily and at opportunistic idle times convert the older non-human readable binary logs and insert the log data into [my favorite] database as very query-friendly information.

chronomex · on April 27, 2012

  I would log in a fast compact, but not limited, and heavily documented binary format

Such as ASN.1, perhaps?

Hopka · on April 27, 2012

How do you even log as JSON?

Is your entire log file a giant JSON array? That would be challenging for most parsers I know because they would have to read the entire array into memory first.

Or do you log one JSON object per line? Then you would get problems as soon as you have line breaks inside strings and still have to parse until the object ends in some other line. Also, JSON objects do not have to be single-line to be valid, so you would in fact be working with some self-defined subset of JSON.

drostie · on April 27, 2012

Is your entire log file a giant JSON array?

That's not necessarily a bad choice, but it's problematic for the reasons you describe. Still, since JSON is concatenative, you could indeed store all of your objects with a comma at the end and then use:

    import json; json.loads("[%s null]" % file_contents)[:-1]

Then you would get problems as soon as you have line breaks inside strings

Then you are not logging JSON. JSON does not permit that.

Also, JSON objects do not have to be single-line to be valid, so you would in fact be working with some self-defined subset of JSON.

Yes, and? Working in a subset of JSON which forbids newlines as whitespace -- that's still JSON, and it solves your problem elegantly.

Do... do you have multiple logging programs, logging to the same file, and one of them wants to insert newlines? Is this a real problem in your dev stack?

Hopka · on April 27, 2012

It is a real problem in someone's dev stack. I have already written code to preprocess such log files before feeding chunks into a real JSON parser. It didn't make my life easier.

I'm not against logging as JSON at all, but as pointed out, you have to use a subset that makes parsing the logs easy.

wolframarnold · on April 27, 2012

I like this idea a lot. Frameworks like Rails come with excellent log messages, granularity and a pub/sub mechanism. Often this can be a lower hanging fruit than throwing in a ton of custom instrumentation for some third party analytics tool, especially when you're pressed for time.

My question is how fluentd can be hooked into Rails so that Rails' native messages use it and how does it work in the Heroku infrastructure?

kablamo · on April 27, 2012

I've been thinking about this recently as well. I wrote a simple JSON logger for Perl recently. It will probably be on CPAN this weekend. Until then you can see it on prepan and github.

prepan http://prepan.org/module/3Yz7PYrBSd

github https://github.com/kablamo/Log-JSON

thezilch · on April 27, 2012

Or provide unit test for said log parser and require (or don't) all tests to pass pre-commit. A JSON struct isn't going to stop your colleague from removing nor renaming a field. Removing the logging all together. Or changing the format himself, if your company is really setup for allowing colleagues to so easily break your code -- not that anyone's perfect.

anonymoushn · on April 27, 2012

Is it worth switching to JSON to avoid having to edit your bash 1-liner when you change the format of the log?

sauravc · on April 27, 2012

We've been logging all of our analytics data in JSON for years now.

majmun · on April 27, 2012

I tried this , it was no good ( because of escaping of special characters, and parsing performance. )

then i switch to newline and n r . and all my problems were solved (for now)

webjunkie · on April 27, 2012

Okay, and as soon as I switch to JSON, I have not just 5 million referrers logged per day, I also have 5 million times the word "referrer" in my log. Nice.

deno · on April 27, 2012

Trivially solved by compression. In those five million referrers you’re probably also repeating “Firefox” and similar strings over and over, so compressing logs is already standard practice.

webjunkie · on April 30, 2012

Compressing logs that are being written to? Never heard of that...

wooptoo · on April 27, 2012

Why not go even further and store them in MongoDB?