I've seen several articles like this, and there are a number of things to consider.
Logging to ascii means that the standard unix tools work out of the box with your log files. Especially if you use something like a tab delimiter, then you typically don't need to specify the delimiter.
As an upside you are aren't storing the column definition in every single line, which if you are doing large volume traffic definitely matters. For instance we store gigabytes of log files per hour, grossing up that space by a significant margin impacts storage, transit and process times during write (marshallers and custom log formatting). Writes are the hardest to scale, so if I'm going to add scale or extra parsing time, I'd rather handle that in Hadoop where I can throw massive parallel resources at it.
Next you can achieve much of the advantages of json or protocol buffers by having a defined format and a structured release process before someone can change the format. Add fields to the end and don't remove defunct fields. This is the same process you have to use with protocol buffers or conceptually with JSON to have it work.
Overall there are advantages to these other formats, but the articles like this that I've seen gloss over the havoc this creates with a standard linux tool chain. You can process a LOT of data with simple tools like Gawk and bash pipelines. It turns out you can even scale those same processes all the way up to Hadoop streaming.
You make a very good point about unix tools. However, as a Javascript/JSON guy myself, I really like the way JSON works. And for a small site like mine, JSON would work much better out of the box than some sort of tab structure.
I work with node.js which means I can console.log and then pipe into a log file. Any object sent into console.log are automatically converted to JSON.
I can also do stack traces with JSON if I ever end up with some sort of nasty bugs.
When you log Gb a day, absolutely, tabs are the way to go. But when you have a tiny little thing like mine, saving the effort for something more value adding is probably the better choice.
The use of JSON logging plus a tool like Record Stream [1] is very powerful and solves the tool chain issue. Recs is complimentary to standard unix tools as well.
We used to do just that, logging as CSV, but switched to JSON.
At a certain scale you don't use Unix tool chain anymore except for a tail and that's for pure debugging. We log >10TB per day and they go to Hadoop for processing.
JSON is crazy verbose. But you pay for flexibility, we want to remove and add fields at will everyday with new business requirements. Its a pain to maintain csv log versions or use the "never remove a column" rule.
Compression rocks on JSON you get 90% gzip compression easily.
We would have considered logging to thrift if we didn't have this huge flexibility requirement.
Especially if you use something like a tab delimiter, then you typically don't need to specify the delimiter.
What if the fields contain tabs? For every human-readable format that can contain arbitrary user input (nearly all of them) you need some form of escaping (I guess you could do length prefixing in ASCII decimal, but it'd not be pretty either and incompatible with basic tools).
But by far the biggest problem is the logging of text messages aimed at a humans, not the delimiting. Regular expressions can help in searching logs in quick-hack jobs, but if you need to parse logs for visualization or reporting, which is very common in organizations, using them is error-prone. After all, you rely on English messages of a certain form. The complexity of that could quickly move from "easy with regexps" to "we need NLP in our log parser!" (never mind security problems with one field leaking into another due to a slightly wrong regexps).
The application might change the message to make the message more readable for humans, or even move around fields, and your automated parser breaks. Structured messages, on the other hand, won't change for those concerns as the formatting for humans happens in the back-end.
I get a bit annoyed at the "WTF kid, learn UNIX tools" kind of responses here. UNIX tools are one way of doing things, not the holy one perfect way. Tool support is important, but there are also tools for processing JSON, XML, streams available. Shouldn't you use the best tool for the job, not the one you happen to know?
(I don't mean that JSON is necessarily the best format in every use-case, but for automated processing every structured format at all trumps arbitrarily delimited and escaped files. You can easily convert between structured formats should the need arise)
> "Shouldn't you use the best tool for the job, not the one you happen to know?"
Well, 'best' in this case should be determined by combining a number of factors. Certainly technological superiority should be weighted quite heavily, but in certain circumstances the "Nobody ever got fired for buying IBM" effect is also very important.
Of course there is also the "The best tool for the job is the one you have/know." quip, which I don't generally agree with myself... so I guess what I'm saying is that your mileage may vary.
But such thinking can block innovation. It's a form of path lock-in. Both UNIX and Windows "gurus" are guilty of this, of seeing their way as the "one true way". Just because it's always been that way.
New ideas are not always better, but sometimes they might be. In the longer run "I'm used to this" (on its own) is not a good argument as there will always be new people that are not used to your specific blub, and if they can be more productive or make more reliable/secure systems then eventually you'll be out of the market.
Oh certainly, I fully agree. I think I'm just saying that I think logging is going to be one of those things, for better or worse, that most developers look at and do the mental calculus of "It's a good idea, but do I want to go out on a limb here, with this issue?" In at least some cases all the factors added together just won't work out to it being worth the risk/effort.
Basically just the IBM thing. Was IBM always the best choice? No. But even so, it was often the best choice for the individuals making that call. This is the sort of thing that you have to recognize and contend with if you want to introduce change.
Hosting Sysadmin here -- logs don't change that often unless you're developing an app that logs, and for that, I like logging to MySQL ("just add a column in a few places, an voila".)
You should use the best tool for the job. If your job is webpage stats for apache, there are literally hundreds of pre-existing tools to parse apache logs, and the name of the *nix game is plaintext.
Why would logging to MySQL be a better tool for the job than NoSQL?
Unless you're routinely purging the MySQL tables fairly often (during which case, it almost doesn't matter what you're using), my best guess is that you're going to end up with a slower database on read than you would with almost any NoSQL alternative.
Of course, you could keep the MySQL tables flat, but if that's what you're doing, why use MySQL at all?
I would not say better - different certainly. Jsonpipe is a lot slower and heavier. (Matters when you are adding json based web 2.0 integration to your router. For small cases jshon is 15x faster and uses 1/14th the ram.) And I would really want to avoid using jsonpipe inside of a loop.
It also seems to handle typical use cases inelegantly. Probably the most common thing I use Jshon for is turning json into a tab deliminated text file. To compare both, here is a query that returns json search results:
Most of that awkwardness is from `paste` really wanting real files to operate on. But if you are going to use jsonpipe, you might as well just write the whole thing in Python.
The one thing that I do like about jsonpipe is that each line has the fully self contained path. So you can shuffle (or otherwise destroy) the output but still have something with usable context. Except for the example above, where the order matters a lot. For really simple cases jsonpipe's method is nice. I might just port it to C so that there can be a fair comparison.
This article feels like it would work just as well with "Protocol Buffers", "Thrift", "XML", or even maybe "ASN.1". If that's truly the case, maybe the better thing to say is "please don't (only) log in ASCII", followed by "please use a format which is hard to get wrong".
JSON scares me a little. Don't you have to worry about escaping a whole bunch of characters just in case something gets the wrong idea about what you have in a field? I saw a page not too long ago which listed about a dozen characters which should be substituted in some manner when used in JSON.
3) JSON feels friendlier than protocol buffers and Thrift because it's human readable instead of binary.
4) JSON is more convenient than multi-line formats like XML because you can grep for things easily. For example, if you have {time:015128752, event:"authentication error", user:"bob smith", details: "..."}, then I can take do the following:
cat log.json | grep 'user:"bob smith"'
It's hard to do that with XML because even if you find the right user, the entire log message/object spans multiple lines.
5) JSON is not as compact as binary, but it gets surprisingly close if you gzip it. All of those repeated descriptive attribute names are great for humans, but compression algorithms love them as well.
6) The format is fairly universal, but if you ever come across a language without JSON parsing libraries, the format is not so hard that you couldn't write your own parser (compare that to Thrift, XML, etc)
JSON is a good choice if you have very tight coupling between producers and consumers of your logs.
Twitter (disclosure: I manage/was early engineer on the analytics infra team) used Json initially, and it quickly turned into a mess because JSON does not enforce schemas or type safety. There is no way for your consumers to find out what you are logging, other than to sample records. This makes evolving logging, deprecating fields, and doing sanity checking very complex (you basically have to build a separate metadata discovery system).
Use Thrift or Avro. Avro is extra nice because not only does it have a schema, it keeps the schema together with the data; however, the support for it is not as mature/wide. It's improving fast, though. Thrift, and, I think, Avro, have JSON protocols if you really love JSON -- you get the human-readable debugging mode, the metadata, the type safety, and compact binary representation.. happiness ensues.
Just please consider that XPath needs to store the items in a tree structure first (XML needs to be parsed and loaded into memry) before it can be useful. Running that on a large log file would be an interesting performance experiment.
Nonsense, XPath works very well with streamed content, and there are even implementations of XPath engines for FPGA and GPGPU, all of which have very strict memory limitations.
DOM is completely optional, and you only need it for convenience or to observe the modifications, e.g. the way browsers make use of it.
Just think how you would compile any XPath expression — it’s very similar to REGEX, only for structured data.
I think you're overstating the capabilities and implementation of common XPath processors here, and silently implying the use of a subset or modified version of XPath as specified. Arbitrary XPath will likely require loading the entire document in the worst case, because it allows navigation to any portion of the tree.
Navigation to any portion of the tree does not imply any need to keep the entire document in memory at once.
Worst case scenario XPath expression will not yield any results before traversing the entire tree. However any XPath 1.0 (and I imagine 2.0 is no different in that regard) can be compiled into a deterministic state machine (DFA), which only needs to keep tabs on how many elements it has seen and what conditions has been met.
What XPath 1.0 specifically doesn’t allow are arbitrary sub-expressions in predicates. Those would be problematic in certain conditions.
The only potential issue with performance is when running a large number of XPEs against a single stream. So there are various techniques to remedy that, including merging state machines for branching expression etc.
There's a following:: axis in XPath which lets you look ahead to arbitrary elements. So the processor needs to load the document into memory, which of course is not streaming.
Just /node/node2/following::* in a predicate or whatever. Perhaps you're conflating XPath-as-used with XPath-as-specified? http://msdn.microsoft.com/en-us/library/ms950778.aspx explains that you can't support all of XPath (even 1.0) and process things the way that you describe.
Right, they completely eliminate any buffering. That’s very strict, but you’re right. I thought you meant that there’s a subset of XPath that can’t work on anything but DOM.
Yeah, that's all. The reason it's relevant is that XPath processors often value conformance over expanding the utility of their system. I didn't want someone going off and using libxslt or xalan expecting it to process things in a memory-efficient manner.
OT and pedantic, but common enough that I thought I'd point it out: there's no need to cat and pipe a file into grep, because grep takes a file name as an argument, eg - grep 'user:"bob smith"' log.json
if you really want L-to-R whilst avoiding the redundant cat and its overhead. That doesn't allow grep to do its speed-ups though, because it's still reading stdin, not a filename.
You can also use tools like recordStream, which allow you to do grep, etc. on individual json fields. https://github.com/benbernard/RecordStream (It has a lot of other features as well)
Sidenote: benbernard worked on Amazon dev tools team
"Don't you have to worry about escaping a whole bunch of characters just in case something gets the wrong idea about what you have in a field?"
Either you're using a length-delimited format, which is pretty hostile for humans to read, or you're using a string that will require something to be escaped.
The only third option is bugs, like a newline that gets out into your logs and oops, new log line.
"I saw a page not too long ago which listed about a dozen characters which should be substituted in some manner when used in JSON."
It is very likely what you saw was the list of Unicode characters that are accepted by the JSON format but rejected by browsers and/or HTML in the context of a Javascript embedded in an HTML page. In this case, it's the browsers and/or JS-in-HTML with the issue, not JSON. And I don't mean that as a who-to-blame sort of issue, I mean, if you're producing JSON and consuming JSON and it doesn't pass through HTML it's a non-issue, and any string that passes through JS-in-HTML will have the same problem even if it isn't "JSON", so it really hasn't got anything to do with JSON if I am correct about what you're referring to.
I'm happy with JSON and it is the native format of MongoDB, CouchDB and similar databases. And grep/sed/awk. The biggest problem with JSON is that it has no standard way of representing dates, times or binary data.
Don't you have to worry about escaping a whole bunch of characters just in case something gets the wrong idea about what you have in a field?
Yes, but you don't do it yourself. As the article mentions, pretty much every language you'd want to use has a JSON library. Use that, and (bugs aside) it'll take care of escaping for you.
I sure hope everyone uses a library. What worries me is that it looks so simple, so someone might just glue together their own encoding and call it "done".
A higher barrier to entry isn't always a bad thing.
rachelbythebay: Yea, you've got some good points. I don't think the author means JSON is the be-all-end-all solution to structured logging. It has its issues (escaping for one), but I think it is a reasonable choice because of its wide adoption.
A nice use of protocol buffer there. Speaking of serialization, my personal favorite is MessagePack (httpx://github.com/msgpack).
The real takeaway is that log files invariably tend to become interfaces for something. They often end up being used for monitoring tools, business intelligence, system diagnostic tools, system tests, and so on. And they're great for this. But not when they contain sentences like "Opening conection...", which break half those tools the moment someone fixes the typo.
The log strings became an interface. Avoid this. If it's an interface, it has to be specced, and it has to allow for backward compatibility, just like any other interface that crosses component / tool boundaries.
Whether you do the actual data storage with JSON or something else doesn't matter. It's an implementation detail (though I agree that keeping it not only machine-readable, but also human-readable, is probably a good thing).
Design the classes that represent log files, and treat them like they're part of a library API. Don't remove fields. Ideally, use the same classes for writing (from your main software) and parsing the logs (from all that other tooling) and include version information in the parser so that the class interface can be current yet the data can be ancient.
That exactly the opposite of my understanding of logs, and the reason why I can't agree with the OP.
For me logs are a way to store extensive historical data about what happened, in the cheapest and simplest way possible. Logs are a "write a lot and read almost never" kind of storage. For this kind of storage, the simplest way to do it is tab delimited flat files.
Logs are for debug or "legal" purposes: a client complains he has lost all his data, your boss comes in the room with fume steaming outside his ears telling you names because you "can't store f* client's F* valuables cleany". Then, using awk or grep kungfu, you come up ten minutes later with the exact millisec when the client did click on "Yes, I am sure I want to delete my data", his IP, his session_id, his browser fingerprint and so on. In case of security audit, you are required to send 10 years of server logs to a trusted third-party (they will not do anything with it, they just want to make sure you have logs). You zip the thing and send it by email to then (thus crashing security auditor's email server). You have done your duty.
These are what logs are for. Having JSON or any other format inside just make it more fragile, less versatile, will mess with line-oriented commands and is unnecessary.
If you are using your logs regularly to track some business data about your product, you are using logs for the wrong purpose and should consider using something else.
Say you need to follow accesses to a particular file? The following quick and dirty one-liner probably works well enough:
tail -f log | grep --line-buffered file.pdf
How do you do that with json?
Granted, as soon as your logs stop being a sequence of records (lines) with a fixed sequence of neatly delimited records, you will need something more than text. However,
I still don't know of tools to work with json from the command line that are as concise, efficient, flexible and robust as the standard unix utilities for text.
The problem with your first command is the "cut" - works great for a while. Then you deal with other_vhosts_combined.log, and there's a leading hostname and port, so you want -f2 instead. After that, you start working with the user-agent field, and after remembering that the date and request fields have internal spaces, settle for something like -f13, which works until someone puts a space in their faked-up referer, and you can't use cut any more.
I love plain-text logs, but that apache logformat is a bad example of "neatly delimited records", given the unstable mixture of spaces between and within fields (some of which are quoted, some not, and one even quoted with square brackets!).
I dunno, this feels like the "web developers" approach to logging. I can't say that it wouldn't be cool to be able to parse logs into a structured format, but honestly, tools are already there to parse logs that are very powerful (gawk + shell pipes + {whatever_unix_tool_you_can_think_of}). If you don't have programmers that can knock out a real quick awk liner to process a logfile for you in any custom way you want, then I could see where this approach (using JSON) is useful because then they can use something they are familiar with instead of something they are not. But really, you should know how to use the Unix tool chain if you're a programmer.
works with awk/cut/join/grep/sort/column/etc./etc./etc.
if you have complicated enough logs that you can't maintain the shell scripts that parse them, you probably also have enough log data that json's going to blow up your space and you probably want indexes anyway, so throw it in a real database (oh hi I work for one of these, log analysis is actually one of our strong suits)
"Alex ... [realizes] that someone added an extra field in each line"
Someone?!? Who's touching the server configuration and why? Unless Alex put a publicly accessible web interface on his .conf files, this shouldn't be happening.
Back on topic. The increase in size for logging in JSON could easily be a deal breaker.
OT, but I really like the way you expressed this: "could easily be a deal breaker". Too many comments on HN instead say "This would never work because of the increase in size" instead. Your comment recognizes the nuance without getting bogged down in what-ifs and disclaimers. Very well put.
Maybe Joe or Tina the other developers. Or maybe it was Mark or Cheryl from systems.
That said, in my experience, this doesn't end up being a common issue because people do the sensible thing and add the new field to the end of the file, so your tools just continue to work.
I've done various different log formats over the years including JSON.
One thing I've done for logging errors or warning is to log them in RSS format. I monitor them just like any RSS feed. It's really handy because there's already tons of ways to read these logs so we don't have to create anything.
I wouldn't use this for a debug log because it would probably be unusable if there was a large volume of logs, but for watching errors it's great.
One of the non-functional requirements of logs is that they should be fast to write. Marshalling data into a structured format takes longer than spitting out sprintfs.
If you really need structure for ease of querying, you might as well go all the way and throw it into a proper data store.
Remember we're talking about writing to disk. The time it takes to marshal the data is pretty much insignificant to the time it takes to do the actual write.
We use JSON on disk heavily in several places where we generate or receive huge amounts of data. JSON-in-a-file is a pretty good datastore for sequential data processing because it's so convent and you can work with the data using cat, tail or any tool supporting JSON.
> The time it takes to marshal the data is pretty much insignificant to the time it takes to do the actual write.
And JSON is more verbose.
Basically this is another read/write dichotomy.
When do we pay the price of structuring data? At write time? Or at read time? I'd rather pay it at read time, as writing may be shaving performance from my actual primary production system.
This is premature optimization. Do you think it's expensive to write a few brackets, quotes and escape a few characters? We have harder problems to think about than optimizing for writes in our logging system :-)
How can you trade write/read without some structure in the log message? And do you know that sprintf actually costs something too? It's basically a mini-language which must be parsed and interpreted. You can go google sprintf+performance to see stories of people finding this out - but most of the time it doesn't matter, because the cost is insignificant and you can do what's most convenient for you.
My point is that structure is imposed somewhere if you want to query your data. Either that structure is in the data, or in tools that parse the data. There is a "conservation of structure", if you like.
Most of the bunfights between proponents of different technologies is really an argument about where to pay structural tax. You can pay more at write time or read time on a sliding scale.
For logs, I think the smart option is to pay as little write-time overhead as possible. Their purpose is to maximally describe an event with minimal interruption to service. Every strictly unnecessary adornment to structure takes you further away from that core non-functional goal.
a) log writes should be buffered
b) once buffered or immediately (ex. UDP [ex. Etsy's StatsD]), the client can typically continue and even complete without the log being flushed all the way to non-volatile media
Spitting out sprintfs means parsing texts. JSON might be slow since it's text, but binary based formats should be faster especially the log includes many integers.
We're also using Fluentd as well as original JSON-based logging libraries.
Fluentd deals with JSON-based logs. JSON is good for human facing interface because it is human readable and GREPable.
On the other side, Fluentd handles logs in MessagePack format internally. Msgpack is a serialization format compatible with JSON and can be an efficient replacement of JSON.
I wrote plugin for Fluentd that send those structured logs to Librato Metrics (https://metrics.librato.com/) which provides charting and dashboard features.
With Fluentd, our logs became program-friendly as well as human-firnedly.
Loggly supports this, and they provided a good interface for querying the data as well. We used it for a while as a way to unify a couple GB of daily log data from our Rails app running on multiple instances. I even wrote a library that allows you to quickly add arbitrary keys to the request log entry anywhere in the app.
Unfortunately we had to disable it temporarily as the Ruby client did not cope well when latency increased to the Loggly service. It was fine for a while since we are both on AWS, but one day our site started getting super slow. It took a while to track down the problem because the Loggly client has threaded delivery, so a given request would not be delayed. But the problem was that the next request couldn't be started until the delivery thread terminated.
Okay I realize this is not the best architecture. There should be a completely isolated process that's pushing the queued logs to Loggly so that the app never deals with anything but a local logging service. Loggly supports syslogng, but that would be standard logging not JSON, so I think if we want to go this route we need to come up with something on our own...
Your “plain text” probably has some implicit structure. XML, JSON, ProtocolBuffers, just make that structure explicit.
Dropping to plain text only to run sed or grep is a classic case of “if you have a hammer….” XML has a myriad of tools that do make your life easier — you just need to learn to embrace them.
It all starts as a stream. That is the "universal format".
sed was designed to edit streams. A stream can be transformed via stream editing into any text format, for any downstream consumer. It's line based. That's the only limitation.
lex can handle multiline "records". There is nothing you cannot do with lex. But only if you know how to use it. It's usually faster than any scripting language. Worth learning to use? Your choice. But it is what it is. It works. It, or some clone of it, was used to build the compiler that someone used to compile the shared library you're using as part of your special solution for the format of the month.
If you produce a stream as JSON, that's great. But now we're limited to consumers that understand JSON.
If you know your consumer wants JSON, then sure use some specialised library. But that's not what this guy is suggesting. He wants everything in JSON.
Well, not every consumer wants JSON.
This is a case of "I learned [X]. Please everyone use [X]."
None of us want to have to learn every language and every application.
Now consider if [X] is UNIX. For better or worse, it's the foundation on which most stuff talked about here runs. Perhaps it seems crude, it lacks sophistication in the eyes of a younger generation. It's a "hammer". But what can you build without a "hammer"?
In his case, [X] is Javascript. What's the foundation for Javascript? A "web browser".
Perhaps some people think nothing is possible without a web browser that can run Javascript.
Your plain text is less portable than any structured format. You’re creating ad-hoc parsers to process your ad-hoc format. There has to be some implicit structure to this text, otherwise you wouldn’t be able to use lex.
All it does is it ties your format to the specific implementation of your parser, including all the bugs in your custom stack. Your logs are now .docs, just in plain text.
> It all starts as a stream. That is the "universal format".
False. It starts as a data structure in the memory of the producing entity. The most direct or lightweight format would be a direct memory dump of the process. This would be unpractical, so the choice is between a generic portable structured data format and an ad-hoc serialization format.
Here’s your pipeline:
Structured data (Producer) → Plain text → Structured data (Consumer)
It’s like creating JPEGs of your logs and then running OCR to get the structured data back. That would be insane, right? But that’s the exact analogy, just your pipeline is a little lighter.
Now consider the alternative:
Structured data (Producer) → Portable structured data (.xml) → Structured data (Consumer)
The data in a structured, portable and uniform format like XML can be leveraged to offer rich and powerful tools, like XPath/XQuery/XSLT, all the while remaining agnostic to the specific data domain.
Why increase your burden even to that? Computers are there to do stupid repetitive stuff for us so we don't have to. Why design a logging format that requires even minimal manual intervention to keep automated parsing working?
Ya, logs are soooo much more valuable when they're in a database that you can query. We log to MySQL and then correlate logs with users, events, IPs, pages, etc.
Logs are just as relational as anything else, it took me awhile to realize it though.
I would love to see a JSON based shell, instead of the traditional shells based on raw strings. Heck, we could have a whole ecosystem of tools built around JSON or similar semi-structured representations.
Log Many things as [my favorite format]. Make My Life Easier by doing the difficult work.
I would log in a fast compact, but not limited, and heavily documented binary format at a hardware level with lots of fail-safes. Maybe what I am doing is more appropriately called creating a journal. [My favorite scheduler] would very lazily and at opportunistic idle times convert the older non-human readable binary logs and insert the log data into [my favorite] database as very query-friendly information.
Is your entire log file a giant JSON array? That would be challenging for most parsers I know because they would have to read the entire array into memory first.
Or do you log one JSON object per line? Then you would get problems as soon as you have line breaks inside strings and still have to parse until the object ends in some other line. Also, JSON objects do not have to be single-line to be valid, so you would in fact be working with some self-defined subset of JSON.
That's not necessarily a bad choice, but it's problematic for the reasons you describe. Still, since JSON is concatenative, you could indeed store all of your objects with a comma at the end and then use:
Then you would get problems as soon as you have line breaks inside strings
Then you are not logging JSON. JSON does not permit that.
Also, JSON objects do not have to be single-line to be valid, so you would in fact be working with some self-defined subset of JSON.
Yes, and? Working in a subset of JSON which forbids newlines as whitespace -- that's still JSON, and it solves your problem elegantly.
Do... do you have multiple logging programs, logging to the same file, and one of them wants to insert newlines? Is this a real problem in your dev stack?
It is a real problem in someone's dev stack. I have already written code to preprocess such log files before feeding chunks into a real JSON parser. It didn't make my life easier.
I'm not against logging as JSON at all, but as pointed out, you have to use a subset that makes parsing the logs easy.
I like this idea a lot. Frameworks like Rails come with excellent log messages, granularity and a pub/sub mechanism. Often this can be a lower hanging fruit than throwing in a ton of custom instrumentation for some third party analytics tool, especially when you're pressed for time.
My question is how fluentd can be hooked into Rails so that Rails' native messages use it and how does it work in the Heroku infrastructure?
I've been thinking about this recently as well. I wrote a simple JSON logger for Perl recently. It will probably be on CPAN this weekend. Until then you can see it on prepan and github.
Or provide unit test for said log parser and require (or don't) all tests to pass pre-commit. A JSON struct isn't going to stop your colleague from removing nor renaming a field. Removing the logging all together. Or changing the format himself, if your company is really setup for allowing colleagues to so easily break your code -- not that anyone's perfect.
Okay, and as soon as I switch to JSON, I have not just 5 million referrers logged per day, I also have 5 million times the word "referrer" in my log. Nice.
Trivially solved by compression. In those five million referrers you’re probably also repeating “Firefox” and similar strings over and over, so compressing logs is already standard practice.
Logging to ascii means that the standard unix tools work out of the box with your log files. Especially if you use something like a tab delimiter, then you typically don't need to specify the delimiter.
As an upside you are aren't storing the column definition in every single line, which if you are doing large volume traffic definitely matters. For instance we store gigabytes of log files per hour, grossing up that space by a significant margin impacts storage, transit and process times during write (marshallers and custom log formatting). Writes are the hardest to scale, so if I'm going to add scale or extra parsing time, I'd rather handle that in Hadoop where I can throw massive parallel resources at it.
Next you can achieve much of the advantages of json or protocol buffers by having a defined format and a structured release process before someone can change the format. Add fields to the end and don't remove defunct fields. This is the same process you have to use with protocol buffers or conceptually with JSON to have it work.
Overall there are advantages to these other formats, but the articles like this that I've seen gloss over the havoc this creates with a standard linux tool chain. You can process a LOT of data with simple tools like Gawk and bash pipelines. It turns out you can even scale those same processes all the way up to Hadoop streaming.