Log transport and aggregation at scale

Posted by markdrayton on Server Fault See other posts from Server Fault or by markdrayton
Published on 2009-05-01T13:09:46Z Indexed on 2012/09/04 21:40 UTC
Read the original article Hit count: 192

Filed under:
|
|
|

How're you analysing log files from UNIX/Linux machines? We run several hundred servers which all generate their own log files, either directly or through syslog. I'm looking for a decent solution to aggregate these and pick out important events. This problem breaks down into 3 components:

1) Message transport

The classic way is to use syslog to log messages to a remote host. This works fine for applications that log into syslog but less useful for apps that write to a local file. Solutions for this might include having the application log into a FIFO connected to a program to send the message using syslog, or by writing something that will grep the local files and send the output to the central syslog host. However, if we go to the trouble of writing tools to get messages into syslog would we be better replacing the whole lot with something like Facebook's Scribe which offers more flexibility and reliability than syslog?

2) Message aggregation

Log entries seem to fall into one of two types: per-host and per-service. Per-host messages are those which occur on one machine; think disk failures or suspicious logins. Per-service messages occur on most or all of the hosts running a service. For instance, we want to know when Apache finds an SSI error but we don't want the same error from 100 machines. In all cases we only want to see one of each type of message: we don't want 10 messages saying the same disk has failed, and we don't want a message each time a broken SSI is hit.

One approach to solving this is to aggregate multiple messages of the same type into one on each host, send the messages to a central server and then aggregate messages of the same kind into one overall event. SER can do this but it's awkward to use. Even after a couple of days of fiddling I had only rudimentary aggregations working and had to constantly look up the logic SER uses to correlate events. It's powerful but tricky stuff: I need something which my colleagues can pick up and use in the shortest possible time. SER rules don't meet that requirement.

3) Generating alerts

How do we tell our admins when something interesting happens? Mail the group inbox? Inject into Nagios?

So, how're you solving this problem? I don't expect an answer on a plate; I can work out the details myself but some high-level discussion on what is surely a common problem would be great. At the moment we're using a mishmash of cron jobs, syslog and who knows what else to find events. This isn't extensible, maintainable or flexible and as such we miss a lot of stuff we shouldn't.

Updated: we're already using Nagios for monitoring which is great for detected down hosts/testing services/etc but less useful for scraping log files. I know there are log plugins for Nagios but I'm interested in something more scalable and hierarchical than per-host alerts.

© Server Fault or respective owner

Related posts about linux

Related posts about unix