I don’t deal in veiled motives — I really like information theory. A lot. It’s been an invaluable conceptual tool for almost every area of my work; and I’m going to try to convince you of its usefulness for engineering problems. Let’s look at a timestamp parsing algorithm in the Sumo Logic codebase.
The basic idea is that each thread gets some stream of input lines (these are from my local /var/log/appfirewall.log), and we want to parse the timestamps (bolded) into another numeric field:
Jul 25 08:33:02 vorta.local socketfilterfw[86] <Info>: java: Allow TCP CONNECT (in:5 out:0)
Jul 25 08:39:54 vorta.local socketfilterfw[86] <Info>: Stealth Mode connection attempt to UDP 1 time
Jul 25 08:42:40 vorta.local socketfilterfw[86] <Info>: Stealth Mode connection attempt to UDP 1 time
Jul 25 08:43:01 vorta.local socketfilterfw[86] <Info>: java: Allow TCP LISTEN (in:0 out:1)
Jul 25 08:44:17 vorta.local socketfilterfw[86] <Info>: Stealth Mode connection attempt to UDP 6 time
Being a giant distributed system, we receive logs with hundreds of different timestamp formats, which are interleaved in the input stream. CPU time on the frontend is dedicated to parsing raw log lines, so if we can derive timestamps more quickly, we can reduce our AWS costs. Let’s assume that exactly one timestamp parser will match–we’ll leave ambiguities for another day.
How can we implement this? The naive approach is to try all of the parsers in an arbitrary sequence each time and see which one works; but all of them are computationally expensive to evaluate. Maybe we try to cache them or parallelize in some creative way? We know that caching should be optimal if the logs were all in the same format; and linear search would be optimal if they were randomly chosen.
In any case, the most efficient way to do this isn’t clear, so let’s do some more analysis: take the sequence of correct timestamp formats and label them:
Timestamp |
Format |
Label |
Jul 25 08:52:10 |
MMM dd HH:mm:ss |
Format 1 |
Fri Jul 25 09:06:49 PDT 2014 |
EEE MMM dd HH:mm:ss ZZZ yyyy |
Format 2 |
1406304462 |
EpochSeconds |
Format 3 |
[Jul 25 08:52:10] |
MMM dd HH:mm:ss |
Format 1 |
How can we turn this into a normal, solvable optimization problem? Well, if we try our parsers in a fixed order, the index label is actually just the number of parsing attempts before hitting the correct parser. Let’s keep the parsers in the original order and add another function that reorders them, and then we’ll try them in that order:
Format |
Parser Label |
Parser Index |
MMM dd HH:mm:ss |
Format 1 |
2 |
EEE MMM dd HH:mm:ss ZZZ yyyy |
Format 2 |
1 |
EpochSeconds |
Format 3 |
3 |
This is clearly better, and we can change this function on every time step. Having the optimal parser choice be a low number is always better, because we’re trying to minimize the time delay of the parsing process:
(Time Delay) (# Tries)
But can we really just optimize over that? It’s not at all clear to me how that translates into an algorithm. While it’s a nice first-order formulation, we’re going to have to change representations to connect it to anything more substantial.
Parser Index |
Parser Index (Binary) |
Parser Index (Unary) |
2 |
10 |
11 |
1 |
1 |
1 |
3 |
11 |
111 |
This makes it clear that making the parser index small is equivalent to making its decimal/binary/unary representation small. In other words, we want to minimize the information content of the index sequence over our choice of parsers.
In mathematical terms, the information (notated H) is just the sum of -p log p over each event, where p is the event’s probability. As an analogy, think of -log p as the length of the unary sequence (as above) and p as the probability of the sequence — we’ll use the experimental probability distribution over the parser indices that actually occur.
As long as the probability of taking more tries is strictly decreasing, minimizing it also minimizes the time required because the information is strictly increasing with the number of tries it takes.
arg min{Time Delay} =arg min{Sequence Length * Probability of sequence}
=arg min {-p(# Tries) * log(p(# Tries)) } = arg min{ H(# Tries) }
That’s strongly suggestive that what we want to use as the parser-order-choosing function is actually a compression function, whose entire goal in life is to minimize the information content (and therefore size) of byte sequences. Let’s see if we can make use of one: in the general case, these algorithms look like Seq(Int) Seq(Int), making the second sequence shorter.
Parser Index Sequence: Length 13 |
Parser Index (LZW Compressed): Length 10 |
12,43,32,64,111,33,12,43,32,64,111,33,12 |
12,43,32,64,111,33,256,258,260,12 |
Let’s say that we have some past sequence — call it P — and we’re trying to find the next parser-index mapping. I admit that it’s not immediately clear how to do this with a compression algorithm a priori, but if we just perturb the algorithm, we can compare the options for the next functions as:
newInfo(parser label) = H(compress(P + [parser label]))-H(compress(P))
Any online compression algorithm will allow you to hold state so that you don’t have to repeat computations in determining this. Then, we can just choose the parser with the least newInfo; and if the compressor will minimize information content (which I’ll assume they’re pretty good at), then our algorithm will minimize the required work. If you’d like a deeper explanation of compression, ITILA [1] is a good reference.
With a fairly small, reasonable change of representation, we now have a well-defined, implementable, fast metric to make online decisions about parser choice. Note that this system will work regardless of the input stream — there is not a worst case except those of the compression algorithm. In this sense, this formulation is adaptive.
Certainly, the reason that we can draw a precise analogy to a solved problem is because analogous situations show up in many fields, which at least include Compression/Coding, Machine Learning [2], and Controls [3]. Information theory is the core conceptual framework here, and if I’ve succeeded in convincing you, Bayesian Theory [4] is my favorite treatment.
References:
-
Information Theory, Inference, and Learning Algorithms by David MacKay
-
Prediction, Learning, and Games by Nicolo Cesa-Bianchi and Gabor Lugosi.
-
Notes on Dynamic Programming and Optimal Control by Demitri Bertsekas
-
Bayesian Theory by Jose Bernardo and Adrian Smith