Changing Representation

I don’t deal in veiled motives — I really like information theory. A lot. It’s been an invaluable conceptual tool for almost every area of my work; and I’m going to try to convince you of its usefulness for engineering problems. Let’s look at a timestamp parsing algorithm in the Sumo Logic codebase.

The basic idea is that each thread gets some stream of input lines (these are from my local /var/log/appfirewall.log), and we want to parse the timestamps (bolded) into another numeric field:

Jul 25 08:33:02 vorta.local socketfilterfw[86] <Info>: java: Allow TCP CONNECT (in:5 out:0)

Jul 25 08:39:54 vorta.local socketfilterfw[86] <Info>: Stealth Mode connection attempt to UDP 1 time

Jul 25 08:42:40 vorta.local socketfilterfw[86] <Info>: Stealth Mode connection attempt to UDP 1 time

Jul 25 08:43:01 vorta.local socketfilterfw[86] <Info>: java: Allow TCP LISTEN (in:0 out:1)

Jul 25 08:44:17 vorta.local socketfilterfw[86] <Info>: Stealth Mode connection attempt to UDP 6 time

Being a giant distributed system, we receive logs with hundreds of different timestamp formats, which are interleaved in the input stream. CPU time on the frontend is dedicated to parsing raw log lines, so if we can derive timestamps more quickly, we can reduce our AWS costs. Let’s assume that exactly one timestamp parser will match–we’ll leave ambiguities for another day.

How can we implement this? The naive approach is to try all of the parsers in an arbitrary sequence each time and see which one works; but all of them are computationally expensive to evaluate. Maybe we try to cache them or parallelize in some creative way? We know that caching should be optimal if the logs were all in the same format; and linear search would be optimal if they were randomly chosen.

In any case, the most efficient way to do this isn’t clear, so let’s do some more analysis: take the sequence of correct timestamp formats and label them:

Timestamp	Format	Label
Jul 25 08:52:10	MMM dd HH:mm:ss	Format 1
Fri Jul 25 09:06:49 PDT 2014	EEE MMM dd HH:mm:ss ZZZ yyyy	Format 2
1406304462	EpochSeconds	Format 3
[Jul 25 08:52:10]	MMM dd HH:mm:ss	Format 1

How can we turn this into a normal, solvable optimization problem? Well, if we try our parsers in a fixed order, the index label is actually just the number of parsing attempts before hitting the correct parser. Let’s keep the parsers in the original order and add another function that reorders them, and then we’ll try them in that order:

Format	Parser Label	Parser Index
MMM dd HH:mm:ss	Format 1	2
EEE MMM dd HH:mm:ss ZZZ yyyy	Format 2	1
EpochSeconds	Format 3	3

This is clearly better, and we can change this function on every time step. Having the optimal parser choice be a low number is always better, because we’re trying to minimize the time delay of the parsing process:

(Time Delay) (# Tries)

But can we really just optimize over that? It’s not at all clear to me how that translates into an algorithm. While it’s a nice first-order formulation, we’re going to have to change representations to connect it to anything more substantial.

Parser Index	Parser Index (Binary)	Parser Index (Unary)
2	10	11
1	1	1
3	11	111

This makes it clear that making the parser index small is equivalent to making its decimal/binary/unary representation small. In other words, we want to minimize the information content of the index sequence over our choice of parsers.

In mathematical terms, the information (notated H) is just the sum of -p log p over each event, where p is the event’s probability. As an analogy, think of -log p as the length of the unary sequence (as above) and p as the probability of the sequence — we’ll use the experimental probability distribution over the parser indices that actually occur.

As long as the probability of taking more tries is strictly decreasing, minimizing it also minimizes the time required because the information is strictly increasing with the number of tries it takes.

arg min{Time Delay} =arg min{Sequence Length * Probability of sequence}

=arg min {-p(# Tries) * log(p(# Tries)) } = arg min{ H(# Tries) }

That’s strongly suggestive that what we want to use as the parser-order-choosing function is actually a compression function, whose entire goal in life is to minimize the information content (and therefore size) of byte sequences. Let’s see if we can make use of one: in the general case, these algorithms look like Seq(Int) Seq(Int), making the second sequence shorter.

Parser Index Sequence: Length 13	Parser Index (LZW Compressed): Length 10
12,43,32,64,111,33,12,43,32,64,111,33,12	12,43,32,64,111,33,256,258,260,12

Let’s say that we have some past sequence — call it P — and we’re trying to find the next parser-index mapping. I admit that it’s not immediately clear how to do this with a compression algorithm a priori, but if we just perturb the algorithm, we can compare the options for the next functions as:

newInfo(parser label) = H(compress(P + [parser label]))-H(compress(P))

Any online compression algorithm will allow you to hold state so that you don’t have to repeat computations in determining this. Then, we can just choose the parser with the least newInfo; and if the compressor will minimize information content (which I’ll assume they’re pretty good at), then our algorithm will minimize the required work. If you’d like a deeper explanation of compression, ITILA [1] is a good reference.

With a fairly small, reasonable change of representation, we now have a well-defined, implementable, fast metric to make online decisions about parser choice. Note that this system will work regardless of the input stream — there is not a worst case except those of the compression algorithm. In this sense, this formulation is adaptive.

Certainly, the reason that we can draw a precise analogy to a solved problem is because analogous situations show up in many fields, which at least include Compression/Coding, Machine Learning [2], and Controls [3]. Information theory is the core conceptual framework here, and if I’ve succeeded in convincing you, Bayesian Theory [4] is my favorite treatment.

References:

Information Theory, Inference, and Learning Algorithms by David MacKay
Prediction, Learning, and Games by Nicolo Cesa-Bianchi and Gabor Lugosi.
Notes on Dynamic Programming and Optimal Control by Demitri Bertsekas
Bayesian Theory by Jose Bernardo and Adrian Smith

Changing Representation

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112