Differenciable protobuf, snowflake method

Another blogger recently proposed you could employ the snowflake method in programming. I found it interesting enough and decided to try it out.

The snowflake method is an old design heuristic used by fiction writers. The name originates from a koch snowflake fractal which starts from a triangle, then proceeds to substitute every edge with a pattern consisting of more edges which it repeats forever until the change in the shape no longer changes the appearance of the shape. The name describes the heuristic pretty much.

Like every other heuristic, it closes away interesting choices and there's that small chance that the heuristic is wrong for the purpose and prevents you from doing what you were planning to do.

There's one nice thing in the snowflake method. It is that you start with the one-sentence summary you need for your elevator speech.

I recently played with these ideas, but we have to start from the beginning for it to make any sense to you.

JSON

Whenever there are datasets that can be used by many people, or otherwise I have to store configuration for an application. I use JSON. It's a brilliant format for that purpose and far much better than the previous ones used for this purpose such as XML or INI.

It's popular to grumble over the problems in JSON, but in the practice it's an awesome format that has made the world a little bit better place as it's efficient to parse and easy to work with. In fact, I tend to convert my XML documents into JSON using beautifulsoup before I do any work on them.

Despite being a text format, JSON decodes and encodes extremely efficiently. The first bottleneck you face is the disk and network performance, rather than the decoder/encoder.

If your JSON performance starts to suffer, the likely first thing you may want to do is to start compressing the files. I have a friend who keeps saying that it's hard for you to beat the compression that gzip does to your file, by using more complex encoding.

Problems of JSON

JSON is meant to be used with Javascript and therefore it's not entirely a language-neutral format. Users of statically typed languages are hit worst by this problem, but I often find it tiring to write those indexing commands when reading JSON files.

Sometimes you have data that you don't necessarily have to read, but you may have to read it. If it's encoded in JSON along everything else then you must work on it anyway.

JSON only encodes strings, numbers, booleans, null, lists, maps. If you have anything else you have to present them with these concepts when you use JSON.

Despite all the problems that come with JSON, at the moment I think that nothing will ever replace JSON where it is used now. I think that I will keep using JSON just like I've used to using it.

The datasets that I work on have grown. Some of those sets are forced to be plain binary blobs with bit arrays tucked into files.

Then there's the middle-ground. The files that require some structure but are simultaneously quite large. Yet they are still small enough that you can load them up all at once.

Binary JSON

There are only few places where you could use MessagePack, BSON, or CBOR, yet where you couldn't use JSON. Over the time I've found perhaps one place like this and that was the bytecode of my programming language.

I told you that JSON encoding/decoding is quite efficient. When JSON gets troublesome, it's not troublesome because we used 4 bytes to say 'true' instead of 1. The troubles come from what we encode in that we wouldn't have to, and from how we handle the data afterwards.

Protobuf

Protobuf is a supposed compact replacement for XML. Protobuf has been designed around a fixed wire format and a schema that interprets it.

Using the schema we can pack the data into binary arrays inside the file which is nice because then we won't need to do that outside the data stream. Also we can annotate a number for every field in our structure and assign numbers to our enumerated values. These are more effective strategies for compacting the size of the data.

If we pretend for a while that there's a middle ground between large database records and JSON, then protobuf is more useful format for the storage than any binary JSON alternative.

Problems of the protobuf

There's two really huge hitches when it comes to actually using the protobuf files.

Use of protobuf schemas require a build environment and a protobuf compiler, which produces source files in your programming language.
Without a protobuf schema, the protobuf encoded files are practically unusable.

These both things are design flaws in the protobuf itself. The first one can be solved by representing protobuf schemas with protobuf files.

The second problem is harder to solve. To solve it we would have to redesign the wire format.

Why?

If we could solve just the first problem, why solve the second? The reason is to approach the usability aspects that make JSON a good format.

If we want to read JSON files and check out something quickly, we can run ordinary command shell commands to do that.

If there is a solvable merge conflict in a JSON file, we can hand-edit it out or if you feel fancy you can use some of the diff programs for JSON.

Without a schema, a protobuf file doesn't recognize between a message and a binary blob consisting of floats.

I would not need to solve this problem, but file encodings are a recurring thing that I keep wondering from time to time. If I solve the pressing problems now, I can perhaps keep my thoughts away from it for a longer while.

Differenciable protocol buffers

protodiff

Using the snowflake method, we start with an one sentence summary. In this case it is "Differenciable protocol buffers". Structured data that we can run a diff on without knowing the specifics of the schema.

The ability to run a diff on it is a baseline requirement that we would have if we wanted something versatile as text files are.

Diff will require awareness of ordered and unordered elements. Also, for the purpose of printing the results we may want to be able to differentiate with binary sequences that are text and that are not.

We assume that there are relatively few unordered elements in structured data files.

Differenciable wire encoding

Tags are encoded using LEB128. int is encoded with SLEB128. These are big endian values because on encoded values it doesn't matter much which encoding is used. Everything else is little endian by default, most systems require or support little endian encoding at the moment.

0: int
1: [int]
2: utf-8 encoded string
3: [utf-8 encoded string]
4: message
5: [message]
6: data
7: [data]
8: 32-bit value
9: 64-bit value
A: union(a)
B: [union(a)]
C: set(union(a))
D: map(union(a), union(b))
E: meta data
F: meta message

Many of the design choices taken by protobuf are preserved. From the field tag we take a fourth bit for the wire type. This decreases the one-byte values for tags from 0-15 to 0-7.

Protobuf has a concept for one of, a type union. It encodes into a message with only one field. We extend that concept into the wire and introduce unordered elements using a set and a map.

We provide a mechanism to add metadata into the data stream. Only the SCHEMA data block is really introduced so this is a bit lossy. But if the properties of the wire aren't enough, then this all is likely to be replaced anyway.

SCHEMA tag

Whenever the data is stored into a file of any kind, it is annotated with a SCHEMA tag. The tag contains a POSIX path or an URL into the respective schema used.

Schema storage

The schema is itself stored in a differenciable protocol buffer, and there's a schema for schemas every decoder/encoder must know.

The schema contains declarations and submodules that you can associate to objects and classes of your programming language.

Every schema file is complete, so when facing a new file with new contents you have to parse, you can download its schema into your project and import or include it where you need to read the files.

This way the schemas are also describing the state where the software is.

Differenciability

To prove the practicality of differenciability, we need to implement at least one tool that produces a set of minimal or near-minimal transformations between two differenciable protocol buffer files.

Notice: The approximate matching turned out to be much more harder problem to solve, and the ideas I wrote here doesn't solve it.

I decided to implement KF-Diff+ from many options. It is likely to be practical because of its lower complexity.

KF-Diff requires that a key field is picked. At every unordered field in the differenciable protocol buffer model, either the set item or its key can be used itself as a key value. Otherwise I would have no clue which value represents a 'key'. To come up with an idea on how to solve this, I also read the classic diff paper: "An O(ND) Difference Algorithm and Its Variations" by Eugene W. Myers.

For completeness I studied through Hunt-Mcllroy algorithm as well. I didn't figure out how would I make it work for fuzzy equivalences.

For edit scripts, it is important to remember that false equivalence is unacceptable whereas false inequivalence can be accepted.

Edit script interchangeability

Certain edit scripts are interchangeable. For example, two successive deletions and two successive deletions are equivalent to any of their mixes. When this happens we would like to prefer the edit script that is ordered deletions first.

But when there's a large object. How should we determine whether it has been changed or removed?

Equivalence more important than inequivalence

An important observation is that equivalence is far more important than inequivalence. Therefore if we have the sequences:

{0: "soap"}            
{0: "soap", 1: "clay"} 
{0: "charcoal"}

{0: "soap", 1:"clay"}
{1: "clay"}
{0: "charcoal"}

Then the edit script:

Remove [0]{0: "soap"}
Insert [2]{1: "clay"}

Is preferred over:

Update [0], insert (1: "clay")
Update [1], remove (1: "clay")
Insert [2]{1: "clay"}

Even when we have different fields, we would like to differentiate between the least equivalents and most equivalents.

Hashes, equivalence and inequivalence

To compare inequivalence quickly, we can use a hash function. Hash doesn't accurately describe equivalence, but it does tell about inequivalence.

So if we have two elements with inequivalent hashes, they are inequivalent. Also comparing the hashes of the elements that build up those two elements allows us to determine how much inequivalent those two elements are.

We can also pretend that two elements are equivalent, and then traverse through them to check that they really are. This kind of rare inequivalence may tolerably calculate a local update.

Splices and merges

Some advanced diff algorithms can detect when elements have been moved. Also some algorithms such as Zhang-Shasha on tree edit distances define a concept of splicing the tree as one of the valid editing operations.

Weighted greedy diff

Now we can start by searching the shortest edit script. We want to find a script with least number of insertions, deletions and those pesky small differences.

At the moment I do not define how many insertions and deletions is a difference worth. But we can consider a diagonal everywhere where the hash-weight is less than 2 - one removal and one insertion.

The intuition suggests that if there are more differences than equivalences, we should treat two items as different. But if there are more equivalences than differences, then we should do the inverse.

This approach suggests to divide differences by equivalences and multiply them by 2. If there are zero equivalences then we directly assign the value two on them. We may vary this heuristic later if it turns out to be useful.

We can use this information when we calculate D-paths described by Myers' paper. If there's a fuzzy-equivalent path after a clear D-path, we can branch on that path once more.

On the unordered elements we can use the KF-Diff+ directly.

For completeness we can also diff the strings. If the string contains newlines, we split the string into newlines first and do the comparison from them.

The edit script

The edit script can be written out as a differenciable protocol buffer. The format is chosen to make it straightforward to produce a playback of the changes or show the diff overlaid on files side-by-side.

The diff has an immediate use within the differenciable protocol buffers themselves. We can apply it on the different versions of a schema file to determine how that specific schema has changed over the time.

MVP

I'm going to write an implementation of this and see how it rots over two months. The experience shows that there will be refinements to the early idea, and there must be an incubation time for them to flourish.

Why?!!

When I've been thinking about it, the idea of differenciable protocol buffers is a bit dubious to me. It is a back of a napkin sort of an idea, and I cannot be certain that there will be usecases for differenciable protocol buffers.

The differenciable binary files has been explored at least once after this venture. The verification of differenciability is a twist that makes this an interesting in itself to explore.

You can soon find the reference implementation of differenciable protocol buffers from github.com/cheery/diffbuff.

Fin

Btw. I recently started using a colorscheme randomizer with my Vim. I have configured it to give me a random colorscheme every time when I start a new instance.

syncolors

Sometimes it gives me a color scheme that I cannot use, but more often it makes the day a little bit better.