Types have tremendous information density

Type annotations and datatype declarations are dense in information. A lot of it is difficult to see and realize that it indeed is there.

Type annotations can even function as an useful form of documentation about the behavior of the program, engage you to thinking and help in writing code that stays correct.

I've found these ideas to be useful independent of the programming language you use.

In this post I am going to present some theory, ideas, present potential uses and finally show what kind of practical applications this theory has.

C in small pieces

Imagine a small subset of a C type system. It is chosen as an example because it's simple and easy to explain.

C has base types, and composite types that are constructed from other types.

From base types, we only take int. It represents some range of mathematical integers.

The composite types we take are union, struct, and a pointer.

Every C type has a size in bytes. Inside unions and structs they also have an offset in bytes.

We do not specify what is the sizeof(int), but we assume it is some small constant.

Now lets first consider the following union structure:

u = union { a; b; c; }
sizeof(u) = max(a, b, c)
offsetof(u._) = 0

I left the labels away because they are not terribly important here. We can give a natural number for each field and refer to them that way, for example u.0.

The sizeof -clause tells us that the size of this union is the size of the largest field it has.

The offsetof -clause tells that every field in an union overlap.

Next lets consider the following struct:

s = struct { a; b; c; }
sizeof(s) = sizeof(a) + sizeof(b) + sizeof(c)
offsetof(s.0) = 0
offsetof(s.n) = offsetof(s.p) + sizeof(s.p) where p = n-1

There are several ways to calculate the size and offsets for C structures. Every platform have their own similar, but incompatible, way to do this.

Finally we only have to consider pointers. You construct a pointer by putting a star symbol behind it. An example:

p = s*
sizeof(p) = address_size
offsetof(p[i]) = sizeof(p) * i
p.n = p[0].n

Like int, pointers have some constant size that we don't know. It is not necessarily an address_size like the name implies.

The offsetof -line abuses the notation a bit to propose that pointers may also refer to arrays. If it does so, there is a count defined for the pointer. The type system in C does not track these details.

Similarly in the union we have not described how it is determined which interpretation of the union is in use. An assumption is made that those details can be tracked from the context. I think this is a nicer choice than trying to give the C language some properties it does not have by design.

Most of the ideas presented above are something most people acknowledge from types. Types are presented as structure for storing data on a computer and they aren't anything else than that.

Now lets look at what kind of properties these structures have that are easily disregarded.

Decoding & Encoding schemes

Automatic serializing may be already familiar to many, so we start with that one.

In many C structures you have everything you need in order to successfully write encoders and decoders to convert the structures into some file format such as JSON or XML.

This is not entirely true for all structures, because when encoding you don't know which shape the union records currently use, or how many elements a pointer refers to.

The context required for automatic serialization could be easily supplied. For example:

active = struct { int; }
inactive = struct { }
reserved = struct { int; int; }

status = struct { int; union { active; inactive; reserved; }

status.1 selector = status.0

Here the selector -rule would tell that the tag in the front selects the union that is being used. In this case the information is bidirectional. This single rule can be used to set and retrieve the union tag.

Such a simple selector wouldn't be suitable for everything, but it would work for many kind of structures.

Deriveable properties

Another functionality familiar to automatic serialization is the pretty printing. C types have enough information in them to automatically determine how they should be printed.

The information for selecting a field could be also used in pretty printing the data. Again the further information needed by the printer could be supplied in.

The possibility to automate serialization and printing tasks already cover huge swaths of numb code you don't have to write.

The type information already present also allow us to derive a concept of 'equality' or 'bigger, smaller' without having to write something like:

int record_cmp(record a, record b)
{
    int k;

    k = cmp(a.group_id, b.group_id);
    if (k != 0)
    {
        return k;
    }
    return cmp(a.user_id, b.user_id);
}

There are many other properties that would end up being easy to derive from the structures. The information for doing this reliably is in the type declarations.

Types also self-describe properties such as.. Are they recursive? Or do they contain pointers?

Structural conversions/maps

Even if C types are not parametric, they provide some details on how to convert from an one kind of type to an another.

There are large swathes of conversions where the conversion step is so trivial that it can be guessed out from just little bit of information.

For example, if you have structures:

s = struct { a; b; c; }
t = struct { d; b; c; }

And you know that there's a valid conversion from 'a' to 'd', then it is very easy to automatically produce the following function:

t s_to_t(s) {
    return { a_to_d(s.0), s.1, s.2 }
}

In fact, the type declaration for 't' could be described by:

t = transform(s,
    a => d)

Inserting/Removing pointers, shuffling positions of the fields is easy as well.

You can even catenate these structures together. Eventually you arrive to an algebra of a kind, where you can use operators to build up structures from simpler structures:

struct { a; b; c; } = a * b * c
union { a; b; c; }  = a + b + c

union {
    struct { a; b; };
    struct { c; d; }; } = a * b  +  c * d

The * is called a product, and the + is a coproduct.

Now consider what kind of properties this algebra might have?

Because of our indexing requirement, neither of these binary operators are commutative. But instead they are both associative. eg. (a * b) * c = a * (b * c)

Are they distributive, for example, left distributive? If they are, the following equation would hold:

a * (b + c) = a*b + a*c

Almost. How you index the values ends up changing, and memory layout also changes if it's not simple one like in our example.

How about the other way around?

a + b * c = (a + b) * (a + c)

The right and left sides have clearly a different meaning, that means no.

Abstract data

When doing C, often it helps in the portability if you "forget" how types are encoded and how they share the same fields because that information depends on platforms and compilers that you are using.

In fact, most of the time you forget that the computer memory is just a bag of bytes indexed by an address that is a value in bytes.

Tools you use emphasize this form of forgettance. You don't see bytes but pictures on your screen. There are no bytes but letters on your keyboard. Letters and pictures are different to you.

But it's all bytes, indexed by an address.

Lets imagine for a short moment that you wouldn't have a way to implement struct. How would you then implement something like the following structure?

struct { a; b; }

If you can imagine that, then you would also be able to imagine that you have an object called 'q'. Then you have four functions:

a get_a(q);
b get_b(q);

set_a(q, a);
set_b(q, b);

Now, if you associate the type 'q' and these four functions, then you can pretend you have a way to implement that kind of a structure.

Therefore you can treat the 'q' and that bundle of functions as if it was a struct { a; b; }.

Interesting thing here is that this new definition of a structure is theoretically indistinguishable from the real structure in practice. You couldn't tell them apart from each other if you used them if you could not see behind the 'curtains'.

In practice there are curtains there: You don't know or need to know how the 'struct' is implemented.

But now you see that typed values can pretend to be something else, except that when it does that, it actually is that something else. Even if that something didn't exist at first.

A viewpoint to types

Types are knowledge that you know about the variables and operations in a program before you run through them.

Therefore type information covers much more than that the value has a membership into a set or a category of some sort.

Traditionally types have been used to ensure that you do not accidentally put an entirely wrong value into an entirely wrong hole. That is very unimaginative use for type information.

All the ideas about how to use type information in a versatile manner have been around for a long time. Everything described in this article originates from research that is decades old by now.

If you utilize it well then type theory becomes a constructive, rather than a restrictive theory.

It allows us to generate things.

  1. Functionality that was generated automatically from the type descriptions.
  2. Increased safety, not because you wouldn't make certain classes of errors, but because there is less of hand-written code that could go wrong.
  3. Languages that are far much more expressive. Due to consistent, logical, coherent, automatic conversions between data types.
  4. Robust generalized implementations of algorithms, that are constrained by the properties required for the algorithm to produce correct results.

So far I've shown what kind of information is contained in type annotations and given ideas on what can be done if you give yourself an access to all that.

Next I'll show how type annotations can form an useful form of documentation for code.

Example: Path search

Dijikstra's algorithm is so simple to implement that you don't usually generalize it.

Unfortunately any algorithm worth to generalize would be so complex that it would require more analysis. But I think this is "real world" enough as an example.

Dijikstra's algorithm works with nodes, and edges between nodes. Every edge has a 'distance', as an input to the algorithm.

You might use 'int' to represent both distances and nodes.

A naive implementation might use 'int' to represent both distances, and nodes. 'int' everywhere doesn't tell much about the algorithm though.

You could also describe what the algorithm takes in, by set of rules that are relevant for it to produce the correct result, like attempted below:

n must be distinguishable (==)
neighbors: n -> [(n, d)]

magma(d, add)
add(x,y) = w, x <= w
d must have total ordering (<, >, <=, >=, ==, !=)

route.set : (n, Route) -> ()
route.get : n -> Route

Route = unreachable | reachable(d, n) | endpoint(d)

shortest_path(graph : [n], neighbors,
        source : n, initial : d,
        add, route):
    route.set(___, unreachable) for all nodes in graph
    route.set(source, endpoint(initial))
    q = priority_queue(n,d)()
    q.add_with_priority(source, initial)
    while q not empty
        v1, d1 = q.extract_min()
        for v2, d in neighbors(u)
            alt = add(d1, d)
            case route.get(v2)
                unreachable:
                    route.set(v2, reachable(alt, v1))
                reachable(d2, _):
                    if alt < d2:
                        route.set(v2, reachable(alt, v1))
                        q.decrease_priority(v2, alt)
                endpoint(d2):
                    fail if alt < d2

I do not quarantee that the above code is correct, or that it would ever compile on any compiler. But lets go through what it attempts to represent.

  1. n must be distinguishable meaning that there must be equality/inequality defined for nodes. You must be able to identify two nodes apart. This could be implied through the use of .decrease_priority, we couldn't do that if we cannot identify the object.
  2. neighbors function has a specific shape.
  3. magma refers to a basic kind of algebraic structure. The d is a carrier and add is a binary operation, that must be closed by definition. Distance may have multiple magmas so this alone would be ambiguous. Therefore add is as an argument and this clause points out that distances and the operation must form a magma.
  4. The line after that describes that the results of addition must monotonically increase in respect to its left argument. This is done so that we don't generate a distance that would break the algorithm, without raising an error. Presence of a property is a bit easier to guarantee than an absence of it.
  5. Finally it is denoted that the distances must have total ordering, so that they can be compared in the queue.
  6. route -function and the associated data structure is given, so that you can choose where to store the results during computation.

The structure of the program describes many of these constraints, so nearly all of them are superfluous. But this is the kind of type information that is in an useful format for the reader. It allows you to reason about the behavior of this program in your use-case, and about the possibilities of running it while still staying within the parameters where it works.

The types here also explain what is needed. If the requirements are not satisfied then the algorithm doesn't run. If the guarantees are easy to satisfy by correct use, then this is a better option than failing somewhere along the way.

In practice you might end up to a very different structure so this is not meant as a proper example for programming. The point is to demonstrate conditions when type annotations are useful.

Similar posts