Regular languages as types (Rady)

Aside matching, regular expressions can be used to generate strings. [Invertible regex matching] doesn't seem to be a thing though. That's odd because they could be used to treat text like they were well-typed values.

[Brzozowski derivatives] and GADTs allow an easy invertible matching. Easy enough that I wrote [Haskell] and [Purescript] implementations. I call these libraries Rady - Regular language matching with types.

Each regular language can be mapped to a type that describes how a string may match. For example: /jan|feb/ can be associated to type (Either () ()). With unambigous regular expressions there forms an equivalence between the values of the type and the strings matched by the expression.

Although still novel, regular expression matching bolted to a type system of a programming language is a well-known possibility. I read up and found out [Oleg Grenrus] derived regexes from types which is an opposite to what I present here of deriving types for regexes. There was also an example of a dependently-typed [regex library] in POPL17 that was very similar to my recent implementation.

wcard

The [wcard] is first program that uses Rady, although it's not using it in an extensive manner. Running the wcard to an URL rewards you with a little card that contains the title and description of a webpage. Here's a small example of using it from the terminal:

term$ wcard <<< https://example.org
https://example.org
title: Example Domain
term$

The program takes input in stdin to allow to use it as a filter in a text editor. It parses the website into [html-parse] tokens and then uses lparse function in Rady to do loose parsing.

Loose parser in the Rady library skips symbols that cannot be matched by the regex. This kind of matching has a danger of falling into computing equivalent of [garden path sentences]. It fails to parse the expression if the first interpretation is a dead end.

The wcard program contains the following regular language expression to match opening and closing tags. I omit the implementation of open and close here, they just match and generate the corresponding [html-parse] tokens.

inTags tag match = 
    open tag Empty
    `Group`
    (match `Group` close tag)

Here the (open tag Empty) matches a named tag with no attributes in it. The (close tag) matches a named closing tag. The type of this thing is something like Shape Token a -> Shape Token ((), (a, ()).

If wcard tries to match only the title, it makes lparse with the following expression. This 'shape' can locate the title from a document:

onlyTitle :: Shape Token Text
onlyTitle = can (inTags "html"
                $ inTags "head"
                $ inTags "title" text)

To make it convenient to remove the blank elements in matches, there's the 'can' -function that discards every unit type from an expression producing a canonical version of a type.

Although wcard doesn't use the generator feature, we can try it out. Here's what generate onlyTitle "foo" produces:

[TagOpen "html" [],
 TagOpen "head" [],
 TagOpen "title" [],
 ContentText "foo",
 TagClose "title",
 TagClose "head",
 TagClose "html"]

Before just fetching the title, wcard tries to find OpenGraph tags. I don't know if matching these are a good idea but they're getting common to cater at antisocial media. Here's an expression that matches on the ogmeta title and description tags.

ogmeta :: Shape Token (Text, Text)
ogmeta = can
       $ inTags "html"
       $ inTags "head"
         (ogtag "og:title"
          `Interleave`
          ogtag "og:description")

ogtag :: Text -> Shape Token Text
ogtag prop = tagged "meta"
           $ attrEq "property" prop `Interleave` attr "content"

The actual wcard program is a bit more messy, I rewrote these expressions. However you can write this kind of patterns with the Rady library and parse/generate on them.

Background

This thing is a result of working on my blog-engine. I've been writing a new one in Haskell.

In Python I used to use a library to parse a HTML template file such as below, and then fill it up step-by-step.

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title> </title>
</head>
<body>
    <article> </article>
</body>
</html>

I prefer to use documents like these with a placeholder. The program used a library to locate the title and article tags in the parsed tagsoup and then fill them up with contents. Some use a templating language but I don't because I find them messy.

I did scan through what Haskell had to offer and concluded it's the same situation again except that filling a shaped document doesn't seem as nice because it means for a modification after a modification.

Thinking about how I'd get typed templates and how to make it easy to verify that the site generator is working. I also realised it'd be nice that the document was valid HTML when it comes out.

What's valid HTML? Everything's valid HTML it seems. The minimal valid document would seem to be <!DOCTYPE html><title> </title>. You may even use malformed tags because there's a spec for how the browsers should operate around those to allow content to render in same way across implementations. Therefore the problem is solved.

But what if we wanted a HTML document that is guaranteed to hold some useful structure?

The leading edge [validator] scans through the document, checks it matches a RelaxNG schema and does other form checking. I looked up on RelaxNG schemas and found out about James Clark's relax ng validator algorithm. I spotted he was using [Brzozowski derivatives] to validate.

A related question

Types have nice properties. They represent values that can be taken apart and then reconstructed.

I am curious about what is a proper, mathematical format for typeset text. Finding such an idea would simplify working with text.

Missing features

Rady is still missing some things I'd wish it would have.

Ambiguity detection is not present. This would require a better dependently typed language that can do such a thing. Validation would ensure a formal guarantee that translating between the value and its presentation does not affect the result.

Schema printing would allow user of a program to be informed of what the program is expecting for.

Language edit distance would ensure you can attempt to correct an input such that it matches the regex.

Similar posts