I got automated summaries

It's a must-have in everybody's toolbelt

This weekend I embarked on a quest to find myself a tool that could construct automated summaries from websites.

After checking out few tools I ended up to studying the huggingface transformers -library which provides pretrained models to perform various tasks such as text classification, translation and summarization.

Huggingface transformers

This transformers -library claims to be a state-of-the-art NLP toolkit. But it runs on Python3, great, that's just the curveball that I needed when I thought I was going to greatly reduce the use of this language in my projects.

The installation was convenient pip3 install transformers, though I found out that tensorflow has some problems that cause it to crash with "illegal instruction" on my machine.

I found the documentation fairly extensive but lousy when it comes to trying to explain how you're supposed to use this library. I think they've tried to make it really simple for you to plug it into your project:

import transformers
summarize = transformers.pipeline("summarization")
article = """Put your text to summarize here,
  for funsies give it a short insult to summarize."""
print(summarize(article)[0]["summary_text"])

It downloads a huge pretrained model to use on the task. Since the text I gave above is too short, the stuff it learned seems to leak out in the answer when it expands the short text.

Put your text to summarize here, for funsies give it a short insult to summarize . For fun, give it an insult or a short summary to summarize. Put it into a picture of your favorite things you like to see in the gallery . Use it to help people understand each other's faults in the world .

This seems like great simple thing, but lets say you give it too long article. You get a warning text with IndexError: index out of range in self instead of a summary. When you troubleshoot it, you'll find out it's a known problem. These NLP programs seem to work on fixed-size inputs and it's a research question how to operate if you get too long article left to summarize.

Eventually I found a solution on a website dedicated to python tutorials and recipes. Finally got this working after all the mess and got to applying this stuff.

Automatic summarization of my own website

I produced the following script to perform automatic summarization on my website:

#!/usr/bin/env python3
from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config
import torch
import os, sys
import subprocess
import time

blog_directory = "/location/to/your/blog"
blog_entries = blog_directory + "/www/entries"

def all_entries(entries):
    for path, dirs, files in os.walk(entries):
        name = os.path.relpath(path, entries)
        if name.count('/') == 3:
            yield path

def stale(dst, src):
    if os.path.exists(dst):
        return os.path.get_mtime(dst) < os.path.get_mtime(src)
    else:
        return True

model = T5ForConditionalGeneration.from_pretrained("t5-small").cuda()
tokenizer = T5Tokenizer.from_pretrained("t5-small")
device = torch.device('cuda')

for path in list(all_entries(blog_entries)):
    autosummary = os.path.join(path, "summary.auto")
    md_index = os.path.join(path, "index.md")
    if not stale(autosummary, md_index):
        continue
    with subprocess.Popen(["python2",
                           os.path.join(blog_directory, "tools/paragraphs.py"),
                           md_index],
                          stdout=subprocess.PIPE) as pop:
        article = pop.stdout.read().decode('utf-8')
    inputs = tokenizer.encode("summarize: " + article,
                return_tensors="pt", max_length=512, truncation=True).to(device)
    outputs = model.generate(
        inputs, 
        max_length=200, 
        min_length=50, 
        length_penalty=2.0, 
        num_beams=4) 
        #early_stopping=True)
    summary = tokenizer.decode(outputs[0])
    print("-"*len(path))
    print(path)
    print("-"*len(path))
    print(summary)
    with open(autosummary, "w") as fd:
        fd.write(summary + "\n")

This script is different from the one on the website only slightly:

  1. It's enabling cuda processing (and required me to update my graphics card drivers). Look up the .cuda() and .to(device) in the code.
  2. It's picking up a small model, t5-small, because were it any bigger it wouldn't fit to run on my GPU.
  3. "early stopping" is disabled because the longer summaries felt better.

There's one hilarious things to highlight there. First of all I wrote a python2 script so I don't need to learn more new stuff.

from markdown2 import markdown
from bs4 import BeautifulSoup, Tag
import sys

def markdown_soup(path):
    with open(path) as fd:
        soup = BeautifulSoup(markdown(fd.read()), "lxml")
        return soup.body

article = markdown_soup(sys.argv[1])
foo = [p.get_text().encode('utf-8').replace('\n', ' ') for p in article.find_all("p")]
print("\n".join(foo))

Thanks to this modification I can construct a page to my blog that contains a short summary from each post, even for posts that don't originally have such a summary. In addition to that, I get automatic summaries for new posts I write, and I can decide whether to write a replacement for them.

General summarization server

When I'm running the summarizer on the GPU, it's much faster but it takes considerable time for the model to initialize itself.

Therefore it's convenient to hook it up to a server! Here's a script that drops a summarizer behind an unix socket.

#!/usr/bin/env python3
from transformers import T5Tokenizer, T5ForConditionalGeneration, T5Config
import torch
import socket, os

sock_path = "/tmp/summarizer.sock"

def recv_all(the_socket):
    block = bytearray(8192)
    total_data=[]
    while True:
        count = the_socket.recv_into(block, 0)
        if count == 0: break
        total_data.append(bytes(block[:count]))
    return b''.join(total_data).decode('utf-8')

sock = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
sock.bind(sock_path)
try:
    sock.listen(50)

    model = T5ForConditionalGeneration.from_pretrained("t5-small").cuda()
    tokenizer = T5Tokenizer.from_pretrained("t5-small")
    device = torch.device('cuda')

    while True:
        connection, client_address = sock.accept()
        article = recv_all(connection)
        inputs = tokenizer.encode("summarize: " + article,
            return_tensors="pt", max_length=512, truncation=True).to(device)
        outputs = model.generate(
            inputs, 
            max_length=200, 
            min_length=50, 
            length_penalty=2.0, 
            num_beams=4) 
            #early_stopping=True)
        summary = tokenizer.decode(outputs[0])
        connection.sendall(summary.encode('utf-8'))
        connection.close()
except KeyboardInterrupt:
    sock.close()
finally:
    os.unlink(sock_path)

You can connect to the mentioned socket with nc -NU /tmp/summarizer.sock, type in the stuff, CTRL+D and then receive the result. This is useful to call from Vim for instance.

Conclusion

Still one thing. If any of the scripts give you future warnings, you can ignore them with this:

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

Overall I really like that have some NLP goodness at hand now. To get an idea of what I got an access to, here's the summary I get for this post for instance.

The huggingface transformers -library is a state-of-the-art NLP toolkit. it provides pretrained models to perform various tasks such as text classification, translation and summarization. it runs on Python3, great, that's just the curveball that I needed when I thought I was going to greatly reduce the use of this language in my projects. I found the documentation fairly extensive but lousy when it comes to trying to explain how you're supposed to use this library.

It looks very good to me. I find these summaries to be a bit biased and pick up on negative remarks at every post that I write. I suppose it is happening because people writing the training material picked up on negatives when writing the summaries to train the model?

Similar posts