Semi-Automatically Cleaning Up Scanned Old Zines

I recently ran across a number of old zines and other publications that had been preserved online. It’s really interesting looking at these ‘zines and seeing the kinds of information preserved and passed along in them, and as usual, I’m grateful for the existence of the Internet Archive in preserving these things at all.

However, these ‘zines are often scanned "as is"; that is, they’re unstapled and put on a flatbed scanner. Sometimes they’re also oriented long-ways, or a mix of orientations in the same PDF.

This makes reading them… difficult.

So, of course, I had to find a solution.

Some of the ‘zines in particular I tested this out on were:

Reclaiming Our Ancient Wisdom
The Black Peoples’ Prison Survival Guide
Fertility Awareness For Non-Invasive Birth Control
Feral Forager
Herbal Abortion: a woman’s d.i.y. guide
The D.I.Y. Guide II

The first thing that you want to do is to grab PDFtk, PDFArranger, ghostscript, and imagemagick to help you out, as well as OCRmyPDF. After that, grab the PDF of the ‘zine you want to clean up.

Use PDFArranger to get all the pages rotated so that they’re longways and all in the same orientation. You’ll also notice that even though it says "page 3" and "page 4" at the bottom, it’s actually page 80 first, and then page 79. You can try to reorder the pages at this point; I found it easier to do so later.

At this point, I was afraid I’d have to manually cut each page out, but I found an answer on StackExchange that got me most of the way there. The script in that answer largely works, but if the "rotation" is set on some of the PDF pages, it’ll fail miserably.

I modified the script slightly into this gist to handle such issues. The doitall.sh script takes the PDF filename as the first argument, uses pdftk to "explode" the PDF, ghostscript to slice and dice, and then pdftk to put it back together again into newfile.pdf.

This does result in a file size twice that of the original, but we’ll deal with that in a minute.

At this point, you’ll want to make sure your pages are in the correct order using PDFArranger. This isn’t a fault of the script so much as the way that the original documents were scanned. Luckily, these ‘zines tend to have large page numbers (as you can see in the screenshot above). Save the rearranged file.

This is where ocrmypdf comes in. By running it against newfile.pdf, it will not only provide an OCR text layer, but also get rid of the extra bits that make the filesize twice as large. You’ll end up with a single-page format, OCR’d ‘zine only a little bit larger than the original (due to the text layer).

The results end up looking like this (the image "border" is my computer wallpaper):

You can examine the results yourself with the output PDFs at https://drive.google.com/drive/folders/1pQZussLYmY_O3FvFXC-qf5R5TsLhaLnt?usp=sharing and compare them to the originals.

While this isn’t fully automated, it definitely sped up the time to process and clean these up to about 15-20 minutes per ‘zine rather than hours. So if you have old ‘zines that you made (or collected), I hope this helps make it a little easier to preserve and share those bits of almost-forgotten history.

DISCLAIMER: It should be noted that I have not authored anything in these ‘zines, have not verified any of the information in them, and they may suggest activities that are illicit, illegal, or harmful to yourself or others. (For example, foraging and scavenging for food.) These are presented only for enhancing educational and historical appreciation and preservation.