I bought a used Brother document scanner on Kleinanzeigen to manage my overflowing desk. Of course, I could have sorted it all manually, but why go the straight way if you can procrastinate with technology instead ┬┴┬┴┤･ω･)ﾉ.

The document scanner is quite simple and can only scan to a computer locally over USB. Luckily I can use scanimage on my Linux machine to access the scanner over the SANE interface as well.

(1) Scan the documents

After a while I found the proper parameters for my model:

$ scanimage
  # select device
  --device-name="$device" \

  # select source on the device (there are devices with several input sources)
  --source 'Automatic Document Feeder(centrally aligned,Duplex)' \

  # use offset to scan only the part of the paper where the actual text is
  -y 297 -l 7 \

  # run automatically, don't ask for input
  --batch="$doc_scan" \

  --resolution=300 \
  --format=pdf

With a single document, this produces a two page PDF file in DIN A4.

I don’t want to decide if I need to scan only one or both sides, and my scanner can only do as he’s told. I opted for both sides all the time - hence the duplex parameter - and take care of empty pages with CPU resources instead.

My workflow uses the following steps:

Create the original PDF file
Create a copy with Ghostscript to fix the xref table
Analyze the copy to find empty pages
If some are found, they will be extracted for manual verification
A “clean” PDF is created for use with paperless-ngx

Later I want to remove the manual verification step. But I need to use it for a few weeks to verify my threshold first.

(2) Fix the PDF

I’m using ghostscript to fix the pdf file. For some reason, the scanimage pdf is always “broken”. Errors look like this:

The following errors were encountered at least once while processing this file: xref table was repaired Incorrect /Length for stream object.

It could be the -y 297 -l 7 offset I’m using, or it could be $scanimage itself – I couldn’t bring myself to invest.

$ gs \
  # result is saved as $doc_fixed
  -o "$doc_fixed" \

  # generate new PDF file
  -sDEVICE=pdfwrite \
  
  # PDFs quality setting (here is potential to reduce the file size)
  -dPDFSETTINGS=/prepress \
  
  # Don't pause between pages
  -dNOPAUSE \
  
  # Quit gs automatically at the end
  -dBATCH \

  # Surpress most messages, except for actual errors
  -dQUIET \

  # input file that is used
  "$doc_scan"

(3) Look for empty pages

In the past I removed empty pages with pdfarranger, since I also used the tool to split all documents from one scan session into seperate files. Today I stumbled upon nklb/remove-blank-pages. His way, to simply ask ghostscript how much ink was used on a page, made me feel w(°ｏ°)w.

$doc_fixed is a 10 page PDF and the pages 4, 6 and 8 are almost empty. When I ask ghostscript how much ink is used on it, I get this result (I added the arrows for visibilty):

$ gs -q -o - -sDEVICE=inkcov $doc_fixed
    0.10491 0.10328 0.09950 0.04416 CMYK OK
    0.10148 0.10088 0.09943 0.05592 CMYK OK
    0.12084 0.11656 0.11248 0.05224 CMYK OK
->  0.01110 0.01129 0.01081 0.00863 CMYK OK
    0.13958 0.13398 0.12899 0.05891 CMYK OK
->  0.01345 0.01371 0.01331 0.01084 CMYK OK
    0.10316 0.09932 0.09451 0.02920 CMYK OK
->  0.00441 0.00413 0.00468 0.00076 CMYK OK
    0.42938 0.36430 0.36582 0.04960 CMYK OK
    0.36276 0.35747 0.36175 0.07565 CMYK OK

However, I couldn’t get nklbs grep pattern to work and decided to go with awk instead. awk can also count, which made it easier to set a threshold for an “empty page” in my view.

$ awk
# injecting $threshold variable into awks context
-v thr="$threshold"

# matching lines starting with an optional space followed by digits
# you can play around with the tokens at https://regex101.com/r/KtqgpB/1
/^[[:space:]]*[0-9]/ 

# awk magic: 
# p is a counter to count rows, where each row is a page in the pdf
# s is the sum of the four CMYK values per row
# then it checks if the sum is below the threshold. 
# If so, it will print p, the page number
{ p++; s=$1+$2+$3+$4; if (s<thr) print p }

Currently my threshold is set to 0.06, so less then 6% of the page are covered in ink to be classified as empty. It’s usually just a border from scanning or punch holes that are recognized as “ink”.