I bought a used Brother xxx on Kleinanzeigen to manage my overflowing desk. Of course, I could have sorted it all manually, but why go the straight way if you can procrastinate with technology instead ┬┴┬┴┤・ω・)ノ.
The document scanner is quite simple and can only scan to a computer locally over USB. Luckily I can use scanimage on my Linux machine to access the scanner over the SANE interface as well.
(1) Scan the documents
After a while I found the proper parameters for my model:
$ scanimage --device-name="$device" \
--source 'Automatic Document Feeder(centrally aligned,Duplex)' \
--resolution=300 \
--batch="$doc_scan" \
-y 297 -l 7 \
--format=pdf
With a single document, this produces a two page PDF file in DIN A4.
I don’t want to decide if I need to scan only one or both sides and my scanner can only do as he’s told. I opted for both sides - hence duplex - all the time and take care of empty pages with CPU resources instead.
My workflow uses the following steps:
- Create the original PDF file
- Create a copy with Ghostscript to fix the xref table
- Analyze the copy to find empty pages
- If some are found, they will be extracted for manual verification
- A “clean” PDF is created for use with paperless-ngx
Later I want to remove the manual verification step. But I need to use it for a few weeks to verify my threshold first.
(2) Fix the PDF
I’m using ghostscript to fix the pdf file. For some reason, the scanimage pdf is always “broken”. Errors look like this:
The following errors were encountered at least once while processing this file: xref table was repaired Incorrect /Length for stream object.
It could be the -y 297 -l 7 offset I’m using or it could be scanimage itself – I couldn’t bring myself to invest.
$ gs \
# result is saved as $doc_fixed
-o "$doc_fixed" \
# generate new PDF file
-sDEVICE=pdfwrite \
# PDFs quality setting (here is potential to reduce the file size)
-dPDFSETTINGS=/prepress \
# Don't pause between pages
-dNOPAUSE \
# Quit gs automatically at the end
-dBATCH \
# Surpress most messages, except for actual errors
-dQUIET \
# input file that is used
"$doc_scan"
(3) Look for empty pages
In the past I removed empty pages with pdfarranger, since I also used the tool to split all documents from one scan session into seperate files. Today I stumbled upon nklb/remove-blank-pages. His way, to simply ask ghostscript how much ink was used on a page, made me feel w(°o°)w.
$doc_fixed is a 10 page PDF and the pages 4, 6 and 8 are almost empty. When I ask ghostscript how much ink is used on it, I get this result (I added the arrows for visibilty):
$ gs -q -o - -sDEVICE=inkcov $doc_fixed
0.10491 0.10328 0.09950 0.04416 CMYK OK
0.10148 0.10088 0.09943 0.05592 CMYK OK
0.12084 0.11656 0.11248 0.05224 CMYK OK
-> 0.01110 0.01129 0.01081 0.00863 CMYK OK
0.13958 0.13398 0.12899 0.05891 CMYK OK
-> 0.01345 0.01371 0.01331 0.01084 CMYK OK
0.10316 0.09932 0.09451 0.02920 CMYK OK
-> 0.00441 0.00413 0.00468 0.00076 CMYK OK
0.42938 0.36430 0.36582 0.04960 CMYK OK
0.36276 0.35747 0.36175 0.07565 CMYK OK
However, I couldn’t get nklbs grep pattern to work and decided to go with awk instead. awk can also count, which made it easier to set a threshold for an “empty page” in my view.
$ awk -v thr="$threshold" '/^[[:space:]]*[0-9]/ { p++; s=$1+$2+$3+$4; if (s<thr) print p }'
# injecting $threshold variable into awks context
-v thr="$threshold"
# matching lines starting with an optional space followed by digits
# you can play around with the tokens at https://regex101.com/r/KtqgpB/1
/^[[:space:]]*[0-9]/
# awk magic:
# p is a counter to count rows, where each row is a page in the pdf
# s is the sum of the four CMYK values per row
# lastly it checks if the sum is below the threshold. If so, it will print p, the page number
{ p++; s=$1+$2+$3+$4; if (s<thr) print p }
Currently my threshold is set to 0.06, so less then 6% of the page are covered in ink to be classified as empty. It’s usually just a border from scanning or punch holes that are recognized as “ink”.