<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Paperless on blog.bdw.li</title>
    <link>https://blog.bdw.li/tags/paperless/</link>
    <description>Recent content in Paperless on blog.bdw.li</description>
    <language>en</language>
    <managingEditor>hello@bdw.li (jwb)</managingEditor>
    <webMaster>hello@bdw.li (jwb)</webMaster>
    <lastBuildDate>Fri, 22 May 2026 22:16:09 +0200</lastBuildDate>
    <atom:link href="https://blog.bdw.li/tags/paperless/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Removing empty pages from PDF files</title>
      <link>https://blog.bdw.li/removing-empty-pages-from-pdf-files/?ref=rss</link>
      <pubDate>Tue, 21 Oct 2025 21:44:00 +0000</pubDate><author>hello@bdw.li (jwb)</author>
      <guid>https://blog.bdw.li/removing-empty-pages-from-pdf-files/?ref=rss</guid>
      <description>I created a script to scan documents and remove empty pages from the final PDF.</description>
      <content:encoded><![CDATA[<p>I bought a used Brother document scanner on Kleinanzeigen to manage my overflowing desk. Of course, I could have sorted it all manually, but why go the straight way if you can procrastinate with technology instead <mark>┬┴┬┴┤･ω･)ﾉ</mark>.</p>
<p>The document scanner is quite simple and can only scan to a computer locally over USB. Luckily I can use <a href="https://linux.die.net/man/1/scanimage" target="_blank" rel="noopener noreferrer">scanimage</a> on my Linux machine to access the scanner over the SANE interface as well.</p>
<h2 id="1-scan-the-documents">(1) Scan the documents</h2>
<p>After a while I found the proper parameters for my model:</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">$ scanimage
</span></span><span class="line"><span class="cl">  <span class="c1"># select device</span>
</span></span><span class="line"><span class="cl">  --device-name<span class="o">=</span><span class="s2">&#34;</span><span class="nv">$device</span><span class="s2">&#34;</span> <span class="se">\
</span></span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">  <span class="c1"># select source on the device (there are devices with several input sources)</span>
</span></span><span class="line"><span class="cl">  --source <span class="s1">&#39;Automatic Document Feeder(centrally aligned,Duplex)&#39;</span> <span class="se">\
</span></span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">  <span class="c1"># use offset to scan only the part of the paper where the actual text is</span>
</span></span><span class="line"><span class="cl">  -y <span class="m">297</span> -l <span class="m">7</span> <span class="se">\
</span></span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">  <span class="c1"># run automatically, don&#39;t ask for input</span>
</span></span><span class="line"><span class="cl">  --batch<span class="o">=</span><span class="s2">&#34;</span><span class="nv">$doc_scan</span><span class="s2">&#34;</span> <span class="se">\
</span></span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">  --resolution<span class="o">=</span><span class="m">300</span> <span class="se">\
</span></span></span><span class="line"><span class="cl">  --format<span class="o">=</span>pdf
</span></span></code></pre></div><p>With a single document, this produces a two page PDF file in DIN A4.</p>
<p>I don&rsquo;t want to decide if I need to scan only one or both sides, and my scanner can only do as he&rsquo;s told. I opted for both sides all the time - hence the duplex parameter - and take care of empty pages with CPU resources instead.</p>
<p>My workflow uses the following steps:</p>
<ul>
<li>Create the original PDF file</li>
<li>Create a copy with Ghostscript to fix the xref table</li>
<li>Analyze the copy to find empty pages</li>
<li>If some are found, they will be extracted for manual verification</li>
<li>A &ldquo;clean&rdquo; PDF is created for use with <a href="https://docs.paperless-ngx.com" target="_blank" rel="noopener noreferrer">paperless-ngx</a></li>
</ul>
<p>Later I want to remove the manual verification step. But I need to use it for a few weeks to verify my threshold first.</p>
<h2 id="2-fix-the-pdf">(2) Fix the PDF</h2>
<p>I&rsquo;m using ghostscript to fix the pdf file. For some reason, the scanimage pdf is always &ldquo;broken&rdquo;. Errors look like this:</p>
<blockquote>
<p>The following errors were encountered at least once while processing this file: xref table was repaired Incorrect /Length for stream object.</p>
</blockquote>
<p>It could be the <code>-y 297 -l 7</code> offset I&rsquo;m using, or it could be $scanimage itself &ndash; I couldn&rsquo;t bring myself to invest.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">$ gs <span class="se">\
</span></span></span><span class="line"><span class="cl">  <span class="c1"># result is saved as $doc_fixed</span>
</span></span><span class="line"><span class="cl">  -o <span class="s2">&#34;</span><span class="nv">$doc_fixed</span><span class="s2">&#34;</span> <span class="se">\
</span></span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">  <span class="c1"># generate new PDF file</span>
</span></span><span class="line"><span class="cl">  -sDEVICE<span class="o">=</span>pdfwrite <span class="se">\
</span></span></span><span class="line"><span class="cl">  
</span></span><span class="line"><span class="cl">  <span class="c1"># PDFs quality setting (here is potential to reduce the file size)</span>
</span></span><span class="line"><span class="cl">  -dPDFSETTINGS<span class="o">=</span>/prepress <span class="se">\
</span></span></span><span class="line"><span class="cl">  
</span></span><span class="line"><span class="cl">  <span class="c1"># Don&#39;t pause between pages</span>
</span></span><span class="line"><span class="cl">  -dNOPAUSE <span class="se">\
</span></span></span><span class="line"><span class="cl">  
</span></span><span class="line"><span class="cl">  <span class="c1"># Quit gs automatically at the end</span>
</span></span><span class="line"><span class="cl">  -dBATCH <span class="se">\
</span></span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">  <span class="c1"># Surpress most messages, except for actual errors</span>
</span></span><span class="line"><span class="cl">  -dQUIET <span class="se">\
</span></span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl">  <span class="c1"># input file that is used</span>
</span></span><span class="line"><span class="cl">  <span class="s2">&#34;</span><span class="nv">$doc_scan</span><span class="s2">&#34;</span>
</span></span></code></pre></div><h2 id="3-look-for-empty-pages">(3) Look for empty pages</h2>
<p>In the past I removed empty pages with <a href="https://github.com/pdfarranger/pdfarranger" target="_blank" rel="noopener noreferrer">pdfarranger</a>, since I also used the tool to split all documents from one scan session into seperate files. Today I stumbled upon <a href="https://github.com/nklb/remove-blank-pages" target="_blank" rel="noopener noreferrer">nklb/remove-blank-pages</a>. His way, to simply ask ghostscript how much ink was used on a page, made me feel <mark>w(°ｏ°)w</mark>.</p>
<p><code>$doc_fixed</code> is a 10 page PDF and the pages 4, 6 and 8 are almost empty. When I ask ghostscript how much ink is used on it, I get this result (I added the arrows for visibilty):</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">$ gs -q -o - -sDEVICE<span class="o">=</span>inkcov <span class="nv">$doc_fixed</span>
</span></span><span class="line"><span class="cl">    0.10491 0.10328 0.09950 0.04416 CMYK OK
</span></span><span class="line"><span class="cl">    0.10148 0.10088 0.09943 0.05592 CMYK OK
</span></span><span class="line"><span class="cl">    0.12084 0.11656 0.11248 0.05224 CMYK OK
</span></span><span class="line"><span class="cl">-&gt;  0.01110 0.01129 0.01081 0.00863 CMYK OK
</span></span><span class="line"><span class="cl">    0.13958 0.13398 0.12899 0.05891 CMYK OK
</span></span><span class="line"><span class="cl">-&gt;  0.01345 0.01371 0.01331 0.01084 CMYK OK
</span></span><span class="line"><span class="cl">    0.10316 0.09932 0.09451 0.02920 CMYK OK
</span></span><span class="line"><span class="cl">-&gt;  0.00441 0.00413 0.00468 0.00076 CMYK OK
</span></span><span class="line"><span class="cl">    0.42938 0.36430 0.36582 0.04960 CMYK OK
</span></span><span class="line"><span class="cl">    0.36276 0.35747 0.36175 0.07565 CMYK OK
</span></span></code></pre></div><p>However, I couldn&rsquo;t get nklbs <a href="https://www.gnu.org/software/grep" target="_blank" rel="noopener noreferrer">grep</a> 
    <a href="https://github.com/nklb/remove-blank-pages/blob/c551f746b3e607a63d24dbb9444996260b63806c/remove-blank-pages#L7" target="_blank" rel="noopener noreferrer">pattern</a> to work and decided to go with <a href="https://www.gnu.org/software/gawk" target="_blank" rel="noopener noreferrer">awk</a> instead. awk can also count, which made it easier to set a threshold for an &ldquo;empty page&rdquo; in my view.</p>
<div class="highlight"><pre tabindex="0" class="chroma"><code class="language-bash" data-lang="bash"><span class="line"><span class="cl">$ awk
</span></span><span class="line"><span class="cl"><span class="c1"># injecting $threshold variable into awks context</span>
</span></span><span class="line"><span class="cl">-v <span class="nv">thr</span><span class="o">=</span><span class="s2">&#34;</span><span class="nv">$threshold</span><span class="s2">&#34;</span>
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># matching lines starting with an optional space followed by digits</span>
</span></span><span class="line"><span class="cl"><span class="c1"># you can play around with the tokens at https://regex101.com/r/KtqgpB/1</span>
</span></span><span class="line"><span class="cl">/^<span class="o">[[</span>:space:<span class="o">]]</span>*<span class="o">[</span>0-9<span class="o">]</span>/ 
</span></span><span class="line"><span class="cl">
</span></span><span class="line"><span class="cl"><span class="c1"># awk magic: </span>
</span></span><span class="line"><span class="cl"><span class="c1"># p is a counter to count rows, where each row is a page in the pdf</span>
</span></span><span class="line"><span class="cl"><span class="c1"># s is the sum of the four CMYK values per row</span>
</span></span><span class="line"><span class="cl"><span class="c1"># then it checks if the sum is below the threshold. </span>
</span></span><span class="line"><span class="cl"><span class="c1"># If so, it will print p, the page number</span>
</span></span><span class="line"><span class="cl"><span class="o">{</span> p++<span class="p">;</span> <span class="nv">s</span><span class="o">=</span><span class="nv">$1</span>+<span class="nv">$2</span>+<span class="nv">$3</span>+<span class="nv">$4</span><span class="p">;</span> <span class="k">if</span> <span class="o">(</span>s&lt;thr<span class="o">)</span> print p <span class="o">}</span>
</span></span></code></pre></div><p>Currently my threshold is set to 0.06, so less then 6% of the page are covered in ink to be classified as empty. It&rsquo;s usually just a border from scanning or punch holes that are recognized as &ldquo;ink&rdquo;.</p>
]]></content:encoded>
    </item>
  </channel>
</rss>
