Thursday, January 30, 2014
8:03 AM

How to split up PDF files - part 2

In an earlier post, I used the pdftk tool to extract pages from a pdf file. I had no reason to investigate alternative solutions until I encountered the following problem.

I had to extract the first 4 pages of a pdf document. The normally reliable pdftk command generated a Java exception.


$ pdftk T4.pdf cat 1-4 output outputT4.pdf
Unhandled Java Exception:
Unhandled Java Exception:
java.lang.NullPointerException
at gnu.gcj.runtime.NameFinder.lookup(libgcj.so.12)
at java.lang.Throwable.getStackTrace(libgcj.so.12)
at java.lang.Throwable.stackTraceString(libgcj.so.12)
at java.lang.Throwable.printStackTrace(libgcj.so.12)
at java.lang.Throwable.printStackTrace(libgcj.so.12)

To troubleshoot the problem, I executed the pdftk command using a different input pdf file. It worked just fine. The problem appears to be the specific input pdf file.

At that point, I started looking for an alternative tool.

gs, aka Ghostscript, is a previewer for PDF as well as PostScript files.

You can direct gs output to various output devices using the -sDEVICE parameter. The pdfwrite device specifies that the output will be in PDF file format.

The page range to extract is defined by -dFirstPage and -dLastPage parameters. The name of the output file is specified using -sOutputFile parameter.


$ gs -sDEVICE=pdfwrite -dNOPAUSE -dBATCH -dSAFER -dFirstPage=1
-dLastPage=4 -sOutputFile=outputT4.pdf T4.pdf
GPL Ghostscript 9.05 (2012-02-08)
Copyright (C) 2010 Artifex Software, Inc. All rights reserved.
This software comes with NO WARRANTY: see the file PUBLIC for details.
**** Warning: considering '0000000000 XXXXX n' as a free entry.
**** Warning: considering '0000000000 XXXXX n' as a free entry.
**** Warning: considering '0000000000 XXXXX n' as a free entry.
Processing pages 1 through 4.
Page 1
Loading NimbusSanL-Regu font from /usr/share/fonts/type1/gsfonts/n019003l.pfb... 4287624 2669241 2475832 1154775 3 done.
Loading NimbusSanL-Bold font from /usr/share/fonts/type1/gsfonts/n019004l.pfb... 4328616 2778664 2516200 1192102 3 done.
Loading NimbusMonL-Regu font from /usr/share/fonts/type1/gsfonts/n022003l.pfb... 4371912 2946486 2677672 1350807 3 done.
Page 2
Loading NimbusSanL-BoldItal font from /usr/share/fonts/type1/gsfonts/n019024l.pfb... 4431472 2877228 2738224 1120988 3 done.
Loading NimbusSanL-ReguItal font from /usr/share/fonts/type1/gsfonts/n019023l.pfb... 4471488 2998784 2758408 1209901 3 done.
Page 3
Page 4
**** This file had errors that were repaired or ignored.
**** The file was produced by:
**** >>>> iText 1.4.5 (by lowagie.com) <<<<
**** Please notify the author of the software that produced this
**** file that it does not conform to Adobe's published PDF
**** specification.

The above output messages provided a clue on why the input pdf file was problematic. The pdf file does not "conform to Adobe's published PDF specification." To its credit, gs "repaired or ignored" the problem. It continued on to successfully extract the pages. In this particular example, gs is more error tolerant than its counterpart, pdftk.

0 comments:

Post a Comment