Docsplit: Break Apart Documents into Images, Text, Pages and PDFs

Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages…)

Under the hood, Docsplit is a thin wrapper around the excellent PDFBox, GraphicsMagick, and JODConverter libraries. PDFBox is used to extract text and metadata from PDF documents, as well as to split them apart into pages. GraphicsMagick is used to generate the page images (internally, it’s rendering them with GhostScript). JODConverter communicates with OpenOffice to perform the PDF conversions.

Hat tip: @documentcloud http://twitter.com/documentcloud/status/6436280510

More software releases! Take a look at … http://documentcloud.github.com/docsplit/

[code on GitHub] [documentation]

News Films

Our little film studio focuses on telling developer-centric stories that need to be seen.

Beyond Code: Season 3 / GopherCon 2015

 
0:00 / 0:00