Lately, I've stumbled upon a Java class that was performing the exact task I had on my mind when starting to write my gem. The class is extracting text from PDF while keeping the text structure. I was a Java developer once, but I wanted my project to still use Ruby.

"Let's wrap it in JRuby gem!" - came to my mind. I started googling and found excellent tutorials on this topic. However, each of them covered wrapping jar package, rather than single class. I started looking for the solution even deeper and found answers in different places on the web. I decided to wrap it in this post.



So firstly, let me introduce The Java Class: PDFLayoutTextStripper. This class is very standard (when it comes to Java world standards). One important thing that it's missing is package definition. Packages in java world can be translated to modules in Ruby. The tutorial I found, assumed every Java class is namespaced by package name - and to be honest I didn't want to change the class signature. I spotted a challenge here :)

Ok, let's start. I mentioned a gem, right? But before we create a gem we need to ensure that we are using JRuby:



❯ ruby -v

jruby 9.1.12.0 (2.3.3) 2017-06-15 33c6439 Java HotSpot(TM) 64-Bit Server VM 9.0.4+11 on 9.0.4+11 +jit [darwin-x86_64]



To create a gem I went a standard way mentioned in Bundler guide:

❯ bundle gem pdf-textstream # naming things is a second hardest thing in IT, right?

Sadly, because we will be using Java native code, our gem will be only JRuby compatible. To ensure that it will be executed only on JVM, you have to modify the pdf-textstream.gemspec file and set platform parameter:

spec.platform = 'java'



The wrapper code will be residing in lib/pdf/textream.rb. Let me walk you through it, line by line.

require "pdf/textstream/version"

require "java"



To use Java classes (also Java stdlib, and even to reference the Java code directly), we have to require the java module.



The next thing is to require Java jars in a ruby way:

# load jars

require_relative "../../jars/pdfbox-2.0.6.jar"

require_relative "../../jars/commons-logging-1.2.jar"

require_relative "../../jars/fontbox-2.0.6.jar"

Those are dependencies of the introduced class. Of course, you have to download and put them in `jars` directory and distribute their compiled versions together with your gem.



The next important line is classpath definition:

$CLASSPATH << "#{File.expand_path(File.dirname(__FILE__))}/../../classes"

module Pdf

module Textstream



Classpath, for those with background in Java, is pretty straightforward. It is the directory, where JVM is looking for the included libraries. In fact, there is no directory named classes in our project. The Java compiler will automatically create it. But we still don't have the compiler in place.

Probably - it's not the best practice, but I included the build file that executes the following command:



javac -d classes -cp .:./jars/pdfbox-2.0.6.jar:./jars/commons-logging-1.2.jar:./jars/fontbox-2.0.6.jar *.java



You should manually execute this command each time you modify Java class or change dependencies.

And finally, the magic bits. First, copy the Java class to the root directory of your gem. Then, by using JRuby as a proxy, we can reference it:

PDFLayoutTextStripper = JavaUtilities.get_proxy_class("PDFLayoutTextStripper")



Next thing I did, is that shortened namespaces of classes I use. Each Java class can be referenced in a Ruby way by going through Java module tree:

# change namespace

PDFParser = Java::OrgApachePdfboxPdfparser::PDFParser

RandomAccessFile = Java::OrgApachePdfboxIo::RandomAccessFile

PDDocument = Java::OrgApachePdfboxPdmodel::PDDocument

PDFTextStripper = Java::OrgApachePdfboxText::PDFTextStripper

To execute the class, and run it on file located in given path I created a static method:

def self.file_path_to_text(path) # TODO: exception handling pdfParser = PDFParser.new(RandomAccessFile.new(Java::JavaIo::File.new(path), "r")) pdfParser.parse() pdDocument = PDDocument.new(pdfParser.getDocument()); pdfTextStripper = PDFLayoutTextStripper.new string = pdfTextStripper.getText(pdDocument); return string end

It initiates PDF reader, parses the PDF file, passes the document to our arbitrary class and returns the string it read.The trickiest part of it was that I was trying to pass the Ruby file handle to the PDFParser as an argument. Of course, it failed. PDFParser signature expects File handle from a Java world. It was something new to me, that's why I had to read the file "the Java way": RandomAccessFile.new(Java::JavaIo::File.new(path), "r")

And… that’s it! Your Java class packed as a gem is ready to use!You can find the gem in my GitHub repo. Please keep in mind it was created as a proof of concept and is not ready for production use.

Photo by Max Nelson on Unsplash