Introduction

We are all aware of Machine Learning tools and cloud services that work via the browser and give us an interface we can use to perform our day-to-day data analysis, model training, and evaluation, and other tasks to various degrees of efficiencies.

But what would you do if you want to do these tasks on or from your local machine or infrastructure available in your organisation? And, if these resources available do not meet the pre-requisites to do decent end-to-end Data Science or Machine Learning tasks. That’s when access to a cloud-provider agnostic deep learning management environments like Valohai can help. And to add to this, we will be using the free-tier that is accessible to one and all.

We will be performing the task of building a Java app, and then training and evaluating an NLP model using it, and we will do all of it from the command-line interface with less interaction with the available web interface — basically it will be an end-to-end process all the way to training, saving and evaluation of the NLP model. And we won’t need to worry much about setting up any environments, configuring or managing it.

Purpose or Goals

We will learn to do a bunch of things in this post covering various levels of abstractions (in no particular order):

how to build and run an NLP model on the local machine?

how to build and run an NLP model on the cloud?

how to build NLP Java apps that run on the CPU or GPU?

most examples out there are non-Java based, much less Java-based ones

most examples are CPU based, much less on GPUs

how to perform the above depending on absence/presence of resources i.e. GPU?

how to build a CUDA docker container for Java?

how to do all the above all from the command-line?

via individual commands

via shell scripts

What do we need and how?

Here’s what we need to be able to get started:

a Java app that builds and runs on any operating system

CLI tools that allow connecting to remote cloud services

shell scripts and code configuration to manage all of the above

The how part of this task is not hard once we have our goals and requirements clear, we will expand on this in the following sections.

NLP for Java, DL4J and Valohai

NLP for Java: DL4J

We have all of the code and instructions needed to get started with this post, captured for you on GitHub. Below are the steps you go through to get acquainted with the project:

Quick startup

To quickly get started we need to do just these:

open an account on https://valohai.com, see https://app.valohai.com/accounts/signup/

install Valohai CLI on your local machine

clone the repo https://github.com/valohai/dl4j-nlp-cuda-example/

$ git clone https://github.com/valohai/dl4j-nlp-cuda-example/

$ cd dl4j-nlp-cuda-example

create a Valohai project using the Valohai CLI tool, and give it a name

$ vh project create

link your Valohai project with the GitHub repo https://github.com/valohai/dl4j-nlp-cuda-example/ on the Repository tab of the Settings page (https://app.valohai.com/p/[your-user-id]/dl4j-nlp-cuda-example/settings/repository/)

$ vh project open ### Go to the Settings page > Repository tab and update the git repo address with https://github.com/valohai/dl4j-nlp-cuda-example/

update Valohai project with the latest commits from the git repo

$ vh project fetch

Now you’re ready to start using the power of performing Machine Learning tasks from the command-line.

See Advanced installation and setup section in the README to find out what we need to install and configure on your system to run the app and experiments on your local machine or inside a Docker container — this is not necessary for this post at the moment but you can try it out at a later time.

You will have noticed we have a valohai.yaml in the git repo and our valohai.yaml file contains several steps that you can use, we have enlisted them by their names, which we will use when running our steps:

build-cpu-gpu-uberjar : build our uber jar (both CPU and GPU versions) on Valohai

: build our uber jar (both CPU and GPU versions) on Valohai train-cpu-linux : run the NLP training using the CPU-version of uber jar on Valohai

: run the NLP training using the CPU-version of uber jar on Valohai train-gpu-linux : run the NLP training using the GPU-version of uber jar on Valohai

: run the NLP training using the GPU-version of uber jar on Valohai evaluate-model-linux : evaluate the trained NLP model from one of the above train-* execution steps

: evaluate the trained NLP model from one of the above execution steps know-your-gpus: run on any instance to gather GPU/Nvidia related details on that instance, we run the same script with the other steps above (both the build and run steps)

Building the Java app from the command line

Assuming you are all set up we will start by building the Java app on the Valohai platform from the command prompt, which is as simple as running one of the two commands:

$ vh exec run build-cpu-gpu-uberjar [--adhoc]



### Run `vh exec run --help` to find out more about this command

And you will be prompted with the execution counter, which is nothing by a number:

<--snipped-->

😼 Success! Execution #1 created. See https://app.valohai.com/p/valohai/dl4j-nlp-cuda-example/execution/016dfef8-3a72-22d4-3d9b-7f992e6ac94d/

Note: use --adhoc only if you have not setup your Valohai project with a git repo or have unsaved commits and want to experiment before being sure of the configuration.

You can watch your execution by:

$ vh watch 1



### the parameter 1 is the counter returned by the `vh exec run build-cpu-gpu-uberjar` operation above, it is the index to refer to that execution run

and you can see either we are waiting for an instance to be allocated or console messages move past the screen when the execution has kicked off. You can see the same via the web interface as well.

Note: instances are available based on how popular they are and also how much quota you have left on them, if they have been used recently they are more likely to be available next.

Once the step is completed, you can see it results in a few artifacts, called outputs in the Valohai terminology, we can see them by:

$ vh outputs 1



### Run `vh outputs --help` to find out more about this command

We will need the URLs that look like datum://[....some sha like notation...] for our next steps. You can see we have a log file that has captured the GPU related information about the running instance, you can download this file by:

$ vh outputs --download . --filter *.logs 1



### Run `vh outputs --help` to find out more about this command

Running the NLP training process for CPU/GPU from the command-line

We will use the built artifacts namely the uber jars for the CPU and GPU backends to run our training process:

### Running the CPU uberjar

$ vh exec run train-cpu-linux --cpu-linux-uberjar=datum://016dff00-43b7-b599-0e85-23a16749146e [--adhoc]



### Running the GPU uberjar

$ vh exec run train-gpu-linux --gpu-linux-uberjar=datum://016dff00-2095-4df7-5d9e-02cb7cd009bb [--adhoc]



### Note these datum:// link will vary in your case

### Run `vh exec run train-cpu-linux --help` to get more details on its usage

Note: take a look at the Inputs with Valohai CLI docs to see how to write commands like the above.

We can watch the process if we like but it can be lengthy so we can switch to another task.

The above execution runs finish with saving the model into the ${VH_OUTPUTS} folder to enable it to be archived by Valohai. The model names get suffix to their names, to keep a track of how they were produced.

At any point during our building, training or evaluation steps, we can stop an ongoing execution (queued or running) by just doing this:

$ vh stop 3

(Resolved stop to execution stop.)

⌛ Stopping #3...

=> {"message":"Stop signal sent"}

😁 Success! Done.

Downloading the saved model post successful training

We can query the outputs of execution by its counter number and download it using:

$ vh outputs 2

$ vh outputs --download . --filter Cnn*.pb 2

See how you can evaluate the downloaded model on your local machine, both the models created by the CPU and GPU based processes (respective uber jars). Just pass in the name of the downloaded model as a parameter to the runner shell script provided.

Evaluating the saved NLP model from a previous training execution

### Running the CPU uberjar and evaluating the CPU-verion of the model

$ vh exec run evaluate-model-linux --uber-jar=datum://016dff00-43b7-b599-0e85-23a16749146e --model=datum://016dff2a-a0d4-3e63-d8da-6a61a96a7ba6 [--adhoc]



### Running the GPU uberjar and evaluating the GPU-verion of the model

$ vh exec run evaluate-model-linux --uber-jar=datum://016dff00-2095-4df7-5d9e-02cb7cd009bb --model=datum://016dff2a-a0d4-3e63-d8da-6a61a96a7ba6 [--adhoc]



### Note these datum:// link will vary in your case

### Run `vh exec run train-cpu-linux --help` to get more details on its usage

And at the end of the model evaluation we get the below, model evaluation metrics and confusion matrix after running a test set on the model:

Note: the source code contains ML and NLP-related explanations at various stages in the form of inline comments.

Capturing the environment information about Nvidia’s GPU and CUDA drivers

This step is unrelated to the whole process of building and running a Java app on the cloud and controlling and viewing it remotely using the client tool, although it is useful to be able to know on what kind of system we ran our training on, especially for the GPU aspect of the training:

$ vh exec run know-your-gpus [--adhoc]



### Run `vh exec run --help` to get more details on its usage

Keeping track of your experiments

While writing this post, I ran several experiments and to keep track of the successful versus failed experiments in an efficient manner, I was able to use Valohai’s version control facilities baked into its design by

filtering for executions

searching for specific execution by “token”

re-running the successful and failed executions

confirming that the executions were successful and a failure for the right reasons

also, checkout data-catalogues and data provenance on the Valohai platform below is an example of my project (look for the Trace button):

Comparing the CPU and GPU based processes

We could have discussed comparisons between the CPU and GPU based processes in terms of these:

app-building performance

model training speed

model evaluation accuracy

But we won’t cover these topics in this post, although you have access to the metrics you need for it, in case you wish to investigate further.

Necessary configuration file(s) and shells scripts

All the necessary scripts can be found on the GitHub repo, they can be found in:

the root folder of the project

docker folder

resources-archive folder

Please also have a look at the README.md file for further details on their usages and other additional information that we haven’t mentioned in this post here.

Valohai — Orchestration

If we have noticed all the above tasks were orchestrating the tasks via a few tools at different levels of abstractions:

docker to manage infrastructure and platform-level configuration and version control management

java to be able to run our apps on any platform of choice

shell scripts to be able to again run both building and execution commands in a platform-agnostic manner and also be able to make exceptions for the absence of resources i.e. GPU on MacOSX

a client tool to connect with the remote cloud service i.e. Valohai CLI, and view, control executions and download the end-results

You are orchestrating your tasks from a single point making use of the tools and technologies available to do various Data and Machine Learning tasks.

Conclusion

We have seen that NLP is a resource-consuming task and having the right methods and tools in hands certainly helps. Once again the DeepLearning4J library from Skymind and the Valohai platform have come to our service. Thanks to the creators of both platform. Also, we can see the below benefits (and more) this post highlights.

Benefits

We gain a bunch of things by doing the way we did the things above:

not have to worry about hardware and/or software configuration and version control management — docker containers FTW

able to run manual one-off building, training and evaluation tasks — Valohai CLI tool FTW

automate regularly use tasks for your team to be able to run tasks on remote cloud infrastructure — infrastructure-as-code FTW

overcome the limitations of an old or slow machine or a Mac with no access to the onboard GPU — CUDA-enabled docker image scripts FTW

overcome situations when not enough resources are available on the local or server infrastructure, and still be able to run experiments requiring high-throughput and performant environments — a cloud-provider agnostic platform i.e Valohai environments FTW

run tasks and not have to wait for them to finish and be able to run multiple tasks — concurrently and in-parallel on remote resources in a cost-effective manner — a cloud-provider agnostic platform i.e Valohai CLI tool FTW

remotely view, control both configuration and executions and even download the end-results after a successful execution — a cloud-provider agnostic platform i.e Valohai CLI tool FTW

and many others you will spot yourself

Suggestions

using provided CUDA-enabled docker container: highly recommend not to start installing Nvidia drivers or CUDA or cuDNN on your local machine (Linux or Windows-based) — shelve this for later experimentation

highly recommend not to start installing Nvidia drivers or CUDA or cuDNN on your local machine (Linux or Windows-based) — shelve this for later experimentation use provided shell scripts and configuration files: try not to perform manual CLI command instead use shells scripts to automate repeated tasks, provided examples are a good starting point and take it further from there

try not to perform manual CLI command instead use shells scripts to automate repeated tasks, provided examples are a good starting point and take it further from there try to learn as much : about GPUs, CUDA, cuDNN from resources provided and look for more (see Resources section at the bottom of the post)

: about GPUs, CUDA, cuDNN from resources provided and look for more (see section at the bottom of the post) use version control and infrastructure-as-code systems: git and the valohai.yaml are great examples of this

I felt very productive and my time and resources were effectively used while doing all of the above, and above all, I can share it with others and everyone can just reuse all of this work directly — just clone the repo and off you go.

What we didn’t cover and is potentially a great topic to talk about, is the Valohai Pipelines in some future post!

Resources

Valohai resources

Other resources

Other related posts