Recently I have struggled a lot to profile a CUDA application on the Shield Tablet. If you were thinking “What the hell would you need a CUDA app for, on a tablet?” I would understand :D. CUDA it’s not for everyday use but can be very powerful.

As of now (Late 2015), the Shield has the most powerful mobile GPU on the market (Tegra Kepler architecture with 192 streaming processors). I decided to evaluate and profile physics algorithms using such architecture.

Reading through documentations, keynotes from GDC, and presentations I found out that is currently not possible to profile a CUDA application from an APK!

NVIDIA offers the Android Works package, previously called Tegra Android Development Pack. This package provides developers with a big suite of handy tools to debug, test and deploy applications on the Shield. Recently, I’ve found this presentation from the GPU Technology Conference in 2014 about profiling CUDA apps. In general, there exist several graphical and command-line tools, but only one is available for Android. See the image below:

Graphical and Command-Line Profiling Tools

As you see, for Android, you can only use nvprof. Nvprof is a command-line tool to profile CUDA applications and it will be explained in the next paragraph. If you look at the red rectangle at the bottom of the picture you will notice that CUDA APK profiling is not supported yet! I.e., if you have in your APK any CUDA kernel, or calls to any library that uses CUDA….you simply can’t profile it.

The Solution

Luckily there is a workaround for that. The only way is to create a standalone and self-contained CUDA app. This must be deployed onto the Shield (manually, unfortunately) and profiled using nvprof.

Nvprof is a handy tool. You can import its output into the NVIDIA NSight Visual Profiler for Eclipse available with the Android Works package.

There is another limitation, well only if you are a Windows lover. In order to compile CUDA for Android you need to use the Android NDK and the CUDA development kit on…Linux! Currently there is no way to compile CUDA kernels using nvvc shipped with the dev kit on Mac or on Windows.

This is what you need:

Linux (I use Ubuntu 14.04)

Android Works (I use the TAPD 3.04) During the installation process, you will be asked to plug the Shield. The profiling tool for CUDA will be deployed accordingly by the installer into this folder (on the Shield): /data/cuda-toolkit-x.0 (where x is the CUDA version being installed. — find my device Android version).

A rooted device You will need to deploy and run the app using the built in Android shell ( adb shell ). This can be a pain in the neck. I suggest you reading some tutorials on how to root the Shield.



Code Example

Let’s pick the simplest CUDA example ever…the sum of two integers:

static const int N = 256; __global__ void vectorAdd( float *a , float *b , float *c) { c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x]; } int main(void) { float *a,*b,*c; float *a_dev, *b_dev, *c_dev; int i; int size_n = N * sizeof(float); a = (float *) malloc( size_n ); b = (float *) malloc( size_n ); c = (float *) malloc( size_n ); for (i = 0; i < N; i++) { a[i] = 1.0f; b[i] = 2.0f; } CheckError( cudaMalloc((void**) &a_dev, size_n) ); CheckError( cudaMalloc((void**) &b_dev, size_n) ); CheckError( cudaMalloc((void**) &c_dev, size_n) ); CheckError( cudaMemcpy(a_dev, a ,size_n, cudaMemcpyHostToDevice) ); CheckError( cudaMemcpy(b_dev, b ,size_n, cudaMemcpyHostToDevice) ); vectorAdd<<>>(a_dev, b_dev, c_dev); CheckError(cudaThreadSynchronize()); CheckError(cudaGetLastError()); CheckError( cudaMemcpy(c, c_dev ,size_n, cudaMemcpyDeviceToHost) ); for (i = 0; i < N; i++) printf("Sum of %f + %f = %f

" , a[i], b[i], c[i]); CheckError(cudaFree((void*) a_dev)); CheckError(cudaFree((void*) b_dev)); CheckError(cudaFree((void*) c_dev)); free(a); free(b); free(c); CheckError(cudaDeviceReset()); return 0; }

CheckError is just a helper function to check the return value from each CUDA function. A simple implementation could be:

void CheckError(cudaError_t value) { static std::string error_msg = "Error %s at line %d in file %s

"; cudaError_t _m_cudaStat = value; if (_m_cudaStat != cudaSuccess) { fprintf(stderr , error_msg.c_str() ,cudaGetErrorString(_m_cudaStat) , __LINE__ , __FILE__); exit(1); } }

Now let’s compile the CUDA vector add for the Shield.

Compilation

Deploy an APK using Eclipse or Visual Studio is pretty straightforward. You can use anu GUI (Android Studio, Eclipse, Visual Studio). However, it is not the same for a manual CUDA compilation step.

The CUDA compilation process works essentially the same way as a C++ compilation does. However, targeting an android device requires to tweak and use compilation parameters. I collected them together for you: look at the following Makefile:

#in the libs android I used the version android-19, but you can replace with any version you like that is available in the NDK. LIBS_ANDROID=$(NDK_ROOT)/android-ndk-r10c/platforms/android-19/arch-arm/usr/lib LIBS_ARM=$(NDK_ROOT)/android-ndk-r10c/sources/cxx-stl/gnu-libstdc++/4.6/libs/armeabi-v7a GCC=$(NDK_ROOT)/toolchains/arm-linux-androideabi-4.6/gen_standalone/linux-x86_64/bin/arm-linux-androideabi-g++ -L$(LIBS_ANDROID) -landroid -lc -lstdc++ -lm -lgcc -L$(LIBS_ARM) -lgnustl_shared -O3 NVCC=$(CUDA_TOOLKIT_ROOT)/bin/nvcc -ccbin $(GCC) --library-path $(LIBS_ANDROID) --library android -target-cpu-arch=ARM -m32 -arch=sm_32 -Xptxas '-dlcm=ca' -target-os-variant=Android vectorAdd: vectorAdd $(NVCC) -cudart static -o vectorAdd vectorAdd.cu clean: rm -rf vectorAdd

Using the makefile above will compile our vectorAdd.cu into an executable for Android. Please note that I’m using the NDK version r10c and targeting a specific Android version (android-19). These are not fixed parameters and you can change them in your makefile accordingly with your NDK installation.

Profiling

Now that the android executable is compiled and available it is to be deployed onto the Shield. The Adb push command can be used to upload the compiled file:

user@user:~$ adb push vectorAdd /data/tmp

*As a side note please keep in mind that upload a file onto the Shield you need a rooted device and permission to write to the data folder (or any other folder in the Shield filesystem).

Last step is to collect the data. Let’s access the Android shell and run nvprof.

user@user:~$ adb shell root@shieldtablet:/ # cd data/tmp root@shieldtablet:/ # export TMPDIR=./ root@shieldtablet:/data/tmp # ../cuda-toolkit-x.y/bin/nvprof -o output.nvvp --analysis-metrics ./vectorAdd

x.y in the cuda-toolkit-x.y folder should be 6.0 or 6.5 depending on the NDK version you have previously installed. export TMPDIR is needed to set a temporary folder which is used by nvprof to save its local data. Eventually we can profile the application calling nvprof and the CUDA program uploaded. By using the option –analysis-metrics is possible to import the generated output.nvvp file into the NVIDIA Visual Profiler (in Linux available using the nvvp command in the shell or in Windows in the folder C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.0\libnvvp

vvp.exe ).

To download the file from the shield just execute this command from your OS:

user@user:~$ adb pull /data/tmp/output.nvvp ./

Once you have the metric file locally, you can import it into the Visual Profiler to analyze the metrics of your CUDA program.