Project Sumatra Wiki

This page, with its child pages, contains design notes for Project Sumatra

OpenJDK project page: http://openjdk.java.net/projects/sumatra

Repositories: http://hg.openjdk.java.net/sumatra/sumatra-dev/{scratch,hotspot,jdk,...} (repo info)

Developer list: http://mail.openjdk.java.net/mailman/listinfo/sumatra-dev

Goals

Enable Java applications to take advantage of heterogeneous processing units (GPUs/APUs)

Extend JVM JITs to generate code for heterogeneous processing hardware

Integrate the JVM data model with data types efficiently processed by such hardware

Allow the JVM to efficiently interoperate with high-performance libraries built for such hardware

Extend the JVM managed runtime to track pointers and storage allocation throughout such a system

Challenges

Here are some of the specific technical challenges.

mitigate the complexities of present-day GPU backends and layered standards

standards include: OpenCL, CUDA, Intel Phi, PTX, HSA HSA (forthcoming), ... FIXME: choose 1-3 of the standards (e.g., PTX, HSAIL/HSA ) for initial backend development

build compromise data schemes for both the JVM and GPU hardware

define Java model for "value types" which can be pervasively unboxed (like tuples or structs) need to support flatter data structures (Complex values, vector and RBGA values, 2D arrays) from Java need to support mix of primitives and JVM-managed pointers

range of solutions: "don't"; like JNI array-critical; pinning read barrier; stack maps and safepoints in GPU range of solutions: no pointers; pointers are opaque (e.g., indices into Java-side array); arena pointers; pinning read barrier. need "foreign data interface" that is competent to interoperate (without copying) to standard sparse array packages adapt (or extend if necessary) JNI as a foreign invocation interface that is competent to call purpose-built C code for complex GPU requests

reduce data copying and inter-phase latency between ISAs and loop kernels

agreement of data structures will reduce copying more flexible loop kernel container will allow loop kernel fusion

cope with dynamically varying mixes of managed parallel and serial data and code

use JVM dynamic compilation techniques to build customized kernels and execution strategies optimize computation requests relative to online data

automatically (at each appropriate level of the system) sense load and distribute cleanly between CPU and GPUs

compile (online) JDK 8 parallel collection pipelines to data parallel compute requests partition simple Java bytecode call graphs (after profile-directed inlining) into CPU and GPU

learn to efficiently flatten nested or keyed parallel constructs

apply existing technology on nested data parallelism (to JVM execution of GPU code) apply existing technology on MapReduce (to JVM execution of GPU code) ensure that Java views of flattened and grouped parallel data sets are compatible with GPU capabilities efficiently implement "nonlinear streams" in JDK 8 parallel collections

create a practical and predictable story for loop vectorization, presumably user-assisted, and with useful failure modes

build a low-level library of vector intrinsics (e.g., AVX-style) that can be called (manually) from Java apply existing technology for loop vectorization build user-assisted loop vectorizers for Java, possibly based on type annotations (JSR 308)

deal with exceptional conditions as they arise in loop kernels

allow GPU loop kernels to call back to CPU for infrequent edge cases (argument reduction, exceptions, allocation overflows, deoptimization of slow paths) engineer a loop kernel container API which accounts for multiple CPU outcomes, and aggregates per kernel iteration (perhaps with continuation-passing style)

define a robust and clear data-parallel execution model on top of the JVM bytecode, memory, and thread specifications

interpret (or adapt if necessary) the Java Memory Model (JSR 133) to the needs of data parallel programming interpret (or adapt if necessary) the thread-based Java concurrency model (define GPU kernel effects in terms of bytecode execution by weakened quasi-threads)

Investigate use of Java Language constructs and programming idioms that can be effectively compiled for a data-parallel execution engine (such as a GPU).

potential candidate - Lambda methods and expressions other options?

Investigate opportunities for GPU enabled 'intrinsics' versions of existing JDK APIs candidates may be sort, (de)+compression, crc checking, search, convolutions etc.

adopt and adapt insights from previous work on data-parallel Java projects

Fork/Join framework Aparapi Rootbeer RIT Parallel Java Terracotta jcuda - Java bindings for CUDA jocl - Java bindings for OpenCL jogamp-jocl - Jogamps' Java bindings for OpenCL FIXME: need a good list of references here



FIXME: Most of these items need their own wiki pages and/or email conversations

Roadmap

FIXME: In what order will we address these challenges?

Known investigations

FIXME: Add your work here!

See something wrong on this page? Fix it!