AFDS was full of talks on OpenCL. You missed them, just like me? Then you will be happy that they put many videos on Youtube!

Enjoy watching! As all videos are around 40 minutes, it is best to take a full day for watching them all. The first part is on openCL itself, second is on tools, third on OpenCL usages, fourth on other subjects.

OpenCL itself

OpenCL 1.2

The topics discussed in this session include: •Device fission •Host access flags for memory objects •New APIs for “memset” of buffers and images •New image formats (1D, 1D_BUFFER, 1D_ARRAY and 2D_ARRAY). •New API for image creation with a descriptor •New API for GL interop interface •Sampler-less image access •Libraries support. And new compile/link interface. •Querying kernels arguments. Provides the information about kernel arguments •Printf support in kernels •Memory objects migration. Provides an explicit method to assign which device an OpenCL memory object resides.

OpenCL C++

With the success of programming models such as Khronos’ OpenCL, heterogeneous computing is going mainstream. However, these systems are low-level, even when considering them as systems programming models. For example, OpenCL is effectively an extended subset of C99, limited to the type unsafe procedural abstraction that C has provided for more than 30 years. Computer systems programming has for more than two decades been able to do a lot better. One successful case in point is the systems programming language C++, known for its strong(er) type system, templates, and object-oriented abstraction features. In this talk we introduce OpenCL C++, an object-oriented programming model (based on C++11) for heterogeneous computing and an alternative for developers targeting OpenCL enabled devices.

OpenCL tools

Advanced OpenCL Debugging and Profiling Using AMD Tools

Developing robust parallel computing applications is difficult. In this talk we will introduce the audience to AMD’s developer tools and display advanced debugging and profiling techniques that help locate hard-to-find OpenCL related bugs and performance issues.

AMD CodeXL Developer Tool

AMD’s Gabe Gravning and Gilad Yarnitzky showcase AMD CodeXL developer tool from the Experience Zone floor at AFDS 2012.

More info and download at: http://developer.amd.com/tools/hc/CodeXL/Pages/default.aspx

Quickly Optimize OpenCL Applications with SlotMaximizer

SlotMaximizer is a transformation tool that automatically tunes OpenCL™ kernels, helping to increase developer productivity. It aids developers to obtain increased performance, higher throughput, and better hardware utilization from their kernels with minimal effort while maintaining a small, readable and maintainable code base. SlotMaximizer enables developers to focus on their original problems and algorithm strategies and leave the details of optimizing the code to the compiler. SlotMaximizer is already incorporated into the AMD Catalyst™ drivers as a preview and can be used by anyone developing applications using the APP SDK. It will be turned on by default to support application execution on end-user systems later in the year. This session will provide a user-oriented tutorial to Fusion developers. The presentation will first introduce the transformation principles applied by SlotMaximizer and then concrete examples will be demonstrated.

MulticoreWare Task Manage (TM) – A Parallel Building Library for Heterogeneous Computing

TM is open sourced and a ULL intended to assist developers in optimally programming parallel software on heterogeneous computing systems to achieve highest performance, throughput and utilization. Its primary function is to provide APIs for designing task based applications and implementing dynamic workload balancing across the entire heterogeneous system. TM offers popular parallelism methods in both data-parallel and task-level parallel ways. The complexities of dynamic task scheduling processes and heterogeneous hardware configurations are completely transparent to the developers. In latest TM releases, we have enabled lot of more new features like boost support, GMAC integration and TM server.

More info and download via http://www.multicorewareinc.com/index.php?option=com_content&view=article&id=65&Itemid=92

OpenCL usages and applications

OpenCL Acceleration of x264

x264 is the world’s most popular H.264 video encoder, and is highly optimized with hand written vectorized assembly code. This presentation will describe how we ported the lookahead (pre-encode) processing to OpenCL for improved performance and encode efficiency.

Accelerating OpenCV on AMD GPUs with OpenCL

OpenCV is a widely used library of programming functions for real time computer vision. OpenCL is an open standard for the parallel programming as well as for the cross-platform programming. Our work is to implement and maintain an OpenCL version of OpenCV to have all frequently used functions implemented and optimized with OpenCL. Until now, we have implemented and optimized over 50 core functions and an advanced application on AMD GPUs. With a number of optimization skills, we got a high performance on AMD GPUs. The talk will demonstrate performance improvements compared with a competing CUDA version. Furthermore, all implemented functions using OpenCL support ROI (Range Of Interest) which is currently provided in the CUDA version. Additionally, our implementation supports more image types.

OpenCL Optimizations on ImageMagick: Convert, Edit, and Compose Images

ImageMagick is an open source software suite to create, edit, compose, or convert bitmap images. Our goal is to optimize it with OpenCL to significantly improve its image processing efficiency. So far we have achieved up to 15x speedup on some image processes on Trinity vs SandyBridge. First a new component named OpenCL object manager is designed to manage all OpenCL objects and environment, which will not only simplify the code maintenance, but also enable further global optimizations. To reduce the fixed overhead of OpenCL init and enable better CL memory management, we also designed a standalone OpenCL process responsible for all OpenCL operations like kernel launch outside each image processing command. At last, the internal image used by ImageMagick is bitmap, so the data to be transferred is huge. We implemented GPU jpeg decode and encode so the data to move is significantly reduced. We have utilized AMD APU’s zero copy capability to improve the data transfer efficiency.

Performance Evaluation of AMD-APARAPI Using Real World Applications

Java APARAPI (Java A PARallel API) allows Java developers to take advantage of the computational power of GPU and APU devices by executing java parallel code fragments on the GPU rather than being confined to the local CPU. This presentation aims at performance evaluation of APARAPI for execution of parallel Java code on GPU via OpenCL. Performance analysis is done by running real world problems programmed in Java using Aparapi. Each program is written in multi-threaded java to have proper comparison. There will around 15 real world programs which are commonly used and well known. This also have some tuning done in the APARAPI library.

clMAGMA: Heterogeneous High-Performance Linear Algebra with OpenCL

The use of GPUs is becoming pervasive in high-performance scientific computing. To further accelerate and enable this transition, fundamental libraries often must be redesigned to fully exploit the power that GPUs present. Challenges regarding the portability of the new developments also must be addressed. We present clMAGMA – an OpenCL port of the current state-of-the-art developments on “Matrix Algebra on GPU and Multicore Architectures” (MAGMA). clMAGMA incorporates well established experiences from the LAPACK, ScaLAPACK, PLASMA, and MAGMA efforts. In particular, these are synchronization, and communication avoiding algorithms, as well as DAG hybrid scheduling. The new developments, combined with the use of OpenCL, will further propel clMAGMA’s portability and impact on the nation’s software cyberinfrastructure, and thus benefit the use of AMD technologies in the forefront of parallel computing.

OpenCL Enabled Face Detection Plug-in for IrfanView

MulticoreWare’s Face Detection plug-in allows IrfanView, a popular FREEWARE graphic viewer, to filter portraits from a large photo gallery. It employs AdaBoost classifier as the core part and includes a few pre-processing steps such as JPEG decode, resizing and histogram generation. The operations construct a pipeline in which every step could be processed on CPU or GPU. The most computational- intensive part, Haar feature face detection, is a good problem to solve with GPU. When the user specifies a large photo set to process, the photos could be processed in parallel with either CPU or GPU. Balancing the workload to maximize performance is a challenging problem. Specifically, the pipeline running either on CPU or GPU should minimize the data transfer between host and devices. In this session, we’ll describe our OpenCL porting of OpenCV’s object detection method for AMD’s GPUs and APUs, our optimizations on the face detection pipeline, and the performance speedup we have seen on APUs. Keywords: IrfanView, OpenCV, Face detection AdaBoost.

Other interesting OpenCL related talks

IOMMUv2: The Ins and Outs of the Heterogeneous GPU Use

Using the GPU in a heterogeneous platform environments requires access to system memory that transcends the use scenarios of traditional IOMMU devices in system software. To that end AMD introduced the IOMMUv2 device in the platform that in addition to IOMMU functionality as used by virtualization SW provides hardware services that can be utilized as HSA MMU (Memory Management Unit) for more efficient but secure memory access in application software. This session provides an overview of the hardware device and its many uses in system software for virtualization and HSA.

GPU Acceleration of Interactive Large Scale Data Analytics

The extreme volume of unstructured data being generated worldwide that must be analyzed, abstracted and understood has for years fueled extensive research to create intuitive, meaningful insights to the data. Capitalizing on human beings innate ability to rapidly comprehend visual imagery, PNNL’s IN-SPIRE processes this data and presents the results to users in a variety of intuitive and interactive visualizations. Within IN-SPIRE users can interactively explore complex relationships between visualizations and their own ad hoc search criteria to discover meaningful insights. This process is both highly dynamic and computationally intensive, as users are continuously drilling-down or widening their focus areas while requiring the computations to be accomplished at ‘interaction speed’ where time-to-solution is critical. In this session, we will explore the use of AMD’s Aparapi to accelerate critical computational analytics and user interactions through high performance GPU computations.

Bolt: A C++ Template Library for HSA

In this talk we describe a C++ template library optimized for AMD’s Heterogeneous System Architecture. In many cases developers will be able to create a single source code base which runs efficiently on both the CPU and the GPU. We provide examples that show a dramatic reduction in lines of code. Finally, we show how the library allows developers to easily access the unique capabilities of HSA , including shared virtual memory, tight CPU and GPU communication, and advanced queuing capabilities.

sheets: http://www.slideshare.net/hsafoundation/bolt-for-hsa-by-ben-sanders

An Overview of HSAIL

HSAIL is a new virtual byte code and virtual machine designed for parallel compute on heterogeneous devices. HSAIL makes it easy to compile high performance code both for current and future architectures. HSAIL programs will run unchanged on future hardware . Unlike AMDIL which is the graphics byte code, HSAIL has been architected to support modern high level programming languages such as Java and C++. This talk will introduce HSAIL at a high level, go over the virtual machine, Next we will talk about the compilation model, the reasons for a byte code rather than an exposed ISA and how HSAIL opens up HSA hardware to compiler and tool developers. We will review how HSAIL is different from PTX/LLVM and Java Byte code. Finally we will go over the one HSAIL important aspects– the memory model. Unlike previous GPU byte codes, the HSAIL memory model uses a formal design based on acquire/release semantics.

AMD and AccelerEyes Demo Matlab on AMD’s Trinity APU

AMD’s Sasa Marinkovic and AccelerEyes’ John Melonakos demo Matlab on OpenCL and the AMD Trinity APU, from the Experience Zone at AFDS 2012.

More?

All videos of AFDS are on http://www.youtube.com/playlist?list=PLA5581E4E4FF05061&feature=plcp

Want more information on OpenCL? Check out the rest of the blog, or learn OpenCL yourself by requesting a training in GPU-programming.