Years ago, AMD set forth on a crusade to seamlessly blend multiple functions which were normally spread throughout disparate regions of a system into one all-encompassing chip. This “system on a chip” approach wasn’t anything new but AMD has arguably been the leader in this field, often going through the process of integration long before their rival Intel.This process started in a straightforward manner with the simple integration of a memory controller into the CPU’s die space. Subsequent iterations have gradually given birth to what we now call Accelerated Processing Units (or APUs) which have taken incorporated a reasonably powerful graphics processor into the same package as well. However, AMD isn’t stopping there. Their ultimate goal is to create what’s now called a Heterogeneous System Architecture in which all parts of the processing pipeline –be it the parallel processing capabilities of a GPU or the CPU’s serial-centric focus- communicate in harmony without any of the latency associated with multi-chip approaches.While Trinity and its predecessor Llano took the first tentative steps towards this HSA approach, they actually featured very few architectural improvements that could be considered next generation in nature. However, the upcoming Kaveri (due in the second half of 2013) APUs will be the first leap towards the realization of AMD’s long term goals. One of their key elements will be a new Heterogeneous Unified Memory Architecture or hUMA in an effort to eliminate some of HSA’s latent bottlenecks.While the last 10 years have seen an almost ludicrous theoretical increase in integer and floating point performance on GPUs, these aspects have been by and large stagnating on the CPU side. In essence, the GPU can offer up to ten times the amount of throughput as a high end CPU in highly parallelized workloads. Now that may sound like a significant, nearly unbridgeable chasm between two disparate elements with a system but when working outside of mere theoretical performance, the GPU has ultimately struggled to reach its full potential. We also can’t forget that serial-based workloads are by and large dominated by the CPU.Outside of its obvious superiority in processing 3D elements within games and OpenGL software, actually harnessing the GPU’s power has always been an issue for programmers. Hence why there are very few applications which can be used by the GPU’s compute components. hUMA plans the level that playing field by allowing the graphics-oriented elements of this equation to play a greater role in overall system performance.hUMA may sound like a new concept but its roots are firmly planted in the past. In many ways this is the next evolutionary step for the Unified Memory Architecture AMD was instrumental in pioneering in the desktop market nearly a decade ago. However, this time, its various architectural enhancements are focused on facilitating the communication between the CPU and GPU.Instead of the GPU being used for some programs and the CPU for others, with hUMA programmers now have the ability to leverage both at the same time, thus optimizing their respective performance thresholds. More importantly, software won’t have to worry about doing the hand-off since it is being accomplished natively within the APU’s architecture.As one might expect by its name, hUMA accomplishes its tasks by incorporating broad scale heterogonous memory integration across the APU’s processing stages. Before hUMA, both the x86 processing cores and graphics architecture had their own respective memory controllers and addressable memory pools, even within AMD’s Trinity and Llano. In some cases, the amount of memory dedicated to each element could be user modified but for the most part, there was nothing dynamic about it and efficiency was lost.hUMA on the other hand allows the GPU to enter into the world unified memory by linking it to the same memory address space as the CPU. This leads to an intelligent computing architecture which enables the x86 cores, GPU stages and other sub-processors to work in harmony on a single piece of silicone, within a unified memory space while dynamically directing processing tasks to the best suited processor.This wasn’t an easy accomplishment by any stretch of the imagination. Not only did AMD have to update the GPU’s instruction set so it could communicate more effectively with system’s memory but hUMA also opened up a world of potential issues as programmers come to grips with what could have been a tricky balancing act.As we’ll talk about a bit later, the programming issue was resolved, resulting in a litany of noteworthy advantages for systems with hUMA. While this approach may not allow a complete unification between a discrete GPU and its associated CPU, it has far-reaching implications for the APU market and its viability against Intel’s upcoming Haswell architecture.In systems without hUMA, both processors could be used in parallel but the entire process was inefficient. It involved playing hot potato with a large amount of data being copied between two memory address spaces, causing redundancy where AMD felt there shouldn’t be any. In order to facilitate the data handoff, hUMA ensures all of the data is passed in a dynamic form through the uniform memory interface, resulting in a quicker information handoff. It isn’t completely shutting out the CPU either. Rather, think of this as an on-the-fly load balancing act between two fully integrated system components.In our briefing AMD said this approach should simplify the artistry of programming, allowing programmers to use their time to deliver the best possible experience to the end user. With that on the table, it isn’t like these processing stages weren’t communicating before. However, now that relationship will be more like close siblings having a friendly chat rather than a divorced couple in the midst of a tug-of-war.With all of the technical elements pushed aside, what really matters is how this technology will make it into the hands of you and me. Thus far GPU compute has been largely relegated to the sidelines since programmers need special languages, tools and memory models to unlock and access the its performance capabilities.One of the main goals here is to get the buy-in of developers. Without software that supports hUMA, it’ll quickly become yet another standard which was cast aside before fully realizing its potential. In order to accomplish this sometimes hard to attain stamp of approval from the development community, AMD has ensured programming for hUMA-based systems is as efficient as possible. It is fully compatible with industry-standard programming languages like C++, .NET and Python, ensuring the developer community can use existing methodologies in order to attain optimal results.When designing hUMA, AMD asked a simple question: how do we leverage the relative strengths of our APUs without reinventing the wheel? By creating a direct link between the CPU and GPU they have could have accomplished just that. Instead of a Berlin-wall like partition between these architectural elements, future APUs will be able to dynamically distribute tasks to the best suited co-processor in a way that’s completely transparent to the end user. Naturally, this all hinges on acceptance from the developer community but with their streamlined use of industry-standard programming languages, AMD seems to have that base covered perfectly. With hUMA in place, hopefully Kaveri will be given its chance to shine.