In my previous blog post, I discussed just how powerful the hardware is that our high level virtual machines sit on today; relativistic quantum mechanics can be simulated on home hardware when implemented close to hardware logic. Far from locking developers into narrow platforms, the virtual architecture of LLVM and OpenCL should work across practically all modern systems.
Though low level, this architecture is still essentially universal. We have one or more processor chips. Each chip has one or more processing elements and a small amount of cache. Some cache is private to each processing element, while some is shared between elements. Processors can have some form of SIMD. Tie this special-purpose workspace to a general C workspace, and we have a cross-platform low level virtual machine.
Even when dealing in SIMD operations, vector types have been designed with forward and backward compatibility in mind; vector types for SIMD can be broken down by the compiler if vectors exceed SIMD capacities, and elementary types can be packed into vectors which are larger than the largest SIMD registers today. The OpenCL compilers even make their own attempts at vectorization for SIMD, and the support will only be better in the future. Manual vectorization always remains an option, though.
For the vast majority of commercial and personal applications, low abstraction design is tricky, tedious overkill which costs programmers' time with no benefit in end products. However, the same can be said of attempting to implement computationally intensive programs via current popular highly level APIs. Many SDK developers are in a mindset to try to take the low level out of low abstraction programming interfaces, as in the case of the tragically hobbled RenderScript for Android.
“...Android's needs are very different than what OpenCL tries to provide.
OpenCL uses the execution model first introduced in CUDA. In this model, a kernel is made up of one or many groups of workers, and each group has fast shared memory and synchronization primitives within that group. What this does is cause the description of an algorithm to be intermingled with how that algorithm should be scheduled on a particular architecture (because you're deciding the size of a group and when to synchronize within that group)...”
(Thanks to openclblog.com for the reference and their post on the topic.)
To the folks at Google: We're big girls and boys, and we can handle it, thank you. When a programmer reaches for low level control for computationally intensive applications, we actually want that control. We don't reach for it in writing a notepad, a web interface application, or even usually in generating graphics. That would be a waste of time and effort. If we reach for OpenCL or the equivalent, it's because we consider the trade-off between the necessity and power of low level micro-management to be well worth it for the very particular situation, in the very limited cases where we need the power.
As programmers, we are perfectly aware that the utility of micro-management we reached for becomes our responsibility as well. As opposed to the CUDA paradigm, assembly is usually not an option, because assembly level programming does lock us into every specific of the particular platform we first design for, whereas the CUDA paradigm doesn't.
I'm writing quantum mechanics simulation kernels for a game physics engine. My OpenCL kernels were written from the ground up with consideration for group size scaling. My output matrices necessarily inter-depend in a complicated way on my input matrices, and RenderScript's way of matching one input to one output with no group size control has made it very difficult for me to port what I've written. I even designed my host code to rewrite my kernels to rescale SIMD vectorization for different physical parameters.
Why go to the effort? Because I had a project for which it was worth reaching for low level control in the first place. I knew micro-management also became my responsibility, but I didn't want to be locked into assembly.
My OpenCL kernels run fast enough on my laptop to suggest that I might be close to pixel-by-pixel simulation on a modern smart phone, except that RenderScript is impossible! Please don't give me a hobbled NDK because you're worried that I won't understand the interface and then I'll down-vote it out of ignorance. We're developers!
As developers, if we're reaching for RenderScript at all, it's because we want low level control. Nobody would write notepad in OpenCL, except for the person who specifically reaches for OpenCL because they have an idea for the best notepad ever that depends on low level design. That programmer knows that low level design is then also her or his responsibility, but nobody would bother to implement anything worth that effort in assembly.
If SDK developers are going to try to make computationally intensive programming easy for us, let alone simplifying control of whats going on at the native level, we should at least have documentation! RenderScript is perfectly opaque. The organization of the work space, for example, will trip up programmers who have no manual to read.
In trying to convert my kernels, I spent an hour or two trying to confirm that large global matrices definitely were basically confined to the “root()” function and are not accessible in any easy way from functions called from root() (even when pointers are passed). How large can a global array be before it's sequestered? In what level or levels of cache does it reside? I couldn't tell you, because there's no manual.
I use Android. I love Android. I would love to produce a version of my game physics engine for Android, and it kills me that RenderScript's designers are so worried about me not understanding low level design, that they've squirreled away my ability to control it when I want it. Now, I'm fighting the intended convenience of the SDK for the control Google was worried I couldn't handle when I reached for low level design in the first place.
Developers, if you have a computationally intensive process for which entries in an output array depend one-to-one on entries in an input array, then RenderScript seems aimed at making a good compromise between the CUDA paradigm and the Android Runtime, and it can likely outperform native C. However, it is absolutely no replacement for OpenCL.
Google, I see the potential of RenderScript, but I want my OpenCL!