Our legions of dedicated fans (hi Mom) may have noticed a dry spell in the posts of late. This is partly because I spent last week in beautiful Vancouver, B.C., attending the annual Siggraph conference. In between time spent watching the year’s best computer animation and learning about morphological antialiasing, I picked up some stuff that’s specifically applicable to graphics development on iOS. The following are my notes from “Beyond Programmable Shading”, a course about GPU utilization. Note, this is somewhat advanced stuff; if you’re not writing your own shaders, this may not be the blog post for you.
Graphics implementations seem to run in cycles. We go from software graphics to fixed function hardware and back again. OpenGL was all about the fixed function, but with programmable shaders we’re right back in software. Of course there’s a tradeoff to either side; it’s all about flexibility versus speed.
Working on a mobile device, our primary concern is power use. We’ve all seen games that drain the battery in a half hour of play time. The somewhat informed assumption is that doing parallelizable work on the GPU (graphics or otherwise) will always be a win. The reality is more complicated. CPUs take a higher voltage per core, but GPUs have many more cores. (Fixed function hardware, such as a floating point unit, is the cheapest to operate; it’s very fast and very inflexible). Offloading work onto the GPU is only a win if it takes less power overall — not just the power taken to do the work on those cores versus the CPU’s cores, but also the CPU power it takes to upload the data and read it back. For small tasks, this can dominate the time spent running code on the GPU.
Moving forward, let’s assume that we have good reason to run code on the GPU — like, say, graphics. We can’t control the voltage the chip takes when in use, but we can control how often it’s in use. Sounds like a no-brainer, but the best thing we as software developers can do to minimize power use is to minimize how long the chip spends running.
How can we do this? First, cap your frame rate. The fewer frames per second you draw, the more time the chip spends idle. If you’re writing a graphically complex game that hits 45fps on a good day, you may not think about this; but you could be getting extremely high frame rates on easy content like menus. This can be even worse than expected, because working that fast can cause the chip to heat up, triggering throttling meant to avoid excessive temperatures. That means that when the user closes the menu and gets to the good stuff, you’ll no longer be capable of rendering at as high a frame rate as you’d like.
Now that your frame rate is low, optimize the time you spend rendering a frame. Same as before: the less time spent rendering, the more time the chip is idle. Don’t stop optimizing once you hit 60fps; further performance gains, combined with a capped frame rate, will really help power consumption.
Another way to keep the GPU idle is to coalesce work in the frame. Rather than computing the player’s position, then rendering the player, then computing the enemy’s position, the rendering the enemy, and so on, do all your rendering back to back. This will maximize the solid time the GPU can power off. It’s particularly important to keep the idle time in one large chunk rather than many small ones, because there is some latency associated with switching on and off parts or all of the chip.
There are plenty of ways to keep your GPU code fast; you’ve probably seen some of it if you’ve read anything about optimizing shaders. One common tip is to minimize branching. I learned why: when a GPU runs conditionals, it actually evaluates both branches — and not in parallel. For an
if/else, it simply masks off writes on the cores that don’t meet the condition; runs the first branch on all cores; reverses the mask; and runs the second branch. That’s potentially a high price to pay! It pays to get clever with
sign(), swizzling, and so on. Fortunately GLSL gives you lots of ways to avoid branching, if you’re willing to take the time to figure them out.
The most time-consuming operation in a shader is reading from memory. GPUs utterly lack the sophisticated caching mechanisms CPUs have; that’s the price for massive parallelism. GPUs are clever about hiding the stalls caused by memory loads by switching to work on other units (vertices or fragments, in our common cases); the trick is making sure there’s enough math for them to do to take up the time. Counterintuitively, a good strategy is often to recompute data rather than taking the time to load it. Those little cores are really fast, and reading from memory is really slow! You’d be surprised how many cosines you can calculate in the time it takes to read from your lookup table.
Bonus Notes on Vancouver
There are way more women than US-average wearing sheer tops. And a way higher incidence than I am used to of slight limps in both genders. Causation?Tweet