Saving this for future reference:
[20:57:38] <LordHavoc> the trick is to use UBO
[20:57:47] <LordHavoc> fallback if not present
[20:57:59] <LordHavoc> an Intel GPU with uniform buffers is probably going to do just fine with hardware
[20:58:10] <LordHavoc> cuts out all the pre-DX10 chips
[20:58:29] <LordHavoc> which have neither the practical limits (too few uniforms to animate a complex character model) nor the performance to really get a benefit
[20:58:58] <LordHavoc> from my time at Valve the "fast path" for geometry appears to be the following strategy:
[21:00:16] <LordHavoc> keep around two uniform buffers (GL_UNIFORM_BUFFER), two vertex buffers (GL_ARRAY_BUFFER), two index buffers (GL_ELEMENT_ARRAY_BUFFER), swap between them each frame, upload data to them using glBufferSubData before you issue your glDrawElements, do not reuse any part of them - always advance (as long as the offset for the new upload is at or past the previous one's end, it will keep the driver happy)
[21:00:30] <LordHavoc> make sure they are big enough to hold a significant amount of data
[21:00:46] <LordHavoc> if you run out of space, just gracefully fallover by switching buffers
[21:01:09] <LordHavoc> at every switch (toggling between the two sets), issue a glBufferData for each one with the same size but NULL data, this discards the contents
[21:01:18] <LordHavoc> Doom3 BFG Edition does this and its performance is great
[21:01:47] <LordHavoc> though Doom3 BFG Edition goes a step further - it does a glMapBuffer on them, spins up a ton of threaded jobs to generate everything, and then does a glUnmapBuffer when the jobs are done
[21:01:54] <LordHavoc> which skips the whole glBufferSubData part
[21:02:09] <LordHavoc> but in practice I find glBufferSubData is a lot easier to work with and quite performant
[21:02:33] <LordHavoc> whenever possible you should try to make sure all your glBufferSubData calls are in a quick burst though
[21:02:44] <LordHavoc> the driver likes it best if you issue no glDrawElements calls between them
[21:03:08] <LordHavoc> but it tolerates them pretty well
[21:03:17] <LordHavoc> so you can go either way, but the basic process is the same
[21:03:35] <LordHavoc> it just means that in practice if you can two-stage your render, it will be a performance win
[21:03:40] <LordHavoc> but you can single stage or two stage at your leisure