If anyone ever tells you sorting your meshes by material and minimizing redundant calls in OpenGL isn't worth your while, they're wrong.
Frame rates have been slowly decreasing as I've been adding new content. Recently my 2GHz Core i7 with NVidia GeForce GT 555M had hit an all-time low framerate of around 50FPS. I needed to do something to crank those framerates back up so players with lower-spec machines would have a chance.
The obvious candidate was the way my renderer handled materials. It didn't really do enough to tell if materials that had different settings were otherwise identical (same textures/shaders.) And it did nothing to reduce redundant state changes, happily changing shader and every rendering setting at the end of each material even if the next material used all the same settings.
I modified the renderer to split every pass of every material on every mesh into an individual operation like so:
uint64 mOrder[ 4 ]; // Used to order these operations
PassId mPassId; // Id of pass
GlMesh* mMesh; // Mesh to draw
GlMaterial* mMaterial; // Material for this submesh
Info* mInfo; // Info for this mesh
GlVertexArray* mVertexArray; // Vertex array for this mesh
GlVertexArray* mFeedbackBuffer; // Feedback buffer for this renderable
unsigned int mModelMatrixIndex; // Model matrix for this mesh
unsigned int mViewMatrixIndex; // View matrix for this mesh
unsigned int mProjectionMatrixIndex; // Projection matrix for this mesh
unsigned int mModelViewMatrixIndex; // Model view matrix for this mesh
unsigned int mModelViewProjectionMatrixIndex; // Model view projection matrix for this mesh
unsigned int mNormalMatrixIndex; // Normal matrix index (if needed)
unsigned int mInverseCameraRotationMatrixIndex; // Inverse camera rotation matrix index (if needed)
unsigned int mPipelineIndex; // Index of current pass ID within pipeline.
unsigned int mMaterialPassIndex; // Index of current pass within material.
unsigned int mSubMeshIndex; // Submesh to render (if renderable is mesh)
bool mQueryFeedbackResult; // True if needs to query the size of the feedback buffer
Those operations are then sorted by pass index, the user-selected ordering, then shader, then textures, then a whole pile of individual rendering options pulled out of the material definition totalling about 256 bits of data (mOrder above.) To speed up sorting, I allocate an array of pointers to the operations rather than sorting the operations themselves. The first optimization pass with CodeXL showed that the comparison function was taking up a big portion of the runtime. Originally I did the mOrder sorting comparison using a loop, but over half of the comparisons exited on the first loop. I unrolled the loop eliminating the loop setup which was about 25% of the time spent in the comparison function according to CodeXL and the comparison function dropped right out of the profiler's list of hot functions.
With the new material sorting in place and those basic optimizations done, my fastest frames were down to around 10ms.
Next I built a little class to keep track of a subset of OpenGL's internal state and check that before making a state change. This class doesn't cover everything, but it does do shader and a bunch of misc state. Next time I need a performance bump I'll expand it to do textures as well.
Once I had all the planned improvements in, I was shocked to find that the renderer was running at an absolutely consistent 30FPS. Before all the work started it was running at 50FPS! What had I done?!? Even weirder, it would run at 60+FPS for about 10 seconds before slowing to 30FPS and staying there. That weird full-speed to low-speed transition got me thinking that it might not be something I was doing. Looking at Miranda's frame profiler I discovered that SwapBuffers was the culprit, and was highly variable, ranging from 1-35ms. I read a bunch on the internet about OpenGL drivers throttling if they have too much work to do, so I opened up GDebugger to see if I was doing anything really stupid. Nope. Well yes, but I'll get to that. My next idea, given that the fan was running full speed, was that it might be running hot. So I shut the PC down for the night.
Before bed I read a bunch more and found that OpenGL handles VSync by throttling, often in SwapBuffers. If you have VSync on and can't deliver frames fast enough for 60FPS, it throttles back to 30FPS. It was weird given that it had been running at 50FPS previously, but I knew what to look into next.
This morning I fired up the game, and yay, it is happily humming along at a consistent 120FPS. The laptop video driver was throttling Miranda to keep its temperature down.
Originally, frames in my profiling build were averaging about 22ms. Now I regularly see frames at just 5ms with the average being around 8ms. Happy! Oh, and that thing that GDebugger showed me? The redundant call list had glClear on it. Ack! Looking through the code I discovered that when I added multi-pass support, I failed to remove the original code to clear the buffer before rendering every frame so it was clearing the buffer twice.