Massively Multiplayer
Seamless Open-World
Real Time Strategy
 

Twitter   Facebook   Google+   YouTube   E-Mail   RSS
The One Man MMO Project: Bad Bug
By Robert Basler on 2015-03-22 01:26:40
Homepage: onemanmmo.com email:one at onemanmmo dot com


If you aren't a programmer, here's a little sample of what we do to make games for you.

Today, at long last, I fixed a bug that goes back to Christmas 2014. The bug first appeared right after I updated the renderer for better performance. One of the things that that change added was a hard limit to the number of shader programs the Lair engine could support: 4096. I figured that should be lots since I only had a few hundred shaders even with permutations. One afternoon while cruising over Miranda, the game halted because it had hit the limit of shader programs. I was shocked. But when I looked into it the reason became readily apparent: I never released shader programs from materials when they were unloaded. It was a simple oversight, quickly rectified with this little bit of code:

void Pass::ShutdownProgram( void )
{
if ( mProgramId == 0 )
{
return;
}
if ( mGeometryShaderId != 0 )
{
mGlShaders->DetachShader( mProgramId, mGeometryShaderId );
mGlShaders->DestroyShader( mGeometryShaderId );
mGeometryShaderId = 0;
}
if ( mVertexShaderId != 0 )
{
mGlShaders->DetachShader( mProgramId, mVertexShaderId );
mGlShaders->DestroyShader( mVertexShaderId );
mVertexShaderId = 0;
}
if ( mPixelShaderId != 0 )
{
mGlShaders->DetachShader( mProgramId, mPixelShaderId );
mGlShaders->DestroyShader( mPixelShaderId );
mPixelShaderId = 0;
}
mGlShaders->DestroyProgram( mProgramId );
mProgramId = 0;
}



Now I was destroying the shaders and programs when they were no longer needed. Problem solved.

Then the Bad Bug showed up.

The%20Imperial%20Realm%20-%20Miranda%20-%20Bad%20Bug_Lo.jpg
[Shader bug.]


At some point within the first 15 minutes of gameplay there would be a rendering problem. Sometimes it would be a terrain block that was all black, or a model that was all black, or a tank that only had a right tread with the left tread and tank body missing. It wasn't consistent. The common thread was that the bug had to do with materials and their shaders.

Since this bug showed up as soon as I added the code above, I had a pretty good idea what the cause of it was. Removing that code above did indeed fix the problem, so I spent a half-day checking all the obvious things, but none of them panned out. I was in the middle of a big rendering change, the bug didn't cause a crash, so I wrote up a bug, then let it go in order to work on more important things.

I revisited the bug a couple times over the last couple months when I had a free minute but I didn't make any progress.

With a Full Indie demo night coming up next week, and given that I'd had to explain away the bug repeatedly last demo night, I decided it was time to figure it out.

I was quickly able to narrow the bug down to just the last function call in that code: DestroyProgram. The problem was that behind that one function call was a whole bunch of code to manage OpenGL shaders and programs. There's a lot of complexity there to keep the number of shaders and programs down, so that any time two objects use the same program, they share that program. That means reference counting which is hopelessly bug-prone. I spent an hour reviewing all the shader management code to see if I was making an error with the reference counting, until I had the bright idea to run the program with no reference counting at all. The program would never release OpenGL shaders, it ran horribly inefficiently, and it still had the bug. The management of OpenGL shaders and programs was not the problem.

When you have a bug that can't be reproduced every single time the code is run, it can make for really difficult bugs to track down. The code runs a million times but only fails on one iteration out of a million. How do you find that one time? What is it about that time that causes it not to work?

If you don't use Visual Assist, you should. Its Find References tool is really helpful for debugging. It is like global find, but it only finds function names and variables that belong to the correct class. That saves a lot of time. (Its function/variable rename is also much, much better than regular search and replace.)

After a couple hours more testing (each test taking up to 15 minutes to reproduce the bug) I narrowed the problem down to this bit of code, specifically the for loop that reuses program slots.

unsigned int GlShaders::CreateProgram( const String& name )
{
Program program( name );
// Reuse deleted program slot
for ( unsigned int a = 0; a < mPrograms.size(); ++a )
{
if ( mPrograms[ a ].IsDeleted() )
{
mPrograms[ a ] = program;
return a + 1;
}
}
mPrograms.push_back( program );
return mPrograms.size();
}


The Lair Engine refers to shader programs using a non-zero numeric handle. That for loop in the middle reuses program slots that have been previously released. With those 8 lines commented out, the game had no rendering problems. It is very, very odd to have a bug like this in code this simple.

I carefully single-stepped through the for loop looking for anything out of place. I looked at how the Program objects were being copied. I looked at the vector code. I used Visual Assist to look everywhere the program handles were used in the Lair Engine. I was at a loss. This is the point where you usually start looking askance at the hardware, or for a freakish error in the assembler output from the compiler.

If I left that code out it was a memory leak, if I left it in, there were rendering problems. I'd been at it for four hours, so it was an easy choice, I added an explanation of why there was a memory leak there, commented the lines out and went to bed.

The next morning it came to me. I had missed something. I knew that somewhere in the Lair Engine it had to be storing that Program handle and that the re-use of the handle itself was what was causing the problem. The problem wasn't in the for loop, it was somewhere else. But I had checked all that earlier. But I knew I had missed something. Looking at the renderer again, I noticed that the Program handle was copied to a temporary variable which much later was used to look up the proper OpenGL VAO to bind during rendering!

When a program was destroyed, the VAO corresponding to it wasn't being destroyed, so when the new program came along, it used the old VAO rather than creating a new one, which resulted in rendering errors. I quickly added code to destroy the VAO when its corresponding program is destroyed.

Bug fixed!!! I could cruise over the terrain without rendering errors.

Then the game ran out of memory, but that's a story for another day.

New Comment

Cookie Warning

We were unable to retrieve our session cookie from your web browser. If pressing F5 once to reload this page does not get rid of this message, please read this to learn more.

You will not be able to post until you resolve this problem.

Comment (You can use HTML, but please double-check web link URLs and HTML tags!)
Your Name
Homepage (optional, don't include http://)
Email (optional, but automatically spam protected so please do)
Cats or dogs? (What's this?)

Admin Log In



[The Imperial Realm :: Miranda] [Blog] [Gallery] [About]
Terms Of Use & Privacy Policy