Author Topic: Code Optimizing  (Read 4293 times)

jitspoe

  • Administrator
  • Autococker
  • Posts: 18802
Code Optimizing
« on: May 08, 2013, 06:46:16 PM »
This is something the community could help out with since it's mostly engine side code that needs optimizing.

I've noticed one of the most expensive functions is R_DrawSkeletalMesh.

Particularly the part where it's looping through all the vertexes:

Code: [Select]
for (j = 0, skmverts = mesh->vertexes; j < numverts; ++j, ++skmverts)
{
VectorClear(inVertsArray[j]);
VectorClear(inNormalsArray[j]);

for (l = 0, boneverts = skmverts->verts; l < skmverts->numbones; ++l, ++boneverts)
{
pose = skmbonepose + boneverts->bonenum;

for (k = 0; k < 3; k++)
{
inVertsArray[j][k] +=
boneverts->origin[0] * pose->axis[k][0] +
boneverts->origin[1] * pose->axis[k][1] +
boneverts->origin[2] * pose->axis[k][2] +
boneverts->influence * pose->origin[k];
inNormalsArray[j][k] +=
boneverts->normal[0] * pose->axis[k][0] +
boneverts->normal[1] * pose->axis[k][1] +
boneverts->normal[2] * pose->axis[k][2];
}
}
}

This could probably be sped up a lot using SIMD intrinsics.  It could also be sped up even more on higher end cards with vertex shaders, but I kind of want to optimize for lower end machines, or the more common case of a decent processor but a crappy onboard graphics card.

Any takers?

Foxhound

  • Autococker
  • Posts: 952
Re: Code Optimizing
« Reply #1 on: May 08, 2013, 07:36:48 PM »
Where's Payl when he's not being foreign?

ViciouZ

  • Map Committee
  • Autococker
  • Posts: 2227
Re: Code Optimizing
« Reply #2 on: May 08, 2013, 09:12:10 PM »
http://msdn.microsoft.com/en-us/library/22kbk6t9%28v=vs.90%29.aspx

Looks like the kind of thing required. But I have no experience with this stuff, I don't know how to align/convert the groups of floats into __m128, also can't see an intrinsic to sum the values.

Also, is performance coming up in the uninstall form a lot or something? I would have thought that the biggest speed bottleneck was still the BSP traversal/immediate mode rendering clusterfudge.

jitspoe

  • Administrator
  • Autococker
  • Posts: 18802
Re: Code Optimizing
« Reply #3 on: May 08, 2013, 10:18:28 PM »
I'm pretty sure rendering the player models is the biggest performance hit right now (unless there are few/no players visible, of course, or the map is super complex).  LOD's would help with this, but that would mean creating a bunch of LOD assets.  The code has a lot of room for improvement.

deadfroggy

  • Autococker
  • Posts: 562
Re: Code Optimizing
« Reply #4 on: May 10, 2013, 12:13:31 AM »
If you want me to have a go send the code to digitalpaintgus@gmail.com.

Is the lag clearing or populating the array?

Loops are generally pretty expensive if you're trying to optimize the loop try and different style.

Using something other than a for loop. Can you achieve the same by if statements, or do while (these functions are also problematic). It is populating the array but there can heaps of issues array. Depending on where you're getting the issues. You need to work out if the delay Is occurring at the beginning when you first enter the for loop or when it's trying to populate the array. Then you look at the array itself and how your using it. Plus you've got two for loops inside the function.

Rockyar_96

  • 68 Carbine
  • Posts: 370
Re: Code Optimizing
« Reply #5 on: May 10, 2013, 01:18:27 PM »
well payl could do this if he wouldn`t be hated by parts of the community ^^

ViciouZ

  • Map Committee
  • Autococker
  • Posts: 2227
Re: Code Optimizing
« Reply #6 on: May 10, 2013, 02:13:52 PM »
If you want me to have a go send the code to digitalpaintgus@gmail.com.

Is the lag clearing or populating the array?

Loops are generally pretty expensive if you're trying to optimize the loop try and different style.

Using something other than a for loop. Can you achieve the same by if statements, or do while (these functions are also problematic). It is populating the array but there can heaps of issues array. Depending on where you're getting the issues. You need to work out if the delay Is occurring at the beginning when you first enter the for loop or when it's trying to populate the array. Then you look at the array itself and how your using it. Plus you've got two for loops inside the function.

It's great that you want to help out, but I suggest you start out with some of the simpler feature votes; I'm not sure you have quite grasped the issue in this particular post. Code is available from the CVS repo at http://sourceforge.net/p/paintball2/cvs/

whoislyap

  • Guest
Re: Code Optimizing
« Reply #7 on: May 16, 2013, 05:27:55 AM »
Code: [Select]
for (j = 0, skmverts = mesh->vertexes; j < numverts; ++j, ++skmverts)
{
VectorClear(inVertsArray[j]);
VectorClear(inNormalsArray[j]);

for (l = 0, boneverts = skmverts->verts; l < skmverts->numbones; ++l, ++boneverts)
{
pose = skmbonepose + boneverts->bonenum;

//for (k = 0; k < 3; k++) - Inlined to improve performance
invertsarray[j][0] +=
boneverts->origin[0] * pose->axis[0][0] +
boneverts->origin[1] * pose->axis[0][1] +
boneverts->origin[2] * pose->axis[0][2] +
boneverts->influence * pose->origin[0];
innormalsarray[j][0] +=
boneverts->normal[0] * pose->axis[0][0] +
boneverts->normal[1] * pose->axis[0][1] +
boneverts->normal[2] * pose->axis[0][2];
invertsarray[j][1] +=
boneverts->origin[0] * pose->axis[1][0] +
boneverts->origin[1] * pose->axis[1][1] +
boneverts->origin[2] * pose->axis[1][2] +
boneverts->influence * pose->origin[1];
innormalsarray[j][1] +=
boneverts->normal[0] * pose->axis[1][0] +
boneverts->normal[1] * pose->axis[1][1] +
boneverts->normal[2] * pose->axis[1][2];
invertsarray[j][2] +=
boneverts->origin[0] * pose->axis[2][0] +
boneverts->origin[1] * pose->axis[2][1] +
boneverts->origin[2] * pose->axis[2][2] +
boneverts->influence * pose->origin[2];
innormalsarray[j][2] +=
boneverts->normal[0] * pose->axis[2][0] +
boneverts->normal[1] * pose->axis[2][1] +
boneverts->normal[2] * pose->axis[2][2];
invertsarray[j][3] +=
boneverts->origin[0] * pose->axis[3][0] +
boneverts->origin[1] * pose->axis[3][1] +
boneverts->origin[2] * pose->axis[3][2] +
boneverts->influence * pose->origin[3];
innormalsarray[j][3] +=
boneverts->normal[0] * pose->axis[3][0] +
boneverts->normal[1] * pose->axis[3][1] +
boneverts->normal[2] * pose->axis[3][2];
}
}
I hope i haven't messed up for, as I'm not C++ guy. Should make compiler produce better code and execute faster as no execution transfer will occur.

jitspoe

  • Administrator
  • Autococker
  • Posts: 18802
Re: Code Optimizing
« Reply #8 on: May 16, 2013, 08:40:24 AM »
I haven't checked the assembly to verify, but unroll loops should be enabled, so the compiler should do that automatically.  I don't know that regular code is going to be able to make that much faster.  SIMD intrinsics allow instructions to be run on multiple sets of data at the same time, so it could potentially make this 3x's faster.

On a side note, why do you keep creating new accounts?  Your original account was only temp banned and that ban expired long ago...

jitspoe

  • Administrator
  • Autococker
  • Posts: 18802
Re: Code Optimizing
« Reply #9 on: May 27, 2013, 06:26:01 PM »
First attempt at SIMD...

Better than I expected, actually, lol.

jitspoe

  • Administrator
  • Autococker
  • Posts: 18802
Re: Code Optimizing
« Reply #10 on: May 27, 2013, 07:20:45 PM »
Saving this for future reference:

[20:57:38] <LordHavoc> the trick is to use UBO
[20:57:47] <LordHavoc> fallback if not present
[20:57:59] <LordHavoc> an Intel GPU with uniform buffers is probably going to do just fine with hardware :P
[20:58:10] <LordHavoc> cuts out all the pre-DX10 chips
[20:58:29] <LordHavoc> which have neither the practical limits (too few uniforms to animate a complex character model) nor the performance to really get a benefit
[20:58:58] <LordHavoc> from my time at Valve the "fast path" for geometry appears to be the following strategy:
[21:00:16] <LordHavoc> keep around two uniform buffers (GL_UNIFORM_BUFFER), two vertex buffers (GL_ARRAY_BUFFER), two index buffers (GL_ELEMENT_ARRAY_BUFFER), swap between them each frame, upload data to them using glBufferSubData before you issue your glDrawElements, do not reuse any part of them - always advance (as long as the offset for the new upload is at or past the previous one's end, it will keep the driver happy)
[21:00:30] <LordHavoc> make sure they are big enough to hold a significant amount of data
[21:00:46] <LordHavoc> if you run out of space, just gracefully fallover by switching buffers
[21:01:09] <LordHavoc> at every switch (toggling between the two sets), issue a glBufferData for each one with the same size but NULL data, this discards the contents
[21:01:18] <LordHavoc> Doom3 BFG Edition does this and its performance is great
[21:01:47] <LordHavoc> though Doom3 BFG Edition goes a step further - it does a glMapBuffer on them, spins up a ton of threaded jobs to generate everything, and then does a glUnmapBuffer when the jobs are done
[21:01:54] <LordHavoc> which skips the whole glBufferSubData part
[21:02:09] <LordHavoc> but in practice I find glBufferSubData is a lot easier to work with and quite performant
[21:02:33] <LordHavoc> whenever possible you should try to make sure all your glBufferSubData calls are in a quick burst though
[21:02:44] <LordHavoc> the driver likes it best if you issue no glDrawElements calls between them
[21:03:08] <LordHavoc> but it tolerates them pretty well
[21:03:17] <LordHavoc> so you can go either way, but the basic process is the same
[21:03:35] <LordHavoc> it just means that in practice if you can two-stage your render, it will be a performance win
[21:03:40] <LordHavoc> but you can single stage or two stage at your leisure

jitspoe

  • Administrator
  • Autococker
  • Posts: 18802
Re: Code Optimizing
« Reply #11 on: May 28, 2013, 11:36:02 AM »
My first attempt at SIMD instructions...

Code: [Select]
for (l = 0, boneverts = skmverts->verts; l < skmverts->numbones; ++l, ++boneverts)
{
pose = g_skmbonepose + boneverts->bonenum;

{
__m128 normmul;
__m128 boneVertOrigin = _mm_loadu_ps(boneverts->origin);
__m128 boneVertNormal = boneverts->normal;
__m128 boneVertX = _mm_shuffle_ps(boneVertOrigin, boneVertOrigin, 0x00);
__m128 boneVertY = _mm_shuffle_ps(boneVertOrigin, boneVertOrigin, 0x55);
__m128 boneVertZ = _mm_shuffle_ps(boneVertOrigin, boneVertOrigin, 0xaa);
__m128 boneVertNormalX = _mm_shuffle_ps(boneVertNormal, boneVertNormal, 0x00);
__m128 boneVertNormalY = _mm_shuffle_ps(boneVertNormal, boneVertNormal, 0x55);
__m128 boneVertNormalZ = _mm_shuffle_ps(boneVertNormal, boneVertNormal, 0xaa);
__m128 boneVertInfluence = _mm_loadu_ps(&boneverts->influence);
boneVertInfluence = _mm_shuffle_ps(boneVertInfluence, boneVertInfluence, 0x00);
vert = _mm_add_ps(vert,
_mm_add_ps(
_mm_add_ps(
_mm_add_ps(
_mm_mul_ps(boneVertX, pose->axis[0]),
_mm_mul_ps(boneVertY, pose->axis[1])
),
_mm_mul_ps(boneVertZ, pose->axis[2])
),
_mm_mul_ps(boneVertInfluence, pose->origin)
)
);
normal = _mm_add_ps(normal,
_mm_add_ps(
_mm_add_ps(
_mm_mul_ps(boneVertNormalX, pose->axis[0]),
_mm_mul_ps(boneVertNormalY, pose->axis[1])
),
_mm_mul_ps(boneVertNormalZ, pose->axis[2])
)
);

_mm_storeu_ps(inVertsArray[j], vert); // Note: This will overflow 1 float into inNormalsArray if verts are maxed out, but that should be ok.
_mm_storeu_ps(inNormalsArray[j], normal); // Note: This will overflow 1 float into color array if verts are maxed out.
}

Seems like there should be a better way to do it without all the shuffles, but that boosted my framerate from about 118 to 135 (14-15%) with a bunch of character models, so it's at least faster than it was.