Calculate fast polygon global pos.. [SOLVED]

Helper

On 30/04/2015 at 05:44, xxxxxxxx wrote:

Hi Eclektrik,

Possible optimizations:

1. Call op->GetPointsR() only once, ideally already outside of poly_center()
2. Call op->GetMg() only once, ideally already outside of poly_center()
3. Don't use containers at all, make a single Vector called "sum" to which you add either three or
four times the respective points of the polygon.

Possible pitfalls in your code:

1. You don't check if a point index is in the point range, which could lead to an access violation
2. You use the standard library containers, which use exceptions, which is unsafe in the Cinema
4D plugin environment as exception handling is disabled by default
3. You should use the matrix as the left operand in the multiplication. The other way round is
not valid in maths and deprecated since R15.

The following code does not multiply by the object's global matrix. Since you should prefer
to call op->GetMg() only once, you can simply write mg * poly_center(...) instead of poly_center(..., mg).

Vector poly_center(Int32 pcnt, const Vector* points, const CPolygon& poly)
{
  Vector sum;
  if (poly.a >= 0 && poly.a < pcnt)
    sum += points[poly.a];
  if (poly.b >= 0 && poly.b < pcnt)
    sum += points[poly.c];
  if (poly.c >= 0 && poly.c < pcnt)
    sum += points[poly.c];
  if (poly.c != poly.d && poly.d >= 0 && poly.d < pcnt) {
    sum += points[poly.d];
    return sum / 4;
  }
  return sum / 3;
}

Best,
Niklas

Helper

On 30/04/2015 at 07:31, xxxxxxxx wrote:

Hi,

thanks Niklas for the nice solution (small typo on the second if statement, I think).
I just want to add two things:
- CPolygon also has a [] operator, so if you prefer, you could write the above with a loop as well (although Niklas code is probably faster)
- more important: We strongly recommend to use BaseArray instead of std::vector. Reasons are given by Wilfred for example in this thread: Faster Vector list...

Helper

On 30/04/2015 at 07:46, xxxxxxxx wrote:

@Andreas , I expected a for loop may be faster "it may get optimized with SSE by the compiler" , but if both generate the same code non-optimized, then for loop will be slower.

Helper

On 30/04/2015 at 09:10, xxxxxxxx wrote:

@Andreas: You're right, I'll fix it

Helper

On 01/05/2015 at 07:40, xxxxxxxx wrote:

I knew i could rely on you guys Thanks particularly Niklas for taking the time to go over this..

Yeah i'm looking at converting everything over to BaseArray as it does seem to be the way to go, flexible and fast ( and obviously compatible ). Every fraction of a ms i can shave off at this point is a bonus, so i'll be going in pretty hard and learning hopefully how to do things better and more efficiently in general.

I've implemented multithreading also via C4D's MPTheadPool, but have achieved probably less than 10% of an improvement in a fairly short-run ( but intensive ) deformer plugin. This is on a hex-core ( 12 threads ) intel and so i'm going to look into what the holdbacks are, ie thread prep/setup times maybe, more optimised code, etc..

Thanks again.

Helper

On 01/05/2015 at 10:06, xxxxxxxx wrote:

Thanks again Niklas just tested and it is a LOT faster..

Also tried to reduce calls to ->GetMg etc to a single call and reuse the result from those. Next step is optimising the rest of the code further, and also learning the nuts and bolts of pointers and references as i feel these would also save time and memory, and atm i'm almost sortof guessing how to use them as i pass values into functions and again into more functions..

Helper

On 01/05/2015 at 11:13, xxxxxxxx wrote:

@Eclectrik multi-threading will give you performance, but it may be memory bound at some cases, here I get much more performance if I split memcpy calls across threads "in possible cases".

Helper

On 01/05/2015 at 12:53, xxxxxxxx wrote:

Originally posted by xxxxxxxx

@Eclectrik multi-threading will give you performance, but it may be memory bound at some cases, here I get much more performance if I split memcpy calls across threads "in possible cases".

Cheers Mohamed - what would that involve exactly?

At the moment, what i have is for example:

Base Control Thread takes PolygonObject, and assigns each Thread with a selection of its polygons to work on. Then each thread starts and does the above poly_center function on each of its share of polygons.

So basically -

	Matrix& matr = op->GetMg();
	LONG pcnt = op->GetPolygonCount();
	const Vector* pnts = op->GetPointR();
	const CPolygon* pPoly = op->GetPolygonR();
  
	for (int i=pIndexStart; i<(pLength+pIndexStart); i++)
	{
		Vector polyPos = matr * poly_center(pcnt, pnts, &pPoly[i]);
	}

Something like that, with some other stuff thrown in. Where pIndexStart and pLength feed it with which share of polygon indexs the thread gets..

In what you are saying, would it be more efficient for example, to maybe assign each share of polygons to a separate CPolygon for example, so that each thread is accessing a separate chunk of memory ie not all accessing the same pointers to op->GetPolygonR().. Or am i looking at this the wrong way..

PS - It's fun throwing a dense mesh at this, and watching CPU usage on all cores hit 100%, not something you often see in the viewport

Helper

On 01/05/2015 at 14:18, xxxxxxxx wrote:

as this function is called a lot, in your case I think you can optimize more poly_center() , with these 2 things:

1- avoid division, so instead of sum / 3, make it sum * 0.333333
2- you may try to use some SSE intrinsics, it "may" improve performance, depends on how bad is the compiler, with a good compiler it won't add much.

Helper

On 18/05/2015 at 05:38, xxxxxxxx wrote:

Small addition to 1:
It's probably better to write * (1.0 / 3.0). In this way the compiler can do the division at compile time and you end up with the same multiplication as Mohamed suggested, while preserving as much precision as possible.