fenomas Posted July 21, 2017 Share Posted July 21, 2017 Hi, I have a scene with around ~800 non-static meshes moving around, and I find the main performance bottleneck is the time taken by Babylon calling evaluateActiveMeshes, which in turn calls computeWorldMatrix and _updateBoundingInfo on most of the meshes. However, the nature of my scene is that most of the meshes never rotate, and I separately track their locations and bounding info. So in principle, it seems like I could tell Babylon they're static (by calling freezeWorldMatrix?), and then manually update their boundingInfo objects, and set their worldMatrices to simple translation matrices. Would this be a safe approach, or has anyone tried it? Or is there some built-in way of achieving a similar result? Or does freezing the world matrix have other implications that would cause this to break? Thanks! Quote Link to comment Share on other sites More sharing options...
jerome Posted July 21, 2017 Share Posted July 21, 2017 mmmmh... I'm afraid you can't skip the WorldMatrix computation that easily because this matrix is passed to the GPU so as it can compute all the mesh vertices final positions and then all the projections to the screen. Computing your own WM from a simple translation matrix (+freezing it) should work either. It's worth a try... not sure the gain is really high though because computing 800 quaternions (for the complete WM) is really fast actually. updateBoundingInfo() should be quite fast also as it updates only the 8 bounding box vertex positions (+ the bbox center). You can update your own bInfo and then lock it to skip the automatic computation with : http://doc.babylonjs.com/classes/3.0/boundinginfo#islocked-boolean Usually, evaluateActiveMeshes() spends most of the time in the call of isInFrustum() : culling btw, I tried some weeks ago to implement a faster culling algo but I wasn't satisfied by the results : http://jerome.bousquie.fr/BJS/test/frustum.html (fast duration = experimental algo, frustum duration = legacy algo) If you're sure (I'm pretty sure you are because I know you're a profiler pro) that the time is spent in the WorldMatrix and bInfo computations, maybe you might think to other approaches : check if you can compute some logical pre-culling (so set some meshes as inactive from your game logic before the camera has to evaluate them), freeze/unfreeze the world matrix in turn for the meshes you know they didn't move for some frames, force the selection for the meshes you know they're quite always in the frustum, etc Maybe using a SPS holding all these meshes (or most of them, even if it's a different model for each solid particle) could help as the SPS computes only one WM and each particle bInfo within the particle loop (so faster)... but has a global level culling (so less accurate : all the particles or none are culled). Usually one draw call, even with false positives (things passed to the GPU that won't be finally rendered because out of the screen), is faster than more pre-computations. This must be tested on your very specific case to check what could be the best solution. Quote Link to comment Share on other sites More sharing options...
haestflod Posted July 21, 2017 Share Posted July 21, 2017 Have you tried using octrees? I also had 800+ meshes in my scene and octrees helped my CPU performance by a lot. Quote Link to comment Share on other sites More sharing options...
fenomas Posted July 22, 2017 Author Share Posted July 22, 2017 20 hours ago, jerome said: not sure the gain is really high though because computing 800 quaternions (for the complete WM) is really fast actually. This was my expectation as well, but for scenes with lots of simple meshes it seems to be the bottleneck, by a long shot. Here's a simple pg that demonstrates roughly what I'm talking about - for me, profiling that shows that about 50% of the total scripting time is spent inside computeWorldMatrix. (Profiling in the playground is iffy, but if you load that link and profile it without changing anything, nothing should get deopted so it should be fine.) 14 hours ago, haestflod said: Have you tried using octrees? I also had 800+ meshes in my scene and octrees helped my CPU performance by a lot. I am using Octrees - performance is better with them than without them, but computeWorldMatrix is still the biggest bottleneck either way. Quote Link to comment Share on other sites More sharing options...
jerome Posted July 22, 2017 Share Posted July 22, 2017 Not sure the octrees are a good option when the meshes move in the World. The profiler says what you say... I will have a look at the reason why computeWorldMatrix() spends this time weirdly [EDIT] when displaying the profile results as a tree (topdown), the percentage of the time used by computeWorldMatrix() while 7200 ms is "only" 37% of the total time for me ... what still seems a high ratio imho 226.6 ms6.34 % 1322.5 ms36.99 % i.computeWorldMatrixbabylon.js:7 101.3 ms2.83 % 198.7 ms5.56 % t.isSynchronizedbabylon.js:6 80.6 ms2.25 % 370.1 ms10.35 % t.multiplyToRefbabylon.js:2 48.2 ms1.35 % 48.2 ms1.35 % t.copyFrombabylon.js:2 31.4 ms0.88 % 31.4 ms0.88 % i.copyFrombabylon.js:1 21.0 ms0.59 % 36.0 ms1.01 % t.RotationYawPitchRollToRefbabylon.js:2 14.0 ms0.39 % 14.0 ms0.39 % t.ScalingToRefbabylon.js:2 12.6 ms0.35 % 12.6 ms0.35 % getbabylon.js:6 10.7 ms0.30 % 10.7 ms0.30 % t.TranslationToRefbabylon.js:2 10.3 ms0.29 % 10.3 ms0.29 % t.getScenebabylon.js:6 4.8 ms0.13 % 4.8 ms0.13 % getbabylon.js:7 4.1 ms0.12 % 4.1 ms0.12 % getbabylon.js:7 3.4 ms0.10 % 3.4 ms0.10 % getbabylon.js:7 1.0 ms0.03 % 1.0 ms0.03 % r.getRenderIdbabylon.js:9 0.9 ms0.02 % 0.9 ms0.02 % getbabylon.js:6 0 ms0 % 349.8 ms9.78 % i._updateBoundingInfobabylon.js:0 Most of the time in computeWorldMatrix() is spent then in multiplyToRef() (10.35%) and in updateBoundingInfo() (9.78%) Matrix.multiplyToRef() calls then Matrix.multiplyToArray() what consumes 8% of the total time https://github.com/BabylonJS/Babylon.js/blob/master/src/Math/babylon.math.ts#L3376 It's 32 float allocations and 16 linear operations per call ... so for you 32 x 800 float allocations = 25600 each time multiplyToRef() is called ! I guess we could get rid of the float allocations since we can't skip the linear operations. I used to make this kind of little opmitizations for ComputeNormals() or the SPS. Dozens of float allocations per frame don't really matter, but dozens of thousands really start to matter. For the bInfo update, most of the time (9.22 %) is spent in the bBox._update() in no particular sub call : https://github.com/BabylonJS/Babylon.js/blob/master/src/Culling/babylon.boundingBox.ts#L66 Well, it's just that we do 800 x 8 box vertex computations and checks to localize them in the World. Quote Link to comment Share on other sites More sharing options...
JCPalmer Posted July 22, 2017 Share Posted July 22, 2017 Worth mentioning, right after FreezeWorldMatrix() was added, I changed it so that it calls computeWorldMatrix(true) inside. This can be valuable for meshes that only rarely move, cutting out one step. Not sure this will apply to your situation, but if not all the meshes move every frame, you could just re-freeze it every frame a given mesh moves. Some of the reason computeWorldMatrix() is so heavy, is the parent checking, sync checking. It might be worth modifying it to check if rotation or scale changed, if not only setting the translation only requires copying position.[x y z] to matrix.[12 13 14]. Then again that's more checking. Quote Link to comment Share on other sites More sharing options...
jerome Posted July 22, 2017 Share Posted July 22, 2017 If no billboard mode and no parent are used, multiplyToRef() (matrix multiplication) is still called several times in each call to computeWorldMatrix() : https://github.com/BabylonJS/Babylon.js/blob/master/src/Mesh/babylon.abstractMesh.ts#L1157 https://github.com/BabylonJS/Babylon.js/blob/master/src/Mesh/babylon.abstractMesh.ts#L1158 https://github.com/BabylonJS/Babylon.js/blob/master/src/Mesh/babylon.abstractMesh.ts#L1158 https://github.com/BabylonJS/Babylon.js/blob/master/src/Mesh/babylon.abstractMesh.ts#L1158 So, if I'm not wrong 4 times per call to computeWorldMatrix() at least. This means, in @fenomas case 25600 x 4 = 102 400 float allocations per frame. This could be avoided. I'll talk about this to @Deltakosh pichou 1 Quote Link to comment Share on other sites More sharing options...
fenomas Posted July 23, 2017 Author Share Posted July 23, 2017 21 hours ago, jerome said: ofile results as a tree (topdown), the percentage of the time used by computeWorldMatrix() while 7200 ms is "only" 37% of the total time for me ... what still seems a high ratio imho So, the absolute numbers will change from profile to profile, depending on your machine's CPU and so forth. So what I usually do is, look at the total time spent executing scripts compared to the time spent at some point in the "call tree" graph. For example, if the root of the tree ("Animation frame fired") has a total time of 50%, and further down in the tree "computeWorldMatrix" has a total time of 25%, one can say that computeWorldMatrix is accounting for about half the script execution time. On a slower machine it might be 80% and 40%, so the absolute numbers can be misleading but the ratios sort of tell you what's going on. 21 hours ago, jerome said: I guess we could get rid of the float allocations since we can't skip the linear operations. I used to make this kind of little opmitizations for ComputeNormals() or the SPS. Dozens of float allocations per frame don't really matter, but dozens of thousands really start to matter. When you start to talk about stuff like this, you really have to know what's going on inside V8 to make predictions about what will improve performance. For code like here, just because there are a lot of "var tm5 = this.m[5];" statements doesn't necessarily mean that the JS engine is allocating new floats onto the stack - the optimizing compiler does a lot of magic and it's hard to predict how it all works. The best way I've found to test performance improvements for low-level stuff like this is to make two versions of the function that I want to compare, and then route the code so that it alternates between each version. They you can just profile, and see which function took more execution time. For example, here's what this would look like for testing multiplyToArray: http://www.babylonjs-playground.com/#E2HVNG#1 Down at the bottom you can see that I define two alternate versions, one just like the original and one that doesn't declare temp vars. If you profile that you should find that the original version is somewhat faster than the alternate. (Not to say that the function can't be improved - I think I can speed it up moderately, but the other stuff you're looking at sounds more likely to be valuable) 10 hours ago, jerome said: This could be avoided. I'll talk about this to @Deltakosh This part of the code I don't understand at all, but if calls can be skipped that'd be cool. Thanks for looking at it! Quote Link to comment Share on other sites More sharing options...
jerome Posted July 23, 2017 Share Posted July 23, 2017 As you said, the V8 engine does a lot of magic under the hood and we can't easily predict where the gain would be. Nevertheless, when doing "var tm5 = someFloat", the engine has to create a float var anyway (floats are stored in the heap in JS, not in the stack), because tm5 can be set then with any other value. I'm not that expert, but I spent hours to compare the behavior of ComputeNormals() with and without the temp variables, what were here just for readability reasons, at the time I optimized it (up to x5 faster). The same (spent days there) with the behavior of all the internal computations (positions, normals, rotations, quatertions, uvs, colors) of the SPS to try to make it almost as fast as the legagy 2D particle system. Using a 10 yo laptop to make those comparisons, I can say there is a substantial gain when we deal with more than 8-10K calls per frame. This is an empirical value obviously but I noticed there was, on every machine, a limit where the CPU has so many things to do while 16 ms that skipping 10K scalar variable allocations (not objects, I don't even speak about the GC here) per frame could make a real difference. Your case seems to reach this limit because, just counting them, it's about 100K floats stacked and removed from the heap per frame. Even if it's only a part of 8% of the time on my machine, I guess it's worth a try to avoid this as this can be done. DK is OK for this. Unfortunately I won't do it before end of august or early september (no code for now). Not sure it's possible for your own case, but did you try the SPS approach ? store your 800 meshes in one SPS (if possible), then move them and compare the perfs ... Quote Link to comment Share on other sites More sharing options...
fenomas Posted July 23, 2017 Author Share Posted July 23, 2017 7 hours ago, jerome said: because tm5 can be set then with any other value Just some trivia, but V8 follows "single static assignment" form, so internally local variables never change values. That is, if you reassign a new value to tm5, the optimizing compiler will compile it as if you had created a new variable. 7 hours ago, jerome said: the engine has to create a float var anyway (floats are stored in the heap in JS, not in the stack) AFAIK v8 is smart enough to create temp variables in registers if it knows they won't be needed for long. For the function we're talking about I imagine it may not have enough registers to do this for all the floats, so some may get created on the heap as well, but this gets into the kind of areas where v8 may not work the same way on all platforms, or it may allocate registers differently next month than it does today. I don't think it's possible to say anything with certainty without looking at decompiled IR. That said, for what it's worth I played a little with optimizing multplyToArray, and got a moderate speedup just by moving code around: http://www.babylonjs-playground.com/#E2HVNG#2 All the alternate version does is move some var assignments down to occur right before they're needed. My guess is that this lets the compiler make better guesses about how to reuse registers. (E.g. putting tm0-3 into registers, and then later putting tm4-7 into those same registers because tm0-3 are no longer needed.) 7 hours ago, jerome said: Not sure it's possible for your own case, but did you try the SPS approach ? I've experimented with it, but it's not ideal, since my scene doesn't always have 800 meshes, that's just a rough upper limit. So I'd probably need to create and destroy SPSes according to demand, which would be hairy. jerome and GameMonetize 2 Quote Link to comment Share on other sites More sharing options...
adam Posted July 31, 2017 Share Posted July 31, 2017 Unfortunately that multiplyToArray optimization would create a bug when a user does this mat1.multiplyToRef(mat2, mat1) edit: i just looked more closely at the function (on my phone). I might be wrong. You should keep this case in mind though. Quote Link to comment Share on other sites More sharing options...
GameMonetize Posted July 31, 2017 Share Posted July 31, 2017 Yes it was my thinking as well Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.