JCPalmer Posted April 7, 2015 Share Posted April 7, 2015 The Single Instruction Multiple Data discussion started on the Wingnut Chronicles, but I think this discussion needs its own topic. To recap so that most of it is here:Intel & browser makers were working on allowing a set of special CPU instructions to be called within Javascript. See https://software.intel.com/en-us/articles/simd-javascript-faster-html5-appsDeltakosh has enabled many parallel methods in Math.ts to be swap out for SIMD versions when detected.Though not yet in production, testing can be run on Firefox nightly, Version 40, which I put on my system.I found a SIMD test page, which I ran as both Firefox 37, and 40. Here is a picture of improvement. Remember not everything can be done using this.I did some more testing over the weekend using BJS scenes, but found no discernible change. Been thinking for a while. Those math methods are probably not called enough to make a difference. Probably more than this is required to make great impact. SIMD works using Float32Arrays, which BJS does very little of. Those Math.ts methods seem to have to load stuff to float32. (Hard to tell with out any documentation)The Morph extension I made deals exclusively with Float32Arrays. In the beforerenderer, each morph is an interpolation of position & normal end points. I am thinking about coming up with a 2nd way to do this which does the +, the -, and the * in batches. No loading to float32 is required, but I do not know if or where SIMD.js is documented, so would have to reverse engineer it from math.ts and that test page. Does this look feasible?/** * Called by the beforeRender() registered by this._mesh * ShapeKeyGroup is a subclass of POV.BeforeRenderer, so need to call its beforeRender method, _incrementallyMove() * @param {Float32Array} positions - Array of the positions for the entire mesh, portion updated based on _affectedPositionElements * @param {Float32Array} normals - Array of the normals for the entire mesh, portion updated based on _affectedVertices */public _incrementallyDeform(positions : Float32Array, normals :Float32Array) : boolean { super._incrementallyMove(); // test of this._currentSeries is duplicated, since super.incrementallyMove() cannot return a value // is possible to have a MotionEvent(with no deformation), which is not a ReferenceDeformation sub-class if (this._currentSeries === null || !(this._currentStepInSeries instanceof MORPH.ReferenceDeformation) ) return false; if (this._ratioComplete < 0) return false; // MotionEvent.BLOCKED or MotionEvent.WAITING // update the positions for (var i = 0; i < this._nPosElements; i++){ positions[this._affectedPositionElements[i]] = this._priorFinalPositionVals[i] + ((this._currFinalPositionVals[i] - this._priorFinalPositionVals[i]) * this._ratioComplete); } // update the normals var mIdx : number, kIdx : number; for (var i = 0; i < this._nVertices; i++){ mIdx = 3 * this._affectedVertices[i] // offset for this vertex in the entire mesh kIdx = 3 * i; // offset for this vertex in the shape key group normals[mIdx ] = this._priorFinalNormalVals[kIdx ] + ((this._currFinalNormalVals[kIdx ] - this._priorFinalNormalVals[kIdx ]) * this._ratioComplete); normals[mIdx + 1] = this._priorFinalNormalVals[kIdx + 1] + ((this._currFinalNormalVals[kIdx + 1] - this._priorFinalNormalVals[kIdx + 1]) * this._ratioComplete); normals[mIdx + 2] = this._priorFinalNormalVals[kIdx + 2] + ((this._currFinalNormalVals[kIdx + 2] - this._priorFinalNormalVals[kIdx + 2]) * this._ratioComplete); } return true;} Quote Link to comment Share on other sites More sharing options...
GameMonetize Posted April 7, 2015 Share Posted April 7, 2015 I'll ask people from Intel to swing by here to help you But AFAIC, this is PERFECTLY feasible. From Babylon.js point of view, bones for instance can get a lot of improvements wehn SIMD.js is available Quote Link to comment Share on other sites More sharing options...
PeterJensen Posted April 8, 2015 Share Posted April 8, 2015 Thank You @JCPalmer for taking an interest in SIMD.js Currently, our documentation is in the form of a polyfill. We've gone to great length to have the functionality and semantics of the polyfill match the implementations. The polyfill, tests, and benchmarks, we've been using, currently resides here: https://github.com/johnmccutchan/ecmascript_simd Your example code is a perfect candidate for using SIMD to get a ~4x speedup. I've taken a stab at rewriting your code. This is just me writing code, so there might be both syntax errors and functional errors, but at least you'll get the gist. // update the positions. 4 at a time for (var i = 0; i <= this._nPosElements-4; i += 4){ var priorFinalPositionVals = SIMD.float32x4.load(this._priorFinalPositionVals, i); var currFinalPositionVals = SIMD.float32x4.load(this._currFinalPositionVals, i); var ratioComplete = SIMD.float32x4.splat(this._ratioComplete); var positionx4 = SIMD.float32x4.add(priorFinalPositionVals, SIMD.float32x4.mul(SIMD.float32x4.sub(currFinalPositionVals, priorFinalPositionVals), ratioComplete)); SIMD.float32x4.store(positions, this._affectedPositionElements[i], positionx4); } // handle possible remainder for (var i = this._nPosElements & ~0x3; i < this._nPosElements; i++){ positions[this._affectedPositionElements[i]] = this._priorFinalPositionVals[i] + ((this._currFinalPositionVals[i] - this._priorFinalPositionVals[i]) * this._ratioComplete); } // update the normals var mIdx : number, kIdx : number; for (var i = 0; i < this._nVertices; i++){ mIdx = 3 * this._affectedVertices[i] // offset for this vertex in the entire mesh kIdx = 3 * i; // offset for this vertex in the shape key group var priorNormalVals = SIMD.float32x4.loadXYZ(this._priorFinalNormalVals, kIdx); var currFinalNormalVals = SIMD.float32x4.loadXYZ(this._currFinalNormalVals, kIdx); var priorFinalNormalVals = SIMD.float32x4.loadXYZ(this._priorFinalNormalVals, kIdx); var ratioComplete = SIMD.float32x4.splat(this._ratioComplete); var normalx4 = SIMD.float32x4.add(priorFinalNormalVals, SIMD.float32x4.mul(SIMD.float32x4.sub(currFinalNormalVals, priorFinalNormalVals), ratioComplete)); SIMD.float32x3.storeXYZ(normals, mIdx, normalx4); }Besides these extensions being in FF nightly, there's also a Chromium prototype available (developed by Intel): You should be able to download that here: https://drive.google.com/open?id=0B9RVWZYRtYFeWTFoNUJfUkdDRlE&authuser=0 I'll try to get your little code snippet extracted into a benchmark kernel that we can use in our benchmarking framework. Again thanks for writing this post and providing the code snippet. Peter JensenIntel Samuel Girardin, jerome and Wingnut 3 Quote Link to comment Share on other sites More sharing options...
JCPalmer Posted April 8, 2015 Author Share Posted April 8, 2015 Thanks! That was more than I had expected. I already have changes to that file that are not checked in. Early next week, I will try to implement this. I also have a sample scene to test. Could add a switch to the scene to force it to not use SIMD, but will just switch between browsers initially. Quote Link to comment Share on other sites More sharing options...
JCPalmer Posted May 11, 2015 Author Share Posted May 11, 2015 Just probably a note to myself, but in case anyone cares. Firefox nightly v40 runs pretty poorly, relative v37. This makes any improvements difficult to see. Am just going to stick with it for now though, and get the thing running out of the way. Quote Link to comment Share on other sites More sharing options...
JCPalmer Posted May 11, 2015 Author Share Posted May 11, 2015 Ok, now I am into this, finally. I started looking at the doc link. Even started trying to make a d.ts from the full API source code. (I saw how Math.ts got around a d.ts in line 3 ). I have my first questions, but first I have broken out my 2 operations (positions & normals) into separate functions, so I can do the swap out like Math.ts. Here are both versions of updatePositions:private updatePositions(positions : Float32Array) : void { for (var i = 0; i < this._nPosElements; i++){ positions[this._affectedPositionElements[i]] = this._priorFinalPositionVals[i] + ((this._currFinalPositionVals[i] - this._priorFinalPositionVals[i]) * this._ratioComplete); } } private updatePositionsSIMD(positions : Float32Array) : void{ for (var i = 0; i <= this._nPosElements-4; i += 4){ var priorFinalPositionVals = SIMD.float32x4.load(this._priorFinalPositionVals, i); var currFinalPositionVals = SIMD.float32x4.load(this._currFinalPositionVals, i); var ratioComplete = SIMD.float32x4.splat(this._ratioComplete); var positionx4 = SIMD.float32x4.add(priorFinalPositionVals, SIMD.float32x4.mul(SIMD.float32x4.sub(currFinalPositionVals, priorFinalPositionVals), ratioComplete)); SIMD.float32x4.store(positions, this._affectedPositionElements[i], positionx4); } } When I looked at the source code for the static function float32x4.load(), I found this is a helper function with all kinds of checking & calling of other helper functions. Trust me, I value checking arguments & see its importance in a typeless Javascript world. But I am coming from Typescript & and my args are explicitly Float32Array. Paying all this overhead seems like it would be more than I would save. /** * @param {Typed array} tarray An instance of a typed array. * @param {Number} index An instance of Number. * @return {float32x4} New instance of float32x4. */ SIMD.float32x4.load = function(tarray, index) { if (!isTypedArray(tarray)) throw new TypeError("The 1st argument must be a typed array."); if (!isInt32(index)) throw new TypeError("The 2nd argument must be an Int32."); var bpe = tarray.BYTES_PER_ELEMENT; if (index < 0 || (index * bpe + 16) > tarray.byteLength) throw new RangeError("The value of index is invalid."); var f32temp = _f32x4; var array = bpe == 1 ? _i8x16 : bpe == 2 ? _i16x8 : bpe == 4 ? (tarray instanceof Float32Array ? f32temp : _i32x4) : _f64x2; var n = 16 / bpe; for (var i = 0; i < n; ++i) array[i] = tarray[index + i]; return SIMD.float32x4(f32temp[0], f32temp[1], f32temp[2], f32temp[3]); }I wrote a 2nd SIMD version, using the float32x4 constructor directly, bypassing all that.private updatePositionsSIMDToo(positions : Float32Array) : void{ var ratioComplete = SIMD.float32x4(this._ratioComplete, this._ratioComplete, this._ratioComplete, this._ratioComplete) for (var i = 0; i <= this._nPosElements-4; i += 4){ var priorFinalPositionVals = SIMD.float32x4(this._priorFinalPositionVals[i], this._priorFinalPositionVals[i + 1], this._priorFinalPositionVals[i + 2], this._priorFinalPositionVals[i + 3]); var currFinalPositionVals = SIMD.float32x4(this._currFinalPositionVals [i], this._currFinalPositionVals [i + 1], this._currFinalPositionVals [i + 2], this._currFinalPositionVals [i + 3]); var positionx4 = SIMD.float32x4.add(priorFinalPositionVals, SIMD.float32x4.mul(SIMD.float32x4.sub(currFinalPositionVals, priorFinalPositionVals), ratioComplete)); SIMD.float32x4.store(positions, this._affectedPositionElements[i], positionx4); } }As soon as I get the swapper ready I will test both ways. Thought I would give you this feedback for comment. Also I think I will need to take into # of positions is not evenly divided by 4 at the end, right? Quote Link to comment Share on other sites More sharing options...
JCPalmer Posted May 12, 2015 Author Share Posted May 12, 2015 Well, I have not built the swapper yet, just compiled referencing a different method (pure javascript, SIMD.load, & SIMD constructor). Morph has a built-in wall clock tracker isolating just the deformations. They were all very close, but Javascript was the fastest. Have only done positions so far. Adding the normals interpolation next. One thing I saw in the readme.md was Float32x4Array. The entire class is using typed arrays already. Maybe generate it as Float32x4Array & Uint32x4Array. Then just do the calc without all the throw away float32x4 to garbage collect. Quote Link to comment Share on other sites More sharing options...
JCPalmer Posted May 12, 2015 Author Share Posted May 12, 2015 Wow, the normals using SIMD was 4 x slower than Javascript. Thinking more about using the Float32x4Array. Doing the whole class is too much work. Making versions of the prior & current, (positions & normals) outside of the render loop as those arrays is next. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.