Notice
In light of the passing of Ben Daglish this week, I would like to dedicate this page to him. He went to great pains to explain to us that his surname was spelled Duh-Ah-Ger-Lish. Taken way too soon.
Introduction
I just want to write a little bit about optimising your games, one or two of the things I have learned over the years. How relevant these things are now is rather moot, as the CPU has been relieved of the burden of plotting the graphics for the most part and therefore has plenty of spare time to do things deeply, or slowly.However, once you find a faster way of doing something, as long as it`s not ten times harder, then why not use those techniques knowing that you`re saving time that you could later use for something clever?
Over-defensive Coding
This came up a few articles ago, so I`d like to clarify what I mean.
Imagine you write a routine that takes as its input a number between one and a hundred. Having set the rule for the expectations of the function, should that function test the input to make sure it does indeed receive a number between one and a hundred, and what should it do if it doesn`t?
The answer slightly depends on whether you`re writing a black box function for someone else to use, or whether it`s just your internal function. If it`s a black box for someone else to use, then yes, it should always test the input value passed, and you need to publish what it will do if you get the input wrong, i.e. will it crash, will it pass a message back, or some failure-indicating value? Just for info: it should NEVER crash!
If you`re writing a function for your own use then you can be a bit cleverer. We usually develop code in a DEBUG mode, which produces less optimised code that can be seen by the debugger, and the final RELEASE mode code is compiled to be optimised, and not debuggable. In theory, both versions of the code will behave the same, but most of us know that differences can occur because the compiler does some helpful things in DEBUG mode, such as zeroise all your variables. Additionally the programmer can detect which mode is being compiled, and you can put your additional validation tests only in the DEBUG mode. The idea being that you thoroughly test your code in DEBUG mode, and deal with any incorrect calls long before delivery. Error messages can be pushed out to log files that won`t be produced in the final RELEASE mode. You can also just plant a break-point in the test failure side to make darned sure the code stops if you send in an invalid value. If it kicks in then you can investigate and fix it there and then.
Way back in the 8-bit days when CPUs were running at 1MHz and scrolling the screen took 60% of your time per frame, every test you didn`t need to do was important.
Scaling your numbers to be computer-friendly was important. I have spoken to some of my work colleagues about hexadecimal numbers and they just glaze over or look blankly back. Most modern applications don`t directly involve binary or hexadecimal, so they`ve never learned to think in hex nor binary, but for games coding they offer some lovely short-cuts.
Let`s take the above example of validating your input of between zero and a hundred, an inherently decimal test. You`re going to have to code something like this:
long Answer = 0;
if ((Input >= 0) && (Input <= 100 ) { // Two tests
Answer = DoOurStuff(Input);
} else {
Answer = -1; // Signal failure
printf("Invalid value %ld passed.\n", Input );
}
return(Answer);
We`ve created two tests that will be done every time, and if we`ve debugged our calling code correctly they will never fail. You`ve also had to code two more lines of error reporting, and somewhere a jump will have been compiled in to get to the return statement.
Now supposing we re-scale our input to be a value between 0 and 127, or in hex: 0x00 and 0x7f. We can ensure we always get a valid value by logically ANDing the input with 0x7f. That doesn`t even require a test, we have simply removed any bits from the input that we are not interested in. We should however still test the input for range, in DEBUG mode at least to ensure that the caller knows they`ve passed something bad through, but we can do that in one test now:
{
long Answer = 0;
long Answer = 0;
long VettedInput = Input&0x7f;
if (VettedInput == Input) { // One test only
Answer = DoOurStuff(Input);
} else {
Answer = -1;
printf("Invalid value %ld passed.\n", Input );
}
return(Answer);
}
}
Things get a bit messy with a DEBUG mode test as we`re removing two chunks of code. Any time we remove one squirly with conditional compilation there must be another block with the same test that removes the corresponding squirly.
{
long Answer = 0;
#ifdef _DEBUG
#ifdef _DEBUG
long VettedInput = Input&0x7f;
if (VettedInput == Input) { // One test only
#endif
#endif
Answer = DoOurStuff(Input);
#ifdef _DEBUG
#ifdef _DEBUG
} else {
Answer = -1;
printf("Invalid value %ld passed.\n", Input );
}
#endif
#endif
return(Answer);
}_DEBUG is a compiler flag that is only defined in DEBUG mode, so we can test if it exists and include the extra code. Usually the non-debug RELEASE code has NDEBUG defined. You can, of course, include your own user-defined switches unique to each mode.
The code is even simpler if you need a signed number between -128 and 127, or an unsigned byte from 0 to 255, you can just use the byte as-is because there are no invalid inputs.
When I was writing Morpheus, way back in 1987, and decided that I wanted some circular patterns, I developed some custom sine and cosine lookup tables. You can always decide on the size of such tables depending on how much accuracy you want. Did we use degrees? No, 360 doesn`t fit in a byte, which was a convenient size to pass around. Did we use radians? No, we didn`t have floating point on the C64, and though we could have scaled some hex values and use 2.6 bits, we decided that there should be 256 degrees in a circle, that would be enough. Just fits in a byte, and if you`re adding angles you get automatic wraparound if you stick to bytes. My array of sines and cosines would be byte values of S1.6 bits, so not especially accurate, but good enough. The sine and cosine values can be read from the same table, offset by 90 degrees, or hex $40. Since the offsets are byte values, and there are 256 of them, adding $40 just wraps around nicely if necessary.
Even today when using SFML, the tests that I have to keep doing after angle calculations to ensure that my angles in degrees don`t go over 360 or under 0 irritate me. SFML must be doing something like this too when called upon to display an object rotated.
Even today when using SFML, the tests that I have to keep doing after angle calculations to ensure that my angles in degrees don`t go over 360 or under 0 irritate me. SFML must be doing something like this too when called upon to display an object rotated.
Being a library for everyone to use, you can`t trust that everyone will get it right so you have to do range-checks. Range-checking a floating point value and resolving it into a valid one may well involve a number of instructions. In any case, we don`t want our angles going out range either, so we should always be doing our limit checks. In fact, any one operation that I do on angles by addition or subtraction will only push numbers into the range -359 to + 718, so I know that if my angle goes below zero I just have to add 360, and if I go into 360 or above then I just subtract 360, I'll never have to deal with numbers any bigger, i.e. repeat the test and correction. Of course logically ANDing the number with 0xff for a 256-degree system is much faster!
Actually there is nothing to stop anyone from implementing a 256 degree system of their own. We had one on the C64, and then again in 68000 assembler. Once you`ve been thinking about 256 degrees in a circle for a few weeks it becomes totally acceptable. After all, 360 was pretty arbitrarily decided upon, it`s just a nice number that you can divide by 2, 3, 4, 6, 8, 10, 12, 15, 20, probably just to make slicing cakes easier! I always use a protractor to make sure I get the biggest slice. Doesn`t everyone?
As it turned out, angle 0 wasn`t straight up either, it was to the right. With a co-ordinate origin of 0,0 in the top left, the sines and cosines just work out that 3 o`clock is angle zero and they went clockwise. There's nothing to stop you setting them up however you want to, of course, but we want the maths to be as simple as possible.
In keeping with making things as simple as possible on the Amiga, we used signed 1.14 numbers for the sines and cosines, or 14 binary places in a 16 bit two-byte word. We need values from 1 to -1, and while we could fit in some invalid bigger numbers than 1, we can`t quite get the full range in 15 bits, with a sign bit.
To accommodate the maths nicely, we then chose to store our 16-bit polar speed as an S5.10 signed number, making the hex value 0x0400 equivalent to 1 pixel. That way, when we multiplied an S1.14 sine or cosine by a polar S5.10 speed we get S7.24 signed 32-bit answers for X and Y, where suddenly the number of pixels is in the top byte, making it easier to read. We then had to sign-shift it to the right by 8 bits to make our Cartesian co-ordinate system S15.16. That way we could lift the pixel position directly out of the top word for plotting. 8 bits of pixel positions isn`t enough even for a single screen, as anyone who has set up a C64 hardware sprite will tell you.
There`s another technique in there then, whereby we have the pixels in the top word of a 4-byte long, and the sub-pixels in the lower word. We can add very fine movements in 32 bits and then store the value in our object stuctures, but then for plotting the object where we only need the whole pixel positions we can just tell the CPU to lift out the top half of the variable, ignoring the rest.
It`s useful to be able to read your values easily with the debugger, so keep the numbers scaled to byte boundaries where possible. Floating point offers its own solutions and as long as it is accurate enough for you and fast enough then it is certainly the modern way. The debugger will happily decode them for you into something displayable. Bear in mind that they are not able to resolve all values, unfortunately they do struggle with base 10. I wouldn`t recommend them for monetary calculations because of the inaccuracies, you at least need to be familiar with principles of rounding, and how many digits you can represent. Some sort of packed decimal representation would be better. The 8086 chips had packed decimal instructions, and while C doesn`t directly support packed decimal, I would expect there to be libraries that do support it. For financial organisations, the accuracy is more important than execution speed. Currency conversions are an especial delight.
Saving Space
This isn`t so important any more, but how many times have you seen someone code a single flag in a long variable? That`s 4 bytes, or nowadays even 8, that simply says "Yes!" or "No!" Being familiar with the assembler instruction sets is such an advantage when writing C because you can tell how it`s going to translate your C instruction into machine code. If you write C that does what a single or pair of assembler instructions can do, then the compiler will do just that, it`s not going to mess around creating a mass of code, especially in RELEASE optimised mode. Testing a single bit in a long is something that machine code can do. Putting all of your boolean flags into one long keeps them all together, and you can give the bit numbers a name in an enum list, equivalent to giving each a variable name. Setting, clearing or testing one arbitrary bit resolves down to a pair of assembler instructions too, just as quick as testing a variable. Now you can get 32 (or 64) bit flags in one variable. You quickly get used to the logical operations that can read, clear, set, or toggle one specific setting without altering any of the others.Binary Mathematics
When working in assembler, especially on a CPU with no divide, and possibly not even a multiply, you start to think about how to do maths in binary. There are instructions on the CPU to shift numbers to the left or right. Sometimes you might need those to slide graphics into the correct position, since in the 8-bit days 8 pixels would share a byte, or 4 multi-colour pixels, and in the 16-bit Amiga days, you`d have 16 bits in a word representing 16 pixels in one bit-plane, and the display chip needed to read multiple bit-planes to determine the colours for those 16 pixels, having an address register for each bit-plane. On the one one hand we got user-definable numbers of bit-planes, but plotting was a nightmare. Yes the blitter did make the job a tad easier, but understanding how to get it to do what we wanted resulted in any number of desperate phone calls to Commodore technical support. The later PC graphics cards got it right with 1 byte per pixel, leading to the 4 bytes per pixel that we enjoy nowadays.Taking a byte value and shifting it left multiplies it by 2. Shifting right offers a choice of 2 instructions, since the top bit is the sign bit of the number. If you are dealing with negative numbers then you would need an arithmetic shift right to divide by 2, which retains the value of the top bit after the shift. A logical shift right regards all numbers as unsigned and shifts all the bits down, introducing a zero at the top. Now we can divide and multiply by powers of 2, so there's a real effort needed to make sure you`re using powers of 2. On the Amiga and ST it was faster to get a multiple of 40, for example, by shifting left by 3, copying the value to another register, shift left by another 2, and then add the copied value. We multiplied the number by 8, saved it, multiplied it by another 4, making 32, then added the multiple of 8, making 40.
It`s probably not worth doing that on a modern PC, though the syntax exists to do it in C. I use left shifts to get my boolean flag bits tested, i.e.
enum {
SessionOver = 0,
Paused,
SuspendTime,
GameFlagsMax};
long glGameFlags = 0;
if ((glGameFlags & ( 1 << SuspendTime)) == 0) {
// Time not suspended code
}
In the above, SuspendTime will be 2, so we need to isolate bit 2 to perform our test. We logically AND only the bit we're interested in. The compiler will actually do the shifting and resolve bit 2 to be the value 0x00000004, it won`t be done at run-time anyway.
Random Numbers
I`ve seen people ask questions such as: "how do I get a random number between 1 and 3 quickly?" The fact is that most random number generators are algorithm-based, so no random numbers are delivered all that quickly. In fact, the only time I`ve found a hardware way of getting a random number was on the C64 SID chip. You could set the third channel to noise waveform, rack up the frequency, and read out out the current waveform value. I used to direct all the noise sound effects to that channel, and when it had finished I switched the sound channel to play silent high-frequency white noise. Any time you need a random number; just read it off the SID chip: 1 assembler instruction, nothing faster!The advantage of algorithm-generated random numbers is that they can often be seeded, so you can get the same set of numbers out every time. In my latest game I could watch the first titles sequence, with supposedly random things happening, and every time the first demo sequence ran, it did exactly the same thing. To change that, I now get a timestamp of when the program starts, and seed the system random number generator with that. Now I get a different demo every time I start the program. I`m still not happy about the time that the system algorithm takes to supply a random number. Firstly, I have no idea actually what algorithm it is using, but secondly, that might change at any time if someone decides to alter the C library.
In order to supply my game with random numbers faster than the system can, I get an array of 256 numbers in advance from the system. I refresh that every time a game starts. Getting 256 numbers at the beginning of a game doesn`t take a great deal of time in the grand scheme of things, the player is just getting ready to play and likely you`re doing some kind of arrival sequence anyway. I then keep one index pointer into the array of random numbers and my fast call pulls out the next entry and shifts the index along by one. It just has to make sure it doesn`t fall off the end of the table. Actually an assembler optimisation for that would be to start at the end of the table so you decrement the index and easily detect the underflow case that resets the index to the end. It saves a compare. In C, the get a random number function is so short that on a RELEASE build it`ll likely drop in the code in the function rather than calling it.
Coming back to the random number between 1 and 3 then... I would always try to avoid anything that doesn`t involves powers of 2. Scaling our 32-bit random number to a lower power of 2 is again just a case of logically ANDing the random number with a mask to reduce the scale. ANDing with 0x03 gives you an equal chance of 0, 1, 2 or 3. None of the powers of 2 are divisible by 3, strangely enough. If you do want to do that then you could do it quicker by, say, getting number between 0 and 255, and testing it for being less than 86 for 1 route, less than 172 for the second, else the third. The only faster way is to set up a specific array of random numbers between 0 and 2 for later diving into.
The process of setting up an array with answers to complex calculations in advance is nothing new. On the C64 the screen was 40 characters wide, and there was no multiply instruction, so a little table of 25 entries with multiples of 40 in it was great for calculating the screen character address at a particular co-ordinate. You would divide the Y pixel co-ordinate by 8 by using the logical-shift right instruction 3 times, then look up the multiple of 40 in your table, then take the X pixel co-ordinate and divide by 8, and add that on.
Square Roots
In the days before floating point, some calculations used square roots, which is a hideously slow answer to get, for example, in a Pythagoras distance calculation, so we needed a simpler way. Some clever mathematicians have worked out some simpler calculations that give approximations. The calculation we used gave an answer within 10%, which is fine to know whether something is close to the player or not. In fact, just taking the bigger of X or Y (or Z, in 3D) would probably be enough to be informed, with a worst case 45 degree 41% inaccuracy (in 2D).
Just as a curiosity, here`s our 68000 approximate square root function. I didn`t write it, as far as I remember, I just said "Dominic, how do I..." and it was done.
;---------------------------------------------------------------
; Square root function
; Expects - d0.l unsigned number to find root of
; Returns - d0.l unchanged
; - d1.w root of d0.l
;
;----------------------------------------------------------------
SquareRoot
push.l d2
move.w #$0100,d1 ; First guess
Rept 4
move.l d0,d2 ; Square
divu d1,d2 ; result=square/guess
add.w d2,d1 ; guess=result+guess
asr.w #1,d1 ; better guess=guess/2
cmp.w d1,d2
beq.s .DoneRoot
EndR
move.l d0,d2 ; Last attempt if
divu d1,d2 ; not already got
add.w d2,d1 ; the correct
asr.w #1,d1 ; result
.DoneRoot
pop.l d2
rts
We found this to be more than accurate enough for our needs. The "Rept 4" is an assembler directive to repeat the block of code to the EndR. You could increase the number to get more accurate results, but we found that those 4 with the fast get-out plus the final one will get you good results. Depending on what you need the square root for; your degree of accuracy might vary. There`s nothing to stop you having one super-accurate routine if you need it, and a less accurate one where you don`t need total accuracy but it`s quicker.
There`s no reason why you shouldn`t have a bit of apparently random behaviour that affects things that is purely down to mathematical inaccuracies. After all, behaviour depends upon a lot of other factors which you`re not necessarily including. Quite often we`d put in some random number fetching to affect behaviour anyway.
Just as a curiosity, here`s our 68000 approximate square root function. I didn`t write it, as far as I remember, I just said "Dominic, how do I..." and it was done.
;---------------------------------------------------------------
; Square root function
; Expects - d0.l unsigned number to find root of
; Returns - d0.l unchanged
; - d1.w root of d0.l
;
;----------------------------------------------------------------
SquareRoot
push.l d2
move.w #$0100,d1 ; First guess
Rept 4
move.l d0,d2 ; Square
divu d1,d2 ; result=square/guess
add.w d2,d1 ; guess=result+guess
asr.w #1,d1 ; better guess=guess/2
cmp.w d1,d2
beq.s .DoneRoot
EndR
move.l d0,d2 ; Last attempt if
divu d1,d2 ; not already got
add.w d2,d1 ; the correct
asr.w #1,d1 ; result
.DoneRoot
pop.l d2
rts
We found this to be more than accurate enough for our needs. The "Rept 4" is an assembler directive to repeat the block of code to the EndR. You could increase the number to get more accurate results, but we found that those 4 with the fast get-out plus the final one will get you good results. Depending on what you need the square root for; your degree of accuracy might vary. There`s nothing to stop you having one super-accurate routine if you need it, and a less accurate one where you don`t need total accuracy but it`s quicker.
There`s no reason why you shouldn`t have a bit of apparently random behaviour that affects things that is purely down to mathematical inaccuracies. After all, behaviour depends upon a lot of other factors which you`re not necessarily including. Quite often we`d put in some random number fetching to affect behaviour anyway.
Conclusion
If you can work out some answers in advance and look them up at run-time then you`ll save some CPU, at the expense of the space that the answers take up. All the better if the calculations are quite complex to get the answers you need, as long as the set of different inputs isn`t too large.
It`ll take a lot longer if you have to optimise at the end of your project because the game isn`t fast enough. Better to think efficiency all the way through and see what you can do with your desired frame rate.
It`ll take a lot longer if you have to optimise at the end of your project because the game isn`t fast enough. Better to think efficiency all the way through and see what you can do with your desired frame rate.