Dec 192017
Hints on how to optimize C code for the utmost in performance. | |||
---|---|---|---|
File Name | File Size | Zip Size | Zip Type |
HINTS.DOC | 31706 | 11283 | deflated |
Download File OPTC38.ZIP Here
Contents of the HINTS.DOC file
HINTS FOR EFFICIENT PROGRAMMING
Copyright (c) 1992 by Omega Point, Inc.
Author: Ratko V. Tomic
-------------------------------------------------------------------------
NOTE: The text below is taken from the CodeRunneR (R) manual. CodeRunneR
is a TSR library for C and assembly programmers, thus many hints
deal implicitly with size optimization. For more information about
CodeRunneR call Omega Point (508) 877-1819 or fax (508) 877-0915.
-------------------------------------------------------------------------
CodeRunneR was designed with the greatest attention for efficiency (size
and speed), all functions were carefully crafted in assembly language.
Combined with some care in the C code you can produce programs rivaling
in size to average assembly language programs, especially on medium and
large projects. On very small programs (few hundred bytes), the assembly
language will still be ahead. In terms of speed, the video functions
(at least) should be competitive with pure assembly code.
The C programming hints below have been found useful in reducing the
efficiency gap between C and assembly code.
1. INT or CHAR - Current compilers are notoriously bad in handling
expressions involving byte size data. The reason for this is an
overly literal interpretation of C "abstract machine", which mandates
that the result should be evaluated as if extended to integer. On the
other hand, the xxx86 processors have a rich subset of instructions
operating on byte size items. These instructions make great majority
of "integral promotions" unnecessary. Unfortunately, only a small
fraction of these are utilized by C compilers, resulting in much more
expensive operations on char versus int variables. Thus a saving of
one byte for the variable storage is practically always more than
paid for in code size.
The following rules have been found useful in reducing this type
of inefficiency:
a) If a char variable is used within expressions, or is being passed to
functions, or returned by functions, it is better to use int instead.
This is true even in case when variable is only used as TRUE/FALSE
flag. If defined as an int, the variable will occupy 1 more byte,
but for each use one or more bytes are saved in generated code.
In addition the speed is often improved due to an even address
alignment of variables, preferred by the 16/32 bit CPU-s.
b) In some situation rule (a) cannot be applied, since an array of
characters is the source for byte size variables. In that case
the characters should be cast as SIGNED whenever possible since
the instruction CBW used to extend signed char to int is 1 byte,
while the instruction MOV AH,0 used for unsigned is 2 bytes.
c) CodeRunner functions which receive char or byte as an argument
will ignore high byte passed. Hence you can safely pass integers
without having to mask out high byte. This also allows application
of the rule (b) independently of char being below or above 0x80.
d) Auto variables of char type will occupy two bytes on stack
anyways, hence one should always gain by using int instead.
e) If an integer variable is to be compared to a char, or passed to
function expecting char, casting integer to char is better than
AND-ing it with 0xff.
2. INITIALIZED AUTO variables for strings, arrays, structures or any
composite objects are very inefficient. Each such object will
occupy exactly twice the size defined. Also, the compiler will
on every function entry call an internal transfer function to
copy the variable from original place into the auto variable.
Besides wasting time, this transfer function is not usable by
the C code as a memory copy function. And lastly, the space
used as the destination is also internal space (stack space),
which will often place much heavier demand on stack size. You
can often create a "local copy" of such item in some temporarily
free buffer (e.g. a disk i/o buffer when disk i/o is idle).
a) If this type of variables is truly needed (rare case), one may
use other general purpose memory copy function (like memcpy())
which is usable for other purposes as well. In case of Turbo C
this is even more important, since Turbo C uses far call and
32 bit pointers to perform the copying, even in small model.
b) In most situations, the intention was only to keep a variable
local to the function and not to reinitialize variable each time
function is activated. In that case the variable should be turned
into static, accomplishing same result in half the size.
Hence instead of:
func()
{
char msg[]="This is an initialized local string.";
}
it is better to use:
func()
{
static char msg[]="This is an initialized local string.";
}
This is one example where ANSI C standard has violated a useful
classic C rule that more efficient constructs are shorter to type.
3. Another case of the same Classic C rule violation is the passing of
structures by value to/from functions. Such deceivingly efficient
constructs are wasteful on memory and time even when implemented
well, and especially in the current C compilers where they were added
as a marketing feature to boost the count of ANSI compliance points.
Hence instead of:
struct x_type func(struct x_type x)
{
....
process(x.field);
....
return(x);
....
}
it is more efficient to use:
void funct(struct x_type *x)
{
...
process(x->field);
...
return;
}
In rare circumstances, where a local copy of a structure is needed in
order to be modified while preserving the callers copy, it is still
better to use a general purpose memory copy, rather than have compiler
link in an internal-use-only transfer function.
4. Uninitialized AUTO variables are more efficient than the equivalent
static/global variables both in memory (typically 1 byte per access)
and speed. Additionally, the space used by auto variables is released
when function exits. Therefore both simple and composite objects
needed temporarily and locally should be defined as auto variables.
Similar situation is with arguments a function receives - they are
equivalent to auto variables in efficiency, thus they are often
better than global variables.
5. Variables which are OFTEN being passed as arguments between functions
should be redefined as globals, ar at least static for that module.
This rule needs to be counterbalanced with the previous rule - the
decision parameter is "often". Namely each access to a global/static
costs on average 1 extra byte (compared to access of a function
argument), but the cost of passing such variable is on average 5 bytes.
6. If a variable occurs in the code as say: n-1, two or more times, it
is better to define another variable, n1=n-1 and use n1 instead.
Just the code space saved on the single re-use pays for the
extra space of variable n1. The program speed will benefit too.
This rule is applicable to char, int, long variables as well as
pointers to any objects. More complex expressions (than n-1)
will benefit even more.
Note also that the assignment (n1=n-1; above) should be combined
with the first use of (n-1):
For example instead of:
y = f(x) + n - 1;
z = v * (n-1);
use:
y = f(x) + (n1=n-1);
z = v * n1;
7. 2-D arrays are not efficiently handled by C compilers. Each access
will produce multiply instruction to compute an offset into the array.
If a 2-D array is used mostly in a sequential manner (like screens,
edit buffers, matrices, etc.) one should define them as 1-D arrays
and compute starting offset explicitly, and from then on use ++,--
operators (on either pointers or indices).
8. Larger expressions are compiled more efficiently than the equivalent
step-by-step sequence of expressions. This is also true for a function
invocation within an argument of another function invocation. The
reason for this is the freedom compilers have when evaluating
expressions - compiler need not worry about side effects occurring
during evaluation, their sequencing is in most cases undefined, thus
compilers can ignore aliasing. On the other hand, the equivalent
sequence of simple expressions has so called "sequence points" after
each step, requiring often that the intermediate results be stored
in memory. In other words the sequence of small expressions places
additional, often useless, constraints on evaluation.
The drawbacks to be weighed in a decision are higher susceptibility
to errors, lower readability and lower flexibility of complex expressions.
9. Functions which accept string pointers and their lengths as arguments,
should have lengths placed before pointers. This is advantageous when
the following invocation method is used:
char *s;
....
str_func( str_len(s), s);
....
The reason for this is the order in which arguments are pushed on the
stack - in case of C parameter convention this is from right to left.
Thus in the example above, the code will fetch the value s and push
it twice than call str_len(). Had the argument order been opposite,
the value s would be accessed (computed) twice.
10. When segments and offsets are defined within structures, offset should
be placed first, followed by the segment. When passing far pointers
as a separate words for offset and segment, offset should be first.
The target function should receive this pair of words as the far
pointer, and will access it using single instruction (LES/LDS).
11. If the total amount of data used by program exceeds 64K, the bulk of
this space is normally occupied by arrays (programs with 30,000 simple
variables are rare). The largest of these arrays should be moved into
far heap (see section on (far) memory allocation). This is always more
efficient than switching globally to large-data models.
CodeRunner is compiled in small data model, allowing for 64K of data
and 64K of code using near access, and up to memory size of far data.
The library functions can move data to/from far memory. One can also
use far pointers available in Turbo and MS C compilers (mixed models).
12. Accessing array elements via pointers (as opposed to index) is more
efficient, especially if arrays have int elements (or larger).
This only holds if the pointer is used without additional offset,
i.e. as *p instead of *(p+i) or p[i].
13. If an index is used to access array elements, global arrays
will compile more efficiently than arrays defined via starting
pointer. Hence if an alternatives are p[i] and a[i] (p is a
pointer, a[] is an array), the array is preferable. This should not
be confused with the "abstract C machine" which mandates that array
references are converted to pointers - the actual code does make
distinction. The reason is the same as for the constants being
more efficient than variables - in this case a[] is a constant,
while p is a variable, both representing some memory address.
One can unify this and the preceding rules as: The pointers are
more efficient than arrays if they follow the array elements,
and less efficient if they are fixed (a variable playing role
of a constant).
14. Loops which always execute at least once, are best implemented
as do {...} while() loops, as opposed to for() {...} loops or
while() {...} loops. The saving is 2-3 bytes.
15. Switch statement is often implemented more efficiently using two
arrays - one with words (or preferably bytes) containing case values,
and another array containing function pointers for the matching cases.
There are three reasons for this:
A. Switch statement will often generate code that does exactly
same thing: it has the two arrays, both with 16-bit elements
plus the code which scans the first one (case values) then
dispatches from the second one. You can not only use common
search function, but also often reduce the case values to 8 bits.
B. Actions done for particular case can be delegated to functions
which are reusable in other switch statements or other code.
C. For each case there will be a jump code to join the common
end point of the switch statement - this is 2-3 byte instruction.
With the suggested replacement, the RET instruction is used
which is always 1 byte.
This method also has a side benefit of allowing variables for case
labels. Within the context of CodeRunner, this method is useful for
hot-key response. The hot-key service routine receives as an argument
the hot-key which has triggered TSR activation. The CR.LIB function
word_pos() when applied to hot-key and the array of all hot-keys
will return 1-based index of the hot-key in the array. The matching
array of function pointers should have first pointer set
to an error function (no match case), and the rest in the same
order as the hot-key array. The whole decision process could
look as follows:
word hot_key_list[] = { key1, key2, key3 }; /* Hot keys */
fp dispatch[]={err_func,func1,func2,func3 }; /* Function pointers */
hot_key_service(key)
{
.....
dispatch[word_pos(key,hot_key_list,3)]();
.....
}
When a switch statement is coded in this manner, the err_func()
should replace the default case of the switch statement.
As usually there is a tradeoff involved - the switch statement allows
access of auto variables belonging to the same function. If these
have to be recoded as globals, or passed to individual case functions
the overhead for globals (or function entry frame setup) may offset
the benefits.
16. When left to choose register variables, compilers will generally
pick the first two integer variables. You may override the default
choice by weighing space/time priorities. If the space is important
select variables which occur (literally) most frequently within a
function. If the speed is important (and you don't use assembler)
select variables used most frequently in the inner-most loop within
a function.
17. For pop-up applications, compiler should be set to optimize for size,
since the bulk of time consuming (low level) screen handling is done
by the CodeRunner functions. In some cases the size optimization
will actually produce faster code than speed optimization due to
compilers disregard of CPU prefetch queue effects. And in almost
all cases the speed difference will be beyond human perception.
18. Compiler switch for stack-overflow checking should be disabled for
TSR application. Namely, the normal action of the compiler will
be to abort the program and exit to DOS. This obviously will
not work once program goes resident, since officially the program
has already "exited" once. Thus all the checking code is at best
useless in this environment (CodeRunneR library substitutes checking
and exit function with a neutral function). An alternate approach
is offered by a pair of CodeRunner functions _watch_stack() and
_unused_stack(). These functions are intended for the test stage
of development, when they can help determine minimal, but still
safe stack size. They should be removed from the final program.
19. When selecting functions from the library, if there is a tossup
choice between the two functions, you should select one already
used somewhere else in the code (or at least the one more likely
usable elsewhere). When the choice is not a tossup, but requires
extra C code to make it use the same function, one should use the
best suited library function. In other words, one should avoid
variety used for variety sake.
20. The CodeRunner video functions are fast enough to make redisplaying
of a line or a window an acceptable alternative to trying to repaint
only the changed portions of the screen. This can save a great deal
of code space.
21. For the same reason as in previous rule, the nested windows may be
implemented without buffers to save an overlayed portion of another
window. Calling the same function which drew the overlayed window to
start with, will save on the data space. In case of two levels of
nesting, a swap_blk() functions may be used to swap the screen block
and another buffer.
22. Goto statement is still legal in C language. Your C compiler will
never generate more than 3 bytes per goto (most often just 2).
In the same time, even a single variable used to control the early
exit from nested loops will generate 4 bytes to get set and 4-5
bytes for each test, and then 2-3 bytes to jump anyways.
23. CodeRunneR's data compaction will not dispose strings specified
as function arguments. These strings should be defined in the
disposable data module, and accessed as an external variable
from the disposable code module.
For example, instead of:
main()
{
dsp("Signon Screen"); /* This string is not disposable */
}
it is more efficient to use:
extern char msg[]; /* Defined in disposable data module */
main()
{
dsp(msg);
}
In assembler you can just define the string within the same module,
but in _IDATA segment to make it disposable.
24. Educational C texts providing lists of DO-s and DON'T-s for portable,
modular, readable, structured and object-oriented coding practices
are a rich source of DON'T-s and DO-s for efficient programs.
This advice should be taken with grain of salt. Namely, when sufficient
effort has been spent polishing a program and when it looks that no
more improvements are possible using "noble" methods, the point of
tradeoffs has been reached. Only at this point, if further size/speed
improvements are needed, one should look at the DON'T-s from such
lists (or DO-s in your code) and choose the least of evils which will
help achieve the desired performance.
Violation of the good practices before the tradeoff point is sloppy
rather than efficient programming.
25. ELSE statement generates a JMP instruction around the ELSE-block.
This can be eliminated as shown in the following example:
Instead of:
if (x
Use more compact form:
z=0;
if (x
NOTE: This also applies to ?: operator, which is just a compactly
written variation of IF-ELSE statement.
26. ELSE IF statement may often be skipped gaining in space at the
expense of the execution speed. This is especially useful when the
statements within each conditional block are simple assignment or
single integer operator like +,-,^,|,&,++,--. In such case there is
almost no speed penalty for the space gained.
For example:
if ((c>='a')&&(c<='z')) c-=0x20;
else if (c<0x20) c=0x20;
can be replaced with seemingly redundant, but more compact check:
if ((c>='a')&&(c<='z')) c-=0x20;
if (c<0x20) c=0x20;
This type of savings is especially useful in keystroke processing code
which, due to human typing speed, need not be very fast.
In addition to greater compactness, the decoupling of conditional
statements makes code simpler to shuffle around.
27. EXTRACTION of common expressions into functions can often eliminate
lots of repetitive expressions. Simple call to a function with no
arguments takes only 3 bytes. Any simple assignment will take at least
that much (usually more). Therefore, any pair of assignments
which repeats in the code 2 or more times can be made into a function.
For example:
...
crs_x=X0; crs_y=Y0;/* Put cursor in the corner */
...
crs_x=X0; crs_y=Y0;/* Reset cursor to the same corner */
...
can be done in less space using:
...
corner0 ();
...
corner0 ();
...
corner0 ()
{
crs_x=X0; crs_y=Y0;
}
This example will save space even if Y0 is not used, and X0 is some
global variable.
28. The previous hint is an example of TIME/SPACE tradeoff. One can
generalize the simple case above to any piece of code used more than
once within a program. Such space optimizations are often done after
the program is already finished, and can be done almost mechanically.
Pieces of code which are similar, can be often made identical and
then replaced with a function. Variables originally placed arbitrarily
can be reorganized into records, allowing for simpler argument passing
to common functions.
In interactive applications, the response speed faster than the screen
refresh time (around 20 ms) is without any visible effect. Since the
function replacement method described above adds under 20 microseconds
even on a 4.77 Mhz IBM PC, then fifty such replacements will add only
one millisecond to the execution time, which is humanly unperceptible
difference. On the other hand, few hundred bytes saved are certainly
perceptible with ever growing DOS with each new version.
The function-blocking method has an additional benefit that the code
is easier to read and modify, especially if function names are
well chosen. The resulting code will look as if designed top-down,
without having to think through entire program up-front.
29. EXAMINE GENERATED CODE by making compiler produce ASSEMBLER output.
Most of todays C compilers have a switch to generate ASM file (-S for
Turbo C and /Fa for Microsoft C). If you are familiar with assembly
language, this exploration will show which constructs need
improvements. Typically, you would look for expressions in which
compiler extends char or byte to int or word (instructions like
CBW or MOV AH,0 or XOR AH,AH). Such expressions can usually be
typecast so that no extension instructions are generated.
Other things to watch are Jumps-Around-Jumps necessitated by
large conditional statement blocks. Reducing the size of these
blocks by redistributing some work-load to functions will produce
smaller and cleaner code (perhaps slightly slower).
30. Arithmetic expressions can often beneficially replace IF/ELSE
decisions. Consider case of PING-PONG buffers: A pointer (char*)p
is toggled between buffers buf1 and buf2. The switching of the
pointer is usually done as:
if (p==buf1) p=buf2;
else p=buf1;
More efficient construct is:
p = (char*) ((word)p ^ ((word)buf1 ^ (word)buf2));
NOTE: The cast to word is done to allow use of ^ operator which is
generally not defined on pointers. In this case it is well defined.
In practical use the buf1^buf2 would be a constant computed in
disposable code.
This XOR trick can be used whenever some (integer) variable x needs to
be toggled between two specified values a and b. You simply create
variable ab=a^b, and toggle x via x^=ab; .
In many cases the control variables could be defined in such way
that the plain arithmetic could be applied to produce desired
result.
31. Along the lines of previous example, you should check the bit
patterns of variables occurring in the program. There are often
surprising properties of binary 2-complement arithmetic which can
greatly optimize the code size and speed.
A trivial example is replacement of say (x % 8) with (x & 7), or
the same way for any power of two.
Less obvious is, for example, the task of extracting the bit mask
for the right-most bit set in a word.
word x,mask;
Instead of:
mask=1;
do if (mask & x) break; /* Test if non-zero bit found */
while ((mask<<=1)!=0);
Use:
mask = -x & x;
Similarly if you are checking if a non-zero integer is power of two
(i.e. it has exactly 1 bit set) you can use:
#define is_power_of_2(x) ( (x) & ((x)-1) )
Thus to count bits in a word x, instead of checking all bits via
shifting a bit mask, you can use the following:
for (cnt=0; x; ++cnt) x &= x-1;
This will produce bit count (cnt) on average twice as fast as the
simple shift and test method.
NOTE: The expressions above work for signed and unsigned integral
values (char, int, long).
32. Integer arithmetic in C is modular (also in assembler). Although
the ANSI C guarantees this only for unsigned types, in practice
on most processors (including 8086,68000,...) this is also true
for signed integer variables (consequence of the binary 2-complement
representation).
This can be most often used when a bit-pattern is shifted to the
left (also to the right for unsigned) to eliminate additional
count variable set to 8, 16 or 32.
Thus instead of:
for (i=0,mask=1; i<16; i++,mask<<=1) {...}
you can often use (assuming you need a mask):
mask=1;
do {
...
} while ((mask<<=1)!=0);
This way the mask doubles as a bit mask and as a counter.
NOTE: We have also applied above the rule for loops which always
execute at least once, therefore do-while form is used.
33. Another consequence of modular arithmetic is that pointer/index
overflow around 64K can be checked without resorting to long
arithmetic.
Thus instead of:
char *p;
word step; /* Assumed forward step */
if (((long)p+step)<0x10000L) p+=step;
You can use:
if (p+step>p) p+=step;
In case of backward step (check for wrap around 0) you can use:
if (p-step
35. You can avoid long arithmetic when subtracting from 64K.
Thus instead:
word used,free;
...
free = 0x10000L-(long)used;
...
you can do:
free = -used;
34. Loop counters are more efficient when counting down to 0 rather than
from 0 up.
Hence instead of:
i=0; do {....} while (++i
it is better to use (whenever possible):
i=N; do {....} while (--i);
The latter often exploits CPU zero flag when decrementing i.
The more recent C compilers also utilize LOOP instruction.
35. Pre-increment/decrement operators are often more efficient than the
post-increment/decrement version. This is valid mostly when these
operators are combined in the expressions using compare (like ++i
The reason for this is that the processor flags are changed by ++ or --,
thus the (i++ < N) must generate an extra code (usually a JMP) to
prevent ++ from destroying the result of compare.
36. Function argument passing has dual overhead - an overhead for pushing
each argument on the stack, plus an overhead for clearing arguments
from the stack. The latter is 3 bytes for MSC for any number of
arguments (2 bytes for 1 argument in recent versions), but for Turbo C
it is 1,2 or 3 bytes for 1,2 or 3 and more arguments, respectively.
Thus to reduce stack clearing overhead it is useful (especially for TC)
to keep number of arguments to 1 or 2. Of course, any reduction helps
also in the argument pushing phase.
One way to reduce number of arguments is to replace them with globals.
As mentioned earlier, this has a drawback of more expensive global
(or static) variable access. Another way is to use structures (or
arrays) which hold several variables, and pass only the pointer to
the structure (array). The third method is useful when several
arguments are passed via a chain of nested calls.
For example instead of:
func1(int a,int b,int c,int d)
{
....
func2(b,c,d);
....
}
func2(int x,int y,int z) {...}
you can use:
func1(int a,int b,int c,int d)
{
...
func2(&b);
...
}
func2(int *p)
{
/*** x is p[0], y is p[1] and z is p[2] ***/
}
Structures can be in the same manner for more complicated argument
lists (or for code documentation purposes).
NOTE: You should not use this for independent auto variables inside
the func1(), such as int e,f,g; since the variables are not
necessarily allocated in the order of declaration. On the
other hand, the passing of arguments to functions is
additionally constrained with the CPU stack discipline for
C convention calls, thus it has well defined behavior across
MS-DOS C compilers.
37. If there are 2 or more occurences of some function call, check the
sourrounding statements. There are aften common pieces of code just
before (or after) the call. The common section should be moved into
the function. If the surrounding code is similar, one can often
modify it by replacing some auto variable with a global, thus making
it identical so that it can be moved to the function.
If the code surrounding the call appears in the same form in several,
but not all occurences consider what effect it would have to move it
into the function anyways. Often the places which don't have the
common piece are not affected by such code being executed, thus it
can be safely moved into the function.
The gain with this rule is even greater than the one from hint #27
since the call already exists.
38. A switch statement may have two or more cases which are similar
except for some variables being assigned different value. The
similar case-blocks can be made to reuse same code (i.e. fall thru)
by initializing the variables to one of the cases and modifying
it in the remaining cases.
For example this code:
switch (i)
{
...
case KEY0: func(0,a,b,c);
break;
case KEY1: func(1,a,b,c);
break;
case KEY2: func(2,a,b,c);
break;
...
}
can be optimized (for size) as follows:
x=0;
switch (i)
{
...
case KEY2: x++;
case KEY1: x++;
case KEY0: func(x,a,b,c);
break;
...
}
December 19, 2017
Add comments