This is the last installment about the straight-forward implementation of Brian's Brain in which I'm going to try to squeeze some more performance out of the brute-force implementation (you should get familiar with first and second post to get the full picture). The point of the article is not about how fast I can get it, but rather about what performance defference can be achieved using "simple" performance tweaking measures.
As any good programmer will tell you – optimizing without profiling first is a waste of time. To profile my code I will use SBCL's statistical profiler. It is called statistical because of the way it works: interrupting the running program at regular intervals and looking what code is running at that moment.
To get the idea of where the time is spent in my program I'll do a single simulation from our benchmark (with an increased iteration count to get more correct statistics):
1: CL-USER> (sb-sprof:with-profiling (:report :flat :loop nil) 2: (simulate 4096 (make-initialised-brain 128 128) 128 128)) 3: 4: Number of samples: 4426 5: Sample interval: 0.01 seconds 6: Total sampling time: 44.26 seconds 7: Number of cycles: 0 8: Sampled threads: 9: #<SB-THREAD:THREAD "initial thread" RUNNING {10029559D1}> 10: 11: Self Total Cumul 12: Nr Count % Count % Count % Calls Function 13: ------------------------------------------------------------------------ 14: 1 1401 31.7 1401 31.7 1401 31.7 - "foreign function sigprocmask" 15: 2 1000 22.6 1737 39.2 2401 54.2 - NEIGHBOURS 16: 3 552 12.5 1056 23.9 2953 66.7 - COUNT 17: 4 200 4.5 200 4.5 3153 71.2 - SB-IMPL::OPTIMIZED-DATA-VECTOR-REF 18: 5 179 4.0 179 4.0 3332 75.3 - SB-KERNEL:HAIRY-DATA-VECTOR-REF/CHECK-BOUNDS 19: 6 173 3.9 173 3.9 3505 79.2 - SB-VM::GENERIC-+ 20: 7 145 3.3 247 5.6 3650 82.5 - (LAMBDA (SB-IMPL::X)) 21: 8 143 3.2 143 3.2 3793 85.7 - TRUNCATE 22: 9 128 2.9 128 2.9 3921 88.6 - SB-KERNEL:%COERCE-CALLABLE-TO-FUN 23: 10 116 2.6 139 3.1 4037 91.2 - LENGTH 24: 11 80 1.8 80 1.8 4117 93.0 - EQL 25: 12 67 1.5 2970 67.1 4184 94.5 - EVOLVE 26: 13 51 1.2 51 1.2 4235 95.7 - (FLET LT) 27: 14 42 0.9 42 0.9 4277 96.6 - (FLET RT) 28: 15 41 0.9 41 0.9 4318 97.6 - SB-KERNEL:SEQUENCEP 29: 16 30 0.7 34 0.8 4348 98.2 - NTHCDR 30: 17 27 0.6 1076 24.3 4375 98.8 - RULES 31: 18 22 0.5 22 0.5 4397 99.3 - (FLET #:CLEANUP-FUN-[CALL-WITH-SYSTEM-MUTEX/WITHOUT-GCING]146) 32: 19 17 0.4 17 0.4 4414 99.7 - "foreign function mach_msg_server" 33: 20 11 0.2 11 0.2 4425 100.0 - (FLET SB-IMPL::FAST-NTHCDR) 34: 21 1 0.02 1 0.02 4426 100.0 - (FLET #:CLEANUP-FUN-[INVOKE-INTERRUPTION]41) 35: 22 0 0.0 2973 67.2 4426 100.0 - SIMULATE 36: ...
Ignoring the first entry related to sigprocmask
(because that's sure some implementation detail
related to GC, profiling itself, or something else I should not bother with) we can see our top
candidates for optimization:
-
neighbours
– as expected, this is the function called most frequently (for every cell at every simulation iteration). -
count
– standard Common Lisp function (called with the results ofneighbours
to count the number of on cells). -
Some implementation-specific functions dealing with vectors. One of these functions contains a
word "hairy" which I think means that SBCL could use some hints. Declaring the
cells
array type should hopefully get rid of these. Also should help with calls tolength
seen a few rows below. -
sb-vm::generic-+
– most probably fromneighbours
function. The name implies that the generic version of+
function (addition) is used, instead of usingfixnum
specific one. -
eql
– might be from call tocount
inrules
function, being the default:test
function. -
(flet lt)
and(flet rt)
– these are from withinneighbours
. I'm pretty sure they are counted in one of the columns inneighbours
row.
The optimization plan now is simple1:
-
Declare the type of
cells
array. -
Optimize
neighbours
function. - Sprinkle some more declarations as needed.
The first step is easy – all I have to do is add the following declaration to uses of cells
array, which tells the compiler that the array is one-dimensional (undefined length) and its
elements are of type (integer 0 2)
(integers from 0
to 2
):
(declare (type (simple-array (integer 0 2) (*)) cells))
OK, might as well make it a bit more generic (although just for the sake of example; the program is too short to make this worthwhile). Two types are introduced as "aliases" (that is, upon seeing one of these the compiler will expand them to corresponding types in the definition; a lot more complicated type declarations are possible):
1: (deftype cell () '(integer 0 2)) 2: (deftype brain () '(simple-array cell (*)))
One important thing I should note here is that type declarations are optional in Common Lisp and the compiler is allowed to ignore them altogether. But they still provide a structured way to document code and all implementations I'm familiar with will use the declarations, if provided.
There are two extreme cases in type-checking:
- Everything is checked and declarations are ignored.
- Declarations are trusted and no type-checking is done.
In Common Lisp there is a way to instruct the compiler how to operate: generate safe (but due to
checking usually relatively slow) or unsafe (probably fast due to the absence of checks) code.
There are two settings (optimization qualities) to directlyl control this behavior: speed
and
safety
. These settings can take values from 0
to 3
(zero being the "unimportant" value, 3
being "very important"). optimize
declaration identifier is used to specify these qualities,
like this:
(declare (optimize (speed 3) (safety 1)))
Specifying that speed is very important will also make some compilers (including SBCL) to be very chatty about why it cannot optimize particular parts of the code. Other implementations have specific declarations to make the compiler output optimization such notes.
Since I'll be experimenting with different optimization settings I'm going to specify them once for the whole file. It can be done with the following form at the beggining of the file:
(declaim (optimize (speed 3) (safety 1)))
Such optimization declarations can also be added on a per-function basis (or even for a specific form within a function), and is actually what is being done in day-to-day programming to optimize only the speed-critical sections of code.
And by the way – never, never ever run code with safety
0
(unless you know what you are doing,
which practically is never). But since this is a toy program anyway, might as well try doing it to
see what performance impact does it have.
Here are the neighbours
and evolve
functions with added array declarations:
1: (defun neighbours (cells i w h) 2: (declare (type brain cells)) 3: (let ((l (length cells)) 4: (mx (1- w)) 5: (my (1- h))) 6: (multiple-value-bind (y x) 7: (truncate i w) 8: (flet ((up (i) (if (zerop y) (- (+ i l) w) (- i w))) 9: (dn (i) (if (= y my) (- (+ i w) l) (+ i w))) 10: (lt (i) (if (zerop x) (1- (+ i w)) (1- i))) 11: (rt (i) (if (= x mx) (1+ (- i w)) (1+ i)))) 12: (let* ((u (up i)) 13: (d (dn i)) 14: (l (lt i)) 15: (r (rt i)) 16: (ul (lt u)) 17: (ur (rt u)) 18: (dl (lt d)) 19: (dr (rt d))) 20: (mapcar (lambda (i) (aref cells i)) 21: (list ul u ur l r dl d dr))))))) 22: 23: (defun evolve (src w h) 24: (declare (type brain src)) 25: (let* ((l (length src)) 26: (dst (make-brain w h))) 27: (declare (type brain dst)) 28: (loop for i below l 29: do (setf (aref dst i) 30: (rules (aref src i) (neighbours src i w h)))) 31: dst))
Compiling these produces a lot of complaints from the compiler. Some of the notes from the
neighbours
function look like this:
1: ; in: DEFUN NEIGHBOURS 2: ; (ZEROP X) 3: ; ==> 4: ; (= X 0) 5: ; 6: ; note: unable to 7: ; open-code FLOAT to RATIONAL comparison 8: ; due to type uncertainty: 9: ; The first argument is a REAL, not a FLOAT. 10: ; 11: ; note: unable to open code because: The operands might not be the same type. 12: 13: ; (+ I W) 14: ; 15: ; note: unable to 16: ; optimize 17: ; due to type uncertainty: 18: ; The first argument is a REAL, not a RATIONAL. 19: ; The second argument is a NUMBER, not a FLOAT. 20: ; 21: ; note: unable to 22: ; optimize 23: ; due to type uncertainty: 24: ; The first argument is a REAL, not a FLOAT. 25: ; The second argument is a NUMBER, not a RATIONAL. 26: ; 27: ; note: unable to 28: ; optimize 29: ; due to type uncertainty: 30: ; The first argument is a REAL, not a SINGLE-FLOAT. 31: ; The second argument is a NUMBER, not a DOUBLE-FLOAT. 32: ; 33: ; note: unable to 34: ; optimize 35: ; due to type uncertainty: 36: ; The first argument is a REAL, not a DOUBLE-FLOAT. 37: ; The second argument is a NUMBER, not a SINGLE-FLOAT. 38: ...
Profiling the code now shows that I've got rid of the two vector-ref
functions. But there is
still a bit of work to do on neighbours
function with all the noise compiler generates. First
I'll declare the types of parameters (and show only the relevant forms here):
1: (deftype array-index () `(integer 0 ,array-dimension-limit)) 2: 3: (declare (type brain cells) 4: (type array-index i) 5: (type fixnum w h))
A note on the array-index
type: the actual integer range depends on the value of
array-dimension-limit
, which is implementation dependent (but usually is between 224 and
most-positive-fixnum
).
Now compiling neighbours
function emits quite a short list of compiler notes:
1: ; in: DEFUN NEIGHBOURS 2: ; (LIST UL U UR L R DL D DR) 3: ; 4: ; note: doing signed word to integer coercion (cost 20) 5: ; 6: ; note: doing signed word to integer coercion (cost 20) 7: 8: ; (FLET ((UP (I) 9: ; (IF (ZEROP Y) 10: ; (- # W) 11: ; (- I W))) 12: ; (DN (I) 13: ; (IF (= Y MY) 14: ; (- # L) 15: ; (+ I W))) 16: ; (LT (I) 17: ; (IF (ZEROP X) 18: ; (1- #) 19: ; (1- I))) 20: ; (RT (I) 21: ; (IF (= X MX) 22: ; (1+ #) 23: ; (1+ I)))) 24: ; (LET* ((U (UP I)) 25: ; (D (DN I)) 26: ; (L (LT I)) 27: ; (R (RT I)) 28: ; (UL (LT U)) 29: ; (UR (RT U)) 30: ; (DL (LT D)) 31: ; (DR (RT D))) 32: ; (MAPCAR (LAMBDA (I) (AREF CELLS I)) (LIST UL U UR L R DL D DR)))) 33: ; 34: ; note: doing signed word to integer coercion (cost 20) to "<return value>" 35: ; 36: ; note: doing signed word to integer coercion (cost 20) to "<return value>" 37: ; 38: ; compilation unit finished 39: ; printed 4 notes
The two offending forms are flet
and list
. One solution now would be to maybe declare types of
all the eight cells in let*
form. But is there anything about this code that bothers you? You
see, lists in Common Lisp can hold any kind of values, and making a list takes time and memory. And
this code actually creates two lists (one with the cell indices, and the one created by mapcar
function). So telling the compiler that the things that are stuffed into those lists are in fact
numbers does not make much sense (or performance difference) because the lists are still being
created and traversed.
So, in the name of optimization I'll go on and do what has to be done: make a single array for neighbours calculation and update it on each call. Here's the ugly-but-supposedly-fast version:
1: (deftype neighbours () '(simple-array cell (8))) 2: 3: (defun neighbours (cells i w h) 4: (declare (type brain cells) 5: (type array-index i) 6: (type fixnum w h)) 7: (let ((result (the neighbours (load-time-value (make-array 8 :element-type 'cell)))) 8: (l (length cells)) 9: (mx (1- w)) 10: (my (1- h))) 11: (multiple-value-bind (y x) 12: (truncate i w) 13: (flet ((up (i) (if (zerop y) (- (+ i l) w) (- i w))) 14: (dn (i) (if (= y my) (- (+ i w) l) (+ i w))) 15: (lt (i) (if (zerop x) (1- (+ i w)) (1- i))) 16: (rt (i) (if (= x mx) (1+ (- i w)) (1+ i)))) 17: (let* ((u (up i)) 18: (d (dn i)) 19: (l (lt i)) 20: (r (rt i)) 21: (ul (lt u)) 22: (ur (rt u)) 23: (dl (lt d)) 24: (dr (rt d))) 25: (setf (aref result 0) (aref cells ul) 26: (aref result 1) (aref cells u) 27: (aref result 2) (aref cells ur) 28: (aref result 3) (aref cells l) 29: (aref result 4) (aref cells r) 30: (aref result 5) (aref cells dl) 31: (aref result 6) (aref cells d) 32: (aref result 7) (aref cells dr)) 33: result)))))
First I define a new type that describes the neighbours array (known to be 8 elements long at
compile time). In the function itself a new binding is introduced: result
. There are two
interesting things about this binding:
- The value is created only once – when the code is loaded (in this case the behaviour is similar to static variables in C and some other languages).
-
The type of array is declared (using
the
form) because the value is only known at load time (as opposed to compile time). Theoretically I'd expect the compiler to figure this out by itself, but generally any code can be given toload-time-value
so the compiler apparently is not even trying to figure anything out.
So there, compiling neighbours
function now gives no warnings and is as ugly as if it was written
in C. And since this function now returns an array, this information can now be added to rules
function:
1: (defun rules (state neighbours) 2: (declare (type cell state) 3: (type neighbours neighbours)) 4: (case state 5: (1 2) 6: (2 0) 7: (t (if (= 2 (count 1 neighbours)) 1 0))))
Benchmarking now gives numbers around 11 seconds for each simulation run (i.e., the code runs almost two times faster than before). The top of the profiler output now looks like this:
1: Self Total Cumul 2: Nr Count % Count % Count % Calls Function 3: ------------------------------------------------------------------------ 4: 1 466 24.7 527 27.9 466 24.7 - NEIGHBOURS 5: 2 376 19.9 1152 61.0 842 44.6 - COUNT 6: 3 194 10.3 194 10.3 1036 54.8 - SB-IMPL::OPTIMIZED-DATA-VECTOR-REF 7: 4 141 7.5 238 12.6 1177 62.3 - (LAMBDA (SB-IMPL::X)) 8: 5 136 7.2 136 7.2 1313 69.5 - SB-KERNEL:HAIRY-DATA-VECTOR-REF/CHECK-BOUNDS 9: 6 127 6.7 127 6.7 1440 76.2 - SB-KERNEL:%COERCE-CALLABLE-TO-FUN 10: 7 110 5.8 118 6.2 1550 82.1 - EQL 11: 8 86 4.6 86 4.6 1636 86.6 - "foreign function sigprocmask" 12: 9 84 4.4 1800 95.3 1720 91.1 - EVOLVE 13: ...
A few notable things:
-
I think I'm starting to understand the second column2. That would explain why
evolve
has 95% and other entries less. If so, thencount
takes about twice the time ofneighbours
. -
The
vector-ref
friends are back! Sinceneighbours
only sets array values, it sure looks likecount
has dragged these into the picture.
Anyway, I have not done much lisp code profiling (and performance tuning for that matter) so I'll do a little experiment here: write my own function to count live neighbour count:
1: (defun rules (state neighbours) 2: (declare (type cell state) 3: (type neighbours neighbours)) 4: (case state 5: (1 2) 6: (2 0) 7: (t (if (= 2 (loop for x across neighbours counting (= 1 x))) 1 0))))
There is one compiler note for this loop – the count accumulation variable is checked for fixnum overflow. I'll leave it at that, wrap things up and call this my final version.
Here are the timings (with safety
optimization setting of 1
and 0
):
To be honest, I did not expect getting rid of count
to make such a big difference. I may only
hope that this example will let implementation developers make the compiler better!
The detailed timings can be seen in transcripts (below), but generally they are:
- 3.9s / 3s for SBCL.
- 8.5s / 6.9s for Clozure CL (32-bit).
- 6s / 5.5s for Clozure CL (64-bit).
I'm pretty sure it is possible to squeeze out some more performance out of this. I actually think that this kind of low-level optimisation is an art in itself, and I'm not very good at it. That is why I will leave this topic for now. But readers are welcome to contribute using comments or better yet – writing about it in their own blogs.
Data files
Footnotes
1 A comment from pkhuong on #lisp IRC channel on approaching optimization (after reading a draft
version of this article):
"Only NEIGHBOURS shows up fairly high in the list. However, before working on it directly, I'd attack
what I consider noise: COUNT
, TRUNCATE
, LENGTH
and EQL
shouldn't be called out of line.
*-DATA-VECTOR-REF
and GENERIC-+
shouldn't have to be called either. Inserting type declarations and
asking for speed optimization where appropriate in order to eliminate these entries should be the
first step. As it is, the profile is mostly telling you that you're doing it wrong if you really
want execution speed.
Only then, with the noise eliminated, would it be time to work on the program itself, from a new
profile (with the declarations and with speed optimization where needed) that may be radically
different from the one above."
2 Another comment from pkhuong, this time about the columns in profiler output (confirming my
understanding of the "total" column):
"Count" columns count samples, "%" % of time (samples). Self is the amount/% of samples with the
function at the right at the top of the call stack (directly executing), total is the amount/% of
time with the function somewhere on the call stack (up to a certain depth). Cumul is the sum of
Self values; I find it helpful to guesstimate where the repartition of self times tapers off into a
long tail.
The cycle output (not shown here nor in the blog post) is extremely useful if you know that
a lot of time is spent in a function, either directly or in total, but now what the callers
or the callees are. You'll be able to find the function you're looking for, the most common
(in terms of sample) callers and callee. With that information, you could decide how to make
a certain callee faster (or special-case it for the slow function), change the way the function
is used in the slow callers, special-case the function for the slow callers, etc. The cycle
output is somewhat harder to read, but the format is the same as that by gprof, so you can
look around for more resources.
Hi Jānis,
ReplyDeleteI'm not sure if you've seen my previous reply.
I've optimized your code to be about 2x faster without using the following techniques you used here:
1) integer representation of state (I use symbol)
2) 1-d array (I use 2-d array)
1) didn't make a lot difference in this micro benchmark. But it can save some memory.
The benchmark (with safety 0) shows your code finished between 2.6s ~ 2.8s. And mine finished between 1.3s ~ 1.8s.
By comparing source code and the assembly of your code and mine. It's likely the difference lies in fixnum tagging.
The assembly of my rules and neighbours function is about half the size of yours and I've seen many SAR, SHR and SHL in yours. Otherwise it's quite similar.
After doing the following changes, I managed to reduce the timing of your code to 1.9s.
1) change type cell from (integer 0 2) to (unsigned-byte 8)
2) inline rules and neighbours
I have no idea how to make it better. And why my code is still faster than yours in small size benchmarks. But if you want to dig further, you can check my code:
http://huangjs.net/code/brain.lisp
Cheers,
Jianshi
Hi Jianshi,
ReplyDeleteI have explained why I use numbers instead of symbols to represent cells in my first article about Common Lisp implementation: "big" unspecialised arrays are extra work for GC. This might explain why your version gets slower in the "not small size" benchmarks (if you refer to brain size here).
I assume the bit-munging instructions are related to arrays specialised to 2-bit numbers. But the speed difference between 2-bit and 8-bit versions is quite significant, definitely more than I expected. I'm glad you brought this one up!
Inlining is a bit dangerous beast, though – I would not suggest people to use it until they know what they are doing. But maybe I should have at least mentioned it in the article...
Thank you for your suggestions!
Hi Jānis,
ReplyDelete1) Unspecialised arrays may stress GC only if the elements are independent objects. But using symbol they refer to the same object.(totally 3 of them, :on, :dying and :off) I suppose the slowdown comes from cache misses (each element is a pointer to the symbol, which is 64bit here).
2) You are right, the specialization is responsible for the shifting.
On my machine (SBCL 64bit, Ubuntu Linux x8664), the array element is specialized to (unsigned-byte 4).
CL-USER> (make-array 1 :element-type '(integer 0 2))
#(0)
CL-USER> (type-of *)
(SIMPLE-ARRAY (UNSIGNED-BYTE 4) (1))
Cheers,
Jianshi
I'll clarify the GC issue if the wording in the first article is not very clear.
ReplyDeleteYou are right that there are only 3 possible values for each element of the cells array. But since the array is not specialised those elements can be any Common Lisp value. Each time the GC encounters such an array it must be traversed completely. This is not the case with specialised arrays because the contain the values directly, not pointers to them, and thus the GC can see immediately upon encountering such an array that it must not be scanned at all.
It does not matter much for small arrays because the whole array most probably becomes garbage before the next GC kicks in. The bigger the arrays become the larger the chance that GC will encounter them (because they are processed longer) and thus the chance of having to scan the array increases. And the amount of scanning increases proportionally to the size of array.
Ok, the GC behavior is understood. You are right, it makes scanning longer and explains the slowdown.
ReplyDeleteThanks for the explanation and I'm sorry for not having looked at the GC time carefully.
Cheers,
Jianshi