2009-12-07

Brian's brain, on Common Lisp, take 3

This is the last installment about the straight-forward implementation of Brian's Brain in which I'm going to try to squeeze some more performance out of the brute-force implementation (you should get familiar with first and second post to get the full picture). The point of the article is not about how fast I can get it, but rather about what performance defference can be achieved using "simple" performance tweaking measures.

As any good programmer will tell you – optimizing without profiling first is a waste of time. To profile my code I will use SBCL's statistical profiler. It is called statistical because of the way it works: interrupting the running program at regular intervals and looking what code is running at that moment.

To get the idea of where the time is spent in my program I'll do a single simulation from our benchmark (with an increased iteration count to get more correct statistics):

 1:  CL-USER> (sb-sprof:with-profiling (:report :flat :loop nil)
 2:             (simulate 4096 (make-initialised-brain 128 128) 128 128))
 3:  
 4:  Number of samples:   4426
 5:  Sample interval:     0.01 seconds
 6:  Total sampling time: 44.26 seconds
 7:  Number of cycles:    0
 8:  Sampled threads:
 9:   #<SB-THREAD:THREAD "initial thread" RUNNING {10029559D1}>
10:  
11:             Self        Total        Cumul
12:    Nr  Count     %  Count     %  Count     %    Calls  Function
13:  ------------------------------------------------------------------------
14:     1   1401  31.7   1401  31.7   1401  31.7        -  "foreign function sigprocmask"
15:     2   1000  22.6   1737  39.2   2401  54.2        -  NEIGHBOURS
16:     3    552  12.5   1056  23.9   2953  66.7        -  COUNT
17:     4    200   4.5    200   4.5   3153  71.2        -  SB-IMPL::OPTIMIZED-DATA-VECTOR-REF
18:     5    179   4.0    179   4.0   3332  75.3        -  SB-KERNEL:HAIRY-DATA-VECTOR-REF/CHECK-BOUNDS
19:     6    173   3.9    173   3.9   3505  79.2        -  SB-VM::GENERIC-+
20:     7    145   3.3    247   5.6   3650  82.5        -  (LAMBDA (SB-IMPL::X))
21:     8    143   3.2    143   3.2   3793  85.7        -  TRUNCATE
22:     9    128   2.9    128   2.9   3921  88.6        -  SB-KERNEL:%COERCE-CALLABLE-TO-FUN
23:    10    116   2.6    139   3.1   4037  91.2        -  LENGTH
24:    11     80   1.8     80   1.8   4117  93.0        -  EQL
25:    12     67   1.5   2970  67.1   4184  94.5        -  EVOLVE
26:    13     51   1.2     51   1.2   4235  95.7        -  (FLET LT)
27:    14     42   0.9     42   0.9   4277  96.6        -  (FLET RT)
28:    15     41   0.9     41   0.9   4318  97.6        -  SB-KERNEL:SEQUENCEP
29:    16     30   0.7     34   0.8   4348  98.2        -  NTHCDR
30:    17     27   0.6   1076  24.3   4375  98.8        -  RULES
31:    18     22   0.5     22   0.5   4397  99.3        -  (FLET #:CLEANUP-FUN-[CALL-WITH-SYSTEM-MUTEX/WITHOUT-GCING]146)
32:    19     17   0.4     17   0.4   4414  99.7        -  "foreign function mach_msg_server"
33:    20     11   0.2     11   0.2   4425 100.0        -  (FLET SB-IMPL::FAST-NTHCDR)
34:    21      1  0.02      1  0.02   4426 100.0        -  (FLET #:CLEANUP-FUN-[INVOKE-INTERRUPTION]41)
35:    22      0   0.0   2973  67.2   4426 100.0        -  SIMULATE
36:  ...

Ignoring the first entry related to sigprocmask (because that's sure some implementation detail related to GC, profiling itself, or something else I should not bother with) we can see our top candidates for optimization:

  • neighbours – as expected, this is the function called most frequently (for every cell at every simulation iteration).
  • count – standard Common Lisp function (called with the results of neighbours to count the number of on cells).
  • Some implementation-specific functions dealing with vectors. One of these functions contains a word "hairy" which I think means that SBCL could use some hints. Declaring the cells array type should hopefully get rid of these. Also should help with calls to length seen a few rows below.
  • sb-vm::generic-+ – most probably from neighbours function. The name implies that the generic version of + function (addition) is used, instead of using fixnum specific one.
  • eql – might be from call to count in rules function, being the default :test function.
  • (flet lt) and (flet rt) – these are from within neighbours. I'm pretty sure they are counted in one of the columns in neighbours row.

The optimization plan now is simple1:

  1. Declare the type of cells array.
  2. Optimize neighbours function.
  3. Sprinkle some more declarations as needed.

The first step is easy – all I have to do is add the following declaration to uses of cells array, which tells the compiler that the array is one-dimensional (undefined length) and its elements are of type (integer 0 2) (integers from 0 to 2):

(declare (type (simple-array (integer 0 2) (*)) cells))

OK, might as well make it a bit more generic (although just for the sake of example; the program is too short to make this worthwhile). Two types are introduced as "aliases" (that is, upon seeing one of these the compiler will expand them to corresponding types in the definition; a lot more complicated type declarations are possible):

1:  (deftype cell () '(integer 0 2))
2:  (deftype brain () '(simple-array cell (*)))

One important thing I should note here is that type declarations are optional in Common Lisp and the compiler is allowed to ignore them altogether. But they still provide a structured way to document code and all implementations I'm familiar with will use the declarations, if provided.

There are two extreme cases in type-checking:

  1. Everything is checked and declarations are ignored.
  2. Declarations are trusted and no type-checking is done.

In Common Lisp there is a way to instruct the compiler how to operate: generate safe (but due to checking usually relatively slow) or unsafe (probably fast due to the absence of checks) code. There are two settings (optimization qualities) to directlyl control this behavior: speed and safety. These settings can take values from 0 to 3 (zero being the "unimportant" value, 3 being "very important"). optimize declaration identifier is used to specify these qualities, like this:

(declare (optimize (speed 3) (safety 1)))

Specifying that speed is very important will also make some compilers (including SBCL) to be very chatty about why it cannot optimize particular parts of the code. Other implementations have specific declarations to make the compiler output optimization such notes.

Since I'll be experimenting with different optimization settings I'm going to specify them once for the whole file. It can be done with the following form at the beggining of the file:

(declaim (optimize (speed 3) (safety 1)))

Such optimization declarations can also be added on a per-function basis (or even for a specific form within a function), and is actually what is being done in day-to-day programming to optimize only the speed-critical sections of code.

And by the way – never, never ever run code with safety 0 (unless you know what you are doing, which practically is never). But since this is a toy program anyway, might as well try doing it to see what performance impact does it have.

Here are the neighbours and evolve functions with added array declarations:

 1:  (defun neighbours (cells i w h)
 2:    (declare (type brain cells))
 3:    (let ((l (length cells))
 4:          (mx (1- w))
 5:          (my (1- h)))
 6:      (multiple-value-bind (y x)
 7:          (truncate i w)
 8:        (flet ((up (i) (if (zerop y) (- (+ i l) w) (- i w)))
 9:               (dn (i) (if (= y  my) (- (+ i w) l) (+ i w)))
10:               (lt (i) (if (zerop x) (1- (+ i w))  (1- i)))
11:               (rt (i) (if (= x  mx) (1+ (- i w))  (1+ i))))
12:          (let* ((u (up i))
13:                 (d (dn i))
14:                 (l (lt i))
15:                 (r (rt i))
16:                 (ul (lt u))
17:                 (ur (rt u))
18:                 (dl (lt d))
19:                 (dr (rt d)))
20:            (mapcar (lambda (i) (aref cells i))
21:                    (list ul u ur l r dl d dr)))))))
22:  
23:  (defun evolve (src w h)
24:    (declare (type brain src))
25:    (let* ((l (length src))
26:           (dst (make-brain w h)))
27:      (declare (type brain dst))
28:      (loop for i below l
29:         do (setf (aref dst i)
30:                  (rules (aref src i) (neighbours src i w h))))
31:      dst))

Compiling these produces a lot of complaints from the compiler. Some of the notes from the neighbours function look like this:

 1:  ; in: DEFUN NEIGHBOURS
 2:  ;     (ZEROP X)
 3:  ; ==>
 4:  ;   (= X 0)
 5:  ; 
 6:  ; note: unable to
 7:  ;   open-code FLOAT to RATIONAL comparison
 8:  ; due to type uncertainty:
 9:  ;   The first argument is a REAL, not a FLOAT.
10:  ; 
11:  ; note: unable to open code because: The operands might not be the same type.
12:  
13:  ;     (+ I W)
14:  ; 
15:  ; note: unable to
16:  ;   optimize
17:  ; due to type uncertainty:
18:  ;   The first argument is a REAL, not a RATIONAL.
19:  ;   The second argument is a NUMBER, not a FLOAT.
20:  ; 
21:  ; note: unable to
22:  ;   optimize
23:  ; due to type uncertainty:
24:  ;   The first argument is a REAL, not a FLOAT.
25:  ;   The second argument is a NUMBER, not a RATIONAL.
26:  ; 
27:  ; note: unable to
28:  ;   optimize
29:  ; due to type uncertainty:
30:  ;   The first argument is a REAL, not a SINGLE-FLOAT.
31:  ;   The second argument is a NUMBER, not a DOUBLE-FLOAT.
32:  ; 
33:  ; note: unable to
34:  ;   optimize
35:  ; due to type uncertainty:
36:  ;   The first argument is a REAL, not a DOUBLE-FLOAT.
37:  ;   The second argument is a NUMBER, not a SINGLE-FLOAT.
38:  ...

Profiling the code now shows that I've got rid of the two vector-ref functions. But there is still a bit of work to do on neighbours function with all the noise compiler generates. First I'll declare the types of parameters (and show only the relevant forms here):

1:  (deftype array-index () `(integer 0 ,array-dimension-limit))
2:  
3:  (declare (type brain cells)
4:           (type array-index i)
5:           (type fixnum w h))

A note on the array-index type: the actual integer range depends on the value of array-dimension-limit, which is implementation dependent (but usually is between 224 and most-positive-fixnum).

Now compiling neighbours function emits quite a short list of compiler notes:

 1:  ; in: DEFUN NEIGHBOURS
 2:  ;     (LIST UL U UR L R DL D DR)
 3:  ; 
 4:  ; note: doing signed word to integer coercion (cost 20)
 5:  ; 
 6:  ; note: doing signed word to integer coercion (cost 20)
 7:  
 8:  ;     (FLET ((UP (I)
 9:  ;              (IF (ZEROP Y)
10:  ;                  (- # W)
11:  ;                  (- I W)))
12:  ;            (DN (I)
13:  ;              (IF (= Y MY)
14:  ;                  (- # L)
15:  ;                  (+ I W)))
16:  ;            (LT (I)
17:  ;              (IF (ZEROP X)
18:  ;                  (1- #)
19:  ;                  (1- I)))
20:  ;            (RT (I)
21:  ;              (IF (= X MX)
22:  ;                  (1+ #)
23:  ;                  (1+ I))))
24:  ;       (LET* ((U (UP I))
25:  ;              (D (DN I))
26:  ;              (L (LT I))
27:  ;              (R (RT I))
28:  ;              (UL (LT U))
29:  ;              (UR (RT U))
30:  ;              (DL (LT D))
31:  ;              (DR (RT D)))
32:  ;         (MAPCAR (LAMBDA (I) (AREF CELLS I)) (LIST UL U UR L R DL D DR))))
33:  ; 
34:  ; note: doing signed word to integer coercion (cost 20) to "<return value>"
35:  ; 
36:  ; note: doing signed word to integer coercion (cost 20) to "<return value>"
37:  ; 
38:  ; compilation unit finished
39:  ;   printed 4 notes

The two offending forms are flet and list. One solution now would be to maybe declare types of all the eight cells in let* form. But is there anything about this code that bothers you? You see, lists in Common Lisp can hold any kind of values, and making a list takes time and memory. And this code actually creates two lists (one with the cell indices, and the one created by mapcar function). So telling the compiler that the things that are stuffed into those lists are in fact numbers does not make much sense (or performance difference) because the lists are still being created and traversed.

So, in the name of optimization I'll go on and do what has to be done: make a single array for neighbours calculation and update it on each call. Here's the ugly-but-supposedly-fast version:

 1:  (deftype neighbours () '(simple-array cell (8)))
 2:  
 3:  (defun neighbours (cells i w h)
 4:    (declare (type brain cells)
 5:             (type array-index i)
 6:             (type fixnum w h))
 7:    (let ((result (the neighbours (load-time-value (make-array 8 :element-type 'cell))))
 8:          (l (length cells))
 9:          (mx (1- w))
10:          (my (1- h)))
11:      (multiple-value-bind (y x)
12:          (truncate i w)
13:        (flet ((up (i) (if (zerop y) (- (+ i l) w) (- i w)))
14:               (dn (i) (if (= y  my) (- (+ i w) l) (+ i w)))
15:               (lt (i) (if (zerop x) (1- (+ i w))  (1- i)))
16:               (rt (i) (if (= x  mx) (1+ (- i w))  (1+ i))))
17:          (let* ((u (up i))
18:                 (d (dn i))
19:                 (l (lt i))
20:                 (r (rt i))
21:                 (ul (lt u))
22:                 (ur (rt u))
23:                 (dl (lt d))
24:                 (dr (rt d)))
25:            (setf (aref result 0) (aref cells ul)
26:                  (aref result 1) (aref cells u)
27:                  (aref result 2) (aref cells ur)
28:                  (aref result 3) (aref cells l)
29:                  (aref result 4) (aref cells r)
30:                  (aref result 5) (aref cells dl)
31:                  (aref result 6) (aref cells d)
32:                  (aref result 7) (aref cells dr))
33:            result)))))

First I define a new type that describes the neighbours array (known to be 8 elements long at compile time). In the function itself a new binding is introduced: result. There are two interesting things about this binding:

  • The value is created only once – when the code is loaded (in this case the behaviour is similar to static variables in C and some other languages).
  • The type of array is declared (using the form) because the value is only known at load time (as opposed to compile time). Theoretically I'd expect the compiler to figure this out by itself, but generally any code can be given to load-time-value so the compiler apparently is not even trying to figure anything out.

So there, compiling neighbours function now gives no warnings and is as ugly as if it was written in C. And since this function now returns an array, this information can now be added to rules function:

1:  (defun rules (state neighbours)
2:    (declare (type cell state)
3:             (type neighbours neighbours))
4:    (case state
5:      (1 2)
6:      (2 0)
7:      (t (if (= 2 (count 1 neighbours)) 1 0))))

Benchmarking now gives numbers around 11 seconds for each simulation run (i.e., the code runs almost two times faster than before). The top of the profiler output now looks like this:

 1:             Self        Total        Cumul
 2:    Nr  Count     %  Count     %  Count     %    Calls  Function
 3:  ------------------------------------------------------------------------
 4:     1    466  24.7    527  27.9    466  24.7        -  NEIGHBOURS
 5:     2    376  19.9   1152  61.0    842  44.6        -  COUNT
 6:     3    194  10.3    194  10.3   1036  54.8        -  SB-IMPL::OPTIMIZED-DATA-VECTOR-REF
 7:     4    141   7.5    238  12.6   1177  62.3        -  (LAMBDA (SB-IMPL::X))
 8:     5    136   7.2    136   7.2   1313  69.5        -  SB-KERNEL:HAIRY-DATA-VECTOR-REF/CHECK-BOUNDS
 9:     6    127   6.7    127   6.7   1440  76.2        -  SB-KERNEL:%COERCE-CALLABLE-TO-FUN
10:     7    110   5.8    118   6.2   1550  82.1        -  EQL
11:     8     86   4.6     86   4.6   1636  86.6        -  "foreign function sigprocmask"
12:     9     84   4.4   1800  95.3   1720  91.1        -  EVOLVE
13:  ...

A few notable things:

  • I think I'm starting to understand the second column2. That would explain why evolve has 95% and other entries less. If so, then count takes about twice the time of neighbours.
  • The vector-ref friends are back! Since neighbours only sets array values, it sure looks like count has dragged these into the picture.

Anyway, I have not done much lisp code profiling (and performance tuning for that matter) so I'll do a little experiment here: write my own function to count live neighbour count:

1:  (defun rules (state neighbours)
2:    (declare (type cell state)
3:             (type neighbours neighbours))
4:    (case state
5:      (1 2)
6:      (2 0)
7:      (t (if (= 2 (loop for x across neighbours counting (= 1 x))) 1 0))))

There is one compiler note for this loop – the count accumulation variable is checked for fixnum overflow. I'll leave it at that, wrap things up and call this my final version.

Here are the timings (with safety optimization setting of 1 and 0):

http://www.ltn.lv/~jonis/blog/4-bb-cl-2/bm-sbcl.png

http://www.ltn.lv/~jonis/blog/4-bb-cl-2/bm-ccl32.png

http://www.ltn.lv/~jonis/blog/4-bb-cl-2/bm-ccl64.png

To be honest, I did not expect getting rid of count to make such a big difference. I may only hope that this example will let implementation developers make the compiler better!

The detailed timings can be seen in transcripts (below), but generally they are:

  • 3.9s / 3s for SBCL.
  • 8.5s / 6.9s for Clozure CL (32-bit).
  • 6s / 5.5s for Clozure CL (64-bit).

I'm pretty sure it is possible to squeeze out some more performance out of this. I actually think that this kind of low-level optimisation is an art in itself, and I'm not very good at it. That is why I will leave this topic for now. But readers are welcome to contribute using comments or better yet – writing about it in their own blogs.

Footnotes

1 A comment from pkhuong on #lisp IRC channel on approaching optimization (after reading a draft version of this article):

"Only NEIGHBOURS shows up fairly high in the list. However, before working on it directly, I'd attack what I consider noise: COUNT, TRUNCATE, LENGTH and EQL shouldn't be called out of line. *-DATA-VECTOR-REF and GENERIC-+ shouldn't have to be called either. Inserting type declarations and asking for speed optimization where appropriate in order to eliminate these entries should be the first step. As it is, the profile is mostly telling you that you're doing it wrong if you really want execution speed.

Only then, with the noise eliminated, would it be time to work on the program itself, from a new profile (with the declarations and with speed optimization where needed) that may be radically different from the one above."

2 Another comment from pkhuong, this time about the columns in profiler output (confirming my understanding of the "total" column):

"Count" columns count samples, "%" % of time (samples). Self is the amount/% of samples with the function at the right at the top of the call stack (directly executing), total is the amount/% of time with the function somewhere on the call stack (up to a certain depth). Cumul is the sum of Self values; I find it helpful to guesstimate where the repartition of self times tapers off into a long tail.

The cycle output (not shown here nor in the blog post) is extremely useful if you know that a lot of time is spent in a function, either directly or in total, but now what the callers or the callees are. You'll be able to find the function you're looking for, the most common (in terms of sample) callers and callee. With that information, you could decide how to make a certain callee faster (or special-case it for the slow function), change the way the function is used in the slow callers, special-case the function for the slow callers, etc. The cycle output is somewhat harder to read, but the format is the same as that by gprof, so you can look around for more resources.

2009-10-29

Brian's brain, on Common Lisp, take 2

While writing code for the first version of Brian's Brain some things crept into my mind, but I left them out of the article to keep it focused. The main issues are:

  • Using symbols to represent cells is not very efficient since a symbol is a kind of object, which means it is a pointer, which means it takes 32 or 64 bits of storage. 2 bits would be enough in our case.
  • Computer memory is linear (well, from a programmer's point of view), and 2-dimensional arrays are figments of our imagination. I mean, in the end it's still a 1-dimensional array, with some support from the programming language to make it appear as something else. Is it worth the trouble to implement the 2D world with a 1D array if the language (Common Lisp in this case) provides us with 2D arrays?
  • Currently a new array is allocated for each simulation step, which is not strictly necessary. Each simulation step needs only two arrays – for current world state and the next one. Since all cells of the new state are overwritten on each step, we don't need to create a completely new array to store them – we can reuse the one from previous step.

Specialised arrays

I'll start with changing the cell representation – instead of symbols I'll use numbers. And the world is going to be a suitable array. This is also a good time to tell a bit about arrays and their types to those unfamiliar with Common Lisp.

The arrays I used for the first implementation of Brian's Brain were of type T, which is the most generic array type possible, and it can store any kind of objects. It is also possible to specify the type of objects that will be stored in an array when creating it. And Common Lisp implementation will choose the most specific type of array that can store objects of specified type.

In my case I want to use array store numbers from 0 to 2. There is an integer type in Common Lisp which is the type of mathematical integer (with unlimited magnitude). It is also possible to limit the magnitude by specifying the lower and upper limit. Here's how to create an array that is suitable for storing integers from 0 to 2 in Common Lisp (SBCL):

1:  CL-USER> (make-array 7 :element-type '(integer 0 2))
2:  #(0 0 0 0 0 0 0)

We can also check what is the exact type of the object that has been created:

3:  CL-USER> (type-of (make-array 7 :element-type '(integer 0 2)))
4:  (SIMPLE-ARRAY (UNSIGNED-BYTE 2) (7))

As we can see, the type of the array created is (unsigned-byte 2), which means all non-negative integers representable by 2 bits (see the documentation of unsigned-byte for more details). The type returned is more general than what I have asked for, but as I said, it is up to Common Lisp implementation to choose what type of array to create. There is also a function that can tell what type of array will be created given a type specifier. Let's see, again on SBCL:

5:  CL-USER> (upgraded-array-element-type '(integer 0 2))
6:  (UNSIGNED-BYTE 2)
7:  CL-USER> (upgraded-array-element-type '(integer 0 42))
8:  (UNSIGNED-BYTE 7)

A different Common Lisp implementation is allowed to work differently. Here is what 32-bit Clozure CL tells us:

 9:  CL-USER> (upgraded-array-element-type '(integer 0 2))
10:  (UNSIGNED-BYTE 8)
11:  CL-USER> (upgraded-array-element-type '(integer 0 42))
12:  (UNSIGNED-BYTE 8)
13:  CL-USER> (type-of (make-array 7 :element-type '(integer 0 2)))
14:  (SIMPLE-ARRAY (UNSIGNED-BYTE 8) (7))

As you can see, in Clozure CL 8-bit-byte is the most specific array type available for arrays of small integers. There are many places in Common Lisp specification where implementations are given freedom of implementation choices. For instance, this specific issue is described in array upgrading section.

So now we are ready to work with numbers as cell representations. 0 for dead cell, 1 for alive cell, and 2 for dying cell:

15:  (defun make-brain (w h)
16:    (make-array (list h w) :element-type '(integer 0 2)))
17:  
18:  (defun make-initialised-brain (w h)
19:    (let ((cells (make-brain w h))
20:          (mid (floor w 2)))
21:      (setf (aref cells 0 mid) 1)
22:      (setf (aref cells 0 (1+ mid)) 1)
23:      cells))
24:  
25:  (defun rules (state neighbours)
26:    (case state
27:      (1 2)
28:      (2 0)
29:      (t (if (= 2 (count 1 neighbours)) 1 0))))

All other code stays as it was. Now is a good time to recall the odd graph from last time (scaled to match other graphs below):

http://www.ltn.lv/~jonis/blog/3-bb-cl/bm-symbolic-sbcl.png

And the new version:

http://www.ltn.lv/~jonis/blog/3-bb-cl/bm-num-sbcl.png

Quite a nice improvement for such a small change. I guess my theory about GCing large arrays was correct. The times for Clozure CL have also improved, not quite that dramatically though:

http://www.ltn.lv/~jonis/blog/3-bb-cl/bm-num-ccl32.png

http://www.ltn.lv/~jonis/blog/3-bb-cl/bm-num-ccl64.png

What do we learn? Lisp programmers can learn the price of some things1.

Destructive action

Next step is to avoid allocating a new array at each evolution step. I'm not expecting any serious performance gains from this because at this point each iteration is creating a single object that does not have to be traversed by GC. Let's see if I can shed some light on the believers in manual memory management lurking in the dark corners of intarwebs.

First I'll make evolve take another parameter – the array where the next generation cells will be put into:

30:  (defun evolve (src dst)
31:    (let* ((w (array-dimension src 1))
32:           (h (array-dimension src 0)))
33:      (loop for j below h
34:         do (loop for i below w
35:               do (setf (aref dst j i)
36:                        (funcall 'rules (aref src j i) (neighbours src i j)))))
37:      dst))

Easy, one parameter more, one line less. Now simulate will need two brains and will alternate them at each call to evolve:

38:  (defun simulate (steps a b)
39:    (loop repeat steps
40:       do (psetf a (evolve a b)
41:                 b a)
42:       finally (return a)))

benchmark also must be updated to pass brain (of proper size) to simulate:

43:  (defun benchmark ()
44:    (format *trace-output* "Benchmarking on ~A ~A~%"
45:            (lisp-implementation-type)
46:            (lisp-implementation-version))
47:    ;; Warmup.
48:    (simulate 10000 (make-initialised-brain 16 16))
49:    (loop
50:       for (w h i) in '((32    32  32768)
51:                        (64    64  8192)
52:                        (128  128  2048)
53:                        (256  256  512)
54:                        (512  512  128)
55:                        (1024 1024 32)
56:                        (2048 2048 8)
57:                        (4096 4096 2))
58:       do #+ccl (gc)
59:          #+sbcl (gc :full t)
60:          (let ((initial (make-initialised-brain w h))
61:                (spare (make-brain w h)))
62:            (format *trace-output* "*** ~Dx~D ~D iteration~:P ***~%" w h i)
63:            (time (simulate i initial spare))
64:            (finish-output *trace-output*)))
65:    (values))

Here, take a close look at the graphs now:

http://www.ltn.lv/~jonis/blog/3-bb-cl/bm-des-sbcl.png

http://www.ltn.lv/~jonis/blog/3-bb-cl/bm-des-ccl32.png

http://www.ltn.lv/~jonis/blog/3-bb-cl/bm-des-ccl64.png

Notice anything interesting? Right, the performance has not improved noticeably. It has actually decreased slightly for SBCL and 64-bit Clozure CL. What's the lesson? Garbage Collector will make your life easier (if you don't deliberately spam it).

Dimension warp

The last exercise today will be to test whether it is worth the trouble of implementing 2-dimensional arrays on top of 1-dimensional arrays manually. This means doing the index calculations manually. I'll branch this experiment off the numerical branch2.

Let's start with the easy things – creating the arrays:

66:  (defun make-brain (w h)
67:    (make-array (* w h) :element-type '(integer 0 2)))

All that has to be changed in make-initialised-brain is to remove the extra dimension (because we only change the first row, and first row in our array are first w cells):

68:  (defun make-initialised-brain (w h)
69:    (let ((cells (make-brain w h))
70:          (mid (floor w 2)))
71:      (setf (aref cells mid) 1)
72:      (setf (aref cells (1+ mid)) 1)
73:      cells))

The main function that has to be changed is neighbours. The function is very simple, although a bit lengthy (comparing to other incorrect versions), but I like to keep the intent of the code close to how I'd do things myself if I were pretending to be a computer:

74:  (defun neighbours (cells i w h)
75:    (let ((l (length cells))
76:          (mx (1- w))
77:          (my (1- h)))
78:      (multiple-value-bind (y x)
79:          (truncate i w)
80:        (flet ((up (i) (if (zerop y) (- (+ i l) w) (- i w)))
81:               (dn (i) (if (= y  my) (- (+ i w) l) (+ i w)))
82:               (lt (i) (if (zerop x) (1- (+ i w))  (1- i)))
83:               (rt (i) (if (= x  mx) (1+ (- i w))  (1+ i))))
84:          (let* ((u (up i))
85:                 (d (dn i))
86:                 (l (lt i))
87:                 (r (rt i))
88:                 (ul (lt u))
89:                 (ur (rt u))
90:                 (dl (lt d))
91:                 (dr (rt d)))
92:            (mapcar (lambda (i) (aref cells i))
93:                    (list ul u ur l r dl d dr)))))))

cells is the 1-dimensional array that is used to simulate the 2-dimensional array. i is the index of a cell for which the neighbours should be calculated. And since cells is a 1-dimensional array and does not have any information about what dimensions are stored in it, w and h specify the dimensions of 2-dimensional array. (If it is not clear to you why these dimensions must be specified think about how many different 2-dimensional arrays can be stored in 24-element 1-dimensional array).

I have written this function in a way that allows me to think about the 1D array as a 2D array. So from i and w the x and y coordinates are calculated. These are used to detect if the cell under question is on the top (0) row or leftmost (0) column. For bottom row and rightmost column I have my and mx, respectively.

Then come four local utility functions to determine an index that is up (up), down (dn), left (lt) or right (rt) of a given cell. The logic is the same as in the 2D array version.

Next I calculate indices for up, down, left and right cells using the above mentioned functions. The cells on diagonals are calculated relative to these. For instance, up-left cell can be obtained by going up from left cell or going left from up cell.

In the end the obtained indices are used to get cell values from cells array. And that's all there is to it.

Now evolve also can be simplified a bit since only one dimension has to be traversed. However, the brain dimensions must be specified explicitly since the dimensions of the brain cannot be obtained from the 1D array.

 94:  (defun evolve (src w h)
 95:    (let* ((l (length src))
 96:           (dst (make-brain w h)))
 97:      (loop for i below l
 98:         do (setf (aref dst i)
 99:                  (funcall 'rules (aref src i) (neighbours src i w h))))
100:      dst))

The dimensions must be dragged through simulate function, too:

101:  (defun simulate (steps initial w h)
102:    (loop with brain = initial
103:       repeat steps
104:       do (setf brain (funcall 'evolve brain w h))
105:       finally (return brain)))

The code gets ever uglier. Another way to solve this problem would be to create a container object (like a structure or class) to hold the cells array and the two dimensions. But that is really not the point of this exercise – I actually expect the gains from this change to be so minimal (if any!) that it is not worth the trouble at all.

OK, last change – brain dimensions must be passed to simulate:

106:  (defun benchmark ()
107:    (format *trace-output* "Benchmarking on ~A ~A~%"
108:            (lisp-implementation-type)
109:            (lisp-implementation-version))
110:    ;; Warmup.
111:    (simulate 10000 (make-initialised-brain 16 16) 16 16)
112:    (loop
113:       for (w h i) in '((32    32  32768)
114:                        (64    64  8192)
115:                        (128  128  2048)
116:                        (256  256  512)
117:                        (512  512  128)
118:                        (1024 1024 32)
119:                        (2048 2048 8)
120:                        (4096 4096 2))
121:       do #+ccl (gc)
122:          #+sbcl (gc :full t)
123:          (let ((initial (make-initialised-brain w h)))
124:            (format *trace-output* "*** ~Dx~D ~D iteration~:P ***~%" w h i)
125:            (time (simulate i initial w h))
126:            (finish-output *trace-output*)))
127:    (values))

So, let's see what do we get:

http://www.ltn.lv/~jonis/blog/3-bb-cl/bm-1d-sbcl.png

http://www.ltn.lv/~jonis/blog/3-bb-cl/bm-1d-ccl32.png

http://www.ltn.lv/~jonis/blog/3-bb-cl/bm-1d-ccl64.png

Oh man was I wrong! Why would the difference be so big? A profiler would come in handy, but before that I'd like to do another experiment because I have a feeling about the source of slowness: there are no declarations and generic array access code for 2D arrays might be more complex than for 1D arrays. Or maybe I should give profiler a shot?

Looking into future

The problem is that performance tweaking this brute-force solution seems a bit pointless. The point of these articles was to show how I would implement the Brian's Brain in Common Lisp (and do some apples-to-oranges comparisons to spice things up). A really interesting thing to try out would be to implement Hashlife, which promises an astronomic speed explosion, and see how it works with Brian's Brain. And I suspect that generating "6 octillion generations in 30 seconds" would not quite work out because Brian's Brain is a lot more chaotic (I have not noticed a single oscillator while watching my animations) – the presence of "dead" cells force a lot of movement (and expansion). I noticed quite a few spaceships, though.

I also installed Golly – a wonderful application to play with cellular automata. It has the Hashlife algorithm implemented, and it really does wonders. However it did not work out for Brian's Brain (starting with 2 live cells as I've been doing all this time) because:

  • In 30 seconds it does not even get to generation 2000.
  • The pattern is ever-expanding and I could not find any way to limit the world size (i.e., I think the torus-world is not implemented in Golly).

So I'd have to implement Hashlife myself. Except that I'm hesitant to hand out $30 for the original B. Gosper's paper, but might as well do it in the future (since as an extra bonus the original implementation was done on Lisp Machine).

So, I think I'll do one more iteration with the brute-force method (with profiling and declarations). And then we'll see what will happen.

Updates

2009-11-01: Please don't send me any more copies of B. Gosper's paper :)

Footnotes

1 Alan Perlis, Epigrams in Programming, "55. A LISP programmer knows the value of everything, but the cost of nothing."

2 Yes, I'm using git, and branching like crazy, even in this toy project.

2009-10-26

Brian's Brain on Common Lisp

The real thing

If you are reading this you might have already read the previous entry and are familiar with the problem domain. And today I'm going to walk you through implementing Brian's Brain in Common Lisp. First, the straight-forward implementation with brain represented as a 2-dimensional array (that's what it is, right?), with just enough code to get it running. There are no functions to abstract away the brain implementation details or cell representation or anything else.

First, a function to create a brain:

1:  (defun make-brain (w h)
2:    (make-array (list h w) :initial-element :off))

And another function that will make us an initialised brain (like the one in Clojure version):

1:  (defun make-initialised-brain (w h)
2:    (let ((cells (make-brain w h))
3:          (mid (floor w 2)))
4:      (setf (aref cells 0 mid) :on)
5:      (setf (aref cells 0 (1+ mid)) :on)
6:      cells))

Why are the dimensions to make-array passed as (h w) and not (w h) you might ask? Because I like to see that my functions work as soon as I write them. Let's see how it works:

CL-USER> (make-initialised-brain 7 5)
#2A((:OFF :OFF :OFF :ON  :ON  :OFF :OFF)
    (:OFF :OFF :OFF :OFF :OFF :OFF :OFF)
    (:OFF :OFF :OFF :OFF :OFF :OFF :OFF)
    (:OFF :OFF :OFF :OFF :OFF :OFF :OFF)
    (:OFF :OFF :OFF :OFF :OFF :OFF :OFF))

What would happen if we had them in the opposite order:

CL-USER> (make-initialised-brain 5 7)
#2A((:OFF :OFF :ON  :ON  :OFF)
    (:OFF :OFF :OFF :OFF :OFF)
    (:OFF :OFF :OFF :OFF :OFF)
    (:OFF :OFF :OFF :OFF :OFF)
    (:OFF :OFF :OFF :OFF :OFF)
    (:OFF :OFF :OFF :OFF :OFF)
    (:OFF :OFF :OFF :OFF :OFF))

Incidentally, this is how Lau's version has them (if you have followed the examples carefully enough). There is no real difference having them either way, only a matter of convenience: if I look at the brain in REPL I want to see the same thing I'd see in animated graphical output.

The rules are independent of brain representation as long as we can provide the neighbouring cells as a parameter:

1:  (defun rules (state neighbours)
2:    (case state
3:      (:on    :dying)
4:      (:dying :off)
5:      (t (if (= 2 (count :on neighbours)) :on :off))))

How do we find the neighbours of a given cell? Easy, like this:

 1:  (defun neighbours (cells x y)
 2:    (let* ((mx (1- (array-dimension cells 1)))
 3:           (my (1- (array-dimension cells 0)))
 4:           (l (if (zerop x) mx (1- x)))
 5:           (r (if (= x mx) 0 (1+ x)))
 6:           (u (if (zerop y) my (1- y)))
 7:           (d (if (= y my) 0 (1+ y))))
 8:      (mapcar (lambda (x y)
 9:                (aref cells y x))
10:              (list l x r l r l x r)
11:              (list u u u y y d d d))))

What happens here should be pretty obvious, but I'll explain a bit anyway since this is a one-way communication channel. mx and my are maximal values for x and y coordinates. Left of the cell (l) is current x coordinate minus 1, unless we're on the leftmost column (0), in which case we get mx. Similarly for right cell, except we look if we're on the rightmost column (mx), and wrap to 0 if we are. Similarly for y axis. In short, referencing a cell off the edge gets us a cell on the opposite side.

Then for each pair of coordinates around our cell we get the value from the cells array. These pairs are given by two lists: one fore x coordinates and one for y coordinates. Function mapcar goes over both lists simultaneously and applies given function to each successive pair of items from both lists.

Also note how indices are passed to aref, with y and x in unnatural positions. This is for the reasons explained above – the rows are first dimension, and columns the second. But neighbours function expects them in the natural order, leaving the implementation details out of the way.

Let's check if our neighbours function works as expected:

CL-USER> (neighbours (make-initialised-brain 7 5) 3 4)
(:OFF :OFF :OFF :OFF :OFF :OFF :ON :ON)

The resulting list is cell values for, respectively, left-up, up, right-up, left, right, left-down, down and right-down cells from the specified x, y coordinate.

What's left? Evolution. The function which will create next state of a brain:

1:  (defun evolve (src)
2:    (let* ((w (array-dimension src 1))
3:           (h (array-dimension src 0))
4:           (dst (make-brain w h)))
5:      (loop for j below h
6:         do (loop for i below w
7:               do (setf (aref dst j i)
8:                        (funcall 'rules (aref src j i) (neighbours src i j)))))
9:      dst))

Ordinary loop over rows and columns, setting values in newly created brain by applying the rules to a cell and its neighbours in the current brain. We're ready to play now:

CL-USER> (evolve (make-initialised-brain 7 5))
#2A((:OFF :OFF :OFF :DYING :DYING :OFF :OFF)
    (:OFF :OFF :OFF :ON    :ON    :OFF :OFF)
    (:OFF :OFF :OFF :OFF   :OFF   :OFF :OFF)
    (:OFF :OFF :OFF :OFF   :OFF   :OFF :OFF)
    (:OFF :OFF :OFF :ON    :ON    :OFF :OFF))
CL-USER> (evolve *)
#2A((:OFF :OFF :ON  :OFF   :OFF   :ON  :OFF)
    (:OFF :OFF :OFF :DYING :DYING :OFF :OFF)
    (:OFF :OFF :OFF :ON    :ON    :OFF :OFF)
    (:OFF :OFF :OFF :ON    :ON    :OFF :OFF)
    (:OFF :OFF :OFF :DYING :DYING :OFF :OFF))

Using numbers for cell values would be much better visually, but I'm staying close to the Clojure version (for now).

Getting ready for blastoff

Almost ready to do some timing. All we need are the simulate and benchmark functions:

 1:  (defun simulate (steps initial)
 2:    (loop repeat steps
 3:       for brain = initial then (funcall 'evolve brain)
 4:       finally (return brain)))
 5:  
 6:  (defun benchmark ()
 7:    (format *trace-output* "Benchmarking on ~A ~A~%"
 8:            (lisp-implementation-type)
 9:            (lisp-implementation-version))
10:    ;; Warmup.
11:    (simulate 10000 (make-initialised-brain 16 16))
12:    (loop
13:       for (w h i) in '((32    32  32768)
14:                        (64    64  8192)
15:                        (128  128  2048)
16:                        (256  256  512)
17:                        (512  512  128)
18:                        (1024 1024 32)
19:                        (2048 2048 8)
20:                        (4096 4096 2))
21:       do (let ((initial (make-initialised-brain w h)))
22:            (format *trace-output* "*** ~Dx~D ~D iteration~:P ***~%" w h i)
23:            (time (simulate i initial))
24:            (finish-output *trace-output*)))
25:    (values))

Notice that there is not a single type annotation in this code. Running it on my laptop1:

CL-USER> (benchmark)
Benchmarking on SBCL 1.0.31.32
*** 32x32 32768 iterations ***
Evaluation took:
  34.782 seconds of real time
  34.064263 seconds of total run time (33.060215 user, 1.004048 system)
  [ Run times consist of 7.670 seconds GC time, and 26.395 seconds non-GC time. ]
  97.94% CPU
  96,906,498,234 processor cycles
  14,769,770,512 bytes consed

*** 64x64 8192 iterations ***
Evaluation took:
  33.512 seconds of real time
  32.776229 seconds of total run time (31.254474 user, 1.521755 system)
  [ Run times consist of 5.903 seconds GC time, and 26.874 seconds non-GC time. ]
  97.80% CPU
  93,366,482,790 processor cycles
  14,762,592,768 bytes consed
...
snipped

Running from the terminal would look something like this:

$ sbcl --noinform --disable-debugger --load simple.fasl --eval "(benchmark)" --eval "(quit)"

This time we have more information to display in the graphs: total run time and GC time. And just look at the numbers:

http://www.ltn.lv/~jonis/blog/2-bb-cl/bm-sbcl.png

Interesting. Very interesting, indeed. What we see here is that the time to run one iteration (blue bars) is increasing a bit first and then declines quite sharply. GC time (green), on the other hand, increases quite sharply starting from 256x256 simulation. Can you come up with an explanation?

I have one. Smaller arrays are processed quite fast, so are short lived, and become garbage before GC kicks in. The bigger arrays are processed longer, so they are alive when GC starts. And GC has to walk all array elements each time, since arrays are not specialised (that is, can contain anything). I told you using numbers would be better (for different reasons, though)!

I also did another run of this same benchmark (transcripts available below), and the numbers were the same up to sub-second precision.

Let's do this same thing with a different Common Lisp implementation, which in my case will be Clozure CL. The nice thing about this implementation that its compiler is very snappy. Running from shell looks like this:

$ ccl -n -Q -l simple.dx32fsl -e "(benchmark)"

http://www.ltn.lv/~jonis/blog/2-bb-cl/bm-ccl.png

One thing to note is that arrays of size 4096x4096 exceed the limit of array length (24-bit number) of 32-bit Clozure CL, which, coincidentally, is just 1 short of what we need:

CL-USER> array-dimension-limit
16777216
CL-USER> (integer-length array-dimension-limit)
25
CL-USER> (integer-length (1- array-dimension-limit))
24
CL-USER> (* 4096 4096)
16777216

But otherwise there is less variation in 32-bit Clozure CL. Let's look at 64-bit version:

http://www.ltn.lv/~jonis/blog/2-bb-cl/bm-ccl64.png

Nice, behaviour similar to SBCL, except that GC times don't grow as fast.

Missing things

I hear somebody in the corner mumbling something about animation and graphics. Oh, right. Missed that one. But it so happens that the very thing that made me start playing with Brian's Brain is that I installed cl-opengl. And guess what? It worked right out of the box. On all the Common Lisp implementation I have on my computer. So I had start playing with it. And the rest, as they say, is history.

It works like this:

CL-USER> (asdf:operate 'asdf:load-op :cl-glut-examples)
; System loading output snipped...
NIL
CL-USER> (cl-glut-examples:run-examples)

And all the examples pop up, many of them animating. So I peek at some examples to see how to set up a window. Easy as a pie:

 1:  (defclass bb (glut:window)
 2:    ((cells :accessor cells-of :initarg :cells))
 3:    (:default-initargs
 4:     :title "Brian's Brain in CL"
 5:     :mode '(:double :rgb)))
 6:  
 7:  (defmethod glut:display-window :before ((w bb))
 8:    (gl:clear-color 0 0 0 0)
 9:    (gl:matrix-mode :projection)
10:    (gl:load-identity)
11:    (let ((cells (cells-of w)))
12:     (gl:ortho 0 (array-dimension cells 1)  0 (array-dimension cells 0) -1 1)))

Then we need a function to render a single cell at specified position. Drawing squares is easy enough:

 1:  (defun render-cell (x y cell)
 2:    (flet ((draw-cell (x y)
 3:             (gl:with-pushed-matrix
 4:                 (gl:translate x y 0)
 5:               (gl:with-primitive :polygon
 6:                 (gl:vertex 0.1 0.1 0)
 7:                 (gl:vertex 0.9 0.1 0)
 8:                 (gl:vertex 0.9 0.9 0)
 9:                 (gl:vertex 0.1 0.9 0)))))
10:      (case cell
11:        (:on (gl:color 1 1 1)
12:             (draw-cell x y))
13:        (:dying (gl:color 0.5 0.5 0.5)
14:                (draw-cell x y)))))

All that's left are some callbacks to draw the whole window and run the animation. The following two methods will do just fine:

 1:  (defmethod glut:display ((w bb))
 2:    (gl:clear :color-buffer)
 3:    (let* ((cells (cells-of w))
 4:           (w (array-dimension cells 1))
 5:           (h (array-dimension cells 0)))
 6:      (loop for j below h
 7:         do (loop for i below w
 8:               do (render-cell i j (aref cells j i)))))
 9:    (glut:swap-buffers))
10:  
11:  
12:  (defmethod glut:idle ((w bb))
13:    (setf (cells-of w) (evolve (cells-of w)))
14:    (glut:post-redisplay))

Everything is ready now. An animated Brian's Brain can be created like this:

(glut:display-window (make-instance 'bb
                                    :cells (make-initialised-brain 128 128)
                                    :width 512
                                    :height 512))

But since I don't like to put things on the toplevel which run when just loading a file, I'll put the code into a function:

1:  (defun run (w h ww wh)
2:    (glut:display-window
3:     (make-instance 'bb
4:                    :cells (make-initialised-brain w h)
5:                    :width ww
6:                    :height wh)))

Feel free to start a never-ending Brian's Brain simulation:

CL-USER> (run 160 100 320 200)

http://www.ltn.lv/~jonis/blog/2-bb-cl/bb-160x100.png

Finishing touches

To make this all easily loadable I'll put all code in its own package and create a system definition (which nowadays means a ASDF) file. Refer to Xach's intro for a nice description of why and how to do this.

Package definition is very simple:

1:  (defpackage :brians-brain-1
2:    (:use :common-lisp)
3:    (:export #:run))

And the system definition is nothing complicated, either:

1:  (asdf:defsystem :brians-brain-1
2:      :version "1.0"
3:      :author "Jānis Džeriņš"
4:      :license "Send me money if you find this stuff useful."
5:      :depends-on (cl-opengl cl-glut)
6:      :components ((:file "package")
7:                   (:file "simple" :depends-on ("package"))
8:                   (:file "display" :depends-on ("package" "simple"))))

At this point we can get to a running animated Brian's Brain from the shell prompt:

$ sbcl --noinform --disable-debugger \
       --eval "(asdf:operate 'asdf:load-op :brians-brain-1)" \
       --eval "(brians-brain-1:run 160 100 320 200)" \
       --eval "(quit)"

Looking forward

Next time I'm going to play with different brain representations.

Updates

2009-10-28: Noticed a bug in simulate function which runs the simulation for one step less than asked. The corrected version looks like this:

1:  (defun simulate (steps initial)
2:    (loop with brain = initial
3:       repeat steps
4:       do (setf brain (funcall 'evolve brain))
5:       finally (return brain)))

Also now invoking GC before each simulation so that garbage from previous simulation has less chance to influence the next:

 6:  (defun benchmark ()
 7:    (format *trace-output* "Benchmarking on ~A ~A~%"
 8:            (lisp-implementation-type)
 9:            (lisp-implementation-version))
10:    ;; Warmup.
11:    (simulate 10000 (make-initialised-brain 16 16))
12:    (loop
13:       for (w h i) in '((32    32  32768)
14:                        (64    64  8192)
15:                        (128  128  2048)
16:                        (256  256  512)
17:                        (512  512  128)
18:                        (1024 1024 32)
19:                        (2048 2048 8)
20:                        (4096 4096 2))
21:       do #+ccl (gc)
22:          #+sbcl (gc :full t)
23:          (let ((initial (make-initialised-brain w h)))
24:           (format *trace-output* "*** ~Dx~D ~D iteration~:P ***~%" w h i)
25:           (time (simulate i initial))
26:           (finish-output *trace-output*)))
27:    (values))

The graphs are generally very similar, except the rightmost columns don't look fishy:

http://www.ltn.lv/~jonis/blog/2-bb-cl/bm-sbcl-2.png

http://www.ltn.lv/~jonis/blog/2-bb-cl/bm-ccl32-2.png

http://www.ltn.lv/~jonis/blog/2-bb-cl/bm-ccl64-2.png

New transcripts:

bm-sbcl-2.txt
bm-ccl32-2.txt
bm-ccl64-2.txt

Footnotes

1 See the previous blog entry for the specs.

Brian's Brain on Clojure

Here's your brain, and here's your brain on Clojure

A couple weeks ago I came across a blog entry by Lau B. Jensen about Brian's Brain implementation in Clojure. And so it happened that briefly before this I have decided to devote some of my time to programming in Common Lisp. What a lucky coincidence – this looks like a perfect job to swap Common Lisp back into the operational area of my brain.

You see, implementing Brian's Brain is easy. So easy that it is not worthy of a blog post except if one spices it up with some orthogonal material. The reference blog entry mentioned above does just that: implement something simple to tell everybody how cool some features of Clojure are. (And then tell that those features come at a drastic performance price.) I smell a recipe for a successful first blog entry: controversy! And the best harmless controversy one can get on the internet is apples to oranges comparisons.

So, as you might expect I'm going to compare the Clojure and Common Lisp implementations1 of Brian's Brain. And today I'll give you some numbers about the Clojure implementations (both original and transient). And to make apples and oranges more similar I'm going to leave only the cell simulation code, leaving graphics and rendering out of the discussion. For a while.

Let's measure something

The simulation and measuring is going to be simple: I'll measure the time it takes to simulate brains of various sizes for a set number of iterations. Since it is apparent that the time to advance the state of brain one step is proportional to the size of the brain, it is reasonable to expect the time spent to increase proportionally to the number of cells in the brain. And since all of the brain's cells are the same, there is nothing interesting to expect in the resulting numbers.

The only interesting thing I'd expect from varying brain sizes is the impact of runtime's memory management and maybe cache locality effects. Looking at these things is also interesting because the brain representation in both Lau's implementations are different: [lazy] lists in the original version and a vector in the transient version. Therefore I'll run the code for brains of different sizes, but adjusting the number of simulation steps so that the number of cells processed stays the same. This way I'd expect graphs be more-or-less straight horizontal.

But first

The code to run simulation, nothing complicated:

1:  (defn simulate [steps initial f]
2:    (loop [n steps, brain initial]
3:      (if (< 0 n)
4:        (recur (dec n) (step brain f))
5:        brain)))

This is a typical Clojure loop that has two "state" variables, which are updated on each iteration (by means of recur invocation): n is decreased, and brain is "transformed" using the step function. That is, on each iteration of the loop new values for these two variables are used; no variables (or values) are really modified.

The value of f is either map or pmap. This way it is easy to run a simulation in serial or parallel fasion by just passing a different function to the simulation. The original step function accepts only a single parameter:

1:  (defn step [board]
2:    (doall
3:     (map (fn [window]
4:            (apply #(doall (apply map rules %&))
5:                   (doall (map torus-window window))))
6:          (torus-window board))))

In his article Lau suggests that to get the parallel version of this function only one function call must be changed. Except that after the change the function must be recompiled. Contrast this to my changed version:

1:  (defn step [board f]
2:    (doall
3:     (f (fn [window]
4:          (apply #(doall (apply map rules %&))
5:                 (doall (map torus-window window))))
6:        (torus-window board))))

Now nothing must be recompiled when we decide to change this function from serial to parallel version.

Another thing I had to change is how the initial brain is created. Lau uses global variables quite a lot (and we'll see how it hurts on many occassions ahead). Besides not being a nice thing in general, it also is very un-lispy (and un-functional as well, but I'll have an article on this some other time). There is nothing in Lau's code that requires global variables, so I re-wrote his code to get rid of them.

This code:

1:  (def dim-board [90 90])
2:  
3:  (def board
4:       (for [x (range (dim-board 0))]
5:         (for [y (range (dim-board 1))]
6:           [(if (< 50 (rand-int 100)) :on :off) x y])))

now has become this:

1:  (defn make-random-board [w h]
2:    (for [x (range w)]
3:      (for [y (range h)]
4:        [(if (< 50 (rand-int 100)) :on :off) x y])))

Now we're ready to experiment with various combinations of brain size and iteration counts in the REPL (manually-reformatted by me for aesthetic reasons):

user> (simulate 0 (make-random-board 4 3) map)
(([:on  0 0] [:on  0 1] [:on  0 2])
 ([:on  1 0] [:on  1 1] [:off 1 2])
 ([:on  2 0] [:off 2 1] [:off 2 2])
 ([:off 3 0] [:off 3 1] [:on  3 2]))
user> (simulate 1 (make-random-board 4 3) pmap)
(([:off   0 0] [:dying 0 1] [:off   0 2])
 ([:off   1 0] [:dying 1 1] [:dying 1 2])
 ([:dying 2 0] [:off   2 1] [:off   2 2])
 ([:dying 3 0] [:dying 3 1] [:off   3 2]))
user> (simulate 2 (make-random-board 4 3) map)
(([:off 0 0] [:off 0 1] [:off 0 2])
 ([:off 1 0] [:off 1 1] [:off 1 2])
 ([:off 2 0] [:off 2 1] [:off 2 2])
 ([:off 3 0] [:off 3 1] [:off 3 2]))

One interesting observation: the board after 2 iterations looks broken (all cells are dead). Is the code somehow broken? Things like these are easy to find out interactively. But only if it's easy to specify various parameters. And that's where the global variables make life hard. In Lau's implementation, for each of the little REPL tests I'd have to go and redefine two global variables (in order). In my version one can play to their heart content without leaving the REPL.

So, what exactly is going on here? Nothing special: having half the brain's cells alive at the start makes this little brain overpopulated and by the rules of the world all cells die. And since I'm not going to manually perform pretty-printing or write a function to do it for me, I'll show another advantage of avoiding unnecessary global variables.

You see, for doing reliable experiments (like this one I'm doing here) they must be repeatable. So using randomly-initialised brains is not going to work (unless a specific pseudo-random-number-generator is seeded with some known seed, but that's besides the point). And it so happens that the other Lau's implementation uses a specific initial brain setup: two adjacent cells in the first row are initially alive. The code to create such a brain is:

1:  (defn make-special-board [w h]
2:    (let [mid (/ w 2)]
3:     (for [x (range w)]
4:       (for [y (range h)]
5:         [(if (and (= y 0) (<= mid x (inc mid))) :on :off) x y]))))

Now, let's play a bit in the REPL:

user> (simulate 0 (make-special-board 4 3) map)
(([:off 0 0] [:off 0 1] [:off 0 2])
 ([:off 1 0] [:off 1 1] [:off 1 2])
 ([:on  2 0] [:off 2 1] [:off 2 2])
 ([:on  3 0] [:off 3 1] [:off 3 2]))
user> (simulate 1 (make-special-board 4 3) map)
(([:off   0 0] [:off 0 1] [:off 0 2])
 ([:off   1 0] [:off 1 1] [:off 1 2])
 ([:dying 2 0] [:on  2 1] [:on  2 2])
 ([:dying 3 0] [:on  3 1] [:on  3 2]))
user> (simulate 2 (make-special-board 4 3) map)
(([:on  0 0] [:on    0 1] [:on    0 2])
 ([:on  1 0] [:on    1 1] [:on    1 2])
 ([:off 2 0] [:dying 2 1] [:dying 2 2])
 ([:off 3 0] [:dying 3 1] [:dying 3 2]))
user> (= *1 (simulate 2 (make-special-board 4 3) pmap))
true
user> (= (simulate 10 (make-special-board 32 32) map)
         (simulate 10 (make-special-board 32 32) pmap))
true

Great, not only can we experiment with different brain dimensions but also with different contents. And we can easily check if parallel and sequential versions work the same. Now try doing this in the Lau's original version!

Almost there

Now we are almost ready to measure things. Here's the function I'll use:

 1:  (defn benchmark []
 2:    ;; Warmup
 3:    (simulate 7500 (make-special-board 16 16) map)
 4:    (simulate 7500 (make-special-board 16 16) pmap)
 5:    (println (clojure-version))
 6:    (doseq [[w h i] [[32    32  32768]
 7:                     [64    64  8192]
 8:                     [128  128  2048]
 9:                     [256  256  512]
10:                     [512  512  128]]]
11:      (let [initial (doall (make-special-board w h))]
12:        (print (str "S " w "x" h ", " i " iteration(s): "))
13:        (time (simulate i initial map))
14:        (flush)
15:        (print (str "P " w "x" h ", " i " iteration(s): "))
16:        (time (simulate i initial pmap))
17:        (flush))))

Nothing special, repeat serial and parallel versions for given brain dimensions and iteration number. The results of running this on my MacBook Pro2 (with the warmup timings removed):

Running this benchmark from shell can be done like this:

$ java -server -jar clojure.jar -i ca.clj -e "(benchmark)"

After a while of intensive contributing to the global warming we get some output:

$ java -server -jar clojure.jar -i ca.clj -e "(benchmark)"
1.1.0-alpha-SNAPSHOT
S 32x32, 32768 iteration(s): "Elapsed time: 415506.172 msecs"
P 32x32, 32768 iteration(s): "Elapsed time: 289353.533 msecs"
S 64x64, 8192 iteration(s): "Elapsed time: 425244.632 msecs"
P 64x64, 8192 iteration(s): "Elapsed time: 290253.746 msecs"
S 128x128, 2048 iteration(s): "Elapsed time: 421812.085 msecs"
P 128x128, 2048 iteration(s): "Elapsed time: 287244.863 msecs"
S 256x256, 512 iteration(s): "Elapsed time: 422331.897 msecs"
P 256x256, 512 iteration(s): "Elapsed time: 289623.448 msecs"
S 512x512, 128 iteration(s): "Elapsed time: 440064.214 msecs"
P 512x512, 128 iteration(s): "Elapsed time: 286282.461 msecs"

One thing you might have noticed is that the number of iterations increases as the brain size decreases. This is to have the number of cells processed in each simulation constant and the resulting numbers easy to compare.

I also run all the "benchmarks" on two JVMs (server VMs at that): OSX default 1.5 (32-bit) and 1.6 (64-bit, there's only server VM). The switching is done using the Java Preferences utility:

$ java -version
java version "1.5.0_20"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_20-b02-315)
Java HotSpot(TM) Client VM (build 1.5.0_20-141, mixed mode, sharing)

$ java -server -version
java version "1.5.0_20"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_20-b02-315)
Java HotSpot(TM) Server VM (build 1.5.0_20-141, mixed mode)

$ java -version
java version "1.6.0_15"
Java(TM) SE Runtime Environment (build 1.6.0_15-b03-226)
Java HotSpot(TM) 64-Bit Server VM (build 14.1-b02-92, mixed mode)

Here are the numbers for the original Brian's Brain implementation:

http://www.ltn.lv/~jonis/blog/1-bb-clj/bm-1.png

Looks like the parallel version is about 45% (JVM1.5) to 60% (JVM1.6) faster than the serial.

The times for 512x512 are not included for JVM1.6 because the serial version was running for 2416341.265ms (that's about 40 minutes!), which is way out of scale for this graph. I guess it is because of tight memory situation because the next (parallel) benchmark bailed out with out of memory exception.

Going transient

Lau's code for transient version is much worse. More global variables. And functions depending on them. What a mess… Just look at it:

 1:  (def dim-board   [130  130])
 2:  (def dim-screen  [600  600])
 3:  (def dim-scale   (vec (map / dim-screen dim-board)))
 4:  
 5:  (def board-size (apply * dim-board))
 6:  (def board      (-> (vec (repeat board-size :off))
 7:                      (assoc (/ (dim-board 0) 2)       :on)
 8:                      (assoc (inc (/ (dim-board 0) 2)) :on)))
 9:  
10:  (def cords (for [y (range (dim-board 0)) x (range (dim-board 1))] [x y]))
11:  
12:  (defn torus-coordinate
13:    [idx]
14:    (cond (neg? idx)          (+ idx board-size)
15:          (>= idx board-size) (- idx board-size)
16:      :else idx))
17:  
18:  (def above     (- (dim-board 0)))
19:  (def below     (dim-board 0))
20:  (def neighbors [-1 1 (dec above) above (inc above) (dec below) below (inc below)])
21:  
22:  (defn on-neighbors [i board]
23:    (count
24:     (filter #(= :on (nth board %))
25:             (map #(torus-coordinate (+ i %)) neighbors))))
26:  
27:  (defn step [board]
28:    (loop [i 0 next-state (transient board)]
29:      (if (< i board-size)
30:        (let [self         (nth board i)]
31:          (recur (inc i)
32:                 (assoc! next-state i (cond
33:                                        (= :on    self)                :dying
34:                                        (= :dying self)                :off
35:                                        (= 2 (on-neighbors i board))   :on
36:                                        :else                          :off))))
37:        (persistent! next-state))))

9 global variables in 37 lines of code! Bad Lau! Bad! Also, the results of torus-coordinate, on-neighbors and step (in short – all of them) depend on things besides the parameters. Very un-functional indeed.

Ok, let's put horrors aside and do what we can to get this mess under control. My only goal is to get the simulation and benchmarking functions working, so first, here they are:

 1:  (defn simulate [steps initial]
 2:    (loop [n steps, board initial]
 3:      (if (< 0 n)
 4:        (recur (dec n) (step board))
 5:        board)))
 6:  
 7:  (defn benchmark []
 8:    ;; Warmup
 9:    (simulate 10000 (make-board 16 16))
10:    (println (clojure-version))
11:    (doseq [[w h i] [[32    32  32768]
12:                     [64    64  8192]
13:                     [128  128  2048]
14:                     [256  256  512]
15:                     [512  512  128]
16:                     [1024 1024 32]]]
17:      (let [initial (make-board w h)]
18:        (print (str "S " w "x" h ", " i " iteration(s): "))
19:        (time (simulate i initial))
20:        (flush))))

Notice that the transient version cannot be parallelized by simply changing a single function, so there is no parallel version at all.

Here's the changed code that allows me to run the benchmark:

 1:  (defn make-board [w h]
 2:    (let [above (- w), below w]
 3:      (def neighbors [-1 1 (dec above) above (inc above) (dec below) below (inc below)]))
 4:  
 5:    (-> (vec (repeat (* w h) :off))
 6:        (assoc (/ w 2)       :on)
 7:        (assoc (inc (/ w 2)) :on)))
 8:  
 9:  (defn torus-coordinate [idx board]
10:    (let [board-size (count board)]
11:      (cond (neg? idx)          (+ idx board-size)
12:            (>= idx board-size) (- idx board-size)
13:            :else idx)))
14:  
15:  (defn on-neighbors [i board]
16:    (count
17:     (filter #(= :on (nth board %))
18:             (map #(torus-coordinate (+ i %) board) neighbors))))
19:  
20:  (defn step [board]
21:    (let [board-size (count board)]
22:      (loop [i 0 next-state (transient board)]
23:        (if (< i board-size)
24:          (let [self         (nth board i)]
25:            (recur (inc i)
26:                   (assoc! next-state i (cond
27:                                         (= :on    self)                :dying
28:                                         (= :dying self)                :off
29:                                         (= 2 (on-neighbors i board))   :on
30:                                         :else                          :off))))
31:          (persistent! next-state)))))

make-board is ugly, and defines global variables. Will have to live with it. torus-coordinate and step don't use global variables any more. And we're down to one global variable which on-neighbors depends on. I'll let this one remain on Lau's conscience.

Another thing that I just can't let slip is that the torus-coordinate function is JUST PLAIN WRONG. Sorry, I might have raised my voice. The problem is that for the cell in the leftmost column the cell to its left is on the opposite side of the board. Similarly for the rightmost. But torus-coordinate cheats and for the leftmost returns the cell on the rightmost column (correct) but one row above (wrong!). For the rightmost cells it returns the cell on the leftmost column, but one row below. This error would instantly disqualify the transient implementation in any half-serious apples-to-apples comparison.

It is easy to see that torus-coordinate is called for every "off" cell in a brain. And there are more "off" cells than "dying" or "on" or the sum of those, respectively (based on my observations of running the simulations graphically; empirical proof is easy, but will have to wait). Allowing this function to do less work definitely impacts its performance. I'm not going to write the correct Clojure version of this function so let's pretend everything is just fine and look at the performance:

$ java -server -jar clojure.jar -i transient-ca.clj -e "(benchmark)"
1.1.0-alpha-SNAPSHOT
S 32x32, 32768 iteration(s): "Elapsed time: 68259.648 msecs"
S 64x64, 8192 iteration(s): "Elapsed time: 69464.992 msecs"
S 128x128, 2048 iteration(s): "Elapsed time: 67646.623 msecs"
S 256x256, 512 iteration(s): "Elapsed time: 70889.975 msecs"
S 512x512, 128 iteration(s): "Elapsed time: 72941.713 msecs"
S 1024x1024, 32 iteration(s): "Elapsed time: 73866.637 msecs"

Quite an improvement I'd say, indeed. And since the benchmark finished relatively fast, I decided to run it again:

$ java -server -jar clojure.jar -i transient-ca.clj -e "(benchmark)"
1.1.0-alpha-SNAPSHOT
S 32x32, 32768 iteration(s): "Elapsed time: 90808.629 msecs"
S 64x64, 8192 iteration(s): "Elapsed time: 92661.493 msecs"
S 128x128, 2048 iteration(s): "Elapsed time: 89570.418 msecs"
S 256x256, 512 iteration(s): "Elapsed time: 92398.728 msecs"
S 512x512, 128 iteration(s): "Elapsed time: 94843.865 msecs"
S 1024x1024, 32 iteration(s): "Elapsed time: 97478.63 msecs"

Weird is the least I can say about this. So I run it one more time… And then again. All the times getting different timings (I was not doing anything else while running the benchmarks on the computer, so I don't have a slightest clue about the cause of variation). Then I gave up. Here's a graph of all four runs:

http://www.ltn.lv/~jonis/blog/1-bb-clj/bm-2.png

Three more runs on JVM1.6, which are quite a bit faster and closer to each other. But they run out of memory on the 1024x1024 run.

http://www.ltn.lv/~jonis/blog/1-bb-clj/bm-3.png

Here are the averages of all the runs together in one graph:

http://www.ltn.lv/~jonis/blog/1-bb-clj/bm-4.png

Around 25% speed increase in JVM1.6 if I'm reading the numbers correctly.

Wrapping things up

Now, the complete picture of original and transient implementations:

http://www.ltn.lv/~jonis/blog/1-bb-clj/bm-5.png

As expected the lines are quite horizontal, but not in all cases. Especially non-horizontal are the original implementation on the JVM1.6. It might be related to it being 64-bit, but I'm not really going to investigate this any further. So, as we can see, the transient version is about 4 times faster than the parallel original version, and about 6 times faster than serial version.

It is apparent that the transient version is quite a bit faster than the original. And with the original version we can see that the parallel version really does speed things up on two cores. How about 8 cores? It so happens that I have SSH access to a quite recent Mac Pro3. One notable thing about this system is that it is set to energy-saving mode, so don't be surprised by the absolute numbers, look for their relative magnitudes:

http://www.ltn.lv/~jonis/blog/1-bb-clj/bm-6.png

Watching the java process in top was showing around 600% CPU usage when running the parallel simulations. What surprises me is that the performance gain from more cores is not there. But surprise is not big enough for me to go investigate this any further.

Final words

This is only an introduction for what I really wanted to write about. It turned out much longer than I thought it would be. Cangratulations to everybody who got this far (reading the content, not searching for pretty pictures).

This article is in no way intended as bashing on anything. Yes, I didsagree with Lau on some software design points, but that should be only considered as that – disagreement. I have shown how I'd write some parts of his program. He will be welcome to comment on my Common Lisp implementation after my next post.

Please don't use these numbers to draw any conclusions about the performance of my computer, Clojure, Java, my or Lau's code or anything else. The numbers are as un-scientific as can be. I was intentionally not doing anyhting else while running the benchmarks on the computer, but there are all the usual desktop things running (except the screensaver which I disabled).

Updates

2009-10-28: Apparently I have been misguided about the energy-savings mode on the MacPro. Going to the box physically and checking if the option is turned on or not, I found out (to my big surprise) that the option is gone. For those who have a Mac can compare their "Energy Saver" window to what it was like before: http://support.apple.com/kb/HT3092. So, I'm not sure whether the "reduced" mode is enabled or not.

Footnotes

1 When I'm talking about Clojure or Common Lisp implementation, I'm talking about brain's implementation in that programming language respectively, not the implementation of language.

2 2.8GHz Intel Core 2 Duo, 6MB cache, 1.07GHz bus speed.

3 3GHz Quad-Core Intel Xeon (2 processors, 8 cores total), 8MB cache per processor. 1.33GHz bus speed.