The beauty of grey: 2009/10

2009-10-29

Brian's brain, on Common Lisp, take 2

While writing code for the first version of Brian's Brain some things crept into my mind, but I left them out of the article to keep it focused. The main issues are:

Using symbols to represent cells is not very efficient since a symbol is a kind of object, which means it is a pointer, which means it takes 32 or 64 bits of storage. 2 bits would be enough in our case.
Computer memory is linear (well, from a programmer's point of view), and 2-dimensional arrays are figments of our imagination. I mean, in the end it's still a 1-dimensional array, with some support from the programming language to make it appear as something else. Is it worth the trouble to implement the 2D world with a 1D array if the language (Common Lisp in this case) provides us with 2D arrays?
Currently a new array is allocated for each simulation step, which is not strictly necessary. Each simulation step needs only two arrays – for current world state and the next one. Since all cells of the new state are overwritten on each step, we don't need to create a completely new array to store them – we can reuse the one from previous step.

Specialised arrays

I'll start with changing the cell representation – instead of symbols I'll use numbers. And the world is going to be a suitable array. This is also a good time to tell a bit about arrays and their types to those unfamiliar with Common Lisp.

The arrays I used for the first implementation of Brian's Brain were of type T, which is the most generic array type possible, and it can store any kind of objects. It is also possible to specify the type of objects that will be stored in an array when creating it. And Common Lisp implementation will choose the most specific type of array that can store objects of specified type.

In my case I want to use array store numbers from 0 to 2. There is an integer type in Common Lisp which is the type of mathematical integer (with unlimited magnitude). It is also possible to limit the magnitude by specifying the lower and upper limit. Here's how to create an array that is suitable for storing integers from 0 to 2 in Common Lisp (SBCL):

1:  CL-USER> (make-array 7 :element-type '(integer 0 2))
2:  #(0 0 0 0 0 0 0)

We can also check what is the exact type of the object that has been created:

3:  CL-USER> (type-of (make-array 7 :element-type '(integer 0 2)))
4:  (SIMPLE-ARRAY (UNSIGNED-BYTE 2) (7))

As we can see, the type of the array created is (unsigned-byte 2), which means all non-negative integers representable by 2 bits (see the documentation of unsigned-byte for more details). The type returned is more general than what I have asked for, but as I said, it is up to Common Lisp implementation to choose what type of array to create. There is also a function that can tell what type of array will be created given a type specifier. Let's see, again on SBCL:

5:  CL-USER> (upgraded-array-element-type '(integer 0 2))
6:  (UNSIGNED-BYTE 2)
7:  CL-USER> (upgraded-array-element-type '(integer 0 42))
8:  (UNSIGNED-BYTE 7)

A different Common Lisp implementation is allowed to work differently. Here is what 32-bit Clozure CL tells us:

 9:  CL-USER> (upgraded-array-element-type '(integer 0 2))
10:  (UNSIGNED-BYTE 8)
11:  CL-USER> (upgraded-array-element-type '(integer 0 42))
12:  (UNSIGNED-BYTE 8)
13:  CL-USER> (type-of (make-array 7 :element-type '(integer 0 2)))
14:  (SIMPLE-ARRAY (UNSIGNED-BYTE 8) (7))

As you can see, in Clozure CL 8-bit-byte is the most specific array type available for arrays of small integers. There are many places in Common Lisp specification where implementations are given freedom of implementation choices. For instance, this specific issue is described in array upgrading section.

So now we are ready to work with numbers as cell representations. 0 for dead cell, 1 for alive cell, and 2 for dying cell:

15:  (defun make-brain (w h)
16:    (make-array (list h w) :element-type '(integer 0 2)))
17:  
18:  (defun make-initialised-brain (w h)
19:    (let ((cells (make-brain w h))
20:          (mid (floor w 2)))
21:      (setf (aref cells 0 mid) 1)
22:      (setf (aref cells 0 (1+ mid)) 1)
23:      cells))
24:  
25:  (defun rules (state neighbours)
26:    (case state
27:      (1 2)
28:      (2 0)
29:      (t (if (= 2 (count 1 neighbours)) 1 0))))

All other code stays as it was. Now is a good time to recall the odd graph from last time (scaled to match other graphs below):

And the new version:

Quite a nice improvement for such a small change. I guess my theory about GCing large arrays was correct. The times for Clozure CL have also improved, not quite that dramatically though:

What do we learn? Lisp programmers can learn the price of some things¹.

Destructive action

Next step is to avoid allocating a new array at each evolution step. I'm not expecting any serious performance gains from this because at this point each iteration is creating a single object that does not have to be traversed by GC. Let's see if I can shed some light on the believers in manual memory management lurking in the dark corners of intarwebs.

First I'll make evolve take another parameter – the array where the next generation cells will be put into:

30:  (defun evolve (src dst)
31:    (let* ((w (array-dimension src 1))
32:           (h (array-dimension src 0)))
33:      (loop for j below h
34:         do (loop for i below w
35:               do (setf (aref dst j i)
36:                        (funcall 'rules (aref src j i) (neighbours src i j)))))
37:      dst))

Easy, one parameter more, one line less. Now simulate will need two brains and will alternate them at each call to evolve:

38:  (defun simulate (steps a b)
39:    (loop repeat steps
40:       do (psetf a (evolve a b)
41:                 b a)
42:       finally (return a)))

benchmark also must be updated to pass brain (of proper size) to simulate:

43:  (defun benchmark ()
44:    (format *trace-output* "Benchmarking on ~A ~A~%"
45:            (lisp-implementation-type)
46:            (lisp-implementation-version))
47:    ;; Warmup.
48:    (simulate 10000 (make-initialised-brain 16 16))
49:    (loop
50:       for (w h i) in '((32    32  32768)
51:                        (64    64  8192)
52:                        (128  128  2048)
53:                        (256  256  512)
54:                        (512  512  128)
55:                        (1024 1024 32)
56:                        (2048 2048 8)
57:                        (4096 4096 2))
58:       do #+ccl (gc)
59:          #+sbcl (gc :full t)
60:          (let ((initial (make-initialised-brain w h))
61:                (spare (make-brain w h)))
62:            (format *trace-output* "*** ~Dx~D ~D iteration~:P ***~%" w h i)
63:            (time (simulate i initial spare))
64:            (finish-output *trace-output*)))
65:    (values))

Here, take a close look at the graphs now:

Notice anything interesting? Right, the performance has not improved noticeably. It has actually decreased slightly for SBCL and 64-bit Clozure CL. What's the lesson? Garbage Collector will make your life easier (if you don't deliberately spam it).

Dimension warp

The last exercise today will be to test whether it is worth the trouble of implementing 2-dimensional arrays on top of 1-dimensional arrays manually. This means doing the index calculations manually. I'll branch this experiment off the numerical branch².

Let's start with the easy things – creating the arrays:

66:  (defun make-brain (w h)
67:    (make-array (* w h) :element-type '(integer 0 2)))

All that has to be changed in make-initialised-brain is to remove the extra dimension (because we only change the first row, and first row in our array are first w cells):

68:  (defun make-initialised-brain (w h)
69:    (let ((cells (make-brain w h))
70:          (mid (floor w 2)))
71:      (setf (aref cells mid) 1)
72:      (setf (aref cells (1+ mid)) 1)
73:      cells))

The main function that has to be changed is neighbours. The function is very simple, although a bit lengthy (comparing to other incorrect versions), but I like to keep the intent of the code close to how I'd do things myself if I were pretending to be a computer:

74:  (defun neighbours (cells i w h)
75:    (let ((l (length cells))
76:          (mx (1- w))
77:          (my (1- h)))
78:      (multiple-value-bind (y x)
79:          (truncate i w)
80:        (flet ((up (i) (if (zerop y) (- (+ i l) w) (- i w)))
81:               (dn (i) (if (= y  my) (- (+ i w) l) (+ i w)))
82:               (lt (i) (if (zerop x) (1- (+ i w))  (1- i)))
83:               (rt (i) (if (= x  mx) (1+ (- i w))  (1+ i))))
84:          (let* ((u (up i))
85:                 (d (dn i))
86:                 (l (lt i))
87:                 (r (rt i))
88:                 (ul (lt u))
89:                 (ur (rt u))
90:                 (dl (lt d))
91:                 (dr (rt d)))
92:            (mapcar (lambda (i) (aref cells i))
93:                    (list ul u ur l r dl d dr)))))))

cells is the 1-dimensional array that is used to simulate the 2-dimensional array. i is the index of a cell for which the neighbours should be calculated. And since cells is a 1-dimensional array and does not have any information about what dimensions are stored in it, w and h specify the dimensions of 2-dimensional array. (If it is not clear to you why these dimensions must be specified think about how many different 2-dimensional arrays can be stored in 24-element 1-dimensional array).

I have written this function in a way that allows me to think about the 1D array as a 2D array. So from i and w the x and y coordinates are calculated. These are used to detect if the cell under question is on the top (0) row or leftmost (0) column. For bottom row and rightmost column I have my and mx, respectively.

Then come four local utility functions to determine an index that is up (up), down (dn), left (lt) or right (rt) of a given cell. The logic is the same as in the 2D array version.

Next I calculate indices for up, down, left and right cells using the above mentioned functions. The cells on diagonals are calculated relative to these. For instance, up-left cell can be obtained by going up from left cell or going left from up cell.

In the end the obtained indices are used to get cell values from cells array. And that's all there is to it.

Now evolve also can be simplified a bit since only one dimension has to be traversed. However, the brain dimensions must be specified explicitly since the dimensions of the brain cannot be obtained from the 1D array.

 94:  (defun evolve (src w h)
 95:    (let* ((l (length src))
 96:           (dst (make-brain w h)))
 97:      (loop for i below l
 98:         do (setf (aref dst i)
 99:                  (funcall 'rules (aref src i) (neighbours src i w h))))
100:      dst))

The dimensions must be dragged through simulate function, too:

101:  (defun simulate (steps initial w h)
102:    (loop with brain = initial
103:       repeat steps
104:       do (setf brain (funcall 'evolve brain w h))
105:       finally (return brain)))

The code gets ever uglier. Another way to solve this problem would be to create a container object (like a structure or class) to hold the cells array and the two dimensions. But that is really not the point of this exercise – I actually expect the gains from this change to be so minimal (if any!) that it is not worth the trouble at all.

OK, last change – brain dimensions must be passed to simulate:

106:  (defun benchmark ()
107:    (format *trace-output* "Benchmarking on ~A ~A~%"
108:            (lisp-implementation-type)
109:            (lisp-implementation-version))
110:    ;; Warmup.
111:    (simulate 10000 (make-initialised-brain 16 16) 16 16)
112:    (loop
113:       for (w h i) in '((32    32  32768)
114:                        (64    64  8192)
115:                        (128  128  2048)
116:                        (256  256  512)
117:                        (512  512  128)
118:                        (1024 1024 32)
119:                        (2048 2048 8)
120:                        (4096 4096 2))
121:       do #+ccl (gc)
122:          #+sbcl (gc :full t)
123:          (let ((initial (make-initialised-brain w h)))
124:            (format *trace-output* "*** ~Dx~D ~D iteration~:P ***~%" w h i)
125:            (time (simulate i initial w h))
126:            (finish-output *trace-output*)))
127:    (values))

So, let's see what do we get:

Oh man was I wrong! Why would the difference be so big? A profiler would come in handy, but before that I'd like to do another experiment because I have a feeling about the source of slowness: there are no declarations and generic array access code for 2D arrays might be more complex than for 1D arrays. Or maybe I should give profiler a shot?

Looking into future

The problem is that performance tweaking this brute-force solution seems a bit pointless. The point of these articles was to show how I would implement the Brian's Brain in Common Lisp (and do some apples-to-oranges comparisons to spice things up). A really interesting thing to try out would be to implement Hashlife, which promises an astronomic speed explosion, and see how it works with Brian's Brain. And I suspect that generating "6 octillion generations in 30 seconds" would not quite work out because Brian's Brain is a lot more chaotic (I have not noticed a single oscillator while watching my animations) – the presence of "dead" cells force a lot of movement (and expansion). I noticed quite a few spaceships, though.

I also installed Golly – a wonderful application to play with cellular automata. It has the Hashlife algorithm implemented, and it really does wonders. However it did not work out for Brian's Brain (starting with 2 live cells as I've been doing all this time) because:

In 30 seconds it does not even get to generation 2000.
The pattern is ever-expanding and I could not find any way to limit the world size (i.e., I think the torus-world is not implemented in Golly).

So I'd have to implement Hashlife myself. Except that I'm hesitant to hand out $30 for the original B. Gosper's paper, but might as well do it in the future (since as an extra bonus the original implementation was done on Lisp Machine).

So, I think I'll do one more iteration with the brute-force method (with profiling and declarations). And then we'll see what will happen.

Data files

Sources:

numeric.lisp
destructive.lisp
1d.lisp

Benchmarking transcripts:

bm-num-sbcl.txt
bm-num-ccl32.txt
bm-num-ccl64.txt
bm-des-sbcl.txt
bm-des-ccl32.txt
bm-des-ccl64.txt
bm-1d-sbcl.txt
bm-1d-ccl32.txt
bm-1d-ccl64.txt

Updates

2009-11-01: Please don't send me any more copies of B. Gosper's paper :)

Footnotes

¹ Alan Perlis, Epigrams in Programming, "55. A LISP programmer knows the value of everything, but the cost of nothing."

² Yes, I'm using git, and branching like crazy, even in this toy project.

2009-10-26

Brian's Brain on Common Lisp

The real thing

If you are reading this you might have already read the previous entry and are familiar with the problem domain. And today I'm going to walk you through implementing Brian's Brain in Common Lisp. First, the straight-forward implementation with brain represented as a 2-dimensional array (that's what it is, right?), with just enough code to get it running. There are no functions to abstract away the brain implementation details or cell representation or anything else.

First, a function to create a brain:

1:  (defun make-brain (w h)
2:    (make-array (list h w) :initial-element :off))

And another function that will make us an initialised brain (like the one in Clojure version):

1:  (defun make-initialised-brain (w h)
2:    (let ((cells (make-brain w h))
3:          (mid (floor w 2)))
4:      (setf (aref cells 0 mid) :on)
5:      (setf (aref cells 0 (1+ mid)) :on)
6:      cells))

Why are the dimensions to make-array passed as (h w) and not (w h) you might ask? Because I like to see that my functions work as soon as I write them. Let's see how it works:

CL-USER> (make-initialised-brain 7 5)
#2A((:OFF :OFF :OFF :ON  :ON  :OFF :OFF)
    (:OFF :OFF :OFF :OFF :OFF :OFF :OFF)
    (:OFF :OFF :OFF :OFF :OFF :OFF :OFF)
    (:OFF :OFF :OFF :OFF :OFF :OFF :OFF)
    (:OFF :OFF :OFF :OFF :OFF :OFF :OFF))

What would happen if we had them in the opposite order:

CL-USER> (make-initialised-brain 5 7)
#2A((:OFF :OFF :ON  :ON  :OFF)
    (:OFF :OFF :OFF :OFF :OFF)
    (:OFF :OFF :OFF :OFF :OFF)
    (:OFF :OFF :OFF :OFF :OFF)
    (:OFF :OFF :OFF :OFF :OFF)
    (:OFF :OFF :OFF :OFF :OFF)
    (:OFF :OFF :OFF :OFF :OFF))

Incidentally, this is how Lau's version has them (if you have followed the examples carefully enough). There is no real difference having them either way, only a matter of convenience: if I look at the brain in REPL I want to see the same thing I'd see in animated graphical output.

The rules are independent of brain representation as long as we can provide the neighbouring cells as a parameter:

1:  (defun rules (state neighbours)
2:    (case state
3:      (:on    :dying)
4:      (:dying :off)
5:      (t (if (= 2 (count :on neighbours)) :on :off))))

How do we find the neighbours of a given cell? Easy, like this:

 1:  (defun neighbours (cells x y)
 2:    (let* ((mx (1- (array-dimension cells 1)))
 3:           (my (1- (array-dimension cells 0)))
 4:           (l (if (zerop x) mx (1- x)))
 5:           (r (if (= x mx) 0 (1+ x)))
 6:           (u (if (zerop y) my (1- y)))
 7:           (d (if (= y my) 0 (1+ y))))
 8:      (mapcar (lambda (x y)
 9:                (aref cells y x))
10:              (list l x r l r l x r)
11:              (list u u u y y d d d))))

What happens here should be pretty obvious, but I'll explain a bit anyway since this is a one-way communication channel. mx and my are maximal values for x and y coordinates. Left of the cell (l) is current x coordinate minus 1, unless we're on the leftmost column (0), in which case we get mx. Similarly for right cell, except we look if we're on the rightmost column (mx), and wrap to 0 if we are. Similarly for y axis. In short, referencing a cell off the edge gets us a cell on the opposite side.

Then for each pair of coordinates around our cell we get the value from the cells array. These pairs are given by two lists: one fore x coordinates and one for y coordinates. Function mapcar goes over both lists simultaneously and applies given function to each successive pair of items from both lists.

Also note how indices are passed to aref, with y and x in unnatural positions. This is for the reasons explained above – the rows are first dimension, and columns the second. But neighbours function expects them in the natural order, leaving the implementation details out of the way.

Let's check if our neighbours function works as expected:

CL-USER> (neighbours (make-initialised-brain 7 5) 3 4)
(:OFF :OFF :OFF :OFF :OFF :OFF :ON :ON)

The resulting list is cell values for, respectively, left-up, up, right-up, left, right, left-down, down and right-down cells from the specified x, y coordinate.

What's left? Evolution. The function which will create next state of a brain:

1:  (defun evolve (src)
2:    (let* ((w (array-dimension src 1))
3:           (h (array-dimension src 0))
4:           (dst (make-brain w h)))
5:      (loop for j below h
6:         do (loop for i below w
7:               do (setf (aref dst j i)
8:                        (funcall 'rules (aref src j i) (neighbours src i j)))))
9:      dst))

Ordinary loop over rows and columns, setting values in newly created brain by applying the rules to a cell and its neighbours in the current brain. We're ready to play now:

CL-USER> (evolve (make-initialised-brain 7 5))
#2A((:OFF :OFF :OFF :DYING :DYING :OFF :OFF)
    (:OFF :OFF :OFF :ON    :ON    :OFF :OFF)
    (:OFF :OFF :OFF :OFF   :OFF   :OFF :OFF)
    (:OFF :OFF :OFF :OFF   :OFF   :OFF :OFF)
    (:OFF :OFF :OFF :ON    :ON    :OFF :OFF))
CL-USER> (evolve *)
#2A((:OFF :OFF :ON  :OFF   :OFF   :ON  :OFF)
    (:OFF :OFF :OFF :DYING :DYING :OFF :OFF)
    (:OFF :OFF :OFF :ON    :ON    :OFF :OFF)
    (:OFF :OFF :OFF :ON    :ON    :OFF :OFF)
    (:OFF :OFF :OFF :DYING :DYING :OFF :OFF))

Using numbers for cell values would be much better visually, but I'm staying close to the Clojure version (for now).

Getting ready for blastoff

Almost ready to do some timing. All we need are the simulate and benchmark functions:

 1:  (defun simulate (steps initial)
 2:    (loop repeat steps
 3:       for brain = initial then (funcall 'evolve brain)
 4:       finally (return brain)))
 5:  
 6:  (defun benchmark ()
 7:    (format *trace-output* "Benchmarking on ~A ~A~%"
 8:            (lisp-implementation-type)
 9:            (lisp-implementation-version))
10:    ;; Warmup.
11:    (simulate 10000 (make-initialised-brain 16 16))
12:    (loop
13:       for (w h i) in '((32    32  32768)
14:                        (64    64  8192)
15:                        (128  128  2048)
16:                        (256  256  512)
17:                        (512  512  128)
18:                        (1024 1024 32)
19:                        (2048 2048 8)
20:                        (4096 4096 2))
21:       do (let ((initial (make-initialised-brain w h)))
22:            (format *trace-output* "*** ~Dx~D ~D iteration~:P ***~%" w h i)
23:            (time (simulate i initial))
24:            (finish-output *trace-output*)))
25:    (values))

Notice that there is not a single type annotation in this code. Running it on my laptop¹:

CL-USER> (benchmark)
Benchmarking on SBCL 1.0.31.32
*** 32x32 32768 iterations ***
Evaluation took:
  34.782 seconds of real time
  34.064263 seconds of total run time (33.060215 user, 1.004048 system)
  [ Run times consist of 7.670 seconds GC time, and 26.395 seconds non-GC time. ]
  97.94% CPU
  96,906,498,234 processor cycles
  14,769,770,512 bytes consed

*** 64x64 8192 iterations ***
Evaluation took:
  33.512 seconds of real time
  32.776229 seconds of total run time (31.254474 user, 1.521755 system)
  [ Run times consist of 5.903 seconds GC time, and 26.874 seconds non-GC time. ]
  97.80% CPU
  93,366,482,790 processor cycles
  14,762,592,768 bytes consed
...
snipped

Running from the terminal would look something like this:

$ sbcl --noinform --disable-debugger --load simple.fasl --eval "(benchmark)" --eval "(quit)"

This time we have more information to display in the graphs: total run time and GC time. And just look at the numbers:

Interesting. Very interesting, indeed. What we see here is that the time to run one iteration (blue bars) is increasing a bit first and then declines quite sharply. GC time (green), on the other hand, increases quite sharply starting from 256x256 simulation. Can you come up with an explanation?

I have one. Smaller arrays are processed quite fast, so are short lived, and become garbage before GC kicks in. The bigger arrays are processed longer, so they are alive when GC starts. And GC has to walk all array elements each time, since arrays are not specialised (that is, can contain anything). I told you using numbers would be better (for different reasons, though)!

I also did another run of this same benchmark (transcripts available below), and the numbers were the same up to sub-second precision.

Let's do this same thing with a different Common Lisp implementation, which in my case will be Clozure CL. The nice thing about this implementation that its compiler is very snappy. Running from shell looks like this:

$ ccl -n -Q -l simple.dx32fsl -e "(benchmark)"

One thing to note is that arrays of size 4096x4096 exceed the limit of array length (24-bit number) of 32-bit Clozure CL, which, coincidentally, is just 1 short of what we need:

CL-USER> array-dimension-limit
16777216
CL-USER> (integer-length array-dimension-limit)
25
CL-USER> (integer-length (1- array-dimension-limit))
24
CL-USER> (* 4096 4096)
16777216

But otherwise there is less variation in 32-bit Clozure CL. Let's look at 64-bit version:

Nice, behaviour similar to SBCL, except that GC times don't grow as fast.

Missing things

I hear somebody in the corner mumbling something about animation and graphics. Oh, right. Missed that one. But it so happens that the very thing that made me start playing with Brian's Brain is that I installed cl-opengl. And guess what? It worked right out of the box. On all the Common Lisp implementation I have on my computer. So I had start playing with it. And the rest, as they say, is history.

It works like this:

CL-USER> (asdf:operate 'asdf:load-op :cl-glut-examples)
; System loading output snipped...
NIL
CL-USER> (cl-glut-examples:run-examples)

And all the examples pop up, many of them animating. So I peek at some examples to see how to set up a window. Easy as a pie:

 1:  (defclass bb (glut:window)
 2:    ((cells :accessor cells-of :initarg :cells))
 3:    (:default-initargs
 4:     :title "Brian's Brain in CL"
 5:     :mode '(:double :rgb)))
 6:  
 7:  (defmethod glut:display-window :before ((w bb))
 8:    (gl:clear-color 0 0 0 0)
 9:    (gl:matrix-mode :projection)
10:    (gl:load-identity)
11:    (let ((cells (cells-of w)))
12:     (gl:ortho 0 (array-dimension cells 1)  0 (array-dimension cells 0) -1 1)))

Then we need a function to render a single cell at specified position. Drawing squares is easy enough:

 1:  (defun render-cell (x y cell)
 2:    (flet ((draw-cell (x y)
 3:             (gl:with-pushed-matrix
 4:                 (gl:translate x y 0)
 5:               (gl:with-primitive :polygon
 6:                 (gl:vertex 0.1 0.1 0)
 7:                 (gl:vertex 0.9 0.1 0)
 8:                 (gl:vertex 0.9 0.9 0)
 9:                 (gl:vertex 0.1 0.9 0)))))
10:      (case cell
11:        (:on (gl:color 1 1 1)
12:             (draw-cell x y))
13:        (:dying (gl:color 0.5 0.5 0.5)
14:                (draw-cell x y)))))

All that's left are some callbacks to draw the whole window and run the animation. The following two methods will do just fine:

 1:  (defmethod glut:display ((w bb))
 2:    (gl:clear :color-buffer)
 3:    (let* ((cells (cells-of w))
 4:           (w (array-dimension cells 1))
 5:           (h (array-dimension cells 0)))
 6:      (loop for j below h
 7:         do (loop for i below w
 8:               do (render-cell i j (aref cells j i)))))
 9:    (glut:swap-buffers))
10:  
11:  
12:  (defmethod glut:idle ((w bb))
13:    (setf (cells-of w) (evolve (cells-of w)))
14:    (glut:post-redisplay))

Everything is ready now. An animated Brian's Brain can be created like this:

(glut:display-window (make-instance 'bb
                                    :cells (make-initialised-brain 128 128)
                                    :width 512
                                    :height 512))

But since I don't like to put things on the toplevel which run when just loading a file, I'll put the code into a function:

1:  (defun run (w h ww wh)
2:    (glut:display-window
3:     (make-instance 'bb
4:                    :cells (make-initialised-brain w h)
5:                    :width ww
6:                    :height wh)))

Feel free to start a never-ending Brian's Brain simulation:

CL-USER> (run 160 100 320 200)

Finishing touches

To make this all easily loadable I'll put all code in its own package and create a system definition (which nowadays means a ASDF) file. Refer to Xach's intro for a nice description of why and how to do this.

Package definition is very simple:

1:  (defpackage :brians-brain-1
2:    (:use :common-lisp)
3:    (:export #:run))

And the system definition is nothing complicated, either:

1:  (asdf:defsystem :brians-brain-1
2:      :version "1.0"
3:      :author "Jānis Džeriņš"
4:      :license "Send me money if you find this stuff useful."
5:      :depends-on (cl-opengl cl-glut)
6:      :components ((:file "package")
7:                   (:file "simple" :depends-on ("package"))
8:                   (:file "display" :depends-on ("package" "simple"))))

At this point we can get to a running animated Brian's Brain from the shell prompt:

$ sbcl --noinform --disable-debugger \
       --eval "(asdf:operate 'asdf:load-op :brians-brain-1)" \
       --eval "(brians-brain-1:run 160 100 320 200)" \
       --eval "(quit)"

Looking forward

Next time I'm going to play with different brain representations.

Data files

Code:

brians-brain-1.asd
package.lisp
simple.lisp
display.lisp

Benchmarking transcripts:

bm-sbcl-mbp.txt
bm-ccl-mbp.txt
bm-ccl64-mbp.txt

Updates

2009-10-28: Noticed a bug in simulate function which runs the simulation for one step less than asked. The corrected version looks like this:

1:  (defun simulate (steps initial)
2:    (loop with brain = initial
3:       repeat steps
4:       do (setf brain (funcall 'evolve brain))
5:       finally (return brain)))

Also now invoking GC before each simulation so that garbage from previous simulation has less chance to influence the next:

 6:  (defun benchmark ()
 7:    (format *trace-output* "Benchmarking on ~A ~A~%"
 8:            (lisp-implementation-type)
 9:            (lisp-implementation-version))
10:    ;; Warmup.
11:    (simulate 10000 (make-initialised-brain 16 16))
12:    (loop
13:       for (w h i) in '((32    32  32768)
14:                        (64    64  8192)
15:                        (128  128  2048)
16:                        (256  256  512)
17:                        (512  512  128)
18:                        (1024 1024 32)
19:                        (2048 2048 8)
20:                        (4096 4096 2))
21:       do #+ccl (gc)
22:          #+sbcl (gc :full t)
23:          (let ((initial (make-initialised-brain w h)))
24:           (format *trace-output* "*** ~Dx~D ~D iteration~:P ***~%" w h i)
25:           (time (simulate i initial))
26:           (finish-output *trace-output*)))
27:    (values))

The graphs are generally very similar, except the rightmost columns don't look fishy:

New transcripts:

bm-sbcl-2.txt
bm-ccl32-2.txt
bm-ccl64-2.txt

Footnotes

¹ See the previous blog entry for the specs.

Brian's Brain on Clojure

Here's your brain, and here's your brain on Clojure

A couple weeks ago I came across a blog entry by Lau B. Jensen about Brian's Brain implementation in Clojure. And so it happened that briefly before this I have decided to devote some of my time to programming in Common Lisp. What a lucky coincidence – this looks like a perfect job to swap Common Lisp back into the operational area of my brain.

You see, implementing Brian's Brain is easy. So easy that it is not worthy of a blog post except if one spices it up with some orthogonal material. The reference blog entry mentioned above does just that: implement something simple to tell everybody how cool some features of Clojure are. (And then tell that those features come at a drastic performance price.) I smell a recipe for a successful first blog entry: controversy! And the best harmless controversy one can get on the internet is apples to oranges comparisons.

So, as you might expect I'm going to compare the Clojure and Common Lisp implementations¹ of Brian's Brain. And today I'll give you some numbers about the Clojure implementations (both original and transient). And to make apples and oranges more similar I'm going to leave only the cell simulation code, leaving graphics and rendering out of the discussion. For a while.

Let's measure something

The simulation and measuring is going to be simple: I'll measure the time it takes to simulate brains of various sizes for a set number of iterations. Since it is apparent that the time to advance the state of brain one step is proportional to the size of the brain, it is reasonable to expect the time spent to increase proportionally to the number of cells in the brain. And since all of the brain's cells are the same, there is nothing interesting to expect in the resulting numbers.

The only interesting thing I'd expect from varying brain sizes is the impact of runtime's memory management and maybe cache locality effects. Looking at these things is also interesting because the brain representation in both Lau's implementations are different: [lazy] lists in the original version and a vector in the transient version. Therefore I'll run the code for brains of different sizes, but adjusting the number of simulation steps so that the number of cells processed stays the same. This way I'd expect graphs be more-or-less straight horizontal.

But first

The code to run simulation, nothing complicated:

1:  (defn simulate [steps initial f]
2:    (loop [n steps, brain initial]
3:      (if (< 0 n)
4:        (recur (dec n) (step brain f))
5:        brain)))

This is a typical Clojure loop that has two "state" variables, which are updated on each iteration (by means of recur invocation): n is decreased, and brain is "transformed" using the step function. That is, on each iteration of the loop new values for these two variables are used; no variables (or values) are really modified.

The value of f is either map or pmap. This way it is easy to run a simulation in serial or parallel fasion by just passing a different function to the simulation. The original step function accepts only a single parameter:

1:  (defn step [board]
2:    (doall
3:     (map (fn [window]
4:            (apply #(doall (apply map rules %&))
5:                   (doall (map torus-window window))))
6:          (torus-window board))))

In his article Lau suggests that to get the parallel version of this function only one function call must be changed. Except that after the change the function must be recompiled. Contrast this to my changed version:

1:  (defn step [board f]
2:    (doall
3:     (f (fn [window]
4:          (apply #(doall (apply map rules %&))
5:                 (doall (map torus-window window))))
6:        (torus-window board))))

Now nothing must be recompiled when we decide to change this function from serial to parallel version.

Another thing I had to change is how the initial brain is created. Lau uses global variables quite a lot (and we'll see how it hurts on many occassions ahead). Besides not being a nice thing in general, it also is very un-lispy (and un-functional as well, but I'll have an article on this some other time). There is nothing in Lau's code that requires global variables, so I re-wrote his code to get rid of them.

This code:

1:  (def dim-board [90 90])
2:  
3:  (def board
4:       (for [x (range (dim-board 0))]
5:         (for [y (range (dim-board 1))]
6:           [(if (< 50 (rand-int 100)) :on :off) x y])))

now has become this:

1:  (defn make-random-board [w h]
2:    (for [x (range w)]
3:      (for [y (range h)]
4:        [(if (< 50 (rand-int 100)) :on :off) x y])))

Now we're ready to experiment with various combinations of brain size and iteration counts in the REPL (manually-reformatted by me for aesthetic reasons):

user> (simulate 0 (make-random-board 4 3) map)
(([:on  0 0] [:on  0 1] [:on  0 2])
 ([:on  1 0] [:on  1 1] [:off 1 2])
 ([:on  2 0] [:off 2 1] [:off 2 2])
 ([:off 3 0] [:off 3 1] [:on  3 2]))
user> (simulate 1 (make-random-board 4 3) pmap)
(([:off   0 0] [:dying 0 1] [:off   0 2])
 ([:off   1 0] [:dying 1 1] [:dying 1 2])
 ([:dying 2 0] [:off   2 1] [:off   2 2])
 ([:dying 3 0] [:dying 3 1] [:off   3 2]))
user> (simulate 2 (make-random-board 4 3) map)
(([:off 0 0] [:off 0 1] [:off 0 2])
 ([:off 1 0] [:off 1 1] [:off 1 2])
 ([:off 2 0] [:off 2 1] [:off 2 2])
 ([:off 3 0] [:off 3 1] [:off 3 2]))

One interesting observation: the board after 2 iterations looks broken (all cells are dead). Is the code somehow broken? Things like these are easy to find out interactively. But only if it's easy to specify various parameters. And that's where the global variables make life hard. In Lau's implementation, for each of the little REPL tests I'd have to go and redefine two global variables (in order). In my version one can play to their heart content without leaving the REPL.

So, what exactly is going on here? Nothing special: having half the brain's cells alive at the start makes this little brain overpopulated and by the rules of the world all cells die. And since I'm not going to manually perform pretty-printing or write a function to do it for me, I'll show another advantage of avoiding unnecessary global variables.

You see, for doing reliable experiments (like this one I'm doing here) they must be repeatable. So using randomly-initialised brains is not going to work (unless a specific pseudo-random-number-generator is seeded with some known seed, but that's besides the point). And it so happens that the other Lau's implementation uses a specific initial brain setup: two adjacent cells in the first row are initially alive. The code to create such a brain is:

1:  (defn make-special-board [w h]
2:    (let [mid (/ w 2)]
3:     (for [x (range w)]
4:       (for [y (range h)]
5:         [(if (and (= y 0) (<= mid x (inc mid))) :on :off) x y]))))

Now, let's play a bit in the REPL:

user> (simulate 0 (make-special-board 4 3) map)
(([:off 0 0] [:off 0 1] [:off 0 2])
 ([:off 1 0] [:off 1 1] [:off 1 2])
 ([:on  2 0] [:off 2 1] [:off 2 2])
 ([:on  3 0] [:off 3 1] [:off 3 2]))
user> (simulate 1 (make-special-board 4 3) map)
(([:off   0 0] [:off 0 1] [:off 0 2])
 ([:off   1 0] [:off 1 1] [:off 1 2])
 ([:dying 2 0] [:on  2 1] [:on  2 2])
 ([:dying 3 0] [:on  3 1] [:on  3 2]))
user> (simulate 2 (make-special-board 4 3) map)
(([:on  0 0] [:on    0 1] [:on    0 2])
 ([:on  1 0] [:on    1 1] [:on    1 2])
 ([:off 2 0] [:dying 2 1] [:dying 2 2])
 ([:off 3 0] [:dying 3 1] [:dying 3 2]))
user> (= *1 (simulate 2 (make-special-board 4 3) pmap))
true
user> (= (simulate 10 (make-special-board 32 32) map)
         (simulate 10 (make-special-board 32 32) pmap))
true

Great, not only can we experiment with different brain dimensions but also with different contents. And we can easily check if parallel and sequential versions work the same. Now try doing this in the Lau's original version!

Almost there

Now we are almost ready to measure things. Here's the function I'll use:

 1:  (defn benchmark []
 2:    ;; Warmup
 3:    (simulate 7500 (make-special-board 16 16) map)
 4:    (simulate 7500 (make-special-board 16 16) pmap)
 5:    (println (clojure-version))
 6:    (doseq [[w h i] [[32    32  32768]
 7:                     [64    64  8192]
 8:                     [128  128  2048]
 9:                     [256  256  512]
10:                     [512  512  128]]]
11:      (let [initial (doall (make-special-board w h))]
12:        (print (str "S " w "x" h ", " i " iteration(s): "))
13:        (time (simulate i initial map))
14:        (flush)
15:        (print (str "P " w "x" h ", " i " iteration(s): "))
16:        (time (simulate i initial pmap))
17:        (flush))))

Nothing special, repeat serial and parallel versions for given brain dimensions and iteration number. The results of running this on my MacBook Pro² (with the warmup timings removed):

Running this benchmark from shell can be done like this:

$ java -server -jar clojure.jar -i ca.clj -e "(benchmark)"

After a while of intensive contributing to the global warming we get some output:

$ java -server -jar clojure.jar -i ca.clj -e "(benchmark)"
1.1.0-alpha-SNAPSHOT
S 32x32, 32768 iteration(s): "Elapsed time: 415506.172 msecs"
P 32x32, 32768 iteration(s): "Elapsed time: 289353.533 msecs"
S 64x64, 8192 iteration(s): "Elapsed time: 425244.632 msecs"
P 64x64, 8192 iteration(s): "Elapsed time: 290253.746 msecs"
S 128x128, 2048 iteration(s): "Elapsed time: 421812.085 msecs"
P 128x128, 2048 iteration(s): "Elapsed time: 287244.863 msecs"
S 256x256, 512 iteration(s): "Elapsed time: 422331.897 msecs"
P 256x256, 512 iteration(s): "Elapsed time: 289623.448 msecs"
S 512x512, 128 iteration(s): "Elapsed time: 440064.214 msecs"
P 512x512, 128 iteration(s): "Elapsed time: 286282.461 msecs"

One thing you might have noticed is that the number of iterations increases as the brain size decreases. This is to have the number of cells processed in each simulation constant and the resulting numbers easy to compare.

I also run all the "benchmarks" on two JVMs (server VMs at that): OSX default 1.5 (32-bit) and 1.6 (64-bit, there's only server VM). The switching is done using the Java Preferences utility:

$ java -version
java version "1.5.0_20"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_20-b02-315)
Java HotSpot(TM) Client VM (build 1.5.0_20-141, mixed mode, sharing)

$ java -server -version
java version "1.5.0_20"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_20-b02-315)
Java HotSpot(TM) Server VM (build 1.5.0_20-141, mixed mode)

$ java -version
java version "1.6.0_15"
Java(TM) SE Runtime Environment (build 1.6.0_15-b03-226)
Java HotSpot(TM) 64-Bit Server VM (build 14.1-b02-92, mixed mode)

Here are the numbers for the original Brian's Brain implementation:

Looks like the parallel version is about 45% (JVM1.5) to 60% (JVM1.6) faster than the serial.

The times for 512x512 are not included for JVM1.6 because the serial version was running for 2416341.265ms (that's about 40 minutes!), which is way out of scale for this graph. I guess it is because of tight memory situation because the next (parallel) benchmark bailed out with out of memory exception.

Going transient

Lau's code for transient version is much worse. More global variables. And functions depending on them. What a mess… Just look at it:

 1:  (def dim-board   [130  130])
 2:  (def dim-screen  [600  600])
 3:  (def dim-scale   (vec (map / dim-screen dim-board)))
 4:  
 5:  (def board-size (apply * dim-board))
 6:  (def board      (-> (vec (repeat board-size :off))
 7:                      (assoc (/ (dim-board 0) 2)       :on)
 8:                      (assoc (inc (/ (dim-board 0) 2)) :on)))
 9:  
10:  (def cords (for [y (range (dim-board 0)) x (range (dim-board 1))] [x y]))
11:  
12:  (defn torus-coordinate
13:    [idx]
14:    (cond (neg? idx)          (+ idx board-size)
15:          (>= idx board-size) (- idx board-size)
16:      :else idx))
17:  
18:  (def above     (- (dim-board 0)))
19:  (def below     (dim-board 0))
20:  (def neighbors [-1 1 (dec above) above (inc above) (dec below) below (inc below)])
21:  
22:  (defn on-neighbors [i board]
23:    (count
24:     (filter #(= :on (nth board %))
25:             (map #(torus-coordinate (+ i %)) neighbors))))
26:  
27:  (defn step [board]
28:    (loop [i 0 next-state (transient board)]
29:      (if (< i board-size)
30:        (let [self         (nth board i)]
31:          (recur (inc i)
32:                 (assoc! next-state i (cond
33:                                        (= :on    self)                :dying
34:                                        (= :dying self)                :off
35:                                        (= 2 (on-neighbors i board))   :on
36:                                        :else                          :off))))
37:        (persistent! next-state))))

9 global variables in 37 lines of code! Bad Lau! Bad! Also, the results of torus-coordinate, on-neighbors and step (in short – all of them) depend on things besides the parameters. Very un-functional indeed.

Ok, let's put horrors aside and do what we can to get this mess under control. My only goal is to get the simulation and benchmarking functions working, so first, here they are:

 1:  (defn simulate [steps initial]
 2:    (loop [n steps, board initial]
 3:      (if (< 0 n)
 4:        (recur (dec n) (step board))
 5:        board)))
 6:  
 7:  (defn benchmark []
 8:    ;; Warmup
 9:    (simulate 10000 (make-board 16 16))
10:    (println (clojure-version))
11:    (doseq [[w h i] [[32    32  32768]
12:                     [64    64  8192]
13:                     [128  128  2048]
14:                     [256  256  512]
15:                     [512  512  128]
16:                     [1024 1024 32]]]
17:      (let [initial (make-board w h)]
18:        (print (str "S " w "x" h ", " i " iteration(s): "))
19:        (time (simulate i initial))
20:        (flush))))

Notice that the transient version cannot be parallelized by simply changing a single function, so there is no parallel version at all.

Here's the changed code that allows me to run the benchmark:

 1:  (defn make-board [w h]
 2:    (let [above (- w), below w]
 3:      (def neighbors [-1 1 (dec above) above (inc above) (dec below) below (inc below)]))
 4:  
 5:    (-> (vec (repeat (* w h) :off))
 6:        (assoc (/ w 2)       :on)
 7:        (assoc (inc (/ w 2)) :on)))
 8:  
 9:  (defn torus-coordinate [idx board]
10:    (let [board-size (count board)]
11:      (cond (neg? idx)          (+ idx board-size)
12:            (>= idx board-size) (- idx board-size)
13:            :else idx)))
14:  
15:  (defn on-neighbors [i board]
16:    (count
17:     (filter #(= :on (nth board %))
18:             (map #(torus-coordinate (+ i %) board) neighbors))))
19:  
20:  (defn step [board]
21:    (let [board-size (count board)]
22:      (loop [i 0 next-state (transient board)]
23:        (if (< i board-size)
24:          (let [self         (nth board i)]
25:            (recur (inc i)
26:                   (assoc! next-state i (cond
27:                                         (= :on    self)                :dying
28:                                         (= :dying self)                :off
29:                                         (= 2 (on-neighbors i board))   :on
30:                                         :else                          :off))))
31:          (persistent! next-state)))))

make-board is ugly, and defines global variables. Will have to live with it. torus-coordinate and step don't use global variables any more. And we're down to one global variable which on-neighbors depends on. I'll let this one remain on Lau's conscience.

Another thing that I just can't let slip is that the torus-coordinate function is JUST PLAIN WRONG. Sorry, I might have raised my voice. The problem is that for the cell in the leftmost column the cell to its left is on the opposite side of the board. Similarly for the rightmost. But torus-coordinate cheats and for the leftmost returns the cell on the rightmost column (correct) but one row above (wrong!). For the rightmost cells it returns the cell on the leftmost column, but one row below. This error would instantly disqualify the transient implementation in any half-serious apples-to-apples comparison.

It is easy to see that torus-coordinate is called for every "off" cell in a brain. And there are more "off" cells than "dying" or "on" or the sum of those, respectively (based on my observations of running the simulations graphically; empirical proof is easy, but will have to wait). Allowing this function to do less work definitely impacts its performance. I'm not going to write the correct Clojure version of this function so let's pretend everything is just fine and look at the performance:

$ java -server -jar clojure.jar -i transient-ca.clj -e "(benchmark)"
1.1.0-alpha-SNAPSHOT
S 32x32, 32768 iteration(s): "Elapsed time: 68259.648 msecs"
S 64x64, 8192 iteration(s): "Elapsed time: 69464.992 msecs"
S 128x128, 2048 iteration(s): "Elapsed time: 67646.623 msecs"
S 256x256, 512 iteration(s): "Elapsed time: 70889.975 msecs"
S 512x512, 128 iteration(s): "Elapsed time: 72941.713 msecs"
S 1024x1024, 32 iteration(s): "Elapsed time: 73866.637 msecs"

Quite an improvement I'd say, indeed. And since the benchmark finished relatively fast, I decided to run it again:

$ java -server -jar clojure.jar -i transient-ca.clj -e "(benchmark)"
1.1.0-alpha-SNAPSHOT
S 32x32, 32768 iteration(s): "Elapsed time: 90808.629 msecs"
S 64x64, 8192 iteration(s): "Elapsed time: 92661.493 msecs"
S 128x128, 2048 iteration(s): "Elapsed time: 89570.418 msecs"
S 256x256, 512 iteration(s): "Elapsed time: 92398.728 msecs"
S 512x512, 128 iteration(s): "Elapsed time: 94843.865 msecs"
S 1024x1024, 32 iteration(s): "Elapsed time: 97478.63 msecs"

Weird is the least I can say about this. So I run it one more time… And then again. All the times getting different timings (I was not doing anything else while running the benchmarks on the computer, so I don't have a slightest clue about the cause of variation). Then I gave up. Here's a graph of all four runs:

Three more runs on JVM1.6, which are quite a bit faster and closer to each other. But they run out of memory on the 1024x1024 run.

Here are the averages of all the runs together in one graph:

Around 25% speed increase in JVM1.6 if I'm reading the numbers correctly.

Wrapping things up

Now, the complete picture of original and transient implementations:

As expected the lines are quite horizontal, but not in all cases. Especially non-horizontal are the original implementation on the JVM1.6. It might be related to it being 64-bit, but I'm not really going to investigate this any further. So, as we can see, the transient version is about 4 times faster than the parallel original version, and about 6 times faster than serial version.

It is apparent that the transient version is quite a bit faster than the original. And with the original version we can see that the parallel version really does speed things up on two cores. How about 8 cores? It so happens that I have SSH access to a quite recent Mac Pro³. One notable thing about this system is that it is set to energy-saving mode, so don't be surprised by the absolute numbers, look for their relative magnitudes:

Watching the java process in top was showing around 600% CPU usage when running the parallel simulations. What surprises me is that the performance gain from more cores is not there. But surprise is not big enough for me to go investigate this any further.

Final words

This is only an introduction for what I really wanted to write about. It turned out much longer than I thought it would be. Cangratulations to everybody who got this far (reading the content, not searching for pretty pictures).

This article is in no way intended as bashing on anything. Yes, I didsagree with Lau on some software design points, but that should be only considered as that – disagreement. I have shown how I'd write some parts of his program. He will be welcome to comment on my Common Lisp implementation after my next post.

Please don't use these numbers to draw any conclusions about the performance of my computer, Clojure, Java, my or Lau's code or anything else. The numbers are as un-scientific as can be. I was intentionally not doing anyhting else while running the benchmarks on the computer, but there are all the usual desktop things running (except the screensaver which I disabled).

Data files

Original: source MacBook Pro transcript MacPro transcript
Transient: source MacBook Pro transcript MacPro transcript

Updates

2009-10-28: Apparently I have been misguided about the energy-savings mode on the MacPro. Going to the box physically and checking if the option is turned on or not, I found out (to my big surprise) that the option is gone. For those who have a Mac can compare their "Energy Saver" window to what it was like before: http://support.apple.com/kb/HT3092. So, I'm not sure whether the "reduced" mode is enabled or not.

Footnotes

¹ When I'm talking about Clojure or Common Lisp implementation, I'm talking about brain's implementation in that programming language respectively, not the implementation of language.

² 2.8GHz Intel Core 2 Duo, 6MB cache, 1.07GHz bus speed.

³ 3GHz Quad-Core Intel Xeon (2 processors, 8 cores total), 8MB cache per processor. 1.33GHz bus speed.

The beauty of grey

2009-10-29

Brian's brain, on Common Lisp, take 2

Specialised arrays

Destructive action

Dimension warp

Looking into future

Data files

Updates

Footnotes

2009-10-26

Brian's Brain on Common Lisp

The real thing

Getting ready for blastoff

Missing things

Finishing touches

Looking forward

Data files

Updates

Footnotes

Brian's Brain on Clojure

Here's your brain, and here's your brain on Clojure

Let's measure something

But first

Almost there

Going transient

Wrapping things up

Final words

Data files

Updates

Footnotes

Blog Archive

Labels

About Me