Showing posts with label Common Lisp. Show all posts
Showing posts with label Common Lisp. Show all posts

2009-12-07

Brian's brain, on Common Lisp, take 3

This is the last installment about the straight-forward implementation of Brian's Brain in which I'm going to try to squeeze some more performance out of the brute-force implementation (you should get familiar with first and second post to get the full picture). The point of the article is not about how fast I can get it, but rather about what performance defference can be achieved using "simple" performance tweaking measures.

As any good programmer will tell you – optimizing without profiling first is a waste of time. To profile my code I will use SBCL's statistical profiler. It is called statistical because of the way it works: interrupting the running program at regular intervals and looking what code is running at that moment.

To get the idea of where the time is spent in my program I'll do a single simulation from our benchmark (with an increased iteration count to get more correct statistics):

 1:  CL-USER> (sb-sprof:with-profiling (:report :flat :loop nil)
 2:             (simulate 4096 (make-initialised-brain 128 128) 128 128))
 3:  
 4:  Number of samples:   4426
 5:  Sample interval:     0.01 seconds
 6:  Total sampling time: 44.26 seconds
 7:  Number of cycles:    0
 8:  Sampled threads:
 9:   #<SB-THREAD:THREAD "initial thread" RUNNING {10029559D1}>
10:  
11:             Self        Total        Cumul
12:    Nr  Count     %  Count     %  Count     %    Calls  Function
13:  ------------------------------------------------------------------------
14:     1   1401  31.7   1401  31.7   1401  31.7        -  "foreign function sigprocmask"
15:     2   1000  22.6   1737  39.2   2401  54.2        -  NEIGHBOURS
16:     3    552  12.5   1056  23.9   2953  66.7        -  COUNT
17:     4    200   4.5    200   4.5   3153  71.2        -  SB-IMPL::OPTIMIZED-DATA-VECTOR-REF
18:     5    179   4.0    179   4.0   3332  75.3        -  SB-KERNEL:HAIRY-DATA-VECTOR-REF/CHECK-BOUNDS
19:     6    173   3.9    173   3.9   3505  79.2        -  SB-VM::GENERIC-+
20:     7    145   3.3    247   5.6   3650  82.5        -  (LAMBDA (SB-IMPL::X))
21:     8    143   3.2    143   3.2   3793  85.7        -  TRUNCATE
22:     9    128   2.9    128   2.9   3921  88.6        -  SB-KERNEL:%COERCE-CALLABLE-TO-FUN
23:    10    116   2.6    139   3.1   4037  91.2        -  LENGTH
24:    11     80   1.8     80   1.8   4117  93.0        -  EQL
25:    12     67   1.5   2970  67.1   4184  94.5        -  EVOLVE
26:    13     51   1.2     51   1.2   4235  95.7        -  (FLET LT)
27:    14     42   0.9     42   0.9   4277  96.6        -  (FLET RT)
28:    15     41   0.9     41   0.9   4318  97.6        -  SB-KERNEL:SEQUENCEP
29:    16     30   0.7     34   0.8   4348  98.2        -  NTHCDR
30:    17     27   0.6   1076  24.3   4375  98.8        -  RULES
31:    18     22   0.5     22   0.5   4397  99.3        -  (FLET #:CLEANUP-FUN-[CALL-WITH-SYSTEM-MUTEX/WITHOUT-GCING]146)
32:    19     17   0.4     17   0.4   4414  99.7        -  "foreign function mach_msg_server"
33:    20     11   0.2     11   0.2   4425 100.0        -  (FLET SB-IMPL::FAST-NTHCDR)
34:    21      1  0.02      1  0.02   4426 100.0        -  (FLET #:CLEANUP-FUN-[INVOKE-INTERRUPTION]41)
35:    22      0   0.0   2973  67.2   4426 100.0        -  SIMULATE
36:  ...

Ignoring the first entry related to sigprocmask (because that's sure some implementation detail related to GC, profiling itself, or something else I should not bother with) we can see our top candidates for optimization:

  • neighbours – as expected, this is the function called most frequently (for every cell at every simulation iteration).
  • count – standard Common Lisp function (called with the results of neighbours to count the number of on cells).
  • Some implementation-specific functions dealing with vectors. One of these functions contains a word "hairy" which I think means that SBCL could use some hints. Declaring the cells array type should hopefully get rid of these. Also should help with calls to length seen a few rows below.
  • sb-vm::generic-+ – most probably from neighbours function. The name implies that the generic version of + function (addition) is used, instead of using fixnum specific one.
  • eql – might be from call to count in rules function, being the default :test function.
  • (flet lt) and (flet rt) – these are from within neighbours. I'm pretty sure they are counted in one of the columns in neighbours row.

The optimization plan now is simple1:

  1. Declare the type of cells array.
  2. Optimize neighbours function.
  3. Sprinkle some more declarations as needed.

The first step is easy – all I have to do is add the following declaration to uses of cells array, which tells the compiler that the array is one-dimensional (undefined length) and its elements are of type (integer 0 2) (integers from 0 to 2):

(declare (type (simple-array (integer 0 2) (*)) cells))

OK, might as well make it a bit more generic (although just for the sake of example; the program is too short to make this worthwhile). Two types are introduced as "aliases" (that is, upon seeing one of these the compiler will expand them to corresponding types in the definition; a lot more complicated type declarations are possible):

1:  (deftype cell () '(integer 0 2))
2:  (deftype brain () '(simple-array cell (*)))

One important thing I should note here is that type declarations are optional in Common Lisp and the compiler is allowed to ignore them altogether. But they still provide a structured way to document code and all implementations I'm familiar with will use the declarations, if provided.

There are two extreme cases in type-checking:

  1. Everything is checked and declarations are ignored.
  2. Declarations are trusted and no type-checking is done.

In Common Lisp there is a way to instruct the compiler how to operate: generate safe (but due to checking usually relatively slow) or unsafe (probably fast due to the absence of checks) code. There are two settings (optimization qualities) to directlyl control this behavior: speed and safety. These settings can take values from 0 to 3 (zero being the "unimportant" value, 3 being "very important"). optimize declaration identifier is used to specify these qualities, like this:

(declare (optimize (speed 3) (safety 1)))

Specifying that speed is very important will also make some compilers (including SBCL) to be very chatty about why it cannot optimize particular parts of the code. Other implementations have specific declarations to make the compiler output optimization such notes.

Since I'll be experimenting with different optimization settings I'm going to specify them once for the whole file. It can be done with the following form at the beggining of the file:

(declaim (optimize (speed 3) (safety 1)))

Such optimization declarations can also be added on a per-function basis (or even for a specific form within a function), and is actually what is being done in day-to-day programming to optimize only the speed-critical sections of code.

And by the way – never, never ever run code with safety 0 (unless you know what you are doing, which practically is never). But since this is a toy program anyway, might as well try doing it to see what performance impact does it have.

Here are the neighbours and evolve functions with added array declarations:

 1:  (defun neighbours (cells i w h)
 2:    (declare (type brain cells))
 3:    (let ((l (length cells))
 4:          (mx (1- w))
 5:          (my (1- h)))
 6:      (multiple-value-bind (y x)
 7:          (truncate i w)
 8:        (flet ((up (i) (if (zerop y) (- (+ i l) w) (- i w)))
 9:               (dn (i) (if (= y  my) (- (+ i w) l) (+ i w)))
10:               (lt (i) (if (zerop x) (1- (+ i w))  (1- i)))
11:               (rt (i) (if (= x  mx) (1+ (- i w))  (1+ i))))
12:          (let* ((u (up i))
13:                 (d (dn i))
14:                 (l (lt i))
15:                 (r (rt i))
16:                 (ul (lt u))
17:                 (ur (rt u))
18:                 (dl (lt d))
19:                 (dr (rt d)))
20:            (mapcar (lambda (i) (aref cells i))
21:                    (list ul u ur l r dl d dr)))))))
22:  
23:  (defun evolve (src w h)
24:    (declare (type brain src))
25:    (let* ((l (length src))
26:           (dst (make-brain w h)))
27:      (declare (type brain dst))
28:      (loop for i below l
29:         do (setf (aref dst i)
30:                  (rules (aref src i) (neighbours src i w h))))
31:      dst))

Compiling these produces a lot of complaints from the compiler. Some of the notes from the neighbours function look like this:

 1:  ; in: DEFUN NEIGHBOURS
 2:  ;     (ZEROP X)
 3:  ; ==>
 4:  ;   (= X 0)
 5:  ; 
 6:  ; note: unable to
 7:  ;   open-code FLOAT to RATIONAL comparison
 8:  ; due to type uncertainty:
 9:  ;   The first argument is a REAL, not a FLOAT.
10:  ; 
11:  ; note: unable to open code because: The operands might not be the same type.
12:  
13:  ;     (+ I W)
14:  ; 
15:  ; note: unable to
16:  ;   optimize
17:  ; due to type uncertainty:
18:  ;   The first argument is a REAL, not a RATIONAL.
19:  ;   The second argument is a NUMBER, not a FLOAT.
20:  ; 
21:  ; note: unable to
22:  ;   optimize
23:  ; due to type uncertainty:
24:  ;   The first argument is a REAL, not a FLOAT.
25:  ;   The second argument is a NUMBER, not a RATIONAL.
26:  ; 
27:  ; note: unable to
28:  ;   optimize
29:  ; due to type uncertainty:
30:  ;   The first argument is a REAL, not a SINGLE-FLOAT.
31:  ;   The second argument is a NUMBER, not a DOUBLE-FLOAT.
32:  ; 
33:  ; note: unable to
34:  ;   optimize
35:  ; due to type uncertainty:
36:  ;   The first argument is a REAL, not a DOUBLE-FLOAT.
37:  ;   The second argument is a NUMBER, not a SINGLE-FLOAT.
38:  ...

Profiling the code now shows that I've got rid of the two vector-ref functions. But there is still a bit of work to do on neighbours function with all the noise compiler generates. First I'll declare the types of parameters (and show only the relevant forms here):

1:  (deftype array-index () `(integer 0 ,array-dimension-limit))
2:  
3:  (declare (type brain cells)
4:           (type array-index i)
5:           (type fixnum w h))

A note on the array-index type: the actual integer range depends on the value of array-dimension-limit, which is implementation dependent (but usually is between 224 and most-positive-fixnum).

Now compiling neighbours function emits quite a short list of compiler notes:

 1:  ; in: DEFUN NEIGHBOURS
 2:  ;     (LIST UL U UR L R DL D DR)
 3:  ; 
 4:  ; note: doing signed word to integer coercion (cost 20)
 5:  ; 
 6:  ; note: doing signed word to integer coercion (cost 20)
 7:  
 8:  ;     (FLET ((UP (I)
 9:  ;              (IF (ZEROP Y)
10:  ;                  (- # W)
11:  ;                  (- I W)))
12:  ;            (DN (I)
13:  ;              (IF (= Y MY)
14:  ;                  (- # L)
15:  ;                  (+ I W)))
16:  ;            (LT (I)
17:  ;              (IF (ZEROP X)
18:  ;                  (1- #)
19:  ;                  (1- I)))
20:  ;            (RT (I)
21:  ;              (IF (= X MX)
22:  ;                  (1+ #)
23:  ;                  (1+ I))))
24:  ;       (LET* ((U (UP I))
25:  ;              (D (DN I))
26:  ;              (L (LT I))
27:  ;              (R (RT I))
28:  ;              (UL (LT U))
29:  ;              (UR (RT U))
30:  ;              (DL (LT D))
31:  ;              (DR (RT D)))
32:  ;         (MAPCAR (LAMBDA (I) (AREF CELLS I)) (LIST UL U UR L R DL D DR))))
33:  ; 
34:  ; note: doing signed word to integer coercion (cost 20) to "<return value>"
35:  ; 
36:  ; note: doing signed word to integer coercion (cost 20) to "<return value>"
37:  ; 
38:  ; compilation unit finished
39:  ;   printed 4 notes

The two offending forms are flet and list. One solution now would be to maybe declare types of all the eight cells in let* form. But is there anything about this code that bothers you? You see, lists in Common Lisp can hold any kind of values, and making a list takes time and memory. And this code actually creates two lists (one with the cell indices, and the one created by mapcar function). So telling the compiler that the things that are stuffed into those lists are in fact numbers does not make much sense (or performance difference) because the lists are still being created and traversed.

So, in the name of optimization I'll go on and do what has to be done: make a single array for neighbours calculation and update it on each call. Here's the ugly-but-supposedly-fast version:

 1:  (deftype neighbours () '(simple-array cell (8)))
 2:  
 3:  (defun neighbours (cells i w h)
 4:    (declare (type brain cells)
 5:             (type array-index i)
 6:             (type fixnum w h))
 7:    (let ((result (the neighbours (load-time-value (make-array 8 :element-type 'cell))))
 8:          (l (length cells))
 9:          (mx (1- w))
10:          (my (1- h)))
11:      (multiple-value-bind (y x)
12:          (truncate i w)
13:        (flet ((up (i) (if (zerop y) (- (+ i l) w) (- i w)))
14:               (dn (i) (if (= y  my) (- (+ i w) l) (+ i w)))
15:               (lt (i) (if (zerop x) (1- (+ i w))  (1- i)))
16:               (rt (i) (if (= x  mx) (1+ (- i w))  (1+ i))))
17:          (let* ((u (up i))
18:                 (d (dn i))
19:                 (l (lt i))
20:                 (r (rt i))
21:                 (ul (lt u))
22:                 (ur (rt u))
23:                 (dl (lt d))
24:                 (dr (rt d)))
25:            (setf (aref result 0) (aref cells ul)
26:                  (aref result 1) (aref cells u)
27:                  (aref result 2) (aref cells ur)
28:                  (aref result 3) (aref cells l)
29:                  (aref result 4) (aref cells r)
30:                  (aref result 5) (aref cells dl)
31:                  (aref result 6) (aref cells d)
32:                  (aref result 7) (aref cells dr))
33:            result)))))

First I define a new type that describes the neighbours array (known to be 8 elements long at compile time). In the function itself a new binding is introduced: result. There are two interesting things about this binding:

  • The value is created only once – when the code is loaded (in this case the behaviour is similar to static variables in C and some other languages).
  • The type of array is declared (using the form) because the value is only known at load time (as opposed to compile time). Theoretically I'd expect the compiler to figure this out by itself, but generally any code can be given to load-time-value so the compiler apparently is not even trying to figure anything out.

So there, compiling neighbours function now gives no warnings and is as ugly as if it was written in C. And since this function now returns an array, this information can now be added to rules function:

1:  (defun rules (state neighbours)
2:    (declare (type cell state)
3:             (type neighbours neighbours))
4:    (case state
5:      (1 2)
6:      (2 0)
7:      (t (if (= 2 (count 1 neighbours)) 1 0))))

Benchmarking now gives numbers around 11 seconds for each simulation run (i.e., the code runs almost two times faster than before). The top of the profiler output now looks like this:

 1:             Self        Total        Cumul
 2:    Nr  Count     %  Count     %  Count     %    Calls  Function
 3:  ------------------------------------------------------------------------
 4:     1    466  24.7    527  27.9    466  24.7        -  NEIGHBOURS
 5:     2    376  19.9   1152  61.0    842  44.6        -  COUNT
 6:     3    194  10.3    194  10.3   1036  54.8        -  SB-IMPL::OPTIMIZED-DATA-VECTOR-REF
 7:     4    141   7.5    238  12.6   1177  62.3        -  (LAMBDA (SB-IMPL::X))
 8:     5    136   7.2    136   7.2   1313  69.5        -  SB-KERNEL:HAIRY-DATA-VECTOR-REF/CHECK-BOUNDS
 9:     6    127   6.7    127   6.7   1440  76.2        -  SB-KERNEL:%COERCE-CALLABLE-TO-FUN
10:     7    110   5.8    118   6.2   1550  82.1        -  EQL
11:     8     86   4.6     86   4.6   1636  86.6        -  "foreign function sigprocmask"
12:     9     84   4.4   1800  95.3   1720  91.1        -  EVOLVE
13:  ...

A few notable things:

  • I think I'm starting to understand the second column2. That would explain why evolve has 95% and other entries less. If so, then count takes about twice the time of neighbours.
  • The vector-ref friends are back! Since neighbours only sets array values, it sure looks like count has dragged these into the picture.

Anyway, I have not done much lisp code profiling (and performance tuning for that matter) so I'll do a little experiment here: write my own function to count live neighbour count:

1:  (defun rules (state neighbours)
2:    (declare (type cell state)
3:             (type neighbours neighbours))
4:    (case state
5:      (1 2)
6:      (2 0)
7:      (t (if (= 2 (loop for x across neighbours counting (= 1 x))) 1 0))))

There is one compiler note for this loop – the count accumulation variable is checked for fixnum overflow. I'll leave it at that, wrap things up and call this my final version.

Here are the timings (with safety optimization setting of 1 and 0):

http://www.ltn.lv/~jonis/blog/4-bb-cl-2/bm-sbcl.png

http://www.ltn.lv/~jonis/blog/4-bb-cl-2/bm-ccl32.png

http://www.ltn.lv/~jonis/blog/4-bb-cl-2/bm-ccl64.png

To be honest, I did not expect getting rid of count to make such a big difference. I may only hope that this example will let implementation developers make the compiler better!

The detailed timings can be seen in transcripts (below), but generally they are:

  • 3.9s / 3s for SBCL.
  • 8.5s / 6.9s for Clozure CL (32-bit).
  • 6s / 5.5s for Clozure CL (64-bit).

I'm pretty sure it is possible to squeeze out some more performance out of this. I actually think that this kind of low-level optimisation is an art in itself, and I'm not very good at it. That is why I will leave this topic for now. But readers are welcome to contribute using comments or better yet – writing about it in their own blogs.

Footnotes

1 A comment from pkhuong on #lisp IRC channel on approaching optimization (after reading a draft version of this article):

"Only NEIGHBOURS shows up fairly high in the list. However, before working on it directly, I'd attack what I consider noise: COUNT, TRUNCATE, LENGTH and EQL shouldn't be called out of line. *-DATA-VECTOR-REF and GENERIC-+ shouldn't have to be called either. Inserting type declarations and asking for speed optimization where appropriate in order to eliminate these entries should be the first step. As it is, the profile is mostly telling you that you're doing it wrong if you really want execution speed.

Only then, with the noise eliminated, would it be time to work on the program itself, from a new profile (with the declarations and with speed optimization where needed) that may be radically different from the one above."

2 Another comment from pkhuong, this time about the columns in profiler output (confirming my understanding of the "total" column):

"Count" columns count samples, "%" % of time (samples). Self is the amount/% of samples with the function at the right at the top of the call stack (directly executing), total is the amount/% of time with the function somewhere on the call stack (up to a certain depth). Cumul is the sum of Self values; I find it helpful to guesstimate where the repartition of self times tapers off into a long tail.

The cycle output (not shown here nor in the blog post) is extremely useful if you know that a lot of time is spent in a function, either directly or in total, but now what the callers or the callees are. You'll be able to find the function you're looking for, the most common (in terms of sample) callers and callee. With that information, you could decide how to make a certain callee faster (or special-case it for the slow function), change the way the function is used in the slow callers, special-case the function for the slow callers, etc. The cycle output is somewhat harder to read, but the format is the same as that by gprof, so you can look around for more resources.

2009-10-29

Brian's brain, on Common Lisp, take 2

While writing code for the first version of Brian's Brain some things crept into my mind, but I left them out of the article to keep it focused. The main issues are:

  • Using symbols to represent cells is not very efficient since a symbol is a kind of object, which means it is a pointer, which means it takes 32 or 64 bits of storage. 2 bits would be enough in our case.
  • Computer memory is linear (well, from a programmer's point of view), and 2-dimensional arrays are figments of our imagination. I mean, in the end it's still a 1-dimensional array, with some support from the programming language to make it appear as something else. Is it worth the trouble to implement the 2D world with a 1D array if the language (Common Lisp in this case) provides us with 2D arrays?
  • Currently a new array is allocated for each simulation step, which is not strictly necessary. Each simulation step needs only two arrays – for current world state and the next one. Since all cells of the new state are overwritten on each step, we don't need to create a completely new array to store them – we can reuse the one from previous step.

Specialised arrays

I'll start with changing the cell representation – instead of symbols I'll use numbers. And the world is going to be a suitable array. This is also a good time to tell a bit about arrays and their types to those unfamiliar with Common Lisp.

The arrays I used for the first implementation of Brian's Brain were of type T, which is the most generic array type possible, and it can store any kind of objects. It is also possible to specify the type of objects that will be stored in an array when creating it. And Common Lisp implementation will choose the most specific type of array that can store objects of specified type.

In my case I want to use array store numbers from 0 to 2. There is an integer type in Common Lisp which is the type of mathematical integer (with unlimited magnitude). It is also possible to limit the magnitude by specifying the lower and upper limit. Here's how to create an array that is suitable for storing integers from 0 to 2 in Common Lisp (SBCL):

1:  CL-USER> (make-array 7 :element-type '(integer 0 2))
2:  #(0 0 0 0 0 0 0)

We can also check what is the exact type of the object that has been created:

3:  CL-USER> (type-of (make-array 7 :element-type '(integer 0 2)))
4:  (SIMPLE-ARRAY (UNSIGNED-BYTE 2) (7))

As we can see, the type of the array created is (unsigned-byte 2), which means all non-negative integers representable by 2 bits (see the documentation of unsigned-byte for more details). The type returned is more general than what I have asked for, but as I said, it is up to Common Lisp implementation to choose what type of array to create. There is also a function that can tell what type of array will be created given a type specifier. Let's see, again on SBCL:

5:  CL-USER> (upgraded-array-element-type '(integer 0 2))
6:  (UNSIGNED-BYTE 2)
7:  CL-USER> (upgraded-array-element-type '(integer 0 42))
8:  (UNSIGNED-BYTE 7)

A different Common Lisp implementation is allowed to work differently. Here is what 32-bit Clozure CL tells us:

 9:  CL-USER> (upgraded-array-element-type '(integer 0 2))
10:  (UNSIGNED-BYTE 8)
11:  CL-USER> (upgraded-array-element-type '(integer 0 42))
12:  (UNSIGNED-BYTE 8)
13:  CL-USER> (type-of (make-array 7 :element-type '(integer 0 2)))
14:  (SIMPLE-ARRAY (UNSIGNED-BYTE 8) (7))

As you can see, in Clozure CL 8-bit-byte is the most specific array type available for arrays of small integers. There are many places in Common Lisp specification where implementations are given freedom of implementation choices. For instance, this specific issue is described in array upgrading section.

So now we are ready to work with numbers as cell representations. 0 for dead cell, 1 for alive cell, and 2 for dying cell:

15:  (defun make-brain (w h)
16:    (make-array (list h w) :element-type '(integer 0 2)))
17:  
18:  (defun make-initialised-brain (w h)
19:    (let ((cells (make-brain w h))
20:          (mid (floor w 2)))
21:      (setf (aref cells 0 mid) 1)
22:      (setf (aref cells 0 (1+ mid)) 1)
23:      cells))
24:  
25:  (defun rules (state neighbours)
26:    (case state
27:      (1 2)
28:      (2 0)
29:      (t (if (= 2 (count 1 neighbours)) 1 0))))

All other code stays as it was. Now is a good time to recall the odd graph from last time (scaled to match other graphs below):

http://www.ltn.lv/~jonis/blog/3-bb-cl/bm-symbolic-sbcl.png

And the new version:

http://www.ltn.lv/~jonis/blog/3-bb-cl/bm-num-sbcl.png

Quite a nice improvement for such a small change. I guess my theory about GCing large arrays was correct. The times for Clozure CL have also improved, not quite that dramatically though:

http://www.ltn.lv/~jonis/blog/3-bb-cl/bm-num-ccl32.png

http://www.ltn.lv/~jonis/blog/3-bb-cl/bm-num-ccl64.png

What do we learn? Lisp programmers can learn the price of some things1.

Destructive action

Next step is to avoid allocating a new array at each evolution step. I'm not expecting any serious performance gains from this because at this point each iteration is creating a single object that does not have to be traversed by GC. Let's see if I can shed some light on the believers in manual memory management lurking in the dark corners of intarwebs.

First I'll make evolve take another parameter – the array where the next generation cells will be put into:

30:  (defun evolve (src dst)
31:    (let* ((w (array-dimension src 1))
32:           (h (array-dimension src 0)))
33:      (loop for j below h
34:         do (loop for i below w
35:               do (setf (aref dst j i)
36:                        (funcall 'rules (aref src j i) (neighbours src i j)))))
37:      dst))

Easy, one parameter more, one line less. Now simulate will need two brains and will alternate them at each call to evolve:

38:  (defun simulate (steps a b)
39:    (loop repeat steps
40:       do (psetf a (evolve a b)
41:                 b a)
42:       finally (return a)))

benchmark also must be updated to pass brain (of proper size) to simulate:

43:  (defun benchmark ()
44:    (format *trace-output* "Benchmarking on ~A ~A~%"
45:            (lisp-implementation-type)
46:            (lisp-implementation-version))
47:    ;; Warmup.
48:    (simulate 10000 (make-initialised-brain 16 16))
49:    (loop
50:       for (w h i) in '((32    32  32768)
51:                        (64    64  8192)
52:                        (128  128  2048)
53:                        (256  256  512)
54:                        (512  512  128)
55:                        (1024 1024 32)
56:                        (2048 2048 8)
57:                        (4096 4096 2))
58:       do #+ccl (gc)
59:          #+sbcl (gc :full t)
60:          (let ((initial (make-initialised-brain w h))
61:                (spare (make-brain w h)))
62:            (format *trace-output* "*** ~Dx~D ~D iteration~:P ***~%" w h i)
63:            (time (simulate i initial spare))
64:            (finish-output *trace-output*)))
65:    (values))

Here, take a close look at the graphs now:

http://www.ltn.lv/~jonis/blog/3-bb-cl/bm-des-sbcl.png

http://www.ltn.lv/~jonis/blog/3-bb-cl/bm-des-ccl32.png

http://www.ltn.lv/~jonis/blog/3-bb-cl/bm-des-ccl64.png

Notice anything interesting? Right, the performance has not improved noticeably. It has actually decreased slightly for SBCL and 64-bit Clozure CL. What's the lesson? Garbage Collector will make your life easier (if you don't deliberately spam it).

Dimension warp

The last exercise today will be to test whether it is worth the trouble of implementing 2-dimensional arrays on top of 1-dimensional arrays manually. This means doing the index calculations manually. I'll branch this experiment off the numerical branch2.

Let's start with the easy things – creating the arrays:

66:  (defun make-brain (w h)
67:    (make-array (* w h) :element-type '(integer 0 2)))

All that has to be changed in make-initialised-brain is to remove the extra dimension (because we only change the first row, and first row in our array are first w cells):

68:  (defun make-initialised-brain (w h)
69:    (let ((cells (make-brain w h))
70:          (mid (floor w 2)))
71:      (setf (aref cells mid) 1)
72:      (setf (aref cells (1+ mid)) 1)
73:      cells))

The main function that has to be changed is neighbours. The function is very simple, although a bit lengthy (comparing to other incorrect versions), but I like to keep the intent of the code close to how I'd do things myself if I were pretending to be a computer:

74:  (defun neighbours (cells i w h)
75:    (let ((l (length cells))
76:          (mx (1- w))
77:          (my (1- h)))
78:      (multiple-value-bind (y x)
79:          (truncate i w)
80:        (flet ((up (i) (if (zerop y) (- (+ i l) w) (- i w)))
81:               (dn (i) (if (= y  my) (- (+ i w) l) (+ i w)))
82:               (lt (i) (if (zerop x) (1- (+ i w))  (1- i)))
83:               (rt (i) (if (= x  mx) (1+ (- i w))  (1+ i))))
84:          (let* ((u (up i))
85:                 (d (dn i))
86:                 (l (lt i))
87:                 (r (rt i))
88:                 (ul (lt u))
89:                 (ur (rt u))
90:                 (dl (lt d))
91:                 (dr (rt d)))
92:            (mapcar (lambda (i) (aref cells i))
93:                    (list ul u ur l r dl d dr)))))))

cells is the 1-dimensional array that is used to simulate the 2-dimensional array. i is the index of a cell for which the neighbours should be calculated. And since cells is a 1-dimensional array and does not have any information about what dimensions are stored in it, w and h specify the dimensions of 2-dimensional array. (If it is not clear to you why these dimensions must be specified think about how many different 2-dimensional arrays can be stored in 24-element 1-dimensional array).

I have written this function in a way that allows me to think about the 1D array as a 2D array. So from i and w the x and y coordinates are calculated. These are used to detect if the cell under question is on the top (0) row or leftmost (0) column. For bottom row and rightmost column I have my and mx, respectively.

Then come four local utility functions to determine an index that is up (up), down (dn), left (lt) or right (rt) of a given cell. The logic is the same as in the 2D array version.

Next I calculate indices for up, down, left and right cells using the above mentioned functions. The cells on diagonals are calculated relative to these. For instance, up-left cell can be obtained by going up from left cell or going left from up cell.

In the end the obtained indices are used to get cell values from cells array. And that's all there is to it.

Now evolve also can be simplified a bit since only one dimension has to be traversed. However, the brain dimensions must be specified explicitly since the dimensions of the brain cannot be obtained from the 1D array.

 94:  (defun evolve (src w h)
 95:    (let* ((l (length src))
 96:           (dst (make-brain w h)))
 97:      (loop for i below l
 98:         do (setf (aref dst i)
 99:                  (funcall 'rules (aref src i) (neighbours src i w h))))
100:      dst))

The dimensions must be dragged through simulate function, too:

101:  (defun simulate (steps initial w h)
102:    (loop with brain = initial
103:       repeat steps
104:       do (setf brain (funcall 'evolve brain w h))
105:       finally (return brain)))

The code gets ever uglier. Another way to solve this problem would be to create a container object (like a structure or class) to hold the cells array and the two dimensions. But that is really not the point of this exercise – I actually expect the gains from this change to be so minimal (if any!) that it is not worth the trouble at all.

OK, last change – brain dimensions must be passed to simulate:

106:  (defun benchmark ()
107:    (format *trace-output* "Benchmarking on ~A ~A~%"
108:            (lisp-implementation-type)
109:            (lisp-implementation-version))
110:    ;; Warmup.
111:    (simulate 10000 (make-initialised-brain 16 16) 16 16)
112:    (loop
113:       for (w h i) in '((32    32  32768)
114:                        (64    64  8192)
115:                        (128  128  2048)
116:                        (256  256  512)
117:                        (512  512  128)
118:                        (1024 1024 32)
119:                        (2048 2048 8)
120:                        (4096 4096 2))
121:       do #+ccl (gc)
122:          #+sbcl (gc :full t)
123:          (let ((initial (make-initialised-brain w h)))
124:            (format *trace-output* "*** ~Dx~D ~D iteration~:P ***~%" w h i)
125:            (time (simulate i initial w h))
126:            (finish-output *trace-output*)))
127:    (values))

So, let's see what do we get:

http://www.ltn.lv/~jonis/blog/3-bb-cl/bm-1d-sbcl.png

http://www.ltn.lv/~jonis/blog/3-bb-cl/bm-1d-ccl32.png

http://www.ltn.lv/~jonis/blog/3-bb-cl/bm-1d-ccl64.png

Oh man was I wrong! Why would the difference be so big? A profiler would come in handy, but before that I'd like to do another experiment because I have a feeling about the source of slowness: there are no declarations and generic array access code for 2D arrays might be more complex than for 1D arrays. Or maybe I should give profiler a shot?

Looking into future

The problem is that performance tweaking this brute-force solution seems a bit pointless. The point of these articles was to show how I would implement the Brian's Brain in Common Lisp (and do some apples-to-oranges comparisons to spice things up). A really interesting thing to try out would be to implement Hashlife, which promises an astronomic speed explosion, and see how it works with Brian's Brain. And I suspect that generating "6 octillion generations in 30 seconds" would not quite work out because Brian's Brain is a lot more chaotic (I have not noticed a single oscillator while watching my animations) – the presence of "dead" cells force a lot of movement (and expansion). I noticed quite a few spaceships, though.

I also installed Golly – a wonderful application to play with cellular automata. It has the Hashlife algorithm implemented, and it really does wonders. However it did not work out for Brian's Brain (starting with 2 live cells as I've been doing all this time) because:

  • In 30 seconds it does not even get to generation 2000.
  • The pattern is ever-expanding and I could not find any way to limit the world size (i.e., I think the torus-world is not implemented in Golly).

So I'd have to implement Hashlife myself. Except that I'm hesitant to hand out $30 for the original B. Gosper's paper, but might as well do it in the future (since as an extra bonus the original implementation was done on Lisp Machine).

So, I think I'll do one more iteration with the brute-force method (with profiling and declarations). And then we'll see what will happen.

Updates

2009-11-01: Please don't send me any more copies of B. Gosper's paper :)

Footnotes

1 Alan Perlis, Epigrams in Programming, "55. A LISP programmer knows the value of everything, but the cost of nothing."

2 Yes, I'm using git, and branching like crazy, even in this toy project.

2009-10-26

Brian's Brain on Common Lisp

The real thing

If you are reading this you might have already read the previous entry and are familiar with the problem domain. And today I'm going to walk you through implementing Brian's Brain in Common Lisp. First, the straight-forward implementation with brain represented as a 2-dimensional array (that's what it is, right?), with just enough code to get it running. There are no functions to abstract away the brain implementation details or cell representation or anything else.

First, a function to create a brain:

1:  (defun make-brain (w h)
2:    (make-array (list h w) :initial-element :off))

And another function that will make us an initialised brain (like the one in Clojure version):

1:  (defun make-initialised-brain (w h)
2:    (let ((cells (make-brain w h))
3:          (mid (floor w 2)))
4:      (setf (aref cells 0 mid) :on)
5:      (setf (aref cells 0 (1+ mid)) :on)
6:      cells))

Why are the dimensions to make-array passed as (h w) and not (w h) you might ask? Because I like to see that my functions work as soon as I write them. Let's see how it works:

CL-USER> (make-initialised-brain 7 5)
#2A((:OFF :OFF :OFF :ON  :ON  :OFF :OFF)
    (:OFF :OFF :OFF :OFF :OFF :OFF :OFF)
    (:OFF :OFF :OFF :OFF :OFF :OFF :OFF)
    (:OFF :OFF :OFF :OFF :OFF :OFF :OFF)
    (:OFF :OFF :OFF :OFF :OFF :OFF :OFF))

What would happen if we had them in the opposite order:

CL-USER> (make-initialised-brain 5 7)
#2A((:OFF :OFF :ON  :ON  :OFF)
    (:OFF :OFF :OFF :OFF :OFF)
    (:OFF :OFF :OFF :OFF :OFF)
    (:OFF :OFF :OFF :OFF :OFF)
    (:OFF :OFF :OFF :OFF :OFF)
    (:OFF :OFF :OFF :OFF :OFF)
    (:OFF :OFF :OFF :OFF :OFF))

Incidentally, this is how Lau's version has them (if you have followed the examples carefully enough). There is no real difference having them either way, only a matter of convenience: if I look at the brain in REPL I want to see the same thing I'd see in animated graphical output.

The rules are independent of brain representation as long as we can provide the neighbouring cells as a parameter:

1:  (defun rules (state neighbours)
2:    (case state
3:      (:on    :dying)
4:      (:dying :off)
5:      (t (if (= 2 (count :on neighbours)) :on :off))))

How do we find the neighbours of a given cell? Easy, like this:

 1:  (defun neighbours (cells x y)
 2:    (let* ((mx (1- (array-dimension cells 1)))
 3:           (my (1- (array-dimension cells 0)))
 4:           (l (if (zerop x) mx (1- x)))
 5:           (r (if (= x mx) 0 (1+ x)))
 6:           (u (if (zerop y) my (1- y)))
 7:           (d (if (= y my) 0 (1+ y))))
 8:      (mapcar (lambda (x y)
 9:                (aref cells y x))
10:              (list l x r l r l x r)
11:              (list u u u y y d d d))))

What happens here should be pretty obvious, but I'll explain a bit anyway since this is a one-way communication channel. mx and my are maximal values for x and y coordinates. Left of the cell (l) is current x coordinate minus 1, unless we're on the leftmost column (0), in which case we get mx. Similarly for right cell, except we look if we're on the rightmost column (mx), and wrap to 0 if we are. Similarly for y axis. In short, referencing a cell off the edge gets us a cell on the opposite side.

Then for each pair of coordinates around our cell we get the value from the cells array. These pairs are given by two lists: one fore x coordinates and one for y coordinates. Function mapcar goes over both lists simultaneously and applies given function to each successive pair of items from both lists.

Also note how indices are passed to aref, with y and x in unnatural positions. This is for the reasons explained above – the rows are first dimension, and columns the second. But neighbours function expects them in the natural order, leaving the implementation details out of the way.

Let's check if our neighbours function works as expected:

CL-USER> (neighbours (make-initialised-brain 7 5) 3 4)
(:OFF :OFF :OFF :OFF :OFF :OFF :ON :ON)

The resulting list is cell values for, respectively, left-up, up, right-up, left, right, left-down, down and right-down cells from the specified x, y coordinate.

What's left? Evolution. The function which will create next state of a brain:

1:  (defun evolve (src)
2:    (let* ((w (array-dimension src 1))
3:           (h (array-dimension src 0))
4:           (dst (make-brain w h)))
5:      (loop for j below h
6:         do (loop for i below w
7:               do (setf (aref dst j i)
8:                        (funcall 'rules (aref src j i) (neighbours src i j)))))
9:      dst))

Ordinary loop over rows and columns, setting values in newly created brain by applying the rules to a cell and its neighbours in the current brain. We're ready to play now:

CL-USER> (evolve (make-initialised-brain 7 5))
#2A((:OFF :OFF :OFF :DYING :DYING :OFF :OFF)
    (:OFF :OFF :OFF :ON    :ON    :OFF :OFF)
    (:OFF :OFF :OFF :OFF   :OFF   :OFF :OFF)
    (:OFF :OFF :OFF :OFF   :OFF   :OFF :OFF)
    (:OFF :OFF :OFF :ON    :ON    :OFF :OFF))
CL-USER> (evolve *)
#2A((:OFF :OFF :ON  :OFF   :OFF   :ON  :OFF)
    (:OFF :OFF :OFF :DYING :DYING :OFF :OFF)
    (:OFF :OFF :OFF :ON    :ON    :OFF :OFF)
    (:OFF :OFF :OFF :ON    :ON    :OFF :OFF)
    (:OFF :OFF :OFF :DYING :DYING :OFF :OFF))

Using numbers for cell values would be much better visually, but I'm staying close to the Clojure version (for now).

Getting ready for blastoff

Almost ready to do some timing. All we need are the simulate and benchmark functions:

 1:  (defun simulate (steps initial)
 2:    (loop repeat steps
 3:       for brain = initial then (funcall 'evolve brain)
 4:       finally (return brain)))
 5:  
 6:  (defun benchmark ()
 7:    (format *trace-output* "Benchmarking on ~A ~A~%"
 8:            (lisp-implementation-type)
 9:            (lisp-implementation-version))
10:    ;; Warmup.
11:    (simulate 10000 (make-initialised-brain 16 16))
12:    (loop
13:       for (w h i) in '((32    32  32768)
14:                        (64    64  8192)
15:                        (128  128  2048)
16:                        (256  256  512)
17:                        (512  512  128)
18:                        (1024 1024 32)
19:                        (2048 2048 8)
20:                        (4096 4096 2))
21:       do (let ((initial (make-initialised-brain w h)))
22:            (format *trace-output* "*** ~Dx~D ~D iteration~:P ***~%" w h i)
23:            (time (simulate i initial))
24:            (finish-output *trace-output*)))
25:    (values))

Notice that there is not a single type annotation in this code. Running it on my laptop1:

CL-USER> (benchmark)
Benchmarking on SBCL 1.0.31.32
*** 32x32 32768 iterations ***
Evaluation took:
  34.782 seconds of real time
  34.064263 seconds of total run time (33.060215 user, 1.004048 system)
  [ Run times consist of 7.670 seconds GC time, and 26.395 seconds non-GC time. ]
  97.94% CPU
  96,906,498,234 processor cycles
  14,769,770,512 bytes consed

*** 64x64 8192 iterations ***
Evaluation took:
  33.512 seconds of real time
  32.776229 seconds of total run time (31.254474 user, 1.521755 system)
  [ Run times consist of 5.903 seconds GC time, and 26.874 seconds non-GC time. ]
  97.80% CPU
  93,366,482,790 processor cycles
  14,762,592,768 bytes consed
...
snipped

Running from the terminal would look something like this:

$ sbcl --noinform --disable-debugger --load simple.fasl --eval "(benchmark)" --eval "(quit)"

This time we have more information to display in the graphs: total run time and GC time. And just look at the numbers:

http://www.ltn.lv/~jonis/blog/2-bb-cl/bm-sbcl.png

Interesting. Very interesting, indeed. What we see here is that the time to run one iteration (blue bars) is increasing a bit first and then declines quite sharply. GC time (green), on the other hand, increases quite sharply starting from 256x256 simulation. Can you come up with an explanation?

I have one. Smaller arrays are processed quite fast, so are short lived, and become garbage before GC kicks in. The bigger arrays are processed longer, so they are alive when GC starts. And GC has to walk all array elements each time, since arrays are not specialised (that is, can contain anything). I told you using numbers would be better (for different reasons, though)!

I also did another run of this same benchmark (transcripts available below), and the numbers were the same up to sub-second precision.

Let's do this same thing with a different Common Lisp implementation, which in my case will be Clozure CL. The nice thing about this implementation that its compiler is very snappy. Running from shell looks like this:

$ ccl -n -Q -l simple.dx32fsl -e "(benchmark)"

http://www.ltn.lv/~jonis/blog/2-bb-cl/bm-ccl.png

One thing to note is that arrays of size 4096x4096 exceed the limit of array length (24-bit number) of 32-bit Clozure CL, which, coincidentally, is just 1 short of what we need:

CL-USER> array-dimension-limit
16777216
CL-USER> (integer-length array-dimension-limit)
25
CL-USER> (integer-length (1- array-dimension-limit))
24
CL-USER> (* 4096 4096)
16777216

But otherwise there is less variation in 32-bit Clozure CL. Let's look at 64-bit version:

http://www.ltn.lv/~jonis/blog/2-bb-cl/bm-ccl64.png

Nice, behaviour similar to SBCL, except that GC times don't grow as fast.

Missing things

I hear somebody in the corner mumbling something about animation and graphics. Oh, right. Missed that one. But it so happens that the very thing that made me start playing with Brian's Brain is that I installed cl-opengl. And guess what? It worked right out of the box. On all the Common Lisp implementation I have on my computer. So I had start playing with it. And the rest, as they say, is history.

It works like this:

CL-USER> (asdf:operate 'asdf:load-op :cl-glut-examples)
; System loading output snipped...
NIL
CL-USER> (cl-glut-examples:run-examples)

And all the examples pop up, many of them animating. So I peek at some examples to see how to set up a window. Easy as a pie:

 1:  (defclass bb (glut:window)
 2:    ((cells :accessor cells-of :initarg :cells))
 3:    (:default-initargs
 4:     :title "Brian's Brain in CL"
 5:     :mode '(:double :rgb)))
 6:  
 7:  (defmethod glut:display-window :before ((w bb))
 8:    (gl:clear-color 0 0 0 0)
 9:    (gl:matrix-mode :projection)
10:    (gl:load-identity)
11:    (let ((cells (cells-of w)))
12:     (gl:ortho 0 (array-dimension cells 1)  0 (array-dimension cells 0) -1 1)))

Then we need a function to render a single cell at specified position. Drawing squares is easy enough:

 1:  (defun render-cell (x y cell)
 2:    (flet ((draw-cell (x y)
 3:             (gl:with-pushed-matrix
 4:                 (gl:translate x y 0)
 5:               (gl:with-primitive :polygon
 6:                 (gl:vertex 0.1 0.1 0)
 7:                 (gl:vertex 0.9 0.1 0)
 8:                 (gl:vertex 0.9 0.9 0)
 9:                 (gl:vertex 0.1 0.9 0)))))
10:      (case cell
11:        (:on (gl:color 1 1 1)
12:             (draw-cell x y))
13:        (:dying (gl:color 0.5 0.5 0.5)
14:                (draw-cell x y)))))

All that's left are some callbacks to draw the whole window and run the animation. The following two methods will do just fine:

 1:  (defmethod glut:display ((w bb))
 2:    (gl:clear :color-buffer)
 3:    (let* ((cells (cells-of w))
 4:           (w (array-dimension cells 1))
 5:           (h (array-dimension cells 0)))
 6:      (loop for j below h
 7:         do (loop for i below w
 8:               do (render-cell i j (aref cells j i)))))
 9:    (glut:swap-buffers))
10:  
11:  
12:  (defmethod glut:idle ((w bb))
13:    (setf (cells-of w) (evolve (cells-of w)))
14:    (glut:post-redisplay))

Everything is ready now. An animated Brian's Brain can be created like this:

(glut:display-window (make-instance 'bb
                                    :cells (make-initialised-brain 128 128)
                                    :width 512
                                    :height 512))

But since I don't like to put things on the toplevel which run when just loading a file, I'll put the code into a function:

1:  (defun run (w h ww wh)
2:    (glut:display-window
3:     (make-instance 'bb
4:                    :cells (make-initialised-brain w h)
5:                    :width ww
6:                    :height wh)))

Feel free to start a never-ending Brian's Brain simulation:

CL-USER> (run 160 100 320 200)

http://www.ltn.lv/~jonis/blog/2-bb-cl/bb-160x100.png

Finishing touches

To make this all easily loadable I'll put all code in its own package and create a system definition (which nowadays means a ASDF) file. Refer to Xach's intro for a nice description of why and how to do this.

Package definition is very simple:

1:  (defpackage :brians-brain-1
2:    (:use :common-lisp)
3:    (:export #:run))

And the system definition is nothing complicated, either:

1:  (asdf:defsystem :brians-brain-1
2:      :version "1.0"
3:      :author "Jānis Džeriņš"
4:      :license "Send me money if you find this stuff useful."
5:      :depends-on (cl-opengl cl-glut)
6:      :components ((:file "package")
7:                   (:file "simple" :depends-on ("package"))
8:                   (:file "display" :depends-on ("package" "simple"))))

At this point we can get to a running animated Brian's Brain from the shell prompt:

$ sbcl --noinform --disable-debugger \
       --eval "(asdf:operate 'asdf:load-op :brians-brain-1)" \
       --eval "(brians-brain-1:run 160 100 320 200)" \
       --eval "(quit)"

Looking forward

Next time I'm going to play with different brain representations.

Updates

2009-10-28: Noticed a bug in simulate function which runs the simulation for one step less than asked. The corrected version looks like this:

1:  (defun simulate (steps initial)
2:    (loop with brain = initial
3:       repeat steps
4:       do (setf brain (funcall 'evolve brain))
5:       finally (return brain)))

Also now invoking GC before each simulation so that garbage from previous simulation has less chance to influence the next:

 6:  (defun benchmark ()
 7:    (format *trace-output* "Benchmarking on ~A ~A~%"
 8:            (lisp-implementation-type)
 9:            (lisp-implementation-version))
10:    ;; Warmup.
11:    (simulate 10000 (make-initialised-brain 16 16))
12:    (loop
13:       for (w h i) in '((32    32  32768)
14:                        (64    64  8192)
15:                        (128  128  2048)
16:                        (256  256  512)
17:                        (512  512  128)
18:                        (1024 1024 32)
19:                        (2048 2048 8)
20:                        (4096 4096 2))
21:       do #+ccl (gc)
22:          #+sbcl (gc :full t)
23:          (let ((initial (make-initialised-brain w h)))
24:           (format *trace-output* "*** ~Dx~D ~D iteration~:P ***~%" w h i)
25:           (time (simulate i initial))
26:           (finish-output *trace-output*)))
27:    (values))

The graphs are generally very similar, except the rightmost columns don't look fishy:

http://www.ltn.lv/~jonis/blog/2-bb-cl/bm-sbcl-2.png

http://www.ltn.lv/~jonis/blog/2-bb-cl/bm-ccl32-2.png

http://www.ltn.lv/~jonis/blog/2-bb-cl/bm-ccl64-2.png

New transcripts:

bm-sbcl-2.txt
bm-ccl32-2.txt
bm-ccl64-2.txt

Footnotes

1 See the previous blog entry for the specs.