[Larceny-users] Parallel assignment optimization and complicatedness.

Wed Feb 7 21:56:03 EST 2007

Once upon a time, I was curious about Larceny/Twobit's code generation,
and so wound up with this simple function:

  (lambda (v)
    (let loop ((n 0) (a 0))
      (if (>= n 26) a (loop (+ n 1) (fx+ a (vector-ref v n))))))

This, on SPARC (with Larceny v0.93) with 'fast-unsafe, compiles to:

4       or      %g0, 0, %r3
8       subcc   %r2, 104, %g0   ! 0x68
12      bl,a    #36
16      add     %r2, 4, %r7
20      ld      [ %stkp + 4 ], %o7
24      jmpl    %o7 + 8, %g0
28      or      %g0, %r3, %result
32      add     %r2, 4, %r7
36      add     %r1, 1, %tmp0
40      ld      [ %tmp0+%r2 ], %r6
44      or      %g0, %r7, %r2
48      add     %r3, %r6, %r3
52      subcc   %timer, 1, %timer
56      bne,a   #12
60      subcc   %r2, 104, %g0   ! 0x68
64      jmpl    %globals + 1376, %o7    ! timer-exception
68      add     %o7, -64, %o7

Which is nice, except that the increment is evaluated before the
vector-ref, and thus requires a temporary -- and I recalled having read
about parallel assignment optimization, which is supposed to prevent
exactly that.

So, after far too much use of trace, I found that: the PAO in pass 4 (as
mentioned by the old notes) wasn't being terribly pointful, because the
subexpressions had all been flattened into registers by the conversion
to A-normal form in pass 3; and that, while A-normal-form itself might
have applied argument reordering, the code it got had the vector-ref call
expanded to a beta-redex festooned with safety checks.

This last was a problem because complicated? considers any non-primitive
call to be complicated, and would also give up early due to the
expression's size.  So, for the sake of experiment, I adjusted
complicated? to consider a call of a lambda with an uncomplicated body
the same way as a primitive (which may well not be the right thing in
general, but this is just a heuristic) and raised the budget (and made
it tunable with set!).  And that fixes that (and should be attached to
this message), though I haven't yet tried to measure what if any effect
it has on actual performance.

But it gets more fun.  I was originally looking at this on Intel, where
a stack frame and lots of unnecessary moves are introduced[*]:

   0:   31 d2                   xor    %edx,%edx
   2:   31 ff                   xor    %edi,%edi
   4:   83 fa 68                cmp    $0x68,%edx
   7:   0f 8c 06 00 00 00       jl     0x13
   d:   89 fb                   mov    %edi,%ebx
   f:   ff 75 04                pushl  0x4(%ebp)
  12:   c3                      ret    
  13:   83 ed 18                sub    $0x18,%ebp
  16:   3b 6c 24 0c             cmp    0xc(%esp),%ebp
  1a:   7d 0a                   jge    0x26
  1c:   83 c5 18                add    $0x18,%ebp
  1f:   ff 54 24 14             call   *0x14(%esp)
  23:   90                      nop    
  24:   eb ed                   jmp    0x13
  26:   c7 45 00 14 00 00 00    movl   $0x14,0x0(%ebp)
  2d:   31 db                   xor    %ebx,%ebx
  2f:   89 5d 04                mov    %ebx,0x4(%ebp)
  32:   89 75 0c                mov    %esi,0xc(%ebp)
  35:   89 7d 14                mov    %edi,0x14(%ebp)
  38:   89 5d 10                mov    %ebx,0x10(%ebp)
  3b:   8d 5a 04                lea    0x4(%edx),%ebx
  3e:   89 5c 24 7c             mov    %ebx,0x7c(%esp)
  42:   8b 54 11 01             mov    0x1(%ecx,%edx,1),%edx
  46:   89 55 10                mov    %edx,0x10(%ebp)
  49:   8b 54 24 7c             mov    0x7c(%esp),%edx
  4d:   8b 7d 10                mov    0x10(%ebp),%edi
  50:   8b 5d 14                mov    0x14(%ebp),%ebx
  53:   01 fb                   add    %edi,%ebx
  55:   89 df                   mov    %ebx,%edi
  57:   83 c5 18                add    $0x18,%ebp
  5a:   ff 4c 24 08             decl   0x8(%esp)
  5e:   0f 85 a0 ff ff ff       jne    0x4
  64:   ff 54 24 18             call   *0x18(%esp)
  68:   e9 97 ff ff ff          jmp    0x4

This is a little hideous.  Felix (in private mail) noticed that raising
*nhwregs* makes it not do that (but doesn't seem to help general
performance).  With my change to complicated?, it also doesn't do that:

   0:   31 d2                   xor    %edx,%edx
   2:   31 ff                   xor    %edi,%edi
   4:   83 fa 68                cmp    $0x68,%edx
   7:   0f 8c 06 00 00 00       jl     0x13
   d:   89 fb                   mov    %edi,%ebx
   f:   ff 75 04                pushl  0x4(%ebp)
  12:   c3                      ret    
  13:   8b 5c 11 01             mov    0x1(%ecx,%edx,1),%ebx
  17:   89 5c 24 7c             mov    %ebx,0x7c(%esp)
  1b:   89 fb                   mov    %edi,%ebx
  1d:   03 5c 24 7c             add    0x7c(%esp),%ebx
  21:   89 df                   mov    %ebx,%edi
  23:   8d 52 04                lea    0x4(%edx),%edx
  26:   ff 4c 24 08             decl   0x8(%esp)
  2a:   0f 85 d4 ff ff ff       jne    0x4
  30:   ff 54 24 18             call   *0x18(%esp)
  34:   e9 cb ff ff ff          jmp    0x4

There continues to not be a stack frame with safety enabled.  (Note also
the unnecessary soft register, but lamenting that is beyond the scope of
this message.)


[*] Obtained by writing the code vector to a file and calling an
external disassembler.
-- 
(let ((C call-with-current-continuation)) (apply (lambda (x y) (x y)) (map
((lambda (r) ((C C) (lambda (s) (r (lambda l (apply (s s) l))))))  (lambda
(f) (lambda (l) (if (null? l) C (lambda (k) (display (car l)) ((f (cdr l))
(C k)))))))    '((#\J #\d #\D #\v #\s) (#\e #\space #\a #\i #\newline)))))
-------------- next part --------------
Index: src/Compiler/pass3anormal.sch
===================================================================

--- src/Compiler/pass3anormal.sch	(revision 3972)
+++ src/Compiler/pass3anormal.sch	(working copy)
@@ -133,6 +133,8 @@
 
 (define A-normal-form-declaration (list 'anf))
 
+(define *anf-complication* 100)
+
 (define (A-normal-form E . rest)
   
   (define (A-normal-form E)
@@ -530,7 +532,7 @@
   
   (define (complicated? exp)
     ; Let's not spend all day on this.
-    (let ((budget 10))
+    (let ((budget *anf-complication*))
       (define (complicated? exp)
         (set! budget (- budget 1))
         (if (zero? budget)
@@ -546,11 +548,15 @@
               ((begin? exp) (some? complicated?
                                      (begin.exprs exp)))
               ((call? exp) (let ((proc (call.proc exp)))
-                             (if (and (variable? proc)
-                                      (prim-entry (variable.name proc)))
-                                 (some? complicated?
-                                        (call.args exp))
-                                 #t)))
+                             (if (cond
+				  ((variable? proc)
+				   (prim-entry (variable.name proc)))
+				  ((lambda? proc)
+				   (not (complicated? (lambda.body proc))))
+				  (else #f))
+				 (some? complicated?
+					(call.args exp))
+				 #t)))
               (else (error "Unrecognized expression" exp)))))
       (complicated? exp)))