[PRL] POPL Trip Report

Wed Jan 9 22:48:10 EST 2008

If I wait til after POPL, I'll never get around to sending this out.  So
here is my report on POPL so far.  I'll only be around for one more day, so
there may not be much more to report from me.

I've mentioned some people by name, so please at least skim it and see the
items I've highlighted for you.

--Mitch

POPL Trip Report (part 1).

================================================================

Monday: PEPM

================================================================

Monica Lam  -- Invited talk at PEPM

Moore's Law says Program Analysis for security is more important than
analysis for performance.  Was mostly a presentation of bddbddb.  This
was all stuff I knew about, but it was interesting to hear her present
it.

She said it was a big step forward to code the analyses in Datalog.
Of course Jens Palsberg and I had been doing this forever...  I should
check her papers w/ John Whaley to see if they cite us.

****************************************************************

Spoke to Cordell Green of Kestrel.

He says Li-Mei Wu is doing "superbly" at Kestel  (is she from NU or
Indiana?)

They have formalized an entire (very small) OS in Specware.  No file
system, but lots of crypto & authorization stuff.

He says that Gregor Kiczales said that macros in {C#, LINQ, F#, CLR??}
provide a good way to sneak program transformations into the MS
toolset.  Dave H. says he's probably referring to what they call
"quotations" or something like that.  Worth checking out.

================================================================

Oleg says that Nick Benton has a neat paper in FLOPS 08 on contracts.
He puts a call/cc in the coercion, and then optimistically runs the
function anyway.  If something goes wrong later, then you get to jump
back to the coercion and retroactively declare that the coercion
failed.

================================================================

Yitzhak Mandelbaum says he is interested in internships at AT&T.
Students should send him a vita.

================================================================

Tuesday: Coq workshop.  They had 70 people in the room, and about 8
UPenn folks in bright red T-shirts walking around helping people who
looked lost, just like we do in HtDP bootcamp.  It was fabulous.  When
they announced coffee break, everybody just stayed hunched over their
machines.  Nobody really wanted to stop at 500pm.

================================================================

Wednesday: DAMP

================================================================

Bratin Saha, Intel:  Intel 64 Architecture Memory Ordering Model
(invited talk)

At least 5 different kinds of memory:  stong uncacheable, write
combining, write back, write through, write protected.

These have different properties, performance, users, etc.

UC: no speculation, mainly for devices.  (Sequential consistency, but
very slow, because every opn has to go all the way through to memory).

WC: memory cached, weaker consistency, mainly for video frame
buffers.  (Writes get cached in separate buffer & batched.  Reads can
be speculative.)

WB memory.  All memory opns cached, memory opns call cause line
fills. Speculation allowed.  Writes performed in cache when possible.
Most common kind of memory for programmers.  This will be main topic
of talk.

Assume no aliasing of memory types (ie the same physical address
mapped to 2 different virtual addresses w/ different memory types.
This is permitted by the processor, but "heavily deprecated".)

WB is not sequentially consistent.

First attempt (P6, ca. 1991): Processor consistency: stores from a
single processor are seen in order, store buffer forwarding allowed
(breaks sequential consistency).

But: ordering of loads unclear, no guarantees about total ordering in
a MP system, no guarantees about causal consistency.

Memory Ordering considerations:  Weaker ordering allows implementation
flexibility, but this leads to conservative coding (many fences, etc.).

Should there be any form of total order?  Stores, locked opns, etc.
This is hard when we go to many-core.

Need to accomodate legacy programming processes.  (eg: in most code,
mutex acquire is an atomic test-and-set, release is an ordinary
store.  This is not actually correct in previous memory models.  Want
to accomodate this.

What is an atomic opn?

Intel 64 MO: Atomic accesses:

1. Byte accesses.

2. Reading/writing a 16/32/64 bit quantity aligned on a 16/32/64 bit
boundary (same size). ie, aligned accesses are atomic. (Unaligned
accesses are a problem because debuggers change its behavior).

3. All locked opns.

Intraprocessor memory ordering:  /rules about intra-processor
reordering of loads and stores/:

   Stores not reordered w/ each other

   Loads not reordering w/ each other

   Stores are not reordered with older loads  [Stores do not pass loads]

   Loads may be reordered w older stores except loads do not get
    reordered w older stores to the same location [Loads may pass
    stores to a non-overlapping location]

   Loads and stores are not reordered w a locked instruction  (locks
    act as full fences)

/* "reordering" refers to the order in which memory operations in one
processor *appear* to have been performed, perhaps to some other
processor (I think). */

Multiprocessor memory ordering:

  Stores to the same location have a total order

  Locked instructions have a total order.  Software models seem to
  want this.  Serialization is expensive, but can use
  invalidation-based coherence.

  Memory ordering obeys causality - ie transitive visibility.  This is
  hard to implement, depending on your interconnect.  This is
  necessary to allow mutex to be released w/ an ordinary store.

Some exceptions for complex instructions.

http://www.intel.com/products/processor/manuals/318147.pdf

================================================================

Manuel Chakrvarty
Partial vectorization of Haskell Programs

Running example:  Barnes-Hut Algorithm for the n-body problem.  Naive
soln using Newton's laws, has n^2 compexity.  B-H approximates force
exetered by cluster of far particles by their center of mass.
Requires adaptive spatial decomposition.  Give sn log n algorithm.
Same approach works for other forces and other dimensions.

Represent spatial decomposition by a quad tree. In each node, keep
center of gravity of points in its region.

In Data Parallel Haskell (DPH):

[: T :]  -- parallel array of T's.

data Tree = Node MassPnt [: Tree :]  ("Rose tree")

Can have comprehensions
  [: moveParticle p | p <- accelerate tree ps :]

Quadtree is irregular
Need data-parallel version of moveParticle
Data-Parallel recursive calls in buildTree.

To handle all this irregularity, go to vectorized version.

All this described in earlier papers.

Errm, what happens if we try to do IO in the middle?

Partial vectorization allows free mixing of vectorizable and
non-vectorizable code.

================================================================

Clemens Grelck
Efficient Heap Manageent for Declarative Data Parallel Programing on
Multicores.

Setting: declarative data-parallel programming in SAC
("Single-Assignment C").

In SAC, everything is an array:  vectors, matrices, scalars(!), eg:

Fact(N) = prod(1 + iota(N))

L2Norm(A) = sqrt(sum(square(A))

convolution:  W1*shift(-1,A) + W2*A + W1*shift(1,A)

Arrays are large, but not cyclic or deeply nested.  So use reference
counting rather than gc.

Need to reduce overhead of RC; identify memory reuse opportunities.

But, today's topic:  how to organize heap to make
allocation/deallocation work well w/ RC.   /sleep/

================================================================

Philipp Haller (EPFL)
Implenting Joins Using Extensible Pattern Matching

Joins as in the join calculus.  In scala...  looks
interesting... Can't stay awake...

================================================================

Lunch: Spoke to Neil Jones.  He's retired now, but is finishing off
a journal paper on termination analysis for untyped LC.  He uses CFA
to make a control-flow graph, and then counts changes in the size of
the environment along each edge.  The moral is that if every cycle in
the control-flow graph is net decreasing, your term terminates.

This appears to be enough to prove termination for lots of interesting
terms, including factorial (written with Y!), Ackermann's function,
etc., even using Churh numerals.

This sounds a lot like Matt Might's Logic-Flow Analysis.  Aaron, can
you say offhand how close this might be to what you did for Pete?

I invited him to stop by the next time he's in the US, so he could
talk to Olin, Harry, et al.

================================================================

John Reppy,
Toward a Parallel Implementation of Concurrent ML

Nice discussion of Concurrent ML -- the clearest I've seen.  It
cleared up some misconceptions I had about it, at least.  This is part
of Manticore, which Riccardo will be talking about soon.

================================================================

Ian Buck, NVIDIA
CPU Computing with CUDA (invited talk)

The folks at NVIDIA are trying to reach out to the general
high-performance computing community by making their cards usable for
non-graphics processing.

Their GeForce 8800 has 128 cores, grouped in sets of 16 that share a
local bunch of memory, plus a memory buffer ("data parallel memory")
for communication w/ CPU memory.  (And sells for $599 at Frye's)

Each core is a full-featured CPU with registers, room for "millions"
of instructions, etc.

Their slogan is: Computation is Free.  All the cost is in moving the
data around.  So this design gets the data next to the processor.
It's often cheaper to recompute rather than save & restore.

The local shared memory looks like a fast register, but read/writes to
main memory take effectively forever and blocks the thread.  So they
assign _thousands_ of threads to each core to mask the latency.

CUDA exposes this to the programmer by some modest extensions to the C
language that allow the programmer to specify that certain quantities
will be kept in this "data parallel memory".

He showed some matrix code that made it look very simple.

Lots of research problems here.  To target their architecture, a
compiler needs to optimize in some very non-standard ways: ie, don't
use too many registers (to minimize save/restore costs), and make sure
the PCs in thread groups stay reasonably close together.

They are very interested in education.  Their chief scientist, David
Kirkland, has developed a whole course on this stuff, which he's
taught at a university whose name I forget.  Plus it's all available
on the web.

CUDA is a free download, including an in-core simulator. They have a
bunch of configurations available for non-graphics use (ie GPU's
without the graphics components :), including deskside and blade
configurations.  They are anxious to work with the research community
and the speaker allowed as how they might make donations.

Sounds like neat stuff.  Do we have anybody interested in this?
Anybody who has large parallelizable computations for which this might
be useful?

================================================================

After this ended, I went across the hall to PLAN-X and caught

Gareth Smith
Hoare Reasoning About DOM

He gave a short introduction to Context Logic, a variation of
separation logic for talking about trees that was invented by Philippa
Gardner.  This is probably worth looking into for our reasoning about
contexts.

Then he showed how this scaled up to develop Hoare-style and
weakest-precondition logics for a core of the XML DOM.  This might be
of interest to Dave H.

====================  end  ============================================
-------------- next part --------------
HTML attachment scrubbed and removed