wiiki - DuffsDeviceInDuffsOwnWords

This is the text of a document by TomDuff describing DuffsDevice, which he sent to comp.lang.c in 1988 when DuffsDevice came up. This version was taken from http://groups-beta.google.com/group/comp.lang.c/msg/bb78298175c42411, with the formatting intact, except where Wiki's TextFormattingRules interfered (WikiWords, italics, etc).
The original netnews version (from 1984) Tom refers to and reconstructed can be found at http://groups.google.com/groups?selm=2748%40alice.UUCP, and Tom presumably knows this.
From: td@alice.UUCP (Tom Duff)
Newsgroups: comp.lang.c
Subject: Re: Explanation, please!
Summary: Original citation
Message-ID: <8144@alice.UUCP>
Date: 29 Aug 88 20:33:51 GMT
References: <638@paris.ICS.UCI.EDU> <634@proxftl.UUCP> <660@proxftl.UUCP>
Organization: AT&T Bell Laboratories, Murray Hill NJ
Lines: 151

I normally do not read comp.lang.c, but Jim McKie told me
that "Duff's device" had come up in comp.lang.c again.  I
have lost the version that was sent to netnews in May 1984,
but I have reproduced below the note in which I originally
proposed the device.  (If anybody has a copy of the netnews
version, I would gratefully receive a copy at research!td or
td@research.att.com.)

To clear up a few points:
1)      The point of the device is to express general
        loop unrolling directly in C.  People who have
        posted saying `just use memcpy' have missed the
        point, as have those who have criticized it using
        various machine-dependent memcpy implementations
        as support.  In fact, the example in the message is
        not implementable as memcpy, nor is any computer
        likely to have an memcpy-like idiom that implements
        it.

2)      Somebody claimed that while the device was named
        for me, I probably didn't invent it.  I almost
        certainly did invent it.  I had definitely not
        seen or heard of it when I came upon it, and nobody
        has ever even claimed prior knowledge, let alone
        provided dates and times.  Note the headers on the
        message below:  apparently I invented the device
        on November 9, 1983, and was proud (or disgusted)
        enough to send mail to dmr.  Please note that I
        do not claim to have invented loop unrolling, merely
        this particular expression of it in C.

3)      The device is legal dpANS C.  I cannot quote chapter
        and verse, but Larry Rosler, who was chairman of the
        language subcommittee (I think), has assured me that X3J11
        considered it carefully and decided that it was legal.
        Somewhere I have a note from dmr certifying that all
        the compilers that he believes in accept it.  Of course,
        the device is also legal C++, since Bjarne uses it in
        his book.

4)      Somebody invoked (or more properly, banished) the
        'false god of efficiency.'  Careful reading of my
        original note will put this slur to rest.  The
        alternative to genuflecting before the god of
        code-bumming is finding a better algorithm.  It
        should be clear that none such was available.  If
        your code is too slow, you must make it faster.  If no
        better algorithm is available, you must trim cycles.

5)      The same person claimed that the device wouldn't exhibit
        the desired speed-up.  The argument was flawed in two
        regards:  first, it didn't address the performance of
        the device, but rather the performance of one of its
        few uses (implementing memcpy) for which many machines
        have a high-performance idiom.  Second, the poster
        made his claims in the absence of timing data, which
        renders his assertion suspect.  A second poster tried
        the test, but botched the implementation, proving
        only that with diligence it is possible to make anything
        run slowly.

6)      Even Henry Spencer, who hit every other nail square on
        the end with the flat round thing stuck to it, made a
        mistake (albeit a trivial one).  Here is Henry replying
        to bill@proxftl.UUCP (T. William Wells):
        >>... Dollars to doughnuts this code
        >>was written on a RISC machine.

        >Nope.  Bell Labs Research uses VAXen and 68Ks, mostly.

        I was at Lucasfilm when I invented the device.

7)      Transformations like this can only be justified by measuring the
        resulting code.  Be careful when you use this thing that you don't
        unwind the loop so much that you overflow your machine's instruction
        cache.  Don't try to be smarter than an over-clever C compiler that
        recognizes loops that implement block move or block clear and compiles
        them into machine idioms.

Here then, is the original document describing Duff's device:

From research!ucbvax!dagobah!td  Sun Nov 13 07:35:46 1983
Received: by ucbvax.ARPA (4.16/4.13)
        id AA18997; Sun, 13 Nov 83 07:35:46 pst
Received: by dagobah.LFL (4.6/4.6b)
        id AA01034; Thu, 10 Nov 83 17:57:56 PST
Date: Thu, 10 Nov 83 17:57:56 PST
From: ucbvax!dagobah!td (Tom Duff)
Message-Id: <8311110157.AA01034@dagobah.LFL>
To: ucbvax!decvax!hcr!rrg, ucbvax!ihnp4!hcr!rrg, ucbvax!research!dmr,
        ucbvax!research!rob

Consider the following routine, abstracted from code which copies an
array of shorts into the Programmed IO data register of an Evans &
Sutherland Picture System II:

        send(to, from, count)
        register short *to, *from;
        register count;
        {
                do
                *to = *from++;
                while(--count>0);
        },

(Obviously, this fails if the count is zero.)
The VAX C compiler compiles the loop into 2 instructions (a movw and
a sobleq, I think.)  As it turns out, this loop was the bottleneck in
a real-time animation playback program which ran too slowly by about 50%.
The standard way to get more speed out of something like this is to unwind
the loop a few times, decreasing the number of sobleqs.  When you do that,
you wind up with a leftover partial loop.  I usually handle this in C with
a switch that indexes a list of copies of the original loop body.  Of
course, if I were writing assembly language code, I'd just jump into the
middle of the unwound loop to deal with the leftovers.  Thinking about this
yesterday, the following implementation occurred to me:

        send(to, from, count)
        register short *to, *from;
        register count;
        {
                register n=(count+7)/8;
                switch(count%8){
                case 0: do{     *to = *from++;
                case 7:         *to = *from++;
                case 6:         *to = *from++;
                case 5:         *to = *from++;
                case 4:         *to = *from++;
                case 3:         *to = *from++;
                case 2:         *to = *from++;
                case 1:         *to = *from++;
                        },while(--n>0);
                },
        },

Disgusting, no?  But it compiles and runs just fine.  I feel a combination
of pride and revulsion at this discovery.  If no one's thought of it before,
I think I'll name it after myself.

It amazes me that after 10 years of writing C there are still little corners
that I haven't explored fully.  (Actually, I have another revolting way to
use switches to implement interrupt driven state machines but it's too
horrid to go into.)

Many people (even bwk?) have said that the worst feature of C is that
switches don't break automatically before each case label.  This code forms
some sort of argument in that debate, but I'm not sure whether it's for or
against.

                        yrs trly
                        Tom