updated 2002-12-21.
Some notes on Computer Architecture. Very incomplete.
Contents:
[FIXME: should I put links to Beowulf here ?]
David also maintains related files:
``I wisdom dwell with prudence, and find out knowledge of witty inventions.'' -- Proverbs 8:12
computer architecture news
The latest computer architecture information is at
news:comp.arch.hobbyist mirror http://groups.google.com/groups?group=comp.arch.hobbyist and
news:comp.arch.embedded mirror http://groups.google.com/groups?group=comp.arch.embedded
The LEON core is a SPARC* compatible integer unit developed for future space missions. It has been implemented as a highly configurable, synthesisable VHDL model. To promote the SPARC standard and enable development of system-on-a-chip (SOC) devices using SPARC cores, the European Space Agency is making the full source code freely available under the GNU LGPL license.
The LEON core has been extensively tested against the SPARC V8 architecture manual and the IEEE-P1754 (SPARC) standard, but have not been formally tested and certified by SPARC international as being SPARC V8 compliant. ...
DO YOU WANT TO HELP? If you wish to contribute to LEON and work on (or donate) any of these modules, please Jiri Gaisler.
X-URL: http://www.egroups.com/list/f-cpu/ Date: Tue, 12 Jan 1999 21:47:57 -0500 (EST) From: Robert Dale <rob at nb.net> To: F-CPU <f-cpu at egroups.com> Subject: [f-cpu] CVS server We have anonymous (read-only) CVS access to AlphaRISC's TTA sim he kindly wrote. $ export CVSROOT=:pserver:anonymous@www.deepfreeze.org:/home/src/f-cpu $ cvs login (Logging in to anonymous@www.deepfreeze.org) CVS password: <hit Enter> $ cvs co f-cpu If you need help and/or want write access, feel free to email me. We'll probably put some help documentation on the website. ;) -- Robert Dale
is one of the architectures suggested for the F-CPU. Its major advantage is: gcc has already been ported to it, so we don't need to write a new compiler on top of everything else. So we can immediately compile the Linux kernel and other code for this chip *now*, before we even start doing anything else, which should make it easy to cache analysis and other trade-off analysis. Also, it means that when the hardware is turned on for the first time, it could boot directly into Linux, which would be cool.
(Can we really go to 64 bit architecture and still use this compiler ?)
(Can we really use a TTA architecture for the attached FPUs ?)
is one of the architectures suggested for the F-CPU.
// $X = ($Y ^ rM) v ($Z|Z ^ ~rM). MUX $X, $Y, $Z|Z // sequence equivalent to MUX $X, $Y, $Z|Z AND $t1, $Y, $rM ORN $t2, $rM, $Z|Z ORN $X, $t1, $t2. // sequence equivalent to MUX $X, $Y, $Z|Z NAND $t1, $Y, $rM ORN $t2, $rM, $Z|Z NAND $X, $t1, $t2. // special peephole optimization for MUX $X, $Y, 0 AND $X, $Y, $rM // special peephole optimization for MUX $X, $Y, $Z // where $Z contains all ones: ORN $X, $Y, $rM
// simulate MXOR $X, $Y, $Z|Z NOT $znot, $Z|Z NOT $ynot, $y MOR $t3, $y, $znot MOR $t4, $ynot, $Z|Z AND $X, $t2, $t4)
NOT $X, $Z|Z // set $X = ~$Z, i.e., bitcomplement; i.e., flip all the bitscan often be merged with previous or following bit instructions by replacing AND, OR with NAND, NOR, ANDN, ORN. When it can't, the assembler can use any of these equivalent instructions:
// When Z is a register, either one of these instructions // bitcomplements it. NOR $X, $Z, 0 ANDN $X, $Z, 0 // When Z is a literal in the range -1 <= Z < #FF, // we can complement it in one instruction // at compile time. // if Z is outside this ranges, // just load ~Z directly into a register. SUB $X, zero, (1+Z) // sets $X = ~Z for -1 <= Z <= #FE. ANDN $X, zero, Z // sets $X = ~Z for 0 <= Z <= #FF.
is one of the architectures suggested for the F-CPU.
(only with FPGAs can you have reconfigurable computing)
see also robot_links.html#PLD Programmable Logic (FPGA, PLD, CPLD, etc.) and the devices needed to program them.
see also simple_cpu for simple CPU architectures that one would think would be simple to implement in a FPGA.
see also vlsi.html#PCI_on_FPGA for FPGA devices compatible with the PCI buss.
The b16 Processor is a minimalistic stack processor, inspired by Chuck Moore's recent work. The original incarnation is 16 bit, byte addressed. There are 32 instructions, each is 5 bits. Three and a bit (literally!) are packed into a 16 bit word (bundle). The first slot in the bundle can only be a nop or a call.
-- http://b16-cpu.de/ (designed in Verilog)
DAV:
As CPUs get relatively faster than RAM, preload becomes more important.
After deciding on an address and sending it to RAM,
it takes several CPU cycles before the RAM is able to reply with the data.
If the very next instruction requires that data,
then the CPU has to wait.
But if the instruction set is designed so that
the programmer can designate an address ahead of time,
this allows the programmer to squeeze in a few ``internal'' instructions
instead of pausing.
(This is very similar to the "branch delay slot" on some RISC processors).
Early (asynchronous) RAM chips allowed you to present the data at the same time as the address
when you wanted to write to the RAM chip. This *seems* faster
than presenting the address, then later presenting the data, in isolation.
But in the context of alternating reads and writes,
where naturally the data coming from the RAM only arrives a memory cycle *after*
the RAM is presented with an address,
it turns out to be faster to always delay the data (whichever direction it is going)
for writes as well. This leads to synchronous RAM ( in particular, SDRAM --
but SRAM stands for something different: static RAM).
DAV: does simply A! help, or do we need to also indicate whether it will be a load or a store ?
There are several improvements that Chuck has added to his newer Forth virtual machine model.
One of them is the address register and the other is the circular stacks.
Chuck has explained that hardware considerations aside,
the idea of the address register was that the Forth words @ and ! ( fetch and store)
were clumsy
at the top of the stack and
were based on smaller atomic operations that the programmer could take advantage of.
@ was broken into two operations A! and @A. Likewise ! was broken into A! and !A.
...
advantage is the use of auto-increment addressing memory access opcodes
...
--
Jeff Fox, talking about Chuck Moore
http://www.ultratechnology.com/forth3.htm
[FIXME: Why isn't this listed on that page ? ]
[FIXME: get this book _HDL Chip Design_ book by Douglas J. Smith, 1998, Doone Publications, ISBN 0-9651934-3-8 ( Computer Literacy , $65+tax). ]... I strongly recommend the book HDL Chip Design ...
Prentice-Hall has published a Xilinx FPGA lab book and win95/NT software which includes a coupon for an XESS FPGA board. Adding the cost of a power supply, you can be doing experiments for less than $250. ...
Only 3 instructions:
br adr Load adr into PC if F=0 Clear F add adr1,adr2 Add the contents of memory location adr1 and the contents of memory location adr2 and store the result in memory location adr2. Store the carry of the addition in F. nor adr1,adr2 Calculate the NOR function of the contents of memory location adr1 and the contents of memory location adr2 and store the result in memory location adr2. F=1 if result =0 , else F=0.and only 1 user-accessible register (PC) and only 1 user-accessible flag (F).
DAV: Unfortunately, self-modifying code is required for subroutine call/return. This makes re-entrant code impossible.
DAV: some simple macros by DAV:
; These macros require 3 special locations: ; ZERO: a location reserved to handle the value 0. ; ONES: a location reserved to handle the value with every bit set (sometimes called -1 or ~0). ; Set them up with ; really_clear ZERO ; really_mov -1, ONES ; . ;;;;;;;;; single-instruction macros (aliases) not a ; invert each bit in a nor a,a ; or alternatively nor ZERO, a ; clear a ; clear all bits in a to zero. Side effect: sets F1 nor ONES, a ; mov 0, a ; clear a ;;;;;;;;; multi-instruction macros (perhaps should be subroutines) really_clear a; -- this is used only to set up location ONES and ZERO; add a,a add a,a add a,a ... (repeated for 16 bits, or whatever the wordsize is) add a,a really_mov -1, a; -- this is used only to set up location ONES; really_clear a not a mov -1, a ; set location a to the value -1 clear a not a mov -2, a ; set location a to the value -2 mov -1, a add a,a mov 1, a ; set location a to the value 1 -- this is used to set up location ONE mov -2, a not a ; this expands to the 4 instructions clear a not a add a,a not a rotl a ; rotate a left, moving MSB to LSB ; side effect: always clears F. add a,a br continue: add ONE, a continue: rotlc a ; move F into LSB of a, shifting all bits left, leaving old MSB in F br F_was_0: add a,a br F_was_1_high_bit_was_0 ; ; handle case of ``F was 1, high bit was 1'' add ONE, a br set_F_and_continue ; execution never gets here F_was_1_high_bit_was_0: add ONE, a br continue ; execution never gets here F_was_0: add a,a ; the next 3 instruction seem redundant in the F_was_zero case, ; but they are necessary to handle the F_was_1 case. br continue set_F_and_continue: ; this must fall through to the end, because branches always clear F nor ONES, ZERO ; continue: mov a, b ; set location b to the value in location a. WARNING: doesn't work when a=b. clear b add a,b jump addr ; unconditionally jump to given address (side effect: clears F) ; isochronous code should use this: add ZERO, ZERO br addr ; alternatively, one instruction quicker when F=0: br addr br addr
lots of details -- the complete schematic diagram of the CPU, a pretty photograph of the test board (FPGA + RAM + ROM + some LEDs to simulate a traffic light) ... some assembly-language code ...
Date: Thu, 11 Jun 1998 05:32:23 +0200 (MET DST)
To: David Cary <d.cary at ieee.org>
From: Majordomo at lslsun.epfl.ch
Subject: Welcome to nlc
--
Welcome to the nlc mailing list!
If you ever want to remove yourself from this mailing list,
you can send mail to "Majordomo@lslsun.epfl.ch" with the following command
in the body of your email message:
unsubscribe nlc David Cary <d.cary@ieee.org>
Here's the general information for the list you've
subscribed to, in case you don't already have it:
This is a mailing list for discussing the 'C -> netlist' compiler, or 'nlc'
for short.
To send mail messages to the nlc mailing list, send your message to:
nlc@lslsun.epfl.ch
I have made nlc, the C to netlist compiler, available for anonymous
ftp on lslsun5.epfl.ch (128.178.150.25) in /pub/nlc-0.9.tar.gz.
It is compressed using gzip (available at a GNU archive site near you).
A compiled binary version for Sparc Solaris 2.x is in the file
/pub/nlc-0.9.bin.gz.
If you have any trouble getting the file, let me know. I can probably
mail it to you.
For those that are interested in the Spyder project, you will find
below references to already published work.
Please let me know if you have trouble locating these papers, and I'll
see what I can do.
Have fun and please keep in touch,
Christian
---
@Article{key,
author = "Christian Iseli and Eduardo Sanchez",
title = "Spyder: A {SURE} ({SU}perscalar and
{RE}configurable) Processor",
journal = "The Journal of Supercomputing",
year = 1995,
volume = 9,
number = 3,
pages = "231-252"
}
@InProceedings{key0,
author = "Christian Iseli and Eduardo Sanchez",
title = "A Superscalar and Reconfigurable Processor",
volume = 849,
series = "Lecture Notes in Computer Science",
pages = "168--174",
booktitle = "Field-Programmable Logic Architectures, Synthesis
and Applications",
year = 1994,
publisher = "Springer-Verlag",
address = "Prague, Czech Republic",
month = "September"
}
@inproceedings{key1,
author = "Christian Iseli and Eduardo Sanchez",
title = "{Beyond Superscalar Using FPGAs}",
booktitle = "IEEE International Conference on Computer Design",
address = "Cambridge Mass.",
month = "October",
year = 1993
}
@inproceedings{key2,
author = "Christian Iseli and Eduardo Sanchez",
title = "{Spyder: A reconfigurable VLIW processor using FPGAs}",
booktitle = "IEEE Workshop on FPGAs for Custom Computing Machines",
address = "Napa",
month = "April",
year = 1993
}
@InProceedings{key3,
author = "Christian Iseli and Eduardo Sanchez",
title = "Augmentation du parall\'{e}lisme par la reconfigurabilit\'{e}",
editor = "L. Boug\'{e} and M. Cosnard and P. Fraigniaud",
pages = "3--6",
booktitle = "Actes des $6^{\`{e}me}$ Rencontres Francophones du
Parall\'{e}lisme, RenPar'6",
year = 1994,
organization = "\'{E}cole normale sup\'{e}rieure",
address = "Lyon",
month = "June"
}
@InProceedings{key4,
author = "Serge Durand and Christian Iseli",
title = "Developing a reconfigurable coprocessor on an {SBus} board",
pages = "5--9",
booktitle = "Sun User Group 1993 Proceedings",
year = 1993,
address = "San Jose, California",
month = "December"
}
@inproceedings{key5,
author = "Christian Iseli and Eduardo Sanchez",
title = "{A \Cplusplus{} compiler for FPGA custom execution units synthesis}",
pages = "173--179",
booktitle = "IEEE Symposium on FPGAs for Custom Computing Machines",
address = "Napa, CA",
month = "April",
year = 1995
}
---
Here are some more info about Brent Chapman's "Majordomo" mailing list manager.
In the description below items contained in []'s are optional. When
providing the item, do not include the []'s around it.
Majordomo understands the following commands:
subscribe <list> [<address>]
Subscribe yourself (or <address> if specified) to the named <list>.
(You probably have done that already if you are receiving this message).
unsubscribe <list> [<address>]
Unsubscribe yourself (or <address> if specified) from the named <list>.
get <list> <filename>
Get a file related to <list>.
index <list>
Return an index of files you can "get" for <list>.
which [<address>]
Find out which lists you (or <address> if specified) are on.
who <list>
Find out who is on the named <list>.
info <list>
Retrieve the general introductory information for the named <list>.
lists
Show the lists served by this Majordomo server.
help
Retrieve this message.
end
Stop processing commands (useful if your mailer adds a signature).
Commands should be sent in the body of an email message to
majordomo@lslsun.epfl.ch
Commands in the "Subject:" line NOT processed.
If you have any questions or problems, please contact
majordomo-owner@lslsun.epfl.ch
From: Christian Iseli <chris@lslsun.epfl.ch>
Date: Wed, 12 Aug 1998 13:53:07 +0200 (MET DST)
To: nlc@lslsun.epfl.ch
Subject: Re: Is this list dead ?
Sender: owner-nlc@lslsun.epfl.ch
Precedence: bulk
>I heard this mailing list talked about converting (mathematically
>intensive) C programs to run at high speed on FPGAs.
>
>I thought I would lurk and learn, but I haven't gotten any messages since I
>got the "Welcome to nlc" message on 1998-06-11.
>
>Is this list dead ?
Not really, but it was never very lively either... ;-)
As it happens, I'm now done with my PhD thesis and since I had no means of
pursuing nlc further, I now have another job and have little time to actively
develop nlc.
The PhD thesis report is available at:
http://lslwww.epfl.ch/pages/publications/rcnt_theses/home.html
The latest nlc source code is available at:
ftp://lslwww.epfl.ch/pub/
HTH. Cheers,
Christian
The exact opposite thing is also being developed:
vhdl2c "vhdl2c is a vhdl to `C' converter by Michael Knieser"
see also "Tiny robots" robot_links.html#tiny
I am fascinated by CPU designs that are extremely simple, approaching "minimal", in several (sometimes incompatible) senses of the word "simple".
Here "CPU", "MCU", and "PE" are almost interchangeable.
Often these characteristics exhibit synergy -- when you eliminate some opcodes, that eliminates some gates (making it lower-cost) and makes it easier to understand (less documentation required). Occasionally, though, driving a design to extreme simplicity according to one measure causes extra complexity in another area to compensate. This is called the the Turing tarpit #tarpit .
For purposes of robotics, "low-cost" and "Simple interface" are usually the dominant considerations. Some of the concepts of massively parallel processing are also present when one builds swarms of simple robots, but the kind of random communication between constantly-changing arrangements of simple robots is very different than the communication between the rigidly connected (most commonly in a 2 D mesh or a hypercube) elements of common cellular automata and multi-processors.
See also Using FPGAs to simulate CPUs #FPGA for some very simple CPUs designed to fit onto FPGAs. 2 very different reasons for simplicity there: (a) a simpler CPU can fit onto a smaller FPGA, making it much less expensive. (b) making the CPU smaller allows you to fit more copies of the CPU on a given FPGA, increasing MIPs at no cost. [FIXME: should combine these into 1 section ? Since the same op-code set may be implemented many ways, in TTL, in FPGA, in custom VLSI, it doesn't really make sense to split them into seperate sections ... On the other hand, ``less hardware'' means something a little different in these 3 technologies. ]
[here I *list* simple CPUs I know about; Opcode considerations #considerations and #FPGA also talk about tips for designing new CPU architectures. ]
... Here I also list all the CPUs I know about that were built up out of TTL.
...
The Block I version used 4,100 ICs, each containing a single 3-input NOR gate. The later Block II version used dual 3-input NOR gates in a flat-pack; appoximately 5,600 gates in all. The gates were resistor-transistor logic (RTL). ...
The instruction format was 3 bits for opcode, 12 bits for address.
(details about each instruction in the instruction set there). (Knowing what we now know about MISC, how much smaller could we make a "reasonable" CPU out of NOR gates ? )
For as long as I've been using microprocessors -- and I was designing the Intel 8080 into radiation-monitoring equipment in 1977 -- I've always itched to have control over the instruction set. The makers always seem to leave something out....
Picaro makes a pleasant change from systems with 16-MB minimum memories and operating systems no human being can comprehend.
describes a very easy-to-build system: There's the CPU connected to some EPROM (for programs and data) and a crystal. The EPROM is serial EPROM (makes things *much* easier to wire up). The "CPU" is really a PIC with an internal interpreter with 2 modes. If the magic value is discovered in EPROM, the PIC starts fetching and "executing" (interpreting) the program in EPROM. Otherwise, it assumes the EPROM is blank, and waits for a properly-formatted program to stream in on the serial port which it programs into the EPROM. [FIXME: todo: build something like this ... but with a completely different instruction set, natch.]
(in other words, no ROM or FLASH is needed for initial testing).From: Ben Franchuk Date: Wed Jul 25, 2001 7:10 am Subject: Re: [fpga-cpu] A debugger for xr16 the easy way... On my cpu on reset a bootstrap loader is run from the serial input. This way all I need for a external parts is [RAM] and a buffer for serial lines from the FPGA.
``The Jupiter Ace hardware page: How to build your own !'' by Grant Searle BSc. http://www.home-micros.freeserve.co.uk/JupiterAce/JupiterAce.html
Build your own ZX81 http://www.home-micros.freeserve.co.uk/zx80/zx80nmi.html
interesting wire-wrap style ... see ``Back of PCB'' photo. While this uses a Z80 cpu (and so is ``cheating'' in the context of building the CPU itself out of standard components), the TV / monitor output circuitry may be interesting... [FIXME: computer museum] [FIXME: crosslink to schematic.html ?]... old micros which can still be built since they don't use custom components. ...
-- http://www.pcengines.com/toy2.htm`` TOY/2 - a minimal 16 bit CPU
TOY/2 was designed by Pascal Dornier and Stephan Paschedag ... in 1988 ...
TOY/2 can address a full 64K word program and memory address space. ... and takes up a whopping 3300 transistors (excluding I/O pads). ...
All instructions are 16 bit, 4 bits for the opcode, and 12 bits for the direct address. ''
DAV: I think the JMP instruction should be P := [vec] to make consistent with description and the conditional branches, and to allow jumping to code anywhere in the full 64 Kword program space. Only 15 defined instructions:
Programmer-visible registers: A: accumulator P: program counter T: temporary register for indirect store C: carry flag Z: zero flag (DAV: does this *always* indicate the current state of A ?) ; 3 for program flow control JMP vec P := [vec] ; indirect jump BCC vec IF (carry clear) then PC := [vec] BNE vec IF (not zero, i.e., 0==Z) then PC := [vec] ; 3 for direct load and store LDC src A := [src], C:= 0. ; load A and clear carry LDA src A := [src] ; load A, don't clear carry STA dest [dest] := A ; ; 3 logical ops (op) src A := A (op) [src] ; arithmetic ops: ; (op) is one of: XOR, OR, AND. ; 2 arithmetic ADC src A := A + [src] SBC src A := A - [src] - C ; subtract ; 4 implicit instructions that don't use the 12 bit address: ROR A,C := A,C ror 1 ; rotate right through carry TAT T := A ; transfer A to T LDI A := [A] ; load A indirect STT [A] := T ; store T indirect
DAV: Given that code is in ROM, and we choose some arbitrary RAM area ``stack_start'' to hold the stack, and we choose some other arbitrary RAM location SP to hold the return stack pointer, call instructions could be implemented like this:
(Is this the most compact implementation of CALL ?). ... ; DAV: This implementation of CALL eats up ; 2 instructions + 1 word of ROM = 3 words per call. LDA $+2 JMP &(banana1) LDA $+2 JMP &(banana2) ... ; linker must allocate a word of ROM to hold the $+2 value, 1 per CALL. ; linker must allocate a word of ROM to hold the &(banana2) value as well, ; but only 1 per subroutine, shared among all calls of banana2(). banana1: ; non-leaf subroutine ; we have return address in A, ; and SP points to an empty location to store it. ; adjust SP to point to empty location TAT LDA SP STT ADC &(1) ; assembler must allocate word of ROM to hold the constant 1 STA SP ... ; return sequence JMP &(non_leaf_return) non_leaf_return: ; common to all non-leaf subroutines LDA SP ADC &(-1) STA SP LDI STA temp JMP temp banana2: ; leaf subroutine ; we have return address in A, ; and SP points to an empty location to store it. TAT LDA SP STT ; leave SP full until end of routine ... ; return sequence JMP &(subroutine_return) subroutine_return: ; common to all subroutines LDA SP LDI STA temp JMP temp
DAV: From the instruction set listed here, I attempted to reverse-engineer a schematic suitable for implementation in TTL. I wonder how close this matches the original design by Pascal Dornier and Stephan Paschedag ?
TOY/2 as reverse-engineered by David Cary
(Warning: not yet implemented -- does this really work ?)
Every instruction takes 2 cycles:
1 SRAM memory cycle cycle (either reading [address] or [A], or writing T to [A]).
1 SRAM memory cycle to read next instruction (from [PC]).
Registers are labeled with their contents in the 1st cycle;
PC and A are swapped in the last cycle of the instruction.
+------------------------------+
| +----------+ |
| | V V
| | +-------------------+ +-------+
| | | opcode /\ address | | A |(PC)
| | +-------/ \--------+ +-------+
| | V V V
| | (control | +--------+
| | lines) | | V
| | V V |
| | +-\/--+ |
| | 1->| mux | |
| | +-----+ |
| | V |
| | (note 1) V address |
| | | +------------+ |
| | V | | |
| | +---+ | SRAM | |
| |2->| T | | | |
| | +---+ +------------+ |
| | V ^ data I/O |
| | | V V
| | +--------->+<| buf |--<+
| ^ V V
| +----------------+ |
| V V
| +-----------+ +----+
| | [address] | | PC |(A)
| +-----------+ +----+
| V V
| +------\ /------+
| 4->\ \/ /-->Z,C
| \ ALU /
| +----------+
| V
^ |
+-----------------------+
Control line summary:
0: instruction register : opcode : next value always latched at end of last cycle.
0: instruction register : address : next value always latched at end of last cycle,
never used during last cycle. (Perhaps share with some other register only used during last cycle ?)
0: [address] transparent register: always transparent during 1st cycle,
latched at end of 1st cycle, and held during last cycle.
0: A(PC) next value always latched at end of every cycle
0: PC(A) next value always latched at end of every cycle
2: Mux selection: depends on opcode and C and Z during 1st cycle;
always comes from PC on last cycle.
2: T: output enabled only on STI ( [A] := T ) instruction; latched only during TAT instruction.
1: SRAM: last cycle always a read; 1st cycle may be read or write.
4: ALU function select (is this really all the control signals needed ?)
0: Z: zero flag always set or cleared during last cycle to reflect current state of A.
1: C: new value of carry flag only latched on LDC, ADC, SBC, ROR instructions.
--
10 bits ? Is this all ?
Notes:
1. Only 1 instruction loads T: the TAT instruction.
There are several ways to implement this:
* the output of A(PC) (during the 1st cycle)
* the output of PC(A) (during the last cycle), or
* the output of the ALU (during the last cycle),
whichever is easiest.
2. The register to hold [address] must be transparent,
to implement JMP and BCC and BNE properly.
3. There's still room for 1 more instruction. What should it be ?
Some options that require *zero* more hardware (just setting up the uinstruction decoder):
* PC-relative jumps: PC := PC + [vec] (for position-independent code)
* TPT: Transfer PC to T: T := PC (to allow position-independent code)
* T := PC + [vec] (to allow position-independent code)
* [A] := PC+1 (to allow position-independent CALL)
This implementation calculates next PC first (using transparent [address] register), then next A.
(I think this ended up simpler than calculate next A first, then next PC).
(Of course, faster CPUs calculate both at the same time,
and have Harvard-style fetches from data cache simultaneous with fetches from instruction cache).
"Microcoded Versus Hard-wired Control: A comparison of two methods for implementing the control logic for a simple CPU" article by Phil Koopman BYTE Magazine January 1987 http://www.skidmore.edu/~pvonk/cs318/calendars/Documents/hard-wired.pdf describes the Toy CPU.
The schematic diagram of the PISC-1a can be downloaded here. DAV: While the paper notes ``no conditional branch microinstruction -- an important need'', I agree that conditionals are important, but "machines do not need conditional branches, they only need conditional subroutine return instructions." -- Philip J. Koopman, Jr. #conditional_returnThe PISC is a processor constructed from discrete TTL logic, which illustrates the operation of both hardwired and microcoded CPUs. ... simple hardware modifications demonstrate interrupts, memory segmentation, microsequencers, parallelism, and pipelining. A standalone PISC board should be an economical and effective tool for teaching processor design.
... Pathetic Instruction Set Computer ... Requiring only 22 standard TTL chips (excluding memory), it is well within the ability of a student to construct and understand. Its writeable microprogram store uses inexpensive EPROM and RAM. ...
Microprogram store and main program store are one and the same. Indeed, the PISC has characteristics of both a hardwired CPU and a microcoded CPU.
... Many weaknesses of the PISC become evident after a short period of use ... It can be argued that the PISC is a valuable educational tool because these faults, and several potential solutions, are painfully obvious. ...
... 2100 gates ...
Subject:
Re: Home-made CPUs
Date:
20 Nov 1998 00:00:00 GMT
From:
jones@cs.uiowa.edu (Douglas W. Jones,201H MLH,3193350740,3193382879)
Organization:
The University of Iowa
Newsgroups:
sci.electronics.design
From article <365533b8.334246@wingate>,
by t.dorrington@dial.pipex.com (Tim Dorrington):
>
> Has anyone out there managed to design/build their own very simplistic
> CPU from basic logic chips, RAM and ROM? I know that this is
> obviously a huge project which has no real practical use, but I just
> wondered if anyone has done it for the challenge?
It's been done, and it's not that huge a project. Reasonable CPU designs
take about 50 SSI and MSI TTL chips. See the Ultimate RISC and the
Minimal CISC architectures indexed on http://www.cs.uiowa.edu/~jones/arch/.
A fair number of people have built that particular RISC.
Of course, I just wrote the specs for it, you've got to figure out how to
reduce it to a chip level design. It's a fun exercise!
Doug Jones
jones@cs.uiowa.edu
Mountain View Press: The Forth Source http://www.TheForthSource.com/ Sells
WISC CPU/16 - A patented hardware design of a 16-bit stack machine with a user writable instruction set. Available in several forms: Complete schematics and documentation with which you can build your own ($50.00); Kits with all of the parts to wire wrap your own ($500.00); bare PC board which you can stuff with your own chips ($250.00); and Assembled and tested system ($750.00). Design uses a PC/AT as an I/O server. FAST!
and also sells lots of educational materials for learning the Forth programming language.
(FIXME: move to http://c2.com/cgi/wiki?WritableInstructionSetComputer ? )
-- _Stack Computers: the new wave_ book by Philip J. Koopman, Jr. 1989 http://www.cs.cmu.edu/~koopman/stack_computers/... one major design tradeoff ... hardwired control and microcoded control. ...
An introduction to the concepts ... may be found in (Koopman 1987a).
Hardwired designs traditionally allow faster and more space efficient implementations to be made. ...
... a microcoded machine can use fewer bits to specify the same possible instructions, ...
... microcoded implementations are more convenient to implement in discrete component designs, so they predominate in board-level implementations. Most single-chip implementations are hardwired.
... Koopman, P. (1987a) Microcoded versus hard-wired control. Byte, January 1987, 12(1) 235-242
DAV: the ``Mark's 8 chip uP'' at the venturalink site looks very clever. I am very impressed. It is far more compact than any other TTL/PAL based CPU I've seen. (It is even smaller than most ``single-board computers'' that use monolithic CPU chips). It makes me want to design a CPU again.
From: Bill Buzbee (bill at buzbees.com) Subject: Re: building a 8 or 16 bit cpu out of TTL parts Newsgroups: comp.arch.hobbyist Date: 2002-10-30 14:30:45 PST "john dobbs" wrote... > Anyone ever thought of building a 8 or 16bit cpu using traditional TTL > chips (NANDs,ORs,Flipflops etc) I could see how it would be a useful > learning device, or for the hardcore guys using transistors/diodes and > no IC's period. I agree with the other posters that there's no *rational* reason to do this using TTL - FPGA is the way to go these days. However, if you're feeling irrational, it can be a fun experience. Here's a link to recent simple CPU project you should check out: http://www.venturalink.net/~jamesc/ttl/ Also, I found the following books very useful when trying to come up to speed on the subject: Digital Computer Electronics, by Albert Paul Malvino, McGraw-Hill, 1977 Understanding Digital Computers, by Forrest Mimms III, Radio Shack, 2nd. Edition,1987. Good luck, ..Bill Buzbee
``... building my own computer from scratch. By "scratch", I mean designing my own instruction set, wire-wrapping a CPU out of a pile of 74 series TTL devices and writing (or porting) my own assembler, compiler, linker, text editor and very rudimentary operating system.
... User and supervisor modes exist, along with hardware address translation ...
... I really want to better understand hardware and operating systems. ... pushing ... the limits of my hobbyist abilities, ... support for preemptive multitasking and paging to enable me to support a "real" (though, I hope, greatly simplified) operating system for my machine.
... wherever possible I'll trade off speed for simpler circuits. Also, as I get closer to actually having to start wrapping wires, I find myself more freely trading off speed for fewer connections.
Compactness. One of my pet peeves is how bloated modern software is. I think there's a lot you can do with 128K bytes of addressing, and I like the idea of keeping things compact and utilitarian. I've tried to construct an expressively dense set of 1-byte opcodes.
At the end of the day, I'd like to have a working, and useful, machine that I understand completely. Oh, and it's also got to have a real front panel with lots and lots of cool blinky lights. ''
The ``links'' page points to many other web pages and books that deal with: small processor design and implementation, digital logic, retargeting C compilers and Forth compilers. [FIXME: does that make this section redundant ?]
[FIXME: finish reading]... relatively inexpensive project (<$100 including power supply, ICS, breadboards and assorted hardware) ...
cellular automata is related to other interests I have:
[FIXME: CA stuff scattered elsewhere ...]
The cellular automata questions I'm most interested in are:
There's a little loop here -- first we use (simulated) cellular automata to learn about replication, then we use that information to design replicating tools. We also use (simulated) cellular automata to explore good rules and good patterns for computronium. Then we use those replicating tools to (build enough copies of themselves to) build computronium. Then we use *that* computronium (hardware cellular automata) as a better computer.
Patterns, Programs, and Links for Conway's Game of Life http://www.radicaleye.com/lifepage/ [FIXME: this page lists a lot of the Conway information I have here. Mirror that page; send the author other related links I've collected; delete redundant stuff from my pages]
"Excellent paper (several chapters of a forthcoming book, actually), if you are interested in cellular automata, reversible computation, hardware implementations and CAM modelling, you should check it out. ... Caution, this expands into 17381429 Bytes of PostScript." -- recc. Eugene Leitl <eugene@liposome.genebee.msu.su>
DAV: I agree. This paper has good stuff about the interaction (co-evolution) of hardware and software, efficient realization of computational architectures, some of the background thinking behind the massively parallel CAM-8 processor, and likely directions of future computing devices." -- DAV
the theory behind the CAM-8 http://www.im.lcs.mit.edu/cam8.html a dedicated cellular automata processor. http://www.im.lcs.mit.edu/broch/med1.html mentions that it is fast at generating the data needed for holograms.
Small size is not everything - Q*Bert-based systems will require more cells for some tasks than the equivalent when expressed in (say) the Margolus neighbourhood. However for the majority of applications which involve performing cmputations, the additional richness provided by a larger LUT seems likely to be wasted much of the time, doing the equivalent of transporting signals from one point to another. We believe the advantage of the smaller cells in terms of their smaller size and higher update frequency will commonly win out.
Finally the Q*Bert neighbourhood is essentially hexagonal - and there is a general additional efficiency of hexagonal packing schemes compared to rectangular ones - a matter I will not go into further here.
[89] Toffoli, T. and N. Margolus, _Cellular Automata Machines -- a new environment for modeling_, MIT Press (1987)
DAV: Which of the 2 CA rules detailed in the _Crystalline Computation_ paper is more "interesting" (can build more compact logic circuits out of): "BBMCA", or "Critters" ? (David Cary thinks they are both "more interesting" than the Conway rules) More importantly, since hydrodynamics required a triangular grid, what rules are most "interesting" on triangular grids ? (Perhaps that grid has rules that are "more interesting" than any square-grid rules).
(a) blocking on even steps: aabbccddee..... aabbccddee..... ffgghhiijj..... ffgghhiijj..... kkllmmnnoo..... kkllmmnnoo..... ... ...zz blocking on odd steps: abbccddee.....a fgghhiijj.....f fgghhiijj.....f kllmmnnoo.....k kllmmnnoo.....k ... abbccddee...zza (b) OO -> OO OO OO *O -> *O OO OO *O -> O* O* *O *O -> O* *O O* O* -> ** ** *O ** -> ** ** **
Fig. 1.5 The invertible "Critters" CA.
(a) The solid and dotted blockings are used alternately.
(b) The Critters rule.
The Critters rule is ... "rotationally symmetric" ... "conserves 1's (particles), and only 3 cases change." ... "each of the 16 possible initial states of a block is turned into a distinct result state. Thus the Critters rule is invertable." ... "Unlike the gliders in Life, Critters gliders are quite robust. ... when two of these gliders collide in an empty region ... ... If nothing hits this blob for a while, we always see at least one of the gliders emerge."
(a) OO -> OO OO OO *O -> OO OO O* *O -> O* O* *O *O -> *O *O *O O* -> O* ** ** ** -> ** ** **
Fig. 1.8 A invertible Billiard Ball Model CA.
(a) The BBMCA rule
...
"rotationally symmetric"
"conserves 1's (particles), and only 2 cases change."
"we can recover macroscopic 2D hydrodynamics from a model that is only slightly more complicated than the HPP gas. A single-speed model with 6 particles per site, moving in 6 directions on a triangular lattice, will do. If all zero-net-momentum collisions cause the molecules at the collision site to scatter into a rotated configuration, and otherwise the particles go straight, then in the low speed limit we recover isotropic macroscopic fluid dynamics."
[100] Yakhot, V. and S. Orszag, "Reynolds number scaling of cellular-automaton hydrodynamics", _Phys. Rev. Lett._ (1986), 1691-1693.
It has been shown that self-replicators can be designed in cellular automata #cellular_automata . some of this deals with ``real'', physical replication. While other parts -- cellular automata, quines, etc -- deal only with patterns in a computer. Should I separate them ? But some of the theory applies to both.
related to reconfigurable robots robot_links.html#reconfigurable and robot construction (humans building robots) robot_links.html#construction and tool closure 3d_design.html#closure . and the bootstrap problem [FIXME:] [FIXME: cross-link all the self-replication stuff on my web pages. nano, robot, cellular automata, etc. Point back and forth from ``replication'' section to computer architecture # cellular automata robots nanotech idea_space and unknowns ]
The ``Game of Life'' http://www.astro.virginia.edu/~eww6n/life/ is the most well-known Cellular Automaton, invented by John Conway and popularized by Martin Gardner. Lots and lots of interesting properties and pretty animations have been collected for this cellular automaton. But David Cary doubts that this is really the most interesting ("computationally compact") cellular automata; other cellular automata are likely to have even more pretty animations and properties. "replicator - a Life object which repeatedly forms copies of itself. Such things are known to be possible in Life, but no example is known. But in the HighLife variant, there is a simple replicator." "HighLife - an alternate set of rules similar to Conway's, but with the additional rule that 6 neighbors generates a birth. Most of the interest in this variant is due to the replicator that evolves from:
***. ...* ...* ...*-- http://www.cs.jhu.edu/~callahan/glossary.html "unit Life cell: a pattern with two states, which is determined by its previous state and the previous state of its neighbors, using exactly the rules used to compute it; that is, it simulates its own universe. None have been constructed in Conway's Life yet." When I (DAV) talk about one entity replicating, I mean that we end up with 2 (or more) practically identical copies of the original, in every way (in particular, size). A ``unit Life cell'' seems to me to be somehow related to the so-called ``replicating-tile'' http://www.geocities.com/alclarke0/PolyPages/Reptiles.htm (which creates identical copies, but scaled much larger)
"Looking (and dreaming) toward the future, one can imagine nano-scale (bioware) systems becoming a reality, which will be endowed with evolutionary, reproductive, regenerative, and learning capabilities." http://lslwww.epfl.ch/pages/tutorials/poe/home.html
-- ``Beyond computation: a talk with Rodney Brooks'' 2002 http://www.kurzweilai.net/meme/frame.html?main=/articles/art0475.html | http://www.edge.org/3rd_culture/brooks_beyond/beyond_p4.htmlWe're also trying to build self-reproducing robots. We've been doing experiments with Fischer Technik and Lego. We're trying to build a robot out of Lego which can put together a copy of itself with Lego pieces. Obviously you need motors and some little computational units, but the big question is to determine what the fixed points in mechanical space are to create objects that can manipulate components of themselves and construct themselves. There is a deep mathematical question to get at there, and for now we're using these off-the-shelf technologies to explore that. Ultimately we expect we're going to get to some other generalized set of components which have lots and lots of ways of cooperatively being put together, and hope that we can get them to be able to manipulate themselves. You can do this computationally in simulation very easily, but in the real world the mechanical properties matter. What is that self-reflective point of mechanical systems? ...
see #simple_cpu and minimal_instruction_set.html
------------------------------ Date: Mon, 19 Jun 2000 16:27:44 +0200 From: Tim Böscke To: <MISC> Subject: Re: MISC > >The _real_ minimum instruction set is one with a > >subtract-and-branch-when-negative instruction btw. > > > > Eugene Styer mentions six 'one instruction computers' at > <http://eagle.eku.edu/faculty/styer/oisc.html> > Well - if you look at them closely it becomes obvious that the six machines are after all just flavours of two ideas. 1) The move machine 2) subtract, branch The mentioned "subtract" machine is a combination of both. However I dont really consider 1) as a one instruction machine. After all it is just a fancy way of doing microcoding. A move to the PC is after all not just a move, but a JMP. And thus you already have two instructions. The same goes for the ALU stuff. ------------------------------
------------------------------ Date: Thu, 22 Jun 2000 00:27:00 -0600 From: Roger Ivie <rivie at teraglobal.com> To: MISC Subject: Re: MISC Content-Type: text/plain; charset="us-ascii" ; format="flowed" >Well - if you look at them closely it becomes obvious that >the six machines are after all just flavours of two ideas. > >1) The move machine >2) subtract, branch > >The mentioned "subtract" machine is a combination of both. > >However I dont really consider 1) as a one instruction machine. >After all it is just a fancy way of doing microcoding. A move >to the PC is after all not just a move, but a JMP. And thus >you already have two instructions. The same goes for the ALU >stuff. Bear in mind that the PC doesn't have to be an actual register. A good example of this is the PDP-5, which stored the PC in location 0. Executing an instruction began by fetching location 0 and using it as the address of the instruction to be executed. It was possible for a DMA device to cause a jump by writing to location 0; this was, in fact, how the front panel loaded an address into the PC. -- Roger Ivie rivie@teraglobal.com Not speaking for TeraGlobal Communications Corporation ------------------------------
Steamer16: a high-performance homebrewer's microprocessor http://www3.sympatico.ca/myron.plichota/steamer1.htm by Myron Plichota <myron.plichota at sympatico.ca>. written in VHDL and prototyped on a single wire-wrapped protoboard. Has only 7 instructions. Packets of 5 instructions (5 instructions * 3 bits = 15 bits/packet) execute in 6 cycles/packet at a cycle rate of 20 MHz. Both the address and data busses are 16 bits. Unusual ``ArF'' call/return protocol.
qUark ../mirror/quark.txt a viable stack-computer with 4-bit opcodes (c) vic plichota, original concept by Myron Plichota Dec '98.
------------------------------
Date: Sat, 20 Feb 1999 12:54:34 +0300
From: "Stas Pereverzev"
To: MISC
Subject: Re: nFORTH v2.3
Content-Type: text/plain;
charset="koi8-r"
Content-Transfer-Encoding: 7bit
>Comments, folks?
You need only five instructions, not 16. They are:
ALU:
1. nand
2. shr
RAM:
3. store ( addr n -- )
4. lit ( -- n )
CONTROL:
5. ncret \ JUMP to addr in N, if carry flag isn't set in T,
\ also drop both T and N
Also, if PC is memory variable (or can be addressed as memory variable)
we can awoid "ncret" instruction:
In that case we sholud use NCSTORE instead STORE:
ncstore (addr n flag -- )
ncstore (addr n 0 ) - same as store,
ncstore (PC n -1 ) - same as ncret
We need only 2 bits per instruction in that case.
That all folks ;-)
Stas.
------------------------------
(does this really work ???)
Turing Tar Pit.
Some things you might think about if you are designing a new instruction set (for a Von Neumann machine).
]
The ``JavaOS'' #JavaOS attempts to do this in software, so it doesn't need MMUs, etc.
DAV: If you're going to be implementing FORTH on top of the architecture anyway, what with it's threaded code and all (lists of subroutines to call, rather than lists of call instructions to those subroutines) ... do you really need a ``call'' instruction ? Or do all the ``calls'' you need really reduce down to the ``next'' instruction at the end of a subroutine, to lookup the next subroutine on a list and call it ? -- DAV
[FIXME: very muddled thinking here:]
With threaded execution, what we have is: blocks: some register RN point to the ``next item'' in the list. (RN never points to a real assembly-language opcode; it points somewhere in the middle of a list of pointers, each of those pointers point directly to real assembly-language opcodes) Let's say that happens to be a (leaf, assembly-language) routine. When the (leaf, assembly-language) routine is done, it does NEXT (conditional ?). If the ``next item'' on the list is a pointer to another (leaf, assembly-language) routine, NEXT simply does PC <-- [[RN]] RN++ (compare to Normal return instructions do PC <-- [SP] SP++ ) So how does NEXT know whether this new routine is a leaf or not ? Or do we just assume it's always a leaf, and add a special "receive" opcode instruction(s) at the beginning if it happens to not be a leaf (reminiscent of the ARM procedure call standard) ? something like this: ; function header ("receive") ; push old RN to return stack [SP] <-- RN SP-- ; make RN point into block of subroutine, rather than calling block RN <-- [PC + n] ; NEXT PC <-- [[RN]] RN++ At the end of a (non-leaf) subroutine, the instruction at the end of the block needs to (leaf) point to code that properly does a return. ; pop old RN from return stack SP++ RN <-- [SP] ; NEXT PC <-- [RN] RN++ Control structures (Things that are not blocks): sequence, selection, iteration: [FIXME:] Perhaps it would be interesting to deal with these only at the threaded-call level, not on the assembly-langauge level. Then we have: selection: using conditional returns: ... perhaps add extra header to every subroutine; to jump *after* that header implies ``unconditionally execute this routine'', but to jump right at the beginning implies ``conditionally execute this routine: only when T is not zero''. I.e., that prepended extra header drops T, does a NEXT when that old value of T is zero, otherwise falls through.
How do I loop through a million pixels without leaving a million return addresses on the stack, when my only conditional is a conditional return ? just use ``fake jump'' of call/drop ?
[/muddled thinking]
-- Joseph H Allen http://www.dejanews.com/getdoc.xp?AN=408499073.1Basically I would not have made a weaker CPU if it came at the expense of being able to use data structures. This means that I prefer to have some form of indexing instead of the cheaper alternatives: direct addressing only (like the original Von Neumann computer with the resulting need for self-modifying code and the execute instructions of future computers), register indirect addressing only (like the 8080), or memory indirect addressing only (like the pdp-8). I had considered the 6502 method of indexing (where the base address is fixed, or stored in a zero-page location, and the 8-bit offset is in a register), but having programmed both 6502s and 6800s, I know that fixed offsets are to be prefered to fixed base-addresses (thus the 6502 index registers are, IMHO, useless). So I know that I need to encode two fields for indexed instructions: where the base is and the value of the offset. You want the offset to be as large as possible since that limits the size of the data structures you can have, thus the current 2-bit/7-bit format. If the base address was not going to be in an actual register, the next best place is the top of stack (which is much better than in a zero-page location).
The other nice thing about indexing is that it allows you to make fast move unrolled move loops:
lda 0,x sta 0,y lda 2,x sta 2,y etc.At 8 cycles for loads and 10 cycles for stores and two bytes per transfer, this gives a move speed of 9 cycles per byte.
If I could get high memory to memory copy speed (which I think is important) with indexing, then I wouldn't have to worry wasting op-code space with any other method (I.E., special instructions or auto-increment addressing modes).
Another lesson from the 6502 (and the 8088 for that matter), is that it just really sucks if the largest datum you can manipulate is smaller than your address size. This means that the accumulator needs to be the same size as the PC -- 16-bits.
So there you go. If you have any ideas for improving it without expending too much extra hardware, I'd like to hear them.
Lots of people have written far better descriptions.
BBW Branch Both Ways BEW Branch Either Way BBBF Branch on Bit Bucket Full BH Branch and Hang BMR Branch Multiple Registers BOB Branch On Bug BPO Branch on Power Off BST Backspace and Stretch Tape CDS Condense and Destroy System CLBR Clobber Register CLBRI Clobber Register Immediately CM Circulate Memory CMFRM Come From -- essential for truly structured programming CPPR Crumple Printer Paper and Rip CRN Convert to Roman Numerals
[FIXME: I have a lot of information on subroutines and the importance of well-factored code scattered about ... should I gather it all together here, or since it's so very important, leave a brief reference at each place ?]
Calls and returns are very important. Calls obviously take much less RAM space than in-line code except for the most trivial subroutines. Few people realize that calls can also take significantly less time than in-line code, because the most heavily re-used subroutines fit in the high-speed cache. Some programming styles (FORTH) create heavily factored code, where calls are the most frequently executed instruction at run time and typically 1/3 of the (static) instructions [see _Stack Computers_ by Philip Koopman 1989, in particular the section "7.1 the importance of fast subroutine calls" http://www.ece.cmu.edu/~koopman/stack_computers/sec7_1.html which says ``expensive procedure calls lead to poorly structured programs''. Also,
-- _Stack Computers_ by Philip Koopman http://www.ece.cmu.edu/~koopman/stack_computers/sec3_3.html#3322 ]. Since calls occur so frequently, it is important to minimize their RAM and cycle time overhead. Koopman suggests combining them with other instructions, so every instruction combines a call, branch, or exit with some other operation (since, at the hardware level, a data operation and a program flow operation can be executed in parallel).... Good Forth programming style ... Subroutines often only consist of 5 or 10 instructions. A static frequency of approximately 50% of the instructions being subroutine calls ... especially effective in environments with limited memory capacity. It also encourages the use of machines with fast subroutine calls.
From a runtime perspective, every call is paired with a return. So (everything else being equal) whatever minimizes (calling time + receiving time + returning time + resume time) is best. Sometimes you can get a net improvement by making one of these slightly slower and another one faster.
From a code space perspective, each function is likely to be called more than once, there are many more call statements than return statements (unless your subroutines have lots of exit points). So to minimize space, it's probably better to reduce (call size + resume size) even at the expense of (receive size + return size).
also see Koopman's taxonomy of microprocessors [FIXME: ]
_Computation Structures_ by Stephen A. Ward and Robert H. Halstead, Jr. (c)1990 by The Massachusetts Institute of Technology The MIT Press
p.606 Models of Computation The von Nuemann model of computation ... alternative models ... include:
The above list is neither exhaustive nor even representative...
... p.612 Taxonomy of Multiprocessors ... Michael Flynn's taxonomy ...
Ward and Halstead
quotes from _Computation Structures_ by Stephen A. Ward and Robert H. Halstead, Jr. (c)1990 by The Massachusetts Institute of Technology The MIT Press
... p. 347 Programming languages offer several /classes/ (allocation disciplines) for storage. These differ ... in the degree to which responsibility for allocation and deallocation of storage is assumed implicitly by the system (as opposed as being left to the programmer).
... common storage classes: