Include pairs of short instructions as part of the ISA, but make the
short instructions 14 bits long instead of 15 so they get only 1/16 of
the opcode space. This way, the compromise is placed in something that's
less important. In the CISC mode, 17-bit short instructions will still
be present, after all.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 7/21/2025 8:45 AM, John Savard wrote:
On Sun, 20 Jul 2025 22:27:27 -0700, Stephen Fuld wrote:
But independent of that, I do miss Ivan's posts in this newsgroup, even >>>> if they aren't about the Mill. I do hope he can find time to post at
least occasionally.
Although I agree, I am also satisfied as long as he is well and healthy. >>>
If he can't waste time with USENET for now, that is all right with me.
But I am instead concerned if he is unable to find funding to make any
progress with the Mill, given that it appears to have been a very promising >>> project. That is much more important.
Based on the posts at the link I posted above, they are making progress,
albeit quite slowly. I understand the patents issue, as they require
real money. But I thought their model of doing work for a share of the
possible eventual profits, if any, would attract enough people to get
the work done. After all, there are lots of people who contribute to
many open source projects for no monetary return at all. And the Mill
needs only a few people. But apparently, I was wrong.
It's easy to underestimate the resources required to bring a new
processor architecture to a point where it makes sense to build
a test chip. Then to optimize the design for the target node.
That's just the hardware side. Then there is the software infrastrucure (processor ABI, processor-specific library code, etc).
Not to mention
marketing and hotchips.
Looking at the webpage, the belt seems to have some characteristics
in common with stack-based architectures, bringing to mind Burroughs
large systems and the HP-3000.
However, try as I may, it may well be that the cost of this will turn
out to be too great. But if I can manage it, a significant restructuring
of the opcodes of this iteration of Concertina II may be coming soon.
More importantly, I need 256-character strings if I'm using them as
translate tables. Fine, I can use a pair of registers for a long string.
On S/360, that is exactly what you did. The first instruction in an assembler program was typically BALR (Branch and Load Register), which
is essentially a subroutine call. You would BALR to the next
instruction which put that instruction's address into a register.
But I'm not sure; cramming more and more stuff in has brought me to a
point of being uneasy.
Except one. Any unused opcode space would still allow me to assign two
bit combinations to the first and second 32-bit parts of a 64-bit
instruction that is available without block structure. This might be
very inefficient,
I have now added that in. As the level of inefficiency, though, was so
high that I couldn't put some of the instructions I would have liked to include in these 64-bit instructions... I resorted to a very desperate measure to make it possible.
On Thu, 22 May 2025 6:51:05 +0000, David Chmelik wrote:
What is Concertina 2?
Roughly speaking, it is a design where most of the non-power of 2 data
types are being supported {36-bits, 48-bits, 60-bits} along with the
standard power of 2 lengths {8, 16, 32, 64}.
This creates "interesting" situations with respect to instruction
formatting and to the requirements of constants in support of those instructions; and interesting requirements in other areas of ISA.
VAX tried too hard in my opinion to close the semantic gap.
Any operand could be accessed with any address mode. Now while this
makes the puny 16-register file seem larger,
what VAX designers forgot, is that each address mode was an instruction
in its own right.
So, VAX shot at minimum instruction count, and purposely miscounted
address modes not equal to %k as free.
On Sat, 14 Jun 2025 17:00:08 +0000, MitchAlsup1 wrote:
VAX tried too hard in my opinion to close the semantic gap.
Any operand could be accessed with any address mode. Now while this
makes the puny 16-register file seem larger,
what VAX designers forgot, is that each address mode was an instruction
in its own right.
So, VAX shot at minimum instruction count, and purposely miscounted
address modes not equal to %k as free.
Fancy addressing modes certainly aren't _free_. However, they are,
in my opinion, often cheaper than achieving the same thing with an
extra instruction.
So it makes sense to add an addressing mode _if_ what that addressing
mode does is pretty common.
That being said, though, designing a new machine today like the VAX
would be a huge mistake.
But the VAX, in its day, was very successful. And I don't think that
this was just a result of riding on the coattails of the huge popularity
of the PDP-11. It was a good match to the technology *of its time*,
that being machines that were implemented using microcode.
John Savard
But the VAX, in its day, was very successful. And I don't think that
this was just a result of riding on the coattails of the huge popularity
of the PDP-11. It was a good match to the technology *of its time*,
that being machines that were implemented using microcode.
So going for microcode no longer was the best choice for the VAX, but
neither the VAX designers nor their competition realized this, and
commercial RISCs only appeared in 1986.
According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
So going for microcode no longer was the best choice for the VAX, but >>neither the VAX designers nor their competition realized this, and >>commercial RISCs only appeared in 1986.
That is certainly true but there were other mistakes too. One is that
they underestimated how cheap memory would get, leading to the overcomplex >instruction and address modes and the tiny 512 byte page size.
Another, which is not entirely their fault, is that they did not expect >compilers to improve as fast as they did, leading to a machine which was fun to
program in assembler but full of stuff that was useless to compilers and >instructions like POLY that should have been subroutines. The 801 project and >PL.8 compiler were well underway at IBM by the time the VAX shipped, but DEC >presumably didn't know about it.
Related to the microcode issue they also don't seem to have anticipated how >important pipelining would be. Some minor changes to the VAX, like not letting >one address modify another in the same instruction, would have made it a lot >easier to pipeline.
John Levine <johnl@taugh.com> writes:
According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
So going for microcode no longer was the best choice for the VAX, but
neither the VAX designers nor their competition realized this, and
commercial RISCs only appeared in 1986.
That is certainly true but there were other mistakes too. One is that
they underestimated how cheap memory would get, leading to the overcomplex >> instruction and address modes and the tiny 512 byte page size.
Concerning code density, while VAX code is compact, RISC-V code with the
C extension is more compact
<2025Mar4.093916@mips.complang.tuwien.ac.at>, so in our time-traveling scenario that would not be a reason for going for the VAX ISA.
Another aspect from those measurements is that the 68k instruction set
(with only one memory operand for any compute instructions, and 16-bit granularity) has a code density similar to the VAX.
Another, which is not entirely their fault, is that they did not expect
compilers to improve as fast as they did, leading to a machine which was fun to
program in assembler but full of stuff that was useless to compilers and
instructions like POLY that should have been subroutines. The 801 project and
PL.8 compiler were well underway at IBM by the time the VAX shipped, but DEC >> presumably didn't know about it.
DEC probably was aware from the work of William Wulf and his students
what optimizing compilers can do and how to write them. After all,
they used his language BLISS and its compiler themselves.
POLY would have made sense in a world where microcode makes sense: If microcode can be executed faster than subroutines, put a building
stone for transcendental library functions into microcode. Of course,
given that microcode no longer made sense for VAX, POLY did not make
sense for it, either.
Related to the microcode issue they also don't seem to have anticipated how >> important pipelining would be. Some minor changes to the VAX, like not letting
one address modify another in the same instruction, would have made it a lot >> easier to pipeline.
My RISC alternative to the VAX 11/780 (RISC-VAX) would probably have
to use pipelining (maybe a three-stage pipeline like the first ARM) to achieve its clock rate goals; that would eat up some of the savings in implementation complexity that avoiding the actual VAX would have
given us.
Another issue would be is how to implement the PDP-11 emulation mode.
I would add a PDP-11 decoder (as the actual VAX 11/780 probably has)
that would decode PDP-11 code into RISC-VAX instructions, or into what RISC-VAX instructions are decoded into. The cost of that is probably
similar to that in the actual VAX 11/780. If the RISC-VAX ISA has a MIPS/Alpha/RISC-V-like handling of conditions, the common microcode
would have to support both the PDP-11 and the RISC-VAX handling of conditions; probably not that expensive, but maybe one still would
prefer a ARM/SPARC/HPPA-like handling of conditions.
- anton
I can't say much for or against VAX, as I don't currently have any
compilers that target it.
So going for microcode no longer was the best choice for the VAX, but AE>>>neither the VAX designers nor their competition realized this, and AE>>>commercial RISCs only appeared in 1986.
That is certainly true but there were other mistakes too. One is that JL>>they underestimated how cheap memory would get, leading to the overcomplex JL>>instruction and address modes and the tiny 512 byte page size.
Concerning code density, while VAX code is compact, RISC-V code with the
C extension is more compact
<2025Mar4.093916@mips.complang.tuwien.ac.at>, so in our time-traveling
scenario that would not be a reason for going for the VAX ISA.
Another aspect from those measurements is that the 68k instruction set (with only one memory operand for any compute instructions, and 16-bit granularity) has a code density similar to the VAX.
Another, which is not entirely their fault, is that they did not expect JL>>compilers to improve as fast as they did, leading to a machine which was fun to
program in assembler but full of stuff that was useless to compilers and JL>>instructions like POLY that should have been subroutines. The 801 project and
PL.8 compiler were well underway at IBM by the time the VAX shipped, but DEC
presumably didn't know about it.
DEC probably was aware from the work of William Wulf and his students
what optimizing compilers can do and how to write them. After all,
they used his language BLISS and its compiler themselves.
POLY would have made sense in a world where microcode makes sense: If microcode can be executed faster than subroutines, put a building
stone for transcendental library functions into microcode. Of course, given that microcode no longer made sense for VAX, POLY did not make
sense for it, either.
Related to the microcode issue they also don't seem to have anticipated how JL>>important pipelining would be. Some minor changes to the VAX, like not letting
one address modify another in the same instruction, would have made it a lot
easier to pipeline.
My RISC alternative to the VAX 11/780 (RISC-VAX) would probably have
to use pipelining (maybe a three-stage pipeline like the first ARM) to achieve its clock rate goals; that would eat up some of the savings in implementation complexity that avoiding the actual VAX would have
given us.
Another issue would be is how to implement the PDP-11 emulation mode.
I would add a PDP-11 decoder (as the actual VAX 11/780 probably has)
that would decode PDP-11 code into RISC-VAX instructions, or into what RISC-VAX instructions are decoded into. The cost of that is probably similar to that in the actual VAX 11/780. If the RISC-VAX ISA has a MIPS/Alpha/RISC-V-like handling of conditions, the common microcode
would have to support both the PDP-11 and the RISC-VAX handling of conditions; probably not that expensive, but maybe one still would
prefer a ARM/SPARC/HPPA-like handling of conditions.
BGB <cr88192@gmail.com> schrieb:
I can't say much for or against VAX, as I don't currently have any
compilers that target it.
If you want to look at code, godbolt has a few gcc versions for it.
John Levine <johnl@taugh.com> writes:
According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
So going for microcode no longer was the best choice for the VAX, but >>>neither the VAX designers nor their competition realized this, and >>>commercial RISCs only appeared in 1986.
That is certainly true but there were other mistakes too. One is that
they underestimated how cheap memory would get, leading to the overcomplex >>instruction and address modes and the tiny 512 byte page size.
Concerning code density, while VAX code is compact, RISC-V code with the
C extension is more compact
<2025Mar4.093916@mips.complang.tuwien.ac.at>, so in our time-traveling scenario that would not be a reason for going for the VAX ISA.
Another aspect from those measurements is that the 68k instruction set
(with only one memory operand for any compute instructions, and 16-bit granularity) has a code density similar to the VAX.
Another, which is not entirely their fault, is that they did not expect >>compilers to improve as fast as they did, leading to a machine which was fun to
program in assembler but full of stuff that was useless to compilers and >>instructions like POLY that should have been subroutines. The 801 project and >>PL.8 compiler were well underway at IBM by the time the VAX shipped, but DEC >>presumably didn't know about it.
DEC probably was aware from the work of William Wulf and his students
what optimizing compilers can do and how to write them. After all,
they used his language BLISS and its compiler themselves.
POLY would have made sense in a world where microcode makes sense: If microcode can be executed faster than subroutines, put a building
stone for transcendental library functions into microcode. Of course,
given that microcode no longer made sense for VAX, POLY did not make
sense for it, either.
Related to the microcode issue they also don't seem to have anticipated how >>important pipelining would be. Some minor changes to the VAX, like not letting
one address modify another in the same instruction, would have made it a lot >>easier to pipeline.
My RISC alternative to the VAX 11/780 (RISC-VAX) would probably have
to use pipelining (maybe a three-stage pipeline like the first ARM) to achieve its clock rate goals; that would eat up some of the savings in implementation complexity that avoiding the actual VAX would have
given us.
Another issue would be is how to implement the PDP-11 emulation mode.
I would add a PDP-11 decoder (as the actual VAX 11/780 probably has)
that would decode PDP-11 code into RISC-VAX instructions, or into what RISC-VAX instructions are decoded into. The cost of that is probably
similar to that in the actual VAX 11/780. If the RISC-VAX ISA has a MIPS/Alpha/RISC-V-like handling of conditions, the common microcode
would have to support both the PDP-11 and the RISC-VAX handling of conditions; probably not that expensive, but maybe one still would
prefer a ARM/SPARC/HPPA-like handling of conditions.
POLY would have made sense in a world where microcode makes sense: If
microcode can be executed faster than subroutines, put a building
stone for transcendental library functions into microcode. Of course,
given that microcode no longer made sense for VAX, POLY did not make
sense for it, either.
IIUC the orignal idea was that POLY should be more accurate than
sequence of separate instructions and reproducible between models.
I must admit that I do not understand why VAX needed so many
cycles per instruction. Namely, register argument can by
recognized looking at 4 high bits of operand byte.
To summarize, VAX with pipeline and modest amount of operand
decoders should be able to execute "narmal" instructions
at RISC speed (in RISC each memory operand would require
load or store, so extra cycle like scheme above).
According to Waldek Hebisch <antispam@fricas.org>:
POLY would have made sense in a world where microcode makes sense: If
microcode can be executed faster than subroutines, put a building
stone for transcendental library functions into microcode. Of course,
given that microcode no longer made sense for VAX, POLY did not make
sense for it, either.
IIUC the orignal idea was that POLY should be more accurate than
sequence of separate instructions and reproducible between models.
That was the plan but the people building Vaxen didn't get the memo
so even on the original 780, it got different answers with and without
the optional floating point accelerator.
If they wanted more accurate results, they should have
https://simh.trailing-edge.com/docs/vax_poly.pdf
I must admit that I do not understand why VAX needed so many
cycles per instruction. Namely, register argument can by
recognized looking at 4 high bits of operand byte.
It can, but autoincrement or decrement modes change the contents
of the register so the operands have to be evaluated in strict
order or you need a lot of logic to check for hazards and stall.
In practice I don't think it was very common to do that, except
for the immediate and absolute address modes which were (PC)+
and @(PC)+, and which needed to be special cased since they took
data from the instruction stream. The size of the immediate
operand could be from 1 to 8 bytes depending on both the instruction
and which operand of the instruction it was.
Looking at the MACRO-32 source for a focal interpreter, I
see
CVTLF 12(SP),@(SP)+
MOVL (SP)+, R0
CMPL (AP)+,#1
MOVL (AP)+,R7
TSTL (SP)+
MOVZBL (R8)+,R5
BICB3 #240,(R8)+,R2
LOCC (R8)+,R0,(R8) ;FIND THE MATCH <<< note R8 used twice
LOCC (R8)+,S^#OPN,OPRATRS
MOVL (SP)+,(R7)[R6]
CMPB (R8)+,#^A/;/ ;VALID END OF STATEMENT
CASE (SP)+,<30$,20$,10$>,-
LIMIT=#0,TYPE=L ;DISPATCH ON NO. OF ARGS
MOVF (SP)+,@(SP)+ ;JUST DO SET
(SP)+ was far and away the most common. (PC)+ wasn't
used in that application.
There were some adjacent dependencies:
ADDB3 #48,R0,(R9)+ ;PUT AS DIGIT INTO BUFFER
ADDB3 #48,R1,(R9)+ ;AND NEXT
and a handful of others. Probably only a single-digit
percentage of instructions used autoincrement/decrement and only
a couple used the updated register in the same
instruction.
According to Scott Lurndal <slp53@pacbell.net>:
Looking at the MACRO-32 source for a focal interpreter, I
see
CVTLF 12(SP),@(SP)+
MOVL (SP)+, R0
CMPL (AP)+,#1
MOVL (AP)+,R7
TSTL (SP)+
MOVZBL (R8)+,R5
BICB3 #240,(R8)+,R2
LOCC (R8)+,R0,(R8) ;FIND THE MATCH <<< note R8 used twice
LOCC (R8)+,S^#OPN,OPRATRS
MOVL (SP)+,(R7)[R6]
CMPB (R8)+,#^A/;/ ;VALID END OF STATEMENT
CASE (SP)+,<30$,20$,10$>,-
LIMIT=#0,TYPE=L ;DISPATCH ON NO. OF ARGS
MOVF (SP)+,@(SP)+ ;JUST DO SET
(SP)+ was far and away the most common. (PC)+ wasn't
used in that application.
Wow, that's some funky code.
According to Waldek Hebisch <antispam@fricas.org>:
I must admit that I do not understand why VAX needed so many
cycles per instruction. Namely, register argument can by
recognized looking at 4 high bits of operand byte.
It can, but autoincrement or decrement modes change the contents
of the register so the operands have to be evaluated in strict
order or you need a lot of logic to check for hazards and stall.
In practice I don't think it was very common to do that, except
for the immediate and absolute address modes which were (PC)+
and @(PC)+, and which needed to be special cased since they took
data from the instruction stream. The size of the immediate
operand could be from 1 to 8 bytes depending on both the instruction
and which operand of the instruction it was.
To summarize, VAX with pipeline and modest amount of operand
decoders should be able to execute "narmal" instructions
at RISC speed (in RISC each memory operand would require
load or store, so extra cycle like scheme above).
Right, but detecting the abnormal cases wasn't trivial.
On 6/17/2025 10:59 AM, quadibloc wrote:
So the fact that it uses 10x the electrical power, while only having 2x
the raw power - for an embarrassingly parallel problem, which doesn't
happen to be the one I need to solve - doesn't matter.
Can you break your processing down into units that can be executed in parallel, or do you get into an interesting issue where step B cannot
proceed until step A is finished?
If you treat [Base+Disp] and [Base+Index] as two mutually exclusive
cases, one gets most of the benefit with less issues.
On Tue, 10 Jun 2025 22:45:05 -0500, BGB wrote:
If you treat [Base+Disp] and [Base+Index] as two mutually exclusive
cases, one gets most of the benefit with less issues.
That's certainly a way to do it. But then you either need to dedicate
one base register to each array - perhaps easier if there's opcode
space to use all 32 registers as base registers, which this would allow -
or you would have to load the base register with the address of the
array.
My idea was that instruction decoder could essentially translate
ADDL (R2)+, R2, R3
into
MOV (R2)+, TMP
ADDL TMP, R2, R3
On 7/30/2025 12:59 AM, Anton Ertl wrote:...
John Levine <johnl@taugh.com> writes:
According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
So going for microcode no longer was the best choice for the VAX, but
neither the VAX designers nor their competition realized this, and
commercial RISCs only appeared in 1986.
That is certainly true but there were other mistakes too. One is that
they underestimated how cheap memory would get, leading to the overcomplex >>> instruction and address modes and the tiny 512 byte page size.
Concerning code density, while VAX code is compact, RISC-V code with the
C extension is more compact
<2025Mar4.093916@mips.complang.tuwien.ac.at>, so in our time-traveling
scenario that would not be a reason for going for the VAX ISA.
But, if so, it would more speak for the weakness of VAX code density
than the goodness of RISC-V.
There is, however, a fairly notable size difference between RV32 and
RV64 here, but I had usually been messing with RV64.
If I were to put it on a ranking (for ISAs I have messed with), it would
be, roughly (smallest first):
i386 with VS2008 or GCC 3.x (*1)
In the days of VAX-11/780, it was "obvious" that operating systems would
be written in assembler in order to be efficient, and the instruction
set allowed high productivity for writing systems programs in "native"
code.
As for a RISC-VAX: To little old naive me, it seems that it would have
been possible to create an alternative microcode load that would be able
to support a RISC ISA on the same hardware, if the idea had occured to a >well-connected group of graduate students. How good a RISC might have
been feasible?
It appears that Waldek Hebisch <antispam@fricas.org> said:
My idea was that instruction decoder could essentially translate
ADDL (R2)+, R2, R3
into
MOV (R2)+, TMP
ADDL TMP, R2, R3
But how about this?
ADDL3 (R2)+,(R2)+,(R2)+
Now you need at least two temps, the second of which depends on the
first,
and there are instructions with six operands.
Or how about
this:
ADDL3 (R2)+,#1234,(R2)+
This is encoded as
OPCODE (R2)+ (PC)+ <1234> (R2)+
The immediate word is in the middle of the instruction. You have to decode the operands one at a time so you can recognize immediates and skip over them.
It must have seemed clever at the time, but ugh.
Lars Poulsen <lars@cleo.beagle-ears.com> writes:
In the days of VAX-11/780, it was "obvious" that operating systems would
be written in assembler in order to be efficient, and the instruction
set allowed high productivity for writing systems programs in "native" >>code.
Yes. I don't think that the productivity would have suffered from a >load/store architecture, though.
As for a RISC-VAX: To little old naive me, it seems that it would have
been possible to create an alternative microcode load that would be able
to support a RISC ISA on the same hardware, if the idea had occured to a >>well-connected group of graduate students. How good a RISC might have
been feasible?
Did the VAX 11/780 have writable microcode?
Given that the VAX 11/780 was not (much) pipelined, I don't expect
that using an alternative microcode that implements a RISC ISA would
have performed well.
John Levine <johnl@taugh.com> wrote:<snip>
ADDL3 (R2)+,#1234,(R2)+
This is encoded as
OPCODE (R2)+ (PC)+ <1234> (R2)+
The immediate word is in the middle of the instruction. You have to decode >> the operands one at a time so you can recognize immediates and skip over them.
Actually decoder that I propose could decode _this_ one in one
cycle.
But for this instruction one cycle decoding is not needed,
because execution will take multiple clocks. One cycle decoding
is needed for
ADDL3 R2,#1234,R2
which should be executed in one cycle. And to handle it one needs
7 operand decoders looking at 7 consequitive bytes, so that last
decoder sees last register argument.
It must have seemed clever at the time, but ugh.
VAX designers clearly had microcode in mind, even small changes
could make hardware decoding easier.
I have book by A. Tanenbaum about computer architecture that
was written in similar period as VAX design.
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
John Levine <johnl@taugh.com> writes:
According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:POLY would have made sense in a world where microcode makes sense: If
microcode can be executed faster than subroutines, put a building
stone for transcendental library functions into microcode. Of course,
given that microcode no longer made sense for VAX, POLY did not make
sense for it, either.
IIUC the orignal idea was that POLY should be more accurate than
sequence of separate instructions and reproducible between models.
Another issue would be is how to implement the PDP-11 emulation mode.
I would add a PDP-11 decoder (as the actual VAX 11/780 probably has)
that would decode PDP-11 code into RISC-VAX instructions, or into what
RISC-VAX instructions are decoded into. The cost of that is probably
similar to that in the actual VAX 11/780. If the RISC-VAX ISA has a
MIPS/Alpha/RISC-V-like handling of conditions, the common microcode
would have to support both the PDP-11 and the RISC-VAX handling of
conditions; probably not that expensive, but maybe one still would
prefer a ARM/SPARC/HPPA-like handling of conditions.
I looked into VAX architecure handbook from 1977. Handbook claims
that VAX-780 used 96-bit microcode words. That is enough bits to
control pipelined machine with 1 instruction per cycle, provided
enough excution resources (register ports, buses and 1-cycle
execution units). However, VAX hardware allowed only one memory
access per cycle so instructions with multiple memory addreses
or using indirection trough memory by necessity needed multiple
cycles.
I must admit that I do not understand why VAX needed so many
cycles per instruction.
For 1 byte opcode with all
register arguments operand specifiers are in predictable placese,
so together modest number of gates could recognize register-only
operand specifiers.
To summarize, VAX with pipeline and modest amount of operand
decoders should be able to execute "narmal" instructions
at RISC speed (in RISC each memory operand would require
load or store, so extra cycle like scheme above).
Given actual speed of VAX possibilities seem to be:
- extra factors slowing both VAX and RISC, like cache
misses (VAX archtecture handbook says that due to
misses cache had effective access time of 290 ns),
- VAX designers could not afford pipeline
- maybe VAX designers decided to avoid pipelne to reduce
complexity
If VAX designers could not afford pipeline, than it is
not clear if RISC could afford it: removing microcode
engine would reduce complexity and cost and give some
free space. But microcode engines tend to be simple.
Also, PDP-11 compatibility depended on microcode.
Without microcode engine one would need parallel set
of hardware instruction decoders, which could add
more complexity than was saved by removing microcode
engine.
To summarize, it is not clear to me if RISC in VAX technology
could be significantly faster than VAX
Without
isight into future it is hard to say that they were
wrong.
John Levine <johnl@taugh.com> wrote:
It appears that Waldek Hebisch <antispam@fricas.org> said:
My idea was that instruction decoder could essentially translate
ADDL (R2)+, R2, R3
into
MOV (R2)+, TMP
ADDL TMP, R2, R3
But how about this?
ADDL3 (R2)+,(R2)+,(R2)+
Now you need at least two temps, the second of which depends on the
first,
3 actually, the translation should be
MOVL (R2)+, TMP1
MOVL (R2)+, TMP2
ADDL TMP1, TMP2, TMP3
MOVL TMP3, (R2)+
Of course, temporaries are only within pipeline, so they probably
do not need real registers. But the instruction would need
4 clocks.
BGB <cr88192@gmail.com> writes:
On 7/30/2025 12:59 AM, Anton Ertl wrote:...
John Levine <johnl@taugh.com> writes:
According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
So going for microcode no longer was the best choice for the VAX, but >>>>> neither the VAX designers nor their competition realized this, and
commercial RISCs only appeared in 1986.
That is certainly true but there were other mistakes too. One is that >>>> they underestimated how cheap memory would get, leading to the overcomplex >>>> instruction and address modes and the tiny 512 byte page size.
Concerning code density, while VAX code is compact, RISC-V code with the >>> C extension is more compact
<2025Mar4.093916@mips.complang.tuwien.ac.at>, so in our time-traveling
scenario that would not be a reason for going for the VAX ISA.
But, if so, it would more speak for the weakness of VAX code density
than the goodness of RISC-V.
For the question at hand, what counts is that one can do a RISC that
is more compact than the VAX.
And neither among the Debian binaries nor among the NetBSD binaries I measured I have found anything consistently more compact than RISC-V
with the C extension. There is one strong competitor, though: armhf
(Thumb2) on Debian, which is a little smaller then RV64GC in 2 out of
3 cases and a little larger in the third case.
There is, however, a fairly notable size difference between RV32 and
RV64 here, but I had usually been messing with RV64.
NetBSD has both RV32GC and RV64GC binaries, and there is no consistent advantage of RV32GC over RV64GC there:
NetBSD numbers from <2025Mar4.093916@mips.complang.tuwien.ac.at>:
libc ksh pax ed
1102054 124726 66218 26226 riscv-riscv32
1077192 127050 62748 26550 riscv-riscv64
If I were to put it on a ranking (for ISAs I have messed with), it would
be, roughly (smallest first):
i386 with VS2008 or GCC 3.x (*1)
i386 has significantly larger binaries than RV64GC on both Debian and
NetBSD, also bigger than AMD64 and ARM A64.
For those who want to see all the numbers in one posting: <2025Jun17.161742@mips.complang.tuwien.ac.at>.
- anton
Digital eventually did move VMS to Alpha, but it was neither
cheap, nor easy. Most alpha customers were existing VAX
customers - it's not clear that DEC actually grew the customer
base by switching to Alpha.
antispam@fricas.org (Waldek Hebisch) writes:
John Levine <johnl@taugh.com> wrote:<snip>
ADDL3 (R2)+,#1234,(R2)+
This is encoded as
OPCODE (R2)+ (PC)+ <1234> (R2)+
The immediate word is in the middle of the instruction. You have to decode >>> the operands one at a time so you can recognize immediates and skip over them.
Actually decoder that I propose could decode _this_ one in one
cycle.
Assuming it didn't cross a cache line, which is possible with any
variable length instruction encoding.
But for this instruction one cycle decoding is not needed,
because execution will take multiple clocks. One cycle decoding
is needed for
ADDL3 R2,#1234,R2
which should be executed in one cycle. And to handle it one needs
7 operand decoders looking at 7 consequitive bytes, so that last
decoder sees last register argument.
It must have seemed clever at the time, but ugh.
VAX designers clearly had microcode in mind, even small changes
could make hardware decoding easier.
I have book by A. Tanenbaum about computer architecture that
was written in similar period as VAX design.
That would be:
$ author tanenbaum
Enter password:
artist title format location
Tanenbaum, Andrew S. Structured Computer Organization Hard A029
It's currently in box A029 in storage, but my recollection is that
it was rather vax-centric.
On Tue, 10 Jun 2025 22:45:05 -0500, BGB wrote:
If you treat [Base+Disp] and [Base+Index] as two mutually exclusive
cases, one gets most of the benefit with less issues.
That's certainly a way to do it. But then you either need to dedicate
one base register to each array - perhaps easier if there's opcode
space to use all 32 registers as base registers, which this would allow -
or you would have to load the base register with the address of the
array.
John Savard
Lars Poulsen <lars@cleo.beagle-ears.com> writes:
In the days of VAX-11/780, it was "obvious" that operating systems would
be written in assembler in order to be efficient, and the instruction
set allowed high productivity for writing systems programs in "native" >>code.
Yes. I don't think that the productivity would have suffered from a load/store architecture, though.
As for a RISC-VAX: To little old naive me, it seems that it would have
been possible to create an alternative microcode load that would be able
to support a RISC ISA on the same hardware, if the idea had occured to a >>well-connected group of graduate students. How good a RISC might have
been feasible?
Did the VAX 11/780 have writable microcode?
Given that the VAX 11/780 was not (much) pipelined, I don't expect
that using an alternative microcode that implements a RISC ISA would
have performed well.
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Lars Poulsen <lars@cleo.beagle-ears.com> writes:
In the days of VAX-11/780, it was "obvious" that operating systems would >>> be written in assembler in order to be efficient, and the instruction
set allowed high productivity for writing systems programs in "native"
code.
Yes. I don't think that the productivity would have suffered from a
load/store architecture, though.
As for a RISC-VAX: To little old naive me, it seems that it would have
been possible to create an alternative microcode load that would be able >>> to support a RISC ISA on the same hardware, if the idea had occured to a >>> well-connected group of graduate students. How good a RISC might have
been feasible?
Did the VAX 11/780 have writable microcode?
Yes.
Given that the VAX 11/780 was not (much) pipelined, I don't expect
that using an alternative microcode that implements a RISC ISA would
have performed well.
A new ISA also requires development of the complete software
infrastructure for building applications (compilers, linkers,
assemblers); updating the OS, rebuilding existing applications
for the new ISA, field and customer training, etc.
Digital eventually did move VMS to Alpha, but it was neither
cheap, nor easy. Most alpha customers were existing VAX
customers - it's not clear that DEC actually grew the customer
base by switching to Alpha.
Wasn't PRISM/MICA supposed to solve this problem, or am I confusing it
with something else?
On Fri, 1 Aug 2025 20:06:43 -0700, Peter Flass wrote:
Wasn't PRISM/MICA supposed to solve this problem, or am I confusing it
with something else?
PRISM was going to be a new hardware architecture, and MICA the OS to run
on it. Yes, they were supposed to solve the problem of where DEC was going >to go since the VAX architecture was clearly being left in the dust by
RISC.
I think the MICA kernel was going to support the concept of >“personalities”, so that a VMS-compatible environment could be implemented
by one set of upper layers, while another set could provide Unix >functionality.
I think the project was taking too long, and not making enough progress.
So DEC management cancelled the whole thing, and brought out a MIPS-based >machine instead.
The guy in charge got annoyed at the killing of his pet project and left
in a huff. He took some of those ideas with him to his new employer, to >create a new OS for them.
The new employer was Microsoft. The guy in question was Dave Cutler. The
OS they brought out was called “Windows NT”.
In article <106k15u$qgip$6@dont-email.me>,
Lawrence D'Oliveiro <ldo@nz.invalid> wrote:
On Fri, 1 Aug 2025 20:06:43 -0700, Peter Flass wrote:
Wasn't PRISM/MICA supposed to solve this problem, or am I confusing it
with something else?
PRISM was going to be a new hardware architecture, and MICA the OS to run
on it. Yes, they were supposed to solve the problem of where DEC was going >> to go since the VAX architecture was clearly being left in the dust by
RISC.
I think the MICA kernel was going to support the concept of
“personalities”, so that a VMS-compatible environment could be implemented
by one set of upper layers, while another set could provide Unix
functionality.
I think the project was taking too long, and not making enough progress.
So DEC management cancelled the whole thing, and brought out a MIPS-based
machine instead.
The guy in charge got annoyed at the killing of his pet project and left
in a huff. He took some of those ideas with him to his new employer, to
create a new OS for them.
The new employer was Microsoft. The guy in question was Dave Cutler. The
OS they brought out was called “Windows NT”.
And it's *still* not finished!
On 8/1/25 11:11, Scott Lurndal wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Lars Poulsen <lars@cleo.beagle-ears.com> writes:
In the days of VAX-11/780, it was "obvious" that operating systems would >>>> be written in assembler in order to be efficient, and the instruction
set allowed high productivity for writing systems programs in "native" >>>> code.
Yes. I don't think that the productivity would have suffered from a
load/store architecture, though.
As for a RISC-VAX: To little old naive me, it seems that it would have >>>> been possible to create an alternative microcode load that would be able >>>> to support a RISC ISA on the same hardware, if the idea had occured to a >>>> well-connected group of graduate students. How good a RISC might have
been feasible?
Did the VAX 11/780 have writable microcode?
Yes.
Given that the VAX 11/780 was not (much) pipelined, I don't expect
that using an alternative microcode that implements a RISC ISA would
have performed well.
A new ISA also requires development of the complete software
infrastructure for building applications (compilers, linkers,
assemblers); updating the OS, rebuilding existing applications
for the new ISA, field and customer training, etc.
Digital eventually did move VMS to Alpha, but it was neither
cheap, nor easy. Most alpha customers were existing VAX
customers - it's not clear that DEC actually grew the customer
base by switching to Alpha.
Wasn't PRISM/MICA supposed to solve this problem, or am I confusing it
with something else?
IIUC PRISM eventually became Alpha.
And Windows on Alpha had a brief shining moment in the sun (no
pun intended).
I can understand why DEC abandoned VAX: already in 1985 they
had some disadvantage and they saw no way to compete against
superscalar machines which were on the horizon. In 1985 they
probably realized, that their features add no value in world
using optimizing compilers.
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Given that the VAX 11/780 was not (much) pipelined, I don't expect
that using an alternative microcode that implements a RISC ISA would
have performed well.
A new ISA also requires development of the complete software
infrastructure for building applications (compilers, linkers,
assemblers); updating the OS, rebuilding existing applications
for the new ISA, field and customer training, etc.
Digital eventually did move VMS to Alpha, but it was neither
cheap, nor easy. Most alpha customers were existing VAX
customers - it's not clear that DEC actually grew the customer
base by switching to Alpha.
1) Performance, and that cost DEC customers since RISCs were
introduced in the mid-1980s. DecStations were introduced to reduce
this bleeding, but of course this meant that these customers were
not VAX customers.
But, it seems to have a few obvious weak points for RISC-V:
Crappy with arrays;
Crappy with code with lots of large immediate values;
Crappy with code which mostly works using lots of global variables;
Say, for example, a lot of Apogee / 3D Realms code;
They sure do like using lots of global variables.
id Software also likes globals, but not as much.
...
NetBSD has both RV32GC and RV64GC binaries, and there is no consistent
advantage of RV32GC over RV64GC there:
NetBSD numbers from <2025Mar4.093916@mips.complang.tuwien.ac.at>:
libc ksh pax ed
1102054 124726 66218 26226 riscv-riscv32
1077192 127050 62748 26550 riscv-riscv64
I guess it can be noted, is the overhead of any ELF metadata being >excluded?...
Granted, newer compilers do support newer versions of the C standard,
and also typically get better performance.
BGB <cr88192@gmail.com> writes:
But, it seems to have a few obvious weak points for RISC-V:
Crappy with arrays;
Crappy with code with lots of large immediate values;
Crappy with code which mostly works using lots of global variables;
Say, for example, a lot of Apogee / 3D Realms code;
They sure do like using lots of global variables.
id Software also likes globals, but not as much.
...
Let's see:
#include <stddef.h>
long arrays(long *v, size_t n)
{
long i, r;
for (i=0, r=0; i<n; i++)
r+=v[i];
return r;
}
long a, b, c, d;
void globals(void)
{
a = 0x1234567890abcdefL;
b = 0xcdef1234567890abL;
c = 0x567890abcdef1234L;
d = 0x5678901234abcdefL;
}
gcc-10.3 -Wall -O2 compiles this to the following RV64GC code:
0000000000010434 <arrays>:
10434: cd81 beqz a1,1044c <arrays+0x18>
10436: 058e slli a1,a1,0x3
10438: 87aa mv a5,a0
1043a: 00b506b3 add a3,a0,a1
1043e: 4501 li a0,0
10440: 6398 ld a4,0(a5)
10442: 07a1 addi a5,a5,8
10444: 953a add a0,a0,a4
10446: fed79de3 bne a5,a3,10440 <arrays+0xc>
1044a: 8082 ret
1044c: 4501 li a0,0
1044e: 8082 ret
0000000000010450 <globals>:
10450: 8201b583 ld a1,-2016(gp) # 12020 <__SDATA_BEGIN__>
10454: 8281b603 ld a2,-2008(gp) # 12028 <__SDATA_BEGIN__+0x8>
10458: 8301b683 ld a3,-2000(gp) # 12030 <__SDATA_BEGIN__+0x10>
1045c: 8381b703 ld a4,-1992(gp) # 12038 <__SDATA_BEGIN__+0x18>
10460: 86b1b423 sd a1,-1944(gp) # 12068 <a>
10464: 86c1b023 sd a2,-1952(gp) # 12060 <b>
10468: 84d1bc23 sd a3,-1960(gp) # 12058 <c>
1046c: 84e1b823 sd a4,-1968(gp) # 12050 <d>
10470: 8082 ret
When using -Os, arrays becomes 2 bytes shorter, but the inner loop
becomes longer.
gcc-12.2 -Wall -O2 -falign-labels=1 -falign-loops=1 -falign-jumps=1 -falign-functions=1
compiles this to the following AMD64 code:
000000001139 <arrays>:
1139: 48 85 f6 test %rsi,%rsi
113c: 74 13 je 1151 <arrays+0x18>
113e: 48 8d 14 f7 lea (%rdi,%rsi,8),%rdx
1142: 31 c0 xor %eax,%eax
1144: 48 03 07 add (%rdi),%rax
1147: 48 83 c7 08 add $0x8,%rdi
114b: 48 39 d7 cmp %rdx,%rdi
114e: 75 f4 jne 1144 <arrays+0xb>
1150: c3 ret
1151: 31 c0 xor %eax,%eax
1153: c3 ret
000000001154 <globals>:
1154: 48 b8 ef cd ab 90 78 movabs $0x1234567890abcdef,%rax
115b: 56 34 12
115e: 48 89 05 cb 2e 00 00 mov %rax,0x2ecb(%rip) # 4030 <a>
1165: 48 b8 ab 90 78 56 34 movabs $0xcdef1234567890ab,%rax
116c: 12 ef cd
116f: 48 89 05 b2 2e 00 00 mov %rax,0x2eb2(%rip) # 4028 <b>
1176: 48 b8 34 12 ef cd ab movabs $0x567890abcdef1234,%rax
117d: 90 78 56
1180: 48 89 05 99 2e 00 00 mov %rax,0x2e99(%rip) # 4020 <c>
1187: 48 b8 ef cd ab 34 12 movabs $0x5678901234abcdef,%rax
118e: 90 78 56
1191: 48 89 05 80 2e 00 00 mov %rax,0x2e80(%rip) # 4018 <d>
1198: c3 ret
gcc-10.2 -Wall -O2 -falign-labels=1 -falign-loops=1 -falign-jumps=1 -falign-functions=1
compiles this to the following ARM A64 code:
0000000000000734 <arrays>:
734: b4000121 cbz x1, 758 <arrays+0x24>
738: aa0003e2 mov x2, x0
73c: d2800000 mov x0, #0x0 // #0
740: 8b010c43 add x3, x2, x1, lsl #3
744: f8408441 ldr x1, [x2], #8
748: 8b010000 add x0, x0, x1
74c: eb03005f cmp x2, x3
750: 54ffffa1 b.ne 744 <arrays+0x10> // b.any
754: d65f03c0 ret
758: d2800000 mov x0, #0x0 // #0
75c: d65f03c0 ret
0000000000000760 <globals>:
760: d299bde2 mov x2, #0xcdef // #52719
764: b0000081 adrp x1, 11000 <__cxa_finalize@GLIBC_2.17>
768: f2b21562 movk x2, #0x90ab, lsl #16
76c: 9100e020 add x0, x1, #0x38
770: f2cacf02 movk x2, #0x5678, lsl #32
774: d2921563 mov x3, #0x90ab // #37035
778: f2e24682 movk x2, #0x1234, lsl #48
77c: f9001c22 str x2, [x1, #56]
780: d2824682 mov x2, #0x1234 // #4660
784: d299bde1 mov x1, #0xcdef // #52719
788: f2aacf03 movk x3, #0x5678, lsl #16
78c: f2b9bde2 movk x2, #0xcdef, lsl #16
790: f2a69561 movk x1, #0x34ab, lsl #16
794: f2c24683 movk x3, #0x1234, lsl #32
798: f2d21562 movk x2, #0x90ab, lsl #32
79c: f2d20241 movk x1, #0x9012, lsl #32
7a0: f2f9bde3 movk x3, #0xcdef, lsl #48
7a4: f2eacf02 movk x2, #0x5678, lsl #48
7a8: f2eacf01 movk x1, #0x5678, lsl #48
7ac: a9008803 stp x3, x2, [x0, #8]
7b0: f9000c01 str x1, [x0, #24]
7b4: d65f03c0 ret
So, the overall sizes (including data size for globals() on RV64GC) are:
arrays globals Architecture
28 66 (34+32) RV64GC
27 69 AMD64
44 84 ARM A64
So RV64GC is smallest for the globals/large-immediate test here, and
only beaten by one byte by AMD64 for the array test. Looking at the
code generated for the inner loop of arrays(), all the inner loops
contain four instructions, so certainly in this case RV64GC is not
crappier than the others. Interestingly, the reasons for using four instructions (rather than five) are different on these architectures:
* RV64GC uses a compare-and-branch instruction.
* AMD64 uses a load-and-add instruction.
* ARM A64 uses an auto-increment instruction.
NetBSD has both RV32GC and RV64GC binaries, and there is no consistent
advantage of RV32GC over RV64GC there:
NetBSD numbers from <2025Mar4.093916@mips.complang.tuwien.ac.at>:
libc ksh pax ed
1102054 124726 66218 26226 riscv-riscv32
1077192 127050 62748 26550 riscv-riscv64
I guess it can be noted, is the overhead of any ELF metadata being
excluded?...
These are sizes of the .text section extracted with objdump -h. So
no, these numbers do not include ELF metadata, nor the sizes of other sections. The latter may be relevant, because RV64GC has "immediates"
in .sdata that other architectures have in .text; however, .sdata can
contain other things than just "immediates", so one cannot just add the .sdata size to the .text size.
Granted, newer compilers do support newer versions of the C standard,
and also typically get better performance.
The latter is not the case in my experience, except in cases where autovectorization succeeds (but I also have seen a horrible slowdown
from auto-vectorization).
There is one other improvement: gcc register allocation has improved
in recent years to a point where we 1) no longer need explicit
register allocation for Gforth on AMD64, and 2) with a lot of manual
help, we could increase the number of stack cache registers from 1 to
3 on AMD64, which gives some speedups typically in the 0%-20% range in Gforth.
But, e.g., for the example from <http://www.complang.tuwien.ac.at/anton/lvas/effizienz/tsp.html>,
which is vectorizable, I still have not been able to get gcc to auto-vectorize it, even with some transformations which should help.
I have not measured the scalar versions again, but given that there
were no consistent speedups between gcc-2.7 (1995) and gcc-5.2 (2015),
I doubt that I will see consistent speedups with newer gcc (or clang) versions.
- anton
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
1) Performance, and that cost DEC customers since RISCs were
introduced in the mid-1980s. DecStations were introduced to reduce
this bleeding, but of course this meant that these customers were
not VAX customers.
Or, even more importantly, VMS customers.
One big selling point of Alpha was 64-bit architecture, but IIUC
VMS was never fully ported to 64-bits, that is a lot of VMS
software used 32-bit addresses and some system interfaces were
32-bit only. OTOH Unix for Alpha was claimed to be pure 64-bit.
I guess I'm getting DecStations and VaxStations mixed up. Maybe one of
their problems was brand confusion.
In my RISC-VAX scenario, the RISC-VAX would be the PDP-11 followon
instead of the actual (CISC) VAX, so there would be no additional
ISA.
Vobis (a German discount computer reseller) offered Alpha-based Windows
boxes in 1993 and another model in 1997. Far too expensive for private
users ...
On 8/2/2025 10:33 AM, Anton Ertl wrote:
BGB <cr88192@gmail.com> writes:
But, it seems to have a few obvious weak points for RISC-V:
Crappy with arrays;
Crappy with code with lots of large immediate values;
Crappy with code which mostly works using lots of global variables; >>> Say, for example, a lot of Apogee / 3D Realms code;
They sure do like using lots of global variables.
id Software also likes globals, but not as much.
...
Let's see:
#include <stddef.h>
long arrays(long *v, size_t n)
{
long i, r;
for (i=0, r=0; i<n; i++)
r+=v[i];
return r;
}
What if I manually translate to XG3?:
arrays:
MOV 0, X14
MOV 0, X13
BLE X11, X0, .L0
.L1:
MOV.Q (X10, X13), X12
ADD 1, X13
ADD X12, X14
BLT X11, X13, .L1
.L0:
MOV X14, X10
RTS
OK, 9 words.
If I added the pair-packing feature, could potentially be reduced to 7
words (4 instructions could be merged into 2 words).
On Sat, 2 Aug 2025 09:07:14 -0000 (UTC), Thomas Koenig wrote:
Vobis (a German discount computer reseller) offered Alpha-based WindowsAnd what a waste of a 64-bit architecture, to run it in 32-bit-only
boxes in 1993 and another model in 1997. Far too expensive for private
users ...
mode ...
Lawrence D'Oliveiro [2025-08-02 23:21:18] wrote:
On Sat, 2 Aug 2025 09:07:14 -0000 (UTC), Thomas Koenig wrote:
Vobis (a German discount computer reseller) offered Alpha-based
Windows boxes in 1993 and another model in 1997. Far too expensive
for private users ...
And what a waste of a 64-bit architecture, to run it in 32-bit-only
mode ...
What do you mean by that?
On Sat, 02 Aug 2025 23:10:56 -0400, Stefan Monnier wrote:
Lawrence D'Oliveiro [2025-08-02 23:21:18] wrote:
On Sat, 2 Aug 2025 09:07:14 -0000 (UTC), Thomas Koenig wrote:
Vobis (a German discount computer reseller) offered Alpha-based
Windows boxes in 1993 and another model in 1997. Far too expensive
for private users ...
And what a waste of a 64-bit architecture, to run it in 32-bit-only
mode ...
What do you mean by that?
Of all the major OSes for Alpha, Windows NT was the only one
that couldn’t take advantage of the 64-bit architecture.
In comp.arch Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
Did the VAX 11/780 have writable microcode?
Yes, 12 kB (2K words 96-bit each).
One piece of supporting sofware
was a VAX emulator IIRC called FX11: it allowed running unmodified
VAX binaries.
OTOH Unix for Alpha was claimed to be pure 64-bit.
The C environment for DEC OSF/1 was an I32LP64 setup, not an ILP64
setup, so can you really call it pure?
On Sun, 03 Aug 2025 16:51:10 GMT, Anton Ertl wrote:
The C environment for DEC OSF/1 was an I32LP64 setup, not an ILP64
setup, so can you really call it pure?
As far as I’m aware, I32LP64 is the standard across 64-bit *nix systems.
Microsoft’s compilers for 64-bit Windows do LLP64. Not aware of any platforms that do/did ILP64.
On 8/3/2025 7:04 PM, Lawrence D'Oliveiro wrote:
On Sun, 03 Aug 2025 16:51:10 GMT, Anton Ertl wrote:
The C environment for DEC OSF/1 was an I32LP64 setup, not an ILP64
setup, so can you really call it pure?
As far as I’m aware, I32LP64 is the standard across 64-bit *nix systems. >>
Microsoft’s compilers for 64-bit Windows do LLP64. Not aware of any
platforms that do/did ILP64.
Yeah, pretty much nothing does ILP64, and doing so would actually be a problem.
Also, C type names:
char : 8 bit
short : 16 bit
int : 32 bit
long : 64 bit
long long: 64 bit
If 'int' were 64-bits, then what about 16 and/or 32 bit types.
short short?
long short?
...
Current system seems preferable.
Well, at least in absence of maybe having the compiler specify actual fixed-size types.
Or, say, what if there was a world where the actual types were, say:
_Int8, _Int16, _Int32, _Int64, _Int128
And, then, say:
char, short, int, long, ...
Were seen as aliases.
Well, maybe along with __int64 and friends, but __int64 and _Int64 could
be seen as equivalent.
Then of course, the "stdint.h" types.
Traditionally, these are a bunch of typedef's to the 'int' and friends.
But, one can imagine a hypothetical world where stdint.h contained
things like, say:
typedef _Int32 int32_t;
C keeps borrowing more and more PL/I features.
On 8/3/2025 7:04 PM, Lawrence D'Oliveiro wrote:Except in embedded 16 bit are not rare
On Sun, 03 Aug 2025 16:51:10 GMT, Anton Ertl wrote:
The C environment for DEC OSF/1 was an I32LP64 setup, not an ILP64
setup, so can you really call it pure?
As far as I’m aware, I32LP64 is the standard across 64-bit *nix
systems.
Microsoft’s compilers for 64-bit Windows do LLP64. Not aware of any platforms that do/did ILP64.
Yeah, pretty much nothing does ILP64, and doing so would actually be
a problem.
Also, C type names:
char : 8 bit
short : 16 bit
int : 32 bit
long : 64 bitExcept for majority of the world where long is 32 bit
long long: 64 bit
If 'int' were 64-bits, then what about 16 and/or 32 bit types.
short short?
long short?
...
Current system seems preferable.
Well, at least in absence of maybe having the compiler specify actual fixed-size types.
Or, say, what if there was a world where the actual types were, say:
_Int8, _Int16, _Int32, _Int64, _Int128
And, then, say:
char, short, int, long, ...
Were seen as aliases.
Well, maybe along with __int64 and friends, but __int64 and _Int64
could be seen as equivalent.
Then of course, the "stdint.h" types.
Traditionally, these are a bunch of typedef's to the 'int' and
friends. But, one can imagine a hypothetical world where stdint.h
contained things like, say:
typedef _Int32 int32_t;
...
On 8/3/25 19:07, BGB wrote:
On 8/3/2025 7:04 PM, Lawrence D'Oliveiro wrote:
On Sun, 03 Aug 2025 16:51:10 GMT, Anton Ertl wrote:
The C environment for DEC OSF/1 was an I32LP64 setup, not an ILP64
setup, so can you really call it pure?
As far as I’m aware, I32LP64 is the standard across 64-bit *nix
systems.
Microsoft’s compilers for 64-bit Windows do LLP64. Not aware of any
platforms that do/did ILP64.
Yeah, pretty much nothing does ILP64, and doing so would actually
be a problem.
Also, C type names:
char : 8 bit
short : 16 bit
int : 32 bit
long : 64 bit
long long: 64 bit
If 'int' were 64-bits, then what about 16 and/or 32 bit types.
short short?
long short?
...
Current system seems preferable.
Well, at least in absence of maybe having the compiler specify
actual fixed-size types.
Or, say, what if there was a world where the actual types were, say:
_Int8, _Int16, _Int32, _Int64, _Int128
And, then, say:
char, short, int, long, ...
Were seen as aliases.
Well, maybe along with __int64 and friends, but __int64 and _Int64
could be seen as equivalent.
Then of course, the "stdint.h" types.
Traditionally, these are a bunch of typedef's to the 'int' and
friends. But, one can imagine a hypothetical world where stdint.h
contained things like, say:
typedef _Int32 int32_t;
Like PL/I which lets you specify any precision: FIXED BINARY(31),
FIXED BINARY(63) etc.
C keeps borrowing more and more PL/I features.
antispam@fricas.org (Waldek Hebisch) writes:
One piece of supporting sofware
was a VAX emulator IIRC called FX11: it allowed running unmodified
VAX binaries.
There was also a static binary translator for DecStation binaries. I
never used it, but a collegue tried to. He found that on the Prolog
systems that he tried it with (I think it was Quintus or SICStus), it
did not work, because that system did unusual things with the binary,
and that did not work on the result of the binary translation. Moral
of the story: Better use dynamic binary translation (which Apple did
for their 68K->PowerPC transition at around the same time).
OTOH Unix for Alpha was claimed to be pure 64-bit.
It depends on the kind of purity you are aspiring to. After a bunch
of renamings it was finally called Tru64 UNIX. Not Pur64, but
Tru64:-) Before that, it was called Digital UNIX (but once DEC had
been bought by Compaq, that was no longer appropriate), and before
that, DEC OSF/1 AXP.
The C environment for DEC OSF/1 was an I32LP64 setup, not an ILP64
setup, so can you really call it pure?
In addition there were some OS features for running ILP32 programs,
similar to Linux' MAP_32BIT flag for mmap(). IIRC Netscape Navigator
was compiled as ILP32 program (the C compiler had a flag for that),
and needed these OS features.
- anton
May be, MIPS-to-Alpha was static simply because it had much lower
priority within DEC?
Actually, in our world the latest C standard (C23) has them, but the
spelling is different: _BitInt(32) and unsigned _BitInt(32).
I'm not sure if any major compiler already has them implemented. Bing
copilot says that clang does, but I don't tend to believe eveything Bing >copilot says.
May be, MIPS-to-Alpha was static simply because it had much lower
priority within DEC?
On Sun, 3 Aug 2025 21:07:02 -0500
BGB <cr88192@gmail.com> wrote:
Except for majority of the world where long is 32 bit
Michael S <already5chosen@yahoo.com> writes:
On Sun, 3 Aug 2025 21:07:02 -0500
BGB <cr88192@gmail.com> wrote:
Except for majority of the world where long is 32 bit
What majority? Linux owns the server market, the
appliance market and much of the handset market (which apple
dominates with their OS). And all Unix/Linux systems have
64-bit longs on 64-bit CPUs.
Michael S <already5chosen@yahoo.com> writes:
Actually, in our world the latest C standard (C23) has them, but the >>spelling is different: _BitInt(32) and unsigned _BitInt(32).
I'm not sure if any major compiler already has them implemented. Bing >>copilot says that clang does, but I don't tend to believe eveything Bing >>copilot says.
I asked godbolt, and tried the following program:
typedef ump unsigned _BitInt(65535);
and for the C setting gcc-15.1 AMD64 produces 129 lines of assembly
language code; for C++ it complains about the syntax. For 65536 bits,
it complains about being beyond the maximum number of 65535 bits.
For the same program with the C setting clang-20.1 produces 29547
lines of assembly language code; that's more than 28 instructions for
every 64-bit word of output, which seems excessive to me, even if you
don't use ADX instructions (which clang apparently does not); I expect
that clang will produce better code at some point in the future.
Compiling this function also takes noticable time, and when I ask for
1000000 bits, clang still does complain about too many bits, but
godbolt's timeout strikes; I finally found out clang's limit: 8388608
bits. On clang-20.1 the C++ setting also accepts this kind of input.
Scott Lurndal wrote:
Michael S <already5chosen@yahoo.com> writes:
On Sun, 3 Aug 2025 21:07:02 -0500
BGB <cr88192@gmail.com> wrote:
Except for majority of the world where long is 32 bit
What majority? Linux owns the server market, the
appliance market and much of the handset market (which apple
dominates with their OS). And all Unix/Linux systems have
64-bit longs on 64-bit CPUs.
Apple/iPhone might dominate in the US market (does it?), but in the rest
of the world Android (with linux) is far larger. World total is 72%
Android, 28% iOS.
Michael S <already5chosen@yahoo.com> writes:
On Sun, 3 Aug 2025 21:07:02 -0500
BGB <cr88192@gmail.com> wrote:
Except for majority of the world where long is 32 bit
What majority? Linux owns the server market, the
appliance market and much of the handset market (which apple
dominates with their OS). And all Unix/Linux systems have
64-bit longs on 64-bit CPUs.
Michael S <already5chosen@yahoo.com> writes:
Actually, in our world the latest C standard (C23) has them, but the >spelling is different: _BitInt(32) and unsigned _BitInt(32).
I'm not sure if any major compiler already has them implemented. Bing >copilot says that clang does, but I don't tend to believe eveything
Bing copilot says.
I asked godbolt, and tried the following program:
typedef ump unsigned _BitInt(65535);
ump sum3(ump a, ump b, ump c)
{
return a+b+c;
}
and for the C setting gcc-15.1 AMD64 produces 129 lines of assembly
language code; for C++ it complains about the syntax. For 65536 bits,
it complains about being beyond the maximum number of 65535 bits.
For the same program with the C setting clang-20.1 produces 29547
lines of assembly language code; that's more than 28 instructions for
every 64-bit word of output, which seems excessive to me, even if you
don't use ADX instructions (which clang apparently does not); I expect
that clang will produce better code at some point in the future.
Compiling this function also takes noticable time, and when I ask for
1000000 bits, clang still does complain about too many bits, but
godbolt's timeout strikes; I finally found out clang's limit: 8388608
bits. On clang-20.1 the C++ setting also accepts this kind of input.
Followups set to comp.arch.
- anton
And what a waste of a 64-bit architecture, to run it in 32-bit-only
mode ...
What do you mean by that? IIUC, the difference between 32bit and
64bit (in terms of cost of designing and producing the CPU) was very
small. MIPS happily designed their R4000 as 64bit while knowing that
most of them would never get a chance to execute an instruction that
makes use of the upper 32bits.
On Mon, 04 Aug 2025 14:22:14 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Michael S <already5chosen@yahoo.com> writes:
On Sun, 3 Aug 2025 21:07:02 -0500
BGB <cr88192@gmail.com> wrote:
Except for majority of the world where long is 32 bit
What majority? Linux owns the server market, the
appliance market and much of the handset market (which apple
dominates with their OS). And all Unix/Linux systems have
64-bit longs on 64-bit CPUs.
Majority of the world is embedded. Ovewhelming majority of embedded is
32-bit or narrower.
Skimming the article on "Binary Translation" in Digital Technical
Journal Vol. 4 No. 4, 1992 <https://dn790009.ca.archive.org/0/items/bitsavers_decdtjdtjv_19086731/ dtj_v04-04_1992.pdf>,
it seems that both VEST (VAX VMS->Alpha VMS) and mx (MIPS Ultrix ->
Alpha OSF/1) used a hybrid approach. These binary translators took an existing binary for one system and produced a binary for the the other system, but included a run-time system that would do binary
translation of run-time-generated code.
Michael S <already5chosen@yahoo.com> writes:
Actually, in our world the latest C standard (C23) has them, but the >>>spelling is different: _BitInt(32) and unsigned _BitInt(32).
I'm not sure if any major compiler already has them implemented. Bing >>>copilot says that clang does, but I don't tend to believe eveything Bing >>>copilot says.
I asked godbolt, and tried the following program:
typedef ump unsigned _BitInt(65535);
The actual compiling version is:
typedef unsigned _BitInt(65535) ump;
ump sum3(ump a, ump b, ump c)
{
return a+b+c;
}
and for the C setting gcc-15.1 AMD64 produces 129 lines of assembly >>language code; for C++ it complains about the syntax. For 65536 bits,
it complains about being beyond the maximum number of 65535 bits.
For the same program with the C setting clang-20.1 produces 29547
lines of assembly language code; that's more than 28 instructions for
every 64-bit word of output, which seems excessive to me, even if you
don't use ADX instructions (which clang apparently does not);
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Michael S <already5chosen@yahoo.com> writes:
Actually, in our world the latest C standard (C23) has them, but the >>spelling is different: _BitInt(32) and unsigned _BitInt(32).
I'm not sure if any major compiler already has them implemented.
Bing copilot says that clang does, but I don't tend to believe
eveything Bing copilot says.
I asked godbolt, and tried the following program:
typedef ump unsigned _BitInt(65535);
The actual compiling version is:
typedef unsigned _BitInt(65535) ump;
ump sum3(ump a, ump b, ump c)
{
return a+b+c;
}
and for the C setting gcc-15.1 AMD64 produces 129 lines of assembly >language code; for C++ it complains about the syntax. For 65536
bits, it complains about being beyond the maximum number of 65535
bits.
For the same program with the C setting clang-20.1 produces 29547
lines of assembly language code; that's more than 28 instructions for
every 64-bit word of output, which seems excessive to me, even if you
don't use ADX instructions (which clang apparently does not); I
expect that clang will produce better code at some point in the
future. Compiling this function also takes noticable time, and when
I ask for 1000000 bits, clang still does complain about too many
bits, but godbolt's timeout strikes; I finally found out clang's
limit: 8388608 bits. On clang-20.1 the C++ setting also accepts
this kind of input.
- anton
On Mon, 04 Aug 2025 14:51:41 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Michael S <already5chosen@yahoo.com> writes:
Actually, in our world the latest C standard (C23) has them, but
the spelling is different: _BitInt(32) and unsigned _BitInt(32).
I'm not sure if any major compiler already has them implemented.
Bing copilot says that clang does, but I don't tend to believe >>eveything Bing copilot says.
I asked godbolt, and tried the following program:
typedef ump unsigned _BitInt(65535);
The actual compiling version is:
typedef unsigned _BitInt(65535) ump;
ump sum3(ump a, ump b, ump c)
{
return a+b+c;
}
and for the C setting gcc-15.1 AMD64 produces 129 lines of assembly >language code; for C++ it complains about the syntax. For 65536
bits, it complains about being beyond the maximum number of 65535
bits.
For the same program with the C setting clang-20.1 produces 29547
lines of assembly language code; that's more than 28 instructions
for every 64-bit word of output, which seems excessive to me, even
if you don't use ADX instructions (which clang apparently does
not); I expect that clang will produce better code at some point
in the future. Compiling this function also takes noticable time,
and when I ask for 1000000 bits, clang still does complain about
too many bits, but godbolt's timeout strikes; I finally found out
clang's limit: 8388608 bits. On clang-20.1 the C++ setting also
accepts this kind of input.
- anton
On my PC with following flags '-S -O -pedantic -std=c23 -march=native' compilation was not too slow - approximately 200 msec for gcc, 600
msec for clang. In case of gcc, most of the time was likely consumed
by [anti]virus rather than by compiler itself.
Sizes (instructions only, directives and labels removed) were as
following:
N gcc-win64 gcc-sysv clang-win64
65472 56 46 10041
65535 71 62 10050
On Sat, 02 Aug 2025 23:10:56 -0400
Stefan Monnier <monnier@iro.umontreal.ca> wrote:
And what a waste of a 64-bit architecture, to run it in 32-bit-only
mode ...
What do you mean by that? IIUC, the difference between 32bit and
64bit (in terms of cost of designing and producing the CPU) was very
small. MIPS happily designed their R4000 as 64bit while knowing that
most of them would never get a chance to execute an instruction that
makes use of the upper 32bits.
This notion that the only advantage of a 64-bit architecture is a large address space is very curious to me. Obviously that's *one* advantage,
but while I don't know the in-the-field history of heavy-duty business/ scientific computing the way some folks here do, I have not gotten the impression that a lot of customers were commonly running up against the
4 GB limit in the early '90s; meanwhile, the *other* advantage - higher performance for the same MIPS on a variety of compute-bound tasks - is
being overlooked entirely, it seems.
On Mon, 04 Aug 2025 12:09:32 GMT[...]
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
typedef ump unsigned _BitInt(65535);
ump sum3(ump a, ump b, ump c)
{
return a+b+c;
}
1. Both gcc and clang happily* accept _BitInt() syntax even when
-std=c17 or lower. Is not here a potential name clash for existing
sources that use _BitInt() as a name of the function? I should think
more about it.
* - the only sign of less than perfect happiness is a warning produced
with -pedantic flag.
Cross-posted to c.lang.c
In C17 and earlier, _BitInt is a reserved identifier. Any attempt to
use it has undefined behavior. That's exactly why new keywords are
often defined with that ugly syntax.
On 8/3/25 19:07, BGB wrote:
On 8/3/2025 7:04 PM, Lawrence D'Oliveiro wrote:
On Sun, 03 Aug 2025 16:51:10 GMT, Anton Ertl wrote:
The C environment for DEC OSF/1 was an I32LP64 setup, not an ILP64
setup, so can you really call it pure?
As far as I’m aware, I32LP64 is the standard across 64-bit *nix systems. >>>
Microsoft’s compilers for 64-bit Windows do LLP64. Not aware of any
platforms that do/did ILP64.
Yeah, pretty much nothing does ILP64, and doing so would actually be a
problem.
Also, C type names:
char : 8 bit
short : 16 bit
int : 32 bit
long : 64 bit
long long: 64 bit
If 'int' were 64-bits, then what about 16 and/or 32 bit types.
short short?
long short?
...
Current system seems preferable.
Well, at least in absence of maybe having the compiler specify actual
fixed-size types.
Or, say, what if there was a world where the actual types were, say:
_Int8, _Int16, _Int32, _Int64, _Int128
And, then, say:
char, short, int, long, ...
Were seen as aliases.
Well, maybe along with __int64 and friends, but __int64 and _Int64
could be seen as equivalent.
Then of course, the "stdint.h" types.
Traditionally, these are a bunch of typedef's to the 'int' and friends.
But, one can imagine a hypothetical world where stdint.h contained
things like, say:
typedef _Int32 int32_t;
Like PL/I which lets you specify any precision: FIXED BINARY(31), FIXED BINARY(63) etc.
C keeps borrowing more and more PL/I features.
On Sun, 3 Aug 2025 21:07:02 -0500
BGB <cr88192@gmail.com> wrote:
On 8/3/2025 7:04 PM, Lawrence D'Oliveiro wrote:
On Sun, 03 Aug 2025 16:51:10 GMT, Anton Ertl wrote:
The C environment for DEC OSF/1 was an I32LP64 setup, not an ILP64
setup, so can you really call it pure?
As far as I’m aware, I32LP64 is the standard across 64-bit *nix
systems.
Microsoft’s compilers for 64-bit Windows do LLP64. Not aware of any
platforms that do/did ILP64.
Yeah, pretty much nothing does ILP64, and doing so would actually be
a problem.
Also, C type names:
char : 8 bit
short : 16 bit
int : 32 bit
Except in embedded 16 bit are not rare
long : 64 bit
Except for majority of the world where long is 32 bit
long long: 64 bit
If 'int' were 64-bits, then what about 16 and/or 32 bit types.
short short?
long short?
...
Current system seems preferable.
Well, at least in absence of maybe having the compiler specify actual
fixed-size types.
Or, say, what if there was a world where the actual types were, say:
_Int8, _Int16, _Int32, _Int64, _Int128
And, then, say:
char, short, int, long, ...
Were seen as aliases.
Actually, in our world the latest C standard (C23) has them, but the
spelling is different: _BitInt(32) and unsigned _BitInt(32).
I'm not sure if any major compiler already has them implemented. Bing
copilot says that clang does, but I don't tend to believe eveything Bing copilot says.
Well, maybe along with __int64 and friends, but __int64 and _Int64
could be seen as equivalent.
Then of course, the "stdint.h" types.
Traditionally, these are a bunch of typedef's to the 'int' and
friends. But, one can imagine a hypothetical world where stdint.h
contained things like, say:
typedef _Int32 int32_t;
...
On 8/2/2025 10:33 AM, Anton Ertl wrote:...
BGB <cr88192@gmail.com> writes:
But, it seems to have a few obvious weak points for RISC-V:
Crappy with arrays;
Crappy with code with lots of large immediate values;
Crappy with code which mostly works using lots of global variables;
Say, for example, a lot of Apogee / 3D Realms code;
They sure do like using lots of global variables.
id Software also likes globals, but not as much.
...
Let's see:
#include <stddef.h>
long arrays(long *v, size_t n)
{
long i, r;
for (i=0, r=0; i<n; i++)
r+=v[i];
return r;
}
long a, b, c, d;
void globals(void)
{
a = 0x1234567890abcdefL;
b = 0xcdef1234567890abL;
c = 0x567890abcdef1234L;
d = 0x5678901234abcdefL;
}
gcc-10.3 -Wall -O2 compiles this to the following RV64GC code:
0000000000010434 <arrays>:
10434: cd81 beqz a1,1044c <arrays+0x18>
10436: 058e slli a1,a1,0x3
10438: 87aa mv a5,a0
1043a: 00b506b3 add a3,a0,a1
1043e: 4501 li a0,0
10440: 6398 ld a4,0(a5)
10442: 07a1 addi a5,a5,8
10444: 953a add a0,a0,a4
10446: fed79de3 bne a5,a3,10440 <arrays+0xc>
1044a: 8082 ret
1044c: 4501 li a0,0
1044e: 8082 ret
0000000000010450 <globals>:
10450: 8201b583 ld a1,-2016(gp) # 12020 <__SDATA_BEGIN__> >> 10454: 8281b603 ld a2,-2008(gp) # 12028 <__SDATA_BEGIN__+0x8>
10458: 8301b683 ld a3,-2000(gp) # 12030 <__SDATA_BEGIN__+0x10>
1045c: 8381b703 ld a4,-1992(gp) # 12038 <__SDATA_BEGIN__+0x18>
10460: 86b1b423 sd a1,-1944(gp) # 12068 <a>
10464: 86c1b023 sd a2,-1952(gp) # 12060 <b>
10468: 84d1bc23 sd a3,-1960(gp) # 12058 <c>
1046c: 84e1b823 sd a4,-1968(gp) # 12050 <d>
10470: 8082 ret
When using -Os, arrays becomes 2 bytes shorter, but the inner loop
becomes longer.
I had not usually seen globals handled this way in RV with GCC...
When I throw it at godbolt.org, I see:
globals:
li a1,593920
addi a1,a1,-1347
li a2,38178816
li a5,-209993728
li a0,863748096
li a3,1450741760
li a4,725372928
slli a1,a1,12
addi a2,a2,-1329
addi a5,a5,1165
li a7,1450741760
addi a0,a0,1165
addi a3,a3,171
addi a4,a4,-2039
li a6,883675136
addi a1,a1,-529
addi a7,a7,171
slli a0,a0,2
slli a2,a2,35
slli a5,a5,34
slli a3,a3,32
slli a4,a4,33
addi a6,a6,-529
add a2,a2,a1
add a5,a5,a7
add a3,a3,a0
lui t1,%hi(a)
lui a7,%hi(b)
lui a0,%hi(c)
add a4,a4,a6
lui a1,%hi(d)
sd a2,%lo(a)(t1)
sd a5,%lo(b)(a7)
sd a3,%lo(c)(a0)
sd a4,%lo(d)(a1)
ret
Though, I was more talking about i386 having good code density, not so
much x86-64.
So RV64GC is smallest for the globals/large-immediate test here, and
only beaten by one byte by AMD64 for the array test. Looking at the
code generated for the inner loop of arrays(), all the inner loops
contain four instructions, so certainly in this case RV64GC is not
crappier than the others. Interestingly, the reasons for using four
instructions (rather than five) are different on these architectures:
These are micro-examples...
Makes more sense to compare something bigger.
On Sat, 02 Aug 2025 23:10:56 -0400
Stefan Monnier <monnier@iro.umontreal.ca> wrote:
And what a waste of a 64-bit architecture, to run it in 32-bit-only
mode ...
What do you mean by that? IIUC, the difference between 32bit and
64bit (in terms of cost of designing and producing the CPU) was very
small. MIPS happily designed their R4000 as 64bit while knowing that
most of them would never get a chance to execute an instruction that
makes use of the upper 32bits.
This notion that the only advantage of a 64-bit architecture is a large >address space is very curious to me. Obviously that's *one* advantage,
but while I don't know the in-the-field history of heavy-duty business/ >scientific computing the way some folks here do, I have not gotten the >impression that a lot of customers were commonly running up against the
4 GB limit in the early '90s; meanwhile, the *other* advantage - higher >performance for the same MIPS on a variety of compute-bound tasks - is
being overlooked entirely, it seems.
Michael S <already5chosen@yahoo.com> writes:
Actually, in our world the latest C standard (C23) has them, but the
spelling is different: _BitInt(32) and unsigned _BitInt(32).
I'm not sure if any major compiler already has them implemented. Bing
copilot says that clang does, but I don't tend to believe eveything Bing
copilot says.
I asked godbolt, and tried the following program:
typedef ump unsigned _BitInt(65535);
ump sum3(ump a, ump b, ump c)
{
return a+b+c;
}
and for the C setting gcc-15.1 AMD64 produces 129 lines of assembly
language code; for C++ it complains about the syntax. For 65536 bits,
it complains about being beyond the maximum number of 65535 bits.
For the same program with the C setting clang-20.1 produces 29547
lines of assembly language code; that's more than 28 instructions for
every 64-bit word of output, which seems excessive to me, even if you
don't use ADX instructions (which clang apparently does not); I expect
that clang will produce better code at some point in the future.
Compiling this function also takes noticable time, and when I ask for
1000000 bits, clang still does complain about too many bits, but
godbolt's timeout strikes; I finally found out clang's limit: 8388608
bits. On clang-20.1 the C++ setting also accepts this kind of input.
Followups set to comp.arch.
- anton
On Sat, 02 Aug 2025 09:28:17 GMT, Anton Ertl wrote:
In my RISC-VAX scenario, the RISC-VAX would be the PDP-11 followon
instead of the actual (CISC) VAX, so there would be no additional
ISA.
In order to be RISC, it would have had to add registers and remove >addressing modes from the non-load/store instructions (and replace "move" >with separate "load" and "store" instructions).
"No additional ISA" or
not, it would still have broken existing code.
Remember that VAX development started in the early-to-mid-1970s.
RISC was
still nothing more than a research idea at that point, which had yet to >prove itself.
The claim by John Savard was that the VAX "was a good match to the
technology *of its time*". It was not. It may have been a good match
for the beliefs of the time, but that's a different thing.
Stefan Monnier <monnier@iro.umontreal.ca> wrote:
scientific computing the way some folks here do, I have not gotten the >>impression that a lot of customers were commonly running up against the
4 GB limit in the early '90s;
Even simple data movement (e.g. optimized memcpy) will require half
the instructions on a 64-bit architecture.
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
The claim by John Savard was that the VAX "was a good match to the
technology *of its time*". It was not. It may have been a good match
for the beliefs of the time, but that's a different thing.
I concur; also, the evidence of the 801 supports that (and that
was designed around the same time as the VAX).
Although, personally, I think Data General might have been the
better target. Going to Edson de Castro and telling him that he
was on the right track with the Nova from the start, and his ideas
should be extended, might have been politically easier than going
to DEC.
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
The claim by John Savard was that the VAX "was a good match to the
technology *of its time*". It was not. It may have been a good match
for the beliefs of the time, but that's a different thing.
I concur; also, the evidence of the 801 supports that (and that
was designed around the same time as the VAX).
Michael S <already5chosen@yahoo.com> writes:
On Mon, 04 Aug 2025 12:09:32 GMT[...]
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
typedef ump unsigned _BitInt(65535);
The correct syntax is :
typedef unsigned _BitInt(65535) ump;
ump sum3(ump a, ump b, ump c)
{
return a+b+c;
}
[...]
1. Both gcc and clang happily* accept _BitInt() syntax even when
-std=c17 or lower. Is not here a potential name clash for existing
sources that use _BitInt() as a name of the function? I should think
more about it.
In C17 and earlier, _BitInt is a reserved identifier. Any attempt to
use it has undefined behavior. That's exactly why new keywords are
often defined with that ugly syntax.
Michael S <already5chosen@yahoo.com> writes:
scott@slp53.sl.home (Scott Lurndal) wrote:In terms of shipped units, perhaps (although many are narrower, as you
Michael S <already5chosen@yahoo.com> writes:Majority of the world is embedded. Ovewhelming majority of embedded is
BGB <cr88192@gmail.com> wrote:What majority? Linux owns the server market, the
Except for majority of the world where long is 32 bit
appliance market and much of the handset market (which apple
dominates with their OS). And all Unix/Linux systems have
64-bit longs on 64-bit CPUs.
32-bit or narrower.
point out). In terms of programmers, it's a fairly small fraction that
do embedded programming.
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
The claim by John Savard was that the VAX "was a good match to the technology *of its time*". It was not. It may have been a good
match for the beliefs of the time, but that's a different thing.
I concur; also, the evidence of the 801 supports that (and that
was designed around the same time as the VAX).
Although, personally, I think Data General might have been the
better target. Going to Edson de Castro and telling him that he
was on the right track with the Nova from the start, and his ideas
should be extended, might have been politically easier than going
to DEC.
According to Scott Lurndal <slp53@pacbell.net>:
Stefan Monnier <monnier@iro.umontreal.ca> wrote:
scientific computing the way some folks here do, I have not gotten
the impression that a lot of customers were commonly running up
against the 4 GB limit in the early '90s;
Mainframes certainly had more than 4GB. In 1990 the ES/9000 had more
than 4GB of "expanded" memory and by 1994 there was 8GB of main
memory, using a variety of mapping and segmentation kludges to
address from a 32 bit architecture.
Even simple data movement (e.g. optimized memcpy) will require half
the instructions on a 64-bit architecture.
Er, maybe. There were plenty of 32 bit systems with 64 bit memory.
I would expect that systems with string move instructions would
take advantage of the underlying hardware.
On Mon, 04 Aug 2025 09:53:51 -0700...
Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
In C17 and earlier, _BitInt is a reserved identifier. Any attempt to
use it has undefined behavior. That's exactly why new keywords are
often defined with that ugly syntax.
That is language lawyer's type of reasoning. Normally gcc maintainers
are wiser than that because, well, by chance gcc happens to be widely
used production compiler. I don't know why this time they had chosen
less conservative road.
Although, personally, I think Data General might have been the
better target. Going to Edson de Castro and telling him that he
was on the right track with the Nova from the start, and his ideas
should be extended, might have been politically easier than going
to DEC.
Scott Lurndal [2025-08-04 15:32:55] wrote:
Michael S <already5chosen@yahoo.com> writes:
scott@slp53.sl.home (Scott Lurndal) wrote:In terms of shipped units, perhaps (although many are narrower, as
Michael S <already5chosen@yahoo.com> writes:Majority of the world is embedded. Ovewhelming majority of
BGB <cr88192@gmail.com> wrote:What majority? Linux owns the server market, the
Except for majority of the world where long is 32 bit
appliance market and much of the handset market (which apple
dominates with their OS). And all Unix/Linux systems have
64-bit longs on 64-bit CPUs.
embedded is 32-bit or narrower.
you point out). In terms of programmers, it's a fairly small
fraction that do embedded programming.
Yeah, the unit of measurement is a problem.
I wonder how it compares if you look at number of programmers paid to
write C code (after all, we're talking about C).
In the desktop/server/laptop/handheld world, AFAICT the market share
of C has shrunk significantly over the years whereas I get the
impression that it's still quite strong in the embedded space. But I
don't have any hard data.
Stefan
Stefan Monnier <monnier@iro.umontreal.ca> wrote:
What do you mean by that? IIUC, the difference between 32bit andThis notion that the only advantage of a 64-bit architecture is a large address space is very curious to me.
64bit (in terms of cost of designing and producing the CPU) was very
small. MIPS happily designed their R4000 as 64bit while knowing that
most of them would never get a chance to execute an instruction that
makes use of the upper 32bits.
On 2025-08-04 15:03, Michael S wrote:
On Mon, 04 Aug 2025 09:53:51 -0700...
Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
In C17 and earlier, _BitInt is a reserved identifier. Any attempt
to use it has undefined behavior. That's exactly why new keywords
are often defined with that ugly syntax.
That is language lawyer's type of reasoning. Normally gcc
maintainers are wiser than that because, well, by chance gcc
happens to be widely used production compiler. I don't know why
this time they had chosen less conservative road.
If _BitInt is accepted by older versions of gcc, that means it was
supported as a fully-conforming extension to C. Allowing
implementations to support extensions in a fully-conforming manner is
one of the main purposes for which the standard reserves identifiers.
If you thought that gcc was too conservative to support extensions,
you must be thinking of the wrong organization.
On Mon, 4 Aug 2025 15:25:54 -0400
James Kuyper <jameskuyper@alumni.caltech.edu> wrote:
On 2025-08-04 15:03, Michael S wrote:
On Mon, 04 Aug 2025 09:53:51 -0700...
Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
In C17 and earlier, _BitInt is a reserved identifier. Any attempt
to use it has undefined behavior. That's exactly why new keywords
are often defined with that ugly syntax.
That is language lawyer's type of reasoning. Normally gcc
maintainers are wiser than that because, well, by chance gcc
happens to be widely used production compiler. I don't know why
this time they had chosen less conservative road.
If _BitInt is accepted by older versions of gcc, that means it was
supported as a fully-conforming extension to C. Allowing
implementations to support extensions in a fully-conforming manner is
one of the main purposes for which the standard reserves identifiers.
If you thought that gcc was too conservative to support extensions,
you must be thinking of the wrong organization.
I know that gcc supports extensions.
I also know that gcc didn't support *this particular extension* up
until quite recently. I would guess, up until this calendar year.
Introducing new extension without way to disable it is different from supporting gradually introduced extensions, typically with names that
start by double underscore and often starting with __builtin.
BTW, I still didn't think deeply about it and still hope that outside
of C23 mode gcc somehow cared to make name clash unlikely.
Thomas Koenig wrote:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
The claim by John Savard was that the VAX "was a good match to the
technology *of its time*". It was not. It may have been a good match
for the beliefs of the time, but that's a different thing.
I concur; also, the evidence of the 801 supports that (and that
was designed around the same time as the VAX).
Although, personally, I think Data General might have been the
better target. Going to Edson de Castro and telling him that he
was on the right track with the Nova from the start, and his ideas
should be extended, might have been politically easier than going
to DEC.
DG's 32-bit Eclipse MV-8000 was also microcoded.
The ECLIPSE MV-8000 Microsequencer 1980 https://dl.acm.org/doi/pdf/10.1145/1014190.802716
In the IBM 5100, the cpu name PALM stands for "Put All Logic in Microcode".
They weren't looking at this with the necessary set of eyes.
The microcoded design approach views instruction execution as a large, single, *monolithic* state machine performing a sequential series of steps (aside from maybe having a prefetch buffer).
Few viewed this as a set of simple, parallel hardware tasks passing values between them. Once one looks at it this way then one starts to look for bottlenecks in that process and much of the risc design guidelines emerge
as potential optimizations.
According to Scott Lurndal <slp53@pacbell.net>:
Stefan Monnier <monnier@iro.umontreal.ca> wrote:
scientific computing the way some folks here do, I have not gotten the >>>impression that a lot of customers were commonly running up against the
4 GB limit in the early '90s;
Mainframes certainly had more than 4GB. In 1990 the ES/9000 had more
than 4GB of "expanded" memory and by 1994 there was 8GB of main memory,
using a variety of mapping and segmentation kludges to address from a
32 bit architecture.
On Mon, 4 Aug 2025 18:16:45 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
Although, personally, I think Data General might have been the
better target. Going to Edson de Castro and telling him that he
was on the right track with the Nova from the start, and his ideas
should be extended, might have been politically easier than going
to DEC.
I don't quite understand the context of this comment. Can you elaborate?
On Mon, 04 Aug 2025 15:09:55 -0400
Stefan Monnier <monnier@iro.umontreal.ca> wrote:
Scott Lurndal [2025-08-04 15:32:55] wrote:
Michael S <already5chosen@yahoo.com> writes:
scott@slp53.sl.home (Scott Lurndal) wrote:In terms of shipped units, perhaps (although many are narrower, as
Michael S <already5chosen@yahoo.com> writes:Majority of the world is embedded. Ovewhelming majority of
BGB <cr88192@gmail.com> wrote:What majority? Linux owns the server market, the
Except for majority of the world where long is 32 bit
appliance market and much of the handset market (which apple
dominates with their OS). And all Unix/Linux systems have
64-bit longs on 64-bit CPUs.
embedded is 32-bit or narrower.
you point out). In terms of programmers, it's a fairly small
fraction that do embedded programming.
Yeah, the unit of measurement is a problem.
I wonder how it compares if you look at number of programmers paid to
write C code (after all, we're talking about C).
In the desktop/server/laptop/handheld world, AFAICT the market share
of C has shrunk significantly over the years whereas I get the
impression that it's still quite strong in the embedded space. But I
don't have any hard data.
Stefan
Personally, [outside of Usenet and rwt forum] I know no one except
myself who writes C targeting user mode on "big" computers (big, in my >definitions, starts at smartphone).
Myself, I am doing it more as a
hobby and to make a point rather than out of professional needs. >Professionally, in this range I tend to use C++. Not a small part of it
is that C++ is more familiar than C for my younger co-workers.
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Michael S <already5chosen@yahoo.com> writes:
Actually, in our world the latest C standard (C23) has them, but the
spelling is different: _BitInt(32) and unsigned _BitInt(32).
I'm not sure if any major compiler already has them implemented. Bing
copilot says that clang does, but I don't tend to believe eveything Bing >>> copilot says.
I asked godbolt, and tried the following program:
typedef ump unsigned _BitInt(65535);
The actual compiling version is:
typedef unsigned _BitInt(65535) ump;
ump sum3(ump a, ump b, ump c)
{
return a+b+c;
}
Michael S <already5chosen@yahoo.com> schrieb:
On Mon, 4 Aug 2025 18:16:45 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
Although, personally, I think Data General might have been the
better target. Going to Edson de Castro and telling him that he
was on the right track with the Nova from the start, and his ideas
should be extended, might have been politically easier than going
to DEC.
I don't quite understand the context of this comment. Can you
elaborate?
De Castro had had a big success with a simple load-store
architecture, the Nova. He did that to reduce CPU complexity
and cost, to compete with DEC and its PDP-8. (Byte addressing
was horrible on the Nova, though).
Now, assume that, as a time traveler wanting to kick off an early
RISC revolution, you are not allowed to reveal that you are a time
traveler (which would have larger effects than just a different
computer architecture). What do you do?
a) You go to DEC
b) You go to Data General
c) You found your own company
My guess would be that, with DEC, you would have the least chance of convincing corporate brass of your ideas. With Data General, you
could try appealing to the CEO's personal history of creating the
Nova, and thus his vanity. That could work. But your own company
might actually be the best choice, if you can get the venture
capital funding.
John Levine <johnl@taugh.com> schrieb:
According to Scott Lurndal <slp53@pacbell.net>:
Stefan Monnier <monnier@iro.umontreal.ca> wrote:
scientific computing the way some folks here do, I have not gotten the >>>>impression that a lot of customers were commonly running up against the >>>>4 GB limit in the early '90s;
Mainframes certainly had more than 4GB. In 1990 the ES/9000 had more
than 4GB of "expanded" memory and by 1994 there was 8GB of main memory,
using a variety of mapping and segmentation kludges to address from a
32 bit architecture.
#ifdef PEDANTIC
Actually, 31 bits.
#endif
This notion that the only advantage of a 64-bit architecture is a large address space is very curious to me. Obviously that's *one* advantage,
but while I don't know the in-the-field history of heavy-duty business/ scientific computing the way some folks here do, I have not gotten the impression that a lot of customers were commonly running up against the
4 GB limit in the early '90s;
Michael S <already5chosen@yahoo.com> writes:
On Mon, 04 Aug 2025 15:09:55 -0400
Stefan Monnier <monnier@iro.umontreal.ca> wrote:
Scott Lurndal [2025-08-04 15:32:55] wrote:
Michael S <already5chosen@yahoo.com> writes:
scott@slp53.sl.home (Scott Lurndal) wrote:In terms of shipped units, perhaps (although many are narrower,
Michael S <already5chosen@yahoo.com> writes:Majority of the world is embedded. Ovewhelming majority of
BGB <cr88192@gmail.com> wrote:What majority? Linux owns the server market, the
Except for majority of the world where long is 32 bit
appliance market and much of the handset market (which apple
dominates with their OS). And all Unix/Linux systems have
64-bit longs on 64-bit CPUs.
embedded is 32-bit or narrower.
as you point out). In terms of programmers, it's a fairly small
fraction that do embedded programming.
Yeah, the unit of measurement is a problem.
I wonder how it compares if you look at number of programmers paid
to write C code (after all, we're talking about C).
In the desktop/server/laptop/handheld world, AFAICT the market
share of C has shrunk significantly over the years whereas I get
the impression that it's still quite strong in the embedded space.
But I don't have any hard data.
Stefan
Personally, [outside of Usenet and rwt forum] I know no one except
myself who writes C targeting user mode on "big" computers (big, in
my definitions, starts at smartphone).
Linux developers would be a significant, if not large, pool
of C programmers.
Myself, I am doing it more as a
hobby and to make a point rather than out of professional needs. >Professionally, in this range I tend to use C++. Not a small part of
it is that C++ is more familiar than C for my younger co-workers.
Likewise, I've been using C++ rather than C since 1989, including for large-scale operating systems and hypervisors (both running on bare
metal).
Anton Ertl wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Michael S <already5chosen@yahoo.com> writes:
Actually, in our world the latest C standard (C23) has them, but
the spelling is different: _BitInt(32) and unsigned _BitInt(32).
I'm not sure if any major compiler already has them implemented.
Bing copilot says that clang does, but I don't tend to believe
eveything Bing copilot says.
I asked godbolt, and tried the following program:
typedef ump unsigned _BitInt(65535);
The actual compiling version is:
typedef unsigned _BitInt(65535) ump;
ump sum3(ump a, ump b, ump c)
{
return a+b+c;
}
I would naively expect the ump type to be defined as an array of
unsigned (byte/short/int/long), possibly with a header defining how
large the allocation is and how many bits are currently defined.
The actual code to add three of them could be something like
xor rax,rax
next:
add rax,[rsi+rcx*8]
adc rdx,0
add rax,[r8+rcx*8]
adc rdx,0
add rax,[r9+rcx*8]
adc rdx,0
mov [rdi+rcx*8],rax
mov rax,rdx
inc rcx
cmp rcx,r10
jb next
The main problem here is of course that every add operation depends
on the previous, so max speed would be 4-5 clock cycles/iteration.
Terje
On 8/4/2025 8:32 AM, John Ames wrote:
snip
This notion that the only advantage of a 64-bit architecture is a
large address space is very curious to me. Obviously that's *one* advantage, but while I don't know the in-the-field history of
heavy-duty business/ scientific computing the way some folks here
do, I have not gotten the impression that a lot of customers were
commonly running up against the 4 GB limit in the early '90s;
Not exactly the same, but I recall an issue with Windows NT where it initially divided the 4GB address space in 2 GB for the OS, and 2GB
for users. Some users were "running out of address space", so
Microsoft came up with an option to reduce the OS space to 1 GB, thus allowing up to 3 GB for users. I am sure others here will know more
details.
On Mon, 04 Aug 2025 20:29:35 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
Michael S <already5chosen@yahoo.com> writes:
On Mon, 04 Aug 2025 15:09:55 -0400
Stefan Monnier <monnier@iro.umontreal.ca> wrote:
Scott Lurndal [2025-08-04 15:32:55] wrote:
Michael S <already5chosen@yahoo.com> writes:
scott@slp53.sl.home (Scott Lurndal) wrote:In terms of shipped units, perhaps (although many are narrower,
Michael S <already5chosen@yahoo.com> writes:Majority of the world is embedded. Ovewhelming majority of
BGB <cr88192@gmail.com> wrote:What majority? Linux owns the server market, the
Except for majority of the world where long is 32 bit
appliance market and much of the handset market (which apple
dominates with their OS). And all Unix/Linux systems have
64-bit longs on 64-bit CPUs.
embedded is 32-bit or narrower.
as you point out). In terms of programmers, it's a fairly small
fraction that do embedded programming.
Yeah, the unit of measurement is a problem.
I wonder how it compares if you look at number of programmers paid
to write C code (after all, we're talking about C).
In the desktop/server/laptop/handheld world, AFAICT the market
share of C has shrunk significantly over the years whereas I get
the impression that it's still quite strong in the embedded space.
But I don't have any hard data.
Stefan
Personally, [outside of Usenet and rwt forum] I know no one except
myself who writes C targeting user mode on "big" computers (big, in
my definitions, starts at smartphone).
Linux developers would be a significant, if not large, pool
of C programmers.
According to my understanding, Linux developers *maintain* user-mode C >programs. They very rarely start new user-mode C programs from scratch.
The last big one I can think about was git almost 2 decades ago. And
even that happened more due to personal idiosyncrasies of its
originator than for solid technical reasons.
I could be wrong about it, of course.
For few of your previous project I am convinced that it was a wrong
tool.
Why not go to somebody who has money and interest to build
microprocessor, but no existing mini/mainframe/SuperC buisness?
On 8/4/2025 8:32 AM, John Ames wrote:
snip
This notion that the only advantage of a 64-bit architecture is a large
address space is very curious to me. Obviously that's *one* advantage,
but while I don't know the in-the-field history of heavy-duty business/
scientific computing the way some folks here do, I have not gotten the
impression that a lot of customers were commonly running up against the
4 GB limit in the early '90s;
Not exactly the same, but I recall an issue with Windows NT where it >initially divided the 4GB address space in 2 GB for the OS, and 2GB for >users. Some users were "running out of address space", so Microsoft
came up with an option to reduce the OS space to 1 GB, thus allowing up
to 3 GB for users. I am sure others here will know more details.
On Mon, 4 Aug 2025 20:13:54 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
Michael S <already5chosen@yahoo.com> schrieb:
On Mon, 4 Aug 2025 18:16:45 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
Although, personally, I think Data General might have been the
better target. Going to Edson de Castro and telling him that he
was on the right track with the Nova from the start, and his ideas
should be extended, might have been politically easier than going
to DEC.
I don't quite understand the context of this comment. Can you
elaborate?
De Castro had had a big success with a simple load-store
architecture, the Nova. He did that to reduce CPU complexity
and cost, to compete with DEC and its PDP-8. (Byte addressing
was horrible on the Nova, though).
Now, assume that, as a time traveler wanting to kick off an early
RISC revolution, you are not allowed to reveal that you are a time
traveler (which would have larger effects than just a different
computer architecture). What do you do?
a) You go to DEC
b) You go to Data General
c) You found your own company
My guess would be that, with DEC, you would have the least chance of
convincing corporate brass of your ideas. With Data General, you
could try appealing to the CEO's personal history of creating the
Nova, and thus his vanity. That could work. But your own company
might actually be the best choice, if you can get the venture
capital funding.
Why not go to somebody who has money and interest to build
microprocessor, but no existing mini/mainframe/SuperC buisness?
If we limit ourselves to USA then Moto, Intel, AMD, NatSemi...
May be, even AT&T ? Or was AT&T stil banned from making computers in
the mid 70s?
On Mon, 4 Aug 2025 22:49:23 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Anton Ertl wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Michael S <already5chosen@yahoo.com> writes:
Actually, in our world the latest C standard (C23) has them, but
the spelling is different: _BitInt(32) and unsigned _BitInt(32).
I'm not sure if any major compiler already has them implemented.
Bing copilot says that clang does, but I don't tend to believe
eveything Bing copilot says.
I asked godbolt, and tried the following program:
typedef ump unsigned _BitInt(65535);
The actual compiling version is:
typedef unsigned _BitInt(65535) ump;
ump sum3(ump a, ump b, ump c)
{
return a+b+c;
}
I would naively expect the ump type to be defined as an array of
unsigned (byte/short/int/long), possibly with a header defining how
large the allocation is and how many bits are currently defined.
The actual code to add three of them could be something like
xor rax,rax
next:
add rax,[rsi+rcx*8]
adc rdx,0
add rax,[r8+rcx*8]
adc rdx,0
add rax,[r9+rcx*8]
adc rdx,0
mov [rdi+rcx*8],rax
mov rax,rdx
inc rcx
cmp rcx,r10
jb next
The main problem here is of course that every add operation depends
on the previous, so max speed would be 4-5 clock cycles/iteration.
Terje
I would guess that even a pair of x86-style loops would likely be
faster than that on most x86-64 processors made in last 15 years.
Despite doing 1.5x more memory acceses.
; rcx = dst
; rdx = a - dst
; r8 = b - dst
mov $1024, %esi
clc
.loop1:
mov (%rcx,%r8), %rax
adc (%rcx,%rdx), %rax
mov %rax, (%rcx)
lea 8(%rcx), %rcx
dec %esi
jnz .loop1
sub $65536, %rcx
mov ..., %rdx ; %rdx = c-dst
mov $1024, %esi
clc
.loop2:
mov (%rcx,%rdx), %rax
adc %rax, (%rcx)
lea 8(%rcx), %rcx
dec %esi
jnz .loop2
...
On Sat, 02 Aug 2025 23:10:56 -0400
Stefan Monnier <monnier@iro.umontreal.ca> wrote:
And what a waste of a 64-bit architecture, to run it in 32-bit-only
mode ...
What do you mean by that? IIUC, the difference between 32bit and
64bit (in terms of cost of designing and producing the CPU) was very
small. MIPS happily designed their R4000 as 64bit while knowing that
most of them would never get a chance to execute an instruction that
makes use of the upper 32bits.
This notion that the only advantage of a 64-bit architecture is a large address space is very curious to me. Obviously that's *one* advantage,
but while I don't know the in-the-field history of heavy-duty business/ scientific computing the way some folks here do, I have not gotten the impression that a lot of customers were commonly running up against the
4 GB limit in the early '90s; meanwhile, the *other* advantage - higher performance for the same MIPS on a variety of compute-bound tasks - is
being overlooked entirely, it seems.
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
On 8/4/2025 8:32 AM, John Ames wrote:
snip
This notion that the only advantage of a 64-bit architecture is a large
address space is very curious to me. Obviously that's *one* advantage,
but while I don't know the in-the-field history of heavy-duty business/
scientific computing the way some folks here do, I have not gotten the
impression that a lot of customers were commonly running up against the
4 GB limit in the early '90s;
Not exactly the same, but I recall an issue with Windows NT where it >>initially divided the 4GB address space in 2 GB for the OS, and 2GB for >>users. Some users were "running out of address space", so Microsoft
came up with an option to reduce the OS space to 1 GB, thus allowing up
to 3 GB for users. I am sure others here will know more details.
AT&T SVR[34] Unix systems had the same issue on x86, as did linux. They mainly used the same solution as well (give the user 3GB) of virtual
address space.
I believe SVR4 was also able to leverage 36-bit physical addressing to
use more 4GB of DRAM, while still limiting a single process to 2 or 3GB
of user virtual address space.
On 8/2/25 1:07 AM, Waldek Hebisch wrote:
IIUC PRISM eventually became Alpha.
Not really. Documents for both, including
the rare PRISM docs are on bitsavers.
PRISM came out of Cutler's DEC West group,
Alpha from the East Coast. I'm not aware
of any team member overlap.
antispam@fricas.org (Waldek Hebisch) writes:<snip>
OTOH Unix for Alpha was claimed to be pure 64-bit.
It depends on the kind of purity you are aspiring to. After a bunch
of renamings it was finally called Tru64 UNIX. Not Pur64, but
Tru64:-) Before that, it was called Digital UNIX (but once DEC had
been bought by Compaq, that was no longer appropriate), and before
that, DEC OSF/1 AXP.
The C environment for DEC OSF/1 was an I32LP64 setup, not an ILP64
setup, so can you really call it pure?
In addition there were some OS features for running ILP32 programs,
similar to Linux' MAP_32BIT flag for mmap(). IIRC Netscape Navigator
was compiled as ILP32 program (the C compiler had a flag for that),
and needed these OS features.
antispam@fricas.org (Waldek Hebisch) writes:
In comp.arch Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
Did the VAX 11/780 have writable microcode?
Yes, 12 kB (2K words 96-bit each).
So that's 12KB of fast RAM that could have been reused for making the
cache larger in a RISC-VAX, maybe increasing its size from 2KB to
12KB.
[snip]
The C environment for DEC OSF/1 was an I32LP64 setup, not an ILP64
setup, so can you really call it pure?
In addition there were some OS features for running ILP32 programs,
similar to Linux' MAP_32BIT flag for mmap(). IIRC Netscape Navigator
was compiled as ILP32 program (the C compiler had a flag for that),
and needed these OS features.
This notion that the only advantage of a 64-bit architecture is a large address space is very curious to me.
Obviously that's *one* advantage, but while I don't know the
in-the-field history of heavy-duty business/ scientific computing
the way some folks here do, I have not gotten the impression that a
lot of customers were commonly running up against the 4 GB limit in
the early '90s ...
... meanwhile, the *other* advantage - higher performance for the
same MIPS on a variety of compute-bound tasks - is being overlooked
entirely, it seems.
Didn't majority 32-bit RISC machines with general-purpose ambitions have 64-bit FP registers?
... I recall an issue with Windows NT where it initially divided the
4GB address space in 2 GB for the OS, and 2GB for users. Some users
were "running out of address space", so Microsoft came up with an
option to reduce the OS space to 1 GB, thus allowing up to 3 GB for
users. I am sure others here will know more details.
BTW: AMD-64 was a special case: since 64-bit mode was bundled with
increasing number of GPR-s, with PC-relative addressing and with register-based call convention on average 64-bit code was faster than
32-bit code. And since AMD-64 was relatively late in 64-bit game there
was limited motivation to develop mode using 32-bit addressing and
64-bit instructions. It works in compilers and in Linux, but support is
much worse than for using 64-bit addressing.
antispam@fricas.org (Waldek Hebisch) writes:
I can understand why DEC abandoned VAX: already in 1985 they
had some disadvantage and they saw no way to compete against
superscalar machines which were on the horizon. In 1985 they
probably realized, that their features add no value in world
using optimizing compilers.
Optimizing compilers increase the advantages of RISCs, but even with a
simple compiler Berkeley RISC II (which was made by hardware people,
not compiler people) has between 85% and 256% of VAX (11/780) speed.
It also has 16-bit and 32-bit instructions for improved code density
and (apparently from memory bandwidth issues) performance.
Lawrence D'Oliveiro <ldo@nz.invalid> writes:
On Sat, 02 Aug 2025 09:28:17 GMT, Anton Ertl wrote:
In my RISC-VAX scenario, the RISC-VAX would be the PDP-11 followon
instead of the actual (CISC) VAX, so there would be no additional
ISA.
In order to be RISC, it would have had to add registers and remove
addressing modes from the non-load/store instructions (and replace
"move" with separate "load" and "store" instructions).
Add registers: No, ARM A32 is RISC and has as many registers as VAX ...
The essence of RISC really is just exposing what existed in the
microcode engines to user-level programming and didn't really make
sense until main memory systems got a lot faster.
a) You go to DEC
b) You go to Data General
c) You found your own company
The ban on AT&T was the whole reason they released Unix freely.
Then when things lifted (after the AT&T break-up), they tried to
re-assert their control over Unix, which backfired.
And, they tried to make and release a workstation, but by then they
were competing against the IBM PC Clone market (and also everyone
else trying to sell Unix workstations at the time), ...
On Mon, 4 Aug 2025 15:25:54 -0400
James Kuyper <jameskuyper@alumni.caltech.edu> wrote:
On 2025-08-04 15:03, Michael S wrote:
On Mon, 04 Aug 2025 09:53:51 -0700...
Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
In C17 and earlier, _BitInt is a reserved identifier. Any attempt
to use it has undefined behavior. That's exactly why new keywords
are often defined with that ugly syntax.
That is language lawyer's type of reasoning. Normally gcc
maintainers are wiser than that because, well, by chance gcc
happens to be widely used production compiler. I don't know why
this time they had chosen less conservative road.
If _BitInt is accepted by older versions of gcc, that means it was
supported as a fully-conforming extension to C. Allowing
implementations to support extensions in a fully-conforming manner is
one of the main purposes for which the standard reserves identifiers.
If you thought that gcc was too conservative to support extensions,
you must be thinking of the wrong organization.
I know that gcc supports extensions.
I also know that gcc didn't support *this particular extension* up
until quite recently. I would guess, up until this calendar year.
Introducing new extension without way to disable it is different from supporting gradually introduced extensions, typically with names that
start by double underscore and often starting with __builtin.
BTW, I still didn't think deeply about it and still hope that outside
of C23 mode gcc somehow cared to make name clash unlikely.
And as others noticed, I32LP64 was very common.
MIPS products came out of DECWRL (the research group started to build
Titan) and were stopgaps until the "real" architecture came out
(Cutler's out of DECWest)
I don't think it ever got much love out of DEC corporate and were just
done so DEC didn't completely get their lunch eaten in the Unix
workstation market.
Except for majority of the world where long is 32 bit
In comp.arch Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
antispam@fricas.org (Waldek Hebisch) writes:
In comp.arch Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
Did the VAX 11/780 have writable microcode?
Yes, 12 kB (2K words 96-bit each).
So that's 12KB of fast RAM that could have been reused for making the
cache larger in a RISC-VAX, maybe increasing its size from 2KB to
12KB.
VAX-780 architecture handbook says cache was 8 KB and used 8-byte
lines. So extra 12KB of fast RAM could double cache size.
That would be nice improvement, but not as dramatic as increase
from 2 KB to 12 KB.
I am not sure what technolgy they used
for register file. For me most likely is fast RAM, but that
normally would give 1 R/W port.
On Mon, 4 Aug 2025 23:24:15 -0000 (UTC), Waldek Hebisch wrote:
BTW: AMD-64 was a special case: since 64-bit mode was bundled with
increasing number of GPR-s, with PC-relative addressing and with
register-based call convention on average 64-bit code was faster than
32-bit code. And since AMD-64 was relatively late in 64-bit game there
was limited motivation to develop mode using 32-bit addressing and
64-bit instructions. It works in compilers and in Linux, but support is
much worse than for using 64-bit addressing.
Intel was trying to promote this in the form of the “X32” ABI. The Linux kernel and some distros did include support for this. I don’t think it was very popular, and it may be extinct now.
Majority of the world is embedded. Ovewhelming majority of embedded is
32-bit or narrower.
On Mon, 4 Aug 2025 18:07:48 +0300, Michael S wrote:
Majority of the world is embedded. Ovewhelming majority of embedded is
32-bit or narrower.
Embedded CPUs are mostly ARM, MIPS, RISC-V ... all of which are available
in 64-bit variants.
On Mon, 4 Aug 2025 14:06:17 -0700, Stephen Fuld wrote:
... I recall an issue with Windows NT where it initially divided the
4GB address space in 2 GB for the OS, and 2GB for users. Some users
were "running out of address space", so Microsoft came up with an
option to reduce the OS space to 1 GB, thus allowing up to 3 GB for
users. I am sure others here will know more details.
That would have been prone to breakage in poorly-written programs that
were using signed instead of unsigned comparisons on memory block sizes.
I hit an earlier version of this problem in about the mid-1980s, trying to help a user install WordStar on his IBM PC, which was one of the earliest machines to have 640K of RAM. The WordStar installer balked, saying he didn’t have enough free RAM!
The solution: create a dummy RAM disk to bring the free memory size down below 512K. Then after the installation succeeded, the RAM disk could be removed.
In article <2025Aug3.185110@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
[snip]
The C environment for DEC OSF/1 was an I32LP64 setup, not an ILP64
setup, so can you really call it pure?
In the OS kernel, often times you want to allocate physical
address space below 4GiB for e.g. device BARs; many devices are
either 32-bit (but have to work on 64-bit systems) or work
better with 32-bit BARs.
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
antispam@fricas.org (Waldek Hebisch) writes:
I can understand why DEC abandoned VAX: already in 1985 they
had some disadvantage and they saw no way to compete against
superscalar machines which were on the horizon. In 1985 they
probably realized, that their features add no value in world
using optimizing compilers.
Optimizing compilers increase the advantages of RISCs, but even with a
simple compiler Berkeley RISC II (which was made by hardware people,
not compiler people) has between 85% and 256% of VAX (11/780) speed.
It also has 16-bit and 32-bit instructions for improved code density
and (apparently from memory bandwidth issues) performance.
The basic question is if VAX could afford the pipeline. VAX had
rather complex memory and bus interface, cache added complexity
too. Ditching microcode could allow more resources for execution
path. Clearly VAX could afford and probably had 1-cycle 32-bit
ALU. I doubt that they could afford 1-cycle multiply or
even a barrel shifter. So they needed a seqencer for sane
assembly programming. I am not sure what technolgy they used
for register file. For me most likely is fast RAM, but that
normally would give 1 R/W port. Multiported register file
probably would need a lot of separate register chips and
multiplexer. Alternatively, they could try some very fast
RAM and run it at multiple of base clock frequency (66 ns
cycle time caches were available at that time, so 3 ports
via multiplexing seem possible). But any of this adds
considerable complexity. Sane pipeline needs interlocks
and forwarding.
On Mon, 4 Aug 2025 20:13:54 -0000 (UTC), Thomas Koenig wrote:
a) You go to DEC
b) You go to Data General
c) You found your own company
How about d) Go talk to the man responsible for the fastest machines in
the world around that time, i.e. Seymour Cray?
Waldek Hebisch <antispam@fricas.org> schrieb:
I am not sure what technolgy they used
for register file. For me most likely is fast RAM, but that
normally would give 1 R/W port.
They used fast SRAM and had three copies of their registers,
for 2R1W.
On 8/4/2025 8:32 AM, John Ames wrote:Any program written to Microsoft/Windows spec would work transparently
snip
This notion that the only advantage of a 64-bit architecture is a large
address space is very curious to me. Obviously that's *one* advantage,>> but while I don't know the in-the-field history of heavy-duty business/
scientific computing the way some folks here do, I have not gotten the>> impression that a lot of customers were commonly running up against the
4 GB limit in the early '90s;
Not exactly the same, but I recall an issue with Windows NT where it initially divided the 4GB address space in 2 GB for the OS, and 2GB for users. Some users were "running out of address space", so Microsoft
came up with an option to reduce the OS space to 1 GB, thus allowing up
to 3 GB for users. I am sure others here will know more details.
On Tue, 5 Aug 2025 00:14:43 +0300
Michael S <already5chosen@yahoo.com> wrote:
On Mon, 4 Aug 2025 22:49:23 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Anton Ertl wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Michael S <already5chosen@yahoo.com> writes:
Actually, in our world the latest C standard (C23) has them, but
the spelling is different: _BitInt(32) and unsigned _BitInt(32).
I'm not sure if any major compiler already has them implemented.
Bing copilot says that clang does, but I don't tend to believe
eveything Bing copilot says.
I asked godbolt, and tried the following program:
typedef ump unsigned _BitInt(65535);
The actual compiling version is:
typedef unsigned _BitInt(65535) ump;
ump sum3(ump a, ump b, ump c)
{
return a+b+c;
}
I would naively expect the ump type to be defined as an array of
unsigned (byte/short/int/long), possibly with a header defining how
large the allocation is and how many bits are currently defined.
The actual code to add three of them could be something like
xor rax,rax
next:
add rax,[rsi+rcx*8]
adc rdx,0
add rax,[r8+rcx*8]
adc rdx,0
add rax,[r9+rcx*8]
adc rdx,0
mov [rdi+rcx*8],rax
mov rax,rdx
inc rcx
cmp rcx,r10
jb next
The main problem here is of course that every add operation depends
on the previous, so max speed would be 4-5 clock cycles/iteration.
Terje
I would guess that even a pair of x86-style loops would likely be
faster than that on most x86-64 processors made in last 15 years.
Despite doing 1.5x more memory acceses.
; rcx = dst
; rdx = a - dst
; r8 = b - dst
mov $1024, %esi
clc
.loop1:
mov (%rcx,%r8), %rax
adc (%rcx,%rdx), %rax
mov %rax, (%rcx)
lea 8(%rcx), %rcx
dec %esi
jnz .loop1
sub $65536, %rcx
mov ..., %rdx ; %rdx = c-dst
mov $1024, %esi
clc
.loop2:
mov (%rcx,%rdx), %rax
adc %rax, (%rcx)
lea 8(%rcx), %rcx
dec %esi
jnz .loop2
...
For extremely wide cores, like Apple's M (modulo ISA), AMD Zen5 and
Intel Lion Cove, I'd do the following modification to your inner loop
(back in Intel syntax):
xor ebx,ebx
next:
xor edx, edx
mov rax,[rsi+rcx*8]
add rax,[r8+rcx*8]
adc edx,edx
add rax,[r9+rcx*8]
adc edx,0
add rbx,rax
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov edx, ebx
mov [rdi+rcx*8],rax
inc rcx
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately never incremen_edx:
inc edx
jmp edx_ready
Less wide cores will likely benefit from reduction of the number of
executed instructions (and more importantly the number of decoded and
renamed instructions) through unrolling by 2, 3 or 4.
Stephen Fuld wrote:
On 8/4/2025 8:32 AM, John Ames wrote:e
=20
snip
=20
This notion that the only advantage of a 64-bit architecture is a larg=
address space is very curious to me. Obviously that's *one* advantage,=
/but while I don't know the in-the-field history of heavy-duty business=
scientific computing the way some folks here do, I have not gotten the=
eimpression that a lot of customers were commonly running up against th=
4 GB limit in the early '90s;=20
Not exactly the same, but I recall an issue with Windows NT where it=20
initially divided the 4GB address space in 2 GB for the OS, and 2GB for= >=20
users.=C2=A0 Some users were "running out of address space", so Microso= >ft=20
came up with an option to reduce the OS space to 1 GB, thus allowing up= >=20
to 3 GB for users.=C2=A0 I am sure others here will know more details.
Any program written to Microsoft/Windows spec would work transparently=20 >with a 3:1 split, the problem was all the programs ported from unix=20
which assumed that any negative return value was a failure code.
[snip]
We tend to be spoiled by modern process densities. The
VAX 11/780 was built using SSI logic chips, thus board
space and backplane wiring were significant constraints
on the logic designs of the era.
Michael S wrote:
On Tue, 5 Aug 2025 00:14:43 +0300
Michael S <already5chosen@yahoo.com> wrote:
On Mon, 4 Aug 2025 22:49:23 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Anton Ertl wrote:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Michael S <already5chosen@yahoo.com> writes:
Actually, in our world the latest C standard (C23) has them,
but the spelling is different: _BitInt(32) and unsigned
_BitInt(32). I'm not sure if any major compiler already has
them implemented. Bing copilot says that clang does, but I
don't tend to believe eveything Bing copilot says.
I asked godbolt, and tried the following program:
typedef ump unsigned _BitInt(65535);
The actual compiling version is:
typedef unsigned _BitInt(65535) ump;
ump sum3(ump a, ump b, ump c)
{
return a+b+c;
}
I would naively expect the ump type to be defined as an array of
unsigned (byte/short/int/long), possibly with a header defining
how large the allocation is and how many bits are currently
defined.
The actual code to add three of them could be something like
xor rax,rax
next:
add rax,[rsi+rcx*8]
adc rdx,0
add rax,[r8+rcx*8]
adc rdx,0
add rax,[r9+rcx*8]
adc rdx,0
mov [rdi+rcx*8],rax
mov rax,rdx
inc rcx
cmp rcx,r10
jb next
The main problem here is of course that every add operation
depends on the previous, so max speed would be 4-5 clock
cycles/iteration.
Terje
I would guess that even a pair of x86-style loops would likely be
faster than that on most x86-64 processors made in last 15 years.
Despite doing 1.5x more memory acceses.
; rcx = dst
; rdx = a - dst
; r8 = b - dst
mov $1024, %esi
clc
.loop1:
mov (%rcx,%r8), %rax
adc (%rcx,%rdx), %rax
mov %rax, (%rcx)
lea 8(%rcx), %rcx
dec %esi
jnz .loop1
sub $65536, %rcx
mov ..., %rdx ; %rdx = c-dst
mov $1024, %esi
clc
.loop2:
mov (%rcx,%rdx), %rax
adc %rax, (%rcx)
lea 8(%rcx), %rcx
dec %esi
jnz .loop2
...
For extremely wide cores, like Apple's M (modulo ISA), AMD Zen5 and
Intel Lion Cove, I'd do the following modification to your inner
loop (back in Intel syntax):
xor ebx,ebx
next:
xor edx, edx
mov rax,[rsi+rcx*8]
add rax,[r8+rcx*8]
adc edx,edx
add rax,[r9+rcx*8]
adc edx,0
add rbx,rax
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov edx, ebx
mov [rdi+rcx*8],rax
inc rcx
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately
never incremen_edx:
inc edx
jmp edx_ready
Less wide cores will likely benefit from reduction of the number of executed instructions (and more importantly the number of decoded
and renamed instructions) through unrolling by 2, 3 or 4.
Interesting code, not totally sure that I understand how the
'ADC EDX,EDX'
really works, i.e. shiftin previous contents up while saving the
current carry.
Anyway, the three main ADD RAX,... operations still define the
minimum possible latency, right?
Terje
cross@spitfire.i.gajendra.net (Dan Cross) writes:
In article <2025Aug3.185110@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
[snip]
The C environment for DEC OSF/1 was an I32LP64 setup, not an ILP64
setup, so can you really call it pure?
In the OS kernel, often times you want to allocate physical
address space below 4GiB for e.g. device BARs; many devices are
either 32-bit (but have to work on 64-bit systems) or work
better with 32-bit BARs.
Indeed. Modern PCI controllers tend to support remapping
a 64-bit physical address in the hardware to support devices
that only advertise 32-bit bars[*]. The firmware (e.g. UEFI
or BIOS) will setup the remapping registers and provide the
address of the 64-bit aperture to the kernel via device tree
or ACPI tables.
[*] AHCI is the typical example, which uses BAR5.
Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:
In C17 and earlier, _BitInt is a reserved identifier. Any attempt to
use it has undefined behavior. That's exactly why new keywords are
often defined with that ugly syntax.
Sometimes I think there is reason to Fortran's approach of not
having defined keywords - old programs just continue to run, even
with new statements or intrinsic procedures, maybe with an addition
of an EXTERNAL statement.
On Mon, 4 Aug 2025 17:18:24 -0500, BGB wrote:I'll say. We had to pay $20,000 for it in 1975. That was a lot
The ban on AT&T was the whole reason they released Unix freely.
It was never really “freely” available.
Then when things lifted (after the AT&T break-up), they tried to
re-assert their control over Unix, which backfired.
They were already tightening things up from the Seventh Edition onwards -- remember, this version rescinded the permission to use the source code for classroom teaching purposes, neatly strangling the entire market for the legendary Lions Book. Which continued to spread afterwards via samizdat, nonetheless.
And, they tried to make and release a workstation, but by then they
were competing against the IBM PC Clone market (and also everyone
else trying to sell Unix workstations at the time), ...
That was a very successful market, from about the mid-1980s until the mid- to-latter 1990s. In spite of all the vendor-lock-in and fragmentation, it mentioned to survive I think because of the sheer performance available in the RISC processors, which Microsoft tried to support with its new
“Windows NT” OS, but was never able to get quite right.
On Mon, 4 Aug 2025 18:07:48 +0300, Michael S wrote:
Majority of the world is embedded. Ovewhelming majority of embedded is
32-bit or narrower.
Embedded CPUs are mostly ARM, MIPS, RISC-V ... all of which are available
in 64-bit variants.
On Tue, 5 Aug 2025 17:31:34 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
In this case 'adc edx,edx' is just slightly shorter encoding
of 'adc edx,0'. EDX register zeroize few lines above.
Anyway, the three main ADD RAX,... operations still define the
minimum possible latency, right?
I don't think so.
It seems to me that there is only one chains of data dependencies
between iterations of the loop - a trivial dependency through RCX. Some modern processors are already capable to eliminate this sort of
dependency in renamer. Probably not yet when it is coded as 'inc', but
when coded as 'add' or 'lea'.
The dependency through RDX/RBX does not form a chain. The next value
of [rdi+rcx*8] does depend on value of rbx from previous iteration, but
the next value of rbx depends only on [rsi+rcx*8], [r8+rcx*8] and
[r9+rcx*8]. It does not depend on the previous value of rbx, except for control dependency that hopefully would be speculated around.
antispam@fricas.org (Waldek Hebisch) writes:
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
antispam@fricas.org (Waldek Hebisch) writes:
I can understand why DEC abandoned VAX: already in 1985 they
had some disadvantage and they saw no way to compete against >>>>superscalar machines which were on the horizon. In 1985 they
probably realized, that their features add no value in world
using optimizing compilers.
Optimizing compilers increase the advantages of RISCs, but even with a
simple compiler Berkeley RISC II (which was made by hardware people,
not compiler people) has between 85% and 256% of VAX (11/780) speed.
It also has 16-bit and 32-bit instructions for improved code density
and (apparently from memory bandwidth issues) performance.
The basic question is if VAX could afford the pipeline. VAX had
rather complex memory and bus interface, cache added complexity
too. Ditching microcode could allow more resources for execution
path. Clearly VAX could afford and probably had 1-cycle 32-bit
ALU. I doubt that they could afford 1-cycle multiply or
even a barrel shifter. So they needed a seqencer for sane
assembly programming. I am not sure what technolgy they used
for register file. For me most likely is fast RAM, but that
normally would give 1 R/W port. Multiported register file
probably would need a lot of separate register chips and
multiplexer. Alternatively, they could try some very fast
RAM and run it at multiple of base clock frequency (66 ns
cycle time caches were available at that time, so 3 ports
via multiplexing seem possible). But any of this adds
considerable complexity. Sane pipeline needs interlocks
and forwarding.
We tend to be spoiled by modern process densities. The
VAX 11/780 was built using SSI logic chips, thus board
space and backplane wiring were significant constraints
on the logic designs of the era.
On Mon, 4 Aug 2025 20:13:54 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
My guess would be that, with DEC, you would have the least chance of
convincing corporate brass of your ideas. With Data General, you
could try appealing to the CEO's personal history of creating the
Nova, and thus his vanity. That could work. But your own company
might actually be the best choice, if you can get the venture
capital funding.
Why not go to somebody who has money and interest to build
microprocessor, but no existing mini/mainframe/SuperC buisness?
If we limit ourselves to USA then Moto, Intel, AMD, NatSemi...
May be, even AT&T ? Or was AT&T stil banned from making computers in
the mid 70s?
On Mon, 04 Aug 2025 09:53:51 -0700
Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
In C17 and earlier, _BitInt is a reserved identifier. Any attempt to
use it has undefined behavior. That's exactly why new keywords are
often defined with that ugly syntax.
That is language lawyer's type of reasoning. Normally gcc maintainers
are wiser than that because, well, by chance gcc happens to be widely
used production compiler. I don't know why this time they had chosen
less conservative road.
On Mon, 4 Aug 2025 15:25:54 -0400
James Kuyper <jameskuyper@alumni.caltech.edu> wrote:
On 2025-08-04 15:03, Michael S wrote:
On Mon, 04 Aug 2025 09:53:51 -0700...
Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
In C17 and earlier, _BitInt is a reserved identifier. Any attempt
to use it has undefined behavior. That's exactly why new keywords
are often defined with that ugly syntax.
That is language lawyer's type of reasoning. Normally gcc
maintainers are wiser than that because, well, by chance gcc
happens to be widely used production compiler. I don't know why
this time they had chosen less conservative road.
If _BitInt is accepted by older versions of gcc, that means it was
supported as a fully-conforming extension to C. Allowing
implementations to support extensions in a fully-conforming manner is
one of the main purposes for which the standard reserves identifiers.
If you thought that gcc was too conservative to support extensions,
you must be thinking of the wrong organization.
I know that gcc supports extensions.
I also know that gcc didn't support *this particular extension* up
until quite recently.
I would guess, up until this calendar year.
Introducing new extension without way to disable it is different from supporting gradually introduced extensions, typically with names that
start by double underscore and often starting with __builtin.
Michael S wrote:
On Tue, 5 Aug 2025 17:31:34 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
In this case 'adc edx,edx' is just slightly shorter encoding
of 'adc edx,0'. EDX register zeroize few lines above.
OK, nice.
Anyway, the three main ADD RAX,... operations still define the
minimum possible latency, right?
I don't think so.
It seems to me that there is only one chains of data dependencies
between iterations of the loop - a trivial dependency through RCX.
Some modern processors are already capable to eliminate this sort of dependency in renamer. Probably not yet when it is coded as 'inc',
but when coded as 'add' or 'lea'.
The dependency through RDX/RBX does not form a chain. The next value
of [rdi+rcx*8] does depend on value of rbx from previous iteration,
but the next value of rbx depends only on [rsi+rcx*8], [r8+rcx*8]
and [r9+rcx*8]. It does not depend on the previous value of rbx,
except for control dependency that hopefully would be speculated
around.
I believe we are doing a bigint thre-way add, so each result word
depends on the three corresponding input words, plus any carries from
the previous round.
This is the carry chain that I don't see any obvious way to break...
Terje
Breaking existing code that uses "_BitInt" as an identifier is
a non-issue. There very probably is no such code.
Waldek Hebisch <antispam@fricas.org> schrieb:
I am not sure what technolgy they used
for register file. For me most likely is fast RAM, but that
normally would give 1 R/W port.
They used fast SRAM and had three copies of their registers,
for 2R1W.
... the problem was all the programs ported from unix which assumed
that any negative return value was a failure code.
So... a strategy could have been to establish the concept with
minicomputers, to make money (the VAX sold big) and then move
aggressively towards microprocessors, trying the disruptive move towards workstations within the same company (which would be HARD).
As for the PC - a scaled-down, cheap, compatible, multi-cycle per
instruction microprocessor could have worked for that market,
but it is entirely unclear to me what this would / could have done to
the PC market, if IBM could have been prevented from gaining such market dominance.
A bit like the /360 strategy, offering a wide range of machines (or CPUs
and systems) with different performance.
On 2025-08-05, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
Breaking existing code that uses "_BitInt" as an identifier is
a non-issue. There very probably is no such code.
However, that doesn't mean GCC can carelessly introduce identifiers
in this namespace.
On Tue, 5 Aug 2025 21:01:20 -0000 (UTC), Thomas Koenig wrote:
So... a strategy could have been to establish the concept with
minicomputers, to make money (the VAX sold big) and then move
aggressively towards microprocessors, trying the disruptive move towards
workstations within the same company (which would be HARD).
None of the companies which tried to move in that direction were
successful. The mass micro market had much higher volumes and lower
margins, and those accustomed to lower-volume, higher-margin operation
simply couldn’t adapt.
As for the PC - a scaled-down, cheap, compatible, multi-cycle per
instruction microprocessor could have worked for that market,
but it is entirely unclear to me what this would / could have done to
the PC market, if IBM could have been prevented from gaining such market
dominance.
IBM had massive marketing clout in the mainframe market. I think that was
the basis on which customers gravitated to their products. And remember,
the IBM PC was essentially a skunkworks project that totally went against
the entire IBM ethos. Internally, it was seen as a one-off mistake that
they determined never to repeat. Hence the PS/2 range.
DEC was bigger in the minicomputer market. If DEC could have offered an open-standard machine, that could have offered serious competition to IBM. But what OS would they have used? They were still dominated by Unix-haters then.
A bit like the /360 strategy, offering a wide range of machines (or CPUs
and systems) with different performance.
That strategy was radical in 1964, less so by the 1970s and 1980s. DEC,
for example, offered entire ranges of machines in each of its various minicomputer families.
Kaz Kylheku <643-408-1753@kylheku.com> writes:
On 2025-08-05, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
Breaking existing code that uses "_BitInt" as an identifier is
a non-issue. There very probably is no such code.
However, that doesn't mean GCC can carelessly introduce identifiers
in this namespace.
Agreed -- and in gcc did not do that in this case. I was referring to _BitInt, not to other identifiers in the reserved namespace.
Do you have any reason to believe that gcc's use of _BitInt will break
any existing code?
The plurality of embedded systems are 8 bit processors - about 40
percent of the total. They are largely used for things like industrial automation, Internet of Things, SCADA, kitchen appliances, etc.
16 bi
account for a small, and shrinking percentage. 32 bit is next (IIRC ~30-35%, but 64 bit is the fastest growing. Perhaps surprising, there
is still a small market for 4 bit processors for things like TV remote controls, where battery life is more important than the highest performance.
There is far more to the embedded market than phones and servers.
The support issues alone were killers. Think about the
Orange/Grey/(Blue?) Wall of VAX documentation, and then look at the five-page flimsy you got with a micro. The customers were willing to
accept cr*p from a small startup, but wouldn't put up with it from IBM
or DEC.
In article <44okQ.831008$QtA1.573001@fx16.iad>,
Scott Lurndal <slp53@pacbell.net> wrote:
[snip]
We tend to be spoiled by modern process densities. The
VAX 11/780 was built using SSI logic chips, thus board
space and backplane wiring were significant constraints
on the logic designs of the era.
Indeed. I find this speculation about the VAX, kind of odd: the
existence of the 801 as a research project being used as an
existence proof to justify assertions that a pipelined RISC
design would have been "better" don't really hold up, when we
consider that the comparison is to a processor designed for
commercial applications on a much shorter timeframe.
Does anybody have an estimate how many CPUs humanity has made so far?
Using UNIX faced stiff competition from AT&T's internal IT people, who
wanted to run DEC's operating systems on all PDP-11 within the company (basically, they wanted to kill UNIX).
But the _real_ killer application for UNIX wasn't writing patents, it
was phototypesetting speeches for the CEO of AT&T, who, for reasons of vanity, did not want to wear glasses, and it was possible to scale the
output of the phototoypesetter so he would be able to read them.
On 2025-08-06, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
Kaz Kylheku <643-408-1753@kylheku.com> writes:
On 2025-08-05, Keith Thompson <Keith.S.Thompson+u@gmail.com>
wrote:
Breaking existing code that uses "_BitInt" as an identifier is
a non-issue. There very probably is no such code.
However, that doesn't mean GCC can carelessly introduce identifiers
in this namespace.
Agreed -- and in gcc did not do that in this case. I was referring
to _BitInt, not to other identifiers in the reserved namespace.
Do you have any reason to believe that gcc's use of _BitInt will
break any existing code?
It has landed, and we don't hear reports that the sky is falling.
If it does break someone's obscure project with few users, unless that
person makes a lot of noise in some forums I read, I will never know.
My position has always been to think about the threat of real,
or at least probable clashes.
I can turn it around: I have not heard of any compiler or library
using _CreamPuff as an identifier, or of a compiler which misbehaves
when a program uses it, on grounds of it being undefined behavior.
Someone using _CreamPuff in their code is taking a risk that is
vanishingly small, the same way that introducing _BigInt is a risk
that is vanishingly small.
In fact, in some sense the risk is smaller because the audience of
programs facing an implementation (or language) that has introduced
some identifier is vastly larger than the audience of implementations
that a given program will face that has introduced some funny
identifier.
Of all the major OSes for Alpha, Windows NT was the only one
that couldn’t take advantage of the 64-bit architecture.
Peter Flass <Peter@Iron-Spring.com> schrieb:
The support issues alone were killers. Think about the
Orange/Grey/(Blue?) Wall of VAX documentation, and then look at the
five-page flimsy you got with a micro. The customers were willing to
accept cr*p from a small startup, but wouldn't put up with it from IBM
or DEC.
Using UNIX faced stiff competition from AT&T's internal IT people,
who wanted to run DEC's operating systems on all PDP-11 within
the company (basically, they wanted to kill UNIX). They pointed
towads the large amout of documentation that DEC provided, compared
to the low amount of UNIX, as proof of superiority. The UNIX people
saw it differently...
But the _real_ killer application for UNIX wasn't writing patents,
it was phototypesetting speeches for the CEO of AT&T, who, for
reasons of vanity, did not want to wear glasses, and it was possible
to scale the output of the phototoypesetter so he would be able
to read them.
After somebody pointed out that having confidential speeches on
one of the most well-known machines in the world, where loads of
people had dial-up access, was not a good idea, his secretary got
her own PDP-11 for that.
And with support from that high up, the project flourished.
Not aware of any platforms that do/did ILP64.
Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
In article <44okQ.831008$QtA1.573001@fx16.iad>,
Scott Lurndal <slp53@pacbell.net> wrote:
[snip]
We tend to be spoiled by modern process densities. The
VAX 11/780 was built using SSI logic chips, thus board
space and backplane wiring were significant constraints
on the logic designs of the era.
Indeed. I find this speculation about the VAX, kind of odd: the
existence of the 801 as a research project being used as an
existence proof to justify assertions that a pipelined RISC
design would have been "better" don't really hold up, when we
consider that the comparison is to a processor designed for
commercial applications on a much shorter timeframe.
I disagree. The 801 was a research project without much time
pressure, and they simulated the machine (IIRC at the gate level)
before they ever bulit one. Plus, they developed an excellent
compiler which implemented graph coloring.
But IBM had zero interest in competition to their own /370 line,
although the 801 would have brought performance improvements
over that line.
If 'int' were 64-bits, then what about 16 and/or 32 bit types.
short short?
long short?
counter-argument to ILP64, where the more natural alternative is LP64.
On Tue, 5 Aug 2025 17:24:34 +0200, Terje Mathisen wrote:
... the problem was all the programs ported from unix which assumed
that any negative return value was a failure code.
If the POSIX API spec says a negative return for a particular call is an >error, then a negative return for that particular call is an error.
On Tue, 5 Aug 2025 22:17:00 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Michael S wrote:
On Tue, 5 Aug 2025 17:31:34 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
In this case 'adc edx,edx' is just slightly shorter encoding
of 'adc edx,0'. EDX register zeroize few lines above.
OK, nice.
BTW, it seems that in your code fragment above you forgot to zeroize EDX
at the beginning of iteration. Or am I mssing something?
Anyway, the three main ADD RAX,... operations still define the
minimum possible latency, right?
I don't think so.
It seems to me that there is only one chains of data dependencies
between iterations of the loop - a trivial dependency through RCX.
Some modern processors are already capable to eliminate this sort of
dependency in renamer. Probably not yet when it is coded as 'inc',
but when coded as 'add' or 'lea'.
The dependency through RDX/RBX does not form a chain. The next value
of [rdi+rcx*8] does depend on value of rbx from previous iteration,
but the next value of rbx depends only on [rsi+rcx*8], [r8+rcx*8]
and [r9+rcx*8]. It does not depend on the previous value of rbx,
except for control dependency that hopefully would be speculated
around.
I believe we are doing a bigint thre-way add, so each result word
depends on the three corresponding input words, plus any carries from
the previous round.
This is the carry chain that I don't see any obvious way to break...
You break the chain by *predicting* that
carry[i] = CARRY(a[i]+b[i]+c[i]+carry(i-1) is equal to
CARRY(a[i]+b[i]+c[i]). If the prediction turns out wrong then you pay a
heavy price of branch misprediction. But outside of specially crafted
inputs it is extremely rare.
On Tue, 5 Aug 2025 05:48:16 -0000 (UTC), Thomas Koenig <tkoenig@netcologne.de> wrote:
Waldek Hebisch <antispam@fricas.org> schrieb:
I am not sure what technolgy they usedThey used fast SRAM and had three copies of their registers,
for register file. For me most likely is fast RAM, but that
normally would give 1 R/W port.
for 2R1W.
I did use 11/780, 8600, and briefly even MicroVax - but I'm primarily
a software person, so please forgive this stupid question.
Why three copies?
Also did you mean 3 total? Or 3 additional copies (4 total)?
Given 1 R/W port each I can see needing a pair to handle cases where destination is also a source (including autoincrement modes). But I
don't see a need ever to sync them - you just keep track of which was
updated most recently, read that one and - if applicable - write the
other and toggle.
Since (at least) the early models evaluated operands sequentially,
there doesn't seem to be a need for more. Later models had some
semblance of pipeline, but it seems that if the /same/ value was
needed multiple times, it could be routed internally to all users
without requiring additional reads of the source.
Or do I completely misunderstand? [Definitely possible.]
DEC was bigger in the minicomputer market. If DEC could have offered
an open-standard machine, that could have offered serious competition
to IBM. But what OS would they have used? They were still dominated
by Unix-haters then.
On 2025-08-04, Michael S <already5chosen@yahoo.com> wrote:...
On Mon, 4 Aug 2025 15:25:54 -0400
James Kuyper <jameskuyper@alumni.caltech.edu> wrote:
If _BitInt is accepted by older versions of gcc, that means it was
supported as a fully-conforming extension to C. Allowing
implementations to support extensions in a fully-conforming manner is
one of the main purposes for which the standard reserves identifiers.
If you thought that gcc was too conservative to support extensions,
you must be thinking of the wrong organization.
I know that gcc supports extensions.
I also know that gcc didn't support *this particular extension* up
until quite recently.
I think what James means is that GCC supports, as an extension,
the use of any _[A-Z].* identifier whatsoever that it has not claimed
for its purposes.
E.g., the designers of ARM A64 included addressing modes for using
32-bit indices (but not 16-bit indices) into arrays. The designers of
RV64G added several sign-extending 32-bit instructions (ending in
"W"), but not corresponding instructions for 16-bit operations. The
RISC-V manual justifies this with
|A few new instructions (ADD[I]W/SUBW/SxxW) are required for addition
|and shifts to ensure reasonable performance for 32-bit values.
Why were 32-bit indices and 32-bit operations more important than
16-bit indices and 16-bit operations? Because with 32-bit int, every
integer type is automatically promoted to at least 32 bits.
Likewise, with ILP64 the size of integers in computations would always
be 64 bits, and many scalar variables (of type int and unsigned) would
also be 64 bits. As a result, 32-bit indices and 32-bit operations
would be rare enough that including these addressing modes and
instructions would not be justified.
But, you might say, what about memory usage? We would use int32_t
where appropriate in big arrays and in fields of structs/classes with
many instances. We would access these array elements and fields with
LW/SW on RV64G and the corresponding instructions on ARM A64, no need
for the addressing modes and instructions mentioned above.
So the addressing mode bloat of ARM A64 and the instruction set bloat
of RV64G that I mentioned above is courtesy of I32LP64.
On 2025-08-05, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
Breaking existing code that uses "_BitInt" as an identifier is
a non-issue. There very probably is no such code.
However, that doesn't mean GCC can carelessly introduce identifiers
in this namespace.
GCC does not define a complete C implementation; it doesn't provide a library. Libraries are provided by other projects: Glibc, Musl,
ucLibc, ...
Those libraries are C implementors also, and get to name things
in the reserved namespace.
On Mon, 4 Aug 2025 18:16:45 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
The claim by John Savard was that the VAX "was a good match to the
technology *of its time*". It was not. It may have been a good
match for the beliefs of the time, but that's a different thing.
The evidence of 801 is the 801 did not deliver until more than decade
later. And the variant that delivered was quite different from original
801.
Actually, it can be argued that 801 didn't deliver until more than 15
years late.
[RISC] didn't really make sense until main
memory systems got a lot faster.
In article <106uqej$36gll$3@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:
Peter Flass <Peter@Iron-Spring.com> schrieb:
The support issues alone were killers. Think about the
Orange/Grey/(Blue?) Wall of VAX documentation, and then look at the
five-page flimsy you got with a micro. The customers were willing to
accept cr*p from a small startup, but wouldn't put up with it from IBM
or DEC.
Using UNIX faced stiff competition from AT&T's internal IT people,
who wanted to run DEC's operating systems on all PDP-11 within
the company (basically, they wanted to kill UNIX). They pointed
towads the large amout of documentation that DEC provided, compared
to the low amount of UNIX, as proof of superiority. The UNIX people
saw it differently...
I've never heard this before, and I do not believe that it is
true. Do you have a source?
The same happened to some extent with the early amd64 machines, which
ended up running 32bit Windows and applications compiled for the i386
ISA. Those processors were successful mostly because they were fast at >running i386 code (with the added marketing benefit of being "64bit
ready"): it took 2 years for MS to release a matching OS.
BGB <cr88192@gmail.com> writes:
counter-argument to ILP64, where the more natural alternative is LP64.
I am curious what makes you think that I32LP64 is "more natural",
given that C is a human creation.
ILP64 is more consistent with the historic use of int: int is the
integer type corresponding to the unnamed single type of B
(predecessor of C), which was used for both integers and pointers.
You can see that in various parts of C, e.g., in the integer type
promotion rules (all integers are promoted at least to int in any
case, beyond that only when another bigger integer is involved).
Another example is
main(argc, argv)
char *argv[];
{
return 0;
}
Here the return type of main() defaults to int, and the type of argc
defaults to int.
As a consequence, one should be able to cast int->pointer->int and pointer->int->pointer without loss. That's not the case with I32LP64.
It is the case for ILP64.
Some people conspired in 1992 to set the de-facto standard, and made
the mistake of deciding on I32LP64 <https://queue.acm.org/detail.cfm?id=1165766>, and we have paid for
this mistake ever since, one way or the other.
E.g., the designers of ARM A64 included addressing modes for using
32-bit indices (but not 16-bit indices) into arrays. The designers of
RV64G added several sign-extending 32-bit instructions (ending in
"W"), but not corresponding instructions for 16-bit operations. The
RISC-V manual justifies this with
|A few new instructions (ADD[I]W/SUBW/SxxW) are required for addition
|and shifts to ensure reasonable performance for 32-bit values.
Why were 32-bit indices and 32-bit operations more important than
16-bit indices and 16-bit operations? Because with 32-bit int, every
integer type is automatically promoted to at least 32 bits.
Likewise, with ILP64 the size of integers in computations would always
be 64 bits, and many scalar variables (of type int and unsigned) would
also be 64 bits. As a result, 32-bit indices and 32-bit operations
would be rare enough that including these addressing modes and
instructions would not be justified.
But, you might say, what about memory usage? We would use int32_t
where appropriate in big arrays and in fields of structs/classes with
many instances. We would access these array elements and fields with
LW/SW on RV64G and the corresponding instructions on ARM A64, no need
for the addressing modes and instructions mentioned above.
So the addressing mode bloat of ARM A64 and the instruction set bloat
of RV64G that I mentioned above is courtesy of I32LP64.
- anton
BGB <cr88192@gmail.com> writes:
If 'int' were 64-bits, then what about 16 and/or 32 bit types.
short short?
long short?
Of course int16_t uint16_t int32_t uint32_t
On what keywords should these types be based? That's up to the
implementor. In C23 one could
typedef signed _BitInt(16) int16_t
etc. Around 1990, one would have just followed the example of "long
long" of accumulating several modifiers. I would go for 16-bit
"short" and 32-bit "long short".
- anton
In any case, RISCs delivered, starting in 1986.
Lawrence D'Oliveiro <ldo@nz.invalid> writes:
Not aware of any platforms that do/did ILP64.
AFAIK the Cray-1 (1976) was the first 64-bit machine, ...
De Castro had had a big success with a simple load-store
architecture, the Nova. He did that to reduce CPU complexity
and cost, to compete with DEC and its PDP-8. (Byte addressing
was horrible on the Nova, though).
Now, assume that, as a time traveler wanting to kick off an early
RISC revolution, you are not allowed to reveal that you are a time
traveler (which would have larger effects than just a different
computer architecture). What do you do?
a) You go to DEC
b) You go to Data General
c) You found your own company
Michael S wrote:
On Tue, 5 Aug 2025 22:17:00 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Michael S wrote:
On Tue, 5 Aug 2025 17:31:34 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
In this case 'adc edx,edx' is just slightly shorter encoding
of 'adc edx,0'. EDX register zeroize few lines above.
OK, nice.
BTW, it seems that in your code fragment above you forgot to
zeroize EDX at the beginning of iteration. Or am I mssing
something?
No, you are not. I skipped pretty much all the setup code. :-)
Anyway, the three main ADD RAX,... operations still define the
minimum possible latency, right?
I don't think so.
It seems to me that there is only one chains of data dependencies
between iterations of the loop - a trivial dependency through RCX.
Some modern processors are already capable to eliminate this sort
of dependency in renamer. Probably not yet when it is coded as
'inc', but when coded as 'add' or 'lea'.
The dependency through RDX/RBX does not form a chain. The next
value of [rdi+rcx*8] does depend on value of rbx from previous
iteration, but the next value of rbx depends only on [rsi+rcx*8],
[r8+rcx*8] and [r9+rcx*8]. It does not depend on the previous
value of rbx, except for control dependency that hopefully would
be speculated around.
I believe we are doing a bigint thre-way add, so each result word
depends on the three corresponding input words, plus any carries
from the previous round.
This is the carry chain that I don't see any obvious way to
break...
You break the chain by *predicting* that
carry[i] = CARRY(a[i]+b[i]+c[i]+carry(i-1) is equal to CARRY(a[i]+b[i]+c[i]). If the prediction turns out wrong then you
pay a heavy price of branch misprediction. But outside of specially
crafted inputs it is extremely rare.
Aha!
That's _very_ nice.
Terje
In comp.arch Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
E.g., the designers of ARM A64 included addressing modes for using
32-bit indices (but not 16-bit indices) into arrays. The designers of
RV64G added several sign-extending 32-bit instructions (ending in
"W"), but not corresponding instructions for 16-bit operations. The
RISC-V manual justifies this with
|A few new instructions (ADD[I]W/SUBW/SxxW) are required for addition
|and shifts to ensure reasonable performance for 32-bit values.
Why were 32-bit indices and 32-bit operations more important than
16-bit indices and 16-bit operations? Because with 32-bit int, every
integer type is automatically promoted to at least 32 bits.
Obectively, a lot of programs fit into 32-bit address space and
may wish to run as 32-bit code for increased performance. Code
that fits into 16-bit address space is rare enough on 64-bit
machines to ignore.
Likewise, with ILP64 the size of integers in computations would always
be 64 bits, and many scalar variables (of type int and unsigned) would
also be 64 bits. As a result, 32-bit indices and 32-bit operations
would be rare enough that including these addressing modes and
instructions would not be justified.
But, you might say, what about memory usage? We would use int32_t
where appropriate in big arrays and in fields of structs/classes with
many instances. We would access these array elements and fields with
LW/SW on RV64G and the corresponding instructions on ARM A64, no need
for the addressing modes and instructions mentioned above.
So the addressing mode bloat of ARM A64 and the instruction set bloat
of RV64G that I mentioned above is courtesy of I32LP64.
It is more complex. There are machines on the market with 64 MB
RAM and 64-bit RISCV processor. There are (or were) machines
with 512 MB RAM and 64-bit ARM processor. On such machines it
is quite natural to use 32-bit pointers. With 32-bit pointers
there is possibility to use existing 32-bit code. And
IPL32 is natural model.
You can say that 32-bit pointers on 64-bit hardware are rare.
But we really do not know. And especially in embedded space one
big customer may want a feature and vendor to avoid fragmentation
provides that feature to everyone.
Why such code need 32-bit addressing? Well, if enough parts of
C were undefined compiler could just extend everthing during
load to 64-bits. So equally well you can claim that real problem
is that C standard should have more undefined behaviour.
On 8/6/2025 6:05 AM, Anton Ertl wrote:
BGB <cr88192@gmail.com> writes:
If 'int' were 64-bits, then what about 16 and/or 32 bit types.
short short?
long short?
Of course int16_t uint16_t int32_t uint32_t
Well, assuming a post C99 world.
According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
Lawrence D'Oliveiro <ldo@nz.invalid> writes:
Not aware of any platforms that do/did ILP64.
AFAIK the Cray-1 (1976) was the first 64-bit machine, ...
The IBM 7030 STRETCH was the first 64 bit machine, shipped in 1961,
but I would be surprised if anyone had written a C compiler for it.
It was bit addressable but memories in those days were so small that a full bit
address was only 24 bits. So if I were writing a C compiler, pointers and ints
would be 32 bits, char 8 bits, long 64 bits.
(There is a thing called STRETCH C Compiler but it's completely unrelated.)
Even if I am allowed to reveal that I am a time traveler, that may not
help; how would I prove it?
It was bit addressable but memories in those days were so small that a full bit
address was only 24 bits. So if I were writing a C compiler, pointers and ints
would be 32 bits, char 8 bits, long 64 bits.
(There is a thing called STRETCH C Compiler but it's completely unrelated.)
I don't get why bit-addressability was a thing? Intel iAPX 432 had it,
too, and it seems like all it does is drastically shrink your address
space and complexify instruction and operand fetch to (maybe) save a few >bytes.
In article <106uqki$36gll$4@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:
Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
In article <44okQ.831008$QtA1.573001@fx16.iad>,
Scott Lurndal <slp53@pacbell.net> wrote:
[snip]
We tend to be spoiled by modern process densities. The
VAX 11/780 was built using SSI logic chips, thus board
space and backplane wiring were significant constraints
on the logic designs of the era.
Indeed. I find this speculation about the VAX, kind of odd: the
existence of the 801 as a research project being used as an
existence proof to justify assertions that a pipelined RISC
design would have been "better" don't really hold up, when we
consider that the comparison is to a processor designed for
commercial applications on a much shorter timeframe.
I disagree. The 801 was a research project without much time
pressure, and they simulated the machine (IIRC at the gate level)
before they ever bulit one. Plus, they developed an excellent
compiler which implemented graph coloring.
But IBM had zero interest in competition to their own /370 line,
although the 801 would have brought performance improvements
over that line.
I'm not sure what, precisely, you're disagreeing with.
I'm saying that the line of though that goes, "the 801 existed,
therefore a RISC VAX would have been better than the
architecture DEC ultimately produced" is specious, and the
conclusion does not follow.
According to Peter Flass <Peter@Iron-Spring.com>:
It was bit addressable but memories in those days were so small that a full bitI don't get why bit-addressability was a thing? Intel iAPX 432 had it, >>too, and it seems like all it does is drastically shrink your address >>space and complexify instruction and operand fetch to (maybe) save a few >>bytes.
address was only 24 bits. So if I were writing a C compiler, pointers and ints
would be 32 bits, char 8 bits, long 64 bits.
(There is a thing called STRETCH C Compiler but it's completely unrelated.) >>
STRETCH had a severe case of second system syndrome, and was full of
complex features that weren't worth the effort and it was impressive
that IBM got it to work and to run as fast as it did.
In that era memory was expensive, and usually measured in K, not M.
The idea was presumably to pack data as tightly as possible.
In the 1970s I briefly used a B1700 which was bit addressable and had reloadable
microcode so COBOL programs used the COBOL instruction set, FORTRAN programs >used the FORTRAN instruction set, and so forth, with each one having whatever >word or byte sizes they wanted. In retrospect it seems like a lot of >premature optimization.
On 2025-08-05 17:13, Kaz Kylheku wrote:
On 2025-08-04, Michael S <already5chosen@yahoo.com> wrote:...
On Mon, 4 Aug 2025 15:25:54 -0400
James Kuyper <jameskuyper@alumni.caltech.edu> wrote:
If _BitInt is accepted by older versions of gcc, that means it was
supported as a fully-conforming extension to C. Allowing
implementations to support extensions in a fully-conforming manner is
one of the main purposes for which the standard reserves identifiers.
If you thought that gcc was too conservative to support extensions,
you must be thinking of the wrong organization.
I know that gcc supports extensions.
I also know that gcc didn't support *this particular extension* up
until quite recently.
I think what James means is that GCC supports, as an extension,
the use of any _[A-Z].* identifier whatsoever that it has not claimed
for its purposes.
No, I meant very specifically that if, as reported, _BitInt was
supported even in earlier versions, then it was supported as an extension.
Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
In article <106uqki$36gll$4@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:
Dan Cross <cross@spitfire.i.gajendra.net> schrieb:I'm not sure what, precisely, you're disagreeing with.
In article <44okQ.831008$QtA1.573001@fx16.iad>,I disagree. The 801 was a research project without much time
Scott Lurndal <slp53@pacbell.net> wrote:
[snip]Indeed. I find this speculation about the VAX, kind of odd: the
We tend to be spoiled by modern process densities. The
VAX 11/780 was built using SSI logic chips, thus board
space and backplane wiring were significant constraints
on the logic designs of the era.
existence of the 801 as a research project being used as an
existence proof to justify assertions that a pipelined RISC
design would have been "better" don't really hold up, when we
consider that the comparison is to a processor designed for
commercial applications on a much shorter timeframe.
pressure, and they simulated the machine (IIRC at the gate level)
before they ever bulit one. Plus, they developed an excellent
compiler which implemented graph coloring.
But IBM had zero interest in competition to their own /370 line,
although the 801 would have brought performance improvements
over that line.
I'm saying that the line of though that goes, "the 801 existed,
therefore a RISC VAX would have been better than the
architecture DEC ultimately produced" is specious, and the
conclusion does not follow.
There are a few intermediate steps.
The 801 demonstrated that a RISC, including caches and pipelining,
would have been feasible at the time. It also demonstrated that
somebody had thought of graph coloring algorithms.
There can also be no doubt that a RISC-type machine would have
exhibited the same performance advantages (at least in integer
performance) as a RISC vs CISC 10 years later. The 801 did so
vs. the /370, as did the RISC processors vs, for example, the
680x0 family of processors (just compare ARM vs. 68000).
Or look at the performance of the TTL implementation of HP-PA,
which used PALs which were not available to the VAX 11/780
designers, so it could be clocked a bit higher, but at
a multiple of the performance than the VAX.
So, Anton visiting DEC or me visiting Data General could have
brought them a technology which would significantly outperformed
the VAX (especially if we brought along the algorithm for graph
coloring. Some people at IBM would have been peeved at having
somebody else "develop" this at the same time, but OK.
Thomas Koenig wrote:
Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
In article <106uqki$36gll$4@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:
Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
In article <44okQ.831008$QtA1.573001@fx16.iad>,
Scott Lurndal <slp53@pacbell.net> wrote:
[snip]Indeed. I find this speculation about the VAX, kind of odd: the
We tend to be spoiled by modern process densities. The
VAX 11/780 was built using SSI logic chips, thus board
space and backplane wiring were significant constraints
on the logic designs of the era.
existence of the 801 as a research project being used as an
existence proof to justify assertions that a pipelined RISC
design would have been "better" don't really hold up, when we
consider that the comparison is to a processor designed for
commercial applications on a much shorter timeframe.
Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) >were available in 1975. Mask programmable PLA were available from TI
circa 1970 but masks would be too expensive.
If I was building a TTL risc cpu in 1975 I would definitely be using
lots of FPLA's, not just for decode but also state machines in fetch,
page table walkers, cache controllers, etc.
Thomas Koenig wrote:
Or look at the performance of the TTL implementation of HP-PA,
which used PALs which were not available to the VAX 11/780
designers, so it could be clocked a bit higher, but at
a multiple of the performance than the VAX.
So, Anton visiting DEC or me visiting Data General could have
brought them a technology which would significantly outperformed
the VAX (especially if we brought along the algorithm for graph
coloring. Some people at IBM would have been peeved at having
somebody else "develop" this at the same time, but OK.
Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) were available in 1975. Mask programmable PLA were available from TI
circa 1970 but masks would be too expensive.
If I was building a TTL risc cpu in 1975 I would definitely be using
lots of FPLA's, not just for decode but also state machines in fetch,
page table walkers, cache controllers, etc.
For comparison:
SPARC: Berkeley RISC research project between 1980 and 1984; <https://en.wikipedia.org/wiki/Berkeley_RISC> does not mention the IBM
801 as inspiration, but a 1978 paper by Tanenbaum. Samples for RISC-I
in May 1982 (but could only run at 0.5MHz). No date for the completion
of RISC-II, but given that the research project ended in 1984, it was probably at that time. Sun developed Berkeley RISC into SPARC, and the
first SPARC machine, the Sun-4/260 appeared in July 1987 with a 16.67MHz processor.
Lawrence D'Oliveiro <ldo@nz.invalid> writes:
Not aware of any platforms that do/did ILP64.
AFAIK the Cray-1 (1976) was the first 64-bit machine, and C for the
Cray-1 and successors implemented, as far as I can determine
type bits
char 8
short int 64
int 64
long int 64
pointer 64
AFAIK the Cray-1 (1976) was the first 64-bit machine, and C for the
Cray-1 and successors implemented, as far as I can determine
type bits
char 8
short int 64
int 64
long int 64
pointer 64
Not having a 16-bit integer type and not having a 32-bit integer type
would make it very hard to adapt portable code, such as TCP/IP protocol >processing.
AFAIK the Cray-1 (1976) was the first 64-bit machine, and C for the
Cray-1 and successors implemented, as far as I can determine
type bits
char 8
short int 64
int 64
long int 64
pointer 64
Not having a 16-bit integer type and not having a 32-bit integer type
would make it very hard to adapt portable code, such as TCP/IP protocol >>processing.
I'd think this was obvious, but if the code depends on word sizes and doesn't declare its variables to use those word sizes, I don't think "portable" is the
right term.
Why were 32-bit indices and 32-bit operations more important than 16-bit indices and 16-bit operations?
I don't get why bit-addressability was a thing? Intel iAPX 432 had it,
too, and it seems like all it does is drastically shrink your address
space and complexify instruction and operand fetch to (maybe) save a few bytes.
Lawrence D'Oliveiro <ldo@nz.invalid> writes:
Not aware of any platforms that do/did ILP64.
AFAIK the Cray-1 (1976) was the first 64-bit machine ...
Lawrence D'Oliveiro <ldo@nz.invalid> writes:
Of all the major OSes for Alpha, Windows NT was the only one that
couldn’t take advantage of the 64-bit architecture.
Actually, Windows took good advantage of the 64-bit architecture:
"64-bit Windows was initially developed on the Alpha AXP." <https://learn.microsoft.com/en-us/previous-versions/technet-magazine/cc718978(v=msdn.10)>
Thomas Koenig wrote:
Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR
Or look at the performance of the TTL implementation of HP-PA, which
used PALs which were not available to the VAX 11/780 designers, so it
could be clocked a bit higher, but at a multiple of the performance
than the VAX.
So, Anton visiting DEC or me visiting Data General could have brought
them a technology which would significantly outperformed the VAX
(especially if we brought along the algorithm for graph coloring. Some
people at IBM would have been peeved at having somebody else "develop"
this at the same time, but OK.
matrix)
were available in 1975. Mask programmable PLA were available from TI
circa 1970 but masks would be too expensive.
If I was building a TTL risc cpu in 1975 I would definitely be using
lots of FPLA's, not just for decode but also state machines in fetch,
page table walkers, cache controllers, etc.
CP/M owes a lot to the DEC lineage, although it dispenses with some
of the more tedious mainframe-isms - e.g. the RUN [program]
[parameters] syntax vs. just treating executable files on disk as
commands in themselves.)
On Wed, 06 Aug 2025 14:00:56 GMT, Anton Ertl wrote:
For comparison:
SPARC: Berkeley RISC research project between 1980 and 1984;
<https://en.wikipedia.org/wiki/Berkeley_RISC> does not mention the IBM
801 as inspiration, but a 1978 paper by Tanenbaum. Samples for RISC-I
in May 1982 (but could only run at 0.5MHz). No date for the completion
of RISC-II, but given that the research project ended in 1984, it was
probably at that time. Sun developed Berkeley RISC into SPARC, and the
first SPARC machine, the Sun-4/260 appeared in July 1987 with a 16.67MHz
processor.
The Katevenis thesis on RISC-II contains a timeline on p6, it lists fabrication of it in spring 83 with testing during summer 83.
There is also a bibliography entry of an informal discussion with John
Cocke at Berkeley about the 801 in June 1983
Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix)^^^^
were available in 1975. Mask programmable PLA were available from TI
circa 1970 but masks would be too expensive.
If I was building a TTL risc cpu in 1975 I would definitely be using
lots of FPLA's, not just for decode but also state machines in fetch,
page table walkers, cache controllers, etc.
On 8/6/25 09:47, Anton Ertl wrote:
Even if I am allowed to reveal that I am a time traveler, that may not
help; how would I prove it?
I'm a time-traveler from the 1960s!
On Wed, 6 Aug 2025 08:28:03 -0700, John Ames wrote:
CP/M owes a lot to the DEC lineage, although it dispenses with some
of the more tedious mainframe-isms - e.g. the RUN [program]
[parameters] syntax vs. just treating executable files on disk as
commands in themselves.)
It added its own misfeatures, though. Like single-letter device names,
but only for disks. Non-file-structured devices were accessed via “reserved” file names, which continue to bedevil Microsoft Windows to this day, aggravated by a totally perverse extension of the concept to
paths with hierarchical directory names.
There is a citation to Cocke as "private communication" in 1980 by
Patterson in The Case for the Reduced Instruction Set Computer,
1980.
"REASONS FOR INCREASED COMPLEXITY
Why have computers become more complex? We can think of several
reasons: Speed of Memory vs. Speed of CPU. John Cocke says that the complexity began with the transition from the 701 to the 709
[Cocke80]. The 701 CPU was about ten times as fast as the core main
memory; this made any primitives that were implemented as
subroutines much slower than primitives that were instructions. Thus
the floating point subroutines became part of the 709 architecture
with dramatic gains. Making the 709 more complex resulted in an
advance that made it more cost-effective than the 701. Since then,
many "higher-level" instructions have been added to machines in an
attempt to improve performance. Note that this trend began because
of the imbalance in speeds; it is not clear that architects have
asked themselves whether this imbalance still holds for their
designs."
["Followup-To:" header set to comp.arch.]
On 2025-08-06, John Levine <johnl@taugh.com> wrote:
AFAIK the Cray-1 (1976) was the first 64-bit machine, and C for the
Cray-1 and successors implemented, as far as I can determine
type bits
char 8
short int 64
int 64
long int 64
pointer 64
Not having a 16-bit integer type and not having a 32-bit integer type >>>would make it very hard to adapt portable code, such as TCP/IP protocol >>>processing.
I'd think this was obvious, but if the code depends on word sizes and doesn't
declare its variables to use those word sizes, I don't think "portable" is the
right term.
My concern is how do you express yopur desire for having e.g. an int16 ?
All the portable code I know defines int8, int16, int32 by means of a
typedef that adds an appropriate alias for each of these back to a
native type. If "short" is 64 bits, how do you define a 16 bit?
Or did the compiler have native types __int16 etc?
Thomas Koenig <tkoenig@netcologne.de> writes:
De Castro had had a big success with a simple load-store
architecture, the Nova. He did that to reduce CPU complexity
and cost, to compete with DEC and its PDP-8. (Byte addressing
was horrible on the Nova, though).
The PDP-8, and its 16-bit followup, the Nova, may be load/store, but
it is not a register machine nor byte-addressed, while the PDP-11 is,
and the RISC-VAX would be, too.
Now, assume that, as a time traveler wanting to kick off an early
RISC revolution, you are not allowed to reveal that you are a time
traveler (which would have larger effects than just a different
computer architecture). What do you do?
a) You go to DEC
b) You go to Data General
c) You found your own company
Even if I am allowed to reveal that I am a time traveler, that may not
help; how would I prove it?
Yes, convincing people in the mid-1970s to bet the company on RISC is
a hard sell, that's I asked for "a magic wand that would convince the
DEC management and workforce that I know how to design their next architecture, and how to compile for it" in
<2025Mar1.125817@mips.complang.tuwien.ac.at>.
Some arguments that might help:
Complexity in CISC and how it breeds complexity elsewhere; e.g., the interaction of having more than one data memory access per
instruction, virtual memory, and precise exceptions.
How the CDC 6600 achieved performance (pipelining) and how non-complex
its instructions are.
I guess I would read through RISC-vs-CISC literature before entering
the time machine in order to have some additional arguments.
Concerning your three options, I think it will be a problem in any
case. Data General's first bet was on FHP, a microcoded machine with user-writeable microcode,
so maybe even more in the wrong direction
than VAX; I can imagine a high-performance OoO VAX implementation, but
for an architecture with exposed microcode like FHP an OoO
implementation would probably be pretty challenging. The backup
project that eventually came through was also a CISC.
Concerning founding ones own company, one would have to convince
venture capital, and then run the RISC of being bought by one of the
big players, who buries the architecture. And even if you survive,
you then have to build up the whole thing: production, marketing,
sales, software support, ...
In any case, the original claim was about the VAX, so of course the--
question at hand is what DEC could have done instead.
- anton
There is a citation to Cocke as "private communication" in 1980 by
Patterson in The Case for the Reduced Instruction Set Computer, 1980.
"REASONS FOR INCREASED COMPLEXITY
Why have computers become more complex? We can think of several reasons: >Speed of Memory vs. Speed of CPU. John Cocke says that the complexity began >with the transition from the 701 to the 709 [Cocke80]. The 701 CPU was about >ten times as fast as the core main memory; this made any primitives that
were implemented as subroutines much slower than primitives that were >instructions. Thus the floating point subroutines became part of the 709 >architecture with dramatic gains. Making the 709 more complex resulted
in an advance that made it more cost-effective than the 701. Since then,
many "higher-level" instructions have been added to machines in an attempt
to improve performance. Note that this trend began because of the imbalance >in speeds; it is not clear that architects have asked themselves whether
this imbalance still holds for their designs."
EricP <ThatWouldBeTelling@thevillage.com> writes:
There is a citation to Cocke as "private communication" in 1980 by >>Patterson in The Case for the Reduced Instruction Set Computer, 1980.
"REASONS FOR INCREASED COMPLEXITY
Why have computers become more complex? We can think of several reasons: >>Speed of Memory vs. Speed of CPU. John Cocke says that the complexity began >>with the transition from the 701 to the 709 [Cocke80]. The 701 CPU was about >>ten times as fast as the core main memory; this made any primitives that >>were implemented as subroutines much slower than primitives that were >>instructions. Thus the floating point subroutines became part of the 709 >>architecture with dramatic gains. Making the 709 more complex resulted
in an advance that made it more cost-effective than the 701. Since then, >>many "higher-level" instructions have been added to machines in an attempt >>to improve performance. Note that this trend began because of the imbalance >>in speeds; it is not clear that architects have asked themselves whether >>this imbalance still holds for their designs."
At the start of this thread
<2025Jul29.104514@mips.complang.tuwien.ac.at>, I made exactly this
argument about the relation between memory speed and clock rate. In
that posting, I wrote:
|my guess is that in the VAX 11/780 timeframe, 2-3MHz DRAM access
|within a row would have been possible. Moreover, the VAX 11/780 has a |cache
In the meantime, this discussion and some additional searching has
unearthed that the VAX 11/780 memory subsystem has 600ns main memory
cycle time (apparently without contiguous-access (row) optimization),
with the cache lowering the average memory cycle time to 290ns.
On Wed, 06 Aug 2025 17:00:03 -0400, EricP wrote:
If I was building a TTL risc cpu in 1975 I would definitely be using
lots of FPLA's, not just for decode but also state machines in fetch,
page table walkers, cache controllers, etc.
The DG MV/8000 used PALs but The Soul of a New Machine hints that there
were supply problems with them at the time.
Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
In article <106uqki$36gll$4@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:
Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
In article <44okQ.831008$QtA1.573001@fx16.iad>,
Scott Lurndal <slp53@pacbell.net> wrote:
[snip]
We tend to be spoiled by modern process densities. The
VAX 11/780 was built using SSI logic chips, thus board
space and backplane wiring were significant constraints
on the logic designs of the era.
Indeed. I find this speculation about the VAX, kind of odd: the
existence of the 801 as a research project being used as an
existence proof to justify assertions that a pipelined RISC
design would have been "better" don't really hold up, when we
consider that the comparison is to a processor designed for
commercial applications on a much shorter timeframe.
I disagree. The 801 was a research project without much time
pressure, and they simulated the machine (IIRC at the gate level)
before they ever bulit one. Plus, they developed an excellent
compiler which implemented graph coloring.
But IBM had zero interest in competition to their own /370 line,
although the 801 would have brought performance improvements
over that line.
I'm not sure what, precisely, you're disagreeing with.
I'm saying that the line of though that goes, "the 801 existed,
therefore a RISC VAX would have been better than the
architecture DEC ultimately produced" is specious, and the
conclusion does not follow.
There are a few intermediate steps.
The 801 demonstrated that a RISC, including caches and pipelining,
would have been feasible at the time. It also demonstrated that
somebody had thought of graph coloring algorithms.
EricP wrote:
Thomas Koenig wrote:
Or look at the performance of the TTL implementation of HP-PA,
which used PALs which were not available to the VAX 11/780
designers, so it could be clocked a bit higher, but at
a multiple of the performance than the VAX.
So, Anton visiting DEC or me visiting Data General could have
brought them a technology which would significantly outperformed
the VAX (especially if we brought along the algorithm for graph
coloring. Some people at IBM would have been peeved at having
somebody else "develop" this at the same time, but OK.
Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) >> were available in 1975. Mask programmable PLA were available from TI
circa 1970 but masks would be too expensive.
If I was building a TTL risc cpu in 1975 I would definitely be using
lots of FPLA's, not just for decode but also state machines in fetch,
page table walkers, cache controllers, etc.
The question isn't could one build a modern risc-style pipelined cpu
from TTL in 1975 - of course one could. Nor do I see any question of
could it beat a VAX 780 0.5 MIPS at 5 MHz - of course it could, easily.
The question is could one build this at a commercially competitive price?
["Followup-To:" header set to comp.arch.]...
On 2025-08-06, John Levine <johnl@taugh.com> wrote:
AFAIK the Cray-1 (1976) was the first 64-bit machine, and C for the
Cray-1 and successors implemented, as far as I can determine
type bits
char 8
short int 64
int 64
long int 64
pointer 64
Not having a 16-bit integer type and not having a 32-bit integer type >>>would make it very hard to adapt portable code, such as TCP/IP protocol >>>processing.
My concern is how do you express yopur desire for having e.g. an int16 ?
All the portable code I know defines int8, int16, int32 by means of a
typedef that adds an appropriate alias for each of these back to a
native type. If "short" is 64 bits, how do you define a 16 bit?
Or did the compiler have native types __int16 etc?
EricP wrote:
Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) >> were available in 1975. Mask programmable PLA were available from TI
circa 1970 but masks would be too expensive.
If I was building a TTL risc cpu in 1975 I would definitely be using
lots of FPLA's, not just for decode but also state machines in fetch,
page table walkers, cache controllers, etc.
The question isn't could one build a modern risc-style pipelined cpu
from TTL in 1975 - of course one could. Nor do I see any question of
could it beat a VAX 780 0.5 MIPS at 5 MHz - of course it could, easily.
I'm pretty sure I could use my Mk-I risc ISA and build a 5 stage pipeline >running at 5 MHz getting 1 IPC sustained when hitting the 200 ns cache
(using some in-order superscalar ideas and two reg file write ports
to "catch up" after pipeline bubbles).
TTL risc would also be much cheaper to design and prototype.
VAX took hundreds of people many many years.
The question is could one build this at a commercially competitive price? >There is a reason people did things sequentially in microcode.
All those control decisions that used to be stored as bits in microcode now >become real logic gates. And in SSI TTL you don't get many to the $.
And many of those sequential microcode states become independent concurrent >state machines, each with its own logic sequencer.
EricP <ThatWouldBeTelling@thevillage.com> writes:
Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) >>were available in 1975. Mask programmable PLA were available from TI
circa 1970 but masks would be too expensive.
Burroughs mainframers started designing with ECL gate arrays circa
1981, and they shipped in 1987[*]. I suspect even FPAL or other PLAs
would have been far to expensive to use to build a RISC CPU,
especially for one of the BUNCH, for whom backward compatability was >paramount.
On Wed, 6 Aug 2025 16:19:11 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Michael S wrote:
On Tue, 5 Aug 2025 22:17:00 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Michael S wrote:
On Tue, 5 Aug 2025 17:31:34 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
In this case 'adc edx,edx' is just slightly shorter encoding
of 'adc edx,0'. EDX register zeroize few lines above.
OK, nice.
BTW, it seems that in your code fragment above you forgot to
zeroize EDX at the beginning of iteration. Or am I mssing
something?
No, you are not. I skipped pretty much all the setup code. :-)
It's a setup code that looks to me as missing, but zeroing of RDX in
the body of the loop.
I did few tests on few machines: Raptor Cove (i7-14700 P core),
Gracemont (i7-14700 E core), Skylake-C (Xeon E-2176G) and Zen3 (EPYC
7543P).
In order to see effects more clearly I had to modify Anton's function:
to one that operates on pointers, because otherwise too much time was
spend at caller's site copying things around which made the
measurements too noisy.
void add3(uintNN_t *dst, const uintNN_t* a, const uintNN_t* b, const uintNN_t* c) {
*dst = *a + *b + *c;
}
After the change on 3 out of 4 platforms I had seen a significant
speed-up after modification. The only platform where speed-up was non-significant was Skylake, probably because its rename stage is too
narrow to profit from the change. The widest machine (Raptor Cove)
benefited most.
The results appear non-conclusive with regard to question whether
dependency between loop iterations is eliminated completely or just
shortened to 1-2 clock cycles per iteration. Even the widest of my
cores is relatively narrow. Considering that my variant of loop contains
13 x86-64 instruction and 16 uOps, I am afraid that even likes of Apple
M4 would be too narrow :(
Here are results in nanoseconds for N=65472
Platform RC GM SK Z3
clang 896.1 1476.7 1453.2 1348.0
gcc 879.2 1661.4 1662.9 1655.0
x86 585.8 1489.3 901.5 672.0
Terje's 772.6 1293.2 1012.6 1127.0
My 397.5 803.8 965.3 660.0
ADX 579.1 1650.1 728.9 853.0
x86/u2 581.5 1246.2 679.9 584.0
Terje's/u3 503.7 954.3 630.9 755.0
My/u3 266.6 487.2 486.5 440.0
ADX/u8 350.4 839.3 490.4 451.0
'x86' is a variant that that was sketched in one of my above
posts. It calculates the sum in two passes over arrays.
'ADX' is a variant that uses ADCX/ADOX instructions as suggested by
Anton, but unlike his suggestion does it in a loop rather than in long straight code sequence.
/u2, /u3, /u8 indicate unroll factors of the inner loop.
Frequency:
RC 5.30 GHz (Est)
GM 4.20 GHz (Est)
SK 4.25 GHz
Z3 3.70 GHz
Lars Poulsen <lars@cleo.beagle-ears.com> writes:
["Followup-To:" header set to comp.arch.]
On 2025-08-06, John Levine <johnl@taugh.com> wrote:
...
My concern is how do you express yopur desire for having e.g. an int16 ? >>All the portable code I know defines int8, int16, int32 by means of a >>typedef that adds an appropriate alias for each of these back to a
native type. If "short" is 64 bits, how do you define a 16 bit?
Or did the compiler have nativea types __int16 etc?
I doubt it. If you want to implement TCP/IP protocol processing on a
Cray-1 or its successors, better use shifts for picking apart or
assembling the headers. One might also think about using C's bit
fields, but, at least if you want the result to be portable, AFAIK bit
fields are too laxly defined to be usable for that.
On 8/6/25 10:25, John Levine wrote:Bit addressing, presumably combined with an easy way to mask the
According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
Lawrence D'Oliveiro <ldo@nz.invalid> writes:
Not aware of any platforms that do/did ILP64.
AFAIK the Cray-1 (1976) was the first 64-bit machine, ...
The IBM 7030 STRETCH was the first 64 bit machine, shipped in 1961,
but I would be surprised if anyone had written a C compiler for it.
It was bit addressable but memories in those days were so small that a
full bit
address was only 24 bits. So if I were writing a C compiler, pointers
and ints
would be 32 bits, char 8 bits, long 64 bits.
(There is a thing called STRETCH C Compiler but it's completely
unrelated.)
I don't get why bit-addressability was a thing? Intel iAPX 432 had it, > too, and it seems like all it does is drastically shrink your address
space and complexify instruction and operand fetch to (maybe) save a few bytes.
That is one of the things I find astonishing - how a company like
DG grew from a kitche-table affair to the size they had.
Bit addressing, presumably combined with an easy way to mask the results/pick an arbitrary number of bits less or equal to register
width, makes it easier to impement compression/decompression/codecs.
However, since the only thing needed to do the same on current CPUs is a single shift after an aligned load, this feature costs far too much in reduced address space compared to what you gain.
scott@slp53.sl.home (Scott Lurndal) writes:
EricP <ThatWouldBeTelling@thevillage.com> writes:
Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) >>>were available in 1975. Mask programmable PLA were available from TI >>>circa 1970 but masks would be too expensive.
Burroughs mainframers started designing with ECL gate arrays circa
1981, and they shipped in 1987[*]. I suspect even FPAL or other PLAs >>would have been far to expensive to use to build a RISC CPU,
The Signetics 82S100 was used in early Commodore 64s, so it could not
have been expensive (at least in 1982, when these early C64s were
built). PLAs were also used by HP when building the first HPPA CPU.
especially for one of the BUNCH, for whom backward compatability was >>paramount.
Why should the cost of building a RISC CPU depend on whether you are
in the BUNCH (Burroughs, UNIVAC, NCR, Control Data Corporation (CDC),
and Honeywell)? And how is the cost of building a RISC CPU related to >backwards compatibility?
It added its own misfeatures, though.
I don't get why bit-addressability was a thing? Intel iAPX 432 had it,
too
That disparity between CPU and RAM speeds is even greater today than
it was back then. Yet we have moved away from adding ever-more-complex instructions, and are getting better performance with simpler ones.
How come? Caching.
On Thu, 7 Aug 2025 02:22:05 -0000 (UTC)
Lawrence D'Oliveiro <ldo@nz.invalid> wrote:
That disparity between CPU and RAM speeds is even greater today than
it was back then. Yet we have moved away from adding ever-more-complex
instructions, and are getting better performance with simpler ones.
How come? Caching.
Yes, but complex instructions also make pipelining and out-of-order
execution much more difficult - to the extent that, as far back as the Pentium Pro, Intel has had to implement the x86 instruction set as a microcoded program running on top of a simpler RISC architecture.
However, in the case of the IBM STRETCH, I think there's a good
excuse: If you go from word addressing to subunit addressing (not sure
why Stretch went there, however; does a supercomputer need that?), why
stop at characters (especially given that character size at the time
was still not settled)? Why not continue down to bits?
It's a 32 bit architecture with 31 bit addressing, kludgily extended
from 24 bit addressing in the 1970s.
Peter Flass <Peter@Iron-Spring.com> writes:
[IBM STRETCH bit-addressable]
I don't get why bit-addressability was a thing? Intel iAPX 432 had it,
too
One might come to think that it's the signature of overambitious
projects that eventually fail.
However, in the case of the IBM STRETCH, I think there's a good
excuse: If you go from word addressing to subunit addressing (not sure
why Stretch went there, however; does a supercomputer need that?)
stop at characters (especially given that character size at the time
was still not settled)? Why not continue down to bits?
The S/360 then found the compromise that conquered the world: Byte
addressing with 8-bit bytes.
Why iAPX432 went for bit addressing at a time when byte addressing and
the 8-bit byte was firmly established, over ten years after the S/360
and 5 years after the PDP-11 is a mystery, however.
I don't get why bit-addressability was a thing? Intel iAPX 432 had it,
too, and it seems like all it does is drastically shrink your address
space and complexify instruction and operand fetch to (maybe) save a few
bytes.
Bit addressing, presumably combined with an easy way to mask the >results/pick an arbitrary number of bits less or equal to register
width, makes it easier to impement compression/decompression/codecs.
On Tue, 5 Aug 2025 13:04:39 -0500
"Brian G. Lucas" <bagel99@gmail.com> wrote:
Hi, Brian
By chance, do you happen to know why Mitch Alsup recently disappeared
from the Usenet?
John Ames wrote:
On Thu, 7 Aug 2025 02:22:05 -0000 (UTC)That's simply wrong:
Lawrence D'Oliveiro <ldo@nz.invalid> wrote:
That disparity between CPU and RAM speeds is even greater today than
it was back then. Yet we have moved away from adding ever-more-complex
instructions, and are getting better performance with simpler ones.
How come? Caching.
Yes, but complex instructions also make pipelining and out-of-order
execution much more difficult - to the extent that, as far back as the
Pentium Pro, Intel has had to implement the x86 instruction set as a
microcoded program running on top of a simpler RISC architecture.
The PPro had close to zero microcode actually running in any user program.
What it did have was decoders that would look at complex operations and
spit out two or more basic operations, like load+execute.
Later on we've seen the opposite where cmp+branch could be combined into
a single internal op.
Terje
Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
In article <106uqej$36gll$3@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:
Peter Flass <Peter@Iron-Spring.com> schrieb:
The support issues alone were killers. Think about the
Orange/Grey/(Blue?) Wall of VAX documentation, and then look at the
five-page flimsy you got with a micro. The customers were willing to
accept cr*p from a small startup, but wouldn't put up with it from IBM >>>> or DEC.
Using UNIX faced stiff competition from AT&T's internal IT people,
who wanted to run DEC's operating systems on all PDP-11 within
the company (basically, they wanted to kill UNIX). They pointed
towads the large amout of documentation that DEC provided, compared
to the low amount of UNIX, as proof of superiority. The UNIX people
saw it differently...
I've never heard this before, and I do not believe that it is
true. Do you have a source?
Hmm... I _think_ it was on a talk given by the UNIX people,
but I may be misremembering.
BEGIN FORTUNE<---
END FORTUNE<------ Synchronet 3.21a-Linux NewsLink 1.2
However, since the only thing needed to do the same on current CPUs is a single shift after an aligned load, this feature costs far too much in reduced address space compared to what you gain.
On 8/6/25 22:29, Thomas Koenig wrote:
That is one of the things I find astonishing - how a company like DGRecent history is littered with companies like this.
grew from a kitche-table affair to the size they had.
On 8/7/25 3:48 PM, Michael S wrote:
On Tue, 5 Aug 2025 13:04:39 -0500No, I do not. And I am worried.
"Brian G. Lucas" <bagel99@gmail.com> wrote:
Hi, Brian
By chance, do you happen to know why Mitch Alsup recently disappeared
from the Usenet?
brian
On Thu, 7 Aug 2025 17:52:05 +0200, Terje Mathisen
<terje.mathisen@tmsw.no> wrote:
John Ames wrote:
The PPro had close to zero microcode actually running in any user program.
What it did have was decoders that would look at complex operations and >>spit out two or more basic operations, like load+execute.
Later on we've seen the opposite where cmp+branch could be combined into
a single internal op.
Terje
You say "tomato". 8-)
It's still "microcode" for some definition ... just not a classic >"interpreter" implementation where a library of routines implements
the high level instructions.
The decoder converts x86 instructions into traces of equivalent wide
micro instructions which are directly executable by the core. The
traces then are cached separately [there is a $I0 "microcache" below
$I1] and can be re-executed (e.g., for loops) as long as they remain
in the microcache.
I guess they thought that 32 address bits left plenty to spare for
something like this. But I think it just shortened the life of their
32- bit architecture by that much more.
On Tue, 5 Aug 2025 13:04:39 -0500
"Brian G. Lucas" <bagel99@gmail.com> wrote:
Hi, Brian
By chance, do you happen to know why Mitch Alsup recently disappeared
from the Usenet?
Michael S wrote:
On Tue, 5 Aug 2025 13:04:39 -0500
"Brian G. Lucas" <bagel99@gmail.com> wrote:
Hi, Brian
By chance, do you happen to know why Mitch Alsup recently
disappeared from the Usenet?
I've been in cantact,
he lost his usenet provider,
Terje
and the one I am > using does not seem to accept new registrations
any langer.
Robert Swindells <rjs@fdy2.co.uk> writes:
On Wed, 06 Aug 2025 17:00:03 -0400, EricP wrote:
If I was building a TTL risc cpu in 1975 I would definitely be usingThe DG MV/8000 used PALs but The Soul of a New Machine hints that there
lots of FPLA's, not just for decode but also state machines in fetch,
page table walkers, cache controllers, etc.
were supply problems with them at the time.
The PALs used for the MV/8000 were different, came out in 1978 (i.e.,
very recent when the MV/8000 was designed), addressed shortcomings of
the PLA Signetics 82S100 that had been available since 1975, and the
PALs initially had yield problems; see <https://en.wikipedia.org/wiki/Programmable_Array_Logic#History>.
Concerning the speed of the 82S100 PLA, <http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf>
reports propagation delays of 25ns-35ns for specific signals in Table
3.4, and EricP found 50ns "max access" in the data sheet of the
82S100. That does not sound too slow to be usable in a CPU with 200ns
cycle time, so yes, one could have used that for the VAX.
- anton
On Fri, 8 Aug 2025 11:58:39 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Michael S wrote:
On Tue, 5 Aug 2025 13:04:39 -0500
"Brian G. Lucas" <bagel99@gmail.com> wrote:
Hi, Brian
By chance, do you happen to know why Mitch Alsup recently
disappeared from the Usenet?
I've been in cantact,
Good.
he lost his usenet provider,
Terje
I was suspecting that much. What made me worrying is that almost at the
same date he stopped posting on RWT forum.
and the one I am > using does not seem to accept new registrations
any langer.
Eternal September does not accept new registrations?
I think, if it is true, Ray Banana will make excception for Mitch if
asked personally.
Michael S <already5chosen@yahoo.com> writes:
On Fri, 8 Aug 2025 11:58:39 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Michael S wrote:
On Tue, 5 Aug 2025 13:04:39 -0500
"Brian G. Lucas" <bagel99@gmail.com> wrote:
Hi, Brian
By chance, do you happen to know why Mitch Alsup recently
disappeared from the Usenet?
I've been in cantact,
Good.
he lost his usenet provider,
Terje
I was suspecting that much. What made me worrying is that almost at the
same date he stopped posting on RWT forum.
and the one I am > using does not seem to accept new registrations
any langer.
Eternal September does not accept new registrations?
I think, if it is true, Ray Banana will make excception for Mitch if
asked personally.
www.usenetserver.com is priced reasonably. I've been using them
for well over a decade now.
Eternal September does not accept new registrations?
I think, if it is true, Ray Banana will make excception for Mitch if
asked personally.
George Neuner <gneuner2@comcast.net> writes:
The decoder converts x86 instructions into traces of equivalent wide
micro instructions which are directly executable by the core. The
traces then are cached separately [there is a $I0 "microcache" below
$I1] and can be re-executed (e.g., for loops) as long as they remain
in the microcache.
No such cache in the P6 or any of its descendents until the Sandy
Bridge (2011). The Pentium 4 has a microop cache, but eventually
(with Core Duo, Core2 Duo) was replaced with P6 descendents that have
no microop cache. Actually, the Core 2 Duo has a loop buffer which
might be seen as a tiny microop cache. Microop caches and loop
buffers still have to contain information about which microops belong
to the same CISC instruction, because otherwise the reorder buffer
could not commit/execute* CISC instructions.
* OoO microarchitecture terminology calls what the reorder buffer does
"retire" or "commit". But this is where the speculative execution
becomes architecturally visible ("commit"), so from an architectural
view it is execution.
Followups set to comp.arch
- anton
George Neuner wrote:
On Tue, 5 Aug 2025 05:48:16 -0000 (UTC), Thomas Koenig
<tkoenig@netcologne.de> wrote:
Waldek Hebisch <antispam@fricas.org> schrieb:
I am not sure what technolgy they usedThey used fast SRAM and had three copies of their registers,
for register file. For me most likely is fast RAM, but that
normally would give 1 R/W port.
for 2R1W.
I did use 11/780, 8600, and briefly even MicroVax - but I'm primarily
a software person, so please forgive this stupid question.
Why three copies?
Also did you mean 3 total? Or 3 additional copies (4 total)?
Given 1 R/W port each I can see needing a pair to handle cases where
destination is also a source (including autoincrement modes). But I
don't see a need ever to sync them - you just keep track of which was
updated most recently, read that one and - if applicable - write the
other and toggle.
Since (at least) the early models evaluated operands sequentially,
there doesn't seem to be a need for more. Later models had some
semblance of pipeline, but it seems that if the /same/ value was
needed multiple times, it could be routed internally to all users
without requiring additional reads of the source.
Or do I completely misunderstand? [Definitely possible.]
To make a 2R 1W port reg file from a single port SRAM you use two banks
which can be addressed separately during the read phase at the start of
the clock phase, and at the end of the clock phase you write both banks
at the same time on the same port number.
The 780 wiring parts list shows Nat Semi 85S68 which are
16*4b 1RW port, 40 ns access SRAMS, tri-state output,
with latched read output to eliminate data race through on write.
So they have two 16 * 32b banks for the 16 general registers.
The third 16 * 32b bank was likely for microcode temp variables.
The thing is, yes, they only needed 1R port for instruction operands
because sequential decode could only produce one operand at a time.
Even on later machines circa 1990 like 8700/8800 or NVAX the general
register file is only 1R1W port, the temp register bank is 2R1W.
So the 780 second read port is likely used the same as later VAXen,
its for reading the temp values concurrently with an operand register.
The operand registers were read one at a time because of the decode >bottleneck.
I'm wondering how they handled modifying address modes like autoincrement
and still had precise interrupts.
ADDLL (r2)+, (r2)+, (r2)+
the first (left) operand reads r2 then adds 4, which the second r2 reads--- Synchronet 3.21a-Linux NewsLink 1.2
and also adds 4, then the third again. It doesn't have a renamer so
it has to stash the first modified r2 in the temp registers,
and (somehow) pass that info to decode of the second operand
so Decode knows to read the temp r2 not the general r2,
and same for the third operand.
At the end of the instruction if there is no exception then
temp r2 is copied to general r2 and memory value is stored.
I'm guessing in Decode someplace there are comparators to detect when
the operand registers are the same so microcode knows to switch to the
temp bank for a modified register.
Concerning the speed of the 82S100 PLA,
<http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf>
reports propagation delays of 25ns-35ns for specific signals in Table
3.4, and EricP found 50ns "max access" in the data sheet of the
82S100. That does not sound too slow to be usable in a CPU with 200ns
cycle time, so yes, one could have used that for the VAX.
Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
In article <106uqki$36gll$4@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:
Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
In article <44okQ.831008$QtA1.573001@fx16.iad>,
Scott Lurndal <slp53@pacbell.net> wrote:
[snip]
We tend to be spoiled by modern process densities. The
VAX 11/780 was built using SSI logic chips, thus board
space and backplane wiring were significant constraints
on the logic designs of the era.
Indeed. I find this speculation about the VAX, kind of odd: the
existence of the 801 as a research project being used as an
existence proof to justify assertions that a pipelined RISC
design would have been "better" don't really hold up, when we
consider that the comparison is to a processor designed for
commercial applications on a much shorter timeframe.
I disagree. The 801 was a research project without much time
pressure, and they simulated the machine (IIRC at the gate level)
before they ever bulit one. Plus, they developed an excellent
compiler which implemented graph coloring.
But IBM had zero interest in competition to their own /370 line,
although the 801 would have brought performance improvements
over that line.
I'm not sure what, precisely, you're disagreeing with.
I'm saying that the line of though that goes, "the 801 existed,
therefore a RISC VAX would have been better than the
architecture DEC ultimately produced" is specious, and the
conclusion does not follow.
There are a few intermediate steps.
The 801 demonstrated that a RISC, including caches and pipelining,
would have been feasible at the time. It also demonstrated that
somebody had thought of graph coloring algorithms.
There can also be no doubt that a RISC-type machine would have
exhibited the same performance advantages (at least in integer
performance) as a RISC vs CISC 10 years later. The 801 did so
vs. the /370, as did the RISC processors vs, for example, the
680x0 family of processors (just compare ARM vs. 68000).
Or look at the performance of the TTL implementation of HP-PA,
which used PALs which were not available to the VAX 11/780
designers, so it could be clocked a bit higher, but at
a multiple of the performance than the VAX.
So, Anton visiting DEC or me visiting Data General could have
brought them a technology which would significantly outperformed
the VAX (especially if we brought along the algorithm for graph
coloring. Some people at IBM would have been peeved at having
somebody else "develop" this at the same time, but OK.
In article <1070cj8$3jivq$1@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:
Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
In article <106uqki$36gll$4@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:
Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
In article <44okQ.831008$QtA1.573001@fx16.iad>,
Scott Lurndal <slp53@pacbell.net> wrote:
[snip]
We tend to be spoiled by modern process densities. The
VAX 11/780 was built using SSI logic chips, thus board
space and backplane wiring were significant constraints
on the logic designs of the era.
Indeed. I find this speculation about the VAX, kind of odd: the
existence of the 801 as a research project being used as an
existence proof to justify assertions that a pipelined RISC
design would have been "better" don't really hold up, when we
consider that the comparison is to a processor designed for
commercial applications on a much shorter timeframe.
I disagree. The 801 was a research project without much time
pressure, and they simulated the machine (IIRC at the gate level) >>>>before they ever bulit one. Plus, they developed an excellent
compiler which implemented graph coloring.
But IBM had zero interest in competition to their own /370 line, >>>>although the 801 would have brought performance improvements
over that line.
I'm not sure what, precisely, you're disagreeing with.
I'm saying that the line of though that goes, "the 801 existed,
therefore a RISC VAX would have been better than the
architecture DEC ultimately produced" is specious, and the
conclusion does not follow.
There are a few intermediate steps.
The 801 demonstrated that a RISC, including caches and pipelining,
would have been feasible at the time. It also demonstrated that
somebody had thought of graph coloring algorithms.
This is the part where the argument breaks down. VAX and 801
were roughly contemporaneous, with VAX being commercially
available around the time the first 801 prototypes were being
developed. There's simply no way in which the 801,
specifically, could have had significant impact on VAX
development.
If you're just talking about RISC design techniques generically,
then I dunno, maybe, sure, why not,
but that's a LOT of
speculation with hindsight-colored glasses.
Furthermore, that
speculation focuses solely on technology, and ignores the
business realities that VAX was born into. Maybe you're right,
maybe you're wrong, we can never _really_ say, but there was a
lot more that went into the decisions around the VAX design than
just technology.
While it's always fun to speculate about alternate timelines, if
all you are talking about is a hypothetical that someone at DEC
could have independently used the same techniques, producing a
more performance RISC-y VAX with better compilers, then sure, I
guess, why not.
But as with all alternate history, this is
completely unknowable.
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Concerning the speed of the 82S100 PLA,
<http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf>
reports propagation delays of 25ns-35ns for specific signals in Table
3.4, and EricP found 50ns "max access" in the data sheet of the
82S100. That does not sound too slow to be usable in a CPU with 200ns
cycle time, so yes, one could have used that for the VAX.
Were there different versions, maybe?
https://deramp.com/downloads/mfe_archive/050-Component%20Specifications/Signetics-Philips/82S100%20FPGA.pdf
gives an I/O propagation delay of 80 ns max.
By comparison, you could get an eight-input NAND gate with a
maximum delay of 12 ns (the 74H030), so putting two in sequence
to simulate a PLA would have been significantly faster.
I can undersand people complaining that PALs were slow.
Thomas Koenig wrote:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Concerning the speed of the 82S100 PLA,
<http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf>
reports propagation delays of 25ns-35ns for specific signals in Table
3.4, and EricP found 50ns "max access" in the data sheet of the
82S100. That does not sound too slow to be usable in a CPU with 200ns
cycle time, so yes, one could have used that for the VAX.
Were there different versions, maybe?
https://deramp.com/downloads/mfe_archive/050-Component%20Specifications/Signetics-Philips/82S100%20FPGA.pdf
gives an I/O propagation delay of 80 ns max.
Yes, must be different versions.
I'm looking at this 1976 datasheet which says 50 ns max access:
http://www.bitsavers.org/components/signetics/_dataBooks/1976_Signetics_Field_Programmable_Logic_Arrays.pdf
By comparison, you could get an eight-input NAND gate with a
maximum delay of 12 ns (the 74H030), so putting two in sequence
to simulate a PLA would have been significantly faster.
I can undersand people complaining that PALs were slow.
The 82S100 PLA is logic equivalent to:
- 16 inputs each with an optional input invertor,
- optionally wired to 48 16-input AND's,
- optionally wired to 8 48-input OR's,
- with 8 optional XOR output invertors,
- driving 8 tri-state or open collector buffers.
So I count roughly 7 or 8 equivalent gate delays.
Also the decoder would need a lot of these so I doubt we can afford the
power and heat for H series. That 74H30 typical is 22 mW but the max
looks like 110 mW max each (I_ol output low of 20 mA * 5.5V max).
74LS30 is 20 ns max, 44 mW max.
Looking at a TI Bipolar Memory Data Manual from 1977,
it was about the same speed as say a 256b mask programmable TTL ROM,
7488A 32w * 8b, 45 ns max access.
One question: Did TTL people actually use the "typical" delays
from the handbooks, or did they use the maximum delays for their
desings?
Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
[snip]
If you're just talking about RISC design techniques generically,
then I dunno, maybe, sure, why not,
Absolutely. The 801 demonstrated that it was a feasible
development _at the time_.
but that's a LOT of
speculation with hindsight-colored glasses.
Graph-colored glasses, for the register allocation, please :-)
Furthermore, that
speculation focuses solely on technology, and ignores the
business realities that VAX was born into. Maybe you're right,
maybe you're wrong, we can never _really_ say, but there was a
lot more that went into the decisions around the VAX design than
just technology.
I'm not sure what you mean here. Do you include the ISA design
in "technology" or not?
[...]
While it's always fun to speculate about alternate timelines, if
all you are talking about is a hypothetical that someone at DEC
could have independently used the same techniques, producing a
more performance RISC-y VAX with better compilers, then sure, I
guess, why not.
Yep, that would have been possible, either as an alternate
VAX or a competitor.
But as with all alternate history, this is
completely unknowable.
We know it was feasible, we know that there were a large
number of minicomputer companies at the time. We cannot
predict what a succesfull minicomputer implementation with
two or three times the performance of the VAX could have
done. We do know that this was the performance advantage
that Fountainhead from DG aimed for via programmable microcode
(which failed to deliver on time due to complexity), and
we can safely assume that DG would have given DEC a run
for its money if they had system which significantly
outperformed the VAX.
So, "completely unknownable" isn't true, "quite plausible"
would be a more accurate description.
In article <107768m$17rul$1@dont-email.me>,<snip>
Thomas Koenig <tkoenig@netcologne.de> wrote:
Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
While it's always fun to speculate about alternate timelines, if
all you are talking about is a hypothetical that someone at DEC
could have independently used the same techniques, producing a
more performance RISC-y VAX with better compilers, then sure, I
guess, why not.
Yep, that would have been possible, either as an alternate
VAX or a competitor.
But as with all alternate history, this is
completely unknowable.
Sure.
We know it was feasible, we know that there were a large
number of minicomputer companies at the time. We cannot
predict what a succesfull minicomputer implementation with
two or three times the performance of the VAX could have
done. We do know that this was the performance advantage
that Fountainhead from DG aimed for via programmable microcode
(which failed to deliver on time due to complexity), and
we can safely assume that DG would have given DEC a run
for its money if they had system which significantly
outperformed the VAX.
My contention is that while it was _feasible_ to build a
RISC-style machine for what became the VAX, that by itself is
only a part of the puzzle. One must also take into account
market and business contexts; perhaps such a machine would have
been faster, but I don't think anyone _really_ knew that to be
the case in 1975 when design work on the VAX started, and even
fewer would have believed it absent a working prototype, which
wouldn't arrive with the 801 for several years after the VAX had
shipped commercially. Furthermore, Digital would have
understood that many customers would have expected to be able to
program their new machine in macro assembler.
Interesting quote that indicates the direction they were looking:
"Many of the instructions in this specification could only
be used by COBOL if 9-bit ASCII were supported. There is currently
no plan for COBOL to support 9-bit ASCII".
"The following goals were taken into consideration when deriving an
address scheme for addressing 9-bit byte strings:"
Fundamentally, 36-bit words ended up being a dead-end.
My contention is that while it was _feasible_ to build a
RISC-style machine for what became the VAX,
that by itself is
only a part of the puzzle. One must also take into account
market and business contexts; perhaps such a machine would have
been faster,
but I don't think anyone _really_ knew that to be
the case in 1975 when design work on the VAX started,
and even
fewer would have believed it absent a working prototype,
which
wouldn't arrive with the 801 for several years after the VAX had
shipped commercially.
Furthermore, Digital would have
understood that many customers would have expected to be able to
program their new machine in macro assembler.
One must also keep in mind that the VAX group was competing
internally with the PDP-10 minicomputer.
Fundamentally, 36-bit words ended up being a dead-end.
scott@slp53.sl.home (Scott Lurndal) writes:
One must also keep in mind that the VAX group was competing
internally with the PDP-10 minicomputer.
This does not make the actual VAX more attractive relative to the >hypothetical RISC-VAX IMO.
Fundamentally, 36-bit words ended up being a dead-end.
The reason why this once-common architectural style died out are:
* 18-bit addresses
Univac sold the 1100/2200 series, and later Unisys continued to
support that in the Unisyst Clearpath systems. ><https://en.wikipedia.org/wiki/UNIVAC_1100/2200_series#Unisys_ClearPath_IX_series>
says:
http://bitsavers.informatik.uni-stuttgart.de/pdf/dec/pdp10/KC10_Jupiter/Jupiter_CIS_Instructions_Oct80.pdf
Interesting quote that indicates the direction they were looking:
"Many of the instructions in this specification could only
be used by COBOL if 9-bit ASCII were supported. There is currently
no plan for COBOL to support 9-bit ASCII".
"The following goals were taken into consideration when deriving an
address scheme for addressing 9-bit byte strings:"
Fundamentally, 36-bit words ended up being a dead-end.
VAX-780 architecture handbook says cache was 8 KB and used 8-byte
lines. So extra 12KB of fast RAM could double cache size.
That would be nice improvement, but not as dramatic as increase
from 2 KB to 12 KB.
MAP_32BIT is only used on x86-64 on Linux, and was originally
a performance hack for allocating thread stacks: apparently, it
was cheaper to do a thread switch with a stack below the 4GiB
barrier (sign extension artifact maybe? Who knows...). But it's
no longer required for that. But there's no indication that it
was for supporting ILP32 on a 64-bit system.
MAP_32BIT is only used on x86-64 on Linux, and was originally
a performance hack for allocating thread stacks: apparently, it
was cheaper to do a thread switch with a stack below the 4GiB
barrier (sign extension artifact maybe? Who knows...). But it's
no longer required for that. But there's no indication that it
was for supporting ILP32 on a 64-bit system.
Reading up about x32, it requires quite a bit more than just
allocating everything in the low 2GB.
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
cross@spitfire.i.gajendra.net (Dan Cross) writes:
MAP_32BIT is only used on x86-64 on Linux, and was originally
a performance hack for allocating thread stacks: apparently, it
was cheaper to do a thread switch with a stack below the 4GiB
barrier (sign extension artifact maybe? Who knows...). But it's
no longer required for that. But there's no indication that it
was for supporting ILP32 on a 64-bit system.
Reading up about x32, it requires quite a bit more than just
allocating everything in the low 2GB.
The primary issue on x86 was with the API definitions. Several
legacy API declarations used signed integers (int) for
address parameters. This limited addresses to 2GB on
a 32-bit system.
https://en.wikipedia.org/wiki/Large-file_support
The Large File Summit (I was one of the Unisys reps at the LFS)
specified a standard way to support files larger than 2GB
on 32-bit systems that used signed integers for file offsets
and file size.
Also, https://en.wikipedia.org/wiki/2_GB_limit
The basic question is if VAX could afford the pipeline.
I doubt that they could afford 1-cycle multiply
or
even a barrel shifter.
It is accepted in this era that using more hardware could
give substantial speedup. IIUC IBM used quadatic rule:
performance was supposed to be proportional to square of
CPU price. That was partly marketing, but partly due to
compromises needed in smaller machines.
Also, IIRC, the major point of X32 was that it would narrow pointers and similar back down to 32 bits, requiring special versions of any shared libraries or similar.
But, it is unattractive to have both 32 and 64 bit versions of all the SO's.
In comp.arch BGB <cr88192@gmail.com> wrote:
Also, IIRC, the major point of X32 was that it would narrow pointers and
similar back down to 32 bits, requiring special versions of any shared
libraries or similar.
But, it is unattractive to have both 32 and 64 bit versions of all the SO's.
We have done something similar for years at Red Hat: not X32, but
x86_32, and it was pretty easy. If you're building a 32-bit OS anyway
(which we were) all you have to do is copy all 32-bit libraries from
one one repo to the other.
I thought the AArch64 ILP32 design was pretty neat, but no one seems
to have been interested. I guess there wasn't an advantage worth the
effort.
That said, Unix generally defined -1 as the return value for all
other system calls, and code that checked for "< 0" instead of
-1 when calling a standard library function or system call was fundamentally >broken.
To be efficient, a RISC needs a full-width (presumably 32 bit)
external data bus, plus a separate address bus, which should at
least be 26 bits, better 32. A random ARM CPU I looked at at
bitsavers had 84 pins, which sounds reasonable.
Building an ARM-like instead of a 68000 would have been feasible,
but the resulting systems would have been more expensive (the
68000 had 64 pins).
So... a strategy could have been to establish the concept with
minicomputers, to make money (the VAX sold big) and then move
aggressively towards microprocessors, trying the disruptive move
towards workstations within the same company (which would be HARD).
As for the PC - a scaled-down, cheap, compatible, multi-cycle per
instruction microprocessor could have worked for that market,
but it is entirely unclear to me what this would / could
have done to the PC market, if IBM could have been prevented
from gaining such market dominance.
On Tue, 5 Aug 2025 21:01:20 -0000 (UTC), Thomas Koenig wrote:
So... a strategy could have been to establish the concept with
minicomputers, to make money (the VAX sold big) and then move
aggressively towards microprocessors, trying the disruptive move towards
workstations within the same company (which would be HARD).
None of the companies which tried to move in that direction were
successful. The mass micro market had much higher volumes and lower
margins, and those accustomed to lower-volume, higher-margin operation >simply couldn’t adapt.
Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
[Snipping the previous long discussion]
My contention is that while it was _feasible_ to build a
RISC-style machine for what became the VAX,
There, we agree.
that by itself is
only a part of the puzzle. One must also take into account
market and business contexts; perhaps such a machine would have
been faster,
With a certainty, if they followed RISC principles.
[snip]
which
wouldn't arrive with the 801 for several years after the VAX had
shipped commercially.
That is clear. It was the premise of this discussion that the
knowledge had been made available (via time travel or some other
strange means) to a company, which would then have used the
knowledge.
Furthermore, Digital would have
understood that many customers would have expected to be able to
program their new machine in macro assembler.
Programming a RISC in assembler is not so hard, at least in my
experience. Plus, people overestimated use of assembler even in
the mid-1975s, and underestimated the use of compilers.
[...]
I thought the AArch64 ILP32 design was pretty neat, but no one seems
to have been interested. I guess there wasn't an advantage worth the >>effort.
Alpha: On Digital OSF/1 the advantage was to be able to run programs
that work on ILP32, but not I32LP64.
x32: I expect that maintained Unix programs ran on I32LP64 in 2012,
and unmaintained ones did not get an x32 port anyway. And if there
are cases where my expectations do not hold, there still is i386. The
only advantage of x32 was a speed advantage on select programs.
That's apparently not enough to gain a critical mass of x32 programs.
Aarch64-ILP32: My guess is that the situation is very similar to the
x32 situation.
Admittedly, there are CPUs without ARM A32/T32
Thomas Koenig <tkoenig@netcologne.de> writes:
To be efficient, a RISC needs a full-width (presumably 32 bit)
external data bus, plus a separate address bus, which should at
least be 26 bits, better 32. A random ARM CPU I looked at at
bitsavers had 84 pins, which sounds reasonable.
Building an ARM-like instead of a 68000 would have been feasible,
but the resulting systems would have been more expensive (the
68000 had 64 pins).
One could have done a RISC-VAX microprocessor with 16-bit data bus and
24-bit address bus.
Thomas Koenig <tkoenig@netcologne.de> writes:<snip>
So how could one capture the PC market? The RISC-VAX would probably
have been too expensive for a PC, even with an 8-bit data bus and a
reduced instruction set, along the lines of RV32E. Or maybe that
would have been feasible, in which case one would provide >8080->reduced-RISC-VAX and 6502->reduced-RISC-VAX assemblers to make
porting easier. And then try to sell it to IBM Boca Raton.
scott@slp53.sl.home (Scott Lurndal) writes:
That said, Unix generally defined -1 as the return value for all
other system calls, and code that checked for "< 0" instead of
-1 when calling a standard library function or system call was fundamentally >>broken.
That may be the interface of the C system call wrapper,
errno, but at the actual system call level, the error is indicated in
an architecture-specific way, and the ones I have looked at before
today use the sign of the result register or the carry flag. On those >architectures, where the sign is used, mmap(2) cannot return negative >addresses, or must have a special wrapper.
Let's look at what the system call wrappers do on RV64G(C) (which has--- Synchronet 3.21a-Linux NewsLink 1.2
no carry flag). For read(2) the wrapper contains:
0x3ff7f173be <read+20>: ecall
0x3ff7f173c2 <read+24>: lui a5,0xfffff
0x3ff7f173c4 <read+26>: mv s0,a0
0x3ff7f173c6 <read+28>: bltu a5,a0,0x3ff7f1740e <read+100>
For dup(2) the wrapper contains:
0x3ff7e7fe9a <dup+2>: ecall
0x3ff7e7fe9e <dup+6>: lui a7,0xfffff
0x3ff7e7fea0 <dup+8>: bltu a7,a0,0x3ff7e7fea6 <dup+14>
and for mmap(2):
0x3ff7e86b6e <mmap64+12>: ecall
0x3ff7e86b72 <mmap64+16>: lui a5,0xfffff
0x3ff7e86b74 <mmap64+18>: bltu a5,a0,0x3ff7e86b8c <mmap64+42>
So instead of checking for the sign flag, on RV64G the wrapper checks
if the result is >0xfffff00000000000. This costs one instruction more
than just checking the sign flag, and allows to almost double the
number of bytes read(2) can read in one call, the number of file ids
that cn be returned by dup(2), and the address range returnable by
mmap(2). Will we ever see processes that need more than 8EB? Maybe
not, but the designers of the RV64G(C) ABI obviously did not want to
be the ones that are quoted as saying "8EB should be enough for
anyone":-).
Followups to comp.arch
- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
scott@slp53.sl.home (Scott Lurndal) writes:
That said, Unix generally defined -1 as the return value for all
other system calls, and code that checked for "< 0" instead of
-1 when calling a standard library function or system call was fundamentally >>>broken.
That may be the interface of the C system call wrapper,
It _is_ the interface that the programmers need to be
concerted with when using POSIX C language bindings.
at the actual system call level, the error is indicated in
an architecture-specific way, and the ones I have looked at before
today use the sign of the result register or the carry flag. On those >>architectures, where the sign is used, mmap(2) cannot return negative >>addresses, or must have a special wrapper.
Why would the wrapper care if the system call failed?
lseek(2) and mmap(2) both require the return of arbitrary 32-bit
or 64-bit values, including those which when interpreted as signed
values are negative.
Clearly POSIX defines the interfaces and the underlying OS and/or
library functions implement the interfaces. The kernel interface
to the language library (e.g. libc) is irrelevent to typical programmers
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Thomas Koenig <tkoenig@netcologne.de> writes:<snip>
So how could one capture the PC market? The RISC-VAX would probably
have been too expensive for a PC, even with an 8-bit data bus and a
reduced instruction set, along the lines of RV32E. Or maybe that
would have been feasible, in which case one would provide >>8080->reduced-RISC-VAX and 6502->reduced-RISC-VAX assemblers to make >>porting easier. And then try to sell it to IBM Boca Raton.
https://en.wikipedia.org/wiki/Rainbow_100
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Thomas Koenig <tkoenig@netcologne.de> writes:
Building an ARM-like instead of a 68000 would have been feasible,
but the resulting systems would have been more expensive (the
68000 had 64 pins).
One could have done a RISC-VAX microprocessor with 16-bit data bus and >>24-bit address bus.
LSI11?
scott@slp53.sl.home (Scott Lurndal) writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
scott@slp53.sl.home (Scott Lurndal) writes:
That said, Unix generally defined -1 as the return value for all
other system calls, and code that checked for "< 0" instead of
-1 when calling a standard library function or system call was fundamentally
broken.
That may be the interface of the C system call wrapper,
It _is_ the interface that the programmers need to be
concerted with when using POSIX C language bindings.
True, but not relevant for the question at hand.
at the actual system call level, the error is indicated in
an architecture-specific way, and the ones I have looked at before
today use the sign of the result register or the carry flag. On those >>>architectures, where the sign is used, mmap(2) cannot return negative >>>addresses, or must have a special wrapper.
Why would the wrapper care if the system call failed?
The actual system call returns an error flag and a register. On some >architectures, they support just a register. If there is no error,
the wrapper returns the content of the register. If the system call >indicates an error, you see from the value of the register which error
it is; the wrapper then typically transforms the register in some way
(e.g., by negating it) and stores the result in errno, and returns -1.
lseek(2) and mmap(2) both require the return of arbitrary 32-bit
or 64-bit values, including those which when interpreted as signed
values are negative.
For lseek(2):
| Upon successful completion, lseek() returns the resulting offset
| location as measured in bytes from the beginning of the file.
Given that off_t is signed, lseek(2) can only return positive values.
For mmap(2):
| On success, mmap() returns a pointer to the mapped area.
So it's up to the kernel which user-level addresses it returns. E.g.,
32-bit Linux originally only produced user-level addresses below 2GB.
When memories grew larger, on some architectures (e.g., i386) Linux
increased that to 3GB.
EricP <ThatWouldBeTelling@thevillage.com> schrieb:
Thomas Koenig wrote:
Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:Yes, must be different versions.
Concerning the speed of the 82S100 PLA,Were there different versions, maybe?
<http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf>
reports propagation delays of 25ns-35ns for specific signals in Table
3.4, and EricP found 50ns "max access" in the data sheet of the
82S100. That does not sound too slow to be usable in a CPU with 200ns >>>> cycle time, so yes, one could have used that for the VAX.
https://deramp.com/downloads/mfe_archive/050-Component%20Specifications/Signetics-Philips/82S100%20FPGA.pdf
gives an I/O propagation delay of 80 ns max.
I'm looking at this 1976 datasheet which says 50 ns max access:
http://www.bitsavers.org/components/signetics/_dataBooks/1976_Signetics_Field_Programmable_Logic_Arrays.pdf
That is strange. Why would they make the chip worse?
Unlesss... maybe somebody (a customer, or they themselves)
discovered that there may have been conditions where they could
only guarantee 80 ns. Maybe a combination of tolerances to one
side and a certain logic programming, and they changed the
data sheet.
By comparison, you could get an eight-input NAND gate with aThe 82S100 PLA is logic equivalent to:
maximum delay of 12 ns (the 74H030), so putting two in sequence
to simulate a PLA would have been significantly faster.
I can undersand people complaining that PALs were slow.
- 16 inputs each with an optional input invertor,
Should be free coming from a Flip-Flop.
- optionally wired to 48 16-input AND's,
- optionally wired to 8 48-input OR's,
Those would be the the two layers of NAND gates, so depending
on which ones you chose, you have to add those.
- with 8 optional XOR output invertors,
I don't find that in the diagrams (but I might be missing that,
I am not an expert at reading them).
- driving 8 tri-state or open collector buffers.
A 74265 had switching times of max. 18 ns, driving 30
output loads, so that would be on top.
One question: Did TTL people actually use the "typical" delays
from the handbooks, or did they use the maximum delays for their
desings? Using anything below the maximum woud sound dangerous to
me, but maybe this was possible to a certain extent.
So I count roughly 7 or 8 equivalent gate delays.
Another point... if you don't need 16 inputs or 8 outpus, you
are also paying a lot more. If you have a 6-bit primary opcode,
you don't need a full 16 bits of input.
Also the decoder would need a lot of these so I doubt we can afford the
power and heat for H series. That 74H30 typical is 22 mW but the max
looks like 110 mW max each (I_ol output low of 20 mA * 5.5V max).
74LS30 is 20 ns max, 44 mW max.
Looking at a TI Bipolar Memory Data Manual from 1977,
it was about the same speed as say a 256b mask programmable TTL ROM,
7488A 32w * 8b, 45 ns max access.
Hmm... did the VAX, for example, actually use them, or were they
using logic built from conventional chips?
scott@slp53.sl.home (Scott Lurndal) writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Thomas Koenig <tkoenig@netcologne.de> writes:<snip>
So how could one capture the PC market? The RISC-VAX would probably
have been too expensive for a PC, even with an 8-bit data bus and a >>>reduced instruction set, along the lines of RV32E. Or maybe that
would have been feasible, in which case one would provide >>>8080->reduced-RISC-VAX and 6502->reduced-RISC-VAX assemblers to make >>>porting easier. And then try to sell it to IBM Boca Raton.
https://en.wikipedia.org/wiki/Rainbow_100
That's completely different from what I suggest above, and DEC
obviously did not capture the PC market with that.
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes: >>aph@littlepinkcloud.invalid writes:
I thought the AArch64 ILP32 design was pretty neat, but no one seems
to have been interested. I guess there wasn't an advantage worth the >>>effort.
Alpha: On Digital OSF/1 the advantage was to be able to run programs
that work on ILP32, but not I32LP64.
I understand what you're saying here, but disagree. A program that
works on ILP32 but not I32LP64 is fundamentally broken, IMHO.
x32: I expect that maintained Unix programs ran on I32LP64 in 2012,
and unmaintained ones did not get an x32 port anyway. And if there
are cases where my expectations do not hold, there still is i386. The
only advantage of x32 was a speed advantage on select programs.
I suspect that performance advantage was minimal, the primary advantage would >have been that existing applications didn't need to be rebuilt
and requalified.
Aarch64-ILP32: My guess is that the situation is very similar to the
x32 situation.
In the early days of AArch64 (2013), we actually built a toolchain to support >Aarch64-ILP32. Not a single customer exhibited _any_ interest in that
and the project was dropped.
Admittedly, there are CPUs without ARM A32/T32
Very few AArch64 designs included AArch32 support
even the Cortex
chips supported it only at exception level zero (user mode)
The markets for AArch64 (servers, high-end appliances) didn't have
a huge existing reservoir of 32-bit ARM applications, so there was
no demand to support them.
While looking for the handbook, I also found
http://hps.ece.utexas.edu/pub/patt_micro22.pdf
which describes some parts of the microarchitecture of the VAX 11/780, 11/750, 8600, and 8800.
Interestingly, Patt wrote this in 1990, after participating in the HPS
papers on an OoO implementation of the VAX architecture.
- anton
scott@slp53.sl.home (Scott Lurndal) writes:
[snip]
errno, but at the actual system call level, the error is indicated in
an architecture-specific way, and the ones I have looked at before
today use the sign of the result register or the carry flag. On those >>architectures, where the sign is used, mmap(2) cannot return negative >>addresses, or must have a special wrapper.
Why would the wrapper care if the system call failed? The
return value from the kernel should be passed through to
the application as per the POSIX language binding requirements.
lseek(2) and mmap(2) both require the return of arbitrary 32-bit
or 64-bit values, including those which when interpreted as signed
values are negative.
Clearly POSIX defines the interfaces and the underlying OS and/or
library functions implement the interfaces. The kernel interface
to the language library (e.g. libc) is irrelevent to typical programmers, >except in the case where it doesn't provide the correct semantics.
In article <2025Aug13.194659@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
scott@slp53.sl.home (Scott Lurndal) writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Thomas Koenig <tkoenig@netcologne.de> writes:<snip>
So how could one capture the PC market? The RISC-VAX would probably
have been too expensive for a PC, even with an 8-bit data bus and a
reduced instruction set, along the lines of RV32E. Or maybe that
would have been feasible, in which case one would provide
8080->reduced-RISC-VAX and 6502->reduced-RISC-VAX assemblers to make
porting easier. And then try to sell it to IBM Boca Raton.
https://en.wikipedia.org/wiki/Rainbow_100
That's completely different from what I suggest above, and DEC
obviously did not capture the PC market with that.
They did manage to crack the college market some where CS departments
had DEC hardware anyway. I know USC (original) had a Rainbow computer
lab circa 1985. That "in" didn't translate to anything else though.
scott@slp53.sl.home (Scott Lurndal) writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
scott@slp53.sl.home (Scott Lurndal) writes:
That said, Unix generally defined -1 as the return value for all
other system calls, and code that checked for "< 0" instead of
-1 when calling a standard library function or system call was fundamentally
broken.
That may be the interface of the C system call wrapper,
It _is_ the interface that the programmers need to be
concerted with when using POSIX C language bindings.
True, but not relevant for the question at hand.
at the actual system call level, the error is indicated in
an architecture-specific way, and the ones I have looked at before
today use the sign of the result register or the carry flag. On those >>>architectures, where the sign is used, mmap(2) cannot return negative >>>addresses, or must have a special wrapper.
Why would the wrapper care if the system call failed?
The actual system call returns an error flag and a register. On some >architectures, they support just a register. If there is no error,
the wrapper returns the content of the register. If the system call >indicates an error, you see from the value of the register which error
it is; the wrapper then typically transforms the register in some way
(e.g., by negating it) and stores the result in errno, and returns -1.
lseek(2) and mmap(2) both require the return of arbitrary 32-bit
or 64-bit values, including those which when interpreted as signed
values are negative.
For lseek(2):
| Upon successful completion, lseek() returns the resulting offset
| location as measured in bytes from the beginning of the file.
Given that off_t is signed, lseek(2) can only return positive values.
For mmap(2):
| On success, mmap() returns a pointer to the mapped area.
So it's up to the kernel which user-level addresses it returns. E.g.,
32-bit Linux originally only produced user-level addresses below 2GB.
When memories grew larger, on some architectures (e.g., i386) Linux
increased that to 3GB.
Clearly POSIX defines the interfaces and the underlying OS and/or
library functions implement the interfaces. The kernel interface
to the language library (e.g. libc) is irrelevent to typical programmers
Sure, but system calls are first introduced in real kernels using the
actual system call interface, and are limited by that interface. And
that interface is remarkably similar between the early days of Unix
and recent Linux kernels for various architectures.
And when you look
closely, you find how the system calls are design to support returning
the error indication, success value, and errno in one register.
lseek64 on 32-bit platforms is an exception (the success value does
not fit in one register), and looking at the machine code of the
wrapper and comparing it with the machine code for the lseek wrapper,
some funny things are going on, but I would have to look at the source
code to understand what is going on. One other interesting thing I
noticed is that the system call wrappers from libc-2.36 on i386 now
draws the boundary between success returns and error returns at
0xfffff000:
0xf7d853c4 <lseek+68>: call *%gs:0x10
0xf7d853cb <lseek+75>: cmp $0xfffff000,%eax
0xf7d853d0 <lseek+80>: ja 0xf7d85410 <lseek+144>
So now the kernel can produce 4095 error values, and the rest can be
success values. In particular, mmap() can return all possible page
addresses as success values with these wrappers. When I last looked
at how system calls are done, I found just a check of the N or the C
flag.
I wonder how the kernel is informed that it can now return more
addresses from mmap().
Terje Mathisen <terje.mathisen@tmsw.no> writes:
Stephen Fuld wrote:
On 8/4/2025 8:32 AM, John Ames wrote:
=20
snip
=20
This notion that the only advantage of a 64-bit architecture is a larg= >>e
address space is very curious to me. Obviously that's *one* advantage,=
but while I don't know the in-the-field history of heavy-duty business= >>/
scientific computing the way some folks here do, I have not gotten the=
impression that a lot of customers were commonly running up against th= >>e=20
4 GB limit in the early '90s;
Not exactly the same, but I recall an issue with Windows NT where it=20
initially divided the 4GB address space in 2 GB for the OS, and 2GB for= >>=20
users.=C2=A0 Some users were "running out of address space", so Microso= >>ft=20
came up with an option to reduce the OS space to 1 GB, thus allowing up= >>=20
to 3 GB for users.=C2=A0 I am sure others here will know more details.
Any program written to Microsoft/Windows spec would work transparently=20 >>with a 3:1 split, the problem was all the programs ported from unix=20 >>which assumed that any negative return value was a failure code.
The only interfaces that I recall this being an issue for were
mmap(2) and lseek(2). The latter was really related to maximum
file size (although it applied to /dev/[k]mem and /proc/<pid>/mem
as well). The former was handled by the standard specifying
MAP_FAILED as the return value.
That said, Unix generally defined -1 as the return value for all
other system calls, and code that checked for "< 0" instead of
-1 when calling a standard library function or system call was fundamentally broken.
[snip]
all that said, my initial point about -1 was that applications
should always check for -1 (or MAP_FAILED), not for return
values less than zero. The actual kernel interface to the
C library is clearly implementation dependent although it
must preserve the user-visible required semantics.
Thomas Koenig wrote:
EricP <ThatWouldBeTelling@thevillage.com> schrieb:
Unlesss... maybe somebody (a customer, or they themselves)
discovered that there may have been conditions where they could
only guarantee 80 ns. Maybe a combination of tolerances to one
side and a certain logic programming, and they changed the
data sheet.
Manufacturing process variation leads to timing differences that
testing sorts into speed bins. The faster bins sell at higher price.
By comparison, you could get an eight-input NAND gate with aThe 82S100 PLA is logic equivalent to:
maximum delay of 12 ns (the 74H030), so putting two in sequence
to simulate a PLA would have been significantly faster.
I can undersand people complaining that PALs were slow.
- 16 inputs each with an optional input invertor,
Should be free coming from a Flip-Flop.
Depends on what chips you use for registers.
If you want both Q and Qb then you only get 4 FF in a package like 74LS375.
For a wide instruction or stage register I'd look at chips such as a 74LS377 with 8 FF in a 20 pin dip, 8 input, 8 Q out, clock, clock enable, vcc, gnd.
Another point... if you don't need 16 inputs or 8 outpus, you
are also paying a lot more. If you have a 6-bit primary opcode,
you don't need a full 16 bits of input.
I'm just showing why it was more than just an AND gate.
I'm still exploring whether it can be variable length instructions or
has to be fixed 32-bit. In either case all the instruction "code" bits
(as in op code or function code or whatever) should be checked,
even if just to verify that should-be-zero bits are zero.
There would also be instruction buffer Valid bits and other state bits
like Fetch exception detected, interrupt request, that might feed into
a bank of PLA's multiple wide and deep.
In article <MO1nQ.2$Bui1.0@fx10.iad>, Scott Lurndal <slp53@pacbell.net> wrote: >>anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
scott@slp53.sl.home (Scott Lurndal) writes:
For mmap, at least the only documented error return value is
`MAP_FAILED`, and programmers must check for that explicitly.
It strikes me that this implies that the _value_ of `MAP_FAILED`
need not be -1; on x86_64, for instance, it _could_ be any
non-canonical address.
In article <2025Aug13.181010@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
For lseek(2):
| Upon successful completion, lseek() returns the resulting offset
| location as measured in bytes from the beginning of the file.
Given that off_t is signed, lseek(2) can only return positive values.
This is incorrect; or rather, it's accidentally correct now, but
was not previously. The 1990 POSIX standard did not explicitly
forbid a file that was so large that the offset couldn't
overflow, hence why in 1990 POSIX you have to be careful about
error handling when using `lseek`.
It is true that POSIX 2024 _does_ prohibit seeking so far that
the offset would become negative, however.
But, POSIX 2024
(still!!) supports multiple definitions of `off_t` for multiple
environments, in which overflow is potentially unavoidable.
For mmap(2):
| On success, mmap() returns a pointer to the mapped area.
So it's up to the kernel which user-level addresses it returns. E.g., >>32-bit Linux originally only produced user-level addresses below 2GB.
When memories grew larger, on some architectures (e.g., i386) Linux >>increased that to 3GB.
The point is that the programmer shouldn't have to care.
Sure, but system calls are first introduced in real kernels using the >>actual system call interface, and are limited by that interface. And
that interface is remarkably similar between the early days of Unix
and recent Linux kernels for various architectures.
Not precisely. On x86_64, for example, some Unixes use a flag
bit to determine whether the system call failed, and return
(positive) errno values; Linux returns negative numbers to
indicate errors, and constrains those to values between -4095
and -1.
Presumably that specific set of values is constrained by `mmap`:
assuming a minimum 4KiB page size, the last architecturally
valid address where a page _could_ be mapped is equivalent to
-4096 and the first is 0. If they did not have that constraint,
they'd have to treat `mmap` specially in the system call path.
I wonder how the kernel is informed that it can now return more
addresses from mmap().
Assuming you mean the Linux kernel, when it loads an ELF
executable, the binary image itself is "branded" with an ABI
type that it can use to make that determination.
I am pretty sure that in the old times, Linux-i386 indicated failure
by returning a value with the MSB set, and the wrapper just checked
whether the return value was negative.
Bottom line: If Linux-i386 ever had a different way of determining
whether a system call has an error result, it was changed to the
current way early on. Given that IIRC I looked into that later than
in 2000, my memory is obviously not of Linux. I must have looked at
source code for a different system.
In article <2025Aug13.181010@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
For lseek(2):
| Upon successful completion, lseek() returns the resulting offset
| location as measured in bytes from the beginning of the file.
Given that off_t is signed, lseek(2) can only return positive values.
This is incorrect; or rather, it's accidentally correct now, but
was not previously. The 1990 POSIX standard did not explicitly
forbid a file that was so large that the offset couldn't
overflow, hence why in 1990 POSIX you have to be careful about
error handling when using `lseek`.
It is true that POSIX 2024 _does_ prohibit seeking so far that
the offset would become negative, however.
I don't think that this is accidental. In 1990 signed overlow had
reliable behaviour on common 2s-complement hardware with the C
compilers of the day.
Nowadays the exotic hardware where this would
not work that way has almost completely died out (and C is not used on
the remaining exotic hardware),
but now compilers sometimes do funny
things on integer overflow, so better don't go there or anywhere near
it.
But, POSIX 2024
(still!!) supports multiple definitions of `off_t` for multiple >>environments, in which overflow is potentially unavoidable.
POSIX also has the EOVERFLOW error for exactly that case.
Bottom line: The off_t returned by lseek(2) is signed and always
positive.
For mmap(2):
| On success, mmap() returns a pointer to the mapped area.
So it's up to the kernel which user-level addresses it returns. E.g., >>>32-bit Linux originally only produced user-level addresses below 2GB. >>>When memories grew larger, on some architectures (e.g., i386) Linux >>>increased that to 3GB.
The point is that the programmer shouldn't have to care.
True, but completely misses the point.
Sure, but system calls are first introduced in real kernels using the >>>actual system call interface, and are limited by that interface. And >>>that interface is remarkably similar between the early days of Unix
and recent Linux kernels for various architectures.
Not precisely. On x86_64, for example, some Unixes use a flag
bit to determine whether the system call failed, and return
(positive) errno values; Linux returns negative numbers to
indicate errors, and constrains those to values between -4095
and -1.
Presumably that specific set of values is constrained by `mmap`:
assuming a minimum 4KiB page size, the last architecturally
valid address where a page _could_ be mapped is equivalent to
-4096 and the first is 0. If they did not have that constraint,
they'd have to treat `mmap` specially in the system call path.
I am pretty sure that in the old times, Linux-i386 indicated failure
by returning a value with the MSB set, and the wrapper just checked
whether the return value was negative. And for mmap() that worked
because user-mode addresses were all below 2GB. Addresses furthere up
where reserved for the kernel.
I wonder how the kernel is informed that it can now return more
addresses from mmap().
Assuming you mean the Linux kernel, when it loads an ELF
executable, the binary image itself is "branded" with an ABI
type that it can use to make that determination.
I have checked that with binaries compiled in 2003 and 2000:
-rwxr-xr-x 1 root root 44660 Sep 26 2000 /usr/local/bin/gforth-0.5.0* >-rwxr-xr-x 1 root root 92352 Sep 7 2003 /usr/local/bin/gforth-0.6.2*
[~:160080] file /usr/local/bin/gforth-0.5.0
/usr/local/bin/gforth-0.5.0: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux.so.2, stripped
[~:160081] file /usr/local/bin/gforth-0.6.2
/usr/local/bin/gforth-0.6.2: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux.so.2, for
GNU/Linux 2.0.0, stripped
So there is actually a difference between these two. However, if I
just strace them as they are now, they both happily produce very high >addresses with mmap, e.g.,
mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xf7f64000
I don't know what the difference is between "for GNU/Linux 2.0.0" and
not having that,
but the addresses produced by mmap() seem unaffected.
However, by calling the binaries with setarch -L, mmap() returns only >addresses < 2GB in all calls I have looked at. I guess if I had
statically linked binaries, i.e., with old system call wrappers, I
would have to use
setarch -L <binary>
to make it work properly with mmap(). Or maybe Linux is smart enough
to do it by itself when it encounters a statically-linked old binary.
In article <2025Aug13.232334@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote: >>cross@spitfire.i.gajendra.net (Dan Cross) writes:
In article <2025Aug13.181010@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
For lseek(2):
| Upon successful completion, lseek() returns the resulting offset
| location as measured in bytes from the beginning of the file.
Given that off_t is signed, lseek(2) can only return positive values.
This is incorrect; or rather, it's accidentally correct now, but
was not previously. The 1990 POSIX standard did not explicitly
forbid a file that was so large that the offset couldn't
overflow, hence why in 1990 POSIX you have to be careful about
error handling when using `lseek`.
It is true that POSIX 2024 _does_ prohibit seeking so far that
the offset would become negative, however.
I don't think that this is accidental. In 1990 signed overlow had
reliable behaviour on common 2s-complement hardware with the C
compilers of the day.
This is simply not true. If anything, there was more variety of
hardware supported by C90, and some of those systems were 1's
complement or sign/mag, not 2's complement. Consequently,
signed integer overflow has _always_ had undefined behavior in
ANSI/ISO C.
However, conversion from signed to unsigned has always been
well-defined, and follows effectively 2's complement semantics.
Conversion from unsigned to signed is a bit more complex, and is >implementation defined, but not UB. Given that the system call
interface is necessarily deeply intwined with the implementation
I see no reason why the semantics of signed overflow should be
an issue here.
Nowadays the exotic hardware where this would
not work that way has almost completely died out (and C is not used on
the remaining exotic hardware),
If by "C is not used" you mean newer editions of the C standard
are not used on very old computers with strange representations
of signed integers, then maybe.
but now compilers sometimes do funny
things on integer overflow, so better don't go there or anywhere near
it.
This isn't about signed overflow. The issue here is conversion
of an unsigned value to signed; almost certainly, the kernel
performs the calculation of the actual file offset using
unsigned arithmetic, and relies on the (assembler, mind you)
system call stubs to map those to the appropriate userspace
type.
I think this is mostly irrelevant, as the system call stub,
almost by necessity, must be written in assembler in order to
have percise control over the use of specific registers and so
on. From C's perspective, a program making a system call just
calls some function that's defined to return a signed integer;
the assembler code that swizzles the register that integer will
be extracted from sets things up accordingly. In other words,
the conversion operation that the C standard mentions isn't at
play, since the code that does the "conversion" is in assembly.
Again from C's perspective the return value of the syscall stub
function is already signed with no need of conversion.
No, for `lseek`, the POSIX rationale explains the reasoning here
quite clearly: the 1990 standard permitted negative offsets, and
programs were expected to accommodate this by special handling
of `errno` before and after calls to `lseek` that returned
negative values. This was deemed onerous and fragile, so they
modified the standard to prohibit calls that would result in
negative offsets.
But, POSIX 2024
(still!!) supports multiple definitions of `off_t` for multiple >>>environments, in which overflow is potentially unavoidable.
POSIX also has the EOVERFLOW error for exactly that case.
Bottom line: The off_t returned by lseek(2) is signed and always
positive.
As I said earlier, post POSIX.1-1990, this is true.
For mmap(2):
| On success, mmap() returns a pointer to the mapped area.
So it's up to the kernel which user-level addresses it returns. E.g., >>>>32-bit Linux originally only produced user-level addresses below 2GB. >>>>When memories grew larger, on some architectures (e.g., i386) Linux >>>>increased that to 3GB.
The point is that the programmer shouldn't have to care.
True, but completely misses the point.
I don't see why. You were talking about the system call stubs,
which run in userspace, and are responsbile for setting up state
so that the kernel can perform some requested action on entry,
whether by trap, call gate, or special instruction, and then for
tearing down that state and handling errors on return from the
kernel.
For mmap, there is exactly one value that may be returned from
the its stub that indicates an error; any other value, by
definition, represents a valid mapping. Whether such a mapping
falls in the first 2G, 3G, anything except the upper 256MiB, or
some hole in the middle is the part that's irrelevant, and
focusing on that misses the main point: all the stub has to do
is detect the error, using whatever convetion the kernel
specifies for communicating such things back to the program, and
ensure that in an error case, MAP_FAILED is returned from the
stub and `errno` is set appropriately. Everything else is
superfluous.
Sure, but system calls are first introduced in real kernels using the >>>>actual system call interface, and are limited by that interface. And >>>>that interface is remarkably similar between the early days of Unix
and recent Linux kernels for various architectures.
Not precisely. On x86_64, for example, some Unixes use a flag
bit to determine whether the system call failed, and return
(positive) errno values; Linux returns negative numbers to
indicate errors, and constrains those to values between -4095
and -1.
Presumably that specific set of values is constrained by `mmap`:
assuming a minimum 4KiB page size, the last architecturally
valid address where a page _could_ be mapped is equivalent to
-4096 and the first is 0. If they did not have that constraint,
they'd have to treat `mmap` specially in the system call path.
I am pretty sure that in the old times, Linux-i386 indicated failure
by returning a value with the MSB set, and the wrapper just checked
whether the return value was negative. And for mmap() that worked
because user-mode addresses were all below 2GB. Addresses furthere up >>where reserved for the kernel.
Define "Linux-i386" in this case. For the kernel, I'm confident
that was NOT the case, and it is easy enough to research, since
old kernel versions are online. Looking at e.g. 0.99.15, one
can see that they set the carry bit in the flags register to
indicate an error, along with returning a negative errno value: >https://kernel.googlesource.com/pub/scm/linux/kernel/git/nico/archive/+/refs/tags/v0.99.15/kernel/sys_call.S
By 2.0, they'd stopped setting the carry bit, though they
continued to clear it on entry.
But remember, `mmap` returns a pointer, not an integer, relying
on libc to do the necessary translation between whatever the
kernel returns and what the program expects. So if the behavior
you describe where anywhere, it would be in libc. Given that
they have, and had, a mechanism for signaling an error
independent of C already, and necessarily the fixup of the
return value must happen in the syscall stub in whatever library
the system used, relying soley on negative values to detect
errors seems like a poor design decision ifor a C library.
So if what you're saying were true, such a check wuld have to
be in the userspace library that provides the syscall stubs; the
kernel really doesn't care. I don't know what version libc
Torvalds started with, or if he did his own bespoke thing
initially or something, but looking at some commonly used C
libraries of a certain age, such as glibc 2.0 from 1997-ish, one
can see that they're explicitly testing the error status against
-4095 (as an unsigned value) in the stub. (e.g., in >sysdeps/unix/sysv/linux/i386/syscall.S).
But glibc-1.06.1 is a different story, and _does_ appear to
simply test whether the return value is negative and then jump
to an error handler if so. So mmap may have worked incidentally
due to the restriction on where in the address space it would
place a mapping in very early kernel versions, as you described,
but that's a library issue, not a kernel issue: again, the
kernel doesn't care.
The old version of libc5 available on kernel.org similarly; it
looks like HJ Lu changed the error handling path to explicitly
compare against -4095 in October of 1996.
So, fixed in the most common libc's used with Linux on i386 for
nearly 30 years, well before the existence of x86_64.
I wonder how the kernel is informed that it can now return more >>>>addresses from mmap().
Assuming you mean the Linux kernel, when it loads an ELF
executable, the binary image itself is "branded" with an ABI
type that it can use to make that determination.
I have checked that with binaries compiled in 2003 and 2000:
-rwxr-xr-x 1 root root 44660 Sep 26 2000 /usr/local/bin/gforth-0.5.0* >>-rwxr-xr-x 1 root root 92352 Sep 7 2003 /usr/local/bin/gforth-0.6.2*
[~:160080] file /usr/local/bin/gforth-0.5.0
/usr/local/bin/gforth-0.5.0: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux.so.2, stripped
[~:160081] file /usr/local/bin/gforth-0.6.2
/usr/local/bin/gforth-0.6.2: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux.so.2, for
GNU/Linux 2.0.0, stripped
So there is actually a difference between these two. However, if I
just strace them as they are now, they both happily produce very high >>addresses with mmap, e.g.,
mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xf7f64000
I don't see any reason why it wouldn't.
I don't know what the difference is between "for GNU/Linux 2.0.0" and
not having that,
`file` is pulling that from a `PT_NOTE` segment defined in the
program header for that second file. A better tool for picking
apart the details of those binaries is probably `objdump`.
I'm mildly curious what version of libc those are linked against
(e.g., as reported by `ldd`).
but the addresses produced by mmap() seem unaffected.
I don't see why it would be. Any common libc post 1997-ish
handles errors in a way that permits this to work correctly. If
you tried glibc 1.0, it might be a different story, but the
Linux folks forked that in 1994 and modified it as "Linux libc"
and the
However, by calling the binaries with setarch -L, mmap() returns only >>addresses < 2GB in all calls I have looked at. I guess if I had
statically linked binaries, i.e., with old system call wrappers, I
would have to use
setarch -L <binary>
to make it work properly with mmap(). Or maybe Linux is smart enough
to do it by itself when it encounters a statically-linked old binary.
Unclear without looking at the kernel source code, but possibly.
`setarch -L` turns on the "legacy" virtual address space layout,
but I suspect that the number of binaries that _actually care_
is pretty small, indeed.
In article <2025Aug13.232334@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote: >>cross@spitfire.i.gajendra.net (Dan Cross) writes:
In article <2025Aug13.181010@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
For lseek(2):
| Upon successful completion, lseek() returns the resulting offset
| location as measured in bytes from the beginning of the file.
Given that off_t is signed, lseek(2) can only return positive values.
This is incorrect; or rather, it's accidentally correct now, but
was not previously. The 1990 POSIX standard did not explicitly
forbid a file that was so large that the offset couldn't
overflow, hence why in 1990 POSIX you have to be careful about
error handling when using `lseek`.
It is true that POSIX 2024 _does_ prohibit seeking so far that
the offset would become negative, however.
I don't think that this is accidental. In 1990 signed overlow had
reliable behaviour on common 2s-complement hardware with the C
compilers of the day.
This is simply not true. If anything, there was more variety of
hardware supported by C90, and some of those systems were 1's
complement or sign/mag, not 2's complement. Consequently,
signed integer overflow has _always_ had undefined behavior in
ANSI/ISO C.
cross@spitfire.i.gajendra.net (Dan Cross) writes:
In article <2025Aug13.232334@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote: >>>cross@spitfire.i.gajendra.net (Dan Cross) writes:
In article <2025Aug13.181010@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
For lseek(2):
| Upon successful completion, lseek() returns the resulting offset
| location as measured in bytes from the beginning of the file.
Given that off_t is signed, lseek(2) can only return positive values.
This is incorrect; or rather, it's accidentally correct now, but
was not previously. The 1990 POSIX standard did not explicitly
forbid a file that was so large that the offset couldn't
overflow, hence why in 1990 POSIX you have to be careful about
error handling when using `lseek`.
It is true that POSIX 2024 _does_ prohibit seeking so far that
the offset would become negative, however.
I don't think that this is accidental. In 1990 signed overlow had >>>reliable behaviour on common 2s-complement hardware with the C
compilers of the day.
This is simply not true. If anything, there was more variety of
hardware supported by C90, and some of those systems were 1's
complement or sign/mag, not 2's complement. Consequently,
signed integer overflow has _always_ had undefined behavior in
ANSI/ISO C.
Both Burroughs Large Systems (48-bit stack machine) and the
Sperry 1100/2200 (36-bit) systems had (have, in emulation today)
C compilers.
The LSI11 uses four 40-pin chips from the MCP-1600 chipset (which is fascinating in itself <https://en.wikipedia.org/wiki/MCP-1600>) for a
total of 160 pins; and it supported only 16 address bits without extra chips. That was certainly even more expensive (and also slower and
less capable) than what I suggest above, but it was several years
earlier, and what I envision was not possible in one chip then.
In article <sknnQ.168942$Bui1.63359@fx10.iad>,
Scott Lurndal <slp53@pacbell.net> wrote:
Both Burroughs Large Systems (48-bit stack machine) and the
Sperry 1100/2200 (36-bit) systems had (have, in emulation today)
C compilers.
Yup. The 1100-series machines were (are) 1's complement. Those
are the ones I usually think of when cursing that signed integer
overflow is UB in C.
I don't think anyone is compiling C23 code for those machines,
but back in the late 1980s, they were still enough of a going
concern that they could influence the emerginc C standard. Not
so much anymore.
Regardless, signed integer overflow remains UB in the current C
standard, nevermind definitionally following 2s complement
semantics. Usually this is done on the basis of performance
arguments: some seemingly-important loop optimizations can be
made if the compiler can assert that overflow Cannot Happen.
And of course, even today, C still targets oddball platforms
like DSPs and custom chips, where assumptions about the ubiquity
of 2's comp may not hold.
The point is that there when the results of an integer computation are
too big, there is no way to get the correct answer in the types used.
Two's complement wrapping is /not/ correct. If you add two real-world positive integers, you don't get a negative integer.
The LSI11 uses four 40-pin chips from the MCP-1600 chipset (which is fascinating in itself <https://en.wikipedia.org/wiki/MCP-1600>) for a total of 160 pins; and it supported only 16 address bits without extra chips. That was certainly even more expensive (and also slower and
less capable) than what I suggest above, but it was several years
earlier, and what I envision was not possible in one chip then.
Maybe compare 808x to something more in its weight class? The 8-bit
8080 was 1974, 16-bit 8086 1978, 16/8-bit 8088 1979.
The DEC F-11 (~1979) and J-11 (~1982) microprocessor designs were
capable of 22 bit addressing on a single 40-pin carrier.
De
The DEC F-11 (~1979) and J-11 (~1982) microprocessor designs werecapable of 22 bit addressing on a single 40-pin carrier.
On 14.08.2025 17:44, Dan Cross wrote:
In article <sknnQ.168942$Bui1.63359@fx10.iad>,
Scott Lurndal <slp53@pacbell.net> wrote:
Both Burroughs Large Systems (48-bit stack machine) and the
Sperry 1100/2200 (36-bit) systems had (have, in emulation today)
C compilers.
Yup. The 1100-series machines were (are) 1's complement. Those
are the ones I usually think of when cursing that signed integer
overflow is UB in C.
I don't think anyone is compiling C23 code for those machines,
but back in the late 1980s, they were still enough of a going
concern that they could influence the emerginc C standard. Not
so much anymore.
They would presumably have been part of the justification for supporting >multiple signed integer formats at the time.
UB on signed integer
arithmetic overflow is a different matter altogether.
Regardless, signed integer overflow remains UB in the current C
standard, nevermind definitionally following 2s complement
semantics. Usually this is done on the basis of performance
arguments: some seemingly-important loop optimizations can be
made if the compiler can assert that overflow Cannot Happen.
The justification for "signed integer arithmetic overflow is UB" is in
the C standards 6.5p5 under "Expressions" :
"""
If an exceptional condition occurs during the evaluation of an
expression (that is, if the result is not mathematically defined or not
in the range of representable values for its type), the behavior is >undefined.
"""
It actually has absolutely nothing to do with signed integer
representation, or machine hardware.
It doesn't even have much to do
with integers at all. It is simply that if the calculation can't give a >correct answer, then then the C standards don't say anything about the >results or effects.
The point is that there when the results of an integer computation are
too big, there is no way to get the correct answer in the types used.
Two's complement wrapping is /not/ correct. If you add two real-world >positive integers, you don't get a negative integer.
And of course, even today, C still targets oddball platforms
like DSPs and custom chips, where assumptions about the ubiquity
of 2's comp may not hold.
Modern C and C++ standards have dropped support for signed integer >representation other than two's complement, because they are not in use
in any modern hardware (including any DSP's) - at least, not for >general-purpose integers. Both committees have consistently voted to
keep overflow as UB.
antispam@fricas.org (Waldek Hebisch) writes:
VAX-780 architecture handbook says cache was 8 KB and used 8-byte
lines. So extra 12KB of fast RAM could double cache size.
That would be nice improvement, but not as dramatic as increase
from 2 KB to 12 KB.
The handbook is: https://ia903400.us.archive.org/26/items/bitsavers_decvaxhandHandbookVol11977_10941546/VAX_Architecture_Handbook_Vol1_1977_text.pdf
The cache is indeed 8KB in size, two-way set associative and write-through.
Section 2.7 also mentions an 8-byte instruction buffer, and that the instruction fetching is done happens concurrently with the microcoded execution. So here we have a little bit of pipelining.
Section 2.7 also describes a 128-entry TLB. The TLB is claimed to
have "typically 97% hit rate". I would go for larger pages, which
would reduce the TLB miss rate.
In article <107b1bu$252qo$1@dont-email.me>,
Programming a RISC in assembler is not so hard, at least in my
experience. Plus, people overestimated use of assembler even in
the mid-1975s, and underestimated the use of compilers.
[...]
They certainly did! I'm not saying that they're right; I'm
saying that business needs must have, at least in part,
influenced the ISA design. That is, while mistaken, it was part
of the business decision process regardless.
Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
In article <107b1bu$252qo$1@dont-email.me>,
Programming a RISC in assembler is not so hard, at least in my >>>experience. Plus, people overestimated use of assembler even in
the mid-1975s, and underestimated the use of compilers.
[...]
They certainly did! I'm not saying that they're right; I'm
saying that business needs must have, at least in part,
influenced the ISA design. That is, while mistaken, it was part
of the business decision process regardless.
It's not clear to me what the distinction of technical vs. business
is supposed to be in the context of ISA design. Could you explain?
In article <107mf9l$u2si$1@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:
Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
In article <107b1bu$252qo$1@dont-email.me>,
Programming a RISC in assembler is not so hard, at least in my >>>>experience. Plus, people overestimated use of assembler even in the >>>>mid-1975s, and underestimated the use of compilers.
[...]
They certainly did! I'm not saying that they're right; I'm saying
that business needs must have, at least in part, influenced the ISA
design. That is, while mistaken, it was part of the business decision
process regardless.
It's not clear to me what the distinction of technical vs. business is >>supposed to be in the context of ISA design. Could you explain?
I can attempt to, though I'm not sure if I can be successful.
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
antispam@fricas.org (Waldek Hebisch) writes:
VAX-780 architecture handbook says cache was 8 KB and used 8-byte
lines. So extra 12KB of fast RAM could double cache size.
That would be nice improvement, but not as dramatic as increase
from 2 KB to 12 KB.
The handbook is:
https://ia903400.us.archive.org/26/items/bitsavers_decvaxhandHandbookVol11977_10941546/VAX_Architecture_Handbook_Vol1_1977_text.pdf
The cache is indeed 8KB in size, two-way set associative and write-through. >>
Section 2.7 also mentions an 8-byte instruction buffer, and that the
instruction fetching is done happens concurrently with the microcoded
execution. So here we have a little bit of pipelining.
Section 2.7 also describes a 128-entry TLB. The TLB is claimed to
have "typically 97% hit rate". I would go for larger pages, which
would reduce the TLB miss rate.
I think that in 1979 VAX 512 bytes page was close to optimal.
Namely, IIUC smallest supported configuration was 128 KB RAM.
That gives 256 pages, enough for sophisticated system with
fine-grained access control. Bigger pages would reduce
number of pages. For example 4 KB pages would mean 32 pages
in minimal configuration significanly reducing usefulness of
such machine.
According to <aph@littlepinkcloud.invalid>:
In comp.arch BGB <cr88192@gmail.com> wrote:
Also, IIRC, the major point of X32 was that it would narrow pointers and >>> similar back down to 32 bits, requiring special versions of any shared
libraries or similar.
But, it is unattractive to have both 32 and 64 bit versions of all the SO's.
We have done something similar for years at Red Hat: not X32, but
x86_32, and it was pretty easy. If you're building a 32-bit OS anyway
(which we were) all you have to do is copy all 32-bit libraries from
one one repo to the other.
FreeBSD does the same thing. The 32 bit libraries are installed by default on 64 bit systems because, by current standards, they're not very big.
I've stopped installing them because I know I don't have any 32 bit apps
left but on systems with old packages, who knows?
In article <107l5ju$k78a$1@dont-email.me>,
David Brown <david.brown@hesbynett.no> wrote:
On 14.08.2025 17:44, Dan Cross wrote:
In article <sknnQ.168942$Bui1.63359@fx10.iad>,
Scott Lurndal <slp53@pacbell.net> wrote:
Both Burroughs Large Systems (48-bit stack machine) and the
Sperry 1100/2200 (36-bit) systems had (have, in emulation today)
C compilers.
Yup. The 1100-series machines were (are) 1's complement. Those
are the ones I usually think of when cursing that signed integer
overflow is UB in C.
I don't think anyone is compiling C23 code for those machines,
but back in the late 1980s, they were still enough of a going
concern that they could influence the emerginc C standard. Not
so much anymore.
They would presumably have been part of the justification for supporting
multiple signed integer formats at the time.
C90 doesn't have much to say about this at all, other than
saying that the actual representation and ranges of the integer
types are implementation defined (G.3.5 para 1).
C90 does say that, "The representations of integral types shall
define values by use of a pure binary numeration system" (sec
6.1.2.5).
C99 tightens this up and talks about 2's comp, 1's comp, and
sign/mag as being the permissible representations (J.3.5, para
1).
UB on signed integer
arithmetic overflow is a different matter altogether.
I disagree.
Regardless, signed integer overflow remains UB in the current C
standard, nevermind definitionally following 2s complement
semantics. Usually this is done on the basis of performance
arguments: some seemingly-important loop optimizations can be
made if the compiler can assert that overflow Cannot Happen.
The justification for "signed integer arithmetic overflow is UB" is in
the C standards 6.5p5 under "Expressions" :
Not in ANSI/ISO 9899-1990. In that revision of the standard,
sec 6.5 covers declarations.
"""
If an exceptional condition occurs during the evaluation of an
expression (that is, if the result is not mathematically defined or not
in the range of representable values for its type), the behavior is
undefined.
"""
In C90, this language appears in sec 6.3 para 5. Note, however,
that they do not define what an exception _is_, only a few
things that _may_ cause one. See below.
It actually has absolutely nothing to do with signed integer
representation, or machine hardware.
Consider this language from the (non-normative) example 4 in sec
5.1.2.3:
|On a machine in which overflows produce an exception and in
|which the range of values representable by an *int* is
|[-32768,+32767], the implementation cannot rewrite this
|expression as [continues with the specifics of the example]....
That seems pretty clear that they're thinking about machines
that actually generate a hardware trap of some kind on overflow.
It doesn't even have much to do
with integers at all. It is simply that if the calculation can't give a
correct answer, then then the C standards don't say anything about the
results or effects.
The point is that there when the results of an integer computation are
too big, there is no way to get the correct answer in the types used.
Two's complement wrapping is /not/ correct. If you add two real-world
positive integers, you don't get a negative integer.
Sorry, but I don't buy this argument as anything other than a
justification after the fact. We're talking about history and
motivation here, not the behavior described in the standard.
In particular, C is a programming language for actual machines,
not a mathematical notation; the language is free to define the
behavior of arithmetic expressions in any way it chooses, though
one presumes it would do so in a way that makes sense for the
machines that it targets.
Thus, it could have formalized the
result of signed integer overflow to follow 2's complement
semantics had the committee so chosen, in which case the result
would not be "incorrect", it would be well-defined with respect
to the semantics of the language. Java, for example, does this,
as does C11 (and later) atomic integer operations. Indeed, the
C99 rationale document makes frequent reference to twos
complement, where overflow and modular behavior are frequently
equivalent, being the common case. But aside from the more
recent atomics support, C _chose_ not to do this.
Also, consider that _unsigned_ arithmetic is defined as having
wrap-around semantics similar to modular arithmetic, and thus
incapable of overflow.
But that's simply a fiction invented for
the abstract machine described informally in the standard: it
requires special handling one machines like the 1100 series,
because those machines might trap on overflow. The C committee
could just as well have said that the unsigned arithmetic
_could_ overflow and that the result was UB.
So why did C chose this way? The only logical reason is that
there were machines at the time that where a) integer overflow
caused machine exceptions, and b) the representation of signed
integers was not well-defined, so that the actual value
resulting from overflow could not be rigorously defined. Given
that C90 mandated a binary representation for integers and so
the representation of of unsigned integers is basically common,
there was no need to do that for unsigned arithmetic.
And of course, even today, C still targets oddball platforms
like DSPs and custom chips, where assumptions about the ubiquity
of 2's comp may not hold.
Modern C and C++ standards have dropped support for signed integer
representation other than two's complement, because they are not in use
in any modern hardware (including any DSP's) - at least, not for
general-purpose integers. Both committees have consistently voted to
keep overflow as UB.
Yes. As I said, performance is often the justification.
I'm not convinced that there are no custom chips and/or DSPs
that are not manufactured today. They may not be common, their
mere existence is certainly dumb and offensive, but that does
not mean that they don't exist. Note that the survey in, e.g., https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2218.htm
only mentions _popular_ DSPs, not _all_ DSPs.
Of course, if such machines exist, I will certainly concede that
I doubt very much that anyone is targeting them with C code
written to a modern standard.
David Brown <david.brown@hesbynett.no> schrieb:
The point is that there when the results of an integer computation are
too big, there is no way to get the correct answer in the types used.
Two's complement wrapping is /not/ correct. If you add two real-world
positive integers, you don't get a negative integer.
I believe it was you who wrote "If you add enough apples to a
pile, the number of apples becomes negative", so there is
clerly a defined physical meaning to overflow.
:-)
One must also consider that the disks in that era wereSection 2.7 also describes a 128-entry TLB. The TLB is claimed to
have "typically 97% hit rate". I would go for larger pages, which
would reduce the TLB miss rate.
I think that in 1979 VAX 512 bytes page was close to optimal. ...
fairly small, and 512 bytes was a common sector size.
Convenient for both swapping and loading program text
without wasting space on the disk by clustering
pages in groups of 2, 4 or 8.
According to Scott Lurndal <slp53@pacbell.net>:
One must also consider that the disks in that era wereSection 2.7 also describes a 128-entry TLB. The TLB is claimed to
have "typically 97% hit rate". I would go for larger pages, which
would reduce the TLB miss rate.
I think that in 1979 VAX 512 bytes page was close to optimal. ...
fairly small, and 512 bytes was a common sector size.
Convenient for both swapping and loading program text
without wasting space on the disk by clustering
pages in groups of 2, 4 or 8.
That's probably it but even at the time the pages seemed rather small.
Pages on the PDP-10 were 512 words which was about 2K bytes.
On 14.08.2025 23:44, Dan Cross wrote:
In article <107l5ju$k78a$1@dont-email.me>,
David Brown <david.brown@hesbynett.no> wrote:
[snip]
UB on signed integer
arithmetic overflow is a different matter altogether.
I disagree.
You have overflow when the mathematical result of an operation cannot be >expressed accurately in the type - regardless of the representation
format for the numbers. Your options, as a language designer or >implementer, of handling the overflow are the same regardless of the >representation. You can pick a fixed value to return, or saturate, or >invoke some kind of error handler mechanism, or return a "don't care" >unspecified value of the type, or perform a specified algorithm to get a >representable value (such as reduction modulo 2^n), or you can simply
say the program is broken if this happens (it is UB).
I don't see where the representation comes into it - overflow is a
matter of values and the ranges that can be stored in a type, not how
those values are stored in the bits of the data.
Regardless, signed integer overflow remains UB in the current C
standard, nevermind definitionally following 2s complement
semantics. Usually this is done on the basis of performance
arguments: some seemingly-important loop optimizations can be
made if the compiler can assert that overflow Cannot Happen.
The justification for "signed integer arithmetic overflow is UB" is in
the C standards 6.5p5 under "Expressions" :
Not in ANSI/ISO 9899-1990. In that revision of the standard,
sec 6.5 covers declarations.
"""
If an exceptional condition occurs during the evaluation of an
expression (that is, if the result is not mathematically defined or not
in the range of representable values for its type), the behavior is
undefined.
"""
In C90, this language appears in sec 6.3 para 5. Note, however,
that they do not define what an exception _is_, only a few
things that _may_ cause one. See below.
It's basically the same in C90 onwards, with just small changes to the >wording.
And it /does/ define what is meant by an "exceptional
condition" (or just "exception" in C90) - that is done by the part in >parentheses.
It actually has absolutely nothing to do with signed integer
representation, or machine hardware.
Consider this language from the (non-normative) example 4 in sec
5.1.2.3:
|On a machine in which overflows produce an exception and in
|which the range of values representable by an *int* is
|[-32768,+32767], the implementation cannot rewrite this
|expression as [continues with the specifics of the example]....
That seems pretty clear that they're thinking about machines
that actually generate a hardware trap of some kind on overflow.
They are thinking about that possibility, yes. In C90, the term
"exception" here was not clearly defined - and it is definitely not the
same as the term "exception" in 6.3p5. The wording was improved in C99 >without changing the intended meaning - there the term in the paragraph >under "Expressions" is "exceptional condition" (defined in that
paragraph), while in the example in "Execution environments", it says
"On a machine in which overflows produce an explicit trap". (C11
further clarifies what "performs a trap" means.)
But this is about re-arrangements the compiler is allowed to make, or
barred from making - it can't make re-arrangements that would mean
execution failed when the direct execution of the code according to the
C abstract machine would have worked correctly (without ever having >encountered an "exceptional condition" or other UB). Representation is
not relevant here - there is nothing about two's complement, ones' >complement, sign-magnitude, or anything else. Even the machine hardware
is not actually particularly important, given that most processors
support non-trapping integer arithmetic instructions and for those that >don't have explicit trap instructions, a compiler could generate "jump
if overflow flag set" or similar instructions to emulate traps
reasonably efficiently. (Many compilers support that kind of thing as
an option to aid debugging.)
It doesn't even have much to do
with integers at all. It is simply that if the calculation can't give a >>> correct answer, then then the C standards don't say anything about the
results or effects.
The point is that there when the results of an integer computation are
too big, there is no way to get the correct answer in the types used.
Two's complement wrapping is /not/ correct. If you add two real-world
positive integers, you don't get a negative integer.
Sorry, but I don't buy this argument as anything other than a
justification after the fact. We're talking about history and
motivation here, not the behavior described in the standard.
It is a fair point that I am describing a rational and sensible reason
for UB on arithmetic overflow - and I do not know the motivation of the >early C language designers, compiler implementers, and authors of the
first C standard.
I do know, however, that the principle of "garbage in, garbage out" was
well established long before C was conceived. And programmers of that
time were familiar with the concept of functions and operations being >defined for appropriate inputs, and having no defined behaviour for
invalid inputs. C is full of other things where behaviour is left
undefined when no sensible correct answer can be specified, and that is
not just because the behaviour of different hardware could vary. It
seems perfectly reasonable to me to suppose that signed integer
arithmetic overflow is just another case, no different from
dereferencing an invalid pointer, dividing by zero, or any one of the
other UB's in the standards.
In particular, C is a programming language for actual machines,
not a mathematical notation; the language is free to define the
behavior of arithmetic expressions in any way it chooses, though
one presumes it would do so in a way that makes sense for the
machines that it targets.
Yes, that is true. It is, however, also important to remember that it
was based on a general abstract machine, not any particular hardware,
and that the operations were intended to follow standard mathematics as
well as practically possible - operations and expressions in C were not >designed for any particular hardware. (Though some design choices were >biased by particular hardware.)
Thus, it could have formalized the
result of signed integer overflow to follow 2's complement
semantics had the committee so chosen, in which case the result
would not be "incorrect", it would be well-defined with respect
to the semantics of the language. Java, for example, does this,
as does C11 (and later) atomic integer operations. Indeed, the
C99 rationale document makes frequent reference to twos
complement, where overflow and modular behavior are frequently
equivalent, being the common case. But aside from the more
recent atomics support, C _chose_ not to do this.
It could have made signed integer overflow defined behaviour, but it did >not. The C standards committee have explicitly chosen not to do that,
even after deciding that two's complement is the only supported >representation for signed integers in C23 onwards. It is fine to have
two's complement representation, and fine to have modulo arithmetic in
some circumstances, while leaving other arithmetic overflow undefined. >Unsigned integer operations in C have always been defined as modulo >arithmetic - addition of unsigned values is a different operation from >addition of signed values. Having some modulo behaviour does not in any
way imply that signed arithmetic should be modulo.
In Java, the language designers decided that integer arithmetic
operations would be modulo operations. Wrapping therefore gives the
correct answer for those operations - it does not give the correct
answer for mathematical integer operations. And Java loses common >mathematical identities which C retains - such as the identity that
adding a positive integer to another integer will increase its value. >Something always has to be lost when approximating unbounded
mathematical integers in a bounded implementation - I think C made the
right choices here about what to keep and what to lose, and Java made
the wrong choices. (Others may of course have different opinions.)
In Zig, unsigned integer arithmetic overflow is also UB as these
operations are not defined as modulo. I think that is a good natural
choice too - but it is useful for a language to have a way to do
wrapping arithmetic on the occasions you need it.
Also, consider that _unsigned_ arithmetic is defined as having
wrap-around semantics similar to modular arithmetic, and thus
incapable of overflow.
Yes. Unsigned arithmetic operations are different operations from
signed arithmetic operations in C.
But that's simply a fiction invented for
the abstract machine described informally in the standard: it
requires special handling one machines like the 1100 series,
because those machines might trap on overflow. The C committee
could just as well have said that the unsigned arithmetic
_could_ overflow and that the result was UB.
They could have done that (as the Zig folk did).
So why did C chose this way? The only logical reason is that
there were machines at the time that where a) integer overflow
caused machine exceptions, and b) the representation of signed
integers was not well-defined, so that the actual value
resulting from overflow could not be rigorously defined. Given
that C90 mandated a binary representation for integers and so
the representation of of unsigned integers is basically common,
there was no need to do that for unsigned arithmetic.
Not at all. Usually when someone says "the only logical reason is...",
they really mean "the only logical reason /I/ can think of is...", or
"the only reason that /I/ can think of that /I/ think is logical is...".
For a language that can be used as a low-level systems language, it is >important to be able to do modulo arithmetic efficiently. It is needed
for a number of low-level tasks, including the implementation of large >arithmetic operations, handling timers, counters, and other bits and
pieces. So it was definitely a useful thing to have in C.
For a language that can be used as a fast and efficient application >language, it must have a reasonable approximation to mathematical
integer arithmetic. Implementations should not be forced to have
behaviours beyond the mathematically sensible answers - if a calculation >can't be done correctly, there's no point in doing it. Giving nonsense >results does not help anyone - C programmers or toolchain implementers,
so the language should not specify any particular result. More sensible >defined overflow behaviour - saturation, error values, language
exceptions or traps, etc., would be very inefficient on most hardware.
So UB is the best choice - and implementations can do something
different if they like.
Too many options make a language bigger - harder to implement, harder to >learn, harder to use. So it makes sense to have modulo arithmetic for >unsigned types, and normal arithmetic for signed types.
I am not claiming to know that this is the reasoning made by the C
language pioneers. But it is definitely an alternative logical reason
for C being the way it is.
And of course, even today, C still targets oddball platforms
like DSPs and custom chips, where assumptions about the ubiquity
of 2's comp may not hold.
Modern C and C++ standards have dropped support for signed integer
representation other than two's complement, because they are not in use
in any modern hardware (including any DSP's) - at least, not for
general-purpose integers. Both committees have consistently voted to
keep overflow as UB.
Yes. As I said, performance is often the justification.
I'm not convinced that there are no custom chips and/or DSPs
that are not manufactured today. They may not be common, their
mere existence is certainly dumb and offensive, but that does
not mean that they don't exist. Note that the survey in, e.g.,
https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2218.htm
only mentions _popular_ DSPs, not _all_ DSPs.
I think you might have missed a few words in that paragraph, but I
believe I know what you intended. There are certainly DSPs and other
cores that have strong support for alternative overflow behaviour - >saturation is very common in DSPs, and it is also common to have a
"sticky overflow" flag so that you can do lots of calculations in a
tight loop, and check for problems once you are finished. I think it is >highly unlikely that you'll find a core with something other than two's >complement as the representation for signed integer types, though I
can't claim that I know /all/ devices! (I do know a bit about more
cores than would be considered popular or common.)
Of course, if such machines exist, I will certainly concede that
I doubt very much that anyone is targeting them with C code
written to a modern standard.
Modern C is definitely used on DSPs with strong saturation support.
(Even ARM cores have saturated arithmetic instructions.) But they can
also handle two's complement wrapped signed integer arithmetic if the >programmer wants that - after all, it's exactly the same in the hardware
as modulo unsigned arithmetic (except for division). That doesn't mean
that wrapping signed integer overflow is useful or desired behaviour.
On 8/15/2025 11:53 AM, John Levine wrote:
According to Scott Lurndal <slp53@pacbell.net>:
One must also consider that the disks in that era wereSection 2.7 also describes a 128-entry TLB. The TLB is claimed to
have "typically 97% hit rate". I would go for larger pages, which
would reduce the TLB miss rate.
I think that in 1979 VAX 512 bytes page was close to optimal. ...
fairly small, and 512 bytes was a common sector size.
Convenient for both swapping and loading program text
without wasting space on the disk by clustering
pages in groups of 2, 4 or 8.
That's probably it but even at the time the pages seemed rather small.
Pages on the PDP-10 were 512 words which was about 2K bytes.
Yeah.
Can note in some of my own testing, I tested various page sizes, and seemingly found a local optimum at around 16K.
Where, going from 4K or 8K to 16K sees a reduction in TLB miss rates,
but 16K to 32K or 64K did not see any significant reduction; but did see
a more significant increase in memory footprint due to allocation
overheads (where, OTOH, going from 4K to 16K pages does not see much increase in memory footprint).
Patterns seemed consistent across multiple programs tested, but harder
to say if this pattern would be universal.
Had noted if running stats on where in the pages memory accesses land:
4K: Pages tend to be accessed fairly evenly
16K: Minor variation as to what parts of the page are being used.
64K: Significant variation between parts of the page.
Basically, tracking per-page memory accesses on a finer grain boundary
(eg, 512 bytes).
Say, for example, at 64K one part of the page may be being accessed
readily but another part of the page isn't really being accessed at all
(and increasing page size only really sees benefit for TLB miss rate so
long as the whole page is "actually being used").
On 8/15/2025 11:19 AM, BGB wrote:
On 8/15/2025 11:53 AM, John Levine wrote:
According to Scott Lurndal <slp53@pacbell.net>:
One must also consider that the disks in that era wereSection 2.7 also describes a 128-entry TLB. The TLB is claimed to >>>>>> have "typically 97% hit rate". I would go for larger pages, which >>>>>> would reduce the TLB miss rate.
I think that in 1979 VAX 512 bytes page was close to optimal. ...
fairly small, and 512 bytes was a common sector size.
Convenient for both swapping and loading program text
without wasting space on the disk by clustering
pages in groups of 2, 4 or 8.
That's probably it but even at the time the pages seemed rather small.
Pages on the PDP-10 were 512 words which was about 2K bytes.
Yeah.
Can note in some of my own testing, I tested various page sizes, and
seemingly found a local optimum at around 16K.
I think that is consistent with what some others have found. I suspect
the average page size should grow as memory gets cheaper, which leads to >more memory on average in systems. This also leads to larger programs,
as they can "fit" in larger memory with less paging. And as disk
(spinning or SSD) get faster transfer rates, the cost (in time) of
paging a larger page goes down. While 4K was the sweet spot some
decades ago, I think it has increased, probably to 16K. At some point
in the future, it may get to 64K, but not for some years yet.
Say, for example, at 64K one part of the page may be being accessed
readily but another part of the page isn't really being accessed at all
(and increasing page size only really sees benefit for TLB miss rate so
long as the whole page is "actually being used").
Not necessarily. Consider the case of a 16K (or larger) page with two
"hot spots" that are more than 4K apart. That takes 2 TLB slots with 4K >pages, but only one with larger pages.
ARM64 (ARMv8) architecturally supports 4k, 16k and 64k.
These days it doesn't make much sense to have pages smaller than 4K since >that's the block size on most disks.
John Levine <johnl@taugh.com> writes:
These days it doesn't make much sense to have pages smaller than 4K since >>that's the block size on most disks.
Two block devices bought less than a year ago:
Disk model: KINGSTON SEDC2000BM8960G
Disk model: WD Blue SN580 2TB
SSDs often let you do 512 byte reads and writes for backward compatibility even
though the physical block size is much larger.
Disk model: WD Blue SN580 2TB
I can't find anything on its internal structure but I see the vendor's random >read/write benchmarks all use 4K blocks so that's probably the internal block >size.
On 8/15/2025 11:19 AM, BGB wrote:
On 8/15/2025 11:53 AM, John Levine wrote:
According to Scott Lurndal <slp53@pacbell.net>:
One must also consider that the disks in that era wereSection 2.7 also describes a 128-entry TLB. The TLB is claimed to >>>>>> have "typically 97% hit rate". I would go for larger pages, which >>>>>> would reduce the TLB miss rate.
I think that in 1979 VAX 512 bytes page was close to optimal. ...
fairly small, and 512 bytes was a common sector size.
Convenient for both swapping and loading program text
without wasting space on the disk by clustering
pages in groups of 2, 4 or 8.
That's probably it but even at the time the pages seemed rather small.
Pages on the PDP-10 were 512 words which was about 2K bytes.
Yeah.
Can note in some of my own testing, I tested various page sizes, and
seemingly found a local optimum at around 16K.
I think that is consistent with what some others have found. I suspect
the average page size should grow as memory gets cheaper, which leads to more memory on average in systems. This also leads to larger programs,
as they can "fit" in larger memory with less paging. And as disk
(spinning or SSD) get faster transfer rates, the cost (in time) of
paging a larger page goes down. While 4K was the sweet spot some
decades ago, I think it has increased, probably to 16K. At some point
in the future, it may get to 64K, but not for some years yet.
Where, going from 4K or 8K to 16K sees a reduction in TLB miss rates,
but 16K to 32K or 64K did not see any significant reduction; but did
see a more significant increase in memory footprint due to allocation
overheads (where, OTOH, going from 4K to 16K pages does not see much
increase in memory footprint).
Patterns seemed consistent across multiple programs tested, but harder
to say if this pattern would be universal.
Had noted if running stats on where in the pages memory accesses land:
4K: Pages tend to be accessed fairly evenly
16K: Minor variation as to what parts of the page are being used.
64K: Significant variation between parts of the page.
Basically, tracking per-page memory accesses on a finer grain boundary
(eg, 512 bytes).
Interesting.
Say, for example, at 64K one part of the page may be being accessed
readily but another part of the page isn't really being accessed at
all (and increasing page size only really sees benefit for TLB miss
rate so long as the whole page is "actually being used").
Not necessarily. Consider the case of a 16K (or larger) page with two
"hot spots" that are more than 4K apart. That takes 2 TLB slots with 4K pages, but only one with larger pages.
John Levine <johnl@taugh.com> writes:
SSDs often let you do 512 byte reads and writes for backward compatibility even
though the physical block size is much larger.
Yes. But if the argument had any merit that 512B is a good page size
because it avoids having to transfer 8, 16, or 32 sectors at a time,
it would still have merit, because the interface still shows 512B
sectors.
John Levine <johnl@taugh.com> writes:
SSDs often let you do 512 byte reads and writes for backward compatibility even
though the physical block size is much larger.
Yes. But if the argument had any merit that 512B is a good page size
because it avoids having to transfer 8, 16, or 32 sectors at a time,
it would still have merit, because the interface still shows 512B
sectors.
EricP <ThatWouldBeTelling@thevillage.com> writes:
EricP wrote:
Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) >>> were available in 1975. Mask programmable PLA were available from TI
circa 1970 but masks would be too expensive.
If I was building a TTL risc cpu in 1975 I would definitely be using
lots of FPLA's, not just for decode but also state machines in fetch,
page table walkers, cache controllers, etc.
The question isn't could one build a modern risc-style pipelined cpu
from TTL in 1975 - of course one could. Nor do I see any question of
could it beat a VAX 780 0.5 MIPS at 5 MHz - of course it could, easily.
I'm pretty sure I could use my Mk-I risc ISA and build a 5 stage pipeline
running at 5 MHz getting 1 IPC sustained when hitting the 200 ns cache
(using some in-order superscalar ideas and two reg file write ports
to "catch up" after pipeline bubbles).
TTL risc would also be much cheaper to design and prototype.
VAX took hundreds of people many many years.
The question is could one build this at a commercially competitive price?
There is a reason people did things sequentially in microcode.
All those control decisions that used to be stored as bits in microcode now >> become real logic gates. And in SSI TTL you don't get many to the $.
And many of those sequential microcode states become independent concurrent >> state machines, each with its own logic sequencer.
I am confused. You gave a possible answer in the posting you are
replying to.
Concerning page table walker: The MIPS R2000 just has a TLB and traps
on a TLB miss, and then does the table walk in software. While that's
not a solution that's appropriate for a wide superscalar CPU, it was
good enough for beating the actual VAX 11/780 by a good margin; at
some later point, you would implement the table walker in hardware,
but probably not for the design you do in 1975.
- anton
It is maybe pushing it a little if one wants to use an AVL-tree or
B-Tree for virtual memory vs a page-table
On 8/7/2025 6:38 AM, Anton Ertl wrote:
Concerning page table walker: The MIPS R2000 just has a TLB and traps
on a TLB miss, and then does the table walk in software. While that's
not a solution that's appropriate for a wide superscalar CPU, it was
good enough for beating the actual VAX 11/780 by a good margin; at
some later point, you would implement the table walker in hardware,
but probably not for the design you do in 1975.
Yeah, this approach works a lot better than people seem to give it
credit for...
BGB wrote:
On 8/7/2025 6:38 AM, Anton Ertl wrote:
Concerning page table walker: The MIPS R2000 just has a TLB and traps
on a TLB miss, and then does the table walk in software. While that's
not a solution that's appropriate for a wide superscalar CPU, it was
good enough for beating the actual VAX 11/780 by a good margin; at
some later point, you would implement the table walker in hardware,
but probably not for the design you do in 1975.
Yeah, this approach works a lot better than people seem to give it
credit for...
Both HW and SW table walkers incur the cost of reading the PTE's.
The pipeline drain and load of the software TLB miss handler,
then a drain and reload of the original code on return
are a large expense that HW walkers do not have.
"For example, Anderson, et al. [1] show TLB miss handlers to be among
the most commonly executed OS primitives; Huck and Hays [10] show that
TLB miss handling can account for more than 40% of total run time;
and Rosenblum, et al. [18] show that TLB miss handling can account
for more than 80% of the kernel’s computation time.
BGB <cr88192@gmail.com> writes:
It is maybe pushing it a little if one wants to use an AVL-tree or
B-Tree for virtual memory vs a page-table
I assume that you mean a balanced search tree (binary (AVL) or n-ary
(B)) vs. the now-dominant hierarchical multi-level page tables, which
are tries.
In both a hardware and a software implementation, one could implement
a balanced search tree, but what would be the advantage?
EricP <ThatWouldBeTelling@thevillage.com> schrieb:
Thomas Koenig wrote:
EricP <ThatWouldBeTelling@thevillage.com> schrieb:
Unlesss... maybe somebody (a customer, or they themselves)Manufacturing process variation leads to timing differences that
discovered that there may have been conditions where they could
only guarantee 80 ns. Maybe a combination of tolerances to one
side and a certain logic programming, and they changed the
data sheet.
testing sorts into speed bins. The faster bins sell at higher price.
Is that possible with a PAL before it has been programmed?
Depends on what chips you use for registers.Should be free coming from a Flip-Flop.By comparison, you could get an eight-input NAND gate with aThe 82S100 PLA is logic equivalent to:
maximum delay of 12 ns (the 74H030), so putting two in sequence
to simulate a PLA would have been significantly faster.
I can undersand people complaining that PALs were slow.
- 16 inputs each with an optional input invertor,
If you want both Q and Qb then you only get 4 FF in a package like 74LS375. >>
For a wide instruction or stage register I'd look at chips such as a 74LS377 >> with 8 FF in a 20 pin dip, 8 input, 8 Q out, clock, clock enable, vcc, gnd.
So if you need eight ouputs, you choice is to use two 74LS375
(presumably more expensive) or an 74LS377 and an eight-chip
inverter (a bit slower, but intervers should be fast).
Another point... if you don't need 16 inputs or 8 outpus, youI'm just showing why it was more than just an AND gate.
are also paying a lot more. If you have a 6-bit primary opcode,
you don't need a full 16 bits of input.
Two layers of NAND :-)
I'm still exploring whether it can be variable length instructions or
has to be fixed 32-bit. In either case all the instruction "code" bits
(as in op code or function code or whatever) should be checked,
even if just to verify that should-be-zero bits are zero.
There would also be instruction buffer Valid bits and other state bits
like Fetch exception detected, interrupt request, that might feed into
a bank of PLA's multiple wide and deep.
Agreed, the logic has to go somewhere. Regularity in the
instruction set would even have been extremely important than now
to reduce the logic requirements for decoding.
BGB wrote:
On 8/7/2025 6:38 AM, Anton Ertl wrote:
Concerning page table walker: The MIPS R2000 just has a TLB and traps
on a TLB miss, and then does the table walk in software. While that's
not a solution that's appropriate for a wide superscalar CPU, it was
good enough for beating the actual VAX 11/780 by a good margin; at
some later point, you would implement the table walker in hardware,
but probably not for the design you do in 1975.
Yeah, this approach works a lot better than people seem to give it
credit for...
Both HW and SW table walkers incur the cost of reading the PTE's.
The pipeline drain and load of the software TLB miss handler,
then a drain and reload of the original code on return
are a large expense that HW walkers do not have.
In-Line Interrupt Handling for Software-Managed TLBs 2001 https://terpconnect.umd.edu/~blj/papers/iccd2001.pdf
"For example, Anderson, et al. [1] show TLB miss handlers to be among
the most commonly executed OS primitives; Huck and Hays [10] show that
TLB miss handling can account for more than 40% of total run time;
and Rosenblum, et al. [18] show that TLB miss handling can account
for more than 80% of the kernel’s computation time.
Recent studies show that TLB-related precise interrupts occur
once every 100–1000 user instructions on all ranges of code, from
SPEC to databases and engineering workloads [5, 18]."
On 2025-08-04, Michael S <already5chosen@yahoo.com> wrote:
On Mon, 04 Aug 2025 09:53:51 -0700
Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
In C17 and earlier, _BitInt is a reserved identifier. Any attempt to
use it has undefined behavior. That's exactly why new keywords are
often defined with that ugly syntax.
That is language lawyer's type of reasoning. Normally gcc maintainers
are wiser than that because, well, by chance gcc happens to be widely
used production compiler. I don't know why this time they had chosen
less conservative road.
They invented an identifer which lands in the _[A-Z].* namespace
designated as reserved by the standard.
What would be an exmaple of a more conservative way to name the
identifier?
EricP <ThatWouldBeTelling@thevillage.com> writes:
Why not treat the SW TLB miss handler as similar to a call as
possible? Admittedly, calls occur as part of the front end, while (in
an OoO core) the TLB miss comes from the execution engine or the
reorder buffer, but still: could it just be treated like a call
inserted in the instruction stream at the time when it is noticed,
with the instructions running in a special context (read access to
page tables allowed). You may need to flush the pipeline anyway,
though, if the TLB miss
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
EricP <ThatWouldBeTelling@thevillage.com> writes:
Why not treat the SW TLB miss handler as similar to a call as
possible? Admittedly, calls occur as part of the front end, while (in
an OoO core) the TLB miss comes from the execution engine or the
reorder buffer, but still: could it just be treated like a call
inserted in the instruction stream at the time when it is noticed,
with the instructions running in a special context (read access to
page tables allowed). You may need to flush the pipeline anyway,
though, if the TLB miss
... if the buffers fill up and there is not enough resources left for
the TLB miss handler.
- anton
Thomas Koenig wrote:
EricP <ThatWouldBeTelling@thevillage.com> schrieb:
Thomas Koenig wrote:
EricP <ThatWouldBeTelling@thevillage.com> schrieb:
Unlesss... maybe somebody (a customer, or they themselves)Manufacturing process variation leads to timing differences that
discovered that there may have been conditions where they could
only guarantee 80 ns. Maybe a combination of tolerances to one
side and a certain logic programming, and they changed the
data sheet.
testing sorts into speed bins. The faster bins sell at higher price.
Is that possible with a PAL before it has been programmed?
They can speed and partially function test it.
Its programmed by blowing internal fuses which is a one-shot thing
so that function can't be tested.
Depends on what chips you use for registers.Should be free coming from a Flip-Flop.By comparison, you could get an eight-input NAND gate with aThe 82S100 PLA is logic equivalent to:
maximum delay of 12 ns (the 74H030), so putting two in sequence
to simulate a PLA would have been significantly faster.
I can undersand people complaining that PALs were slow.
- 16 inputs each with an optional input invertor,
If you want both Q and Qb then you only get 4 FF in a package like
74LS375.
For a wide instruction or stage register I'd look at chips such as a
74LS377
with 8 FF in a 20 pin dip, 8 input, 8 Q out, clock, clock enable,
vcc, gnd.
So if you need eight ouputs, you choice is to use two 74LS375
(presumably more expensive) or an 74LS377 and an eight-chip
inverter (a bit slower, but intervers should be fast).
Another point... if you don't need 16 inputs or 8 outpus, youI'm just showing why it was more than just an AND gate.
are also paying a lot more. If you have a 6-bit primary opcode,
you don't need a full 16 bits of input.
Two layers of NAND :-)
Thinking about different ways of doing this...
If the first NAND layer has open collector outputs then we can use
a wired-AND logic driving and invertor for the second NAND plane.
If the instruction buffer outputs to a set of 74159 4:16 demux with
open collector outputs, then we can just wire the outputs we want
together with a 10k pull-up resistor and drive an invertor,
to form the second output NAND layer.
inst buf <15:8> <7:0>
| | | |
4:16 4:16 4:16 4:16
vvvv vvvv vvvv vvvv
10k ---|---|---|---|------>INV->
10k ---------------------->INV->
10k ---------------------->INV->
I'm still exploring whether it can be variable length instructions or
has to be fixed 32-bit. In either case all the instruction "code" bits
(as in op code or function code or whatever) should be checked,
even if just to verify that should-be-zero bits are zero.
There would also be instruction buffer Valid bits and other state bits
like Fetch exception detected, interrupt request, that might feed into
a bank of PLA's multiple wide and deep.
Agreed, the logic has to go somewhere. Regularity in the
instruction set would even have been extremely important than now
to reduce the logic requirements for decoding.
The question is whether in 1975 main memory is so expensive that
we cannot afford the wasted space of a fixed 32-bit ISA.
In 1975 the widely available DRAM was the Intel 1103 1k*1b.
The 4kb drams were just making to customers, 16kb were preliminary.
Looking at the instruction set usage of VAX in
Measurement and Analysis of Instruction Use in VAX 780, 1982 https://dl.acm.org/doi/pdf/10.1145/1067649.801709
we see that the top 25 instructions covers about 80-90% of the usage,
and many of them would fit into 2 or 3 bytes.
A fixed 32-bit instruction would waste 1 to 2 bytes on most instructions.
But a fixed 32-bit instruction is very much easier to fetch and
decode needs a lot less logic for shifting prefetch buffers,
compared to, say, variable length 1 to 12 bytes.
I am unsure how GCC --pedantic deals with the standards-contrary
features in the GNUC89 language, such as the different type of (foo,
'C') (GNUC says char, C89 says int), maybe specifying standard C
instead of GNUC reverts those to the standard definition .
In article <107mf9l$u2si$1@dont-email.me>,
Thomas Koenig <tkoenig@netcologne.de> wrote:
It's not clear to me what the distinction of technical vs. business
is supposed to be in the context of ISA design. Could you explain?
I can attempt to, though I'm not sure if I can be successful.
And so with the VAX, I can imagine the work (which started in,
what, 1975?) being informed by a business landscape that saw an
increasing trend towards favoring high-level languages, but also
saw the continued development of large, bespoke, business
applications for another five or more years, and with customers
wanting to be able to write (say) complex formatting sequences
easily in assembler (the EDIT instruction!), in a way that was
compatible with COBOL (so make the COBOL compiler emit the EDIT instruction!), while also trying to accommodate the scientific
market (POLYF/POLYG!) who would be writing primarily in FORTRAN
but jumping to assembler for the fuzz-busting speed boost (so
stabilize what amounts to an ABI very early on!), and so forth.
Jakob Bohm <egenagwemdimtapsar@jbohm.dk> writes:
[...]
I am unsure how GCC --pedantic deals with the standards-contrary
features in the GNUC89 language, such as the different type of (foo,
'C') (GNUC says char, C89 says int), maybe specifying standard C
instead of GNUC reverts those to the standard definition .
I'm not sure what you're referring to. You didn't say what foo is.
I believe that in all versions of C, the result of a comma operator has
the type and value of its right operand, and the type of an unprefixed character constant is int.
Can you show a complete example where `sizeof (foo, 'C')` yields
sizeof (int) in any version of GNUC?
Jakob Bohm <egenagwemdimtapsar@jbohm.dk> writes:
[...]
I am unsure how GCC --pedantic deals with the standards-contrary
features in the GNUC89 language, such as the different type of (foo,
'C') (GNUC says char, C89 says int), maybe specifying standard C
instead of GNUC reverts those to the standard definition .
I'm not sure what you're referring to. You didn't say what foo is.
I believe that in all versions of C, the result of a comma operator has
the type and value of its right operand, and the type of an unprefixed character constant is int.
Can you show a complete example where `sizeof (foo, 'C')` yields
sizeof (int) in any version of GNUC?
EricP <ThatWouldBeTelling@thevillage.com> writes:
BGB wrote:
On 8/7/2025 6:38 AM, Anton Ertl wrote:Both HW and SW table walkers incur the cost of reading the PTE's.
Concerning page table walker: The MIPS R2000 just has a TLB and trapsYeah, this approach works a lot better than people seem to give it
on a TLB miss, and then does the table walk in software. While that's >>>> not a solution that's appropriate for a wide superscalar CPU, it was
good enough for beating the actual VAX 11/780 by a good margin; at
some later point, you would implement the table walker in hardware,
but probably not for the design you do in 1975.
credit for...
The pipeline drain and load of the software TLB miss handler,
then a drain and reload of the original code on return
are a large expense that HW walkers do not have.
Why not treat the SW TLB miss handler as similar to a call as
possible? Admittedly, calls occur as part of the front end, while (in
an OoO core) the TLB miss comes from the execution engine or the
reorder buffer, but still: could it just be treated like a call
inserted in the instruction stream at the time when it is noticed,
with the instructions running in a special context (read access to
page tables allowed). You may need to flush the pipeline anyway,
though, if the TLB miss
"For example, Anderson, et al. [1] show TLB miss handlers to be among
the most commonly executed OS primitives; Huck and Hays [10] show that
TLB miss handling can account for more than 40% of total run time;
and Rosenblum, et al. [18] show that TLB miss handling can account
for more than 80% of the kernel’s computation time.
I have seen ~90% of the time spent on TLB handling on an Ivy Bridge
with hardware table walking, on a 1000x1000 matrix multiply with
pessimal spatial locality (2 TLB misses per iteration). Each TLB miss
cost about 20 cycles.
- anton
There were a number of proposals around then, the paper I linked to
also suggested injecting the miss routine into the ROB.
My idea back then was a HW thread.
All of these are attempts to fix inherent drawbacks and limitations
in the SW-miss approach, and all of them run counter to the only
advantage SW-miss had: its simplicity.
The SW approach is inherently synchronous and serial -
it can only handle one TLB miss at a time, one PTE read at a time.
While HW walkers are serial for translating one VA,
the translations are inherently concurrent provided one can
implement an atomic RMW for the Accessed and Modified bits.
Each PTE read can cache miss and stall that walker.
As most OoO caches support multiple pending misses and hit-under-miss,
you can create as many HW walkers as you can afford.
EricP <ThatWouldBeTelling@thevillage.com> writes:the same problem.
There were a number of proposals around then, the paper I linked to
also suggested injecting the miss routine into the ROB.
My idea back then was a HW thread.
While HW walkers are serial for translating one VA,
the translations are inherently concurrent provided one can
implement an atomic RMW for the Accessed and Modified bits.
It's always a one-way street (towards accessed and towards modified,
never the other direction), so it's not clear to me why one would want >atomicity there.
On 18.08.2025 07:18, Keith Thompson wrote:
Jakob Bohm <egenagwemdimtapsar@jbohm.dk> writes:
[...]
I am unsure how GCC --pedantic deals with the standards-contraryI'm not sure what you're referring to. You didn't say what foo is.
features in the GNUC89 language, such as the different type of (foo,
'C') (GNUC says char, C89 says int), maybe specifying standard C
instead of GNUC reverts those to the standard definition .
I believe that in all versions of C, the result of a comma operator
has
the type and value of its right operand, and the type of an unprefixed
character constant is int.
Can you show a complete example where `sizeof (foo, 'C')` yields
sizeof (int) in any version of GNUC?
Presumably that's a typo - you meant to ask when the size is /not/ the
size of "int" ? After all, you said yourself that "(foo, 'C')"
evaluates to 'C' which is of type "int". It would be very interesting
if Jakob can show an example where gcc treats the expression as any
other type than "int".
For extremely wide cores, like Apple's M (modulo ISA), AMD Zen5 and
Intel Lion Cove, I'd do the following modification to your inner loop
(back in Intel syntax):
xor ebx,ebx
next:
xor edx, edx
mov rax,[rsi+rcx*8]
add rax,[r8+rcx*8]
adc edx,edx
add rax,[r9+rcx*8]
adc edx,0
add rbx,rax
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov edx, ebx
mov [rdi+rcx*8],rax
inc rcx
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately never >incremen_edx:
inc edx
jmp edx_ready
The idea is interesting, but I don't understand the code. The
following looks funny to me:
1) You increment edx in increment_edx, then jump back to edx_ready and
immediately overwrite edx with ebx. Then you do nothing with it,
and then you clear edx in the next iteration. So both the "inc
edx" and the "mov edx, ebx" look like dead code to me that can be
optimized away.
2) There is a loop-carried dependency through ebx, and the number
accumulating in ebx and the carry check makes no sense with that.
Could it be that you wanted to do "mov ebx, edx" at edx_ready? It all
makes more sense with that. ebx then contains the carry from the last
cycle on entry. The carry dependency chain starts at clearing edx,
then gets to additional carries, then is copied to ebx, transferred
into the next iteration, and is ended there by overwriting ebx. No >dependency cycles (except the loop counter and addresses, which can be
dealt with by hardware or by unrolling), and ebx contains the carry
from the last iteration
One other problem is that according to Agner Fog's instruction tables,
even the latest and greatest CPUs from AMD and Intel that he measured
(Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,
and adc has a latency of 1, so breaking the dependency chain in a
beneficial way should avoid the use of adc. For our three-summand
add, it's not clear if adcx and adox can run in the same cycle, but
looking at your measurements, it is unlikely.
So we would need something other than "adc edx, edx" to set the carry >register. According to Agner Fog Zen3 can perform 2 cmovc per cycle
(and Zen5 can do 4/cycle), so that might be the way to do it. E.g.,
have 1 in edi, and then do, for two-summand addition:
mov edi,1
xor ebx,ebx
next:
xor edx, edx
mov rax,[rsi+rcx*8]
add rax,[r8+rcx*8]
cmovc edx, edi
add rbx,rax
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov edx, ebx
mov [rdi+rcx*8],rax
inc rcx
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately never >incremen_edx:
inc edx
jmp edx_ready
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
The idea is interesting, but I don't understand the code. The
following looks funny to me:
1) You increment edx in increment_edx, then jump back to edx_ready and
immediately overwrite edx with ebx. Then you do nothing with it,
and then you clear edx in the next iteration. So both the "inc
edx" and the "mov edx, ebx" look like dead code to me that can be
optimized away.
2) There is a loop-carried dependency through ebx, and the number
accumulating in ebx and the carry check makes no sense with that.
Could it be that you wanted to do "mov ebx, edx" at edx_ready? It all
makes more sense with that. ebx then contains the carry from the last
cycle on entry. The carry dependency chain starts at clearing edx,
then gets to additional carries, then is copied to ebx, transferred
into the next iteration, and is ended there by overwriting ebx. No
dependency cycles (except the loop counter and addresses, which can be
dealt with by hardware or by unrolling), and ebx contains the carry
from the last iteration
One other problem is that according to Agner Fog's instruction tables,
even the latest and greatest CPUs from AMD and Intel that he measured
(Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,
and adc has a latency of 1, so breaking the dependency chain in a
beneficial way should avoid the use of adc. For our three-summand
add, it's not clear if adcx and adox can run in the same cycle, but
looking at your measurements, it is unlikely.
So we would need something other than "adc edx, edx" to set the carry
register. According to Agner Fog Zen3 can perform 2 cmovc per cycle
(and Zen5 can do 4/cycle), so that might be the way to do it. E.g.,
have 1 in edi, and then do, for two-summand addition:
mov edi,1
xor ebx,ebx
next:
xor edx, edx
mov rax,[rsi+rcx*8]
add rax,[r8+rcx*8]
cmovc edx, edi
add rbx,rax
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov edx, ebx
mov [rdi+rcx*8],rax
inc rcx
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately never
incremen_edx:
inc edx
jmp edx_ready
Forgot to fix the "mov edx, ebx" here. One other thing: I think that
the "add rbx, rax" should be "add rax, rbx". You want to add the
carry to rax before storing the result. So the version with just one iteration would be:
mov edi,1
xor ebx,ebx
next:
xor edx, edx
mov rax,[rsi+rcx*8]
add rax,[r8+rcx*8]
cmovc edx, edi
add rax,rbx
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov ebx, edx
mov [rdi+rcx*8],rax
inc rcx
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately never incremen_edx:
inc edx
jmp edx_ready
And the version with the two additional adc-using iterations would be
(with an additional correction):
mov edi,1
xor ebx,ebx
next:
mov rax,[rsi+rcx*8]
add [r8+rcx*8], rax
mov rax,[rsi+rcx*8+8]
adc [r8+rcx*8+8], rax
xor edx, edx
mov rax,[rsi+rcx*8+16]
adc rax,[r8+rcx*8+16]
cmovc edx, edi
add rax,rbx
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov ebx, edx
mov [rdi+rcx*8+16],rax
add rcx,3
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately never incremen_edx:
inc edx
jmp edx_ready
- anton
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
The idea is interesting, but I don't understand the code. The
following looks funny to me:
1) You increment edx in increment_edx, then jump back to edx_ready
and
immediately overwrite edx with ebx. Then you do nothing with it,
and then you clear edx in the next iteration. So both the "inc
edx" and the "mov edx, ebx" look like dead code to me that can be
optimized away.
2) There is a loop-carried dependency through ebx, and the number
accumulating in ebx and the carry check makes no sense with that.
Could it be that you wanted to do "mov ebx, edx" at edx_ready? It
all makes more sense with that. ebx then contains the carry from
the last cycle on entry. The carry dependency chain starts at
clearing edx, then gets to additional carries, then is copied to
ebx, transferred into the next iteration, and is ended there by
overwriting ebx. No dependency cycles (except the loop counter and >addresses, which can be dealt with by hardware or by unrolling), and
ebx contains the carry from the last iteration
One other problem is that according to Agner Fog's instruction
tables, even the latest and greatest CPUs from AMD and Intel that he >measured (Zen5 and Tiger Lake) can only execute one adc/adcx/adox
per cycle, and adc has a latency of 1, so breaking the dependency
chain in a beneficial way should avoid the use of adc. For our >three-summand add, it's not clear if adcx and adox can run in the
same cycle, but looking at your measurements, it is unlikely.
So we would need something other than "adc edx, edx" to set the carry >register. According to Agner Fog Zen3 can perform 2 cmovc per cycle
(and Zen5 can do 4/cycle), so that might be the way to do it. E.g.,
have 1 in edi, and then do, for two-summand addition:
mov edi,1
xor ebx,ebx
next:
xor edx, edx
mov rax,[rsi+rcx*8]
add rax,[r8+rcx*8]
cmovc edx, edi
add rbx,rax
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov edx, ebx
mov [rdi+rcx*8],rax
inc rcx
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately
never incremen_edx:
inc edx
jmp edx_ready
Forgot to fix the "mov edx, ebx" here. One other thing: I think that
the "add rbx, rax" should be "add rax, rbx". You want to add the
carry to rax before storing the result. So the version with just one iteration would be:
mov edi,1
xor ebx,ebx
next:
xor edx, edx
mov rax,[rsi+rcx*8]
add rax,[r8+rcx*8]
cmovc edx, edi
add rax,rbx
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov ebx, edx
mov [rdi+rcx*8],rax
inc rcx
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately
never incremen_edx:
inc edx
jmp edx_ready
And the version with the two additional adc-using iterations would be
(with an additional correction):
mov edi,1
xor ebx,ebx
next:
mov rax,[rsi+rcx*8]
add [r8+rcx*8], rax
mov rax,[rsi+rcx*8+8]
adc [r8+rcx*8+8], rax
xor edx, edx
mov rax,[rsi+rcx*8+16]
adc rax,[r8+rcx*8+16]
cmovc edx, edi
add rax,rbx
jc incremen_edx
; eliminate data dependency between loop iteration
; replace it by very predictable control dependency
edx_ready:
mov ebx, edx
mov [rdi+rcx*8+16],rax
add rcx,3
cmp rcx,r10
jb next
...
ret
; that code is placed after return
; it is executed extremely rarely.For random inputs-approximately
never incremen_edx:
inc edx
jmp edx_ready
- anton
Anton, I like what you and Michael have done, but I'm still not sure >everything is OK:
In your code, I only see two input arrays [rsi] and [r8], instead of
three? (Including [r9])
It would also be possible to use SETC to save the intermediate carries...
One other problem is that according to Agner Fog's instruction tables,
even the latest and greatest CPUs from AMD and Intel that he measured
(Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,
According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
John Levine <johnl@taugh.com> writes:
SSDs often let you do 512 byte reads and writes for backward compatibility even
though the physical block size is much larger.
Yes. But if the argument had any merit that 512B is a good page size >>because it avoids having to transfer 8, 16, or 32 sectors at a time,
it would still have merit, because the interface still shows 512B
sectors.
I think we're agreeing that even in the early 1980s a 512 byte page was
too small. They certainly couldn't have made it any smaller, but they
should have made it larger.
S/370 was a decade before that and its pages were 2K or 4M. The KI-10,
the first PDP-10 with paging, had 2K pages in 1972. Its pager was based
on BBN's add-on pager for TENEX, built in 1970 also with 2K pages.
S/370 was a decade before that and its pages were 2K or 4K. The KI-10,...
the first PDP-10 with paging, had 2K pages in 1972. Its pager was based
on BBN's add-on pager for TENEX, built in 1970 also with 2K pages.
Note that 360 has optional page protection used only for access
control. In 370 era they had legacy of 2k or 4k pages, and
AFAICS IBM was mainly aiming at bigger machines, so they
were not so worried about fragmentation.
PDP-11 experience possibly contributed to using smaller pages for VAX.
Microprocessors were designed with different constraints, which
led to bigger pages. But VAX apparently could afford resonably
large TLB and due VMS structure gain was bigger than for other
OS-es.
And little correction: VAX architecture handbook is dated 1977,
so actually decision about page size had to be made at least
in 1977 and possibly earlier.
antispam@fricas.org (Waldek Hebisch) writes:
The basic question is if VAX could afford the pipeline.
VAX 11/780 only performed instruction fetching concurrently with the
rest (a two-stage pipeline, if you want). The 8600, 8700/8800 and
NVAX applied more pipelining, but CPI remained high.
VUPs MHz CPI Machine
1 5 10 11/780
4 12.5 6.25 8600
6 22.2 7.4 8700
35 90.9 5.1 NVAX+
SPEC92 MHz VAX CPI Machine
1/1 5 10/10 VAX 11/780
133/200 200 3/2 Alpha 21064 (DEC 7000 model 610)
VUPs and SPEC numbers from
<https://pghardy.net/paul/programs/vms_cpus.html>.
The 10 CPI (cycles per instructions) of the VAX 11/780 are annecdotal.
The other CPIs are computed from VUP/SPEC and MHz numbers; all of that
is probably somewhat off (due to the anecdotal base being off), but if
you relate them to each other, the offness cancels itself out.
Note that the NVAX+ was made in the same process as the 21064, the
21064 has about the clock rate, and has 4-6 times the performance,
resulting not just in a lower native CPI, but also in a lower "VAX
CPI" (the CPI a VAX would have needed to achieve the same performance
at this clock rate).
On Tue, 19 Aug 2025 05:47:01 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
One other problem is that according to Agner Fog's instruction tables,
even the latest and greatest CPUs from AMD and Intel that he measured
(Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,
I didn't measure on either TGL or Zen5, but both Raptor Cove and Zen3
are certainly capable of more than 1 adcx|adox per cycle.
Below are Execution times of very heavily unrolled adcx/adox code with dependency broken by trick similiar to above:
Platform RC GM SK Z3
add3_my_adx_u17 244.5 471.1 482.4 407.0
Considering that there are 2166 adcx/adox/adc instructions, we have
following number of adcx/adox/adc instructions per clock:
Platform RC GM SK Z3
1.67 1.10 1.05 1.44
For Gracemont and Skylake there exists a possibility of small
measurement mistake, but Raptor Cove appears to be capable of at least 2 instructions of this type per clock while Zen3 capable of at least 1.5
but more likely also 2.
It looks to me that the bottlenecks on both RC and Z3 are either rename
phase or more likely L1$ access. It seems that while Golden/Raptore Cove
can occasionally issue 3 load + 2 stores per clock, it can not sustain
more than 3 load-or-store accesses per clock
Code:
.file "add3_my_adx_u17.s"
.text
.p2align 4
.globl add3
.def add3; .scl 2; .type 32; .endef
.seh_proc add3
add3:
pushq %rsi
.seh_pushreg %rsi
pushq %rbx
.seh_pushreg %rbx
.seh_endprologue
# %rcx - dst
# %rdx - a
# %r8 - b
# %r9 - c
sub %rdx, %rcx
mov %rcx, %r10 # r10 = dst - a
sub %rdx, %r8 # r8 = b - a
sub %rdx, %r9 # r9 = c - c
mov %rdx, %r11 # r11 - a
mov $60, %edx
xor %ecx, %ecx
.p2align 4
.loop:
xor %ebx, %ebx # CF <= 0, OF <= 0, EBX <= 0
mov (%r11), %rsi
adcx (%r11,%r8), %rsi
adox (%r11,%r9), %rsi
mov 8(%r11), %rax
adcx 8(%r11,%r8), %rax
adox 8(%r11,%r9), %rax
mov %rax, 8(%r10,%r11)
Very impressive Michael!
I particularly like how you are interleaving ADOX and ADCX to gain
two carry bits without having to save them off to an additional
register.
Terje
Overall, I think that time spent by Intel engineers on invention of ADX
could have been spent much better.
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
EricP <ThatWouldBeTelling@thevillage.com> writes:the same problem.
There were a number of proposals around then, the paper I linked to
also suggested injecting the miss routine into the ROB.
My idea back then was a HW thread.
While HW walkers are serial for translating one VA,It's always a one-way street (towards accessed and towards modified,
the translations are inherently concurrent provided one can
implement an atomic RMW for the Accessed and Modified bits.
never the other direction), so it's not clear to me why one would want
atomicity there.
To avoid race conditions with software clearing those bits, presumably.
ARM64 originally didn't support hardware updates in V8.0, they were independent hardware features added to V8.1.
EricP <ThatWouldBeTelling@thevillage.com> writes:
There were a number of proposals around then, the paper I linked to
also suggested injecting the miss routine into the ROB.
My idea back then was a HW thread.
All of these are attempts to fix inherent drawbacks and limitations
in the SW-miss approach, and all of them run counter to the only
advantage SW-miss had: its simplicity.
Another advantage is the flexibility: you can implement any
translation scheme you want: hierarchical page tables, inverted page
tables, search trees, .... However, given that hierarchical page
tables have won, this is no longer an advantage anyone cares for.
The SW approach is inherently synchronous and serial -
it can only handle one TLB miss at a time, one PTE read at a time.
On an OoO engine, I don't see that. The table walker software is
called in its special context and the instructions in the table walker
are then run through the front end and the OoO engine. Another table
walk could be started at any time (even when the first table walk has
not yet finished feeding its instructions to the front end), and once
inside the OoO engine, the execution is OoO and concurrent anyway. It
would be useful to avoid two searches for the same page at the same
time, but hardware walkers have the same problem.
While HW walkers are serial for translating one VA,
the translations are inherently concurrent provided one can
implement an atomic RMW for the Accessed and Modified bits.
It's always a one-way street (towards accessed and towards modified,
never the other direction), so it's not clear to me why one would want atomicity there.
Each PTE read can cache miss and stall that walker.
As most OoO caches support multiple pending misses and hit-under-miss,
you can create as many HW walkers as you can afford.
Which poses the question: is it cheaper to implement n table walkers,
or to add some resources and mechanism that allows doing SW table
walks until the OoO engine runs out of resources, and a recovery
mechanism in that case.
I see other performance and conceptual disadvantages for the envisioned
SW walkers, however:
1) The SW walker is inserted at the front end and there may be many
ready instructions ahead of it before the instructions of the SW
walker get their turn. By contrast, a hardware walker sits in the
load/store unit and can do its own loads and stores with priority over
the program-level loads and stores. However, it's not clear that
giving priority to table walking is really a performance advantage.
2) Some decisions will have to be implemented as branches, resulting
in branch misses, which cost time and lead to all kinds of complexity
if you want to avoid resetting the whole pipeline (which is the normal reaction to a branch misprediction).
3) The reorder buffer processes instructions in architectural order.
If the table walker's instructions get their sequence numbers from
where they are inserted into the instruction stream, they will not
retire until after the memory access that waits for the table walker
is retired. Deadlock!
It may be possible to solve these problems (your idea of doing it with something like hardware threads may point in the right direction), but
it's probably easier to stay with hardware walkers.
- anton
On 8/17/2025 12:35 PM, EricP wrote:
The question is whether in 1975 main memory is so expensive that
we cannot afford the wasted space of a fixed 32-bit ISA.
In 1975 the widely available DRAM was the Intel 1103 1k*1b.
The 4kb drams were just making to customers, 16kb were preliminary.
Looking at the instruction set usage of VAX in
Measurement and Analysis of Instruction Use in VAX 780, 1982
https://dl.acm.org/doi/pdf/10.1145/1067649.801709
we see that the top 25 instructions covers about 80-90% of the usage,
and many of them would fit into 2 or 3 bytes.
A fixed 32-bit instruction would waste 1 to 2 bytes on most instructions.
But a fixed 32-bit instruction is very much easier to fetch and
decode needs a lot less logic for shifting prefetch buffers,
compared to, say, variable length 1 to 12 bytes.
When code/density is the goal, a 16/32 RISC can do well.
Can note:
Maximizing code density often prefers fewer registers;
For 16-bit instructions, 8 or 16 registers is good;
8 is rather limiting;
32 registers uses too many bits.
Can note ISAs with 16 bit encodings:
PDP-11: 8 registers
M68K : 2x 8 (A and D)
MSP430: 16
Thumb : 8|16
RV-C : 8|32
SuperH: 16
XG1 : 16|32 (Mostly 16)
In my recent fiddling for trying to design a pair encoding for XG3, can
note the top-used instructions are mostly, it seems (non Ld/St):
ADD Rs, 0, Rd //MOV Rs, Rd
ADD X0, Imm, Rd //MOV Imm, Rd
ADDW Rs, 0, Rd //EXTS.L Rs, Rd
ADDW Rd, Imm, Rd //ADDW Imm, Rd
ADD Rd, Imm, Rd //ADD Imm, Rd
Followed by:
ADDWU Rs, 0, Rd //EXTU.L Rs, Rd
ADDWU Rd, Imm, Rd //ADDWu Imm, Rd
ADDW Rd, Rs, Rd //ADDW Rs, Rd
ADD Rd, Rs, Rd //ADD Rs, Rd
ADDWU Rd, Rs, Rd //ADDWU Rs, Rd
Most every other ALU instruction and usage pattern either follows a bit further behind or could not be expressed in a 16-bit op.
For Load/Store:
SD Rn, Disp(SP)
LD Rn, Disp(SP)
LW Rn, Disp(SP)
SW Rn, Disp(SP)
LD Rn, Disp(Rm)
LW Rn, Disp(Rm)
SD Rn, Disp(Rm)
SW Rn, Disp(Rm)
For registers, there is a split:
Leaf functions:
R10..R17, R28..R31 dominate.
Non-Leaf functions:
R10, R18..R27, R8/R9
For 3-bit configurations:
R8..R15 Reg3A
R18/R19, R20/R21, R26/R27, R10/R11 Reg3B
Reg3B was a bit hacky, but had similar hit rates but uses less encoding space than using a 4-bit R8..R23 (saving 1 bit on the relevant scenarios).
BGB wrote:
On 8/17/2025 12:35 PM, EricP wrote:
The question is whether in 1975 main memory is so expensive that
we cannot afford the wasted space of a fixed 32-bit ISA.
In 1975 the widely available DRAM was the Intel 1103 1k*1b.
The 4kb drams were just making to customers, 16kb were preliminary.
Looking at the instruction set usage of VAX in
Measurement and Analysis of Instruction Use in VAX 780, 1982
https://dl.acm.org/doi/pdf/10.1145/1067649.801709
we see that the top 25 instructions covers about 80-90% of the usage,
and many of them would fit into 2 or 3 bytes.
A fixed 32-bit instruction would waste 1 to 2 bytes on most
instructions.
But a fixed 32-bit instruction is very much easier to fetch and
decode needs a lot less logic for shifting prefetch buffers,
compared to, say, variable length 1 to 12 bytes.
When code/density is the goal, a 16/32 RISC can do well.
Can note:
Maximizing code density often prefers fewer registers;
For 16-bit instructions, 8 or 16 registers is good;
8 is rather limiting;
32 registers uses too many bits.
I'm assuming 16 32-bit registers, plus a separate RIP.
The 74172 is a single chip 3 port 16*2b register file, 1R,1W,1RW.
With just 16 registers there would be no zero register.
The 4-bit register allows many 2-byte accumulate style instructions
(where a register is both source and dest)
8-bit opcode plus two 4-bit registers,
or a 12-bit opcode, one 4-bit register, and an immediate 1-8 bytes.
A flags register allows 2-byte short conditional branch instructions,
8-bit opcode and 8-bit offset. With no flags register the shortest conditional branch would be 3 bytes as it needs a register specifier.
If one is doing variable byte length instructions then
it allows the highest usage frequency to be most compact possible.
Eg. an ADD with 32-bit immediate in 6 bytes.
Can note ISAs with 16 bit encodings:
PDP-11: 8 registers
M68K : 2x 8 (A and D)
MSP430: 16
Thumb : 8|16
RV-C : 8|32
SuperH: 16
XG1 : 16|32 (Mostly 16)
The saving for fixed 32-bit instructions is that it only needs to
prefetch aligned 4 bytes ahead of the current instruction to maintain
1 decode per clock.
With variable length instructions from 1 to 12 bytes it could need
a 16 byte fetch buffer to maintain that decode rate.
And a 16 byte variable shifter (collapsing buffer) is much more logic.
I was thinking the variable instruction buffer shifter could be built
from tri-state buffers in a cross-bar rather than muxes.
The difference for supporting variable aligned 16-bit instructions and
byte aligned is that bytes doubles the number of tri-state buffers.
In my recent fiddling for trying to design a pair encoding for XG3,
can note the top-used instructions are mostly, it seems (non Ld/St):
ADD Rs, 0, Rd //MOV Rs, Rd
ADD X0, Imm, Rd //MOV Imm, Rd
ADDW Rs, 0, Rd //EXTS.L Rs, Rd
ADDW Rd, Imm, Rd //ADDW Imm, Rd
ADD Rd, Imm, Rd //ADD Imm, Rd
Followed by:
ADDWU Rs, 0, Rd //EXTU.L Rs, Rd
ADDWU Rd, Imm, Rd //ADDWu Imm, Rd
ADDW Rd, Rs, Rd //ADDW Rs, Rd
ADD Rd, Rs, Rd //ADD Rs, Rd
ADDWU Rd, Rs, Rd //ADDWU Rs, Rd
Most every other ALU instruction and usage pattern either follows a
bit further behind or could not be expressed in a 16-bit op.
For Load/Store:
SD Rn, Disp(SP)
LD Rn, Disp(SP)
LW Rn, Disp(SP)
SW Rn, Disp(SP)
LD Rn, Disp(Rm)
LW Rn, Disp(Rm)
SD Rn, Disp(Rm)
SW Rn, Disp(Rm)
For registers, there is a split:
Leaf functions:
R10..R17, R28..R31 dominate.
Non-Leaf functions:
R10, R18..R27, R8/R9
For 3-bit configurations:
R8..R15 Reg3A
R18/R19, R20/R21, R26/R27, R10/R11 Reg3B
Reg3B was a bit hacky, but had similar hit rates but uses less
encoding space than using a 4-bit R8..R23 (saving 1 bit on the
relevant scenarios).
EricP <ThatWouldBeTelling@thevillage.com> writes:
While HW walkers are serial for translating one VA,
the translations are inherently concurrent provided one can
implement an atomic RMW for the Accessed and Modified bits.
It's always a one-way street (towards accessed and towards modified,
never the other direction), so it's not clear to me why one would want atomicity there.
Let me reformulate my position a bit: clearly in 1977 some RISC
design was possible. But probably it would be something
even more primitive than Berkeley RISC. Putting in hardware
things that later RISC designs put in hardware almost surely would
exceed allowed cost. Technically at 1 mln transistors one should
be able to do acceptable RISC and IIUC IBM 360/90 used about
1 mln transistors in less dense technology, so in 1977 it was
possible to do 1 mln transistor machine.
On Sun, 20 Jul 2025 17:28:37 +0000, MitchAlsup1 wrote:
I do agree with some of what Mill does, including placing the preserved registers in memory where they cannot be damaged.
My 66000 calls this mode of operation "safe stack".
This sounds like an idea worth stealing, although no doubt the way I
would attempt to copy it would be a failure which removed all the
usefulness of it.
For one thing, I don't have a stack for calling subroutines, or any other purpose.
But I could easily add a feature where a mode is turned on, and instead of using the registers, it works off of a workspace pointer, like the TI 9900.
The trouble is, though, that this would be an extremely slow mode. When registers are _saved_, they're already saved to memory, as I can't think
of anywhere else to save them. (There might be multiple sets of registers, for things like SMT, but *not* for user vs supervisor or anything like
that.)
So I've probably completely misunderstood you here.
John Savard
Waldek Hebisch <antispam@fricas.org> schrieb:
Let me reformulate my position a bit: clearly in 1977 some RISC
design was possible. But probably it would be something
even more primitive than Berkeley RISC. Putting in hardware
things that later RISC designs put in hardware almost surely would
exceed allowed cost. Technically at 1 mln transistors one should
be able to do acceptable RISC and IIUC IBM 360/90 used about
1 mln transistors in less dense technology, so in 1977 it was
possible to do 1 mln transistor machine.
HUH? That is more than one order of magnitude than what is needed
for a RISC chip.
Consider ARM2, which had 27000 transistors and which is sort of
the minimum RISC design you can manage (altough it had a Booth
multiplier).
An ARMv2 implementation with added I and D cache, plus virtual
memory, would have been the ideal design (too few registers, too
many bits wasted on conditional execution, ...) but it would have
run rings around the VAX.
Waldek Hebisch <antispam@fricas.org> schrieb:
Let me reformulate my position a bit: clearly in 1977 some RISC
design was possible. But probably it would be something
even more primitive than Berkeley RISC. Putting in hardware
things that later RISC designs put in hardware almost surely would
exceed allowed cost. Technically at 1 mln transistors one should
be able to do acceptable RISC and IIUC IBM 360/90 used about
1 mln transistors in less dense technology, so in 1977 it was
possible to do 1 mln transistor machine.
HUH? That is more than one order of magnitude than what is needed
for a RISC chip.
Thomas Koenig <tkoenig@netcologne.de> wrote:
Waldek Hebisch <antispam@fricas.org> schrieb:
Let me reformulate my position a bit: clearly in 1977 some RISC
design was possible. But probably it would be something
even more primitive than Berkeley RISC. Putting in hardware
things that later RISC designs put in hardware almost surely would
exceed allowed cost. Technically at 1 mln transistors one should
be able to do acceptable RISC and IIUC IBM 360/90 used about
1 mln transistors in less dense technology, so in 1977 it was
possible to do 1 mln transistor machine.
HUH? That is more than one order of magnitude than what is needed
for a RISC chip.
Consider ARM2, which had 27000 transistors and which is sort of
the minimum RISC design you can manage (altough it had a Booth
multiplier).
An ARMv2 implementation with added I and D cache, plus virtual
memory, would have been the ideal design (too few registers, too
many bits wasted on conditional execution, ...) but it would have
run rings around the VAX.
1 mln transistors is an upper estimate. But low numbers given
for early RISC chips are IMO misleading: RISC become comercialy
viable for high-end machines only in later generations, when
designers added a few "expensive" instructions.
Also, to fit
design into a single chip designers moved some functionality
like bus interface to support chips. RISC processor with
mixed 16-32 bit instructions (needed to get resonable code
density), hardware multiply and FPU, including cache controller,
paging hardware and memory controller is much more than
100 thousend transitors cited for early workstation chips.
Sysop: | DaiTengu |
---|---|
Location: | Appleton, WI |
Users: | 1,064 |
Nodes: | 10 (0 / 10) |
Uptime: | 148:05:31 |
Calls: | 13,691 |
Calls today: | 1 |
Files: | 186,936 |
D/L today: |
33 files (6,120K bytes) |
Messages: | 2,410,932 |