• Re: Why I've Dropped In

    From John Savard@quadibloc@invalid.invalid to comp.arch on Tue Jul 22 04:30:28 2025
    From Newsgroup: comp.arch

    On Tue, 10 Jun 2025 22:53:27 +0000, quadibloc wrote:

    Include pairs of short instructions as part of the ISA, but make the
    short instructions 14 bits long instead of 15 so they get only 1/16 of
    the opcode space. This way, the compromise is placed in something that's
    less important. In the CISC mode, 17-bit short instructions will still
    be present, after all.

    After this change, I have been busily making minor tweaks to the ISA.

    The latest one involved a header format which allowed room for fourteen alternate 17-bit short instructions in a block, in order to permit
    a higher level of superscalar operation.

    I made opcode space for this header by using two opcodes from the standard memory-reference instruction set for it; they were the ones formerly used
    for load address and jump to subroutine with offset.

    I was not happy with doing this, however. Right now, I am engaging in a
    mighty struggle to squeeze the available opcode space to avoid doing this. However, try as I may, it may well be that the cost of this will turn out
    to be too great. But if I can manage it, a significant restructuring of
    the opcodes of this iteration of Concertina II may be coming soon.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon Jul 21 22:02:28 2025
    From Newsgroup: comp.arch

    On 7/21/2025 12:56 PM, Scott Lurndal wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 7/21/2025 8:45 AM, John Savard wrote:
    On Sun, 20 Jul 2025 22:27:27 -0700, Stephen Fuld wrote:

    But independent of that, I do miss Ivan's posts in this newsgroup, even >>>> if they aren't about the Mill. I do hope he can find time to post at
    least occasionally.

    Although I agree, I am also satisfied as long as he is well and healthy. >>>
    If he can't waste time with USENET for now, that is all right with me.

    But I am instead concerned if he is unable to find funding to make any
    progress with the Mill, given that it appears to have been a very promising >>> project. That is much more important.

    Based on the posts at the link I posted above, they are making progress,
    albeit quite slowly. I understand the patents issue, as they require
    real money. But I thought their model of doing work for a share of the
    possible eventual profits, if any, would attract enough people to get
    the work done. After all, there are lots of people who contribute to
    many open source projects for no monetary return at all. And the Mill
    needs only a few people. But apparently, I was wrong.

    It's easy to underestimate the resources required to bring a new
    processor architecture to a point where it makes sense to build
    a test chip. Then to optimize the design for the target node.

    I get the impression from the kind of people that they are looking for,
    that they are concentrating on the software side. They are working on Verilog, but more on porting SW tools.



    That's just the hardware side. Then there is the software infrastrucure (processor ABI, processor-specific library code, etc).

    Yes. I think that is what they are concentrating on now.

    Not to mention
    marketing and hotchips.

    No real marketing effort yet.



    Looking at the webpage, the belt seems to have some characteristics
    in common with stack-based architectures, bringing to mind Burroughs
    large systems and the HP-3000.

    IMHO, only sort of. The Burroughs large systems are true stack based
    systems, with a real HW stack, etc. While, if you squint enough, the
    Mill has sort of a stack, it has enough differences to be a totally
    different thing.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Tue Jul 22 16:29:28 2025
    From Newsgroup: comp.arch

    On Tue, 22 Jul 2025 04:30:28 +0000, John Savard wrote:

    However, try as I may, it may well be that the cost of this will turn
    out to be too great. But if I can manage it, a significant restructuring
    of the opcodes of this iteration of Concertina II may be coming soon.

    I have now revised my pages on Concertina II to reflect this latest
    change. Its most shocking result is that the three-operand arithmetic instructions in the basic 32-bit instruction set now only have six-bit opcodes. However, this didn't actually result in the omission of any
    useful instructions that had been defined for them when they had seven-bit opcodes.

    And the header mechanism, of course, allows the instruction set to be
    massively extended. Thus, I shouldn't really view this as an unacceptable
    cost requiring me to do a major rollback of the design... I think.

    But I'm not sure; cramming more and more stuff in has brought me to a point
    of being uneasy.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Wed Jul 23 16:03:33 2025
    From Newsgroup: comp.arch

    On Fri, 20 Jun 2025 19:46:42 +0000, quadibloc wrote:

    More importantly, I need 256-character strings if I'm using them as
    translate tables. Fine, I can use a pair of registers for a long string.

    I've realized now that I can have eight 256-character string registers
    if I instead use the extended register bank of 128 floating-point
    registers for the string registers; this provides another use for a
    set of registers that would otherwise be little used outside of VLIW
    code.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Thu Jul 24 10:47:09 2025
    From Newsgroup: comp.arch

    On Sat, 14 Jun 2025 09:24:02 -0700, Stephen Fuld wrote:

    On S/360, that is exactly what you did. The first instruction in an assembler program was typically BALR (Branch and Load Register), which
    is essentially a subroutine call. You would BALR to the next
    instruction which put that instruction's address into a register.

    That's almost right. However, you can't really "BALR to the next
    instruction", because BALR is a register-to-register instruction.
    Therefore, it doesn't reference memory.

    It's the register-to-register version of BAL, the jump to subroutine instruction (Branch and Link), and because of that, it doesn't do any branching, and has no branch target.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Thu Jul 24 11:07:43 2025
    From Newsgroup: comp.arch

    On Tue, 22 Jul 2025 16:29:28 +0000, John Savard wrote:

    But I'm not sure; cramming more and more stuff in has brought me to a
    point of being uneasy.

    I am now feeling more comfortable with the state that the main
    32-bit instruction set without the use of headers is in. I dare
    not remove anything more from it, but I see no good way to enlarge
    it to put things back in.

    Except one. Any unused opcode space would still allow me to assign
    two bit combinations to the first and second 32-bit parts of a
    64-bit instruction that is available without block structure. This
    might be very inefficient, but it would allow non-block programs to still include things like the supplementary memory-reference instructions,
    allowing access to special types like register packed decimal and
    simple floating without using blocks. I still do think it would be
    highly desirable to make the full architecture available without the
    block structure if at all possible.

    However, at least some of the types of header are not so intrinsically complicated as to make a compiler impossible. The code generator can
    produce a string of instructions, and then a packager can note if any instructions are needed that require a header, and then apply a header
    to the block an instruction will fall in.

    The only problematic case is where such an instruction turns up in the
    final position of a block; a header would bump it into the next block.
    This could be dealt with by a NOP. Using an un-needed header in the
    preceding block, however, is normally preferred, as a NOP, being an instruction, would always take a little time (particularly as, not being
    within a block with a header, there would be no indication directing it
    to be executed in parallel with other instructions).

    Surely, though, that's a sufficiently standard technique that compilers
    for many historical architectures have done this. Writing a compiler for
    the Itanium, I would think, would be more difficult than writing one to
    make use of my architecture. Particularly as the basic no-header
    instruction set includes all the instructions a compiler would typically
    use (but not the string instructions, which, say, a COBOL compiler would generate, rather than leaving inside standard subroutines).

    Now that the "jump to subroutine with offset" instruction is a special
    case, present with no headers, but still block-aware in that it can't
    begin a block, compilers do have to work a bit - replacing a normal
    jump to subroutine, if it isn't indexed, by a JSO where one would be
    helpful to skip over stuff between blocks. Again, that should be
    trivial - it does mean, though, that, likely, code has to be written,
    as it wouldn't mesh with the prefabricated code generator of an existing compiler kit. So even if "writing a compiler" isn't particularly hard,
    "porting GCC to the architecture" - or, for that matter, porting LLVM,
    might involve difficulties.

    John Savard

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Thu Jul 24 16:45:33 2025
    From Newsgroup: comp.arch

    On Thu, 24 Jul 2025 11:07:43 +0000, John Savard wrote:

    Except one. Any unused opcode space would still allow me to assign two
    bit combinations to the first and second 32-bit parts of a 64-bit
    instruction that is available without block structure. This might be
    very inefficient,

    I have now added that in. As the level of inefficiency, though, was so high that I couldn't put some of the instructions I would have liked to include
    in these 64-bit instructions... I resorted to a very desperate measure to
    make it possible.

    Gaze upon the bottom of this page

    http://www.quadibloc.com/arch/cad0102.htm

    if you dare.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Thu Jul 24 20:04:43 2025
    From Newsgroup: comp.arch

    On Thu, 24 Jul 2025 16:45:33 +0000, John Savard wrote:

    I have now added that in. As the level of inefficiency, though, was so
    high that I couldn't put some of the instructions I would have liked to include in these 64-bit instructions... I resorted to a very desperate measure to make it possible.

    At least, a way to simplify that desperate measure finally occurred to me.

    Initially, I tried to just move the front of the instructions, while
    keeping things aligned, so different instructions were changed in different ways between the two cases.

    Then the obvious dawned on me.

    John Savard

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Sat Jul 26 05:57:49 2025
    From Newsgroup: comp.arch

    On Thu, 22 May 2025 17:42:14 +0000, MitchAlsup1 wrote:
    On Thu, 22 May 2025 6:51:05 +0000, David Chmelik wrote:

    What is Concertina 2?

    Roughly speaking, it is a design where most of the non-power of 2 data
    types are being supported {36-bits, 48-bits, 60-bits} along with the
    standard power of 2 lengths {8, 16, 32, 64}.

    As this is such a fondly remembered feature, I have finally gotten
    around to adding one header type to the ISA that enables it. I do,
    however, carefully note that this is a highly specialized feature,
    and thus it is not expected to be included in most implementaions.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Sat Jul 26 06:14:47 2025
    From Newsgroup: comp.arch

    On Thu, 22 May 2025 17:42:14 +0000, MitchAlsup1 wrote:

    This creates "interesting" situations with respect to instruction
    formatting and to the requirements of constants in support of those instructions; and interesting requirements in other areas of ISA.

    Oh, there are indeed challenges, but they're hardly insurmountable.

    Compilers are the obvious case. Since the instruction set is built
    around 32-bit instructions, obviously the architecture will need to
    be running in conventional mode for compilation.

    The data width is, of course, specified by the block header. It
    isn't a switchable mode. So a program can have memory allocated to
    it of different widths, put pointers to those regions of memory in
    different base registers, and include code operating on data of
    those various lengths.

    So the compiler can call subroutines designed to craft things like
    36-bit floats for inclusion in object modules. From data placed in
    registers by normal code.

    John Savard

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Mon Jul 28 23:18:52 2025
    From Newsgroup: comp.arch

    On Sat, 14 Jun 2025 17:00:08 +0000, MitchAlsup1 wrote:

    VAX tried too hard in my opinion to close the semantic gap.
    Any operand could be accessed with any address mode. Now while this
    makes the puny 16-register file seem larger,
    what VAX designers forgot, is that each address mode was an instruction
    in its own right.

    So, VAX shot at minimum instruction count, and purposely miscounted
    address modes not equal to %k as free.

    Fancy addressing modes certainly aren't _free_. However, they are,
    in my opinion, often cheaper than achieving the same thing with an
    extra instruction.

    So it makes sense to add an addressing mode _if_ what that addressing
    mode does is pretty common.

    That being said, though, designing a new machine today like the VAX
    would be a huge mistake.

    But the VAX, in its day, was very successful. And I don't think that
    this was just a result of riding on the coattails of the huge popularity
    of the PDP-11. It was a good match to the technology *of its time*,
    that being machines that were implemented using microcode.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Mon Jul 28 22:56:33 2025
    From Newsgroup: comp.arch

    On 7/28/2025 6:18 PM, John Savard wrote:
    On Sat, 14 Jun 2025 17:00:08 +0000, MitchAlsup1 wrote:

    VAX tried too hard in my opinion to close the semantic gap.
    Any operand could be accessed with any address mode. Now while this
    makes the puny 16-register file seem larger,
    what VAX designers forgot, is that each address mode was an instruction
    in its own right.

    So, VAX shot at minimum instruction count, and purposely miscounted
    address modes not equal to %k as free.

    Fancy addressing modes certainly aren't _free_. However, they are,
    in my opinion, often cheaper than achieving the same thing with an
    extra instruction.

    So it makes sense to add an addressing mode _if_ what that addressing
    mode does is pretty common.


    The use of addressing modes drops off pretty sharply though.

    Like, if one could stat it out, one might see a static-use pattern
    something like:
    80%: [Rb+disp]
    15%: [Rb+Ri*Sc]
    3%: (Rb)+ / -(Rb)
    1%: [Rb+Ri*Sc+Disp]
    <1%: Everything else

    Though, I am counting [PC+Disp] and [GP+Disp] as part of [Rb+Disp] here.

    Granted, the dominance of [Rb+Disp] does drop off slightly when
    considering dynamic instruction use. Part of it is due to the
    prolog/epilog sequences.

    If one had instead used (SP)+ and -(SP) addressing for prologs and
    epilogs, then one might see around 20% or so going to these instead.
    Or, if one had PUSH/POP, to PUSH/POP.

    The discrepancy then between static and dynamic instruction counts them
    being mostly due to things like loops and similar.

    Estimating the effect of loops in a compiler is hard, but had noted that assuming a scale factor of around 1.5^D for loop nesting levels (D)
    seemed to be in the area. Many loops end up unreached in many
    iterations, or only running a few times, so possibly counter-intuitively
    it is often faster to assume that a loop body will likely only cycle 2
    or 3 times rather than 100s or 1000s, and trying to aggressively
    optimize loops by assuming large N tends to be detrimental to performance.

    Well, and at least thus far, profiler-driven optimization isn't really a
    thing in my case.



    One could maybe argue for some LoadOp instructions, but even this is debatable. If the compiler is designed mostly for Load/Store, and the
    ISA has a lot of registers, the relative benefit of LoadOp is reduced.

    LoadOp being mostly a benefit if the value is loaded exactly once, and
    there is some other ALU operation or similar that can be fused with it.

    Practically, it limits the usefulness of LoadOp mostly to saving an instruction for things like:
    z=arr[i]+x;


    But, the relative incidence of things like this is low enough as to not
    save that much.

    The other thing is that one has to implement it in a way that does not increase pipeline length, since if one makes the pipeline linger for
    sake of LoadOp or OpStore, then this is likely to be a net negative for performance vs prioritizing Load/Store, unless the pipeline had already
    needed to be lengthened for other reasons.


    One can be like, "But what if the local variables are not in registers?"
    but on a machine with 32 or 64 registers, most likely your local
    variable is already going to be in a register.

    So, the main potential merit of LoadOp being "doesn't hurt as bad on a register-starved machine".


    That being said, though, designing a new machine today like the VAX
    would be a huge mistake.

    But the VAX, in its day, was very successful. And I don't think that
    this was just a result of riding on the coattails of the huge popularity
    of the PDP-11. It was a good match to the technology *of its time*,
    that being machines that were implemented using microcode.


    Yeah.

    There are some living descendants of that family, but pretty much
    everything now is Reg/Mem or Load/Store with a greatly reduced set of addressing modes.


    John Savard

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Jul 29 08:45:14 2025
    From Newsgroup: comp.arch

    John Savard <quadibloc@invalid.invalid> writes:
    But the VAX, in its day, was very successful. And I don't think that
    this was just a result of riding on the coattails of the huge popularity
    of the PDP-11. It was a good match to the technology *of its time*,
    that being machines that were implemented using microcode.

    Microcode may have been a good thing somewhat earlier when ROM or the
    writable control store (WCS) could be run at speeds much higher than
    core memory (how was the WCS actually implemented?), but core memory
    had been replaced by semiconductor DRAM by the time the VAX was
    introduced, and that was faster (already the Nova 800 of 1971 had an
    800ns cycle, and Acorn managed to access DRAM at 8MHz (but only when
    staying within the same row) in 1987); my guess is that in the VAX
    11/780 timeframe, 2-3MHz DRAM access within a row would have been
    possible. Moreover, the VAX 11/780 has a cache (it also has a WCS).
    So going for microcode no longer was the best choice for the VAX, but
    neither the VAX designers nor their competition realized this, and
    commercial RISCs only appeared in 1986.

    Nevertheless, if I time-traveled to the start of the VAX design, and
    was put in charge of designing the VAX, I would design a RISC, and I
    am sure that it would outperform the actual VAX 11/780 by at least a
    factor of 2. So no, I don't think that the VAX architecture was a
    good match for the technology of the time.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Tue Jul 29 16:44:35 2025
    From Newsgroup: comp.arch

    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    So going for microcode no longer was the best choice for the VAX, but
    neither the VAX designers nor their competition realized this, and
    commercial RISCs only appeared in 1986.

    That is certainly true but there were other mistakes too. One is that
    they underestimated how cheap memory would get, leading to the overcomplex instruction and address modes and the tiny 512 byte page size.

    Another, which is not entirely their fault, is that they did not expect compilers to improve as fast as they did, leading to a machine which was fun to program in assembler but full of stuff that was useless to compilers and instructions like POLY that should have been subroutines. The 801 project and PL.8 compiler were well underway at IBM by the time the VAX shipped, but DEC presumably didn't know about it.

    Related to the microcode issue they also don't seem to have anticipated how important pipelining would be. Some minor changes to the VAX, like not letting one address modify another in the same instruction, would have made it a lot easier to pipeline.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Jul 30 05:59:18 2025
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> writes:
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    So going for microcode no longer was the best choice for the VAX, but >>neither the VAX designers nor their competition realized this, and >>commercial RISCs only appeared in 1986.

    That is certainly true but there were other mistakes too. One is that
    they underestimated how cheap memory would get, leading to the overcomplex >instruction and address modes and the tiny 512 byte page size.

    Concerning code density, while VAX code is compact, RISC-V code with the
    C extension is more compact
    <2025Mar4.093916@mips.complang.tuwien.ac.at>, so in our time-traveling
    scenario that would not be a reason for going for the VAX ISA.

    Another aspect from those measurements is that the 68k instruction set
    (with only one memory operand for any compute instructions, and 16-bit granularity) has a code density similar to the VAX.

    Another, which is not entirely their fault, is that they did not expect >compilers to improve as fast as they did, leading to a machine which was fun to
    program in assembler but full of stuff that was useless to compilers and >instructions like POLY that should have been subroutines. The 801 project and >PL.8 compiler were well underway at IBM by the time the VAX shipped, but DEC >presumably didn't know about it.

    DEC probably was aware from the work of William Wulf and his students
    what optimizing compilers can do and how to write them. After all,
    they used his language BLISS and its compiler themselves.

    POLY would have made sense in a world where microcode makes sense: If
    microcode can be executed faster than subroutines, put a building
    stone for transcendental library functions into microcode. Of course,
    given that microcode no longer made sense for VAX, POLY did not make
    sense for it, either.

    Related to the microcode issue they also don't seem to have anticipated how >important pipelining would be. Some minor changes to the VAX, like not letting >one address modify another in the same instruction, would have made it a lot >easier to pipeline.

    My RISC alternative to the VAX 11/780 (RISC-VAX) would probably have
    to use pipelining (maybe a three-stage pipeline like the first ARM) to
    achieve its clock rate goals; that would eat up some of the savings in implementation complexity that avoiding the actual VAX would have
    given us.

    Another issue would be is how to implement the PDP-11 emulation mode.
    I would add a PDP-11 decoder (as the actual VAX 11/780 probably has)
    that would decode PDP-11 code into RISC-VAX instructions, or into what
    RISC-VAX instructions are decoded into. The cost of that is probably
    similar to that in the actual VAX 11/780. If the RISC-VAX ISA has a MIPS/Alpha/RISC-V-like handling of conditions, the common microcode
    would have to support both the PDP-11 and the RISC-VAX handling of
    conditions; probably not that expensive, but maybe one still would
    prefer a ARM/SPARC/HPPA-like handling of conditions.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Jul 30 04:02:19 2025
    From Newsgroup: comp.arch

    On 7/30/2025 12:59 AM, Anton Ertl wrote:
    John Levine <johnl@taugh.com> writes:
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    So going for microcode no longer was the best choice for the VAX, but
    neither the VAX designers nor their competition realized this, and
    commercial RISCs only appeared in 1986.

    That is certainly true but there were other mistakes too. One is that
    they underestimated how cheap memory would get, leading to the overcomplex >> instruction and address modes and the tiny 512 byte page size.

    Concerning code density, while VAX code is compact, RISC-V code with the
    C extension is more compact
    <2025Mar4.093916@mips.complang.tuwien.ac.at>, so in our time-traveling scenario that would not be a reason for going for the VAX ISA.

    Another aspect from those measurements is that the 68k instruction set
    (with only one memory operand for any compute instructions, and 16-bit granularity) has a code density similar to the VAX.


    I can't say much for or against VAX, as I don't currently have any
    compilers that target it.

    But, if so, it would more speak for the weakness of VAX code density
    than the goodness of RISC-V.

    Where, at least in my testing, the front-runner for code density has not
    been RV64GC.

    Granted, the C extension does seem go give around a 20% size reduction
    vs RV64G; and RV64G is at least "not horrible".


    There is, however, a fairly notable size difference between RV32 and
    RV64 here, but I had usually been messing with RV64.

    Among other things, RV32 allows encoding Abs32 memory addressing in 2 instructions, which is not really an option in RV64 (except in
    fixed-address static-linked binaries).


    For most of my uses (with GCC) had effectively been an improvised
    static-PIE with my own C library. This does negatively effect
    code-density for RV64G and RV64GC. IIRC, typical pattern is to load an
    address from the GOT and then access globals through the loaded address.

    With BGBCC, all access to global variables being via the Global Pointer (generally a 32-bit offset relative to GP for RV64 and similar).



    If I were to put it on a ranking (for ISAs I have messed with), it would
    be, roughly (smallest first):
    i386 with VS2008 or GCC 3.x (*1)
    Thumb2 with GCC (*2)
    x86-64 (GCC, "-Os", etc)
    XG1
    XG2
    XG3
    RV64GC
    SH-4
    ARMv8
    RV64G
    X64 (Modern MSVC)
    (Presumably followed by many of the classic RISCs).

    *1: Compilers from the 2000s seem to be a high point for size
    optimization, where more modern versions of both MSVC and GCC tend to do worse, but MSVC has taken the much bigger hit.
    Though, GCC 3.x also has another feature that seemingly its build-times
    are molasses (where, later versions of GCC got a little faster again).


    *2: Usual options for GCC were like "-Os -ffunction-sections
    -fdata-sections -Wl,gc-sections ..."

    To be more accurate though, one needs to control for how the C library
    is linked. In my stuff, I had generally used a static linked C library. Though, on i386 or Thumb, the binaries are non-functional in this case.

    Mostly using Doom as a reference case.


    Can't compare VAX or PDP or M68K because no compilers.
    Some tables I can find online imply M68K is in similar areas to i386
    here. More tables: i386 beats M68K, M68K beats VAX. SH4 also usually
    beats VAX. Seemingly about the only ISAs consistently worse than VAX are PowePC and MIPS and similar. PDP-11 seemingly often close to M68K.



    Between ISA variants on my core (code-size, first better):
    XG1, XG2, XG3, RV64GC, RV64G
    Performance:
    XG2, XG3, RV64G, XG1, RV64GC
    Where XG1 gives the smallest binaries, but is also slowest.
    XG2 and XG3 currently fight for the speed crown.

    In my CPU design, both XG1 and RV64GC operate with a speed penalty. Full
    speed operation is only really achieved with strict 32-bit instruction alignment.


    For ratios of the ".text' section:
    Ratio between XG1 and XG2 is 109%.
    Ratio between XG1 and XG3 is 121%.
    Ratio between XG1 and RV64GC is 149%.
    Ratio between XG1 and RV64G is 179%.




    There is a very large size difference between x86-64 via GCC, and X64
    via MSVC, in that MSVC output tends to be significantly larger. Also
    "/Os" in modern MSVC doesn't really seem to work, seems to behave mostly
    like an alias to "/O1".


    Another, which is not entirely their fault, is that they did not expect
    compilers to improve as fast as they did, leading to a machine which was fun to
    program in assembler but full of stuff that was useless to compilers and
    instructions like POLY that should have been subroutines. The 801 project and
    PL.8 compiler were well underway at IBM by the time the VAX shipped, but DEC >> presumably didn't know about it.

    DEC probably was aware from the work of William Wulf and his students
    what optimizing compilers can do and how to write them. After all,
    they used his language BLISS and its compiler themselves.

    POLY would have made sense in a world where microcode makes sense: If microcode can be executed faster than subroutines, put a building
    stone for transcendental library functions into microcode. Of course,
    given that microcode no longer made sense for VAX, POLY did not make
    sense for it, either.

    Related to the microcode issue they also don't seem to have anticipated how >> important pipelining would be. Some minor changes to the VAX, like not letting
    one address modify another in the same instruction, would have made it a lot >> easier to pipeline.

    My RISC alternative to the VAX 11/780 (RISC-VAX) would probably have
    to use pipelining (maybe a three-stage pipeline like the first ARM) to achieve its clock rate goals; that would eat up some of the savings in implementation complexity that avoiding the actual VAX would have
    given us.

    Another issue would be is how to implement the PDP-11 emulation mode.
    I would add a PDP-11 decoder (as the actual VAX 11/780 probably has)
    that would decode PDP-11 code into RISC-VAX instructions, or into what RISC-VAX instructions are decoded into. The cost of that is probably
    similar to that in the actual VAX 11/780. If the RISC-VAX ISA has a MIPS/Alpha/RISC-V-like handling of conditions, the common microcode
    would have to support both the PDP-11 and the RISC-VAX handling of conditions; probably not that expensive, but maybe one still would
    prefer a ARM/SPARC/HPPA-like handling of conditions.

    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Jul 30 16:24:40 2025
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> schrieb:

    I can't say much for or against VAX, as I don't currently have any
    compilers that target it.

    If you want to look at code, godbolt has a few gcc versions for it.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lars Poulsen@lars@cleo.beagle-ears.com to comp.arch,alt.folklore.computers on Wed Jul 30 17:17:28 2025
    From Newsgroup: comp.arch

    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    So going for microcode no longer was the best choice for the VAX, but AE>>>neither the VAX designers nor their competition realized this, and AE>>>commercial RISCs only appeared in 1986.

    John Levine <johnl@taugh.com> writes:
    That is certainly true but there were other mistakes too. One is that JL>>they underestimated how cheap memory would get, leading to the overcomplex JL>>instruction and address modes and the tiny 512 byte page size.

    On 2025-07-30, Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    Concerning code density, while VAX code is compact, RISC-V code with the
    C extension is more compact
    <2025Mar4.093916@mips.complang.tuwien.ac.at>, so in our time-traveling
    scenario that would not be a reason for going for the VAX ISA.

    Another aspect from those measurements is that the 68k instruction set (with only one memory operand for any compute instructions, and 16-bit granularity) has a code density similar to the VAX.

    Another, which is not entirely their fault, is that they did not expect JL>>compilers to improve as fast as they did, leading to a machine which was fun to
    program in assembler but full of stuff that was useless to compilers and JL>>instructions like POLY that should have been subroutines. The 801 project and
    PL.8 compiler were well underway at IBM by the time the VAX shipped, but DEC
    presumably didn't know about it.

    DEC probably was aware from the work of William Wulf and his students
    what optimizing compilers can do and how to write them. After all,
    they used his language BLISS and its compiler themselves.

    POLY would have made sense in a world where microcode makes sense: If microcode can be executed faster than subroutines, put a building
    stone for transcendental library functions into microcode. Of course, given that microcode no longer made sense for VAX, POLY did not make
    sense for it, either.

    Related to the microcode issue they also don't seem to have anticipated how JL>>important pipelining would be. Some minor changes to the VAX, like not letting
    one address modify another in the same instruction, would have made it a lot
    easier to pipeline.

    My RISC alternative to the VAX 11/780 (RISC-VAX) would probably have
    to use pipelining (maybe a three-stage pipeline like the first ARM) to achieve its clock rate goals; that would eat up some of the savings in implementation complexity that avoiding the actual VAX would have
    given us.

    Another issue would be is how to implement the PDP-11 emulation mode.
    I would add a PDP-11 decoder (as the actual VAX 11/780 probably has)
    that would decode PDP-11 code into RISC-VAX instructions, or into what RISC-VAX instructions are decoded into. The cost of that is probably similar to that in the actual VAX 11/780. If the RISC-VAX ISA has a MIPS/Alpha/RISC-V-like handling of conditions, the common microcode
    would have to support both the PDP-11 and the RISC-VAX handling of conditions; probably not that expensive, but maybe one still would
    prefer a ARM/SPARC/HPPA-like handling of conditions.

    In the days of VAX-11/780, it was "obvious" that operating systems would
    be written in assembler in order to be efficient, and the instruction
    set allowed high productivity for writing systems programs in "native"
    code. Yes, UNIX - written in C - existed, but was not all that well
    known. DEC had developed BLISS in -11 and -10 variants and they decided
    to do a -32 for the VAX and a number of system utilities were written in BLISS-32, but I think that the BLISS-32 compiler was written in
    BLISS-10. This all had a feeling of experimentation. "It may be the
    future, but we are not there yet".

    As for a RISC-VAX: To little old naive me, it seems that it would have
    been possible to create an alternative microcode load that would be able
    to support a RISC ISA on the same hardware, if the idea had occured to a well-connected group of graduate students. How good a RISC might have
    been feasible?
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Jul 30 13:24:29 2025
    From Newsgroup: comp.arch

    On 7/30/2025 11:24 AM, Thomas Koenig wrote:
    BGB <cr88192@gmail.com> schrieb:

    I can't say much for or against VAX, as I don't currently have any
    compilers that target it.

    If you want to look at code, godbolt has a few gcc versions for it.


    ===

    Testing a function:
    u16 GANN_AddFp16SF(u16 va, u16 vb)
    {
    int fra, frb, frc;
    int sga, sgb, sgc;
    int exa, exb, exc;

    fra=2048|(va&127); exa=(va>>10)&31; sga=(va>>15)&1;
    frb=2048|(vb&127); exb=(vb>>10)&31; sgb=(vb>>15)&1;

    if(!exa)
    fra=0;
    if(!exb)
    frb=0;

    if((va&0x7FFF)>=(vb&0x7FFF))
    {
    sgc=sga;
    exc=exa;
    if((exa-exb)>=8)
    return(va);
    if(sga==sgb)
    { frc=fra+(frb>>(exa-exb)); }
    else
    { frc=fra-(frb>>(exa-exb)); }
    }else
    {
    sgc=sgb;
    exc=exb;
    if((exb-exa)>=8)
    return(vb);
    if(sga==sgb)
    { frc=frb+(fra>>(exb-exa)); }
    else
    { frc=frb-(fra>>(exb-exa)); }
    }

    if(!frc)
    return(0x0000);

    if(frc<0)
    { sgc=!sgc; frc=-frc; }
    if(frc&0x800)
    { exc++; frc=frc>>1; }
    if(!(frc&0x400))
    {
    exc--; frc=frc<<1;
    if(!(frc&0x400))
    {
    exc--; frc=frc<<1;
    while(!(frc&0x400))
    { exc--; frc=frc<<1; }
    }
    }

    if(exc< 0)
    return(0x0000);
    if(exc>=31)
    return(0x7C00|(sgc<<15));
    return((frc&1023)|(exc<<10)|(sgc<<15));
    }



    === ASM Lines ===

    x86-64: 178
    VAX lines: 164
    RISC-V lines: 146
    XG3 lines: 156
    XG2 lines: 138
    But: + 18 + 22 (for compressed prolog/epilog)

    So, RISC-V has the fewest instructions in this examle.
    XG2 would win, except if counting the folded prolog/epilog, it does not.

    For the XG2 and XG3 examples, needed to edit out a bunch of blank and
    debug comment lines to match the same "style" as the others.


    In the XG2 example, BGBCC also saved/restored the most registers, so
    this isn't exactly a clear win for XG2.

    However, XG3 also seems to be using stack spill-and-fill for local
    variables, which isn't a win for XG3 either (it uses the same ABI as
    RISC-V, whereas GCC RISC-V managed to not do spill-and-fill).

    Technically, the XG3 mode also contains the RV64G instructions, so any
    loss vs RV here is more the compilers' fault.


    Can note performance in some past tests:
    XG2 tended to win performance-wise in the emulator.
    But, XG3 wins for performance in my Verilog CPU core.

    But, this example isn't really a win.

    Harder to say for exact ".text" sizes from these examples though...


    === x86-64

    "GANN_AddFp16SF":
    mov ecx, esi
    mov edx, edi
    mov r9d, esi
    push r12
    shr cx, 10
    shr dx, 15
    push rbp
    mov eax, edi
    mov esi, ecx
    mov r8d, ecx
    push rbx
    mov ecx, r9d
    movzx ebx, dx
    mov edx, r9d
    shr cx, 15
    mov r12d, edi
    and edx, 127
    mov ebp, r9d
    shr ax, 10
    and esi, 31
    or dh, 8
    and r8d, 31
    movzx r10d, cx
    and bp, 32767
    and r12w, 32767
    and ax, 31
    jne .L2
    test si, si
    jne .L43
    .L1:
    pop rbx
    mov eax, esi
    pop rbp
    pop r12
    ret
    .L2:
    mov r11d, edi
    movzx eax, ax
    mov edi, r11d
    and edi, 127
    or edi, 2048
    test si, si
    jne .L5
    cmp r12w, bp
    jnb .L44
    cmp ebx, r10d
    je .L41
    xor ecx, 1
    movzx r10d, cx
    .L41:
    sar edi
    mov eax, 1
    mov edx, edi
    .L10:
    sal eax, 10
    and dx, 1023
    mov esi, r10d
    pop rbx
    or edx, eax
    sal esi, 15
    pop rbp
    pop r12
    or esi, edx
    mov eax, esi
    ret
    .L5:
    cmp r12w, bp
    jb .L11
    mov ecx, eax
    mov esi, r11d
    sub ecx, r8d
    cmp ecx, 7
    jg .L1
    sar edx, cl
    cmp ebx, r10d
    je .L7
    sub edi, edx
    mov r10d, ebx
    .L14:
    xor esi, esi
    test edi, edi
    je .L1
    .L8:
    test edi, edi
    js .L17
    mov edx, edi
    and edx, 2048
    .L13:
    test edx, edx
    jne .L45
    test edi, 1024
    jne .L31
    lea edx, [rdi+rdi]
    test edi, 512
    je .L18
    sub eax, 1
    jmp .L10
    .L43:
    xor esi, esi
    cmp r12w, bp
    jnb .L1
    mov esi, r9d
    lea eax, [r8+1]
    sar edx
    test r9w, 24576
    jne .L1
    jmp .L10
    .L11:
    mov ecx, r8d
    mov esi, r9d
    sub ecx, eax
    cmp ecx, 7
    jg .L1
    sar edi, cl
    mov eax, r8d
    mov ecx, edi
    mov edi, edx
    sub edi, ecx
    cmp ebx, r10d
    jne .L14
    lea edi, [rdx+rcx]
    mov edx, edi
    and edx, 2048
    jmp .L13
    .L44:
    mov esi, r11d
    cmp eax, 7
    jg .L1
    cmp ebx, r10d
    je .L26
    mov r10d, ebx
    jmp .L8
    .L17:
    neg edi
    xor r10d, 1
    .L18:
    sub eax, 2
    lea edx, [0+rdi*4]
    and edi, 256
    jne .L22
    .L23:
    add edx, edx
    sub eax, 1
    test dh, 4
    je .L23
    .L22:
    xor esi, esi
    test eax, eax
    js .L1
    jmp .L10
    .L45:
    sar edi
    add eax, 1
    mov edx, edi
    .L20:
    cmp eax, 30
    jle .L10
    mov esi, r10d
    pop rbx
    pop rbp
    sal esi, 15
    pop r12
    or si, 31744
    mov eax, esi
    ret
    .L26:
    xor edx, edx
    .L7:
    add edi, edx
    mov r10d, ebx
    mov edx, edi
    and edx, 2048
    jmp .L13
    .L31:
    mov edx, edi
    jmp .L20


    === VAX

    GANN_AddFp16SF:
    subl2 $4,%sp
    movl 4(%ap),%r4
    rotl $22,4(%ap),%r1
    bicl2 $-32,%r1
    movzwl %r1,%r2
    rotl $17,%r4,%r6
    bicl2 $-2,%r6
    movzwl %r6,%r6
    movl 8(%ap),%r5
    rotl $22,8(%ap),%r0
    bicl2 $-32,%r0
    movzwl %r0,%r3
    rotl $17,%r5,%r7
    bicl2 $-2,%r7
    movzwl %r7,%r7
    tstw %r1
    jeql .L24
    bicl3 $-128,%r4,%r1
    bisl2 $2048,%r1
    bicw3 $32768,%r4,%r9
    bicw3 $32768,%r5,%r8
    tstw %r0
    jneq .L3
    .L54:
    cmpw %r9,%r8
    jlssu .L51
    clrl %r0
    .L4:
    subl3 %r3,%r2,%r3
    cmpl %r3,$7
    jgtr .L30
    mnegb %r3,%r3
    ashl %r3,%r0,%r3
    cmpl %r6,%r7
    jeql .L52
    subl2 %r3,%r1
    tstl %r1
    jeql .L32
    .L59:
    jlss .L53
    bicl3 $-2049,%r1,%r0
    jneq .L13
    .L15:
    bicl3 $-1025,%r1,%r0
    jneq .L16
    .L57:
    addl3 %r1,%r1,%r3
    bicl3 $-1025,%r3,%r4
    jneq .L17
    subl2 $2,%r2
    moval 0[%r1],%r1
    bicl3 $-1025,%r1,%r0
    jneq .L18
    .L19:
    decl %r2
    addl2 %r1,%r1
    bicl3 $-1025,%r1,%r0
    jeql .L19
    tstl %r2
    jgeq .L21
    .L32:
    clrw %r0
    ret
    .L24:
    clrl %r1
    bicw3 $32768,%r4,%r9
    bicw3 $32768,%r5,%r8
    tstw %r0
    jeql .L54
    .L3:
    bicl3 $-128,%r5,%r0
    bisl2 $2048,%r0
    cmpw %r9,%r8
    jgequ .L4
    subl3 %r2,%r3,%r2
    cmpl %r2,$7
    jleq .L55
    movw %r5,%r0
    ret
    .L51:
    mnegl %r2,%r2
    cmpl %r6,%r7
    jeql .L56
    mnegb %r2,%r2
    ashl %r2,%r1,%r2
    jeql .L1
    mnegl %r2,%r1
    clrl %r2
    xorl3 $1,%r7,%r6
    mnegl %r1,%r1
    .L61:
    bicl3 $-2049,%r1,%r0
    jeql .L15
    jbr .L13
    .L56:
    mnegb %r2,%r2
    ashl %r2,%r1,%r1
    jeql .L1
    clrl %r2
    .L13:
    incl %r2
    ashl $-1,%r1,%r1
    bicl3 $-1025,%r1,%r0
    jeql .L57
    .L16:
    cmpl %r2,$30
    jleq .L21
    ashl $15,%r6,%r0
    bisw2 $31744,%r0
    ret
    .L55:
    cmpl %r6,%r7
    jeql .L58
    mnegb %r2,%r2
    ashl %r2,%r1,%r2
    subl3 %r2,%r0,%r1
    movl %r3,%r2
    movl %r7,%r6
    tstl %r1
    jneq .L59
    jbr .L32
    .L17:
    decl %r2
    cmpl %r2,$-1
    jneq .L60
    .L1:
    ret
    .L18:
    tstl %r2
    jlss .L30
    .L21:
    bicw2 $64512,%r1
    ashl $10,%r2,%r2
    bisw2 %r2,%r1
    ashl $15,%r6,%r0
    bisw2 %r1,%r0
    ret
    .L30:
    movw %r4,%r0
    ret
    .L52:
    addl2 %r3,%r1
    jeql .L32
    bicl3 $-2049,%r1,%r0
    jeql .L15
    jbr .L13
    .L60:
    movl %r3,%r1
    jbr .L16
    .L53:
    movl %r6,%r7
    xorl3 $1,%r7,%r6
    mnegl %r1,%r1
    jbr .L61
    .L58:
    mnegb %r2,%r2
    ashl %r2,%r1,%r2
    addl3 %r2,%r0,%r1
    movl %r3,%r2
    bicl3 $-2049,%r1,%r0
    jeql .L15
    jbr .L13



    === RISC-V

    GANN_AddFp16SF:
    li a3,4096
    andi a4,a1,127
    addi a3,a3,-2048
    srliw a5,a0,10
    srliw a7,a1,10
    or a4,a4,a3
    mv a6,a0
    slli t3,a0,49
    slli t1,a1,49
    slliw a2,a4,16
    andi a0,a7,31
    srli t0,a1,15
    andi a5,a5,31
    srliw a2,a2,16
    mv a7,a0
    srli t3,t3,49
    srli t1,t1,49
    srli t6,a6,15
    mv t4,t0
    bne a5,zero,.L2
    bne a0,zero,.L44
    .L3:
    ret
    .L2:
    andi a4,a6,127
    or a4,a4,a3
    slliw a4,a4,16
    srliw t5,a4,16
    mv a4,t5
    sext.w a5,a5
    bne a0,zero,.L5
    bgeu t3,t1,.L45
    beq t6,t0,.L41
    xori t4,t0,1
    .L41:
    sraiw a4,a4,1
    li a5,1
    .L10:
    andi a0,a4,1023
    slliw a5,a5,10
    or a0,a0,a5
    slliw t4,t4,15
    or a0,a0,t4
    slli a0,a0,48
    srli a0,a0,48
    ret
    .L5:
    bltu t3,t1,.L11
    subw a7,a5,a0
    li a3,7
    mv a0,a6
    bgt a7,a3,.L3
    sraw a2,a2,a7
    mv t5,a2
    beq t6,t0,.L7
    subw t5,a4,a2
    mv t4,t6
    .L14:
    li a0,0
    beq t5,zero,.L3
    .L8:
    blt t5,zero,.L17
    li a3,4096
    addi a3,a3,-2048
    and a3,t5,a3
    .L13:
    bne a3,zero,.L46
    andi a4,t5,1024
    bne a4,zero,.L31
    slliw a4,t5,1
    andi a3,a4,1024
    beq a3,zero,.L18
    addiw a5,a5,-1
    j .L10
    .L44:
    li a0,0
    bgeu t3,t1,.L3
    srli a5,a1,13
    andi a5,a5,3
    mv a0,a1
    bne a5,zero,.L3
    addiw a5,a7,1
    sraiw a4,a2,1
    j .L10
    .L45:
    li a3,7
    mv a0,a6
    bgt a5,a3,.L3
    beq t6,t0,.L26
    mv t4,t6
    j .L8
    .L11:
    subw a5,a0,a5
    li a6,7
    mv a0,a1
    bgt a5,a6,.L3
    sraw a4,t5,a5
    subw t5,a2,a4
    mv a5,a7
    bne t6,t0,.L14
    addw t5,a2,a4
    and a3,t5,a3
    j .L13
    .L17:
    negw t5,t5
    xori t4,t4,1
    .L18:
    slliw a4,t5,2
    andi a3,a4,1024
    addiw a5,a5,-2
    bne a3,zero,.L22
    .L23:
    slliw a4,a4,1
    andi a3,a4,1024
    addiw a5,a5,-1
    beq a3,zero,.L23
    .L22:
    li a0,0
    bge a5,zero,.L10
    ret
    .L46:
    addiw a5,a5,1
    sraiw a4,t5,1
    .L20:
    li a3,30
    ble a5,a3,.L10
    andi a0,t4,1
    li a5,32768
    addi a5,a5,-1024
    slli a0,a0,15
    or a0,a0,a5
    ret
    .L26:
    li t5,0
    .L7:
    li a3,4096
    addw t5,a4,t5
    addi a3,a3,-2048
    mv t4,t6
    and a3,t5,a3
    j .L13
    .L31:
    mv a4,t5
    j .L20




    === XG3

    GANN_AddFp16SF:
    ADD R2, -64, R2
    MOV.X R20, (R2, 32)
    MOV.X R8, (R2, 0)
    MOV.X R26, (R2, 48)
    MOV.X R18, (R2, 16)
    ADD R2, -272, R2
    MOV.L RD10, (R2, 44)
    MOV.L RD11, (R2, 40)
    MOV.L (R2, 44), RD27
    AND RD27, 127, RQ26
    ADD RQ26, RQ0, RD8
    MOV 2048, RD13
    OR RD8, RD13, RD8
    SHAD.L RD27, -10, RQ26
    AND RQ26, 31, RD9
    SHAD.L RD27, -15, RQ26
    AND RQ26, 1, RD12
    MOV.L RD12, (R2, 32)
    MOV.L (R2, 40), RD11
    AND RD11, 127, RQ26
    MOV.L RD26, (R2, 36)
    MOV.L (R2, 36), RD10
    OR RD10, RD13, RD10
    MOV.L RD10, (R2, 36)
    SHAD.L RD11, -10, RQ26
    AND RQ26, 31, RD18
    SHAD.L RD11, -15, RQ26
    AND RQ26, 1, RD17
    MOV.L RD17, (R2, 28)
    BRNE.L R0, RD9, .L00800F15
    MOV 0, RD8
    .L00800F15:
    BRNE.L R0, RD18, .L00800F16
    MOV.L RD0, (R2, 36)
    .L00800F16:
    MOV.L (R2, 44), RD26
    MOV 32767, RD27
    AND RQ26, RD27, RQ26
    MOV.L (R2, 40), RD19
    AND RQ19, RD27, RQ19
    BRLT.L RQ19, RQ26, .L00800F17
    MOV.L (R2, 32), RD27
    ADD RD27, RQ0, RD13
    MOV.L RD13, (R2, 24)
    ADD RD9, RQ0, RD20
    SUBS.L RD9, RD18, RQ26
    MOV 8, RD12
    BRLT.L RD12, RQ26, .L00800F18
    MOV.L (R2, 44), RD10
    BSR .L00C008F0, R0
    .L00800F18:
    MOV.L (R2, 32), RD27
    MOV.L (R2, 28), RD13
    BRNE.Q RD13, RD27, .L00800F19
    SUBS.L RD9, RD18, RQ19
    MOV.L (R2, 36), RD27
    SHAR RD27, RQ19, RQ26
    ADDS.L RD8, RQ26, RD21
    BSR .L00800F1A, R0
    .L00800F19:
    SUBS.L RD9, RD18, RQ19
    MOV.L (R2, 36), RD27
    SHAR RD27, RQ19, RQ26
    SUBS.L RD8, RQ26, RD21
    .L00800F1A:
    BSR .L00800F1B, R0
    .L00800F17:
    MOV.L (R2, 28), RD27
    ADD RD27, RQ0, RD13
    MOV.L RD13, (R2, 24)
    ADD RD18, RQ0, RD20
    SUBS.L RD18, RD9, RQ19
    MOV 8, RD12
    BRLT.L RD12, RQ19, .L00800F1C
    MOV.L (R2, 40), RD10
    BSR .L00C008F0, R0
    .L00800F1C:
    MOV.L (R2, 32), RD27
    MOV.L (R2, 28), RD13
    BRNE.Q RD13, RD27, .L00800F1D
    SUBS.L RD18, RD9, RQ26
    SHAR RD8, RQ26, RQ19
    MOV.L (R2, 36), RD27
    ADDS.L RD27, RQ19, RD21
    BSR .L00800F1E, R0
    .L00800F1D:
    SUBS.L RD18, RD9, RQ26
    SHAR RD8, RQ26, RQ19
    MOV.L (R2, 36), RD27
    SUBS.L RD27, RQ19, RD21
    .L00800F1E:
    .L00800F1B:
    BREQ.L R0, RD21, .L00C00206
    BRGE.L R0, RD21, .L00800F1F
    MOV.L (R2, 24), RD27
    CMPEQ.Q RD27, 0, RD27
    MOV.L RD27, (R2, 24)
    SUBS.L R0, RD21, RD21
    .L00800F1F:
    MOV 2048, RD27
    BTST.L RD27, RD21, .L00800F20
    ADDS.L RD20, 1, RD20
    SHAD.L RD21, -1, RD21
    .L00800F20:
    MOV 1024, RD27
    BTSTN.L RD27, RD21, .L00800F21
    ADDS.L RD20, -1, RD20
    SHAD.L RD21, 1, RD21
    MOV 1024, RD27
    BTSTN.L RD27, RD21, .L00800F22
    ADDS.L RD20, -1, RD20
    SHAD.L RD21, 1, RD21
    .L00800F23:
    MOV 1024, RD27
    BTSTN.L RD27, RD21, .L00800F24
    ADDS.L RD20, -1, RD20
    SHAD.L RD21, 1, RD21
    BSR .L00800F23, R0
    .L00800F24:
    .L00800F22:
    .L00800F21:
    BRLT.L R0, RD20, .L00C00206
    MOV 31, RD27
    BRLT.L RD27, RD20, .L00800F26
    MOV.L (R2, 24), RD27
    SHAD.L RD27, 15, RQ26
    OR RQ26, 31744, RQ19
    EXTU.W RQ19, RQ26
    ADD RQ26, RQ0, RD10
    BSR .L00C008F0, R0
    .L00800F26:
    AND RD21, 1023, RQ19
    SHAD.L RD20, 10, RQ26
    OR RQ19, RQ26, RQ27
    MOV.L (R2, 24), RD13
    SHAD.L RD13, 15, RQ26
    OR RQ27, RQ26, RQ19
    EXTU.W RQ19, RQ27
    ADD RQ27, RQ0, RD10
    BSR .L00C008F0, R0
    .L00C00206:
    MOV 0, RQ10
    .L00C008F0:
    ADD R2, 272, R2
    MOV.X (R2, 0), R8
    MOV.X (R2, 16), R18
    MOV.X (R2, 32), R20
    MOV.X (R2, 48), R26
    ADD R2, 64, R2
    JSR R1, 0, R0
    ADD R1, RQ0, RQ6
    MOV.Q RQ6, (R4, 64)
    ADD RQ10, RQ0, R1
    BSR .L00C008F0, R0





    === XG2

    GANN_AddFp16SF:
    MOV LR, RQ1
    BSR __prolog_0005_C0FF0200FFFF
    ADD SP, -112, SP
    MOV RD4, RD29
    MOV RD5, RD28
    MOV 1, RD24
    MOV 1024, RD47
    MOV 0, RD46
    MOV 15, RD45
    MOV 2048, RD44
    MOV 31, RD43
    MOV 10, RD42
    MOV 127, RD41
    MOV 8, RD40
    MOV 31744, RD63
    MOV 1023, RD62
    AND RD29, 127, RQ14
    EXTS.L RQ14, RD8
    OR RD8, RD44, RD8
    SHAD.L RD29, -10, RQ14
    AND RQ14, 31, RD9
    SHAD.L RD29, -15, RQ14
    AND RQ14, 1, RD27
    AND RD28, 127, RQ14
    MOV RQ14, RD31
    OR RD31, RD44, RD31
    SHAD.L RD28, -10, RQ14
    AND RQ14, 31, RD10
    SHAD.L RD28, -15, RQ14
    AND RQ14, 1, RD26
    CMPEQ.L 0, RD9
    MOV?T RD46, RD8
    CMPEQ.L 0, RD10
    MOV?T RD46, RD31
    MOV RD29, RQ14
    MOV 32767, RD7
    AND RQ14, RD7, RQ14
    MOV RD28, RQ11
    AND RQ11, RD7, RQ11
    BRLT.L RQ11, RQ14, .L00800DEA
    MOV RD27, RD30
    EXTS.L RD9, RD12
    SUBS.L RD9, RD10, RQ14
    CMPGE.L 8, RQ14
    BT .L00800DEB
    MOV RD29, RD2
    BRA .L00C01071
    .L00800DEB:
    BRNE.Q RD26, RD27, .L00800DEC
    SUBS.L RD9, RD10, RQ11
    SHAR RD31, RQ11, RQ14
    ADDS.L RD8, RQ14, RD13
    BRA .L00800DED
    .L00800DEC:
    SUBS.L RD9, RD10, RQ11
    SHAR RD31, RQ11, RQ14
    SUBS.L RD8, RQ14, RD13
    .L00800DED:
    BRA .L00800DEE
    .L00800DEA:
    MOV RD26, RD30
    EXTS.L RD10, RD12
    SUBS.L RD10, RD9, RQ11
    CMPGE.L 8, RQ11
    BT .L00800DEF
    MOV RD28, RD2
    BRA .L00C01071
    .L00800DEF:
    BRNE.Q RD26, RD27, .L00800DF0
    SUBS.L RD10, RD9, RQ14
    SHAR RD8, RQ14, RQ11
    ADDS.L RD31, RQ11, RD13
    BRA .L00800DF1
    .L00800DF0:
    SUBS.L RD10, RD9, RQ14
    SHAR RD8, RQ14, RQ11
    SUBS.L RD31, RQ11, RD13
    .L00800DF1:
    .L00800DEE:
    BREQ.L RD13, .L00C0054A
    BRGE.L RD13, .L00800DF2
    CMPEQ.Q RD30, 0, RD30
    SUBS.L RD46, RD13, RD13
    .L00800DF2:
    TST.L RD44, RD13
    BT .L00800DF3
    ADDS.L RD12, 1, RD12
    SHAD.L RD13, -1, RD13
    .L00800DF3:
    TST.L RD47, RD13
    BT .L00800DF4
    ADDS.L RD12, -1, RD12
    SHAD.L RD13, 1, RD13
    TST.L RD47, RD13
    BT .L00800DF5
    ADDS.L RD12, -1, RD12
    SHAD.L RD13, 1, RD13
    BRA .L00800DF8
    .L00800DF6:
    ADDS.L RD12, -1, RD12
    SHAD.L RD13, 1, RD13
    .L00800DF8:
    TST.L RD47, RD13
    BT .L00800DF6
    .L00800DF7:
    .L00800DF5:
    .L00800DF4:
    BRLT.L RD12, .L00C0054A
    CMPGE.L 31, RD12
    BT .L00800DF9
    SHAD.L RD30, 15, RQ14
    EXTS.L RQ14, RQ11
    OR RQ11, RD63, RQ11
    EXTU.W RQ11, RQ14
    EXTU.W RQ14, RD2
    BRA .L00C01071
    .L00800DF9:
    EXTS.L RD13, RQ11
    AND RQ11, RD62, RQ11
    SHAD.L RD12, 10, RQ14
    OR RQ11, RQ14, RQ25
    SHAD.L RD30, 15, RQ14
    OR RQ25, RQ14, RQ11
    EXTU.W RQ11, RQ25
    MOV RQ25, RD2
    BRA .L00C01071
    .L00C0054A:
    MOV 0, RQ2
    .L00C01071:
    ADD SP, 112, SP
    BRA __epilog_0005_C0FF0200FFFF
    MOV.Q (SP, 312), RQ16
    MOV 64, R0
    MOV.Q RQ16, (TBR, DLR)
    MOV.Q RQ4, (SP, 312)
    BRA .L00C01071

    ==== XG2 penalty

    __prolog_0005_C0FF0200FFFF:
    ADD SP, -208, SP
    MOV.Q RQ1, (SP, 200)
    MOV.X R9, (SP, 120)
    MOV.X R10, (SP, 16)
    MOV.X R13, (SP, 152)
    MOV.X R26, (SP, 72)
    MOV.X R8, (SP, 0)
    MOV.X R24, (SP, 56)
    MOV.X R30, (SP, 104)
    MOV.Q R14, (SP, 48)
    MOV.X R12, (SP, 32)
    MOV.X R11, (SP, 136)
    MOV.X R31, (SP, 184)
    MOV.X R28, (SP, 88)
    MOV.Q R46, (SP, 168)
    MOV.Q R47, (SP, 176)
    RTSU
    __epilog_0005_C0FF0200FFFF:
    MOV.Q (SP, 200), RQ1
    MOV.X (SP, 0), R8
    MOV.X (SP, 16), R10
    MOV.X (SP, 32), R12
    MOV.Q (SP, 48), R14
    MOV.X (SP, 56), R24
    MOV.X (SP, 72), R26
    MOV.X (SP, 88), R28
    MOV.X (SP, 104), R30
    MOV.X (SP, 120), R9
    MOV.X (SP, 136), R11
    MOV.X (SP, 152), R13
    MOV.Q (SP, 168), R46
    MOV.Q (SP, 176), R47
    MOV.X (SP, 184), R31
    ADD SP, 208, SP
    JMP RQ1
    MOV.Q (SP, 312), RQ16
    MOV 64, R0
    MOV.Q RQ16, (TBR, DLR)
    MOV.Q RQ4, (SP, 312)
    BRA .L00C006F9

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Thu Jul 31 04:26:27 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    John Levine <johnl@taugh.com> writes:
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    So going for microcode no longer was the best choice for the VAX, but >>>neither the VAX designers nor their competition realized this, and >>>commercial RISCs only appeared in 1986.

    That is certainly true but there were other mistakes too. One is that
    they underestimated how cheap memory would get, leading to the overcomplex >>instruction and address modes and the tiny 512 byte page size.

    Concerning code density, while VAX code is compact, RISC-V code with the
    C extension is more compact
    <2025Mar4.093916@mips.complang.tuwien.ac.at>, so in our time-traveling scenario that would not be a reason for going for the VAX ISA.

    Another aspect from those measurements is that the 68k instruction set
    (with only one memory operand for any compute instructions, and 16-bit granularity) has a code density similar to the VAX.

    Another, which is not entirely their fault, is that they did not expect >>compilers to improve as fast as they did, leading to a machine which was fun to
    program in assembler but full of stuff that was useless to compilers and >>instructions like POLY that should have been subroutines. The 801 project and >>PL.8 compiler were well underway at IBM by the time the VAX shipped, but DEC >>presumably didn't know about it.

    DEC probably was aware from the work of William Wulf and his students
    what optimizing compilers can do and how to write them. After all,
    they used his language BLISS and its compiler themselves.

    POLY would have made sense in a world where microcode makes sense: If microcode can be executed faster than subroutines, put a building
    stone for transcendental library functions into microcode. Of course,
    given that microcode no longer made sense for VAX, POLY did not make
    sense for it, either.

    IIUC the orignal idea was that POLY should be more accurate than
    sequence of separate instructions and reproducible between models.

    Related to the microcode issue they also don't seem to have anticipated how >>important pipelining would be. Some minor changes to the VAX, like not letting
    one address modify another in the same instruction, would have made it a lot >>easier to pipeline.

    My RISC alternative to the VAX 11/780 (RISC-VAX) would probably have
    to use pipelining (maybe a three-stage pipeline like the first ARM) to achieve its clock rate goals; that would eat up some of the savings in implementation complexity that avoiding the actual VAX would have
    given us.

    Another issue would be is how to implement the PDP-11 emulation mode.
    I would add a PDP-11 decoder (as the actual VAX 11/780 probably has)
    that would decode PDP-11 code into RISC-VAX instructions, or into what RISC-VAX instructions are decoded into. The cost of that is probably
    similar to that in the actual VAX 11/780. If the RISC-VAX ISA has a MIPS/Alpha/RISC-V-like handling of conditions, the common microcode
    would have to support both the PDP-11 and the RISC-VAX handling of conditions; probably not that expensive, but maybe one still would
    prefer a ARM/SPARC/HPPA-like handling of conditions.

    I looked into VAX architecure handbook from 1977. Handbook claims
    that VAX-780 used 96-bit microcode words. That is enough bits to
    control pipelined machine with 1 instruction per cycle, provided
    enough excution resources (register ports, buses and 1-cycle
    execution units). However, VAX hardware allowed only one memory
    access per cycle so instructions with multiple memory addreses
    or using indirection trough memory by necessity needed multiple
    cycles.

    I must admit that I do not understand why VAX needed so many
    cycles per instruction. Namely, register argument can by
    recognized looking at 4 high bits of operand byte. That
    can be done using 2 negators and 4 input NAND gate. For normal
    instructions lowest bit of opcode seem to select between 2
    and 3 operand instructions. For 1 byte opcode with all
    register arguments operand specifiers are in predictable placese,
    so together modest number of gates could recognize register-only
    operand specifiers. Of course, to be sure that this is
    register instruction one needs to look at opcode. I am
    guessing that VAX fetches microcode word based on opcode,
    so this microcode word could conditionaly (based on result
    of circuit mentioned above) pass instruction to pipeline
    and initiate processing of next instruction or start
    argument processing. Such one cycle conditional branch
    in general may be problematic, but I would be surprising if
    it were problematic for VAX microcode. Namely it was
    ususal for microdode to specify address of next microcode
    word. So with a pipeline and small number of extra gates
    VAX should be able to do register-only instructions in
    1 cycle. Escalating a bit, with managable number of
    gates one should be able to recognize operand of
    "defered mode", "autodecement mode" and "autoincrement mode".
    For each such input operand microcode engine could
    insert a load into pipeline and proceed with rest of
    instruction. Similarly, for write operand microcode
    could pass instruction to the pipeline, but also pass
    special bit changing destination and insert store
    after instruction. Once given memory operand is
    handled decodin gates would indicate if this was last
    memory operand which would allow either going to next
    instruction or handling next memory operand. Together,
    for normal istructions each memory operand should add
    one cycle to execution time. Also short immediates
    could be handled in similar way. This leaves some nasty
    cases: longer immediates, displacement and modes with
    double indirection. Displacement could probly be handled
    at cost of extra cycle. Other modes probably would
    cost one or two cycle penalty.

    To summarize, VAX with pipeline and modest amount of operand
    decoders should be able to execute "narmal" instructions
    at RISC speed (in RISC each memory operand would require
    load or store, so extra cycle like scheme above).

    Given actual speed of VAX possibilities seem to be:
    - extra factors slowing both VAX and RISC, like cache
    misses (VAX archtecture handbook says that due to
    misses cache had effective access time of 290 ns),
    - VAX designers could not afford pipeline
    - maybe VAX designers decided to avoid pipelne to reduce
    complexity

    If VAX designers could not afford pipeline, than it is
    not clear if RISC could afford it: removing microcode
    engine would reduce complexity and cost and give some
    free space. But microcode engines tend to be simple.
    Also, PDP-11 compatibility depended on microcode.
    Without microcode engine one would need parallel set
    of hardware instruction decoders, which could add
    more complexity than was saved by removing microcode
    engine.

    To summarize, it is not clear to me if RISC in VAX technology
    could be significantly faster than VAX especially given constraint
    of PDP-11 compatibility. OTOH VAX designers probably felt
    that CISC nature added significant value: they understood
    that cost of programming was significant and believed that
    ortogonal instruction set, in particular allowing complex
    addresing on all operands made programming simpler. They
    probably thought that providing resonably common procedures
    as microcoded instructions made work of programmers simpler
    even if routines were only marginally faster than ordinary
    code. Part of this thinking was probably like "future
    system" motivation at IBM: Digital did not want to produce
    "commodity" systems, they wanted something with unique
    features that custemer will want to use. Without
    isight into future it is hard to say that they were
    wrong.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Thu Jul 31 16:05:14 2025
    From Newsgroup: comp.arch

    According to Waldek Hebisch <antispam@fricas.org>:
    POLY would have made sense in a world where microcode makes sense: If
    microcode can be executed faster than subroutines, put a building
    stone for transcendental library functions into microcode. Of course,
    given that microcode no longer made sense for VAX, POLY did not make
    sense for it, either.

    IIUC the orignal idea was that POLY should be more accurate than
    sequence of separate instructions and reproducible between models.

    That was the plan but the people building Vaxen didn't get the memo
    so even on the original 780, it got different answers with and without
    the optional floating point accelerator.

    If they wanted more accurate results, they should have

    https://simh.trailing-edge.com/docs/vax_poly.pdf

    I must admit that I do not understand why VAX needed so many
    cycles per instruction. Namely, register argument can by
    recognized looking at 4 high bits of operand byte.

    It can, but autoincrement or decrement modes change the contents
    of the register so the operands have to be evaluated in strict
    order or you need a lot of logic to check for hazards and stall.

    In practice I don't think it was very common to do that, except
    for the immediate and absolute address modes which were (PC)+
    and @(PC)+, and which needed to be special cased since they took
    data from the instruction stream. The size of the immediate
    operand could be from 1 to 8 bytes depending on both the instruction
    and which operand of the instruction it was.

    To summarize, VAX with pipeline and modest amount of operand
    decoders should be able to execute "narmal" instructions
    at RISC speed (in RISC each memory operand would require
    load or store, so extra cycle like scheme above).

    Right, but detecting the abnormal cases wasn't trivial.

    R's,
    John
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Jul 31 19:01:36 2025
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> writes:
    According to Waldek Hebisch <antispam@fricas.org>:
    POLY would have made sense in a world where microcode makes sense: If
    microcode can be executed faster than subroutines, put a building
    stone for transcendental library functions into microcode. Of course,
    given that microcode no longer made sense for VAX, POLY did not make
    sense for it, either.

    IIUC the orignal idea was that POLY should be more accurate than
    sequence of separate instructions and reproducible between models.

    That was the plan but the people building Vaxen didn't get the memo
    so even on the original 780, it got different answers with and without
    the optional floating point accelerator.

    If they wanted more accurate results, they should have

    https://simh.trailing-edge.com/docs/vax_poly.pdf

    I must admit that I do not understand why VAX needed so many
    cycles per instruction. Namely, register argument can by
    recognized looking at 4 high bits of operand byte.

    It can, but autoincrement or decrement modes change the contents
    of the register so the operands have to be evaluated in strict
    order or you need a lot of logic to check for hazards and stall.

    In practice I don't think it was very common to do that, except
    for the immediate and absolute address modes which were (PC)+
    and @(PC)+, and which needed to be special cased since they took
    data from the instruction stream. The size of the immediate
    operand could be from 1 to 8 bytes depending on both the instruction
    and which operand of the instruction it was.

    Looking at the MACRO-32 source for a focal interpreter, I
    see
    CVTLF 12(SP),@(SP)+
    MOVL (SP)+, R0
    CMPL (AP)+,#1
    MOVL (AP)+,R7
    TSTL (SP)+
    MOVZBL (R8)+,R5
    BICB3 #240,(R8)+,R2
    LOCC (R8)+,R0,(R8) ;FIND THE MATCH <<< note R8 used twice
    LOCC (R8)+,S^#OPN,OPRATRS
    MOVL (SP)+,(R7)[R6]
    CMPB (R8)+,#^A/;/ ;VALID END OF STATEMENT
    CASE (SP)+,<30$,20$,10$>,-
    LIMIT=#0,TYPE=L ;DISPATCH ON NO. OF ARGS
    MOVF (SP)+,@(SP)+ ;JUST DO SET

    (SP)+ was far and away the most common. (PC)+ wasn't
    used in that application.

    There were some adjacent dependencies:

    ADDB3 #48,R0,(R9)+ ;PUT AS DIGIT INTO BUFFER
    ADDB3 #48,R1,(R9)+ ;AND NEXT


    and a handful of others. Probably only a single-digit
    percentage of instructions used autoincrement/decrement and only
    a couple used the updated register in the same
    instruction.

    in some of my code from the era, I used auto-decrement frequently,
    mainly to push 8 or 16bit data onto the stack.

    ;
    ; Deallocate Virtual Memory used to buffer records in copy.
    ;
    pushl copy_in_rab+rab$l_ubf ; Record address
    movzwl copy_in_rab+rab$w_usz,-(sp) ; Record size
    pushab 4(sp)
    pushab 4(sp)
    calls #2,g^lib$free_vm ; Get rid of vm
    ret

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Thu Jul 31 19:57:43 2025
    From Newsgroup: comp.arch

    According to Scott Lurndal <slp53@pacbell.net>:
    Looking at the MACRO-32 source for a focal interpreter, I
    see
    CVTLF 12(SP),@(SP)+
    MOVL (SP)+, R0
    CMPL (AP)+,#1
    MOVL (AP)+,R7
    TSTL (SP)+
    MOVZBL (R8)+,R5
    BICB3 #240,(R8)+,R2
    LOCC (R8)+,R0,(R8) ;FIND THE MATCH <<< note R8 used twice
    LOCC (R8)+,S^#OPN,OPRATRS
    MOVL (SP)+,(R7)[R6]
    CMPB (R8)+,#^A/;/ ;VALID END OF STATEMENT
    CASE (SP)+,<30$,20$,10$>,-
    LIMIT=#0,TYPE=L ;DISPATCH ON NO. OF ARGS
    MOVF (SP)+,@(SP)+ ;JUST DO SET

    (SP)+ was far and away the most common. (PC)+ wasn't
    used in that application.

    Wow, that's some funky code. The #240 is syntactic sugar for (PC)+
    followed by a byte with 240 (octal) in it. VAX had an immediate
    address mode that could represent 0 to 77 octal so the assembler used
    that for immediates that would fit, (PC)+ if not. The S^#OPN explictly
    tells it to use the short immediate mode. #^A/;/ is a literal
    semicolon which fits in an immediate.

    There were some adjacent dependencies:

    ADDB3 #48,R0,(R9)+ ;PUT AS DIGIT INTO BUFFER
    ADDB3 #48,R1,(R9)+ ;AND NEXT


    and a handful of others. Probably only a single-digit
    percentage of instructions used autoincrement/decrement and only
    a couple used the updated register in the same
    instruction.

    Right, but it always had to check for it. As I said a few messages ago,
    if they didn't allow register updates to affect other operands, or changed
    the spec so the registers were all updated at the end of the instruction, it wouldn't have affected much code but would have made decoding and pipelining easier.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Jul 31 21:24:29 2025
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> writes:
    According to Scott Lurndal <slp53@pacbell.net>:
    Looking at the MACRO-32 source for a focal interpreter, I
    see
    CVTLF 12(SP),@(SP)+
    MOVL (SP)+, R0
    CMPL (AP)+,#1
    MOVL (AP)+,R7
    TSTL (SP)+
    MOVZBL (R8)+,R5
    BICB3 #240,(R8)+,R2
    LOCC (R8)+,R0,(R8) ;FIND THE MATCH <<< note R8 used twice
    LOCC (R8)+,S^#OPN,OPRATRS
    MOVL (SP)+,(R7)[R6]
    CMPB (R8)+,#^A/;/ ;VALID END OF STATEMENT
    CASE (SP)+,<30$,20$,10$>,-
    LIMIT=#0,TYPE=L ;DISPATCH ON NO. OF ARGS
    MOVF (SP)+,@(SP)+ ;JUST DO SET

    (SP)+ was far and away the most common. (PC)+ wasn't
    used in that application.

    Wow, that's some funky code.

    .TITLE FOCAL MAIN SEGMENT
    ;FOCAL MAIN SEGMENT
    ;DAVE MONAHAN MARCH 1978

    ...

    HEADER: .ASCII /C VAX FOCAL V1.0 /
    DATE: .BLKB 24
    .ASCII / -NOT A DEC PRODUCT/

    I had it on a 9-track from 1980 that Al was nice enough to
    copy to a CD-ROM for me.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Fri Aug 1 02:18:17 2025
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> wrote:
    According to Waldek Hebisch <antispam@fricas.org>:

    I must admit that I do not understand why VAX needed so many
    cycles per instruction. Namely, register argument can by
    recognized looking at 4 high bits of operand byte.

    It can, but autoincrement or decrement modes change the contents
    of the register so the operands have to be evaluated in strict
    order or you need a lot of logic to check for hazards and stall.

    My idea was that instruction decoder could essentially translate

    ADDL (R2)+, R2, R3

    into

    MOV (R2)+, TMP
    ADDL TMP, R2, R3

    where TMP is special forwarding register in the CPU. AFAICS normal
    forwarding in the pipeline would handle this. In case of

    ADDL R2, (R2)+, R3

    one would need something which we could denote

    MOV (R2)+, TMP
    ADDL R2*, TMP, R3

    where R2* denotes previous value of R2, which introduces extra
    complication, but does not look hard to handle.

    Note that I do _not_ aim at executiong complex VAX instructions in
    one cycle. Rather each memory operand is handled separately
    and they are handled in order.

    In practice I don't think it was very common to do that, except
    for the immediate and absolute address modes which were (PC)+
    and @(PC)+, and which needed to be special cased since they took
    data from the instruction stream. The size of the immediate
    operand could be from 1 to 8 bytes depending on both the instruction
    and which operand of the instruction it was.

    I considered only popular integer instructions, everthing else
    would be handled by microcode at the same speed as real VAX.
    VAX had 32-bit bus, so 8-bytes operand needed 2 cycles anyway,
    so slower decoding for such operands would not be a problem.

    To summarize, VAX with pipeline and modest amount of operand
    decoders should be able to execute "narmal" instructions
    at RISC speed (in RISC each memory operand would require
    load or store, so extra cycle like scheme above).

    Right, but detecting the abnormal cases wasn't trivial.

    Maybe I was unclear, but the whole point was that distinguishing
    between normal cases and abnomal ones could be done by
    moderately complex hardware. Also, I am comparing to execution
    time for equvalent functionality: VAX instruction with 1 memory
    operand would take 2 cycles (the same as 2 instructions needed
    by RISC). And I am comparing to early RISC, that is 32-bit
    integer operations. Similar speedup for floating point operations
    or for 64-bit operands would need bigger decoders, handling
    more than 1 memory operand per cycle or going superscalar
    probably would lead to too complex decoders.

    And a little corection: proposed decoder effectively add 1 more
    pipeline stage, so taken jump would be 1 cycle slower than
    classic RISC having the same pipeline (and 2 cycle slower than
    RISC delayed jumps). OTOH RISC-V compressed instructions
    seem to require similar decoding stage, so Anton VAX-RISC-V
    would have similar timing.

    I can understand why DEC abandoned VAX: already in 1985 they
    had some disadvantage and they saw no way to compete against
    superscalar machines which were on the horizon. In 1985 they
    probably realized, that their features add no value in world
    using optimizing compilers.

    But after your post I find it more likely that in DEC could
    not afford pipeline for VAX-780, even with simple instructions
    one has to decide between accessing register file and using
    forwarded value, one needs intelocks to wait for cache misses
    etc.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Fri Aug 1 04:31:00 2025
    From Newsgroup: comp.arch

    On Tue, 17 Jun 2025 12:45:44 -0700, Chris M. Thomasson wrote:
    On 6/17/2025 10:59 AM, quadibloc wrote:

    So the fact that it uses 10x the electrical power, while only having 2x
    the raw power - for an embarrassingly parallel problem, which doesn't
    happen to be the one I need to solve - doesn't matter.

    Can you break your processing down into units that can be executed in parallel, or do you get into an interesting issue where step B cannot
    proceed until step A is finished?

    I'm assuming that the latter case is true often enough for real-world
    programs that out-of-order processors with massive overhead and power consumption are worth using instead of many small processors in
    parallel with greater throughput.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Fri Aug 1 04:42:28 2025
    From Newsgroup: comp.arch

    On Tue, 10 Jun 2025 22:45:05 -0500, BGB wrote:

    If you treat [Base+Disp] and [Base+Index] as two mutually exclusive
    cases, one gets most of the benefit with less issues.

    That's certainly a way to do it. But then you either need to dedicate
    one base register to each array - perhaps easier if there's opcode
    space to use all 32 registers as base registers, which this would allow -
    or you would have to load the base register with the address of the
    array.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Aug 1 05:03:07 2025
    From Newsgroup: comp.arch

    John Savard <quadibloc@invalid.invalid> schrieb:
    On Tue, 10 Jun 2025 22:45:05 -0500, BGB wrote:

    If you treat [Base+Disp] and [Base+Index] as two mutually exclusive
    cases, one gets most of the benefit with less issues.

    That's certainly a way to do it. But then you either need to dedicate
    one base register to each array - perhaps easier if there's opcode
    space to use all 32 registers as base registers, which this would allow -
    or you would have to load the base register with the address of the
    array.

    Which is what everybody does. Loading a register with the address
    of a small array on the stack is a simple addition, usually one
    cycle latency. If the array came as an argument, it is (usually)
    in a register to start with. If you allocate the array dynamically,
    you get its address for free after the function call. If you have
    enough GP registers, chances are it will still be in the array;
    otherwise you can spill it to stack and restre, with restoring
    needing one L1 cache access.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Fri Aug 1 15:30:56 2025
    From Newsgroup: comp.arch

    It appears that Waldek Hebisch <antispam@fricas.org> said:
    My idea was that instruction decoder could essentially translate

    ADDL (R2)+, R2, R3

    into

    MOV (R2)+, TMP
    ADDL TMP, R2, R3

    But how about this?

    ADDL3 (R2)+,(R2)+,(R2)+

    Now you need at least two temps, the second of which depends on the
    first, and there are instructions with six operands. Or how about
    this:

    ADDL3 (R2)+,#1234,(R2)+

    This is encoded as

    OPCODE (R2)+ (PC)+ <1234> (R2)+

    The immediate word is in the middle of the instruction. You have to decode
    the operands one at a time so you can recognize immediates and skip over them. It must have seemed clever at the time, but ugh.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Aug 1 17:02:33 2025
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> writes:
    On 7/30/2025 12:59 AM, Anton Ertl wrote:
    John Levine <johnl@taugh.com> writes:
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    So going for microcode no longer was the best choice for the VAX, but
    neither the VAX designers nor their competition realized this, and
    commercial RISCs only appeared in 1986.

    That is certainly true but there were other mistakes too. One is that
    they underestimated how cheap memory would get, leading to the overcomplex >>> instruction and address modes and the tiny 512 byte page size.

    Concerning code density, while VAX code is compact, RISC-V code with the
    C extension is more compact
    <2025Mar4.093916@mips.complang.tuwien.ac.at>, so in our time-traveling
    scenario that would not be a reason for going for the VAX ISA.
    ...
    But, if so, it would more speak for the weakness of VAX code density
    than the goodness of RISC-V.

    For the question at hand, what counts is that one can do a RISC that
    is more compact than the VAX.

    And neither among the Debian binaries nor among the NetBSD binaries I
    measured I have found anything consistently more compact than RISC-V
    with the C extension. There is one strong competitor, though: armhf
    (Thumb2) on Debian, which is a little smaller then RV64GC in 2 out of
    3 cases and a little larger in the third case.

    There is, however, a fairly notable size difference between RV32 and
    RV64 here, but I had usually been messing with RV64.

    NetBSD has both RV32GC and RV64GC binaries, and there is no consistent advantage of RV32GC over RV64GC there:

    NetBSD numbers from <2025Mar4.093916@mips.complang.tuwien.ac.at>:

    libc ksh pax ed
    1102054 124726 66218 26226 riscv-riscv32
    1077192 127050 62748 26550 riscv-riscv64

    If I were to put it on a ranking (for ISAs I have messed with), it would
    be, roughly (smallest first):
    i386 with VS2008 or GCC 3.x (*1)

    i386 has significantly larger binaries than RV64GC on both Debian and
    NetBSD, also bigger than AMD64 and ARM A64.

    For those who want to see all the numbers in one posting: <2025Jun17.161742@mips.complang.tuwien.ac.at>.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Fri Aug 1 17:16:48 2025
    From Newsgroup: comp.arch

    Lars Poulsen <lars@cleo.beagle-ears.com> writes:
    In the days of VAX-11/780, it was "obvious" that operating systems would
    be written in assembler in order to be efficient, and the instruction
    set allowed high productivity for writing systems programs in "native"
    code.

    Yes. I don't think that the productivity would have suffered from a
    load/store architecture, though.

    As for a RISC-VAX: To little old naive me, it seems that it would have
    been possible to create an alternative microcode load that would be able
    to support a RISC ISA on the same hardware, if the idea had occured to a >well-connected group of graduate students. How good a RISC might have
    been feasible?

    Did the VAX 11/780 have writable microcode?

    Given that the VAX 11/780 was not (much) pipelined, I don't expect
    that using an alternative microcode that implements a RISC ISA would
    have performed well.

    Crossposted to comp.arch, alt.folklore.computers

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Fri Aug 1 18:08:30 2025
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> wrote:
    It appears that Waldek Hebisch <antispam@fricas.org> said:
    My idea was that instruction decoder could essentially translate

    ADDL (R2)+, R2, R3

    into

    MOV (R2)+, TMP
    ADDL TMP, R2, R3

    But how about this?

    ADDL3 (R2)+,(R2)+,(R2)+

    Now you need at least two temps, the second of which depends on the
    first,

    3 actually, the translation should be

    MOVL (R2)+, TMP1
    MOVL (R2)+, TMP2
    ADDL TMP1, TMP2, TMP3
    MOVL TMP3, (R2)+

    Of course, temporaries are only within pipeline, so they probably
    do not need real registers. But the instruction would need
    4 clocks.

    and there are instructions with six operands.

    Those would be classified as hairy and done by microcode.

    Or how about
    this:

    ADDL3 (R2)+,#1234,(R2)+

    This is encoded as

    OPCODE (R2)+ (PC)+ <1234> (R2)+

    The immediate word is in the middle of the instruction. You have to decode the operands one at a time so you can recognize immediates and skip over them.

    Actually decoder that I propose could decode _this_ one in one
    cycle. But for this instruction one cycle decoding is not needed,
    because execution will take multiple clocks. One cycle decoding
    is needed for

    ADDL3 R2,#1234,R2

    which should be executed in one cycle. And to handle it one needs
    7 operand decoders looking at 7 consequitive bytes, so that last
    decoder sees last register argument.

    It must have seemed clever at the time, but ugh.

    VAX designers clearly had microcode in mind, even small changes
    could make hardware decoding easier.

    I have book by A. Tanenbaum about computer architecture that
    was written in similar period as VAX design. Tanenbaum was
    very positive about microcode and advocated adding instructions
    that directly correspond to higher-level language constructs.
    In a sense, Tanenbaum could see advantegs of RISC. Namely
    he cites report about compiling Fortran to IBM microcode:
    Fortran compiled to microcode could run 45 times faster than
    Fortran compiled to native code. So it was implicitely
    known to him that very primitive machine language was
    pretty adequate to get high speed from compiled languages.
    Yet Tanenbaum still wanted microcode and gave made-up
    examples of microcode advantages. No wonder that VAX
    designers were in the same camp.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch,alt.folklore.computers on Fri Aug 1 18:11:28 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Lars Poulsen <lars@cleo.beagle-ears.com> writes:
    In the days of VAX-11/780, it was "obvious" that operating systems would
    be written in assembler in order to be efficient, and the instruction
    set allowed high productivity for writing systems programs in "native" >>code.

    Yes. I don't think that the productivity would have suffered from a >load/store architecture, though.

    As for a RISC-VAX: To little old naive me, it seems that it would have
    been possible to create an alternative microcode load that would be able
    to support a RISC ISA on the same hardware, if the idea had occured to a >>well-connected group of graduate students. How good a RISC might have
    been feasible?

    Did the VAX 11/780 have writable microcode?

    Yes.


    Given that the VAX 11/780 was not (much) pipelined, I don't expect
    that using an alternative microcode that implements a RISC ISA would
    have performed well.

    A new ISA also requires development of the complete software
    infrastructure for building applications (compilers, linkers,
    assemblers); updating the OS, rebuilding existing applications
    for the new ISA, field and customer training, etc.

    Digital eventually did move VMS to Alpha, but it was neither
    cheap, nor easy. Most alpha customers were existing VAX
    customers - it's not clear that DEC actually grew the customer
    base by switching to Alpha.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Aug 1 18:33:31 2025
    From Newsgroup: comp.arch

    antispam@fricas.org (Waldek Hebisch) writes:
    John Levine <johnl@taugh.com> wrote:
    <snip>
    ADDL3 (R2)+,#1234,(R2)+

    This is encoded as

    OPCODE (R2)+ (PC)+ <1234> (R2)+

    The immediate word is in the middle of the instruction. You have to decode >> the operands one at a time so you can recognize immediates and skip over them.

    Actually decoder that I propose could decode _this_ one in one
    cycle.

    Assuming it didn't cross a cache line, which is possible with any
    variable length instruction encoding.

    But for this instruction one cycle decoding is not needed,
    because execution will take multiple clocks. One cycle decoding
    is needed for

    ADDL3 R2,#1234,R2

    which should be executed in one cycle. And to handle it one needs
    7 operand decoders looking at 7 consequitive bytes, so that last
    decoder sees last register argument.

    It must have seemed clever at the time, but ugh.

    VAX designers clearly had microcode in mind, even small changes
    could make hardware decoding easier.

    I have book by A. Tanenbaum about computer architecture that
    was written in similar period as VAX design.

    That would be:

    $ author tanenbaum
    Enter password:
    artist title format location
    Tanenbaum, Andrew S. Structured Computer Organization Hard A029

    It's currently in box A029 in storage, but my recollection is that
    it was rather vax-centric.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Aug 1 17:25:22 2025
    From Newsgroup: comp.arch

    antispam@fricas.org (Waldek Hebisch) writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    John Levine <johnl@taugh.com> writes:
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    POLY would have made sense in a world where microcode makes sense: If
    microcode can be executed faster than subroutines, put a building
    stone for transcendental library functions into microcode. Of course,
    given that microcode no longer made sense for VAX, POLY did not make
    sense for it, either.

    IIUC the orignal idea was that POLY should be more accurate than
    sequence of separate instructions and reproducible between models.

    The reproducability did not happen.

    It actually might have been better if the ISA contained instructions
    for the individual steps. According to <http://simh.trailing-edge.com/docs/vax_poly.pdf>

    |For example, POLY specified that in the central an*x+bn step:
    |- The multiply result was truncated to 31b/63b prior to normalization.
    |- The extended-precision multiply result was added to the next coefficient.
    |- The addition result was truncated to 31b/63b prior to normalization and
    | rounding.

    One could specify an FMA instruction for that step like many recent
    ISAs have done, but I think that the reproducibility would be better
    if the truncation was a separate instruction. And of course, all of
    this would require having at least a few registers with extra bits.

    Another issue would be is how to implement the PDP-11 emulation mode.
    I would add a PDP-11 decoder (as the actual VAX 11/780 probably has)
    that would decode PDP-11 code into RISC-VAX instructions, or into what
    RISC-VAX instructions are decoded into. The cost of that is probably
    similar to that in the actual VAX 11/780. If the RISC-VAX ISA has a
    MIPS/Alpha/RISC-V-like handling of conditions, the common microcode
    would have to support both the PDP-11 and the RISC-VAX handling of
    conditions; probably not that expensive, but maybe one still would
    prefer a ARM/SPARC/HPPA-like handling of conditions.

    I looked into VAX architecure handbook from 1977. Handbook claims
    that VAX-780 used 96-bit microcode words. That is enough bits to
    control pipelined machine with 1 instruction per cycle, provided
    enough excution resources (register ports, buses and 1-cycle
    execution units). However, VAX hardware allowed only one memory
    access per cycle so instructions with multiple memory addreses
    or using indirection trough memory by necessity needed multiple
    cycles.

    I must admit that I do not understand why VAX needed so many
    cycles per instruction.

    It was not pipelined much. Assuming a totally unpipelined machine, an
    ADD3.L R1,R2,R3 instruction might be executed in the following steps:

    decode add3.l
    decode first operand (r1)
    read r1 from the register file | decode second operand (r2)
    read r2 from the register file
    add r1 and r2 | decode r3
    write the result to r3

    That's 6 cycles, and without any cycles for instruction fetching.

    For 1 byte opcode with all
    register arguments operand specifiers are in predictable placese,
    so together modest number of gates could recognize register-only
    operand specifiers.

    Yes, but they wanted to implement the VAX, where every operand can be
    anything. If they thought that focusing on register-only instructions
    was the way to go, they would not have designed the VAX, but the IBM
    801. The ISA was designed for a non-pipelined microcoded
    implementation, obviously without any thought given to future
    pipelined implementations, and that's how the VAX 11/780 was
    implemented.

    To summarize, VAX with pipeline and modest amount of operand
    decoders should be able to execute "narmal" instructions
    at RISC speed (in RISC each memory operand would require
    load or store, so extra cycle like scheme above).

    The VAX 11/780 was not pipelined. The VAX 8700/8800 (introduced 1986,
    but apparently the 8800 replaced the 8600 as high-end VAX only
    starting from 1987) was pipelined at the microcode level, like you
    suggest, but despite having a 4.4 times higher clock rate, the 8700
    achieved only 6 VUP, i.e., 6 times the VAX 11/780 performance (the
    8800 just had two CPUs, but each CPU with the same speed). So if the
    VAX 11/780 takes 10 cycles/instruction on average, the VAX 8700 still
    takes 7.4 cycles per instruction on average, whereas typical RISCs
    contemporary with the VAX 8700 required <2 CPI. They needed
    a more instruction, but the bottom line was still a big speed
    advantage for the RISCs.

    A few years later, there was the pipelined 91MHz NVAX+ with 35 VUP,
    and, implemented in the same process, the 200MHz 21064 with 106.9
    SPECint92 and 159.6 SPECfp92 (https://ntrs.nasa.gov/api/citations/19960008936/downloads/19960008936.pdf). Note that both VUP and SPEC92 scale relative to the VAX 11/780 (i.e.,
    the 11/780 has 1 VUP and SPEC92 int and fp results of 1). So we see
    that they did not manage to get the NVAX+ up to the same clock rate as
    the 21064 in the same process, and that the performance disadvantage
    of the VAX is even higher than the clock rate disadvantage.

    Given actual speed of VAX possibilities seem to be:
    - extra factors slowing both VAX and RISC, like cache
    misses (VAX archtecture handbook says that due to
    misses cache had effective access time of 290 ns),
    - VAX designers could not afford pipeline
    - maybe VAX designers decided to avoid pipelne to reduce
    complexity

    Yes to all. And even when they finally pipelined the VAX, it was far
    less effective than for RISCs.

    If VAX designers could not afford pipeline, than it is
    not clear if RISC could afford it: removing microcode
    engine would reduce complexity and cost and give some
    free space. But microcode engines tend to be simple.

    RISCs like the ARM, MIPS R2000, and SPARC implement a pipelined
    integer instruction set in one chip in 1985/86, with the R2000 running
    at up 12.5MHz. At around the same time the MicroVAX 78032 appeared
    with a similar number of transistors (R2000 110,000, 78032 125000).
    The 78032 runs at 5MHz and has a similar performance to the VAX
    11/780. So for these single-chip implementations, the RISC could be
    pipelined (and clocked higher), whereas the VAX could not*. I expect
    that with the resources needed for the VAX 11/780, a pipelined RISC
    could be implemented.

    * And did the 78032 implement the whole integer instruction set? I
    have certainly read about MicroVAXen that trapped rare instructions
    and implemented them in software.

    Also, PDP-11 compatibility depended on microcode.
    Without microcode engine one would need parallel set
    of hardware instruction decoders, which could add
    more complexity than was saved by removing microcode
    engine.

    The PDP-11 instruction set is relatively simple. I expect that the
    effort for decoding it to the RISC-VAX (whether in hardware or with
    microcode) would not take that many resources.

    To summarize, it is not clear to me if RISC in VAX technology
    could be significantly faster than VAX

    They were significantly faster in later technologies, and the IBM 801 demonstrates the superiority of RISC at around the time of the VAX, so
    it is very likely that a pipelined and faster RISC-VAX would have been
    doable with the resources of the VAX.

    Without
    isight into future it is hard to say that they were
    wrong.

    It's now the past. And now we have all the data to see that the
    result was certainly not very future-proof, and very likely not even
    the best-performing design possible at the time. But ok, they did not
    know better, that's why there's a time-machine involved in my
    scenario.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Aug 1 19:13:53 2025
    From Newsgroup: comp.arch

    Waldek Hebisch <antispam@fricas.org> schrieb:
    John Levine <johnl@taugh.com> wrote:
    It appears that Waldek Hebisch <antispam@fricas.org> said:
    My idea was that instruction decoder could essentially translate

    ADDL (R2)+, R2, R3

    into

    MOV (R2)+, TMP
    ADDL TMP, R2, R3

    But how about this?

    ADDL3 (R2)+,(R2)+,(R2)+

    Now you need at least two temps, the second of which depends on the
    first,

    3 actually, the translation should be

    MOVL (R2)+, TMP1
    MOVL (R2)+, TMP2
    ADDL TMP1, TMP2, TMP3
    MOVL TMP3, (R2)+

    Of course, temporaries are only within pipeline, so they probably
    do not need real registers. But the instruction would need
    4 clocks.

    It would be, unoptimized (my VAX assember is very probably wrong)

    MOVL (R2),TMP1
    ADDL #4,R2
    MOVL (R2),TMP2
    ADDL #4,R2
    ADDL TMP1,TMP2,TMP2 ! That could be one register,
    ! or an implied forwarding register
    MOVL TMO2,(R2)
    ADDL #4,R2

    which could better be expressed by

    MOVL (R2),TMP1
    MOVL 4(R2),TMP2
    ADDL TMP1,TMP2,TMP2
    MOVL TMP2,8(R2)
    ADDL #12,R2
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Aug 1 15:24:37 2025
    From Newsgroup: comp.arch

    On 8/1/2025 12:02 PM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    On 7/30/2025 12:59 AM, Anton Ertl wrote:
    John Levine <johnl@taugh.com> writes:
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    So going for microcode no longer was the best choice for the VAX, but >>>>> neither the VAX designers nor their competition realized this, and
    commercial RISCs only appeared in 1986.

    That is certainly true but there were other mistakes too. One is that >>>> they underestimated how cheap memory would get, leading to the overcomplex >>>> instruction and address modes and the tiny 512 byte page size.

    Concerning code density, while VAX code is compact, RISC-V code with the >>> C extension is more compact
    <2025Mar4.093916@mips.complang.tuwien.ac.at>, so in our time-traveling
    scenario that would not be a reason for going for the VAX ISA.
    ...
    But, if so, it would more speak for the weakness of VAX code density
    than the goodness of RISC-V.

    For the question at hand, what counts is that one can do a RISC that
    is more compact than the VAX.


    Fair enough.

    I have noted it seems that RISC's can be more compact than VAX.


    And neither among the Debian binaries nor among the NetBSD binaries I measured I have found anything consistently more compact than RISC-V
    with the C extension. There is one strong competitor, though: armhf
    (Thumb2) on Debian, which is a little smaller then RV64GC in 2 out of
    3 cases and a little larger in the third case.


    This differs from my experience.

    In my own testing, RISC-V has not usually been the front-runner on code density. Though, when checking micro-examples, it tends to do fairly
    well (in many cases, having the smaller instruction count among the
    various ISAs).


    But, it seems to have a few obvious weak points for RISC-V:
    Crappy with arrays;
    Crappy with code with lots of large immediate values;
    Crappy with code which mostly works using lots of global variables;
    Say, for example, a lot of Apogee / 3D Realms code;
    They sure do like using lots of global variables.
    id Software also likes globals, but not as much.
    ...

    Though, across ISAs, can note that a few of the bigger binaries tend to
    be from:
    Q3A / Quake 3 Arena;
    ROTT / Rise of the Triad.
    Both of which tend to have ".text" sizes of around 1MB.


    Vs, Doom, where:
    Median value would be 370K.
    Smallest ".text" in my current lineup:
    ~ 265K, XG1 (BGBCC)
    Biggest ".text" in my current lineup:
    ~ 475K, RV64G (GCC)
    Vs:
    ~ 380K for RV64GC
    ~ 320K for XG3

    OTOH, if BGBCC is limited to plain RV64G, its binary sizes and
    performance are both worse than those from GCC.

    In non-functional tests though, Thumb2 seemed to get Doom down to around
    240K, and i386 to around 235K (yes, I am rounding to multiples of 5K,
    but there tends to be "noise" at the single kB levels). Seems to depend
    a lot on which program it is whether i386 or Thumb2 wins; both seem
    pretty close here.

    But, RV64GC doesn't seem to compete well with Thumb2 in this case, as it
    is seemingly around 60% larger (for ".text" and similar).




    In my own testing also, the C extension tends to result in around a 20% reduction (or seen from the other side, its absence around a 30%
    expansion). This however has not usually been enough to move it into top place.


    I have actually seen larger space deltas due to things like adding a combination of:
    Register-indexed addressing;
    Jumbo prefixes for larger immediate and displacement values;
    Load/Store Pair.

    At least with my compiler, this makes the extended RISC-V more
    competitive with my own ISA designs.

    The Jumbo scheme does allow for a merged register space, however with RV
    (and with each non-native register requiring a 64-bit encoding), trying
    to enable a full 64-register mode tends to have a negative impact on
    both code density and performance.


    The RV + Jumbo extensions are compatible with the RISC-V 'C' extension. However, my XG3 thing is incompatible with the C extension (as it uses
    the same encoding space).



    I had done a mock-up of an idea I mentioned elsewhere:
    Adding 13/14 bit pair-encoded instructions to XG3.
    These preserve overall 32-bit alignment of the instruction stream;
    Only non-dependent instructions can be paired;
    ...

    However, in my modeling I am only seeing around a 4 to 5% potential
    reduction in binary sizes.

    Or, basically, would get Doom down to ~ 310K or maybe 305K.


    It uses Reg3 for many of the instructions.
    Traditional Reg3 is R8..R15, similar to the RV-C extension.
    If tuned for BGBCC's output, best sane option seems to be:
    R8..R11, R24..R27

    Mostly because BGBCC tends not to do register allocation in scratch
    registers, but the first 2 arguments, and return value, appear frequently.

    Note that R8/R9 and R24..R27 are the most high-priority registers in
    BGBCC's register allocator.


    Though, it is debatable, and seems possibly not worth the added
    complexity for a fairly modest size reduction.


    As noted, current instruction mix for this experiment is (hybrid notation):
    MOV Rm5, Rn5
    ADD Rm5, Rn5
    LI Imm5s, Rn5
    ADDI Imm5s, Rn5
    ADDWI Imm5s, Rn5 //debated, but high hit rate
    SUB/XOR/AND/OR Rm3, Rn3
    SUBW/ADDW/SUBWU/ADDWU Rm3, Rn3
    SLL /SLA /SRL /SRA Rm3, Rn3
    SLLW/SLAW/SRLW/SRAW Rm3, Rn3

    SUBWI/ADDWI/SUBWUI/ADDWUI Imm3u/n, Rn3
    SLLI /SLAI /SRLI /SRAI Imm3u/n, Rn3
    SLLWI/SLAWI/SRLWI/SRAWI Imm3u/n, Rn3

    LW/LD/SW/SD Rn5, Disp5u(SP)
    LW/LD/SW/SD Rn3, Disp2u(Rm3)
    LB/LH/SB/SH Rn3, 0(Rm3)
    LBU/LHU/LWU Rn3, 0(Rm3)

    Not stack-adjustment or control-flow instructions as these weren't
    common enough to make the cut.

    As for why to include ADDWU/SUBWU variants, which don't even exist in
    RISC-V, is because they do exist in XG3 and BGBCC uses them often (as it
    uses zero-extended "unsigned int").

    ...



    There is, however, a fairly notable size difference between RV32 and
    RV64 here, but I had usually been messing with RV64.

    NetBSD has both RV32GC and RV64GC binaries, and there is no consistent advantage of RV32GC over RV64GC there:

    NetBSD numbers from <2025Mar4.093916@mips.complang.tuwien.ac.at>:

    libc ksh pax ed
    1102054 124726 66218 26226 riscv-riscv32
    1077192 127050 62748 26550 riscv-riscv64


    I guess it can be noted, is the overhead of any ELF metadata being excluded?...


    In my own testing, things like binary size overheads due to ELF metadata
    can be quite substantial (without debugging, ELF metadata is still often around half of the overall size of the binary).

    I prefer PE/COFF here as it tends to have less metadata related
    overheads (say, due binaries not including symbol tables and trying to
    export everything).


    Though, for BGBCC, the debugging strategy is a bit more limited:
    It emits a symbol map file (roughly 'nm' format) along with the binary;
    This has symbols, line number info, and had started on very early work
    for STABS debugging (though the specifics of how STABS metadata is used
    by compilers and debuggers is not particularly well documented).

    A had a few times thoughts for a possibly more compact binary format
    and/or LZ compressing the map.


    If I were to put it on a ranking (for ISAs I have messed with), it would
    be, roughly (smallest first):
    i386 with VS2008 or GCC 3.x (*1)

    i386 has significantly larger binaries than RV64GC on both Debian and
    NetBSD, also bigger than AMD64 and ARM A64.


    I was seeing i386 as usually having the smallest binaries, but:
    When using full static linking with a controlled C library;
    When using roughly 20 year old compilers in size-optimization mode;
    Excluding metadata.

    Also, one needs to build with options to optimize for size, prune
    unreachable code and data, ...


    With most newer compilers, it seems i386 and x86-64 code density has
    gotten worse (particularly with MSVC, where newer versions tend to
    produce huge binaries vs the older versions).

    Granted, newer compilers do support newer versions of the C standard,
    and also typically get better performance.


    There has usually been a difference between x86 and x86-64 in that
    seemingly the average instruction size in i386 is around 3.25 bytes, vs
    around 4.5 for x86-64.

    And, i386 doesn't always win in terms of instruction counts either.


    So, i386 and Thumb don't win on instruction counts, but do have smaller average size instructions.

    Among mainline ISA's, RISC-V seems to be fairly competitive in terms of instruction counts.



    For those who want to see all the numbers in one posting: <2025Jun17.161742@mips.complang.tuwien.ac.at>.

    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch,alt.folklore.computers on Fri Aug 1 20:41:06 2025
    From Newsgroup: comp.arch

    In article <kr7jQ.442699$Tc12.355083@fx17.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    Digital eventually did move VMS to Alpha, but it was neither
    cheap, nor easy. Most alpha customers were existing VAX
    customers - it's not clear that DEC actually grew the customer
    base by switching to Alpha.

    Not for VMS, anyway.

    DEC was decently well regarded in the Unix world even then, and
    OSF/1 seemed pretty nifty, if you were coming from a BSD-ish
    place. A lot of Sun shops that didn't want SVR4 and Solaris on
    SPARC looked hard at OSF/1 on Alpha, though I don't know how
    many ultimately jumped.

    And Windows on Alpha had a brief shining moment in the sun (no
    pun intended).

    Interesting, the first OS brought up on Alpha was Ultrix, though
    it never shipped as a product.

    I wonder, if you broke it down by OS, what shipped on the most
    units.

    - Dan C.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Fri Aug 1 21:24:42 2025
    From Newsgroup: comp.arch

    Scott Lurndal <scott@slp53.sl.home> wrote:
    antispam@fricas.org (Waldek Hebisch) writes:
    John Levine <johnl@taugh.com> wrote:
    <snip>
    ADDL3 (R2)+,#1234,(R2)+

    This is encoded as

    OPCODE (R2)+ (PC)+ <1234> (R2)+

    The immediate word is in the middle of the instruction. You have to decode >>> the operands one at a time so you can recognize immediates and skip over them.

    Actually decoder that I propose could decode _this_ one in one
    cycle.

    Assuming it didn't cross a cache line, which is possible with any
    variable length instruction encoding.

    Assuming that instruction is in prefetch buffer. IIUC VAX accessed
    cache in 4 byte units, so length of cache line did not matter.

    But for this instruction one cycle decoding is not needed,
    because execution will take multiple clocks. One cycle decoding
    is needed for

    ADDL3 R2,#1234,R2

    which should be executed in one cycle. And to handle it one needs
    7 operand decoders looking at 7 consequitive bytes, so that last
    decoder sees last register argument.

    It must have seemed clever at the time, but ugh.

    VAX designers clearly had microcode in mind, even small changes
    could make hardware decoding easier.

    I have book by A. Tanenbaum about computer architecture that
    was written in similar period as VAX design.

    That would be:

    $ author tanenbaum
    Enter password:
    artist title format location
    Tanenbaum, Andrew S. Structured Computer Organization Hard A029

    It's currently in box A029 in storage, but my recollection is that
    it was rather vax-centric.

    Maybe you have later edition. My had IBM-360, PDP-11 and Cyber-6600
    as example intstruction sets and IBM-360, PDP-11 and a Burroughs
    machine as examples of microcode level. VAX may be mentioned,
    but I am not sure. In library I saw later edition with quite
    different (post-VAX) examples.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Aug 1 18:07:01 2025
    From Newsgroup: comp.arch

    On 7/31/2025 11:42 PM, John Savard wrote:
    On Tue, 10 Jun 2025 22:45:05 -0500, BGB wrote:

    If you treat [Base+Disp] and [Base+Index] as two mutually exclusive
    cases, one gets most of the benefit with less issues.

    That's certainly a way to do it. But then you either need to dedicate
    one base register to each array - perhaps easier if there's opcode
    space to use all 32 registers as base registers, which this would allow -
    or you would have to load the base register with the address of the
    array.


    Yeah.


    Internally whenever one deals with an array, internally they are dealing
    with a pointer to the array.

    Can note that in BGBCC, as far as code generation is concerned, both
    structs and arrays exist primarily as pointers to the memory holding the struct or array in question.

    This pointer is then subject to register allocation in roughly the same
    way as something like an integer value (though, with some optimizations possible as one can know that the pointer's value is essentially
    immutable as far as the code generation is concerned; similar to a
    literal value).

    But, with 32 or 64 GPRs, this isn't too much of an issue.
    With 64 GPRs, one can mostly static-assign everything to registers. So
    the main cost is often more with the prolog and epilog rather than spill-and-fill.

    Potentially, a compiler like GCC could go further and make more
    effective use of scratch registers in non-leaf functions (BGBCC only
    using scratch registers for register allocation in leaf functions; and
    not for variables held across basic-block boundaries unless it can also static-assign everything).


    But, the register allocation parts of BGBCC has turned into a mess, and
    could potentially use a redesign at some point.

    It is like a whole pile of things that kinda suck, but fixing them
    invariably proves to be too much effort (and not quite sufficiently bad
    to justify throwing it all out and starting over).


    Well, and now I have 3 major ISA variants, without a clear winner:
    XG1: Oldest, cruftiest, but best code density;
    XG2: Intermediate, wonky/hacky encoding scheme;
    XG3: Cleaned up encodings, can coexist more easily with RISC-V.



    While it is theoretically possible to mix and match RV-C and XG3, there
    would be a bit of cruft involved here. Then again, ARM managed with the original Thumb.

    For now, XG3 uses the RISC-V ABI, but it is considered as possible that
    a "XG3 native" ABI could be added which makes a few tweaks:
    F10..F17 become Arguments 9..16;
    F4..F7 become Callee Save (*).

    *:
    F0 ..F3 : Scratch
    F4 ..F9 : Callee Save
    F10..F17: Argument / Scratch
    F18..F27: Callee Save
    F28..F31: Scratch

    Thus, increasing the total available callee-save registers from 24 to 28
    (with 31 total scratch registers). Within the F registers, it would be a
    50/50 split (or 16/16 split).

    Though, for leaf functions (or probably for a compiler like GCC), it is possible that a 24 / 35 split is preferable.

    ...


    John Savard

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch,alt.folklore.computers on Fri Aug 1 23:41:36 2025
    From Newsgroup: comp.arch

    In comp.arch Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    Lars Poulsen <lars@cleo.beagle-ears.com> writes:
    In the days of VAX-11/780, it was "obvious" that operating systems would
    be written in assembler in order to be efficient, and the instruction
    set allowed high productivity for writing systems programs in "native" >>code.

    Yes. I don't think that the productivity would have suffered from a load/store architecture, though.

    As for a RISC-VAX: To little old naive me, it seems that it would have
    been possible to create an alternative microcode load that would be able
    to support a RISC ISA on the same hardware, if the idea had occured to a >>well-connected group of graduate students. How good a RISC might have
    been feasible?

    Did the VAX 11/780 have writable microcode?

    Yes, 12 kB (2K words 96-bit each).

    Given that the VAX 11/780 was not (much) pipelined, I don't expect
    that using an alternative microcode that implements a RISC ISA would
    have performed well.

    Yes.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Peter Flass@Peter@Iron-Spring.com to comp.arch,alt.folklore.computers on Fri Aug 1 20:06:43 2025
    From Newsgroup: comp.arch

    On 8/1/25 11:11, Scott Lurndal wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Lars Poulsen <lars@cleo.beagle-ears.com> writes:
    In the days of VAX-11/780, it was "obvious" that operating systems would >>> be written in assembler in order to be efficient, and the instruction
    set allowed high productivity for writing systems programs in "native"
    code.

    Yes. I don't think that the productivity would have suffered from a
    load/store architecture, though.

    As for a RISC-VAX: To little old naive me, it seems that it would have
    been possible to create an alternative microcode load that would be able >>> to support a RISC ISA on the same hardware, if the idea had occured to a >>> well-connected group of graduate students. How good a RISC might have
    been feasible?

    Did the VAX 11/780 have writable microcode?

    Yes.


    Given that the VAX 11/780 was not (much) pipelined, I don't expect
    that using an alternative microcode that implements a RISC ISA would
    have performed well.

    A new ISA also requires development of the complete software
    infrastructure for building applications (compilers, linkers,
    assemblers); updating the OS, rebuilding existing applications
    for the new ISA, field and customer training, etc.

    Digital eventually did move VMS to Alpha, but it was neither
    cheap, nor easy. Most alpha customers were existing VAX
    customers - it's not clear that DEC actually grew the customer
    base by switching to Alpha.


    Wasn't PRISM/MICA supposed to solve this problem, or am I confusing it
    with something else?
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Sat Aug 2 03:37:34 2025
    From Newsgroup: comp.arch

    On Fri, 1 Aug 2025 20:06:43 -0700, Peter Flass wrote:

    Wasn't PRISM/MICA supposed to solve this problem, or am I confusing it
    with something else?

    PRISM was going to be a new hardware architecture, and MICA the OS to run
    on it. Yes, they were supposed to solve the problem of where DEC was going
    to go since the VAX architecture was clearly being left in the dust by
    RISC.

    I think the MICA kernel was going to support the concept of “personalities”, so that a VMS-compatible environment could be implemented by one set of upper layers, while another set could provide Unix functionality.

    I think the project was taking too long, and not making enough progress.
    So DEC management cancelled the whole thing, and brought out a MIPS-based machine instead.

    The guy in charge got annoyed at the killing of his pet project and left
    in a huff. He took some of those ideas with him to his new employer, to
    create a new OS for them.

    The new employer was Microsoft. The guy in question was Dave Cutler. The
    OS they brought out was called “Windows NT”.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From ted@loft.tnolan.com (Ted Nolan@tednolan to comp.arch,alt.folklore.computers on Sat Aug 2 04:14:47 2025
    From Newsgroup: comp.arch

    In article <106k15u$qgip$6@dont-email.me>,
    Lawrence D'Oliveiro <ldo@nz.invalid> wrote:
    On Fri, 1 Aug 2025 20:06:43 -0700, Peter Flass wrote:

    Wasn't PRISM/MICA supposed to solve this problem, or am I confusing it
    with something else?

    PRISM was going to be a new hardware architecture, and MICA the OS to run
    on it. Yes, they were supposed to solve the problem of where DEC was going >to go since the VAX architecture was clearly being left in the dust by
    RISC.

    I think the MICA kernel was going to support the concept of >“personalities”, so that a VMS-compatible environment could be implemented
    by one set of upper layers, while another set could provide Unix >functionality.

    I think the project was taking too long, and not making enough progress.
    So DEC management cancelled the whole thing, and brought out a MIPS-based >machine instead.

    The guy in charge got annoyed at the killing of his pet project and left
    in a huff. He took some of those ideas with him to his new employer, to >create a new OS for them.

    The new employer was Microsoft. The guy in question was Dave Cutler. The
    OS they brought out was called “Windows NT”.

    And it's *still* not finished!
    --
    columbiaclosings.com
    What's not in Columbia anymore..
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch,alt.folklore.computers on Fri Aug 1 21:35:26 2025
    From Newsgroup: comp.arch

    On 8/1/2025 9:14 PM, Ted Nolan <tednolan> wrote:
    In article <106k15u$qgip$6@dont-email.me>,
    Lawrence D'Oliveiro <ldo@nz.invalid> wrote:
    On Fri, 1 Aug 2025 20:06:43 -0700, Peter Flass wrote:

    Wasn't PRISM/MICA supposed to solve this problem, or am I confusing it
    with something else?

    PRISM was going to be a new hardware architecture, and MICA the OS to run
    on it. Yes, they were supposed to solve the problem of where DEC was going >> to go since the VAX architecture was clearly being left in the dust by
    RISC.

    I think the MICA kernel was going to support the concept of
    “personalities”, so that a VMS-compatible environment could be implemented
    by one set of upper layers, while another set could provide Unix
    functionality.

    I think the project was taking too long, and not making enough progress.
    So DEC management cancelled the whole thing, and brought out a MIPS-based
    machine instead.

    The guy in charge got annoyed at the killing of his pet project and left
    in a huff. He took some of those ideas with him to his new employer, to
    create a new OS for them.

    The new employer was Microsoft. The guy in question was Dave Cutler. The
    OS they brought out was called “Windows NT”.

    And it's *still* not finished!

    Well, what about:

    https://github.com/ZoloZiak/WinNT4

    Humm... A little?
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch,alt.folklore.computers on Sat Aug 2 08:07:50 2025
    From Newsgroup: comp.arch

    In comp.arch Peter Flass <Peter@iron-spring.com> wrote:
    On 8/1/25 11:11, Scott Lurndal wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Lars Poulsen <lars@cleo.beagle-ears.com> writes:
    In the days of VAX-11/780, it was "obvious" that operating systems would >>>> be written in assembler in order to be efficient, and the instruction
    set allowed high productivity for writing systems programs in "native" >>>> code.

    Yes. I don't think that the productivity would have suffered from a
    load/store architecture, though.

    As for a RISC-VAX: To little old naive me, it seems that it would have >>>> been possible to create an alternative microcode load that would be able >>>> to support a RISC ISA on the same hardware, if the idea had occured to a >>>> well-connected group of graduate students. How good a RISC might have
    been feasible?

    Did the VAX 11/780 have writable microcode?

    Yes.


    Given that the VAX 11/780 was not (much) pipelined, I don't expect
    that using an alternative microcode that implements a RISC ISA would
    have performed well.

    A new ISA also requires development of the complete software
    infrastructure for building applications (compilers, linkers,
    assemblers); updating the OS, rebuilding existing applications
    for the new ISA, field and customer training, etc.

    Digital eventually did move VMS to Alpha, but it was neither
    cheap, nor easy. Most alpha customers were existing VAX
    customers - it's not clear that DEC actually grew the customer
    base by switching to Alpha.


    Wasn't PRISM/MICA supposed to solve this problem, or am I confusing it
    with something else?

    IIUC PRISM eventually became Alpha. One piece of supporting sofware
    was a VAX emulator IIRC called FX11: it allowed running unmodified
    VAX binaries. Another supporting piece was Macro32, which effectively
    was a compiler from VAX assembly to Alpha binaries.

    One big selling point of Alpha was 64-bit architecture, but IIUC
    VMS was never fully ported to 64-bits, that is a lot of VMS
    software used 32-bit addresses and some system interfaces were
    32-bit only. OTOH Unix for Alpha was claimed to be pure 64-bit.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Al Kossow@aek@bitsavers.org to comp.arch,alt.folklore.computers on Sat Aug 2 01:48:39 2025
    From Newsgroup: comp.arch

    On 8/2/25 1:07 AM, Waldek Hebisch wrote:

    IIUC PRISM eventually became Alpha.

    Not really. Documents for both, including
    the rare PRISM docs are on bitsavers.
    PRISM came out of Cutler's DEC West group,
    Alpha from the East Coast. I'm not aware
    of any team member overlap.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch,alt.folklore.computers on Sat Aug 2 09:07:14 2025
    From Newsgroup: comp.arch

    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

    And Windows on Alpha had a brief shining moment in the sun (no
    pun intended).

    Vobis (a German discount computer reseller) offered Alpha-based
    Windows boxes in 1993 and another model in 1997. Far too expensive
    for private users (cost was 9999 DM for the two models, the latter
    one with SCSI; IDE was cheaper, equivalent to ~10000 Euros today)
    for a machine with very limited suftware support.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Aug 2 09:02:37 2025
    From Newsgroup: comp.arch

    antispam@fricas.org (Waldek Hebisch) writes:
    I can understand why DEC abandoned VAX: already in 1985 they
    had some disadvantage and they saw no way to compete against
    superscalar machines which were on the horizon. In 1985 they
    probably realized, that their features add no value in world
    using optimizing compilers.

    Optimizing compilers increase the advantages of RISCs, but even with a
    simple compiler Berkeley RISC II (which was made by hardware people,
    not compiler people) has between 85% and 256% of VAX (11/780) speed.
    It also has 16-bit and 32-bit instructions for improved code density
    and (apparently from memory bandwidth issues) performance.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Sat Aug 2 09:28:17 2025
    From Newsgroup: comp.arch

    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Given that the VAX 11/780 was not (much) pipelined, I don't expect
    that using an alternative microcode that implements a RISC ISA would
    have performed well.

    A new ISA also requires development of the complete software
    infrastructure for building applications (compilers, linkers,
    assemblers); updating the OS, rebuilding existing applications
    for the new ISA, field and customer training, etc.

    The VAX was a new ISA, a followon to the PDP-11, which was different
    in many respects (e.g., 16-bit instruction granularity on PDP-11,
    8-bit granularity on VAX). In my RISC-VAX scenario, the RISC-VAX
    would be the PDP-11 followon instead of the actual (CISC) VAX, so
    there would be no additional ISA.

    Digital eventually did move VMS to Alpha, but it was neither
    cheap, nor easy. Most alpha customers were existing VAX
    customers - it's not clear that DEC actually grew the customer
    base by switching to Alpha.

    Our group had no VAX in the years before we bought our first Alphas in
    1995, but we had DecStations. My recommendation in 1995 was to go for
    Linux on Pentium, but the Alpha camp won, and we ran OSF/1 on them for
    some years. Later we ran Linux on our Alphas, and eventually we
    switched to Linux on Intel and AMD.

    As for the VAX-Alpha transition, there were two reasons for the
    switch:

    1) Performance, and that cost DEC customers since RISCs were
    introduced in the mid-1980s. DecStations were introduced to reduce
    this bleeding, but of course this meant that these customers were
    not VAX customers.

    2) The transition to 64 bits. Almost everyone in the workstation
    market introduced hardware for that in the 1990s: MIPS R4000 in
    1991 (MIPS III architecture); DEC Alpha 21064 in 1992; SPARCv9
    (specification) 1993 with first implementation 1995; HP PA-8000
    1995; PowerPC 620 1997 (originally planned earlier); "The original
    goal year for delivering the first [IA-64] product, Merced, was
    1998." I think, though, that for many customers that need arose
    only in the 2000s; e.g., our last Alpha (bought in the year 2000)
    only has 1GB of RAM, so a 64-bit architecture was not necessary for
    us until a few years later, maybe 2005.

    DEC obviously failed to convert its thriving VAX business from the
    1980s into a sustainable Alpha business. Maybe the competetive
    landscape was such that they would have run into problems in any case;
    DEC were not alone in getting problems. OTOH, HP was a mini and
    workstation manufacturer that replaced its CISC architectures with
    RISC early, and managed to survive (and buy Compaq, which had bought
    DEC), although it eventually abandoned its own RISC architecture as
    well as IA-64, the intended successor.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch,alt.folklore.computers on Sat Aug 2 15:29:38 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

    1) Performance, and that cost DEC customers since RISCs were
    introduced in the mid-1980s. DecStations were introduced to reduce
    this bleeding, but of course this meant that these customers were
    not VAX customers.

    Or, even more importantly, VMS customers.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Aug 2 15:33:07 2025
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> writes:
    But, it seems to have a few obvious weak points for RISC-V:
    Crappy with arrays;
    Crappy with code with lots of large immediate values;
    Crappy with code which mostly works using lots of global variables;
    Say, for example, a lot of Apogee / 3D Realms code;
    They sure do like using lots of global variables.
    id Software also likes globals, but not as much.
    ...

    Let's see:

    #include <stddef.h>

    long arrays(long *v, size_t n)
    {
    long i, r;
    for (i=0, r=0; i<n; i++)
    r+=v[i];
    return r;
    }

    long a, b, c, d;

    void globals(void)
    {
    a = 0x1234567890abcdefL;
    b = 0xcdef1234567890abL;
    c = 0x567890abcdef1234L;
    d = 0x5678901234abcdefL;
    }

    gcc-10.3 -Wall -O2 compiles this to the following RV64GC code:

    0000000000010434 <arrays>:
    10434: cd81 beqz a1,1044c <arrays+0x18>
    10436: 058e slli a1,a1,0x3
    10438: 87aa mv a5,a0
    1043a: 00b506b3 add a3,a0,a1
    1043e: 4501 li a0,0
    10440: 6398 ld a4,0(a5)
    10442: 07a1 addi a5,a5,8
    10444: 953a add a0,a0,a4
    10446: fed79de3 bne a5,a3,10440 <arrays+0xc>
    1044a: 8082 ret
    1044c: 4501 li a0,0
    1044e: 8082 ret

    0000000000010450 <globals>:
    10450: 8201b583 ld a1,-2016(gp) # 12020 <__SDATA_BEGIN__>
    10454: 8281b603 ld a2,-2008(gp) # 12028 <__SDATA_BEGIN__+0x8>
    10458: 8301b683 ld a3,-2000(gp) # 12030 <__SDATA_BEGIN__+0x10>
    1045c: 8381b703 ld a4,-1992(gp) # 12038 <__SDATA_BEGIN__+0x18>
    10460: 86b1b423 sd a1,-1944(gp) # 12068 <a>
    10464: 86c1b023 sd a2,-1952(gp) # 12060 <b>
    10468: 84d1bc23 sd a3,-1960(gp) # 12058 <c>
    1046c: 84e1b823 sd a4,-1968(gp) # 12050 <d>
    10470: 8082 ret

    When using -Os, arrays becomes 2 bytes shorter, but the inner loop
    becomes longer.

    gcc-12.2 -Wall -O2 -falign-labels=1 -falign-loops=1 -falign-jumps=1 -falign-functions=1
    compiles this to the following AMD64 code:

    000000001139 <arrays>:
    1139: 48 85 f6 test %rsi,%rsi
    113c: 74 13 je 1151 <arrays+0x18>
    113e: 48 8d 14 f7 lea (%rdi,%rsi,8),%rdx
    1142: 31 c0 xor %eax,%eax
    1144: 48 03 07 add (%rdi),%rax
    1147: 48 83 c7 08 add $0x8,%rdi
    114b: 48 39 d7 cmp %rdx,%rdi
    114e: 75 f4 jne 1144 <arrays+0xb>
    1150: c3 ret
    1151: 31 c0 xor %eax,%eax
    1153: c3 ret

    000000001154 <globals>:
    1154: 48 b8 ef cd ab 90 78 movabs $0x1234567890abcdef,%rax
    115b: 56 34 12
    115e: 48 89 05 cb 2e 00 00 mov %rax,0x2ecb(%rip) # 4030 <a> 1165: 48 b8 ab 90 78 56 34 movabs $0xcdef1234567890ab,%rax
    116c: 12 ef cd
    116f: 48 89 05 b2 2e 00 00 mov %rax,0x2eb2(%rip) # 4028 <b> 1176: 48 b8 34 12 ef cd ab movabs $0x567890abcdef1234,%rax
    117d: 90 78 56
    1180: 48 89 05 99 2e 00 00 mov %rax,0x2e99(%rip) # 4020 <c> 1187: 48 b8 ef cd ab 34 12 movabs $0x5678901234abcdef,%rax
    118e: 90 78 56
    1191: 48 89 05 80 2e 00 00 mov %rax,0x2e80(%rip) # 4018 <d> 1198: c3 ret

    gcc-10.2 -Wall -O2 -falign-labels=1 -falign-loops=1 -falign-jumps=1 -falign-functions=1
    compiles this to the following ARM A64 code:

    0000000000000734 <arrays>:
    734: b4000121 cbz x1, 758 <arrays+0x24>
    738: aa0003e2 mov x2, x0
    73c: d2800000 mov x0, #0x0 // #0
    740: 8b010c43 add x3, x2, x1, lsl #3
    744: f8408441 ldr x1, [x2], #8
    748: 8b010000 add x0, x0, x1
    74c: eb03005f cmp x2, x3
    750: 54ffffa1 b.ne 744 <arrays+0x10> // b.any
    754: d65f03c0 ret
    758: d2800000 mov x0, #0x0 // #0
    75c: d65f03c0 ret

    0000000000000760 <globals>:
    760: d299bde2 mov x2, #0xcdef // #52719
    764: b0000081 adrp x1, 11000 <__cxa_finalize@GLIBC_2.17>
    768: f2b21562 movk x2, #0x90ab, lsl #16
    76c: 9100e020 add x0, x1, #0x38
    770: f2cacf02 movk x2, #0x5678, lsl #32
    774: d2921563 mov x3, #0x90ab // #37035
    778: f2e24682 movk x2, #0x1234, lsl #48
    77c: f9001c22 str x2, [x1, #56]
    780: d2824682 mov x2, #0x1234 // #4660
    784: d299bde1 mov x1, #0xcdef // #52719
    788: f2aacf03 movk x3, #0x5678, lsl #16
    78c: f2b9bde2 movk x2, #0xcdef, lsl #16
    790: f2a69561 movk x1, #0x34ab, lsl #16
    794: f2c24683 movk x3, #0x1234, lsl #32
    798: f2d21562 movk x2, #0x90ab, lsl #32
    79c: f2d20241 movk x1, #0x9012, lsl #32
    7a0: f2f9bde3 movk x3, #0xcdef, lsl #48
    7a4: f2eacf02 movk x2, #0x5678, lsl #48
    7a8: f2eacf01 movk x1, #0x5678, lsl #48
    7ac: a9008803 stp x3, x2, [x0, #8]
    7b0: f9000c01 str x1, [x0, #24]
    7b4: d65f03c0 ret

    So, the overall sizes (including data size for globals() on RV64GC) are:

    arrays globals Architecture
    28 66 (34+32) RV64GC
    27 69 AMD64
    44 84 ARM A64

    So RV64GC is smallest for the globals/large-immediate test here, and
    only beaten by one byte by AMD64 for the array test. Looking at the
    code generated for the inner loop of arrays(), all the inner loops
    contain four instructions, so certainly in this case RV64GC is not
    crappier than the others. Interestingly, the reasons for using four instructions (rather than five) are different on these architectures:

    * RV64GC uses a compare-and-branch instruction.
    * AMD64 uses a load-and-add instruction.
    * ARM A64 uses an auto-increment instruction.

    NetBSD has both RV32GC and RV64GC binaries, and there is no consistent
    advantage of RV32GC over RV64GC there:

    NetBSD numbers from <2025Mar4.093916@mips.complang.tuwien.ac.at>:

    libc ksh pax ed
    1102054 124726 66218 26226 riscv-riscv32
    1077192 127050 62748 26550 riscv-riscv64


    I guess it can be noted, is the overhead of any ELF metadata being >excluded?...

    These are sizes of the .text section extracted with objdump -h. So
    no, these numbers do not include ELF metadata, nor the sizes of other
    sections. The latter may be relevant, because RV64GC has "immediates"
    in .sdata that other architectures have in .text; however, .sdata can
    contain other things than just "immediates", so one cannot just add the
    .sdata size to the .text size.

    Granted, newer compilers do support newer versions of the C standard,
    and also typically get better performance.

    The latter is not the case in my experience, except in cases where autovectorization succeeds (but I also have seen a horrible slowdown
    from auto-vectorization).

    There is one other improvement: gcc register allocation has improved
    in recent years to a point where we 1) no longer need explicit
    register allocation for Gforth on AMD64, and 2) with a lot of manual
    help, we could increase the number of stack cache registers from 1 to
    3 on AMD64, which gives some speedups typically in the 0%-20% range in
    Gforth.

    But, e.g., for the example from <http://www.complang.tuwien.ac.at/anton/lvas/effizienz/tsp.html>,
    which is vectorizable, I still have not been able to get gcc to
    auto-vectorize it, even with some transformations which should help.
    I have not measured the scalar versions again, but given that there
    were no consistent speedups between gcc-2.7 (1995) and gcc-5.2 (2015),
    I doubt that I will see consistent speedups with newer gcc (or clang)
    versions.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sat Aug 2 15:15:12 2025
    From Newsgroup: comp.arch

    On 8/2/2025 10:33 AM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    But, it seems to have a few obvious weak points for RISC-V:
    Crappy with arrays;
    Crappy with code with lots of large immediate values;
    Crappy with code which mostly works using lots of global variables;
    Say, for example, a lot of Apogee / 3D Realms code;
    They sure do like using lots of global variables.
    id Software also likes globals, but not as much.
    ...

    Let's see:

    #include <stddef.h>

    long arrays(long *v, size_t n)
    {
    long i, r;
    for (i=0, r=0; i<n; i++)
    r+=v[i];
    return r;
    }

    long a, b, c, d;

    void globals(void)
    {
    a = 0x1234567890abcdefL;
    b = 0xcdef1234567890abL;
    c = 0x567890abcdef1234L;
    d = 0x5678901234abcdefL;
    }

    gcc-10.3 -Wall -O2 compiles this to the following RV64GC code:

    0000000000010434 <arrays>:
    10434: cd81 beqz a1,1044c <arrays+0x18>
    10436: 058e slli a1,a1,0x3
    10438: 87aa mv a5,a0
    1043a: 00b506b3 add a3,a0,a1
    1043e: 4501 li a0,0
    10440: 6398 ld a4,0(a5)
    10442: 07a1 addi a5,a5,8
    10444: 953a add a0,a0,a4
    10446: fed79de3 bne a5,a3,10440 <arrays+0xc>
    1044a: 8082 ret
    1044c: 4501 li a0,0
    1044e: 8082 ret


    So, 7 words.


    What if I manually translate to XG3?:
    arrays:
    MOV 0, X14
    MOV 0, X13
    BLE X11, X0, .L0
    .L1:
    MOV.Q (X10, X13), X12
    ADD 1, X13
    ADD X12, X14
    BLT X11, X13, .L1
    .L0:
    MOV X14, X10
    RTS

    OK, 9 words.

    If I added the pair-packing feature, could potentially be reduced to 7
    words (4 instructions could be merged into 2 words).



    Checking with godbolt.org:
    arrays:
    beq a1,zero,.L49
    slli a1,a1,3
    mv a5,a0
    add a1,a0,a1
    li a0,0
    .L48:
    ld a4,0(a5)
    addi a5,a5,8
    add a0,a0,a4
    bne a5,a1,.L48
    ret
    .L49:
    li a0,0
    ret

    So, basically the same.


    0000000000010450 <globals>:
    10450: 8201b583 ld a1,-2016(gp) # 12020 <__SDATA_BEGIN__>
    10454: 8281b603 ld a2,-2008(gp) # 12028 <__SDATA_BEGIN__+0x8>
    10458: 8301b683 ld a3,-2000(gp) # 12030 <__SDATA_BEGIN__+0x10>
    1045c: 8381b703 ld a4,-1992(gp) # 12038 <__SDATA_BEGIN__+0x18>
    10460: 86b1b423 sd a1,-1944(gp) # 12068 <a>
    10464: 86c1b023 sd a2,-1952(gp) # 12060 <b>
    10468: 84d1bc23 sd a3,-1960(gp) # 12058 <c>
    1046c: 84e1b823 sd a4,-1968(gp) # 12050 <d>
    10470: 8082 ret

    When using -Os, arrays becomes 2 bytes shorter, but the inner loop
    becomes longer.


    I had not usually seen globals handled this way in RV with GCC...

    When I throw it at godbolt.org, I see:
    globals:
    li a1,593920
    addi a1,a1,-1347
    li a2,38178816
    li a5,-209993728
    li a0,863748096
    li a3,1450741760
    li a4,725372928
    slli a1,a1,12
    addi a2,a2,-1329
    addi a5,a5,1165
    li a7,1450741760
    addi a0,a0,1165
    addi a3,a3,171
    addi a4,a4,-2039
    li a6,883675136
    addi a1,a1,-529
    addi a7,a7,171
    slli a0,a0,2
    slli a2,a2,35
    slli a5,a5,34
    slli a3,a3,32
    slli a4,a4,33
    addi a6,a6,-529
    add a2,a2,a1
    add a5,a5,a7
    add a3,a3,a0
    lui t1,%hi(a)
    lui a7,%hi(b)
    lui a0,%hi(c)
    add a4,a4,a6
    lui a1,%hi(d)
    sd a2,%lo(a)(t1)
    sd a5,%lo(b)(a7)
    sd a3,%lo(c)(a0)
    sd a4,%lo(d)(a1)
    ret

    So, seems GCC has decided to generate inline constants and use split
    Abs32 addressing in this case...


    If using -fPIC:
    globals:
    ld a7,.LC0
    ld a0,.LC1
    ld a2,.LC2
    ld a4,.LC3
    la a6,a
    la a1,b
    la a3,c
    la a5,d
    sd a7,0(a6)
    sd a0,0(a1)
    sd a2,0(a3)
    sd a4,0(a5)
    ret

    Checking "la" IIRC expands to an AUIPC+LD pair (using PC-relative
    addressing to load the address of the variable from the GOT).

    It is also using an AUIPC+LD pair for each constant.


    gcc-12.2 -Wall -O2 -falign-labels=1 -falign-loops=1 -falign-jumps=1 -falign-functions=1
    compiles this to the following AMD64 code:

    000000001139 <arrays>:
    1139: 48 85 f6 test %rsi,%rsi
    113c: 74 13 je 1151 <arrays+0x18>
    113e: 48 8d 14 f7 lea (%rdi,%rsi,8),%rdx
    1142: 31 c0 xor %eax,%eax
    1144: 48 03 07 add (%rdi),%rax
    1147: 48 83 c7 08 add $0x8,%rdi
    114b: 48 39 d7 cmp %rdx,%rdi
    114e: 75 f4 jne 1144 <arrays+0xb>
    1150: c3 ret
    1151: 31 c0 xor %eax,%eax
    1153: c3 ret


    27 bytes, ~ 6.75 words.

    Though, I was more talking about i386 having good code density, not so
    much x86-64.

    Though, in this example, i386 would do worse than x86-64 (while no REX prefixes, it would be operating out of memory).


    000000001154 <globals>:
    1154: 48 b8 ef cd ab 90 78 movabs $0x1234567890abcdef,%rax
    115b: 56 34 12
    115e: 48 89 05 cb 2e 00 00 mov %rax,0x2ecb(%rip) # 4030 <a>
    1165: 48 b8 ab 90 78 56 34 movabs $0xcdef1234567890ab,%rax
    116c: 12 ef cd
    116f: 48 89 05 b2 2e 00 00 mov %rax,0x2eb2(%rip) # 4028 <b>
    1176: 48 b8 34 12 ef cd ab movabs $0x567890abcdef1234,%rax
    117d: 90 78 56
    1180: 48 89 05 99 2e 00 00 mov %rax,0x2e99(%rip) # 4020 <c>
    1187: 48 b8 ef cd ab 34 12 movabs $0x5678901234abcdef,%rax
    118e: 90 78 56
    1191: 48 89 05 80 2e 00 00 mov %rax,0x2e80(%rip) # 4018 <d>
    1198: c3 ret


    Here it doesn't get the "free pass" of having the constants elsewhere.

    On i386, would need to use two loads and two stores for each 64 bit
    value. Though, 'long' is 32-bit on i386.



    gcc-10.2 -Wall -O2 -falign-labels=1 -falign-loops=1 -falign-jumps=1 -falign-functions=1
    compiles this to the following ARM A64 code:

    0000000000000734 <arrays>:
    734: b4000121 cbz x1, 758 <arrays+0x24>
    738: aa0003e2 mov x2, x0
    73c: d2800000 mov x0, #0x0 // #0
    740: 8b010c43 add x3, x2, x1, lsl #3
    744: f8408441 ldr x1, [x2], #8
    748: 8b010000 add x0, x0, x1
    74c: eb03005f cmp x2, x3
    750: 54ffffa1 b.ne 744 <arrays+0x10> // b.any
    754: d65f03c0 ret
    758: d2800000 mov x0, #0x0 // #0
    75c: d65f03c0 ret


    11 words.


    0000000000000760 <globals>:
    760: d299bde2 mov x2, #0xcdef // #52719
    764: b0000081 adrp x1, 11000 <__cxa_finalize@GLIBC_2.17>
    768: f2b21562 movk x2, #0x90ab, lsl #16
    76c: 9100e020 add x0, x1, #0x38
    770: f2cacf02 movk x2, #0x5678, lsl #32
    774: d2921563 mov x3, #0x90ab // #37035
    778: f2e24682 movk x2, #0x1234, lsl #48
    77c: f9001c22 str x2, [x1, #56]
    780: d2824682 mov x2, #0x1234 // #4660
    784: d299bde1 mov x1, #0xcdef // #52719
    788: f2aacf03 movk x3, #0x5678, lsl #16
    78c: f2b9bde2 movk x2, #0xcdef, lsl #16
    790: f2a69561 movk x1, #0x34ab, lsl #16
    794: f2c24683 movk x3, #0x1234, lsl #32
    798: f2d21562 movk x2, #0x90ab, lsl #32
    79c: f2d20241 movk x1, #0x9012, lsl #32
    7a0: f2f9bde3 movk x3, #0xcdef, lsl #48
    7a4: f2eacf02 movk x2, #0x5678, lsl #48
    7a8: f2eacf01 movk x1, #0x5678, lsl #48
    7ac: a9008803 stp x3, x2, [x0, #8]
    7b0: f9000c01 str x1, [x0, #24]
    7b4: d65f03c0 ret

    So, the overall sizes (including data size for globals() on RV64GC) are:

    arrays globals Architecture
    28 66 (34+32) RV64GC
    27 69 AMD64
    44 84 ARM A64

    So RV64GC is smallest for the globals/large-immediate test here, and
    only beaten by one byte by AMD64 for the array test. Looking at the
    code generated for the inner loop of arrays(), all the inner loops
    contain four instructions, so certainly in this case RV64GC is not
    crappier than the others. Interestingly, the reasons for using four instructions (rather than five) are different on these architectures:


    These are micro-examples...

    Makes more sense to compare something bigger.

    As noted, RISC-V often does well in micro-examples, but I have seen it
    not necessarily do the best for larger examples.


    For example, Doom often getting the smallest binary sizes on i386...
    Though, I don't have many other test programs set up for an
    ISA-by-ISA comparison.

    After setting up some basic tests here for ROTT:
    x86-64 : 410K (gcc 13.2.0, "-Os", ...)
    RV64GC : 505K
    XG1 : 555K
    XG2 : 590K
    RV64G : 630K
    X64 : 1475K (VS2022)

    So, in this case RV64GC got a 20% reduction vs RV64G, which was enough
    to beat out my own ISA on code size it seems.


    Seemingly ROTT is a little more friendly to RISC-V than my Doom port is.
    Maybe Doom was just not a particularly friendly test case for RV64?...



    Using current settings:
    -I. -I$(TKCLPATH)/include -L$(TKCLPATH) \
    -march=rv64gc \
    -nostdinc -nostdlib -nostartfiles -fno-builtin \
    -fwrapv -fno-strict-aliasing -fno-inline \
    -mabi=lp64 -mtune=sifive-s76 \
    -mno-strict-align \
    -Wmaybe-uninitialized -Wuninitialized -Os -fcommon -std=gnu99

    -lclib_rv64 -ltkgdi_rv64 -fPIC -shared \
    -ffunction-sections -fdata-sections -Wl,-gc-sections

    The "-std=gnu99" was partly because it seems like now GCC wants to take
    a harder stance by default on missing prototypes and similar.


    Granted, the x86-64 build "cheats" here slightly by not using the same static-linked C library (in this case it was using the native dynamically-linked glibc).

    I guess this example might be a bit closer if controlling for the C library.


    Technically, -fPIC or similar is needed to get RV64 code to run on
    TestKern, as otherwise it is incompatible with how TestKern uses the
    address space.

    Just building the Linux version for RV64 won't work as the RV64 cross
    compiler lacks any builds of SDL or similar.


    * RV64GC uses a compare-and-branch instruction.
    * AMD64 uses a load-and-add instruction.
    * ARM A64 uses an auto-increment instruction.


    OK.


    NetBSD has both RV32GC and RV64GC binaries, and there is no consistent
    advantage of RV32GC over RV64GC there:

    NetBSD numbers from <2025Mar4.093916@mips.complang.tuwien.ac.at>:

    libc ksh pax ed
    1102054 124726 66218 26226 riscv-riscv32
    1077192 127050 62748 26550 riscv-riscv64


    I guess it can be noted, is the overhead of any ELF metadata being
    excluded?...

    These are sizes of the .text section extracted with objdump -h. So
    no, these numbers do not include ELF metadata, nor the sizes of other sections. The latter may be relevant, because RV64GC has "immediates"
    in .sdata that other architectures have in .text; however, .sdata can
    contain other things than just "immediates", so one cannot just add the .sdata size to the .text size.


    OK.

    Just making sure mostly.


    Granted, newer compilers do support newer versions of the C standard,
    and also typically get better performance.

    The latter is not the case in my experience, except in cases where autovectorization succeeds (but I also have seen a horrible slowdown
    from auto-vectorization).

    There is one other improvement: gcc register allocation has improved
    in recent years to a point where we 1) no longer need explicit
    register allocation for Gforth on AMD64, and 2) with a lot of manual
    help, we could increase the number of stack cache registers from 1 to
    3 on AMD64, which gives some speedups typically in the 0%-20% range in Gforth.

    But, e.g., for the example from <http://www.complang.tuwien.ac.at/anton/lvas/effizienz/tsp.html>,
    which is vectorizable, I still have not been able to get gcc to auto-vectorize it, even with some transformations which should help.
    I have not measured the scalar versions again, but given that there
    were no consistent speedups between gcc-2.7 (1995) and gcc-5.2 (2015),
    I doubt that I will see consistent speedups with newer gcc (or clang) versions.


    As noted, the issue is particularly notable with MSVC, where modern
    versions generate particularly bloated binaries; but faster and with
    support for C99 features (whereas, say, VS2008 is mostly limited to
    C89/C95 level functionality).

    It does mostly disable auto-vectorization if using /O1 or /Os, but the binaries are still quite bloated.


    The difference for GCC seems to be much smaller, but it supports newer C standards at least, but is more prone to break old code, and even refuse
    to compile it sometimes without jumping through hoops.


    - anton


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Peter Flass@Peter@Iron-Spring.com to comp.arch,alt.folklore.computers on Sat Aug 2 15:33:15 2025
    From Newsgroup: comp.arch

    On 8/2/25 08:29, Thomas Koenig wrote:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

    1) Performance, and that cost DEC customers since RISCs were
    introduced in the mid-1980s. DecStations were introduced to reduce
    this bleeding, but of course this meant that these customers were
    not VAX customers.

    Or, even more importantly, VMS customers.

    I guess I'm getting DecStations and VaxStations mixed up. Maybe one of
    their problems was brand confusion.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Sat Aug 2 23:08:39 2025
    From Newsgroup: comp.arch

    On Sat, 2 Aug 2025 08:07:50 -0000 (UTC), Waldek Hebisch wrote:

    One big selling point of Alpha was 64-bit architecture, but IIUC
    VMS was never fully ported to 64-bits, that is a lot of VMS
    software used 32-bit addresses and some system interfaces were
    32-bit only. OTOH Unix for Alpha was claimed to be pure 64-bit.

    Of the four main OSes for Alpha, the only fully-64-bit ones were DEC Unix
    and Linux. OpenVMS was a hybrid 32/64-bit implementation, and Windows NT
    was 32-bit-only.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Sat Aug 2 23:17:34 2025
    From Newsgroup: comp.arch

    On Sat, 2 Aug 2025 15:33:15 -0700, Peter Flass wrote:

    I guess I'm getting DecStations and VaxStations mixed up. Maybe one of
    their problems was brand confusion.

    Wot fun.

    “VAXstation” = graphical workstation with VAX processor.

    “DECstation” = short-lived DEC machine range with MIPS processor.

    “DECserver” = dedicated terminal server running LAT protocol.

    “DECmate” = one of their 3 different PC families. This one was based around a PDP-8-compatible processor.

    “VAXmate” = a quick look at the docs indicates this was some kind of Microsoft-PC-compatible, bundled with extra DEC-specific connectivity features.

    Any others ... ?
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Sat Aug 2 23:20:37 2025
    From Newsgroup: comp.arch

    On Sat, 02 Aug 2025 09:28:17 GMT, Anton Ertl wrote:

    In my RISC-VAX scenario, the RISC-VAX would be the PDP-11 followon
    instead of the actual (CISC) VAX, so there would be no additional
    ISA.

    In order to be RISC, it would have had to add registers and remove
    addressing modes from the non-load/store instructions (and replace “move” with separate “load” and “store” instructions). “No additional ISA” or
    not, it would still have broken existing code.

    Remember that VAX development started in the early-to-mid-1970s. RISC was still nothing more than a research idea at that point, which had yet to
    prove itself.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Sat Aug 2 23:21:18 2025
    From Newsgroup: comp.arch

    On Sat, 2 Aug 2025 09:07:14 -0000 (UTC), Thomas Koenig wrote:

    Vobis (a German discount computer reseller) offered Alpha-based Windows
    boxes in 1993 and another model in 1997. Far too expensive for private
    users ...

    And what a waste of a 64-bit architecture, to run it in 32-bit-only
    mode ...
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sat Aug 2 18:55:51 2025
    From Newsgroup: comp.arch

    On 8/2/2025 3:15 PM, BGB wrote:
    On 8/2/2025 10:33 AM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    But, it seems to have a few obvious weak points for RISC-V:
       Crappy with arrays;
       Crappy with code with lots of large immediate values;
       Crappy with code which mostly works using lots of global variables; >>>      Say, for example, a lot of Apogee / 3D Realms code;
         They sure do like using lots of global variables.
         id Software also likes globals, but not as much.
       ...

    Let's see:

    #include <stddef.h>

    long arrays(long *v, size_t n)
    {
       long i, r;
       for (i=0, r=0; i<n; i++)
         r+=v[i];
       return r;
    }


    ...


    What if I manually translate to XG3?:
      arrays:
      MOV    0, X14
      MOV    0, X13
      BLE    X11, X0, .L0
      .L1:
      MOV.Q  (X10, X13), X12
      ADD    1, X13
      ADD    X12, X14
      BLT    X11, X13, .L1
      .L0:
      MOV    X14, X10
      RTS

    OK, 9 words.

    If I added the pair-packing feature, could potentially be reduced to 7
    words (4 instructions could be merged into 2 words).


    so, somehow I had a brain fart here and used X register notation rather
    than R register notation, though both are equivalent in the context of
    XG3; X belongs with RV ASM syntax, and R with the style syntax I was
    using. So, a bit of a screw up on my part.


    Technically, this would also map 1:1 to the RV+Jx mode.
    If BGBCC supported RV+Jx+C mode, this would be reduced to 6 words (24
    bytes).


    But, what does BGBCC generate for RV mode?:
    arrays:
    ADD RQ0, 0, RQ13
    ADD RQ0, 0, RQ12
    .L008010FA:
    BRGEU.Q RQ11, RQ13, .L008010FC
    SHAD.Q RQ13, 3, R5
    ADD RQ10, R5, R5
    MOV.Q (R5, 0), RQ17
    ADD RQ12, RQ17, RQ12
    ADD RQ13, 1, RQ13
    BSR .L008010FA, R0
    .L008010FC:
    ADD RQ12, 0, RQ10
    .L00C00EB5:
    JSR R1, 0, R0

    OK, so a fairly naive solution.


    With RV+JX:
    arrays:
    ADD RQ0, 0, RQ13
    ADD RQ0, 0, RQ12
    .L008010FA:
    BRGEU.Q RQ11, RQ13, .L008010FC
    MOV.Q (RQ10, RQ13), RQ17
    ADD RQ12, RQ17, RQ12
    ADD RQ13, 1, RQ13
    BSR .L008010FA, R0
    .L008010FC:
    ADD RQ12, 0, RQ10
    .L00C008D9:
    JSR R1, 0, R0


    So, seems what I generated by hand was pretty close to what BGBCC spits
    out in this case...

    Disasm version:
    arrays: //@006A48
    0000.0693 ADD R0, 0, R13
    0000.0613 ADD R0, 0, R12
    .L008010FA: //@006A50
    .reloc .L008010FC 34/RELW12_RVI
    00B6.F063 BRGEU.Q R13, R11, 0
    36D5.38AF MOV.Q (R10, R13), R17
    0116.0633 ADD R12, R17, R12
    0016.8693 ADD R13, 1, R13
    .reloc .L008010FA 35/RELW20_RVI
    0000.006F BSR 0, R0
    .L008010FC: //@006A64
    0006.0513 ADD R12, 0, R10
    .L00C008D9: //@006A68
    0000.8067 JMP (R1, 0), R0

    Note that XG3 output would likely be similar, just with some instructons
    that are not valid RV instructions.

    Note that 36D538AF is not a valid encoding in normal RV, as my extended variant had encoded some of these ops via the AMO block.

    As of yet, BGBCC doesn't fully support the RV-C extension (partly
    implemented, but would need more work to "actually work").

    Then I note some inconsistencies in operand ordering, but not all of
    this is necessarily consistent...

    As for RQnn vs RDnn vs Rnn ... Early on I thought I was going to go in a direction more like Intel-style x86 ASM and use register types to encode
    the operation. Didn't go that way, but BGBCC still sorta associates
    types with register IDs even if in most cases they don't do much.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch,alt.folklore.computers on Sat Aug 2 23:10:56 2025
    From Newsgroup: comp.arch

    Lawrence D'Oliveiro [2025-08-02 23:21:18] wrote:
    On Sat, 2 Aug 2025 09:07:14 -0000 (UTC), Thomas Koenig wrote:
    Vobis (a German discount computer reseller) offered Alpha-based Windows
    boxes in 1993 and another model in 1997. Far too expensive for private
    users ...
    And what a waste of a 64-bit architecture, to run it in 32-bit-only
    mode ...

    What do you mean by that? IIUC, the difference between 32bit and 64bit
    (in terms of cost of designing and producing the CPU) was very small.
    MIPS happily designed their R4000 as 64bit while knowing that most of
    them would never get a chance to execute an instruction that makes use
    of the upper 32bits.


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Sun Aug 3 09:14:10 2025
    From Newsgroup: comp.arch

    On Sat, 02 Aug 2025 23:10:56 -0400, Stefan Monnier wrote:

    Lawrence D'Oliveiro [2025-08-02 23:21:18] wrote:

    On Sat, 2 Aug 2025 09:07:14 -0000 (UTC), Thomas Koenig wrote:

    Vobis (a German discount computer reseller) offered Alpha-based
    Windows boxes in 1993 and another model in 1997. Far too expensive
    for private users ...

    And what a waste of a 64-bit architecture, to run it in 32-bit-only
    mode ...

    What do you mean by that?

    Of all the major OSes for Alpha, Windows NT was the only one
    that couldn’t take advantage of the 64-bit architecture.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Peter Flass@Peter@Iron-Spring.com to comp.arch,alt.folklore.computers on Sun Aug 3 07:41:14 2025
    From Newsgroup: comp.arch

    On 8/3/25 02:14, Lawrence D'Oliveiro wrote:
    On Sat, 02 Aug 2025 23:10:56 -0400, Stefan Monnier wrote:

    Lawrence D'Oliveiro [2025-08-02 23:21:18] wrote:

    On Sat, 2 Aug 2025 09:07:14 -0000 (UTC), Thomas Koenig wrote:

    Vobis (a German discount computer reseller) offered Alpha-based
    Windows boxes in 1993 and another model in 1997. Far too expensive
    for private users ...

    And what a waste of a 64-bit architecture, to run it in 32-bit-only
    mode ...

    What do you mean by that?

    Of all the major OSes for Alpha, Windows NT was the only one
    that couldn’t take advantage of the 64-bit architecture.

    At that point they should have renamed it "Windows OT".

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Sun Aug 3 16:42:20 2025
    From Newsgroup: comp.arch

    antispam@fricas.org (Waldek Hebisch) writes:
    In comp.arch Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    Did the VAX 11/780 have writable microcode?

    Yes, 12 kB (2K words 96-bit each).

    So that's 12KB of fast RAM that could have been reused for making the
    cache larger in a RISC-VAX, maybe increasing its size from 2KB to
    12KB.

    Followups set to comp.arch. Change it if you think this is still
    on-topic for afc.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Sun Aug 3 16:51:10 2025
    From Newsgroup: comp.arch

    antispam@fricas.org (Waldek Hebisch) writes:
    One piece of supporting sofware
    was a VAX emulator IIRC called FX11: it allowed running unmodified
    VAX binaries.

    There was also a static binary translator for DecStation binaries. I
    never used it, but a collegue tried to. He found that on the Prolog
    systems that he tried it with (I think it was Quintus or SICStus), it
    did not work, because that system did unusual things with the binary,
    and that did not work on the result of the binary translation. Moral
    of the story: Better use dynamic binary translation (which Apple did
    for their 68K->PowerPC transition at around the same time).

    OTOH Unix for Alpha was claimed to be pure 64-bit.

    It depends on the kind of purity you are aspiring to. After a bunch
    of renamings it was finally called Tru64 UNIX. Not Pur64, but
    Tru64:-) Before that, it was called Digital UNIX (but once DEC had
    been bought by Compaq, that was no longer appropriate), and before
    that, DEC OSF/1 AXP.

    The C environment for DEC OSF/1 was an I32LP64 setup, not an ILP64
    setup, so can you really call it pure?

    In addition there were some OS features for running ILP32 programs,
    similar to Linux' MAP_32BIT flag for mmap(). IIRC Netscape Navigator
    was compiled as ILP32 program (the C compiler had a flag for that),
    and needed these OS features.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Mon Aug 4 00:04:54 2025
    From Newsgroup: comp.arch

    On Sun, 03 Aug 2025 16:51:10 GMT, Anton Ertl wrote:

    The C environment for DEC OSF/1 was an I32LP64 setup, not an ILP64
    setup, so can you really call it pure?

    As far as I’m aware, I32LP64 is the standard across 64-bit *nix systems.

    Microsoft’s compilers for 64-bit Windows do LLP64. Not aware of any platforms that do/did ILP64.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch,alt.folklore.computers on Sun Aug 3 21:07:02 2025
    From Newsgroup: comp.arch

    On 8/3/2025 7:04 PM, Lawrence D'Oliveiro wrote:
    On Sun, 03 Aug 2025 16:51:10 GMT, Anton Ertl wrote:

    The C environment for DEC OSF/1 was an I32LP64 setup, not an ILP64
    setup, so can you really call it pure?

    As far as I’m aware, I32LP64 is the standard across 64-bit *nix systems.

    Microsoft’s compilers for 64-bit Windows do LLP64. Not aware of any platforms that do/did ILP64.

    Yeah, pretty much nothing does ILP64, and doing so would actually be a problem.

    Also, C type names:
    char : 8 bit
    short : 16 bit
    int : 32 bit
    long : 64 bit
    long long: 64 bit

    If 'int' were 64-bits, then what about 16 and/or 32 bit types.
    short short?
    long short?
    ...

    Current system seems preferable.
    Well, at least in absence of maybe having the compiler specify actual fixed-size types.

    Or, say, what if there was a world where the actual types were, say:
    _Int8, _Int16, _Int32, _Int64, _Int128
    And, then, say:
    char, short, int, long, ...
    Were seen as aliases.

    Well, maybe along with __int64 and friends, but __int64 and _Int64 could
    be seen as equivalent.


    Then of course, the "stdint.h" types.
    Traditionally, these are a bunch of typedef's to the 'int' and friends.
    But, one can imagine a hypothetical world where stdint.h contained
    things like, say:
    typedef _Int32 int32_t;


    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Peter Flass@Peter@Iron-Spring.com to comp.arch,alt.folklore.computers on Sun Aug 3 20:39:52 2025
    From Newsgroup: comp.arch

    On 8/3/25 19:07, BGB wrote:
    On 8/3/2025 7:04 PM, Lawrence D'Oliveiro wrote:
    On Sun, 03 Aug 2025 16:51:10 GMT, Anton Ertl wrote:

    The C environment for DEC OSF/1 was an I32LP64 setup, not an ILP64
    setup, so can you really call it pure?

    As far as I’m aware, I32LP64 is the standard across 64-bit *nix systems. >>
    Microsoft’s compilers for 64-bit Windows do LLP64. Not aware of any
    platforms that do/did ILP64.

    Yeah, pretty much nothing does ILP64, and doing so would actually be a problem.

    Also, C type names:
      char     :  8 bit
      short    : 16 bit
      int      : 32 bit
      long     : 64 bit
      long long: 64 bit

    If 'int' were 64-bits, then what about 16 and/or 32 bit types.
      short short?
      long short?
      ...

    Current system seems preferable.
    Well, at least in absence of maybe having the compiler specify actual fixed-size types.

    Or, say, what if there was a world where the actual types were, say:
      _Int8, _Int16, _Int32, _Int64, _Int128
    And, then, say:
      char, short, int, long, ...
    Were seen as aliases.

    Well, maybe along with __int64 and friends, but __int64 and _Int64 could
    be seen as equivalent.


    Then of course, the "stdint.h" types.
    Traditionally, these are a bunch of typedef's to the 'int' and friends.
    But, one can imagine a hypothetical world where stdint.h contained
    things like, say:
      typedef _Int32 int32_t;



    Like PL/I which lets you specify any precision: FIXED BINARY(31), FIXED BINARY(63) etc.

    C keeps borrowing more and more PL/I features.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Mon Aug 4 04:50:11 2025
    From Newsgroup: comp.arch

    On Sun, 3 Aug 2025 20:39:52 -0700, Peter Flass wrote:

    C keeps borrowing more and more PL/I features.

    Struct definitions with level numbers??
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch,alt.folklore.computers on Mon Aug 4 12:19:38 2025
    From Newsgroup: comp.arch

    On Sun, 3 Aug 2025 21:07:02 -0500
    BGB <cr88192@gmail.com> wrote:
    On 8/3/2025 7:04 PM, Lawrence D'Oliveiro wrote:
    On Sun, 03 Aug 2025 16:51:10 GMT, Anton Ertl wrote:

    The C environment for DEC OSF/1 was an I32LP64 setup, not an ILP64
    setup, so can you really call it pure?

    As far as I’m aware, I32LP64 is the standard across 64-bit *nix
    systems.

    Microsoft’s compilers for 64-bit Windows do LLP64. Not aware of any platforms that do/did ILP64.

    Yeah, pretty much nothing does ILP64, and doing so would actually be
    a problem.

    Also, C type names:
    char : 8 bit
    short : 16 bit
    int : 32 bit
    Except in embedded 16 bit are not rare
    long : 64 bit
    Except for majority of the world where long is 32 bit
    long long: 64 bit

    If 'int' were 64-bits, then what about 16 and/or 32 bit types.
    short short?
    long short?
    ...

    Current system seems preferable.
    Well, at least in absence of maybe having the compiler specify actual fixed-size types.

    Or, say, what if there was a world where the actual types were, say:
    _Int8, _Int16, _Int32, _Int64, _Int128
    And, then, say:
    char, short, int, long, ...
    Were seen as aliases.

    Actually, in our world the latest C standard (C23) has them, but the
    spelling is different: _BitInt(32) and unsigned _BitInt(32).
    I'm not sure if any major compiler already has them implemented. Bing
    copilot says that clang does, but I don't tend to believe eveything Bing copilot says.
    Well, maybe along with __int64 and friends, but __int64 and _Int64
    could be seen as equivalent.


    Then of course, the "stdint.h" types.
    Traditionally, these are a bunch of typedef's to the 'int' and
    friends. But, one can imagine a hypothetical world where stdint.h
    contained things like, say:
    typedef _Int32 int32_t;


    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch,alt.folklore.computers on Mon Aug 4 12:35:01 2025
    From Newsgroup: comp.arch

    On Sun, 3 Aug 2025 20:39:52 -0700
    Peter Flass <Peter@Iron-Spring.com> wrote:
    On 8/3/25 19:07, BGB wrote:
    On 8/3/2025 7:04 PM, Lawrence D'Oliveiro wrote:
    On Sun, 03 Aug 2025 16:51:10 GMT, Anton Ertl wrote:

    The C environment for DEC OSF/1 was an I32LP64 setup, not an ILP64
    setup, so can you really call it pure?

    As far as I’m aware, I32LP64 is the standard across 64-bit *nix
    systems.

    Microsoft’s compilers for 64-bit Windows do LLP64. Not aware of any
    platforms that do/did ILP64.

    Yeah, pretty much nothing does ILP64, and doing so would actually
    be a problem.

    Also, C type names:
      char     :  8 bit
      short    : 16 bit
      int      : 32 bit
      long     : 64 bit
      long long: 64 bit

    If 'int' were 64-bits, then what about 16 and/or 32 bit types.
      short short?
      long short?
      ...

    Current system seems preferable.
    Well, at least in absence of maybe having the compiler specify
    actual fixed-size types.

    Or, say, what if there was a world where the actual types were, say:
      _Int8, _Int16, _Int32, _Int64, _Int128
    And, then, say:
      char, short, int, long, ...
    Were seen as aliases.

    Well, maybe along with __int64 and friends, but __int64 and _Int64
    could be seen as equivalent.


    Then of course, the "stdint.h" types.
    Traditionally, these are a bunch of typedef's to the 'int' and
    friends. But, one can imagine a hypothetical world where stdint.h
    contained things like, say:
      typedef _Int32 int32_t;



    Like PL/I which lets you specify any precision: FIXED BINARY(31),
    FIXED BINARY(63) etc.

    C23 does not let you specify any precision.
    Implementation defines BITINT_MAXWIDTH that, according to my
    understanding, (I didn't read the standard) is allowed to be quite
    small.
    It seems, that in real life BITINT_MAXWIDTH >= 128 will be supported
    on all platforms that would go to trouble of implementing complete C23,
    even on 32-bit hardware.
    C keeps borrowing more and more PL/I features.

    How can we know that the feature is borrowed from PL/1 and not for few
    other languages that had similar features?
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch,alt.folklore.computers on Mon Aug 4 12:42:44 2025
    From Newsgroup: comp.arch

    On Sun, 03 Aug 2025 16:51:10 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    antispam@fricas.org (Waldek Hebisch) writes:
    One piece of supporting sofware
    was a VAX emulator IIRC called FX11: it allowed running unmodified
    VAX binaries.

    There was also a static binary translator for DecStation binaries. I
    never used it, but a collegue tried to. He found that on the Prolog
    systems that he tried it with (I think it was Quintus or SICStus), it
    did not work, because that system did unusual things with the binary,
    and that did not work on the result of the binary translation. Moral
    of the story: Better use dynamic binary translation (which Apple did
    for their 68K->PowerPC transition at around the same time).


    IIRC, x386-to-Alpha translator was dynamic. Supposedly, VAX-to-Alpha
    was also dynamic.
    May be, MIPS-to-Alpha was static simply because it had much lower
    priority within DEC?

    OTOH Unix for Alpha was claimed to be pure 64-bit.

    It depends on the kind of purity you are aspiring to. After a bunch
    of renamings it was finally called Tru64 UNIX. Not Pur64, but
    Tru64:-) Before that, it was called Digital UNIX (but once DEC had
    been bought by Compaq, that was no longer appropriate), and before
    that, DEC OSF/1 AXP.

    The C environment for DEC OSF/1 was an I32LP64 setup, not an ILP64
    setup, so can you really call it pure?

    In addition there were some OS features for running ILP32 programs,
    similar to Linux' MAP_32BIT flag for mmap(). IIRC Netscape Navigator
    was compiled as ILP32 program (the C compiler had a flag for that),
    and needed these OS features.

    - anton


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Al Kossow@aek@bitsavers.org to comp.arch,alt.folklore.computers on Mon Aug 4 03:32:52 2025
    From Newsgroup: comp.arch

    On 8/4/25 2:42 AM, Michael S wrote:

    May be, MIPS-to-Alpha was static simply because it had much lower
    priority within DEC?

    MIPS products came out of DECWRL (the research group
    started to build Titan) and were stopgaps until
    the "real" architecture came out (Cutler's out of DECWest)
    I don't think it ever got much love out of DEC corporate
    and were just done so DEC didn't completely get their
    lunch eaten in the Unix workstation market.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Mon Aug 4 12:09:32 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    Actually, in our world the latest C standard (C23) has them, but the
    spelling is different: _BitInt(32) and unsigned _BitInt(32).
    I'm not sure if any major compiler already has them implemented. Bing
    copilot says that clang does, but I don't tend to believe eveything Bing >copilot says.

    I asked godbolt, and tried the following program:

    typedef ump unsigned _BitInt(65535);

    ump sum3(ump a, ump b, ump c)
    {
    return a+b+c;
    }

    and for the C setting gcc-15.1 AMD64 produces 129 lines of assembly
    language code; for C++ it complains about the syntax. For 65536 bits,
    it complains about being beyond the maximum number of 65535 bits.

    For the same program with the C setting clang-20.1 produces 29547
    lines of assembly language code; that's more than 28 instructions for
    every 64-bit word of output, which seems excessive to me, even if you
    don't use ADX instructions (which clang apparently does not); I expect
    that clang will produce better code at some point in the future.
    Compiling this function also takes noticable time, and when I ask for
    1000000 bits, clang still does complain about too many bits, but
    godbolt's timeout strikes; I finally found out clang's limit: 8388608
    bits. On clang-20.1 the C++ setting also accepts this kind of input.

    Followups set to comp.arch.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Mon Aug 4 13:42:32 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    May be, MIPS-to-Alpha was static simply because it had much lower
    priority within DEC?

    Skimming the article on "Binary Translation" in Digital Technical
    Journal Vol. 4 No. 4, 1992 <https://dn790009.ca.archive.org/0/items/bitsavers_decdtjdtjv_19086731/dtj_v04-04_1992.pdf>,
    it seems that both VEST (VAX VMS->Alpha VMS) and mx (MIPS Ultrix ->
    Alpha OSF/1) used a hybrid approach. These binary translators took an
    existing binary for one system and produced a binary for the the other
    system, but included a run-time system that would do binary
    translation of run-time-generated code.

    But for the Prolog system that did not work with mx the problem was
    that the binary looked different (IIRC Ultrix uses a.out format, and
    Digital OSF/1 used a different binary format), so the run-time
    component of the binary translator did not help.

    What would have been needed for that is a way to run the MIPS-Ultrix
    binary as-is, with the binary translation coming in out-of-band,
    either completely at run-time, or with the static part of the
    translated code looked up based on the original binary and loaded into
    address space beyond the reach of the 32-bit MIPS architecture
    supported by Ultrix.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch,alt.folklore.computers on Mon Aug 4 14:22:14 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    On Sun, 3 Aug 2025 21:07:02 -0500
    BGB <cr88192@gmail.com> wrote:


    Except for majority of the world where long is 32 bit


    What majority? Linux owns the server market, the
    appliance market and much of the handset market (which apple
    dominates with their OS). And all Unix/Linux systems have
    64-bit longs on 64-bit CPUs.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch,alt.folklore.computers on Mon Aug 4 16:46:03 2025
    From Newsgroup: comp.arch

    Scott Lurndal wrote:
    Michael S <already5chosen@yahoo.com> writes:
    On Sun, 3 Aug 2025 21:07:02 -0500
    BGB <cr88192@gmail.com> wrote:


    Except for majority of the world where long is 32 bit


    What majority? Linux owns the server market, the
    appliance market and much of the handset market (which apple
    dominates with their OS). And all Unix/Linux systems have
    64-bit longs on 64-bit CPUs.

    Apple/iPhone might dominate in the US market (does it?), but in the rest
    of the world Android (with linux) is far larger. World total is 72%
    Android, 28% iOS.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Aug 4 14:51:41 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Michael S <already5chosen@yahoo.com> writes:
    Actually, in our world the latest C standard (C23) has them, but the >>spelling is different: _BitInt(32) and unsigned _BitInt(32).
    I'm not sure if any major compiler already has them implemented. Bing >>copilot says that clang does, but I don't tend to believe eveything Bing >>copilot says.

    I asked godbolt, and tried the following program:

    typedef ump unsigned _BitInt(65535);

    The actual compiling version is:

    typedef unsigned _BitInt(65535) ump;

    ump sum3(ump a, ump b, ump c)
    {
    return a+b+c;
    }

    and for the C setting gcc-15.1 AMD64 produces 129 lines of assembly
    language code; for C++ it complains about the syntax. For 65536 bits,
    it complains about being beyond the maximum number of 65535 bits.

    For the same program with the C setting clang-20.1 produces 29547
    lines of assembly language code; that's more than 28 instructions for
    every 64-bit word of output, which seems excessive to me, even if you
    don't use ADX instructions (which clang apparently does not); I expect
    that clang will produce better code at some point in the future.
    Compiling this function also takes noticable time, and when I ask for
    1000000 bits, clang still does complain about too many bits, but
    godbolt's timeout strikes; I finally found out clang's limit: 8388608
    bits. On clang-20.1 the C++ setting also accepts this kind of input.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch,alt.folklore.computers on Mon Aug 4 15:05:51 2025
    From Newsgroup: comp.arch

    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Scott Lurndal wrote:
    Michael S <already5chosen@yahoo.com> writes:
    On Sun, 3 Aug 2025 21:07:02 -0500
    BGB <cr88192@gmail.com> wrote:


    Except for majority of the world where long is 32 bit


    What majority? Linux owns the server market, the
    appliance market and much of the handset market (which apple
    dominates with their OS). And all Unix/Linux systems have
    64-bit longs on 64-bit CPUs.

    Apple/iPhone might dominate in the US market (does it?), but in the rest
    of the world Android (with linux) is far larger. World total is 72%
    Android, 28% iOS.

    Good point, thanks.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch,alt.folklore.computers on Mon Aug 4 18:07:48 2025
    From Newsgroup: comp.arch

    On Mon, 04 Aug 2025 14:22:14 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Sun, 3 Aug 2025 21:07:02 -0500
    BGB <cr88192@gmail.com> wrote:


    Except for majority of the world where long is 32 bit


    What majority? Linux owns the server market, the
    appliance market and much of the handset market (which apple
    dominates with their OS). And all Unix/Linux systems have
    64-bit longs on 64-bit CPUs.

    Majority of the world is embedded. Ovewhelming majority of embedded is
    32-bit or narrower.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch,comp.lamg.c on Mon Aug 4 18:25:35 2025
    From Newsgroup: comp.arch

    On Mon, 04 Aug 2025 12:09:32 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    Actually, in our world the latest C standard (C23) has them, but the >spelling is different: _BitInt(32) and unsigned _BitInt(32).
    I'm not sure if any major compiler already has them implemented. Bing >copilot says that clang does, but I don't tend to believe eveything
    Bing copilot says.

    I asked godbolt, and tried the following program:


    It turned out that I didn't need even goldbolt.
    I already had sufficiently advanced gcc and clang installed on my new
    Windows PC at work. I probably have them installed on very old home PC
    as well, but now I am at work. Playing with compilers instead of
    working.

    typedef ump unsigned _BitInt(65535);

    ump sum3(ump a, ump b, ump c)
    {
    return a+b+c;
    }

    and for the C setting gcc-15.1 AMD64 produces 129 lines of assembly
    language code; for C++ it complains about the syntax. For 65536 bits,
    it complains about being beyond the maximum number of 65535 bits.

    For the same program with the C setting clang-20.1 produces 29547
    lines of assembly language code; that's more than 28 instructions for
    every 64-bit word of output, which seems excessive to me, even if you
    don't use ADX instructions (which clang apparently does not); I expect
    that clang will produce better code at some point in the future.
    Compiling this function also takes noticable time, and when I ask for
    1000000 bits, clang still does complain about too many bits, but
    godbolt's timeout strikes; I finally found out clang's limit: 8388608
    bits. On clang-20.1 the C++ setting also accepts this kind of input.

    Followups set to comp.arch.

    - anton

    I didn't pay much attention to code size yet. Had seen two other
    worrying things.

    1. Both gcc and clang happily* accept _BitInt() syntax even when
    -std=c17 or lower. Is not here a potential name clash for existing
    sources that use _BitInt() as a name of the function? I should think
    more about it.

    2. Windows-specific.
    gcc and clang appear to have different ABIs for return value of type _BitInt(128).


    * - the only sign of less than perfect happiness is a warning produced
    with -pedantic flag.


    Cross-posted to c.lang.c

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Ames@commodorejohn@gmail.com to comp.arch,alt.folklore.computers on Mon Aug 4 08:32:19 2025
    From Newsgroup: comp.arch

    On Sat, 02 Aug 2025 23:10:56 -0400
    Stefan Monnier <monnier@iro.umontreal.ca> wrote:

    And what a waste of a 64-bit architecture, to run it in 32-bit-only
    mode ...

    What do you mean by that? IIUC, the difference between 32bit and
    64bit (in terms of cost of designing and producing the CPU) was very
    small. MIPS happily designed their R4000 as 64bit while knowing that
    most of them would never get a chance to execute an instruction that
    makes use of the upper 32bits.

    This notion that the only advantage of a 64-bit architecture is a large
    address space is very curious to me. Obviously that's *one* advantage,
    but while I don't know the in-the-field history of heavy-duty business/ scientific computing the way some folks here do, I have not gotten the impression that a lot of customers were commonly running up against the
    4 GB limit in the early '90s; meanwhile, the *other* advantage - higher performance for the same MIPS on a variety of compute-bound tasks - is
    being overlooked entirely, it seems.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch,alt.folklore.computers on Mon Aug 4 15:32:55 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 04 Aug 2025 14:22:14 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Sun, 3 Aug 2025 21:07:02 -0500
    BGB <cr88192@gmail.com> wrote:


    Except for majority of the world where long is 32 bit


    What majority? Linux owns the server market, the
    appliance market and much of the handset market (which apple
    dominates with their OS). And all Unix/Linux systems have
    64-bit longs on 64-bit CPUs.

    Majority of the world is embedded. Ovewhelming majority of embedded is
    32-bit or narrower.

    in terms of shipped units, perhaps (although many are narrower, as you
    point out). In terms of programmers, it's a fairly small fraction that
    do embedded programming.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Andy Burns@usenet@andyburns.uk to comp.arch,alt.folklore.computers on Mon Aug 4 16:50:41 2025
    From Newsgroup: comp.arch

    Anton Ertl wrote:

    Skimming the article on "Binary Translation" in Digital Technical
    Journal Vol. 4 No. 4, 1992 <https://dn790009.ca.archive.org/0/items/bitsavers_decdtjdtjv_19086731/ dtj_v04-04_1992.pdf>,
    it seems that both VEST (VAX VMS->Alpha VMS) and mx (MIPS Ultrix ->
    Alpha OSF/1) used a hybrid approach. These binary translators took an existing binary for one system and produced a binary for the the other system, but included a run-time system that would do binary
    translation of run-time-generated code.

    Since our systems were recently re-written from BASIC to C, we didn't
    look at the translator, just recompiled for Alpha.

    There was the FX!32 runtime that allowed Intel windows programs to run
    under Alpha NT, never looked at Windows on Alpha at all.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Aug 4 15:11:38 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes: >anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Michael S <already5chosen@yahoo.com> writes:
    Actually, in our world the latest C standard (C23) has them, but the >>>spelling is different: _BitInt(32) and unsigned _BitInt(32).
    I'm not sure if any major compiler already has them implemented. Bing >>>copilot says that clang does, but I don't tend to believe eveything Bing >>>copilot says.

    I asked godbolt, and tried the following program:

    typedef ump unsigned _BitInt(65535);

    The actual compiling version is:

    typedef unsigned _BitInt(65535) ump;

    ump sum3(ump a, ump b, ump c)
    {
    return a+b+c;
    }

    and for the C setting gcc-15.1 AMD64 produces 129 lines of assembly >>language code; for C++ it complains about the syntax. For 65536 bits,
    it complains about being beyond the maximum number of 65535 bits.

    For the same program with the C setting clang-20.1 produces 29547
    lines of assembly language code; that's more than 28 instructions for
    every 64-bit word of output, which seems excessive to me, even if you
    don't use ADX instructions (which clang apparently does not);

    I forgot to enable optimization. With -Os (and similarly with -O2),
    the result takes 9328 instructions, about 9 per output word, which
    seems appropriate if you don't use ADX:

    ...
    load a word from a
    adc a word from b
    store a word to result
    ...
    #when you are done with a+b
    ...
    load a word from result
    adc a word from c
    store a word to result
    ...

    Ok, I only produce 6 instructions per word of result (and could
    improve by one instruction with an RMW instruction, but apparently
    clang does not use that. clang produces a lot of additional moves,
    maybe some problem with register allocation in this huge basic block.

    gcc produces 49 instructions with -Os, with a loop containing 20
    instructions for two result words (10 instructions per result word):

    .L2:
    add cl, -1
    mov rcx, QWORD PTR [rsp+8+rax]
    adc rcx, QWORD PTR [rsp+8200+rax]
    setc r8b
    add dl, -1
    adc rcx, QWORD PTR [rsp+16392+rax]
    mov rdx, QWORD PTR [rsp+16+rax]
    setc sil
    add r8b, -1
    adc rdx, QWORD PTR [rsp+8208+rax]
    mov QWORD PTR [rdi+rax], rcx
    setc cl
    add sil, -1
    mov rsi, rdx
    adc rsi, QWORD PTR [rsp+16400+rax]
    mov QWORD PTR [rdi+8+rax], rsi
    setc dl
    add rax, 16
    cmp rax, 8176
    jne .L2

    So gcc performs both additions in one loop, and one word of additions
    after the other, so reifies the carry in a register after every adc.

    At least for this code I don't see why gcc implements a
    BITINT_MAXWIDTH that's smaller than the largest number, the same
    approach wouid work with arbitrarily large _BitInts. Maybe one of the
    other operations has problems with big _BitInts.

    The ideal for sum3() would be to use ADX:

    load word from a
    adcx word from b
    adox word from c
    store word to result

    But gcc produces no ADX instructions even with -Os -march=x86-64-v4.
    Clang does not produce ADX instructions, either.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Mon Aug 4 19:00:13 2025
    From Newsgroup: comp.arch

    On Mon, 04 Aug 2025 14:51:41 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Michael S <already5chosen@yahoo.com> writes:
    Actually, in our world the latest C standard (C23) has them, but the >>spelling is different: _BitInt(32) and unsigned _BitInt(32).
    I'm not sure if any major compiler already has them implemented.
    Bing copilot says that clang does, but I don't tend to believe
    eveything Bing copilot says.

    I asked godbolt, and tried the following program:

    typedef ump unsigned _BitInt(65535);

    The actual compiling version is:

    typedef unsigned _BitInt(65535) ump;

    ump sum3(ump a, ump b, ump c)
    {
    return a+b+c;
    }

    and for the C setting gcc-15.1 AMD64 produces 129 lines of assembly >language code; for C++ it complains about the syntax. For 65536
    bits, it complains about being beyond the maximum number of 65535
    bits.

    For the same program with the C setting clang-20.1 produces 29547
    lines of assembly language code; that's more than 28 instructions for
    every 64-bit word of output, which seems excessive to me, even if you
    don't use ADX instructions (which clang apparently does not); I
    expect that clang will produce better code at some point in the
    future. Compiling this function also takes noticable time, and when
    I ask for 1000000 bits, clang still does complain about too many
    bits, but godbolt's timeout strikes; I finally found out clang's
    limit: 8388608 bits. On clang-20.1 the C++ setting also accepts
    this kind of input.

    - anton

    On my PC with following flags '-S -O -pedantic -std=c23 -march=native' compilation was not too slow - approximately 200 msec for gcc, 600 msec
    for clang. In case of gcc, most of the time was likely consumed by
    [anti]virus rather than by compiler itself.
    Sizes (instructions only, directives and labels removed) were as
    following:

    N gcc-win64 gcc-sysv clang-win64
    65472 56 46 10041
    65535 71 62 10050




    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Mon Aug 4 19:04:16 2025
    From Newsgroup: comp.arch

    On Mon, 4 Aug 2025 19:00:13 +0300
    Michael S <already5chosen@yahoo.com> wrote:

    On Mon, 04 Aug 2025 14:51:41 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Michael S <already5chosen@yahoo.com> writes:
    Actually, in our world the latest C standard (C23) has them, but
    the spelling is different: _BitInt(32) and unsigned _BitInt(32).
    I'm not sure if any major compiler already has them implemented.
    Bing copilot says that clang does, but I don't tend to believe >>eveything Bing copilot says.

    I asked godbolt, and tried the following program:

    typedef ump unsigned _BitInt(65535);

    The actual compiling version is:

    typedef unsigned _BitInt(65535) ump;

    ump sum3(ump a, ump b, ump c)
    {
    return a+b+c;
    }

    and for the C setting gcc-15.1 AMD64 produces 129 lines of assembly >language code; for C++ it complains about the syntax. For 65536
    bits, it complains about being beyond the maximum number of 65535
    bits.

    For the same program with the C setting clang-20.1 produces 29547
    lines of assembly language code; that's more than 28 instructions
    for every 64-bit word of output, which seems excessive to me, even
    if you don't use ADX instructions (which clang apparently does
    not); I expect that clang will produce better code at some point
    in the future. Compiling this function also takes noticable time,
    and when I ask for 1000000 bits, clang still does complain about
    too many bits, but godbolt's timeout strikes; I finally found out
    clang's limit: 8388608 bits. On clang-20.1 the C++ setting also
    accepts this kind of input.

    - anton

    On my PC with following flags '-S -O -pedantic -std=c23 -march=native' compilation was not too slow - approximately 200 msec for gcc, 600
    msec for clang. In case of gcc, most of the time was likely consumed
    by [anti]virus rather than by compiler itself.
    Sizes (instructions only, directives and labels removed) were as
    following:

    N gcc-win64 gcc-sysv clang-win64
    65472 56 46 10041
    65535 71 62 10050





    For reference, the shortest result:

    .file "tst3.c"
    .text
    .globl foo
    .def foo; .scl 2; .type
    32; .endef foo:
    pushq %rbp
    pushq %rbx
    movq %rdi, %rdx
    leaq 24(%rsp), %rcx
    leaq 8208(%rsp), %rbx
    leaq 16392(%rsp), %r11
    movq %rdi, %r10
    leaq 8200(%rsp), %rbp
    movl $0, %r9d
    movl $0, %r8d
    .L2:
    addb $-1, %r8b
    movq (%rcx), %rax
    adcq (%rbx), %rax
    setc %dil
    addb $-1, %r9b
    adcq (%r11), %rax
    setc %sil
    movq %rax, (%r10)
    addb $-1, %dil
    movq 8(%rcx), %rax
    adcq 8(%rbx), %rax
    setc %dil
    movzbl %dil, %edi
    movq %rdi, %r8
    addb $-1, %sil
    adcq 8(%r11), %rax
    setc %sil
    movzbl %sil, %esi
    movq %rsi, %r9
    movq %rax, 8(%r10)
    addq $16, %rcx
    addq $16, %rbx
    addq $16, %r11
    addq $16, %r10
    cmpq %rbp, %rcx
    jne .L2
    addb $-1, %dil
    movq 16384(%rsp), %rax
    adcq 8200(%rsp), %rax
    addb $-1, %sil
    adcq 24568(%rsp), %rax
    movq %rax, 8176(%rdx)
    movq %rdx, %rax
    popq %rbx
    popq %rbp
    ret
    .ident "GCC: (Rev7, Built by MSYS2 project) 15.1.0"


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch,alt.folklore.computers on Mon Aug 4 11:47:41 2025
    From Newsgroup: comp.arch

    On 8/4/2025 10:32 AM, John Ames wrote:
    On Sat, 02 Aug 2025 23:10:56 -0400
    Stefan Monnier <monnier@iro.umontreal.ca> wrote:

    And what a waste of a 64-bit architecture, to run it in 32-bit-only
    mode ...

    What do you mean by that? IIUC, the difference between 32bit and
    64bit (in terms of cost of designing and producing the CPU) was very
    small. MIPS happily designed their R4000 as 64bit while knowing that
    most of them would never get a chance to execute an instruction that
    makes use of the upper 32bits.

    This notion that the only advantage of a 64-bit architecture is a large address space is very curious to me. Obviously that's *one* advantage,
    but while I don't know the in-the-field history of heavy-duty business/ scientific computing the way some folks here do, I have not gotten the impression that a lot of customers were commonly running up against the
    4 GB limit in the early '90s; meanwhile, the *other* advantage - higher performance for the same MIPS on a variety of compute-bound tasks - is
    being overlooked entirely, it seems.


    Yeah.

    Using 64-bit values mostly for data manipulation, but with a 32 bit
    address space, also makes a lot of sense.

    In my project, ATM, the main reason I went to using a 48 bit address
    space was mostly because I was also using a global address space; and 32
    bit gets cramped pretty quick. Also, 48 bit means more space for 16 tag
    bits.

    For smaller configurations, it can make sense to drop back down to 32
    bits, possibly with a 24-bit physical space if lacking a DDR RAM chip or similar.

    ...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Keith Thompson@Keith.S.Thompson+u@gmail.com to comp.arch,comp.lang.c on Mon Aug 4 09:53:51 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 04 Aug 2025 12:09:32 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    [...]
    typedef ump unsigned _BitInt(65535);

    The correct syntax is :

    typedef unsigned _BitInt(65535) ump;

    ump sum3(ump a, ump b, ump c)
    {
    return a+b+c;
    }

    [...]

    1. Both gcc and clang happily* accept _BitInt() syntax even when
    -std=c17 or lower. Is not here a potential name clash for existing
    sources that use _BitInt() as a name of the function? I should think
    more about it.

    In C17 and earlier, _BitInt is a reserved identifier. Any attempt to
    use it has undefined behavior. That's exactly why new keywords are
    often defined with that ugly syntax.

    Both gcc and clang warn about _BitInt with invoked with "-std=c17 -pedantic".

    [...]

    * - the only sign of less than perfect happiness is a warning produced
    with -pedantic flag.

    Yes, both are behaving reasonably. If you don't use "-pedantic",
    you're telling the compiler you don't want standard conformance.
    (I'd be happier if conformance were the default, but we're stuck
    with it.) But accepting _BitInt in pre-C23 mode is conforming.

    Cross-posted to c.lang.c

    I've kept the cross-post to comp.lang.c and comp.arch.
    --
    Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
    void Void(void) { Void(); } /* The recursive call of the void */
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Aug 4 16:58:04 2025
    From Newsgroup: comp.arch

    Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:

    In C17 and earlier, _BitInt is a reserved identifier. Any attempt to
    use it has undefined behavior. That's exactly why new keywords are
    often defined with that ugly syntax.

    Sometimes I think there is reason to Fortran's approach of not
    having defined keywords - old programs just continue to run, even
    with new statements or intrinsic procedures, maybe with an addition
    of an EXTERNAL statement.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch,alt.folklore.computers on Mon Aug 4 11:59:57 2025
    From Newsgroup: comp.arch

    On 8/3/2025 10:39 PM, Peter Flass wrote:
    On 8/3/25 19:07, BGB wrote:
    On 8/3/2025 7:04 PM, Lawrence D'Oliveiro wrote:
    On Sun, 03 Aug 2025 16:51:10 GMT, Anton Ertl wrote:

    The C environment for DEC OSF/1 was an I32LP64 setup, not an ILP64
    setup, so can you really call it pure?

    As far as I’m aware, I32LP64 is the standard across 64-bit *nix systems. >>>
    Microsoft’s compilers for 64-bit Windows do LLP64. Not aware of any
    platforms that do/did ILP64.

    Yeah, pretty much nothing does ILP64, and doing so would actually be a
    problem.

    Also, C type names:
       char     :  8 bit
       short    : 16 bit
       int      : 32 bit
       long     : 64 bit
       long long: 64 bit

    If 'int' were 64-bits, then what about 16 and/or 32 bit types.
       short short?
       long short?
       ...

    Current system seems preferable.
    Well, at least in absence of maybe having the compiler specify actual
    fixed-size types.

    Or, say, what if there was a world where the actual types were, say:
       _Int8, _Int16, _Int32, _Int64, _Int128
    And, then, say:
       char, short, int, long, ...
    Were seen as aliases.

    Well, maybe along with __int64 and friends, but __int64 and _Int64
    could be seen as equivalent.


    Then of course, the "stdint.h" types.
    Traditionally, these are a bunch of typedef's to the 'int' and friends.
    But, one can imagine a hypothetical world where stdint.h contained
    things like, say:
       typedef _Int32 int32_t;



    Like PL/I which lets you specify any precision: FIXED BINARY(31), FIXED BINARY(63) etc.

    C keeps borrowing more and more PL/I features.


    This would be _BitInt(n) ...


    Though, despite originally making it so that power-of-2 _BitInt(n) would
    map to the corresponding types when available, I ended up later needing
    to make them distinct, to remember the exact bit-widths, and to preserve
    the expected overflow behavior for these widths.

    There is apparently a discrepancy between BGBCC and Clang when it comes
    to this type:
    BGBCC: Storage is padded to a power of 2;
    Up to 256 bits, after which it is the next multiple of 128.
    Clang: Storage is the next multiple of 1 byte.

    But, efficiently loading and storing arbitrary N byte values is a harder problem than using a power-of-2 type and then ignoring or
    masking/extending the HOBs.

    Main harder case is store, which would need to be turned into a Load+Mask+Store absent special ISA support.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch,alt.folklore.computers on Mon Aug 4 12:12:49 2025
    From Newsgroup: comp.arch

    On 8/4/2025 4:19 AM, Michael S wrote:
    On Sun, 3 Aug 2025 21:07:02 -0500
    BGB <cr88192@gmail.com> wrote:

    On 8/3/2025 7:04 PM, Lawrence D'Oliveiro wrote:
    On Sun, 03 Aug 2025 16:51:10 GMT, Anton Ertl wrote:

    The C environment for DEC OSF/1 was an I32LP64 setup, not an ILP64
    setup, so can you really call it pure?

    As far as I’m aware, I32LP64 is the standard across 64-bit *nix
    systems.

    Microsoft’s compilers for 64-bit Windows do LLP64. Not aware of any
    platforms that do/did ILP64.

    Yeah, pretty much nothing does ILP64, and doing so would actually be
    a problem.

    Also, C type names:
    char : 8 bit
    short : 16 bit
    int : 32 bit

    Except in embedded 16 bit are not rare

    long : 64 bit

    Except for majority of the world where long is 32 bit


    Possibly, this wasn't meant to address every possible use-case, but as a counter-argument to ILP64, where the more natural alternative is LP64.

    long long: 64 bit

    If 'int' were 64-bits, then what about 16 and/or 32 bit types.
    short short?
    long short?
    ...

    Current system seems preferable.
    Well, at least in absence of maybe having the compiler specify actual
    fixed-size types.

    Or, say, what if there was a world where the actual types were, say:
    _Int8, _Int16, _Int32, _Int64, _Int128
    And, then, say:
    char, short, int, long, ...
    Were seen as aliases.


    Actually, in our world the latest C standard (C23) has them, but the
    spelling is different: _BitInt(32) and unsigned _BitInt(32).
    I'm not sure if any major compiler already has them implemented. Bing
    copilot says that clang does, but I don't tend to believe eveything Bing copilot says.


    Essentially, _BitInt(n) semantics mean that, say, _BitInt(32) is not
    strictly equivalent to _Int32 or 'int', and _BitInt(16) is not
    equivalent to what _Int16 or 'short' would be.

    So, a range of power-of-2 integer types may still be needed.


    Well, maybe along with __int64 and friends, but __int64 and _Int64
    could be seen as equivalent.


    Then of course, the "stdint.h" types.
    Traditionally, these are a bunch of typedef's to the 'int' and
    friends. But, one can imagine a hypothetical world where stdint.h
    contained things like, say:
    typedef _Int32 int32_t;


    ...





    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Aug 4 16:33:32 2025
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> writes:
    On 8/2/2025 10:33 AM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    But, it seems to have a few obvious weak points for RISC-V:
    Crappy with arrays;
    Crappy with code with lots of large immediate values;
    Crappy with code which mostly works using lots of global variables;
    Say, for example, a lot of Apogee / 3D Realms code;
    They sure do like using lots of global variables.
    id Software also likes globals, but not as much.
    ...

    Let's see:

    #include <stddef.h>

    long arrays(long *v, size_t n)
    {
    long i, r;
    for (i=0, r=0; i<n; i++)
    r+=v[i];
    return r;
    }

    long a, b, c, d;

    void globals(void)
    {
    a = 0x1234567890abcdefL;
    b = 0xcdef1234567890abL;
    c = 0x567890abcdef1234L;
    d = 0x5678901234abcdefL;
    }

    gcc-10.3 -Wall -O2 compiles this to the following RV64GC code:

    0000000000010434 <arrays>:
    10434: cd81 beqz a1,1044c <arrays+0x18>
    10436: 058e slli a1,a1,0x3
    10438: 87aa mv a5,a0
    1043a: 00b506b3 add a3,a0,a1
    1043e: 4501 li a0,0
    10440: 6398 ld a4,0(a5)
    10442: 07a1 addi a5,a5,8
    10444: 953a add a0,a0,a4
    10446: fed79de3 bne a5,a3,10440 <arrays+0xc>
    1044a: 8082 ret
    1044c: 4501 li a0,0
    1044e: 8082 ret
    ...
    0000000000010450 <globals>:
    10450: 8201b583 ld a1,-2016(gp) # 12020 <__SDATA_BEGIN__> >> 10454: 8281b603 ld a2,-2008(gp) # 12028 <__SDATA_BEGIN__+0x8>
    10458: 8301b683 ld a3,-2000(gp) # 12030 <__SDATA_BEGIN__+0x10>
    1045c: 8381b703 ld a4,-1992(gp) # 12038 <__SDATA_BEGIN__+0x18>
    10460: 86b1b423 sd a1,-1944(gp) # 12068 <a>
    10464: 86c1b023 sd a2,-1952(gp) # 12060 <b>
    10468: 84d1bc23 sd a3,-1960(gp) # 12058 <c>
    1046c: 84e1b823 sd a4,-1968(gp) # 12050 <d>
    10470: 8082 ret

    When using -Os, arrays becomes 2 bytes shorter, but the inner loop
    becomes longer.


    I had not usually seen globals handled this way in RV with GCC...

    When I throw it at godbolt.org, I see:
    globals:
    li a1,593920
    addi a1,a1,-1347
    li a2,38178816
    li a5,-209993728
    li a0,863748096
    li a3,1450741760
    li a4,725372928
    slli a1,a1,12
    addi a2,a2,-1329
    addi a5,a5,1165
    li a7,1450741760
    addi a0,a0,1165
    addi a3,a3,171
    addi a4,a4,-2039
    li a6,883675136
    addi a1,a1,-529
    addi a7,a7,171
    slli a0,a0,2
    slli a2,a2,35
    slli a5,a5,34
    slli a3,a3,32
    slli a4,a4,33
    addi a6,a6,-529
    add a2,a2,a1
    add a5,a5,a7
    add a3,a3,a0
    lui t1,%hi(a)
    lui a7,%hi(b)
    lui a0,%hi(c)
    add a4,a4,a6
    lui a1,%hi(d)
    sd a2,%lo(a)(t1)
    sd a5,%lo(b)(a7)
    sd a3,%lo(c)(a0)
    sd a4,%lo(d)(a1)
    ret

    What compiler and what options do you use?

    One interesting aspect here is that when I compile with gcc-10.3 -Os -S xxx-array.c, I get:

    globals:
    lui a4,%hi(.LC0)
    ld a4,%lo(.LC0)(a4)
    lui a5,%hi(a)
    sd a4,%lo(a)(a5)
    lui a4,%hi(.LC1)
    ld a4,%lo(.LC1)(a4)
    lui a5,%hi(b)
    sd a4,%lo(b)(a5)
    lui a4,%hi(.LC2)
    ld a4,%lo(.LC2)(a4)
    lui a5,%hi(c)
    sd a4,%lo(c)(a5)
    lui a4,%hi(.LC3)
    ld a4,%lo(.LC3)(a4)
    lui a5,%hi(d)
    sd a4,%lo(d)(a5)
    ret

    Once that is assembled, it looks as follows:

    000000000000001a <globals>:
    1a: 00000737 lui a4,0x0
    1e: 00073703 ld a4,0(a4) # 0 <arrays>
    22: 000007b7 lui a5,0x0
    26: 00e7b023 sd a4,0(a5) # 0 <arrays>
    2a: 00000737 lui a4,0x0
    2e: 00073703 ld a4,0(a4) # 0 <arrays>
    32: 000007b7 lui a5,0x0
    36: 00e7b023 sd a4,0(a5) # 0 <arrays>
    3a: 00000737 lui a4,0x0
    3e: 00073703 ld a4,0(a4) # 0 <arrays>
    42: 000007b7 lui a5,0x0
    46: 00e7b023 sd a4,0(a5) # 0 <arrays>
    4a: 00000737 lui a4,0x0
    4e: 00073703 ld a4,0(a4) # 0 <arrays>
    52: 000007b7 lui a5,0x0
    56: 00e7b023 sd a4,0(a5) # 0 <arrays>
    5a: 8082 ret

    Finally, when this object file is linked to a file containing a main()
    (without or with -fPIC), and the resulting binary is disassembled,
    globals() looks as follows:

    000000000001044e <globals>:
    1044e: 8201b703 ld a4,-2016(gp) # 12020 <__SDATA_BEGIN__>
    10452: 86e1b423 sd a4,-1944(gp) # 12068 <a>
    10456: 8281b703 ld a4,-2008(gp) # 12028 <__SDATA_BEGIN__+0x8>
    1045a: 86e1b023 sd a4,-1952(gp) # 12060 <b>
    1045e: 8301b703 ld a4,-2000(gp) # 12030 <__SDATA_BEGIN__+0x10>
    10462: 84e1bc23 sd a4,-1960(gp) # 12058 <c>
    10466: 8381b703 ld a4,-1992(gp) # 12038 <__SDATA_BEGIN__+0x18>
    1046a: 84e1b823 sd a4,-1968(gp) # 12050 <d>
    1046e: 8082 ret

    So the linker optimizes all these luis away, and you have to look at
    RISC-V code after linking to evaluate how big its code for globals and
    large immediates is.

    Though, I was more talking about i386 having good code density, not so
    much x86-64.

    The constants are too large for long on IA-32.

    So RV64GC is smallest for the globals/large-immediate test here, and
    only beaten by one byte by AMD64 for the array test. Looking at the
    code generated for the inner loop of arrays(), all the inner loops
    contain four instructions, so certainly in this case RV64GC is not
    crappier than the others. Interestingly, the reasons for using four
    instructions (rather than five) are different on these architectures:


    These are micro-examples...

    Makes more sense to compare something bigger.

    One example is enough to refute a general claim like "Crappy for ...".

    One example is also one more than you provided.

    Plus, I also gave results for several programs on Debian and NetBSD,
    where RV64GC (and RV32GC) has code size similar to ARM T32 and small
    code size than all other architectures that were available on Debian
    and NetBSD.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch,alt.folklore.computers on Mon Aug 4 17:20:33 2025
    From Newsgroup: comp.arch

    John Ames <commodorejohn@gmail.com> writes:
    On Sat, 02 Aug 2025 23:10:56 -0400
    Stefan Monnier <monnier@iro.umontreal.ca> wrote:

    And what a waste of a 64-bit architecture, to run it in 32-bit-only
    mode ...

    What do you mean by that? IIUC, the difference between 32bit and
    64bit (in terms of cost of designing and producing the CPU) was very
    small. MIPS happily designed their R4000 as 64bit while knowing that
    most of them would never get a chance to execute an instruction that
    makes use of the upper 32bits.

    This notion that the only advantage of a 64-bit architecture is a large >address space is very curious to me. Obviously that's *one* advantage,
    but while I don't know the in-the-field history of heavy-duty business/ >scientific computing the way some folks here do, I have not gotten the >impression that a lot of customers were commonly running up against the
    4 GB limit in the early '90s; meanwhile, the *other* advantage - higher >performance for the same MIPS on a variety of compute-bound tasks - is
    being overlooked entirely, it seems.

    Even simple data movement (e.g. optimized memcpy) will require half
    the instructions on a 64-bit architecture.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Mon Aug 4 12:56:13 2025
    From Newsgroup: comp.arch

    On 8/4/2025 7:09 AM, Anton Ertl wrote:
    Michael S <already5chosen@yahoo.com> writes:
    Actually, in our world the latest C standard (C23) has them, but the
    spelling is different: _BitInt(32) and unsigned _BitInt(32).
    I'm not sure if any major compiler already has them implemented. Bing
    copilot says that clang does, but I don't tend to believe eveything Bing
    copilot says.

    I asked godbolt, and tried the following program:

    typedef ump unsigned _BitInt(65535);

    ump sum3(ump a, ump b, ump c)
    {
    return a+b+c;
    }


    IIRC, that is over the limit of what BGBCC supports...


    and for the C setting gcc-15.1 AMD64 produces 129 lines of assembly
    language code; for C++ it complains about the syntax. For 65536 bits,
    it complains about being beyond the maximum number of 65535 bits.

    For the same program with the C setting clang-20.1 produces 29547
    lines of assembly language code; that's more than 28 instructions for
    every 64-bit word of output, which seems excessive to me, even if you
    don't use ADX instructions (which clang apparently does not); I expect
    that clang will produce better code at some point in the future.
    Compiling this function also takes noticable time, and when I ask for
    1000000 bits, clang still does complain about too many bits, but
    godbolt's timeout strikes; I finally found out clang's limit: 8388608
    bits. On clang-20.1 the C++ setting also accepts this kind of input.

    Followups set to comp.arch.


    FWIW:
    The biggest handled inline in BGBCC is 128 bits.
    And, this depends on ISA features;
    Otherwise, runtime calls.
    At 129..256 bits, it uses runtime calls for 256 bit ops;
    At 257+, it uses runtime calls for variable-width large integers.

    A problem for very large types is still the need for large storage though...


    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Mon Aug 4 17:23:24 2025
    From Newsgroup: comp.arch

    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    On Sat, 02 Aug 2025 09:28:17 GMT, Anton Ertl wrote:

    In my RISC-VAX scenario, the RISC-VAX would be the PDP-11 followon
    instead of the actual (CISC) VAX, so there would be no additional
    ISA.

    In order to be RISC, it would have had to add registers and remove >addressing modes from the non-load/store instructions (and replace "move" >with separate "load" and "store" instructions).

    Add registers: No, ARM A32 is RISC and has as many registers as VAX
    (including the misfeature of having the PC addressable as a GPR). But
    yes, I would tend towards more registers.

    Remove addressig modes: The memory-indirect addressing modes certainly
    don't occur in any RISC and add complexity, so I would not include
    them.

    Move: It does not matter how these instructions are called.

    "No additional ISA" or
    not, it would still have broken existing code.

    There was no existing VAX code before the VAX ISA was designed.

    Remember that VAX development started in the early-to-mid-1970s.

    This is exactly the point where the time machine would deliver the
    RISC-VAX ideas.

    RISC was
    still nothing more than a research idea at that point, which had yet to >prove itself.

    Certainly, that's why I have a time-machine in my scenario that deals
    with this problem.

    The claim by John Savard was that the VAX "was a good match to the
    technology *of its time*". It was not. It may have been a good match
    for the beliefs of the time, but that's a different thing.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch,alt.folklore.computers on Mon Aug 4 18:16:45 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

    The claim by John Savard was that the VAX "was a good match to the
    technology *of its time*". It was not. It may have been a good match
    for the beliefs of the time, but that's a different thing.

    I concur; also, the evidence of the 801 supports that (and that
    was designed around the same time as the VAX).

    Although, personally, I think Data General might have been the
    better target. Going to Edson de Castro and telling him that he
    was on the right track with the Nova from the start, and his ideas
    should be extended, might have been politically easier than going
    to DEC.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch,alt.folklore.computers on Mon Aug 4 18:17:54 2025
    From Newsgroup: comp.arch

    According to Scott Lurndal <slp53@pacbell.net>:
    Stefan Monnier <monnier@iro.umontreal.ca> wrote:
    scientific computing the way some folks here do, I have not gotten the >>impression that a lot of customers were commonly running up against the
    4 GB limit in the early '90s;

    Mainframes certainly had more than 4GB. In 1990 the ES/9000 had more
    than 4GB of "expanded" memory and by 1994 there was 8GB of main memory,
    using a variety of mapping and segmentation kludges to address from a
    32 bit architecture.

    Even simple data movement (e.g. optimized memcpy) will require half
    the instructions on a 64-bit architecture.

    Er, maybe. There were plenty of 32 bit systems with 64 bit memory.
    I would expect that systems with string move instructions would
    take advantage of the underlying hardware.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Mon Aug 4 14:39:22 2025
    From Newsgroup: comp.arch

    Thomas Koenig wrote:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

    The claim by John Savard was that the VAX "was a good match to the
    technology *of its time*". It was not. It may have been a good match
    for the beliefs of the time, but that's a different thing.

    I concur; also, the evidence of the 801 supports that (and that
    was designed around the same time as the VAX).

    Although, personally, I think Data General might have been the
    better target. Going to Edson de Castro and telling him that he
    was on the right track with the Nova from the start, and his ideas
    should be extended, might have been politically easier than going
    to DEC.

    DG's 32-bit Eclipse MV-8000 was also microcoded.

    The ECLIPSE MV-8000 Microsequencer 1980 https://dl.acm.org/doi/pdf/10.1145/1014190.802716

    In the IBM 5100, the cpu name PALM stands for "Put All Logic in Microcode".

    They weren't looking at this with the necessary set of eyes.
    The microcoded design approach views instruction execution as a large,
    single, *monolithic* state machine performing a sequential series of steps (aside from maybe having a prefetch buffer).

    Few viewed this as a set of simple, parallel hardware tasks passing values between them. Once one looks at it this way then one starts to look for bottlenecks in that process and much of the risc design guidelines emerge
    as potential optimizations.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch,alt.folklore.computers on Mon Aug 4 18:59:02 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

    The claim by John Savard was that the VAX "was a good match to the
    technology *of its time*". It was not. It may have been a good match
    for the beliefs of the time, but that's a different thing.

    I concur; also, the evidence of the 801 supports that (and that
    was designed around the same time as the VAX).

    Looking back at it after 50 years, hindsight is 20-20. It's
    difficult to judge the decisions made at DEC during the 70's;
    but it is easy to criticize them :-)

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch,comp.lang.c on Mon Aug 4 22:03:15 2025
    From Newsgroup: comp.arch

    On Mon, 04 Aug 2025 09:53:51 -0700
    Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 04 Aug 2025 12:09:32 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    [...]
    typedef ump unsigned _BitInt(65535);

    The correct syntax is :

    typedef unsigned _BitInt(65535) ump;

    ump sum3(ump a, ump b, ump c)
    {
    return a+b+c;
    }

    [...]

    1. Both gcc and clang happily* accept _BitInt() syntax even when
    -std=c17 or lower. Is not here a potential name clash for existing
    sources that use _BitInt() as a name of the function? I should think
    more about it.

    In C17 and earlier, _BitInt is a reserved identifier. Any attempt to
    use it has undefined behavior. That's exactly why new keywords are
    often defined with that ugly syntax.


    That is language lawyer's type of reasoning. Normally gcc maintainers
    are wiser than that because, well, by chance gcc happens to be widely
    used production compiler. I don't know why this time they had chosen
    less conservative road.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch,alt.folklore.computers on Mon Aug 4 15:09:55 2025
    From Newsgroup: comp.arch

    Scott Lurndal [2025-08-04 15:32:55] wrote:
    Michael S <already5chosen@yahoo.com> writes:
    scott@slp53.sl.home (Scott Lurndal) wrote:
    Michael S <already5chosen@yahoo.com> writes:
    BGB <cr88192@gmail.com> wrote:
    Except for majority of the world where long is 32 bit
    What majority? Linux owns the server market, the
    appliance market and much of the handset market (which apple
    dominates with their OS). And all Unix/Linux systems have
    64-bit longs on 64-bit CPUs.
    Majority of the world is embedded. Ovewhelming majority of embedded is
    32-bit or narrower.
    In terms of shipped units, perhaps (although many are narrower, as you
    point out). In terms of programmers, it's a fairly small fraction that
    do embedded programming.

    Yeah, the unit of measurement is a problem.
    I wonder how it compares if you look at number of programmers paid to
    write C code (after all, we're talking about C).

    In the desktop/server/laptop/handheld world, AFAICT the market share of
    C has shrunk significantly over the years whereas I get the impression
    that it's still quite strong in the embedded space. But I don't have
    any hard data.


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch,alt.folklore.computers on Mon Aug 4 22:12:13 2025
    From Newsgroup: comp.arch

    On Mon, 4 Aug 2025 18:16:45 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

    The claim by John Savard was that the VAX "was a good match to the technology *of its time*". It was not. It may have been a good
    match for the beliefs of the time, but that's a different thing.


    The evidence of 801 is the 801 did not deliver until more than decade
    later. And the variant that delivered was quite different from original
    801.
    Actually, it can be argued that 801 didn't deliver until more than 15
    years late. I remember RSC from 1992H1. It was underwhelming.

    I concur; also, the evidence of the 801 supports that (and that
    was designed around the same time as the VAX).

    Although, personally, I think Data General might have been the
    better target. Going to Edson de Castro and telling him that he
    was on the right track with the Nova from the start, and his ideas
    should be extended, might have been politically easier than going
    to DEC.

    I don't quite understand the context of this comment. Can you elaborate?

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch,alt.folklore.computers on Mon Aug 4 22:17:24 2025
    From Newsgroup: comp.arch

    On Mon, 4 Aug 2025 18:17:54 -0000 (UTC)
    John Levine <johnl@taugh.com> wrote:

    According to Scott Lurndal <slp53@pacbell.net>:
    Stefan Monnier <monnier@iro.umontreal.ca> wrote:
    scientific computing the way some folks here do, I have not gotten
    the impression that a lot of customers were commonly running up
    against the 4 GB limit in the early '90s;

    Mainframes certainly had more than 4GB. In 1990 the ES/9000 had more
    than 4GB of "expanded" memory and by 1994 there was 8GB of main
    memory, using a variety of mapping and segmentation kludges to
    address from a 32 bit architecture.

    Even simple data movement (e.g. optimized memcpy) will require half
    the instructions on a 64-bit architecture.

    Er, maybe. There were plenty of 32 bit systems with 64 bit memory.
    I would expect that systems with string move instructions would
    take advantage of the underlying hardware.


    Also there existed possibility of using non-GPR registers.
    Didn't majority 32-bit RISC machines with general-purpose ambitions
    have 64-bit FP registers?
    Plus LTM/STM. I think, ARM was not the only one that had them.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From James Kuyper@jameskuyper@alumni.caltech.edu to comp.arch,comp.lang.c on Mon Aug 4 15:25:54 2025
    From Newsgroup: comp.arch

    On 2025-08-04 15:03, Michael S wrote:
    On Mon, 04 Aug 2025 09:53:51 -0700
    Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
    ...
    In C17 and earlier, _BitInt is a reserved identifier. Any attempt to
    use it has undefined behavior. That's exactly why new keywords are
    often defined with that ugly syntax.


    That is language lawyer's type of reasoning. Normally gcc maintainers
    are wiser than that because, well, by chance gcc happens to be widely
    used production compiler. I don't know why this time they had chosen
    less conservative road.

    If _BitInt is accepted by older versions of gcc, that means it was
    supported as a fully-conforming extension to C. Allowing implementations
    to support extensions in a fully-conforming manner is one of the main
    purposes for which the standard reserves identifiers.
    If you thought that gcc was too conservative to support extensions, you
    must be thinking of the wrong organization.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Al Kossow@aek@bitsavers.org to comp.arch,alt.folklore.computers on Mon Aug 4 12:27:04 2025
    From Newsgroup: comp.arch

    On 8/4/25 11:16 AM, Thomas Koenig wrote:

    Although, personally, I think Data General might have been the
    better target. Going to Edson de Castro and telling him that he
    was on the right track with the Nova from the start, and his ideas
    should be extended, might have been politically easier than going
    to DEC.


    A word-oriented 4 accumulator machine with skips, reduced to a 4 bit
    ALU to keep the cost down vs what came out of CMU to become the PDP-11?

    The essence of RISC really is just exposing what existed in the microcode engines to user-level programming and didn't really make sense until main memory systems got a lot faster.




    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch,alt.folklore.computers on Mon Aug 4 22:31:03 2025
    From Newsgroup: comp.arch

    On Mon, 04 Aug 2025 15:09:55 -0400
    Stefan Monnier <monnier@iro.umontreal.ca> wrote:

    Scott Lurndal [2025-08-04 15:32:55] wrote:
    Michael S <already5chosen@yahoo.com> writes:
    scott@slp53.sl.home (Scott Lurndal) wrote:
    Michael S <already5chosen@yahoo.com> writes:
    BGB <cr88192@gmail.com> wrote:
    Except for majority of the world where long is 32 bit
    What majority? Linux owns the server market, the
    appliance market and much of the handset market (which apple
    dominates with their OS). And all Unix/Linux systems have
    64-bit longs on 64-bit CPUs.
    Majority of the world is embedded. Ovewhelming majority of
    embedded is 32-bit or narrower.
    In terms of shipped units, perhaps (although many are narrower, as
    you point out). In terms of programmers, it's a fairly small
    fraction that do embedded programming.

    Yeah, the unit of measurement is a problem.
    I wonder how it compares if you look at number of programmers paid to
    write C code (after all, we're talking about C).

    In the desktop/server/laptop/handheld world, AFAICT the market share
    of C has shrunk significantly over the years whereas I get the
    impression that it's still quite strong in the embedded space. But I
    don't have any hard data.


    Stefan


    Personally, [outside of Usenet and rwt forum] I know no one except
    myself who writes C targeting user mode on "big" computers (big, in my definitions, starts at smartphone). Myself, I am doing it more as a
    hobby and to make a point rather than out of professional needs. Professionally, in this range I tend to use C++. Not a small part of it
    is that C++ is more familiar than C for my younger co-workers.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch,alt.folklore.computers on Mon Aug 4 15:40:12 2025
    From Newsgroup: comp.arch

    John Ames [2025-08-04 08:32:19] wrote:
    Stefan Monnier <monnier@iro.umontreal.ca> wrote:
    What do you mean by that? IIUC, the difference between 32bit and
    64bit (in terms of cost of designing and producing the CPU) was very
    small. MIPS happily designed their R4000 as 64bit while knowing that
    most of them would never get a chance to execute an instruction that
    makes use of the upper 32bits.
    This notion that the only advantage of a 64-bit architecture is a large address space is very curious to me.

    By "upper bits" I didn't mean to restrict it to the address space.
    AFAIK it would take several years before the OS and the rest of the
    tools started to support the use of instructions manipulating 64bits.
    By that time, many of those machines started to be decommissioned:
    The R4000 came out in late 1991, while the first version of Irix with
    support for the 64bit ISA on that CPU was released only in early 1996
    (there was an earlier 64bit version of Irix but only for the R8000
    processor).

    The same happened to some extent with the early amd64 machines, which
    ended up running 32bit Windows and applications compiled for the i386
    ISA. Those processors were successful mostly because they were fast at
    running i386 code (with the added marketing benefit of being "64bit
    ready"): it took 2 years for MS to release a matching OS.

    And I can't see why anyone would consider it a waste.
    AFAIK it was cheap to implement, and without it, there wouldn't have
    been the installed base of 64bit machines needed to justify investing
    into software development for that new ISA.


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch,comp.lang.c on Mon Aug 4 22:40:49 2025
    From Newsgroup: comp.arch

    On Mon, 4 Aug 2025 15:25:54 -0400
    James Kuyper <jameskuyper@alumni.caltech.edu> wrote:

    On 2025-08-04 15:03, Michael S wrote:
    On Mon, 04 Aug 2025 09:53:51 -0700
    Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
    ...
    In C17 and earlier, _BitInt is a reserved identifier. Any attempt
    to use it has undefined behavior. That's exactly why new keywords
    are often defined with that ugly syntax.


    That is language lawyer's type of reasoning. Normally gcc
    maintainers are wiser than that because, well, by chance gcc
    happens to be widely used production compiler. I don't know why
    this time they had chosen less conservative road.

    If _BitInt is accepted by older versions of gcc, that means it was
    supported as a fully-conforming extension to C. Allowing
    implementations to support extensions in a fully-conforming manner is
    one of the main purposes for which the standard reserves identifiers.
    If you thought that gcc was too conservative to support extensions,
    you must be thinking of the wrong organization.


    I know that gcc supports extensions.
    I also know that gcc didn't support *this particular extension* up
    until quite recently. I would guess, up until this calendar year.
    Introducing new extension without way to disable it is different from supporting gradually introduced extensions, typically with names that
    start by double underscore and often starting with __builtin.

    BTW, I still didn't think deeply about it and still hope that outside
    of C23 mode gcc somehow cared to make name clash unlikely.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch,comp.lang.c on Mon Aug 4 12:44:20 2025
    From Newsgroup: comp.arch

    On 8/4/2025 12:40 PM, Michael S wrote:
    On Mon, 4 Aug 2025 15:25:54 -0400
    James Kuyper <jameskuyper@alumni.caltech.edu> wrote:

    On 2025-08-04 15:03, Michael S wrote:
    On Mon, 04 Aug 2025 09:53:51 -0700
    Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
    ...
    In C17 and earlier, _BitInt is a reserved identifier. Any attempt
    to use it has undefined behavior. That's exactly why new keywords
    are often defined with that ugly syntax.


    That is language lawyer's type of reasoning. Normally gcc
    maintainers are wiser than that because, well, by chance gcc
    happens to be widely used production compiler. I don't know why
    this time they had chosen less conservative road.

    If _BitInt is accepted by older versions of gcc, that means it was
    supported as a fully-conforming extension to C. Allowing
    implementations to support extensions in a fully-conforming manner is
    one of the main purposes for which the standard reserves identifiers.
    If you thought that gcc was too conservative to support extensions,
    you must be thinking of the wrong organization.


    I know that gcc supports extensions.
    I also know that gcc didn't support *this particular extension* up
    until quite recently. I would guess, up until this calendar year.
    Introducing new extension without way to disable it is different from supporting gradually introduced extensions, typically with names that
    start by double underscore and often starting with __builtin.

    Well, if there is an "new exotic" extension in a C compiler that does
    _not_ have the ability to be turned on or off, would be "bad"... Well,
    it can be a way to sort of try to "lock" one into a compiler?


    BTW, I still didn't think deeply about it and still hope that outside
    of C23 mode gcc somehow cared to make name clash unlikely.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Aug 4 19:59:07 2025
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    Thomas Koenig wrote:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

    The claim by John Savard was that the VAX "was a good match to the
    technology *of its time*". It was not. It may have been a good match
    for the beliefs of the time, but that's a different thing.

    I concur; also, the evidence of the 801 supports that (and that
    was designed around the same time as the VAX).

    Although, personally, I think Data General might have been the
    better target. Going to Edson de Castro and telling him that he
    was on the right track with the Nova from the start, and his ideas
    should be extended, might have been politically easier than going
    to DEC.

    DG's 32-bit Eclipse MV-8000 was also microcoded.

    And their Fountainhead project was about different types of microcode
    for different languages. The Nova wasn't microcoded, but it was far
    from a single-cycle machine. The 16-bit Eclipse was also micorocoded.


    The ECLIPSE MV-8000 Microsequencer 1980 https://dl.acm.org/doi/pdf/10.1145/1014190.802716

    Nice article, thanks!

    In the IBM 5100, the cpu name PALM stands for "Put All Logic in Microcode".

    They weren't looking at this with the necessary set of eyes.
    The microcoded design approach views instruction execution as a large, single, *monolithic* state machine performing a sequential series of steps (aside from maybe having a prefetch buffer).

    Few viewed this as a set of simple, parallel hardware tasks passing values between them. Once one looks at it this way then one starts to look for bottlenecks in that process and much of the risc design guidelines emerge
    as potential optimizations.

    Throughout "The Soul of A New Machine", I kept trying to reach into
    the pages and tell the people to build a RISC instead...
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch,alt.folklore.computers on Mon Aug 4 20:00:10 2025
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> schrieb:
    According to Scott Lurndal <slp53@pacbell.net>:
    Stefan Monnier <monnier@iro.umontreal.ca> wrote:
    scientific computing the way some folks here do, I have not gotten the >>>impression that a lot of customers were commonly running up against the
    4 GB limit in the early '90s;

    Mainframes certainly had more than 4GB. In 1990 the ES/9000 had more
    than 4GB of "expanded" memory and by 1994 there was 8GB of main memory,
    using a variety of mapping and segmentation kludges to address from a
    32 bit architecture.

    #ifdef PEDANTIC
    Actually, 31 bits.
    #endif
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch,alt.folklore.computers on Mon Aug 4 20:13:54 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> schrieb:
    On Mon, 4 Aug 2025 18:16:45 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Although, personally, I think Data General might have been the
    better target. Going to Edson de Castro and telling him that he
    was on the right track with the Nova from the start, and his ideas
    should be extended, might have been politically easier than going
    to DEC.

    I don't quite understand the context of this comment. Can you elaborate?

    De Castro had had a big success with a simple load-store
    architecture, the Nova. He did that to reduce CPU complexity
    and cost, to compete with DEC and its PDP-8. (Byte addressing
    was horrible on the Nova, though).

    Now, assume that, as a time traveler wanting to kick off an early
    RISC revolution, you are not allowed to reveal that you are a time
    traveler (which would have larger effects than just a different
    computer architecture). What do you do?

    a) You go to DEC

    b) You go to Data General

    c) You found your own company

    My guess would be that, with DEC, you would have the least chance of
    convincing corporate brass of your ideas. With Data General, you
    could try appealing to the CEO's personal history of creating the
    Nova, and thus his vanity. That could work. But your own company
    might actually be the best choice, if you can get the venture
    capital funding.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch,alt.folklore.computers on Mon Aug 4 20:29:35 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 04 Aug 2025 15:09:55 -0400
    Stefan Monnier <monnier@iro.umontreal.ca> wrote:

    Scott Lurndal [2025-08-04 15:32:55] wrote:
    Michael S <already5chosen@yahoo.com> writes:
    scott@slp53.sl.home (Scott Lurndal) wrote:
    Michael S <already5chosen@yahoo.com> writes:
    BGB <cr88192@gmail.com> wrote:
    Except for majority of the world where long is 32 bit
    What majority? Linux owns the server market, the
    appliance market and much of the handset market (which apple
    dominates with their OS). And all Unix/Linux systems have
    64-bit longs on 64-bit CPUs.
    Majority of the world is embedded. Ovewhelming majority of
    embedded is 32-bit or narrower.
    In terms of shipped units, perhaps (although many are narrower, as
    you point out). In terms of programmers, it's a fairly small
    fraction that do embedded programming.

    Yeah, the unit of measurement is a problem.
    I wonder how it compares if you look at number of programmers paid to
    write C code (after all, we're talking about C).

    In the desktop/server/laptop/handheld world, AFAICT the market share
    of C has shrunk significantly over the years whereas I get the
    impression that it's still quite strong in the embedded space. But I
    don't have any hard data.


    Stefan


    Personally, [outside of Usenet and rwt forum] I know no one except
    myself who writes C targeting user mode on "big" computers (big, in my >definitions, starts at smartphone).

    Linux developers would be a significant, if not large, pool
    of C programmers.

    Myself, I am doing it more as a
    hobby and to make a point rather than out of professional needs. >Professionally, in this range I tend to use C++. Not a small part of it
    is that C++ is more familiar than C for my younger co-workers.

    Likewise, I've been using C++ rather than C since 1989, including for large-scale operating systems and hypervisors (both running on bare metal).
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Mon Aug 4 22:49:23 2025
    From Newsgroup: comp.arch

    Anton Ertl wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Michael S <already5chosen@yahoo.com> writes:
    Actually, in our world the latest C standard (C23) has them, but the
    spelling is different: _BitInt(32) and unsigned _BitInt(32).
    I'm not sure if any major compiler already has them implemented. Bing
    copilot says that clang does, but I don't tend to believe eveything Bing >>> copilot says.

    I asked godbolt, and tried the following program:

    typedef ump unsigned _BitInt(65535);

    The actual compiling version is:

    typedef unsigned _BitInt(65535) ump;

    ump sum3(ump a, ump b, ump c)
    {
    return a+b+c;
    }

    I would naively expect the ump type to be defined as an array of
    unsigned (byte/short/int/long), possibly with a header defining how
    large the allocation is and how many bits are currently defined.

    The actual code to add three of them could be something like

    xor rax,rax
    next:
    add rax,[rsi+rcx*8]
    adc rdx,0
    add rax,[r8+rcx*8]
    adc rdx,0
    add rax,[r9+rcx*8]
    adc rdx,0
    mov [rdi+rcx*8],rax
    mov rax,rdx
    inc rcx
    cmp rcx,r10
    jb next

    The main problem here is of course that every add operation depends on
    the previous, so max speed would be 4-5 clock cycles/iteration.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch,alt.folklore.computers on Mon Aug 4 23:54:51 2025
    From Newsgroup: comp.arch

    On Mon, 4 Aug 2025 20:13:54 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Michael S <already5chosen@yahoo.com> schrieb:
    On Mon, 4 Aug 2025 18:16:45 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Although, personally, I think Data General might have been the
    better target. Going to Edson de Castro and telling him that he
    was on the right track with the Nova from the start, and his ideas
    should be extended, might have been politically easier than going
    to DEC.

    I don't quite understand the context of this comment. Can you
    elaborate?

    De Castro had had a big success with a simple load-store
    architecture, the Nova. He did that to reduce CPU complexity
    and cost, to compete with DEC and its PDP-8. (Byte addressing
    was horrible on the Nova, though).

    Now, assume that, as a time traveler wanting to kick off an early
    RISC revolution, you are not allowed to reveal that you are a time
    traveler (which would have larger effects than just a different
    computer architecture). What do you do?

    a) You go to DEC

    b) You go to Data General

    c) You found your own company

    My guess would be that, with DEC, you would have the least chance of convincing corporate brass of your ideas. With Data General, you
    could try appealing to the CEO's personal history of creating the
    Nova, and thus his vanity. That could work. But your own company
    might actually be the best choice, if you can get the venture
    capital funding.


    Why not go to somebody who has money and interest to build
    microprocessor, but no existing mini/mainframe/SuperC buisness?
    If we limit ourselves to USA then Moto, Intel, AMD, NatSemi...
    May be, even AT&T ? Or was AT&T stil banned from making computers in
    the mid 70s?









    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch,alt.folklore.computers on Mon Aug 4 21:04:40 2025
    From Newsgroup: comp.arch

    According to Thomas Koenig <tkoenig@netcologne.de>:
    John Levine <johnl@taugh.com> schrieb:
    According to Scott Lurndal <slp53@pacbell.net>:
    Stefan Monnier <monnier@iro.umontreal.ca> wrote:
    scientific computing the way some folks here do, I have not gotten the >>>>impression that a lot of customers were commonly running up against the >>>>4 GB limit in the early '90s;

    Mainframes certainly had more than 4GB. In 1990 the ES/9000 had more
    than 4GB of "expanded" memory and by 1994 there was 8GB of main memory,
    using a variety of mapping and segmentation kludges to address from a
    32 bit architecture.

    #ifdef PEDANTIC
    Actually, 31 bits.
    #endif

    It's a 32 bit architecture with 31 bit addressing, kludgily extended
    from 24 bit addressing in the 1970s.

    R's,
    John
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch,alt.folklore.computers on Mon Aug 4 14:06:17 2025
    From Newsgroup: comp.arch

    On 8/4/2025 8:32 AM, John Ames wrote:

    snip

    This notion that the only advantage of a 64-bit architecture is a large address space is very curious to me. Obviously that's *one* advantage,
    but while I don't know the in-the-field history of heavy-duty business/ scientific computing the way some folks here do, I have not gotten the impression that a lot of customers were commonly running up against the
    4 GB limit in the early '90s;

    Not exactly the same, but I recall an issue with Windows NT where it
    initially divided the 4GB address space in 2 GB for the OS, and 2GB for
    users. Some users were "running out of address space", so Microsoft
    came up with an option to reduce the OS space to 1 GB, thus allowing up
    to 3 GB for users. I am sure others here will know more details.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch,alt.folklore.computers on Tue Aug 5 00:08:38 2025
    From Newsgroup: comp.arch

    On Mon, 04 Aug 2025 20:29:35 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 04 Aug 2025 15:09:55 -0400
    Stefan Monnier <monnier@iro.umontreal.ca> wrote:

    Scott Lurndal [2025-08-04 15:32:55] wrote:
    Michael S <already5chosen@yahoo.com> writes:
    scott@slp53.sl.home (Scott Lurndal) wrote:
    Michael S <already5chosen@yahoo.com> writes:
    BGB <cr88192@gmail.com> wrote:
    Except for majority of the world where long is 32 bit
    What majority? Linux owns the server market, the
    appliance market and much of the handset market (which apple
    dominates with their OS). And all Unix/Linux systems have
    64-bit longs on 64-bit CPUs.
    Majority of the world is embedded. Ovewhelming majority of
    embedded is 32-bit or narrower.
    In terms of shipped units, perhaps (although many are narrower,
    as you point out). In terms of programmers, it's a fairly small
    fraction that do embedded programming.

    Yeah, the unit of measurement is a problem.
    I wonder how it compares if you look at number of programmers paid
    to write C code (after all, we're talking about C).

    In the desktop/server/laptop/handheld world, AFAICT the market
    share of C has shrunk significantly over the years whereas I get
    the impression that it's still quite strong in the embedded space.
    But I don't have any hard data.


    Stefan


    Personally, [outside of Usenet and rwt forum] I know no one except
    myself who writes C targeting user mode on "big" computers (big, in
    my definitions, starts at smartphone).

    Linux developers would be a significant, if not large, pool
    of C programmers.


    According to my understanding, Linux developers *maintain* user-mode C programs. They very rarely start new user-mode C programs from scratch.
    The last big one I can think about was git almost 2 decades ago. And
    even that happened more due to personal idiosyncrasies of its
    originator than for solid technical reasons.
    I could be wrong about it, of course.

    Myself, I am doing it more as a
    hobby and to make a point rather than out of professional needs. >Professionally, in this range I tend to use C++. Not a small part of
    it is that C++ is more familiar than C for my younger co-workers.

    Likewise, I've been using C++ rather than C since 1989, including for large-scale operating systems and hypervisors (both running on bare
    metal).

    You know my opinion about it.
    For you current project, C++ appears to be a right tool. Or, at least
    more right than C.
    For few of your previous project I am convinced that it was a wrong
    tool.
    And I know that you are convinced that I am wrong about it so we don't
    have to repeat it.









    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Tue Aug 5 00:14:43 2025
    From Newsgroup: comp.arch

    On Mon, 4 Aug 2025 22:49:23 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Anton Ertl wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Michael S <already5chosen@yahoo.com> writes:
    Actually, in our world the latest C standard (C23) has them, but
    the spelling is different: _BitInt(32) and unsigned _BitInt(32).
    I'm not sure if any major compiler already has them implemented.
    Bing copilot says that clang does, but I don't tend to believe
    eveything Bing copilot says.

    I asked godbolt, and tried the following program:

    typedef ump unsigned _BitInt(65535);

    The actual compiling version is:

    typedef unsigned _BitInt(65535) ump;

    ump sum3(ump a, ump b, ump c)
    {
    return a+b+c;
    }

    I would naively expect the ump type to be defined as an array of
    unsigned (byte/short/int/long), possibly with a header defining how
    large the allocation is and how many bits are currently defined.

    The actual code to add three of them could be something like

    xor rax,rax
    next:
    add rax,[rsi+rcx*8]
    adc rdx,0
    add rax,[r8+rcx*8]
    adc rdx,0
    add rax,[r9+rcx*8]
    adc rdx,0
    mov [rdi+rcx*8],rax
    mov rax,rdx
    inc rcx
    cmp rcx,r10
    jb next

    The main problem here is of course that every add operation depends
    on the previous, so max speed would be 4-5 clock cycles/iteration.

    Terje


    I would guess that even a pair of x86-style loops would likely be faster
    than that on most x86-64 processors made in last 15 years. Despite doing
    1.5x more memory acceses.
    ; rcx = dst
    ; rdx = a - dst
    ; r8 = b - dst
    mov $1024, %esi
    clc
    .loop1:
    mov (%rcx,%r8), %rax
    adc (%rcx,%rdx), %rax
    mov %rax, (%rcx)
    lea 8(%rcx), %rcx
    dec %esi
    jnz .loop1

    sub $65536, %rcx
    mov ..., %rdx ; %rdx = c-dst
    mov $1024, %esi
    clc
    .loop2:
    mov (%rcx,%rdx), %rax
    adc %rax, (%rcx)
    lea 8(%rcx), %rcx
    dec %esi
    jnz .loop2
    ...






    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch,alt.folklore.computers on Tue Aug 5 00:21:34 2025
    From Newsgroup: comp.arch

    On Mon, 4 Aug 2025 14:06:17 -0700
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:

    On 8/4/2025 8:32 AM, John Ames wrote:

    snip

    This notion that the only advantage of a 64-bit architecture is a
    large address space is very curious to me. Obviously that's *one* advantage, but while I don't know the in-the-field history of
    heavy-duty business/ scientific computing the way some folks here
    do, I have not gotten the impression that a lot of customers were
    commonly running up against the 4 GB limit in the early '90s;

    Not exactly the same, but I recall an issue with Windows NT where it initially divided the 4GB address space in 2 GB for the OS, and 2GB
    for users. Some users were "running out of address space", so
    Microsoft came up with an option to reduce the OS space to 1 GB, thus allowing up to 3 GB for users. I am sure others here will know more
    details.



    IIRC, it wasn't a problem for absolute majority of Nt users up until approximately turn of millennium. Even as late as 1999 128 MB was
    considered mid-range PC. 64 MB PCs were still sold and bought in dozens
    of millions.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch,alt.folklore.computers on Mon Aug 4 21:23:10 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 04 Aug 2025 20:29:35 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 04 Aug 2025 15:09:55 -0400
    Stefan Monnier <monnier@iro.umontreal.ca> wrote:

    Scott Lurndal [2025-08-04 15:32:55] wrote:
    Michael S <already5chosen@yahoo.com> writes:
    scott@slp53.sl.home (Scott Lurndal) wrote:
    Michael S <already5chosen@yahoo.com> writes:
    BGB <cr88192@gmail.com> wrote:
    Except for majority of the world where long is 32 bit
    What majority? Linux owns the server market, the
    appliance market and much of the handset market (which apple
    dominates with their OS). And all Unix/Linux systems have
    64-bit longs on 64-bit CPUs.
    Majority of the world is embedded. Ovewhelming majority of
    embedded is 32-bit or narrower.
    In terms of shipped units, perhaps (although many are narrower,
    as you point out). In terms of programmers, it's a fairly small
    fraction that do embedded programming.

    Yeah, the unit of measurement is a problem.
    I wonder how it compares if you look at number of programmers paid
    to write C code (after all, we're talking about C).

    In the desktop/server/laptop/handheld world, AFAICT the market
    share of C has shrunk significantly over the years whereas I get
    the impression that it's still quite strong in the embedded space.
    But I don't have any hard data.


    Stefan


    Personally, [outside of Usenet and rwt forum] I know no one except
    myself who writes C targeting user mode on "big" computers (big, in
    my definitions, starts at smartphone).

    Linux developers would be a significant, if not large, pool
    of C programmers.


    According to my understanding, Linux developers *maintain* user-mode C >programs. They very rarely start new user-mode C programs from scratch.
    The last big one I can think about was git almost 2 decades ago. And
    even that happened more due to personal idiosyncrasies of its
    originator than for solid technical reasons.
    I could be wrong about it, of course.

    I meant to say 'kernel developers'. My bad.


    For few of your previous project I am convinced that it was a wrong
    tool.

    Sans further details on how you consider C++ as the wrong tool
    for bare-metal operating system/hypervisor development (particularly as the subset used for those projects, which did _not_ include any
    of the standard C++ library, was just as efficient as C but provided
    much better modularization and encapsulation), I'd just say
    that your opinion wasn't widely shared amongst those who actually
    did the work.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Al Kossow@aek@bitsavers.org to comp.arch,alt.folklore.computers on Mon Aug 4 14:41:59 2025
    From Newsgroup: comp.arch

    On 8/4/25 1:54 PM, Michael S wrote:

    Why not go to somebody who has money and interest to build
    microprocessor, but no existing mini/mainframe/SuperC buisness?

    MOS technology was still in the stone age.
    High speed CMOS didn't exist, bipolar wasn't
    very dense, and it was power hungry.
    It took a lot of power to even get 8MIPs (FPS-120B array processor)
    in 1975 and the working memory was tiny.

    HP probably had the most advanced tech with their SOS
    process but they were building stack machines (3000)
    and wouldn't integrate them until the 80s

    None of this makes any sense with the memory performance
    available at the time.

    and.. who would be the buyers?


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch,alt.folklore.computers on Mon Aug 4 21:51:47 2025
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 8/4/2025 8:32 AM, John Ames wrote:

    snip

    This notion that the only advantage of a 64-bit architecture is a large
    address space is very curious to me. Obviously that's *one* advantage,
    but while I don't know the in-the-field history of heavy-duty business/
    scientific computing the way some folks here do, I have not gotten the
    impression that a lot of customers were commonly running up against the
    4 GB limit in the early '90s;

    Not exactly the same, but I recall an issue with Windows NT where it >initially divided the 4GB address space in 2 GB for the OS, and 2GB for >users. Some users were "running out of address space", so Microsoft
    came up with an option to reduce the OS space to 1 GB, thus allowing up
    to 3 GB for users. I am sure others here will know more details.

    AT&T SVR[34] Unix systems had the same issue on x86, as did linux. They
    mainly used the same solution as well (give the user 3GB) of virtual
    address space.

    I believe SVR4 was also able to leverage 36-bit physical addressing to
    use more 4GB of DRAM, while still limiting a single process to 2 or 3GB
    of user virtual address space.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch,alt.folklore.computers on Mon Aug 4 17:18:24 2025
    From Newsgroup: comp.arch

    On 8/4/2025 3:54 PM, Michael S wrote:
    On Mon, 4 Aug 2025 20:13:54 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Michael S <already5chosen@yahoo.com> schrieb:
    On Mon, 4 Aug 2025 18:16:45 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Although, personally, I think Data General might have been the
    better target. Going to Edson de Castro and telling him that he
    was on the right track with the Nova from the start, and his ideas
    should be extended, might have been politically easier than going
    to DEC.

    I don't quite understand the context of this comment. Can you
    elaborate?

    De Castro had had a big success with a simple load-store
    architecture, the Nova. He did that to reduce CPU complexity
    and cost, to compete with DEC and its PDP-8. (Byte addressing
    was horrible on the Nova, though).

    Now, assume that, as a time traveler wanting to kick off an early
    RISC revolution, you are not allowed to reveal that you are a time
    traveler (which would have larger effects than just a different
    computer architecture). What do you do?

    a) You go to DEC

    b) You go to Data General

    c) You found your own company

    My guess would be that, with DEC, you would have the least chance of
    convincing corporate brass of your ideas. With Data General, you
    could try appealing to the CEO's personal history of creating the
    Nova, and thus his vanity. That could work. But your own company
    might actually be the best choice, if you can get the venture
    capital funding.


    Why not go to somebody who has money and interest to build
    microprocessor, but no existing mini/mainframe/SuperC buisness?
    If we limit ourselves to USA then Moto, Intel, AMD, NatSemi...
    May be, even AT&T ? Or was AT&T stil banned from making computers in
    the mid 70s?


    AFAIK (from what I heard about all of this):
    The ban on AT&T was the whole reason they released Unix freely.

    Then when things lifted (after the AT&T break-up), they tried to
    re-assert their control over Unix, which backfired. And, they tried to
    make and release a workstation, but by then they were competing against
    the IBM PC Clone market (and also everyone else trying to sell Unix workstations at the time), ...

    Then, in their thing of trying to re-consolidate Unix under their
    control, and fighting with the BSD people over copyright, etc. Linux and Microsoft came in and mostly ate what market they might have had.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Tue Aug 5 01:43:56 2025
    From Newsgroup: comp.arch

    On Tue, 5 Aug 2025 00:14:43 +0300
    Michael S <already5chosen@yahoo.com> wrote:

    On Mon, 4 Aug 2025 22:49:23 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Anton Ertl wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Michael S <already5chosen@yahoo.com> writes:
    Actually, in our world the latest C standard (C23) has them, but
    the spelling is different: _BitInt(32) and unsigned _BitInt(32).
    I'm not sure if any major compiler already has them implemented.
    Bing copilot says that clang does, but I don't tend to believe
    eveything Bing copilot says.

    I asked godbolt, and tried the following program:

    typedef ump unsigned _BitInt(65535);

    The actual compiling version is:

    typedef unsigned _BitInt(65535) ump;

    ump sum3(ump a, ump b, ump c)
    {
    return a+b+c;
    }

    I would naively expect the ump type to be defined as an array of
    unsigned (byte/short/int/long), possibly with a header defining how
    large the allocation is and how many bits are currently defined.

    The actual code to add three of them could be something like

    xor rax,rax
    next:
    add rax,[rsi+rcx*8]
    adc rdx,0
    add rax,[r8+rcx*8]
    adc rdx,0
    add rax,[r9+rcx*8]
    adc rdx,0
    mov [rdi+rcx*8],rax
    mov rax,rdx
    inc rcx
    cmp rcx,r10
    jb next

    The main problem here is of course that every add operation depends
    on the previous, so max speed would be 4-5 clock cycles/iteration.

    Terje


    I would guess that even a pair of x86-style loops would likely be
    faster than that on most x86-64 processors made in last 15 years.
    Despite doing 1.5x more memory acceses.
    ; rcx = dst
    ; rdx = a - dst
    ; r8 = b - dst
    mov $1024, %esi
    clc
    .loop1:
    mov (%rcx,%r8), %rax
    adc (%rcx,%rdx), %rax
    mov %rax, (%rcx)
    lea 8(%rcx), %rcx
    dec %esi
    jnz .loop1

    sub $65536, %rcx
    mov ..., %rdx ; %rdx = c-dst
    mov $1024, %esi
    clc
    .loop2:
    mov (%rcx,%rdx), %rax
    adc %rax, (%rcx)
    lea 8(%rcx), %rcx
    dec %esi
    jnz .loop2
    ...



    For extremely wide cores, like Apple's M (modulo ISA), AMD Zen5 and
    Intel Lion Cove, I'd do the following modification to your inner loop
    (back in Intel syntax):

    xor ebx,ebx
    next:
    xor edx, edx
    mov rax,[rsi+rcx*8]
    add rax,[r8+rcx*8]
    adc edx,edx
    add rax,[r9+rcx*8]
    adc edx,0
    add rbx,rax
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov edx, ebx
    mov [rdi+rcx*8],rax
    inc rcx
    cmp rcx,r10
    jb next
    ...
    ret


    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately never incremen_edx:
    inc edx
    jmp edx_ready


    Less wide cores will likely benefit from reduction of the number of
    executed instructions (and more importantly the number of decoded and
    renamed instructions) through unrolling by 2, 3 or 4.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch,alt.folklore.computers on Mon Aug 4 23:24:15 2025
    From Newsgroup: comp.arch

    In comp.arch John Ames <commodorejohn@gmail.com> wrote:
    On Sat, 02 Aug 2025 23:10:56 -0400
    Stefan Monnier <monnier@iro.umontreal.ca> wrote:

    And what a waste of a 64-bit architecture, to run it in 32-bit-only
    mode ...

    What do you mean by that? IIUC, the difference between 32bit and
    64bit (in terms of cost of designing and producing the CPU) was very
    small. MIPS happily designed their R4000 as 64bit while knowing that
    most of them would never get a chance to execute an instruction that
    makes use of the upper 32bits.

    This notion that the only advantage of a 64-bit architecture is a large address space is very curious to me. Obviously that's *one* advantage,
    but while I don't know the in-the-field history of heavy-duty business/ scientific computing the way some folks here do, I have not gotten the impression that a lot of customers were commonly running up against the
    4 GB limit in the early '90s; meanwhile, the *other* advantage - higher performance for the same MIPS on a variety of compute-bound tasks - is
    being overlooked entirely, it seems.

    Well, as log as an app fits into 32-bit address space, all other
    factors being equal one can expect 10-20% better performance
    from 32-bit addresses. Due to this customers had motivation
    to stay with 32-bits as log as possible.

    But matter is somewhat different for OS vendor: once machine
    gets more than 1GB memory 64-bit addressing in the kernel avoids
    various troubles.

    Concerning applications: server with multiple process sharing
    memory use may operate with several gigabytes using 32-bit
    addresses for applications.

    But for numeric work 512 MB of real memory and more than 3 GB
    virtual (with swapping to disc) may give adequate performance.
    But it is quite inconvenient for 32-bit OS to provide more than
    3 GB of address space to applications.

    Also heaviliy multithread application with some threads needing
    large stacks is inconvenient in 32-bit address space.

    Of course software developers wanting to develop for 64-bit
    systems need 64-bit system interfaces.

    So, supporting 32-bit applications is natural and one could expect
    for some (possibly quite long) time that 32-bit applications
    will be majority. But supporting 64-bit operation was also
    important, both for customers and for OS itself.

    BTW: AMD-64 was a special case: since 64-bit mode was bundled
    with increasing number of GPR-s, with PC-relative addressing
    and with register-based call convention on average 64-bit
    code was faster than 32-bit code. And since AMD-64 was
    relatively late in 64-bit game there was limited motivation
    to develop mode using 32-bit addressing and 64-bit instructions.
    It works in compilers and in Linux, but support is much worse
    than for using 64-bit addressing.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch,alt.folklore.computers on Mon Aug 4 23:38:53 2025
    From Newsgroup: comp.arch

    In comp.arch Scott Lurndal <scott@slp53.sl.home> wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 8/4/2025 8:32 AM, John Ames wrote:

    snip

    This notion that the only advantage of a 64-bit architecture is a large
    address space is very curious to me. Obviously that's *one* advantage,
    but while I don't know the in-the-field history of heavy-duty business/
    scientific computing the way some folks here do, I have not gotten the
    impression that a lot of customers were commonly running up against the
    4 GB limit in the early '90s;

    Not exactly the same, but I recall an issue with Windows NT where it >>initially divided the 4GB address space in 2 GB for the OS, and 2GB for >>users. Some users were "running out of address space", so Microsoft
    came up with an option to reduce the OS space to 1 GB, thus allowing up
    to 3 GB for users. I am sure others here will know more details.

    AT&T SVR[34] Unix systems had the same issue on x86, as did linux. They mainly used the same solution as well (give the user 3GB) of virtual
    address space.

    I believe SVR4 was also able to leverage 36-bit physical addressing to
    use more 4GB of DRAM, while still limiting a single process to 2 or 3GB
    of user virtual address space.

    IIRC Linux pretty early used 3 GB for users and 1 GB for kernel.
    Other splits (including 2 GG + 2 GB) were available as an option.
    With PAE Linux offered 4 GB (or maybe 3.5 GB) and whatever amount
    of RAM was supported by PAE, but in this mode kernel was slower
    than standard one.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch,alt.folklore.computers on Mon Aug 4 23:45:21 2025
    From Newsgroup: comp.arch

    In comp.arch Al Kossow <aek@bitsavers.org> wrote:
    On 8/2/25 1:07 AM, Waldek Hebisch wrote:

    IIUC PRISM eventually became Alpha.

    Not really. Documents for both, including
    the rare PRISM docs are on bitsavers.
    PRISM came out of Cutler's DEC West group,
    Alpha from the East Coast. I'm not aware
    of any team member overlap.

    Well, from peoples point of view they were different efforts.
    From company point of view there was project to deliver
    high-preformace RISC-y machine, it finally succeded when
    new team did the work. I think that at least high-level
    knowledge gained in PRISM project was useful for Alpha.
    I would expect that some detailed work was reused, but
    do not know how much.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch,alt.folklore.computers on Mon Aug 4 23:52:55 2025
    From Newsgroup: comp.arch

    In comp.arch Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    antispam@fricas.org (Waldek Hebisch) writes:
    <snip>
    OTOH Unix for Alpha was claimed to be pure 64-bit.

    It depends on the kind of purity you are aspiring to. After a bunch
    of renamings it was finally called Tru64 UNIX. Not Pur64, but
    Tru64:-) Before that, it was called Digital UNIX (but once DEC had
    been bought by Compaq, that was no longer appropriate), and before
    that, DEC OSF/1 AXP.

    The C environment for DEC OSF/1 was an I32LP64 setup, not an ILP64
    setup, so can you really call it pure?

    What counts are OS interfaces. C while playing prominent role is
    just one of programming languages. While 'int' leaked to early
    system interfaces later one used abstract types for most things.
    So as long as C provided 64-bit integer type (that is long) and
    64-bit pointers this was OK.

    And as others noticed, I32LP64 was very common.

    Anyway, given system interfaces one could naturally implement
    language were the only integer type is 64-bit. That is enough
    for me that call this pure 64-bit.

    In addition there were some OS features for running ILP32 programs,
    similar to Linux' MAP_32BIT flag for mmap(). IIRC Netscape Navigator
    was compiled as ILP32 program (the C compiler had a flag for that),
    and needed these OS features.

    Again, that not a problem for _my_ notion of purity.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Tue Aug 5 00:55:09 2025
    From Newsgroup: comp.arch

    In comp.arch Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    antispam@fricas.org (Waldek Hebisch) writes:
    In comp.arch Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    Did the VAX 11/780 have writable microcode?

    Yes, 12 kB (2K words 96-bit each).

    So that's 12KB of fast RAM that could have been reused for making the
    cache larger in a RISC-VAX, maybe increasing its size from 2KB to
    12KB.

    VAX-780 architecture handbook says cache was 8 KB and used 8-byte
    lines. So extra 12KB of fast RAM could double cache size.
    That would be nice improvement, but not as dramatic as increase
    from 2 KB to 12 KB.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch,alt.folklore.computers on Tue Aug 5 01:31:31 2025
    From Newsgroup: comp.arch

    In article <2025Aug3.185110@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    [snip]
    The C environment for DEC OSF/1 was an I32LP64 setup, not an ILP64
    setup, so can you really call it pure?

    I would. The definiton of "purity" I usually adopted during the
    transition to 64-bit CPUs was that pointers were 32-bits. The
    catchphrase at the time was, "64-bit clean", which usually meant
    that you didn't use int's to type pune for pointers.

    Many ABIs on modern day 64-bit machines are still I32LP64; I64
    is really too large in many respects.

    In addition there were some OS features for running ILP32 programs,
    similar to Linux' MAP_32BIT flag for mmap(). IIRC Netscape Navigator
    was compiled as ILP32 program (the C compiler had a flag for that),
    and needed these OS features.

    MAP_32BIT is only used on x86-64 on Linux, and was originally
    a performance hack for allocating thread stacks: apparently, it
    was cheaper to do a thread switch with a stack below the 4GiB
    barrier (sign extension artifact maybe? Who knows...). But it's
    no longer required for that. But there's no indication that it
    was for supporting ILP32 on a 64-bit system.

    In the OS kernel, often times you want to allocate physical
    address space below 4GiB for e.g. device BARs; many devices are
    either 32-bit (but have to work on 64-bit systems) or work
    better with 32-bit BARs.

    - Dan C.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Tue Aug 5 01:34:00 2025
    From Newsgroup: comp.arch

    On Mon, 4 Aug 2025 08:32:19 -0700, John Ames wrote:

    This notion that the only advantage of a 64-bit architecture is a large address space is very curious to me.

    That is basically it.

    Obviously that's *one* advantage, but while I don't know the
    in-the-field history of heavy-duty business/ scientific computing
    the way some folks here do, I have not gotten the impression that a
    lot of customers were commonly running up against the 4 GB limit in
    the early '90s ...

    By the latter 1990s, as GPUs became popular in the consumer market, the
    amount of VRAM on them kept growing, and taking up more and more
    significant chunks of a 32-bit address space. So that was one of the
    drivers towards 64-bit addressing.

    ... meanwhile, the *other* advantage - higher performance for the
    same MIPS on a variety of compute-bound tasks - is being overlooked
    entirely, it seems.

    I don’t think there is one. A lot of computation involves floating point, and the floating-point formats mostly remain the same ones defined by
    IEEE-754 back in the 1980s.

    In the x86 world, there is the performance boost in the switch from the
    old register-poor 32-bit 80386 instruction set to the larger register pool available in AMD’s 64-bit extensions.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Tue Aug 5 01:36:06 2025
    From Newsgroup: comp.arch

    On Mon, 4 Aug 2025 22:17:24 +0300, Michael S wrote:

    Didn't majority 32-bit RISC machines with general-purpose ambitions have 64-bit FP registers?

    A common “extended” floating-point format for IEEE-754-compatible use was 80 bits. E.g. Apple’s 1980s-vintage SANE numerics library used this as its internal format for all computations. Motorola implemented a 96-bit format
    in its 68881-and-following hardware floating-point processors, but
    actually only 80 of those bits were used -- the rest was just alignment padding.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Tue Aug 5 01:39:14 2025
    From Newsgroup: comp.arch

    On Mon, 4 Aug 2025 14:06:17 -0700, Stephen Fuld wrote:

    ... I recall an issue with Windows NT where it initially divided the
    4GB address space in 2 GB for the OS, and 2GB for users. Some users
    were "running out of address space", so Microsoft came up with an
    option to reduce the OS space to 1 GB, thus allowing up to 3 GB for
    users. I am sure others here will know more details.

    That would have been prone to breakage in poorly-written programs that
    were using signed instead of unsigned comparisons on memory block sizes.

    I hit an earlier version of this problem in about the mid-1980s, trying to help a user install WordStar on his IBM PC, which was one of the earliest machines to have 640K of RAM. The WordStar installer balked, saying he didn’t have enough free RAM!

    The solution: create a dummy RAM disk to bring the free memory size down
    below 512K. Then after the installation succeeded, the RAM disk could be removed.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Tue Aug 5 01:41:15 2025
    From Newsgroup: comp.arch

    On Mon, 4 Aug 2025 23:24:15 -0000 (UTC), Waldek Hebisch wrote:

    BTW: AMD-64 was a special case: since 64-bit mode was bundled with
    increasing number of GPR-s, with PC-relative addressing and with register-based call convention on average 64-bit code was faster than
    32-bit code. And since AMD-64 was relatively late in 64-bit game there
    was limited motivation to develop mode using 32-bit addressing and
    64-bit instructions. It works in compilers and in Linux, but support is
    much worse than for using 64-bit addressing.

    Intel was trying to promote this in the form of the “X32” ABI. The Linux kernel and some distros did include support for this. I don’t think it was very popular, and it may be extinct now.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Tue Aug 5 01:43:14 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    antispam@fricas.org (Waldek Hebisch) writes:
    I can understand why DEC abandoned VAX: already in 1985 they
    had some disadvantage and they saw no way to compete against
    superscalar machines which were on the horizon. In 1985 they
    probably realized, that their features add no value in world
    using optimizing compilers.

    Optimizing compilers increase the advantages of RISCs, but even with a
    simple compiler Berkeley RISC II (which was made by hardware people,
    not compiler people) has between 85% and 256% of VAX (11/780) speed.
    It also has 16-bit and 32-bit instructions for improved code density
    and (apparently from memory bandwidth issues) performance.

    The basic question is if VAX could afford the pipeline. VAX had
    rather complex memory and bus interface, cache added complexity
    too. Ditching microcode could allow more resources for execution
    path. Clearly VAX could afford and probably had 1-cycle 32-bit
    ALU. I doubt that they could afford 1-cycle multiply or
    even a barrel shifter. So they needed a seqencer for sane
    assembly programming. I am not sure what technolgy they used
    for register file. For me most likely is fast RAM, but that
    normally would give 1 R/W port. Multiported register file
    probably would need a lot of separate register chips and
    multiplexer. Alternatively, they could try some very fast
    RAM and run it at multiple of base clock frequency (66 ns
    cycle time caches were available at that time, so 3 ports
    via multiplexing seem possible). But any of this adds
    considerable complexity. Sane pipeline needs interlocks
    and forwarding.

    It is accepted in this era that using more hardware could
    give substantial speedup. IIUC IBM used quadatic rule:
    performance was supposed to be proportional to square of
    CPU price. That was partly marketing, but partly due to
    compromises needed in smaller machines.

    Concerning RISC-II, IIUC its instructions set was too
    simplified, later RISC-s added more complex instructions
    and removed few nasty ones. So RISC-II was good proof
    of concept, but one needed bigger machine to start
    viable product line.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Tue Aug 5 01:43:36 2025
    From Newsgroup: comp.arch

    On Mon, 04 Aug 2025 17:23:24 GMT, Anton Ertl wrote:

    Lawrence D'Oliveiro <ldo@nz.invalid> writes:

    On Sat, 02 Aug 2025 09:28:17 GMT, Anton Ertl wrote:

    In my RISC-VAX scenario, the RISC-VAX would be the PDP-11 followon
    instead of the actual (CISC) VAX, so there would be no additional
    ISA.

    In order to be RISC, it would have had to add registers and remove
    addressing modes from the non-load/store instructions (and replace
    "move" with separate "load" and "store" instructions).

    Add registers: No, ARM A32 is RISC and has as many registers as VAX ...

    It was the PDP-11 we were talking about as the starting point.
    Remember Anton’s claim is that it was unnecessary to do the complete
    redesign that was the VAX, that something could have been done that
    was more backward-compatible with the PDP-11.

    But no, I don’t think that was possible, and the above is why.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Tue Aug 5 01:46:03 2025
    From Newsgroup: comp.arch

    On Mon, 4 Aug 2025 12:27:04 -0700, Al Kossow wrote:

    The essence of RISC really is just exposing what existed in the
    microcode engines to user-level programming and didn't really make
    sense until main memory systems got a lot faster.

    How do you reconcile this with the fact that the CPU-RAM speed gap is
    even wider now than it was back then?

    I would amend that to say, RISC started to make sense when fast RAM
    became cheap enough to use as a cache to bridge the gap.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Tue Aug 5 01:47:48 2025
    From Newsgroup: comp.arch

    On Mon, 4 Aug 2025 20:13:54 -0000 (UTC), Thomas Koenig wrote:

    a) You go to DEC

    b) You go to Data General

    c) You found your own company

    How about d) Go talk to the man responsible for the fastest machines in
    the world around that time, i.e. Seymour Cray?
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Tue Aug 5 01:53:08 2025
    From Newsgroup: comp.arch

    On Mon, 4 Aug 2025 17:18:24 -0500, BGB wrote:

    The ban on AT&T was the whole reason they released Unix freely.

    It was never really “freely” available.

    Then when things lifted (after the AT&T break-up), they tried to
    re-assert their control over Unix, which backfired.

    They were already tightening things up from the Seventh Edition onwards -- remember, this version rescinded the permission to use the source code for classroom teaching purposes, neatly strangling the entire market for the legendary Lions Book. Which continued to spread afterwards via samizdat, nonetheless.

    And, they tried to make and release a workstation, but by then they
    were competing against the IBM PC Clone market (and also everyone
    else trying to sell Unix workstations at the time), ...

    That was a very successful market, from about the mid-1980s until the mid- to-latter 1990s. In spite of all the vendor-lock-in and fragmentation, it mentioned to survive I think because of the sheer performance available in
    the RISC processors, which Microsoft tried to support with its new
    “Windows NT” OS, but was never able to get quite right.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Keith Thompson@Keith.S.Thompson+u@gmail.com to comp.arch,comp.lang.c on Mon Aug 4 22:21:08 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 4 Aug 2025 15:25:54 -0400
    James Kuyper <jameskuyper@alumni.caltech.edu> wrote:
    On 2025-08-04 15:03, Michael S wrote:
    On Mon, 04 Aug 2025 09:53:51 -0700
    Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
    ...
    In C17 and earlier, _BitInt is a reserved identifier. Any attempt
    to use it has undefined behavior. That's exactly why new keywords
    are often defined with that ugly syntax.


    That is language lawyer's type of reasoning. Normally gcc
    maintainers are wiser than that because, well, by chance gcc
    happens to be widely used production compiler. I don't know why
    this time they had chosen less conservative road.

    If _BitInt is accepted by older versions of gcc, that means it was
    supported as a fully-conforming extension to C. Allowing
    implementations to support extensions in a fully-conforming manner is
    one of the main purposes for which the standard reserves identifiers.
    If you thought that gcc was too conservative to support extensions,
    you must be thinking of the wrong organization.


    I know that gcc supports extensions.
    I also know that gcc didn't support *this particular extension* up
    until quite recently. I would guess, up until this calendar year.
    Introducing new extension without way to disable it is different from supporting gradually introduced extensions, typically with names that
    start by double underscore and often starting with __builtin.

    BTW, I still didn't think deeply about it and still hope that outside
    of C23 mode gcc somehow cared to make name clash unlikely.

    Using the keyword _BitInt makes name clashes nearly impossible.

    In C23, it's a new keyword. In pre-C23, it's an
    implementation-defined keyword that will not clash with any
    identifier that any portable code will use, or that any non-portable
    code is at all likely to use for anything other than its new
    standard meaning.

    Breaking existing code that uses "_BitInt" as an identifier is
    a non-issue. There very probably is no such code.

    The feature can be disabled by not using "_BitInt" in your code,
    or by using the "-pedantic-errors" option with a standard earlier
    than C23.

    I'm not sure how much more conservative gcc could have been.
    (Likewise for clang.)
    --
    Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
    void Void(void) { Void(); } /* The recursive call of the void */
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Tue Aug 5 05:35:41 2025
    From Newsgroup: comp.arch

    On Mon, 4 Aug 2025 23:52:55 -0000 (UTC), Waldek Hebisch wrote:

    And as others noticed, I32LP64 was very common.

    Still is the most common.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Tue Aug 5 05:37:22 2025
    From Newsgroup: comp.arch

    On Mon, 4 Aug 2025 03:32:52 -0700, Al Kossow wrote:

    MIPS products came out of DECWRL (the research group started to build
    Titan) and were stopgaps until the "real" architecture came out
    (Cutler's out of DECWest)
    I don't think it ever got much love out of DEC corporate and were just
    done so DEC didn't completely get their lunch eaten in the Unix
    workstation market.

    There were many in high places at DEC who didn’t like Unix at all. Dave Cutler was one of them, and I think Ken Olsen, right at the top, as well.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Tue Aug 5 05:38:08 2025
    From Newsgroup: comp.arch

    On Mon, 4 Aug 2025 12:19:38 +0300, Michael S wrote:

    Except for majority of the world where long is 32 bit

    That only applies on Windows, as far as we can tell.

    The majority of the world is I32LP64.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Tue Aug 5 05:44:59 2025
    From Newsgroup: comp.arch

    Waldek Hebisch <antispam@fricas.org> schrieb:
    In comp.arch Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    antispam@fricas.org (Waldek Hebisch) writes:
    In comp.arch Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    Did the VAX 11/780 have writable microcode?

    Yes, 12 kB (2K words 96-bit each).

    So that's 12KB of fast RAM that could have been reused for making the
    cache larger in a RISC-VAX, maybe increasing its size from 2KB to
    12KB.

    VAX-780 architecture handbook says cache was 8 KB and used 8-byte
    lines. So extra 12KB of fast RAM could double cache size.
    That would be nice improvement, but not as dramatic as increase
    from 2 KB to 12 KB.

    It could have been used as icache, for example.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Tue Aug 5 05:48:16 2025
    From Newsgroup: comp.arch

    Waldek Hebisch <antispam@fricas.org> schrieb:
    I am not sure what technolgy they used
    for register file. For me most likely is fast RAM, but that
    normally would give 1 R/W port.

    They used fast SRAM and had three copies of their registers,
    for 2R1W.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From vallor@vallor@cultnix.org to comp.arch,alt.folklore.computers on Tue Aug 5 05:56:45 2025
    From Newsgroup: comp.arch

    On Tue, 5 Aug 2025 01:41:15 -0000 (UTC), Lawrence D'Oliveiro wrote:

    On Mon, 4 Aug 2025 23:24:15 -0000 (UTC), Waldek Hebisch wrote:

    BTW: AMD-64 was a special case: since 64-bit mode was bundled with
    increasing number of GPR-s, with PC-relative addressing and with
    register-based call convention on average 64-bit code was faster than
    32-bit code. And since AMD-64 was relatively late in 64-bit game there
    was limited motivation to develop mode using 32-bit addressing and
    64-bit instructions. It works in compilers and in Linux, but support is
    much worse than for using 64-bit addressing.

    Intel was trying to promote this in the form of the “X32” ABI. The Linux kernel and some distros did include support for this. I don’t think it was very popular, and it may be extinct now.

    It's still in the Linux kernel, but off by default.

    arch/x86/Kconfig

    I went to an O'Reilly "Foo Camp" where AMD was showing off their
    new 64-bit processor. Found it fascinating, if a little over my
    head. But I did gather that the instruction set made sense for transitioning from 32-bit software, and I think Intel missed the boat with their IA-64.

    (And I have memories of when Intel started making "EM64T" processors...)
    --
    -Scott System76 Thelio Mega v1.1 x86_64 NVIDIA RTX 3090Ti 24G
    OS: Linux 6.16.0 D: Mint 22.1 DE: Xfce 4.18
    NVIDIA: 575.64.05 Mem: 258G
    "Excuse me for butting in, but I'm interrupt-driven."
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Tue Aug 5 06:46:18 2025
    From Newsgroup: comp.arch

    On Mon, 4 Aug 2025 18:07:48 +0300, Michael S wrote:

    Majority of the world is embedded. Ovewhelming majority of embedded is
    32-bit or narrower.

    Embedded CPUs are mostly ARM, MIPS, RISC-V ... all of which are available
    in 64-bit variants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch,alt.folklore.computers on Tue Aug 5 03:14:17 2025
    From Newsgroup: comp.arch

    On 8/5/2025 1:46 AM, Lawrence D'Oliveiro wrote:
    On Mon, 4 Aug 2025 18:07:48 +0300, Michael S wrote:

    Majority of the world is embedded. Ovewhelming majority of embedded is
    32-bit or narrower.

    Embedded CPUs are mostly ARM, MIPS, RISC-V ... all of which are available
    in 64-bit variants.

    Well, along with, traditionally, 6502 and Z80, and MSP430.

    The Atmel AVR was also pretty popular for a while, though AFAIK more in
    the hobbyist space (say, more popularity due to Arduino than due to its
    use in consumer electronics). Whereas the MSP430 was fairly widespread
    in the latter (and a fairly common chip for running things like mice and keyboards).

    There were more advanced versions of the MSP430, with a 20 bit address
    space, etc. But the most readily available versions typically used a
    16-bit address space (with typically between 0.25K and 2K of RAM; and 1K
    to 48K of ROM).


    In most cases, one got C with a similar programming model; namely 'int'
    being 16 bit. Though, the Arduino platform used C++.

    I was left thinking that I had still seen a lot of K&R style C in the
    6502 and Z80 spaces, but can't seem to confirm.




    ...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Kerr-Mudd, John@admin@127.0.0.1 to comp.arch,alt.folklore.computers on Tue Aug 5 09:25:28 2025
    From Newsgroup: comp.arch

    On Tue, 5 Aug 2025 01:39:14 -0000 (UTC)
    Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

    On Mon, 4 Aug 2025 14:06:17 -0700, Stephen Fuld wrote:

    ... I recall an issue with Windows NT where it initially divided the
    4GB address space in 2 GB for the OS, and 2GB for users. Some users
    were "running out of address space", so Microsoft came up with an
    option to reduce the OS space to 1 GB, thus allowing up to 3 GB for
    users. I am sure others here will know more details.

    That would have been prone to breakage in poorly-written programs that
    were using signed instead of unsigned comparisons on memory block sizes.

    I hit an earlier version of this problem in about the mid-1980s, trying to help a user install WordStar on his IBM PC, which was one of the earliest machines to have 640K of RAM. The WordStar installer balked, saying he didn’t have enough free RAM!

    The solution: create a dummy RAM disk to bring the free memory size down below 512K. Then after the installation succeeded, the RAM disk could be removed.

    I recall the time our DOS-based install disks (network boot and re-image a
    PC from a server) failed. It was the first time we'd seen a PC with 4G (I think) of RAM, DOS was wrapping addressed memory and overwriting the
    running batch file!
    --
    Bah, and indeed Humbug.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch,alt.folklore.computers on Tue Aug 5 13:46:13 2025
    From Newsgroup: comp.arch

    cross@spitfire.i.gajendra.net (Dan Cross) writes:
    In article <2025Aug3.185110@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    [snip]
    The C environment for DEC OSF/1 was an I32LP64 setup, not an ILP64
    setup, so can you really call it pure?

    In the OS kernel, often times you want to allocate physical
    address space below 4GiB for e.g. device BARs; many devices are
    either 32-bit (but have to work on 64-bit systems) or work
    better with 32-bit BARs.

    Indeed. Modern PCI controllers tend to support remapping
    a 64-bit physical address in the hardware to support devices
    that only advertise 32-bit bars[*]. The firmware (e.g. UEFI
    or BIOS) will setup the remapping registers and provide the
    address of the 64-bit aperture to the kernel via device tree
    or ACPI tables.

    [*] AHCI is the typical example, which uses BAR5.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue Aug 5 13:56:16 2025
    From Newsgroup: comp.arch

    antispam@fricas.org (Waldek Hebisch) writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    antispam@fricas.org (Waldek Hebisch) writes:
    I can understand why DEC abandoned VAX: already in 1985 they
    had some disadvantage and they saw no way to compete against
    superscalar machines which were on the horizon. In 1985 they
    probably realized, that their features add no value in world
    using optimizing compilers.

    Optimizing compilers increase the advantages of RISCs, but even with a
    simple compiler Berkeley RISC II (which was made by hardware people,
    not compiler people) has between 85% and 256% of VAX (11/780) speed.
    It also has 16-bit and 32-bit instructions for improved code density
    and (apparently from memory bandwidth issues) performance.

    The basic question is if VAX could afford the pipeline. VAX had
    rather complex memory and bus interface, cache added complexity
    too. Ditching microcode could allow more resources for execution
    path. Clearly VAX could afford and probably had 1-cycle 32-bit
    ALU. I doubt that they could afford 1-cycle multiply or
    even a barrel shifter. So they needed a seqencer for sane
    assembly programming. I am not sure what technolgy they used
    for register file. For me most likely is fast RAM, but that
    normally would give 1 R/W port. Multiported register file
    probably would need a lot of separate register chips and
    multiplexer. Alternatively, they could try some very fast
    RAM and run it at multiple of base clock frequency (66 ns
    cycle time caches were available at that time, so 3 ports
    via multiplexing seem possible). But any of this adds
    considerable complexity. Sane pipeline needs interlocks
    and forwarding.

    We tend to be spoiled by modern process densities. The
    VAX 11/780 was built using SSI logic chips, thus board
    space and backplane wiring were significant constraints
    on the logic designs of the era.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch,alt.folklore.computers on Tue Aug 5 13:58:12 2025
    From Newsgroup: comp.arch

    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    On Mon, 4 Aug 2025 20:13:54 -0000 (UTC), Thomas Koenig wrote:

    a) You go to DEC

    b) You go to Data General

    c) You found your own company

    How about d) Go talk to the man responsible for the fastest machines in
    the world around that time, i.e. Seymour Cray?

    I did speak with him, once, when he was visiting my
    godfather in Chippewa Falls. I was rather young at the time
    and had no clue who he was until years later, sadly.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Tue Aug 5 14:13:39 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Waldek Hebisch <antispam@fricas.org> schrieb:
    I am not sure what technolgy they used
    for register file. For me most likely is fast RAM, but that
    normally would give 1 R/W port.

    They used fast SRAM and had three copies of their registers,
    for 2R1W.

    Hmm, normal RAM can not do read during write and AFAICS one needs to
    do writes to all copies. So this seems to require doubling clock to
    put 2 RAM cycles into one CPU cycle. Or do you mean 2 reads or
    1 write per cycle?
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch,alt.folklore.computers on Tue Aug 5 17:24:34 2025
    From Newsgroup: comp.arch

    Stephen Fuld wrote:
    On 8/4/2025 8:32 AM, John Ames wrote:

    snip

    This notion that the only advantage of a 64-bit architecture is a large
    address space is very curious to me. Obviously that's *one* advantage,>> but while I don't know the in-the-field history of heavy-duty business/
    scientific computing the way some folks here do, I have not gotten the>> impression that a lot of customers were commonly running up against the
    4 GB limit in the early '90s;

    Not exactly the same, but I recall an issue with Windows NT where it initially divided the 4GB address space in 2 GB for the OS, and 2GB for users.  Some users were "running out of address space", so Microsoft
    came up with an option to reduce the OS space to 1 GB, thus allowing up
    to 3 GB for users.  I am sure others here will know more details.
    Any program written to Microsoft/Windows spec would work transparently
    with a 3:1 split, the problem was all the programs ported from unix
    which assumed that any negative return value was a failure code.
    In effect, the program had to promise the OS that it would behave
    correctly before it was allowed allocate more than 2GB of memory.
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Aug 5 17:31:34 2025
    From Newsgroup: comp.arch

    Michael S wrote:
    On Tue, 5 Aug 2025 00:14:43 +0300
    Michael S <already5chosen@yahoo.com> wrote:

    On Mon, 4 Aug 2025 22:49:23 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Anton Ertl wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Michael S <already5chosen@yahoo.com> writes:
    Actually, in our world the latest C standard (C23) has them, but
    the spelling is different: _BitInt(32) and unsigned _BitInt(32).
    I'm not sure if any major compiler already has them implemented.
    Bing copilot says that clang does, but I don't tend to believe
    eveything Bing copilot says.

    I asked godbolt, and tried the following program:

    typedef ump unsigned _BitInt(65535);

    The actual compiling version is:

    typedef unsigned _BitInt(65535) ump;

    ump sum3(ump a, ump b, ump c)
    {
    return a+b+c;
    }

    I would naively expect the ump type to be defined as an array of
    unsigned (byte/short/int/long), possibly with a header defining how
    large the allocation is and how many bits are currently defined.

    The actual code to add three of them could be something like

    xor rax,rax
    next:
    add rax,[rsi+rcx*8]
    adc rdx,0
    add rax,[r8+rcx*8]
    adc rdx,0
    add rax,[r9+rcx*8]
    adc rdx,0
    mov [rdi+rcx*8],rax
    mov rax,rdx
    inc rcx
    cmp rcx,r10
    jb next

    The main problem here is of course that every add operation depends
    on the previous, so max speed would be 4-5 clock cycles/iteration.

    Terje


    I would guess that even a pair of x86-style loops would likely be
    faster than that on most x86-64 processors made in last 15 years.
    Despite doing 1.5x more memory acceses.
    ; rcx = dst
    ; rdx = a - dst
    ; r8 = b - dst
    mov $1024, %esi
    clc
    .loop1:
    mov (%rcx,%r8), %rax
    adc (%rcx,%rdx), %rax
    mov %rax, (%rcx)
    lea 8(%rcx), %rcx
    dec %esi
    jnz .loop1

    sub $65536, %rcx
    mov ..., %rdx ; %rdx = c-dst
    mov $1024, %esi
    clc
    .loop2:
    mov (%rcx,%rdx), %rax
    adc %rax, (%rcx)
    lea 8(%rcx), %rcx
    dec %esi
    jnz .loop2
    ...



    For extremely wide cores, like Apple's M (modulo ISA), AMD Zen5 and
    Intel Lion Cove, I'd do the following modification to your inner loop
    (back in Intel syntax):

    xor ebx,ebx
    next:
    xor edx, edx
    mov rax,[rsi+rcx*8]
    add rax,[r8+rcx*8]
    adc edx,edx
    add rax,[r9+rcx*8]
    adc edx,0
    add rbx,rax
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov edx, ebx
    mov [rdi+rcx*8],rax
    inc rcx
    cmp rcx,r10
    jb next
    ...
    ret


    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately never incremen_edx:
    inc edx
    jmp edx_ready


    Less wide cores will likely benefit from reduction of the number of
    executed instructions (and more importantly the number of decoded and
    renamed instructions) through unrolling by 2, 3 or 4.


    Interesting code, not totally sure that I understand how the

    'ADC EDX,EDX'

    really works, i.e. shiftin previous contents up while saving the current carry.

    Anyway, the three main ADD RAX,... operations still define the minimum possible latency, right?

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch,alt.folklore.computers on Tue Aug 5 15:41:29 2025
    From Newsgroup: comp.arch

    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Stephen Fuld wrote:
    On 8/4/2025 8:32 AM, John Ames wrote:
    =20
    snip
    =20
    This notion that the only advantage of a 64-bit architecture is a larg=
    e
    address space is very curious to me. Obviously that's *one* advantage,=

    but while I don't know the in-the-field history of heavy-duty business=
    /
    scientific computing the way some folks here do, I have not gotten the=

    impression that a lot of customers were commonly running up against th=
    e
    4 GB limit in the early '90s;
    =20
    Not exactly the same, but I recall an issue with Windows NT where it=20
    initially divided the 4GB address space in 2 GB for the OS, and 2GB for= >=20
    users.=C2=A0 Some users were "running out of address space", so Microso= >ft=20
    came up with an option to reduce the OS space to 1 GB, thus allowing up= >=20
    to 3 GB for users.=C2=A0 I am sure others here will know more details.

    Any program written to Microsoft/Windows spec would work transparently=20 >with a 3:1 split, the problem was all the programs ported from unix=20
    which assumed that any negative return value was a failure code.

    The only interfaces that I recall this being an issue for were
    mmap(2) and lseek(2). The latter was really related to maximum
    file size (although it applied to /dev/[k]mem and /proc/<pid>/mem
    as well). The former was handled by the standard specifying
    MAP_FAILED as the return value.

    That said, Unix generally defined -1 as the return value for all
    other system calls, and code that checked for "< 0" instead of
    -1 when calling a standard library function or system call was fundamentally broken.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Tue Aug 5 16:44:30 2025
    From Newsgroup: comp.arch

    In article <44okQ.831008$QtA1.573001@fx16.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    [snip]
    We tend to be spoiled by modern process densities. The
    VAX 11/780 was built using SSI logic chips, thus board
    space and backplane wiring were significant constraints
    on the logic designs of the era.

    Indeed. I find this speculation about the VAX, kind of odd: the
    existence of the 801 as a research project being used as an
    existence proof to justify assertions that a pipelined RISC
    design would have been "better" don't really hold up, when we
    consider that the comparison is to a processor designed for
    commercial applications on a much shorter timeframe.

    - Dan C.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Tue Aug 5 19:49:33 2025
    From Newsgroup: comp.arch

    On Tue, 5 Aug 2025 17:31:34 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Tue, 5 Aug 2025 00:14:43 +0300
    Michael S <already5chosen@yahoo.com> wrote:

    On Mon, 4 Aug 2025 22:49:23 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Anton Ertl wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Michael S <already5chosen@yahoo.com> writes:
    Actually, in our world the latest C standard (C23) has them,
    but the spelling is different: _BitInt(32) and unsigned
    _BitInt(32). I'm not sure if any major compiler already has
    them implemented. Bing copilot says that clang does, but I
    don't tend to believe eveything Bing copilot says.

    I asked godbolt, and tried the following program:

    typedef ump unsigned _BitInt(65535);

    The actual compiling version is:

    typedef unsigned _BitInt(65535) ump;

    ump sum3(ump a, ump b, ump c)
    {
    return a+b+c;
    }

    I would naively expect the ump type to be defined as an array of
    unsigned (byte/short/int/long), possibly with a header defining
    how large the allocation is and how many bits are currently
    defined.

    The actual code to add three of them could be something like

    xor rax,rax
    next:
    add rax,[rsi+rcx*8]
    adc rdx,0
    add rax,[r8+rcx*8]
    adc rdx,0
    add rax,[r9+rcx*8]
    adc rdx,0
    mov [rdi+rcx*8],rax
    mov rax,rdx
    inc rcx
    cmp rcx,r10
    jb next

    The main problem here is of course that every add operation
    depends on the previous, so max speed would be 4-5 clock
    cycles/iteration.

    Terje


    I would guess that even a pair of x86-style loops would likely be
    faster than that on most x86-64 processors made in last 15 years.
    Despite doing 1.5x more memory acceses.
    ; rcx = dst
    ; rdx = a - dst
    ; r8 = b - dst
    mov $1024, %esi
    clc
    .loop1:
    mov (%rcx,%r8), %rax
    adc (%rcx,%rdx), %rax
    mov %rax, (%rcx)
    lea 8(%rcx), %rcx
    dec %esi
    jnz .loop1

    sub $65536, %rcx
    mov ..., %rdx ; %rdx = c-dst
    mov $1024, %esi
    clc
    .loop2:
    mov (%rcx,%rdx), %rax
    adc %rax, (%rcx)
    lea 8(%rcx), %rcx
    dec %esi
    jnz .loop2
    ...



    For extremely wide cores, like Apple's M (modulo ISA), AMD Zen5 and
    Intel Lion Cove, I'd do the following modification to your inner
    loop (back in Intel syntax):

    xor ebx,ebx
    next:
    xor edx, edx
    mov rax,[rsi+rcx*8]
    add rax,[r8+rcx*8]
    adc edx,edx
    add rax,[r9+rcx*8]
    adc edx,0
    add rbx,rax
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov edx, ebx
    mov [rdi+rcx*8],rax
    inc rcx
    cmp rcx,r10
    jb next
    ...
    ret


    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately
    never incremen_edx:
    inc edx
    jmp edx_ready


    Less wide cores will likely benefit from reduction of the number of executed instructions (and more importantly the number of decoded
    and renamed instructions) through unrolling by 2, 3 or 4.


    Interesting code, not totally sure that I understand how the

    'ADC EDX,EDX'

    really works, i.e. shiftin previous contents up while saving the
    current carry.


    In this case 'adc edx,edx' is just slightly shorter encoding
    of 'adc edx,0'. EDX register zeroize few lines above.

    Anyway, the three main ADD RAX,... operations still define the
    minimum possible latency, right?


    I don't think so.
    It seems to me that there is only one chains of data dependencies
    between iterations of the loop - a trivial dependency through RCX. Some
    modern processors are already capable to eliminate this sort of
    dependency in renamer. Probably not yet when it is coded as 'inc', but
    when coded as 'add' or 'lea'.

    The dependency through RDX/RBX does not form a chain. The next value
    of [rdi+rcx*8] does depend on value of rbx from previous iteration, but
    the next value of rbx depends only on [rsi+rcx*8], [r8+rcx*8] and
    [r9+rcx*8]. It does not depend on the previous value of rbx, except for
    control dependency that hopefully would be speculated around.

    I didn't measured it yet. Didn't finished coding as well.
    But even when code is finished, the widest processors I have right now
    are only Intel Raptor Cove (P-core of i7-14700) and AMD Zen3. I am
    afraid that neither is sufficiently wide to see a full effect of
    iterations de-coupling.


    Terje



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch,alt.folklore.computers on Tue Aug 5 17:21:19 2025
    From Newsgroup: comp.arch

    In article <FWnkQ.830336$QtA1.728878@fx16.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    cross@spitfire.i.gajendra.net (Dan Cross) writes:
    In article <2025Aug3.185110@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    [snip]
    The C environment for DEC OSF/1 was an I32LP64 setup, not an ILP64
    setup, so can you really call it pure?

    In the OS kernel, often times you want to allocate physical
    address space below 4GiB for e.g. device BARs; many devices are
    either 32-bit (but have to work on 64-bit systems) or work
    better with 32-bit BARs.

    Indeed. Modern PCI controllers tend to support remapping
    a 64-bit physical address in the hardware to support devices
    that only advertise 32-bit bars[*]. The firmware (e.g. UEFI
    or BIOS) will setup the remapping registers and provide the
    address of the 64-bit aperture to the kernel via device tree
    or ACPI tables.

    [*] AHCI is the typical example, which uses BAR5.

    Yes; AHCI is an odd duck. They probably should have chosen BAR4
    for the ABAR and reserved 5; then they could have extended it to
    64-bit using the (BAR4, BAR5) pair.

    With the IOHC we have a lot more flexibility than we did
    previously.

    - Dan C.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Brian G. Lucas@bagel99@gmail.com to comp.arch on Tue Aug 5 13:03:09 2025
    From Newsgroup: comp.arch

    On 8/4/25 11:58 AM, Thomas Koenig wrote:
    Keith Thompson <Keith.S.Thompson+u@gmail.com> schrieb:

    In C17 and earlier, _BitInt is a reserved identifier. Any attempt to
    use it has undefined behavior. That's exactly why new keywords are
    often defined with that ugly syntax.

    Sometimes I think there is reason to Fortran's approach of not
    having defined keywords - old programs just continue to run, even
    with new statements or intrinsic procedures, maybe with an addition
    of an EXTERNAL statement.

    I agree. I designed my personal programming language <https://github.com/bagel99/esl> without reserved words.
    It is really not that hard to do in a compiler.

    brian

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Brian G. Lucas@bagel99@gmail.com to comp.arch,alt.folklore.computers on Tue Aug 5 13:04:39 2025
    From Newsgroup: comp.arch

    On 8/4/25 8:53 PM, Lawrence D'Oliveiro wrote:
    On Mon, 4 Aug 2025 17:18:24 -0500, BGB wrote:

    The ban on AT&T was the whole reason they released Unix freely.

    It was never really “freely” available.
    I'll say. We had to pay $20,000 for it in 1975. That was a lot
    of money for software on a mini-computer.


    Then when things lifted (after the AT&T break-up), they tried to
    re-assert their control over Unix, which backfired.

    They were already tightening things up from the Seventh Edition onwards -- remember, this version rescinded the permission to use the source code for classroom teaching purposes, neatly strangling the entire market for the legendary Lions Book. Which continued to spread afterwards via samizdat, nonetheless.

    And, they tried to make and release a workstation, but by then they
    were competing against the IBM PC Clone market (and also everyone
    else trying to sell Unix workstations at the time), ...

    That was a very successful market, from about the mid-1980s until the mid- to-latter 1990s. In spite of all the vendor-lock-in and fragmentation, it mentioned to survive I think because of the sheer performance available in the RISC processors, which Microsoft tried to support with its new
    “Windows NT” OS, but was never able to get quite right.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch,alt.folklore.computers on Tue Aug 5 11:52:38 2025
    From Newsgroup: comp.arch

    On 8/4/2025 11:46 PM, Lawrence D'Oliveiro wrote:
    On Mon, 4 Aug 2025 18:07:48 +0300, Michael S wrote:

    Majority of the world is embedded. Ovewhelming majority of embedded is
    32-bit or narrower.

    Embedded CPUs are mostly ARM, MIPS, RISC-V ... all of which are available
    in 64-bit variants.

    I recently looked this up and it confirmed my earlier information. Unfortunately, I can't find the reference. :-(

    The plurality of embedded systems are 8 bit processors - about 40
    percent of the total. They are largely used for things like industrial automation, Internet of Things, SCADA, kitchen appliances, etc. 16 bit account for a small, and shrinking percentage. 32 bit is next (IIRC
    ~30-35%, but 64 bit is the fastest growing. Perhaps surprising, there
    is still a small market for 4 bit processors for things like TV remote controls, where battery life is more important than the highest performance.

    There is far more to the embedded market than phones and servers.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Aug 5 22:17:00 2025
    From Newsgroup: comp.arch

    Michael S wrote:
    On Tue, 5 Aug 2025 17:31:34 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:
    In this case 'adc edx,edx' is just slightly shorter encoding
    of 'adc edx,0'. EDX register zeroize few lines above.

    OK, nice.

    Anyway, the three main ADD RAX,... operations still define the
    minimum possible latency, right?


    I don't think so.
    It seems to me that there is only one chains of data dependencies
    between iterations of the loop - a trivial dependency through RCX. Some modern processors are already capable to eliminate this sort of
    dependency in renamer. Probably not yet when it is coded as 'inc', but
    when coded as 'add' or 'lea'.

    The dependency through RDX/RBX does not form a chain. The next value
    of [rdi+rcx*8] does depend on value of rbx from previous iteration, but
    the next value of rbx depends only on [rsi+rcx*8], [r8+rcx*8] and
    [r9+rcx*8]. It does not depend on the previous value of rbx, except for control dependency that hopefully would be speculated around.

    I believe we are doing a bigint thre-way add, so each result word
    depends on the three corresponding input words, plus any carries from
    the previous round.

    This is the carry chain that I don't see any obvious way to break...

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Tue Aug 5 20:34:27 2025
    From Newsgroup: comp.arch

    Scott Lurndal <scott@slp53.sl.home> wrote:
    antispam@fricas.org (Waldek Hebisch) writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    antispam@fricas.org (Waldek Hebisch) writes:
    I can understand why DEC abandoned VAX: already in 1985 they
    had some disadvantage and they saw no way to compete against >>>>superscalar machines which were on the horizon. In 1985 they
    probably realized, that their features add no value in world
    using optimizing compilers.

    Optimizing compilers increase the advantages of RISCs, but even with a
    simple compiler Berkeley RISC II (which was made by hardware people,
    not compiler people) has between 85% and 256% of VAX (11/780) speed.
    It also has 16-bit and 32-bit instructions for improved code density
    and (apparently from memory bandwidth issues) performance.

    The basic question is if VAX could afford the pipeline. VAX had
    rather complex memory and bus interface, cache added complexity
    too. Ditching microcode could allow more resources for execution
    path. Clearly VAX could afford and probably had 1-cycle 32-bit
    ALU. I doubt that they could afford 1-cycle multiply or
    even a barrel shifter. So they needed a seqencer for sane
    assembly programming. I am not sure what technolgy they used
    for register file. For me most likely is fast RAM, but that
    normally would give 1 R/W port. Multiported register file
    probably would need a lot of separate register chips and
    multiplexer. Alternatively, they could try some very fast
    RAM and run it at multiple of base clock frequency (66 ns
    cycle time caches were available at that time, so 3 ports
    via multiplexing seem possible). But any of this adds
    considerable complexity. Sane pipeline needs interlocks
    and forwarding.

    We tend to be spoiled by modern process densities. The
    VAX 11/780 was built using SSI logic chips, thus board
    space and backplane wiring were significant constraints
    on the logic designs of the era.

    Using terminology of late seventies VAX was mixture of SSI,
    MSI and LSI chips. I am not sure if VAX used it, but there
    were 4-bit TTL ALU chips, 8 such chips would give 32-bit ALU
    (for better speed one would add carry propagation chips,
    which would increase chip count).

    Probably only memory used LSI chips. That could add bias
    for microcode: microcode used densest MOS chips (memory) and
    replaced less dense random TTL logic. After switching to CMOS
    on-chip logic was more comparable to memory, so balance
    shifted.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch,alt.folklore.computers on Tue Aug 5 21:01:20 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> schrieb:
    On Mon, 4 Aug 2025 20:13:54 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    My guess would be that, with DEC, you would have the least chance of
    convincing corporate brass of your ideas. With Data General, you
    could try appealing to the CEO's personal history of creating the
    Nova, and thus his vanity. That could work. But your own company
    might actually be the best choice, if you can get the venture
    capital funding.


    Why not go to somebody who has money and interest to build
    microprocessor, but no existing mini/mainframe/SuperC buisness?
    If we limit ourselves to USA then Moto, Intel, AMD, NatSemi...
    May be, even AT&T ? Or was AT&T stil banned from making computers in
    the mid 70s?

    To be efficient, a RISC needs a full-width (presumably 32 bit)
    external data bus, plus a separate address bus, which should at
    least be 26 bits, better 32. A random ARM CPU I looked at at
    bitsavers had 84 pins, which sounds reasonable.

    Building an ARM-like instead of a 68000 would have been feasible,
    but the resulting systems would have been more expensive (the
    68000 had 64 pins).

    So... a strategy could have been to establish the concept with
    minicomputers, to make money (the VAX sold big) and then move
    aggressively towards microprocessors, trying the disruptive move
    towards workstations within the same company (which would be HARD).

    As for the PC - a scaled-down, cheap, compatible, multi-cycle per
    instruction microprocessor could have worked for that market,
    but it is entirely unclear to me what this would / could
    have done to the PC market, if IBM could have been prevented
    from gaining such market dominance.

    A bit like the /360 strategy, offering a wide range of machines
    (or CPUs and systems) with different performance.

    Might have worked, might have ended as a footnote in the
    minicomputer history. As with all pieces of alternate
    history, we'll never know.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Kaz Kylheku@643-408-1753@kylheku.com to comp.arch,comp.lang.c on Tue Aug 5 21:08:53 2025
    From Newsgroup: comp.arch

    On 2025-08-04, Michael S <already5chosen@yahoo.com> wrote:
    On Mon, 04 Aug 2025 09:53:51 -0700
    Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
    In C17 and earlier, _BitInt is a reserved identifier. Any attempt to
    use it has undefined behavior. That's exactly why new keywords are
    often defined with that ugly syntax.

    That is language lawyer's type of reasoning. Normally gcc maintainers
    are wiser than that because, well, by chance gcc happens to be widely
    used production compiler. I don't know why this time they had chosen
    less conservative road.

    They invented an identifer which lands in the _[A-Z].* namespace
    designated as reserved by the standard.

    What would be an exmaple of a more conservative way to name the
    identifier?
    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Kaz Kylheku@643-408-1753@kylheku.com to comp.arch,comp.lang.c on Tue Aug 5 21:13:50 2025
    From Newsgroup: comp.arch

    On 2025-08-04, Michael S <already5chosen@yahoo.com> wrote:
    On Mon, 4 Aug 2025 15:25:54 -0400
    James Kuyper <jameskuyper@alumni.caltech.edu> wrote:

    On 2025-08-04 15:03, Michael S wrote:
    On Mon, 04 Aug 2025 09:53:51 -0700
    Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
    ...
    In C17 and earlier, _BitInt is a reserved identifier. Any attempt
    to use it has undefined behavior. That's exactly why new keywords
    are often defined with that ugly syntax.


    That is language lawyer's type of reasoning. Normally gcc
    maintainers are wiser than that because, well, by chance gcc
    happens to be widely used production compiler. I don't know why
    this time they had chosen less conservative road.

    If _BitInt is accepted by older versions of gcc, that means it was
    supported as a fully-conforming extension to C. Allowing
    implementations to support extensions in a fully-conforming manner is
    one of the main purposes for which the standard reserves identifiers.
    If you thought that gcc was too conservative to support extensions,
    you must be thinking of the wrong organization.


    I know that gcc supports extensions.
    I also know that gcc didn't support *this particular extension* up
    until quite recently.

    I think what James means is that GCC supports, as an extension,
    the use of any _[A-Z].* identifier whatsoever that it has not claimed
    for its purposes.

    (I don't know that to be true; an extension has to be documented other
    than by omission. But anyway, if the GCC documentation says somewhere
    something like, "no other identifier is reserved in this version of
    GCC", then it means that the remaining portions of the reserved
    namespaces are available to the program. Since it is undefined behavior
    to use those identifiers (or in certain ways in certain circumstances,
    as the case may be), being able to use them with the documentation's
    blessing constitutes use of a documented extension.)

    I would guess, up until this calendar year.
    Introducing new extension without way to disable it is different from supporting gradually introduced extensions, typically with names that
    start by double underscore and often starting with __builtin.

    __builtin also in a standard-defined reserved namespace; the double
    underscore namespace. It is no more or less conservative to name
    something __bitInt as _BitInt.
    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Wed Aug 6 00:21:25 2025
    From Newsgroup: comp.arch

    On Tue, 5 Aug 2025 22:17:00 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Tue, 5 Aug 2025 17:31:34 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:
    In this case 'adc edx,edx' is just slightly shorter encoding
    of 'adc edx,0'. EDX register zeroize few lines above.

    OK, nice.

    BTW, it seems that in your code fragment above you forgot to zeroize EDX
    at the beginning of iteration. Or am I mssing something?


    Anyway, the three main ADD RAX,... operations still define the
    minimum possible latency, right?


    I don't think so.
    It seems to me that there is only one chains of data dependencies
    between iterations of the loop - a trivial dependency through RCX.
    Some modern processors are already capable to eliminate this sort of dependency in renamer. Probably not yet when it is coded as 'inc',
    but when coded as 'add' or 'lea'.

    The dependency through RDX/RBX does not form a chain. The next value
    of [rdi+rcx*8] does depend on value of rbx from previous iteration,
    but the next value of rbx depends only on [rsi+rcx*8], [r8+rcx*8]
    and [r9+rcx*8]. It does not depend on the previous value of rbx,
    except for control dependency that hopefully would be speculated
    around.

    I believe we are doing a bigint thre-way add, so each result word
    depends on the three corresponding input words, plus any carries from
    the previous round.

    This is the carry chain that I don't see any obvious way to break...

    Terje



    You break the chain by *predicting* that
    carry[i] = CARRY(a[i]+b[i]+c[i]+carry(i-1) is equal to
    CARRY(a[i]+b[i]+c[i]). If the prediction turns out wrong then you pay a
    heavy price of branch misprediction. But outside of specially crafted
    inputs it is extremely rare.







    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Kaz Kylheku@643-408-1753@kylheku.com to comp.arch,comp.lang.c on Tue Aug 5 21:25:17 2025
    From Newsgroup: comp.arch

    On 2025-08-05, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
    Breaking existing code that uses "_BitInt" as an identifier is
    a non-issue. There very probably is no such code.

    However, that doesn't mean GCC can carelessly introduce identifiers
    in this namespace.

    GCC does not define a complete C implementation; it doesn't provide a
    library. Libraries are provided by other projects: Glibc, Musl,
    ucLibc, ...

    Those libraries are C implementors also, and get to name things
    in the reserved namespace.

    It would be unthinkable for GCC to introduce, say, an extension
    using the identifier __libc_malloc.

    In addition to libraries, if some other important project that serves as
    a base package in many distributions happens to claim identifiers in
    those spaces, it wouldn't be wise for GCC (or the C libraries) to start
    taking them away.

    You can't just rename the identifier out of the way in the offending
    package, because that only fixes the issue going forward. Older versions
    of the package can't be compiled with the new compiler without a patch. Compiling older things with newer GCC happens.

    There are always the questions:

    1. Is there an issue? Is anything broken?

    2. If so, is what is broken important such that it becomes a showstopper
    if the compiler change is rolled out (major distros are on fire?)
    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From George Neuner@gneuner2@comcast.net to comp.arch on Tue Aug 5 17:41:30 2025
    From Newsgroup: comp.arch

    On Tue, 5 Aug 2025 05:48:16 -0000 (UTC), Thomas Koenig
    <tkoenig@netcologne.de> wrote:

    Waldek Hebisch <antispam@fricas.org> schrieb:
    I am not sure what technolgy they used
    for register file. For me most likely is fast RAM, but that
    normally would give 1 R/W port.

    They used fast SRAM and had three copies of their registers,
    for 2R1W.


    I did use 11/780, 8600, and briefly even MicroVax - but I'm primarily
    a software person, so please forgive this stupid question.


    Why three copies?
    Also did you mean 3 total? Or 3 additional copies (4 total)?


    Given 1 R/W port each I can see needing a pair to handle cases where destination is also a source (including autoincrement modes). But I
    don't see a need ever to sync them - you just keep track of which was
    updated most recently, read that one and - if applicable - write the
    other and toggle.

    Since (at least) the early models evaluated operands sequentially,
    there doesn't seem to be a need for more. Later models had some
    semblance of pipeline, but it seems that if the /same/ value was
    needed multiple times, it could be routed internally to all users
    without requiring additional reads of the source.

    Or do I completely misunderstand? [Definitely possible.]
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Wed Aug 6 00:49:21 2025
    From Newsgroup: comp.arch

    On Tue, 5 Aug 2025 17:24:34 +0200, Terje Mathisen wrote:

    ... the problem was all the programs ported from unix which assumed
    that any negative return value was a failure code.

    If the POSIX API spec says a negative return for a particular call is an error, then a negative return for that particular call is an error.

    I can’t imagine this kind of thing blithely being carried over to any non- POSIX API calls.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Wed Aug 6 00:59:07 2025
    From Newsgroup: comp.arch

    On Tue, 5 Aug 2025 21:01:20 -0000 (UTC), Thomas Koenig wrote:

    So... a strategy could have been to establish the concept with
    minicomputers, to make money (the VAX sold big) and then move
    aggressively towards microprocessors, trying the disruptive move towards workstations within the same company (which would be HARD).

    None of the companies which tried to move in that direction were
    successful. The mass micro market had much higher volumes and lower
    margins, and those accustomed to lower-volume, higher-margin operation
    simply couldn’t adapt.

    As for the PC - a scaled-down, cheap, compatible, multi-cycle per
    instruction microprocessor could have worked for that market,
    but it is entirely unclear to me what this would / could have done to
    the PC market, if IBM could have been prevented from gaining such market dominance.

    IBM had massive marketing clout in the mainframe market. I think that was
    the basis on which customers gravitated to their products. And remember,
    the IBM PC was essentially a skunkworks project that totally went against
    the entire IBM ethos. Internally, it was seen as a one-off mistake that
    they determined never to repeat. Hence the PS/2 range.

    DEC was bigger in the minicomputer market. If DEC could have offered an open-standard machine, that could have offered serious competition to IBM.
    But what OS would they have used? They were still dominated by Unix-haters then.

    A bit like the /360 strategy, offering a wide range of machines (or CPUs
    and systems) with different performance.

    That strategy was radical in 1964, less so by the 1970s and 1980s. DEC,
    for example, offered entire ranges of machines in each of its various minicomputer families.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Keith Thompson@Keith.S.Thompson+u@gmail.com to comp.arch,comp.lang.c on Tue Aug 5 19:14:48 2025
    From Newsgroup: comp.arch

    Kaz Kylheku <643-408-1753@kylheku.com> writes:
    On 2025-08-05, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
    Breaking existing code that uses "_BitInt" as an identifier is
    a non-issue. There very probably is no such code.

    However, that doesn't mean GCC can carelessly introduce identifiers
    in this namespace.

    Agreed -- and in gcc did not do that in this case. I was referring to
    _BitInt, not to other identifiers in the reserved namespace.

    Do you have any reason to believe that gcc's use of _BitInt will break
    any existing code? My best guess is that there is no such code, that
    the only real world uses of the name _BitInt are deliberate uses of the
    new C23 feature, and that gcc's support of _BitInt in non-C23 mode
    will not break anything.

    It is of course possible that I'm wrong.

    If the name _BitInt did break (non-portable) existing C code, then the
    fault would lie with the C committee, not with the gcc maintainers.
    --
    Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
    void Void(void) { Void(); } /* The recursive call of the void */
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Peter Flass@Peter@Iron-Spring.com to comp.arch,alt.folklore.computers on Tue Aug 5 20:15:11 2025
    From Newsgroup: comp.arch

    On 8/5/25 17:59, Lawrence D'Oliveiro wrote:
    On Tue, 5 Aug 2025 21:01:20 -0000 (UTC), Thomas Koenig wrote:

    So... a strategy could have been to establish the concept with
    minicomputers, to make money (the VAX sold big) and then move
    aggressively towards microprocessors, trying the disruptive move towards
    workstations within the same company (which would be HARD).

    None of the companies which tried to move in that direction were
    successful. The mass micro market had much higher volumes and lower
    margins, and those accustomed to lower-volume, higher-margin operation
    simply couldn’t adapt.

    The support issues alone were killers. Think about the
    Orange/Grey/(Blue?) Wall of VAX documentation, and then look at the
    five-page flimsy you got with a micro. The customers were willing to
    accept cr*p from a small startup, but wouldn't put up with it from IBM
    or DEC.


    As for the PC - a scaled-down, cheap, compatible, multi-cycle per
    instruction microprocessor could have worked for that market,
    but it is entirely unclear to me what this would / could have done to
    the PC market, if IBM could have been prevented from gaining such market
    dominance.

    IBM had massive marketing clout in the mainframe market. I think that was
    the basis on which customers gravitated to their products. And remember,
    the IBM PC was essentially a skunkworks project that totally went against
    the entire IBM ethos. Internally, it was seen as a one-off mistake that
    they determined never to repeat. Hence the PS/2 range.

    DEC was bigger in the minicomputer market. If DEC could have offered an open-standard machine, that could have offered serious competition to IBM. But what OS would they have used? They were still dominated by Unix-haters then.

    VMS was a heckuva good OS.


    A bit like the /360 strategy, offering a wide range of machines (or CPUs
    and systems) with different performance.

    That strategy was radical in 1964, less so by the 1970s and 1980s. DEC,
    for example, offered entire ranges of machines in each of its various minicomputer families.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Kaz Kylheku@643-408-1753@kylheku.com to comp.arch,comp.lang.c on Wed Aug 6 04:31:59 2025
    From Newsgroup: comp.arch

    On 2025-08-06, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
    Kaz Kylheku <643-408-1753@kylheku.com> writes:
    On 2025-08-05, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
    Breaking existing code that uses "_BitInt" as an identifier is
    a non-issue. There very probably is no such code.

    However, that doesn't mean GCC can carelessly introduce identifiers
    in this namespace.

    Agreed -- and in gcc did not do that in this case. I was referring to _BitInt, not to other identifiers in the reserved namespace.

    Do you have any reason to believe that gcc's use of _BitInt will break
    any existing code?

    It has landed, and we don't hear reports that the sky is falling.

    If it does break someone's obscure project with few users, unless that
    person makes a lot of noise in some forums I read, I will never know.

    My position has always been to think about the threat of real,
    or at least probable clashes.

    I can turn it around: I have not heard of any compiler or library using _CreamPuff as an identifier, or of a compiler which misbehaves when a
    program uses it, on grounds of it being undefined behavior. Someone
    using _CreamPuff in their code is taking a risk that is vanishingly
    small, the same way that introducing _BigInt is a risk that is
    vanishingly small.

    In fact, in some sense the risk is smaller because the audience of
    programs facing an implementation (or language) that has introduced some identifier is vastly larger than the audience of implementations that a
    given program will face that has introduced some funny identifier.
    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch,alt.folklore.computers on Wed Aug 6 05:37:32 2025
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

    The plurality of embedded systems are 8 bit processors - about 40
    percent of the total. They are largely used for things like industrial automation, Internet of Things, SCADA, kitchen appliances, etc.

    I believe heart pacemakers run with a 6502 (well, 65C02)

    16 bi
    account for a small, and shrinking percentage. 32 bit is next (IIRC ~30-35%, but 64 bit is the fastest growing. Perhaps surprising, there
    is still a small market for 4 bit processors for things like TV remote controls, where battery life is more important than the highest performance.

    There is far more to the embedded market than phones and servers.

    Also, the processors which run in earphones etc...

    Does anybody have an estimate how many CPUs humanity has made
    so far?
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch,alt.folklore.computers on Wed Aug 6 05:50:11 2025
    From Newsgroup: comp.arch

    Peter Flass <Peter@Iron-Spring.com> schrieb:

    The support issues alone were killers. Think about the
    Orange/Grey/(Blue?) Wall of VAX documentation, and then look at the five-page flimsy you got with a micro. The customers were willing to
    accept cr*p from a small startup, but wouldn't put up with it from IBM
    or DEC.

    Using UNIX faced stiff competition from AT&T's internal IT people,
    who wanted to run DEC's operating systems on all PDP-11 within
    the company (basically, they wanted to kill UNIX). They pointed
    towads the large amout of documentation that DEC provided, compared
    to the low amount of UNIX, as proof of superiority. The UNIX people
    saw it differently...

    But the _real_ killer application for UNIX wasn't writing patents,
    it was phototypesetting speeches for the CEO of AT&T, who, for
    reasons of vanity, did not want to wear glasses, and it was possible
    to scale the output of the phototoypesetter so he would be able
    to read them.

    After somebody pointed out that having confidential speeches on
    one of the most well-known machines in the world, where loads of
    people had dial-up access, was not a good idea, his secretary got
    her own PDP-11 for that.

    And with support from that high up, the project flourished.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Aug 6 05:53:22 2025
    From Newsgroup: comp.arch

    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <44okQ.831008$QtA1.573001@fx16.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    [snip]
    We tend to be spoiled by modern process densities. The
    VAX 11/780 was built using SSI logic chips, thus board
    space and backplane wiring were significant constraints
    on the logic designs of the era.

    Indeed. I find this speculation about the VAX, kind of odd: the
    existence of the 801 as a research project being used as an
    existence proof to justify assertions that a pipelined RISC
    design would have been "better" don't really hold up, when we
    consider that the comparison is to a processor designed for
    commercial applications on a much shorter timeframe.

    I disagree. The 801 was a research project without much time
    pressure, and they simulated the machine (IIRC at the gate level)
    before they ever bulit one. Plus, they developed an excellent
    compiler which implemented graph coloring.

    But IBM had zero interest in competition to their own /370 line,
    although the 801 would have brought performance improvements
    over that line.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Wed Aug 6 06:20:57 2025
    From Newsgroup: comp.arch

    On Wed, 6 Aug 2025 05:37:32 -0000 (UTC), Thomas Koenig wrote:

    Does anybody have an estimate how many CPUs humanity has made so far?

    More ARM chips are made each year than the entire population of the Earth.

    I think RISC-V has also achieved that status.

    Where are they all going??
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Wed Aug 6 07:28:52 2025
    From Newsgroup: comp.arch

    On Wed, 6 Aug 2025 05:50:11 -0000 (UTC), Thomas Koenig wrote:

    Using UNIX faced stiff competition from AT&T's internal IT people, who
    wanted to run DEC's operating systems on all PDP-11 within the company (basically, they wanted to kill UNIX).

    But because AT&T controlled Unix, they were able to mould it like putty to their own uses. E.g. look at the MERT project which supported real-time
    tasks (as needed in telephone exchanges) besides conventional Unix ones.
    No way they could do this with an outside proprietary system, like those
    from DEC.

    AT&T also created its own hardware (the 3B range) to complement the
    software in serving those high-availability needs.

    But the _real_ killer application for UNIX wasn't writing patents, it
    was phototypesetting speeches for the CEO of AT&T, who, for reasons of vanity, did not want to wear glasses, and it was possible to scale the
    output of the phototoypesetter so he would be able to read them.

    Heck, no. The biggest use for the Unix documentation tools was in the
    legal department, writing up patent applications. troff was just about the only software around that could do automatic line-numbering, which was
    crucial for this purpose.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch,comp.lang.c on Wed Aug 6 11:48:09 2025
    From Newsgroup: comp.arch

    On Wed, 6 Aug 2025 04:31:59 -0000 (UTC)
    Kaz Kylheku <643-408-1753@kylheku.com> wrote:

    On 2025-08-06, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
    Kaz Kylheku <643-408-1753@kylheku.com> writes:
    On 2025-08-05, Keith Thompson <Keith.S.Thompson+u@gmail.com>
    wrote:
    Breaking existing code that uses "_BitInt" as an identifier is
    a non-issue. There very probably is no such code.

    However, that doesn't mean GCC can carelessly introduce identifiers
    in this namespace.

    Agreed -- and in gcc did not do that in this case. I was referring
    to _BitInt, not to other identifiers in the reserved namespace.

    Do you have any reason to believe that gcc's use of _BitInt will
    break any existing code?

    It has landed, and we don't hear reports that the sky is falling.

    If it does break someone's obscure project with few users, unless that
    person makes a lot of noise in some forums I read, I will never know.


    Exactly.
    The World is a very big place. Even nowadays it is not completely
    transparent. Even those parts that are publicly visible in theory not necessarily had been had been observed recently by a single person even
    if the person in question is Keith.
    Besides, according to my understanding majority of gcc users didn't yet
    migrate to gcc14 or 15.

    My position has always been to think about the threat of real,
    or at least probable clashes.

    I can turn it around: I have not heard of any compiler or library
    using _CreamPuff as an identifier, or of a compiler which misbehaves
    when a program uses it, on grounds of it being undefined behavior.
    Someone using _CreamPuff in their code is taking a risk that is
    vanishingly small, the same way that introducing _BigInt is a risk
    that is vanishingly small.

    In fact, in some sense the risk is smaller because the audience of
    programs facing an implementation (or language) that has introduced
    some identifier is vastly larger than the audience of implementations
    that a given program will face that has introduced some funny
    identifier.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Wed Aug 6 10:24:49 2025
    From Newsgroup: comp.arch

    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    Of all the major OSes for Alpha, Windows NT was the only one
    that couldn’t take advantage of the 64-bit architecture.

    Actually, Windows took good advantage of the 64-bit architecture:
    "64-bit Windows was initially developed on the Alpha AXP." <https://learn.microsoft.com/en-us/previous-versions/technet-magazine/cc718978(v=msdn.10)>

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch,alt.folklore.computers on Wed Aug 6 10:48:51 2025
    From Newsgroup: comp.arch

    In article <106uqej$36gll$3@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Peter Flass <Peter@Iron-Spring.com> schrieb:

    The support issues alone were killers. Think about the
    Orange/Grey/(Blue?) Wall of VAX documentation, and then look at the
    five-page flimsy you got with a micro. The customers were willing to
    accept cr*p from a small startup, but wouldn't put up with it from IBM
    or DEC.

    Using UNIX faced stiff competition from AT&T's internal IT people,
    who wanted to run DEC's operating systems on all PDP-11 within
    the company (basically, they wanted to kill UNIX). They pointed
    towads the large amout of documentation that DEC provided, compared
    to the low amount of UNIX, as proof of superiority. The UNIX people
    saw it differently...

    I've never heard this before, and I do not believe that it is
    true. Do you have a source?

    Bell Telephone's computer center was basically an IBM shop
    before Unix was written, having written BESYS for the IBM 704,
    for instance. They made investments in GE machines around the
    time of the Multics project (e.g., they had a GE 645 and at
    least one 635). The PDP-11 used for Unix was so new that they
    had to wait a few weeks for its disk to arrive.

    Unix escaped out of research, and into the larger Bell System,
    via the legal department, as has been retold many times. It
    spread widely internally after that. After divestiture, when
    AT&T was freed to be able to compete in the computer industry,
    it was seen as a strategic asset.

    But the _real_ killer application for UNIX wasn't writing patents,
    it was phototypesetting speeches for the CEO of AT&T, who, for
    reasons of vanity, did not want to wear glasses, and it was possible
    to scale the output of the phototoypesetter so he would be able
    to read them.

    After somebody pointed out that having confidential speeches on
    one of the most well-known machines in the world, where loads of
    people had dial-up access, was not a good idea, his secretary got
    her own PDP-11 for that.

    And with support from that high up, the project flourished.

    While it is true that Charlie Brown's office got a Unix system
    of their own to run troff because its output scaled to large
    sizes, the speeches weren't the data they were worried about
    protecting: those were records from AT&T board meetings.

    At the time, the research PDP-11 used for Unix at Bell Labs was
    not one of the, "most well-known machines in the world, where
    loads of people had dial-up access" in any sense; in the grand
    scheme of things, it was pretty obscure, and had a few dozen
    users. But it was a machine where most users had "root" access,
    and it was agreed that these documents shouldn't be on the
    research machine out of concern for confidentiality.

    - Dan C.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Wed Aug 6 10:32:39 2025
    From Newsgroup: comp.arch

    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    Not aware of any platforms that do/did ILP64.

    AFAIK the Cray-1 (1976) was the first 64-bit machine, and C for the
    Cray-1 and successors implemented, as far as I can determine

    type bits
    char 8
    short int 64
    int 64
    long int 64
    pointer 64

    ILP64 for Cray is documented in <https://en.cppreference.com/w/c/language/arithmetic_types.html>. For
    short int, I don't have a direct reference, only the statement

    |Firstly there was the word size, one rather large size fitted all,
    |integers and floats were represented in 64 bits

    <https://cray-history.net/faq-1-cray-supercomputer-families/faq-3/>

    For the 8-bit characters I found a reference (maybe somewhere else in
    that document), but I do not find it at the moment.

    Followups set to comp.arch.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Wed Aug 6 11:10:46 2025
    From Newsgroup: comp.arch

    In article <106uqki$36gll$4@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <44okQ.831008$QtA1.573001@fx16.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    [snip]
    We tend to be spoiled by modern process densities. The
    VAX 11/780 was built using SSI logic chips, thus board
    space and backplane wiring were significant constraints
    on the logic designs of the era.

    Indeed. I find this speculation about the VAX, kind of odd: the
    existence of the 801 as a research project being used as an
    existence proof to justify assertions that a pipelined RISC
    design would have been "better" don't really hold up, when we
    consider that the comparison is to a processor designed for
    commercial applications on a much shorter timeframe.

    I disagree. The 801 was a research project without much time
    pressure, and they simulated the machine (IIRC at the gate level)
    before they ever bulit one. Plus, they developed an excellent
    compiler which implemented graph coloring.

    But IBM had zero interest in competition to their own /370 line,
    although the 801 would have brought performance improvements
    over that line.

    I'm not sure what, precisely, you're disagreeing with.

    I'm saying that the line of though that goes, "the 801 existed,
    therefore a RISC VAX would have been better than the
    architecture DEC ultimately produced" is specious, and the
    conclusion does not follow.

    - Dan C.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Wed Aug 6 11:05:30 2025
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> writes:
    If 'int' were 64-bits, then what about 16 and/or 32 bit types.
    short short?
    long short?

    Of course int16_t uint16_t int32_t uint32_t

    On what keywords should these types be based? That's up to the
    implementor. In C23 one could

    typedef signed _BitInt(16) int16_t

    etc. Around 1990, one would have just followed the example of "long
    long" of accumulating several modifiers. I would go for 16-bit
    "short" and 32-bit "long short".

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Wed Aug 6 11:28:45 2025
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> writes:
    counter-argument to ILP64, where the more natural alternative is LP64.

    I am curious what makes you think that I32LP64 is "more natural",
    given that C is a human creation.

    ILP64 is more consistent with the historic use of int: int is the
    integer type corresponding to the unnamed single type of B
    (predecessor of C), which was used for both integers and pointers.
    You can see that in various parts of C, e.g., in the integer type
    promotion rules (all integers are promoted at least to int in any
    case, beyond that only when another bigger integer is involved).
    Another example is

    main(argc, argv)
    char *argv[];
    {
    return 0;
    }

    Here the return type of main() defaults to int, and the type of argc
    defaults to int.

    As a consequence, one should be able to cast int->pointer->int and pointer->int->pointer without loss. That's not the case with I32LP64.
    It is the case for ILP64.

    Some people conspired in 1992 to set the de-facto standard, and made
    the mistake of deciding on I32LP64 <https://queue.acm.org/detail.cfm?id=1165766>, and we have paid for
    this mistake ever since, one way or the other.

    E.g., the designers of ARM A64 included addressing modes for using
    32-bit indices (but not 16-bit indices) into arrays. The designers of
    RV64G added several sign-extending 32-bit instructions (ending in
    "W"), but not corresponding instructions for 16-bit operations. The
    RISC-V manual justifies this with

    |A few new instructions (ADD[I]W/SUBW/SxxW) are required for addition
    |and shifts to ensure reasonable performance for 32-bit values.

    Why were 32-bit indices and 32-bit operations more important than
    16-bit indices and 16-bit operations? Because with 32-bit int, every
    integer type is automatically promoted to at least 32 bits.

    Likewise, with ILP64 the size of integers in computations would always
    be 64 bits, and many scalar variables (of type int and unsigned) would
    also be 64 bits. As a result, 32-bit indices and 32-bit operations
    would be rare enough that including these addressing modes and
    instructions would not be justified.

    But, you might say, what about memory usage? We would use int32_t
    where appropriate in big arrays and in fields of structs/classes with
    many instances. We would access these array elements and fields with
    LW/SW on RV64G and the corresponding instructions on ARM A64, no need
    for the addressing modes and instructions mentioned above.

    So the addressing mode bloat of ARM A64 and the instruction set bloat
    of RV64G that I mentioned above is courtesy of I32LP64.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch,alt.folklore.computers on Wed Aug 6 13:48:17 2025
    From Newsgroup: comp.arch

    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    On Tue, 5 Aug 2025 17:24:34 +0200, Terje Mathisen wrote:

    ... the problem was all the programs ported from unix which assumed
    that any negative return value was a failure code.

    If the POSIX API spec says a negative return for a particular call is an >error, then a negative return for that particular call is an error.

    Please find a single POSIX API that says a negative return is an error.

    You won't have much success. POSIX explicitly states in most
    cases that the API returns -1 on error (mmap returns MAP_FAILED,
    which happens to be -1 on most implementations; regardless a
    POSIX application _must_ check for MAP_FAILED, not a negative
    return value).

    More misinformation from LDO.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Aug 6 16:19:11 2025
    From Newsgroup: comp.arch

    Michael S wrote:
    On Tue, 5 Aug 2025 22:17:00 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Tue, 5 Aug 2025 17:31:34 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:
    In this case 'adc edx,edx' is just slightly shorter encoding
    of 'adc edx,0'. EDX register zeroize few lines above.

    OK, nice.

    BTW, it seems that in your code fragment above you forgot to zeroize EDX
    at the beginning of iteration. Or am I mssing something?

    No, you are not. I skipped pretty much all the setup code. :-)


    Anyway, the three main ADD RAX,... operations still define the
    minimum possible latency, right?


    I don't think so.
    It seems to me that there is only one chains of data dependencies
    between iterations of the loop - a trivial dependency through RCX.
    Some modern processors are already capable to eliminate this sort of
    dependency in renamer. Probably not yet when it is coded as 'inc',
    but when coded as 'add' or 'lea'.

    The dependency through RDX/RBX does not form a chain. The next value
    of [rdi+rcx*8] does depend on value of rbx from previous iteration,
    but the next value of rbx depends only on [rsi+rcx*8], [r8+rcx*8]
    and [r9+rcx*8]. It does not depend on the previous value of rbx,
    except for control dependency that hopefully would be speculated
    around.

    I believe we are doing a bigint thre-way add, so each result word
    depends on the three corresponding input words, plus any carries from
    the previous round.

    This is the carry chain that I don't see any obvious way to break...


    You break the chain by *predicting* that
    carry[i] = CARRY(a[i]+b[i]+c[i]+carry(i-1) is equal to
    CARRY(a[i]+b[i]+c[i]). If the prediction turns out wrong then you pay a
    heavy price of branch misprediction. But outside of specially crafted
    inputs it is extremely rare.

    Aha!

    That's _very_ nice.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Aug 6 10:23:26 2025
    From Newsgroup: comp.arch

    George Neuner wrote:
    On Tue, 5 Aug 2025 05:48:16 -0000 (UTC), Thomas Koenig <tkoenig@netcologne.de> wrote:

    Waldek Hebisch <antispam@fricas.org> schrieb:
    I am not sure what technolgy they used
    for register file. For me most likely is fast RAM, but that
    normally would give 1 R/W port.
    They used fast SRAM and had three copies of their registers,
    for 2R1W.


    I did use 11/780, 8600, and briefly even MicroVax - but I'm primarily
    a software person, so please forgive this stupid question.


    Why three copies?
    Also did you mean 3 total? Or 3 additional copies (4 total)?


    Given 1 R/W port each I can see needing a pair to handle cases where destination is also a source (including autoincrement modes). But I
    don't see a need ever to sync them - you just keep track of which was
    updated most recently, read that one and - if applicable - write the
    other and toggle.

    Since (at least) the early models evaluated operands sequentially,
    there doesn't seem to be a need for more. Later models had some
    semblance of pipeline, but it seems that if the /same/ value was
    needed multiple times, it could be routed internally to all users
    without requiring additional reads of the source.

    Or do I completely misunderstand? [Definitely possible.]

    To make a 2R 1W port reg file from a single port SRAM you use two banks
    which can be addressed separately during the read phase at the start of
    the clock phase, and at the end of the clock phase you write both banks
    at the same time on the same port number.

    The 780 wiring parts list shows Nat Semi 85S68 which are
    16*4b 1RW port, 40 ns access SRAMS, tri-state output,
    with latched read output to eliminate data race through on write.

    So they have two 16 * 32b banks for the 16 general registers.
    The third 16 * 32b bank was likely for microcode temp variables.

    The thing is, yes, they only needed 1R port for instruction operands
    because sequential decode could only produce one operand at a time.
    Even on later machines circa 1990 like 8700/8800 or NVAX the general
    register file is only 1R1W port, the temp register bank is 2R1W.

    So the 780 second read port is likely used the same as later VAXen,
    its for reading the temp values concurrently with an operand register.
    The operand registers were read one at a time because of the decode
    bottleneck.

    I'm wondering how they handled modifying address modes like autoincrement
    and still had precise interrupts.

    ADDLL (r2)+, (r2)+, (r2)+

    the first (left) operand reads r2 then adds 4, which the second r2 reads
    and also adds 4, then the third again. It doesn't have a renamer so
    it has to stash the first modified r2 in the temp registers,
    and (somehow) pass that info to decode of the second operand
    so Decode knows to read the temp r2 not the general r2,
    and same for the third operand.
    At the end of the instruction if there is no exception then
    temp r2 is copied to general r2 and memory value is stored.

    I'm guessing in Decode someplace there are comparators to detect when
    the operand registers are the same so microcode knows to switch to the
    temp bank for a modified register.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Ames@commodorejohn@gmail.com to comp.arch,alt.folklore.computers on Wed Aug 6 08:28:03 2025
    From Newsgroup: comp.arch

    On Wed, 6 Aug 2025 00:59:07 -0000 (UTC)
    Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

    DEC was bigger in the minicomputer market. If DEC could have offered
    an open-standard machine, that could have offered serious competition
    to IBM. But what OS would they have used? They were still dominated
    by Unix-haters then.

    DEC had plenty of experience in small-system single-user OSes by then;
    their bigger challenge would've been picking one. (CP/M owes a lot to
    the DEC lineage, although it dispenses with some of the more tedious mainframe-isms - e.g. the RUN [program] [parameters] syntax vs. just
    treating executable files on disk as commands in themselves.)

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From James Kuyper@jameskuyper@alumni.caltech.edu to comp.arch,comp.lang.c on Wed Aug 6 11:54:57 2025
    From Newsgroup: comp.arch

    On 2025-08-05 17:13, Kaz Kylheku wrote:
    On 2025-08-04, Michael S <already5chosen@yahoo.com> wrote:
    On Mon, 4 Aug 2025 15:25:54 -0400
    James Kuyper <jameskuyper@alumni.caltech.edu> wrote:

    ...
    If _BitInt is accepted by older versions of gcc, that means it was
    supported as a fully-conforming extension to C. Allowing
    implementations to support extensions in a fully-conforming manner is
    one of the main purposes for which the standard reserves identifiers.
    If you thought that gcc was too conservative to support extensions,
    you must be thinking of the wrong organization.


    I know that gcc supports extensions.
    I also know that gcc didn't support *this particular extension* up
    until quite recently.

    I think what James means is that GCC supports, as an extension,
    the use of any _[A-Z].* identifier whatsoever that it has not claimed
    for its purposes.

    No, I meant very specifically that if, as reported, _BitInt was
    supported even in earlier versions, then it was supported as an extension.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Wed Aug 6 15:55:06 2025
    From Newsgroup: comp.arch

    In comp.arch Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

    E.g., the designers of ARM A64 included addressing modes for using
    32-bit indices (but not 16-bit indices) into arrays. The designers of
    RV64G added several sign-extending 32-bit instructions (ending in
    "W"), but not corresponding instructions for 16-bit operations. The
    RISC-V manual justifies this with

    |A few new instructions (ADD[I]W/SUBW/SxxW) are required for addition
    |and shifts to ensure reasonable performance for 32-bit values.

    Why were 32-bit indices and 32-bit operations more important than
    16-bit indices and 16-bit operations? Because with 32-bit int, every
    integer type is automatically promoted to at least 32 bits.

    Obectively, a lot of programs fit into 32-bit address space and
    may wish to run as 32-bit code for increased performance. Code
    that fits into 16-bit address space is rare enough on 64-bit
    machines to ignore.

    Likewise, with ILP64 the size of integers in computations would always
    be 64 bits, and many scalar variables (of type int and unsigned) would
    also be 64 bits. As a result, 32-bit indices and 32-bit operations
    would be rare enough that including these addressing modes and
    instructions would not be justified.

    But, you might say, what about memory usage? We would use int32_t
    where appropriate in big arrays and in fields of structs/classes with
    many instances. We would access these array elements and fields with
    LW/SW on RV64G and the corresponding instructions on ARM A64, no need
    for the addressing modes and instructions mentioned above.

    So the addressing mode bloat of ARM A64 and the instruction set bloat
    of RV64G that I mentioned above is courtesy of I32LP64.

    It is more complex. There are machines on the market with 64 MB
    RAM and 64-bit RISCV processor. There are (or were) machines
    with 512 MB RAM and 64-bit ARM processor. On such machines it
    is quite natural to use 32-bit pointers. With 32-bit pointers
    there is possibility to use existing 32-bit code. And
    IPL32 is natural model.

    You can say that 32-bit pointers on 64-bit hardware are rare.
    But we really do not know. And especially in embedded space one
    big customer may want a feature and vendor to avoid fragmentation
    provides that feature to everyone.

    Why such code need 32-bit addressing? Well, if enough parts of
    C were undefined compiler could just extend everthing during
    load to 64-bits. So equally well you can claim that real problem
    is that C standard should have more undefined behaviour.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From James Kuyper@jameskuyper@alumni.caltech.edu to comp.arch,comp.lang.c on Wed Aug 6 11:56:04 2025
    From Newsgroup: comp.arch

    On 2025-08-05 17:25, Kaz Kylheku wrote:
    On 2025-08-05, Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
    Breaking existing code that uses "_BitInt" as an identifier is
    a non-issue. There very probably is no such code.

    However, that doesn't mean GCC can carelessly introduce identifiers
    in this namespace.

    GCC does not define a complete C implementation; it doesn't provide a library. Libraries are provided by other projects: Glibc, Musl,
    ucLibc, ...

    Those libraries are C implementors also, and get to name things
    in the reserved namespace.

    GCC cannot be implemented in such a way as to create a fully conforming implementation of C when used in connection with an arbitrary
    implementation of the C standard library. This is just one example of a
    more general potential problem: Both gcc and the library must use some
    reserved identifiers, and they might have made conflicting choices.
    That's just one example of the many things that might prevent them from
    being combined to form a conforming implementation of C. It doesn't mean
    that either one is defective. It does mean that the two groups of
    implementors should consider working together to resolve the conflicts.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Wed Aug 6 14:00:56 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 4 Aug 2025 18:16:45 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

    The claim by John Savard was that the VAX "was a good match to the
    technology *of its time*". It was not. It may have been a good
    match for the beliefs of the time, but that's a different thing.


    The evidence of 801 is the 801 did not deliver until more than decade
    later. And the variant that delivered was quite different from original
    801.
    Actually, it can be argued that 801 didn't deliver until more than 15
    years late.

    Maybe for IBM. IBM had its successful S/370 business, and no real
    need for the IBM 801 after the telephone switch project for which it
    was originally developed had been canceled, so they had no hurry in productizing it. <https://en.wikipedia.org/wiki/IBM_ROMP> says:

    |The architectural work on the ROMP began in late spring of 1977, as a |spin-off of IBM Research's 801 RISC processor (hence the "Research"
    |in the acronym). Most of the architectural changes were for cost
    |reduction, such as adding 16-bit instructions for
    |byte-efficiency. [...]
    |
    |The first chips were ready in early 1981 [...] ROMP first appeared in
    |a commercial product as the processor for the IBM RT PC workstation,
    |which was introduced in 1986. To provide examples for RT PC
    |production, volume production of the ROMP and its MMU began in
    |1985. The delay between the completion of the ROMP design, and
    |introduction of the RT PC was caused by overly ambitious software
    |plans for the RT PC and its operating system (OS).

    If IBM had been in a hurry to introduce ROMP, they would have had a
    contingency plan for the RT PC system software.

    For comparison:

    HPPA: "In early 1982, work on the Precision Architecture began at HP Laboratories, defining the instruction set and virtual memory
    system. Development of the first TTL implementation started in April
    1983. With simulation of the processor having completed in 1983, a
    final processor design was delivered to software developers in July
    1984. Systems prototyping followed, with "lab prototypes" being
    produced in 1985 and product prototypes in 1986. The first processors
    were introduced in products during 1986, with the first HP 9000 Series
    840 units shipping in November of that year." <https://en.wikipedia.org/wiki/PA-RISC>

    MIPS: Inspired by IBM 801, Stanford MIPS research project 1981-1984,
    1984 MIPS Inc, R2000 and R2010 (FP) introduced May 1986 (12.5MHz), and according to
    <https://en.wikipedia.org/wiki/MIPS_Computer_Systems#History> MIPS
    delivered a workstation in the same year.

    SPARC: Berkeley RISC research project between 1980 and 1984; <https://en.wikipedia.org/wiki/Berkeley_RISC> does not mention the IBM
    801 as inspiration, but a 1978 paper by Tanenbaum. Samples for RISC-I
    in May 1982 (but could only run at 0.5MHz). No date for the
    completion of RISC-II, but given that the research project ended in
    1984, it was probably at that time. Sun developed Berkeley RISC into
    SPARC, and the first SPARC machine, the Sun-4/260 appeared in July
    1987 with a 16.67MHz processor.

    ARM: Inspired by Berkeley RISC, "Acorn initiated its RISC research
    project in October 1983" <https://en.wikipedia.org/wiki/Acorn_Computers#New_RISC_architecture>
    "The first samples of ARM silicon worked properly when first received
    and tested on 26 April 1985. Known as ARM1, these versions ran at 6
    MHz.[...] late 1986 introduction of the ARM2 design running at 8 MHz
    [...] Acorn Archimedes personal computer models A305, A310, and A440,
    launched on the 6th June 1987." <https://en.wikipedia.org/wiki/ARM_architecture_family#History> Note
    that the Acorn people originally were not computer architects or
    circuit designers. ARM1 and ARM2 did not include an MMU, cache
    controller, or FPU, however.

    There are examples of Motorola (88000, 1988), Intel (i960, 1988), IBM
    (RS/6000, 1990), and DEC (Alpha, 1992) which had successful
    established architectures, and that caused the problem of how to place
    the RISC architecture in the market, and a certain lack of urgency.
    Read up on the individual architectures and their predecessors to
    learn about the individual causes for delays (there's not much in
    Wikipedia about the development of the 88000, however).

    HP might have been in the same camp, but apparently someone high up at
    HP decided to replace all their existing architectures with RISC ASAP,
    and they succeeded.

    In any case, RISCs delivered, starting in 1986. There is no reason
    they could not have delivered earlier.


    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Wed Aug 6 16:21:51 2025
    From Newsgroup: comp.arch

    Al Kossow <aek@bitsavers.org> writes:
    [RISC] didn't really make sense until main
    memory systems got a lot faster.

    The memory system of the VAX 11/780 was plenty fast for RISC to make
    sense:

    Cache cycle time: 200ns
    Memory cycle time: 600ns
    Average memory access time: 290ns
    Average VAX instruction execution time: 2000ns

    If we assume 1.5 RISC instructions per average VAX instruction, and a
    RISC CPI of 2 cycles (400ns: the 290ns plus extra time data memory
    accesses and branches), the equivalent of a VAX instruction takes
    600ns, more then 3 times as fast as the actual VAX.

    Followups to comp.arch.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch,alt.folklore.computers on Wed Aug 6 16:35:23 2025
    From Newsgroup: comp.arch

    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <106uqej$36gll$3@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Peter Flass <Peter@Iron-Spring.com> schrieb:

    The support issues alone were killers. Think about the
    Orange/Grey/(Blue?) Wall of VAX documentation, and then look at the
    five-page flimsy you got with a micro. The customers were willing to
    accept cr*p from a small startup, but wouldn't put up with it from IBM
    or DEC.

    Using UNIX faced stiff competition from AT&T's internal IT people,
    who wanted to run DEC's operating systems on all PDP-11 within
    the company (basically, they wanted to kill UNIX). They pointed
    towads the large amout of documentation that DEC provided, compared
    to the low amount of UNIX, as proof of superiority. The UNIX people
    saw it differently...

    I've never heard this before, and I do not believe that it is
    true. Do you have a source?

    Hmm... I _think_ it was on a talk given by the UNIX people,
    but I may be misremembering.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Wed Aug 6 16:34:55 2025
    From Newsgroup: comp.arch

    Stefan Monnier <monnier@iro.umontreal.ca> writes:
    The same happened to some extent with the early amd64 machines, which
    ended up running 32bit Windows and applications compiled for the i386
    ISA. Those processors were successful mostly because they were fast at >running i386 code (with the added marketing benefit of being "64bit
    ready"): it took 2 years for MS to release a matching OS.

    Apr 2003: Opteron launch
    Sep 2003: Athlon 64 launch
    Oct 2003 (IIRC): I buy an Athlon 64
    Nov 2003: Fedora Core 1 released for IA-32, X86-64, PowerPC

    I installed Fedora Core 1 on my Athlon64 box in early 2004.

    Why wait for MS?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Aug 6 12:00:36 2025
    From Newsgroup: comp.arch

    On 8/6/2025 6:28 AM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    counter-argument to ILP64, where the more natural alternative is LP64.

    I am curious what makes you think that I32LP64 is "more natural",
    given that C is a human creation.


    We would have needed a new type to be able to express 32 bit values.

    Though, goes and looks it up, apparently the solution was to add __int32
    to address this issue.

    So, it seems, an early occurrence of the __int8, __int16, __int32,
    __int64, __int128 system.


    ILP64 is more consistent with the historic use of int: int is the
    integer type corresponding to the unnamed single type of B
    (predecessor of C), which was used for both integers and pointers.
    You can see that in various parts of C, e.g., in the integer type
    promotion rules (all integers are promoted at least to int in any
    case, beyond that only when another bigger integer is involved).
    Another example is

    main(argc, argv)
    char *argv[];
    {
    return 0;
    }

    Here the return type of main() defaults to int, and the type of argc
    defaults to int.

    As a consequence, one should be able to cast int->pointer->int and pointer->int->pointer without loss. That's not the case with I32LP64.
    It is the case for ILP64.


    Possibly.

    Though, in BGBCC I did make a minor tweak in the behavior of K&R and C89
    style code:
    The 'implicit int' was replaced with 'implicit long'...

    Which, ironically, allows a lot more K&R style code to run unmodified on
    a 64-bit machine. Where, if one assumes 'int', then a lot of K&R style
    code doesn't work correctly.


    Some people conspired in 1992 to set the de-facto standard, and made
    the mistake of deciding on I32LP64 <https://queue.acm.org/detail.cfm?id=1165766>, and we have paid for
    this mistake ever since, one way or the other.

    E.g., the designers of ARM A64 included addressing modes for using
    32-bit indices (but not 16-bit indices) into arrays. The designers of
    RV64G added several sign-extending 32-bit instructions (ending in
    "W"), but not corresponding instructions for 16-bit operations. The
    RISC-V manual justifies this with

    |A few new instructions (ADD[I]W/SUBW/SxxW) are required for addition
    |and shifts to ensure reasonable performance for 32-bit values.

    Why were 32-bit indices and 32-bit operations more important than
    16-bit indices and 16-bit operations? Because with 32-bit int, every
    integer type is automatically promoted to at least 32 bits.


    It is a tradeoff.

    A lot of 32 bit code expected int to be 32 bits, and also expects int to
    wrap on overflow. Without ADDW and friends, the expected wrap on
    overflow behavior is not preserved.

    Early BitManip would have added an ADDWU instruction (ADDW but zero extending); but then they dropped it.

    In my own RV extensions, I re-added ADDWU because IMHO dropping it was a mistake.


    In Zba, they have ADDUW instead, which zero-extends Rs1; so "ADDUW Rd,
    Rs, X0" can be used to zero-extend stuff, but this isn't as good. There
    was at one point an ADDIWU instruction, but I did not re-add it. I
    managed to add the original form of my jumbo prefix into the same
    encoding space; but have since relocated it.


    Re-adding ADDIWU is more debatable as the relative gains are smaller
    than for ADDWU (in a compiler with zero-extended unsigned int).

    For RV64G, it still needs, say:
    ADD Rd, Rs, Rt
    SLLI Rd, Rd, 32
    SLRI Rd, Rd, 32
    Which isn't ideal.

    Though, IMHO, the cost of needing 2 shifts for "unsigned int" ADD is
    less than the mess that results from sign-extending "unsigned int".

    Like, Zba adds "SHnADD.UW" and similar, which with zero-extended
    "unsigned int" would have been entirely unnecessary.


    So, that was my partial act of rebellion against the RV ABI spec (well,
    that and different handling of passing and returning structs by value).

    Where, BGBCC handles it in a way more like that in MS style ABIs, where:
    1-16 bytes, pass in registers or register pair;
    17+ bytes: pass or return via memory reference.

    As opposed to using on-stack copying as the fallback case.
    Though, arguably at least less of a mess than whatever was going on in
    the design of the SysV AMD64 ABI.



    Likewise, with ILP64 the size of integers in computations would always
    be 64 bits, and many scalar variables (of type int and unsigned) would
    also be 64 bits. As a result, 32-bit indices and 32-bit operations
    would be rare enough that including these addressing modes and
    instructions would not be justified.

    But, you might say, what about memory usage? We would use int32_t
    where appropriate in big arrays and in fields of structs/classes with
    many instances. We would access these array elements and fields with
    LW/SW on RV64G and the corresponding instructions on ARM A64, no need
    for the addressing modes and instructions mentioned above.

    So the addressing mode bloat of ARM A64 and the instruction set bloat
    of RV64G that I mentioned above is courtesy of I32LP64.


    This assumes though that all 64 bit operations can have the same latency
    as 32 bit operations.

    If you have a machine where common 32-bit ops can have 1 cycle latency
    but 64 needs 2 cycles, then it may be preferable to have 32 bit types
    for cases where 64 isn't needed.


    But, yeah, in an idealized world, maybe yeah, the avoidance of 32-bit
    int, or at least the avoidance of a dependency on assumed wrap on
    overflow semantics, or implicit promotion to whatever is the widest
    natively supported type, could have led to less of a mess.


    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch,alt.folklore.computers on Wed Aug 6 12:12:32 2025
    From Newsgroup: comp.arch

    On 8/6/2025 6:05 AM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    If 'int' were 64-bits, then what about 16 and/or 32 bit types.
    short short?
    long short?

    Of course int16_t uint16_t int32_t uint32_t


    Well, assuming a post C99 world.


    On what keywords should these types be based? That's up to the
    implementor. In C23 one could

    typedef signed _BitInt(16) int16_t


    Possible, though one can realize that _BitInt(16) is not equivalent to a normal 16-bit integer.

    _BitInt(16) sa, sb;
    _BitInt(32) lc;
    sa=0x5678;
    sb=0x789A;
    lc=sa+sb;

    Would give:
    0xFFFFCF12
    Rather than 0xCF12 (as would be expected with 'short' or similar).

    Because _BitInt(16) would not auto-promote before the addition, but
    rather would produce a _BitInt(16) result which is then widened to 32
    bits via sign extension.


    etc. Around 1990, one would have just followed the example of "long
    long" of accumulating several modifiers. I would go for 16-bit
    "short" and 32-bit "long short".


    OK.

    Apparently at least some went for "__int32" instead.


    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Al Kossow@aek@bitsavers.org to comp.arch,alt.folklore.computers on Wed Aug 6 10:20:18 2025
    From Newsgroup: comp.arch

    On 8/6/25 7:00 AM, Anton Ertl wrote:

    In any case, RISCs delivered, starting in 1986.

    http://bitsavers.org/pdf/ridge/Ridge_Hardware_Reference_Manual_Aug82.pdf


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch,alt.folklore.computers on Wed Aug 6 17:25:25 2025
    From Newsgroup: comp.arch

    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    Not aware of any platforms that do/did ILP64.

    AFAIK the Cray-1 (1976) was the first 64-bit machine, ...

    The IBM 7030 STRETCH was the first 64 bit machine, shipped in 1961,
    but I would be surprised if anyone had written a C compiler for it.

    It was bit addressable but memories in those days were so small that a full bit address was only 24 bits. So if I were writing a C compiler, pointers and ints would be 32 bits, char 8 bits, long 64 bits.

    (There is a thing called STRETCH C Compiler but it's completely unrelated.)
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Wed Aug 6 16:47:39 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    De Castro had had a big success with a simple load-store
    architecture, the Nova. He did that to reduce CPU complexity
    and cost, to compete with DEC and its PDP-8. (Byte addressing
    was horrible on the Nova, though).

    The PDP-8, and its 16-bit followup, the Nova, may be load/store, but
    it is not a register machine nor byte-addressed, while the PDP-11 is,
    and the RISC-VAX would be, too.

    Now, assume that, as a time traveler wanting to kick off an early
    RISC revolution, you are not allowed to reveal that you are a time
    traveler (which would have larger effects than just a different
    computer architecture). What do you do?

    a) You go to DEC

    b) You go to Data General

    c) You found your own company

    Even if I am allowed to reveal that I am a time traveler, that may not
    help; how would I prove it?

    Yes, convincing people in the mid-1970s to bet the company on RISC is
    a hard sell, that's I asked for "a magic wand that would convince the
    DEC management and workforce that I know how to design their next
    architecture, and how to compile for it" in <2025Mar1.125817@mips.complang.tuwien.ac.at>.

    Some arguments that might help:

    Complexity in CISC and how it breeds complexity elsewhere; e.g., the interaction of having more than one data memory access per
    instruction, virtual memory, and precise exceptions.

    How the CDC 6600 achieved performance (pipelining) and how non-complex
    its instructions are.

    I guess I would read through RISC-vs-CISC literature before entering
    the time machine in order to have some additional arguments.


    Concerning your three options, I think it will be a problem in any
    case. Data General's first bet was on FHP, a microcoded machine with user-writeable microcode, so maybe even more in the wrong direction
    than VAX; I can imagine a high-performance OoO VAX implementation, but
    for an architecture with exposed microcode like FHP an OoO
    implementation would probably be pretty challenging. The backup
    project that eventually came through was also a CISC.

    Concerning founding ones own company, one would have to convince
    venture capital, and then run the RISC of being bought by one of the
    big players, who buries the architecture. And even if you survive,
    you then have to build up the whole thing: production, marketing,
    sales, software support, ...

    In any case, the original claim was about the VAX, so of course the
    question at hand is what DEC could have done instead.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Wed Aug 6 20:43:33 2025
    From Newsgroup: comp.arch

    On Wed, 6 Aug 2025 16:19:11 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Tue, 5 Aug 2025 22:17:00 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Tue, 5 Aug 2025 17:31:34 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:
    In this case 'adc edx,edx' is just slightly shorter encoding
    of 'adc edx,0'. EDX register zeroize few lines above.

    OK, nice.

    BTW, it seems that in your code fragment above you forgot to
    zeroize EDX at the beginning of iteration. Or am I mssing
    something?

    No, you are not. I skipped pretty much all the setup code. :-)

    It's a setup code that looks to me as missing, but zeroing of RDX in
    the body of the loop.



    Anyway, the three main ADD RAX,... operations still define the
    minimum possible latency, right?


    I don't think so.
    It seems to me that there is only one chains of data dependencies
    between iterations of the loop - a trivial dependency through RCX.
    Some modern processors are already capable to eliminate this sort
    of dependency in renamer. Probably not yet when it is coded as
    'inc', but when coded as 'add' or 'lea'.

    The dependency through RDX/RBX does not form a chain. The next
    value of [rdi+rcx*8] does depend on value of rbx from previous
    iteration, but the next value of rbx depends only on [rsi+rcx*8],
    [r8+rcx*8] and [r9+rcx*8]. It does not depend on the previous
    value of rbx, except for control dependency that hopefully would
    be speculated around.

    I believe we are doing a bigint thre-way add, so each result word
    depends on the three corresponding input words, plus any carries
    from the previous round.

    This is the carry chain that I don't see any obvious way to
    break...

    You break the chain by *predicting* that
    carry[i] = CARRY(a[i]+b[i]+c[i]+carry(i-1) is equal to CARRY(a[i]+b[i]+c[i]). If the prediction turns out wrong then you
    pay a heavy price of branch misprediction. But outside of specially
    crafted inputs it is extremely rare.

    Aha!

    That's _very_ nice.

    Terje



    I did few tests on few machines: Raptor Cove (i7-14700 P core),
    Gracemont (i7-14700 E core), Skylake-C (Xeon E-2176G) and Zen3 (EPYC
    7543P).
    In order to see effects more clearly I had to modify Anton's function:
    to one that operates on pointers, because otherwise too much time was
    spend at caller's site copying things around which made the
    measurements too noisy.

    void add3(uintNN_t *dst, const uintNN_t* a, const uintNN_t* b, const
    uintNN_t* c) {
    *dst = *a + *b + *c;
    }


    After the change on 3 out of 4 platforms I had seen a significant
    speed-up after modification. The only platform where speed-up was non-significant was Skylake, probably because its rename stage is too
    narrow to profit from the change. The widest machine (Raptor Cove)
    benefited most.
    The results appear non-conclusive with regard to question whether
    dependency between loop iterations is eliminated completely or just
    shortened to 1-2 clock cycles per iteration. Even the widest of my
    cores is relatively narrow. Considering that my variant of loop contains
    13 x86-64 instruction and 16 uOps, I am afraid that even likes of Apple
    M4 would be too narrow :(

    Here are results in nanoseconds for N=65472
    Platform RC GM SK Z3
    clang 896.1 1476.7 1453.2 1348.0
    gcc 879.2 1661.4 1662.9 1655.0
    x86 585.8 1489.3 901.5 672.0
    Terje's 772.6 1293.2 1012.6 1127.0
    My 397.5 803.8 965.3 660.0
    ADX 579.1 1650.1 728.9 853.0
    x86/u2 581.5 1246.2 679.9 584.0
    Terje's/u3 503.7 954.3 630.9 755.0
    My/u3 266.6 487.2 486.5 440.0
    ADX/u8 350.4 839.3 490.4 451.0

    'x86' is a variant that that was sketched in one of my above
    posts. It calculates the sum in two passes over arrays.
    'ADX' is a variant that uses ADCX/ADOX instructions as suggested by
    Anton, but unlike his suggestion does it in a loop rather than in long
    straight code sequence.
    /u2, /u3, /u8 indicate unroll factors of the inner loop.

    Frequency:
    RC 5.30 GHz (Est)
    GM 4.20 GHz (Est)
    SK 4.25 GHz
    Z3 3.70 GHz





    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Aug 6 12:47:10 2025
    From Newsgroup: comp.arch

    On 8/6/2025 10:55 AM, Waldek Hebisch wrote:
    In comp.arch Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

    E.g., the designers of ARM A64 included addressing modes for using
    32-bit indices (but not 16-bit indices) into arrays. The designers of
    RV64G added several sign-extending 32-bit instructions (ending in
    "W"), but not corresponding instructions for 16-bit operations. The
    RISC-V manual justifies this with

    |A few new instructions (ADD[I]W/SUBW/SxxW) are required for addition
    |and shifts to ensure reasonable performance for 32-bit values.

    Why were 32-bit indices and 32-bit operations more important than
    16-bit indices and 16-bit operations? Because with 32-bit int, every
    integer type is automatically promoted to at least 32 bits.

    Obectively, a lot of programs fit into 32-bit address space and
    may wish to run as 32-bit code for increased performance. Code
    that fits into 16-bit address space is rare enough on 64-bit
    machines to ignore.

    Likewise, with ILP64 the size of integers in computations would always
    be 64 bits, and many scalar variables (of type int and unsigned) would
    also be 64 bits. As a result, 32-bit indices and 32-bit operations
    would be rare enough that including these addressing modes and
    instructions would not be justified.

    But, you might say, what about memory usage? We would use int32_t
    where appropriate in big arrays and in fields of structs/classes with
    many instances. We would access these array elements and fields with
    LW/SW on RV64G and the corresponding instructions on ARM A64, no need
    for the addressing modes and instructions mentioned above.

    So the addressing mode bloat of ARM A64 and the instruction set bloat
    of RV64G that I mentioned above is courtesy of I32LP64.

    It is more complex. There are machines on the market with 64 MB
    RAM and 64-bit RISCV processor. There are (or were) machines
    with 512 MB RAM and 64-bit ARM processor. On such machines it
    is quite natural to use 32-bit pointers. With 32-bit pointers
    there is possibility to use existing 32-bit code. And
    IPL32 is natural model.

    You can say that 32-bit pointers on 64-bit hardware are rare.
    But we really do not know. And especially in embedded space one
    big customer may want a feature and vendor to avoid fragmentation
    provides that feature to everyone.

    Why such code need 32-bit addressing? Well, if enough parts of
    C were undefined compiler could just extend everthing during
    load to 64-bits. So equally well you can claim that real problem
    is that C standard should have more undefined behaviour.


    Something like the X32 style ABIs almost make sense, since most
    processes need less than 4GB of RAM.

    But, then the problem becomes that one would need both 32 and 64 bit
    variants of most of the OS shared libraries, which may well offset the
    savings from using less RAM for pointers.

    ...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch,alt.folklore.computers on Wed Aug 6 18:22:03 2025
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> writes:
    On 8/6/2025 6:05 AM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    If 'int' were 64-bits, then what about 16 and/or 32 bit types.
    short short?
    long short?

    Of course int16_t uint16_t int32_t uint32_t


    Well, assuming a post C99 world.

    'typedef' was around long before C99 happened to
    standardize the aforementioned typedefs.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Peter Flass@Peter@Iron-Spring.com to comp.arch,alt.folklore.computers on Wed Aug 6 12:11:08 2025
    From Newsgroup: comp.arch

    On 8/6/25 10:25, John Levine wrote:
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    Not aware of any platforms that do/did ILP64.

    AFAIK the Cray-1 (1976) was the first 64-bit machine, ...

    The IBM 7030 STRETCH was the first 64 bit machine, shipped in 1961,
    but I would be surprised if anyone had written a C compiler for it.

    It was bit addressable but memories in those days were so small that a full bit
    address was only 24 bits. So if I were writing a C compiler, pointers and ints
    would be 32 bits, char 8 bits, long 64 bits.

    (There is a thing called STRETCH C Compiler but it's completely unrelated.)

    I don't get why bit-addressability was a thing? Intel iAPX 432 had it,
    too, and it seems like all it does is drastically shrink your address
    space and complexify instruction and operand fetch to (maybe) save a few bytes.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Peter Flass@Peter@Iron-Spring.com to comp.arch,alt.folklore.computers on Wed Aug 6 12:12:30 2025
    From Newsgroup: comp.arch

    On 8/6/25 09:47, Anton Ertl wrote:


    Even if I am allowed to reveal that I am a time traveler, that may not
    help; how would I prove it?

    I'm a time-traveler from the 1960s!

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch,alt.folklore.computers on Wed Aug 6 19:50:17 2025
    From Newsgroup: comp.arch

    According to Peter Flass <Peter@Iron-Spring.com>:
    It was bit addressable but memories in those days were so small that a full bit
    address was only 24 bits. So if I were writing a C compiler, pointers and ints
    would be 32 bits, char 8 bits, long 64 bits.

    (There is a thing called STRETCH C Compiler but it's completely unrelated.)

    I don't get why bit-addressability was a thing? Intel iAPX 432 had it,
    too, and it seems like all it does is drastically shrink your address
    space and complexify instruction and operand fetch to (maybe) save a few >bytes.

    STRETCH had a severe case of second system syndrome, and was full of
    complex features that weren't worth the effort and it was impressive
    that IBM got it to work and to run as fast as it did.

    In that era memory was expensive, and usually measured in K, not M.
    The idea was presumably to pack data as tightly as possible.

    In the 1970s I briefly used a B1700 which was bit addressable and had reloadable
    microcode so COBOL programs used the COBOL instruction set, FORTRAN programs used the FORTRAN instruction set, and so forth, with each one having whatever word or byte sizes they wanted. In retrospect it seems like a lot of
    premature optimization.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Aug 6 20:06:00 2025
    From Newsgroup: comp.arch

    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <106uqki$36gll$4@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <44okQ.831008$QtA1.573001@fx16.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    [snip]
    We tend to be spoiled by modern process densities. The
    VAX 11/780 was built using SSI logic chips, thus board
    space and backplane wiring were significant constraints
    on the logic designs of the era.

    Indeed. I find this speculation about the VAX, kind of odd: the
    existence of the 801 as a research project being used as an
    existence proof to justify assertions that a pipelined RISC
    design would have been "better" don't really hold up, when we
    consider that the comparison is to a processor designed for
    commercial applications on a much shorter timeframe.

    I disagree. The 801 was a research project without much time
    pressure, and they simulated the machine (IIRC at the gate level)
    before they ever bulit one. Plus, they developed an excellent
    compiler which implemented graph coloring.

    But IBM had zero interest in competition to their own /370 line,
    although the 801 would have brought performance improvements
    over that line.

    I'm not sure what, precisely, you're disagreeing with.

    I'm saying that the line of though that goes, "the 801 existed,
    therefore a RISC VAX would have been better than the
    architecture DEC ultimately produced" is specious, and the
    conclusion does not follow.

    There are a few intermediate steps.

    The 801 demonstrated that a RISC, including caches and pipelining,
    would have been feasible at the time. It also demonstrated that
    somebody had thought of graph coloring algorithms.

    There can also be no doubt that a RISC-type machine would have
    exhibited the same performance advantages (at least in integer
    performance) as a RISC vs CISC 10 years later. The 801 did so
    vs. the /370, as did the RISC processors vs, for example, the
    680x0 family of processors (just compare ARM vs. 68000).

    Or look at the performance of the TTL implementation of HP-PA,
    which used PALs which were not available to the VAX 11/780
    designers, so it could be clocked a bit higher, but at
    a multiple of the performance than the VAX.

    So, Anton visiting DEC or me visiting Data General could have
    brought them a technology which would significantly outperformed
    the VAX (especially if we brought along the algorithm for graph
    coloring. Some people at IBM would have been peeved at having
    somebody else "develop" this at the same time, but OK.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch,alt.folklore.computers on Wed Aug 6 20:30:00 2025
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> writes:
    According to Peter Flass <Peter@Iron-Spring.com>:
    It was bit addressable but memories in those days were so small that a full bit
    address was only 24 bits. So if I were writing a C compiler, pointers and ints
    would be 32 bits, char 8 bits, long 64 bits.

    (There is a thing called STRETCH C Compiler but it's completely unrelated.) >>
    I don't get why bit-addressability was a thing? Intel iAPX 432 had it, >>too, and it seems like all it does is drastically shrink your address >>space and complexify instruction and operand fetch to (maybe) save a few >>bytes.

    STRETCH had a severe case of second system syndrome, and was full of
    complex features that weren't worth the effort and it was impressive
    that IBM got it to work and to run as fast as it did.

    In that era memory was expensive, and usually measured in K, not M.
    The idea was presumably to pack data as tightly as possible.

    In the 1970s I briefly used a B1700 which was bit addressable and had reloadable
    microcode so COBOL programs used the COBOL instruction set, FORTRAN programs >used the FORTRAN instruction set, and so forth, with each one having whatever >word or byte sizes they wanted. In retrospect it seems like a lot of >premature optimization.

    We had a B1900 in the software lab, but I don't recall anyone
    actually using it - I believe it had been moved from Santa
    Barbara (Small Systems plant) and may have been used for
    reproducing customer issues, but by 1983, there weren't many
    small systems customers remaining.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Keith Thompson@Keith.S.Thompson+u@gmail.com to comp.arch,comp.lang.c on Wed Aug 6 13:58:51 2025
    From Newsgroup: comp.arch

    James Kuyper <jameskuyper@alumni.caltech.edu> writes:
    On 2025-08-05 17:13, Kaz Kylheku wrote:
    On 2025-08-04, Michael S <already5chosen@yahoo.com> wrote:
    On Mon, 4 Aug 2025 15:25:54 -0400
    James Kuyper <jameskuyper@alumni.caltech.edu> wrote:

    ...
    If _BitInt is accepted by older versions of gcc, that means it was
    supported as a fully-conforming extension to C. Allowing
    implementations to support extensions in a fully-conforming manner is
    one of the main purposes for which the standard reserves identifiers.
    If you thought that gcc was too conservative to support extensions,
    you must be thinking of the wrong organization.


    I know that gcc supports extensions.
    I also know that gcc didn't support *this particular extension* up
    until quite recently.

    I think what James means is that GCC supports, as an extension,
    the use of any _[A-Z].* identifier whatsoever that it has not claimed
    for its purposes.

    No, I meant very specifically that if, as reported, _BitInt was
    supported even in earlier versions, then it was supported as an extension.

    gcc 13.4.0 does not recognize _BitInt at all.

    gcc 14.2.0 handles _BitInt as a language feature in C23 mode,
    and as an "extension" in pre-C23 modes.

    It warns about _BitInt with "-std=c17 -pedantic", but not with
    just "-std=c17". I think I would have preferred a warning with
    "-std=c17", but it doesn't bother me. There's no mention of _BitInt
    as an extension or feature in the documentation. An implementation
    is required to document the implementation-defined value of
    BITINT_MAXWIDTH, so that's a conformance issue. In pre-C23 mode,
    since it's not documented, support for _BitInt is not formally an
    "extension"; it's an allowed behavior in the presence of code that
    has undefined behavior due to its use of a reserved identifier.
    (This is a picky language-lawyerly interpretation.)
    --
    Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
    void Void(void) { Void(); } /* The recursive call of the void */
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Aug 6 17:00:03 2025
    From Newsgroup: comp.arch

    Thomas Koenig wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <106uqki$36gll$4@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <44okQ.831008$QtA1.573001@fx16.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    [snip]
    We tend to be spoiled by modern process densities. The
    VAX 11/780 was built using SSI logic chips, thus board
    space and backplane wiring were significant constraints
    on the logic designs of the era.
    Indeed. I find this speculation about the VAX, kind of odd: the
    existence of the 801 as a research project being used as an
    existence proof to justify assertions that a pipelined RISC
    design would have been "better" don't really hold up, when we
    consider that the comparison is to a processor designed for
    commercial applications on a much shorter timeframe.
    I disagree. The 801 was a research project without much time
    pressure, and they simulated the machine (IIRC at the gate level)
    before they ever bulit one. Plus, they developed an excellent
    compiler which implemented graph coloring.

    But IBM had zero interest in competition to their own /370 line,
    although the 801 would have brought performance improvements
    over that line.
    I'm not sure what, precisely, you're disagreeing with.

    I'm saying that the line of though that goes, "the 801 existed,
    therefore a RISC VAX would have been better than the
    architecture DEC ultimately produced" is specious, and the
    conclusion does not follow.

    There are a few intermediate steps.

    The 801 demonstrated that a RISC, including caches and pipelining,
    would have been feasible at the time. It also demonstrated that
    somebody had thought of graph coloring algorithms.

    There can also be no doubt that a RISC-type machine would have
    exhibited the same performance advantages (at least in integer
    performance) as a RISC vs CISC 10 years later. The 801 did so
    vs. the /370, as did the RISC processors vs, for example, the
    680x0 family of processors (just compare ARM vs. 68000).

    Or look at the performance of the TTL implementation of HP-PA,
    which used PALs which were not available to the VAX 11/780
    designers, so it could be clocked a bit higher, but at
    a multiple of the performance than the VAX.

    So, Anton visiting DEC or me visiting Data General could have
    brought them a technology which would significantly outperformed
    the VAX (especially if we brought along the algorithm for graph
    coloring. Some people at IBM would have been peeved at having
    somebody else "develop" this at the same time, but OK.


    Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) were available in 1975. Mask programmable PLA were available from TI
    circa 1970 but masks would be too expensive.

    If I was building a TTL risc cpu in 1975 I would definitely be using
    lots of FPLA's, not just for decode but also state machines in fetch,
    page table walkers, cache controllers, etc.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Aug 6 21:14:07 2025
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Thomas Koenig wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <106uqki$36gll$4@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <44okQ.831008$QtA1.573001@fx16.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    [snip]
    We tend to be spoiled by modern process densities. The
    VAX 11/780 was built using SSI logic chips, thus board
    space and backplane wiring were significant constraints
    on the logic designs of the era.
    Indeed. I find this speculation about the VAX, kind of odd: the
    existence of the 801 as a research project being used as an
    existence proof to justify assertions that a pipelined RISC
    design would have been "better" don't really hold up, when we
    consider that the comparison is to a processor designed for
    commercial applications on a much shorter timeframe.


    Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) >were available in 1975. Mask programmable PLA were available from TI
    circa 1970 but masks would be too expensive.

    Burroughs mainframers started designing with ECL gate arrays circa
    1981, and they shipped in 1987[*]. I suspect even FPAL or other PLAs
    would have been far to expensive to use to build a RISC CPU,
    especially for one of the BUNCH, for whom backward compatability was
    paramount.

    [*] The machine (Unisys V530) sold for well over a megabuck in
    a single processor configuration.

    If I was building a TTL risc cpu in 1975 I would definitely be using
    lots of FPLA's, not just for decode but also state machines in fetch,
    page table walkers, cache controllers, etc.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Aug 6 17:57:03 2025
    From Newsgroup: comp.arch

    EricP wrote:
    Thomas Koenig wrote:

    Or look at the performance of the TTL implementation of HP-PA,
    which used PALs which were not available to the VAX 11/780
    designers, so it could be clocked a bit higher, but at
    a multiple of the performance than the VAX.

    So, Anton visiting DEC or me visiting Data General could have
    brought them a technology which would significantly outperformed
    the VAX (especially if we brought along the algorithm for graph
    coloring. Some people at IBM would have been peeved at having
    somebody else "develop" this at the same time, but OK.

    Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) were available in 1975. Mask programmable PLA were available from TI
    circa 1970 but masks would be too expensive.

    If I was building a TTL risc cpu in 1975 I would definitely be using
    lots of FPLA's, not just for decode but also state machines in fetch,
    page table walkers, cache controllers, etc.

    The question isn't could one build a modern risc-style pipelined cpu
    from TTL in 1975 - of course one could. Nor do I see any question of
    could it beat a VAX 780 0.5 MIPS at 5 MHz - of course it could, easily.

    I'm pretty sure I could use my Mk-I risc ISA and build a 5 stage pipeline running at 5 MHz getting 1 IPC sustained when hitting the 200 ns cache
    (using some in-order superscalar ideas and two reg file write ports
    to "catch up" after pipeline bubbles).

    TTL risc would also be much cheaper to design and prototype.
    VAX took hundreds of people many many years.

    The question is could one build this at a commercially competitive price?
    There is a reason people did things sequentially in microcode.
    All those control decisions that used to be stored as bits in microcode now become real logic gates. And in SSI TTL you don't get many to the $.
    And many of those sequential microcode states become independent concurrent state machines, each with its own logic sequencer.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Swindells@rjs@fdy2.co.uk to comp.arch,alt.folklore.computers on Wed Aug 6 22:30:56 2025
    From Newsgroup: comp.arch

    On Wed, 06 Aug 2025 14:00:56 GMT, Anton Ertl wrote:

    For comparison:

    SPARC: Berkeley RISC research project between 1980 and 1984; <https://en.wikipedia.org/wiki/Berkeley_RISC> does not mention the IBM
    801 as inspiration, but a 1978 paper by Tanenbaum. Samples for RISC-I
    in May 1982 (but could only run at 0.5MHz). No date for the completion
    of RISC-II, but given that the research project ended in 1984, it was probably at that time. Sun developed Berkeley RISC into SPARC, and the
    first SPARC machine, the Sun-4/260 appeared in July 1987 with a 16.67MHz processor.

    The Katevenis thesis on RISC-II contains a timeline on p6, it lists fabrication of it in spring 83 with testing during summer 83.

    There is also a bibliography entry of an informal discussion with John
    Cocke at Berkeley about the 801 in June 1983
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lars Poulsen@lars@cleo.beagle-ears.com to comp.arch,alt.folklore.computers on Wed Aug 6 23:12:26 2025
    From Newsgroup: comp.arch

    On 2025-08-06, Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    Not aware of any platforms that do/did ILP64.

    AFAIK the Cray-1 (1976) was the first 64-bit machine, and C for the
    Cray-1 and successors implemented, as far as I can determine

    type bits
    char 8
    short int 64
    int 64
    long int 64
    pointer 64

    Not having a 16-bit integer type and not having a 32-bit integer type
    would make it very hard to adapt portable code, such as TCP/IP protocol processing.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch,alt.folklore.computers on Wed Aug 6 23:15:54 2025
    From Newsgroup: comp.arch

    AFAIK the Cray-1 (1976) was the first 64-bit machine, and C for the
    Cray-1 and successors implemented, as far as I can determine

    type bits
    char 8
    short int 64
    int 64
    long int 64
    pointer 64

    Not having a 16-bit integer type and not having a 32-bit integer type
    would make it very hard to adapt portable code, such as TCP/IP protocol >processing.

    I'd think this was obvious, but if the code depends on word sizes and doesn't declare its variables to use those word sizes, I don't think "portable" is the right term.

    Perhaps "happens to work on some computers similar to the one it was originally written on."
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lars Poulsen@lars@cleo.beagle-ears.com to comp.arch,alt.folklore.computers on Wed Aug 6 23:32:47 2025
    From Newsgroup: comp.arch

    ["Followup-To:" header set to comp.arch.]
    On 2025-08-06, John Levine <johnl@taugh.com> wrote:
    AFAIK the Cray-1 (1976) was the first 64-bit machine, and C for the
    Cray-1 and successors implemented, as far as I can determine

    type bits
    char 8
    short int 64
    int 64
    long int 64
    pointer 64

    Not having a 16-bit integer type and not having a 32-bit integer type
    would make it very hard to adapt portable code, such as TCP/IP protocol >>processing.

    I'd think this was obvious, but if the code depends on word sizes and doesn't declare its variables to use those word sizes, I don't think "portable" is the
    right term.

    My concern is how do you express yopur desire for having e.g. an int16 ?
    All the portable code I know defines int8, int16, int32 by means of a
    typedef that adds an appropriate alias for each of these back to a
    native type. If "short" is 64 bits, how do you define a 16 bit?
    Or did the compiler have native types __int16 etc?

    - Lars
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch on Wed Aug 6 23:34:04 2025
    From Newsgroup: comp.arch

    On Wed, 06 Aug 2025 11:28:45 GMT, Anton Ertl wrote:

    Why were 32-bit indices and 32-bit operations more important than 16-bit indices and 16-bit operations?

    32 bits was considered a kind of “sweet spot” in the evolution of computer architectures. It was the first point at which memory-addressability constraints were no longer at the top of list of things to worry about
    when designing a software architecture.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Wed Aug 6 23:36:11 2025
    From Newsgroup: comp.arch

    On Wed, 6 Aug 2025 12:11:08 -0700, Peter Flass wrote:

    I don't get why bit-addressability was a thing? Intel iAPX 432 had it,
    too, and it seems like all it does is drastically shrink your address
    space and complexify instruction and operand fetch to (maybe) save a few bytes.

    But with 64-bit addressing, it only means sacrificing the bottom 3 bits.

    With normal load/store, you can insist that these 3 bits be zero, whereas
    in bit-aligned load/store, they can specify a nonzero bit offset.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Wed Aug 6 23:38:15 2025
    From Newsgroup: comp.arch

    On Wed, 06 Aug 2025 10:32:39 GMT, Anton Ertl wrote:

    Lawrence D'Oliveiro <ldo@nz.invalid> writes:

    Not aware of any platforms that do/did ILP64.

    AFAIK the Cray-1 (1976) was the first 64-bit machine ...

    But it was not byte-addressable. Its precursor CDC machines had 60-bit
    words, as I recall. DEC’s “large systems” family from around that era (PDP-6, PDP-10) had 36-bit words. And there were likely some other vendors offering 48-bit words, that kind of thing. Maybe some with word lengths
    even longer than 64 bits.

    I was thinking more specifically of machines from the byte-addressable
    era.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Wed Aug 6 23:40:48 2025
    From Newsgroup: comp.arch

    On Wed, 06 Aug 2025 10:24:49 GMT, Anton Ertl wrote:

    Lawrence D'Oliveiro <ldo@nz.invalid> writes:

    Of all the major OSes for Alpha, Windows NT was the only one that
    couldn’t take advantage of the 64-bit architecture.

    Actually, Windows took good advantage of the 64-bit architecture:
    "64-bit Windows was initially developed on the Alpha AXP." <https://learn.microsoft.com/en-us/previous-versions/technet-magazine/cc718978(v=msdn.10)>

    Remember the Alpha was first released in 1992. No shipping version of
    Windows NT ever ran on it in anything other than “TASO” (“Truncated Address-Space Option”, i.e. 32-bit-only addressing) mode.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Swindells@rjs@fdy2.co.uk to comp.arch on Wed Aug 6 23:43:12 2025
    From Newsgroup: comp.arch

    On Wed, 06 Aug 2025 17:00:03 -0400, EricP wrote:

    Thomas Koenig wrote:

    Or look at the performance of the TTL implementation of HP-PA, which
    used PALs which were not available to the VAX 11/780 designers, so it
    could be clocked a bit higher, but at a multiple of the performance
    than the VAX.

    So, Anton visiting DEC or me visiting Data General could have brought
    them a technology which would significantly outperformed the VAX
    (especially if we brought along the algorithm for graph coloring. Some
    people at IBM would have been peeved at having somebody else "develop"
    this at the same time, but OK.


    Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR
    matrix)
    were available in 1975. Mask programmable PLA were available from TI
    circa 1970 but masks would be too expensive.

    If I was building a TTL risc cpu in 1975 I would definitely be using
    lots of FPLA's, not just for decode but also state machines in fetch,
    page table walkers, cache controllers, etc.

    The DG MV/8000 used PALs but The Soul of a New Machine hints that there
    were supply problems with them at the time.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Wed Aug 6 23:45:44 2025
    From Newsgroup: comp.arch

    On Wed, 6 Aug 2025 08:28:03 -0700, John Ames wrote:

    CP/M owes a lot to the DEC lineage, although it dispenses with some
    of the more tedious mainframe-isms - e.g. the RUN [program]
    [parameters] syntax vs. just treating executable files on disk as
    commands in themselves.)

    It added its own misfeatures, though. Like single-letter device names,
    but only for disks. Non-file-structured devices were accessed via “reserved” file names, which continue to bedevil Microsoft Windows to
    this day, aggravated by a totally perverse extension of the concept to
    paths with hierarchical directory names.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch,alt.folklore.computers on Wed Aug 6 20:21:31 2025
    From Newsgroup: comp.arch

    Robert Swindells wrote:
    On Wed, 06 Aug 2025 14:00:56 GMT, Anton Ertl wrote:

    For comparison:

    SPARC: Berkeley RISC research project between 1980 and 1984;
    <https://en.wikipedia.org/wiki/Berkeley_RISC> does not mention the IBM
    801 as inspiration, but a 1978 paper by Tanenbaum. Samples for RISC-I
    in May 1982 (but could only run at 0.5MHz). No date for the completion
    of RISC-II, but given that the research project ended in 1984, it was
    probably at that time. Sun developed Berkeley RISC into SPARC, and the
    first SPARC machine, the Sun-4/260 appeared in July 1987 with a 16.67MHz
    processor.

    The Katevenis thesis on RISC-II contains a timeline on p6, it lists fabrication of it in spring 83 with testing during summer 83.

    There is also a bibliography entry of an informal discussion with John
    Cocke at Berkeley about the 801 in June 1983

    There is a citation to Cocke as "private communication" in 1980 by
    Patterson in The Case for the Reduced Instruction Set Computer, 1980.

    "REASONS FOR INCREASED COMPLEXITY

    Why have computers become more complex? We can think of several reasons:
    Speed of Memory vs. Speed of CPU. John Cocke says that the complexity began with the transition from the 701 to the 709 [Cocke80]. The 701 CPU was about ten times as fast as the core main memory; this made any primitives that
    were implemented as subroutines much slower than primitives that were instructions. Thus the floating point subroutines became part of the 709 architecture with dramatic gains. Making the 709 more complex resulted
    in an advance that made it more cost-effective than the 701. Since then,
    many "higher-level" instructions have been added to machines in an attempt
    to improve performance. Note that this trend began because of the imbalance
    in speeds; it is not clear that architects have asked themselves whether
    this imbalance still holds for their designs."



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Aug 6 20:41:44 2025
    From Newsgroup: comp.arch

    EricP wrote:

    Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix)
    ^^^^
    Oops... typo. Should be FPLA.
    PAL or Programmable Array Logic was a slightly different thing,
    also an AND-OR matrix from Monolithic Memories.

    were available in 1975. Mask programmable PLA were available from TI
    circa 1970 but masks would be too expensive.

    If I was building a TTL risc cpu in 1975 I would definitely be using
    lots of FPLA's, not just for decode but also state machines in fetch,
    page table walkers, cache controllers, etc.

    And PAL's too. Whatever works and is cheapest.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Charlie Gibbs@cgibbs@kltpzyxm.invalid to comp.arch,alt.folklore.computers on Thu Aug 7 01:36:50 2025
    From Newsgroup: comp.arch

    On 2025-08-06, Peter Flass <Peter@Iron-Spring.com> wrote:

    On 8/6/25 09:47, Anton Ertl wrote:

    Even if I am allowed to reveal that I am a time traveler, that may not
    help; how would I prove it?

    I'm a time-traveler from the 1960s!

    I'm starting to tell people that I'm a traveller
    from a distant land known as the past.
    --
    /~\ Charlie Gibbs | Growth for the sake of
    \ / <cgibbs@kltpzyxm.invalid> | growth is the ideology
    X I'm really at ac.dekanfrus | of the cancer cell.
    / \ if you read it the right way. | -- Edward Abbey
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Charlie Gibbs@cgibbs@kltpzyxm.invalid to comp.arch,alt.folklore.computers on Thu Aug 7 01:49:18 2025
    From Newsgroup: comp.arch

    On 2025-08-06, Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

    On Wed, 6 Aug 2025 08:28:03 -0700, John Ames wrote:

    CP/M owes a lot to the DEC lineage, although it dispenses with some
    of the more tedious mainframe-isms - e.g. the RUN [program]
    [parameters] syntax vs. just treating executable files on disk as
    commands in themselves.)

    It added its own misfeatures, though. Like single-letter device names,
    but only for disks. Non-file-structured devices were accessed via “reserved” file names, which continue to bedevil Microsoft Windows to this day, aggravated by a totally perverse extension of the concept to
    paths with hierarchical directory names.

    Funny how people ridicule COBOL's reserved words, while accepting MS-DOS/Windows' CON, LPT, etc. If only a trailing colon (which I
    always used) were mandatory; that would put device names cleanly
    into a different name space, eliminating the problem.

    But, you know, Microsoft...
    --
    /~\ Charlie Gibbs | Growth for the sake of
    \ / <cgibbs@kltpzyxm.invalid> | growth is the ideology
    X I'm really at ac.dekanfrus | of the cancer cell.
    / \ if you read it the right way. | -- Edward Abbey
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Thu Aug 7 02:22:05 2025
    From Newsgroup: comp.arch

    On Wed, 06 Aug 2025 20:21:31 -0400, EricP wrote:

    There is a citation to Cocke as "private communication" in 1980 by
    Patterson in The Case for the Reduced Instruction Set Computer,
    1980.

    "REASONS FOR INCREASED COMPLEXITY

    Why have computers become more complex? We can think of several
    reasons: Speed of Memory vs. Speed of CPU. John Cocke says that the complexity began with the transition from the 701 to the 709
    [Cocke80]. The 701 CPU was about ten times as fast as the core main
    memory; this made any primitives that were implemented as
    subroutines much slower than primitives that were instructions. Thus
    the floating point subroutines became part of the 709 architecture
    with dramatic gains. Making the 709 more complex resulted in an
    advance that made it more cost-effective than the 701. Since then,
    many "higher-level" instructions have been added to machines in an
    attempt to improve performance. Note that this trend began because
    of the imbalance in speeds; it is not clear that architects have
    asked themselves whether this imbalance still holds for their
    designs."

    That disparity between CPU and RAM speeds is even greater today than
    it was back then. Yet we have moved away from adding ever-more-complex instructions, and are getting better performance with simpler ones.

    How come? Caching.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch,alt.folklore.computers on Thu Aug 7 02:56:08 2025
    From Newsgroup: comp.arch

    According to Lars Poulsen <lars@cleo.beagle-ears.com>:
    ["Followup-To:" header set to comp.arch.]
    On 2025-08-06, John Levine <johnl@taugh.com> wrote:
    AFAIK the Cray-1 (1976) was the first 64-bit machine, and C for the
    Cray-1 and successors implemented, as far as I can determine

    type bits
    char 8
    short int 64
    int 64
    long int 64
    pointer 64

    Not having a 16-bit integer type and not having a 32-bit integer type >>>would make it very hard to adapt portable code, such as TCP/IP protocol >>>processing.

    I'd think this was obvious, but if the code depends on word sizes and doesn't
    declare its variables to use those word sizes, I don't think "portable" is the
    right term.

    My concern is how do you express yopur desire for having e.g. an int16 ?
    All the portable code I know defines int8, int16, int32 by means of a
    typedef that adds an appropriate alias for each of these back to a
    native type. If "short" is 64 bits, how do you define a 16 bit?

    In modern C you use the values in limits.h to pick the type, and define
    macros that mask values to the size you need. In older C you did the same thing in much uglier ways. Writing code that is portable across different
    word sizes has always been tedious.

    Or did the compiler have native types __int16 etc?

    Given how long ago it was, I doubt it.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch,alt.folklore.computers on Thu Aug 7 05:29:33 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    De Castro had had a big success with a simple load-store
    architecture, the Nova. He did that to reduce CPU complexity
    and cost, to compete with DEC and its PDP-8. (Byte addressing
    was horrible on the Nova, though).

    The PDP-8, and its 16-bit followup, the Nova, may be load/store, but
    it is not a register machine nor byte-addressed, while the PDP-11 is,
    and the RISC-VAX would be, too.

    Now, assume that, as a time traveler wanting to kick off an early
    RISC revolution, you are not allowed to reveal that you are a time
    traveler (which would have larger effects than just a different
    computer architecture). What do you do?

    a) You go to DEC

    b) You go to Data General

    c) You found your own company

    Even if I am allowed to reveal that I am a time traveler, that may not
    help; how would I prove it?

    Bring an mobile phone or tablet with you, install Stockfish,
    and beat everybody at chess.

    But making it known that you are a time traveller (and being able
    to prove it) would very probably invite all sorts of questions
    from all sorts of people about the future (or even about things
    in the then-present which were declassified in the future), and
    these people might not tke "no" or "I don't know" for an answer.

    [...]

    Yes, convincing people in the mid-1970s to bet the company on RISC is
    a hard sell, that's I asked for "a magic wand that would convince the
    DEC management and workforce that I know how to design their next architecture, and how to compile for it" in
    <2025Mar1.125817@mips.complang.tuwien.ac.at>.

    Some arguments that might help:

    Complexity in CISC and how it breeds complexity elsewhere; e.g., the interaction of having more than one data memory access per
    instruction, virtual memory, and precise exceptions.

    How the CDC 6600 achieved performance (pipelining) and how non-complex
    its instructions are.

    I guess I would read through RISC-vs-CISC literature before entering
    the time machine in order to have some additional arguments.


    Concerning your three options, I think it will be a problem in any
    case. Data General's first bet was on FHP, a microcoded machine with user-writeable microcode,

    That would have been the right time, I think - convince de Castro
    that, instead of writable microcode, RISC is the right direction.
    Fountainhead project started in July 1975, more or less contemporary
    with the VAX, and an alternate-Fountainhead could probably have
    been introduced at the same time, in 1977.

    so maybe even more in the wrong direction
    than VAX; I can imagine a high-performance OoO VAX implementation, but
    for an architecture with exposed microcode like FHP an OoO
    implementation would probably be pretty challenging. The backup
    project that eventually came through was also a CISC.

    Sure.


    Concerning founding ones own company, one would have to convince
    venture capital, and then run the RISC of being bought by one of the
    big players, who buries the architecture. And even if you survive,
    you then have to build up the whole thing: production, marketing,
    sales, software support, ...

    That is one of the things I find astonishing - how a company like
    DG grew from a kitche-table affair to the size they had.

    In any case, the original claim was about the VAX, so of course the
    question at hand is what DEC could have done instead.

    - anton
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Thu Aug 7 10:27:40 2025
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    There is a citation to Cocke as "private communication" in 1980 by
    Patterson in The Case for the Reduced Instruction Set Computer, 1980.

    "REASONS FOR INCREASED COMPLEXITY

    Why have computers become more complex? We can think of several reasons: >Speed of Memory vs. Speed of CPU. John Cocke says that the complexity began >with the transition from the 701 to the 709 [Cocke80]. The 701 CPU was about >ten times as fast as the core main memory; this made any primitives that
    were implemented as subroutines much slower than primitives that were >instructions. Thus the floating point subroutines became part of the 709 >architecture with dramatic gains. Making the 709 more complex resulted
    in an advance that made it more cost-effective than the 701. Since then,
    many "higher-level" instructions have been added to machines in an attempt
    to improve performance. Note that this trend began because of the imbalance >in speeds; it is not clear that architects have asked themselves whether
    this imbalance still holds for their designs."

    At the start of this thread
    <2025Jul29.104514@mips.complang.tuwien.ac.at>, I made exactly this
    argument about the relation between memory speed and clock rate. In
    that posting, I wrote:

    |my guess is that in the VAX 11/780 timeframe, 2-3MHz DRAM access
    |within a row would have been possible. Moreover, the VAX 11/780 has a
    |cache

    In the meantime, this discussion and some additional searching has
    unearthed that the VAX 11/780 memory subsystem has 600ns main memory
    cycle time (apparently without contiguous-access (row) optimization),
    with the cache lowering the average memory cycle time to 290ns.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch,alt.folklore.computers on Thu Aug 7 11:06:06 2025
    From Newsgroup: comp.arch

    In comp.arch Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    There is a citation to Cocke as "private communication" in 1980 by >>Patterson in The Case for the Reduced Instruction Set Computer, 1980.

    "REASONS FOR INCREASED COMPLEXITY

    Why have computers become more complex? We can think of several reasons: >>Speed of Memory vs. Speed of CPU. John Cocke says that the complexity began >>with the transition from the 701 to the 709 [Cocke80]. The 701 CPU was about >>ten times as fast as the core main memory; this made any primitives that >>were implemented as subroutines much slower than primitives that were >>instructions. Thus the floating point subroutines became part of the 709 >>architecture with dramatic gains. Making the 709 more complex resulted
    in an advance that made it more cost-effective than the 701. Since then, >>many "higher-level" instructions have been added to machines in an attempt >>to improve performance. Note that this trend began because of the imbalance >>in speeds; it is not clear that architects have asked themselves whether >>this imbalance still holds for their designs."

    At the start of this thread
    <2025Jul29.104514@mips.complang.tuwien.ac.at>, I made exactly this
    argument about the relation between memory speed and clock rate. In
    that posting, I wrote:

    |my guess is that in the VAX 11/780 timeframe, 2-3MHz DRAM access
    |within a row would have been possible. Moreover, the VAX 11/780 has a |cache

    In the meantime, this discussion and some additional searching has
    unearthed that the VAX 11/780 memory subsystem has 600ns main memory
    cycle time (apparently without contiguous-access (row) optimization),

    Memory subsystem was able to operate at bus speed: during memory
    cycle memory delivered 64 bits. Bus was 32-bit and needed 3 cycles
    (200 ns each) to transfer 64-bit. Making memory faster would
    require redesigning the bus.

    with the cache lowering the average memory cycle time to 290ns.

    For processor miss penalty was 1800 ns (documentation say that
    was du to bus protocol overhead). Cache hit rate was claimed
    to be 95%.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Aug 7 10:47:50 2025
    From Newsgroup: comp.arch

    Robert Swindells <rjs@fdy2.co.uk> writes:
    On Wed, 06 Aug 2025 17:00:03 -0400, EricP wrote:
    If I was building a TTL risc cpu in 1975 I would definitely be using
    lots of FPLA's, not just for decode but also state machines in fetch,
    page table walkers, cache controllers, etc.

    The DG MV/8000 used PALs but The Soul of a New Machine hints that there
    were supply problems with them at the time.

    The PALs used for the MV/8000 were different, came out in 1978 (i.e.,
    very recent when the MV/8000 was designed), addressed shortcomings of
    the PLA Signetics 82S100 that had been available since 1975, and the
    PALs initially had yield problems; see <https://en.wikipedia.org/wiki/Programmable_Array_Logic#History>.

    Concerning the speed of the 82S100 PLA, <http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf>
    reports propagation delays of 25ns-35ns for specific signals in Table
    3.4, and EricP found 50ns "max access" in the data sheet of the
    82S100. That does not sound too slow to be usable in a CPU with 200ns
    cycle time, so yes, one could have used that for the VAX.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Thu Aug 7 11:16:20 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <106uqki$36gll$4@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <44okQ.831008$QtA1.573001@fx16.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    [snip]
    We tend to be spoiled by modern process densities. The
    VAX 11/780 was built using SSI logic chips, thus board
    space and backplane wiring were significant constraints
    on the logic designs of the era.

    Indeed. I find this speculation about the VAX, kind of odd: the
    existence of the 801 as a research project being used as an
    existence proof to justify assertions that a pipelined RISC
    design would have been "better" don't really hold up, when we
    consider that the comparison is to a processor designed for
    commercial applications on a much shorter timeframe.

    I disagree. The 801 was a research project without much time
    pressure, and they simulated the machine (IIRC at the gate level)
    before they ever bulit one. Plus, they developed an excellent
    compiler which implemented graph coloring.

    But IBM had zero interest in competition to their own /370 line,
    although the 801 would have brought performance improvements
    over that line.

    I'm not sure what, precisely, you're disagreeing with.

    I'm saying that the line of though that goes, "the 801 existed,
    therefore a RISC VAX would have been better than the
    architecture DEC ultimately produced" is specious, and the
    conclusion does not follow.

    There are a few intermediate steps.

    The 801 demonstrated that a RISC, including caches and pipelining,
    would have been feasible at the time. It also demonstrated that
    somebody had thought of graph coloring algorithms.

    Russians in late sixties proposed graph coloring as a way of
    memory allocation (and proved that optimal allocation is
    equivalent to graph coloring). They also proposed heuristics
    for graph coloring and experimentaly showed that they
    are reasonably effective. This is not the same thing as
    register allocation, but connection is rather obvious.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Thu Aug 7 11:29:46 2025
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    EricP wrote:
    Thomas Koenig wrote:

    Or look at the performance of the TTL implementation of HP-PA,
    which used PALs which were not available to the VAX 11/780
    designers, so it could be clocked a bit higher, but at
    a multiple of the performance than the VAX.

    So, Anton visiting DEC or me visiting Data General could have
    brought them a technology which would significantly outperformed
    the VAX (especially if we brought along the algorithm for graph
    coloring. Some people at IBM would have been peeved at having
    somebody else "develop" this at the same time, but OK.

    Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) >> were available in 1975. Mask programmable PLA were available from TI
    circa 1970 but masks would be too expensive.

    If I was building a TTL risc cpu in 1975 I would definitely be using
    lots of FPLA's, not just for decode but also state machines in fetch,
    page table walkers, cache controllers, etc.

    The question isn't could one build a modern risc-style pipelined cpu
    from TTL in 1975 - of course one could. Nor do I see any question of
    could it beat a VAX 780 0.5 MIPS at 5 MHz - of course it could, easily.

    IIUC description of IBM 360-85 it had a pipeline which was much
    more aggresivly clocked than VAX. 360-85 probaly used ECL, but
    at VAX clock speed should be easily doable in Schottky TTL
    (used in VAX).

    The question is could one build this at a commercially competitive price?

    Yes.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Aug 7 11:21:56 2025
    From Newsgroup: comp.arch

    Lars Poulsen <lars@cleo.beagle-ears.com> writes:
    ["Followup-To:" header set to comp.arch.]
    On 2025-08-06, John Levine <johnl@taugh.com> wrote:
    AFAIK the Cray-1 (1976) was the first 64-bit machine, and C for the
    Cray-1 and successors implemented, as far as I can determine

    type bits
    char 8
    short int 64
    int 64
    long int 64
    pointer 64

    Not having a 16-bit integer type and not having a 32-bit integer type >>>would make it very hard to adapt portable code, such as TCP/IP protocol >>>processing.
    ...
    My concern is how do you express yopur desire for having e.g. an int16 ?
    All the portable code I know defines int8, int16, int32 by means of a
    typedef that adds an appropriate alias for each of these back to a
    native type. If "short" is 64 bits, how do you define a 16 bit?
    Or did the compiler have native types __int16 etc?

    I doubt it. If you want to implement TCP/IP protocol processing on a
    Cray-1 or its successors, better use shifts for picking apart or
    assembling the headers. One might also think about using C's bit
    fields, but, at least if you want the result to be portable, AFAIK bit
    fields are too laxly defined to be usable for that.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Aug 7 11:38:54 2025
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    EricP wrote:
    Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) >> were available in 1975. Mask programmable PLA were available from TI
    circa 1970 but masks would be too expensive.

    If I was building a TTL risc cpu in 1975 I would definitely be using
    lots of FPLA's, not just for decode but also state machines in fetch,
    page table walkers, cache controllers, etc.

    The question isn't could one build a modern risc-style pipelined cpu
    from TTL in 1975 - of course one could. Nor do I see any question of
    could it beat a VAX 780 0.5 MIPS at 5 MHz - of course it could, easily.

    I'm pretty sure I could use my Mk-I risc ISA and build a 5 stage pipeline >running at 5 MHz getting 1 IPC sustained when hitting the 200 ns cache
    (using some in-order superscalar ideas and two reg file write ports
    to "catch up" after pipeline bubbles).

    TTL risc would also be much cheaper to design and prototype.
    VAX took hundreds of people many many years.

    The question is could one build this at a commercially competitive price? >There is a reason people did things sequentially in microcode.
    All those control decisions that used to be stored as bits in microcode now >become real logic gates. And in SSI TTL you don't get many to the $.
    And many of those sequential microcode states become independent concurrent >state machines, each with its own logic sequencer.

    I am confused. You gave a possible answer in the posting you are
    replying to.

    Concerning page table walker: The MIPS R2000 just has a TLB and traps
    on a TLB miss, and then does the table walk in software. While that's
    not a solution that's appropriate for a wide superscalar CPU, it was
    good enough for beating the actual VAX 11/780 by a good margin; at
    some later point, you would implement the table walker in hardware,
    but probably not for the design you do in 1975.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Aug 7 11:59:35 2025
    From Newsgroup: comp.arch

    scott@slp53.sl.home (Scott Lurndal) writes:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) >>were available in 1975. Mask programmable PLA were available from TI
    circa 1970 but masks would be too expensive.

    Burroughs mainframers started designing with ECL gate arrays circa
    1981, and they shipped in 1987[*]. I suspect even FPAL or other PLAs
    would have been far to expensive to use to build a RISC CPU,

    The Signetics 82S100 was used in early Commodore 64s, so it could not
    have been expensive (at least in 1982, when these early C64s were
    built). PLAs were also used by HP when building the first HPPA CPU.

    especially for one of the BUNCH, for whom backward compatability was >paramount.

    Why should the cost of building a RISC CPU depend on whether you are
    in the BUNCH (Burroughs, UNIVAC, NCR, Control Data Corporation (CDC),
    and Honeywell)? And how is the cost of building a RISC CPU related to backwards compatibility?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Thu Aug 7 15:15:17 2025
    From Newsgroup: comp.arch

    Michael S wrote:
    On Wed, 6 Aug 2025 16:19:11 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Tue, 5 Aug 2025 22:17:00 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Tue, 5 Aug 2025 17:31:34 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:
    In this case 'adc edx,edx' is just slightly shorter encoding
    of 'adc edx,0'. EDX register zeroize few lines above.

    OK, nice.

    BTW, it seems that in your code fragment above you forgot to
    zeroize EDX at the beginning of iteration. Or am I mssing
    something?

    No, you are not. I skipped pretty much all the setup code. :-)

    It's a setup code that looks to me as missing, but zeroing of RDX in
    the body of the loop.

    I don't remember my code exactly, but the intent was that RDX would
    contain any incoming carries (0,1,2) from the previous iteration.

    Using ADCX/ADOX would not be an obvious speedup, at least not obvious to me.

    Terje

    I did few tests on few machines: Raptor Cove (i7-14700 P core),
    Gracemont (i7-14700 E core), Skylake-C (Xeon E-2176G) and Zen3 (EPYC
    7543P).
    In order to see effects more clearly I had to modify Anton's function:
    to one that operates on pointers, because otherwise too much time was
    spend at caller's site copying things around which made the
    measurements too noisy.

    void add3(uintNN_t *dst, const uintNN_t* a, const uintNN_t* b, const uintNN_t* c) {
    *dst = *a + *b + *c;
    }


    After the change on 3 out of 4 platforms I had seen a significant
    speed-up after modification. The only platform where speed-up was non-significant was Skylake, probably because its rename stage is too
    narrow to profit from the change. The widest machine (Raptor Cove)
    benefited most.
    The results appear non-conclusive with regard to question whether
    dependency between loop iterations is eliminated completely or just
    shortened to 1-2 clock cycles per iteration. Even the widest of my
    cores is relatively narrow. Considering that my variant of loop contains
    13 x86-64 instruction and 16 uOps, I am afraid that even likes of Apple
    M4 would be too narrow :(

    Here are results in nanoseconds for N=65472
    Platform RC GM SK Z3
    clang 896.1 1476.7 1453.2 1348.0
    gcc 879.2 1661.4 1662.9 1655.0
    x86 585.8 1489.3 901.5 672.0
    Terje's 772.6 1293.2 1012.6 1127.0
    My 397.5 803.8 965.3 660.0
    ADX 579.1 1650.1 728.9 853.0
    x86/u2 581.5 1246.2 679.9 584.0
    Terje's/u3 503.7 954.3 630.9 755.0
    My/u3 266.6 487.2 486.5 440.0
    ADX/u8 350.4 839.3 490.4 451.0

    'x86' is a variant that that was sketched in one of my above
    posts. It calculates the sum in two passes over arrays.
    'ADX' is a variant that uses ADCX/ADOX instructions as suggested by
    Anton, but unlike his suggestion does it in a loop rather than in long straight code sequence.
    /u2, /u3, /u8 indicate unroll factors of the inner loop.

    Frequency:
    RC 5.30 GHz (Est)
    GM 4.20 GHz (Est)
    SK 4.25 GHz
    Z3 3.70 GHz


    Thanks for an interesting set of tests/results!

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Aug 7 13:34:26 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Lars Poulsen <lars@cleo.beagle-ears.com> writes:
    ["Followup-To:" header set to comp.arch.]
    On 2025-08-06, John Levine <johnl@taugh.com> wrote:

    ...
    My concern is how do you express yopur desire for having e.g. an int16 ? >>All the portable code I know defines int8, int16, int32 by means of a >>typedef that adds an appropriate alias for each of these back to a
    native type. If "short" is 64 bits, how do you define a 16 bit?
    Or did the compiler have nativea types __int16 etc?

    I doubt it. If you want to implement TCP/IP protocol processing on a
    Cray-1 or its successors, better use shifts for picking apart or
    assembling the headers. One might also think about using C's bit
    fields, but, at least if you want the result to be portable, AFAIK bit
    fields are too laxly defined to be usable for that.

    The more likely solution would be to push the protocol processing
    into an attached I/O processor, in those days.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch,alt.folklore.computers on Thu Aug 7 15:44:55 2025
    From Newsgroup: comp.arch

    Peter Flass wrote:
    On 8/6/25 10:25, John Levine wrote:
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    Not aware of any platforms that do/did ILP64.

    AFAIK the Cray-1 (1976) was the first 64-bit machine, ...

    The IBM 7030 STRETCH was the first 64 bit machine, shipped in 1961,
    but I would be surprised if anyone had written a C compiler for it.

    It was bit addressable but memories in those days were so small that a
    full bit
    address was only 24 bits.  So if I were writing a C compiler, pointers
    and ints
    would be 32 bits, char 8 bits, long 64 bits.

    (There is a thing called STRETCH C Compiler but it's completely
    unrelated.)

    I don't get why bit-addressability was a thing? Intel iAPX 432 had it, > too, and it seems like all it does is drastically shrink your address
    space and complexify instruction and operand fetch to (maybe) save a few bytes.
    Bit addressing, presumably combined with an easy way to mask the
    results/pick an arbitrary number of bits less or equal to register
    width, makes it easier to impement compression/decompression/codecs.
    However, since the only thing needed to do the same on current CPUs is a single shift after an aligned load, this feature costs far too much in
    reduced address space compared to what you gain.
    In the real world, all important codecs (like mp4 or aes crypto) end up
    as dedicated hardware, either AES opcodes or a standalone VLSI slice
    capable of CABAC decoding. The main reason is energy: A cell phone or
    laptop cannot stream video all day without having hardware support for
    the decoding task.
    One possibly relevant anecdote: Back in the later 1990'ies, when Intel
    was producing the first quad core Pentium Pro-style cpus, I showed them
    that it was in fact possible for one of those CPUs to decode a maximum
    h264 bitstream, with 40 Mbit/s of CABAC coded data, in pure software.
    (Their own sw engineers had claimed that every other frame of a 60 Hz HD video would have to be skipped.)
    What Intel did was to license h264 decoding IP since that would use far
    less power and leave 3 of the 4 cores totally idle.
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Peter Flass@Peter@Iron-Spring.com to comp.arch,alt.folklore.computers on Thu Aug 7 07:26:32 2025
    From Newsgroup: comp.arch

    On 8/6/25 22:29, Thomas Koenig wrote:


    That is one of the things I find astonishing - how a company like
    DG grew from a kitche-table affair to the size they had.


    Recent history is littered with companies like this. The microcomputer revolution spawned scores of companies that started in someone's garage, ballooned to major presence overnight, and then disappeared - bankrupt,
    bought out, split up, etc. Look at all the players in the S-100 CP/M
    space, or Digital Research. Only a few, like Apple and Microsoft, made
    it out alive.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Peter Flass@Peter@Iron-Spring.com to comp.arch,alt.folklore.computers on Thu Aug 7 07:34:28 2025
    From Newsgroup: comp.arch

    On 8/7/25 06:44, Terje Mathisen wrote:

    Bit addressing, presumably combined with an easy way to mask the results/pick an arbitrary number of bits less or equal to register
    width, makes it easier to impement compression/decompression/codecs.

    However, since the only thing needed to do the same on current CPUs is a single shift after an aligned load, this feature costs far too much in reduced address space compared to what you gain.


    Bit addressing *as an option* (Bit Load, Bit store instructions, etc.)
    is a great idea, for example it greatly simplifies BitBlt logic. The
    432's use of bit addressing for everything, especially instructions,
    seems just too cute. I forget the details, it's been a while since I
    looked, but it forced extremely small code segments which, combined with
    the segmentation logic, etc. really impacted performance.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Aug 7 15:03:23 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) >>>were available in 1975. Mask programmable PLA were available from TI >>>circa 1970 but masks would be too expensive.

    Burroughs mainframers started designing with ECL gate arrays circa
    1981, and they shipped in 1987[*]. I suspect even FPAL or other PLAs >>would have been far to expensive to use to build a RISC CPU,

    The Signetics 82S100 was used in early Commodore 64s, so it could not
    have been expensive (at least in 1982, when these early C64s were
    built). PLAs were also used by HP when building the first HPPA CPU.

    especially for one of the BUNCH, for whom backward compatability was >>paramount.

    Why should the cost of building a RISC CPU depend on whether you are
    in the BUNCH (Burroughs, UNIVAC, NCR, Control Data Corporation (CDC),
    and Honeywell)? And how is the cost of building a RISC CPU related to >backwards compatibility?

    Because you need to sell it. Without disrupting your existing
    customer base.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Ames@commodorejohn@gmail.com to comp.arch,alt.folklore.computers on Thu Aug 7 08:28:52 2025
    From Newsgroup: comp.arch

    On Wed, 6 Aug 2025 23:45:44 -0000 (UTC)
    Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

    It added its own misfeatures, though.

    Unfortunately, yes. "User areas" in particular are just a completely
    useless bastard child of proper subdirectories and something like
    TOPS-10's programmer/project pairs; even making user area 0 a "common
    area" accessible from any of the others would've helped, but they
    didn't do that. It's a sign of how misconceived they were that MS-DOS
    (in re-implementing CP/M) dropped them entirely and nobody complained,
    then added real subdirectories later.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Thu Aug 7 14:57:59 2025
    From Newsgroup: comp.arch

    Peter Flass <Peter@Iron-Spring.com> writes:
    [IBM STRETCH bit-addressable]
    I don't get why bit-addressability was a thing? Intel iAPX 432 had it,
    too

    One might come to think that it's the signature of overambitious
    projects that eventually fail.

    However, in the case of the IBM STRETCH, I think there's a good
    excuse: If you go from word addressing to subunit addressing (not sure
    why Stretch went there, however; does a supercomputer need that?), why
    stop at characters (especially given that character size at the time
    was still not settled)? Why not continue down to bits?

    The S/360 then found the compromise that conquered the world: Byte
    addressing with 8-bit bytes.

    Why iAPX432 went for bit addressing at a time when byte addressing and
    the 8-bit byte was firmly established, over ten years after the S/360
    and 5 years after the PDP-11 is a mystery, however.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Ames@commodorejohn@gmail.com to comp.arch,alt.folklore.computers on Thu Aug 7 08:38:56 2025
    From Newsgroup: comp.arch

    On Thu, 7 Aug 2025 02:22:05 -0000 (UTC)
    Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

    That disparity between CPU and RAM speeds is even greater today than
    it was back then. Yet we have moved away from adding ever-more-complex instructions, and are getting better performance with simpler ones.

    How come? Caching.

    Yes, but complex instructions also make pipelining and out-of-order
    execution much more difficult - to the extent that, as far back as the
    Pentium Pro, Intel has had to implement the x86 instruction set as a
    microcoded program running on top of a simpler RISC architecture.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch,alt.folklore.computers on Thu Aug 7 17:52:05 2025
    From Newsgroup: comp.arch

    John Ames wrote:
    On Thu, 7 Aug 2025 02:22:05 -0000 (UTC)
    Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

    That disparity between CPU and RAM speeds is even greater today than
    it was back then. Yet we have moved away from adding ever-more-complex
    instructions, and are getting better performance with simpler ones.

    How come? Caching.

    Yes, but complex instructions also make pipelining and out-of-order
    execution much more difficult - to the extent that, as far back as the Pentium Pro, Intel has had to implement the x86 instruction set as a microcoded program running on top of a simpler RISC architecture.

    That's simply wrong:

    The PPro had close to zero microcode actually running in any user program.

    What it did have was decoders that would look at complex operations and
    spit out two or more basic operations, like load+execute.

    Later on we've seen the opposite where cmp+branch could be combined into
    a single internal op.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From drb@drb@ihatespam.msu.edu (Dennis Boone) to comp.arch,alt.folklore.computers on Thu Aug 7 15:54:16 2025
    From Newsgroup: comp.arch

    However, in the case of the IBM STRETCH, I think there's a good
    excuse: If you go from word addressing to subunit addressing (not sure
    why Stretch went there, however; does a supercomputer need that?), why
    stop at characters (especially given that character size at the time
    was still not settled)? Why not continue down to bits?

    Remember who they built STRETCH for.

    De
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lynn Wheeler@lynn@garlic.com to comp.arch,alt.folklore.computers on Thu Aug 7 07:32:35 2025
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> writes:
    It's a 32 bit architecture with 31 bit addressing, kludgily extended
    from 24 bit addressing in the 1970s.

    2nd half 70s kludge, with 370s that could have 64mbytes of real memory
    with only 24bit addressing ... the virtual memory page table entry (PTE)
    had 16bits with 2 "unused bits" ... 12bit page number (12bit 4kbyte
    pages, 24bits) ... and defined the two unused bits to prepend to the
    page number ... making 14bit page number ... for 26bits (instructions
    were still be 24bit, but virtual memory used to translate to 26bits real addressing).

    original 360 I/O had only 24bit addressing, adding virtual memory (to
    all 370s) added IDALs, the CCW was still 24bit but were still being
    built by applications running in virtual memory ... and (effectively)
    assumed any large storage locations consisting of one contiguous
    area. Moving to virtual memory, I/O large "contiguous" area was now
    borken into page size chunks in non-contiguous areas. Translating
    "virtual" I/O program, the original virtual CCW ... would be converted
    to CCW with real addresses and flagged as IDAL ... where the CCW pointed
    to IDAL list of real addresses ... that were 32 bit words ... (31 bits specifying real address) for each (possibly non-contiguous) real page
    involved.
    --
    virtualization experience starting Jan1968, online at home since Mar1970
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch,alt.folklore.computers on Thu Aug 7 13:01:07 2025
    From Newsgroup: comp.arch

    On 8/7/2025 7:57 AM, Anton Ertl wrote:
    Peter Flass <Peter@Iron-Spring.com> writes:
    [IBM STRETCH bit-addressable]
    I don't get why bit-addressability was a thing? Intel iAPX 432 had it,
    too

    One might come to think that it's the signature of overambitious
    projects that eventually fail.

    Interesting. While it seems to be sufficient to predict the failure of
    a project, it certainly isn't necessary. So I think calling it a
    signature is too extreme.


    However, in the case of the IBM STRETCH, I think there's a good
    excuse: If you go from word addressing to subunit addressing (not sure
    why Stretch went there, however; does a supercomputer need that?)

    While perhaps not absolutely necessary, it is very useful. For example, inputting the parameters for, and showing the results of a simulation in
    human readable format, And for a compiler. While you could do all of
    those things on another (different architecture) computer, and transfer
    the results via say magnetic tape, that is pretty inconvenient and
    increases the cost for that additional computer. And there is
    interaction with the console.


    , why
    stop at characters (especially given that character size at the time
    was still not settled)? Why not continue down to bits?

    According to Wikipedia

    https://en.wikipedia.org/wiki/IBM_7030_Stretch#Data_formats

    it supported both binary and decimal fixed point arithmetic (so it helps
    to have four bit "characters", the floating point representation had a
    four bit sign, and alphanumeric characters could be anywhere from 1-8
    bits. And as you say, 6 bit characters were common, especially for
    scientific computers.


    The S/360 then found the compromise that conquered the world: Byte
    addressing with 8-bit bytes.

    Yes, but several years later.

    Another factor that may have contributed. According to the same
    Wikipedia article, the requirements for the system came from Edward
    Teller then at Lawrence Livermore Labs, so there may have been some
    classified requirement that led to bit addressability.


    Why iAPX432 went for bit addressing at a time when byte addressing and
    the 8-bit byte was firmly established, over ten years after the S/360
    and 5 years after the PDP-11 is a mystery, however.

    Agreed.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Al Kossow@aek@bitsavers.org to comp.arch,alt.folklore.computers on Thu Aug 7 13:34:09 2025
    From Newsgroup: comp.arch

    The TI TMS34020 graphics processor may have been the last CPU to have bit addressing.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Thu Aug 7 23:48:10 2025
    From Newsgroup: comp.arch

    On Tue, 5 Aug 2025 13:04:39 -0500
    "Brian G. Lucas" <bagel99@gmail.com> wrote:

    Hi, Brian
    By chance, do you happen to know why Mitch Alsup recently disappeared
    from the Usenet?

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch,alt.folklore.computers on Thu Aug 7 20:54:01 2025
    From Newsgroup: comp.arch

    According to Terje Mathisen <terje.mathisen@tmsw.no>:
    I don't get why bit-addressability was a thing? Intel iAPX 432 had it,
    too, and it seems like all it does is drastically shrink your address
    space and complexify instruction and operand fetch to (maybe) save a few
    bytes.

    Bit addressing, presumably combined with an easy way to mask the >results/pick an arbitrary number of bits less or equal to register
    width, makes it easier to impement compression/decompression/codecs.

    STRETCH was designed in the late 1950s. Shannon-Fano coding was invented
    in the 1940s, and Huffman published his paper on optimal coding in 1950,
    but modern codes like LZ were only invented in the 1970s. I doubt anyone
    did compression or decompression on STRETCH other than packing and unpacking bit fields.

    IBMs commercial machines were digit or character addressed, with a variety of different representations. They didn't know what the natural byte size would be so they let you use whatever you wanted. That made it easy to pack and unpack bitfields to store data compactly in fields of exactly the minimum size.

    The NSA was an important customer, for whom they built the 7950 HARVEST coprocessor
    and it's quite plausible that they had applications for which bit addressing was useful.

    The paper on the design of S/360 said they looked at addressing of 6 bit characters, and 8 bit characters, with 4-bit BCD digits sometimes stored in them. It was evident at the time that 6 bit characters were too small, so
    8 bits it was. They don't mention bit addressing, so they'd presumably already decided that was a bad idea.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Brian G. Lucas@bagel99@gmail.com to comp.arch on Thu Aug 7 16:01:07 2025
    From Newsgroup: comp.arch

    On 8/7/25 3:48 PM, Michael S wrote:
    On Tue, 5 Aug 2025 13:04:39 -0500
    "Brian G. Lucas" <bagel99@gmail.com> wrote:

    Hi, Brian
    By chance, do you happen to know why Mitch Alsup recently disappeared
    from the Usenet?

    No, I do not. And I am worried.

    brian

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From George Neuner@gneuner2@comcast.net to comp.arch,alt.folklore.computers on Thu Aug 7 21:53:11 2025
    From Newsgroup: comp.arch

    On Thu, 7 Aug 2025 17:52:05 +0200, Terje Mathisen
    <terje.mathisen@tmsw.no> wrote:

    John Ames wrote:
    On Thu, 7 Aug 2025 02:22:05 -0000 (UTC)
    Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

    That disparity between CPU and RAM speeds is even greater today than
    it was back then. Yet we have moved away from adding ever-more-complex
    instructions, and are getting better performance with simpler ones.

    How come? Caching.

    Yes, but complex instructions also make pipelining and out-of-order
    execution much more difficult - to the extent that, as far back as the
    Pentium Pro, Intel has had to implement the x86 instruction set as a
    microcoded program running on top of a simpler RISC architecture.

    That's simply wrong:

    The PPro had close to zero microcode actually running in any user program.

    What it did have was decoders that would look at complex operations and
    spit out two or more basic operations, like load+execute.

    Later on we've seen the opposite where cmp+branch could be combined into
    a single internal op.

    Terje

    You say "tomato". 8-)

    It's still "microcode" for some definition ... just not a classic
    "interpreter" implementation where a library of routines implements
    the high level instructions.

    The decoder converts x86 instructions into traces of equivalent wide
    micro instructions which are directly executable by the core. The
    traces then are cached separately [there is a $I0 "microcache" below
    $I1] and can be re-executed (e.g., for loops) as long as they remain
    in the microcache. If they age out, the decoder has to produce them
    again from the "source" x86 instructions.

    So the core is executing microinstructions - not x86 - and the program
    as executed reasonably can be said to be "microcoded" ... again for
    some definition.

    YMMV.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch,alt.folklore.computers on Fri Aug 8 01:57:53 2025
    From Newsgroup: comp.arch

    In article <107008b$3g8jl$1@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <106uqej$36gll$3@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Peter Flass <Peter@Iron-Spring.com> schrieb:

    The support issues alone were killers. Think about the
    Orange/Grey/(Blue?) Wall of VAX documentation, and then look at the
    five-page flimsy you got with a micro. The customers were willing to
    accept cr*p from a small startup, but wouldn't put up with it from IBM >>>> or DEC.

    Using UNIX faced stiff competition from AT&T's internal IT people,
    who wanted to run DEC's operating systems on all PDP-11 within
    the company (basically, they wanted to kill UNIX). They pointed
    towads the large amout of documentation that DEC provided, compared
    to the low amount of UNIX, as proof of superiority. The UNIX people
    saw it differently...

    I've never heard this before, and I do not believe that it is
    true. Do you have a source?

    Hmm... I _think_ it was on a talk given by the UNIX people,
    but I may be misremembering.

    I have heard similar stories about DEC, but not AT&T. The Unix
    fortune file used to (in)famously have a quote from Ken Olsen
    about the relative volume of documentation between Unix and VMS
    (reproduced below).

    - Dan C.

    BEGIN FORTUNE<---

    One of the questions that comes up all the time is: How
    enthusiastic is our support for UNIX?
    Unix was written on our machines and for our machines many
    years ago. Today, much of UNIX being done is done on our machines.
    Ten percent of our VAXs are going for UNIX use. UNIX is a simple
    language, easy to understand, easy to get started with. It's great for students, great for somewhat casual users, and it's great for
    interchanging programs between different machines. And so, because of
    its popularity in these markets, we support it. We have good UNIX on
    VAX and good UNIX on PDP-11s.
    It is our belief, however, that serious professional users will
    run out of things they can do with UNIX. They'll want a real system and
    will end up doing VMS when they get to be serious about programming.
    With UNIX, if you're looking for something, you can easily and
    quickly check that small manual and find out that it's not there. With
    VMS, no matter what you look for -- it's literally a five-foot shelf of documentation -- if you look long enough it's there. That's the
    difference -- the beauty of UNIX is it's simple; and the beauty of VMS
    is that it's all there.
    -- Ken Olsen, President of DEC, 1984

    END FORTUNE<---
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Fri Aug 8 03:51:08 2025
    From Newsgroup: comp.arch

    On Thu, 7 Aug 2025 15:44:55 +0200, Terje Mathisen wrote:

    However, since the only thing needed to do the same on current CPUs is a single shift after an aligned load, this feature costs far too much in reduced address space compared to what you gain.

    Reserving the bottom 3 bits for a bit offset in a 64-bit address, even if
    it is unused in most instructions, doesn’t seem like such a big cost. And
    it unifies the pointer representation for all data types, which can make things more convenient in a higher-level language.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch,alt.folklore.computers on Fri Aug 8 03:57:17 2025
    From Newsgroup: comp.arch

    On Thu, 7 Aug 2025 07:26:32 -0700, Peter Flass wrote:

    On 8/6/25 22:29, Thomas Koenig wrote:

    That is one of the things I find astonishing - how a company like DG
    grew from a kitche-table affair to the size they had.

    Recent history is littered with companies like this.

    DG were famously the setting for that Tracy Kidder book, “The Soul Of A
    New Machine”, chronicling their belated and high-pressure project to enter the 32-bit virtual-memory supermini market and compete with DEC’s VAX.

    Looking at things with the eyes of a software guy, I found some of their hardware decisions questionable. Like they thought they were very clever
    to avoid having separate privilege modes in the processor status register
    like the VAX did: instead, they encoded the access privilege mode in the address itself.

    I guess they thought that 32 address bits left plenty to spare for
    something like this. But I think it just shortened the life of their 32-
    bit architecture by that much more.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Aug 8 01:41:00 2025
    From Newsgroup: comp.arch

    On 8/7/2025 4:01 PM, Brian G. Lucas wrote:
    On 8/7/25 3:48 PM, Michael S wrote:
    On Tue, 5 Aug 2025 13:04:39 -0500
    "Brian G. Lucas" <bagel99@gmail.com> wrote:

    Hi, Brian
    By chance, do you happen to know why Mitch Alsup recently disappeared
    from the Usenet?

    No, I do not.  And I am worried.


    Yeah, that is concerning...


    brian


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Fri Aug 8 06:16:51 2025
    From Newsgroup: comp.arch

    George Neuner <gneuner2@comcast.net> writes:
    On Thu, 7 Aug 2025 17:52:05 +0200, Terje Mathisen
    <terje.mathisen@tmsw.no> wrote:

    John Ames wrote:
    The PPro had close to zero microcode actually running in any user program.

    What it did have was decoders that would look at complex operations and >>spit out two or more basic operations, like load+execute.

    Later on we've seen the opposite where cmp+branch could be combined into
    a single internal op.

    Terje

    You say "tomato". 8-)

    It's still "microcode" for some definition ... just not a classic >"interpreter" implementation where a library of routines implements
    the high level instructions.

    Exactly, for most instructions there is no microcode. There are
    microops, with 118 bits on the Pentium Pro (P6). They are not RISC instructions (no RISC has 118-bit instructions). At best one might
    argue that one P6 microinstruction typically does what a RISC
    instruction does in a RISC. But in the end the reorder buffer still
    has to deal with the CISC instructions.

    The decoder converts x86 instructions into traces of equivalent wide
    micro instructions which are directly executable by the core. The
    traces then are cached separately [there is a $I0 "microcache" below
    $I1] and can be re-executed (e.g., for loops) as long as they remain
    in the microcache.

    No such cache in the P6 or any of its descendents until the Sandy
    Bridge (2011). The Pentium 4 has a microop cache, but eventually
    (with Core Duo, Core2 Duo) was replaced with P6 descendents that have
    no microop cache. Actually, the Core 2 Duo has a loop buffer which
    might be seen as a tiny microop cache. Microop caches and loop
    buffers still have to contain information about which microops belong
    to the same CISC instruction, because otherwise the reorder buffer
    could not commit/execute* CISC instructions.

    * OoO microarchitecture terminology calls what the reorder buffer does
    "retire" or "commit". But this is where the speculative execution
    becomes architecturally visible ("commit"), so from an architectural
    view it is execution.

    Followups set to comp.arch

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch,alt.folklore.computers on Fri Aug 8 11:43:00 2025
    From Newsgroup: comp.arch

    On Fri, 8 Aug 2025 03:57:17 -0000 (UTC)
    Lawrence D'Oliveiro <ldo@nz.invalid> wrote:


    I guess they thought that 32 address bits left plenty to spare for
    something like this. But I think it just shortened the life of their
    32- bit architecture by that much more.


    The history proved them right. Eagle series didn't last long enough to
    run out of 512MB address space.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Fri Aug 8 11:58:39 2025
    From Newsgroup: comp.arch

    Michael S wrote:
    On Tue, 5 Aug 2025 13:04:39 -0500
    "Brian G. Lucas" <bagel99@gmail.com> wrote:

    Hi, Brian
    By chance, do you happen to know why Mitch Alsup recently disappeared
    from the Usenet?

    I've been in cantact, he lost his usenet provider, and the one I am
    using does not seem to accept new registrations any langer.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Fri Aug 8 13:20:33 2025
    From Newsgroup: comp.arch

    On Fri, 8 Aug 2025 11:58:39 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Tue, 5 Aug 2025 13:04:39 -0500
    "Brian G. Lucas" <bagel99@gmail.com> wrote:

    Hi, Brian
    By chance, do you happen to know why Mitch Alsup recently
    disappeared from the Usenet?

    I've been in cantact,

    Good.

    he lost his usenet provider,

    Terje


    I was suspecting that much. What made me worrying is that almost at the
    same date he stopped posting on RWT forum.

    and the one I am > using does not seem to accept new registrations
    any langer.

    Eternal September does not accept new registrations?
    I think, if it is true, Ray Banana will make excception for Mitch if
    asked personally.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Fri Aug 8 10:08:43 2025
    From Newsgroup: comp.arch

    Anton Ertl wrote:
    Robert Swindells <rjs@fdy2.co.uk> writes:
    On Wed, 06 Aug 2025 17:00:03 -0400, EricP wrote:
    If I was building a TTL risc cpu in 1975 I would definitely be using
    lots of FPLA's, not just for decode but also state machines in fetch,
    page table walkers, cache controllers, etc.
    The DG MV/8000 used PALs but The Soul of a New Machine hints that there
    were supply problems with them at the time.

    The PALs used for the MV/8000 were different, came out in 1978 (i.e.,
    very recent when the MV/8000 was designed), addressed shortcomings of
    the PLA Signetics 82S100 that had been available since 1975, and the
    PALs initially had yield problems; see <https://en.wikipedia.org/wiki/Programmable_Array_Logic#History>.

    I don't know why they think these are problems with the 82S100.
    These complaints sound like from a hobbyist.

    Concerning the speed of the 82S100 PLA, <http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf>
    reports propagation delays of 25ns-35ns for specific signals in Table
    3.4, and EricP found 50ns "max access" in the data sheet of the
    82S100. That does not sound too slow to be usable in a CPU with 200ns
    cycle time, so yes, one could have used that for the VAX.

    - anton

    Yes. This risc-VAX would have to decode 1 instruction per clock to
    keep keep a pipeline full so I envision running the fetch buffer
    through a bank of those PLA and generating a uOp out.

    I don't know whether the instructions can be byte aligned variable size
    or have to be fixed 32-bits in order to meet performance requirements.
    I would prefer the flexibility of variable size but
    the Fetch byte alignment shifter adds a lot of logic.

    If variable then the high frequency instructions like MOV rd,rs
    and ADD rsd,rs fit into two bytes. The longest instruction looks like
    12 bytes, 4 bytes operation specifier (opcode plus registers)
    plus 8 bytes immediate FP64.

    If a variable size instruction arranges that all the critical parse
    information is located in the first 8-16 bits then we can just run
    those bits through a PLAs in parallel and have that control the
    alignment shifter as well as generate the uOp.

    I envision this Fetch buffer alignment shifter built from tri-state
    buffers rather than muxes as TTL muxes are very slow and we would
    need a lot of them.

    The whole fetch-parse-decode should fit on a single board.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Aug 8 14:22:28 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    On Fri, 8 Aug 2025 11:58:39 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Tue, 5 Aug 2025 13:04:39 -0500
    "Brian G. Lucas" <bagel99@gmail.com> wrote:

    Hi, Brian
    By chance, do you happen to know why Mitch Alsup recently
    disappeared from the Usenet?

    I've been in cantact,

    Good.

    he lost his usenet provider,

    Terje


    I was suspecting that much. What made me worrying is that almost at the
    same date he stopped posting on RWT forum.

    and the one I am > using does not seem to accept new registrations
    any langer.

    Eternal September does not accept new registrations?
    I think, if it is true, Ray Banana will make excception for Mitch if
    asked personally.


    www.usenetserver.com is priced reasonably. I've been using them
    for well over a decade now.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Fri Aug 8 18:34:46 2025
    From Newsgroup: comp.arch

    On 2025-08-08 17:22, Scott Lurndal wrote:
    Michael S <already5chosen@yahoo.com> writes:
    On Fri, 8 Aug 2025 11:58:39 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Tue, 5 Aug 2025 13:04:39 -0500
    "Brian G. Lucas" <bagel99@gmail.com> wrote:

    Hi, Brian
    By chance, do you happen to know why Mitch Alsup recently
    disappeared from the Usenet?

    I've been in cantact,

    Good.

    he lost his usenet provider,

    Terje


    I was suspecting that much. What made me worrying is that almost at the
    same date he stopped posting on RWT forum.

    and the one I am > using does not seem to accept new registrations
    any langer.

    Eternal September does not accept new registrations?
    I think, if it is true, Ray Banana will make excception for Mitch if
    asked personally.


    www.usenetserver.com is priced reasonably. I've been using them
    for well over a decade now.


    I have been happy with http://news.individual.net/. 10 euro/year.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From George Neuner@gneuner2@comcast.net to comp.arch on Fri Aug 8 19:07:17 2025
    From Newsgroup: comp.arch

    On Fri, 8 Aug 2025 13:20:33 +0300, Michael S
    <already5chosen@yahoo.com> wrote:


    Eternal September does not accept new registrations?
    I think, if it is true, Ray Banana will make excception for Mitch if
    asked personally.

    Eternal September still accepts new users. What they don't support is
    shrouding your email address other than using ".invalid" as the
    domain. It's trivially easy to figure out the real addresses, so for
    users who care about address hiding, ES would not be a good choice.

    However, Mitch has tended to use his own name for his addresses in the
    past, so I doubt he cares much about hiding. ES is free (0$) so
    somebody who can reach him ought to mention it.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From George Neuner@gneuner2@comcast.net to comp.arch on Fri Aug 8 19:48:59 2025
    From Newsgroup: comp.arch

    On Fri, 08 Aug 2025 06:16:51 GMT, anton@mips.complang.tuwien.ac.at
    (Anton Ertl) wrote:

    George Neuner <gneuner2@comcast.net> writes:

    The decoder converts x86 instructions into traces of equivalent wide
    micro instructions which are directly executable by the core. The
    traces then are cached separately [there is a $I0 "microcache" below
    $I1] and can be re-executed (e.g., for loops) as long as they remain
    in the microcache.

    No such cache in the P6 or any of its descendents until the Sandy
    Bridge (2011). The Pentium 4 has a microop cache, but eventually
    (with Core Duo, Core2 Duo) was replaced with P6 descendents that have
    no microop cache. Actually, the Core 2 Duo has a loop buffer which
    might be seen as a tiny microop cache. Microop caches and loop
    buffers still have to contain information about which microops belong
    to the same CISC instruction, because otherwise the reorder buffer
    could not commit/execute* CISC instructions.

    * OoO microarchitecture terminology calls what the reorder buffer does
    "retire" or "commit". But this is where the speculative execution
    becomes architecturally visible ("commit"), so from an architectural
    view it is execution.

    Followups set to comp.arch

    - anton

    Thanks for the correction. I did fair amount of SIMD coding for
    Pentium II, III and IV, so was more aware of their architecture. After
    the IV, I moved on to other things so haven't kept up.

    Question:
    It would seem that, lacking the microop cache the decoder would need
    to be involved, e.g., for every iteration of a loop, and there would
    be more pressure on I$1. Did these prove to be a bottleneck for the
    models lacking cache? [either? or something else?]
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From George Neuner@gneuner2@comcast.net to comp.arch on Fri Aug 8 21:43:11 2025
    From Newsgroup: comp.arch

    On Wed, 06 Aug 2025 10:23:26 -0400, EricP
    <ThatWouldBeTelling@thevillage.com> wrote:

    George Neuner wrote:
    On Tue, 5 Aug 2025 05:48:16 -0000 (UTC), Thomas Koenig
    <tkoenig@netcologne.de> wrote:

    Waldek Hebisch <antispam@fricas.org> schrieb:
    I am not sure what technolgy they used
    for register file. For me most likely is fast RAM, but that
    normally would give 1 R/W port.
    They used fast SRAM and had three copies of their registers,
    for 2R1W.


    I did use 11/780, 8600, and briefly even MicroVax - but I'm primarily
    a software person, so please forgive this stupid question.


    Why three copies?
    Also did you mean 3 total? Or 3 additional copies (4 total)?


    Given 1 R/W port each I can see needing a pair to handle cases where
    destination is also a source (including autoincrement modes). But I
    don't see a need ever to sync them - you just keep track of which was
    updated most recently, read that one and - if applicable - write the
    other and toggle.

    Since (at least) the early models evaluated operands sequentially,
    there doesn't seem to be a need for more. Later models had some
    semblance of pipeline, but it seems that if the /same/ value was
    needed multiple times, it could be routed internally to all users
    without requiring additional reads of the source.

    Or do I completely misunderstand? [Definitely possible.]

    To make a 2R 1W port reg file from a single port SRAM you use two banks
    which can be addressed separately during the read phase at the start of
    the clock phase, and at the end of the clock phase you write both banks
    at the same time on the same port number.

    I was aware of this (thank you), but I was trying to figure out why
    the VAX - particularly early ones - would need it. And also it does
    not mesh with Waldek's comment [at top] about 3 copies.


    The VAX did have one (pathological?) address mode:

    displacement deferred indexed @dis(Rn)[Rx]

    in which Rn and Rx could be the same register. It is the only mode
    for which a single operand could reference a given register more than
    once. I never saw any code that actually did this, but the manual
    does say it was possible.

    But even with this situation, it seems that the register would only
    need to be read once (per operand, at least) and the value could be
    used twice.


    The 780 wiring parts list shows Nat Semi 85S68 which are
    16*4b 1RW port, 40 ns access SRAMS, tri-state output,
    with latched read output to eliminate data race through on write.

    So they have two 16 * 32b banks for the 16 general registers.
    The third 16 * 32b bank was likely for microcode temp variables.

    The thing is, yes, they only needed 1R port for instruction operands
    because sequential decode could only produce one operand at a time.
    Even on later machines circa 1990 like 8700/8800 or NVAX the general
    register file is only 1R1W port, the temp register bank is 2R1W.

    So the 780 second read port is likely used the same as later VAXen,
    its for reading the temp values concurrently with an operand register.
    The operand registers were read one at a time because of the decode >bottleneck.

    I'm wondering how they handled modifying address modes like autoincrement
    and still had precise interrupts.

    ADDLL (r2)+, (r2)+, (r2)+

    You mean exceptions? Exceptions were handled between instructions.
    VAX had no iterating string-copy/move instructions so every
    instruction logically could stand alone.

    VAX separately identified the case where the instruction completed
    with a problem (trap) from where the instruction could not complete
    because of the problem (fault), but in both cases it indicated the
    offending instruction.


    the first (left) operand reads r2 then adds 4, which the second r2 reads
    and also adds 4, then the third again. It doesn't have a renamer so
    it has to stash the first modified r2 in the temp registers,
    and (somehow) pass that info to decode of the second operand
    so Decode knows to read the temp r2 not the general r2,
    and same for the third operand.
    At the end of the instruction if there is no exception then
    temp r2 is copied to general r2 and memory value is stored.

    I'm guessing in Decode someplace there are comparators to detect when
    the operand registers are the same so microcode knows to switch to the
    temp bank for a modified register.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Aug 9 08:07:12 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Concerning the speed of the 82S100 PLA,
    <http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf>
    reports propagation delays of 25ns-35ns for specific signals in Table
    3.4, and EricP found 50ns "max access" in the data sheet of the
    82S100. That does not sound too slow to be usable in a CPU with 200ns
    cycle time, so yes, one could have used that for the VAX.

    Were there different versions, maybe?

    https://deramp.com/downloads/mfe_archive/050-Component%20Specifications/Signetics-Philips/82S100%20FPGA.pdf
    gives an I/O propagation delay of 80 ns max.

    By comparison, you could get an eight-input NAND gate with a
    maximum delay of 12 ns (the 74H030), so putting two in sequence
    to simulate a PLA would have been significantly faster.
    I can undersand people complaining that PALs were slow.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Sat Aug 9 09:04:40 2025
    From Newsgroup: comp.arch

    In article <1070cj8$3jivq$1@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <106uqki$36gll$4@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <44okQ.831008$QtA1.573001@fx16.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    [snip]
    We tend to be spoiled by modern process densities. The
    VAX 11/780 was built using SSI logic chips, thus board
    space and backplane wiring were significant constraints
    on the logic designs of the era.

    Indeed. I find this speculation about the VAX, kind of odd: the
    existence of the 801 as a research project being used as an
    existence proof to justify assertions that a pipelined RISC
    design would have been "better" don't really hold up, when we
    consider that the comparison is to a processor designed for
    commercial applications on a much shorter timeframe.

    I disagree. The 801 was a research project without much time
    pressure, and they simulated the machine (IIRC at the gate level)
    before they ever bulit one. Plus, they developed an excellent
    compiler which implemented graph coloring.

    But IBM had zero interest in competition to their own /370 line,
    although the 801 would have brought performance improvements
    over that line.

    I'm not sure what, precisely, you're disagreeing with.

    I'm saying that the line of though that goes, "the 801 existed,
    therefore a RISC VAX would have been better than the
    architecture DEC ultimately produced" is specious, and the
    conclusion does not follow.

    There are a few intermediate steps.

    The 801 demonstrated that a RISC, including caches and pipelining,
    would have been feasible at the time. It also demonstrated that
    somebody had thought of graph coloring algorithms.

    This is the part where the argument breaks down. VAX and 801
    were roughly contemporaneous, with VAX being commercially
    available around the time the first 801 prototypes were being
    developed. There's simply no way in which the 801,
    specifically, could have had significant impact on VAX
    development.

    If you're just talking about RISC design techniques generically,
    then I dunno, maybe, sure, why not, but that's a LOT of
    speculation with hindsight-colored glasses. Furthermore, that
    speculation focuses solely on technology, and ignores the
    business realities that VAX was born into. Maybe you're right,
    maybe you're wrong, we can never _really_ say, but there was a
    lot more that went into the decisions around the VAX design than
    just technology.

    There can also be no doubt that a RISC-type machine would have
    exhibited the same performance advantages (at least in integer
    performance) as a RISC vs CISC 10 years later. The 801 did so
    vs. the /370, as did the RISC processors vs, for example, the
    680x0 family of processors (just compare ARM vs. 68000).

    Or look at the performance of the TTL implementation of HP-PA,
    which used PALs which were not available to the VAX 11/780
    designers, so it could be clocked a bit higher, but at
    a multiple of the performance than the VAX.

    So, Anton visiting DEC or me visiting Data General could have
    brought them a technology which would significantly outperformed
    the VAX (especially if we brought along the algorithm for graph
    coloring. Some people at IBM would have been peeved at having
    somebody else "develop" this at the same time, but OK.

    While it's always fun to speculate about alternate timelines, if
    all you are talking about is a hypothetical that someone at DEC
    could have independently used the same techniques, producing a
    more performance RISC-y VAX with better compilers, then sure, I
    guess, why not. But as with all alternate history, this is
    completely unknowable.

    - Dan C.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Aug 9 10:00:54 2025
    From Newsgroup: comp.arch

    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <1070cj8$3jivq$1@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <106uqki$36gll$4@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <44okQ.831008$QtA1.573001@fx16.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    [snip]
    We tend to be spoiled by modern process densities. The
    VAX 11/780 was built using SSI logic chips, thus board
    space and backplane wiring were significant constraints
    on the logic designs of the era.

    Indeed. I find this speculation about the VAX, kind of odd: the
    existence of the 801 as a research project being used as an
    existence proof to justify assertions that a pipelined RISC
    design would have been "better" don't really hold up, when we
    consider that the comparison is to a processor designed for
    commercial applications on a much shorter timeframe.

    I disagree. The 801 was a research project without much time
    pressure, and they simulated the machine (IIRC at the gate level) >>>>before they ever bulit one. Plus, they developed an excellent
    compiler which implemented graph coloring.

    But IBM had zero interest in competition to their own /370 line, >>>>although the 801 would have brought performance improvements
    over that line.

    I'm not sure what, precisely, you're disagreeing with.

    I'm saying that the line of though that goes, "the 801 existed,
    therefore a RISC VAX would have been better than the
    architecture DEC ultimately produced" is specious, and the
    conclusion does not follow.

    There are a few intermediate steps.

    The 801 demonstrated that a RISC, including caches and pipelining,
    would have been feasible at the time. It also demonstrated that
    somebody had thought of graph coloring algorithms.

    This is the part where the argument breaks down. VAX and 801
    were roughly contemporaneous, with VAX being commercially
    available around the time the first 801 prototypes were being
    developed. There's simply no way in which the 801,
    specifically, could have had significant impact on VAX
    development.

    Sure. IBM was in less than no hurry to make a product out of
    the 801.


    If you're just talking about RISC design techniques generically,
    then I dunno, maybe, sure, why not,

    Absolutely. The 801 demonstrated that it was a feasible
    development _at the time_.

    but that's a LOT of
    speculation with hindsight-colored glasses.

    Graph-colored glasses, for the register allocation, please :-)

    Furthermore, that
    speculation focuses solely on technology, and ignores the
    business realities that VAX was born into. Maybe you're right,
    maybe you're wrong, we can never _really_ say, but there was a
    lot more that went into the decisions around the VAX design than
    just technology.

    I'm not sure what you mean here. Do you include the ISA design
    in "technology" or not?

    [...]

    While it's always fun to speculate about alternate timelines, if
    all you are talking about is a hypothetical that someone at DEC
    could have independently used the same techniques, producing a
    more performance RISC-y VAX with better compilers, then sure, I
    guess, why not.

    Yep, that would have been possible, either as an alternate
    VAX or a competitor.

    But as with all alternate history, this is
    completely unknowable.

    We know it was feasible, we know that there were a large
    number of minicomputer companies at the time. We cannot
    predict what a succesfull minicomputer implementation with
    two or three times the performance of the VAX could have
    done. We do know that this was the performance advantage
    that Fountainhead from DG aimed for via programmable microcode
    (which failed to deliver on time due to complexity), and
    we can safely assume that DG would have given DEC a run
    for its money if they had system which significantly
    outperformed the VAX.

    So, "completely unknownable" isn't true, "quite plausible"
    would be a more accurate description.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sat Aug 9 10:03:29 2025
    From Newsgroup: comp.arch

    Thomas Koenig wrote:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Concerning the speed of the 82S100 PLA,
    <http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf>
    reports propagation delays of 25ns-35ns for specific signals in Table
    3.4, and EricP found 50ns "max access" in the data sheet of the
    82S100. That does not sound too slow to be usable in a CPU with 200ns
    cycle time, so yes, one could have used that for the VAX.

    Were there different versions, maybe?

    https://deramp.com/downloads/mfe_archive/050-Component%20Specifications/Signetics-Philips/82S100%20FPGA.pdf
    gives an I/O propagation delay of 80 ns max.

    Yes, must be different versions.
    I'm looking at this 1976 datasheet which says 50 ns max access:

    http://www.bitsavers.org/components/signetics/_dataBooks/1976_Signetics_Field_Programmable_Logic_Arrays.pdf

    By comparison, you could get an eight-input NAND gate with a
    maximum delay of 12 ns (the 74H030), so putting two in sequence
    to simulate a PLA would have been significantly faster.
    I can undersand people complaining that PALs were slow.

    The 82S100 PLA is logic equivalent to:
    - 16 inputs each with an optional input invertor,
    - optionally wired to 48 16-input AND's,
    - optionally wired to 8 48-input OR's,
    - with 8 optional XOR output invertors,
    - driving 8 tri-state or open collector buffers.

    So I count roughly 7 or 8 equivalent gate delays.
    Also the decoder would need a lot of these so I doubt we can afford the
    power and heat for H series. That 74H30 typical is 22 mW but the max
    looks like 110 mW max each (I_ol output low of 20 mA * 5.5V max).
    74LS30 is 20 ns max, 44 mW max.

    Looking at a TI Bipolar Memory Data Manual from 1977,
    it was about the same speed as say a 256b mask programmable TTL ROM,
    7488A 32w * 8b, 45 ns max access.





    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Aug 9 20:54:07 2025
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    Thomas Koenig wrote:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Concerning the speed of the 82S100 PLA,
    <http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf>
    reports propagation delays of 25ns-35ns for specific signals in Table
    3.4, and EricP found 50ns "max access" in the data sheet of the
    82S100. That does not sound too slow to be usable in a CPU with 200ns
    cycle time, so yes, one could have used that for the VAX.

    Were there different versions, maybe?

    https://deramp.com/downloads/mfe_archive/050-Component%20Specifications/Signetics-Philips/82S100%20FPGA.pdf
    gives an I/O propagation delay of 80 ns max.

    Yes, must be different versions.
    I'm looking at this 1976 datasheet which says 50 ns max access:

    http://www.bitsavers.org/components/signetics/_dataBooks/1976_Signetics_Field_Programmable_Logic_Arrays.pdf

    That is strange. Why would they make the chip worse?

    Unlesss... maybe somebody (a customer, or they themselves)
    discovered that there may have been conditions where they could
    only guarantee 80 ns. Maybe a combination of tolerances to one
    side and a certain logic programming, and they changed the
    data sheet.


    By comparison, you could get an eight-input NAND gate with a
    maximum delay of 12 ns (the 74H030), so putting two in sequence
    to simulate a PLA would have been significantly faster.
    I can undersand people complaining that PALs were slow.

    The 82S100 PLA is logic equivalent to:
    - 16 inputs each with an optional input invertor,

    Should be free coming from a Flip-Flop.

    - optionally wired to 48 16-input AND's,
    - optionally wired to 8 48-input OR's,

    Those would be the the two layers of NAND gates, so depending
    on which ones you chose, you have to add those.

    - with 8 optional XOR output invertors,

    I don't find that in the diagrams (but I might be missing that,
    I am not an expert at reading them).

    - driving 8 tri-state or open collector buffers.

    A 74265 had switching times of max. 18 ns, driving 30
    output loads, so that would be on top.

    One question: Did TTL people actually use the "typical" delays
    from the handbooks, or did they use the maximum delays for their
    desings? Using anything below the maximum woud sound dangerous to
    me, but maybe this was possible to a certain extent.

    So I count roughly 7 or 8 equivalent gate delays.

    Another point... if you don't need 16 inputs or 8 outpus, you
    are also paying a lot more. If you have a 6-bit primary opcode,
    you don't need a full 16 bits of input.


    Also the decoder would need a lot of these so I doubt we can afford the
    power and heat for H series. That 74H30 typical is 22 mW but the max
    looks like 110 mW max each (I_ol output low of 20 mA * 5.5V max).
    74LS30 is 20 ns max, 44 mW max.

    Looking at a TI Bipolar Memory Data Manual from 1977,
    it was about the same speed as say a 256b mask programmable TTL ROM,
    7488A 32w * 8b, 45 ns max access.

    Hmm... did the VAX, for example, actually use them, or were they
    using logic built from conventional chips?
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Al Kossow@aek@bitsavers.org to comp.arch on Sat Aug 9 14:57:03 2025
    From Newsgroup: comp.arch

    On 8/9/25 1:54 PM, Thomas Koenig wrote:

    One question: Did TTL people actually use the "typical" delays
    from the handbooks, or did they use the maximum delays for their
    desings?

    using typicals was a rookie mistake
    also not comparing delay times across vendors


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Sun Aug 10 12:06:46 2025
    From Newsgroup: comp.arch

    In article <107768m$17rul$1@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    [snip]
    If you're just talking about RISC design techniques generically,
    then I dunno, maybe, sure, why not,

    Absolutely. The 801 demonstrated that it was a feasible
    development _at the time_.

    Ok. Sure.

    but that's a LOT of
    speculation with hindsight-colored glasses.

    Graph-colored glasses, for the register allocation, please :-)

    Heh. :-)

    Furthermore, that
    speculation focuses solely on technology, and ignores the
    business realities that VAX was born into. Maybe you're right,
    maybe you're wrong, we can never _really_ say, but there was a
    lot more that went into the decisions around the VAX design than
    just technology.

    I'm not sure what you mean here. Do you include the ISA design
    in "technology" or not?

    Absolutely.

    [...]

    While it's always fun to speculate about alternate timelines, if
    all you are talking about is a hypothetical that someone at DEC
    could have independently used the same techniques, producing a
    more performance RISC-y VAX with better compilers, then sure, I
    guess, why not.

    Yep, that would have been possible, either as an alternate
    VAX or a competitor.

    But as with all alternate history, this is
    completely unknowable.

    Sure.

    We know it was feasible, we know that there were a large
    number of minicomputer companies at the time. We cannot
    predict what a succesfull minicomputer implementation with
    two or three times the performance of the VAX could have
    done. We do know that this was the performance advantage
    that Fountainhead from DG aimed for via programmable microcode
    (which failed to deliver on time due to complexity), and
    we can safely assume that DG would have given DEC a run
    for its money if they had system which significantly
    outperformed the VAX.

    My contention is that while it was _feasible_ to build a
    RISC-style machine for what became the VAX, that by itself is
    only a part of the puzzle. One must also take into account
    market and business contexts; perhaps such a machine would have
    been faster, but I don't think anyone _really_ knew that to be
    the case in 1975 when design work on the VAX started, and even
    fewer would have believed it absent a working prototype, which
    wouldn't arrive with the 801 for several years after the VAX had
    shipped commercially. Furthermore, Digital would have
    understood that many customers would have expected to be able to
    program their new machine in macro assembler.

    Similarly for other minicomputer companies.

    So, "completely unknownable" isn't true, "quite plausible"
    would be a more accurate description.

    Plausiblity is orthogonal to whether a thing is knowable.

    - Dan C.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sun Aug 10 15:18:23 2025
    From Newsgroup: comp.arch

    cross@spitfire.i.gajendra.net (Dan Cross) writes:
    In article <107768m$17rul$1@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    <snip>

    While it's always fun to speculate about alternate timelines, if
    all you are talking about is a hypothetical that someone at DEC
    could have independently used the same techniques, producing a
    more performance RISC-y VAX with better compilers, then sure, I
    guess, why not.

    Yep, that would have been possible, either as an alternate
    VAX or a competitor.

    But as with all alternate history, this is
    completely unknowable.

    Sure.

    We know it was feasible, we know that there were a large
    number of minicomputer companies at the time. We cannot
    predict what a succesfull minicomputer implementation with
    two or three times the performance of the VAX could have
    done. We do know that this was the performance advantage
    that Fountainhead from DG aimed for via programmable microcode
    (which failed to deliver on time due to complexity), and
    we can safely assume that DG would have given DEC a run
    for its money if they had system which significantly
    outperformed the VAX.

    My contention is that while it was _feasible_ to build a
    RISC-style machine for what became the VAX, that by itself is
    only a part of the puzzle. One must also take into account
    market and business contexts; perhaps such a machine would have
    been faster, but I don't think anyone _really_ knew that to be
    the case in 1975 when design work on the VAX started, and even
    fewer would have believed it absent a working prototype, which
    wouldn't arrive with the 801 for several years after the VAX had
    shipped commercially. Furthermore, Digital would have
    understood that many customers would have expected to be able to
    program their new machine in macro assembler.

    One must also keep in mind that the VAX group was competing
    internally with the PDP-10 minicomputer. Considerable
    internal resources were being applied to the Jupiter project
    at the end of the 1970s to support a wider range of applications.

    http://bitsavers.informatik.uni-stuttgart.de/pdf/dec/pdp10/KC10_Jupiter/Jupiter_CIS_Instructions_Oct80.pdf

    Interesting quote that indicates the direction they were looking:
    "Many of the instructions in this specification could only
    be used by COBOL if 9-bit ASCII were supported. There is currently
    no plan for COBOL to support 9-bit ASCII".

    "The following goals were taken into consideration when deriving an
    address scheme for addressing 9-bit byte strings:"

    Fundamentally, 36-bit words ended up being a dead-end.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Sun Aug 10 19:55:01 2025
    From Newsgroup: comp.arch

    According to Scott Lurndal <slp53@pacbell.net>: >http://bitsavers.informatik.uni-stuttgart.de/pdf/dec/pdp10/KC10_Jupiter/Jupiter_CIS_Instructions_Oct80.pdf

    Interesting quote that indicates the direction they were looking:
    "Many of the instructions in this specification could only
    be used by COBOL if 9-bit ASCII were supported. There is currently
    no plan for COBOL to support 9-bit ASCII".

    "The following goals were taken into consideration when deriving an
    address scheme for addressing 9-bit byte strings:"

    Fundamentally, 36-bit words ended up being a dead-end.

    Interesting document. It added half-hearted 9 bit byte addressing to the PDP-10, intended for COBOL string processing and decimal arithmetic.

    Except that the PDP-10's existing byte instructions let you use any byte
    size you wanted, and everyone used 7-bit bytes for ASCII strings. It would have been straightforward but very tedious to add 9 bit byte strings to
    the COBOL compiler since they'd need ways to say which text data was in
    which format and convert as needed. Who knows what they'd have done for
    data files.

    36 bit word machines had a good run starting in the mid 1950s but once
    S/360 came out with 8 bit bytes and power of two addressing for larger
    data, all of the other addressing models were doomed.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Aug 10 21:01:50 2025
    From Newsgroup: comp.arch

    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

    [Snipping the previous long discussion]

    My contention is that while it was _feasible_ to build a
    RISC-style machine for what became the VAX,

    There, we agree.

    that by itself is
    only a part of the puzzle. One must also take into account
    market and business contexts; perhaps such a machine would have
    been faster,

    With a certainty, if they followed RISC principles.

    but I don't think anyone _really_ knew that to be
    the case in 1975 when design work on the VAX started,

    That is true. Reading https://acg.cis.upenn.edu/milom/cis501-Fall11/papers/cocke-RISC.pdf
    (I liked the potential toung-in-cheek "Regular Instruction
    Set-Computer" name for their instruction set).

    and even
    fewer would have believed it absent a working prototype,

    The simulation approach that IBM took is interesting. They built
    a fast simulator, translating one 801 instruciton into one (or
    several) /370-instructions on the fly, with a fixed 32-bit size.


    which
    wouldn't arrive with the 801 for several years after the VAX had
    shipped commercially.

    That is clear. It was the premise of this discussion that the
    knowledge had been made available (via time travel or some other
    strange means) to a company, which would then have used the
    knowledge.

    Furthermore, Digital would have
    understood that many customers would have expected to be able to
    program their new machine in macro assembler.

    Programming a RISC in assembler is not so hard, at least in my
    experience. Plus, people overestimated use of assembler even in
    the mid-1975s, and underestimated the use of compilers.
    [...]
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Aug 11 08:17:48 2025
    From Newsgroup: comp.arch

    scott@slp53.sl.home (Scott Lurndal) writes:
    One must also keep in mind that the VAX group was competing
    internally with the PDP-10 minicomputer.

    This does not make the actual VAX more attractive relative to the
    hypothetical RISC-VAX IMO.

    Fundamentally, 36-bit words ended up being a dead-end.

    The reason why this once-common architectural style died out are:

    * 18-bit addresses

    * word addressing

    Sure, one could add 36-bit byte addresses to such an architecture
    (probably with 9-bit bytes to make it easy to deal with words), but it
    would force a completely different ABI and API, so the legacy code
    would still have no good upgrade path and would be limited to its
    256KW address space no matter how much actual RAM there is available.
    IBM decided to switch from this 36-bit legacy to the 32-bit
    byte-addressed S/360 in the early 1960s (with support for their legacy
    lines built into various S/360 implementations), DEC did so when they introduced the VAX.

    Concerning other manufacturers:

    <https://en.wikipedia.org/wiki/36-bit_computing> tells me that the
    GE-600 series was also 36-bit. It continued as Honeywell 6000 series <https://en.wikipedia.org/wiki/Honeywell_6000_series>. Honeywell
    introduced the DPS-88 in 1982; the architecture is described as
    supporting the usual 256KW, but apparently the DPS-88 could be bought
    with up to 128MB; programming that probably was no fun. Honeywell
    later sold the NEC S1000 as DPS-90, which does not sound like the
    Honeywell 6000 line was a growing business. And that's the last I
    read about the Honeywell 6000 line.

    Univac sold the 1100/2200 series, and later Unisys continued to
    support that in the Unisyst Clearpath systems. <https://en.wikipedia.org/wiki/UNIVAC_1100/2200_series#Unisys_ClearPath_IX_series>
    says:

    |In addition to the IX (1100/2200) CPUs [...], the architecture had
    |Xeon [...] CPUs. Unisys' goal was to provide an orderly transition for
    |their 1100/2200 customers to a more modern architecture.

    So they continued to support it for a long time, but it's a legacy
    thing, not a future-oriented architecture.

    The Wikipedia article also mentions the Symbolics 3600 as 36-bit
    machine, but that was quite different from the 36-bit architectures of
    the 1950s and 1960s: The Symbolics 3600 has 28-bit addresses (the rest apparently taken by tags) and its successor Ivory has 32-bit addresses
    and a 40-bit word. Here the reason for its demise was the AI winter
    of the late 1980s and early 1990s.

    DEC did the right thing when they decided to support VAX as *the*
    future architecture, and the success of the VAX compared to the
    Honeywell 6000 and Univac 1100/2200 series demonstrates this.

    RISC-VAX would have been better than the PDP-10, for the same reasons:
    32-bit addresses and byte addressing. And in addition, the
    performance advantage of RISC-VAX would have made the position of
    RISC-VAX compared to PDP-10 even stronger.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Aug 11 14:51:20 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    One must also keep in mind that the VAX group was competing
    internally with the PDP-10 minicomputer.

    This does not make the actual VAX more attractive relative to the >hypothetical RISC-VAX IMO.

    Fundamentally, 36-bit words ended up being a dead-end.

    In a sense, they still live in the Unisys Clearpath systems.


    The reason why this once-common architectural style died out are:

    * 18-bit addresses

    An issue for PDP-10, certainly. Not so much for the Univac
    systems.



    Univac sold the 1100/2200 series, and later Unisys continued to
    support that in the Unisyst Clearpath systems. ><https://en.wikipedia.org/wiki/UNIVAC_1100/2200_series#Unisys_ClearPath_IX_series>
    says:


    I spent 14 years at Burroughs/Unisys (on the Burroughs side, mainly).

    Yes, two of the six mainframe lines still exist (albeit in emulation);
    one 48-bit, the other 36-bit.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Aug 11 17:27:30 2025
    From Newsgroup: comp.arch

    Scott Lurndal <scott@slp53.sl.home> schrieb:

    http://bitsavers.informatik.uni-stuttgart.de/pdf/dec/pdp10/KC10_Jupiter/Jupiter_CIS_Instructions_Oct80.pdf

    Interesting link, thanks!


    Interesting quote that indicates the direction they were looking:
    "Many of the instructions in this specification could only
    be used by COBOL if 9-bit ASCII were supported. There is currently
    no plan for COBOL to support 9-bit ASCII".

    "The following goals were taken into consideration when deriving an
    address scheme for addressing 9-bit byte strings:"

    They were considering byte-addressability; interesting. It is also
    slightly funny that a 9-bit byte address would be made up of
    30 bits of virtual address and 2 bits of byte address, i.e.
    a 32-bit address in total.

    Fundamentally, 36-bit words ended up being a dead-end.

    Pretty much so. It was a pity for floating-point, where they had
    more precision than the 32-bit words (and especially the horrible
    IBM format).

    But byte addressability and power of two won.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Aug 12 15:02:04 2025
    From Newsgroup: comp.arch

    antispam@fricas.org (Waldek Hebisch) writes:
    VAX-780 architecture handbook says cache was 8 KB and used 8-byte
    lines. So extra 12KB of fast RAM could double cache size.
    That would be nice improvement, but not as dramatic as increase
    from 2 KB to 12 KB.

    The handbook is: https://ia903400.us.archive.org/26/items/bitsavers_decvaxhandHandbookVol11977_10941546/VAX_Architecture_Handbook_Vol1_1977_text.pdf

    The cache is indeed 8KB in size, two-way set associative and write-through.

    Section 2.7 also mentions an 8-byte instruction buffer, and that the instruction fetching is done happens concurrently with the microcoded execution. So here we have a little bit of pipelining.

    Section 2.7 also describes a 128-entry TLB. The TLB is claimed to
    have "typically 97% hit rate". I would go for larger pages, which
    would reduce the TLB miss rate.

    While looking for the handbook, I also found

    http://hps.ece.utexas.edu/pub/patt_micro22.pdf

    which describes some parts of the microarchitecture of the VAX 11/780,
    11/750, 8600, and 8800.

    Interestingly, Patt wrote this in 1990, after participating in the HPS
    papers on an OoO implementation of the VAX architecture.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Tue Aug 12 15:28:27 2025
    From Newsgroup: comp.arch

    cross@spitfire.i.gajendra.net (Dan Cross) writes:
    MAP_32BIT is only used on x86-64 on Linux, and was originally
    a performance hack for allocating thread stacks: apparently, it
    was cheaper to do a thread switch with a stack below the 4GiB
    barrier (sign extension artifact maybe? Who knows...). But it's
    no longer required for that. But there's no indication that it
    was for supporting ILP32 on a 64-bit system.

    Reading up about x32, it requires quite a bit more than just
    allocating everything in the low 2GB.

    My memories (from reading about it, I never compiled a program for
    that usage myself) are that on Digital OSF/1, the corresponding usage
    did just that: Configure the compiler for ILP32, and allocate all
    memory in the low 2GB. I expect that types such as off_t would be
    defined appropriately, and any pointers in library-defined structures
    (e.g., FILE from <stdio.h>) consumed 8 bytes, even though the ILP32
    code only accessed the bottom 4. Or maybe they had compiled the
    library also for ILP32. In those days fewer shared libraries were in
    play, and the number of system calls and their interface complexity in
    OSF/1 was probably closer to Unix v6 or so than to Linux today (or in
    2012, when x32 was introduced), so all of that required a lot less
    work.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch,alt.folklore.computers on Tue Aug 12 16:08:58 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes: >cross@spitfire.i.gajendra.net (Dan Cross) writes:
    MAP_32BIT is only used on x86-64 on Linux, and was originally
    a performance hack for allocating thread stacks: apparently, it
    was cheaper to do a thread switch with a stack below the 4GiB
    barrier (sign extension artifact maybe? Who knows...). But it's
    no longer required for that. But there's no indication that it
    was for supporting ILP32 on a 64-bit system.

    Reading up about x32, it requires quite a bit more than just
    allocating everything in the low 2GB.

    The primary issue on x86 was with the API definitions. Several
    legacy API declarations used signed integers (int) for
    address parameters. This limited addresses to 2GB on
    a 32-bit system.

    https://en.wikipedia.org/wiki/Large-file_support

    The Large File Summit (I was one of the Unisys reps at the LFS)
    specified a standard way to support files larger than 2GB
    on 32-bit systems that used signed integers for file offsets
    and file size.

    Also, https://en.wikipedia.org/wiki/2_GB_limit

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch,alt.folklore.computers on Tue Aug 12 11:53:37 2025
    From Newsgroup: comp.arch

    On 8/12/2025 11:08 AM, Scott Lurndal wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    cross@spitfire.i.gajendra.net (Dan Cross) writes:
    MAP_32BIT is only used on x86-64 on Linux, and was originally
    a performance hack for allocating thread stacks: apparently, it
    was cheaper to do a thread switch with a stack below the 4GiB
    barrier (sign extension artifact maybe? Who knows...). But it's
    no longer required for that. But there's no indication that it
    was for supporting ILP32 on a 64-bit system.

    Reading up about x32, it requires quite a bit more than just
    allocating everything in the low 2GB.

    The primary issue on x86 was with the API definitions. Several
    legacy API declarations used signed integers (int) for
    address parameters. This limited addresses to 2GB on
    a 32-bit system.

    https://en.wikipedia.org/wiki/Large-file_support

    The Large File Summit (I was one of the Unisys reps at the LFS)
    specified a standard way to support files larger than 2GB
    on 32-bit systems that used signed integers for file offsets
    and file size.

    Also, https://en.wikipedia.org/wiki/2_GB_limit


    Also, IIRC, the major point of X32 was that it would narrow pointers and similar back down to 32 bits, requiring special versions of any shared libraries or similar.

    But, it is unattractive to have both 32 and 64 bit versions of all the SO's.

    Though, admittedly, not messed with it much personally...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Aug 12 15:59:32 2025
    From Newsgroup: comp.arch

    antispam@fricas.org (Waldek Hebisch) writes:
    The basic question is if VAX could afford the pipeline.

    VAX 11/780 only performed instruction fetching concurrently with the
    rest (a two-stage pipeline, if you want). The 8600, 8700/8800 and
    NVAX applied more pipelining, but CPI remained high.

    VUPs MHz CPI Machine
    1 5 10 11/780
    4 12.5 6.25 8600
    6 22.2 7.4 8700
    35 90.9 5.1 NVAX+

    SPEC92 MHz VAX CPI Machine
    1/1 5 10/10 VAX 11/780
    133/200 200 3/2 Alpha 21064 (DEC 7000 model 610)

    VUPs and SPEC numbers from
    <https://pghardy.net/paul/programs/vms_cpus.html>.

    The 10 CPI (cycles per instructions) of the VAX 11/780 are annecdotal.
    The other CPIs are computed from VUP/SPEC and MHz numbers; all of that
    is probably somewhat off (due to the anecdotal base being off), but if
    you relate them to each other, the offness cancels itself out.

    Note that the NVAX+ was made in the same process as the 21064, the
    21064 has about the clock rate, and has 4-6 times the performance,
    resulting not just in a lower native CPI, but also in a lower "VAX
    CPI" (the CPI a VAX would have needed to achieve the same performance
    at this clock rate).

    I doubt that they could afford 1-cycle multiply

    Yes, one might do a multiplier and divider with its own sequencer (and
    more sophisticated in later implementations), and with any user of the
    result waiting stalling the pipeline until that is complete, and any
    following user of the multiplier or divider stalling the pipeline
    until it is free again.

    The idea of providing multiply-step instructions and using a bunch of
    them was short-lived; already the MIPS R2000 included a multiply
    instruction (with its own sequencer), HPPA has multiply-step as well
    as an FPU-based multiply from the start. The idea of avoiding divide instructions had a longer life. MIPS has divide right from the start,
    but Alpha and even IA-64 avoided it. RISC-V includes divide in the M
    extension that also gives multiply.

    or
    even a barrel shifter.

    Five levels of 32-bit 2->1 muxes might be doable, but would that be cost-effecti

    It is accepted in this era that using more hardware could
    give substantial speedup. IIUC IBM used quadatic rule:
    performance was supposed to be proportional to square of
    CPU price. That was partly marketing, but partly due to
    compromises needed in smaller machines.

    That's more of a 1960s thing, probably because low-end S/360
    implementations used all (slow) tricks to minimize hardware. In the
    VAX 11/780 environment, I very much doubt that it is true. Looking at
    the early VAXen, you get the 11/730 with 0.3 VUPs up to the 11/784
    with 3.5 VUPs (from 4 11/780 CPUs). sqrt(3.5/0.3)=3.4. I very much
    doubt that you could get an 11/784 for 3.4 times the price of an
    11/730.

    Searching a little, I find

    |[11/730 is] to be a quarter the price and a quarter the performance of
    |a grown-up VAX (11/780) <https://retrocomputingforum.com/t/price-of-vax-730-with-vms-the-11-730-from-dec/3286>

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From aph@aph@littlepinkcloud.invalid to comp.arch,alt.folklore.computers on Tue Aug 12 17:57:20 2025
    From Newsgroup: comp.arch

    In comp.arch BGB <cr88192@gmail.com> wrote:

    Also, IIRC, the major point of X32 was that it would narrow pointers and similar back down to 32 bits, requiring special versions of any shared libraries or similar.

    But, it is unattractive to have both 32 and 64 bit versions of all the SO's.

    We have done something similar for years at Red Hat: not X32, but
    x86_32, and it was pretty easy. If you're building a 32-bit OS anyway
    (which we were) all you have to do is copy all 32-bit libraries from
    one one repo to the other.

    I thought the AArch64 ILP32 design was pretty neat, but no one seems
    to have been interested. I guess there wasn't an advantage worth the
    effort.

    Andrew.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch,alt.folklore.computers on Tue Aug 12 19:09:27 2025
    From Newsgroup: comp.arch

    According to <aph@littlepinkcloud.invalid>:
    In comp.arch BGB <cr88192@gmail.com> wrote:

    Also, IIRC, the major point of X32 was that it would narrow pointers and
    similar back down to 32 bits, requiring special versions of any shared
    libraries or similar.

    But, it is unattractive to have both 32 and 64 bit versions of all the SO's.

    We have done something similar for years at Red Hat: not X32, but
    x86_32, and it was pretty easy. If you're building a 32-bit OS anyway
    (which we were) all you have to do is copy all 32-bit libraries from
    one one repo to the other.

    FreeBSD does the same thing. The 32 bit libraries are installed by default
    on 64 bit systems because, by current standards, they're not very big.

    I've stopped installing them because I know I don't have any 32 bit apps
    left but on systems with old packages, who knows?
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Wed Aug 13 06:11:02 2025
    From Newsgroup: comp.arch

    aph@littlepinkcloud.invalid writes:
    I thought the AArch64 ILP32 design was pretty neat, but no one seems
    to have been interested. I guess there wasn't an advantage worth the
    effort.

    Alpha: On Digital OSF/1 the advantage was to be able to run programs
    that work on ILP32, but not I32LP64.

    x32: I expect that maintained Unix programs ran on I32LP64 in 2012,
    and unmaintained ones did not get an x32 port anyway. And if there
    are cases where my expectations do not hold, there still is i386. The
    only advantage of x32 was a speed advantage on select programs.
    That's apparently not enough to gain a critical mass of x32 programs.

    Aarch64-ILP32: My guess is that the situation is very similar to the
    x32 situation. Admittedly, there are CPUs without ARM A32/T32
    support, but if there was any significant program for these CPUs that
    does not work with I32LP64, the manufacturer would have chosen to
    include the A32/T32 option. Given that the situation is the same as
    for x32, the result is the same: What I find about it are discussions
    about deprecation and removal <https://www.phoronix.com/news/GCC-Deprecates-ARM64-ILP32>.

    Concerning performance, <https://static.linaro.org/connect/bkk16/Presentations/Wednesday/BKK16-305B.pdf>
    shows SPECint 2006 benchmarks on two unnamed platforms. Out of 12
    benchmark programs, ILP32 shows a speedup by a factor ~1.55 on
    429.mcf, ~1.2 on 471.omnetpp, ~1.1 on 483.xalancbmk, ~1.05 on 403.gcc,
    and ~0.95 (i.e., slowdowns) on 401.bzip2, 456.hmmer, 458.sjeng.

    That slide deck concludes with:

    |Do We Care? Enough?
    |
    |A lot of code to maintain for little gain.

    Apparently the answers to these questions is no.

    Followups to comp.arch.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Wed Aug 13 07:32:28 2025
    From Newsgroup: comp.arch

    scott@slp53.sl.home (Scott Lurndal) writes:
    That said, Unix generally defined -1 as the return value for all
    other system calls, and code that checked for "< 0" instead of
    -1 when calling a standard library function or system call was fundamentally >broken.

    That may be the interface of the C system call wrapper, along with
    errno, but at the actual system call level, the error is indicated in
    an architecture-specific way, and the ones I have looked at before
    today use the sign of the result register or the carry flag. On those architectures, where the sign is used, mmap(2) cannot return negative addresses, or must have a special wrapper.

    Let's look at what the system call wrappers do on RV64G(C) (which has
    no carry flag). For read(2) the wrapper contains:

    0x3ff7f173be <read+20>: ecall
    0x3ff7f173c2 <read+24>: lui a5,0xfffff
    0x3ff7f173c4 <read+26>: mv s0,a0
    0x3ff7f173c6 <read+28>: bltu a5,a0,0x3ff7f1740e <read+100>

    For dup(2) the wrapper contains:

    0x3ff7e7fe9a <dup+2>: ecall
    0x3ff7e7fe9e <dup+6>: lui a7,0xfffff
    0x3ff7e7fea0 <dup+8>: bltu a7,a0,0x3ff7e7fea6 <dup+14>

    and for mmap(2):

    0x3ff7e86b6e <mmap64+12>: ecall
    0x3ff7e86b72 <mmap64+16>: lui a5,0xfffff
    0x3ff7e86b74 <mmap64+18>: bltu a5,a0,0x3ff7e86b8c <mmap64+42>

    So instead of checking for the sign flag, on RV64G the wrapper checks
    if the result is >0xfffff00000000000. This costs one instruction more
    than just checking the sign flag, and allows to almost double the
    number of bytes read(2) can read in one call, the number of file ids
    that cn be returned by dup(2), and the address range returnable by
    mmap(2). Will we ever see processes that need more than 8EB? Maybe
    not, but the designers of the RV64G(C) ABI obviously did not want to
    be the ones that are quoted as saying "8EB should be enough for
    anyone":-).

    Followups to comp.arch

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Wed Aug 13 08:22:17 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    To be efficient, a RISC needs a full-width (presumably 32 bit)
    external data bus, plus a separate address bus, which should at
    least be 26 bits, better 32. A random ARM CPU I looked at at
    bitsavers had 84 pins, which sounds reasonable.

    Building an ARM-like instead of a 68000 would have been feasible,
    but the resulting systems would have been more expensive (the
    68000 had 64 pins).

    One could have done a RISC-VAX microprocessor with 16-bit data bus and
    24-bit address bus, like the 68000, or even an 8-bit data bus, and
    without FPU and MMU and without PDP-11 decoder. The performance would
    have been memory-bandwidth-limited and therefore simular to the 68000
    and 68008, respectively (unless extra love was spent on the memory
    interface, e.g., with row optimization), with a few memory accesses
    saved by having more registers. This would still have made sense in a
    world where the same architecture was available (with better
    performance) on the supermini of the day, the RISC-VAX: Write your
    code on the cheap micro RISC-VAX and this will give you the
    performance advantages in a few years when proper 32-bit computing
    arrives (or on more expensive systems today).

    So... a strategy could have been to establish the concept with
    minicomputers, to make money (the VAX sold big) and then move
    aggressively towards microprocessors, trying the disruptive move
    towards workstations within the same company (which would be HARD).

    For workstations one would need the MMU and the FPU as extra chips.

    Getting a company to avoid trying to milk the cash cow for longer
    (short-term profits) by burying in-company progress (that other
    companies then make, i.e., long-term loss) may be hard, but given that
    some companies have survived, it's obviously possible.

    HP seems to have avoided the problem at various stages: They had their
    own HP3000 and HP9000/500 architectures, but found ways to drop that
    for HPPA without losing too many customers, then they dropped HPPa for
    IA-64, and IA-64 for AMD64, and they still survive. They also managed
    to become one of the biggest PC makers, but found it necessary to
    split the PC and big-machine businesses into two companies.

    As for the PC - a scaled-down, cheap, compatible, multi-cycle per
    instruction microprocessor could have worked for that market,
    but it is entirely unclear to me what this would / could
    have done to the PC market, if IBM could have been prevented
    from gaining such market dominance.

    The IBM PC success was based on the open architecture, on being more
    advanced than the Apple II and not too expensive, and the IBM name
    certainly helped at the start. In the long run it was an Intel and
    Microsoft success, not an IBM success. And Intel's 8086 success was
    initially helped by being able to port 8080 programs (with 8080->8086 assemblers).

    So how could one capture the PC market? The RISC-VAX would probably
    have been too expensive for a PC, even with an 8-bit data bus and a
    reduced instruction set, along the lines of RV32E. Or maybe that
    would have been feasible, in which case one would provide 8080->reduced-RISC-VAX and 6502->reduced-RISC-VAX assemblers to make
    porting easier. And then try to sell it to IBM Boca Raton.

    An alternative would be to sell it as a faster and better upgrade path
    for the 8088 later, as competition to the 80286. Have a RISC-VAX
    (without MMU und FPU) with an additional 8086 decoder for running
    legacy programs (should be possible in the 134,000 transistors that the
    80286 has): Users could run their existing code, as well as
    future-oriented (actually present-oriented) 32-bit code. The next
    step would be adding the TLB for paging.

    Concerning on how to do it from the business side: The microprocessor
    business (at least, maybe more) should probably be spun off as an
    independent company, such that customers would not need to worry about
    being at a disadvantage compared to DEC in-house demands.

    One can also imagine other ways: Instead of the reduced-RISC-VAX, Try
    to get a PDP-11 variant with 8-bit data bus into the actual IBM PC
    (instead of the 8088), or set up your own PC business based on such a processor; and then the logical upgrade path would be to the successor
    of the PDP-11, the RISC-VAX (with PDP-11 decoder).

    What about the fears of the majority in the company working on big
    computers? They would continue to make big computers, with initially
    faster and later more CPUs than PCs. That's what we are seeing today.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Wed Aug 13 09:37:27 2025
    From Newsgroup: comp.arch

    Lawrence D'Oliveiro <ldo@nz.invalid> writes:
    On Tue, 5 Aug 2025 21:01:20 -0000 (UTC), Thomas Koenig wrote:

    So... a strategy could have been to establish the concept with
    minicomputers, to make money (the VAX sold big) and then move
    aggressively towards microprocessors, trying the disruptive move towards
    workstations within the same company (which would be HARD).

    None of the companies which tried to move in that direction were
    successful. The mass micro market had much higher volumes and lower
    margins, and those accustomed to lower-volume, higher-margin operation >simply couldn’t adapt.

    At leas some of the Nova-based microprocessors were relatively cheap,
    and still did not succeed. I think that the essential parts of the
    success of the 8088 were:

    * Offered 1MB of address space. In a cumbersome way, but still; and
    AFAIK less cumbersome than what you would do on a mini or Apple III.
    Intel's architects did not understand that themselves, as shown by
    the 80286, which offered decent support for multiple processes, each
    with 64KB address space. Users actually preferred single-tasking of
    programs that can access more than 64KB easily to multitasking of
    64KB (or 64KB+64KB) processes.

    * Cheap to design computers for, in particular the 8-bit bus and small
    package.

    * Support for porting 8080 assembly code to the 8086 architecture.
    That was not needed for long, but it provided a boost in available
    software at a critical time.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Wed Aug 13 11:25:24 2025
    From Newsgroup: comp.arch

    In article <107b1bu$252qo$1@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:

    [Snipping the previous long discussion]

    My contention is that while it was _feasible_ to build a
    RISC-style machine for what became the VAX,

    There, we agree.

    that by itself is
    only a part of the puzzle. One must also take into account
    market and business contexts; perhaps such a machine would have
    been faster,

    With a certainty, if they followed RISC principles.

    Sure. I wasn't disputing that, just saying that I don't think
    it mattered that much.

    [snip]
    which
    wouldn't arrive with the 801 for several years after the VAX had
    shipped commercially.

    That is clear. It was the premise of this discussion that the
    knowledge had been made available (via time travel or some other
    strange means) to a company, which would then have used the
    knowledge.

    Well, then we're definitely into the unknowable. :-)

    Furthermore, Digital would have
    understood that many customers would have expected to be able to
    program their new machine in macro assembler.

    Programming a RISC in assembler is not so hard, at least in my
    experience. Plus, people overestimated use of assembler even in
    the mid-1975s, and underestimated the use of compilers.
    [...]

    They certainly did! I'm not saying that they're right; I'm
    saying that business needs must have, at least in part,
    influenced the ISA design. That is, while mistaken, it was part
    of the business decision process regardless.

    - Dan C.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Aug 13 14:24:48 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes: >aph@littlepinkcloud.invalid writes:
    I thought the AArch64 ILP32 design was pretty neat, but no one seems
    to have been interested. I guess there wasn't an advantage worth the >>effort.

    Alpha: On Digital OSF/1 the advantage was to be able to run programs
    that work on ILP32, but not I32LP64.

    I understand what you're saying here, but disagree. A program that
    works on ILP32 but not I32LP64 is fundamentally broken, IMHO.


    x32: I expect that maintained Unix programs ran on I32LP64 in 2012,
    and unmaintained ones did not get an x32 port anyway. And if there
    are cases where my expectations do not hold, there still is i386. The
    only advantage of x32 was a speed advantage on select programs.

    I suspect that performance advantage was minimal, the primary advantage would have been that existing applications didn't need to be rebuilt
    and requalified.

    That's apparently not enough to gain a critical mass of x32 programs.

    Aarch64-ILP32: My guess is that the situation is very similar to the
    x32 situation.

    In the early days of AArch64 (2013), we actually built a toolchain to support Aarch64-ILP32. Not a single customer exhibited _any_ interest in that
    and the project was dropped.

    Admittedly, there are CPUs without ARM A32/T32

    Very few AArch64 designs included AArch32 support; even the Cortex
    chips supported it only at exception level zero (user mode), not
    at the other exception levels. The latest Neoverse chips have,
    for the most part, dropped AArch32 compeletely, even at EL0.

    The markets for AArch64 (servers, high-end appliances) didn't have
    a huge existing reservoir of 32-bit ARM applications, so there was
    no demand to support them.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch,alt.folklore.computers on Wed Aug 13 14:26:18 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    To be efficient, a RISC needs a full-width (presumably 32 bit)
    external data bus, plus a separate address bus, which should at
    least be 26 bits, better 32. A random ARM CPU I looked at at
    bitsavers had 84 pins, which sounds reasonable.

    Building an ARM-like instead of a 68000 would have been feasible,
    but the resulting systems would have been more expensive (the
    68000 had 64 pins).

    One could have done a RISC-VAX microprocessor with 16-bit data bus and
    24-bit address bus.

    LSI11?
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch,alt.folklore.computers on Wed Aug 13 14:44:29 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    <snip>
    So how could one capture the PC market? The RISC-VAX would probably
    have been too expensive for a PC, even with an 8-bit data bus and a
    reduced instruction set, along the lines of RV32E. Or maybe that
    would have been feasible, in which case one would provide >8080->reduced-RISC-VAX and 6502->reduced-RISC-VAX assemblers to make
    porting easier. And then try to sell it to IBM Boca Raton.

    https://en.wikipedia.org/wiki/Rainbow_100
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Aug 13 15:03:08 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    That said, Unix generally defined -1 as the return value for all
    other system calls, and code that checked for "< 0" instead of
    -1 when calling a standard library function or system call was fundamentally >>broken.

    That may be the interface of the C system call wrapper,

    It _is_ the interface that the programmers need to be
    concerted with when using POSIX C language bindings.

    Other language bindings offer alternative mechanisms.


    errno, but at the actual system call level, the error is indicated in
    an architecture-specific way, and the ones I have looked at before
    today use the sign of the result register or the carry flag. On those >architectures, where the sign is used, mmap(2) cannot return negative >addresses, or must have a special wrapper.

    Why would the wrapper care if the system call failed? The
    return value from the kernel should be passed through to
    the application as per the POSIX language binding requirements.

    lseek(2) and mmap(2) both require the return of arbitrary 32-bit
    or 64-bit values, including those which when interpreted as signed
    values are negative.

    Clearly POSIX defines the interfaces and the underlying OS and/or
    library functions implement the interfaces. The kernel interface
    to the language library (e.g. libc) is irrelevent to typical programmers, except in the case where it doesn't provide the correct semantics.


    Let's look at what the system call wrappers do on RV64G(C) (which has
    no carry flag). For read(2) the wrapper contains:

    0x3ff7f173be <read+20>: ecall
    0x3ff7f173c2 <read+24>: lui a5,0xfffff
    0x3ff7f173c4 <read+26>: mv s0,a0
    0x3ff7f173c6 <read+28>: bltu a5,a0,0x3ff7f1740e <read+100>

    For dup(2) the wrapper contains:

    0x3ff7e7fe9a <dup+2>: ecall
    0x3ff7e7fe9e <dup+6>: lui a7,0xfffff
    0x3ff7e7fea0 <dup+8>: bltu a7,a0,0x3ff7e7fea6 <dup+14>

    and for mmap(2):

    0x3ff7e86b6e <mmap64+12>: ecall
    0x3ff7e86b72 <mmap64+16>: lui a5,0xfffff
    0x3ff7e86b74 <mmap64+18>: bltu a5,a0,0x3ff7e86b8c <mmap64+42>

    So instead of checking for the sign flag, on RV64G the wrapper checks
    if the result is >0xfffff00000000000. This costs one instruction more
    than just checking the sign flag, and allows to almost double the
    number of bytes read(2) can read in one call, the number of file ids
    that cn be returned by dup(2), and the address range returnable by
    mmap(2). Will we ever see processes that need more than 8EB? Maybe
    not, but the designers of the RV64G(C) ABI obviously did not want to
    be the ones that are quoted as saying "8EB should be enough for
    anyone":-).

    Followups to comp.arch

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Aug 13 16:10:10 2025
    From Newsgroup: comp.arch

    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    That said, Unix generally defined -1 as the return value for all
    other system calls, and code that checked for "< 0" instead of
    -1 when calling a standard library function or system call was fundamentally >>>broken.

    That may be the interface of the C system call wrapper,

    It _is_ the interface that the programmers need to be
    concerted with when using POSIX C language bindings.

    True, but not relevant for the question at hand.

    at the actual system call level, the error is indicated in
    an architecture-specific way, and the ones I have looked at before
    today use the sign of the result register or the carry flag. On those >>architectures, where the sign is used, mmap(2) cannot return negative >>addresses, or must have a special wrapper.

    Why would the wrapper care if the system call failed?

    The actual system call returns an error flag and a register. On some architectures, they support just a register. If there is no error,
    the wrapper returns the content of the register. If the system call
    indicates an error, you see from the value of the register which error
    it is; the wrapper then typically transforms the register in some way
    (e.g., by negating it) and stores the result in errno, and returns -1.

    lseek(2) and mmap(2) both require the return of arbitrary 32-bit
    or 64-bit values, including those which when interpreted as signed
    values are negative.

    For lseek(2):

    | Upon successful completion, lseek() returns the resulting offset
    | location as measured in bytes from the beginning of the file.

    Given that off_t is signed, lseek(2) can only return positive values.

    For mmap(2):

    | On success, mmap() returns a pointer to the mapped area.

    So it's up to the kernel which user-level addresses it returns. E.g.,
    32-bit Linux originally only produced user-level addresses below 2GB.
    When memories grew larger, on some architectures (e.g., i386) Linux
    increased that to 3GB.

    Clearly POSIX defines the interfaces and the underlying OS and/or
    library functions implement the interfaces. The kernel interface
    to the language library (e.g. libc) is irrelevent to typical programmers

    Sure, but system calls are first introduced in real kernels using the
    actual system call interface, and are limited by that interface. And
    that interface is remarkably similar between the early days of Unix
    and recent Linux kernels for various architectures. And when you look
    closely, you find how the system calls are design to support returning
    the error indication, success value, and errno in one register.

    lseek64 on 32-bit platforms is an exception (the success value does
    not fit in one register), and looking at the machine code of the
    wrapper and comparing it with the machine code for the lseek wrapper,
    some funny things are going on, but I would have to look at the source
    code to understand what is going on. One other interesting thing I
    noticed is that the system call wrappers from libc-2.36 on i386 now
    draws the boundary between success returns and error returns at
    0xfffff000:

    0xf7d853c4 <lseek+68>: call *%gs:0x10
    0xf7d853cb <lseek+75>: cmp $0xfffff000,%eax
    0xf7d853d0 <lseek+80>: ja 0xf7d85410 <lseek+144>

    So now the kernel can produce 4095 error values, and the rest can be
    success values. In particular, mmap() can return all possible page
    addresses as success values with these wrappers. When I last looked
    at how system calls are done, I found just a check of the N or the C
    flag. I wonder how the kernel is informed that it can now return more addresses from mmap().

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Wed Aug 13 17:46:59 2025
    From Newsgroup: comp.arch

    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    <snip>
    So how could one capture the PC market? The RISC-VAX would probably
    have been too expensive for a PC, even with an 8-bit data bus and a
    reduced instruction set, along the lines of RV32E. Or maybe that
    would have been feasible, in which case one would provide >>8080->reduced-RISC-VAX and 6502->reduced-RISC-VAX assemblers to make >>porting easier. And then try to sell it to IBM Boca Raton.

    https://en.wikipedia.org/wiki/Rainbow_100

    That's completely different from what I suggest above, and DEC
    obviously did not capture the PC market with that.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch,alt.folklore.computers on Wed Aug 13 17:50:35 2025
    From Newsgroup: comp.arch

    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Building an ARM-like instead of a 68000 would have been feasible,
    but the resulting systems would have been more expensive (the
    68000 had 64 pins).

    One could have done a RISC-VAX microprocessor with 16-bit data bus and >>24-bit address bus.

    LSI11?

    The LSI11 uses four 40-pin chips from the MCP-1600 chipset (which is fascinating in itself <https://en.wikipedia.org/wiki/MCP-1600>) for a
    total of 160 pins; and it supported only 16 address bits without extra
    chips. That was certainly even more expensive (and also slower and
    less capable) than what I suggest above, but it was several years
    earlier, and what I envision was not possible in one chip then.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Aug 13 18:15:23 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    That said, Unix generally defined -1 as the return value for all
    other system calls, and code that checked for "< 0" instead of
    -1 when calling a standard library function or system call was fundamentally
    broken.

    That may be the interface of the C system call wrapper,

    It _is_ the interface that the programmers need to be
    concerted with when using POSIX C language bindings.

    True, but not relevant for the question at hand.

    at the actual system call level, the error is indicated in
    an architecture-specific way, and the ones I have looked at before
    today use the sign of the result register or the carry flag. On those >>>architectures, where the sign is used, mmap(2) cannot return negative >>>addresses, or must have a special wrapper.

    Why would the wrapper care if the system call failed?

    The actual system call returns an error flag and a register. On some >architectures, they support just a register. If there is no error,
    the wrapper returns the content of the register. If the system call >indicates an error, you see from the value of the register which error
    it is; the wrapper then typically transforms the register in some way
    (e.g., by negating it) and stores the result in errno, and returns -1.

    lseek(2) and mmap(2) both require the return of arbitrary 32-bit
    or 64-bit values, including those which when interpreted as signed
    values are negative.

    For lseek(2):

    | Upon successful completion, lseek() returns the resulting offset
    | location as measured in bytes from the beginning of the file.

    Given that off_t is signed, lseek(2) can only return positive values.

    Which was addressed by the LFS (Large File Summit), to support
    files > 2GB in size.

    There is also the degenerate case of open("/dev/mem"...) which
    requires lseek support over the entire physical address space
    and /dev/kmem which supports access to the kernel virtual memory
    address space, which on most systems has the high-order bit
    in the address set to one. Personally, I've used pread/pwrite
    in those cases (once 1003.4 was merged) rather than lseek/read
    and lseek/write.



    For mmap(2):

    | On success, mmap() returns a pointer to the mapped area.

    So it's up to the kernel which user-level addresses it returns. E.g.,
    32-bit Linux originally only produced user-level addresses below 2GB.
    When memories grew larger, on some architectures (e.g., i386) Linux
    increased that to 3GB.

    Aside from mmap-ing /dev/mem or /dev/kmem,
    one must also consider the use of MAP_FIXED, when supported,
    where the kernel doesn't choose the mapped address (although
    it is allowed to refuse to map certain ranges).

    The return value for mmap is 'void *'. The only special value
    for mmap(2) is MAP_FAILED (which is the unsigned equivalent of -1)
    which implies that a one-byte mapping at the end of the address
    space isn't supported.

    all that said, my initial point about -1 was that applications
    should always check for -1 (or MAP_FAILED), not for return
    values less than zero. The actual kernel interface to the
    C library is clearly implementation dependent although it
    must preserve the user-visible required semantics.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Aug 13 14:18:06 2025
    From Newsgroup: comp.arch

    Thomas Koenig wrote:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    Thomas Koenig wrote:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Concerning the speed of the 82S100 PLA,
    <http://skoe.de/docs/c64-dissected/pla/c64_pla_dissected_a4ds.pdf>
    reports propagation delays of 25ns-35ns for specific signals in Table
    3.4, and EricP found 50ns "max access" in the data sheet of the
    82S100. That does not sound too slow to be usable in a CPU with 200ns >>>> cycle time, so yes, one could have used that for the VAX.
    Were there different versions, maybe?

    https://deramp.com/downloads/mfe_archive/050-Component%20Specifications/Signetics-Philips/82S100%20FPGA.pdf
    gives an I/O propagation delay of 80 ns max.
    Yes, must be different versions.
    I'm looking at this 1976 datasheet which says 50 ns max access:

    http://www.bitsavers.org/components/signetics/_dataBooks/1976_Signetics_Field_Programmable_Logic_Arrays.pdf

    That is strange. Why would they make the chip worse?

    Unlesss... maybe somebody (a customer, or they themselves)
    discovered that there may have been conditions where they could
    only guarantee 80 ns. Maybe a combination of tolerances to one
    side and a certain logic programming, and they changed the
    data sheet.

    Manufacturing process variation leads to timing differences that
    testing sorts into speed bins. The faster bins sell at higher price.

    By comparison, you could get an eight-input NAND gate with a
    maximum delay of 12 ns (the 74H030), so putting two in sequence
    to simulate a PLA would have been significantly faster.
    I can undersand people complaining that PALs were slow.
    The 82S100 PLA is logic equivalent to:
    - 16 inputs each with an optional input invertor,

    Should be free coming from a Flip-Flop.

    Depends on what chips you use for registers.
    If you want both Q and Qb then you only get 4 FF in a package like 74LS375.

    For a wide instruction or stage register I'd look at chips such as a 74LS377 with 8 FF in a 20 pin dip, 8 input, 8 Q out, clock, clock enable, vcc, gnd.

    - optionally wired to 48 16-input AND's,
    - optionally wired to 8 48-input OR's,

    Those would be the the two layers of NAND gates, so depending
    on which ones you chose, you have to add those.

    - with 8 optional XOR output invertors,

    I don't find that in the diagrams (but I might be missing that,
    I am not an expert at reading them).

    - driving 8 tri-state or open collector buffers.

    A 74265 had switching times of max. 18 ns, driving 30
    output loads, so that would be on top.

    One question: Did TTL people actually use the "typical" delays
    from the handbooks, or did they use the maximum delays for their
    desings? Using anything below the maximum woud sound dangerous to
    me, but maybe this was possible to a certain extent.

    I didn't use the typical values. Yes, it would be dangerous to use them.
    I never understood why they even quoted those typical numbers.
    I always considered them marketing fluff.

    So I count roughly 7 or 8 equivalent gate delays.

    Another point... if you don't need 16 inputs or 8 outpus, you
    are also paying a lot more. If you have a 6-bit primary opcode,
    you don't need a full 16 bits of input.

    I'm just showing why it was more than just an AND gate.

    I'm still exploring whether it can be variable length instructions or
    has to be fixed 32-bit. In either case all the instruction "code" bits
    (as in op code or function code or whatever) should be checked,
    even if just to verify that should-be-zero bits are zero.

    There would also be instruction buffer Valid bits and other state bits
    like Fetch exception detected, interrupt request, that might feed into
    a bank of PLA's multiple wide and deep.

    Also the decoder would need a lot of these so I doubt we can afford the
    power and heat for H series. That 74H30 typical is 22 mW but the max
    looks like 110 mW max each (I_ol output low of 20 mA * 5.5V max).
    74LS30 is 20 ns max, 44 mW max.

    Looking at a TI Bipolar Memory Data Manual from 1977,
    it was about the same speed as say a 256b mask programmable TTL ROM,
    7488A 32w * 8b, 45 ns max access.

    Hmm... did the VAX, for example, actually use them, or were they
    using logic built from conventional chips?

    I wasn't suggesting that. People used to modern CMOS speeds might not appreciate how slow TTL was. I was showing that its 50 ns speed number
    was not out of line with other MSI parts of that day, and just happened
    to have a PDF TTL manual opened on that part so used it as an example.
    A 74181 4-bit ALU is also of similar complexity and 62 ns max.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From ted@loft.tnolan.com (Ted Nolan@tednolan to comp.arch,alt.folklore.computers on Wed Aug 13 18:26:44 2025
    From Newsgroup: comp.arch

    In article <2025Aug13.194659@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    <snip>
    So how could one capture the PC market? The RISC-VAX would probably
    have been too expensive for a PC, even with an 8-bit data bus and a >>>reduced instruction set, along the lines of RV32E. Or maybe that
    would have been feasible, in which case one would provide >>>8080->reduced-RISC-VAX and 6502->reduced-RISC-VAX assemblers to make >>>porting easier. And then try to sell it to IBM Boca Raton.

    https://en.wikipedia.org/wiki/Rainbow_100

    That's completely different from what I suggest above, and DEC
    obviously did not capture the PC market with that.


    They did manage to crack the college market some where CS departments
    had DEC hardware anyway. I know USC (original) had a Rainbow computer
    lab circa 1985. That "in" didn't translate to anything else though.
    --
    columbiaclosings.com
    What's not in Columbia anymore..
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Aug 13 18:13:35 2025
    From Newsgroup: comp.arch

    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes: >>aph@littlepinkcloud.invalid writes:
    I thought the AArch64 ILP32 design was pretty neat, but no one seems
    to have been interested. I guess there wasn't an advantage worth the >>>effort.

    Alpha: On Digital OSF/1 the advantage was to be able to run programs
    that work on ILP32, but not I32LP64.

    I understand what you're saying here, but disagree. A program that
    works on ILP32 but not I32LP64 is fundamentally broken, IMHO.

    In 1992 most C programs worked on ILP32, but not on I32LP64. That's
    because Digital OSF/1 was the first I32LP64 platform, and it only
    appeared in 1992. ILP32 support was a way to increase the amount of
    available software.

    x32: I expect that maintained Unix programs ran on I32LP64 in 2012,
    and unmaintained ones did not get an x32 port anyway. And if there
    are cases where my expectations do not hold, there still is i386. The
    only advantage of x32 was a speed advantage on select programs.

    I suspect that performance advantage was minimal, the primary advantage would >have been that existing applications didn't need to be rebuilt
    and requalified.

    You certainly have to rebuild for x32. It's a new ABI.

    Aarch64-ILP32: My guess is that the situation is very similar to the
    x32 situation.

    In the early days of AArch64 (2013), we actually built a toolchain to support >Aarch64-ILP32. Not a single customer exhibited _any_ interest in that
    and the project was dropped.

    Admittedly, there are CPUs without ARM A32/T32

    Very few AArch64 designs included AArch32 support

    If by Aarch32 you mean what ARM now calls the A32 and T32 instruction
    sets (their constant renamings are confusing, but the A64/A32/T32
    naming makes more sense than earlier ones), every ARMv8 core I use
    (A53, A55, A72, A73, A76) includes A32 and T32 support.


    even the Cortex
    chips supported it only at exception level zero (user mode)

    When you run user-mode software, that's what's important. Only kernel developers care about which instruction set kernel mode supports.

    The markets for AArch64 (servers, high-end appliances) didn't have
    a huge existing reservoir of 32-bit ARM applications, so there was
    no demand to support them.

    Actually there is a huge market for CPUs with ARM A32/T32 ISA
    (earlier) and ARM A64 ISA (now): smartphones and tablets. Apparently
    this market has mechanisms that remove software after relatively few
    years and the customers accept it. So the appearance of cores without
    A32/T32 support indicates that the software compiled to A32/T32 has
    been mostly eliminated. Smartphone SoCs typically still contain some
    cores that support A32/T32 (at least last time I read about them), but
    others don't. It's interesting to see which cores support A32/T32 and
    which don't.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Aug 13 14:40:01 2025
    From Newsgroup: comp.arch

    Anton Ertl wrote:

    While looking for the handbook, I also found

    http://hps.ece.utexas.edu/pub/patt_micro22.pdf

    which describes some parts of the microarchitecture of the VAX 11/780, 11/750, 8600, and 8800.

    Interestingly, Patt wrote this in 1990, after participating in the HPS
    papers on an OoO implementation of the VAX architecture.

    - anton

    Yes I saw the Patt paper recently. He has written many microarchitecture papers. I was surprised that in 1990 he would say on page 2:

    "All VAXes are microcoded. The richness of the instruction set urges that
    the flexibility of microcoded control be employed, notwithstanding the conventional mythology that hardwired control is somehow faster than
    microcode. It is instructive to point out that (1) hardwired control
    produces higher performance execution only in situations where the
    critical path is in the microsequencing function, and (2) that this
    should not occur in VAX implementations if one designs with the
    well-understood (to microarchitects) technique that the next control
    store address must be obtained from information available at the start
    of the current microcycle. A variation of this basic old technique is
    the recently popularized delayed branch present in many ISA architectures introduced in the last few years."

    When he refers to the "mythology that hardwired control is somehow faster"
    he appears to still be using the monolithic "eyes" I referred to earlier
    in that everything must go through a single microsequencer.
    He compares a hardwired sequential controller to a microcoded sequential controller and notes that in that case hardwired is no faster.

    What he is not doing is comparing multiple parallel hardware stages
    to a sequential controller, hardwired or microcoded.

    Risc brings with it the concurrent hardware stages view.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Wed Aug 13 18:51:15 2025
    From Newsgroup: comp.arch

    In article <MO1nQ.2$Bui1.0@fx10.iad>, Scott Lurndal <slp53@pacbell.net> wrote: >anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    [snip]
    errno, but at the actual system call level, the error is indicated in
    an architecture-specific way, and the ones I have looked at before
    today use the sign of the result register or the carry flag. On those >>architectures, where the sign is used, mmap(2) cannot return negative >>addresses, or must have a special wrapper.

    Why would the wrapper care if the system call failed? The
    return value from the kernel should be passed through to
    the application as per the POSIX language binding requirements.

    For the branch to `cerror`. That is, the usual reason is (was?)
    to convert from the system call interface to the C ABI,
    specifically, to populate the (userspace, now thread-local)
    `errno` variable if there was an error. (I know you know this,
    Scott, but others reading the discussion may not.)

    Looking at the 32v code for VAX and 7th Edition on the PDP-11,
    on error the kernel returns a non-zero value and sets the carry
    bit in the PSW. The stub checks whether the C bit is set, and
    if so, copies R0 to `errno` and then sets R0 to -1. On the
    PDP-11, `as` supports the non-standard "bec" mnemonic as an
    alias for "bcc" and the stub is actually something like:

    / Do sys call....land in the kernel `trap` in m40.s
    bec 1f
    jmp cerror
    1f:
    rts pc

    cerror:
    mov r0, _errno
    mov $-1, r0
    rts pc

    In other words, if the carry bit is not set, there system call
    was successful, so just return whatever it returned. Otherwise,
    the kernel is returning an error to the user, so do the dance of
    setting up `errno` and returning -1.

    (There's some fiddly bits with popping R5, which Unix used as
    the frame pointer, but I omitted those for brevity).

    lseek(2) and mmap(2) both require the return of arbitrary 32-bit
    or 64-bit values, including those which when interpreted as signed
    values are negative.

    At last for lseek, that was true in the 1990 POSIX standard,
    where the programmer was expected to (maybe save and then) clear
    `errno`, invoke `lseek`, and then check the value of `errno`
    after return to see if there was an error, but has been relaxed
    in subsequent editions (include POSIX 2024) where `lseek` now
    must return `EINVAL` if the offset is negative for a regular
    file, directory, or block-special file. (https://pubs.opengroup.org/onlinepubs/9799919799/functions/lseek.html;
    see "ERRORS")

    For mmap, at least the only documented error return value is
    `MAP_FAILED`, and programmers must check for that explicitly.

    It strikes me that this implies that the _value_ of `MAP_FAILED`
    need not be -1; on x86_64, for instance, it _could_ be any
    non-canonical address.

    Clearly POSIX defines the interfaces and the underlying OS and/or
    library functions implement the interfaces. The kernel interface
    to the language library (e.g. libc) is irrelevent to typical programmers, >except in the case where it doesn't provide the correct semantics.

    Certainly, these are hidden by the system call stubs in the
    libraries for language-specific bindings, and workaday
    programmers should not be trying to side-step those!

    - Dan C.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Peter Flass@Peter@Iron-Spring.com to comp.arch,alt.folklore.computers on Wed Aug 13 12:09:35 2025
    From Newsgroup: comp.arch

    On 8/13/25 11:26, Ted Nolan <tednolan> wrote:
    In article <2025Aug13.194659@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    <snip>
    So how could one capture the PC market? The RISC-VAX would probably
    have been too expensive for a PC, even with an 8-bit data bus and a
    reduced instruction set, along the lines of RV32E. Or maybe that
    would have been feasible, in which case one would provide
    8080->reduced-RISC-VAX and 6502->reduced-RISC-VAX assemblers to make
    porting easier. And then try to sell it to IBM Boca Raton.

    https://en.wikipedia.org/wiki/Rainbow_100

    That's completely different from what I suggest above, and DEC
    obviously did not capture the PC market with that.


    They did manage to crack the college market some where CS departments
    had DEC hardware anyway. I know USC (original) had a Rainbow computer
    lab circa 1985. That "in" didn't translate to anything else though.

    Skidmore College was a DEC shop back in the day.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Wed Aug 13 19:25:31 2025
    From Newsgroup: comp.arch

    In article <2025Aug13.181010@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    That said, Unix generally defined -1 as the return value for all
    other system calls, and code that checked for "< 0" instead of
    -1 when calling a standard library function or system call was fundamentally
    broken.

    That may be the interface of the C system call wrapper,

    It _is_ the interface that the programmers need to be
    concerted with when using POSIX C language bindings.

    True, but not relevant for the question at hand.

    at the actual system call level, the error is indicated in
    an architecture-specific way, and the ones I have looked at before
    today use the sign of the result register or the carry flag. On those >>>architectures, where the sign is used, mmap(2) cannot return negative >>>addresses, or must have a special wrapper.

    Why would the wrapper care if the system call failed?

    The actual system call returns an error flag and a register. On some >architectures, they support just a register. If there is no error,
    the wrapper returns the content of the register. If the system call >indicates an error, you see from the value of the register which error
    it is; the wrapper then typically transforms the register in some way
    (e.g., by negating it) and stores the result in errno, and returns -1.

    lseek(2) and mmap(2) both require the return of arbitrary 32-bit
    or 64-bit values, including those which when interpreted as signed
    values are negative.

    For lseek(2):

    | Upon successful completion, lseek() returns the resulting offset
    | location as measured in bytes from the beginning of the file.

    Given that off_t is signed, lseek(2) can only return positive values.

    This is incorrect; or rather, it's accidentally correct now, but
    was not previously. The 1990 POSIX standard did not explicitly
    forbid a file that was so large that the offset couldn't
    overflow, hence why in 1990 POSIX you have to be careful about
    error handling when using `lseek`.

    It is true that POSIX 2024 _does_ prohibit seeking so far that
    the offset would become negative, however. But, POSIX 2024
    (still!!) supports multiple definitions of `off_t` for multiple
    environments, in which overflow is potentially unavoidable.
    This leads to considerable complexity in implementations that
    try to support such multiple environments in their ABI (for
    instance, for backwards compatability with old programs).

    For mmap(2):

    | On success, mmap() returns a pointer to the mapped area.

    So it's up to the kernel which user-level addresses it returns. E.g.,
    32-bit Linux originally only produced user-level addresses below 2GB.
    When memories grew larger, on some architectures (e.g., i386) Linux
    increased that to 3GB.

    The point is that the programmer shouldn't have to care. The
    programmer should check the return value against MAP_FAILED, and
    if it is NOT that value, then the returned address may be
    assumed valid. If such an address is not actually valid, that
    indicates a bug in the implementation of `mmap`.

    Clearly POSIX defines the interfaces and the underlying OS and/or
    library functions implement the interfaces. The kernel interface
    to the language library (e.g. libc) is irrelevent to typical programmers

    Sure, but system calls are first introduced in real kernels using the
    actual system call interface, and are limited by that interface. And
    that interface is remarkably similar between the early days of Unix
    and recent Linux kernels for various architectures.

    Not precisely. On x86_64, for example, some Unixes use a flag
    bit to determine whether the system call failed, and return
    (positive) errno values; Linux returns negative numbers to
    indicate errors, and constrains those to values between -4095
    and -1.

    Presumably that specific set of values is constrained by `mmap`:
    assuming a minimum 4KiB page size, the last architecturally
    valid address where a page _could_ be mapped is equivalent to
    -4096 and the first is 0. If they did not have that constraint,
    they'd have to treat `mmap` specially in the system call path.

    Linux _could_ decide to define `MAP_FAILED` as
    0x0fff_ffff_0000_0000, which is non-canonical on all extant
    versions of x86-64, even with 5-level paging, but maybe they do
    not because they're anticipating 6-level paging showing up at
    some point.

    And when you look
    closely, you find how the system calls are design to support returning
    the error indication, success value, and errno in one register.

    lseek64 on 32-bit platforms is an exception (the success value does
    not fit in one register), and looking at the machine code of the
    wrapper and comparing it with the machine code for the lseek wrapper,
    some funny things are going on, but I would have to look at the source
    code to understand what is going on. One other interesting thing I
    noticed is that the system call wrappers from libc-2.36 on i386 now
    draws the boundary between success returns and error returns at
    0xfffff000:

    0xf7d853c4 <lseek+68>: call *%gs:0x10
    0xf7d853cb <lseek+75>: cmp $0xfffff000,%eax
    0xf7d853d0 <lseek+80>: ja 0xf7d85410 <lseek+144>

    So now the kernel can produce 4095 error values, and the rest can be
    success values. In particular, mmap() can return all possible page
    addresses as success values with these wrappers. When I last looked
    at how system calls are done, I found just a check of the N or the C
    flag.

    Yes; see above.

    I wonder how the kernel is informed that it can now return more
    addresses from mmap().

    Assuming you mean the Linux kernel, when it loads an ELF
    executable, the binary image itself is "branded" with an ABI
    type that it can use to make that determination.

    - Dan C.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch,alt.folklore.computers on Wed Aug 13 19:35:09 2025
    From Newsgroup: comp.arch

    In comp.arch Scott Lurndal <scott@slp53.sl.home> wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Stephen Fuld wrote:
    On 8/4/2025 8:32 AM, John Ames wrote:
    =20
    snip
    =20
    This notion that the only advantage of a 64-bit architecture is a larg= >>e
    address space is very curious to me. Obviously that's *one* advantage,=

    but while I don't know the in-the-field history of heavy-duty business= >>/
    scientific computing the way some folks here do, I have not gotten the=

    impression that a lot of customers were commonly running up against th= >>e
    4 GB limit in the early '90s;
    =20
    Not exactly the same, but I recall an issue with Windows NT where it=20
    initially divided the 4GB address space in 2 GB for the OS, and 2GB for= >>=20
    users.=C2=A0 Some users were "running out of address space", so Microso= >>ft=20
    came up with an option to reduce the OS space to 1 GB, thus allowing up= >>=20
    to 3 GB for users.=C2=A0 I am sure others here will know more details.

    Any program written to Microsoft/Windows spec would work transparently=20 >>with a 3:1 split, the problem was all the programs ported from unix=20 >>which assumed that any negative return value was a failure code.

    The only interfaces that I recall this being an issue for were
    mmap(2) and lseek(2). The latter was really related to maximum
    file size (although it applied to /dev/[k]mem and /proc/<pid>/mem
    as well). The former was handled by the standard specifying
    MAP_FAILED as the return value.

    That said, Unix generally defined -1 as the return value for all
    other system calls, and code that checked for "< 0" instead of
    -1 when calling a standard library function or system call was fundamentally broken.

    I remeber RIM. When I compiled it on Linux and tried it I got error
    due to check for "< 0". Change to '== -1" fixed it. Possibly there
    were similar troubles in other programs that I do not remember.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Wed Aug 13 19:40:17 2025
    From Newsgroup: comp.arch

    In article <%C4nQ.6540$CQJe.2438@fx14.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    [snip]
    all that said, my initial point about -1 was that applications
    should always check for -1 (or MAP_FAILED), not for return
    values less than zero. The actual kernel interface to the
    C library is clearly implementation dependent although it
    must preserve the user-visible required semantics.

    For some reason, I have a vague memory of reading somewhere that
    it was considered "more robust" to check for a negative return
    value, and not just -1 specifically. Perhaps this was just
    superstition, or perhaps someone had been bit by an overly
    permissive environment. It certainly seems like advice that we
    can safely discard at this point.

    - Dan C.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Aug 13 20:23:53 2025
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    Thomas Koenig wrote:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:

    Unlesss... maybe somebody (a customer, or they themselves)
    discovered that there may have been conditions where they could
    only guarantee 80 ns. Maybe a combination of tolerances to one
    side and a certain logic programming, and they changed the
    data sheet.

    Manufacturing process variation leads to timing differences that
    testing sorts into speed bins. The faster bins sell at higher price.

    Is that possible with a PAL before it has been programmed?


    By comparison, you could get an eight-input NAND gate with a
    maximum delay of 12 ns (the 74H030), so putting two in sequence
    to simulate a PLA would have been significantly faster.
    I can undersand people complaining that PALs were slow.
    The 82S100 PLA is logic equivalent to:
    - 16 inputs each with an optional input invertor,

    Should be free coming from a Flip-Flop.

    Depends on what chips you use for registers.
    If you want both Q and Qb then you only get 4 FF in a package like 74LS375.

    For a wide instruction or stage register I'd look at chips such as a 74LS377 with 8 FF in a 20 pin dip, 8 input, 8 Q out, clock, clock enable, vcc, gnd.

    So if you need eight ouputs, you choice is to use two 74LS375
    (presumably more expensive) or an 74LS377 and an eight-chip
    inverter (a bit slower, but intervers should be fast).

    Another point... if you don't need 16 inputs or 8 outpus, you
    are also paying a lot more. If you have a 6-bit primary opcode,
    you don't need a full 16 bits of input.

    I'm just showing why it was more than just an AND gate.

    Two layers of NAND :-)

    I'm still exploring whether it can be variable length instructions or
    has to be fixed 32-bit. In either case all the instruction "code" bits
    (as in op code or function code or whatever) should be checked,
    even if just to verify that should-be-zero bits are zero.

    There would also be instruction buffer Valid bits and other state bits
    like Fetch exception detected, interrupt request, that might feed into
    a bank of PLA's multiple wide and deep.

    Agreed, the logic has to go somewhere. Regularity in the
    instruction set would even have been extremely important than now
    to reduce the logic requirements for decoding.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Aug 13 20:28:13 2025
    From Newsgroup: comp.arch

    cross@spitfire.i.gajendra.net (Dan Cross) writes:
    In article <MO1nQ.2$Bui1.0@fx10.iad>, Scott Lurndal <slp53@pacbell.net> wrote: >>anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:

    For mmap, at least the only documented error return value is
    `MAP_FAILED`, and programmers must check for that explicitly.

    It strikes me that this implies that the _value_ of `MAP_FAILED`
    need not be -1; on x86_64, for instance, it _could_ be any
    non-canonical address.

    And in the very unlikely case that a C compiler was developed
    for the Burroughs B4900, MAP_FAILED could be 0xC0EEEEEE (which
    is how the NULL pointer was encoded in the hardware). Because
    all the data was BCD, undigits (a-f) in an address were
    unconditionally illegal.

    There were instructions to search linked lists, so the hardware
    needed to understand the concept of a NULL pointer (as well
    as deal with the possibility of a loop, using a timer while
    the search instruction was executing).

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Aug 13 21:23:34 2025
    From Newsgroup: comp.arch

    cross@spitfire.i.gajendra.net (Dan Cross) writes:
    In article <2025Aug13.181010@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    For lseek(2):

    | Upon successful completion, lseek() returns the resulting offset
    | location as measured in bytes from the beginning of the file.

    Given that off_t is signed, lseek(2) can only return positive values.

    This is incorrect; or rather, it's accidentally correct now, but
    was not previously. The 1990 POSIX standard did not explicitly
    forbid a file that was so large that the offset couldn't
    overflow, hence why in 1990 POSIX you have to be careful about
    error handling when using `lseek`.

    It is true that POSIX 2024 _does_ prohibit seeking so far that
    the offset would become negative, however.

    I don't think that this is accidental. In 1990 signed overlow had
    reliable behaviour on common 2s-complement hardware with the C
    compilers of the day. Nowadays the exotic hardware where this would
    not work that way has almost completely died out (and C is not used on
    the remaining exotic hardware), but now compilers sometimes do funny
    things on integer overflow, so better don't go there or anywhere near
    it.

    But, POSIX 2024
    (still!!) supports multiple definitions of `off_t` for multiple
    environments, in which overflow is potentially unavoidable.

    POSIX also has the EOVERFLOW error for exactly that case.

    Bottom line: The off_t returned by lseek(2) is signed and always
    positive.

    For mmap(2):

    | On success, mmap() returns a pointer to the mapped area.

    So it's up to the kernel which user-level addresses it returns. E.g., >>32-bit Linux originally only produced user-level addresses below 2GB.
    When memories grew larger, on some architectures (e.g., i386) Linux >>increased that to 3GB.

    The point is that the programmer shouldn't have to care.

    True, but completely misses the point.

    Sure, but system calls are first introduced in real kernels using the >>actual system call interface, and are limited by that interface. And
    that interface is remarkably similar between the early days of Unix
    and recent Linux kernels for various architectures.

    Not precisely. On x86_64, for example, some Unixes use a flag
    bit to determine whether the system call failed, and return
    (positive) errno values; Linux returns negative numbers to
    indicate errors, and constrains those to values between -4095
    and -1.

    Presumably that specific set of values is constrained by `mmap`:
    assuming a minimum 4KiB page size, the last architecturally
    valid address where a page _could_ be mapped is equivalent to
    -4096 and the first is 0. If they did not have that constraint,
    they'd have to treat `mmap` specially in the system call path.

    I am pretty sure that in the old times, Linux-i386 indicated failure
    by returning a value with the MSB set, and the wrapper just checked
    whether the return value was negative. And for mmap() that worked
    because user-mode addresses were all below 2GB. Addresses furthere up
    where reserved for the kernel.

    I wonder how the kernel is informed that it can now return more
    addresses from mmap().

    Assuming you mean the Linux kernel, when it loads an ELF
    executable, the binary image itself is "branded" with an ABI
    type that it can use to make that determination.

    I have checked that with binaries compiled in 2003 and 2000:

    -rwxr-xr-x 1 root root 44660 Sep 26 2000 /usr/local/bin/gforth-0.5.0* -rwxr-xr-x 1 root root 92352 Sep 7 2003 /usr/local/bin/gforth-0.6.2*

    [~:160080] file /usr/local/bin/gforth-0.5.0
    /usr/local/bin/gforth-0.5.0: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux.so.2, stripped
    [~:160081] file /usr/local/bin/gforth-0.6.2
    /usr/local/bin/gforth-0.6.2: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux.so.2, for GNU/Linux 2.0.0, stripped

    So there is actually a difference between these two. However, if I
    just strace them as they are now, they both happily produce very high
    addresses with mmap, e.g.,

    mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xf7f64000

    I don't know what the difference is between "for GNU/Linux 2.0.0" and
    not having that, but the addresses produced by mmap() seem unaffected.

    However, by calling the binaries with setarch -L, mmap() returns only
    addresses < 2GB in all calls I have looked at. I guess if I had
    statically linked binaries, i.e., with old system call wrappers, I
    would have to use

    setarch -L <binary>

    to make it work properly with mmap(). Or maybe Linux is smart enough
    to do it by itself when it encounters a statically-linked old binary.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Aug 14 07:58:41 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    I am pretty sure that in the old times, Linux-i386 indicated failure
    by returning a value with the MSB set, and the wrapper just checked
    whether the return value was negative.

    I have now checked this by chrooting into an old Red Hat 6.2 system
    (not RHEL) with glibc-2.1.3 (released in Feb 2000) and its system call wrappers. And already those wrappers use the current way of
    determining whether a system call returns an error or not:

    For mmap():

    0xf7fd984b <__mmap+11>: int $0x80
    0xf7fd984d <__mmap+13>: mov %edx,%ebx
    0xf7fd984f <__mmap+15>: cmp $0xfffff000,%eax
    0xf7fd9854 <__mmap+20>: ja 0xf7fd9857 <__mmap+23>

    Bottom line: If Linux-i386 ever had a different way of determining
    whether a system call has an error result, it was changed to the
    current way early on. Given that IIRC I looked into that later than
    in 2000, my memory is obviously not of Linux. I must have looked at
    source code for a different system.

    Actually, the whole wrapper is short enough to easily understand what
    is going on:

    0xf7fd9840 <__mmap>: mov %ebx,%edx
    0xf7fd9842 <__mmap+2>: mov $0x5a,%eax
    0xf7fd9847 <__mmap+7>: lea 0x4(%esp,1),%ebx
    0xf7fd984b <__mmap+11>: int $0x80
    0xf7fd984d <__mmap+13>: mov %edx,%ebx
    0xf7fd984f <__mmap+15>: cmp $0xfffff000,%eax
    0xf7fd9854 <__mmap+20>: ja 0xf7fd9857 <__mmap+23>
    0xf7fd9856 <__mmap+22>: ret
    0xf7fd9857 <__mmap+23>: push %ebx
    0xf7fd9858 <__mmap+24>: call 0xf7fd985d <__mmap+29>
    0xf7fd985d <__mmap+29>: pop %ebx
    0xf7fd985e <__mmap+30>: xor %edx,%edx
    0xf7fd9860 <__mmap+32>: add $0x400b,%ebx
    0xf7fd9866 <__mmap+38>: sub %eax,%edx
    0xf7fd9868 <__mmap+40>: push %edx
    0xf7fd9869 <__mmap+41>: call 0xf7fd7f80 <__errno_location>
    0xf7fd986e <__mmap+46>: pop %ecx
    0xf7fd986f <__mmap+47>: pop %ebx
    0xf7fd9870 <__mmap+48>: mov %ecx,(%eax)
    0xf7fd9872 <__mmap+50>: or $0xffffffff,%eax
    0xf7fd9875 <__mmap+53>: jmp 0xf7fd9856 <__mmap+22>

    One interesting difference from the current way of invoking a system
    call is that (as far as I understand the wrapper) the wrapper loads
    the arguments from memory (IA-32 ABI passes parameters on the stack)
    into registers and then performs the system call in some newfangled
    way, whereas here the arguments are left in memory, and apparently a
    pointer to the first argument is passed in %ebx; the system call is
    invoked in the old way: int $0x80.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Aug 14 13:28:31 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    Bottom line: If Linux-i386 ever had a different way of determining
    whether a system call has an error result, it was changed to the
    current way early on. Given that IIRC I looked into that later than
    in 2000, my memory is obviously not of Linux. I must have looked at
    source code for a different system.

    I looked around and found
    <2016Sep18.100027@mips.complang.tuwien.ac.at>. I mentioned the Linux
    approach there, but apparently it did not stick in my memory. I
    linked to <http://stackoverflow.com/questions/36845866/history-of-using-negative-errno-values-in-gnu>,
    and there fuz writes:

    |Historically, system calls returned either a positive value (in case
    |of success) or a negative value indicating an error code. This has
    |been the case from the very beginning of UNIX as far as I'm concerned.

    and Steve Summit earlier writes essentially the same. But Lars
    Brinkhoff read my posting and contradicted Steve Summit and fuz, e.g.:

    |PDP-11 Unix V1 does not do this. When there's an error, the system
    |call sets the carry flag in the status register, and returns the error
    |code in register R0. On success, the carry flag is cleared, and R0
    |holds a return value. Unix V7 does the same.

    Why do I know he read my posting? Because he wrote a followup: <868tunivr0.fsf@molnjunk.nocrew.org>.

    In <2016Sep20.160042@mips.complang.tuwien.ac.at> I wrote:

    |Some Linux ports use a second register to indicate that there is an
    |error, and SPARC even uses the carry flag.

    So apparently I had looked at the source code of the C wrappers (or of
    the Linux kernel) at that point. I definitely remember finding this
    in some source code.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Thu Aug 14 15:14:56 2025
    From Newsgroup: comp.arch

    In article <2025Aug13.232334@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote: >cross@spitfire.i.gajendra.net (Dan Cross) writes:
    In article <2025Aug13.181010@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    For lseek(2):

    | Upon successful completion, lseek() returns the resulting offset
    | location as measured in bytes from the beginning of the file.

    Given that off_t is signed, lseek(2) can only return positive values.

    This is incorrect; or rather, it's accidentally correct now, but
    was not previously. The 1990 POSIX standard did not explicitly
    forbid a file that was so large that the offset couldn't
    overflow, hence why in 1990 POSIX you have to be careful about
    error handling when using `lseek`.

    It is true that POSIX 2024 _does_ prohibit seeking so far that
    the offset would become negative, however.

    I don't think that this is accidental. In 1990 signed overlow had
    reliable behaviour on common 2s-complement hardware with the C
    compilers of the day.

    This is simply not true. If anything, there was more variety of
    hardware supported by C90, and some of those systems were 1's
    complement or sign/mag, not 2's complement. Consequently,
    signed integer overflow has _always_ had undefined behavior in
    ANSI/ISO C.

    However, conversion from signed to unsigned has always been
    well-defined, and follows effectively 2's complement semantics.

    Conversion from unsigned to signed is a bit more complex, and is
    implementation defined, but not UB. Given that the system call
    interface is necessarily deeply intwined with the implementation
    I see no reason why the semantics of signed overflow should be
    an issue here.

    Nowadays the exotic hardware where this would
    not work that way has almost completely died out (and C is not used on
    the remaining exotic hardware),

    If by "C is not used" you mean newer editions of the C standard
    are not used on very old computers with strange representations
    of signed integers, then maybe.

    but now compilers sometimes do funny
    things on integer overflow, so better don't go there or anywhere near
    it.

    This isn't about signed overflow. The issue here is conversion
    of an unsigned value to signed; almost certainly, the kernel
    performs the calculation of the actual file offset using
    unsigned arithmetic, and relies on the (assembler, mind you)
    system call stubs to map those to the appropriate userspace
    type.

    I think this is mostly irrelevant, as the system call stub,
    almost by necessity, must be written in assembler in order to
    have percise control over the use of specific registers and so
    on. From C's perspective, a program making a system call just
    calls some function that's defined to return a signed integer;
    the assembler code that swizzles the register that integer will
    be extracted from sets things up accordingly. In other words,
    the conversion operation that the C standard mentions isn't at
    play, since the code that does the "conversion" is in assembly.
    Again from C's perspective the return value of the syscall stub
    function is already signed with no need of conversion.

    No, for `lseek`, the POSIX rationale explains the reasoning here
    quite clearly: the 1990 standard permitted negative offsets, and
    programs were expected to accommodate this by special handling
    of `errno` before and after calls to `lseek` that returned
    negative values. This was deemed onerous and fragile, so they
    modified the standard to prohibit calls that would result in
    negative offsets.

    But, POSIX 2024
    (still!!) supports multiple definitions of `off_t` for multiple >>environments, in which overflow is potentially unavoidable.

    POSIX also has the EOVERFLOW error for exactly that case.

    Bottom line: The off_t returned by lseek(2) is signed and always
    positive.

    As I said earlier, post POSIX.1-1990, this is true.

    For mmap(2):

    | On success, mmap() returns a pointer to the mapped area.

    So it's up to the kernel which user-level addresses it returns. E.g., >>>32-bit Linux originally only produced user-level addresses below 2GB. >>>When memories grew larger, on some architectures (e.g., i386) Linux >>>increased that to 3GB.

    The point is that the programmer shouldn't have to care.

    True, but completely misses the point.

    I don't see why. You were talking about the system call stubs,
    which run in userspace, and are responsbile for setting up state
    so that the kernel can perform some requested action on entry,
    whether by trap, call gate, or special instruction, and then for
    tearing down that state and handling errors on return from the
    kernel.

    For mmap, there is exactly one value that may be returned from
    the its stub that indicates an error; any other value, by
    definition, represents a valid mapping. Whether such a mapping
    falls in the first 2G, 3G, anything except the upper 256MiB, or
    some hole in the middle is the part that's irrelevant, and
    focusing on that misses the main point: all the stub has to do
    is detect the error, using whatever convetion the kernel
    specifies for communicating such things back to the program, and
    ensure that in an error case, MAP_FAILED is returned from the
    stub and `errno` is set appropriately. Everything else is
    superfluous.

    Sure, but system calls are first introduced in real kernels using the >>>actual system call interface, and are limited by that interface. And >>>that interface is remarkably similar between the early days of Unix
    and recent Linux kernels for various architectures.

    Not precisely. On x86_64, for example, some Unixes use a flag
    bit to determine whether the system call failed, and return
    (positive) errno values; Linux returns negative numbers to
    indicate errors, and constrains those to values between -4095
    and -1.

    Presumably that specific set of values is constrained by `mmap`:
    assuming a minimum 4KiB page size, the last architecturally
    valid address where a page _could_ be mapped is equivalent to
    -4096 and the first is 0. If they did not have that constraint,
    they'd have to treat `mmap` specially in the system call path.

    I am pretty sure that in the old times, Linux-i386 indicated failure
    by returning a value with the MSB set, and the wrapper just checked
    whether the return value was negative. And for mmap() that worked
    because user-mode addresses were all below 2GB. Addresses furthere up
    where reserved for the kernel.

    Define "Linux-i386" in this case. For the kernel, I'm confident
    that was NOT the case, and it is easy enough to research, since
    old kernel versions are online. Looking at e.g. 0.99.15, one
    can see that they set the carry bit in the flags register to
    indicate an error, along with returning a negative errno value: https://kernel.googlesource.com/pub/scm/linux/kernel/git/nico/archive/+/refs/tags/v0.99.15/kernel/sys_call.S

    By 2.0, they'd stopped setting the carry bit, though they
    continued to clear it on entry.

    But remember, `mmap` returns a pointer, not an integer, relying
    on libc to do the necessary translation between whatever the
    kernel returns and what the program expects. So if the behavior
    you describe where anywhere, it would be in libc. Given that
    they have, and had, a mechanism for signaling an error
    independent of C already, and necessarily the fixup of the
    return value must happen in the syscall stub in whatever library
    the system used, relying soley on negative values to detect
    errors seems like a poor design decision ifor a C library.

    So if what you're saying were true, such a check wuld have to
    be in the userspace library that provides the syscall stubs; the
    kernel really doesn't care. I don't know what version libc
    Torvalds started with, or if he did his own bespoke thing
    initially or something, but looking at some commonly used C
    libraries of a certain age, such as glibc 2.0 from 1997-ish, one
    can see that they're explicitly testing the error status against
    -4095 (as an unsigned value) in the stub. (e.g., in sysdeps/unix/sysv/linux/i386/syscall.S).

    But glibc-1.06.1 is a different story, and _does_ appear to
    simply test whether the return value is negative and then jump
    to an error handler if so. So mmap may have worked incidentally
    due to the restriction on where in the address space it would
    place a mapping in very early kernel versions, as you described,
    but that's a library issue, not a kernel issue: again, the
    kernel doesn't care.

    The old version of libc5 available on kernel.org similarly; it
    looks like HJ Lu changed the error handling path to explicitly
    compare against -4095 in October of 1996.

    So, fixed in the most common libc's used with Linux on i386 for
    nearly 30 years, well before the existence of x86_64.

    I wonder how the kernel is informed that it can now return more
    addresses from mmap().

    Assuming you mean the Linux kernel, when it loads an ELF
    executable, the binary image itself is "branded" with an ABI
    type that it can use to make that determination.

    I have checked that with binaries compiled in 2003 and 2000:

    -rwxr-xr-x 1 root root 44660 Sep 26 2000 /usr/local/bin/gforth-0.5.0* >-rwxr-xr-x 1 root root 92352 Sep 7 2003 /usr/local/bin/gforth-0.6.2*

    [~:160080] file /usr/local/bin/gforth-0.5.0
    /usr/local/bin/gforth-0.5.0: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux.so.2, stripped
    [~:160081] file /usr/local/bin/gforth-0.6.2
    /usr/local/bin/gforth-0.6.2: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux.so.2, for
    GNU/Linux 2.0.0, stripped

    So there is actually a difference between these two. However, if I
    just strace them as they are now, they both happily produce very high >addresses with mmap, e.g.,

    mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xf7f64000

    I don't see any reason why it wouldn't.

    I don't know what the difference is between "for GNU/Linux 2.0.0" and
    not having that,

    `file` is pulling that from a `PT_NOTE` segment defined in the
    program header for that second file. A better tool for picking
    apart the details of those binaries is probably `objdump`.

    I'm mildly curious what version of libc those are linked against
    (e.g., as reported by `ldd`).

    but the addresses produced by mmap() seem unaffected.

    I don't see why it would be. Any common libc post 1997-ish
    handles errors in a way that permits this to work correctly. If
    you tried glibc 1.0, it might be a different story, but the
    Linux folks forked that in 1994 and modified it as "Linux libc"
    and the

    However, by calling the binaries with setarch -L, mmap() returns only >addresses < 2GB in all calls I have looked at. I guess if I had
    statically linked binaries, i.e., with old system call wrappers, I
    would have to use

    setarch -L <binary>

    to make it work properly with mmap(). Or maybe Linux is smart enough
    to do it by itself when it encounters a statically-linked old binary.

    Unclear without looking at the kernel source code, but possibly.
    `setarch -L` turns on the "legacy" virtual address space layout,
    but I suspect that the number of binaries that _actually care_
    is pretty small, indeed.

    - Dan C.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Thu Aug 14 15:25:13 2025
    From Newsgroup: comp.arch

    In article <107kuhg$8ks$1@reader1.panix.com>,
    Dan Cross <cross@spitfire.i.gajendra.net> wrote:
    In article <2025Aug13.232334@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote: >>cross@spitfire.i.gajendra.net (Dan Cross) writes:
    In article <2025Aug13.181010@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    For lseek(2):

    | Upon successful completion, lseek() returns the resulting offset
    | location as measured in bytes from the beginning of the file.

    Given that off_t is signed, lseek(2) can only return positive values.

    This is incorrect; or rather, it's accidentally correct now, but
    was not previously. The 1990 POSIX standard did not explicitly
    forbid a file that was so large that the offset couldn't
    overflow, hence why in 1990 POSIX you have to be careful about
    error handling when using `lseek`.

    It is true that POSIX 2024 _does_ prohibit seeking so far that
    the offset would become negative, however.

    I don't think that this is accidental. In 1990 signed overlow had
    reliable behaviour on common 2s-complement hardware with the C
    compilers of the day.

    This is simply not true. If anything, there was more variety of
    hardware supported by C90, and some of those systems were 1's
    complement or sign/mag, not 2's complement. Consequently,
    signed integer overflow has _always_ had undefined behavior in
    ANSI/ISO C.

    However, conversion from signed to unsigned has always been
    well-defined, and follows effectively 2's complement semantics.

    Conversion from unsigned to signed is a bit more complex, and is >implementation defined, but not UB. Given that the system call
    interface is necessarily deeply intwined with the implementation
    I see no reason why the semantics of signed overflow should be
    an issue here.

    Nowadays the exotic hardware where this would
    not work that way has almost completely died out (and C is not used on
    the remaining exotic hardware),

    If by "C is not used" you mean newer editions of the C standard
    are not used on very old computers with strange representations
    of signed integers, then maybe.

    but now compilers sometimes do funny
    things on integer overflow, so better don't go there or anywhere near
    it.

    This isn't about signed overflow. The issue here is conversion
    of an unsigned value to signed; almost certainly, the kernel
    performs the calculation of the actual file offset using
    unsigned arithmetic, and relies on the (assembler, mind you)
    system call stubs to map those to the appropriate userspace
    type.

    I think this is mostly irrelevant, as the system call stub,
    almost by necessity, must be written in assembler in order to
    have percise control over the use of specific registers and so
    on. From C's perspective, a program making a system call just
    calls some function that's defined to return a signed integer;
    the assembler code that swizzles the register that integer will
    be extracted from sets things up accordingly. In other words,
    the conversion operation that the C standard mentions isn't at
    play, since the code that does the "conversion" is in assembly.
    Again from C's perspective the return value of the syscall stub
    function is already signed with no need of conversion.

    No, for `lseek`, the POSIX rationale explains the reasoning here
    quite clearly: the 1990 standard permitted negative offsets, and
    programs were expected to accommodate this by special handling
    of `errno` before and after calls to `lseek` that returned
    negative values. This was deemed onerous and fragile, so they
    modified the standard to prohibit calls that would result in
    negative offsets.

    But, POSIX 2024
    (still!!) supports multiple definitions of `off_t` for multiple >>>environments, in which overflow is potentially unavoidable.

    POSIX also has the EOVERFLOW error for exactly that case.

    Bottom line: The off_t returned by lseek(2) is signed and always
    positive.

    As I said earlier, post POSIX.1-1990, this is true.

    For mmap(2):

    | On success, mmap() returns a pointer to the mapped area.

    So it's up to the kernel which user-level addresses it returns. E.g., >>>>32-bit Linux originally only produced user-level addresses below 2GB. >>>>When memories grew larger, on some architectures (e.g., i386) Linux >>>>increased that to 3GB.

    The point is that the programmer shouldn't have to care.

    True, but completely misses the point.

    I don't see why. You were talking about the system call stubs,
    which run in userspace, and are responsbile for setting up state
    so that the kernel can perform some requested action on entry,
    whether by trap, call gate, or special instruction, and then for
    tearing down that state and handling errors on return from the
    kernel.

    For mmap, there is exactly one value that may be returned from
    the its stub that indicates an error; any other value, by
    definition, represents a valid mapping. Whether such a mapping
    falls in the first 2G, 3G, anything except the upper 256MiB, or
    some hole in the middle is the part that's irrelevant, and
    focusing on that misses the main point: all the stub has to do
    is detect the error, using whatever convetion the kernel
    specifies for communicating such things back to the program, and
    ensure that in an error case, MAP_FAILED is returned from the
    stub and `errno` is set appropriately. Everything else is
    superfluous.

    Sure, but system calls are first introduced in real kernels using the >>>>actual system call interface, and are limited by that interface. And >>>>that interface is remarkably similar between the early days of Unix
    and recent Linux kernels for various architectures.

    Not precisely. On x86_64, for example, some Unixes use a flag
    bit to determine whether the system call failed, and return
    (positive) errno values; Linux returns negative numbers to
    indicate errors, and constrains those to values between -4095
    and -1.

    Presumably that specific set of values is constrained by `mmap`:
    assuming a minimum 4KiB page size, the last architecturally
    valid address where a page _could_ be mapped is equivalent to
    -4096 and the first is 0. If they did not have that constraint,
    they'd have to treat `mmap` specially in the system call path.

    I am pretty sure that in the old times, Linux-i386 indicated failure
    by returning a value with the MSB set, and the wrapper just checked
    whether the return value was negative. And for mmap() that worked
    because user-mode addresses were all below 2GB. Addresses furthere up >>where reserved for the kernel.

    Define "Linux-i386" in this case. For the kernel, I'm confident
    that was NOT the case, and it is easy enough to research, since
    old kernel versions are online. Looking at e.g. 0.99.15, one
    can see that they set the carry bit in the flags register to
    indicate an error, along with returning a negative errno value: >https://kernel.googlesource.com/pub/scm/linux/kernel/git/nico/archive/+/refs/tags/v0.99.15/kernel/sys_call.S

    By 2.0, they'd stopped setting the carry bit, though they
    continued to clear it on entry.

    But remember, `mmap` returns a pointer, not an integer, relying
    on libc to do the necessary translation between whatever the
    kernel returns and what the program expects. So if the behavior
    you describe where anywhere, it would be in libc. Given that
    they have, and had, a mechanism for signaling an error
    independent of C already, and necessarily the fixup of the
    return value must happen in the syscall stub in whatever library
    the system used, relying soley on negative values to detect
    errors seems like a poor design decision ifor a C library.

    So if what you're saying were true, such a check wuld have to
    be in the userspace library that provides the syscall stubs; the
    kernel really doesn't care. I don't know what version libc
    Torvalds started with, or if he did his own bespoke thing
    initially or something, but looking at some commonly used C
    libraries of a certain age, such as glibc 2.0 from 1997-ish, one
    can see that they're explicitly testing the error status against
    -4095 (as an unsigned value) in the stub. (e.g., in >sysdeps/unix/sysv/linux/i386/syscall.S).

    But glibc-1.06.1 is a different story, and _does_ appear to
    simply test whether the return value is negative and then jump
    to an error handler if so. So mmap may have worked incidentally
    due to the restriction on where in the address space it would
    place a mapping in very early kernel versions, as you described,
    but that's a library issue, not a kernel issue: again, the
    kernel doesn't care.

    The old version of libc5 available on kernel.org similarly; it
    looks like HJ Lu changed the error handling path to explicitly
    compare against -4095 in October of 1996.

    So, fixed in the most common libc's used with Linux on i386 for
    nearly 30 years, well before the existence of x86_64.

    I wonder how the kernel is informed that it can now return more >>>>addresses from mmap().

    Assuming you mean the Linux kernel, when it loads an ELF
    executable, the binary image itself is "branded" with an ABI
    type that it can use to make that determination.

    I have checked that with binaries compiled in 2003 and 2000:

    -rwxr-xr-x 1 root root 44660 Sep 26 2000 /usr/local/bin/gforth-0.5.0* >>-rwxr-xr-x 1 root root 92352 Sep 7 2003 /usr/local/bin/gforth-0.6.2*

    [~:160080] file /usr/local/bin/gforth-0.5.0
    /usr/local/bin/gforth-0.5.0: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux.so.2, stripped
    [~:160081] file /usr/local/bin/gforth-0.6.2
    /usr/local/bin/gforth-0.6.2: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux.so.2, for
    GNU/Linux 2.0.0, stripped

    So there is actually a difference between these two. However, if I
    just strace them as they are now, they both happily produce very high >>addresses with mmap, e.g.,

    mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xf7f64000

    I don't see any reason why it wouldn't.

    I don't know what the difference is between "for GNU/Linux 2.0.0" and
    not having that,

    `file` is pulling that from a `PT_NOTE` segment defined in the
    program header for that second file. A better tool for picking
    apart the details of those binaries is probably `objdump`.

    I'm mildly curious what version of libc those are linked against
    (e.g., as reported by `ldd`).

    but the addresses produced by mmap() seem unaffected.

    I don't see why it would be. Any common libc post 1997-ish
    handles errors in a way that permits this to work correctly. If
    you tried glibc 1.0, it might be a different story, but the
    Linux folks forked that in 1994 and modified it as "Linux libc"
    and the

    ...and the Linux folks changed this to the present mechanism in
    1996.

    (Sorry 'bout that.)

    However, by calling the binaries with setarch -L, mmap() returns only >>addresses < 2GB in all calls I have looked at. I guess if I had
    statically linked binaries, i.e., with old system call wrappers, I
    would have to use

    setarch -L <binary>

    to make it work properly with mmap(). Or maybe Linux is smart enough
    to do it by itself when it encounters a statically-linked old binary.

    Unclear without looking at the kernel source code, but possibly.
    `setarch -L` turns on the "legacy" virtual address space layout,
    but I suspect that the number of binaries that _actually care_
    is pretty small, indeed.

    - Dan C.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Aug 14 15:32:40 2025
    From Newsgroup: comp.arch

    cross@spitfire.i.gajendra.net (Dan Cross) writes:
    In article <2025Aug13.232334@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote: >>cross@spitfire.i.gajendra.net (Dan Cross) writes:
    In article <2025Aug13.181010@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    For lseek(2):

    | Upon successful completion, lseek() returns the resulting offset
    | location as measured in bytes from the beginning of the file.

    Given that off_t is signed, lseek(2) can only return positive values.

    This is incorrect; or rather, it's accidentally correct now, but
    was not previously. The 1990 POSIX standard did not explicitly
    forbid a file that was so large that the offset couldn't
    overflow, hence why in 1990 POSIX you have to be careful about
    error handling when using `lseek`.

    It is true that POSIX 2024 _does_ prohibit seeking so far that
    the offset would become negative, however.

    I don't think that this is accidental. In 1990 signed overlow had
    reliable behaviour on common 2s-complement hardware with the C
    compilers of the day.

    This is simply not true. If anything, there was more variety of
    hardware supported by C90, and some of those systems were 1's
    complement or sign/mag, not 2's complement. Consequently,
    signed integer overflow has _always_ had undefined behavior in
    ANSI/ISO C.

    Both Burroughs Large Systems (48-bit stack machine) and the
    Sperry 1100/2200 (36-bit) systems had (have, in emulation today)
    C compilers.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Thu Aug 14 15:44:34 2025
    From Newsgroup: comp.arch

    In article <sknnQ.168942$Bui1.63359@fx10.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    cross@spitfire.i.gajendra.net (Dan Cross) writes:
    In article <2025Aug13.232334@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote: >>>cross@spitfire.i.gajendra.net (Dan Cross) writes:
    In article <2025Aug13.181010@mips.complang.tuwien.ac.at>,
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    For lseek(2):

    | Upon successful completion, lseek() returns the resulting offset
    | location as measured in bytes from the beginning of the file.

    Given that off_t is signed, lseek(2) can only return positive values.

    This is incorrect; or rather, it's accidentally correct now, but
    was not previously. The 1990 POSIX standard did not explicitly
    forbid a file that was so large that the offset couldn't
    overflow, hence why in 1990 POSIX you have to be careful about
    error handling when using `lseek`.

    It is true that POSIX 2024 _does_ prohibit seeking so far that
    the offset would become negative, however.

    I don't think that this is accidental. In 1990 signed overlow had >>>reliable behaviour on common 2s-complement hardware with the C
    compilers of the day.

    This is simply not true. If anything, there was more variety of
    hardware supported by C90, and some of those systems were 1's
    complement or sign/mag, not 2's complement. Consequently,
    signed integer overflow has _always_ had undefined behavior in
    ANSI/ISO C.

    Both Burroughs Large Systems (48-bit stack machine) and the
    Sperry 1100/2200 (36-bit) systems had (have, in emulation today)
    C compilers.

    Yup. The 1100-series machines were (are) 1's complement. Those
    are the ones I usually think of when cursing that signed integer
    overflow is UB in C.

    I don't think anyone is compiling C23 code for those machines,
    but back in the late 1980s, they were still enough of a going
    concern that they could influence the emerginc C standard. Not
    so much anymore.

    Regardless, signed integer overflow remains UB in the current C
    standard, nevermind definitionally following 2s complement
    semantics. Usually this is done on the basis of performance
    arguments: some seemingly-important loop optimizations can be
    made if the compiler can assert that overflow Cannot Happen.

    And of course, even today, C still targets oddball platforms
    like DSPs and custom chips, where assumptions about the ubiquity
    of 2's comp may not hold.

    - Dan C.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From drb@drb@ihatespam.msu.edu (Dennis Boone) to comp.arch,alt.folklore.computers on Thu Aug 14 17:12:40 2025
    From Newsgroup: comp.arch

    The LSI11 uses four 40-pin chips from the MCP-1600 chipset (which is fascinating in itself <https://en.wikipedia.org/wiki/MCP-1600>) for a
    total of 160 pins; and it supported only 16 address bits without extra chips. That was certainly even more expensive (and also slower and
    less capable) than what I suggest above, but it was several years
    earlier, and what I envision was not possible in one chip then.

    Maybe compare 808x to something more in its weight class? The 8-bit
    8080 was 1974, 16-bit 8086 1978, 16/8-bit 8088 1979.

    The DEC F-11 (~1979) and J-11 (~1982) microprocessor designs were
    capable of 22 bit addressing on a single 40-pin carrier.

    De
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Thu Aug 14 19:15:42 2025
    From Newsgroup: comp.arch

    On 14.08.2025 17:44, Dan Cross wrote:
    In article <sknnQ.168942$Bui1.63359@fx10.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    Both Burroughs Large Systems (48-bit stack machine) and the
    Sperry 1100/2200 (36-bit) systems had (have, in emulation today)
    C compilers.

    Yup. The 1100-series machines were (are) 1's complement. Those
    are the ones I usually think of when cursing that signed integer
    overflow is UB in C.

    I don't think anyone is compiling C23 code for those machines,
    but back in the late 1980s, they were still enough of a going
    concern that they could influence the emerginc C standard. Not
    so much anymore.


    They would presumably have been part of the justification for supporting multiple signed integer formats at the time. UB on signed integer
    arithmetic overflow is a different matter altogether.

    Regardless, signed integer overflow remains UB in the current C
    standard, nevermind definitionally following 2s complement
    semantics. Usually this is done on the basis of performance
    arguments: some seemingly-important loop optimizations can be
    made if the compiler can assert that overflow Cannot Happen.


    The justification for "signed integer arithmetic overflow is UB" is in
    the C standards 6.5p5 under "Expressions" :

    """
    If an exceptional condition occurs during the evaluation of an
    expression (that is, if the result is not mathematically defined or not
    in the range of representable values for its type), the behavior is
    undefined.
    """

    It actually has absolutely nothing to do with signed integer
    representation, or machine hardware. It doesn't even have much to do
    with integers at all. It is simply that if the calculation can't give a correct answer, then then the C standards don't say anything about the
    results or effects.

    The point is that there when the results of an integer computation are
    too big, there is no way to get the correct answer in the types used.
    Two's complement wrapping is /not/ correct. If you add two real-world positive integers, you don't get a negative integer.

    And of course, even today, C still targets oddball platforms
    like DSPs and custom chips, where assumptions about the ubiquity
    of 2's comp may not hold.


    Modern C and C++ standards have dropped support for signed integer representation other than two's complement, because they are not in use
    in any modern hardware (including any DSP's) - at least, not for general-purpose integers. Both committees have consistently voted to
    keep overflow as UB.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Aug 14 17:43:50 2025
    From Newsgroup: comp.arch

    David Brown <david.brown@hesbynett.no> schrieb:

    The point is that there when the results of an integer computation are
    too big, there is no way to get the correct answer in the types used.
    Two's complement wrapping is /not/ correct. If you add two real-world positive integers, you don't get a negative integer.

    I believe it was you who wrote "If you add enough apples to a
    pile, the number of apples becomes negative", so there is
    clerly a defined physical meaning to overflow.

    :-)
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch,alt.folklore.computers on Thu Aug 14 15:22:46 2025
    From Newsgroup: comp.arch

    Dennis Boone wrote:
    The LSI11 uses four 40-pin chips from the MCP-1600 chipset (which is fascinating in itself <https://en.wikipedia.org/wiki/MCP-1600>) for a total of 160 pins; and it supported only 16 address bits without extra chips. That was certainly even more expensive (and also slower and
    less capable) than what I suggest above, but it was several years
    earlier, and what I envision was not possible in one chip then.

    Maybe compare 808x to something more in its weight class? The 8-bit
    8080 was 1974, 16-bit 8086 1978, 16/8-bit 8088 1979.

    The DEC F-11 (~1979) and J-11 (~1982) microprocessor designs were
    capable of 22 bit addressing on a single 40-pin carrier.

    De

    For those interested in a blast from the past, on the Wikipedia WD16 page https://en.wikipedia.org/wiki/Western_Digital_WD16

    is a link to a copy of Electronic Design magazine from 1977 which
    has a set of articles on microprocessors starting on page 60.

    Its a nice summary of the state of the microprocessor world circa 1977.

    https://www.worldradiohistory.com/Archive-Electronic-Design/1977/Electronic-Design-V25-N21-1977-1011.pdf

    Table 1 General Purpose Microprocessors on pg 62 shows 8 different
    16-bit microprocessor chip sets including the WD16.

    Table 3 on pg 66 show ~11 bit slice families that can be used to build
    larger microcoded processors, such as AMD 2900 4-bit slice series.

    It also has many data sheets on various micros starting on pg 88
    and 16-bit ones starting on pg 170, mostly chips you never heard
    on like the Ferranti F100L, but also some you'll know like the
    Data General MicroNova mN601 on page 178.
    The Western Digital WD-16 is on pg 190.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Al Kossow@aek@bitsavers.org to comp.arch,alt.folklore.computers on Thu Aug 14 12:59:00 2025
    From Newsgroup: comp.arch

    On 8/14/25 10:12 AM, Dennis Boone wrote:
    The DEC F-11 (~1979) and J-11 (~1982) microprocessor designs were
    capable of 22 bit addressing on a single 40-pin carrier.

    The only single die PDP-11 DEC produced was the T-11 and it didn't
    have an MMU

    The J-11 is a Harris two chip hybrid, and is in a >40 pin chip carrier. http://simh.trailing-edge.com/semi/j11.html
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Thu Aug 14 21:44:42 2025
    From Newsgroup: comp.arch

    In article <107l5ju$k78a$1@dont-email.me>,
    David Brown <david.brown@hesbynett.no> wrote:
    On 14.08.2025 17:44, Dan Cross wrote:
    In article <sknnQ.168942$Bui1.63359@fx10.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    Both Burroughs Large Systems (48-bit stack machine) and the
    Sperry 1100/2200 (36-bit) systems had (have, in emulation today)
    C compilers.

    Yup. The 1100-series machines were (are) 1's complement. Those
    are the ones I usually think of when cursing that signed integer
    overflow is UB in C.

    I don't think anyone is compiling C23 code for those machines,
    but back in the late 1980s, they were still enough of a going
    concern that they could influence the emerginc C standard. Not
    so much anymore.

    They would presumably have been part of the justification for supporting >multiple signed integer formats at the time.

    C90 doesn't have much to say about this at all, other than
    saying that the actual representation and ranges of the integer
    types are implementation defined (G.3.5 para 1).

    C90 does say that, "The representations of integral types shall
    define values by use of a pure binary numeration system" (sec
    6.1.2.5).

    C99 tightens this up and talks about 2's comp, 1's comp, and
    sign/mag as being the permissible representations (J.3.5, para
    1).

    UB on signed integer
    arithmetic overflow is a different matter altogether.

    I disagree.

    Regardless, signed integer overflow remains UB in the current C
    standard, nevermind definitionally following 2s complement
    semantics. Usually this is done on the basis of performance
    arguments: some seemingly-important loop optimizations can be
    made if the compiler can assert that overflow Cannot Happen.

    The justification for "signed integer arithmetic overflow is UB" is in
    the C standards 6.5p5 under "Expressions" :

    Not in ANSI/ISO 9899-1990. In that revision of the standard,
    sec 6.5 covers declarations.

    """
    If an exceptional condition occurs during the evaluation of an
    expression (that is, if the result is not mathematically defined or not
    in the range of representable values for its type), the behavior is >undefined.
    """

    In C90, this language appears in sec 6.3 para 5. Note, however,
    that they do not define what an exception _is_, only a few
    things that _may_ cause one. See below.

    It actually has absolutely nothing to do with signed integer
    representation, or machine hardware.

    Consider this language from the (non-normative) example 4 in sec
    5.1.2.3:

    |On a machine in which overflows produce an exception and in
    |which the range of values representable by an *int* is
    |[-32768,+32767], the implementation cannot rewrite this
    |expression as [continues with the specifics of the example]....

    That seems pretty clear that they're thinking about machines
    that actually generate a hardware trap of some kind on overflow.

    It doesn't even have much to do
    with integers at all. It is simply that if the calculation can't give a >correct answer, then then the C standards don't say anything about the >results or effects.

    The point is that there when the results of an integer computation are
    too big, there is no way to get the correct answer in the types used.
    Two's complement wrapping is /not/ correct. If you add two real-world >positive integers, you don't get a negative integer.

    Sorry, but I don't buy this argument as anything other than a
    justification after the fact. We're talking about history and
    motivation here, not the behavior described in the standard.

    In particular, C is a programming language for actual machines,
    not a mathematical notation; the language is free to define the
    behavior of arithmetic expressions in any way it chooses, though
    one presumes it would do so in a way that makes sense for the
    machines that it targets. Thus, it could have formalized the
    result of signed integer overflow to follow 2's complement
    semantics had the committee so chosen, in which case the result
    would not be "incorrect", it would be well-defined with respect
    to the semantics of the language. Java, for example, does this,
    as does C11 (and later) atomic integer operations. Indeed, the
    C99 rationale document makes frequent reference to twos
    complement, where overflow and modular behavior are frequently
    equivalent, being the common case. But aside from the more
    recent atomics support, C _chose_ not to do this.

    Also, consider that _unsigned_ arithmetic is defined as having
    wrap-around semantics similar to modular arithmetic, and thus
    incapable of overflow. But that's simply a fiction invented for
    the abstract machine described informally in the standard: it
    requires special handling one machines like the 1100 series,
    because those machines might trap on overflow. The C committee
    could just as well have said that the unsigned arithmetic
    _could_ overflow and that the result was UB.

    So why did C chose this way? The only logical reason is that
    there were machines at the time that where a) integer overflow
    caused machine exceptions, and b) the representation of signed
    integers was not well-defined, so that the actual value
    resulting from overflow could not be rigorously defined. Given
    that C90 mandated a binary representation for integers and so
    the representation of of unsigned integers is basically common,
    there was no need to do that for unsigned arithmetic.

    And of course, even today, C still targets oddball platforms
    like DSPs and custom chips, where assumptions about the ubiquity
    of 2's comp may not hold.

    Modern C and C++ standards have dropped support for signed integer >representation other than two's complement, because they are not in use
    in any modern hardware (including any DSP's) - at least, not for >general-purpose integers. Both committees have consistently voted to
    keep overflow as UB.

    Yes. As I said, performance is often the justification.

    I'm not convinced that there are no custom chips and/or DSPs
    that are not manufactured today. They may not be common, their
    mere existence is certainly dumb and offensive, but that does
    not mean that they don't exist. Note that the survey in, e.g., https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2218.htm
    only mentions _popular_ DSPs, not _all_ DSPs.

    Of course, if such machines exist, I will certainly concede that
    I doubt very much that anyone is targeting them with C code
    written to a modern standard.

    - Dan C.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Fri Aug 15 03:20:56 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    antispam@fricas.org (Waldek Hebisch) writes:
    VAX-780 architecture handbook says cache was 8 KB and used 8-byte
    lines. So extra 12KB of fast RAM could double cache size.
    That would be nice improvement, but not as dramatic as increase
    from 2 KB to 12 KB.

    The handbook is: https://ia903400.us.archive.org/26/items/bitsavers_decvaxhandHandbookVol11977_10941546/VAX_Architecture_Handbook_Vol1_1977_text.pdf

    The cache is indeed 8KB in size, two-way set associative and write-through.

    Section 2.7 also mentions an 8-byte instruction buffer, and that the instruction fetching is done happens concurrently with the microcoded execution. So here we have a little bit of pipelining.

    Section 2.7 also describes a 128-entry TLB. The TLB is claimed to
    have "typically 97% hit rate". I would go for larger pages, which
    would reduce the TLB miss rate.

    I think that in 1979 VAX 512 bytes page was close to optimal.
    Namely, IIUC smallest supported configuration was 128 KB RAM.
    That gives 256 pages, enough for sophisticated system with
    fine-grained access control. Bigger pages would reduce
    number of pages. For example 4 KB pages would mean 32 pages
    in minimal configuration significanly reducing usefulness of
    such machine.

    _For current machines_ there are reasons to use bigger pages, but
    in VAX time bigger pages almost surely would lead to higher memory
    use and consequently to higher price for end user. In effect
    machine would be much less competitive.

    BTW: Long ago I saw message about porting an application from
    VAX to Linux. On VAX application run OK in 1GB of memory.
    On 32 bit Inter architecture Linux with 1 GB there was excessive
    paging. The reason was much smaller number of bigger pages.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Aug 15 05:07:01 2025
    From Newsgroup: comp.arch

    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <107b1bu$252qo$1@dont-email.me>,

    Programming a RISC in assembler is not so hard, at least in my
    experience. Plus, people overestimated use of assembler even in
    the mid-1975s, and underestimated the use of compilers.
    [...]

    They certainly did! I'm not saying that they're right; I'm
    saying that business needs must have, at least in part,
    influenced the ISA design. That is, while mistaken, it was part
    of the business decision process regardless.

    It's not clear to me what the distinction of technical vs. business
    is supposed to be in the context of ISA design. Could you explain?
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Fri Aug 15 12:57:35 2025
    From Newsgroup: comp.arch

    In article <107mf9l$u2si$1@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <107b1bu$252qo$1@dont-email.me>,

    Programming a RISC in assembler is not so hard, at least in my >>>experience. Plus, people overestimated use of assembler even in
    the mid-1975s, and underestimated the use of compilers.
    [...]

    They certainly did! I'm not saying that they're right; I'm
    saying that business needs must have, at least in part,
    influenced the ISA design. That is, while mistaken, it was part
    of the business decision process regardless.

    It's not clear to me what the distinction of technical vs. business
    is supposed to be in the context of ISA design. Could you explain?

    I can attempt to, though I'm not sure if I can be successful.

    The VAX was built to be a commercial product. As such, it was
    designed to be successful in the market. But in order to be
    successful in the market, it was important that the designers be
    informed by the business landscape at both the time they were
    designing it, and what they could project would be the lifetime
    of the product. Those are considerations that extend beyond
    the purely technical aspects of the design, and are both more
    speculative and more abstract.

    Consider how the business criteria might influence the technical
    design, and how these might play off of one another: obviously,
    DEC understood that the PDP-11 was growing ever more constrained
    by its 16-bit address space, and that any successor would have
    to have a larger address space. From a business perspective, it
    made no sense to create a VAX with a 16-bit address space.
    Similarly, they could have chosen (say) a 20, 24, or 28 bit
    address space, or used segmented memory, or any number of other
    such decisions, but the model that they did chose (basically a
    flat 32-bit virtual address space: at least as far as the
    hardware was concerned; I know VMS did things differently) was
    ultimately the one that "won".

    Of course, those are obvious examples. What I'm contending is
    that the business<->technical relationship is probably deeper
    and that business has more influence on technology than we
    realize, up to and including the ISA design. I'm not saying
    that the business folks are looking over the engineers'
    shoulders telling them how the opcode space should be arranged,
    but I am saying that they're probably going to engineering with
    broad-strokes requirements based on market analysis and customer
    demand. Indeed, we see examples of this now, with the addition
    of vector instructions to most major ISAs. That's driven by the
    market, not merely engineers saying to each other, "you know
    what would be cool? AVX-512!"

    And so with the VAX, I can imagine the work (which started in,
    what, 1975?) being informed by a business landscape that saw an
    increasing trend towards favoring high-level languages, but also
    saw the continued development of large, bespoke, business
    applications for another five or more years, and with customers
    wanting to be able to write (say) complex formatting sequences
    easily in assembler (the EDIT instruction!), in a way that was
    compatible with COBOL (so make the COBOL compiler emit the EDIT
    instruction!), while also trying to accommodate the scientific
    market (POLYF/POLYG!) who would be writing primarily in FORTRAN
    but jumping to assembler for the fuzz-busting speed boost (so
    stabilize what amounts to an ABI very early on!), and so forth.

    Of course, they messed some of it up; EDITPC was like the
    punchline of a bad joke, and the ways that POLY was messed up
    are well-known.

    Anyway, I apologize for the length of the post, but that's the
    sort of thing I mean.

    - Dan C.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Swindells@rjs@fdy2.co.uk to comp.arch on Fri Aug 15 13:36:12 2025
    From Newsgroup: comp.arch

    On Fri, 15 Aug 2025 12:57:35 -0000 (UTC), Dan Cross wrote:

    In article <107mf9l$u2si$1@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <107b1bu$252qo$1@dont-email.me>,

    Programming a RISC in assembler is not so hard, at least in my >>>>experience. Plus, people overestimated use of assembler even in the >>>>mid-1975s, and underestimated the use of compilers.
    [...]

    They certainly did! I'm not saying that they're right; I'm saying
    that business needs must have, at least in part, influenced the ISA
    design. That is, while mistaken, it was part of the business decision
    process regardless.

    It's not clear to me what the distinction of technical vs. business is >>supposed to be in the context of ISA design. Could you explain?

    I can attempt to, though I'm not sure if I can be successful.

    [snip]

    There are also bits of the business requirements in each of the
    descriptions of DEC microprocessor projects on Bob Supnik's site
    that Al Kossow linked to earlier:

    <http://simh.trailing-edge.com/dsarchive.html>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Aug 15 15:10:58 2025
    From Newsgroup: comp.arch

    antispam@fricas.org (Waldek Hebisch) writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    antispam@fricas.org (Waldek Hebisch) writes:
    VAX-780 architecture handbook says cache was 8 KB and used 8-byte
    lines. So extra 12KB of fast RAM could double cache size.
    That would be nice improvement, but not as dramatic as increase
    from 2 KB to 12 KB.

    The handbook is:
    https://ia903400.us.archive.org/26/items/bitsavers_decvaxhandHandbookVol11977_10941546/VAX_Architecture_Handbook_Vol1_1977_text.pdf

    The cache is indeed 8KB in size, two-way set associative and write-through. >>
    Section 2.7 also mentions an 8-byte instruction buffer, and that the
    instruction fetching is done happens concurrently with the microcoded
    execution. So here we have a little bit of pipelining.

    Section 2.7 also describes a 128-entry TLB. The TLB is claimed to
    have "typically 97% hit rate". I would go for larger pages, which
    would reduce the TLB miss rate.

    I think that in 1979 VAX 512 bytes page was close to optimal.
    Namely, IIUC smallest supported configuration was 128 KB RAM.
    That gives 256 pages, enough for sophisticated system with
    fine-grained access control. Bigger pages would reduce
    number of pages. For example 4 KB pages would mean 32 pages
    in minimal configuration significanly reducing usefulness of
    such machine.

    One must also consider that the disks in that era were
    fairly small, and 512 bytes was a common sector size.

    Convenient for both swapping and loading program text
    without wasting space on the disk by clustering
    pages in groups of 2, 4 or 8.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From OrangeFish@OrangeFish@invalid.invalid to comp.arch,alt.folklore.computers on Fri Aug 15 11:42:09 2025
    From Newsgroup: comp.arch

    On 2025-08-12 15:09, John Levine wrote:
    According to <aph@littlepinkcloud.invalid>:
    In comp.arch BGB <cr88192@gmail.com> wrote:

    Also, IIRC, the major point of X32 was that it would narrow pointers and >>> similar back down to 32 bits, requiring special versions of any shared
    libraries or similar.

    But, it is unattractive to have both 32 and 64 bit versions of all the SO's.

    We have done something similar for years at Red Hat: not X32, but
    x86_32, and it was pretty easy. If you're building a 32-bit OS anyway
    (which we were) all you have to do is copy all 32-bit libraries from
    one one repo to the other.

    FreeBSD does the same thing. The 32 bit libraries are installed by default on 64 bit systems because, by current standards, they're not very big.

    Same is true for Solaris Sparc.

    OF.


    I've stopped installing them because I know I don't have any 32 bit apps
    left but on systems with old packages, who knows?


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Fri Aug 15 17:49:53 2025
    From Newsgroup: comp.arch

    On 14.08.2025 23:44, Dan Cross wrote:
    In article <107l5ju$k78a$1@dont-email.me>,
    David Brown <david.brown@hesbynett.no> wrote:
    On 14.08.2025 17:44, Dan Cross wrote:
    In article <sknnQ.168942$Bui1.63359@fx10.iad>,
    Scott Lurndal <slp53@pacbell.net> wrote:
    Both Burroughs Large Systems (48-bit stack machine) and the
    Sperry 1100/2200 (36-bit) systems had (have, in emulation today)
    C compilers.

    Yup. The 1100-series machines were (are) 1's complement. Those
    are the ones I usually think of when cursing that signed integer
    overflow is UB in C.

    I don't think anyone is compiling C23 code for those machines,
    but back in the late 1980s, they were still enough of a going
    concern that they could influence the emerginc C standard. Not
    so much anymore.

    They would presumably have been part of the justification for supporting
    multiple signed integer formats at the time.

    C90 doesn't have much to say about this at all, other than
    saying that the actual representation and ranges of the integer
    types are implementation defined (G.3.5 para 1).

    C90 does say that, "The representations of integral types shall
    define values by use of a pure binary numeration system" (sec
    6.1.2.5).

    C99 tightens this up and talks about 2's comp, 1's comp, and
    sign/mag as being the permissible representations (J.3.5, para
    1).

    Yes. Early C didn't go into the details, then C99 described the systems
    that could realistically be used. And now in C23 only two's complement
    is allowed.


    UB on signed integer
    arithmetic overflow is a different matter altogether.

    I disagree.


    You have overflow when the mathematical result of an operation cannot be expressed accurately in the type - regardless of the representation
    format for the numbers. Your options, as a language designer or
    implementer, of handling the overflow are the same regardless of the representation. You can pick a fixed value to return, or saturate, or
    invoke some kind of error handler mechanism, or return a "don't care" unspecified value of the type, or perform a specified algorithm to get a representable value (such as reduction modulo 2^n), or you can simply
    say the program is broken if this happens (it is UB).

    I don't see where the representation comes into it - overflow is a
    matter of values and the ranges that can be stored in a type, not how
    those values are stored in the bits of the data.

    Regardless, signed integer overflow remains UB in the current C
    standard, nevermind definitionally following 2s complement
    semantics. Usually this is done on the basis of performance
    arguments: some seemingly-important loop optimizations can be
    made if the compiler can assert that overflow Cannot Happen.

    The justification for "signed integer arithmetic overflow is UB" is in
    the C standards 6.5p5 under "Expressions" :

    Not in ANSI/ISO 9899-1990. In that revision of the standard,
    sec 6.5 covers declarations.

    """
    If an exceptional condition occurs during the evaluation of an
    expression (that is, if the result is not mathematically defined or not
    in the range of representable values for its type), the behavior is
    undefined.
    """

    In C90, this language appears in sec 6.3 para 5. Note, however,
    that they do not define what an exception _is_, only a few
    things that _may_ cause one. See below.


    It's basically the same in C90 onwards, with just small changes to the wording. And it /does/ define what is meant by an "exceptional
    condition" (or just "exception" in C90) - that is done by the part in parentheses.

    It actually has absolutely nothing to do with signed integer
    representation, or machine hardware.

    Consider this language from the (non-normative) example 4 in sec
    5.1.2.3:

    |On a machine in which overflows produce an exception and in
    |which the range of values representable by an *int* is
    |[-32768,+32767], the implementation cannot rewrite this
    |expression as [continues with the specifics of the example]....

    That seems pretty clear that they're thinking about machines
    that actually generate a hardware trap of some kind on overflow.


    They are thinking about that possibility, yes. In C90, the term
    "exception" here was not clearly defined - and it is definitely not the
    same as the term "exception" in 6.3p5. The wording was improved in C99 without changing the intended meaning - there the term in the paragraph
    under "Expressions" is "exceptional condition" (defined in that
    paragraph), while in the example in "Execution environments", it says
    "On a machine in which overflows produce an explicit trap". (C11
    further clarifies what "performs a trap" means.)

    But this is about re-arrangements the compiler is allowed to make, or
    barred from making - it can't make re-arrangements that would mean
    execution failed when the direct execution of the code according to the
    C abstract machine would have worked correctly (without ever having encountered an "exceptional condition" or other UB). Representation is
    not relevant here - there is nothing about two's complement, ones'
    complement, sign-magnitude, or anything else. Even the machine hardware
    is not actually particularly important, given that most processors
    support non-trapping integer arithmetic instructions and for those that
    don't have explicit trap instructions, a compiler could generate "jump
    if overflow flag set" or similar instructions to emulate traps
    reasonably efficiently. (Many compilers support that kind of thing as
    an option to aid debugging.)


    It doesn't even have much to do
    with integers at all. It is simply that if the calculation can't give a
    correct answer, then then the C standards don't say anything about the
    results or effects.

    The point is that there when the results of an integer computation are
    too big, there is no way to get the correct answer in the types used.
    Two's complement wrapping is /not/ correct. If you add two real-world
    positive integers, you don't get a negative integer.

    Sorry, but I don't buy this argument as anything other than a
    justification after the fact. We're talking about history and
    motivation here, not the behavior described in the standard.

    It is a fair point that I am describing a rational and sensible reason
    for UB on arithmetic overflow - and I do not know the motivation of the
    early C language designers, compiler implementers, and authors of the
    first C standard.

    I do know, however, that the principle of "garbage in, garbage out" was
    well established long before C was conceived. And programmers of that
    time were familiar with the concept of functions and operations being
    defined for appropriate inputs, and having no defined behaviour for
    invalid inputs. C is full of other things where behaviour is left
    undefined when no sensible correct answer can be specified, and that is
    not just because the behaviour of different hardware could vary. It
    seems perfectly reasonable to me to suppose that signed integer
    arithmetic overflow is just another case, no different from
    dereferencing an invalid pointer, dividing by zero, or any one of the
    other UB's in the standards.


    In particular, C is a programming language for actual machines,
    not a mathematical notation; the language is free to define the
    behavior of arithmetic expressions in any way it chooses, though
    one presumes it would do so in a way that makes sense for the
    machines that it targets.

    Yes, that is true. It is, however, also important to remember that it
    was based on a general abstract machine, not any particular hardware,
    and that the operations were intended to follow standard mathematics as
    well as practically possible - operations and expressions in C were not designed for any particular hardware. (Though some design choices were
    biased by particular hardware.)

    Thus, it could have formalized the
    result of signed integer overflow to follow 2's complement
    semantics had the committee so chosen, in which case the result
    would not be "incorrect", it would be well-defined with respect
    to the semantics of the language. Java, for example, does this,
    as does C11 (and later) atomic integer operations. Indeed, the
    C99 rationale document makes frequent reference to twos
    complement, where overflow and modular behavior are frequently
    equivalent, being the common case. But aside from the more
    recent atomics support, C _chose_ not to do this.


    It could have made signed integer overflow defined behaviour, but it did
    not. The C standards committee have explicitly chosen not to do that,
    even after deciding that two's complement is the only supported
    representation for signed integers in C23 onwards. It is fine to have
    two's complement representation, and fine to have modulo arithmetic in
    some circumstances, while leaving other arithmetic overflow undefined. Unsigned integer operations in C have always been defined as modulo
    arithmetic - addition of unsigned values is a different operation from addition of signed values. Having some modulo behaviour does not in any
    way imply that signed arithmetic should be modulo.

    In Java, the language designers decided that integer arithmetic
    operations would be modulo operations. Wrapping therefore gives the
    correct answer for those operations - it does not give the correct
    answer for mathematical integer operations. And Java loses common mathematical identities which C retains - such as the identity that
    adding a positive integer to another integer will increase its value. Something always has to be lost when approximating unbounded
    mathematical integers in a bounded implementation - I think C made the
    right choices here about what to keep and what to lose, and Java made
    the wrong choices. (Others may of course have different opinions.)

    In Zig, unsigned integer arithmetic overflow is also UB as these
    operations are not defined as modulo. I think that is a good natural
    choice too - but it is useful for a language to have a way to do
    wrapping arithmetic on the occasions you need it.

    Also, consider that _unsigned_ arithmetic is defined as having
    wrap-around semantics similar to modular arithmetic, and thus
    incapable of overflow.

    Yes. Unsigned arithmetic operations are different operations from
    signed arithmetic operations in C.

    But that's simply a fiction invented for
    the abstract machine described informally in the standard: it
    requires special handling one machines like the 1100 series,
    because those machines might trap on overflow. The C committee
    could just as well have said that the unsigned arithmetic
    _could_ overflow and that the result was UB.


    They could have done that (as the Zig folk did).

    So why did C chose this way? The only logical reason is that
    there were machines at the time that where a) integer overflow
    caused machine exceptions, and b) the representation of signed
    integers was not well-defined, so that the actual value
    resulting from overflow could not be rigorously defined. Given
    that C90 mandated a binary representation for integers and so
    the representation of of unsigned integers is basically common,
    there was no need to do that for unsigned arithmetic.


    Not at all. Usually when someone says "the only logical reason is...",
    they really mean "the only logical reason /I/ can think of is...", or
    "the only reason that /I/ can think of that /I/ think is logical is...".

    For a language that can be used as a low-level systems language, it is important to be able to do modulo arithmetic efficiently. It is needed
    for a number of low-level tasks, including the implementation of large arithmetic operations, handling timers, counters, and other bits and
    pieces. So it was definitely a useful thing to have in C.

    For a language that can be used as a fast and efficient application
    language, it must have a reasonable approximation to mathematical
    integer arithmetic. Implementations should not be forced to have
    behaviours beyond the mathematically sensible answers - if a calculation
    can't be done correctly, there's no point in doing it. Giving nonsense results does not help anyone - C programmers or toolchain implementers,
    so the language should not specify any particular result. More sensible defined overflow behaviour - saturation, error values, language
    exceptions or traps, etc., would be very inefficient on most hardware.
    So UB is the best choice - and implementations can do something
    different if they like.

    Too many options make a language bigger - harder to implement, harder to learn, harder to use. So it makes sense to have modulo arithmetic for unsigned types, and normal arithmetic for signed types.

    I am not claiming to know that this is the reasoning made by the C
    language pioneers. But it is definitely an alternative logical reason
    for C being the way it is.

    And of course, even today, C still targets oddball platforms
    like DSPs and custom chips, where assumptions about the ubiquity
    of 2's comp may not hold.

    Modern C and C++ standards have dropped support for signed integer
    representation other than two's complement, because they are not in use
    in any modern hardware (including any DSP's) - at least, not for
    general-purpose integers. Both committees have consistently voted to
    keep overflow as UB.

    Yes. As I said, performance is often the justification.

    I'm not convinced that there are no custom chips and/or DSPs
    that are not manufactured today. They may not be common, their
    mere existence is certainly dumb and offensive, but that does
    not mean that they don't exist. Note that the survey in, e.g., https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2218.htm
    only mentions _popular_ DSPs, not _all_ DSPs.


    I think you might have missed a few words in that paragraph, but I
    believe I know what you intended. There are certainly DSPs and other
    cores that have strong support for alternative overflow behaviour -
    saturation is very common in DSPs, and it is also common to have a
    "sticky overflow" flag so that you can do lots of calculations in a
    tight loop, and check for problems once you are finished. I think it is highly unlikely that you'll find a core with something other than two's complement as the representation for signed integer types, though I
    can't claim that I know /all/ devices! (I do know a bit about more
    cores than would be considered popular or common.)

    Of course, if such machines exist, I will certainly concede that
    I doubt very much that anyone is targeting them with C code
    written to a modern standard.


    Modern C is definitely used on DSPs with strong saturation support.
    (Even ARM cores have saturated arithmetic instructions.) But they can
    also handle two's complement wrapped signed integer arithmetic if the programmer wants that - after all, it's exactly the same in the hardware
    as modulo unsigned arithmetic (except for division). That doesn't mean
    that wrapping signed integer overflow is useful or desired behaviour.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Fri Aug 15 17:49:58 2025
    From Newsgroup: comp.arch

    On 14.08.2025 19:43, Thomas Koenig wrote:
    David Brown <david.brown@hesbynett.no> schrieb:

    The point is that there when the results of an integer computation are
    too big, there is no way to get the correct answer in the types used.
    Two's complement wrapping is /not/ correct. If you add two real-world
    positive integers, you don't get a negative integer.

    I believe it was you who wrote "If you add enough apples to a
    pile, the number of apples becomes negative", so there is
    clerly a defined physical meaning to overflow.

    :-)

    Yes, I did say something along those lines - but perhaps not /exactly/
    those words!

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Fri Aug 15 16:53:29 2025
    From Newsgroup: comp.arch

    According to Scott Lurndal <slp53@pacbell.net>:
    Section 2.7 also describes a 128-entry TLB. The TLB is claimed to
    have "typically 97% hit rate". I would go for larger pages, which
    would reduce the TLB miss rate.

    I think that in 1979 VAX 512 bytes page was close to optimal. ...
    One must also consider that the disks in that era were
    fairly small, and 512 bytes was a common sector size.

    Convenient for both swapping and loading program text
    without wasting space on the disk by clustering
    pages in groups of 2, 4 or 8.

    That's probably it but even at the time the pages seemed rather small.
    Pages on the PDP-10 were 512 words which was about 2K bytes.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Aug 15 13:19:36 2025
    From Newsgroup: comp.arch

    On 8/15/2025 11:53 AM, John Levine wrote:
    According to Scott Lurndal <slp53@pacbell.net>:
    Section 2.7 also describes a 128-entry TLB. The TLB is claimed to
    have "typically 97% hit rate". I would go for larger pages, which
    would reduce the TLB miss rate.

    I think that in 1979 VAX 512 bytes page was close to optimal. ...
    One must also consider that the disks in that era were
    fairly small, and 512 bytes was a common sector size.

    Convenient for both swapping and loading program text
    without wasting space on the disk by clustering
    pages in groups of 2, 4 or 8.

    That's probably it but even at the time the pages seemed rather small.
    Pages on the PDP-10 were 512 words which was about 2K bytes.

    Yeah.


    Can note in some of my own testing, I tested various page sizes, and
    seemingly found a local optimum at around 16K.

    Where, going from 4K or 8K to 16K sees a reduction in TLB miss rates,
    but 16K to 32K or 64K did not see any significant reduction; but did see
    a more significant increase in memory footprint due to allocation
    overheads (where, OTOH, going from 4K to 16K pages does not see much
    increase in memory footprint).

    Patterns seemed consistent across multiple programs tested, but harder
    to say if this pattern would be universal.


    Had noted if running stats on where in the pages memory accesses land:
    4K: Pages tend to be accessed fairly evenly
    16K: Minor variation as to what parts of the page are being used.
    64K: Significant variation between parts of the page.
    Basically, tracking per-page memory accesses on a finer grain boundary
    (eg, 512 bytes).

    Say, for example, at 64K one part of the page may be being accessed
    readily but another part of the page isn't really being accessed at all
    (and increasing page size only really sees benefit for TLB miss rate so
    long as the whole page is "actually being used").


    Though, can also note that a skew appeared in 64 pages where they were
    more likely to be being accessed in the first 32K rather than the latter
    32K. Though, I would expect the opposite pattern with stack pages (in
    this case, a traditional "grows downwards" stack being used).


    Granted, always possible other people might see something different if
    they ran their own tests on different programs.



    Can also note that in this case, IIRC, the "malloc()" tended to operate
    by allocating chunks of memory for the "medium heap" 128K at a time, and
    for objects larger than 64K would fall back to page allocation. This may
    be a poor fit for 64K pages (since a whole class of "just over 64K"
    mallocs needs 128K).

    Arguably, would be better for this page size to grow the "malloc()" heap
    in 1MB chunks; but this only really makes sense if the system has
    "plenty of RAM" and applications tend to use a lot of RAM. Where, say,
    if the program is fine with 64K of stack and 128K of heap, it is kind of
    a waste to allocate 1MB of heap for it (although in TestKern programs
    default to 128K, but this is about how much is needed for Doom and Quake
    and similar; though Quake will exceed 128K if all local arrays are stack-allocated rather than auto-folding large structs or arrays to heap allocation, otherwise Quake needs around 256K of stack space; more for
    GLQuake as it used a very large stack array, ~ 256K IIRC, for texture resizing).

    ...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Fri Aug 15 18:33:07 2025
    From Newsgroup: comp.arch

    In article <107nkv2$1753a$1@dont-email.me>,
    David Brown <david.brown@hesbynett.no> wrote:
    On 14.08.2025 23:44, Dan Cross wrote:
    In article <107l5ju$k78a$1@dont-email.me>,
    David Brown <david.brown@hesbynett.no> wrote:
    [snip]
    UB on signed integer
    arithmetic overflow is a different matter altogether.

    I disagree.

    You have overflow when the mathematical result of an operation cannot be >expressed accurately in the type - regardless of the representation
    format for the numbers. Your options, as a language designer or >implementer, of handling the overflow are the same regardless of the >representation. You can pick a fixed value to return, or saturate, or >invoke some kind of error handler mechanism, or return a "don't care" >unspecified value of the type, or perform a specified algorithm to get a >representable value (such as reduction modulo 2^n), or you can simply
    say the program is broken if this happens (it is UB).

    I don't see where the representation comes into it - overflow is a
    matter of values and the ranges that can be stored in a type, not how
    those values are stored in the bits of the data.

    I understood your point. But we are talking about the history of
    the language here, not the presently defined behavior.

    We do, in fact, have historical source materials we can draw
    from when discussing this; there's little need to guess. Here,
    we know that the earliest C implementations simply ignored the
    posibility of overflow. In K&R1, chap 2, sec 2.5 ("Arithmetic
    Operators") on page 38, the authors write, "the action taken on
    overflow or underflow depends on the machine at hand." In
    Appendix A, sec 7 ("Expressions"), page 185, the authors write:
    "The handling of overflow and divide check in expression
    evaluation is machine-dependent. All existing implements of C
    ignore integer overflows; treatment of division by 0, and all
    floating point exceptions, varies between machines, and is
    usually adjustable by a library function."

    In other words, different machines give different results; some
    will trap, others will differ due to representation issues. No
    where here does it suggest that the language designers were
    worried about getting the "wrong" result, as you have asserted.

    Regardless, signed integer overflow remains UB in the current C
    standard, nevermind definitionally following 2s complement
    semantics. Usually this is done on the basis of performance
    arguments: some seemingly-important loop optimizations can be
    made if the compiler can assert that overflow Cannot Happen.

    The justification for "signed integer arithmetic overflow is UB" is in
    the C standards 6.5p5 under "Expressions" :

    Not in ANSI/ISO 9899-1990. In that revision of the standard,
    sec 6.5 covers declarations.

    """
    If an exceptional condition occurs during the evaluation of an
    expression (that is, if the result is not mathematically defined or not
    in the range of representable values for its type), the behavior is
    undefined.
    """

    In C90, this language appears in sec 6.3 para 5. Note, however,
    that they do not define what an exception _is_, only a few
    things that _may_ cause one. See below.

    It's basically the same in C90 onwards, with just small changes to the >wording.

    Did I suggest otherwise?

    And it /does/ define what is meant by an "exceptional
    condition" (or just "exception" in C90) - that is done by the part in >parentheses.

    That is an interpretation.

    It actually has absolutely nothing to do with signed integer
    representation, or machine hardware.

    Consider this language from the (non-normative) example 4 in sec
    5.1.2.3:

    |On a machine in which overflows produce an exception and in
    |which the range of values representable by an *int* is
    |[-32768,+32767], the implementation cannot rewrite this
    |expression as [continues with the specifics of the example]....

    That seems pretty clear that they're thinking about machines
    that actually generate a hardware trap of some kind on overflow.

    They are thinking about that possibility, yes. In C90, the term
    "exception" here was not clearly defined - and it is definitely not the
    same as the term "exception" in 6.3p5. The wording was improved in C99 >without changing the intended meaning - there the term in the paragraph >under "Expressions" is "exceptional condition" (defined in that
    paragraph), while in the example in "Execution environments", it says
    "On a machine in which overflows produce an explicit trap". (C11
    further clarifies what "performs a trap" means.)

    But this is about re-arrangements the compiler is allowed to make, or
    barred from making - it can't make re-arrangements that would mean
    execution failed when the direct execution of the code according to the
    C abstract machine would have worked correctly (without ever having >encountered an "exceptional condition" or other UB). Representation is
    not relevant here - there is nothing about two's complement, ones' >complement, sign-magnitude, or anything else. Even the machine hardware
    is not actually particularly important, given that most processors
    support non-trapping integer arithmetic instructions and for those that >don't have explicit trap instructions, a compiler could generate "jump
    if overflow flag set" or similar instructions to emulate traps
    reasonably efficiently. (Many compilers support that kind of thing as
    an option to aid debugging.)

    It doesn't even have much to do
    with integers at all. It is simply that if the calculation can't give a >>> correct answer, then then the C standards don't say anything about the
    results or effects.

    The point is that there when the results of an integer computation are
    too big, there is no way to get the correct answer in the types used.
    Two's complement wrapping is /not/ correct. If you add two real-world
    positive integers, you don't get a negative integer.

    Sorry, but I don't buy this argument as anything other than a
    justification after the fact. We're talking about history and
    motivation here, not the behavior described in the standard.

    It is a fair point that I am describing a rational and sensible reason
    for UB on arithmetic overflow - and I do not know the motivation of the >early C language designers, compiler implementers, and authors of the
    first C standard.

    Then there's really nothing more to discuss. The intent here is
    to understand the motivation of those folks.

    Early C didn't even have unsigned; Dennis Ritchie's paper for
    the History of Programming Languages conference said that it
    came around 1977 (https://www.nokia.com/bell-labs/about/dennis-m-ritchie/chist.html;
    see the section on "portability"), and in pre-ANSI C, struct
    fields of `int` type were effectively unsigned (K&R1,
    pp.138,197). I mentioned the quote from K&R1 about overflow
    above, but we see some other hints about signed overflow
    becoming negative in other documents. For instance, K&R2, p 118
    gives the example of a hash function followed by the sentence,
    "unsigned arithmetic ensures that the hash value is
    non-negative." This does not suggest to me that the authors
    thought that the wrapping behavior of twos-complement arithemtic
    was "incorrect".

    I do know, however, that the principle of "garbage in, garbage out" was
    well established long before C was conceived. And programmers of that
    time were familiar with the concept of functions and operations being >defined for appropriate inputs, and having no defined behaviour for
    invalid inputs. C is full of other things where behaviour is left
    undefined when no sensible correct answer can be specified, and that is
    not just because the behaviour of different hardware could vary. It
    seems perfectly reasonable to me to suppose that signed integer
    arithmetic overflow is just another case, no different from
    dereferencing an invalid pointer, dividing by zero, or any one of the
    other UB's in the standards.

    Indeed; this is effectively what I've been saying: signed
    integer overflow is UB because the behavior of overflow varied
    between the machines of the day, so C could not make assumptions
    about what value would result, in part because of representation
    issues: at the hardware level, signed overflow of the largest
    representable positive integer yields different _values_ between
    1s comp and 2s comp machines. Who is to say which is correct?

    In particular, C is a programming language for actual machines,
    not a mathematical notation; the language is free to define the
    behavior of arithmetic expressions in any way it chooses, though
    one presumes it would do so in a way that makes sense for the
    machines that it targets.

    Yes, that is true. It is, however, also important to remember that it
    was based on a general abstract machine, not any particular hardware,
    and that the operations were intended to follow standard mathematics as
    well as practically possible - operations and expressions in C were not >designed for any particular hardware. (Though some design choices were >biased by particular hardware.)

    This is historically inaccurate.

    C was developed by and for the PDP-11 initially, targeting Unix,
    building from Martin Richards's BCPL (which Ritchie and Thompson
    had used under Multics on the GE-645 machine, and GCOS on the
    635) and Ken Thompson's B language, which he had implemented as
    a chopped-down BCPL to be a systems programming language for
    _very_ early Unix on the PDP-7. B was typeless, as the PDP-7
    was word-oriented, and we see vestages of this ancestral DNA in
    C today. See Ritchie's C history paper for details.

    Concerns for protability, leading to the development of the
    abstract machine informally described by the C standard, came
    much, much later in its evolutionary development.

    Thus, it could have formalized the
    result of signed integer overflow to follow 2's complement
    semantics had the committee so chosen, in which case the result
    would not be "incorrect", it would be well-defined with respect
    to the semantics of the language. Java, for example, does this,
    as does C11 (and later) atomic integer operations. Indeed, the
    C99 rationale document makes frequent reference to twos
    complement, where overflow and modular behavior are frequently
    equivalent, being the common case. But aside from the more
    recent atomics support, C _chose_ not to do this.

    It could have made signed integer overflow defined behaviour, but it did >not. The C standards committee have explicitly chosen not to do that,
    even after deciding that two's complement is the only supported >representation for signed integers in C23 onwards. It is fine to have
    two's complement representation, and fine to have modulo arithmetic in
    some circumstances, while leaving other arithmetic overflow undefined. >Unsigned integer operations in C have always been defined as modulo >arithmetic - addition of unsigned values is a different operation from >addition of signed values. Having some modulo behaviour does not in any
    way imply that signed arithmetic should be modulo.

    In Java, the language designers decided that integer arithmetic
    operations would be modulo operations. Wrapping therefore gives the
    correct answer for those operations - it does not give the correct
    answer for mathematical integer operations. And Java loses common >mathematical identities which C retains - such as the identity that
    adding a positive integer to another integer will increase its value. >Something always has to be lost when approximating unbounded
    mathematical integers in a bounded implementation - I think C made the
    right choices here about what to keep and what to lose, and Java made
    the wrong choices. (Others may of course have different opinions.)

    In Zig, unsigned integer arithmetic overflow is also UB as these
    operations are not defined as modulo. I think that is a good natural
    choice too - but it is useful for a language to have a way to do
    wrapping arithmetic on the occasions you need it.

    None of this seems relevant to understanding the motivations of
    the members of the committee that produced the 1990 C standard,
    other than agreeing that the decision could have been different.

    I would add that very early C treated signed and unsigned
    arithmetic as more or less equivalent. It wasn't until they
    started porting C to machines other than the PDP-11 that it
    started to matter.

    Also, consider that _unsigned_ arithmetic is defined as having
    wrap-around semantics similar to modular arithmetic, and thus
    incapable of overflow.

    Yes. Unsigned arithmetic operations are different operations from
    signed arithmetic operations in C.

    This is the second time you have mentioned this. Did I say
    something that led you believe that I suggested otherwise, or
    am somehow unaware of this fact?

    But that's simply a fiction invented for
    the abstract machine described informally in the standard: it
    requires special handling one machines like the 1100 series,
    because those machines might trap on overflow. The C committee
    could just as well have said that the unsigned arithmetic
    _could_ overflow and that the result was UB.

    They could have done that (as the Zig folk did).

    Or the SML folks before the Zig folks.

    So why did C chose this way? The only logical reason is that
    there were machines at the time that where a) integer overflow
    caused machine exceptions, and b) the representation of signed
    integers was not well-defined, so that the actual value
    resulting from overflow could not be rigorously defined. Given
    that C90 mandated a binary representation for integers and so
    the representation of of unsigned integers is basically common,
    there was no need to do that for unsigned arithmetic.

    Not at all. Usually when someone says "the only logical reason is...",
    they really mean "the only logical reason /I/ can think of is...", or
    "the only reason that /I/ can think of that /I/ think is logical is...".

    I probably should have said that I'm also drawing from direct
    references, as well as hints and inferences from other
    historical documents; both editions of K&R as well as early Unix
    source code and the "C Reference Manual" from 6th and 7th
    Edition Unix (the language described in 7th Ed is quite
    different from the language in 6th Ed; most of this was driven
    by the a) portability, and b) the need to support
    phototypesetters, hence why the C implemented in 7th Ed and PCC
    is sometimes called "Typesetter C"). This is complemented with
    direct conversations with some of the original players, though
    admittedly those were quite a while ago.

    For a language that can be used as a low-level systems language, it is >important to be able to do modulo arithmetic efficiently. It is needed
    for a number of low-level tasks, including the implementation of large >arithmetic operations, handling timers, counters, and other bits and
    pieces. So it was definitely a useful thing to have in C.

    For a language that can be used as a fast and efficient application >language, it must have a reasonable approximation to mathematical
    integer arithmetic. Implementations should not be forced to have
    behaviours beyond the mathematically sensible answers - if a calculation >can't be done correctly, there's no point in doing it. Giving nonsense >results does not help anyone - C programmers or toolchain implementers,
    so the language should not specify any particular result. More sensible >defined overflow behaviour - saturation, error values, language
    exceptions or traps, etc., would be very inefficient on most hardware.
    So UB is the best choice - and implementations can do something
    different if they like.

    This is where we differ: you keep asserting notions of
    "correctness", without acknowledging that a) correctness differs
    in this context, and b) the notion of what is "correct" has
    itself differed over time as C has evolved.

    Moreover, when you say, "if a calculation can't be done
    correctly, there's no point in doing it" that's seems highly
    specific and reliant on your definition of correctness. My

    Here's an example:

    char foo = 128;
    int x = foo + 1;
    printf("%d\n", x);

    What is printed? (Note: that's rhetorical)

    On the systems I just tested, x86_64, ARM64 and RISCV64, I get
    -127 for the first two, and 129 for the last.

    Of course, we all know that this relies on implementation
    defined behavior around whether `char` is treated as signed or
    unsigned (and resultingly conversion from an unsigned constant
    to signed), but if what you say were true about GIGO, why is
    this not _undefined_ behavior?

    Too many options make a language bigger - harder to implement, harder to >learn, harder to use. So it makes sense to have modulo arithmetic for >unsigned types, and normal arithmetic for signed types.

    I am not claiming to know that this is the reasoning made by the C
    language pioneers. But it is definitely an alternative logical reason
    for C being the way it is.

    But we _can_ see what those pioneers were thinking by reading
    the artifacts they left behind, which we know, again based on
    primary sources, had an impact on the standards committee.

    And of course, even today, C still targets oddball platforms
    like DSPs and custom chips, where assumptions about the ubiquity
    of 2's comp may not hold.

    Modern C and C++ standards have dropped support for signed integer
    representation other than two's complement, because they are not in use
    in any modern hardware (including any DSP's) - at least, not for
    general-purpose integers. Both committees have consistently voted to
    keep overflow as UB.

    Yes. As I said, performance is often the justification.

    I'm not convinced that there are no custom chips and/or DSPs
    that are not manufactured today. They may not be common, their
    mere existence is certainly dumb and offensive, but that does
    not mean that they don't exist. Note that the survey in, e.g.,
    https://www.open-std.org/jtc1/sc22/wg14/www/docs/n2218.htm
    only mentions _popular_ DSPs, not _all_ DSPs.

    I think you might have missed a few words in that paragraph, but I
    believe I know what you intended. There are certainly DSPs and other
    cores that have strong support for alternative overflow behaviour - >saturation is very common in DSPs, and it is also common to have a
    "sticky overflow" flag so that you can do lots of calculations in a
    tight loop, and check for problems once you are finished. I think it is >highly unlikely that you'll find a core with something other than two's >complement as the representation for signed integer types, though I
    can't claim that I know /all/ devices! (I do know a bit about more
    cores than would be considered popular or common.)

    I was referring specifically to integer representation here, not
    saturating (or other) operations, but sure.

    Of course, if such machines exist, I will certainly concede that
    I doubt very much that anyone is targeting them with C code
    written to a modern standard.

    Modern C is definitely used on DSPs with strong saturation support.
    (Even ARM cores have saturated arithmetic instructions.) But they can
    also handle two's complement wrapped signed integer arithmetic if the >programmer wants that - after all, it's exactly the same in the hardware
    as modulo unsigned arithmetic (except for division). That doesn't mean
    that wrapping signed integer overflow is useful or desired behaviour.

    So again, the context here is understanding the initial
    motivation. I've mentioned reasons why they don't change it now
    (there _are_ arguments about correctness, but compiler writers
    also argue strongly that making signed integer overflow well
    defined would prohibit them from implementing what they consider
    to be important optimizations).

    - Dan C.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Fri Aug 15 12:03:41 2025
    From Newsgroup: comp.arch

    On 8/15/2025 11:19 AM, BGB wrote:
    On 8/15/2025 11:53 AM, John Levine wrote:
    According to Scott Lurndal <slp53@pacbell.net>:
    Section 2.7 also describes a 128-entry TLB.  The TLB is claimed to
    have "typically 97% hit rate".  I would go for larger pages, which
    would reduce the TLB miss rate.

    I think that in 1979 VAX 512 bytes page was close to optimal. ...
    One must also consider that the disks in that era were
    fairly small, and 512 bytes was a common sector size.

    Convenient for both swapping and loading program text
    without wasting space on the disk by clustering
    pages in groups of 2, 4 or 8.

    That's probably it but even at the time the pages seemed rather small.
    Pages on the PDP-10 were 512 words which was about 2K bytes.

    Yeah.


    Can note in some of my own testing, I tested various page sizes, and seemingly found a local optimum at around 16K.

    I think that is consistent with what some others have found. I suspect
    the average page size should grow as memory gets cheaper, which leads to
    more memory on average in systems. This also leads to larger programs,
    as they can "fit" in larger memory with less paging. And as disk
    (spinning or SSD) get faster transfer rates, the cost (in time) of
    paging a larger page goes down. While 4K was the sweet spot some
    decades ago, I think it has increased, probably to 16K. At some point
    in the future, it may get to 64K, but not for some years yet.


    Where, going from 4K or 8K to 16K sees a reduction in TLB miss rates,
    but 16K to 32K or 64K did not see any significant reduction; but did see
    a more significant increase in memory footprint due to allocation
    overheads (where, OTOH, going from 4K to 16K pages does not see much increase in memory footprint).

    Patterns seemed consistent across multiple programs tested, but harder
    to say if this pattern would be universal.


    Had noted if running stats on where in the pages memory accesses land:
      4K: Pages tend to be accessed fairly evenly
     16K: Minor variation as to what parts of the page are being used.
     64K: Significant variation between parts of the page.
    Basically, tracking per-page memory accesses on a finer grain boundary
    (eg, 512 bytes).

    Interesting.


    Say, for example, at 64K one part of the page may be being accessed
    readily but another part of the page isn't really being accessed at all
    (and increasing page size only really sees benefit for TLB miss rate so
    long as the whole page is "actually being used").

    Not necessarily. Consider the case of a 16K (or larger) page with two
    "hot spots" that are more than 4K apart. That takes 2 TLB slots with 4K pages, but only one with larger pages.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Fri Aug 15 19:19:50 2025
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 8/15/2025 11:19 AM, BGB wrote:
    On 8/15/2025 11:53 AM, John Levine wrote:
    According to Scott Lurndal <slp53@pacbell.net>:
    Section 2.7 also describes a 128-entry TLB.  The TLB is claimed to >>>>>> have "typically 97% hit rate".  I would go for larger pages, which >>>>>> would reduce the TLB miss rate.

    I think that in 1979 VAX 512 bytes page was close to optimal. ...
    One must also consider that the disks in that era were
    fairly small, and 512 bytes was a common sector size.

    Convenient for both swapping and loading program text
    without wasting space on the disk by clustering
    pages in groups of 2, 4 or 8.

    That's probably it but even at the time the pages seemed rather small.
    Pages on the PDP-10 were 512 words which was about 2K bytes.

    Yeah.


    Can note in some of my own testing, I tested various page sizes, and
    seemingly found a local optimum at around 16K.

    I think that is consistent with what some others have found. I suspect
    the average page size should grow as memory gets cheaper, which leads to >more memory on average in systems. This also leads to larger programs,
    as they can "fit" in larger memory with less paging. And as disk
    (spinning or SSD) get faster transfer rates, the cost (in time) of
    paging a larger page goes down. While 4K was the sweet spot some
    decades ago, I think it has increased, probably to 16K. At some point
    in the future, it may get to 64K, but not for some years yet.

    ARM64 (ARMv8) architecturally supports 4k, 16k and 64k. When
    ARMv8 first came out, one vendor (Redhat) shipped using 64k pages,
    while Ubuntu shipped with 4k page support. 16k support by the
    processor was optional (although the Neoverse cores support all
    three, some third-party cores developed before ARM added 16k
    pages to the architecture specification only supported 4k and 64k).


    Say, for example, at 64K one part of the page may be being accessed
    readily but another part of the page isn't really being accessed at all
    (and increasing page size only really sees benefit for TLB miss rate so
    long as the whole page is "actually being used").

    Not necessarily. Consider the case of a 16K (or larger) page with two
    "hot spots" that are more than 4K apart. That takes 2 TLB slots with 4K >pages, but only one with larger pages.

    Note that the ARMv8 architecture[*] supports terminating the table walk
    before reaching the smallest level, so with 4K pages[**], a single TLB
    entry can cover 4K, 2M or 1GB blocks. With 16k pages, a single
    TLB entry can cover 16k, 32MB or 64GB blocks. 64k pages support
    64k, 512M and 4TB block sizes.

    [*] Intel, AMD and others have similar "large page" capabilities.
    [**] Granules, in ARM terminology.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Fri Aug 15 20:40:44 2025
    From Newsgroup: comp.arch

    According to Scott Lurndal <slp53@pacbell.net>:
    ARM64 (ARMv8) architecturally supports 4k, 16k and 64k.

    S/370 had 2K or 4K pages grouped into 64K or 1M segment tables. By the time it became S/390 it was just 4K pages and 1M segment tables, in a 31 bit address spave.

    In zSeries there are multiple 2G regions consisting of 1M segments and 4K pages.
    A segment can optionally be mapped as a single unit, in effect a 1M page.

    These days it doesn't make much sense to have pages smaller than 4K since that's the block size on most disks. I can believe that with today's giant memories and bloated programs larger than 4K pages would work better.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Aug 15 21:22:53 2025
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> writes:
    These days it doesn't make much sense to have pages smaller than 4K since >that's the block size on most disks.

    Two block devices bought less than a year ago:

    Disk model: KINGSTON SEDC2000BM8960G
    Units: sectors of 1 * 512 = 512 bytes
    Sector size (logical/physical): 512 bytes / 512 bytes
    I/O size (minimum/optimal): 512 bytes / 512 bytes

    Disk model: WD Blue SN580 2TB
    Units: sectors of 1 * 512 = 512 bytes
    Sector size (logical/physical): 512 bytes / 512 bytes
    I/O size (minimum/optimal): 512 bytes / 512 bytes

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Sat Aug 16 01:22:57 2025
    From Newsgroup: comp.arch

    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    John Levine <johnl@taugh.com> writes:
    These days it doesn't make much sense to have pages smaller than 4K since >>that's the block size on most disks.

    Two block devices bought less than a year ago:

    SSDs often let you do 512 byte reads and writes for backward compatibility even though the physical block size is much larger.

    Wikipedia tells us all about it:

    https://en.wikipedia.org/wiki/Advanced_Format#512_emulation_(512e)

    Disk model: KINGSTON SEDC2000BM8960G

    Says here the block size of the 480GB version is 16K, so I'd assume the 960GB is
    the same:

    https://www.techpowerup.com/ssd-specs/kingston-dc2000b-480-gb.d2166

    Disk model: WD Blue SN580 2TB

    I can't find anything on its internal structure but I see the vendor's random read/write benchmarks all use 4K blocks so that's probably the internal block size.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Aug 16 05:09:43 2025
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> writes:
    SSDs often let you do 512 byte reads and writes for backward compatibility even
    though the physical block size is much larger.

    Yes. But if the argument had any merit that 512B is a good page size
    because it avoids having to transfer 8, 16, or 32 sectors at a time,
    it would still have merit, because the interface still shows 512B
    sectors. In 1985, 1986 and 1992 the common HDDs of the time had
    actual 512B sectors, so if that argument had any merit, the i386
    (1985), MIPS R2000 (1986), SPARC (1986), and Alpha (1992) should have
    been introduced with 512B pages, but they actually were introduced
    with 4KB (386, MIPS, SPARC) and 8KB (Alpha) pages.

    Disk model: WD Blue SN580 2TB

    I can't find anything on its internal structure but I see the vendor's random >read/write benchmarks all use 4K blocks so that's probably the internal block >size.

    https://www.techpowerup.com/ssd-specs/western-digital-sn580-2-tb.d1542

    claims

    |Page Size: 16 KB
    |Block Size: 1344 Pages

    I assume that the "Block size" means the size of an erase block.
    Where does the number 1344 come from? My guess is that it has to do
    with:

    |Type: TLC
    |Technology: 112-layer

    3*112*4=1344

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sat Aug 16 03:17:49 2025
    From Newsgroup: comp.arch

    On 8/15/2025 2:03 PM, Stephen Fuld wrote:
    On 8/15/2025 11:19 AM, BGB wrote:
    On 8/15/2025 11:53 AM, John Levine wrote:
    According to Scott Lurndal <slp53@pacbell.net>:
    Section 2.7 also describes a 128-entry TLB.  The TLB is claimed to >>>>>> have "typically 97% hit rate".  I would go for larger pages, which >>>>>> would reduce the TLB miss rate.

    I think that in 1979 VAX 512 bytes page was close to optimal. ...
    One must also consider that the disks in that era were
    fairly small, and 512 bytes was a common sector size.

    Convenient for both swapping and loading program text
    without wasting space on the disk by clustering
    pages in groups of 2, 4 or 8.

    That's probably it but even at the time the pages seemed rather small.
    Pages on the PDP-10 were 512 words which was about 2K bytes.

    Yeah.


    Can note in some of my own testing, I tested various page sizes, and
    seemingly found a local optimum at around 16K.

    I think that is consistent with what some others have found.  I suspect
    the average page size should grow as memory gets cheaper, which leads to more memory on average in systems.  This also leads to larger programs,
    as they can "fit" in larger memory with less paging.  And as disk
    (spinning or SSD) get faster transfer rates, the cost (in time) of
    paging a larger page goes down.  While 4K was the sweet spot some
    decades ago, I think it has increased, probably to 16K.  At some point
    in the future, it may get to 64K, but not for some years yet.


    Some of the programs I have tested don't have particularly large memory footprints by modern standards (~ 10 to 50MB).

    Excluding very small programs (where TLB miss rate becomes negligible)
    had noted that 16K appeared to be reasonably stable.


    Where, going from 4K or 8K to 16K sees a reduction in TLB miss rates,
    but 16K to 32K or 64K did not see any significant reduction; but did
    see a more significant increase in memory footprint due to allocation
    overheads (where, OTOH, going from 4K to 16K pages does not see much
    increase in memory footprint).

    Patterns seemed consistent across multiple programs tested, but harder
    to say if this pattern would be universal.


    Had noted if running stats on where in the pages memory accesses land:
       4K: Pages tend to be accessed fairly evenly
      16K: Minor variation as to what parts of the page are being used.
      64K: Significant variation between parts of the page.
    Basically, tracking per-page memory accesses on a finer grain boundary
    (eg, 512 bytes).

    Interesting.


    Say, for example, at 64K one part of the page may be being accessed
    readily but another part of the page isn't really being accessed at
    all (and increasing page size only really sees benefit for TLB miss
    rate so long as the whole page is "actually being used").

    Not necessarily.  Consider the case of a 16K (or larger) page with two
    "hot spots" that are more than 4K apart.  That takes 2 TLB slots with 4K pages, but only one with larger pages.


    This is part of why 16K has an advantage.

    But, it drops off with 32K or 64K, as one may have a lot of large gaps
    of relatively little activity.

    So, rather than having a 64K page with two or more hot-spots ~ 30K apart
    or less, one may often just have a lot of pages with one hot-spot.

    Granted, my testing was far from exhaustive...


    One may think that larger page would always be better for TLB miss rate,
    but this assumes that most of the pages have most of the page being
    accessed.

    Which, as noted, is fairly true at 4/8/16K, but seemingly not as true at
    32K or 64K.

    And, for the more limited effect of the larger page size on reducing TLB
    miss rate, one does have a lot more memory being wasted by things like "mmap()" type calls.


    Say, for example, you want to allocate 93K via "mmap()":
    4K pages: 96K (waste=3K, 3%)
    8K pages: 96K
    16K pages: 96K
    32K pages: 96K
    64K pages: 128K (waste=35K, 27%)
    OK, 99K:
    4K: 100K (waste= 1K, 1%)
    8K: 104K (waste= 5K, 5%)
    16K: 112K (waste=13K, 12%)
    32K: 128K (waste=29K, 23%)
    64K: 128K
    What about 65K:
    4K: 68K (waste= 3K, 4%)
    8K: 72K (waste= 7K, 10%)
    16K: 80K (waste=15K, 19%)
    32K: 96K (waste=31K, 32%)
    64K: 128K (waste=63K, 49%)

    ...


    So, bigger pages aren't great for "mmap()" with smaller allocation sizes.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Sat Aug 16 10:00:18 2025
    From Newsgroup: comp.arch

    On 8/15/2025 10:09 PM, Anton Ertl wrote:
    John Levine <johnl@taugh.com> writes:
    SSDs often let you do 512 byte reads and writes for backward compatibility even
    though the physical block size is much larger.

    Yes. But if the argument had any merit that 512B is a good page size
    because it avoids having to transfer 8, 16, or 32 sectors at a time,
    it would still have merit, because the interface still shows 512B
    sectors.

    I don't think anyone has argued for 512B page sizes. There are two
    issues that are perhaps being conflated. One is whether it would be
    better if page sizes were increased from the current typical 4K to 16K.
    The other is about changing the size of blocks on disks (both hard disks
    and SSDs) from 512 bytes to 4K bytes.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Sat Aug 16 17:06:42 2025
    From Newsgroup: comp.arch

    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    John Levine <johnl@taugh.com> writes:
    SSDs often let you do 512 byte reads and writes for backward compatibility even
    though the physical block size is much larger.

    Yes. But if the argument had any merit that 512B is a good page size
    because it avoids having to transfer 8, 16, or 32 sectors at a time,
    it would still have merit, because the interface still shows 512B
    sectors.

    I think we're agreeing that even in the early 1980s a 512 byte page was
    too small. They certainly couldn't have made it any smaller, but they
    should have made it larger.

    S/370 was a decade before that and its pages were 2K or 4M. The KI-10,
    the first PDP-10 with paging, had 2K pages in 1972. Its pager was based
    on BBN's add-on pager for TENEX, built in 1970 also with 2K pages.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sat Aug 16 15:26:31 2025
    From Newsgroup: comp.arch

    On 8/7/2025 6:38 AM, Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    EricP wrote:
    Signetics 82S100/101 Field Programmable Logic Array FPAL (an AND-OR matrix) >>> were available in 1975. Mask programmable PLA were available from TI
    circa 1970 but masks would be too expensive.

    If I was building a TTL risc cpu in 1975 I would definitely be using
    lots of FPLA's, not just for decode but also state machines in fetch,
    page table walkers, cache controllers, etc.

    The question isn't could one build a modern risc-style pipelined cpu
    from TTL in 1975 - of course one could. Nor do I see any question of
    could it beat a VAX 780 0.5 MIPS at 5 MHz - of course it could, easily.

    I'm pretty sure I could use my Mk-I risc ISA and build a 5 stage pipeline
    running at 5 MHz getting 1 IPC sustained when hitting the 200 ns cache
    (using some in-order superscalar ideas and two reg file write ports
    to "catch up" after pipeline bubbles).

    TTL risc would also be much cheaper to design and prototype.
    VAX took hundreds of people many many years.

    The question is could one build this at a commercially competitive price?
    There is a reason people did things sequentially in microcode.
    All those control decisions that used to be stored as bits in microcode now >> become real logic gates. And in SSI TTL you don't get many to the $.
    And many of those sequential microcode states become independent concurrent >> state machines, each with its own logic sequencer.

    I am confused. You gave a possible answer in the posting you are
    replying to.

    Concerning page table walker: The MIPS R2000 just has a TLB and traps
    on a TLB miss, and then does the table walk in software. While that's
    not a solution that's appropriate for a wide superscalar CPU, it was
    good enough for beating the actual VAX 11/780 by a good margin; at
    some later point, you would implement the table walker in hardware,
    but probably not for the design you do in 1975.


    Yeah, this approach works a lot better than people seem to give it
    credit for...

    It is maybe pushing it a little if one wants to use an AVL-tree or
    B-Tree for virtual memory vs a page-table, but is otherwise pretty much
    OK assuming TLB miss rate isn't too unreasonable.


    For the TLB, had noticed best results with 4 or 8 way associativity:
    1-way: Doesn't work for main TLB.
    1-way works OK for an L1-TLB in a split L1/L2 TLB config.
    2-way: Barely works
    In some edge cases and configurations,
    may get stuck in a TLB miss loop.
    4-way: Works fairly well, cheaper option.
    8-way: Works better, but significantly more expensive.

    A possible intermediate option could be 6-way associativity.
    Full associativity is impractically expensive.
    Also a large set associative TLB beats a small full associative TLB.

    For a lot of the test programs I run, TLB size:
    64x: Small, fairly high TLB miss rate.
    256x: Mostly good
    512x or 1024x: Can mostly eliminate TLB misses, but debatable.

    In practice, this has mostly left 256x4 as the main configuration for
    the Main TLB. Optionally, can use a 64x1 L1 TLB (with the main TLB as an
    L2 TLB), but this is optional.


    A hardware page walker or inverted page table has been considered, but
    not crossed into use yet. If I were to add a hardware page walker, it
    would likely be semi-optional (still allowing processes to use
    unconventional memory management as needed, *).

    Supported page sizes thus far are 4K, 16K, and 64K. In test-kern, 16K
    mostly won out, using a 3-level page table and 48-bit address space,
    though technically the current page-table layout only does 47 bits.

    Idea was that the high half of the address space could use a separate
    System page table, but this isn't really used thus far.

    *: One merit of software TLB is that is allows for things like nested
    page tables or other trickery without needing any actual hardware
    support. Though, you can also easily enough fake software TLB in
    software as well (a host TLB miss pulling from the guest TLB and
    translating the address again).


    Ended up not as inclined towards inverted page tables, as they offer
    fewer benefits than a page walker but would have many of the same issues
    in terms of implementation complexity (needs to access RAM and perform multiple memory accesses to resolve a miss, ...). The page walker then
    is closer to the end goal, whereas the IPT is basically just a much
    bigger RAM-backed TLB.



    Actually, it is not too far removed from doing a weaker (not-quite-IEEE)
    FPU in hardware, and then using optional traps to emulate full IEEE
    behavior (nevermind if such an FPU encountering things like subnormal
    numbers or similar causes performance to tank; and the usual temptation
    to just disable the use of full IEEE semantics).

    ...


    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Aug 17 06:16:08 2025
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> writes:
    It is maybe pushing it a little if one wants to use an AVL-tree or
    B-Tree for virtual memory vs a page-table

    I assume that you mean a balanced search tree (binary (AVL) or n-ary
    (B)) vs. the now-dominant hierarchical multi-level page tables, which
    are tries.

    In both a hardware and a software implementation, one could implement
    a balanced search tree, but what would be the advantage?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sun Aug 17 10:00:56 2025
    From Newsgroup: comp.arch

    BGB wrote:
    On 8/7/2025 6:38 AM, Anton Ertl wrote:

    Concerning page table walker: The MIPS R2000 just has a TLB and traps
    on a TLB miss, and then does the table walk in software. While that's
    not a solution that's appropriate for a wide superscalar CPU, it was
    good enough for beating the actual VAX 11/780 by a good margin; at
    some later point, you would implement the table walker in hardware,
    but probably not for the design you do in 1975.


    Yeah, this approach works a lot better than people seem to give it
    credit for...

    Both HW and SW table walkers incur the cost of reading the PTE's.
    The pipeline drain and load of the software TLB miss handler,
    then a drain and reload of the original code on return
    are a large expense that HW walkers do not have.

    In-Line Interrupt Handling for Software-Managed TLBs 2001 https://terpconnect.umd.edu/~blj/papers/iccd2001.pdf

    "For example, Anderson, et al. [1] show TLB miss handlers to be among
    the most commonly executed OS primitives; Huck and Hays [10] show that
    TLB miss handling can account for more than 40% of total run time;
    and Rosenblum, et al. [18] show that TLB miss handling can account
    for more than 80% of the kernel’s computation time.
    Recent studies show that TLB-related precise interrupts occur
    once every 100–1000 user instructions on all ranges of code, from
    SPEC to databases and engineering workloads [5, 18]."



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Aug 17 15:21:38 2025
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    BGB wrote:
    On 8/7/2025 6:38 AM, Anton Ertl wrote:

    Concerning page table walker: The MIPS R2000 just has a TLB and traps
    on a TLB miss, and then does the table walk in software. While that's
    not a solution that's appropriate for a wide superscalar CPU, it was
    good enough for beating the actual VAX 11/780 by a good margin; at
    some later point, you would implement the table walker in hardware,
    but probably not for the design you do in 1975.


    Yeah, this approach works a lot better than people seem to give it
    credit for...

    Both HW and SW table walkers incur the cost of reading the PTE's.
    The pipeline drain and load of the software TLB miss handler,
    then a drain and reload of the original code on return
    are a large expense that HW walkers do not have.

    Why not treat the SW TLB miss handler as similar to a call as
    possible? Admittedly, calls occur as part of the front end, while (in
    an OoO core) the TLB miss comes from the execution engine or the
    reorder buffer, but still: could it just be treated like a call
    inserted in the instruction stream at the time when it is noticed,
    with the instructions running in a special context (read access to
    page tables allowed). You may need to flush the pipeline anyway,
    though, if the TLB miss

    "For example, Anderson, et al. [1] show TLB miss handlers to be among
    the most commonly executed OS primitives; Huck and Hays [10] show that
    TLB miss handling can account for more than 40% of total run time;
    and Rosenblum, et al. [18] show that TLB miss handling can account
    for more than 80% of the kernel’s computation time.

    I have seen ~90% of the time spent on TLB handling on an Ivy Bridge
    with hardware table walking, on a 1000x1000 matrix multiply with
    pessimal spatial locality (2 TLB misses per iteration). Each TLB miss
    cost about 20 cycles.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Aug 17 11:29:20 2025
    From Newsgroup: comp.arch

    On 8/17/2025 1:16 AM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    It is maybe pushing it a little if one wants to use an AVL-tree or
    B-Tree for virtual memory vs a page-table

    I assume that you mean a balanced search tree (binary (AVL) or n-ary
    (B)) vs. the now-dominant hierarchical multi-level page tables, which
    are tries.


    Yes.

    AVL tree is a balanced binary tree that tracks depth and "rotates" nodes
    as needed to keep the depth of one side within +/- 1 of the other.

    The B-Trees would use N elements per node, which are stored in sorted
    order so that one can use a binary search.


    In both a hardware and a software implementation, one could implement
    a balanced search tree, but what would be the advantage?


    Can use less RAM for large sparse address spaces with aggressive ASLR.
    However. looking up a page or updating the page table are significantly
    slower (enough to be relevant).

    Though, I mostly ended up staying with more conventional page tables and weakening the ASLR, where it may try to reuse the previous bits (47:25)
    and (47:36) of the address a few times, to reduce page-table
    fragmentation (sparse, mostly-empty, page table pages).

    ...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sun Aug 17 13:35:03 2025
    From Newsgroup: comp.arch

    Thomas Koenig wrote:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    Thomas Koenig wrote:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:

    Unlesss... maybe somebody (a customer, or they themselves)
    discovered that there may have been conditions where they could
    only guarantee 80 ns. Maybe a combination of tolerances to one
    side and a certain logic programming, and they changed the
    data sheet.
    Manufacturing process variation leads to timing differences that
    testing sorts into speed bins. The faster bins sell at higher price.

    Is that possible with a PAL before it has been programmed?

    They can speed and partially function test it.
    Its programmed by blowing internal fuses which is a one-shot thing
    so that function can't be tested.

    By comparison, you could get an eight-input NAND gate with a
    maximum delay of 12 ns (the 74H030), so putting two in sequence
    to simulate a PLA would have been significantly faster.
    I can undersand people complaining that PALs were slow.
    The 82S100 PLA is logic equivalent to:
    - 16 inputs each with an optional input invertor,
    Should be free coming from a Flip-Flop.
    Depends on what chips you use for registers.
    If you want both Q and Qb then you only get 4 FF in a package like 74LS375. >>
    For a wide instruction or stage register I'd look at chips such as a 74LS377 >> with 8 FF in a 20 pin dip, 8 input, 8 Q out, clock, clock enable, vcc, gnd.

    So if you need eight ouputs, you choice is to use two 74LS375
    (presumably more expensive) or an 74LS377 and an eight-chip
    inverter (a bit slower, but intervers should be fast).

    Another point... if you don't need 16 inputs or 8 outpus, you
    are also paying a lot more. If you have a 6-bit primary opcode,
    you don't need a full 16 bits of input.
    I'm just showing why it was more than just an AND gate.

    Two layers of NAND :-)

    Thinking about different ways of doing this...
    If the first NAND layer has open collector outputs then we can use
    a wired-AND logic driving and invertor for the second NAND plane.

    If the instruction buffer outputs to a set of 74159 4:16 demux with
    open collector outputs, then we can just wire the outputs we want
    together with a 10k pull-up resistor and drive an invertor,
    to form the second output NAND layer.

    inst buf <15:8> <7:0>
    | | | |
    4:16 4:16 4:16 4:16
    vvvv vvvv vvvv vvvv
    10k ---|---|---|---|------>INV->
    10k ---------------------->INV->
    10k ---------------------->INV->

    I'm still exploring whether it can be variable length instructions or
    has to be fixed 32-bit. In either case all the instruction "code" bits
    (as in op code or function code or whatever) should be checked,
    even if just to verify that should-be-zero bits are zero.

    There would also be instruction buffer Valid bits and other state bits
    like Fetch exception detected, interrupt request, that might feed into
    a bank of PLA's multiple wide and deep.

    Agreed, the logic has to go somewhere. Regularity in the
    instruction set would even have been extremely important than now
    to reduce the logic requirements for decoding.

    The question is whether in 1975 main memory is so expensive that
    we cannot afford the wasted space of a fixed 32-bit ISA.
    In 1975 the widely available DRAM was the Intel 1103 1k*1b.
    The 4kb drams were just making to customers, 16kb were preliminary.

    Looking at the instruction set usage of VAX in

    Measurement and Analysis of Instruction Use in VAX 780, 1982 https://dl.acm.org/doi/pdf/10.1145/1067649.801709

    we see that the top 25 instructions covers about 80-90% of the usage,
    and many of them would fit into 2 or 3 bytes.
    A fixed 32-bit instruction would waste 1 to 2 bytes on most instructions.

    But a fixed 32-bit instruction is very much easier to fetch and
    decode needs a lot less logic for shifting prefetch buffers,
    compared to, say, variable length 1 to 12 bytes.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Aug 17 12:53:32 2025
    From Newsgroup: comp.arch

    On 8/17/2025 9:00 AM, EricP wrote:
    BGB wrote:
    On 8/7/2025 6:38 AM, Anton Ertl wrote:

    Concerning page table walker: The MIPS R2000 just has a TLB and traps
    on a TLB miss, and then does the table walk in software.  While that's
    not a solution that's appropriate for a wide superscalar CPU, it was
    good enough for beating the actual VAX 11/780 by a good margin; at
    some later point, you would implement the table walker in hardware,
    but probably not for the design you do in 1975.


    Yeah, this approach works a lot better than people seem to give it
    credit for...

    Both HW and SW table walkers incur the cost of reading the PTE's.
    The pipeline drain and load of the software TLB miss handler,
    then a drain and reload of the original code on return
    are a large expense that HW walkers do not have.


    I am not saying SW page walkers are fast.
    Though in my experience, the cycle cost of the SW TLB miss handling
    isn't "too bad".

    If it were a bigger issue in my case, could probably add a HW page
    walker, as I had long considered it as a possible optional feature. In
    this case, it could be per-process (with the LOBs of the page-base
    register also encoding whether or not HW page-walking is allowed; along
    with in my case also often encoding the page-table type/layout).


    In-Line Interrupt Handling for Software-Managed TLBs 2001 https://terpconnect.umd.edu/~blj/papers/iccd2001.pdf

    "For example, Anderson, et al. [1] show TLB miss handlers to be among
    the most commonly executed OS primitives; Huck and Hays [10] show that
    TLB miss handling can account for more than 40% of total run time;
    and Rosenblum, et al. [18] show that TLB miss handling can account
    for more than 80% of the kernel’s computation time.
    Recent studies show that TLB-related precise interrupts occur
    once every 100–1000 user instructions on all ranges of code, from
    SPEC to databases and engineering workloads [5, 18]."


    This is around 2 orders of magnitude more than I am often seeing in my
    testing (mind you, with a TLB miss handler that is currently written in C).


    But, this is partly where things like page-sizes and also the size of
    the TLB can have a big effect.

    Ideally, one wants a TLB that has a coverage larger than the working set
    of the typical applications (and OS); at which point miss rate becomes negligible. Granted, if one has GB's of RAM, and larger programs, this
    is a harder problem...


    Then the ratio of working set to TLB coverage comes into play, which
    granted (sadly) appears to follow an (workingSet/coverage)^2 curve...


    I had noted before that some of the 90s era RISC's had comparably very
    small TLBs, such as 64-entry fully associative, or 16x4.
    Such a TLB with a 4K page size having a coverage of roughly 256K.

    Where, most programs have working sets somewhat larger than 256K.

    Looking it up, the DEC Alpha used a 48 entry TLB, so 192K coverage, yeah...


    The CPU time cost of TLB Miss handling would be significantly reduced
    with a "not pissant" TLB.



    I was mostly using 256x4, with a 16K page size, which covers a working
    set of roughly 16MB.

    A 1024x4 would cover 64MB, and 1024x6 would cover 96MB.

    One possibility though would be to use 64K pages for larger programs,
    which would increase coverage of a 1024x TLB to 256MB or 384MB.

    At present, a 1024x4 TLB would use 64K of Block-RAM, and 1024x6 would
    use 98K.

    But, yeah... this is comparable to the apparent TLB sizes on a lot of
    modern ARM processors; which typically deal with somewhat larger working
    sets than I am dealing with.


    Another option is to RAM-back part of the TLB, essentially as an
    "Inverted Page Table", but admittedly, this has similar complexities to
    a HW page walker (and the hassle of still needing a fault handler to
    deal with missing IPT entries).



    In an ideal case, could make sense to write at least the fast path of
    the miss handler in ASM.

    Note that TLB misses are segregated into their own interrupt category
    separate from other interrupt:
    8: General Fault (Memory Faults, Instruction Faults, FPU Traps)
    A: TLB Miss (TLB Miss, ACL Miss)
    C: Interrupt (1kHz HW timer mostly)
    E: Syscall (System Calls)

    Typically, the VBR layout looks like:
    + 0: Reset (typically only used on boot, with VBR reset to 0)
    + 8: General Fault
    +16: TLB Miss
    +24: Interrupt
    +32: Syscall
    With a non-standard alignment requirement (vector table needs to be
    aligned to a multiple of 256 bytes, for "reasons"). Though actual CPU
    core currently only needs a 64B alignment (256B would allow adding a lot
    more vectors while staying with the use of bit-slicing). Each "entry" in
    this table being a branch to the entry point of the ISR handler.

    On initial Boot, as a little bit of a hack, the CPU looks at the
    encoding of the Reset Vector branch to determine the initial ISA Mode
    (such as XG1, XG3, or RV64GC).



    If doing a TLB miss handler in ASM, possible strategy could be:
    Save off some of the registers;
    Check if a simple case TLB miss or ACL miss;
    Try to deal with it;
    Restore registers;
    Return.
    Save rest of registers;
    Deal with more complex scenario (probably in C land);
    Such as initiate a context switch to the page-fault handler.

    For the simple cases:
    TLB Miss involves walking the page table;
    ACL miss may involve first looking up the ID pairs in a hash table;
    Fallback cases may involve more complex logic in a more general handler.



    At present, the Interrupt and Syscall handlers have the quirk in that
    they require TBR to be set-up first, as they directly save to the
    register save area (relative to) TBR, rather than using the interrupt
    stack. The main rationale here being that these interrupts frequently
    perform context switches and saving/restoring registers to TBR greatly
    reduces the performance cost of performing a context switch.

    Note though that one ideally wants to use shared address spaces or ASIDs
    to limit the amount of TLB misses.

    Can note that currently my CPU core uses 16-bit ASIDs, split into 6+10
    bits, currently 64 groups, each with 1024 members. Global pages are
    generally only global within a groups, and high numbered groups are
    assumed to not allow global pages. Say, for example, if you were running
    a VM, you wouldn't want its VAS being polluted with global pages from
    the host OS.

    Though, global pages would allow things like DLLs and similar to be
    shared without needing TLB misses for them on context switches.

    ...





    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Jakob Bohm@egenagwemdimtapsar@jbohm.dk to comp.arch,comp.lang.c on Sun Aug 17 20:18:36 2025
    From Newsgroup: comp.arch

    On 2025-08-05 23:08, Kaz Kylheku wrote:
    On 2025-08-04, Michael S <already5chosen@yahoo.com> wrote:
    On Mon, 04 Aug 2025 09:53:51 -0700
    Keith Thompson <Keith.S.Thompson+u@gmail.com> wrote:
    In C17 and earlier, _BitInt is a reserved identifier. Any attempt to
    use it has undefined behavior. That's exactly why new keywords are
    often defined with that ugly syntax.

    That is language lawyer's type of reasoning. Normally gcc maintainers
    are wiser than that because, well, by chance gcc happens to be widely
    used production compiler. I don't know why this time they had chosen
    less conservative road.

    They invented an identifer which lands in the _[A-Z].* namespace
    designated as reserved by the standard.

    What would be an exmaple of a more conservative way to name the
    identifier?


    What is actually going on is GCC offering its users a gradual way to transition from C17 to C23, by applying the C23 meaning of any C23
    construct that has no conflicting meaning in C17 . In particular, this
    allows installed library headers to use the new types as part of
    logically opaque (but compiler visible) implementation details, even
    when those libraries are used by pure C17 programs. For example, the
    ISO POSIX datatype struct stat could contain a _BitInt(128) type for
    st_dev or st_ino if the kernel needs that, as was the case with the 1996
    NT kernel . Or a _BitInt(512) for st_uid as used by that same kernel .

    GCC --pedantic is an option to check if a program is a fully conforming portable C program, with the obvious exception of the contents of any
    used "system" headers (including installed libc headers), as those are
    allowed to implement standard or non-standard features in implementation specific ways, and might even include implementation specific logic to
    report the use of non-standard extensions to the library standards when
    the compiler is invoked with --pedantic and no contrary options .

    I am unsure how GCC --pedantic deals with the standards-contrary
    features in the GNUC89 language, such as the different type of (foo,
    'C') (GNUC says char, C89 says int), maybe specifying standard C instead
    of GNUC reverts those to the standard definition .

    Enjoy

    Jakob
    --
    Jakob Bohm, MSc.Eng., I speak only for myself, not my company
    This public discussion message is non-binding and may contain errors
    All trademarks and other things belong to their owners, if any.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Aug 17 19:10:21 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Why not treat the SW TLB miss handler as similar to a call as
    possible? Admittedly, calls occur as part of the front end, while (in
    an OoO core) the TLB miss comes from the execution engine or the
    reorder buffer, but still: could it just be treated like a call
    inserted in the instruction stream at the time when it is noticed,
    with the instructions running in a special context (read access to
    page tables allowed). You may need to flush the pipeline anyway,
    though, if the TLB miss

    ... if the buffers fill up and there is not enough resources left for
    the TLB miss handler.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Aug 17 15:08:14 2025
    From Newsgroup: comp.arch

    On 8/17/2025 2:10 PM, Anton Ertl wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Why not treat the SW TLB miss handler as similar to a call as
    possible? Admittedly, calls occur as part of the front end, while (in
    an OoO core) the TLB miss comes from the execution engine or the
    reorder buffer, but still: could it just be treated like a call
    inserted in the instruction stream at the time when it is noticed,
    with the instructions running in a special context (read access to
    page tables allowed). You may need to flush the pipeline anyway,
    though, if the TLB miss

    ... if the buffers fill up and there is not enough resources left for
    the TLB miss handler.


    If the processor has microcode, could try to handle it that way.

    If it could work, and the CPU allows sufficiently complex logic in
    microcode to deal with this.

    ...



    One idea I had considered early on would be that there is would be a
    special interrupt class that always goes into the ROM; so to the OS it
    would always looks as-if there were a HW page walker.

    This was eventually ended though as I was typically using 32K for the
    Boot ROM, and with the initial startup tests, font initialization, and
    FAT32 driver + PEL and elf loaders, ..., there wasn't much space left
    for "niceties" like TLB miss handling and similar. So, the role of the
    ROM was largely reduced to initial boot-up.

    It could be possible to have a "2-stage ROM", where the first stage boot
    ROM also loads more "ROM" from the SDcard. But, at that point, may as
    well just go over to using the current loader design to essentially try
    to load a UEFI BIOS or similar (which could then load the OS, achieving basically the same effect).

    Where, in effect, UEFI is basically an OS in its own right, which just
    so happens to use similar binary formats to what I am using already (eg, PE/COFF).

    Not yet gone up the learning curve for how to make TestKern behave like
    a UEFI backend though (say, for example, if I wanted to try to get
    "Debian RV64G" or similar to boot on my stuff).


    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Aug 17 18:56:49 2025
    From Newsgroup: comp.arch

    On 8/17/2025 12:35 PM, EricP wrote:
    Thomas Koenig wrote:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    Thomas Koenig wrote:
    EricP <ThatWouldBeTelling@thevillage.com> schrieb:

    Unlesss... maybe somebody (a customer, or they themselves)
    discovered that there may have been conditions where they could
    only guarantee 80 ns.  Maybe a combination of tolerances to one
    side and a certain logic programming, and they changed the
    data sheet.
    Manufacturing process variation leads to timing differences that
    testing sorts into speed bins. The faster bins sell at higher price.

    Is that possible with a PAL before it has been programmed?

    They can speed and partially function test it.
    Its programmed by blowing internal fuses which is a one-shot thing
    so that function can't be tested.

    By comparison, you could get an eight-input NAND gate with a
    maximum delay of 12 ns (the 74H030), so putting two in sequence
    to simulate a PLA would have been significantly faster.
    I can undersand people complaining that PALs were slow.
    The 82S100 PLA is logic equivalent to:
    - 16 inputs each with an optional input invertor,
    Should be free coming from a Flip-Flop.
    Depends on what chips you use for registers.
    If you want both Q and Qb then you only get 4 FF in a package like
    74LS375.

    For a wide instruction or stage register I'd look at chips such as a
    74LS377
    with 8 FF in a 20 pin dip, 8 input, 8 Q out, clock, clock enable,
    vcc, gnd.

    So if you need eight ouputs, you choice is to use two 74LS375
    (presumably more expensive) or an 74LS377 and an eight-chip
    inverter (a bit slower, but intervers should be fast).

    Another point... if you don't need 16 inputs or 8 outpus, you
    are also paying a lot more.  If you have a 6-bit primary opcode,
    you don't need a full 16 bits of input.
    I'm just showing why it was more than just an AND gate.

    Two layers of NAND :-)

    Thinking about different ways of doing this...
    If the first NAND layer has open collector outputs then we can use
    a wired-AND logic driving and invertor for the second NAND plane.

    If the instruction buffer outputs to a set of 74159 4:16 demux with
    open collector outputs, then we can just wire the outputs we want
    together with a 10k pull-up resistor and drive an invertor,
    to form the second output NAND layer.

    inst buf <15:8>   <7:0>
             |    |   |   |
           4:16 4:16 4:16 4:16
           vvvv vvvv vvvv vvvv
      10k  ---|---|---|---|------>INV->
      10k  ---------------------->INV->
      10k  ---------------------->INV->

    I'm still exploring whether it can be variable length instructions or
    has to be fixed 32-bit. In either case all the instruction "code" bits
    (as in op code or function code or whatever) should be checked,
    even if just to verify that should-be-zero bits are zero.

    There would also be instruction buffer Valid bits and other state bits
    like Fetch exception detected, interrupt request, that might feed into
    a bank of PLA's multiple wide and deep.

    Agreed, the logic has to go somewhere.  Regularity in the
    instruction set would even have been extremely important than now
    to reduce the logic requirements for decoding.

    The question is whether in 1975 main memory is so expensive that
    we cannot afford the wasted space of a fixed 32-bit ISA.
    In 1975 the widely available DRAM was the Intel 1103 1k*1b.
    The 4kb drams were just making to customers, 16kb were preliminary.

    Looking at the instruction set usage of VAX in

    Measurement and Analysis of Instruction Use in VAX 780, 1982 https://dl.acm.org/doi/pdf/10.1145/1067649.801709

    we see that the top 25 instructions covers about 80-90% of the usage,
    and many of them would fit into 2 or 3 bytes.
    A fixed 32-bit instruction would waste 1 to 2 bytes on most instructions.

    But a fixed 32-bit instruction is very much easier to fetch and
    decode needs a lot less logic for shifting prefetch buffers,
    compared to, say, variable length 1 to 12 bytes.


    When code/density is the goal, a 16/32 RISC can do well.

    Can note:
    Maximizing code density often prefers fewer registers;
    For 16-bit instructions, 8 or 16 registers is good;
    8 is rather limiting;
    32 registers uses too many bits.


    Can note ISAs with 16 bit encodings:
    PDP-11: 8 registers
    M68K : 2x 8 (A and D)
    MSP430: 16
    Thumb : 8|16
    RV-C : 8|32
    SuperH: 16
    XG1 : 16|32 (Mostly 16)


    In my recent fiddling for trying to design a pair encoding for XG3, can
    note the top-used instructions are mostly, it seems (non Ld/St):
    ADD Rs, 0, Rd //MOV Rs, Rd
    ADD X0, Imm, Rd //MOV Imm, Rd
    ADDW Rs, 0, Rd //EXTS.L Rs, Rd
    ADDW Rd, Imm, Rd //ADDW Imm, Rd
    ADD Rd, Imm, Rd //ADD Imm, Rd

    Followed by:
    ADDWU Rs, 0, Rd //EXTU.L Rs, Rd
    ADDWU Rd, Imm, Rd //ADDWu Imm, Rd
    ADDW Rd, Rs, Rd //ADDW Rs, Rd
    ADD Rd, Rs, Rd //ADD Rs, Rd
    ADDWU Rd, Rs, Rd //ADDWU Rs, Rd

    Most every other ALU instruction and usage pattern either follows a bit further behind or could not be expressed in a 16-bit op.

    For Load/Store:
    SD Rn, Disp(SP)
    LD Rn, Disp(SP)
    LW Rn, Disp(SP)
    SW Rn, Disp(SP)

    LD Rn, Disp(Rm)
    LW Rn, Disp(Rm)
    SD Rn, Disp(Rm)
    SW Rn, Disp(Rm)


    For registers, there is a split:
    Leaf functions:
    R10..R17, R28..R31 dominate.
    Non-Leaf functions:
    R10, R18..R27, R8/R9

    For 3-bit configurations:
    R8..R15 Reg3A
    R18/R19, R20/R21, R26/R27, R10/R11 Reg3B

    Reg3B was a bit hacky, but had similar hit rates but uses less encoding
    space than using a 4-bit R8..R23 (saving 1 bit on the relevant scenarios).


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Keith Thompson@Keith.S.Thompson+u@gmail.com to comp.arch,comp.lang.c on Sun Aug 17 22:18:28 2025
    From Newsgroup: comp.arch

    Jakob Bohm <egenagwemdimtapsar@jbohm.dk> writes:
    [...]
    I am unsure how GCC --pedantic deals with the standards-contrary
    features in the GNUC89 language, such as the different type of (foo,
    'C') (GNUC says char, C89 says int), maybe specifying standard C
    instead of GNUC reverts those to the standard definition .

    I'm not sure what you're referring to. You didn't say what foo is.

    I believe that in all versions of C, the result of a comma operator has
    the type and value of its right operand, and the type of an unprefixed character constant is int.

    Can you show a complete example where `sizeof (foo, 'C')` yields
    sizeof (int) in any version of GNUC?
    --
    Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
    void Void(void) { Void(); } /* The recursive call of the void */
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Aug 18 05:48:00 2025
    From Newsgroup: comp.arch

    Dan Cross <cross@spitfire.i.gajendra.net> schrieb:
    In article <107mf9l$u2si$1@dont-email.me>,
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    It's not clear to me what the distinction of technical vs. business
    is supposed to be in the context of ISA design. Could you explain?

    I can attempt to, though I'm not sure if I can be successful.

    [...]

    And so with the VAX, I can imagine the work (which started in,
    what, 1975?) being informed by a business landscape that saw an
    increasing trend towards favoring high-level languages, but also
    saw the continued development of large, bespoke, business
    applications for another five or more years, and with customers
    wanting to be able to write (say) complex formatting sequences
    easily in assembler (the EDIT instruction!), in a way that was
    compatible with COBOL (so make the COBOL compiler emit the EDIT instruction!), while also trying to accommodate the scientific
    market (POLYF/POLYG!) who would be writing primarily in FORTRAN
    but jumping to assembler for the fuzz-busting speed boost (so
    stabilize what amounts to an ABI very early on!), and so forth.

    I had actually forgotten that the VAX also had decimal
    instructions. But the 11/780 also had one really important
    restriction: It could only do one write every six cycles, see https://dl.acm.org/doi/pdf/10.1145/800015.808199 , so that
    severely limited their throughput there (assuming they did
    things bytewise). So yes, decimal arithmetic was important
    in the day for COBOL and related commercial applications.

    So, what to do with decimal arithmetic, which was important
    at the time (and a business consideration)?

    Something like Power's addg6s instruction could have been
    introduced, it adds two numbers together, generating only the
    decimal carries, and puts a nibble "6" into the corresponding
    nibble if there is one, and "0" otherwise. With 32 bits, that
    would allow addition of eight-digit decimal numbers in four
    instructions (see one of the POWER ISA documents for details),
    but the cycle of "read ASCII digits, do arithmetic, write
    ASCII digits" would have needed some extra shifts and masks,
    so it might have been more beneficial to use four digits per
    register.

    The article above is also extremely interesting otherwise. It does
    not give cycle timings for each individual instruction and address
    mode, but it gives statistics on how they were used, and a good
    explanation of the timing implications of their microcode design.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Richard Heathfield@rjh@cpax.org.uk to comp.arch,comp.lang.c on Mon Aug 18 08:02:30 2025
    From Newsgroup: comp.arch

    On 18/08/2025 06:18, Keith Thompson wrote:
    Jakob Bohm <egenagwemdimtapsar@jbohm.dk> writes:
    [...]
    I am unsure how GCC --pedantic deals with the standards-contrary
    features in the GNUC89 language, such as the different type of (foo,
    'C') (GNUC says char, C89 says int), maybe specifying standard C
    instead of GNUC reverts those to the standard definition .

    I'm not sure what you're referring to. You didn't say what foo is.

    I believe that in all versions of C, the result of a comma operator has
    the type and value of its right operand, and the type of an unprefixed character constant is int.

    Can you show a complete example where `sizeof (foo, 'C')` yields
    sizeof (int) in any version of GNUC?

    $ cat so.c
    #include <stdio.h>

    int main(void)
    {
    int foo = 42;
    size_t soa = sizeof (foo, 'C');
    size_t sob = sizeof foo;
    printf("%s.\n", (soa == sob) ? "Yes" : "No");
    return 0;
    }
    $ gcc -o so so.c
    $ ./so
    Yes.
    $ gcc --version
    gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
    --
    Richard Heathfield
    Email: rjh at cpax dot org dot uk
    "Usenet is a strange place" - dmr 29 July 1999
    Sig line 4 vacant - apply within

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch,comp.lang.c on Mon Aug 18 11:34:49 2025
    From Newsgroup: comp.arch

    On 18.08.2025 07:18, Keith Thompson wrote:
    Jakob Bohm <egenagwemdimtapsar@jbohm.dk> writes:
    [...]
    I am unsure how GCC --pedantic deals with the standards-contrary
    features in the GNUC89 language, such as the different type of (foo,
    'C') (GNUC says char, C89 says int), maybe specifying standard C
    instead of GNUC reverts those to the standard definition .

    I'm not sure what you're referring to. You didn't say what foo is.

    I believe that in all versions of C, the result of a comma operator has
    the type and value of its right operand, and the type of an unprefixed character constant is int.

    Can you show a complete example where `sizeof (foo, 'C')` yields
    sizeof (int) in any version of GNUC?


    Presumably that's a typo - you meant to ask when the size is /not/ the
    size of "int" ? After all, you said yourself that "(foo, 'C')"
    evaluates to 'C' which is of type "int". It would be very interesting
    if Jakob can show an example where gcc treats the expression as any
    other type than "int".


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Mon Aug 18 11:03:15 2025
    From Newsgroup: comp.arch

    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    BGB wrote:
    On 8/7/2025 6:38 AM, Anton Ertl wrote:
    Concerning page table walker: The MIPS R2000 just has a TLB and traps
    on a TLB miss, and then does the table walk in software. While that's >>>> not a solution that's appropriate for a wide superscalar CPU, it was
    good enough for beating the actual VAX 11/780 by a good margin; at
    some later point, you would implement the table walker in hardware,
    but probably not for the design you do in 1975.

    Yeah, this approach works a lot better than people seem to give it
    credit for...
    Both HW and SW table walkers incur the cost of reading the PTE's.
    The pipeline drain and load of the software TLB miss handler,
    then a drain and reload of the original code on return
    are a large expense that HW walkers do not have.

    Why not treat the SW TLB miss handler as similar to a call as
    possible? Admittedly, calls occur as part of the front end, while (in
    an OoO core) the TLB miss comes from the execution engine or the
    reorder buffer, but still: could it just be treated like a call
    inserted in the instruction stream at the time when it is noticed,
    with the instructions running in a special context (read access to
    page tables allowed). You may need to flush the pipeline anyway,
    though, if the TLB miss

    There were a number of proposals around then, the paper I linked to
    also suggested injecting the miss routine into the ROB.
    My idea back then was a HW thread.

    All of these are attempts to fix inherent drawbacks and limitations
    in the SW-miss approach, and all of them run counter to the only
    advantage SW-miss had: its simplicity.
    The SW approach is inherently synchronous and serial -
    it can only handle one TLB miss at a time, one PTE read at a time.

    None of those research papers that I have seen consider the possibility
    that OoO can make use of multiple concurrent HW walkers if the
    cache supports hit-under-miss and multiple pending miss buffers.

    While instruction fetch only needs to occasionally translate a VA one
    at a time, with more aggressive alternate path prefetching all those VA
    have to be translated first before the buffers can be prefetched.
    LSQ could also potentially be translating as many VA as there are entries.

    While HW walkers are serial for translating one VA,
    the translations are inherently concurrent provided one can
    implement an atomic RMW for the Accessed and Modified bits.
    Each PTE read can cache miss and stall that walker.
    As most OoO caches support multiple pending misses and hit-under-miss,
    you can create as many HW walkers as you can afford.

    "For example, Anderson, et al. [1] show TLB miss handlers to be among
    the most commonly executed OS primitives; Huck and Hays [10] show that
    TLB miss handling can account for more than 40% of total run time;
    and Rosenblum, et al. [18] show that TLB miss handling can account
    for more than 80% of the kernel’s computation time.

    I have seen ~90% of the time spent on TLB handling on an Ivy Bridge
    with hardware table walking, on a 1000x1000 matrix multiply with
    pessimal spatial locality (2 TLB misses per iteration). Each TLB miss
    cost about 20 cycles.

    - anton

    I'm looking for papers that separate out the common cost of loading a PTE
    from the extra cost of just the SW-miss handler. I had a paper a while
    back but can't find it now. IIRC in that paper the extra cost of the
    SW miss handler on Alpha was measured at 5-25%.

    One thing to mention about some of these papers looking at TLB performance. Some papers on virtual address translate appear to NOT be familiar
    that Intel's HW walker on its downward walk caches the interior node
    PTE's in auxiliary TLB's and checks for PTE TLB hits in bottom to top order (called a bottom-up walk) and thereby avoids many HW walks from the root.

    A SW walker can accomplish the same bottom-up walk by locating
    the different page table levels at *virtual* base addresses,
    and adding each VA of those interior PTE's to the TLB.
    This is what VAX VA translate did, probably Alpha too but I didn't check.

    This interior PTE node caching is critical for optimal performance
    and some of their stats don't take it into account
    and give much worse numbers than they should.

    Also many papers were written before ASID's were in common use
    so the TLB got invalidated with each address space switch.
    This would penalize any OS which had separate user and kernel space.

    So all these numbers need to be taken with a grain of salt.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Aug 18 15:35:36 2025
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    There were a number of proposals around then, the paper I linked to
    also suggested injecting the miss routine into the ROB.
    My idea back then was a HW thread.

    All of these are attempts to fix inherent drawbacks and limitations
    in the SW-miss approach, and all of them run counter to the only
    advantage SW-miss had: its simplicity.

    Another advantage is the flexibility: you can implement any
    translation scheme you want: hierarchical page tables, inverted page
    tables, search trees, .... However, given that hierarchical page
    tables have won, this is no longer an advantage anyone cares for.

    The SW approach is inherently synchronous and serial -
    it can only handle one TLB miss at a time, one PTE read at a time.

    On an OoO engine, I don't see that. The table walker software is
    called in its special context and the instructions in the table walker
    are then run through the front end and the OoO engine. Another table
    walk could be started at any time (even when the first table walk has
    not yet finished feeding its instructions to the front end), and once
    inside the OoO engine, the execution is OoO and concurrent anyway. It
    would be useful to avoid two searches for the same page at the same
    time, but hardware walkers have the same problem.

    While HW walkers are serial for translating one VA,
    the translations are inherently concurrent provided one can
    implement an atomic RMW for the Accessed and Modified bits.

    It's always a one-way street (towards accessed and towards modified,
    never the other direction), so it's not clear to me why one would want atomicity there.

    Each PTE read can cache miss and stall that walker.
    As most OoO caches support multiple pending misses and hit-under-miss,
    you can create as many HW walkers as you can afford.

    Which poses the question: is it cheaper to implement n table walkers,
    or to add some resources and mechanism that allows doing SW table
    walks until the OoO engine runs out of resources, and a recovery
    mechanism in that case.

    I see other performance and conceptual disadvantages for the envisioned
    SW walkers, however:

    1) The SW walker is inserted at the front end and there may be many
    ready instructions ahead of it before the instructions of the SW
    walker get their turn. By contrast, a hardware walker sits in the
    load/store unit and can do its own loads and stores with priority over
    the program-level loads and stores. However, it's not clear that
    giving priority to table walking is really a performance advantage.

    2) Some decisions will have to be implemented as branches, resulting
    in branch misses, which cost time and lead to all kinds of complexity
    if you want to avoid resetting the whole pipeline (which is the normal
    reaction to a branch misprediction).

    3) The reorder buffer processes instructions in architectural order.
    If the table walker's instructions get their sequence numbers from
    where they are inserted into the instruction stream, they will not
    retire until after the memory access that waits for the table walker
    is retired. Deadlock!

    It may be possible to solve these problems (your idea of doing it with something like hardware threads may point in the right direction), but
    it's probably easier to stay with hardware walkers.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Aug 18 17:19:13 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    There were a number of proposals around then, the paper I linked to
    also suggested injecting the miss routine into the ROB.
    My idea back then was a HW thread.
    the same problem.

    While HW walkers are serial for translating one VA,
    the translations are inherently concurrent provided one can
    implement an atomic RMW for the Accessed and Modified bits.

    It's always a one-way street (towards accessed and towards modified,
    never the other direction), so it's not clear to me why one would want >atomicity there.

    To avoid race conditions with software clearing those bits, presumably.

    ARM64 originally didn't support hardware updates in V8.0, they were
    independent hardware features added to V8.1.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Keith Thompson@Keith.S.Thompson+u@gmail.com to comp.arch,comp.lang.c on Mon Aug 18 21:57:59 2025
    From Newsgroup: comp.arch

    David Brown <david.brown@hesbynett.no> writes:
    On 18.08.2025 07:18, Keith Thompson wrote:
    Jakob Bohm <egenagwemdimtapsar@jbohm.dk> writes:
    [...]
    I am unsure how GCC --pedantic deals with the standards-contrary
    features in the GNUC89 language, such as the different type of (foo,
    'C') (GNUC says char, C89 says int), maybe specifying standard C
    instead of GNUC reverts those to the standard definition .
    I'm not sure what you're referring to. You didn't say what foo is.
    I believe that in all versions of C, the result of a comma operator
    has
    the type and value of its right operand, and the type of an unprefixed
    character constant is int.
    Can you show a complete example where `sizeof (foo, 'C')` yields
    sizeof (int) in any version of GNUC?

    Presumably that's a typo - you meant to ask when the size is /not/ the
    size of "int" ? After all, you said yourself that "(foo, 'C')"
    evaluates to 'C' which is of type "int". It would be very interesting
    if Jakob can show an example where gcc treats the expression as any
    other type than "int".

    Yes (more of a thinko, actually).

    I meant to ask about `sizeof (foo, 'C')` yielding a value *other than*
    `sizeof (int)`. Jakob implies a difference in this area between GNU C
    and ISO C. I'm not aware of any.
    --
    Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
    void Void(void) { Void(); } /* The recursive call of the void */
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Aug 19 05:47:01 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    For extremely wide cores, like Apple's M (modulo ISA), AMD Zen5 and
    Intel Lion Cove, I'd do the following modification to your inner loop
    (back in Intel syntax):

    xor ebx,ebx
    next:
    xor edx, edx
    mov rax,[rsi+rcx*8]
    add rax,[r8+rcx*8]
    adc edx,edx
    add rax,[r9+rcx*8]
    adc edx,0
    add rbx,rax
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov edx, ebx
    mov [rdi+rcx*8],rax
    inc rcx
    cmp rcx,r10
    jb next
    ...
    ret


    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately never >incremen_edx:
    inc edx
    jmp edx_ready

    The idea is interesting, but I don't understand the code. The
    following looks funny to me:

    1) You increment edx in increment_edx, then jump back to edx_ready and
    immediately overwrite edx with ebx. Then you do nothing with it,
    and then you clear edx in the next iteration. So both the "inc
    edx" and the "mov edx, ebx" look like dead code to me that can be
    optimized away.

    2) There is a loop-carried dependency through ebx, and the number
    accumulating in ebx and the carry check makes no sense with that.

    Could it be that you wanted to do "mov ebx, edx" at edx_ready? It all
    makes more sense with that. ebx then contains the carry from the last
    cycle on entry. The carry dependency chain starts at clearing edx,
    then gets to additional carries, then is copied to ebx, transferred
    into the next iteration, and is ended there by overwriting ebx. No
    dependency cycles (except the loop counter and addresses, which can be
    dealt with by hardware or by unrolling), and ebx contains the carry
    from the last iteration

    One other problem is that according to Agner Fog's instruction tables,
    even the latest and greatest CPUs from AMD and Intel that he measured
    (Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,
    and adc has a latency of 1, so breaking the dependency chain in a
    beneficial way should avoid the use of adc. For our three-summand
    add, it's not clear if adcx and adox can run in the same cycle, but
    looking at your measurements, it is unlikely.

    So we would need something other than "adc edx, edx" to set the carry
    register. According to Agner Fog Zen3 can perform 2 cmovc per cycle
    (and Zen5 can do 4/cycle), so that might be the way to do it. E.g.,
    have 1 in edi, and then do, for two-summand addition:

    mov edi,1
    xor ebx,ebx
    next:
    xor edx, edx
    mov rax,[rsi+rcx*8]
    add rax,[r8+rcx*8]
    cmovc edx, edi
    add rbx,rax
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov edx, ebx
    mov [rdi+rcx*8],rax
    inc rcx
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately never incremen_edx:
    inc edx
    jmp edx_ready

    However, even without the loop overhead (which can be reduced with
    unrolling) that's 8 instructions per iteration, and therefore we will
    have a hard time executing it at less than 1cycle/iteration on current
    CPUs. What if we mix in some adc-based stuff to bring down the
    instruction count? E.g., with one adc-based and one cmov-based
    iteration:

    mov edi,1
    xor ebx,ebx
    next:
    mov rax,[rsi+rcx*8]
    add [r8+rcx*8], rax
    mov rax,[rsi+rcx*8+8]
    add [r8+rcx*8+8], rax
    xor edx, edx
    mov rax,[rsi+rcx*8+16]
    adc rax,[r8+rcx*8+16]
    cmovc edx, edi
    add rbx,rax
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov ebx, edx
    mov [rdi+rcx*8+16],rax
    add rcx,3
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately never incremen_edx:
    inc edx
    jmp edx_ready

    Now we have 15 instructions per unrolled iteration (3 original
    iterations). Executing an unrolled iteration in less than three
    cycles might be in reach for Zen3 and Raptor Cove (I don't remember if
    all the other resource limits are also satisfied; the load/store unit
    may be at its limit, too).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Aug 19 07:09:51 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    The idea is interesting, but I don't understand the code. The
    following looks funny to me:

    1) You increment edx in increment_edx, then jump back to edx_ready and
    immediately overwrite edx with ebx. Then you do nothing with it,
    and then you clear edx in the next iteration. So both the "inc
    edx" and the "mov edx, ebx" look like dead code to me that can be
    optimized away.

    2) There is a loop-carried dependency through ebx, and the number
    accumulating in ebx and the carry check makes no sense with that.

    Could it be that you wanted to do "mov ebx, edx" at edx_ready? It all
    makes more sense with that. ebx then contains the carry from the last
    cycle on entry. The carry dependency chain starts at clearing edx,
    then gets to additional carries, then is copied to ebx, transferred
    into the next iteration, and is ended there by overwriting ebx. No >dependency cycles (except the loop counter and addresses, which can be
    dealt with by hardware or by unrolling), and ebx contains the carry
    from the last iteration

    One other problem is that according to Agner Fog's instruction tables,
    even the latest and greatest CPUs from AMD and Intel that he measured
    (Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,
    and adc has a latency of 1, so breaking the dependency chain in a
    beneficial way should avoid the use of adc. For our three-summand
    add, it's not clear if adcx and adox can run in the same cycle, but
    looking at your measurements, it is unlikely.

    So we would need something other than "adc edx, edx" to set the carry >register. According to Agner Fog Zen3 can perform 2 cmovc per cycle
    (and Zen5 can do 4/cycle), so that might be the way to do it. E.g.,
    have 1 in edi, and then do, for two-summand addition:

    mov edi,1
    xor ebx,ebx
    next:
    xor edx, edx
    mov rax,[rsi+rcx*8]
    add rax,[r8+rcx*8]
    cmovc edx, edi
    add rbx,rax
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov edx, ebx
    mov [rdi+rcx*8],rax
    inc rcx
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately never >incremen_edx:
    inc edx
    jmp edx_ready

    Forgot to fix the "mov edx, ebx" here. One other thing: I think that
    the "add rbx, rax" should be "add rax, rbx". You want to add the
    carry to rax before storing the result. So the version with just one
    iteration would be:

    mov edi,1
    xor ebx,ebx
    next:
    xor edx, edx
    mov rax,[rsi+rcx*8]
    add rax,[r8+rcx*8]
    cmovc edx, edi
    add rax,rbx
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov ebx, edx
    mov [rdi+rcx*8],rax
    inc rcx
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately never incremen_edx:
    inc edx
    jmp edx_ready

    And the version with the two additional adc-using iterations would be
    (with an additional correction):

    mov edi,1
    xor ebx,ebx
    next:
    mov rax,[rsi+rcx*8]
    add [r8+rcx*8], rax
    mov rax,[rsi+rcx*8+8]
    adc [r8+rcx*8+8], rax
    xor edx, edx
    mov rax,[rsi+rcx*8+16]
    adc rax,[r8+rcx*8+16]
    cmovc edx, edi
    add rax,rbx
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov ebx, edx
    mov [rdi+rcx*8+16],rax
    add rcx,3
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately never incremen_edx:
    inc edx
    jmp edx_ready

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Aug 19 12:11:56 2025
    From Newsgroup: comp.arch

    Anton, I like what you and Michael have done, but I'm still not sure everything is OK:

    In your code, I only see two input arrays [rsi] and [r8], instead of
    three? (Including [r9])

    Re breaking dependency chains (from Michael):

    In each iteration we have four inputs:

    carry_in from the previous iteration, [rsi+rcx*8], [r8+rcx*8] and
    [r9+rcx*8], and we want to generate [rdi+rcx*8] and the carry_out.

    Assuming effectively random inputs, cin+[rsi]+[r8]+[r9] will result in
    random low-order 64 bits in [rdi], and either 0, 1 or 2 as carry_out.

    In order to break the per-iteration dependency (per Michael), it is
    sufficient to branch out IFF adding cin to the 3-sum produces an
    additional carry:

    ; rdx = cin (0,1,2)
    next:
    mov rbx,rdx ; Save CIN
    xor rdx,rdx
    mov rax,[rsi+rcx*8]
    add rax,[r8+rcx*8]
    adc rdx,rdx ; RDX = 0 or 1 (50:50)
    add rax,[r9+rcx*8]
    adc rdx,0 ; RDX = 0, 1 or 2 (33:33:33)

    ; At this point RAX has the 3-sum, now do the cin 0..2 add

    add rax,rbx
    jc fixup ; Pretty much never taken

    save:
    mov [rdi+rcx*8],rax
    inc rcx
    cmp rcx,r10
    jb next

    fixup:
    inc rdx
    jmp save

    It would also be possible to use SETC to save the intermediate carries...

    Terje

    Anton Ertl wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    The idea is interesting, but I don't understand the code. The
    following looks funny to me:

    1) You increment edx in increment_edx, then jump back to edx_ready and
    immediately overwrite edx with ebx. Then you do nothing with it,
    and then you clear edx in the next iteration. So both the "inc
    edx" and the "mov edx, ebx" look like dead code to me that can be
    optimized away.

    2) There is a loop-carried dependency through ebx, and the number
    accumulating in ebx and the carry check makes no sense with that.

    Could it be that you wanted to do "mov ebx, edx" at edx_ready? It all
    makes more sense with that. ebx then contains the carry from the last
    cycle on entry. The carry dependency chain starts at clearing edx,
    then gets to additional carries, then is copied to ebx, transferred
    into the next iteration, and is ended there by overwriting ebx. No
    dependency cycles (except the loop counter and addresses, which can be
    dealt with by hardware or by unrolling), and ebx contains the carry
    from the last iteration

    One other problem is that according to Agner Fog's instruction tables,
    even the latest and greatest CPUs from AMD and Intel that he measured
    (Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,
    and adc has a latency of 1, so breaking the dependency chain in a
    beneficial way should avoid the use of adc. For our three-summand
    add, it's not clear if adcx and adox can run in the same cycle, but
    looking at your measurements, it is unlikely.

    So we would need something other than "adc edx, edx" to set the carry
    register. According to Agner Fog Zen3 can perform 2 cmovc per cycle
    (and Zen5 can do 4/cycle), so that might be the way to do it. E.g.,
    have 1 in edi, and then do, for two-summand addition:

    mov edi,1
    xor ebx,ebx
    next:
    xor edx, edx
    mov rax,[rsi+rcx*8]
    add rax,[r8+rcx*8]
    cmovc edx, edi
    add rbx,rax
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov edx, ebx
    mov [rdi+rcx*8],rax
    inc rcx
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately never
    incremen_edx:
    inc edx
    jmp edx_ready

    Forgot to fix the "mov edx, ebx" here. One other thing: I think that
    the "add rbx, rax" should be "add rax, rbx". You want to add the
    carry to rax before storing the result. So the version with just one iteration would be:

    mov edi,1
    xor ebx,ebx
    next:
    xor edx, edx
    mov rax,[rsi+rcx*8]
    add rax,[r8+rcx*8]
    cmovc edx, edi
    add rax,rbx
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov ebx, edx
    mov [rdi+rcx*8],rax
    inc rcx
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately never incremen_edx:
    inc edx
    jmp edx_ready

    And the version with the two additional adc-using iterations would be
    (with an additional correction):

    mov edi,1
    xor ebx,ebx
    next:
    mov rax,[rsi+rcx*8]
    add [r8+rcx*8], rax
    mov rax,[rsi+rcx*8+8]
    adc [r8+rcx*8+8], rax
    xor edx, edx
    mov rax,[rsi+rcx*8+16]
    adc rax,[r8+rcx*8+16]
    cmovc edx, edi
    add rax,rbx
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov ebx, edx
    mov [rdi+rcx*8+16],rax
    add rcx,3
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately never incremen_edx:
    inc edx
    jmp edx_ready

    - anton

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Tue Aug 19 17:20:54 2025
    From Newsgroup: comp.arch

    On Tue, 19 Aug 2025 07:09:51 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    The idea is interesting, but I don't understand the code. The
    following looks funny to me:

    1) You increment edx in increment_edx, then jump back to edx_ready
    and
    immediately overwrite edx with ebx. Then you do nothing with it,
    and then you clear edx in the next iteration. So both the "inc
    edx" and the "mov edx, ebx" look like dead code to me that can be
    optimized away.

    2) There is a loop-carried dependency through ebx, and the number
    accumulating in ebx and the carry check makes no sense with that.

    Could it be that you wanted to do "mov ebx, edx" at edx_ready? It
    all makes more sense with that. ebx then contains the carry from
    the last cycle on entry. The carry dependency chain starts at
    clearing edx, then gets to additional carries, then is copied to
    ebx, transferred into the next iteration, and is ended there by
    overwriting ebx. No dependency cycles (except the loop counter and >addresses, which can be dealt with by hardware or by unrolling), and
    ebx contains the carry from the last iteration

    One other problem is that according to Agner Fog's instruction
    tables, even the latest and greatest CPUs from AMD and Intel that he >measured (Zen5 and Tiger Lake) can only execute one adc/adcx/adox
    per cycle, and adc has a latency of 1, so breaking the dependency
    chain in a beneficial way should avoid the use of adc. For our >three-summand add, it's not clear if adcx and adox can run in the
    same cycle, but looking at your measurements, it is unlikely.

    So we would need something other than "adc edx, edx" to set the carry >register. According to Agner Fog Zen3 can perform 2 cmovc per cycle
    (and Zen5 can do 4/cycle), so that might be the way to do it. E.g.,
    have 1 in edi, and then do, for two-summand addition:

    mov edi,1
    xor ebx,ebx
    next:
    xor edx, edx
    mov rax,[rsi+rcx*8]
    add rax,[r8+rcx*8]
    cmovc edx, edi
    add rbx,rax
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov edx, ebx
    mov [rdi+rcx*8],rax
    inc rcx
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately
    never incremen_edx:
    inc edx
    jmp edx_ready

    Forgot to fix the "mov edx, ebx" here. One other thing: I think that
    the "add rbx, rax" should be "add rax, rbx". You want to add the
    carry to rax before storing the result. So the version with just one iteration would be:

    To many back and force mental switches between Intel and AT&T syntax.
    The real code that I measured was for Windows platform, but in AT&T
    (gnu) syntax.
    Below is full function with loop unrolled by 3. The rest may be
    I'd answer later, right now I don't have time.

    .file "add3_my_u3.s"
    .text
    .p2align 4
    .globl add3
    .def add3; .scl 2; .type
    32; .endef
    .seh_proc add3
    add3:
    pushq %r13
    .seh_pushreg %r13
    pushq %r12
    .seh_pushreg %r12
    pushq %rbp
    .seh_pushreg %rbp
    pushq %rdi
    .seh_pushreg %rdi
    pushq %rsi
    .seh_pushreg %rsi
    pushq %rbx
    .seh_pushreg %rbx
    .seh_endprologue
    # %rcx - dst
    # %rdx - a
    # %r8 - b
    # %r9 - c
    sub %rcx, %rdx
    sub %rcx, %r8
    sub %rcx, %r9
    mov $341, %ebx
    xor %eax, %eax
    .loop:
    xor %esi, %esi
    mov (%rcx,%rdx), %rdi
    mov 8(%rcx,%rdx), %rbp
    mov 16(%rcx,%rdx), %r10
    add (%rcx,%r8), %rdi
    adc 8(%rcx,%r8), %rbp
    adc 16(%rcx,%r8), %r10
    adc %esi, %esi
    add (%rcx,%r9), %rdi
    adc 8(%rcx,%r9), %rbp
    adc 16(%rcx,%r9), %r10
    adc $0, %esi
    add %rax, %rdi # add carry from the previous
    iteration
    jc .prop_carry
    .carry_done:
    mov %esi, %eax
    mov %rdi, (%rcx)
    mov %rbp, 8(%rcx)
    mov %r10, 16(%rcx)
    lea 24(%rcx), %rcx
    dec %ebx
    jnz .loop

    sub $(1023*8), %rcx
    mov %rcx, %rax

    popq %rbx
    popq %rsi
    popq %rdi
    popq %rbp
    popq %r12
    popq %r13
    ret

    .prop_carry:
    add $1, %rbp
    adc $0, %r10
    adc $0, %esi
    jmp .carry_done

    .seh_endproc








    mov edi,1
    xor ebx,ebx
    next:
    xor edx, edx
    mov rax,[rsi+rcx*8]
    add rax,[r8+rcx*8]
    cmovc edx, edi
    add rax,rbx
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov ebx, edx
    mov [rdi+rcx*8],rax
    inc rcx
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately
    never incremen_edx:
    inc edx
    jmp edx_ready

    And the version with the two additional adc-using iterations would be
    (with an additional correction):

    mov edi,1
    xor ebx,ebx
    next:
    mov rax,[rsi+rcx*8]
    add [r8+rcx*8], rax
    mov rax,[rsi+rcx*8+8]
    adc [r8+rcx*8+8], rax
    xor edx, edx
    mov rax,[rsi+rcx*8+16]
    adc rax,[r8+rcx*8+16]
    cmovc edx, edi
    add rax,rbx
    jc incremen_edx
    ; eliminate data dependency between loop iteration
    ; replace it by very predictable control dependency
    edx_ready:
    mov ebx, edx
    mov [rdi+rcx*8+16],rax
    add rcx,3
    cmp rcx,r10
    jb next
    ...
    ret
    ; that code is placed after return
    ; it is executed extremely rarely.For random inputs-approximately
    never incremen_edx:
    inc edx
    jmp edx_ready

    - anton


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Tue Aug 19 17:24:23 2025
    From Newsgroup: comp.arch

    Above by mistake I posted not the most up to date variant, sorry.
    Here is a correct code:

    .file "add3_my_u3.s"
    .text
    .p2align 4
    .globl add3
    .def add3; .scl 2; .type
    32; .endef .seh_proc add3
    add3:
    pushq %rbp
    .seh_pushreg %rbp
    pushq %rdi
    .seh_pushreg %rdi
    pushq %rsi
    .seh_pushreg %rsi
    pushq %rbx
    .seh_pushreg %rbx
    .seh_endprologue
    # %rcx - dst
    # %rdx - a
    # %r8 - b
    # %r9 - c
    sub %rcx, %rdx
    sub %rcx, %r8
    sub %rcx, %r9
    mov $341, %ebx
    xor %eax, %eax
    .loop:
    xor %esi, %esi
    mov (%rcx,%rdx), %rdi
    mov 8(%rcx,%rdx), %r10
    mov 16(%rcx,%rdx), %r11
    add (%rcx,%r8), %rdi
    adc 8(%rcx,%r8), %r10
    adc 16(%rcx,%r8), %r11
    adc %esi, %esi
    add (%rcx,%r9), %rdi
    adc 8(%rcx,%r9), %r10
    adc 16(%rcx,%r9), %r11
    adc $0, %esi
    add %rax, %rdi # add carry from the previous
    iteration jc .prop_carry
    .carry_done:
    mov %esi, %eax
    mov %rdi, (%rcx)
    mov %r10, 8(%rcx)
    mov %r11, 16(%rcx)
    lea 24(%rcx), %rcx
    dec %ebx
    jnz .loop

    sub $(1023*8), %rcx
    mov %rcx, %rax

    popq %rbx
    popq %rsi
    popq %rdi
    popq %rbp
    ret

    .prop_carry:
    add $1, %r10
    adc $0, %r11
    adc $0, %esi
    jmp .carry_done

    .seh_endproc


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Aug 19 17:43:25 2025
    From Newsgroup: comp.arch

    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Anton, I like what you and Michael have done, but I'm still not sure >everything is OK:

    In your code, I only see two input arrays [rsi] and [r8], instead of
    three? (Including [r9])

    I implemented a two-summand addition, not three-summand. I wanted the
    minumum of complexity to make it easier to understand, and latency is
    a bigger problem for the two-summand case.

    It would also be possible to use SETC to save the intermediate carries...

    I must have had a bad morning. Instead of xor edx, edx, setc dl (also
    2 per cycle on Zen3), I wrote

    mov edi,1
    ...
    xor edx, edx
    ...
    cmovc edx, edi

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Tue Aug 19 23:03:01 2025
    From Newsgroup: comp.arch

    On Tue, 19 Aug 2025 05:47:01 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:


    One other problem is that according to Agner Fog's instruction tables,
    even the latest and greatest CPUs from AMD and Intel that he measured
    (Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,

    I didn't measure on either TGL or Zen5, but both Raptor Cove and Zen3
    are certainly capable of more than 1 adcx|adox per cycle.

    Below are Execution times of very heavily unrolled adcx/adox code with dependency broken by trick similiar to above:

    Platform RC GM SK Z3
    add3_my_adx_u17 244.5 471.1 482.4 407.0

    Considering that there are 2166 adcx/adox/adc instructions, we have
    following number of adcx/adox/adc instructions per clock:
    Platform RC GM SK Z3
    1.67 1.10 1.05 1.44

    For Gracemont and Skylake there exists a possibility of small
    measurement mistake, but Raptor Cove appears to be capable of at least 2 instructions of this type per clock while Zen3 capable of at least 1.5
    but more likely also 2.
    It looks to me that the bottlenecks on both RC and Z3 are either rename
    phase or more likely L1$ access. It seems that while Golden/Raptore Cove
    can occasionally issue 3 load + 2 stores per clock, it can not sustain
    more than 3 load-or-store accesses per clock.


    Code:

    .file "add3_my_adx_u17.s"
    .text
    .p2align 4
    .globl add3
    .def add3; .scl 2; .type 32; .endef
    .seh_proc add3
    add3:
    pushq %rsi
    .seh_pushreg %rsi
    pushq %rbx
    .seh_pushreg %rbx
    .seh_endprologue
    # %rcx - dst
    # %rdx - a
    # %r8 - b
    # %r9 - c
    sub %rdx, %rcx
    mov %rcx, %r10 # r10 = dst - a
    sub %rdx, %r8 # r8 = b - a
    sub %rdx, %r9 # r9 = c - c
    mov %rdx, %r11 # r11 - a
    mov $60, %edx
    xor %ecx, %ecx
    .p2align 4
    .loop:
    xor %ebx, %ebx # CF <= 0, OF <= 0, EBX <= 0
    mov (%r11), %rsi
    adcx (%r11,%r8), %rsi
    adox (%r11,%r9), %rsi

    mov 8(%r11), %rax
    adcx 8(%r11,%r8), %rax
    adox 8(%r11,%r9), %rax
    mov %rax, 8(%r10,%r11)

    mov 16(%r11), %rax
    adcx 16(%r11,%r8), %rax
    adox 16(%r11,%r9), %rax
    mov %rax, 16(%r10,%r11)

    mov 24(%r11), %rax
    adcx 24(%r11,%r8), %rax
    adox 24(%r11,%r9), %rax
    mov %rax, 24(%r10,%r11)

    mov 32(%r11), %rax
    adcx 32(%r11,%r8), %rax
    adox 32(%r11,%r9), %rax
    mov %rax, 32(%r10,%r11)

    mov 40(%r11), %rax
    adcx 40(%r11,%r8), %rax
    adox 40(%r11,%r9), %rax
    mov %rax, 40(%r10,%r11)

    mov 48(%r11), %rax
    adcx 48(%r11,%r8), %rax
    adox 48(%r11,%r9), %rax
    mov %rax, 48(%r10,%r11)

    mov 56(%r11), %rax
    adcx 56(%r11,%r8), %rax
    adox 56(%r11,%r9), %rax
    mov %rax, 56(%r10,%r11)

    mov 64(%r11), %rax
    adcx 64(%r11,%r8), %rax
    adox 64(%r11,%r9), %rax
    mov %rax, 64(%r10,%r11)

    mov 72(%r11), %rax
    adcx 72(%r11,%r8), %rax
    adox 72(%r11,%r9), %rax
    mov %rax, 72(%r10,%r11)

    mov 80(%r11), %rax
    adcx 80(%r11,%r8), %rax
    adox 80(%r11,%r9), %rax
    mov %rax, 80(%r10,%r11)

    mov 88(%r11), %rax
    adcx 88(%r11,%r8), %rax
    adox 88(%r11,%r9), %rax
    mov %rax, 88(%r10,%r11)

    mov 96(%r11), %rax
    adcx 96(%r11,%r8), %rax
    adox 96(%r11,%r9), %rax
    mov %rax, 96(%r10,%r11)

    mov 104(%r11), %rax
    adcx 104(%r11,%r8), %rax
    adox 104(%r11,%r9), %rax
    mov %rax, 104(%r10,%r11)

    mov 112(%r11), %rax
    adcx 112(%r11,%r8), %rax
    adox 112(%r11,%r9), %rax
    mov %rax, 112(%r10,%r11)

    mov 120(%r11), %rax
    adcx 120(%r11,%r8), %rax
    adox 120(%r11,%r9), %rax
    mov %rax, 120(%r10,%r11)

    lea 136(%r11), %r11

    mov -8(%r11), %rax
    adcx -8(%r11,%r8), %rax
    adox -8(%r11,%r9), %rax
    mov %rax, -8(%r10,%r11)

    mov %ebx, %eax # EAX <= 0
    adcx %ebx, %eax # EAX <= OF, OF <= 0
    adox %ebx, %eax # EAX <= OF, OF <= 0

    add %rcx, %rsi
    jc .prop_carry
    .carry_done:
    mov %rsi, -136(%r10,%r11)
    mov %eax, %ecx
    dec %edx
    jnz .loop

    # last 3
    mov (%r11), %rax
    mov 8(%r11), %rdx
    mov 16(%r11), %rbx
    add (%r11,%r8), %rax
    adc 8(%r11,%r8), %rdx
    adc 16(%r11,%r8), %rbx
    add (%r11,%r9), %rax
    adc 8(%r11,%r9), %rdx
    adc 16(%r11,%r9), %rbx
    add %rcx, %rax
    adc $0, %rdx
    adc $0, %rbx
    mov %rax, (%r10,%r11)
    mov %rdx, 8(%r10,%r11)
    mov %rbx, 16(%r10,%r11)

    lea (-1020*8)(%r10,%r11), %rax
    popq %rbx
    popq %rsi
    ret

    .prop_carry:
    lea -128(%r10,%r11), %rbx
    xor %ecx, %ecx
    addq $1, (%rbx)
    adc %rcx, 8(%rbx)
    adc %rcx, 16(%rbx)
    adc %rcx, 24(%rbx)
    adc %rcx, 32(%rbx)
    adc %rcx, 40(%rbx)
    adc %rcx, 48(%rbx)
    adc %rcx, 56(%rbx)
    adc %rcx, 64(%rbx)
    adc %rcx, 72(%rbx)
    adc %rcx, 80(%rbx)
    adc %rcx, 88(%rbx)
    adc %rcx, 96(%rbx)
    adc %rcx,104(%rbx)
    adc %rcx,112(%rbx)
    adc %rcx,120(%rbx)
    adc %ecx, %eax
    jmp .carry_done
    .seh_endproc








    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Wed Aug 20 01:49:41 2025
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> wrote:
    According to Anton Ertl <anton@mips.complang.tuwien.ac.at>:
    John Levine <johnl@taugh.com> writes:
    SSDs often let you do 512 byte reads and writes for backward compatibility even
    though the physical block size is much larger.

    Yes. But if the argument had any merit that 512B is a good page size >>because it avoids having to transfer 8, 16, or 32 sectors at a time,
    it would still have merit, because the interface still shows 512B
    sectors.

    I think we're agreeing that even in the early 1980s a 512 byte page was
    too small. They certainly couldn't have made it any smaller, but they
    should have made it larger.

    S/370 was a decade before that and its pages were 2K or 4M. The KI-10,
    the first PDP-10 with paging, had 2K pages in 1972. Its pager was based
    on BBN's add-on pager for TENEX, built in 1970 also with 2K pages.

    Several posts above I wrote:

    : I think that in 1979 VAX 512 bytes page was close to optimal.
    : Namely, IIUC smallest supported configuration was 128 KB RAM.
    : That gives 256 pages, enough for sophisticated system with
    : fine-grained access control.

    Note that 360 has optional page protection used only for access
    control. In 370 era they had legacy of 2k or 4k pages, and
    AFAICS IBM was mainly aiming at bigger machines, so they
    were not so worried about fragmentation. PDP-11 experience
    possibly contributed to using smaller pages for VAX.

    Microprocessors were designed with different constraints, which
    led to bigger pages. But VAX apparently could afford resonably
    large TLB and due VMS structure gain was bigger than for other
    OS-es.

    And little correction: VAX architecture handbook is dated 1977,
    so actually decision about page size had to be made at least
    in 1977 and possibly earlier.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Wed Aug 20 02:49:26 2025
    From Newsgroup: comp.arch

    According to Waldek Hebisch <antispam@fricas.org>:
    S/370 was a decade before that and its pages were 2K or 4K. The KI-10,
    the first PDP-10 with paging, had 2K pages in 1972. Its pager was based
    on BBN's add-on pager for TENEX, built in 1970 also with 2K pages.
    ...

    Note that 360 has optional page protection used only for access
    control. In 370 era they had legacy of 2k or 4k pages, and
    AFAICS IBM was mainly aiming at bigger machines, so they
    were not so worried about fragmentation.

    I don't think so. The smallest 370s were 370/115 with 64K to 192K of
    RAM, 370/125 with 96K to 256K, both with paging hardware and running
    DOS/VS. The 115 was shipped in 1973, the 125 in 1972.

    PDP-11 experience possibly contributed to using smaller pages for VAX.

    The PDP-11's pages were 8K which were too big to be used as pages so
    we used them as a single block for swapping. When I was at Yale I did
    a hack that mapped the 32K display memory for a bitmap terminal into
    the high half of the process' data space but that left too little room
    for regular data so we addressed the display memory a different way that
    didn't use up address spavce.

    Microprocessors were designed with different constraints, which
    led to bigger pages. But VAX apparently could afford resonably
    large TLB and due VMS structure gain was bigger than for other
    OS-es.

    I can only guess what their thinking was, but I can tell you that
    at the time the 512 byte pages seemed oddly small.

    And little correction: VAX architecture handbook is dated 1977,
    so actually decision about page size had to be made at least
    in 1977 and possibly earlier.

    The VAX design started in 1976, well after IBM had shipped those
    low end 370s with tiny memories and 2K pages.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Wed Aug 20 03:47:17 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    antispam@fricas.org (Waldek Hebisch) writes:
    The basic question is if VAX could afford the pipeline.

    VAX 11/780 only performed instruction fetching concurrently with the
    rest (a two-stage pipeline, if you want). The 8600, 8700/8800 and
    NVAX applied more pipelining, but CPI remained high.

    VUPs MHz CPI Machine
    1 5 10 11/780
    4 12.5 6.25 8600
    6 22.2 7.4 8700
    35 90.9 5.1 NVAX+

    SPEC92 MHz VAX CPI Machine
    1/1 5 10/10 VAX 11/780
    133/200 200 3/2 Alpha 21064 (DEC 7000 model 610)

    VUPs and SPEC numbers from
    <https://pghardy.net/paul/programs/vms_cpus.html>.

    The 10 CPI (cycles per instructions) of the VAX 11/780 are annecdotal.
    The other CPIs are computed from VUP/SPEC and MHz numbers; all of that
    is probably somewhat off (due to the anecdotal base being off), but if
    you relate them to each other, the offness cancels itself out.

    Note that the NVAX+ was made in the same process as the 21064, the
    21064 has about the clock rate, and has 4-6 times the performance,
    resulting not just in a lower native CPI, but also in a lower "VAX
    CPI" (the CPI a VAX would have needed to achieve the same performance
    at this clock rate).

    Prism paper says the following about RISC versus VAX performance:

    : 1. Shorter cycle time. VAX chips have more, and longer, critical
    : paths than RISC chips. The worst VAX paths are the control store
    : loop and the variable length instruction decode loop, both of
    : which are absent in RISC chips.

    : 2. Fewer cycles per function. Although VAX chips require fewer
    : instructions than RISC chips (1:2.3) to implement a given
    : function, VAX instructions take so many more cycles than RISC
    : instructions (5-10:1-1.5) that VAX chips require many more cycles
    : per function than RISC chips.

    : 3. Increased pipelining. VAX chips have more inter- and
    : intra-instruction dependencies, architectural irregularities,
    : instruction formats, address modes, and ordering requirements
    : than RISC chips. This makes VAX chips harder and more
    : complicated to pipeline.

    Point 1 above for me means that VAX chips were microcoded. Point
    2 above suggest that there were limited changes compared to VAX-780
    microcode.

    IIUC attempts to create better hardware for VAX were canceled
    just after PRISM memos, so later VAX used essential the same
    logic, just rescaled to better process.

    I think that VAX had problem with hardware decoders because of gate
    delay: in 1987 probably hardware decoder would slow down clock.
    But 1977 design for me looks quite relaxed: man logic was Schotky
    TTL which nominaly has 3 ns of inverter delay. With 200 ns cycle
    this means about 66 gate delays per cycle. And in critical paths
    VAX use ECL. I do not exactly which ECL, but AFAIK 2 ns ECL was
    commonly available in 1970 and 1 ns ECL was leading edge in 1970.

    That is why I think that in 1977 hardware decoder could give
    speedup, assuming that execution units could keep up: gate delay
    and cycle time means that rather deep circuit could fit within
    cycle time. IIUC 1987 designs were much more aggressive and
    decoder delay probably could not fit within single cycle.

    Quite possible that hardware designers attempting VAX hardware
    decoders were too ambitious and wanted to decode in one cycle
    too complicated instructions. AFAICS for instructions that can
    not be executed in one cycle decode can be slower than one
    cycle, all what one needs is to recognize withing one cycle
    that decode will take multiple cycles.

    Let me reformulate my position a bit: clearly in 1977 some RISC
    design was possible. But probably it would be something
    even more primitive than Berkeley RISC. Putting in hardware
    things that later RISC designs put in hardware almost surely would
    exceed allowed cost. Technically at 1 mln transistors one should
    be able to do acceptable RISC and IIUC IBM 360/90 used about
    1 mln transistors in less dense technology, so in 1977 it was
    possible to do 1 mln transistor machine.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Aug 20 10:50:39 2025
    From Newsgroup: comp.arch

    Michael S wrote:
    On Tue, 19 Aug 2025 05:47:01 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:


    One other problem is that according to Agner Fog's instruction tables,
    even the latest and greatest CPUs from AMD and Intel that he measured
    (Zen5 and Tiger Lake) can only execute one adc/adcx/adox per cycle,

    I didn't measure on either TGL or Zen5, but both Raptor Cove and Zen3
    are certainly capable of more than 1 adcx|adox per cycle.

    Below are Execution times of very heavily unrolled adcx/adox code with dependency broken by trick similiar to above:

    Platform RC GM SK Z3
    add3_my_adx_u17 244.5 471.1 482.4 407.0

    Considering that there are 2166 adcx/adox/adc instructions, we have
    following number of adcx/adox/adc instructions per clock:
    Platform RC GM SK Z3
    1.67 1.10 1.05 1.44

    For Gracemont and Skylake there exists a possibility of small
    measurement mistake, but Raptor Cove appears to be capable of at least 2 instructions of this type per clock while Zen3 capable of at least 1.5
    but more likely also 2.
    It looks to me that the bottlenecks on both RC and Z3 are either rename
    phase or more likely L1$ access. It seems that while Golden/Raptore Cove
    can occasionally issue 3 load + 2 stores per clock, it can not sustain
    more than 3 load-or-store accesses per clock


    Code:

    .file "add3_my_adx_u17.s"
    .text
    .p2align 4
    .globl add3
    .def add3; .scl 2; .type 32; .endef
    .seh_proc add3
    add3:
    pushq %rsi
    .seh_pushreg %rsi
    pushq %rbx
    .seh_pushreg %rbx
    .seh_endprologue
    # %rcx - dst
    # %rdx - a
    # %r8 - b
    # %r9 - c
    sub %rdx, %rcx
    mov %rcx, %r10 # r10 = dst - a
    sub %rdx, %r8 # r8 = b - a
    sub %rdx, %r9 # r9 = c - c
    mov %rdx, %r11 # r11 - a
    mov $60, %edx
    xor %ecx, %ecx
    .p2align 4
    .loop:
    xor %ebx, %ebx # CF <= 0, OF <= 0, EBX <= 0
    mov (%r11), %rsi
    adcx (%r11,%r8), %rsi
    adox (%r11,%r9), %rsi

    mov 8(%r11), %rax
    adcx 8(%r11,%r8), %rax
    adox 8(%r11,%r9), %rax
    mov %rax, 8(%r10,%r11)

    [snipped the rest]


    Very impressive Michael!

    I particularly like how you are interleaving ADOX and ADCX to gain two
    carry bits without having to save them off to an additional register.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Wed Aug 20 14:16:55 2025
    From Newsgroup: comp.arch

    On Wed, 20 Aug 2025 10:50:39 +0200
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:



    Very impressive Michael!

    I particularly like how you are interleaving ADOX and ADCX to gain
    two carry bits without having to save them off to an additional
    register.

    Terje


    It is interesting as an exercise in ADX extension programming, but in
    practice it is only 0-10% faster than much simpler and smaller code
    presented in the other post that uses no ISA extensions so runs on every
    iAMD64 CPU since K8.
    I suspect that this result is quite representative of the gains that
    can be achieved with ADX. May be, if there is a crypto requirement
    of independence of execution time from inputs then the gain would be
    somewhat bigger, but even there I would be very surprised to find 1.5x
    gain.
    Overall, I think that time spent by Intel engineers on invention of ADX
    could have been spent much better.


    Going back to the task of 3-way addition, another approach that can
    utilize the same idea of breaking data dependency is using SIMD.
    In case of 4 cores that I tested SIMD means AVX2.
    The are results of AVX2 implementation that unrolls by two i.e. 512
    output bits per iteration of the inner loop.

    Platform RC GM SK Z3
    add3_avxq_u2 226.7 823.3 321.1 309.5

    The speed is about equal to more unrolled ADX variant on RC, faster on
    Z3, much faster on SK and much slower on GM. Unlike ADX, it runs on
    Intel Haswell and on few pre-Zen AMD CPUs.

    .file "add3_avxq_u2.s"
    .text
    .p2align 4
    .globl add3
    .def add3; .scl 2; .type 32; .endef
    .seh_proc add3
    add3:
    subq $56, %rsp
    .seh_stackalloc 56
    vmovups %xmm6, 32(%rsp)
    .seh_savexmm %xmm6, 32
    .seh_endprologue
    # %rcx - dst
    # %rdx - a
    # %r8 - b
    # %r9 - c
    sub %rcx, %rdx # %rdx - a-dst
    sub %rcx, %r8 # %r8 - b-dst
    sub %rcx, %r9 # %r9 - c-dst
    vpcmpeqq %ymm6, %ymm6, %ymm6
    vpsllq $63, %ymm6, %ymm6 # ymm6[0:3] = msbit = 2**63
    vpxor %xmm5, %xmm5, %xmm5 # ymm5[0] = carry = 0
    mov $127, %eax
    .loop:
    vpxor (%rdx,%rcx), %ymm6, %ymm0
    # ymm0[0:3] = iA[0:3] = a[0:3] - msbit
    vpxor 32(%rdx,%rcx), %ymm6, %ymm1
    # ymm1[0:3] = iA[4:7] = a[4:7] - msbit
    vpaddq (%r8, %rcx), %ymm0, %ymm2
    # ymm2[0:3] = iSum1[0:3] = iA[0:3]+b[0:3]
    vpaddq 32(%r8, %rcx), %ymm1, %ymm3
    # ymm3[0:3] = iSum1[4:7] = iA[4:7] + b[4:7]
    vpcmpgtq %ymm2, %ymm0, %ymm4
    # ymm4[0:3] = c1[0:3] = iA[0:3] > iSum1[0:3]
    vpaddq (%r9, %rcx), %ymm2, %ymm0
    # ymm0[0:3] = iSum2[0:3] = iSum1[0:3]+c[0:3]
    vpcmpgtq %ymm0, %ymm2, %ymm2
    # ymm2[0:3] = c2[0:3] = iSum1[0:3] > iSum2[0:3]
    vpaddq %ymm4, %ymm2, %ymm2
    # ymm2[0:3] = cSum0[0:3] = c1[0:3]+c2[0:3]
    vpcmpgtq %ymm3, %ymm1, %ymm4
    # ymm4[0:3] = c1[4:7] = iA[4:7] > iSum1[4:7]
    vpaddq 32(%r9, %rcx), %ymm3, %ymm1
    # ymm1[0:3] = iSum2[4:7] = iSum1[4:7] + c[4:7]
    vpcmpgtq %ymm1, %ymm3, %ymm3
    # ymm3[0:3] = c2[4:7] = iSum1[4:7] > iSum2[4:7]
    vpaddq %ymm4, %ymm3, %ymm3
    # ymm3[0:3] = cSum0[4:7] = c1[4:7] + c2[4:7]
    vpermq $0x93, %ymm2, %ymm4
    # ymm4[0:3] = cSum0[3,0:2]
    vpblendd $3, %ymm5, %ymm4, %ymm2
    # ymm1[0:3] = cSum[0:3] = { carry[0], cSum0[0,1,2] }
    vpermq $0x93, %ymm3, %ymm5
    # ymm5[0:3] = cSum0[7,4:6] == carry
    vpblendd $3, %ymm4, %ymm5, %ymm3
    # ymm3[0:3] = cSum[4:7] = { cSum0[3], cSum0[4:6] }
    .add_carry:
    vpsubq %ymm2, %ymm0, %ymm2
    # ymm2[0:3] = iSum3[0:3] = iSum2[0:3] - cSum[0:3]
    vpsubq %ymm3, %ymm1, %ymm3
    # ymm3[0:3] = iSum3[4:7] = iSum2[4:7] - cSum[4:7]
    vpcmpgtq %ymm2, %ymm0, %ymm0
    # ymm0[0:3] = c3[0:3] = iSum2[0:3] > iSum3[0:3]
    vpcmpgtq %ymm3, %ymm1, %ymm1
    # ymm3[0:3] = c3[4:7] = iSum2[4:7] > iSum3[4:7]
    vpor %ymm0, %ymm1, %ymm4
    vptest %ymm4, %ymm4
    jne .prop_carry
    vpxor %ymm2, %ymm6, %ymm0
    # ymm0[0:3] = uSum3[0:3] = iSum3[0:3] + msbit
    vpxor %ymm3, %ymm6, %ymm1
    # ymm1[4:7] = uSum3[4:7] = iSum3[4:7] + msbit
    vmovdqu %ymm0, (%rcx)
    vmovdqu %ymm1, 32(%rcx)
    addq $64, %rcx
    dec %eax
    jnz .loop

    # last 7
    vpxor (%rdx,%rcx), %ymm6, %ymm0
    # ymm0[0:3] = iA[0:3] = a[0:3] - msbit
    vpxor 24(%rdx,%rcx), %ymm6, %ymm1
    # ymm1[0:3] = iA[3:6] = a[3:6] - msbit
    vpaddq (%r8, %rcx), %ymm0, %ymm2
    # ymm2[0:3] = iSum1[0:3] = iA[0:3]+b[0:3]
    vpaddq 24(%r8, %rcx), %ymm1, %ymm3
    # ymm3[0:3] = iSum1[3:6] = iA[3:6] + b[3:6]
    vpcmpgtq %ymm2, %ymm0, %ymm4
    # ymm4[0:3] = c1[0:3] = iA[0:3] > iSum1[0:3]
    vpaddq (%r9, %rcx), %ymm2, %ymm0
    # ymm0[0:3] = iSum2[0:3] = iSum1[0:3]+c[0:3]
    vpcmpgtq %ymm0, %ymm2, %ymm2
    # ymm2[0:3] = c2[0:3] = iSum1[0:3] > iSum2[0:3]
    vpaddq %ymm4, %ymm2, %ymm2
    # ymm2[0:3] = cSum0[0:3] = c1[0:3]+c2[0:3]
    vpcmpgtq %ymm3, %ymm1, %ymm4
    # ymm4[0:3] = c1[3:6] = iA[3:6] > iSum1[3:6]
    vpaddq 24(%r9, %rcx), %ymm3, %ymm1
    # ymm1[0:3] = iSum2[3:6] = iSum1[3:6] + c[3:6]
    vpcmpgtq %ymm1, %ymm3, %ymm3
    # ymm3[0:3] = c2[3:6] = iSum1[3:6] > iSum2[3:6]
    vpaddq %ymm4, %ymm3, %ymm3
    # ymm3[0:3] = cSum[4:7] = cSum0[3:6] = c1[3:6] + c2[367]
    vpermq $0x93, %ymm2, %ymm4
    # ymm2[0:3] = cSum0[3,0,1,2]
    vpblendd $3, %ymm5, %ymm4, %ymm2
    # ymm1[0:3] = cSum[0:3] = { carry[0], cSum0[0,1,2] }
    vpermq $0xF9, %ymm1, %ymm1
    # ymm3[0:3] = iSum2[4:6,6]
    .add_carry2:
    vpsubq %ymm2, %ymm0, %ymm2
    # ymm2[0:3] = iSum3[0:3] = iSum2[0:3] - cSum[0:3]
    vpsubq %ymm3, %ymm1, %ymm3
    # ymm3[0:3] = iSum3[4:7] = iSum2[4:7] - cSum[4:7]
    vpcmpgtq %ymm2, %ymm0, %ymm0
    # ymm0[0:3] = c3[0:3] = iSum2[0:3] > iSum3[0:3]
    vpcmpgtq %ymm3, %ymm1, %ymm1
    # ymm1[0:3] = c3[4:7] = iSum2[4:7] > iSum3[4:7]
    vptest %ymm0, %ymm0
    jne .prop_carry2
    vptest %ymm1, %ymm1
    jne .prop_carry2
    vpxor %ymm2, %ymm6, %ymm0
    # ymm0[0:3] = uSum3[0:3] = iSum3[0:3] + msbit
    vpxor %ymm3, %ymm6, %ymm1
    # ymm1[4:7] = uSum3[4:7] = iSum3[4:7] + msbit
    vmovdqu %ymm0, (%rcx)
    vmovdqu %xmm1, 32(%rcx)
    vextractf128 $1, %ymm1, %xmm1
    vmovq %xmm1, 48(%rcx)

    lea -(127*64)(%rcx), %rax
    vzeroupper
    vmovups 32(%rsp), %xmm6
    addq $56, %rsp
    ret

    .prop_carry:
    # input:
    # ymm0[0:3] = c3[0:3]
    # ymm1[0:3] = c3[4:7]
    # ymm2[0:3] = iSum3[0:3]
    # ymm3[0:3] = iSum3[4:7]
    # ymm5[0] = carry
    # output:
    # ymm0[0:3] = iSum2[0:3]
    # ymm1[0:3] = iSum2[4:7]
    # ymm2[0:3] = cSum [0:3]
    # ymm3[0:3] = cSum [4:7]
    # ymm5[0] = carry
    # scratch: ymm4
    vpermq $0x93, %ymm0, %ymm4
    # ymm4[0:3] = c3[3,0,1,2]
    vmovdqa %ymm2, %ymm0
    # ymm0[0:3] = iSum2[0:3] = iSum3[0:3]
    vpermq $0x93, %ymm1, %ymm2
    # ymm2[0:3] = c3[7,4,5,6]
    vpaddq %xmm2, %xmm5, %xmm5
    # ymm5[0] = carry += c3[7]
    vmovdqa %ymm3, %ymm1
    # ymm1[0:3] = iSum2[4:7] = iSum3[4:7]
    vpblendd $3, %ymm4, %ymm2, %ymm3
    # ymm3[0:3] = cSum[4:7] = { c3[3], c3[4,5,6] }
    vpxor %xmm2, %xmm2, %xmm2
    # ymm2[0:3] = 0
    vpblendd $3, %ymm2, %ymm4, %ymm2
    # ymm2[0:3] = cSum[0:3] = { 0, c3[0,1,2] }
    jmp .add_carry

    .prop_carry2:
    # input:
    # ymm0[0:3] = c3[0:3]
    # ymm1[0:3] = c3[4:7]
    # ymm2[0:3] = iSum3[0:3]
    # ymm3[0:3] = iSum3[4:7]
    # ymm5[0] = carry
    # output:
    # ymm0[0:3] = iSum2[0:3]
    # ymm1[0:3] = iSum2[4:7]
    # ymm2[0:3] = cSum [0:3]
    # ymm3[0:3] = cSum [4:7]
    # ymm5[0] = carry
    # scratch: ymm4
    vpermq $0x93, %ymm0, %ymm4
    # ymm4[0:3] = c3[3,0,1,2]
    vmovdqa %ymm2, %ymm0
    # ymm0[0:3] = iSum2[0:3] = iSum3[0:3]
    vpermq $0x93, %ymm1, %ymm2
    # ymm2[0:3] = c3[7,4,5,6]
    vmovdqa %ymm3, %ymm1
    # ymm1[0:3] = iSum2[4:7] = iSum3[4:7]
    vpblendd $3, %ymm4, %ymm2, %ymm3
    # ymm3[0:3] = cSum[4:7] = { c3[3], c3[4,5,6] }
    vpxor %xmm2, %xmm2, %xmm2
    # ymm2[0:3] = 0
    vpblendd $3, %ymm2, %ymm4, %ymm2
    # ymm2[0:3] = cSum[0:3] = { 0, c3[0,1,2] }
    jmp .add_carry2

    .seh_endproc

    AVX2 is rather poorly suited for this task - it lacks unsigned
    comparison instructions, so the first input should be shifted by
    half-range at the beginning and the result should be shifted back.

    AVX-512 can be more suitable. But the only AVX-512 capable CPU that I
    have access to is miniPC with cheap and slow core-i3 used by family
    members almost exclusively for viewing movies. It does not even have
    minimal programming environments installed.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Aug 20 14:08:34 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    Overall, I think that time spent by Intel engineers on invention of ADX
    could have been spent much better.

    The whitepapers about ADX are about long multiplication and squaring
    (for cryptographic uses), and they are from 2012, and ADX was first
    implemented in Broadwell (2014), when microarchitectures were quite a
    bit narrower than the recent ones.

    If you implement the classical long multiplication algorithm, but add
    each line to the intermediate sum as you create it, you need an
    operation like

    intermediateresult += multiplicand*multiplicator[i]

    where all parts are multi-precision numbers, but only one word of the multiplicator is used. The whole long multiplication would look
    somewhat like:

    intermediateresult=0;
    for (i=0; i<n; i++) {
    intermediateresult += multiplicand*multiplicator[i];
    shift intermediate result by one word; /* actually, you will access it at */
    /* an offset, but how to express this in this pseudocode? */ }

    The operation for a single line can be implemented as:

    carry=0;
    for (j=0; j<m; j++)
    uint128_t d = intermediateresult[j] +
    multiplicand[j]*(uint128_t)multiplicator[i] +
    (uint128_t)carry;
    intermediateresult[j] = d; /* low word */
    carry = d >> 64;
    }

    The computation of d (both words) can be written on AMD64 as:

    #multuplicator[i] in rax
    mov multiplicator[i], rax
    mulq multiplicand[j]
    addq intermediateresult[j], rax
    adcq $0, rdx
    addq carry, rax
    adcq $0, rdx
    mov rdx, carry

    With ADX and BMI2, this can be coded as:

    #carry is represented as carry1+C+O
    mulx ma, m, carry2
    adcx mb, m
    adox carry1, m
    #carry is represented as carry2+C+O
    #unroll by an even factor, and switch roles for carry1 and carry2

    Does it matter? We can apply the usual blocking techniques to
    perform, say, a 4x4 submultiplication in the registers (probably even
    a little larger, but let's stick with these numbers). That's 16
    mulx/adox/adcx combinations, loads of 4+4 words of inputs and stores
    of 8 words of output. mulx is probably limited to one per cycle, but
    if we want to utilize this on a 4-wide machine line the Broadwell, we
    must have at most 3 additional instructions per mulx; with ADX, one
    additional instruction is adcx, another adox, and the third is either
    a load or a store. Any additional overhead, and the code will be
    limited by resources other than the multiplier.

    On today's CPUs, we can reach the 1 mul/cycle limit with the x86-64-v1
    code shown before the ADX code. But then, they might put a second
    multiplier in, and we would profit from ADX again.

    But Intel seems to have second thoughts on ADX itself. ADX has not
    been included in x86-64-4, despite the fact that every CPU that
    supports the other extensions of x86-64-4 also supports ADX. And the whitepapers vanish from Intel's web pages. Some time ago I still
    found it on https://www.intel.cn/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf
    (i.e., Intel China), but now it's gone there, too. I can still find
    it on <https://raw.githubusercontent.com/wiki/intel/intel-ipsec-mb/doc/ia-large-integer-arithmetic-paper.pdf>

    There is another whitepaper on using ADX for squaring numbers, but I
    cannot find that. Looking around what Erdinç Öztürk (aka Erdinc
    Ozturk) has also written, there's a patent "SIMD integer
    multiply-accumulate instruction for multi-precision arithmetic" from
    2016, so maybe Intel's thoughts are now into doing it with SIMD
    instead of with scalar instructions.

    Still, why deemphasize ADX? Do they want to drop it eventually? Why?
    They have to support separate renaming of C, O, and the other three
    because of instructions that go much farther back. The only way would
    be to provide alternatives to these instructions, and then deemphasize
    them over time, and eventually rename all flags together (and the old instrutions may then perform slowly).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Aug 20 14:36:43 2025
    From Newsgroup: comp.arch

    Scott Lurndal wrote:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    There were a number of proposals around then, the paper I linked to
    also suggested injecting the miss routine into the ROB.
    My idea back then was a HW thread.
    the same problem.

    Not quite.
    My idea was to have two HW threads HT1 and HT2 which are like x86 HW
    threads except when HT1 gets a TLB miss it stalls its execution and
    injects the TLB miss handler at the front of HT2 pipeline,
    and a HT2 TLB miss stalls itself and injects its handler into HT1.
    The TLB miss handler never itself TLB misses as it explicitly checks
    the TLB for any VA it needs to translate so recursion is not possible.

    As the handler is injected at the front of the pipeline no drain occurs.
    The only possible problem is if between when HT1 injects its miss handler
    into HT2 that HT2's existing pipeline code then also does a TLB miss.
    As this would cause a deadlock, if this occurs then it cores detects it
    and both HT fault and run their TLB miss handler themselves.

    While HW walkers are serial for translating one VA,
    the translations are inherently concurrent provided one can
    implement an atomic RMW for the Accessed and Modified bits.
    It's always a one-way street (towards accessed and towards modified,
    never the other direction), so it's not clear to me why one would want
    atomicity there.

    To avoid race conditions with software clearing those bits, presumably.

    ARM64 originally didn't support hardware updates in V8.0, they were independent hardware features added to V8.1.

    Yes. A memory recycler can periodically clear the Accessed bit
    so it can detect page usage, and that might be a different core.
    But it might skip sending TLB shootdowns to all other cores
    to lower the overhead (maybe a lazy usage detector).


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Aug 20 16:41:39 2025
    From Newsgroup: comp.arch

    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    There were a number of proposals around then, the paper I linked to
    also suggested injecting the miss routine into the ROB.
    My idea back then was a HW thread.

    All of these are attempts to fix inherent drawbacks and limitations
    in the SW-miss approach, and all of them run counter to the only
    advantage SW-miss had: its simplicity.

    Another advantage is the flexibility: you can implement any
    translation scheme you want: hierarchical page tables, inverted page
    tables, search trees, .... However, given that hierarchical page
    tables have won, this is no longer an advantage anyone cares for.

    The SW approach is inherently synchronous and serial -
    it can only handle one TLB miss at a time, one PTE read at a time.

    On an OoO engine, I don't see that. The table walker software is
    called in its special context and the instructions in the table walker
    are then run through the front end and the OoO engine. Another table
    walk could be started at any time (even when the first table walk has
    not yet finished feeding its instructions to the front end), and once
    inside the OoO engine, the execution is OoO and concurrent anyway. It
    would be useful to avoid two searches for the same page at the same
    time, but hardware walkers have the same problem.

    Hmmm... I don't think that is possible, or if it is then its really hairy.
    The miss handler needs to LD the memory PTE's, which can happen OoO.
    But it also needs to do things like writing control registers
    (e.g. the TLB) or setting the Accessed or Dirty bits on the in-memory PTE, things that usually only occur at retire. But those handler instructions
    can't get to retire because the older instructions that triggered the
    miss are stalled.

    The miss handler needs general registers so it needs to
    stash the current content someplace and it can't use memory.
    Then add a nested miss handler on top of that.

    While HW walkers are serial for translating one VA,
    the translations are inherently concurrent provided one can
    implement an atomic RMW for the Accessed and Modified bits.

    It's always a one-way street (towards accessed and towards modified,
    never the other direction), so it's not clear to me why one would want atomicity there.

    As Scott said, to avoid race conditions with software clearing those bits.
    Plus there might be PTE modifications that an OS could perform on other
    PTE fields concurrently without first acquiring the normal mutexes
    and doing a TLB shoot down of the PTE on all the other cores,
    provided they are done atomically so the updates of one core
    don't clobber the changes of another.

    Each PTE read can cache miss and stall that walker.
    As most OoO caches support multiple pending misses and hit-under-miss,
    you can create as many HW walkers as you can afford.

    Which poses the question: is it cheaper to implement n table walkers,
    or to add some resources and mechanism that allows doing SW table
    walks until the OoO engine runs out of resources, and a recovery
    mechanism in that case.

    A HW walker looks simple to me.
    It has a few bits of state number and a couple of registers.
    It needs to detect memory read errors if they occur and abort.
    Otherwise it checks each TLB level in backwards order using the
    appropriate VA bits, and if it gets a hit walks back down the tree
    reading PTE's for each level and adding them to their level TLB,
    checking it is marked present, and performing an atomic OR to set
    the Accessed and Dirty flags if they are clear.

    The HW walker is even simpler if the atomic OR is implemented directly
    in the cache controller as part of the Atomic Fetch And OP series.

    I see other performance and conceptual disadvantages for the envisioned
    SW walkers, however:

    1) The SW walker is inserted at the front end and there may be many
    ready instructions ahead of it before the instructions of the SW
    walker get their turn. By contrast, a hardware walker sits in the
    load/store unit and can do its own loads and stores with priority over
    the program-level loads and stores. However, it's not clear that
    giving priority to table walking is really a performance advantage.

    2) Some decisions will have to be implemented as branches, resulting
    in branch misses, which cost time and lead to all kinds of complexity
    if you want to avoid resetting the whole pipeline (which is the normal reaction to a branch misprediction).

    3) The reorder buffer processes instructions in architectural order.
    If the table walker's instructions get their sequence numbers from
    where they are inserted into the instruction stream, they will not
    retire until after the memory access that waits for the table walker
    is retired. Deadlock!

    It may be possible to solve these problems (your idea of doing it with something like hardware threads may point in the right direction), but
    it's probably easier to stay with hardware walkers.

    - anton

    Yes, and it seems to me that one would spend a lot more time trying to
    fix the SW walker than doing the simple HW walker that just works.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Aug 20 19:17:01 2025
    From Newsgroup: comp.arch

    BGB wrote:
    On 8/17/2025 12:35 PM, EricP wrote:

    The question is whether in 1975 main memory is so expensive that
    we cannot afford the wasted space of a fixed 32-bit ISA.
    In 1975 the widely available DRAM was the Intel 1103 1k*1b.
    The 4kb drams were just making to customers, 16kb were preliminary.

    Looking at the instruction set usage of VAX in

    Measurement and Analysis of Instruction Use in VAX 780, 1982
    https://dl.acm.org/doi/pdf/10.1145/1067649.801709

    we see that the top 25 instructions covers about 80-90% of the usage,
    and many of them would fit into 2 or 3 bytes.
    A fixed 32-bit instruction would waste 1 to 2 bytes on most instructions.

    But a fixed 32-bit instruction is very much easier to fetch and
    decode needs a lot less logic for shifting prefetch buffers,
    compared to, say, variable length 1 to 12 bytes.


    When code/density is the goal, a 16/32 RISC can do well.

    Can note:
    Maximizing code density often prefers fewer registers;
    For 16-bit instructions, 8 or 16 registers is good;
    8 is rather limiting;
    32 registers uses too many bits.

    I'm assuming 16 32-bit registers, plus a separate RIP.
    The 74172 is a single chip 3 port 16*2b register file, 1R,1W,1RW.
    With just 16 registers there would be no zero register.

    The 4-bit register allows many 2-byte accumulate style instructions
    (where a register is both source and dest)
    8-bit opcode plus two 4-bit registers,
    or a 12-bit opcode, one 4-bit register, and an immediate 1-8 bytes.

    A flags register allows 2-byte short conditional branch instructions,
    8-bit opcode and 8-bit offset. With no flags register the shortest
    conditional branch would be 3 bytes as it needs a register specifier.

    If one is doing variable byte length instructions then
    it allows the highest usage frequency to be most compact possible.
    Eg. an ADD with 32-bit immediate in 6 bytes.

    Can note ISAs with 16 bit encodings:
    PDP-11: 8 registers
    M68K : 2x 8 (A and D)
    MSP430: 16
    Thumb : 8|16
    RV-C : 8|32
    SuperH: 16
    XG1 : 16|32 (Mostly 16)

    The saving for fixed 32-bit instructions is that it only needs to
    prefetch aligned 4 bytes ahead of the current instruction to maintain
    1 decode per clock.

    With variable length instructions from 1 to 12 bytes it could need
    a 16 byte fetch buffer to maintain that decode rate.
    And a 16 byte variable shifter (collapsing buffer) is much more logic.

    I was thinking the variable instruction buffer shifter could be built
    from tri-state buffers in a cross-bar rather than muxes.

    The difference for supporting variable aligned 16-bit instructions and
    byte aligned is that bytes doubles the number of tri-state buffers.

    In my recent fiddling for trying to design a pair encoding for XG3, can
    note the top-used instructions are mostly, it seems (non Ld/St):
    ADD Rs, 0, Rd //MOV Rs, Rd
    ADD X0, Imm, Rd //MOV Imm, Rd
    ADDW Rs, 0, Rd //EXTS.L Rs, Rd
    ADDW Rd, Imm, Rd //ADDW Imm, Rd
    ADD Rd, Imm, Rd //ADD Imm, Rd

    Followed by:
    ADDWU Rs, 0, Rd //EXTU.L Rs, Rd
    ADDWU Rd, Imm, Rd //ADDWu Imm, Rd
    ADDW Rd, Rs, Rd //ADDW Rs, Rd
    ADD Rd, Rs, Rd //ADD Rs, Rd
    ADDWU Rd, Rs, Rd //ADDWU Rs, Rd

    Most every other ALU instruction and usage pattern either follows a bit further behind or could not be expressed in a 16-bit op.

    For Load/Store:
    SD Rn, Disp(SP)
    LD Rn, Disp(SP)
    LW Rn, Disp(SP)
    SW Rn, Disp(SP)

    LD Rn, Disp(Rm)
    LW Rn, Disp(Rm)
    SD Rn, Disp(Rm)
    SW Rn, Disp(Rm)


    For registers, there is a split:
    Leaf functions:
    R10..R17, R28..R31 dominate.
    Non-Leaf functions:
    R10, R18..R27, R8/R9

    For 3-bit configurations:
    R8..R15 Reg3A
    R18/R19, R20/R21, R26/R27, R10/R11 Reg3B

    Reg3B was a bit hacky, but had similar hit rates but uses less encoding space than using a 4-bit R8..R23 (saving 1 bit on the relevant scenarios).



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Aug 20 23:50:52 2025
    From Newsgroup: comp.arch

    On 8/20/2025 6:17 PM, EricP wrote:
    BGB wrote:
    On 8/17/2025 12:35 PM, EricP wrote:

    The question is whether in 1975 main memory is so expensive that
    we cannot afford the wasted space of a fixed 32-bit ISA.
    In 1975 the widely available DRAM was the Intel 1103 1k*1b.
    The 4kb drams were just making to customers, 16kb were preliminary.

    Looking at the instruction set usage of VAX in

    Measurement and Analysis of Instruction Use in VAX 780, 1982
    https://dl.acm.org/doi/pdf/10.1145/1067649.801709

    we see that the top 25 instructions covers about 80-90% of the usage,
    and many of them would fit into 2 or 3 bytes.
    A fixed 32-bit instruction would waste 1 to 2 bytes on most
    instructions.

    But a fixed 32-bit instruction is very much easier to fetch and
    decode needs a lot less logic for shifting prefetch buffers,
    compared to, say, variable length 1 to 12 bytes.


    When code/density is the goal, a 16/32 RISC can do well.

    Can note:
      Maximizing code density often prefers fewer registers;
      For 16-bit instructions, 8 or 16 registers is good;
      8 is rather limiting;
      32 registers uses too many bits.

    I'm assuming 16 32-bit registers, plus a separate RIP.
    The 74172 is a single chip 3 port 16*2b register file, 1R,1W,1RW.
    With just 16 registers there would be no zero register.

    The 4-bit register allows many 2-byte accumulate style instructions
    (where a register is both source and dest)
    8-bit opcode plus two 4-bit registers,
    or a 12-bit opcode, one 4-bit register, and an immediate 1-8 bytes.


    Yeah.

    SuperH had:
    ZZZZnnnnmmmmZZZZ //2R
    ZZZZnnnniiiiiiii //2RI (Imm8)
    ZZZZnnnnZZZZZZZZ //1R


    For BJX2/XG1, had went with:
    ZZZZZZZZnnnnmmmm
    But, in retrospect, this layout was inferior to the one SuperH had used
    (and I almost would have just been better off doing a clean-up of the SH encoding scheme than moving the bits around).

    Though, this happened during a transition between B32V and BSR1, where:
    B32V was basically a bare-metal version of SH;
    BSR1 was an instruction repack (with tweaks to try make it more
    competitive with MSP430 while still remaining Load/Store);
    BJX2 was basically rebuilding all the stuff from BJX1 on top of BSR1's encoding scheme (which then mutated more).


    At first, BJX2's 32-bit ops were a prefix:
    111P-YYWY-qnmo-oooo ZZZZ-ZZZZ-nnnn-mmmm

    But, then got reorganized:
    111P-YYWY-nnnn-mmmm ZZZZ-qnmo-oooo-ZZZZ

    Originally, this repack was partly because I had ended up designing some Imm9/Disp9 encodings as it quickly became obvious that Imm5/Disp5 was insufficient. But, I had designed the new instructions to have the Imm
    field not be totally dog-chewed, so ended up changing the layout. Then
    ended up changing the encoding for the 3R instructions to better match
    that of the new Imm9 encodings.

    Then, XG2:
    NMOP-YYwY-nnnn-mmmm ZZZZ-qnmo-oooo-ZZZZ //3R

    Which moved entirely over to 32/64/96 bit encodings in exchange for
    being able to directly encode 64 GPRs in 32-bit encodings for the whole ISA.


    In the original BJX2 (later renamed XG1), only a small subset having
    direct access to the higher numbered registers; and other instructions
    using 64-bit encodings.

    Though, ironically, XG2 never surpassed XG1 in terms of code-density;
    but being able to use 64 registers "pretty much everywhere" was (mostly)
    a good thing for performance.


    For XG3, there was another repack:
    ZZZZ-oooooo-mmmmmm-ZZZZ-nnnnnn-qY-YYPw //3R

    But, this was partly to allow it to co-exist with RISC-V.

    Technically, still has conditional instructions, but these were demoted
    to optional; as if one did a primarily RISC-V core, with an XG3 subset
    as an ISA extension, they might not necessarily want to deal with the
    added architectural state of a 'T' bit.

    BGBCC doesn't currently use it by default.

    Was also able to figure out how to make the encoding less dog chewed
    than either XG2 or RISC-V.


    Though, ironically, the full merits of XG3 are only really visible in
    cases where XG1 and XG2 are dropped. But, it has a new boat-anchor in
    that it now assumes coexistence with RISC-V (which itself has a fair bit
    of dog chew).

    And, if the goal is RISC-V first, then likely the design of XG3 is a big
    ask; it being essentially its own ISA.

    Though, while giving fairly solid performance, XG3 currently hasn't
    matched the code density of its predecessors (either XG1 or XG2). It is
    more like "RISC-V but faster".

    And, needing to use mode changes to access XG3 or RV-C is a little ugly.



    Though, OTOH, RISC-V land is annoying in a way; lots of people being
    like "RV-V will save us from all our performance woes!". Vs, realizing
    that some issues need to be addressed in the integer ISA, and SIMD and auto-vectorization will not address inefficiencies in the integer ISA.


    Though, I have seen glimmers of hope that other people in RV land
    realize this...


    A flags register allows 2-byte short conditional branch instructions,
    8-bit opcode and 8-bit offset. With no flags register the shortest conditional branch would be 3 bytes as it needs a register specifier.


    Yeah, "BT/BF Disp8".


    If one is doing variable byte length instructions then
    it allows the highest usage frequency to be most compact possible.
    Eg. an ADD with 32-bit immediate in 6 bytes.



    In BSR1, I had experimented with:
    LDIZ Imm12u, R0 //R0=Imm12
    LDISH Imm8u //R0=(R0<<8)|Umm8u
    OP Imm4R, Rn //OP [(R0<<4)|Imm4u], Rn

    Which allowed Imm24 in 6 bytes or Imm32 in 8 bytes.
    Granted, as 3 or 4 instructions.

    Though, this began the process of allowing the assembler to fake more
    complex instructions which would decompose into simpler instructions.


    But, this was not kept, and in BJX2 was mostly replaced with:
    LDIZ Imm24u, R0
    OP R0, Rn

    Then, when I added Jumbo Prefixes:
    OP Rm, Imm33s, Rn

    Some extensions of RISC-V support Imm32 in 48-bit ops, but this burns
    through lots of encoding space.

    iiiiiiii-iiiiiiii iiiiiiii-iiiiiiii zzzz-nnnnn-z0-11111

    This doesn't go very far.


    Can note ISAs with 16 bit encodings:
      PDP-11: 8 registers
      M68K  : 2x 8 (A and D)
      MSP430: 16
      Thumb : 8|16
      RV-C  : 8|32
      SuperH: 16
      XG1   : 16|32 (Mostly 16)

    The saving for fixed 32-bit instructions is that it only needs to
    prefetch aligned 4 bytes ahead of the current instruction to maintain
    1 decode per clock.

    With variable length instructions from 1 to 12 bytes it could need
    a 16 byte fetch buffer to maintain that decode rate.
    And a 16 byte variable shifter (collapsing buffer) is much more logic.

    I was thinking the variable instruction buffer shifter could be built
    from tri-state buffers in a cross-bar rather than muxes.

    The difference for supporting variable aligned 16-bit instructions and
    byte aligned is that bytes doubles the number of tri-state buffers.


    If the smallest instruction size is 16 bits, it simplifies things
    considerably vs 8 bits.

    If the smallest size is 32-bits, it simplifies things even more.
    Fixed length is the simplest case though.


    As noted, 32/64/96 bit fetch isn't too difficult though.

    For 64/96 bit instructions though, mostly want to be able to (mostly)
    treat it like a superscalar fetch of 2 or 3 32-bit instructions.

    In my CPU, I ended up making it so that only 32-bit instructions support superscalar; whereas 16 and 64/96 bit instructions are scalar only.

    Superscalar only works with native alignment though (for RISC-V), and
    for XG3, 32-bit instruction alignment is mandatory.


    As noted, in terms of code density, a few of the stronger options are
    Thumb2 and RV-C, which have 16 bits as the smallest size.


    I once experimented with having a range of 24-bit instructions, but the
    hair this added (combined with the fairly little gain in terms of code density) showed this was rather not worth it.


    ...


    In my recent fiddling for trying to design a pair encoding for XG3,
    can note the top-used instructions are mostly, it seems (non Ld/St):
      ADD   Rs, 0, Rd    //MOV     Rs, Rd
      ADD   X0, Imm, Rd  //MOV     Imm, Rd
      ADDW  Rs, 0, Rd    //EXTS.L  Rs, Rd
      ADDW  Rd, Imm, Rd  //ADDW    Imm, Rd
      ADD   Rd, Imm, Rd  //ADD     Imm, Rd

    Followed by:
      ADDWU Rs, 0, Rd    //EXTU.L  Rs, Rd
      ADDWU Rd, Imm, Rd  //ADDWu   Imm, Rd
      ADDW  Rd, Rs, Rd   //ADDW    Rs, Rd
      ADD   Rd, Rs, Rd   //ADD     Rs, Rd
      ADDWU Rd, Rs, Rd   //ADDWU   Rs, Rd

    Most every other ALU instruction and usage pattern either follows a
    bit further behind or could not be expressed in a 16-bit op.

    For Load/Store:
      SD  Rn, Disp(SP)
      LD  Rn, Disp(SP)
      LW  Rn, Disp(SP)
      SW  Rn, Disp(SP)

      LD  Rn, Disp(Rm)
      LW  Rn, Disp(Rm)
      SD  Rn, Disp(Rm)
      SW  Rn, Disp(Rm)


    For registers, there is a split:
      Leaf functions:
        R10..R17, R28..R31 dominate.
      Non-Leaf functions:
        R10, R18..R27, R8/R9

    For 3-bit configurations:
      R8..R15                             Reg3A
      R18/R19, R20/R21, R26/R27, R10/R11  Reg3B

    Reg3B was a bit hacky, but had similar hit rates but uses less
    encoding space than using a 4-bit R8..R23 (saving 1 bit on the
    relevant scenarios).




    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Thu Aug 21 16:21:37 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:

    While HW walkers are serial for translating one VA,
    the translations are inherently concurrent provided one can
    implement an atomic RMW for the Accessed and Modified bits.

    It's always a one-way street (towards accessed and towards modified,
    never the other direction), so it's not clear to me why one would want atomicity there.

    Consider "virgin" page, that is neither accessed nor modified.
    Intruction 1 reads the page, instruction 2 modifies it. After
    both are done you should have both bits set. But if miss handling
    for instruction 1 reads page table entry first, but stores after
    store fomr instruction 2 handler, then you get only accessed bit
    and modified flag is lost. Symbolically we could have

    read PTE for instruction 1
    read PTE for instruction 2
    store PTE for instruction 2 (setting Accessed and Modified)
    store PTE for instruction 1 (setting Accessed but clearing Modified)
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Aug 21 19:26:47 2025
    From Newsgroup: comp.arch

    Waldek Hebisch <antispam@fricas.org> schrieb:

    Let me reformulate my position a bit: clearly in 1977 some RISC
    design was possible. But probably it would be something
    even more primitive than Berkeley RISC. Putting in hardware
    things that later RISC designs put in hardware almost surely would
    exceed allowed cost. Technically at 1 mln transistors one should
    be able to do acceptable RISC and IIUC IBM 360/90 used about
    1 mln transistors in less dense technology, so in 1977 it was
    possible to do 1 mln transistor machine.

    HUH? That is more than one order of magnitude than what is needed
    for a RISC chip.

    Consider ARM2, which had 27000 transistors and which is sort of
    the minimum RISC design you can manage (altough it had a Booth
    multiplier).

    An ARMv2 implementation with added I and D cache, plus virtual
    memory, would have been the ideal design (too few registers, too
    many bits wasted on conditional execution, ...) but it would have
    run rings around the VAX.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Aug 21 21:48:03 2025
    From Newsgroup: comp.arch


    John Savard <quadibloc@invalid.invalid> posted:

    On Sun, 20 Jul 2025 17:28:37 +0000, MitchAlsup1 wrote:

    I do agree with some of what Mill does, including placing the preserved registers in memory where they cannot be damaged.
    My 66000 calls this mode of operation "safe stack".

    This sounds like an idea worth stealing, although no doubt the way I
    would attempt to copy it would be a failure which removed all the
    usefulness of it.

    For one thing, I don't have a stack for calling subroutines, or any other purpose.

    But I could easily add a feature where a mode is turned on, and instead of using the registers, it works off of a workspace pointer, like the TI 9900.

    The trouble is, though, that this would be an extremely slow mode. When registers are _saved_, they're already saved to memory, as I can't think
    of anywhere else to save them. (There might be multiple sets of registers, for things like SMT, but *not* for user vs supervisor or anything like
    that.)

    In reverse order:
    If TI 9900 used their registers like a write-back cache, then typical access would be fast and efficient. When the register pointer is altered, the old file is written en-massé and a new file is read in en-massé {possibly with some buffering to lessen visible cycle count} ... but I digress.

    {Conceptually}
    My 66000 uses this concept for its GPRs and for its Thread State but only at context switch time, not for subroutine calls and returns. HW saves and restores Thread State and Registers on context switches so that the CPU
    never has to Disable Interrupts (it can, it just doesn't have to). {/Conceptually}
    I bracketed the above with 'Conceptually' because it is completely
    reasonable to envision a Core big enough to have 4 copies of Thread
    State and Register files, and bank switch between them. The important properties are that the switch delivers control reentrantly, HOW any
    given implementation does that is NOT architecture--that it does IS architecture.

    I specifically left how many registers are preserved to SW per CALL
    because up to 50% need 0, and few % require more than 4. This appears
    to indicate that SPARC using 8 was overkill ... but I digress again.

    Safe Stack is a stack used for preserving the ABI contract between caller
    and callee even in the face of buffer overruns, RoP, and other malicious program behavior. SS places the return address and the preserved registers
    in an area of memory where LD and ST instructions have no access (RWE = 000) but ENTER, EXIT, and RET do. This was done in such a way that correct code
    runs both with SS=on and SS=off, so the compiler does not have to know.

    Only CALL, CALX, RET, ENTER, and EXIT are aware of the existence of SS
    and only in HW implementations.

    I have harped on you for a while to start development of your compiler.
    One of the first things a compiler needs to do is to develop its means
    to call subroutines and return back. This requires a philosophy of passing arguments, returning results, dealing with recursion, dealing with TRY- THROW-CATCH SW defined exception handling. I KNOW of nobody who does this without some kind of stack.

    I happen to use 2 such stacks mostly to harden the environment at low
    cost to malicious attack vectors. It comes with benefits: Lines removed
    from SS do not migrate to L2 or even DRAM, they can be discarded at
    end-of-use, reducing memory traffic; the SW contract between Caller and
    Callee is guaranteed even in the face of malicious code; it can be used
    as a debug tool to catch malicious code. ...

    NOTE: malicious code can still damage data*, just not the preserved regs,
    the return address, guaranteeing that control returns to the instruction following CALL. And all without adding a single instruction to the CALL
    RET instruction sequence.

    (*) memory

    So I've probably completely misunderstood you here.

    Not the first time ...

    John Savard

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Fri Aug 22 16:36:09 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Waldek Hebisch <antispam@fricas.org> schrieb:

    Let me reformulate my position a bit: clearly in 1977 some RISC
    design was possible. But probably it would be something
    even more primitive than Berkeley RISC. Putting in hardware
    things that later RISC designs put in hardware almost surely would
    exceed allowed cost. Technically at 1 mln transistors one should
    be able to do acceptable RISC and IIUC IBM 360/90 used about
    1 mln transistors in less dense technology, so in 1977 it was
    possible to do 1 mln transistor machine.

    HUH? That is more than one order of magnitude than what is needed
    for a RISC chip.

    Consider ARM2, which had 27000 transistors and which is sort of
    the minimum RISC design you can manage (altough it had a Booth
    multiplier).

    An ARMv2 implementation with added I and D cache, plus virtual
    memory, would have been the ideal design (too few registers, too
    many bits wasted on conditional execution, ...) but it would have
    run rings around the VAX.

    1 mln transistors is an upper estimate. But low numbers given
    for early RISC chips are IMO misleading: RISC become comercialy
    viable for high-end machines only in later generations, when
    designers added a few "expensive" instructions. Also, to fit
    design into a single chip designers moved some functionality
    like bus interface to support chips. RISC processor with
    mixed 16-32 bit instructions (needed to get resonable code
    density), hardware multiply and FPU, including cache controller,
    paging hardware and memory controller is much more than
    100 thousend transitors cited for early workstation chips.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Fri Aug 22 16:45:56 2025
    From Newsgroup: comp.arch

    According to Thomas Koenig <tkoenig@netcologne.de>:
    Waldek Hebisch <antispam@fricas.org> schrieb:

    Let me reformulate my position a bit: clearly in 1977 some RISC
    design was possible. But probably it would be something
    even more primitive than Berkeley RISC. Putting in hardware
    things that later RISC designs put in hardware almost surely would
    exceed allowed cost. Technically at 1 mln transistors one should
    be able to do acceptable RISC and IIUC IBM 360/90 used about
    1 mln transistors in less dense technology, so in 1977 it was
    possible to do 1 mln transistor machine.

    HUH? That is more than one order of magnitude than what is needed
    for a RISC chip.

    It's also seems rather high for the /91. I can't find any authoritative numbers but 100K seems more likely. It was SLT, individual transistors
    mounted a few to a package. The /91 was big but it wasn't *that* big.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Aug 22 17:21:17 2025
    From Newsgroup: comp.arch

    Waldek Hebisch <antispam@fricas.org> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> wrote:
    Waldek Hebisch <antispam@fricas.org> schrieb:

    Let me reformulate my position a bit: clearly in 1977 some RISC
    design was possible. But probably it would be something
    even more primitive than Berkeley RISC. Putting in hardware
    things that later RISC designs put in hardware almost surely would
    exceed allowed cost. Technically at 1 mln transistors one should
    be able to do acceptable RISC and IIUC IBM 360/90 used about
    1 mln transistors in less dense technology, so in 1977 it was
    possible to do 1 mln transistor machine.

    HUH? That is more than one order of magnitude than what is needed
    for a RISC chip.

    Consider ARM2, which had 27000 transistors and which is sort of
    the minimum RISC design you can manage (altough it had a Booth
    multiplier).

    An ARMv2 implementation with added I and D cache, plus virtual
    memory, would have been the ideal design (too few registers, too
    many bits wasted on conditional execution, ...) but it would have
    run rings around the VAX.

    1 mln transistors is an upper estimate. But low numbers given
    for early RISC chips are IMO misleading: RISC become comercialy
    viable for high-end machines only in later generations, when
    designers added a few "expensive" instructions.

    Like the multiply instruction in ARM2.

    Also, to fit
    design into a single chip designers moved some functionality
    like bus interface to support chips. RISC processor with
    mixed 16-32 bit instructions (needed to get resonable code
    density), hardware multiply and FPU, including cache controller,
    paging hardware and memory controller is much more than
    100 thousend transitors cited for early workstation chips.

    Yep, FP support can be expensive and was an extra option
    on the VAX, which also included integer multiply.

    However, I maintain that a ~1977 supermini with a similar sort
    of bus, MMU, floating point unit etc like the VAX, but with an
    architecture similar to ARM2, plus separate icache and dcache, would
    have beaten the VAX hands-down in performance - it would have taken
    fewer chips to implement, less power and possibly time to develop.
    HP showed this was possible some time later.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2