• Base-Index Addressing in the Concertina II

    From John Savard@quadibloc@invalid.invalid to comp.arch on Sat Jul 26 00:33:22 2025
    From Newsgroup: comp.arch

    It occurred to me that I didn't really attempt to mount a full defense to
    my decision to include base-index addressing with 16-bit displacements
    in the Concertina II ISA.

    Doing so definitely came with a cost; although the Concertina II ISA uses
    banks of 32 registers, and when a block header is not present, it uses instructions that are all 32 bits in length, thus perhaps making it a
    wolf in RISC clothing... it means that only seven, rather than all
    thirty-one registers other than register zero, can be used as index
    registers, and only a different seven registers can be used as base
    registers with 16-bit displacements. Thus making it abundantly clear the architecture is not RISC.

    Here is my defense for this design decision:

    1) 16-bit displacements are _important_. Pretty well *all* microprocessors
    use 16-bit displacements, rather than anything shorter like 12 bits. Even though this meant, in most cases, they had to give up indexing.

    2) Index registers were hailed as a great advancement in computers
    when they were added to them, as they allowed avoiding self-modifying
    code. Of course, if one just has base registers, one can still have
    a special base register, used for array accessing, where the base
    address of a segment has had the array offset added to it through
    separate arithmetic instructions.

    Array accesses are common, and not needing extra instructions for them is therefore beneficial.

    3) At least one major microprocessor manufacturer, Motorola, did have base-index addressing with 16-bit displacements, Motorola, starting with
    the 68020.

    (3) may not be much of an argument, but it seems to me that (1) and (2)
    can reasonably be considered fairly strong arguments. But what about the drawbacks?

    If you have to have base registers in order to address memory, those
    base registers will cause register pressure no matter where they're
    put. The usual convention, at least on the System/360, was to put them
    near the end of the register bank (although the last three registers
    were special), so choosing the last seven registers for use as base
    registers with the most common displacement size seemed like it shouldn't confuse register allocation too much.

    Given, though, that one doesn't get absolute addressing - instead, one
    gets other address modes - by specifying base register zero, one can't
    put a pointer into an _index_ register on the Concertina II and get a
    useful result. Hence, instead of the convention being to return pointers
    to arrays in register 1, perhaps I'll have to use register 25. (However,
    there is a good argument for register 17 instead, since some instructions
    don't have 16-bit displacements available, for reasons of compactness,
    and must make do with 12-bit displacements. Thus, using the first base
    register for 12-bit displacements would be more universally useful.)

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Jul 26 08:48:34 2025
    From Newsgroup: comp.arch

    John Savard <quadibloc@invalid.invalid> schrieb:

    1) 16-bit displacements are _important_. Pretty well *all* microprocessors use 16-bit displacements, rather than anything shorter like 12 bits.

    That is a bit of an exaggeration - SPARC has 13-bit constants, RISC-V
    has 12-bit constants.

    Even
    though this meant, in most cases, they had to give up indexing.

    SPARC has both Ra+Rb and Ra+immediate, but not combined. The use
    case for Ra+Rb+13..16 bit is extremely limited.

    2) Index registers were hailed as a great advancement in computers
    when they were added to them, as they allowed avoiding self-modifying
    code.

    GP registers were an even greater achievement.

    Of course, if one just has base registers, one can still have
    a special base register, used for array accessing, where the base
    address of a segment has had the array offset added to it through
    separate arithmetic instructions.

    The usual method is Ra+Rb. Mitch also has Ra+Rb<<n+32-bit or
    Ra+Rb<<n+64-bit, with n from 0 to 3. This is useful for addressing
    global data encoded in the constant, which a 12 to 16 bit-offset
    is not.

    Array accesses are common, and not needing extra instructions for them is therefore beneficial.

    Yes, and indexing without offset can do that particular job just
    fine.

    3) At least one major microprocessor manufacturer, Motorola, did have base-index addressing with 16-bit displacements, Motorola, starting with
    the 68020.

    Mitch recently explained that they had micorarchitectural reasons.

    (3) may not be much of an argument, but it seems to me that (1) and (2)
    can reasonably be considered fairly strong arguments. But what about the drawbacks?

    I have not seen a strong argument for Ra+Rb+16 bit in what you wrote
    above.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Sat Jul 26 13:18:48 2025
    From Newsgroup: comp.arch

    On Sat, 26 Jul 2025 08:48:34 +0000, Thomas Koenig wrote:

    The use case
    for Ra+Rb+13..16 bit is extremely limited.

    Maybe so. It covers the case where multiple small arrays are located
    in the same kind of 64K-byte segment as simple variables.

    But what if arrays are larger than 64K in size? Well, in that case,
    I've included Array Mode in the standard form of a memory address.

    This is a kind of indirect addressing that uses a table of array
    addresses in memory to supply the address to which the index
    register contents are added.

    I have not seen a strong argument for Ra+Rb+16 bit in what you wrote
    above.

    You may not find the arguments strong.

    My goal was to provide the instruction set for a very powerful computer;
    so I included this addressing mode so as not to lack a feature that
    clearly benefited performance that other computers had.

    Base plus index plus displacement saves an instruction or two.

    16-bit displacements instead of 12-bit displacements ease register
    pressure.

    That was good enough for me, but of course others may take a different
    view of things.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Jul 26 13:34:19 2025
    From Newsgroup: comp.arch

    On 2025-07-26, John Savard <quadibloc@invalid.invalid> wrote:
    On Sat, 26 Jul 2025 08:48:34 +0000, Thomas Koenig wrote:

    The use case
    for Ra+Rb+13..16 bit is extremely limited.

    Maybe so. It covers the case where multiple small arrays are located
    in the same kind of 64K-byte segment as simple variables.

    The cost of having a register point to each array is much lower than
    the cost of crippling your ISA with the complicated base + index
    register scheme.


    But what if arrays are larger than 64K in size? Well, in that case,
    I've included Array Mode in the standard form of a memory address.

    This is a kind of indirect addressing that uses a table of array
    addresses in memory to supply the address to which the index
    register contents are added.

    Wait.

    Do you really want to have an extra memory access to access an
    array element, instead of loading the base address of your array
    into a register?

    This makes negative sense. Memory accesses, even from L1 cache,
    are very expensive these days.


    I have not seen a strong argument for Ra+Rb+16 bit in what you wrote
    above.

    You may not find the arguments strong.

    My goal was to provide the instruction set for a very powerful computer;
    so I included this addressing mode so as not to lack a feature that
    clearly benefited performance that other computers had.

    Base plus index plus displacement saves an instruction or two.

    How often? Do you have any idea if you're talking about 1%, 0.1%,
    0.01% or 0.001%?

    If you want this to actually be useful, do as Mitch has done, and
    go a full 32-bit constant.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Sat Jul 26 17:43:58 2025
    From Newsgroup: comp.arch

    On Sat, 26 Jul 2025 13:34:19 +0000, Thomas Koenig wrote:

    Wait.

    Do you really want to have an extra memory access to access an array
    element, instead of loading the base address of your array into a
    register?

    This makes negative sense. Memory accesses, even from L1 cache, are
    very expensive these days.

    I know that. But computers these days do have such a thing as cache,
    and giving the array pointer table a higher priority to cache because
    it's expected to be used a lot is doable.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Sat Jul 26 18:23:35 2025
    From Newsgroup: comp.arch

    On Sat, 26 Jul 2025 17:43:58 +0000, John Savard wrote:

    I know that. But computers these days do have such a thing as cache, and giving the array pointer table a higher priority to cache because it's expected to be used a lot is doable.

    As it happens, you've given me an idea. And in preparing to edit my pages
    to note this new feature, I found some errors that I've corrected as
    well; I had not realized that my assignment of functions to the integer registers when used as base registers was different from what I thought it
    was - apparently, I failed to edit some out-of-date text.

    My idea? Well, I've shrunk the tables in memory used with Array Mode to 384 pointers from 512. That way, displacement values starting with 11 now
    indicate that a register - one of the 128 registers in the extended
    integer register bank - is being used to contain the pointer to an array.

    Since I have plenty of those, and they're usually only used in code
    intended to be massively superscalar, making use of VLIW features, and,
    for that matter, the extended floating register bank may also be used as
    a set of eight _string_ registers... it seemed like a good fit.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Sun Jul 27 00:31:41 2025
    From Newsgroup: comp.arch

    On Sat, 26 Jul 2025 13:34:19 +0000, Thomas Koenig wrote:

    Do you really want to have an extra memory access to access an array
    element, instead of loading the base address of your array into a
    register?

    Obviously, loading the starting address of the array into a register is preferred.

    There are seven registers used as base registers with 16-bit displacements,
    and another seven registers used as base registers with 12-bit
    displacements.

    So, in a normal program that uses only one each of the former group of base registers for its code and data segments respectively, there are enough registers to handle twelve arrays.

    Array Mode is only for use when it's needed, because a program is either dealing with a lot of large arrays, or if it is under extreme register
    pressure in addition to dealing with some large arrays.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Sun Jul 27 01:36:57 2025
    From Newsgroup: comp.arch

    On Sat, 26 Jul 2025 13:34:19 +0000, Thomas Koenig wrote:

    If you want this to actually be useful, do as Mitch has done, and go a
    full 32-bit constant.

    Well, I would actually need to use 64 bits if I wanted to include
    full-size memory addresses in instructions. I think that is too long.

    However, I _do_ have a class of extra-long instructions which use
    a displacement longer than 16 bits. I followed the example of a major
    computer manufacturer in doing so; thus, the longer displacement is
    still only 20 bits.

    John Savard

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sat Jul 26 23:18:58 2025
    From Newsgroup: comp.arch

    On 7/26/2025 3:48 AM, Thomas Koenig wrote:
    John Savard <quadibloc@invalid.invalid> schrieb:

    1) 16-bit displacements are _important_. Pretty well *all* microprocessors >> use 16-bit displacements, rather than anything shorter like 12 bits.

    That is a bit of an exaggeration - SPARC has 13-bit constants, RISC-V
    has 12-bit constants.


    Or, XG2/XG3, and SH-5: 10-bits.

    12 bits would be plenty, if it were scaled.
    Unscaled displacements effectively lose 2 or 3 bits of range for no real benefit.


    Though, for 32-bit instructions, 12 bits is a sensible size with 5 bit register fields. Whereas 10 bits makes more sense for 6-bit registers.
    Going much smaller than this will result in a sharp increase in miss
    rate though.

    So, for example, 6 or 7 displacement bits would be too few.


    As can be noted, the "best case" for scaled displacements was seemingly
    9 bits unsigned, 10 bits signed.
    Where, in terms of hit-rate: 9u>9s, but 10u<10s; since a slight majority
    of those that miss the 9u range were also negative.


    If using unscaled displacements (like RISC-V) the needed range is closer
    to 13 bits.

    One possible downside of scaled displacements is that they can't
    directly encode a misaligned load or store. But this is rare.

    In my more recent ISA designs (namely XG3), had made it so the Disp33 encodings can also encode an unscaled displacement.

    In this case, there is effectively a 1 bit scale selector:
    0: Use the element size as the scale;
    1: Use a Byte scale.

    The need for misaligned displacements being rare enough that needing to jumbo-encode them isn't all that much of an issue.

    As noted, the only available displacement sizes here are 10 and 33 bits.
    10 covers the vast majority of load/store;
    33 covers everything else.


    There doesn't seem to be much practical need for larger displacements
    than 33 in the general case. As I see it, it is acceptable to have any
    larger displacement addressing decay to using general purpose ALU instructions.

    In my case, there is generally no absolute addressing mode.
    On 64 bit machines, absolute addressing no longer makes as much sense.
    ...


    Even
    though this meant, in most cases, they had to give up indexing.

    SPARC has both Ra+Rb and Ra+immediate, but not combined. The use
    case for Ra+Rb+13..16 bit is extremely limited.


    Yeah.

    Better IMHO to treat [Base+Disp] and [Base+Index*Sc] as two mutually
    exclusive cases in terms of the encoding scheme.

    The gains of full [Base+Index*Sc+Disp] isn't really worth the cost;
    mostly because cases where this is applicable tend to be statistically infrequent.


    Even if one used larger encodings for these, still debatable if the
    gains are worth the added implementation cost.


    On the other side, things like array accesses are common enough that in
    many cases, RISC-V effectively shoots itself in the foot by lacking
    [Rb+Ri*Sc] addressing.

    Well, and while not quite as big of a deal in terms of static
    instruction counts, this addressing mode tends to have a very high
    probability of being used inside loops; so has an oversized impact on performance.


    Though, there was some murmuring that at least some people in RISC-V
    land are considering adding it.


    2) Index registers were hailed as a great advancement in computers
    when they were added to them, as they allowed avoiding self-modifying
    code.

    GP registers were an even greater achievement.


    Yeah.


    Of course, if one just has base registers, one can still have
    a special base register, used for array accessing, where the base
    address of a segment has had the array offset added to it through
    separate arithmetic instructions.

    The usual method is Ra+Rb. Mitch also has Ra+Rb<<n+32-bit or Ra+Rb<<n+64-bit, with n from 0 to 3. This is useful for addressing
    global data encoded in the constant, which a 12 to 16 bit-offset
    is not.


    A 16-bit offset can often be useful for globals if using a Global
    Pointer, but not useful for PC-rel (and not worth it for general Ld/St ops).


    But, this is niche, and can effectively be limited in scope only to 32
    and 64 bit items with a hard-coded base register.

    This can then access the first 256K or 512K of ".data", which is
    typically sufficient for global variables (though, any commonly-accessed globals may need to be promoted to being in ".data" regardless of
    whether or not they are initialized).

    Though, even with separate compilation, this is probably something a
    linker could figure out (just that "common data" could be put either in ".data" or ".bss", rather than only in ".bss").


    I would not recommend Disp16 for normal 32-bit Ld/St ops as this wastes
    too much encoding space.


    Array accesses are common, and not needing extra instructions for them is
    therefore beneficial.

    Yes, and indexing without offset can do that particular job just
    fine.


    Agreed.

    Where:
    [Rb+Ri*Sc]
    Is fairly common.

    But:
    [Rb+Ri*sc+Disp]
    Is much less commonly needed IME.


    3) At least one major microprocessor manufacturer, Motorola, did have
    base-index addressing with 16-bit displacements, Motorola, starting with
    the 68020.

    Mitch recently explained that they had micorarchitectural reasons.


    The M68K line is probably not a great design reference as, such wonk.


    (3) may not be much of an argument, but it seems to me that (1) and (2)
    can reasonably be considered fairly strong arguments. But what about the
    drawbacks?

    I have not seen a strong argument for Ra+Rb+16 bit in what you wrote
    above.

    Agreed.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sun Jul 27 01:10:18 2025
    From Newsgroup: comp.arch

    On 2025-07-27 12:18 a.m., BGB wrote:
    On 7/26/2025 3:48 AM, Thomas Koenig wrote:
    John Savard <quadibloc@invalid.invalid> schrieb:

    1) 16-bit displacements are _important_. Pretty well *all*
    microprocessors
    use 16-bit displacements, rather than anything shorter like 12 bits.

    That is a bit of an exaggeration - SPARC has 13-bit constants, RISC-V
    has 12-bit constants.


    Or, XG2/XG3, and SH-5: 10-bits.

    12 bits would be plenty, if it were scaled.
    Unscaled displacements effectively lose 2 or 3 bits of range for no real benefit.


    Though, for 32-bit instructions, 12 bits is a sensible size with 5 bit register fields. Whereas 10 bits makes more sense for 6-bit registers.
    Going much smaller than this will result in a sharp increase in miss
    rate though.

    So, for example, 6 or 7 displacement bits would be too few.


    As can be noted, the "best case" for scaled displacements was seemingly
    9 bits unsigned, 10 bits signed.
    Where, in terms of hit-rate: 9u>9s, but 10u<10s; since a slight majority
    of those that miss the 9u range were also negative.


    If using unscaled displacements (like RISC-V) the needed range is closer
    to 13 bits.

    One possible downside of scaled displacements is that they can't
    directly encode a misaligned load or store. But this is rare.

    In my more recent ISA designs (namely XG3), had made it so the Disp33 encodings can also encode an unscaled displacement.

    In this case, there is effectively a 1 bit scale selector:
      0: Use the element size as the scale;
      1: Use a Byte scale.

    The need for misaligned displacements being rare enough that needing to jumbo-encode them isn't all that much of an issue.

    As noted, the only available displacement sizes here are 10 and 33 bits.
      10 covers the vast majority of load/store;
      33 covers everything else.


    There doesn't seem to be much practical need for larger displacements
    than 33 in the general case. As I see it, it is acceptable to have any larger displacement addressing decay to using general purpose ALU instructions.

    In my case, there is generally no absolute addressing mode.
      On 64 bit machines, absolute addressing no longer makes as much sense.
      ...


    Even
    though this meant, in most cases, they had to give up indexing.

    SPARC has both Ra+Rb and Ra+immediate, but not combined.  The use
    case for Ra+Rb+13..16 bit is extremely limited.


    Yeah.

    Better IMHO to treat [Base+Disp] and [Base+Index*Sc] as two mutually exclusive cases in terms of the encoding scheme.

    The gains of full [Base+Index*Sc+Disp] isn't really worth the cost;
    mostly because cases where this is applicable tend to be statistically infrequent.


    Even if one used larger encodings for these, still debatable if the
    gains are worth the added implementation cost.


    On the other side, things like array accesses are common enough that in
    many cases, RISC-V effectively shoots itself in the foot by lacking [Rb+Ri*Sc] addressing.

    Well, and while not quite as big of a deal in terms of static
    instruction counts, this addressing mode tends to have a very high probability of being used inside loops; so has an oversized impact on performance.


    Though, there was some murmuring that at least some people in RISC-V
    land are considering adding it.


    2) Index registers were hailed as a great advancement in computers
    when they were added to them, as they allowed avoiding self-modifying
    code.

    GP registers were an even greater achievement.


    Yeah.


    Of course, if one just has base registers, one can still have
    a special base register, used for array accessing, where the base
    address of a segment has had the array offset added to it through
    separate arithmetic instructions.

    The usual method is Ra+Rb.  Mitch also has Ra+Rb<<n+32-bit or
    Ra+Rb<<n+64-bit, with n from 0 to 3.  This is useful for addressing
    global data encoded in the constant, which a 12 to 16 bit-offset
    is not.


    A 16-bit offset can often be useful for globals if using a Global
    Pointer, but not useful for PC-rel (and not worth it for general Ld/St
    ops).


    But, this is niche, and can effectively be limited in scope only to 32
    and 64 bit items with a hard-coded base register.

    This can then access the first 256K or 512K of ".data", which is
    typically sufficient for global variables (though, any commonly-accessed globals may need to be promoted to being in ".data" regardless of
    whether or not they are initialized).

    Though, even with separate compilation, this is probably something a
    linker could figure out (just that "common data" could be put either in ".data" or ".bss", rather than only in ".bss").


    I would not recommend Disp16 for normal 32-bit Ld/St ops as this wastes
    too much encoding space.


    Array accesses are common, and not needing extra instructions for
    them is
    therefore beneficial.

    Yes, and indexing without offset can do that particular job just
    fine.


    Agreed.

    Where:
      [Rb+Ri*Sc]
    Is fairly common.

    But:
      [Rb+Ri*sc+Disp]
    Is much less commonly needed IME.

    Although much less commonly needed there is [Rb+Ri*sc+Disp24] in my
    current design as the scaled index takes another instruction word and
    24-bit would be wasted otherwise.

    I think it depends somewhat on the instruction encoding what is
    efficient. Given the number of transistors available today, even less
    common needed functionality could be considered.>

    3) At least one major microprocessor manufacturer, Motorola, did have
    base-index addressing with 16-bit displacements, Motorola, starting with >>> the 68020.

    Mitch recently explained that they had micorarchitectural reasons.


    The M68K line is probably not a great design reference as, such wonk.

    I think the 68k is not that bad a reference design to study because it
    shows both what to do and what not to do in a single design.
    Understanding drawbacks can help a lot.

    I learned a lot from studying the 6502 and asking why did they do that?


    (3) may not be much of an argument, but it seems to me that (1) and (2)
    can reasonably be considered fairly strong arguments. But what about the >>> drawbacks?

    I have not seen a strong argument for Ra+Rb+16 bit in what you wrote
    above.

    Agreed.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Jul 27 07:15:02 2025
    From Newsgroup: comp.arch

    John Savard <quadibloc@invalid.invalid> schrieb:
    On Sat, 26 Jul 2025 13:34:19 +0000, Thomas Koenig wrote:

    Wait.

    Do you really want to have an extra memory access to access an array
    element, instead of loading the base address of your array into a
    register?

    This makes negative sense. Memory accesses, even from L1 cache, are
    very expensive these days.

    I know that. But computers these days do have such a thing as cache,
    and giving the array pointer table a higher priority to cache because
    it's expected to be used a lot is doable.

    Doable how, in such a way that both memory accesses are as
    fast as a single one with the obvious scheme? For that,
    you would need a zero overhead for load from your speial
    cache, which sounds slightly impossible.

    But leaving that aside for a moment: You also need cycles and
    instructions to set up that table. Is that less than loading the
    base address of an array into a register?
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Sun Jul 27 07:28:57 2025
    From Newsgroup: comp.arch

    On Sun, 27 Jul 2025 07:15:02 +0000, Thomas Koenig wrote:

    Doable how, in such a way that both memory accesses are as fast as a
    single one with the obvious scheme?

    Cache is storage inside the processor chip. So are registers. Cache will
    be slower, but not by as much as a normal memory access.

    But leaving that aside for a moment: You also need cycles and
    instructions to set up that table. Is that less than loading the base address of an array into a register?

    No, but it only happens once at the beginning.

    As I've noted, I agree that putting the start address in a register is
    better. When you can do that. When there are enough registers available.
    But what if you have more arrays than you have registers?

    Although you gave me an idea, as I noted, so I have fixed that so it
    won't happen nearly as often. By making use of another silly feature
    of the Concertina II architecture: the Itanium had 128 registers, so
    as to make the pipeline go faster, so I gave the Concertina extended
    register banks of 128 registers each and special instructions to use
    them as well. Since it isn't always practical to generate superscalar
    code for every application that makes full use of all those registers...
    well, now they can also be used as array pointers.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Sun Jul 27 07:37:50 2025
    From Newsgroup: comp.arch

    On Sat, 26 Jul 2025 23:18:58 -0500, BGB wrote:

    12 bits would be plenty, if it were scaled.
    Unscaled displacements effectively lose 2 or 3 bits of range for no real benefit.

    I had tried, in a few places, to allow the values in index registers to
    be scaled. As this isn't a common feature in CPU designs, though, I didn't
    use the opcode space to indicate this scaling in most instructions.

    But scaling *displacements* is an idea that had not even occurred to me.

    There is _one_ important benefit of not scaling the displacements which is
    made very evident by certain other characteristics of the Concertina II architecture.

    The System/360 had 12-bit displacements that were not scaled. As a result,
    when writing code for the System/360, it was known that each base register
    that was used would provide coverage of 4,096 bytes of memory. No more,
    no less, regardless of what type of data you referenced. So you knew when
    to allocate another base register with the address plus 4,096 in it if
    needed.

    In the Concertina II, I actually take the 32 registers, and in addition to taking the last seven of a group of eight as base registers for use with
    16-bit displacemets, I go with a different group of eight registers for
    use with 12-bit displacements, and a different one still for 20-bit displacements.

    Because there is no value in being only able to access the first 4,096
    bytes of a 65,536 byte region of memory. So instead of having addressing
    modes that do that silly thing, I have more base registers available; if
    you run out of 65,536-byte regios of memory to allocate, at leadt you can
    also allocate additional 4,096-byte regions of nemory, and that might be useful.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Jul 27 02:55:11 2025
    From Newsgroup: comp.arch

    On 7/27/2025 12:10 AM, Robert Finch wrote:
    On 2025-07-27 12:18 a.m., BGB wrote:
    On 7/26/2025 3:48 AM, Thomas Koenig wrote:
    John Savard <quadibloc@invalid.invalid> schrieb:

    1) 16-bit displacements are _important_. Pretty well *all*
    microprocessors
    use 16-bit displacements, rather than anything shorter like 12 bits.

    That is a bit of an exaggeration - SPARC has 13-bit constants, RISC-V
    has 12-bit constants.


    Or, XG2/XG3, and SH-5: 10-bits.

    12 bits would be plenty, if it were scaled.
    Unscaled displacements effectively lose 2 or 3 bits of range for no
    real benefit.


    Though, for 32-bit instructions, 12 bits is a sensible size with 5 bit
    register fields. Whereas 10 bits makes more sense for 6-bit registers.
    Going much smaller than this will result in a sharp increase in miss
    rate though.

    So, for example, 6 or 7 displacement bits would be too few.


    As can be noted, the "best case" for scaled displacements was
    seemingly 9 bits unsigned, 10 bits signed.
    Where, in terms of hit-rate: 9u>9s, but 10u<10s; since a slight
    majority of those that miss the 9u range were also negative.


    If using unscaled displacements (like RISC-V) the needed range is
    closer to 13 bits.

    One possible downside of scaled displacements is that they can't
    directly encode a misaligned load or store. But this is rare.

    In my more recent ISA designs (namely XG3), had made it so the Disp33
    encodings can also encode an unscaled displacement.

    In this case, there is effectively a 1 bit scale selector:
       0: Use the element size as the scale;
       1: Use a Byte scale.

    The need for misaligned displacements being rare enough that needing
    to jumbo-encode them isn't all that much of an issue.

    As noted, the only available displacement sizes here are 10 and 33 bits.
       10 covers the vast majority of load/store;
       33 covers everything else.


    There doesn't seem to be much practical need for larger displacements
    than 33 in the general case. As I see it, it is acceptable to have any
    larger displacement addressing decay to using general purpose ALU
    instructions.

    In my case, there is generally no absolute addressing mode.
       On 64 bit machines, absolute addressing no longer makes as much sense. >>    ...


    Even
    though this meant, in most cases, they had to give up indexing.

    SPARC has both Ra+Rb and Ra+immediate, but not combined.  The use
    case for Ra+Rb+13..16 bit is extremely limited.


    Yeah.

    Better IMHO to treat [Base+Disp] and [Base+Index*Sc] as two mutually
    exclusive cases in terms of the encoding scheme.

    The gains of full [Base+Index*Sc+Disp] isn't really worth the cost;
    mostly because cases where this is applicable tend to be statistically
    infrequent.


    Even if one used larger encodings for these, still debatable if the
    gains are worth the added implementation cost.


    On the other side, things like array accesses are common enough that
    in many cases, RISC-V effectively shoots itself in the foot by lacking
    [Rb+Ri*Sc] addressing.

    Well, and while not quite as big of a deal in terms of static
    instruction counts, this addressing mode tends to have a very high
    probability of being used inside loops; so has an oversized impact on
    performance.


    Though, there was some murmuring that at least some people in RISC-V
    land are considering adding it.


    2) Index registers were hailed as a great advancement in computers
    when they were added to them, as they allowed avoiding self-modifying
    code.

    GP registers were an even greater achievement.


    Yeah.


    Of course, if one just has base registers, one can still have
    a special base register, used for array accessing, where the base
    address of a segment has had the array offset added to it through
    separate arithmetic instructions.

    The usual method is Ra+Rb.  Mitch also has Ra+Rb<<n+32-bit or
    Ra+Rb<<n+64-bit, with n from 0 to 3.  This is useful for addressing
    global data encoded in the constant, which a 12 to 16 bit-offset
    is not.


    A 16-bit offset can often be useful for globals if using a Global
    Pointer, but not useful for PC-rel (and not worth it for general Ld/St
    ops).


    But, this is niche, and can effectively be limited in scope only to 32
    and 64 bit items with a hard-coded base register.

    This can then access the first 256K or 512K of ".data", which is
    typically sufficient for global variables (though, any commonly-
    accessed globals may need to be promoted to being in ".data"
    regardless of whether or not they are initialized).

    Though, even with separate compilation, this is probably something a
    linker could figure out (just that "common data" could be put either
    in ".data" or ".bss", rather than only in ".bss").


    I would not recommend Disp16 for normal 32-bit Ld/St ops as this
    wastes too much encoding space.


    Array accesses are common, and not needing extra instructions for
    them is
    therefore beneficial.

    Yes, and indexing without offset can do that particular job just
    fine.


    Agreed.

    Where:
       [Rb+Ri*Sc]
    Is fairly common.

    But:
       [Rb+Ri*sc+Disp]
    Is much less commonly needed IME.

    Although much less commonly needed there is [Rb+Ri*sc+Disp24] in my
    current design as the scaled index takes another instruction word and
    24-bit would be wasted otherwise.

    I think it depends somewhat on the instruction encoding what is
    efficient. Given the number of transistors available today, even less
    common needed functionality could be considered.>


    I had experimented with similar before, although it was in the form of:
    [Rb+Ri*Sc+Disp13]

    With a 64-bit jumbo encoded form.


    Tried to get my compiler to use it in a few cases:
    Array on the stack, where the array isn't already in a register;
    Array inside struct;
    ...

    But, usage frequency was at best fairly low.



    3) At least one major microprocessor manufacturer, Motorola, did have
    base-index addressing with 16-bit displacements, Motorola, starting
    with
    the 68020.

    Mitch recently explained that they had micorarchitectural reasons.


    The M68K line is probably not a great design reference as, such wonk.

    I think the 68k is not that bad a reference design to study because it
    shows both what to do and what not to do in a single design.
    Understanding drawbacks can help a lot.

    I learned a lot from studying the 6502 and asking why did they do that?


    OK.

    I guess it is can make sense for comparing tradeoffs, but, say, "M68K
    did it, therefore it is a good idea" is a little weak.


    Then again, M68K could be considered as part of a family that seems to
    have largely taken design inspiration from the PDP-11.

    Well, along with the MSP430 and SuperH.
    Then looked around, and noted something.
    SH was designed as a 32-bit redesign of the Hitachi H8 microcontroller;
    The H8 design was itself partly based on the PDP-11.

    Implying that some of the design similarities might not be coincidence.

    Well, and by implication, there is sort of an indirect evolutionary through-line connecting my ISA design efforts back to the PDP-11.

    Looking:
    PDP-11:
    zzzz-yyy-mmm-yyy-nnn
    Regs: R0..R7 (16b), R6=SP, R7=PC
    Condition Codes
    Bcc, Disp8
    H8:
    zzzz-zzzz-mmmm-nnnn
    Regs: R0..R7 (16b), R7=SP
    Can address High/Low Bytes;
    Paired registers
    Condition Codes
    Bcc, Disp8
    SH (and BJX1):
    zzzz-nnnn-mmmm-zzzz
    Regs: R0..R15 (32b), R15=SP
    S and T bits
    BT/BF, Disp8
    XG1 and XG2
    XG1
    16b: zzzz-zzzz-nnnn-mmmm
    32b: 111p-zwzz-nnnn-mmmm zzzz-qnmo-oooo-zzzz
    XG2:
    NMOp-zwzz-nnnn-mmmm zzzz-qnmo-oooo-zzzz
    Regs (XG1/XG2):
    R0..R31 (64b), R15=SP
    R32..R63 (XG1-XGPR and XG2)
    T bit
    BT/BF, Disp8 | Disp20
    XG3
    zzzzz-oooooo-mmmmmm-zzzz-nnnnnn-yyyypp
    Regs:
    R0..R31
    R0..R3: ZR, LR, SP, GP
    R32..R63 (F0..F31)
    T bit (optional)
    BT/BF, Disp23 (optional)
    XG3 was reworked to resemble RISC-V,
    but does does not break an evolutionary path per se.

    Some instructions have similar names, but differ in terms of behavior:
    PDP-11: JSR / RTS: Link goes on Stack.
    H8: JSR / RTS: Link goes on Stack.
    SH: JSR / RTS: Link goes in PR (register)
    BSR, Disp12 (+/- 4K)
    XG1/XG2:
    JSR / RTS: Link goes in LR (register)
    BSR, Disp20 (+/- 1MB)
    BSR, Disp23 (XG2, +/- 8MB)
    XG3:
    JSR / RTS: Pseudo-Instructions over JALR
    Link goes in LR
    BSR, Disp23 (+/- 16MB)
    Its BSR is hard-wired to LR, unlike RV's JAL.


    XG3 sort of has an identity crisis between the PDP-derived elements and
    the RV/MIPS derived elements; along with two wildly different ASM syntax styles. Though, admittedly, with BGBCC had still mostly been using the previous AT&T style ASM syntax, eg:
    MOV.L (R18), R10


    The M68K is part of a similar family, also borrowing design elements
    from the PDP-11. The M68K mostly had its own wonk as well, like separate Address and Data registers.

    So, M68K:
    zzzz-mmm-yyy-yyy-nnn
    Regs:
    D0..D7
    A0..A7
    A7=SP
    Condition Codes
    JSR / RTS: Link goes on Stack.
    Bcc, Disp8/Disp16/Disp32
    BRA/BSR, Disp8/Disp16/Disp32
    Escape-Coded Scheme, Disp==0 escapes to Disp16, ...

    In PDP-11 and M68K, instructions may encode an addressing mode for each register, allowing Reg/Mem and Mem/Mem operation.

    M68K seemingly renamed MOV to MOVE.
    Looks like ASM examples lack the '%' on register names.
    Seemingly, '%' prefixed register names were mostly a GAS thing?...


    Contrast, H8, SH, and my BJX2 ISA went over to being Load/Store.
    PDP-11, H8, SH, and M68K have auto-increment addressing.
    Though, I had dropped auto-increment in BJX2.
    Had also been dropped in some of my intermediate subsets (*1).


    *1: B32V and BSR1 representing were the transition between SH and what
    would become BJX2.
    B32V was basically an SH variant with nearly all non-essential features removed. BSR1 had reorganized the encoding space, and (after 32-bit
    encodings were added) became the core of BJX2.

    Something like:
    MOV.L (R8)+, R4 //auto-increment loads
    Still existing in these later ISA variants as pseudo-instructions.

    Though, auto-increment notation is currently unusable for the RISC-V or
    XG3 targets. Could re-add the pseudo instructions if it mattered though.

    ...



    (3) may not be much of an argument, but it seems to me that (1) and (2) >>>> can reasonably be considered fairly strong arguments. But what about
    the
    drawbacks?

    I have not seen a strong argument for Ra+Rb+16 bit in what you wrote
    above.

    Agreed.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Jul 27 08:17:00 2025
    From Newsgroup: comp.arch

    John Savard <quadibloc@invalid.invalid> schrieb:
    On Sat, 26 Jul 2025 13:34:19 +0000, Thomas Koenig wrote:

    Do you really want to have an extra memory access to access an array
    element, instead of loading the base address of your array into a
    register?

    Obviously, loading the starting address of the array into a register is preferred.

    Then do so, there is no need to add something more complicated than
    necessary.

    There are seven registers used as base registers with 16-bit displacements, and another seven registers used as base registers with 12-bit displacements.

    I have not seen the use case for R1+R2+16 bit.

    So, in a normal program that uses only one each of the former group of base registers for its code and data segments respectively, there are enough registers to handle twelve arrays.

    Simultaneously?

    Array Mode is only for use when it's needed, because a program is either dealing with a lot of large arrays, or if it is under extreme register pressure in addition to dealing with some large arrays.

    s/only for use when it's/not/
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Jul 27 08:18:08 2025
    From Newsgroup: comp.arch

    John Savard <quadibloc@invalid.invalid> schrieb:
    On Sat, 26 Jul 2025 13:34:19 +0000, Thomas Koenig wrote:

    If you want this to actually be useful, do as Mitch has done, and go a
    full 32-bit constant.

    Well, I would actually need to use 64 bits if I wanted to include
    full-size memory addresses in instructions.

    OK, Mitch has this.

    I think that is too long.

    Too long for what? It's simpler than any of the alternatives, and
    even x86_64 (which you are aiming to surpass in complexity, it seems)
    has 64-bit constants.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Jul 27 04:23:23 2025
    From Newsgroup: comp.arch

    On 7/27/2025 3:18 AM, Thomas Koenig wrote:
    John Savard <quadibloc@invalid.invalid> schrieb:
    On Sat, 26 Jul 2025 13:34:19 +0000, Thomas Koenig wrote:

    If you want this to actually be useful, do as Mitch has done, and go a
    full 32-bit constant.

    Well, I would actually need to use 64 bits if I wanted to include
    full-size memory addresses in instructions.

    OK, Mitch has this.

    I think that is too long.

    Too long for what? It's simpler than any of the alternatives, and
    even x86_64 (which you are aiming to surpass in complexity, it seems)
    has 64-bit constants.

    My own ISA also has 33 and 64-bit constants via jumbo prefixes.

    Only no absolute addressing mode (but can be faked if needed), and no
    large displacements (though, it is more a case that the need for larger displacements doesn't justify the cost).


    Technically, there is nothing really in the encoding scheme stopping
    adding an Absolute addressing mode or Disp64 encodings though.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Jul 27 09:58:45 2025
    From Newsgroup: comp.arch

    John Savard <quadibloc@invalid.invalid> schrieb:
    On Sun, 27 Jul 2025 07:15:02 +0000, Thomas Koenig wrote:

    Doable how, in such a way that both memory accesses are as fast as a
    single one with the obvious scheme?

    Cache is storage inside the processor chip. So are registers. Cache will
    be slower, but not by as much as a normal memory access.

    A "normal memory access" is hundreds of cycles. If you are operating
    on data outside the cache. that is a whole different game.

    What I am looking at is an inner loop.

    But leaving that aside for a moment: You also need cycles and
    instructions to set up that table. Is that less than loading the base
    address of an array into a register?

    No, but it only happens once at the beginning.

    Can you actually provide example code where this would matter?

    As I've noted, I agree that putting the start address in a register is better. When you can do that. When there are enough registers available.
    But what if you have more arrays than you have registers?

    One instruction per loop, one cycle latency per loop startup
    as opposed to 3-5 cycles latency on every loop iteration.

    If there is a universe and time in which this makes sense, it's
    not ours in 2025.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sun Jul 27 06:43:31 2025
    From Newsgroup: comp.arch

    On 2025-07-27 3:37 a.m., John Savard wrote:
    On Sat, 26 Jul 2025 23:18:58 -0500, BGB wrote:

    12 bits would be plenty, if it were scaled.
    Unscaled displacements effectively lose 2 or 3 bits of range for no real
    benefit.

    I had tried, in a few places, to allow the values in index registers to
    be scaled. As this isn't a common feature in CPU designs, though, I didn't use the opcode space to indicate this scaling in most instructions.

    But scaling *displacements* is an idea that had not even occurred to me.

    I think the ARM has displacement that can be shifted IIRC.>
    There is _one_ important benefit of not scaling the displacements which is made very evident by certain other characteristics of the Concertina II architecture.

    The System/360 had 12-bit displacements that were not scaled. As a result, when writing code for the System/360, it was known that each base register that was used would provide coverage of 4,096 bytes of memory. No more,
    no less, regardless of what type of data you referenced. So you knew when
    to allocate another base register with the address plus 4,096 in it if needed.

    4096 bytes may not be enough for some apps. I think a larger
    displacement would be better than multiple base registers.


    In the Concertina II, I actually take the 32 registers, and in addition to taking the last seven of a group of eight as base registers for use with 16-bit displacemets, I go with a different group of eight registers for
    use with 12-bit displacements, and a different one still for 20-bit displacements.

    I take this to mean some bits of the displacement and register field are balanced. IF there is a 20-bit displacement available with eight
    registers that is 23-bits. That is enough for 32 GPR regs with an 18-bit displacement. 20 bits is a lot of global data.

    Different groups of registers for different displacements goes against
    the idea of GPRs. It would make the compiler trickier to write.>
    Because there is no value in being only able to access the first 4,096
    bytes of a 65,536 byte region of memory. So instead of having addressing modes that do that silly thing, I have more base registers available; if
    you run out of 65,536-byte regios of memory to allocate, at leadt you can also allocate additional 4,096-byte regions of nemory, and that might be useful.

    John Savard

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sun Jul 27 06:50:17 2025
    From Newsgroup: comp.arch

    On 2025-07-27 3:55 a.m., BGB wrote:
    On 7/27/2025 12:10 AM, Robert Finch wrote:
    On 2025-07-27 12:18 a.m., BGB wrote:
    On 7/26/2025 3:48 AM, Thomas Koenig wrote:
    John Savard <quadibloc@invalid.invalid> schrieb:

    1) 16-bit displacements are _important_. Pretty well *all*
    microprocessors
    use 16-bit displacements, rather than anything shorter like 12 bits.

    That is a bit of an exaggeration - SPARC has 13-bit constants, RISC-V
    has 12-bit constants.


    Or, XG2/XG3, and SH-5: 10-bits.

    12 bits would be plenty, if it were scaled.
    Unscaled displacements effectively lose 2 or 3 bits of range for no
    real benefit.


    Though, for 32-bit instructions, 12 bits is a sensible size with 5
    bit register fields. Whereas 10 bits makes more sense for 6-bit
    registers.
    Going much smaller than this will result in a sharp increase in miss
    rate though.

    So, for example, 6 or 7 displacement bits would be too few.


    As can be noted, the "best case" for scaled displacements was
    seemingly 9 bits unsigned, 10 bits signed.
    Where, in terms of hit-rate: 9u>9s, but 10u<10s; since a slight
    majority of those that miss the 9u range were also negative.


    If using unscaled displacements (like RISC-V) the needed range is
    closer to 13 bits.

    One possible downside of scaled displacements is that they can't
    directly encode a misaligned load or store. But this is rare.

    In my more recent ISA designs (namely XG3), had made it so the Disp33
    encodings can also encode an unscaled displacement.

    In this case, there is effectively a 1 bit scale selector:
       0: Use the element size as the scale;
       1: Use a Byte scale.

    The need for misaligned displacements being rare enough that needing
    to jumbo-encode them isn't all that much of an issue.

    As noted, the only available displacement sizes here are 10 and 33 bits. >>>    10 covers the vast majority of load/store;
       33 covers everything else.


    There doesn't seem to be much practical need for larger displacements
    than 33 in the general case. As I see it, it is acceptable to have
    any larger displacement addressing decay to using general purpose ALU
    instructions.

    In my case, there is generally no absolute addressing mode.
       On 64 bit machines, absolute addressing no longer makes as much
    sense.
       ...


    Even
    though this meant, in most cases, they had to give up indexing.

    SPARC has both Ra+Rb and Ra+immediate, but not combined.  The use
    case for Ra+Rb+13..16 bit is extremely limited.


    Yeah.

    Better IMHO to treat [Base+Disp] and [Base+Index*Sc] as two mutually
    exclusive cases in terms of the encoding scheme.

    The gains of full [Base+Index*Sc+Disp] isn't really worth the cost;
    mostly because cases where this is applicable tend to be
    statistically infrequent.


    Even if one used larger encodings for these, still debatable if the
    gains are worth the added implementation cost.


    On the other side, things like array accesses are common enough that
    in many cases, RISC-V effectively shoots itself in the foot by
    lacking [Rb+Ri*Sc] addressing.

    Well, and while not quite as big of a deal in terms of static
    instruction counts, this addressing mode tends to have a very high
    probability of being used inside loops; so has an oversized impact on
    performance.


    Though, there was some murmuring that at least some people in RISC-V
    land are considering adding it.


    2) Index registers were hailed as a great advancement in computers
    when they were added to them, as they allowed avoiding self-modifying >>>>> code.

    GP registers were an even greater achievement.


    Yeah.


    Of course, if one just has base registers, one can still have
    a special base register, used for array accessing, where the base
    address of a segment has had the array offset added to it through
    separate arithmetic instructions.

    The usual method is Ra+Rb.  Mitch also has Ra+Rb<<n+32-bit or
    Ra+Rb<<n+64-bit, with n from 0 to 3.  This is useful for addressing
    global data encoded in the constant, which a 12 to 16 bit-offset
    is not.


    A 16-bit offset can often be useful for globals if using a Global
    Pointer, but not useful for PC-rel (and not worth it for general Ld/
    St ops).


    But, this is niche, and can effectively be limited in scope only to
    32 and 64 bit items with a hard-coded base register.

    This can then access the first 256K or 512K of ".data", which is
    typically sufficient for global variables (though, any commonly-
    accessed globals may need to be promoted to being in ".data"
    regardless of whether or not they are initialized).

    Though, even with separate compilation, this is probably something a
    linker could figure out (just that "common data" could be put either
    in ".data" or ".bss", rather than only in ".bss").


    I would not recommend Disp16 for normal 32-bit Ld/St ops as this
    wastes too much encoding space.


    Array accesses are common, and not needing extra instructions for
    them is
    therefore beneficial.

    Yes, and indexing without offset can do that particular job just
    fine.


    Agreed.

    Where:
       [Rb+Ri*Sc]
    Is fairly common.

    But:
       [Rb+Ri*sc+Disp]
    Is much less commonly needed IME.

    Although much less commonly needed there is [Rb+Ri*sc+Disp24] in my
    current design as the scaled index takes another instruction word and
    24-bit would be wasted otherwise.

    I think it depends somewhat on the instruction encoding what is
    efficient. Given the number of transistors available today, even less
    common needed functionality could be considered.>


    I had experimented with similar before, although it was in the form of:
      [Rb+Ri*Sc+Disp13]

    With a 64-bit jumbo encoded form.


    Tried to get my compiler to use it in a few cases:
      Array on the stack, where the array isn't already in a register;
      Array inside struct;
      ...

    But, usage frequency was at best fairly low.



    3) At least one major microprocessor manufacturer, Motorola, did have >>>>> base-index addressing with 16-bit displacements, Motorola, starting >>>>> with
    the 68020.

    Mitch recently explained that they had micorarchitectural reasons.


    The M68K line is probably not a great design reference as, such wonk.

    I think the 68k is not that bad a reference design to study because it
    shows both what to do and what not to do in a single design.
    Understanding drawbacks can help a lot.

    I learned a lot from studying the 6502 and asking why did they do that?


    OK.

    I guess it is can make sense for comparing tradeoffs, but, say, "M68K
    did it, therefore it is a good idea" is a little weak.


    Then again, M68K could be considered as part of a family that seems to
    have largely taken design inspiration from the PDP-11.

    Well, along with the MSP430 and SuperH.
    Then looked around, and noted something.
    SH was designed as a 32-bit redesign of the Hitachi H8 microcontroller;
    The H8 design was itself partly based on the PDP-11.

    Implying that some of the design similarities might not be coincidence.

    Well, and by implication, there is sort of an indirect evolutionary through-line connecting my ISA design efforts back to the PDP-11.

    Looking:
      PDP-11:
        zzzz-yyy-mmm-yyy-nnn
        Regs: R0..R7 (16b), R6=SP, R7=PC
        Condition Codes
          Bcc, Disp8
      H8:
        zzzz-zzzz-mmmm-nnnn
        Regs: R0..R7 (16b), R7=SP
          Can address High/Low Bytes;
          Paired registers
        Condition Codes
          Bcc, Disp8
      SH (and BJX1):
        zzzz-nnnn-mmmm-zzzz
        Regs: R0..R15 (32b), R15=SP
        S and T bits
          BT/BF, Disp8
      XG1 and XG2
        XG1
          16b: zzzz-zzzz-nnnn-mmmm
          32b: 111p-zwzz-nnnn-mmmm zzzz-qnmo-oooo-zzzz
        XG2:
          NMOp-zwzz-nnnn-mmmm zzzz-qnmo-oooo-zzzz
        Regs (XG1/XG2):
          R0..R31 (64b), R15=SP
          R32..R63 (XG1-XGPR and XG2)
        T bit
          BT/BF, Disp8 | Disp20
      XG3
        zzzzz-oooooo-mmmmmm-zzzz-nnnnnn-yyyypp
        Regs:
          R0..R31
            R0..R3: ZR, LR, SP, GP
          R32..R63 (F0..F31)
        T bit (optional)
          BT/BF, Disp23 (optional)
        XG3 was reworked to resemble RISC-V,
          but does does not break an evolutionary path per se.

    Some instructions have similar names, but differ in terms of behavior:
      PDP-11: JSR / RTS: Link goes on Stack.
      H8: JSR / RTS: Link goes on Stack.
      SH: JSR / RTS: Link goes in PR (register)
        BSR, Disp12 (+/- 4K)
      XG1/XG2:
        JSR / RTS: Link goes in LR (register)
        BSR, Disp20 (+/- 1MB)
        BSR, Disp23 (XG2, +/- 8MB)
      XG3:
        JSR / RTS: Pseudo-Instructions over JALR
          Link goes in LR
        BSR, Disp23 (+/- 16MB)
          Its BSR is hard-wired to LR, unlike RV's JAL.


    XG3 sort of has an identity crisis between the PDP-derived elements and
    the RV/MIPS derived elements; along with two wildly different ASM syntax styles. Though, admittedly, with BGBCC had still mostly been using the previous AT&T style ASM syntax, eg:
      MOV.L (R18), R10


    The M68K is part of a similar family, also borrowing design elements
    from the PDP-11. The M68K mostly had its own wonk as well, like separate Address and Data registers.

    So, M68K:
      zzzz-mmm-yyy-yyy-nnn
      Regs:
        D0..D7
        A0..A7
        A7=SP
      Condition Codes
      JSR / RTS: Link goes on Stack.
      Bcc, Disp8/Disp16/Disp32
        BRA/BSR, Disp8/Disp16/Disp32
        Escape-Coded Scheme, Disp==0 escapes to Disp16, ...

    In PDP-11 and M68K, instructions may encode an addressing mode for each register, allowing Reg/Mem and Mem/Mem operation.

    M68K seemingly renamed MOV to MOVE.
      Looks like ASM examples lack the '%' on register names.
        Seemingly, '%' prefixed register names were mostly a GAS thing?...


    Contrast, H8, SH, and my BJX2 ISA went over to being Load/Store.
      PDP-11, H8, SH, and M68K have auto-increment addressing.
        Though, I had dropped auto-increment in BJX2.
        Had also been dropped in some of my intermediate subsets (*1).


    *1: B32V and BSR1 representing were the transition between SH and what
    would become BJX2.
    B32V was basically an SH variant with nearly all non-essential features removed. BSR1 had reorganized the encoding space, and (after 32-bit encodings were added) became the core of BJX2.

    Something like:
      MOV.L (R8)+, R4  //auto-increment loads
    Still existing in these later ISA variants as pseudo-instructions.

    Though, auto-increment notation is currently unusable for the RISC-V or
    XG3 targets. Could re-add the pseudo instructions if it mattered though.

    ...



    (3) may not be much of an argument, but it seems to me that (1) and >>>>> (2)
    can reasonably be considered fairly strong arguments. But what
    about the
    drawbacks?

    I have not seen a strong argument for Ra+Rb+16 bit in what you wrote
    above.

    Agreed.



    My current design fuses a max of one memory op into instructions instead
    of having a load followed by the instruction (or an instruction followed
    by a store). Address mode available without adding instruction words are
    Rn, (Rn), (Rn)+, -(Rn). After that 32-bit instruction words are added to support 32 and 64-bit displacements or addresses.

    The instructions with the extra displacement words are larger but there
    are fewer instructions to execute.
    LOAD Rd,[Rb+Disp16]
    ADD Rd,Ra,Rd
    Requiring two instruction words, and executing as two instructions, gets replaced with:
    ADD Rd,Ra,[Rb+Disp32]
    Which also takes two instruction words, but only one instruction.

    Immediate operands and memory operands are routed according to two
    two-bit routing fields. I may be able to compress this to a single
    three-bit field.

    Typical instruction encoding:
    ADD: oooooo ss xx ii ww mmm rrrrr rrrrr ddddd
    oooooo: is the opcode
    ss: is the operation size
    xx: is two extra opcode bits
    ii: indicates which register field represents an immediate value
    ww: indicates which register field is a memory operand
    mmm: is the addressing mode, similar to a 68k
    rrrrr: source register spec (or 4+ bit immediate)
    ddddd: destination register spec (or 4+ bit immediate)

    A 36-bit opcode would work great, allowing operand sign control.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Sun Jul 27 10:51:43 2025
    From Newsgroup: comp.arch

    On Sun, 27 Jul 2025 07:37:50 +0000, John Savard wrote:

    But scaling *displacements* is an idea that had not even occurred to me.

    I realize now that I wasn't thinking clearly when I said that. In fact,
    I used scaled displacements extensively in some previous iterations of
    the Concertina II architecture.

    I realized this when I noticed another problem with scaled displacements:
    they only allow aligned data to be addressed.

    But instead of scaling a displacement of constant size, in order to
    maintain the size of the segment to which a given base register points
    as a constant, I varied the size of the displacemeent based on the
    degree to which it was scaled...

    so if I had a 16-bit displacement for references to bytes, then I had
    a 15-bit one to address 16-bit halfwords, a 14-bit one to address 32-bit
    words, and a 13-bit one to address 64-bit long integers. This let me
    use only two sets of opcodes, rather than one for each variable type,
    for memory-reference instructions on integers.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sun Jul 27 11:20:01 2025
    From Newsgroup: comp.arch

    BGB wrote:

    One possible downside of scaled displacements is that they can't
    directly encode a misaligned load or store. But this is rare.

    Scaled displacements are also restricted in what it can address for
    RIP-rel addressing if the RIP has alignment restrictions
    (instructions aligned on 16 or 32 bits).


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Jul 27 15:31:20 2025
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> schrieb:
    BGB wrote:

    One possible downside of scaled displacements is that they can't
    directly encode a misaligned load or store. But this is rare.

    Scaled displacements are also restricted in what it can address for
    RIP-rel addressing if the RIP has alignment restrictions
    (instructions aligned on 16 or 32 bits).

    Scaled displacements also create asymmetries. Consider

    struct foo {
    int a;
    char b,c,d,e;
    }

    Accessing a might work, but b,c,d or e not.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Sun Jul 27 11:59:00 2025
    From Newsgroup: comp.arch

    One possible downside of scaled displacements is that they can't directly encode a misaligned load or store. But this is rare.

    Much more frequent is misaligned pointers to aligned data, which are
    used quite commonly in dynamically-typed languages when you use
    a non-zero tag for boxed objects.

    I think it would be in keeping with John's tradition to "not choose" and instead use one bit to indicate whether the displacement should
    be scaled.


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Jul 27 11:56:27 2025
    From Newsgroup: comp.arch

    On 7/27/2025 10:20 AM, EricP wrote:
    BGB wrote:

    One possible downside of scaled displacements is that they can't
    directly encode a misaligned load or store. But this is rare.

    Scaled displacements are also restricted in what it can address for
    RIP-rel addressing if the RIP has alignment restrictions
    (instructions aligned on 16 or 32 bits).


    It depends...

    For example, in the SuperH ISA, they had a PC-rel DWORD load, 16 bit instructions, an offset in DWORDs, and required loads to be aligned.

    It this specific case, the strategy was basically to mask off the
    low-order bits from PC, pretending as if PC was always 32-bit aligned.

    ...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Sun Jul 27 17:51:39 2025
    From Newsgroup: comp.arch

    On Sun, 27 Jul 2025 11:59:00 -0400, Stefan Monnier wrote:

    I think it would be in keeping with John's tradition to "not choose" and instead use one bit to indicate whether the displacement should be
    scaled.

    Since the way I had used scaled displacements was to shorten the amount
    of opcode space needed by varying the displacement length based on the
    variable type... I would use a bit to indicate scaled displacements, but
    it would be in the header, to note which instruction set I'm using this
    block.

    In one earlier version of Concertina II, however, along with a full set of aligned load-store instructions, I allocated some opcode space to
    load-store instructions with full-width plain displacements - which could
    only work with the first eight of the thirty-two registers. So the
    programmer did have both alternatives available.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Jul 27 13:24:57 2025
    From Newsgroup: comp.arch

    On 7/27/2025 10:59 AM, Stefan Monnier wrote:
    One possible downside of scaled displacements is that they can't directly
    encode a misaligned load or store. But this is rare.

    Much more frequent is misaligned pointers to aligned data, which are
    used quite commonly in dynamically-typed languages when you use
    a non-zero tag for boxed objects.

    I think it would be in keeping with John's tradition to "not choose" and instead use one bit to indicate whether the displacement should
    be scaled.


    Possible as well.


    In my ISA, getting this scenario to work (if the LOB's for pointers are
    not 0's) would require masking. In traditional tagging scheme, object
    pointers usually had low bits 0; leaving non-zero mostly for value
    types, say:
    00: Object Pointer
    000: Plain Object
    100: CONS
    01: Fixnum
    10: Flonum
    11: Other small value types

    Though, in this scheme, one does have to go through a ritual to figure
    out what type of object is pointed to.


    In my ABI, I had instead went with the high order bits for type-tags;
    with most Load/Store ops effectively ignoring the high order 16 bits of
    a 64-bit address.

    For object-pointer types, generally (59:48) encoded the object type ID.
    Tag 0/000: Special, used for plain C pointers.
    High-4 tags 3..7, and 8..B were used for Fixnum and Flonum.
    Otherwise, the 12-bit type ID maps to a string which gives the actual
    type name.



    Though, some special ops, like LDTEX, made use of the HOBs (the low 48
    encoded the base address, with the high 16 bits encoding the texture
    size and block format).

    Note that texture sizes for LDTEX were limited to powers of 2, and only
    square or rectangular images with certain patterns, so it worked.
    64x64, 128x64, 128x128, 256x128, 256x256, ...
    A texture size like 128x256 could be handled by flipping the U/V coords.
    But, 256x64 or similar could not be handled.

    Note that in the GL API, typically DXT1 or DXT5 was used, but internally
    I used a format called UTX2 that could mimic both. It supported more
    features vs DXT1, but less than DXT5 (so generic DXT5 would lose
    quality; and used RGB555 rather than RGB565).

    Partial reason for internal UTX2 was because the logic for S3TC textures
    would have been more expensive. Also, BC7 would be notably more
    expensive than either.

    I also had an internal UTX3 format which was an intermediate between
    DXT5 and BC7. Could address BC7 and BC6H use-cases.


    Format for UTX2 was basically:
    16 bits: ColorA (RGB555)
    16 bits: ColorB (RGB555)
    32 bits: Selector (4x4x2b)
    With 4 modes:
    00: Opaque (2-bit color interpolation)
    01: C/A selectors (1 bit color, 1 bit alpha/transparency)
    10: Mimic DXT1's behavior (A, B, (A+B)/2, Transparent)
    11: Translucent (2-bit combined interpolation)
    Modes 0/2: Color is RGB555.
    Modes 1/3: Color is essentially RGB444A3
    It was ambiguous if 1/3 + 2/3 or 3/8 + 5/8 is used.
    Early had assumed 1/3 + 2/3.
    But, later mostly assumes 3/8 + 5/8.

    Format for UTX3 was basically:
    32 bits: ColorA (RGBA32, or 4x FP8U)
    32 bits: ColorB (RGBA32, or 4x FP8U)
    32 bits: RGB selector (4x4x2b)
    32 bits: A selector.
    UTX3 always interpolates.
    FP8 values are handled post interpolation (*1).

    *1: Well, actually, much of the HDR path was itself a tweaked RGBA32 LDR
    path. This could greatly reduce the cost of supporting HDR.

    Unlike traditional OpenGL, it assumes RGBA for HDR, but the A handling
    is a little wonky, with A only handling unit-range values.

    Essentially, A was shifted left by 1 bit, and then used similar to the
    LDR Alpha channel, with some biasing/fudging so that 1.0 = 255.



    Operations like LDTEX are a bit niche and not considered part of the
    core ISA though, they mostly exist to make the software OpenGL backend
    faster.

    It can use a 64-bit jumbo encoding, with a few bits to select which
    corner of the texel one wants. The interpolation (for "Poor man's
    Bilinear") was then done using SIMD instructions. Where, rather than
    proper Bilinear filtering, it was using a 3 texel scheme partly inspired
    by the one used on the Nintendo 64.


    Ironically, also some of the wonk and limitations of my FPU were similar
    to those in the N64 FPU as well, which didn't quite do full IEEE-754 in hardware, but more used traps for SW to emulate the full semantics when needed. But, then N64 software mostly disabled the FPU traps.

    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Sun Jul 27 22:29:13 2025
    From Newsgroup: comp.arch

    On 7/27/2025 3:50 AM, Robert Finch wrote:

    big snip

    First, thanks for posting this. I don't recall you posting much about
    your design. Can you talk about its goals, why you are doing it, its
    status, etc.?

    Specific comments below
    My current design fuses a max of one memory op into instructions instead
    of having a load followed by the instruction (or an instruction followed
    by a store). Address mode available without adding instruction words are
    Rn, (Rn), (Rn)+, -(Rn). After that 32-bit instruction words are added to support 32 and 64-bit displacements or addresses.

    The combined mem-op instructions used to be popular, but since the RISC revolution, are now out of fashion. Their advantages are, as you state,
    often eliminating an instruction. The disadvantages include that they preclude scheduling the load earlier in the instruction stream. Do you "crack" the instruction into two micro-ops in the decode stage? What
    drove your decision to "buck" the trend. I am not saying you are wrong.
    I just want to understand your reasoning.



    The instructions with the extra displacement words are larger but there
    are fewer instructions to execute.
      LOAD Rd,[Rb+Disp16]
      ADD Rd,Ra,Rd
    Requiring two instruction words, and executing as two instructions, gets replaced with:
      ADD Rd,Ra,[Rb+Disp32]
    Which also takes two instruction words, but only one instruction.

    Immediate operands and memory operands are routed according to two two-
    bit routing fields. I may be able to compress this to a single three-bit field.

    Yes. For example, unless you are doing some special case thing, having
    both registers be immediates probably doesn't make sense. And unless
    you are going to allow two memory references in one instruction, that combination doesn't make sense.


    Typical instruction encoding:
    ADD: oooooo ss xx ii ww mmm rrrrr rrrrr ddddd
    oooooo: is the opcode
    ss: is the operation size
    xx: is two extra opcode bits
    ii: indicates which register field represents an immediate value
    ww: indicates which register field is a memory operand
    mmm: is the addressing mode, similar to a 68k
    rrrrr: source register spec (or 4+ bit immediate)
    ddddd: destination register spec (or 4+ bit immediate)

    A 36-bit opcode would work great, allowing operand sign control.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Mon Jul 28 05:47:45 2025
    From Newsgroup: comp.arch

    On 2025-07-28 1:29 a.m., Stephen Fuld wrote:
    On 7/27/2025 3:50 AM, Robert Finch wrote:

    big snip

    First, thanks for posting this.  I don't recall you posting much about
    your design.  Can you talk about its goals, why you are doing it, its status, etc.?

    Just started the design. Lots of details to work out. I like some
    features of the 68k and 66k. I have some doubt as to starting a new
    design. I would prefer to use something existing. I am not terribly fond
    of RISC designs though.

    Specific comments below
    My current design fuses a max of one memory op into instructions
    instead of having a load followed by the instruction (or an
    instruction followed by a store). Address mode available without
    adding instruction words are Rn, (Rn), (Rn)+, -(Rn). After that 32-bit
    instruction words are added to support 32 and 64-bit displacements or
    addresses.

    The combined mem-op instructions used to be popular, but since the RISC revolution, are now out of fashion.  Their advantages are, as you state, often eliminating an instruction.  The disadvantages include that they preclude scheduling the load earlier in the instruction stream.  Do you "crack" the instruction into two micro-ops in the decode stage?  What
    drove your decision to "buck" the trend.  I am not saying you are wrong.
     I just want to understand your reasoning.

    Instructions will be cracked into micro-ops. My compiler does not do instruction scheduling (yet). Relying on the processor to schedule instructions. There are explicit load and store instructions which
    should allow scheduling earlier in the instruction stream.

    I am under the impression that with a micro-op based processor the ISA (RISC/CISC) becomes somewhat less relevant allowing more flexibility in
    the ISA design. >


    The instructions with the extra displacement words are larger but
    there are fewer instructions to execute.
       LOAD Rd,[Rb+Disp16]
       ADD Rd,Ra,Rd
    Requiring two instruction words, and executing as two instructions,
    gets replaced with:
       ADD Rd,Ra,[Rb+Disp32]
    Which also takes two instruction words, but only one instruction.

    Immediate operands and memory operands are routed according to two
    two- bit routing fields. I may be able to compress this to a single
    three-bit field.

    Yes.  For example, unless you are doing some special case thing, having both registers be immediates probably doesn't make sense.  And unless
    you are going to allow two memory references in one instruction, that combination doesn't make sense.


    Typical instruction encoding:
    ADD: oooooo ss xx ii ww mmm rrrrr rrrrr ddddd
    oooooo: is the opcode
    ss: is the operation size
    xx: is two extra opcode bits
    ii: indicates which register field represents an immediate value
    ww: indicates which register field is a memory operand
    mmm: is the addressing mode, similar to a 68k
    rrrrr: source register spec (or 4+ bit immediate)
    ddddd: destination register spec (or 4+ bit immediate)

    A 36-bit opcode would work great, allowing operand sign control.
    I got some sign control using the two XX bits. For some operations that
    is all that is needed. Inverting the sign on the destination and one of
    the source operands.


    --- Synchronet 3.21a-Linux NewsLink 1.2