• Pseudo-Immediates as Part of the Instruction

    From John Savard@quadibloc@invalid.invalid to comp.arch on Fri Aug 1 15:11:49 2025
    From Newsgroup: comp.arch

    I couldn't locate a post I finally felt I was ready to respond to, which
    was in reply to one of my posts about Concertina II, which said that immediates ought to be properly considered part of the instruction.

    Well, in nearly all computer architectures, immediates _are_ part of the instruction, and quite obviously so.

    But what Concertina II has are *pseudo* immediates. That is, they're not really immediates, but they pretend to be.

    What does this mean? What could this mean?

    Well, in my register-to-register operate instruction, associated with each _source_ register field, there's a bit which, if set, says that the five
    bits in the field aren't a register specifier, but a pointer to a constant.

    A constant that's addressed by an instruction isn't an immediate; it's a constant. So why do I even call these constants "pseudo-immediates" then?

    Well, that pointer - five bits long - is an awfully short pointer. Where
    does it point?

    Instructions are fetched in blocks that are 256 bits long. One of the
    things this allows for is for the block to begin with a header that
    specifies that a certain number of 32-bit instruction slots at the end of
    the current block are to be skipped over in the sequence of instructions
    to be executed; this space can be used for constants.

    So although the constant is fetched in response to a pointer, and thus is
    not an immediate, the constant is located directly in the instruction
    stream. This is particularly true in implementations where the memory bus
    is 256 bits wide, and a block of instructions is fetched in a single
    memory read.

    So the pseudo-immediate value is not part of the _instruction_ in the conventional sense, but if you think of the 256-bit block as being the
    "real" instruction for a VLIW architecture, it's part of *that*.

    Think of the Itanium: the 128-bit thingie is one thing, and each of the 41-
    bit thingies that make it up, along with the 5-bit header, is another
    thing.

    The 5-bit header is part of the 128-bit thingy without being part of any
    of the 41-bit thingies. That is the limbo in which my pseudo-immediates
    are found. Data? Or a field in the instruction? It can be either one, depending on whether you define each individual 32-bit instruction as an instruction, or the 256-bit block as the "real" instruction the
    architecture executes.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Fri Aug 1 16:52:34 2025
    From Newsgroup: comp.arch

    On Fri, 01 Aug 2025 15:11:49 +0000, John Savard wrote:

    The 5-bit header is part of the 128-bit thingy without being part of any
    of the 41-bit thingies. That is the limbo in which my pseudo-immediates
    are found. Data? Or a field in the instruction? It can be either one, depending on whether you define each individual 32-bit instruction as an instruction, or the 256-bit block as the "real" instruction the
    architecture executes.

    ...and if you think that's crazy, in some of the earliest iterations of
    the Concertina II design, I implemented instructions longer than 32 bits
    by having a six-bit pointer in an instruction to the rest of the
    instruction.

    Which, I suppose, argues against the view that pseudo-immediates are not
    part of the instruction, since that which definitely is part of the instruction can be pointed to in the same way.

    I stopped doing that because I felt it involved too much overhead.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Aug 1 18:08:17 2025
    From Newsgroup: comp.arch

    John Savard <quadibloc@invalid.invalid> schrieb:
    I couldn't locate a post I finally felt I was ready to respond to, which
    was in reply to one of my posts about Concertina II, which said that immediates ought to be properly considered part of the instruction.

    That was probably mine.

    Well, in nearly all computer architectures, immediates _are_ part of the instruction, and quite obviously so.

    But what Concertina II has are *pseudo* immediates. That is, they're not really immediates, but they pretend to be.

    What does this mean? What could this mean?

    Well, in my register-to-register operate instruction, associated with each _source_ register field, there's a bit which, if set, says that the five bits in the field aren't a register specifier, but a pointer to a constant.

    A constant that's addressed by an instruction isn't an immediate; it's a constant. So why do I even call these constants "pseudo-immediates" then?

    Well, that pointer - five bits long - is an awfully short pointer. Where does it point?

    Question: Do the pointers point to the same block only, or also
    to other blocks? With 5 bits, you could address others as well.
    Can you give an example of their use, including the block headers?
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Fri Aug 1 21:04:11 2025
    From Newsgroup: comp.arch

    On Fri, 01 Aug 2025 18:08:17 +0000, Thomas Koenig wrote:

    Question: Do the pointers point to the same block only, or also to other blocks? With 5 bits, you could address others as well. Can you give an example of their use, including the block headers?

    Actually, no, 5 bits are only enough to point within the same block.
    That's because it's a byte pointer, as it can be used to point to any type
    of constant, including single byte constants.

    This is despite the fact that I do have an instruction format for
    conventional style byte immediates (and I've just squeezed in one for
    16-bit immediates as well).

    However, they _can_ point to another block, by means of a sixth bit that
    some instructions have... but when this happens, it does not trigger an
    extra fetch from memory. Instead, the data is retrieved from a copy of an earlier block in the instruction stream that's saved in a special
    register... so as to reduce potential NOP-style problems.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Fri Aug 1 21:03:17 2025
    From Newsgroup: comp.arch

    On 2025-08-01 5:04 p.m., John Savard wrote:
    On Fri, 01 Aug 2025 18:08:17 +0000, Thomas Koenig wrote:

    Question: Do the pointers point to the same block only, or also to other
    blocks? With 5 bits, you could address others as well. Can you give an
    example of their use, including the block headers?

    Actually, no, 5 bits are only enough to point within the same block.
    That's because it's a byte pointer, as it can be used to point to any type
    of constant, including single byte constants.

    This is despite the fact that I do have an instruction format for conventional style byte immediates (and I've just squeezed in one for
    16-bit immediates as well).

    However, they _can_ point to another block, by means of a sixth bit that
    some instructions have... but when this happens, it does not trigger an
    extra fetch from memory. Instead, the data is retrieved from a copy of an earlier block in the instruction stream that's saved in a special
    register... so as to reduce potential NOP-style problems.

    John Savard

    I tried something similar to this but without block headers and it
    worked okay. But there were a couple of issues. One was the last
    instruction in cache line could not have an immediate. Or instructions
    had to stop before the end of the cache line to accommodate immediates.
    This resulted in some wasted space. There would sometimes be a 32-bit
    hole between the last instruction and the first immediate. I used a
    four-bit index and 32-bit immediate, instruction word size. Four bits
    was enough for a 512-bit (cache line size). IIRC the wasted space was
    about 5%.
    It made the assembler more complex. I had immediates being positioned
    from the far end of the cache line down (like a stack) towards the instructions which began at the lower end. The assembler had to be able
    to keep track of where things were on the cache line and the assembler
    was not built to handle that.
    Also, it made reading listings more difficult as constants were in the
    middle of sequences of instructions.
    Sometimes constants could be shared, but this turned out to be not
    possible in many cases as the assembler needed to emit relocation
    records for some constants and it could not handle having two or more instructions pointing to the same constant.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Lawrence D'Oliveiro@ldo@nz.invalid to comp.arch on Sat Aug 2 03:21:56 2025
    From Newsgroup: comp.arch

    On Fri, 1 Aug 2025 15:11:49 -0000 (UTC), John Savard wrote:

    Well, that pointer - five bits long - is an awfully short pointer. Where
    does it point?

    Instructions are fetched in blocks that are 256 bits long. One of the
    things this allows for is for the block to begin with a header that
    specifies that a certain number of 32-bit instruction slots at the end
    of the current block are to be skipped over in the sequence of
    instructions to be executed; this space can be used for constants.

    Just add a couple of modifier bits: one is the indirect bit, indicating
    that the location referenced contains the address of the value, not the
    value itself, and another “page zero” bit, which indicates that the location is not in the current block, but in another block at a fixed
    address ...

    ... and I start having PDP-8 flashbacks.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Sat Aug 2 03:22:41 2025
    From Newsgroup: comp.arch

    On Fri, 01 Aug 2025 21:03:17 -0400, Robert Finch wrote:

    I tried something similar to this but without block headers and it
    worked okay. But there were a couple of issues. One was the last
    instruction in cache line could not have an immediate. Or instructions
    had to stop before the end of the cache line to accommodate immediates.
    This resulted in some wasted space.

    This is interesting. I've tried to keep things simple by making everything explicit.

    Also, it made reading listings more difficult as constants were in the
    middle of sequences of instructions.

    I don't plan on structuring my assembly language that way. It might make reading _core dumps_ more difficult, but pseudo-immediate values would
    appear in the assembler source within the instruction just like
    conventional immediates.

    John Savard

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Aug 2 09:12:17 2025
    From Newsgroup: comp.arch

    John Savard <quadibloc@invalid.invalid> schrieb:
    On Fri, 01 Aug 2025 18:08:17 +0000, Thomas Koenig wrote:

    Question: Do the pointers point to the same block only, or also to other
    blocks? With 5 bits, you could address others as well. Can you give an
    example of their use, including the block headers?

    Actually, no, 5 bits are only enough to point within the same block.
    That's because it's a byte pointer, as it can be used to point to any type
    of constant, including single byte constants.

    This is despite the fact that I do have an instruction format for conventional style byte immediates (and I've just squeezed in one for
    16-bit immediates as well).

    Is there a reason for that? On the face of it, having both makes
    no sense.

    But even so: Having a single, let's say, 32-bit immedate would require
    a 32-bit header and a 32-bit constant, so 64 bits used instead of
    directly encoding a 32-bit constant.

    However, they _can_ point to another block, by means of a sixth bit that some instructions have...

    Try writing an assembler and disassembler for what you have. I have
    written this for Mitch's ISA, and it turned out to be very difficult
    already. Your method, I would guess, would be much more difficult.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Sat Aug 2 18:57:43 2025
    From Newsgroup: comp.arch

    On Sat, 02 Aug 2025 09:12:17 +0000, Thomas Koenig wrote:
    John Savard <quadibloc@invalid.invalid> schrieb:

    This is despite the fact that I do have an instruction format for
    conventional style byte immediates (and I've just squeezed in one for
    16-bit immediates as well).

    Is there a reason for that? On the face of it, having both makes no
    sense.

    The option of having a pseudo-immediate pointer instead of a register specification is baked into the format of the operate instructions.
    Removing it for some variable types would be messy.

    But even so: Having a single, let's say, 32-bit immedate would require a 32-bit header and a 32-bit constant, so 64 bits used instead of directly encoding a 32-bit constant.

    And avoiding that for eight and sixteen bit constants is the reason for conventional immediates for them, despite the duplication. (Try fitting
    the other sizes of immediate into a 32-bit instruction.)

    But I'm sneaky. Since this situation dismayed me all along with
    Concertina II, I have what I call a "zero-overhead header". In the first instruction slot of a block, one may have a Type I header, which is a two-address operate instruction which *also* supplies a three-bit _decode_ field, reserving slots for pseudo-immediates.

    Since operate instructions are the most common type of instruction, if one
    can re-arrange instructions a little, one might be able to have these pseudo-imediates *without* the crushing burden of a 32-bit overhead!

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Aug 2 19:23:01 2025
    From Newsgroup: comp.arch

    John Savard <quadibloc@invalid.invalid> schrieb:

    Since operate instructions are the most common type of instruction, if one can re-arrange instructions a little, one might be able to have these pseudo-imediates *without* the crushing burden of a 32-bit overhead!

    I read "one might" as "never will".

    You still haven't shown a single piece of code with your header
    scheme, I presume because it is to difficult even for you, the
    author of the ISA.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Sun Aug 3 05:30:34 2025
    From Newsgroup: comp.arch

    On Sat, 02 Aug 2025 19:23:01 +0000, Thomas Koenig wrote:

    You still haven't shown a single piece of code with your header scheme,
    I presume because it is to difficult even for you, the author of the
    ISA.

    I can understand how you might feel that way, but if my block structure
    isn't understandable when illustrated by diagrams showing the basic
    essentials of how it works, I fail to realize how making the extra effort
    to smother that information in a mass of irrelevant detail is going to
    make it any clearer to you.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Aug 3 11:25:51 2025
    From Newsgroup: comp.arch

    John Savard <quadibloc@invalid.invalid> schrieb:
    On Sat, 02 Aug 2025 19:23:01 +0000, Thomas Koenig wrote:

    You still haven't shown a single piece of code with your header scheme,
    I presume because it is to difficult even for you, the author of the
    ISA.

    I can understand how you might feel that way, but if my block structure
    isn't understandable when illustrated by diagrams showing the basic essentials of how it works, I fail to realize how making the extra effort
    to smother that information in a mass of irrelevant detail is going to
    make it any clearer to you.

    It is not how something appears in a diagram, it is how an actual
    algorithm is transformed into efficient machine language (I would
    have said assembly language, but you put a massive barrier between
    the two with your block structure).

    You wrote, upthread, that you have never done so. My current
    assumption is that you chose not to do it because this would
    be too complicated for you, the inventor of this ISA, let alone
    anybody else.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Sun Aug 3 12:50:05 2025
    From Newsgroup: comp.arch

    On 8/2/2025 2:12 AM, Thomas Koenig wrote:
    John Savard <quadibloc@invalid.invalid> schrieb:
    On Fri, 01 Aug 2025 18:08:17 +0000, Thomas Koenig wrote:

    Question: Do the pointers point to the same block only, or also to other >>> blocks? With 5 bits, you could address others as well. Can you give an
    example of their use, including the block headers?

    Actually, no, 5 bits are only enough to point within the same block.
    That's because it's a byte pointer, as it can be used to point to any type >> of constant, including single byte constants.

    This is despite the fact that I do have an instruction format for
    conventional style byte immediates (and I've just squeezed in one for
    16-bit immediates as well).

    Is there a reason for that? On the face of it, having both makes
    no sense.

    But even so: Having a single, let's say, 32-bit immedate would require
    a 32-bit header and a 32-bit constant, so 64 bits used instead of
    directly encoding a 32-bit constant.

    Yup. And as Robert Finch pointed out, what if the instruction that
    needs the constant is the last instruction in the block?



    However, they _can_ point to another block, by means of a sixth bit that
    some instructions have...


    But using this capability isn't a solution, as it adds 32 bits to the
    block, which pushes the last instruction in that block into the current
    block, which pushes the instruction that needs the immediate into the
    next block and forces the extra nop anyway.


    Try writing an assembler and disassembler for what you have. I have
    written this for Mitch's ISA, and it turned out to be very difficult
    already.

    I am curious as to what features you found difficult?


    Your method, I would guess, would be much more difficult.

    Agreed!
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Sun Aug 3 13:03:21 2025
    From Newsgroup: comp.arch

    On 8/2/2025 10:30 PM, John Savard wrote:
    On Sat, 02 Aug 2025 19:23:01 +0000, Thomas Koenig wrote:

    You still haven't shown a single piece of code with your header scheme,
    I presume because it is to difficult even for you, the author of the
    ISA.

    I can understand how you might feel that way, but if my block structure
    isn't understandable when illustrated by diagrams showing the basic essentials of how it works, I fail to realize how making the extra effort
    to smother that information in a mass of irrelevant detail is going to
    make it any clearer to you.

    I suspect that the purpose of Thomas's suggestion wasn't to make the
    design clearer to him, but to force you to discover/think about the
    utility and ease of use of some of the features you propose *in real
    programs* . If a typical programmer can't figure out how to use some
    CPU feature, it probably won't be used, and thus probably should not be
    in the architecture. The best way to learn about what features are
    useful is to try to use them! and the best way to do that is to write
    actual code for a real program.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sun Aug 3 22:36:32 2025
    From Newsgroup: comp.arch

    Stephen Fuld wrote:
    On 8/2/2025 10:30 PM, John Savard wrote:
    On Sat, 02 Aug 2025 19:23:01 +0000, Thomas Koenig wrote:

    You still haven't shown a single piece of code with your header scheme,
    I presume because it is to difficult even for you, the author of the
    ISA.

    I can understand how you might feel that way, but if my block structure
    isn't understandable when illustrated by diagrams showing the basic
    essentials of how it works, I fail to realize how making the extra effort
    to smother that information in a mass of irrelevant detail is going to>> make it any clearer to you.

    I suspect that the purpose of Thomas's suggestion wasn't to make the
    design clearer to him, but to force you to discover/think about the
    utility and ease of use of some of the features you propose *in real programs* .  If a typical programmer can't figure out how to use some
    CPU feature, it probably won't be used, and thus probably should not be
    in the architecture.  The best way to learn about what features are
    useful is to try to use them!  and the best way to do that is to write actual code for a real program.
    That is always a required step, but still not enough.
    I.e when I first got the Itanium architecture manual (long before any CPUs/systems were available) I sat down and wrote some (to me)
    interesting kernels, like medium-sized arbitrary precision math, up to a kbit or two, using carry-save in-register storage.
    That persuaded me that it was possible for the Itanium do do these kinds of calculations very fast indeed, but the architecure was still a
    memorable failure.
    Being fit for a number of hand-written asm kernels does not a generally
    useful cpu make.
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Sun Aug 3 14:28:11 2025
    From Newsgroup: comp.arch

    On 8/3/2025 1:36 PM, Terje Mathisen wrote:
    Stephen Fuld wrote:
    On 8/2/2025 10:30 PM, John Savard wrote:
    On Sat, 02 Aug 2025 19:23:01 +0000, Thomas Koenig wrote:

    You still haven't shown a single piece of code with your header scheme, >>>> I presume because it is to difficult even for you, the author of the
    ISA.

    I can understand how you might feel that way, but if my block structure
    isn't understandable when illustrated by diagrams showing the basic
    essentials of how it works, I fail to realize how making the extra
    effort
    to smother that information in a mass of irrelevant detail is going to
    make it any clearer to you.

    I suspect that the purpose of Thomas's suggestion wasn't to make the
    design clearer to him, but to force you to discover/think about the
    utility and ease of use of some of the features you propose *in real
    programs* .  If a typical programmer can't figure out how to use some
    CPU feature, it probably won't be used, and thus probably should not
    be in the architecture.  The best way to learn about what features are
    useful is to try to use them!  and the best way to do that is to write
    actual code for a real program.

    That is always a required step, but still not enough.

    I.e when I first got the Itanium architecture manual (long before any CPUs/systems were available) I sat down and wrote some (to me)
    interesting kernels, like medium-sized arbitrary precision math, up to a kbit or two, using carry-save in-register storage.

    That persuaded me that it was possible for the Itanium do do these kinds
    of calculations very fast indeed, but the architecure was still a
    memorable failure.

    Being fit for a number of hand-written asm kernels does not a generally useful cpu make.

    I absolutely agree, though John seems reluctant to do even that despite Thomas's and my suggestions.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Sun Aug 3 22:14:51 2025
    From Newsgroup: comp.arch

    On Sun, 03 Aug 2025 12:50:05 -0700, Stephen Fuld wrote:

    Yup. And as Robert Finch pointed out, what if the instruction that
    needs the constant is the last instruction in the block?

    The first thing one could do is precede that instruction by a NOP.

    In Concertina II, the preferred way to achieve the same effect is to use a do-nothing header, because that wouldn't consume a whole cycle like a NOP might.

    But I thought of that, and added a feature where instructions can
    (provided a recent branch hadn't taken place) indicate that they're using
    a saved copy of the preceding block, instead of the current block, for the constant.

    Oh, I see you noticed that:
    However, they _can_ point to another block, by means of a sixth bit
    that some instructions have...

    But using this capability isn't a solution, as it adds 32 bits to the
    block, which pushes the last instruction in that block into the current block, which pushes the instruction that needs the immediate into the
    next block and forces the extra nop anyway.

    That isn't quite how it would work out.

    Current issue...

    I I I I I I I I#

    When I fix it, to put the value in the current block, it pushes the
    problem instruction to the next one,

    (1) I I I I I I M1
    I I#

    so pointing to the previous block *does* solve the problem.

    I# - instruction wanting to use a 32-bit pseudo-immediate constant
    (1) - header that reserves one instruction slot at the end of the current block for a constant
    M1 - constant value that's one 32-bit instruction slot long
    I - plain 32-bit instruction

    So, yes, it works just fine.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Sun Aug 3 22:43:04 2025
    From Newsgroup: comp.arch

    On Sat, 02 Aug 2025 18:57:43 +0000, John Savard wrote:

    But I'm sneaky. Since this situation dismayed me all along with
    Concertina II, I have what I call a "zero-overhead header". In the first instruction slot of a block, one may have a Type I header, which is a two-address operate instruction which *also* supplies a three-bit
    _decode_ field, reserving slots for pseudo-immediates.

    It had also provided a few extra bits to allow some other things to be
    done without overhead.

    Recently, I mistakenly thought I had the opportunity to add one extra bit
    to this instruction, to give me the chance to point to a 35-bit
    instruction without the overhead of a full 32-bit header. I thought that
    might be too good to be true, though, so I did make preparations to revert
    the change.

    Well, indeed I did find the extra opcode space was not available. But I decided not to revert, but to correct things as they now were, because one result of the changes I had made was that the opcode range containing
    operate instructions was now neater - and this applied to some other categories of instructions as well.

    And even though I wasn't able to modify the Type I header as I had wished,
    I ended up figuring out another attainable way of achieving my objective. Since both forms of the Augmented Short Instruction format of 32-bit instructions provided versions of the operate instructions with longer opcodes, I really didn't need to provide them in the Alternate 32-bit Instructions as well. So I took those out, and provided a stripped-down limited form of the memory-to-register operate instruction (which is what
    the 35-bit instructions were, but without being stripped down) within the Alternate 32-bit Instructions... so now this capability is provided, at
    least to an extent, by the Type I header.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stefan Monnier@monnier@iro.umontreal.ca to comp.arch on Sun Aug 3 19:23:50 2025
    From Newsgroup: comp.arch

    Being fit for a number of hand-written asm kernels does not a generally useful cpu make.

    Beside bignums, other "kernels" worth trying might be something like
    a simple balanced binary tree, including some operation that
    requires recursion, like counting the number of leaves.

    And of course, trying to get a compiler to generate code vaguely similar
    to the ASM you wrote by hand is always a good test, tho it may take
    more effort.


    Stefan
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Sun Aug 3 18:11:28 2025
    From Newsgroup: comp.arch

    On 8/3/2025 3:14 PM, John Savard wrote:
    On Sun, 03 Aug 2025 12:50:05 -0700, Stephen Fuld wrote:

    Yup. And as Robert Finch pointed out, what if the instruction that
    needs the constant is the last instruction in the block?

    The first thing one could do is precede that instruction by a NOP.

    In Concertina II, the preferred way to achieve the same effect is to use a do-nothing header, because that wouldn't consume a whole cycle like a NOP might.

    But I thought of that, and added a feature where instructions can
    (provided a recent branch hadn't taken place) indicate that they're using
    a saved copy of the preceding block, instead of the current block, for the constant.

    Oh, I see you noticed that:
    However, they _can_ point to another block, by means of a sixth bit
    that some instructions have...

    But using this capability isn't a solution, as it adds 32 bits to the
    block, which pushes the last instruction in that block into the current
    block, which pushes the instruction that needs the immediate into the
    next block and forces the extra nop anyway.

    That isn't quite how it would work out.

    Current issue...

    I I I I I I I I#

    When I fix it, to put the value in the current block, it pushes the
    problem instruction to the next one,

    (1) I I I I I I M1
    I I#

    so pointing to the previous block *does* solve the problem.

    OK, I see what you are saying.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Mon Aug 4 04:07:41 2025
    From Newsgroup: comp.arch

    On Sun, 03 Aug 2025 13:03:21 -0700, Stephen Fuld wrote:

    I suspect that the purpose of Thomas's suggestion wasn't to make the
    design clearer to him, but to force you to discover/think about the
    utility and ease of use of some of the features you propose *in real programs* . If a typical programmer can't figure out how to use some
    CPU feature, it probably won't be used, and thus probably should not be
    in the architecture. The best way to learn about what features are
    useful is to try to use them! and the best way to do that is to write
    actual code for a real program.

    While I'm not prepared to go to the trouble of creating a fleshed-out
    example, a very short and trivial example will still indicate what my
    goals are.

    X = Y * 2.78 + Z

    On a typical RISC architecture, this would involve instructions like this:

    load 18, Y
    load 19, K#0001
    fmul 18, 18, 19
    load 19, Z
    fadd 18, 18, 19
    fsto X

    Six instructions, each 32 bits long.

    On the IBM System/360, though, it would be something like

    le 12, Y
    me 12, K#0001
    ae 12, Z
    ste 12, x

    All four instructions are memory-reference instructions, so they're also
    32 bits long.

    How would I do this on Concertina II?

    Well, since the sequence has to start with a memory-reference, I can't use
    the zero-overhead header (Type I). Instead, a Type XI header is in order;
    that specifies a decode field, so that space can be reserved for a pseudo- immediate, and instruction slots can be indicated as containing
    instructions from the alternate instruction set.

    Then the instructions can be

    lf 6,y
    mfr 6,#2.78
    af 6,z
    stf 6,x

    with the instruction "af" coming from the alternate 32-bit instruction set.

    The other tricky precondition that must be met is to store z in a data
    region that is only 4,096 bytes or less in size, prefaced with

    USING *,23

    or another register from 17 to 23 could be used as the base register, so
    that it is addressed with a 12-bit displacement. (Also, register 6, from
    the first eight registers, is used to do the arithmetic to meet the limitations of the "add floating" memory to register operate instruction
    in the alternate instruction set.)

    Because it uses a pseudo-immediate, which gets fetched along with the instruction stream, where the 360 uses a constant, it has an advantage
    over the 360. On the other hand, while the actual code is the same length, there's also the 32-bit overhead of the header.

    John Savard


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Aug 4 05:52:31 2025
    From Newsgroup: comp.arch

    On 2025-08-03, Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:
    On 8/2/2025 2:12 AM, Thomas Koenig wrote:

    Try writing an assembler and disassembler for what you have. I have
    written this for Mitch's ISA, and it turned out to be very difficult
    already.

    I am curious as to what features you found difficult?

    A few things.

    First, I wrote this as a port of GNU binutils. binutils internals
    are not very well documented. You do not to ELF stuff directly,
    but rather you have to interface with BFD, which then does the
    ELF stuff. And this interface is hairy, to say the least.

    Second, there are very many instructions with the same name name,
    but with different flags doing different things. Things like

    add r1,r2,#Imm16 ! Different major opcode from the rest
    add r1,r2,r3
    add r1,-r2,r3
    add r1,r2,-r3
    add r1,-r2,-r3
    add r1,r2,#Imm5
    add r1,r2,#Imm32
    add r1,r2,#Imm64

    (the list is not complete, and each variant has its own combination
    of flags) make things complex to begin with. Syntactically,
    a 16-bit integer looks like a 32-bit integer, but a 16-bit
    integer should be selected for size reasons.

    There are also 47 different operand types at latest count, which
    makes writing an assembler/disassembler somewhat error-prone.
    (I think the complexity for the assembler works well for a user,
    I find My 66000 assembly very easy to read and write).

    But the most difficult part was getting the relocations and
    fixups right, also for things like a (8,16,32 or 64-bit)
    jump table instruction, and there the main problem was
    a) getting thins straight in my head and b) interfacing
    with BFD, see above.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Aug 4 16:56:13 2025
    From Newsgroup: comp.arch

    John Savard <quadibloc@invalid.invalid> schrieb:
    On Sun, 03 Aug 2025 13:03:21 -0700, Stephen Fuld wrote:

    I suspect that the purpose of Thomas's suggestion wasn't to make the
    design clearer to him, but to force you to discover/think about the
    utility and ease of use of some of the features you propose *in real
    programs* . If a typical programmer can't figure out how to use some
    CPU feature, it probably won't be used, and thus probably should not be
    in the architecture. The best way to learn about what features are
    useful is to try to use them! and the best way to do that is to write
    actual code for a real program.

    While I'm not prepared to go to the trouble of creating a fleshed-out example, a very short and trivial example will still indicate what my
    goals are.

    X = Y * 2.78 + Z

    On a typical RISC architecture, this would involve instructions like this:

    load 18, Y
    load 19, K#0001
    fmul 18, 18, 19
    load 19, Z
    fadd 18, 18, 19
    fsto X

    If all the variables were in BSS.

    My 66000 with its compiler:

    double foo (double y, double z)
    {
    return y*2.78 + z;
    }

    yields

    foo: ; @foo
    ; %bb.0:
    fmac r1,r1,#0x40063D70A3D70A3D,r2
    ret

    One instruction for the arithmetic, one for the function return.
    Here's the disassembly:

    0000000000000000 <foo>:
    0: 3021e040 fmac r1,r1,#0x4006337003370033,r2
    4: 03370033
    8: 40063370
    c: 6be00000 ret


    Six instructions, each 32 bits long.

    On the IBM System/360, though, it would be something like

    le 12, Y
    me 12, K#0001
    ae 12, Z
    ste 12, x

    With gcc -O2 -m31, on godbolt:

    foo:
    larl %r5,.L3
    madb %f2,%f0,.L4-.L3(%r5)
    ldr %f0,%f2
    br %r14
    .L3:
    .L4:
    .long 1074150768
    .long -1546188227

    All four instructions are memory-reference instructions, so they're also
    32 bits long.

    How would I do this on Concertina II?

    Well, since the sequence has to start with a memory-reference, I can't use the zero-overhead header (Type I). Instead, a Type XI header is in order; that specifies a decode field, so that space can be reserved for a pseudo- immediate, and instruction slots can be indicated as containing
    instructions from the alternate instruction set.

    Then the instructions can be

    lf 6,y
    mfr 6,#2.78
    af 6,z
    stf 6,x

    with the instruction "af" coming from the alternate 32-bit instruction set.

    The other tricky precondition that must be met is to store z in a data region that is only 4,096 bytes or less in size, prefaced with

    USING *,23

    or another register from 17 to 23 could be used as the base register, so that it is addressed with a 12-bit displacement.

    Using USING is just horrible, and this makes it worse. Where would
    you need store this, in an executable page? Newer architectures
    have read, write and execute bits on their page tables for a very
    good reason.

    And... would you like to have a stack in your architecture?

    (Also, register 6, from
    the first eight registers, is used to do the arithmetic to meet the limitations of the "add floating" memory to register operate instruction
    in the alternate instruction set.)

    Because it uses a pseudo-immediate, which gets fetched along with the instruction stream, where the 360 uses a constant, it has an advantage
    over the 360. On the other hand, while the actual code is the same length, there's also the 32-bit overhead of the header.

    Where is the advantage over putting a constant directly in
    the instruction stream?
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Tue Aug 5 02:10:40 2025
    From Newsgroup: comp.arch

    On Mon, 04 Aug 2025 16:56:13 +0000, Thomas Koenig wrote:
    John Savard <quadibloc@invalid.invalid> schrieb:

    The other tricky precondition that must be met is to store z in a data
    region that is only 4,096 bytes or less in size, prefaced with

    USING *,23

    or another register from 17 to 23 could be used as the base register,
    so that it is addressed with a 12-bit displacement.

    Using USING is just horrible, and this makes it worse. Where would you
    need store this, in an executable page? Newer architectures have read,
    write and execute bits on their page tables for a very good reason.

    Never fear. The virtual memory subsystem will indeed mark the DSECTs as writeable but not executable, and the CSECTs as execyhtabke but not
    writeable. These operations being privileged, they don't take place in the user code.

    And... would you like to have a stack in your architecture?

    No. One always has to worry about stacks overflowing. The System/360 got
    along just fine withoug a stack, faking one whenever the need arose.

    To me, having stacks is just asking for trouble; they're a disaster
    waiting to happen and a blatant security hole.

    Because it uses a pseudo-immediate, which gets fetched along with the
    instruction stream, where the 360 uses a constant, it has an advantage
    over the 360. On the other hand, while the actual code is the same
    length, there's also the 32-bit overhead of the header.

    Where is the advantage over putting a constant directly in the
    instruction stream?

    One would need a different instruction format for each length of variable.

    I'm trying to have either all instructions 32 bits long, of, if a limited variation in instruction length is allowed, the header indicates where
    every instruction begins, so all instructions may decode in parallel.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon Aug 4 20:37:04 2025
    From Newsgroup: comp.arch

    On 8/4/2025 9:56 AM, Thomas Koenig wrote:
    John Savard <quadibloc@invalid.invalid> schrieb:
    On Sun, 03 Aug 2025 13:03:21 -0700, Stephen Fuld wrote:

    I suspect that the purpose of Thomas's suggestion wasn't to make the
    design clearer to him, but to force you to discover/think about the
    utility and ease of use of some of the features you propose *in real
    programs* . If a typical programmer can't figure out how to use some
    CPU feature, it probably won't be used, and thus probably should not be
    in the architecture. The best way to learn about what features are
    useful is to try to use them! and the best way to do that is to write
    actual code for a real program.

    While I'm not prepared to go to the trouble of creating a fleshed-out
    example, a very short and trivial example will still indicate what my
    goals are.

    X = Y * 2.78 + Z

    On a typical RISC architecture, this would involve instructions like this: >>
    load 18, Y
    load 19, K#0001
    fmul 18, 18, 19
    load 19, Z
    fadd 18, 18, 19
    fsto X

    If all the variables were in BSS.

    My 66000 with its compiler:

    double foo (double y, double z)
    {
    return y*2.78 + z;
    }

    yields

    foo: ; @foo
    ; %bb.0:
    fmac r1,r1,#0x40063D70A3D70A3D,r2
    ret

    One instruction for the arithmetic, one for the function return.
    Here's the disassembly:

    0000000000000000 <foo>:
    0: 3021e040 fmac r1,r1,#0x4006337003370033,r2
    4: 03370033
    8: 40063370
    c: 6be00000 ret


    Six instructions, each 32 bits long.

    On the IBM System/360, though, it would be something like

    le 12, Y
    me 12, K#0001
    ae 12, Z
    ste 12, x

    With gcc -O2 -m31, on godbolt:

    foo:
    larl %r5,.L3
    madb %f2,%f0,.L4-.L3(%r5)
    ldr %f0,%f2
    br %r14
    .L3:
    .L4:
    .long 1074150768
    .long -1546188227

    All four instructions are memory-reference instructions, so they're also
    32 bits long.

    How would I do this on Concertina II?

    Well, since the sequence has to start with a memory-reference, I can't use >> the zero-overhead header (Type I). Instead, a Type XI header is in order;
    that specifies a decode field, so that space can be reserved for a pseudo- >> immediate, and instruction slots can be indicated as containing
    instructions from the alternate instruction set.

    Then the instructions can be

    lf 6,y
    mfr 6,#2.78
    af 6,z
    stf 6,x

    with the instruction "af" coming from the alternate 32-bit instruction set.

    So, if I got this right, four instructions plus 2 32 bit words, one for
    the constant and one for the header required by the constant.



    The other tricky precondition that must be met is to store z in a data
    region that is only 4,096 bytes or less in size, prefaced with

    USING *,23

    or another register from 17 to 23 could be used as the base register, so
    that it is addressed with a 12-bit displacement.

    This shows why one should use more "complete" examples rather than
    single statements for ISA comparisons. John showed the series of
    instructions for the single source line as if it were pulled from the
    middle of some program. But you showed, since you wanted something
    actually compileable, a function/subroutine. This allowed you to assume
    that all the inputs were already in registers, whereas John had to
    include the instructions to load the values from memory and store the
    result. If you had to do that, it would add two load instructions (for
    Y and Z), and the store for the results. But the MY 66000 has the
    advantage of the FMA, so, eliminating the return instruction, four instructions plus 32 bits for the constant. Of course, a more extensive example might show what inputs were already in registers, etc. If the
    inputs were already in registers, then the MY 66000's instruction count
    goes down, but the Concertina's doesn't

    So, an apples to apples comparison gives the advantage to the MY 66000, primarily due to the FMA instruction and not requiring a header for the
    inline immediate. But I still maintain a more "complete" example is
    really needed.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Tue Aug 5 04:56:21 2025
    From Newsgroup: comp.arch

    John Savard <quadibloc@invalid.invalid> schrieb:

    And... would you like to have a stack in your architecture?

    No.

    OK. I think that is the final nail in the coffin, I will
    henceforth stop reading (and writing) about your architecture.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Tue Aug 5 16:26:52 2025
    From Newsgroup: comp.arch

    On Mon, 04 Aug 2025 20:37:04 -0700, Stephen Fuld wrote:

    But I still maintain a more "complete" example is
    really needed.

    That may be. But now that the one most ardently seeking that has
    identified my ISA as being dead to him, I'm not going to rush.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Aug 5 09:51:11 2025
    From Newsgroup: comp.arch

    On 8/4/2025 9:56 PM, Thomas Koenig wrote:
    John Savard <quadibloc@invalid.invalid> schrieb:

    And... would you like to have a stack in your architecture?

    No.

    OK. I think that is the final nail in the coffin, I will
    henceforth stop reading (and writing) about your architecture.

    While I agree that having at least push and pop instructions would be beneficial, I hardly think that is the most "bizarre" and less than
    useful aspect of John's architecture. After all, both of those
    instructions can be accomplished by two "standard" instructions, a store
    and an add (for push) and a load and subtract (for pop). Interchange
    the add and the subtract if you want the stack to grow in the other
    direction.

    Of course, you are free to stop contributing on this topic, but I, for
    one, will miss your contributions.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue Aug 5 18:23:36 2025
    From Newsgroup: comp.arch

    On 8/5/2025 11:51 AM, Stephen Fuld wrote:
    On 8/4/2025 9:56 PM, Thomas Koenig wrote:
    John Savard <quadibloc@invalid.invalid> schrieb:

    And... would you like to have a stack in your architecture?

    No.

    OK.  I think that is the final nail in the coffin, I will
    henceforth stop reading (and writing) about your architecture.

    While I agree that having at least push and pop instructions would be beneficial, I hardly think that is the most "bizarre" and less than
    useful aspect of John's architecture.  After all, both of those instructions can be accomplished by two "standard" instructions, a store
    and an add (for push) and a load and subtract (for pop).  Interchange
    the add and the subtract if you want the stack to grow in the other direction.

    Of course, you are free to stop contributing on this topic, but I, for
    one, will miss your contributions.



    The lack of dedicated PUSH/POP instructions IME has relatively little
    direct impact on the usability of an ISA. Either way, one is likely to
    need stack-frame adjustment, in which case PUSH/POP don't tend to offer
    much over normal Load/Store instructions.


    That said, a lot of John's other ideas come off to me like straight up absurdity. So, I wouldn't hold up much hope personally for it to turn
    into much usable.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Tue Aug 5 23:49:08 2025
    From Newsgroup: comp.arch

    On Tue, 05 Aug 2025 09:51:11 -0700, Stephen Fuld wrote:

    While I agree that having at least push and pop instructions would be beneficial,

    And I have now added exactly that to the architecture - as I note in the
    new thread titled "By Popular Demand".

    But subroutine calls still don't use them.

    I've also added another requested feature while I was at it; allowing the
    use of a 64-bit displacement without a base register but with an index.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Aug 6 05:32:41 2025
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
    On 8/4/2025 9:56 PM, Thomas Koenig wrote:
    John Savard <quadibloc@invalid.invalid> schrieb:

    And... would you like to have a stack in your architecture?

    No.

    OK. I think that is the final nail in the coffin, I will
    henceforth stop reading (and writing) about your architecture.

    While I agree that having at least push and pop instructions would be beneficial, I hardly think that is the most "bizarre" and less than
    useful aspect of John's architecture. After all, both of those
    instructions can be accomplished by two "standard" instructions, a store
    and an add (for push) and a load and subtract (for pop). Interchange
    the add and the subtract if you want the stack to grow in the other direction.

    What I meant was that, the way he described his addressind modes,
    he was not considering a stack at all, even implemented by
    the usual RISC method (which is better than push/pop, see the
    special hoops that AMD64 has to jump through to fuse several
    push or pop instructions into one - IIRC, it costs them a cycle
    of pipeline length).

    And stacks _are_ extremely efficient, as everybody except one
    person knows, because they save memory and improve cache locality.

    Of course, you are free to stop contributing on this topic, but I, for
    one, will miss your contributions.

    Hm, thanks. Maybe I'll look into it again.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Sun Aug 10 18:07:59 2025
    From Newsgroup: comp.arch

    On Tue, 05 Aug 2025 18:23:36 -0500, BGB wrote:

    That said, a lot of John's other ideas come off to me like straight up absurdity. So, I wouldn't hold up much hope personally for it to turn
    into much usable.

    While I think that not being able to be put to use isn't really one of the faults of the Concertina II ISA, the block structure, especially at its current level of complexity, is going to come across as quite weird to
    many, and I don't yet see any hope of achieving a drastic simplification
    in that area.

    Each of the sixteen block types serves one or another functionality which
    I see as necessary to give this ISA the breadth of application that I have
    as my goal.

    But I have introduced "scaled displacements" back in, allowing the
    augmented short instruction mode instruction set to be more powerful.

    John Savard

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Aug 10 18:59:29 2025
    From Newsgroup: comp.arch

    On 8/10/2025 1:07 PM, John Savard wrote:
    On Tue, 05 Aug 2025 18:23:36 -0500, BGB wrote:

    That said, a lot of John's other ideas come off to me like straight up
    absurdity. So, I wouldn't hold up much hope personally for it to turn
    into much usable.

    While I think that not being able to be put to use isn't really one of the faults of the Concertina II ISA, the block structure, especially at its current level of complexity, is going to come across as quite weird to
    many, and I don't yet see any hope of achieving a drastic simplification
    in that area.


    OK.

    I judge things here by a few criteria:
    Could be affordably implemented in hardware;
    Would be usable and useful;
    Mostly makes sense in terms of relative cost/benefit tradeoffs.

    I am a little more pessimistic on things that I don't really feel
    satisfy the above constraints.

    For comparison, RISC-V mostly satisfies the above, although:
    Many of the extensions are weaker on these points;
    Some of the encodings, and the 'C' extension in general,
    are badly dog chewed.


    Then again, my ISA has potentially ended up with an excess of niche-case format converter instructions and similar.


    Each of the sixteen block types serves one or another functionality which
    I see as necessary to give this ISA the breadth of application that I have
    as my goal.


    Many make it work with plain 32-bit or 16/32 encodings.

    Granted, I have ended up with more:
    16/32/64/96, depending on ISA.
    XG1, 16/32/64/96
    XG2, 32/64/96
    XG3, 32/64/96 (32/64 for RV ops)
    RV, 16/32/(48)/64


    Apparently, Huawei and similar have some 48-bit encodings defined for
    RV64. In my sensibilities, 48-bit only makes sense if one is already
    committed to 16 bit ops, but given how quickly they burnt through the
    encoding space; practically the 48-bit space would just end up being a space-saving subset of the 64-bit space (in my experimental attempt to
    deal with the 48-bit encodings, they were unpacked temporarily into the
    64-bit encoding space).

    Basically, they burnt through most of the 48-bit encoding space with a
    handful of Imm32 and a few Disp32 ops. If it were me I would have gone
    for Imm24 ops and had a little more encoding space left over.

    Did experimentally mock up a 48-bit scheme that did basically extend the 32-bit space to have Imm24 (adding 12 bits to each Imm/Disp for all the Imm12/Disp12 ops), but it was a little dog chewed. Could potentially
    lead alternate encodings for Imm32 constant load and Disp32 branch (by
    adding 12 bits to LUI and JAL).

    One can argue though, which would they rather have:
    Pretty much all of the 32-bit immediate forms extended to 24 bits;
    Or, 32-bits immediate values,
    but only for a very limited range of ops.

    Though, I suspect for general use, extending the whole ISA to 24 bits
    might be "better" for average case code density (with 64-bit encodings
    for cases when one needs Imm32).

    Then again, I am on the fence about 48 bit encodings in general:
    Helps code density;
    Hurts performance for a cheap core;
    Say, if one doesn't want to spend the cost of dealing with superscalar
    for misaligned instructions and 16 bit ops (doing so would add
    significant resource cost).



    I did experiment with adding the C extension to BGBCC, and RV64GC+Jumbo
    can seemingly get decent code density.

    Granted, both are mostly similar here, both using 5-bit register fields.
    Though, XG1 16-bit ops mostly have access to 16 registers;
    And, RV-C ops mostly are a mix of 8 and 32 registers.

    Did experiment with a pair encoding for XG3 (X3C), which doesn't match
    either XG1 or RV64GC+Jumbo in terms of code density. But not too far off.

    At the moment (Doom ".text" size, static-linked C library):
    XG1: 275K
    XG2: 290K
    RV64GC+Jumbo: 295K (vs 350K RV+Jumbo, or 370K RV64GC)
    XG3+X3C: 305K (vs 320K)

    Granted, XG3 isn't designed for maximum code density, rather performance
    and being able to merge with RV64G.

    It is unclear if the improvement in code density (of X3C) would be worth
    the added decoder cost (and doesn't fit in with the existing decoder
    paths for XG1 or RVC; so would need something new/wacky to deal with it).

    Though, could deal with it (in the core) in a similar way to how I dealt
    with 48-bit ops, namely unpacking it to a 64-bit form (two instructions)
    after fetch.


    In theory, XG3 should be able to match XG2 code density as there isn't
    really anything that XG2 has that XG3 lacks that would significantly
    effect code density. XG3 did drop the 2RI-Imm10 ops, but these had
    largely become redundant. So, the main difference is likely related to
    BGBCC itself, which is mostly treating XG3 as an extension of its RV64G
    mode (which "suffers" slightly by having less usable callee save
    registers in the ABI, and fewer register arguments; but had on/off
    considered tweaking the ABI here).

    Though, if XG3 did match XG2 code density, X3C could potentially also
    reduce it to 275K.

    But, could just focus more on RV64GC here, as I sorta already needed it,
    and recently found/fixed a bug in the decoder in my CPU core that was
    stopping the 'C' extension from working (so now it seems to work).


    Though, to recap (X3C):
    X3C packs a 13 and 14 bit instruction together into a 32 bit word;
    Which serves a similar purpose to RVC;
    Though only allows instruction pairs which can safely co-execute.
    Instructions encode:
    MOV/ADD Rm5, Rn5
    LI/ADD/ADDW Imm5s, Rn5
    SUB/ADDW/ADDWU/AND/OR/XOR Rm3, Rn3
    SLL/SRL/SRA Rm3, Rn3
    SLLW/SRLW/SLAW/SRAW Rm3, Rn3
    SLL/SRL/SRA Imm3, Rn3
    SLLW/SRLW/SLAW/SRAW Imm3, Rn3
    And, for the 14-bit case:
    LD/SD/LW/SW Rn5, Disp5(SP)
    LD/SD/LW/SW Rn3, Disp2(Rm3)
    LB/LBU/LH/LHU Rn3, 0(Rm3)
    SB/SH Rn3, 0(Rm3)

    X3C was put into a hole in the encoding space that previously held the
    PrWEX space (in XG1/XG2), but PrWEX is N/A in XG3. The WEX space is N/A
    (used for RV encodings, and the large-constant instruction was replaced
    with the XG3's Jumbo Prefix). Granted, the scope of X3C is more limited
    than that of RV-C.


    But I have introduced "scaled displacements" back in, allowing the
    augmented short instruction mode instruction set to be more powerful.


    OK.

    Yeah, scaled displacements make sense.


    Ironically, another one of my complaints about RVC is that while they
    saved bits in the displacements, rather than doing something sane like changing scale based on type, they bit-sliced the displacements based on
    type in a way that means it effectively has unique displacement
    encodings for:
    LW, Disp(SP)
    SW, Disp(SP)
    LD, Disp(SP)
    SD, Disp(SP)
    LW, Disp(Reg3)
    SW, Disp(Reg3)
    LD, Disp(Reg3)
    SD, Disp(Reg3)
    Which is, groan...

    Would have been better, say, if all the encodings just sorta had Rd/Rs2
    in the same spot and then not had separate Load/Store encoding.
    IMHO, having Rd and Rs2 in the same location is a lesser evil than
    having twice as many displacement types.

    And, also adjusting scale is a lesser evil than separate bit slicing for
    each type.



    Though, it does lead to the partial irony that despite XG3 having a
    longer listing than RV64G, when I wrote a VM that did both RV64 and XG3,
    the XG3 decoder is smaller due partly due to "less dog chew".

    The decoder is bigger in the Verilog core, but this is mostly because
    XG1/2/3 all use a shared decoder. An XG3 exclusive decoder would be smaller.

    Though, maybe moot if one is also going to need a RISC-V decoder, unless
    I make a purely XG3 target that doesn't use any of the RV encodings.




    John Savard


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon Aug 11 10:27:08 2025
    From Newsgroup: comp.arch

    On 8/10/2025 11:07 AM, John Savard wrote:
    On Tue, 05 Aug 2025 18:23:36 -0500, BGB wrote:

    That said, a lot of John's other ideas come off to me like straight up
    absurdity. So, I wouldn't hold up much hope personally for it to turn
    into much usable.

    While I think that not being able to be put to use isn't really one of the faults of the Concertina II ISA,

    I am not sure what you are saying here. Is it the while you agree that
    at least some features cannot be put to use, but that isn't the fault of
    the ISA, or that the fault of not being able to be put to use doesn't
    exist in the ISA?


    the block structure, especially at its
    current level of complexity, is going to come across as quite weird to
    many, and I don't yet see any hope of achieving a drastic simplification
    in that area.

    Each of the sixteen block types serves one or another functionality which
    I see as necessary to give this ISA the breadth of application that I have
    as my goal.

    While I agree that they meet your goals (at least as I understand them),
    I think that you have two problems.

    Your goals, even if you meet them aren't particularly useful, e.g. being "nearly" plug compatible with S/360

    There are *far* simpler ways to accomplish what most people really want
    to do.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Mon Aug 11 18:20:05 2025
    From Newsgroup: comp.arch

    On Mon, 11 Aug 2025 10:27:08 -0700, Stephen Fuld wrote:
    On 8/10/2025 11:07 AM, John Savard wrote:
    On Tue, 05 Aug 2025 18:23:36 -0500, BGB wrote:

    That said, a lot of John's other ideas come off to me like straight up
    absurdity. So, I wouldn't hold up much hope personally for it to turn
    into much usable.

    While I think that not being able to be put to use isn't really one of
    the faults of the Concertina II ISA,

    I am not sure what you are saying here. Is it the while you agree that
    at least some features cannot be put to use, but that isn't the fault of
    the ISA, or that the fault of not being able to be put to use doesn't
    exist in the ISA?

    What I was trying to say was that while the Concertina II ISA no doubt has many flaws, not being able to crank out useful work is, in my opinion, not
    one of them.

    On the other hand, driving insane those who attempt to program it or write compilers for it must be admitted to be an obstacle to making use of a
    given CPU, and so I must admit to its usability being limited in that
    manner.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Mon Aug 11 18:33:14 2025
    From Newsgroup: comp.arch

    On Mon, 11 Aug 2025 10:27:08 -0700, Stephen Fuld wrote:

    Your goals, even if you meet them aren't particularly useful, e.g. being "nearly" plug compatible with S/360

    There are *far* simpler ways to accomplish what most people really want
    to do.

    Being plug-compatible with System/360 is not among the goals of my ISA.
    The term "plug-compatible" refers to... _plugs_, as one might guess.
    Nothing in my ISA talks about stuff like USB ports, Centronics parallel ports... or the kind of port IBM used to connect a 1403 printer to a System/360 computer.

    There are certainly far simpler ways to run System/360 code correctly.
    One can just set a mode bit to enter System/360 emulation, for example.

    What I'm doing with the Type V header is to provide a way to imitate the behavior of a System/360 program after code conversion. So one could write
    a special FORTRAN compiler to generate code using this header to allow a FORTRAN program running on the Concertina II to deliver the same results
    as on a System/360.

    And this isn't simple because it's buried deep down in the instruction set
    as an _afterthought_ within an ISA which is primarily designed to do the
    same sort of work as one might do with an x86-64 chip or a PowerPC chip or
    a SPARC chip even. And secondarily designed to be capable of
    implementations which shine at whatever the TMS20C6000 shines at, or even whatever, if anything, the Itanium was good for.

    It may not, however, be lost on implementors that a full implementation of
    the Type V header stuff ends up putting the needed circuitry on the die to *provide* a very nice System/360 emulation or implementation, which they
    might offer as an added feature not defined in the Concertina II specification.

    John Savard


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Savard@quadibloc@invalid.invalid to comp.arch on Mon Aug 11 19:16:06 2025
    From Newsgroup: comp.arch

    On Mon, 11 Aug 2025 18:33:14 +0000, John Savard wrote:

    implementations which shine at whatever the TMS20C6000 shines at, or

    Oops, the TMS320C6000.

    John Savard
    --- Synchronet 3.21a-Linux NewsLink 1.2