• Re: Tonights Tradeoff

    From Robert Finch@robfi680@gmail.com to comp.arch on Tue Oct 28 23:52:53 2025
    From Newsgroup: comp.arch

    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
    64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
    as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
    is r33). Sameo for other registers. GPRs may contain either integer or floating-point values.

    Going with a bit result vector in any GPR for compares, then a branch on bit-set/clear for conditional branches. Might also include branch true / false.

    Using operand routing for immediate constants and an operation size for
    the instruction. Constants and operation size may be specified
    independently. With 40-bit instruction words, constants may be 10,50,90
    or 130 bits.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed Oct 29 00:14:08 2025
    From Newsgroup: comp.arch

    On 10/28/2025 8:52 PM, Robert Finch wrote:
    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
    64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
    as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
    is r33). Sameo for other registers.

    I assume the "high" registers are for handling 128 bit operations
    without the need to specify another register name. Do you have 5 or 6
    bit register numbers in the instructions. Five allows you to use the
    high registers for 128 bit operations without needing another register specifier, but then the high registers can only be used for 128 bit operations, which seems a waste. If you have six bits, you can use all
    64 registers for any operation, but how is the "upper" method that
    better than automatically using r(x+1)?



    GPRs may contain either integer or
    floating-point values.

    Going with a bit result vector in any GPR for compares, then a branch on bit-set/clear for conditional branches. Might also include branch true / false.

    Using operand routing for immediate constants and an operation size for
    the instruction. Constants and operation size may be specified independently. With 40-bit instruction words, constants may be 10,50,90
    or 130 bits.

    Those seem like a call from the My 66000 playbook, which I like.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Oct 29 04:29:15 2025
    From Newsgroup: comp.arch

    On 10/28/2025 10:52 PM, Robert Finch wrote:
    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
    64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
    as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
    is r33). Sameo for other registers. GPRs may contain either integer or floating-point values.


    OK.

    I mostly stuck with 32-bit encodings, but 40 could maybe allow more
    encoding space, but the drawback of being non-power-of-2.

    But, yeah, occasionally dealing with 128-bit data is a major case for 64
    GPRs and paired-registers registers.


    Well, that and when co-existing with RV64G, it gives somewhere to put
    the FPRs. But, in turn this was initially motivated by me failing to
    figure out how to get GCC configured to target Zfinx/Zdinx.


    Had ended up going with the Even/Odd pairing scheme as it is less wonky
    IMO to deal with R5:R4 than R36:R4.


    Going with a bit result vector in any GPR for compares, then a branch on bit-set/clear for conditional branches. Might also include branch true / false.


    BT/BF works well. I otherwise also ended up using RISC-V style branches,
    which I originally disliked due to higher implementation cost, but they
    do technically allow for higher performance than just BT/BF or Branch-Compare-with-Zero in 2-R cases.

    So, it becomes harder to complain about a feature that does technically
    help with performance.


    Using operand routing for immediate constants and an operation size for
    the instruction. Constants and operation size may be specified independently. With 40-bit instruction words, constants may be 10,50,90
    or 130 bits.


    Hmm...

    My case: 10/33/64.
    No direct 128-bit constant, but can use two 64-bit constants whenever
    128 bits is needed.



    Otherwise, goings on in my land:
    ISA development is slow, and had mostly turned into bug hunting;
    There are some unresolved bugs, but I haven't been able to fully hunt
    them down. A lot was in relation to RISC-V's C extension, but at least
    it seems like at this point the C extension is likely fully working.

    Haven't been many features that can usefully increase general-case performance. So, it is starting to seem like XG2 and XG3 may be fairly
    stable at this point.

    The longer term future is uncertain.


    My ISA's can beat RISC-V in terms of code-density and performance, but
    when when RISC-V is extended with similar features, it is harder to make
    a case that it is "enough".

    Doesn't seem like (within the ISA) there are many obvious ways left to
    grab large general-case performance gains over what I have done already.

    Some code benefits from lots of GPRs, but harder to make the case that
    it reflects the general case.



    Recently got a new very-cheap laptop (a Dell Latitude 7490, for around
    $240), made some curious observations:
    It seems to slightly outperform my main PC in single-threaded performance;
    Its RAM timings don't seem to match the expected values.

    My main PC still wins at multi-threaded performance, and has the
    advantage of 7x more RAM.

    Had noted in Cinebench that my main PC is actually performing a little
    slower than is typical for the 2700X, but then again, it is effectively
    a 2700X running with DDR4-2133 rather than DDR4-2933, but partly this
    was a case of the RAM I have was unstable if run all that fast (and in
    this case; more RAM but slightly slower seemed preferable to less RAM
    but slightly faster, or running it slightly faster but having the
    computer be crash-prone).

    They sold the ran with its on-the-box speed being the XMP2 settings
    rather than the baseline settings, but the RAM in question didn't run
    reliably at the XMP or XMP2 settings (and wasn't inclined to spend more;
    more so when there was already the annoyance that my MOBO chipset
    apparently doesn't deal with a full 128GB, but can tolerate 112GB, but
    maybe not an ideal setup for perf).

    So, yeah, it seems that I have a setup where the 2700X is getting worse single-threaded performance than the i7 8650U in the laptop.

    Apparently, going by Cinebench scores, my PC's single threaded
    performance is mostly hanging out with a bunch of Xeons (getting a score
    in R23 of around 700 vs 950).

    Well, could be addressed, in theory, but would need some RAM that
    actually runs reliably at 2933 or 3200 MT/s and is also cheap...


    In both cases, they are CPUs originally released in 2018.

    Has noted, in a few tests:
    LZ4 benchmark (same file):
    Main PC: 3.3 GB/s
    Laptop: 3.9 GB/s
    memcpy (single threaded):
    Main PC: 3.8 GB/s
    Laptop : 5.6 GB/s
    memcpy (all threads):
    Main PC: ~ 15 GB/s
    Laptop : ~ 24 GB/s
    ( Like, what; thing only has 1 stick of RAM... *1 )

    *1: Also, how is a laptop with 1 stick of RAM matching a dual-socket
    Xeon E5410 with like 8 sticks of RAM...

    or, maybe it was just weak that my main PC was failing to beat the Xeon
    at this?... My main PC does at least beat the Xeon at single-threaded performance (was less true of my older Piledriver based PC).


    Granted, then again, I am using (almost) the cheapest MOBO I could find
    at the time (that had an OK number of RAM slots and SATA connectors).
    Can't quite identify the MOBO or chipset as I lost the box (and not
    clearly labeled on the MOBO itself); except that it is a
    something-or-another ASUS board.

    Like, at the time, IIRC:
    Went on Newggg;
    Pick mostly the cheapest parts on the site;
    Say, a Zen+ CPU being a lot cheaper than Zen 2,
    or pretty much anything from Intel.
    ...


    Did get a slightly fancy/beefy case, but partly this was because I was
    annoyed with the late-90s-era beige tower case I had been using. Which I
    had ended up hot gluing a bunch of extra PC fans into the thing in an
    attempt to keep airflow good enough so that it didn't melt. And
    under-clocking the CPU so that it could run reliably.

    Like, 4GHz Piledriver ran too hot and was unreliable, but was far more
    stable at 3.4 GHz. Was technically faster than a Phenom II underclocked
    to 2.8 GHz (for similar reasons).

    Where, at least the Zen+ doesn't overheat at stock settings (but, they
    also supplied the thing with a comparably much bigger stock CPU cooler).

    The case I got is slightly more traditional, with 5.25" bays and similar
    and mostly sheet-steel construction, Vs the "new" trend of mostly glass-covered-box PC cases. Sadly, it seems like companies have mostly
    stopped selling the traditional sheet-steel PC cases with open 5.25"
    bays. Like, where exactly is someone supposed to put their DVD-RW drive,
    or hot-swap HDD trays ?...

    Well, in the past we also had floppy drives, but the MOBO's removed the connectors forcing one to now go the USB route if they want a floppy
    drive (but, now mostly moot as relatively few other computers still have floppy drives either).




    Well, in theory could build a PC with newer components and a bigger
    budget for parts. Still wouldn't want to go over to Win11, now it is a
    choice between jumping to Linux or "Windows Server" or similar (like, at
    least they didn't pollute Windows Server with a bunch of random
    pointless crap).

    For now, inertia option is to just keep using Win10 for now.


    As for the laptop, had noted:
    Can run Minecraft:
    Yes; though best results at an 8-chunk draw distance.
    Much more than this, and the "Intel UHD" graphics struggle.
    At 12 chunks, there is obvious chug.
    At 16 chunks, it starts dropping into single digit territory.
    Can run Doom3:
    Yes: Mostly gets 40-50 fps in Doom 3.

    My main PC can manage a 16-chunk draw distance in Minecraft and mostly
    gets a constant 63 fps in Doom3.

    Don't have many other newer games to test, as I mostly lost interest in
    modern "AAA" games. And, stuff like Doom+RTX, I already know this wont
    work. I can mostly just be happy that Minecraft works and is playable
    (and that its GPU is solidly faster than just using a software renderer...).


    On both fronts, this is a significant improvement over the older laptop.
    For the price, I sort of worried that it would be dead slow, but it significantly outperforms its Vista-era predecessor.

    This is mostly because I had noticed that, right now (unlike a few years
    ago), there are actually OK laptops at cheap prices (along with all the
    $80 Dell OptiPlex computers and similar on Amazon...).



    Otherwise, went and recently wrote up a spec partly based on a BASIC
    dialect I had used in one of my 3D engines, with some design cleanup: https://pastebin.com/2pEE7VE8

    Where I was able to get a usable implementation for something similar in
    a little over 1000 lines of C.

    Though, this was for an Unstructured BASIC dialect.


    Decided then to try something a little harder:
    Doing a small JavaScript like language, and trying to keep the
    interpreter small.

    I don't yet have the full language implemented, but for a partial JS
    like language, I currently have something in around 2500 lines of C.

    I had initially set a target estimate of 4-8 kLOC.
    Unless the remaining functionality ends up eating a lot of code, I am on target towards hitting the lower end of this range (need to get most of
    the rest of the core-language implemented within around 1.5 kLOC or so).

    Note: No 3rd party libraries allowed, only the normal C runtime library.
    Did end up using a few C99 features, but mostly still C95.


    For now, I was calling the language BS3L, where:
    Dynamically typed;
    Supports: Integers, Floating-Point, Strings, Objects, Arrays, ...
    JS style syntax;
    Nothing too exciting here.
    Still has JS style arrays and objects;
    Dynamically scoped.
    Where, dynamic scoping needs less code than lexical scoping;
    But, dynamic scoping is also a potential foot-gun as well.
    Not sure if too much of a foot-gun.
    Vs going to C-style scoping;
    Or, biting the bullet and properly implementing lexical scoping.
    Leaving out most advanced features.
    will be fairly minimal even vs early versions of JS.

    But, in some cases, was borrowing some design ideas from the BASIC interpreter. There were some unavoidable costs, such as in this case
    needing a full parser (that builds an AST) and an AST-walking
    interpreter. Unlike BASIC, it wouldn't be possible to implement an
    interpreter by directly walking and pattern matching lists of tokens.

    And, a parser that builds an AST, and code to walk said AST, necessarily
    needs more code.

    I guess, it is a question if if someone else could manage to implement a JavaScript style language in under 1000 lines of C while also writing "relatively normal" C (no huge blocks of obfuscated or rampant abuse of
    the preprocessor). Or, basically, where one has to stick to similar C
    coding conventions to those used in Doom and Quake.


    I am not sure if this would be possible. Both the dynamic type-system
    and parser have eaten up a fair chunk of the code budget. A sub 1000
    line parser is also a little novel; but the parser itself got a little
    wonky and doesn't fully abstract over what it parses (as there is still
    a fair bit of bleed-over from the token stream). And, it sorta ended up abusing the use of binary operators a little.

    For example, it has wonk like dealing with lists of statements as-if
    there were a right-associative semicolon operator (allowing it to be
    walked like a linked list).

    There is slightly wonky operator tokenization again to save code:
    Separately matching every possible operator pattern is a bunch of extra
    logic. Was using rules that mostly give the correct operators, but with
    the possibility of non-sense operators. Also the precedence levels don't
    match up exactly, but this is a lower priority issue.


    I guess, if someone things they can do so in significantly less code,
    they can try.

    Note that while a language like Lua sort of resembles an intermediate
    between BASIC and JavaScript, I wouldn't expect Lua to save that much
    here (it would still have the cost of needing to build an AST and similar).

    Going from an AST to a bytecode or 3AC IR would allow for higher
    performance.

    But, I decided to go for an AST walking interpreter in this case as it
    would be the least LOC.


    Actually takes more effort trying to keep the code small. Rather than
    just copy-pasting stuff a bunch of times, one spends more time needing
    to try to factor out and reuse common patterns.


    Though, in a way, some of this is revisiting stuff I did 20+ years ago,
    but from a different perspective.

    Like, 20+ years ago, my first interpreters also used AST walkers.

    As for where I will go with this, I don't know.
    Some of it could make sense as a starting point for a GLSL compiler;
    Or maybe adapted into parsing the SCAD language;

    Or, as a cheaper alternative to what my first script VM became.
    By the end of its span, it had become quite massive...
    Though, still not too bad if compared with SpiderMonkey or V8.

    Ironically, my jump to a Java + JVM/.NET like design was actually to
    make it simpler.

    For a simple but slow language, JS works, but if you want it fast it
    quickly turns worse (and simpler to jump to a more traditional
    statically typed language). Like, there was this thing, known as "Hindley-Milner Type Inference", which on one hand, could be used to
    make a JavaScript style language fast (by turning it transparently into
    a statically-typed language), but also, was a huge PITA to deal with
    (this was combined in my VM with optional explicit type declarations;
    with a syntax inspired by ActionScript).


    Well, and when something gets big and complicated enough that one almost
    may as well just use spiderMonkey or similar to run their JS code, this
    is a problem...

    Still less bad than LLVM, not sure why anyone would willingly submit to
    this.


    Well, there is still surviving descendant of the original VM (although branching off from an earlier form) in the form of BGBCC.

    Though, makes more sense to do a clean interpreter in this case, than to
    try to build one by copy-pasting the parser from BGBCC or my old VM and
    trying to build a new lighter-weight VM.

    In some of these cases, it is easier to scale up than scale back down.
    Easier to take simpler code and add features or improve performance.
    Than to take more complex code and try to trim it down.


    And, sometimes it does make more sense to just write something starting
    from a clean slate.

    Well, except for my attempt at a clean slate C compiler, except this was
    more a case of realizing I wouldn't undershoot BGBCC by enough to be worthwhile, and there were some new problem points that were emerging in
    the design. Partly as I was trying to follow a model more like that used
    by GCC and binutils, which I was then left to suspect is not the right approach (and in some ways, the approach I had used in BGBCC seemed to
    make more sense than trying to imitate how GCC does things).

    Might still make at some point to try for another clean-slate C
    compiler, though if I would still end up taking a similar general
    approach to BGBCC (or .NET), there isn't a huge incentive (vs continuing
    to use BGBCC).

    Where, say, the main thing that would ideally need to be improved would
    be improving BGBCC's performance and reducing memory footprint. As-is, compiling with BGBCC is about as slow as compiling with GCC, which isn't great.

    Comparably, MSVC typically being a bit faster at compiling stuff IME.


    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Oct 29 08:41:46 2025
    From Newsgroup: comp.arch

    On 2025-10-29 3:14 a.m., Stephen Fuld wrote:
    On 10/28/2025 8:52 PM, Robert Finch wrote:
    Started working on yet another CPU – Qupls4. Fixed 40-bit
    instructions, 64 GPRs. GPRs may be used in pairs for 128-bit ops.
    Registers are named as if there were 32 GPRs, A0 (arg 0 register is
    r1) and A0H (arg 0 high is r33). Sameo for other registers.

    I assume the "high" registers are for handling 128 bit operations
    without the need to specify another register name.  Do you have 5 or 6
    bit register numbers in the instructions.  Five allows you to use the
    high registers for 128 bit operations without needing another register specifier, but then the high registers can only be used for 128 bit operations, which seems a waste.  If you have six bits, you can use all
    64 registers for any operation, but how is the "upper" method that
    better than automatically using r(x+1)?

    Yes, but it is just a suggested usage. The registers are GPRs that can
    be used for anything, specified using a six bit register number. I
    suggested it that way because most of the time register values would be
    passed around as 64-bit quantities and it keeps the same set of
    registers for the same register type (argument, temp, saved). But since
    it should be using mostly compiled code, it does not make much difference.

    Also, the high registers could be used as FP registers. Maybe allowing
    for saving only the low order 32 regs during a context switch.>

    GPRs may contain either integer or floating-point values.

    Going with a bit result vector in any GPR for compares, then a branch
    on bit-set/clear for conditional branches. Might also include branch
    true / false.

    Using operand routing for immediate constants and an operation size
    for the instruction. Constants and operation size may be specified
    independently. With 40-bit instruction words, constants may be
    10,50,90 or 130 bits.

    Those seem like a call from the My 66000 playbook, which I like.

    Yup.>


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Oct 29 08:50:35 2025
    From Newsgroup: comp.arch

    On 2025-10-29 8:41 a.m., Robert Finch wrote:
    On 2025-10-29 3:14 a.m., Stephen Fuld wrote:
    On 10/28/2025 8:52 PM, Robert Finch wrote:
    Started working on yet another CPU – Qupls4. Fixed 40-bit
    instructions, 64 GPRs. GPRs may be used in pairs for 128-bit ops.
    Registers are named as if there were 32 GPRs, A0 (arg 0 register is
    r1) and A0H (arg 0 high is r33). Sameo for other registers.

    I assume the "high" registers are for handling 128 bit operations
    without the need to specify another register name.  Do you have 5 or 6
    bit register numbers in the instructions.  Five allows you to use the
    high registers for 128 bit operations without needing another register
    specifier, but then the high registers can only be used for 128 bit
    operations, which seems a waste.  If you have six bits, you can use
    all 64 registers for any operation, but how is the "upper" method that
    better than automatically using r(x+1)?

    Yes, but it is just a suggested usage. The registers are GPRs that can
    be used for anything, specified using a six bit register number. I
    suggested it that way because most of the time register values would be passed around as 64-bit quantities and it keeps the same set of
    registers for the same register type (argument, temp, saved). But since
    it should be using mostly compiled code, it does not make much difference.

    Also, the high registers could be used as FP registers. Maybe allowing
    for saving only the low order 32 regs during a context switch.>

    GPRs may contain either integer or floating-point values.

    Going with a bit result vector in any GPR for compares, then a branch
    on bit-set/clear for conditional branches. Might also include branch
    true / false.

    Using operand routing for immediate constants and an operation size
    for the instruction. Constants and operation size may be specified
    independently. With 40-bit instruction words, constants may be
    10,50,90 or 130 bits.

    Those seem like a call from the My 66000 playbook, which I like.

    Yup.>


    I should mention that the high registers are available only in user/app
    mode. For other modes of operation only the low order 32 registers are available. I did this to reduce the number of logical registers in the
    design. There are about 160 (64+32+32+32) logical registers then. They
    are supported by 512 physical registers. My previous design had 224
    logical registers which eats up more hardware, probably for little benefit.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Oct 29 17:44:14 2025
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    Do you have 5 or 6
    bit register numbers in the instructions. Five allows you to use the
    high registers for 128 bit operations without needing another register >specifier, but then the high registers can only be used for 128 bit >operations, which seems a waste.

    These days, that's not so clear. E.g., Zen4 has 192 physical 512-bit
    SIMD registers, despite having only 256-bit wide FUs. The way I
    understand it, a 512-bit operation comes as one uop to the FU,
    occupies it for two cycles (and of course the result latency is
    extra), and then has a 512-bit result.

    The alternative would be to do as AMD did in some earlier cores,
    starting with (I think) K8: have registers that are half as wide and
    split each 512-bit operation into 2 256-bit uops that go throught the
    OoO engine individually. This approach would allow more physical
    256-bit registers, and waste less on 32-bit, 64-bit, 128-bit and
    256-bit operations, but would cost additional decoding bandwidth,
    renaming bandwidth, renaming checkpoint size (a little), and scheduler
    space than the approach AMD have taken. Apparently the cost of this
    approach is higher than the benefit.

    Doubling the logical register size doubles the renamer checkpoint
    size, no? This way of avoiding "waste" looks quite a bit more
    expensive.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Oct 29 13:04:42 2025
    From Newsgroup: comp.arch

    On 10/29/2025 7:50 AM, Robert Finch wrote:
    On 2025-10-29 8:41 a.m., Robert Finch wrote:
    On 2025-10-29 3:14 a.m., Stephen Fuld wrote:
    On 10/28/2025 8:52 PM, Robert Finch wrote:
    Started working on yet another CPU – Qupls4. Fixed 40-bit
    instructions, 64 GPRs. GPRs may be used in pairs for 128-bit ops.
    Registers are named as if there were 32 GPRs, A0 (arg 0 register is
    r1) and A0H (arg 0 high is r33). Sameo for other registers.

    I assume the "high" registers are for handling 128 bit operations
    without the need to specify another register name.  Do you have 5 or
    6 bit register numbers in the instructions.  Five allows you to use
    the high registers for 128 bit operations without needing another
    register specifier, but then the high registers can only be used for
    128 bit operations, which seems a waste.  If you have six bits, you
    can use all 64 registers for any operation, but how is the "upper"
    method that better than automatically using r(x+1)?

    Yes, but it is just a suggested usage. The registers are GPRs that can
    be used for anything, specified using a six bit register number. I
    suggested it that way because most of the time register values would
    be passed around as 64-bit quantities and it keeps the same set of
    registers for the same register type (argument, temp, saved). But
    since it should be using mostly compiled code, it does not make much
    difference.

    Also, the high registers could be used as FP registers. Maybe allowing
    for saving only the low order 32 regs during a context switch.>

    I am not as sure about this approach...

    Well, Low 32=GPR, High 32=FPR, makes sense, I did this.

    But, pairing a GPR and FPR for the 128-bit cases seems wonky; or
    subsetting registers on context switch seems like it could turn into a problem.


    Or, if a goal is to allow for encodings with a 5-bit register field,
    would make sense to use 32-bit encodings.

    Where, granted, 6b register fields in a 32-bit instruction does have the drawback of limiting how much encoding space exists for opcode and
    immediate (and one has to be more careful not to "waste" the encoding
    space as badly as RISC-V had done).

    Though, can note that both:
    R6+R6+Imm10
    R5+R5+Imm12
    Use the same amount of encoding space.
    But, R6+R6+R6 uses 3 bits more than R5+R5+R5.


    Though, one could debate my case, as I did effectively end up burning
    1/4 of the total encoding space mostly on Jumbo prefixes.

    ...



    GPRs may contain either integer or floating-point values.

    Going with a bit result vector in any GPR for compares, then a
    branch on bit-set/clear for conditional branches. Might also include
    branch true / false.

    Using operand routing for immediate constants and an operation size
    for the instruction. Constants and operation size may be specified
    independently. With 40-bit instruction words, constants may be
    10,50,90 or 130 bits.

    Those seem like a call from the My 66000 playbook, which I like.

    Yup.>


    I should mention that the high registers are available only in user/app mode. For other modes of operation only the low order 32 registers are available. I did this to reduce the number of logical registers in the design. There are about 160 (64+32+32+32) logical registers then. They
    are supported by 512 physical registers. My previous design had 224
    logical registers which eats up more hardware, probably for little benefit.


    FWIW: I have gotten by OK with 128 internal registers:
    00..3F: Array-Mapped Registers (mostly the GPRs)
    40..7F: CRs and SPRs

    Mostly sufficient.

    For the array-mapped registers, these ones use LUTRAM, with a logical
    copy of the array per write port, and some control bits to encode which
    array currently holds the up-to-date copy of the register.

    All this gets internally replicated for each read port.

    So, roughly 18 internal copies of all of the registers with 6R3W, but
    this is unavoidable (since LUTRAMs are 1R1W).


    The other option is using flip-flops, which is the strategy mostly used
    for the writable CRs and SPRs. This is done sparingly as the resource
    cost is higher in this case (at least on xilinx, *).

    *: Things went amiss on Altera and when I tried to build on this, needed
    to use FF's for all the GPRs as well; as these FPGAs lack a direct
    equivalent of LUTRAMs and instead have smaller Block RAMs. Also the
    Lattice FPGAs also lack LUTRAM IIRC (but, my core doesn't map as well to Lattice FPGAs either).


    As for the CR/SPR space:
    Some of it is used for writable registers;
    A big chunk is used for internal read-only registers.
    ZZR, IMM, IMMB, JIMM, etc.
    ZZR: Zero Register / Null Register (Write)
    IMM: Immediate for current lane (33-bit, sign-ext).
    IMMB: Immediate from Lane 3.
    JIMM: 64-bit immediate spanning Lanes A and B.
    ...

    Could also be seen as C0..C63 (or, all control registers) except that
    much of C32..C63 is used for internal read-only SPRs, and a few other
    SPRs (DLR, DHR, and SP).

    Originally, the CRs and SPRs were handled as separate, but now things
    have gotten fuzzy (and, for RISC-V, some of the CRs need to be accessed
    in GPR like ways).

    There is some wonk as they were handled as separate modules, but with
    the current way things are done it would almost make more sense to fold
    all of the CRs into the GPR file module.

    The module might also continue to deal with forwarding, but might also
    make sense to have a RegisterFile module, possibly with a disjoint
    "Register Forwarding And Interlocks" style module (which forwards
    registers if the value is available and signals pipeline stalls as
    needed; this logic currently partly handled by the existing
    register-file module).



    Did experiment with a mechanism to allow bank-swapped registers. This
    would have added an internal 2-bit mode for the registers, and would
    stall the pipeline to swap the current registers with their bank-swapped versions if needed (with the registers internally backed to Block-RAM).
    Ended up mostly not using this though (at best, it wouldn't gain much
    over the existing "Load and Store everything to RAM" strategy; and would
    make context switching slower than it is already).

    It is more likely that a practical mechanism for fast bank swapping
    would need a mechanism to bank-swap the registers to external RAM. Or
    maybe a special "Stall and dump all the registers to this RAM Address" instruction.


    For the RISC-V CSRs:
    Part of the space maps to the CRs, and part maps to CPUID;
    For pretty much everything else, it traps.
    So, pretty much all of the normal RISC-V CSRs will trap.

    Ended up trapping for the RISC-V FPU CSRs as well:
    Rarely accessed;
    Rather than just one CSR for the FPU status, they broke it up to
    multiple sub-registers for parts of the register (like, there is a
    special CSR just for the rounding-mode, ...).

    Also the hardware only supports moving to/from a CR, so any more complex scenarios will also trap. They had gotten a little too fancy with this
    stuff IMO.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Oct 29 18:15:42 2025
    From Newsgroup: comp.arch

    Robert Finch <robfi680@gmail.com> schrieb:
    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
    64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
    as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
    is r33). Sameo for other registers. GPRs may contain either integer or floating-point values.

    I understand the temptation to go for more bits :-) What is your
    instruction alignment? Bytewise so 40 bits fit, or do you have some
    alignment that the first instruction of a cache line is always aligned?

    Having register pairs does not make the compiler writer's life easier, unfortunately.

    Going with a bit result vector in any GPR for compares, then a branch on bit-set/clear for conditional branches. Might also include branch true / false.

    Having 64 registers and 64 bit registers makes life easier for that
    particular task :-)

    If you have that many bits available, do you still go for a load-store architecture, or do you have memory operations? This could offset the
    larger size of your instructions.

    Using operand routing for immediate constants and an operation size for
    the instruction. Constants and operation size may be specified independently. With 40-bit instruction words, constants may be 10,50,90
    or 130 bits.

    Those sizes are not really a good fit for constants from programs,
    where quite a few constants tend to be 32 or 64 bits. Would a
    64-bit FP constant leave 26 bits empty?
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed Oct 29 11:29:54 2025
    From Newsgroup: comp.arch

    On 10/29/2025 10:44 AM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    Do you have 5 or 6
    bit register numbers in the instructions. Five allows you to use the
    high registers for 128 bit operations without needing another register
    specifier, but then the high registers can only be used for 128 bit
    operations, which seems a waste.

    At this point, the discussion is academic, as Robert has said he has 6
    bit register specifiers in the instructions. But my issue had nothing
    to do with SIMD registers, as he said he supported 128 bit arithmetic
    and the "high" registers were used for that. e.g.

    Add A1,A2,A3 would be a 64 bit add on those registers but
    Add128 A1,A2,A3 would be a 128 bit add using A1H for the high order
    bits of the destination, etc. So the question becomes how is using
    Rn+32 better than using Rn+1?

    That being said, your points are well taken for a different implementation.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Oct 29 18:33:46 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
    64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
    as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
    is r33). Sameo for other registers. GPRs may contain either integer or floating-point values.

    Going with a bit result vector in any GPR for compares, then a branch on bit-set/clear for conditional branches. Might also include branch true / false.

    I have both the bit-vector compare and branch, but also a compare to zero
    and branch as a single instruction. I suggest you should too, if for no
    other reason than:

    if( p && p->next )

    Using operand routing for immediate constants and an operation size for
    the instruction. Constants and operation size may be specified independently. With 40-bit instruction words, constants may be 10,50,90
    or 130 bits.

    My 66000 allows for occasional use of 128-bit values but is designed mainly
    for 64-bit and smaller.

    With 32-bit instructions, I provide, {5, 16, 32, and 64}-bit constants.

    Just last week we discovered a case where HW can do a better job than SW. Previously, the compiler would emit:

    CVTfd Rt,Rf
    FMUL Rt,Rt,#1.425D0
    CVTdf Rd,Rt

    Which is subject to double rounding once at the FMUL and again at the
    down conversion. I though about the problem and it seems fairly easy
    to gate the 24-bit fraction into the multiplier tree along with the
    53-bit fraction of the constant, and then normalize and round the
    result dropping out of the tree--avoiding the double rounding case.

    Now, the compiler emits:

    FMULf Rd,Rf,#1.425D0

    saving 2 instructions alongwith the higher precision.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Oct 29 18:47:09 2025
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 10/28/2025 10:52 PM, Robert Finch wrote:
    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
    as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
    is r33). Sameo for other registers. GPRs may contain either integer or floating-point values.


    OK.

    I mostly stuck with 32-bit encodings, but 40 could maybe allow more
    encoding space, but the drawback of being non-power-of-2.

    it is definitely an issue.

    But, yeah, occasionally dealing with 128-bit data is a major case for 64 GPRs and paired-registers registers.

    There is always the DBLE pseudo-instruction.

    DBLE Rd,Rs1,Rs2,Rs3

    All DBLE does is to provide more registers for the wide computation
    in such a way that compiler is not forced to pair or share any reg-
    isters. The other thing DBLE does is to tell the decoder that the
    next instruction is 2× as wide as its OpCode states. In lower end
    machines (and in GPUs) DBLE is sequenced as if it were an instruction.
    In higher end machines, DBLE would be CoIssued with its mate.

    ----------

    My case: 10/33/64.
    No direct 128-bit constant, but can use two 64-bit constants whenever
    128 bits is needed.

    {5, 16, 32, 64}-bit immediates.



    Otherwise, goings on in my land:
    ISA development is slow, and had mostly turned into bug hunting;
    <snip>

    The longer term future is uncertain.


    My ISA's can beat RISC-V in terms of code-density and performance, but
    when when RISC-V is extended with similar features, it is harder to make
    a case that it is "enough".

    I am still running at 70% RISC-Vs instruction count.

    Doesn't seem like (within the ISA) there are many obvious ways left to
    grab large general-case performance gains over what I have done already.

    Fewer instructions, and or instructions that take fewer cycles to execute.

    Example, ENTER and EXIT instructions move 4 registers per cycle to/from
    cache in a pipeline that has 1 result per cycle.

    Some code benefits from lots of GPRs, but harder to make the case that
    it reflects the general case.

    There is very little to be gained with that many registers.

    Recently got a new very-cheap laptop (a Dell Latitude 7490, for around $240), made some curious observations:
    It seems to slightly outperform my main PC in single-threaded performance; Its RAM timings don't seem to match the expected values.

    My main PC still wins at multi-threaded performance, and has the
    advantage of 7x more RAM.

    My new Linux box has 64 cores at 4.5 GHz and 96GB of DRAM.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Oct 29 14:02:32 2025
    From Newsgroup: comp.arch

    On 10/29/2025 1:15 PM, Thomas Koenig wrote:
    Robert Finch <robfi680@gmail.com> schrieb:
    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
    64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
    as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
    is r33). Sameo for other registers. GPRs may contain either integer or
    floating-point values.

    I understand the temptation to go for more bits :-) What is your
    instruction alignment? Bytewise so 40 bits fit, or do you have some alignment that the first instruction of a cache line is always aligned?

    Having register pairs does not make the compiler writer's life easier, unfortunately.


    Yeah, and from the compiler POV, would likely prefer having Even+Odd pairs.

    Going with a bit result vector in any GPR for compares, then a branch on
    bit-set/clear for conditional branches. Might also include branch true /
    false.

    Having 64 registers and 64 bit registers makes life easier for that particular task :-)

    If you have that many bits available, do you still go for a load-store architecture, or do you have memory operations? This could offset the
    larger size of your instructions.

    Using operand routing for immediate constants and an operation size for
    the instruction. Constants and operation size may be specified
    independently. With 40-bit instruction words, constants may be 10,50,90
    or 130 bits.

    Those sizes are not really a good fit for constants from programs,
    where quite a few constants tend to be 32 or 64 bits. Would a
    64-bit FP constant leave 26 bits empty?


    Agreed.

    From what I have seen, the vast bulk of constants tend to come in
    several major clusters:
    0 to 511: The bulk of all constants (peaks near 0, geometric fall-off)
    -64 to -1: Much of what falls outside 0 to 511.
    -32768 to 65535: Second major group
    -2G to +4G: Third group (smaller than second)
    64-bit: Another smaller spike.

    For values between 512 and 16384: Sparsely populated.
    Mostly the continued geometric fall-off from the near-0 peak.
    Likewise for values between 65536 and 1G.
    Values between 4G and 4E tend to be mostly unused.

    Like, in the sense of, if you have 33-bit vs 52 or 56-bit for a
    constant, the larger constants would have very little advantage (in
    terms of statistical hit rate) over the 33 bit constant (and, it isn't
    until you reach 64 bits that it suddenly becomes worthwhile again).


    Partly why I go with 33 bit immediate fields in the pipeline in my core,
    but nothing much bigger or smaller:
    Slightly smaller misses out on a lot, so almost may as well drop back to
    17 in this case;
    Going slightly bigger would gain pretty much nothing.

    Like, in the latter case, does sort of almost turn into a "go all the
    way to 64 bits or don't bother" thing.


    That said, I do use a 48-bit address space, so while in concept 48-bits
    could be useful for pointers: This is statistically insignificant in an
    ISA which doesn't encode absolute addresses in instructions.

    So, ironically, there are a lot of 48-bit values around, just pretty
    much none of them being encoded via instructions.


    Kind of a similar situation to function argument counts:
    8 arguments: Most of the functions;
    12: Vast majority of them;
    16: Often only a few stragglers remain.

    So, 16 gets like 99.95% of the functions, but maybe there are a few
    isolated ones taking 20+ arguments lurking somewhere in the code. One
    would then need to go up to 32 arguments to have reasonable confidence
    of "100%" coverage.

    Or, impose an arbitrary limit, where the stragglers would need to be
    modified to pass arguments using a struct or something.

    ...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed Oct 29 13:05:08 2025
    From Newsgroup: comp.arch

    On 10/29/2025 11:47 AM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    snip
    But, yeah, occasionally dealing with 128-bit data is a major case for 64
    GPRs and paired-registers registers.

    There is always the DBLE pseudo-instruction.

    DBLE Rd,Rs1,Rs2,Rs3

    All DBLE does is to provide more registers for the wide computation
    in such a way that compiler is not forced to pair or share any reg-
    isters. The other thing DBLE does is to tell the decoder that the
    next instruction is 2× as wide as its OpCode states. In lower end
    machines (and in GPUs) DBLE is sequenced as if it were an instruction.
    In higher end machines, DBLE would be CoIssued with its mate.

    So if DBLE says the next instruction is double width, does that mean
    that all "128 bit instructions" require 64 bits in the instruction
    stream? So a sequence of say four 128 bit arithmetic instructions would require the I space of 8 instructions?

    If so, I guess it is a tradeoff for not requiring register pairing, e.g.
    Rn and Rn+1.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Oct 29 15:58:40 2025
    From Newsgroup: comp.arch

    On 10/29/2025 1:47 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 10/28/2025 10:52 PM, Robert Finch wrote:
    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, >>> 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
    as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
    is r33). Sameo for other registers. GPRs may contain either integer or
    floating-point values.


    OK.

    I mostly stuck with 32-bit encodings, but 40 could maybe allow more
    encoding space, but the drawback of being non-power-of-2.

    it is definitely an issue.

    But, yeah, occasionally dealing with 128-bit data is a major case for 64
    GPRs and paired-registers registers.

    There is always the DBLE pseudo-instruction.

    DBLE Rd,Rs1,Rs2,Rs3

    All DBLE does is to provide more registers for the wide computation
    in such a way that compiler is not forced to pair or share any reg-
    isters. The other thing DBLE does is to tell the decoder that the
    next instruction is 2× as wide as its OpCode states. In lower end
    machines (and in GPUs) DBLE is sequenced as if it were an instruction.
    In higher end machines, DBLE would be CoIssued with its mate.


    OK.

    In my case, a lot of the 128-bit operations are a single 32-bit
    instruction, which splits (in decode) to spanning multiple lanes (using
    the 6R3w register file as a virtual 3R1W 128-bit register file).

    In some cases, pairs of 64-bit SIMD instructions may be merged to send
    both through the SIMD unit at the same time. Say, as a special-case
    co-issue for 2x Binary32 ops (which can basically be handled the same as
    the 4x Binary32 scenario by the SIMD unit).

    ----------

    My case: 10/33/64.
    No direct 128-bit constant, but can use two 64-bit constants whenever
    128 bits is needed.

    {5, 16, 32, 64}-bit immediates.


    The reason 17 and 33 ended up slightly preferable is that both
    zero-extended and sign-extended 16 and 32 bit values are fairly common.

    And, if one has both a zero and sign extended immediate, this eats the
    same encoding space as having a 17-bit immediate, or a separate
    zero-extended and one-extended variant.

    There are a few 5/6 bit immediate instructions, but I didn't really
    count them.

    XG3's equivalent of SLTI and similar only has Imm6 encodings (can be
    extended to 33 bits with a jumbo prefix).



    There isn't much need for a direct 128-bit immediate though:
    This case is exceedingly rate;
    Register-pairs basically make it a non-issue;
    Even if it were supported:
    This would still require a 24-byte encoding...
    Which, doesn't save anything over 2x 12-bytes.
    And doesn't gain much, apart from making CPU more expensive.

    Someone could maybe do 20 bytes by using a 128-bit memory load, but with
    the usual drawbacks of using a memory load (BGBCC doesn't usually do
    this). The memory load will have a higher latency than a pair of
    immediate instructions.





    Otherwise, goings on in my land:
    ISA development is slow, and had mostly turned into bug hunting;
    <snip>

    The longer term future is uncertain.


    My ISA's can beat RISC-V in terms of code-density and performance, but
    when when RISC-V is extended with similar features, it is harder to make
    a case that it is "enough".

    I am still running at 70% RISC-Vs instruction count.


    Basically similar.

    XG3 also uses only 70% as many instructions as RV64G.

    But, if you throw Indexed Load/Store, Load/Store Pair, Jumbo Prefixes,
    etc, at the problem (on top of RISC-V), suddenly RISC-V becomes a lot
    more competitive (30% smaller and 50% faster).

    Not found a good way to much improve much over this though...


    But, yeah, if comparing against RV64G as it exists in its standard form,
    there is a bit of room for improvement.



    Doesn't seem like (within the ISA) there are many obvious ways left to
    grab large general-case performance gains over what I have done already.

    Fewer instructions, and or instructions that take fewer cycles to execute.

    Example, ENTER and EXIT instructions move 4 registers per cycle to/from
    cache in a pipeline that has 1 result per cycle.

    Some code benefits from lots of GPRs, but harder to make the case that
    it reflects the general case.

    There is very little to be gained with that many registers.


    Granted.

    The main thing it benefits is things like TKRA-GL, ...

    Doom basically sees no real difference between 32 and 64 GPRs (nor does SW-Quake).


    Mostly matters for code where one has functions with around 100+ local variables... Which, are uncommon much outside of TKRA-GL or similar.


    As-is, SW-Quake is one of the cases that does well with RISC-V, though GL-Quake performs like hot dog-crap; mostly as TKRA-GL gets wrecked if
    it is limited to 32 registers and doesn't have SIMD.


    Only real saving point is when running with TKRA-GL over system calls in
    which case it runs in the kernel (as XG1) which is slightly less bad.
    For reasons, TestKern kinda still needs to be built as XG1.


    Recently got a new very-cheap laptop (a Dell Latitude 7490, for around
    $240), made some curious observations:
    It seems to slightly outperform my main PC in single-threaded performance; >> Its RAM timings don't seem to match the expected values.

    My main PC still wins at multi-threaded performance, and has the
    advantage of 7x more RAM.

    My new Linux box has 64 cores at 4.5 GHz and 96GB of DRAM.

    Desktop PC:
    8C/16T: 3.7 Base, 4.3 Turbo, 112GB RAM (just, not very fast RAM)
    Rarely reaches turbo
    pretty much only happens if just running a single thread...
    With all cores running stuff in the background:
    Idles around 3.6 to 3.8.

    Laptop:
    4C/8T, 1.9 GHz Base, 4.2 GHz Turbo
    If power set to performance, reaches turbo a lot more easily,
    and with multi-core workloads.
    But, puts out a lot of heat while doing so...

    If set to Efficiency, mostly stays below 3 GHz.

    As noted, the laptop is surprisingly speedy for how cheap it was.

    For $240 I was paranoid is still might not have been fast enough to run Minecraft...


    Still annoyed as the RAM claimed like DDR4-3200 on the box, but doesn't
    run reliably at more than DDR4-2133... Like, you can try 3200 if you
    don't mind computer blue-screening after a few minutes I guess...



    But, without much RAM, nor enough SSD space to set up a huge pagefile,
    not going to try compiling LLVM on the thing.

    Even with all the RAM, a full rebuild of LLVM still takes several hours
    on my main PC (though, trying to build LLVM or GCC is at least slightly
    faster if one tells the AV software to stop grinding the CPU by looking
    at every file accessed).


    Vs the $80 OptiPlex that came with a 2C/4T Core i3 variant, that wasn't particularly snappy (seemed on-par with the Vista era laptop; though
    this has a 2C/2T CPU).

    Basically, was a small PC that was using mostly laptop-style parts
    internally (laptop DVD-RW drive and laptop style HDD); some sort of ITX
    MOBO layout I think.

    I don't remember there being any card slots; so like if you want to
    install a PCIe card or similar, basically SOL.

    But, it was either this or an off-brand NUC clone...


    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Oct 29 21:52:54 2025
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 10/29/2025 11:47 AM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    snip
    But, yeah, occasionally dealing with 128-bit data is a major case for 64 >> GPRs and paired-registers registers.

    There is always the DBLE pseudo-instruction.

    DBLE Rd,Rs1,Rs2,Rs3

    All DBLE does is to provide more registers for the wide computation
    in such a way that compiler is not forced to pair or share any reg-
    isters. The other thing DBLE does is to tell the decoder that the
    next instruction is 2× as wide as its OpCode states. In lower end
    machines (and in GPUs) DBLE is sequenced as if it were an instruction.
    In higher end machines, DBLE would be CoIssued with its mate.

    So if DBLE says the next instruction is double width, does that mean
    that all "128 bit instructions" require 64 bits in the instruction
    stream? So a sequence of say four 128 bit arithmetic instructions would require the I space of 8 instructions?

    It is a 64-bit machine that provides a small modicum of support for
    larger sizes. It is not and never will be a 128-bit machine--that is
    what vVM is for.

    Key words "small modicum"

    DBLE simply supplies registers to the pipeline and width to decode.

    If so, I guess it is a tradeoff for not requiring register pairing, e.g.
    Rn and Rn+1.

    DBLE supports 128-bits in the ISA at the total cost of 1 instruction
    added per use. In many situations (especially integer) CARRY is the
    better option because it throws a shadow of width over a number of
    instructions and thereby has lower code foot print costs. So, a 256
    bit shift is only 5 instructions instead of 8. And realistically, if
    you want wider than that, you have already run out of registers.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Oct 29 18:01:17 2025
    From Newsgroup: comp.arch

    On 2025-10-29 2:15 p.m., Thomas Koenig wrote:
    Robert Finch <robfi680@gmail.com> schrieb:
    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
    64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
    as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
    is r33). Sameo for other registers. GPRs may contain either integer or
    floating-point values.

    I understand the temptation to go for more bits :-) What is your
    instruction alignment? Bytewise so 40 bits fit, or do you have some alignment that the first instruction of a cache line is always aligned?

    The 40-bit instructions are byte aligned. This does add more shifting in
    the align stage. Once shifted though instructions are easily peeled off
    from fixed positions. One consequence is jump targets must be byte
    aligned OR routines could be required to be 32-bit aligned for instance.>
    Having register pairs does not make the compiler writer's life easier, unfortunately.

    Going with a bit result vector in any GPR for compares, then a branch on
    bit-set/clear for conditional branches. Might also include branch true /
    false.

    Having 64 registers and 64 bit registers makes life easier for that particular task :-)

    If you have that many bits available, do you still go for a load-store architecture, or do you have memory operations? This could offset the
    larger size of your instructions.

    It is load/store with no memory ops excepting possibly atomic memory ops.>
    Using operand routing for immediate constants and an operation size for
    the instruction. Constants and operation size may be specified
    independently. With 40-bit instruction words, constants may be 10,50,90
    or 130 bits.

    Those sizes are not really a good fit for constants from programs,
    where quite a few constants tend to be 32 or 64 bits. Would a
    64-bit FP constant leave 26 bits empty?

    I found that 16-bit immediates could be encoded instead of 10-bit.
    So, now there are 16,56,96 and 136 bit constants possible. The
    56-bitconstant likely has enough range for most 64-bit ops. Otherwise using
    a 96-bit constant for 64-bit ops would leave the upper 32-bit of the
    constant unused. 136 bit constants may not be implemented, but a size
    code is reserved for that size.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Oct 29 18:20:51 2025
    From Newsgroup: comp.arch

    On 2025-10-29 2:33 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
    64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
    as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
    is r33). Sameo for other registers. GPRs may contain either integer or
    floating-point values.

    Going with a bit result vector in any GPR for compares, then a branch on
    bit-set/clear for conditional branches. Might also include branch true /
    false.

    I have both the bit-vector compare and branch, but also a compare to zero
    and branch as a single instruction. I suggest you should too, if for no
    other reason than:

    if( p && p->next )


    Yes, I was going to have at least branch on register 0 (false) 1 (true)
    as there is encoding room to support it. It does add more cases in the
    branch eval, but is probably well worth it.
    Using operand routing for immediate constants and an operation size for
    the instruction. Constants and operation size may be specified
    independently. With 40-bit instruction words, constants may be 10,50,90
    or 130 bits.

    My 66000 allows for occasional use of 128-bit values but is designed mainly for 64-bit and smaller.


    Following the same philosophy. Expecting only some use for 128-bit
    floats. Integers can only handle 8,16,32, or 64-bits.

    With 32-bit instructions, I provide, {5, 16, 32, and 64}-bit constants.

    Just last week we discovered a case where HW can do a better job than SW. Previously, the compiler would emit:

    CVTfd Rt,Rf
    FMUL Rt,Rt,#1.425D0
    CVTdf Rd,Rt

    Which is subject to double rounding once at the FMUL and again at the
    down conversion. I though about the problem and it seems fairly easy
    to gate the 24-bit fraction into the multiplier tree along with the
    53-bit fraction of the constant, and then normalize and round the
    result dropping out of the tree--avoiding the double rounding case.

    Now, the compiler emits:

    FMULf Rd,Rf,#1.425D0

    saving 2 instructions alongwith the higher precision.

    Improves the accuracy? of algorithms, but seems a bit specific to me.
    Are there other instruction sequence where double-rounding would be good
    to avoid? Seems like HW could detect the sequence and fuse the instructions.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Oct 29 18:26:05 2025
    From Newsgroup: comp.arch

    <snip>>> My new Linux box has 64 cores at 4.5 GHz and 96GB of DRAM.

    Desktop PC:
      8C/16T: 3.7 Base, 4.3 Turbo, 112GB RAM (just, not very fast RAM)
        Rarely reaches turbo
          pretty much only happens if just running a single thread...
        With all cores running stuff in the background:
          Idles around 3.6 to 3.8.

    Laptop:
      4C/8T, 1.9 GHz Base, 4.2 GHz Turbo
        If power set to performance, reaches turbo a lot more easily,
          and with multi-core workloads.
        But, puts out a lot of heat while doing so...

    If set to Efficiency, mostly stays below 3 GHz.

    As noted, the laptop is surprisingly speedy for how cheap it was.

    <snip>
    For my latest PC I bought a gaming machine – i7-14700KF CPU (20 cores).
    32 GB RAM, 16GB graphics RAM. 3.4 GHz (5.6 GHz in turbo mode). More RAM
    was needed, my last machine only had 16GB, found it using about 20GB. I
    did not want to spring for a machine with even more RAM, they tended to
    be high-end machines.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Oct 29 22:31:12 2025
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    At this point, the discussion is academic, as Robert has said he has 6
    bit register specifiers in the instructions.

    He could still make these registers have 128 bits rather than pairing
    registers for 128-bit operation.

    But my issue had nothing
    to do with SIMD registers, as he said he supported 128 bit arithmetic
    and the "high" registers were used for that.

    As far as waste etc. is concerned, it does not matter if the 128-bit
    operation is a SIMD operation or a scalar 128-bit operation.

    Intel designed SSE with scalar instructions that use only 32 bits out
    of the 128 bits available; SSE2 with 64-bit scalar instructions, AVX
    (and AVX2) with 32-bit and 64-bit scalar operations in a 256-bit
    register, and various AVX-512 variants with 32-bit and 64-bit scalars,
    and 128-bit and 256-bit operations in addition to the 512-bit ones.
    They are obviously not worried about waste.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Oct 29 18:48:56 2025
    From Newsgroup: comp.arch

    On 10/29/2025 5:26 PM, Robert Finch wrote:
    <snip>>> My new Linux box has 64 cores at 4.5 GHz and 96GB of DRAM.

    Desktop PC:
       8C/16T: 3.7 Base, 4.3 Turbo, 112GB RAM (just, not very fast RAM)
         Rarely reaches turbo
           pretty much only happens if just running a single thread...
         With all cores running stuff in the background:
           Idles around 3.6 to 3.8.

    Laptop:
       4C/8T, 1.9 GHz Base, 4.2 GHz Turbo
         If power set to performance, reaches turbo a lot more easily,
           and with multi-core workloads.
         But, puts out a lot of heat while doing so...

    If set to Efficiency, mostly stays below 3 GHz.

    As noted, the laptop is surprisingly speedy for how cheap it was.

    <snip>
    For my latest PC I bought a gaming machine – i7-14700KF CPU (20 cores).
    32 GB RAM, 16GB graphics RAM. 3.4 GHz (5.6 GHz in turbo mode). More RAM
    was needed, my last machine only had 16GB, found it using about 20GB. I
    did not want to spring for a machine with even more RAM, they tended to
    be high-end machines.


    IIRC, current PC was something like:
    CPU: $80 (Zen+; Zen 2 and 3 were around, but more expensive)
    MOBO: $60
    Case: $50
    ...

    Spent around $200 for 128GB of RAM.
    Could have gotten a cheaper 64GB kit had I known my MOBO would not
    accept a full 128GB (then could have had 96 GB).


    The RTX card I have (RTX 3060) has 12 GB of VRAM.

    IIRC, it was also about the cheapest semi-modern graphics card I could
    find at the time. Like, while I could have bought an RTX 4090 or similar
    at the time, I am not made of money.

    Like, a prior-generation mid-range card being the cheaper option.
    And, still newer than the GTX980 that had died on my (where, the GTX980
    was itself second-hand).


    Before this, had been running a GTX 460, and before that, a Radeon HD
    4850 (IIRC).

    I think it was a case of:
    Had a Phenom II box, with the HD 4850;
    Switched to GTX 460, as I got one second-hand for free, slightly better; Replaced Phenom II board+CPU with FX-8350;
    Got GTX 980 (also second hand);
    Got Ryzen 7 2700X and new MOBO;
    Got RTX 3060 (as the 980 was failing).

    With the RTX 3060, had to go single-monitor, mostly as it only has
    DisplayPort outputs, and DP->HDMI->DVI via adapters doesn't seem to work (whereas HDMI->DVI did work via adapters).

    Well, also the RTX 3060 doesn't have a VGA output either (monitor would
    also accept VGA).

    Though, the current monitor I am using is newer and does support
    DisplayPort.


    I also managed to get a MultiSync CRT a while ago, but it only really
    gives good results at 640x480 and 800x600, 1024x768 sorta-works (but
    1280x1024 does not work), has a roughly 16" CRT or so; VGA input.

    I also have an LCD that goes up to 1280x1024, although it looks like
    garbage if set above 1024x768. Only accepts VGA.

    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Oct 30 07:13:54 2025
    From Newsgroup: comp.arch

    Robert Finch <robfi680@gmail.com> schrieb:
    On 2025-10-29 2:15 p.m., Thomas Koenig wrote:
    Robert Finch <robfi680@gmail.com> schrieb:
    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, >>> 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
    as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
    is r33). Sameo for other registers. GPRs may contain either integer or
    floating-point values.

    I understand the temptation to go for more bits :-) What is your
    instruction alignment? Bytewise so 40 bits fit, or do you have some
    alignment that the first instruction of a cache line is always aligned?

    The 40-bit instructions are byte aligned. This does add more shifting in
    the align stage. Once shifted though instructions are easily peeled off
    from fixed positions. One consequence is jump targets must be byte
    aligned OR routines could be required to be 32-bit aligned for instance.>

    That raises an interesting question. If you want to align a branch
    target on a 32-bit boundary, or even a cache line, how do you fill
    up the rest? If all instructions are 40 bits, you cannot have a
    NOP that is not 40 bits, so there would need to be a jump before
    a gap that is does not fit 40 bits.

    If you have that many bits available, do you still go for a load-store
    architecture, or do you have memory operations? This could offset the
    larger size of your instructions.

    It is load/store with no memory ops excepting possibly atomic memory ops.>

    OK. Starting with 40 vs 32 bits, you have a factor of 1.25 disadvantage
    in code density to start with. Having memory operations could offset
    that by a certain factor, that was why I was asking.

    Using operand routing for immediate constants and an operation size for
    the instruction. Constants and operation size may be specified
    independently. With 40-bit instruction words, constants may be 10,50,90
    or 130 bits.

    Those sizes are not really a good fit for constants from programs,
    where quite a few constants tend to be 32 or 64 bits. Would a
    64-bit FP constant leave 26 bits empty?

    I found that 16-bit immediates could be encoded instead of 10-bit.

    OK. That should also help for offsets in load/store.

    So, now there are 16,56,96 and 136 bit constants possible. The 56-bitconstant likely has enough range for most 64-bit ops.

    For addresses, it will take some time for this to overflow :-)
    For floating point constants, that will be hard.

    I have done some analysis on frequency of floating point constants
    in different programs, and what I found was that there are a few
    floating point constants that keep coming up, like a few integers
    around zero (biased towards the positive side), plus a few more
    golden oldies like 0.5, 1.5 and pi. Apart from that, I found that
    different programs have wildly different floating point constants,
    which is not surprising. (I based that analysis on the grand
    total of three packages, namely Perl, gnuplot and GSL, so cover
    is not really extensive).

    Otherwise using
    a 96-bit constant for 64-bit ops would leave the upper 32-bit of the constant unused.

    There are also 32-bit floating point constants, and 32-bit integers
    as constants. There are also very many small integer constants, but
    of course there also could be others.

    136 bit constants may not be implemented, but a size
    code is reserved for that size.

    I'm still hoping for good 128-bit IEEE hardware float support.
    POWER has this, but stuck it on their their decimal float
    arithmetic, which is not highly performing...
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Oct 30 13:53:04 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Robert Finch <robfi680@gmail.com> schrieb:
    On 2025-10-29 2:15 p.m., Thomas Koenig wrote:
    Robert Finch <robfi680@gmail.com> schrieb:
    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, >>>> 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named >>>> as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high >>>> is r33). Sameo for other registers. GPRs may contain either integer or >>>> floating-point values.

    I understand the temptation to go for more bits :-) What is your
    instruction alignment? Bytewise so 40 bits fit, or do you have some
    alignment that the first instruction of a cache line is always aligned?

    The 40-bit instructions are byte aligned. This does add more shifting in
    the align stage. Once shifted though instructions are easily peeled off
    from fixed positions. One consequence is jump targets must be byte
    aligned OR routines could be required to be 32-bit aligned for instance.>

    That raises an interesting question. If you want to align a branch
    target on a 32-bit boundary, or even a cache line, how do you fill
    up the rest? If all instructions are 40 bits, you cannot have a
    NOP that is not 40 bits, so there would need to be a jump before
    a gap that is does not fit 40 bits.

    iCache lines could be a multiple of 5-bytes in size (e.g. 80 bytes
    instead of 64).

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Oct 30 16:09:00 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    On 2025-10-29 2:33 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, >> 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
    as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
    is r33). Sameo for other registers. GPRs may contain either integer or
    floating-point values.

    Going with a bit result vector in any GPR for compares, then a branch on >> bit-set/clear for conditional branches. Might also include branch true / >> false.

    I have both the bit-vector compare and branch, but also a compare to zero and branch as a single instruction. I suggest you should too, if for no other reason than:

    if( p && p->next )


    Yes, I was going to have at least branch on register 0 (false) 1 (true)
    as there is encoding room to support it. It does add more cases in the branch eval, but is probably well worth it.
    Using operand routing for immediate constants and an operation size for
    the instruction. Constants and operation size may be specified
    independently. With 40-bit instruction words, constants may be 10,50,90
    or 130 bits.

    My 66000 allows for occasional use of 128-bit values but is designed mainly for 64-bit and smaller.


    Following the same philosophy. Expecting only some use for 128-bit
    floats. Integers can only handle 8,16,32, or 64-bits.

    With 32-bit instructions, I provide, {5, 16, 32, and 64}-bit constants.

    Just last week we discovered a case where HW can do a better job than SW. Previously, the compiler would emit:

    CVTfd Rt,Rf
    FMUL Rt,Rt,#1.425D0
    CVTdf Rd,Rt

    Which is subject to double rounding once at the FMUL and again at the
    down conversion. I though about the problem and it seems fairly easy
    to gate the 24-bit fraction into the multiplier tree along with the
    53-bit fraction of the constant, and then normalize and round the
    result dropping out of the tree--avoiding the double rounding case.

    Now, the compiler emits:

    FMULf Rd,Rf,#1.425D0

    saving 2 instructions along with the higher precision.

    Improves the accuracy? of algorithms, but seems a bit specific to me.

    It is down in the 1% footprint area.

    Are there other instruction sequence where double-rounding would be good
    to avoid?

    Back when I joined Moto (1983) there was a lot of talk about double
    roundings and how it could screw up various algorithms but mainly in
    the 64-bit versus 80-bit stuff of 68881, where you got 11-more bits
    of precision and thus took a change of 2/2^10 of a double rounding.
    Today with 32-bit versus 64-bit you take a chance of 2/2^28 so the
    problem is greatly ameliorated although technically still present.

    The problem arises due to a cross products of various {machine,
    language, compiler} features not working "all ends towards the middle".

    LLVM promotes FP calculations with a constant to 64-bits whenever the
    constant cannot be represented exactly in 32-bits. {Strike one}

    C makes no <useful> statements about precision of calculation control.
    {strike two}

    HW almost never provides mixed mode calculations which provide the
    means to avoid the double rounding. {strike three}

    So, technically, My 66000 does not provide general-mixed-mode FP,
    but I wrote a special rule to allow for larger constants used with
    narrower registers to cover exactly this case. {It also saves 2 CVT instructions (latency and footprint).

    Seems like HW could detect the sequence and fuse the instructions.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Oct 30 16:10:47 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    At this point, the discussion is academic, as Robert has said he has 6
    bit register specifiers in the instructions.

    He could still make these registers have 128 bits rather than pairing registers for 128-bit operation.

    But my issue had nothing
    to do with SIMD registers, as he said he supported 128 bit arithmetic
    and the "high" registers were used for that.

    As far as waste etc. is concerned, it does not matter if the 128-bit operation is a SIMD operation or a scalar 128-bit operation.

    Intel designed SSE with scalar instructions that use only 32 bits out
    of the 128 bits available; SSE2 with 64-bit scalar instructions, AVX
    (and AVX2) with 32-bit and 64-bit scalar operations in a 256-bit
    register, and various AVX-512 variants with 32-bit and 64-bit scalars,
    and 128-bit and 256-bit operations in addition to the 512-bit ones.
    They are obviously not worried about waste.

    Which only goes to prove that x86 is not IRSC.

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Thu Oct 30 12:29:39 2025
    From Newsgroup: comp.arch

    On 10/30/2025 11:10 AM, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    At this point, the discussion is academic, as Robert has said he has 6
    bit register specifiers in the instructions.

    He could still make these registers have 128 bits rather than pairing
    registers for 128-bit operation.


    Only really makes sense if one assumes these resources are "borderline
    free".

    If you are also paying for logic complexity and wires/routing, then
    having bigger registers just to typically waste most of them is not ideal.


    Granted, one could argue that most of the register is wasted when, say:
    Most integer values could easily fit into 16 bits;
    We have 64-bit registers.

    But, there is enough that actually uses the 64-bits of a 64-bit register
    to make it worthwhile. Would be harder to say the same for 128-bit
    registers.

    It is common on many 32-bit machines to use register pairs for 64-bit operations.


    But my issue had nothing
    to do with SIMD registers, as he said he supported 128 bit arithmetic
    and the "high" registers were used for that.

    As far as waste etc. is concerned, it does not matter if the 128-bit
    operation is a SIMD operation or a scalar 128-bit operation.

    Intel designed SSE with scalar instructions that use only 32 bits out
    of the 128 bits available; SSE2 with 64-bit scalar instructions, AVX
    (and AVX2) with 32-bit and 64-bit scalar operations in a 256-bit
    register, and various AVX-512 variants with 32-bit and 64-bit scalars,
    and 128-bit and 256-bit operations in addition to the 512-bit ones.
    They are obviously not worried about waste.

    Which only goes to prove that x86 is not IRSC.


    Also questionable to read as someone lacking much hardware that actually supports 256 or 512-bit AVX on the actual HW level. And, both AVX and
    AVX-512 had not exactly had clean roll-outs.


    Checks and, ironically, my recent super-cheap laptop was the first thing
    I got that apparently has proper 256-bit AVX support (still no AVX-512 though...).


    Still some oddities though:
    RAM that appears to be faster than it should be;
    The MHz and CAS latency appear abnormally high.
    They do not match the values for DDR4-2400.
    (Nor, even DDR4 in general).
    Appears to exceed expected bandwidth on memcpy test;
    ...
    Windows 11 on an unsupported CPU model;
    More so, Windows 11 Professional, also on something cheap.
    (Listing said it would come with Win10, got Win11 instead, OK).

    So, technically seems good, but also slightly sus...


    Differs slightly from what I was expecting:
    Something kinda old and not super fast;
    Listing said Windows 10, kinda expected Windows 10;
    ...

    Like, something non-standard may have been done here.


    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Oct 30 16:46:14 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    Intel designed SSE with scalar instructions that use only 32 bits out
    of the 128 bits available; SSE2 with 64-bit scalar instructions, AVX
    (and AVX2) with 32-bit and 64-bit scalar operations in a 256-bit
    register, and various AVX-512 variants with 32-bit and 64-bit scalars,
    and 128-bit and 256-bit operations in addition to the 512-bit ones.
    They are obviously not worried about waste.

    Which only goes to prove that x86 is not IRSC.

    I don't see that following at all, but it inspired a closer look at
    the usage/waste of register bits in RISCs:

    Every 64-bit RISC starting with MIPS-IV and Alpha, wastes a lot of
    precious register bits by keeping 8-bit, 16-bit, and 32-bit values in
    64-bit registers rather than following the idea of Intel and Robert
    Finch of splitting the 64-bit register in the double number of 32-bit registers; this idea can be extended to eliminate waste by having the
    quadruple number of 16-bit registers that can be joined into 32-bit
    anbd 64-bit registers when needed, or even better, the octuple number
    of 8-bit registers that can be joined to 16-bit, 32-bit, and 64-bit
    registers. We can even ressurrect the character-oriented or
    digit-oriented architectures of the 1950s.

    Intel split AX into AL and AH, similar for BX, CX, and DX, but not for
    SI, DI, BP, and SP. In the 32-bit extension, they did not add ways to
    access the third and fourth byte, or the second wyde (16-bit value).
    In the 64-bit extension, AMD added ways to access the low byte of
    every register (in addition to AH-DH), but no way to access the second
    byte of other registers than RAX-RDX, nor ways to access higher wydes,
    or 32-bit units. Apparently they were not concerned about this kind
    of waste. For the 8086 the explanation is not trying to avoid waste,
    but an easy automatic mapping from 8080 code to 8086 code.

    Writing to AL-DL or AX-DX,SI,DI,BP,SP leaves the other bits of the
    32-bit register alone, which one can consider to be useful for storing
    data in those bits (and in case of AL, AH actually provides a
    conventient way to access some of the bits, and vice versa), but leads
    to partial-register stalls. The hardware contains fast paths for some
    common cases of partial-register writes, but AFAIK AH-DH do not get
    fast paths in most CPUs.

    By contrast, RISCs waste the other 24 of 56 bits on a byte load by zero-extending or sign-extending the byte.

    Alpha avoids wasting register bits for some idioms by keeping up to 8
    bytes in a register in SIMD style (a few years before the wave of SIMD extensions across the industry), but still provides no direct name for
    the individual bytes of a register.

    IIRC the original HPPA has 32 or so 64-bit FP registers, which they
    then split into 58? 32-bit FP registers. I don't know how they
    further evolved that feature.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Oct 30 17:58:34 2025
    From Newsgroup: comp.arch

    Scott Lurndal <scott@slp53.sl.home> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Robert Finch <robfi680@gmail.com> schrieb:
    On 2025-10-29 2:15 p.m., Thomas Koenig wrote:
    Robert Finch <robfi680@gmail.com> schrieb:
    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, >>>>> 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named >>>>> as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high >>>>> is r33). Sameo for other registers. GPRs may contain either integer or >>>>> floating-point values.

    I understand the temptation to go for more bits :-) What is your
    instruction alignment? Bytewise so 40 bits fit, or do you have some
    alignment that the first instruction of a cache line is always aligned? >>>
    The 40-bit instructions are byte aligned. This does add more shifting in >>> the align stage. Once shifted though instructions are easily peeled off
    from fixed positions. One consequence is jump targets must be byte
    aligned OR routines could be required to be 32-bit aligned for instance.> >>
    That raises an interesting question. If you want to align a branch
    target on a 32-bit boundary, or even a cache line, how do you fill
    up the rest? If all instructions are 40 bits, you cannot have a
    NOP that is not 40 bits, so there would need to be a jump before
    a gap that is does not fit 40 bits.

    iCache lines could be a multiple of 5-bytes in size (e.g. 80 bytes
    instead of 64).

    There is a cache level (L2 usually, I believe) when icache and
    dcache are no longer separate. Wouldn't this cause problemso
    or inefficiencies?
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Thu Oct 30 23:39:28 2025
    From Newsgroup: comp.arch

    On Thu, 30 Oct 2025 16:46:14 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:


    Alpha avoids wasting register bits for some idioms by keeping up to 8
    bytes in a register in SIMD style (a few years before the wave of SIMD extensions across the industry), but still provides no direct name for
    the individual bytes of a register.


    According to my understanding, EV4 had no SIMD-style instructions.
    They were introduced in EV5 (Jan 1995). Which makes it only ~6 months
    ahead of VIS in UltraSPARC.




    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Oct 30 22:00:50 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    Intel designed SSE with scalar instructions that use only 32 bits out
    of the 128 bits available; SSE2 with 64-bit scalar instructions, AVX
    (and AVX2) with 32-bit and 64-bit scalar operations in a 256-bit
    register, and various AVX-512 variants with 32-bit and 64-bit scalars,
    and 128-bit and 256-bit operations in addition to the 512-bit ones.
    They are obviously not worried about waste.

    Which only goes to prove that x86 is not RISC.

    I don't see that following at all, but it inspired a closer look at
    the usage/waste of register bits in RISCs:

    Every 64-bit RISC starting with MIPS-IV and Alpha, wastes a lot of
    precious register bits by keeping 8-bit, 16-bit, and 32-bit values in
    64-bit registers rather than following the idea of Intel and Robert
    Finch of splitting the 64-bit register in the double number of 32-bit registers; this idea can be extended to eliminate waste by having the quadruple number of 16-bit registers that can be joined into 32-bit
    anbd 64-bit registers when needed, or even better, the octuple number
    of 8-bit registers that can be joined to 16-bit, 32-bit, and 64-bit registers. We can even ressurrect the character-oriented or
    digit-oriented architectures of the 1950s.

    Consider that being able to address every 2^(3+n) field of a register
    is far from free. Take a simple add of 2 bytes::

    ADDB R8[7], R6[3], R19[4]

    One has to individually align each of the bytes, which is going to blow
    out your timing for forwarding by at least 3 gates of delay (operands)
    and 4 gates for the result (register). The only way it makes "timing"
    sense if if you restrict the patterns to::

    ADDB R8[7], R6[7], R19[7]

    Where there is no "vertical" routine in obtaining operands and delivering results. {{OR you could always just eat a latency cycle when all fields
    are not the same.}}

    I also suspect that you would gain few compiler writers to support random fields in registers.

    Intel split AX into AL and AH, similar for BX, CX, and DX, but not for
    SI, DI, BP, and SP.

    {ABCD}X registers were data.
    {SDBS} registers were pointer registers.

    There are vanishingly few useful manipulations on part of pointers.

    Oh and BTW:: using x86-history as justification for an architectural
    feature is "bad style".

    In the 32-bit extension, they did not add ways to
    access the third and fourth byte, or the second wyde (16-bit value).
    In the 64-bit extension, AMD added ways to access the low byte of
    every register (in addition to AH-DH), but no way to access the second
    byte of other registers than RAX-RDX, nor ways to access higher wydes,
    or 32-bit units. Apparently they were not concerned about this kind
    of waste. For the 8086 the explanation is not trying to avoid waste,
    but an easy automatic mapping from 8080 code to 8086 code.

    Writing to AL-DL or AX-DX,SI,DI,BP,SP leaves the other bits of the
    32-bit register alone, which one can consider to be useful for storing
    data in those bits (and in case of AL, AH actually provides a
    conventient way to access some of the bits, and vice versa), but leads
    to partial-register stalls. The hardware contains fast paths for some
    common cases of partial-register writes, but AFAIK AH-DH do not get
    fast paths in most CPUs.

    By contrast, RISCs waste the other 24 of 56 bits on a byte load by zero-extending or sign-extending the byte.

    But gains the property that the whole register contains 1 proper value {range-limited to the container size whence it came} This in turn makes tracking values easy--in fact placing several different sized values
    in a single register makes it essentially impossible to perform value
    analysis in the compiler.

    Alpha avoids wasting register bits for some idioms by keeping up to 8
    bytes in a register in SIMD style (a few years before the wave of SIMD extensions across the industry), but still provides no direct name for
    the individual bytes of a register.

    If your ISA has excellent support for statically positioned bit-fields
    (or even better with dynamically positioned bit fields) fetching the
    fields and depositing them back into containers does not add significant latency. {volatile notwithstanding} While poor ISA support does add
    significant latency.

    IIRC the original HPPA has 32 or so 64-bit FP registers, which they
    then split into 58? 32-bit FP registers. I don't know how they
    further evolved that feature.

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Oct 30 22:06:35 2025
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    Scott Lurndal <scott@slp53.sl.home> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Robert Finch <robfi680@gmail.com> schrieb:
    On 2025-10-29 2:15 p.m., Thomas Koenig wrote:
    Robert Finch <robfi680@gmail.com> schrieb:
    Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
    64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named >>>>> as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high >>>>> is r33). Sameo for other registers. GPRs may contain either integer or >>>>> floating-point values.

    I understand the temptation to go for more bits :-) What is your
    instruction alignment? Bytewise so 40 bits fit, or do you have some >>>> alignment that the first instruction of a cache line is always aligned? >>>
    The 40-bit instructions are byte aligned. This does add more shifting in >>> the align stage. Once shifted though instructions are easily peeled off >>> from fixed positions. One consequence is jump targets must be byte
    aligned OR routines could be required to be 32-bit aligned for instance.> >>
    That raises an interesting question. If you want to align a branch >>target on a 32-bit boundary, or even a cache line, how do you fill
    up the rest? If all instructions are 40 bits, you cannot have a
    NOP that is not 40 bits, so there would need to be a jump before
    a gap that is does not fit 40 bits.

    iCache lines could be a multiple of 5-bytes in size (e.g. 80 bytes
    instead of 64).

    There is a cache level (L2 usually, I believe) when icache and
    dcache are no longer separate. Wouldn't this cause problems
    or inefficiencies?

    Consider trying to invalidate an ICache line--this requires looking
    at 2 DCache lines to see if they, too, need invalidation.

    Consider self-modifying code, the data stream overwrites an instruction,
    then later the FETCH engine runs over the modified line, but the modified
    line is 64-bytes of the needed 80-bytes, so you take a hit and a miss on
    a single fetch.

    It also prevents SNARFing updates to ICache instructions, unless the
    SNARFed data is entirely retained in a single ICache line.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Oct 30 22:19:18 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    According to my understanding, EV4 had no SIMD-style instructions.

    My understanding ist that CMPBGE and ZAP(NOT), both SIMD-style
    instructions, were already present in EV4. The architecture
    description <https://download.majix.org/dec/alpha_arch_ref.pdf> does
    not say that some implementations don't include these instructons in
    hardware, whereas for the Multimedia support instructions (Section
    4.13), the reference does say that.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Fri Oct 31 00:57:42 2025
    From Newsgroup: comp.arch

    On Thu, 30 Oct 2025 22:19:18 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    According to my understanding, EV4 had no SIMD-style instructions.

    My understanding ist that CMPBGE and ZAP(NOT), both SIMD-style
    instructions, were already present in EV4.


    Yes, those were in EV4.

    Alpha 21064 and Alpha 21064A HRM is here: https://github.com/JonathanBelanger/DECaxp/blob/master/ExternalDocumentation

    I didn't consider these instructions as SIMD. May be, I should have.
    Looks like these instructions are intended to accelerated string
    processing. That's unusual for the first wave of SIMD extensions.

    The architecture
    description <https://download.majix.org/dec/alpha_arch_ref.pdf> does
    not say that some implementations don't include these instructons in hardware, whereas for the Multimedia support instructions (Section
    4.13), the reference does say that.

    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Oct 31 14:48:41 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    On Thu, 30 Oct 2025 22:19:18 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    My understanding ist that CMPBGE and ZAP(NOT), both SIMD-style
    instructions, were already present in EV4.
    ...
    I didn't consider these instructions as SIMD. May be, I should have.

    They definitely are, but they were not touted as such at the time, and
    they use the GPRs, unlike most SIMD extensions to instruction sets.

    Looks like these instructions are intended to accelerated string
    processing. That's unusual for the first wave of SIMD extensions.

    Yes. This was pre-first-wave. The Alpha architects just wanted to
    speed up some common operations that would otherwise have been
    relatively slow thanks to Alpha initially not having BWX instructions. Ironically, when Alpha showed a particularly good result on some
    benchmark (maybe Dhrystone), someone claimed that these string
    instructions gave Alpha an unfair advantage.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Oct 31 13:21:45 2025
    From Newsgroup: comp.arch

    On 10/31/2025 9:48 AM, Anton Ertl wrote:
    Michael S <already5chosen@yahoo.com> writes:
    On Thu, 30 Oct 2025 22:19:18 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    My understanding ist that CMPBGE and ZAP(NOT), both SIMD-style
    instructions, were already present in EV4.
    ...
    I didn't consider these instructions as SIMD. May be, I should have.

    They definitely are, but they were not touted as such at the time, and
    they use the GPRs, unlike most SIMD extensions to instruction sets.

    Looks like these instructions are intended to accelerated string
    processing. That's unusual for the first wave of SIMD extensions.

    Yes. This was pre-first-wave. The Alpha architects just wanted to
    speed up some common operations that would otherwise have been
    relatively slow thanks to Alpha initially not having BWX instructions. Ironically, when Alpha showed a particularly good result on some
    benchmark (maybe Dhrystone), someone claimed that these string
    instructions gave Alpha an unfair advantage.


    Most likely Dhrystone:
    It shows disproportionate impact from the relative speed of things like "strcmp()" and integer divide.


    I had experimented with special instructions for packed search, which
    could be used to help with either string compare of implementing
    dictionary objects in my usual way.


    Though, had later fallen back to a more generic way of implementing
    "strcmp()" that could allow more fair comparison between my own ISA and RISC-V. Where, say, one instead makes the determination based on how efficiently the ISA can handle various pieces of C code (rather than the
    use of niche instructions that typically require hand-written ASM or
    similar).



    Generally, makes more sense to use helper instructions that have a
    general impact on performance, say for example, effecting how quickly a
    new image can be drawn into VRAM.

    For example, in my GUI experiments:
    Most of the programs are redrawing the screens as, say, 320x200 RGB555.

    Well, except ROTT, which uses 384x200 8-bit, on top of a bunch of code
    to mimic planar VGA behavior. In this case, for the port it was easier
    to write wrapper code to fake the VGA weirdness than to try to rewrite
    the whole renderer to work with a normal linear framebuffer (like what
    Doom and similar had used).


    In a lot of the cases, I was using an 8-bit indexed color or color-cell
    mode. For indexed color, one needs to send each image through a palette conversion (to the OS color palette); or run a color-cell encoder.
    Mostly because the display HW used 128K of VRAM.

    And, even if RAM backed, there are bandwidth problems with going bigger;
    so higher-resolutions had typically worked to reduce the bits per pixel:
    320x200: 16 bpp
    640x400: 4 bpp (color cell), 8 bpp (uses 256K, sorta works)
    800x600: 2 or 4 bpp color-cell
    1024x768: 1 bpp monochrome, other experiments (*1)
    Or, use the 2 bpp mode, for 192K.

    *1: Bayer Pattern Mode/Logic (where the pattern of pixels also encodes
    the color);
    One possibility also being to use an indexed color pair for every 8x8, allowing for a 1.25 bpp color cell mode.

    Though, thus far the 1024x768 mode is still mostly untested on real
    hardware.

    Had experimented some with special instructions to speed up the indexed
    color conversion and color-cell encoding, but had mostly gone back and
    forth between using helper instructions and normal plain C logic, and
    which exact route to take.

    Had at one point had a helper instruction for the "convert 4 RGB555
    colors to 4 indexed colors using a hardware palette", but this broke
    when I later ended up modifying the system palette for better results
    (which was a critical weakness of this approach). Also the naive
    strategy of using a 32K lookup table isn't great, as this doesn't fit
    into the L1 cache.


    So, for 4 bpp color cell:
    Generally, each block of 4x4 pixels understood as 2 RGB555 endpoints,
    and 2 selector bits per pixel. Though, in VRAM, 4 of these are packed
    into a logical 8x8 pixel block; rather than a linear ordering like in
    DXT1 or similar (specifics differ, but general concept is similar to DXT1/S3TC).

    The 2bpp mode generally has 8x8 pixels encoded as 1bpp in raster order
    (same order as a character cell, with MSB in top-left corner and LSB in lower-right corner). And, then typically 2x RGB555 over the 8x8 block.
    IIRC, had also experimented with having each 4x4 sub-block able to use a
    pair of RGB232 colors, but was harder to get good results.

    But, to help with this process, it was useful to have helper operations
    for, say:
    Map RGB555 values to a luma value;
    Select minimum and maximum RGB555 values for block;
    Map luma values to 1 or 2 bit selectors;
    ...


    Internally, the GUI mode had worked by drawing everything to an RGB555 framebuffer (~ 512K or 1MB) and then using a bitmap to track which
    blocks had been modified and need to be re-encoded and sent over to VRAM (partly by first flagging during window redraw, then comparing with a
    previous version of the framebuffer and tracking when pixel-blocks will
    differ to refine the selection of blocks that need redraw, copying over
    blocks as needed to keep track of these buffers).

    Process wasn't particularly efficient (and performance is considerably
    worse than what Win3.x or Win9x seemed to give).



    As for the packed-search instructions, there were 16-bit versions as
    well, which could be used either to help with UTF-16 operations; or for dictionary objects.

    Where, a common way I implement dictionary objects is to use arrays of
    16-bit keys with 64-bit values (often tagged values or similar).

    Though, this does put a limit on the maximum number of unique symbols
    that can be used as dictionary keys, but not often an issue in practice. Generally these are not QNames or C function names, so reduces the issue
    of running out of symbol name somewhat.

    One can also differ though on how much it makes to have sense to have
    ISA level helpers for working with tagrefs and similar (or, getting the
    ABI involved with these matters, like defining in the ABI the encodings
    for things like fixnum/flonum/etc).

    ...


    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Oct 31 14:32:00 2025
    From Newsgroup: comp.arch

    On 10/31/2025 1:21 PM, BGB wrote:

    ...


    In a lot of the cases, I was using an 8-bit indexed color or color-cell mode. For indexed color, one needs to send each image through a palette conversion (to the OS color palette); or run a color-cell encoder.
    Mostly because the display HW used 128K of VRAM.

    And, even if RAM backed, there are bandwidth problems with going bigger;
    so higher-resolutions had typically worked to reduce the bits per pixel:
       320x200: 16 bpp
       640x400: 4 bpp (color cell), 8 bpp (uses 256K, sorta works)
       800x600: 2 or 4 bpp color-cell
      1024x768: 1 bpp monochrome, other experiments (*1)
        Or, use the 2 bpp mode, for 192K.

    *1: Bayer Pattern Mode/Logic (where the pattern of pixels also encodes
    the color);
    One possibility also being to use an indexed color pair for every 8x8, allowing for a 1.25 bpp color cell mode.



    Expanding on this:
    Idea 1, original:
    Each group of 2x2 pixels understood as:
    G R
    B G
    With each pixel alternating color.

    But, slightly better for quality is to operate on blocks of 4x4 pixels,
    with the pixel bits encoding color indirectly for the whole 4x4 block:
    G R G B
    B G R G
    G R G B
    B G R G
    So, if >= 4 G bits are set, G is High.
    So, if >= 2 R bits are set, R is High.
    So, if >= 2 B bits are set, B is High.
    If > 8 bits are set, I is high.

    The non-set pixels usually assuming either 0000 (Black) or 1000 (Dark
    Grey) depending on I bit. Or, a low intensity version of the main color
    if over 75% of a given bit are set in a given way (say, for mostly flat
    color blocks).

    Still kinda sucks, but allows a crude approximation of 16 color graphics
    at 1 bpp...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Fri Oct 31 21:09:23 2025
    From Newsgroup: comp.arch

    MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:
    Improves the accuracy? of algorithms, but seems a bit specific to me.

    It is down in the 1% footprint area.

    Are there other instruction sequence where double-rounding would be good
    to avoid?

    Back when I joined Moto (1983) there was a lot of talk about double
    roundings and how it could screw up various algorithms but mainly in
    the 64-bit versus 80-bit stuff of 68881, where you got 11-more bits
    of precision and thus took a change of 2/2^10 of a double rounding.
    Today with 32-bit versus 64-bit you take a chance of 2/2^28 so the
    problem is greatly ameliorated although technically still present.

    Actually, for the five required basic operations, you can always do the
    op in the next higher precision, then round again down to the target,
    and get exactly the same result.

    This is because the mantissa lengths (including the hidden bit) increase
    to at least 2n+2:

    f16 1:5:10 (1+10=11, 11*2+2 = 22)
    f32 1:8:23 (1+23=24, 24*2+2 = 50)
    f64 1:11:52 (1+52=53, 53*2+2 = 108)
    f128 1:15:112 (1+112=113)

    You can however NOT use f128 FMUL + FADD to emulate f64 FMAC, since that
    would require a triple sized mantissa.

    The Intel+Motorola 80-bit format was a bastard that made it effectively impossible to produce bit-for-bit identical results even when the FPU
    was set to 64-bit precision.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Fri Oct 31 21:12:45 2025
    From Newsgroup: comp.arch

    Michael S wrote:
    On Thu, 30 Oct 2025 16:46:14 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:


    Alpha avoids wasting register bits for some idioms by keeping up to 8
    bytes in a register in SIMD style (a few years before the wave of SIMD
    extensions across the industry), but still provides no direct name for
    the individual bytes of a register.


    According to my understanding, EV4 had no SIMD-style instructions.
    They were introduced in EV5 (Jan 1995). Which makes it only ~6 months
    ahead of VIS in UltraSPARC.

    The original (v1?) Alpha had instructions intending to make it "easy" to process character data in 8-byte chunks inside a register.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Nov 1 18:19:48 2025
    From Newsgroup: comp.arch


    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:
    Improves the accuracy? of algorithms, but seems a bit specific to me.

    It is down in the 1% footprint area.

    Are there other instruction sequence where double-rounding would be good >> to avoid?

    Back when I joined Moto (1983) there was a lot of talk about double roundings and how it could screw up various algorithms but mainly in
    the 64-bit versus 80-bit stuff of 68881, where you got 11-more bits
    of precision and thus took a change of 2/2^10 of a double rounding.
    Today with 32-bit versus 64-bit you take a chance of 2/2^28 so the
    problem is greatly ameliorated although technically still present.

    Actually, for the five required basic operations, you can always do the
    op in the next higher precision, then round again down to the target,
    and get exactly the same result.

    https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

    This is because the mantissa lengths (including the hidden bit) increase
    to at least 2n+2:

    f16 1:5:10 (1+10=11, 11*2+2 = 22)
    f32 1:8:23 (1+23=24, 24*2+2 = 50)
    f64 1:11:52 (1+52=53, 53*2+2 = 108)
    f128 1:15:112 (1+112=113)

    You can however NOT use f128 FMUL + FADD to emulate f64 FMAC, since that would require a triple sized mantissa.

    The Intel+Motorola 80-bit format was a bastard that made it effectively impossible to produce bit-for-bit identical results even when the FPU
    was set to 64-bit precision.

    Terje

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Nov 1 19:18:39 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    Intel split AX into AL and AH, similar for BX, CX, and DX, but not for
    SI, DI, BP, and SP.

    {ABCD}X registers were data.
    {SDBS} registers were pointer registers.

    The 8086 is no 68000. The [BX] addressing mode makes it obvious that
    that's not the case.

    What is actually the case: AL-DL, AH-DH correspond to 8-bit registers
    of the 8080, some of AX-DX correspond to register pairs. SI, DI, BP
    are new, SP corresponds to the 8080 SP, which does not have 8-bit
    components. That's why SI, DI, BP, SP have no low or high
    sub-registers.

    Oh and BTW:: using x86-history as justification for an architectural
    feature is "bad style".

    I think that we can learn a lot from earlier architectures, some
    things to adopt and some things to avoid. Concerning subregisters, I
    lean towards avoiding.

    That's also another reason to avoid load-and-op and RMW instructions.
    With a load/store architecture, load can sign/zero extend as
    necessary, and then most operations can be done at full width.

    But gains the property that the whole register contains 1 proper value >{range-limited to the container size whence it came} This in turn makes >tracking values easy--in fact placing several different sized values
    in a single register makes it essentially impossible to perform value >analysis in the compiler.

    I don't think it's impossible or particularly hard for the compiler. Implementing it in OoO hardware causes complications, though.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Nov 1 21:08:35 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    Actually, for the five required basic operations, you can always do the
    op in the next higher precision, then round again down to the target,
    and get exactly the same result.

    https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

    The PowerISA version 3.0 introduced rounding to odd for its 128-bit
    floating point arithmetic, for that very reason (I assume).
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Nov 2 02:21:18 2025
    From Newsgroup: comp.arch

    On 10/31/2025 2:32 PM, BGB wrote:
    On 10/31/2025 1:21 PM, BGB wrote:

    ...


    In a lot of the cases, I was using an 8-bit indexed color or color-
    cell mode. For indexed color, one needs to send each image through a
    palette conversion (to the OS color palette); or run a color-cell
    encoder. Mostly because the display HW used 128K of VRAM.

    And, even if RAM backed, there are bandwidth problems with going
    bigger; so higher-resolutions had typically worked to reduce the bits
    per pixel:
        320x200: 16 bpp
        640x400: 4 bpp (color cell), 8 bpp (uses 256K, sorta works)
        800x600: 2 or 4 bpp color-cell
       1024x768: 1 bpp monochrome, other experiments (*1)
         Or, use the 2 bpp mode, for 192K.

    *1: Bayer Pattern Mode/Logic (where the pattern of pixels also encodes
    the color);
    One possibility also being to use an indexed color pair for every 8x8,
    allowing for a 1.25 bpp color cell mode.



    Expanding on this:
    Idea 1, original:
    Each group of 2x2 pixels understood as:
      G R
      B G
    With each pixel alternating color.

    But, slightly better for quality is to operate on blocks of 4x4 pixels,
    with the pixel bits encoding color indirectly for the whole 4x4 block:
      G R G B
      B G R G
      G R G B
      B G R G
    So, if >= 4 G bits are set, G is High.
    So, if >= 2 R bits are set, R is High.
    So, if >= 2 B bits are set, B is High.
    If > 8 bits are set, I is high.

    The non-set pixels usually assuming either 0000 (Black) or 1000 (Dark
    Grey) depending on I bit. Or, a low intensity version of the main color
    if over 75% of a given bit are set in a given way (say, for mostly flat color blocks).

    Still kinda sucks, but allows a crude approximation of 16 color graphics
    at 1 bpp...


    Well, anyways, here is me testing with another variation of the idea
    (after thinking about it again).

    Using a joke image as a test case here...

    https://x.com/cr88192/status/1984694932666261839

    This variation uses:
    Y R
    B G

    In this case tiling as:
    Y R Y R ...
    B G B G ...
    Y R Y R ...
    B G B G ...
    ...

    Where, Y is a pure luma value.
    May or may not use this, or:
    Y R B G Y R B G
    B G Y R B G Y R
    ...
    But, prior pattern is simpler to deal with.

    Note that having every line follow the same pattern (with no
    alternation) would lead to obvious vertical lines in the output.


    With a different (slightly more complicated color recovery algorithm),
    and was operating on 8x8 pixel blocks.

    With 4x4, there is effectively 4 bits per channel, which is enough to
    recover 1 bit of color per channel.

    With 8x8, there are 16 bits, and it is possible to recover ~ 3 bits per channel, allowing for roughly a RGB333 color space (though, the vectors
    are normalized here).

    Having both a Y and G channel slightly helps with the color-recovery
    process; and allows a way to signal a monochrome block (if Y==G, the
    block is assumed to be monochrome, and the R/B bits can be used more
    freely for expressing luma).

    Where:
    Chroma accuracy comes at the expense of luma accuracy;
    An increased colorspace comes at the cost of spatial resolution of chroma;
    ...


    Dealing with chroma does have the effect of making the dithering process
    more complicated. As noted, reliable recovery of the color vector is
    itself a bit fiddly (and is very sensitive to the encoder side dither process).

    The former image was itself an example of an artifact caused by the
    dithering process, which in this case was over-boosting the green
    channel (and rotating the dither matrix would result in drastic color
    shifts). The later image was mostly after I realized the issue with the
    dither pattern, and modified how it was being handled (replacing the use
    of an 8x8 ordered dither with a 4x4 ordered dither, and then rotating
    the matrix for each channel).


    Image quality isn't great, but then again, not sure how to do that much
    better with a naive 1 bit/pixel encoding.


    I guess, an open question here is whether the color-recovery algorithm
    would be practical for hardware / FPGA.

    One possible could be:
    Use LUT4 to map 4b -> 2b (as a count)
    Then, map 2x2b -> 3b (adder)
    Then, map 2x3b -> 4b (adder), then discard LSB.
    Then, select max or R/G/B/Y;
    This is used as an inverse normalization scale.
    Feed each value and scale through a LUT (for R/G/B)
    Getting a 5-bit scaled RGB;
    Roughly: (Val<<5)/Max
    Compose a 5-bit RGB555 value used for each pixel that is set.

    Actual pixel decoding process works the same as with 8x8 blocks of 1 bit monochome, selecting minimum or maximum color based on each bit.

    Possibly, Y could also be used to select "relative" minimum and maximum values, vs full intensity and black, but this would add more logic
    complexity.


    Pros/Cons:
    +: Looks better than per-pixel Bayer-RGB
    +: Looks better than 4x4 RGBI
    -: Would require more complex decoder logic;
    -: Requires specialized dither logic to not look like broken crap.
    -: Doesn't give passable results if handed naive grayscale dithering.

    Per-Pixel RGB still holds up OK with naive grayscale dither.
    But, this approach is a lot more particular.

    the RGBI approach seems intermediate, more likely to decode grayscale
    patterns as gray.



    I guess a more open question is if such a thing could be useful (it is
    pretty far down the image-quality scale). But, OTOH, with simpler (non-randomized) dither patterns; it can LZ compress OK (depending on
    image, can get 0.1 to 0.8 bpp; which is generally JPEG territory).

    If combined with delta encoding or similar; could almost be adapted into
    a very crappy video codec.

    Well, or LZ4, where (at 320x200) one could potentially hold several
    frames of video in a 64K sliding window.

    But, image quality might be unacceptably poor. Also if decoded in
    software, the color-reconstruction is likely to be more computationally expensive than just using a CRAM style codec (while also giving worse
    image quality).


    More just interesting that I was able to get things "almost half-way
    passable" from 1 bpp monochrome.


    ...



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sun Nov 2 11:36:36 2025
    From Newsgroup: comp.arch

    Thomas Koenig wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    Actually, for the five required basic operations, you can always do the
    op in the next higher precision, then round again down to the target,
    and get exactly the same result.

    https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

    The PowerISA version 3.0 introduced rounding to odd for its 128-bit
    floating point arithmetic, for that very reason (I assume).

    Rounding to odd is basically the same as rounding to sticky, i.e if
    there are any trailing 1 bits in the exact result, then put that in the
    ulp position.

    We have known since before the 1978 ieee754 standard that guard+sticky
    (plus sign and ulp) is enough to get the rounding correct in all modes.

    The single exception is when rounding up from the maximum magnitude
    value to inf should be suppressed, there you do in fact need to check
    all the bits.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sun Nov 2 15:56:12 2025
    From Newsgroup: comp.arch

    On Sun, 2 Nov 2025 11:36:36 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Thomas Koenig wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    Actually, for the five required basic operations, you can always
    do the op in the next higher precision, then round again down to
    the target, and get exactly the same result.

    https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

    The PowerISA version 3.0 introduced rounding to odd for its 128-bit floating point arithmetic, for that very reason (I assume).

    Rounding to odd is basically the same as rounding to sticky, i.e if
    there are any trailing 1 bits in the exact result, then put that in
    the ulp position.

    We have known since before the 1978 ieee754 standard that
    guard+sticky (plus sign and ulp) is enough to get the rounding
    correct in all modes.

    The single exception is when rounding up from the maximum magnitude
    value to inf should be suppressed, there you do in fact need to check
    all the bits.

    Terje


    People use names like guard and sticky bits and sometimes also rounding
    bit (e.g. in Wikipedia article) without explanation, as if everybody
    had agreed about what they mean. But I don't think that everybody
    really agree.

    Shockingly, an absence of strict definitions apples even to most widely refereed article of David Goldberg "What Every Computer Scientist Should
    Know About Floating-Point Arithmetic". It seems, people copy the name
    of the article one from another, but very small fraction of them
    bothered to actually read it.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sun Nov 2 09:39:27 2025
    From Newsgroup: comp.arch

    Contemplating having conditional branch instructions branch to a target
    value in a register instead of using a displacement.

    I think this has about the same code density as having a branch to a displacement from the IP.

    Using a fused compare-and-branch instruction for Qupls4 there is not
    enough room in the instruction for a large branch displacement (10
    bits). So, my thought is to branch to a register value instead.
    There is already an add-to-instruction-pointer instruction that can be
    used to generate relative addresses.

    By moving the register load outside of a loop, the dynamic instruction
    count can be reduced. I think this solution is a bit better than having compare and branch as two separate instructions, or having an extended constant added to the branch instruction.

    One gotcha may be that the branch target needs to be predicted as it
    cannot be calculated earlier in the pipeline.

    The 10-bit displacement format could also be supported, but it is yet
    another branch instruction format. I may leave holes in the instruction
    set for future support, but I think it is best to start with just a
    single format.

    Code:
    AIPSI R3,1234 ; add displacement to IP and store in R3 (hoist-able)
    BLT R1,R2,R3 ; branch to R3 if R1 < R2

    Versus:
    CMP R3,R1,R2
    BLT R3,displacement


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sun Nov 2 10:06:42 2025
    From Newsgroup: comp.arch

    On 2025-11-02 3:21 a.m., BGB wrote:
    On 10/31/2025 2:32 PM, BGB wrote:
    On 10/31/2025 1:21 PM, BGB wrote:

    ...


    In a lot of the cases, I was using an 8-bit indexed color or color-
    cell mode. For indexed color, one needs to send each image through a
    palette conversion (to the OS color palette); or run a color-cell
    encoder. Mostly because the display HW used 128K of VRAM.

    And, even if RAM backed, there are bandwidth problems with going
    bigger; so higher-resolutions had typically worked to reduce the bits
    per pixel:
        320x200: 16 bpp
        640x400: 4 bpp (color cell), 8 bpp (uses 256K, sorta works)
        800x600: 2 or 4 bpp color-cell
       1024x768: 1 bpp monochrome, other experiments (*1)
         Or, use the 2 bpp mode, for 192K.

    *1: Bayer Pattern Mode/Logic (where the pattern of pixels also
    encodes the color);
    One possibility also being to use an indexed color pair for every
    8x8, allowing for a 1.25 bpp color cell mode.



    Expanding on this:
    Idea 1, original:
    Each group of 2x2 pixels understood as:
       G R
       B G
    With each pixel alternating color.

    But, slightly better for quality is to operate on blocks of 4x4
    pixels, with the pixel bits encoding color indirectly for the whole
    4x4 block:
       G R G B
       B G R G
       G R G B
       B G R G
    So, if >= 4 G bits are set, G is High.
    So, if >= 2 R bits are set, R is High.
    So, if >= 2 B bits are set, B is High.
    If > 8 bits are set, I is high.

    The non-set pixels usually assuming either 0000 (Black) or 1000 (Dark
    Grey) depending on I bit. Or, a low intensity version of the main
    color if over 75% of a given bit are set in a given way (say, for
    mostly flat color blocks).

    Still kinda sucks, but allows a crude approximation of 16 color
    graphics at 1 bpp...


    Well, anyways, here is me testing with another variation of the idea
    (after thinking about it again).

    Using a joke image as a test case here...

    https://x.com/cr88192/status/1984694932666261839

    This variation uses:
      Y R
      B G

    In this case tiling as:
      Y R Y R ...
      B G B G ...
      Y R Y R ...
      B G B G ...
      ...

    Where, Y is a pure luma value.
      May or may not use this, or:
        Y R B G Y R B G
        B G Y R B G Y R
        ...
      But, prior pattern is simpler to deal with.

    Note that having every line follow the same pattern (with no
    alternation) would lead to obvious vertical lines in the output.


    With a different (slightly more complicated color recovery algorithm),
    and was operating on 8x8 pixel blocks.

    With 4x4, there is effectively 4 bits per channel, which is enough to recover 1 bit of color per channel.

    With 8x8, there are 16 bits, and it is possible to recover ~ 3 bits per channel, allowing for roughly a RGB333 color space (though, the vectors
    are normalized here).

    Having both a Y and G channel slightly helps with the color-recovery process; and allows a way to signal a monochrome block (if Y==G, the
    block is assumed to be monochrome, and the R/B bits can be used more
    freely for expressing luma).

    Where:
    Chroma accuracy comes at the expense of luma accuracy;
    An increased colorspace comes at the cost of spatial resolution of chroma; ...


    Dealing with chroma does have the effect of making the dithering process more complicated. As noted, reliable recovery of the color vector is
    itself a bit fiddly (and is very sensitive to the encoder side dither process).

    The former image was itself an example of an artifact caused by the dithering process, which in this case was over-boosting the green
    channel (and rotating the dither matrix would result in drastic color shifts). The later image was mostly after I realized the issue with the dither pattern, and modified how it was being handled (replacing the use
    of an 8x8 ordered dither with a 4x4 ordered dither, and then rotating
    the matrix for each channel).


    Image quality isn't great, but then again, not sure how to do that much better with a naive 1 bit/pixel encoding.


    I guess, an open question here is whether the color-recovery algorithm
    would be practical for hardware / FPGA.

    One possible could be:
      Use LUT4 to map 4b -> 2b (as a count)
      Then, map 2x2b -> 3b (adder)
      Then, map 2x3b -> 4b (adder), then discard LSB.
      Then, select max or R/G/B/Y;
        This is used as an inverse normalization scale.
      Feed each value and scale through a LUT (for R/G/B)
        Getting a 5-bit scaled RGB;
        Roughly: (Val<<5)/Max
      Compose a 5-bit RGB555 value used for each pixel that is set.

    Actual pixel decoding process works the same as with 8x8 blocks of 1 bit monochome, selecting minimum or maximum color based on each bit.

    Possibly, Y could also be used to select "relative" minimum and maximum values, vs full intensity and black, but this would add more logic complexity.


    Pros/Cons:
      +: Looks better than per-pixel Bayer-RGB
      +: Looks better than 4x4 RGBI
      -: Would require more complex decoder logic;
      -: Requires specialized dither logic to not look like broken crap.
      -: Doesn't give passable results if handed naive grayscale dithering.

    Per-Pixel RGB still holds up OK with naive grayscale dither.
    But, this approach is a lot more particular.

    the RGBI approach seems intermediate, more likely to decode grayscale patterns as gray.



    I guess a more open question is if such a thing could be useful (it is pretty far down the image-quality scale). But, OTOH, with simpler (non- randomized) dither patterns; it can LZ compress OK (depending on image,
    can get 0.1 to 0.8 bpp; which is generally JPEG territory).

    If combined with delta encoding or similar; could almost be adapted into
    a very crappy video codec.

    Well, or LZ4, where (at 320x200) one could potentially hold several
    frames of video in a 64K sliding window.

    But, image quality might be unacceptably poor. Also if decoded in
    software, the color-reconstruction is likely to be more computationally expensive than just using a CRAM style codec (while also giving worse
    image quality).


    More just interesting that I was able to get things "almost half-way passable" from 1 bpp monochrome.


    ...



    I think your support for graphics is interesting; something to keep in
    mind for displays with limited RAM.

    I use a high-speed DDR memory interface and video fifo (line cache).
    Colors are broken into components specifying the number of bits per
    component (up to 10) in CRs. Colors are passed around as 32-bit values
    for video processing. Using the colors directly is much easier than
    dealing with dithered colors.
    The graphics accelerator just spits out colors to the frame buffer
    without needing to go through a dithering stage.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sun Nov 2 16:09:10 2025
    From Newsgroup: comp.arch

    Michael S wrote:
    On Sun, 2 Nov 2025 11:36:36 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Thomas Koenig wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    Actually, for the five required basic operations, you can always
    do the op in the next higher precision, then round again down to
    the target, and get exactly the same result.

    https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

    The PowerISA version 3.0 introduced rounding to odd for its 128-bit
    floating point arithmetic, for that very reason (I assume).

    Rounding to odd is basically the same as rounding to sticky, i.e if
    there are any trailing 1 bits in the exact result, then put that in
    the ulp position.

    We have known since before the 1978 ieee754 standard that
    guard+sticky (plus sign and ulp) is enough to get the rounding
    correct in all modes.

    The single exception is when rounding up from the maximum magnitude
    value to inf should be suppressed, there you do in fact need to check
    all the bits.

    Terje


    People use names like guard and sticky bits and sometimes also rounding
    bit (e.g. in Wikipedia article) without explanation, as if everybody
    had agreed about what they mean. But I don't think that everybody
    really agree.

    Within the 754 working group the definition is totally clear:

    Guard is the first bit after the normal mantissa.

    Sticky is the bit following the guard bit, it is generated by OR'ing
    together all subsequent bits in the exact/infinitely precise result.

    I.e if an exact result is exactly halfway between two representable
    numbers, the Guard bit will be set and Sticky unset.

    Ulp (Unit in Last Place)) is the final mantissa bit

    Sign is of course the sign in the Sign-Magnitude format used for all fp numbers.

    This means that those four bits in combination suffices to separate
    between rounding directions:

    Default rounding is nearest or even: (In this case Sign does not matter.)

    Ulp | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
    Guard | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 |
    Sticky | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |

    Round | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sun Nov 2 18:14:54 2025
    From Newsgroup: comp.arch

    On Sun, 2 Nov 2025 16:09:10 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Sun, 2 Nov 2025 11:36:36 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Thomas Koenig wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    Actually, for the five required basic operations, you can always
    do the op in the next higher precision, then round again down to
    the target, and get exactly the same result.

    https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

    The PowerISA version 3.0 introduced rounding to odd for its
    128-bit floating point arithmetic, for that very reason (I
    assume).

    Rounding to odd is basically the same as rounding to sticky, i.e if
    there are any trailing 1 bits in the exact result, then put that in
    the ulp position.

    We have known since before the 1978 ieee754 standard that
    guard+sticky (plus sign and ulp) is enough to get the rounding
    correct in all modes.

    The single exception is when rounding up from the maximum magnitude
    value to inf should be suppressed, there you do in fact need to
    check all the bits.

    Terje


    People use names like guard and sticky bits and sometimes also
    rounding bit (e.g. in Wikipedia article) without explanation, as if everybody had agreed about what they mean. But I don't think that
    everybody really agree.

    Within the 754 working group the definition is totally clear:


    I could believe that there is consensus about these names between
    current members of 754 working group. But nothing of that sort is
    mentioned in the text of the Standard. Which among other things means
    that you can not rely on being understood even by new members of 754
    working group.

    Guard is the first bit after the normal mantissa.

    Sticky is the bit following the guard bit, it is generated by OR'ing together all subsequent bits in the exact/infinitely precise result.

    I.e if an exact result is exactly halfway between two representable
    numbers, the Guard bit will be set and Sticky unset.

    Ulp (Unit in Last Place)) is the final mantissa bit

    Sign is of course the sign in the Sign-Magnitude format used for all
    fp numbers.

    This means that those four bits in combination suffices to separate
    between rounding directions:

    Default rounding is nearest or even: (In this case Sign does not
    matter.)

    Ulp | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
    Guard | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 |
    Sticky | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |

    Round | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |

    Terje


    I mostly use ULP/Guard/Sticky in the same meaning. Except when I use
    them, esp. Guard, differently.
    Given the choice, [in the context of binary floating point] I'd rather
    not use the term 'guard' at all. Names like 'rounding bit' or
    'half-ULP' are far more self-describing.







    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sun Nov 2 20:19:10 2025
    From Newsgroup: comp.arch

    Michael S wrote:
    On Sun, 2 Nov 2025 16:09:10 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Michael S wrote:
    On Sun, 2 Nov 2025 11:36:36 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Thomas Koenig wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    Actually, for the five required basic operations, you can always >>>>>>> do the op in the next higher precision, then round again down to >>>>>>> the target, and get exactly the same result.

    https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

    The PowerISA version 3.0 introduced rounding to odd for its
    128-bit floating point arithmetic, for that very reason (I
    assume).

    Rounding to odd is basically the same as rounding to sticky, i.e if
    there are any trailing 1 bits in the exact result, then put that in
    the ulp position.

    We have known since before the 1978 ieee754 standard that
    guard+sticky (plus sign and ulp) is enough to get the rounding
    correct in all modes.

    The single exception is when rounding up from the maximum magnitude
    value to inf should be suppressed, there you do in fact need to
    check all the bits.

    Terje


    People use names like guard and sticky bits and sometimes also
    rounding bit (e.g. in Wikipedia article) without explanation, as if
    everybody had agreed about what they mean. But I don't think that
    everybody really agree.

    Within the 754 working group the definition is totally clear:


    I could believe that there is consensus about these names between
    current members of 754 working group. But nothing of that sort is
    mentioned in the text of the Standard. Which among other things means
    that you can not rely on being understood even by new members of 754
    working group.

    Guard is the first bit after the normal mantissa.

    Sticky is the bit following the guard bit, it is generated by OR'ing
    together all subsequent bits in the exact/infinitely precise result.

    I.e if an exact result is exactly halfway between two representable
    numbers, the Guard bit will be set and Sticky unset.

    Ulp (Unit in Last Place)) is the final mantissa bit

    Sign is of course the sign in the Sign-Magnitude format used for all
    fp numbers.

    This means that those four bits in combination suffices to separate
    between rounding directions:

    Default rounding is nearest or even: (In this case Sign does not
    matter.)

    Ulp | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
    Guard | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 |
    Sticky | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |

    Round | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |

    Terje


    I mostly use ULP/Guard/Sticky in the same meaning. Except when I use
    them, esp. Guard, differently.
    Given the choice, [in the context of binary floating point] I'd rather
    not use the term 'guard' at all. Names like 'rounding bit' or
    'half-ULP' are far more self-describing.

    Guard also works for decimal FP, where you need a single Sticky bit if
    the Guard digit is equal to 5. If you work with the binary
    representation for decimal, then you just need two extra bits, just like
    BFP.

    Correct rounding also work when Guard temporarily contains more than one
    bit, possibly due to normalization, but you would normally squash this
    down (Guard, Sticky) by OR'ing any secondary guard bits into Sticky.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Nov 2 14:58:36 2025
    From Newsgroup: comp.arch

    On 11/2/2025 9:06 AM, Robert Finch wrote:
    On 2025-11-02 3:21 a.m., BGB wrote:
    On 10/31/2025 2:32 PM, BGB wrote:
    On 10/31/2025 1:21 PM, BGB wrote:

    ...


    In a lot of the cases, I was using an 8-bit indexed color or color-
    cell mode. For indexed color, one needs to send each image through a
    palette conversion (to the OS color palette); or run a color-cell
    encoder. Mostly because the display HW used 128K of VRAM.

    And, even if RAM backed, there are bandwidth problems with going
    bigger; so higher-resolutions had typically worked to reduce the
    bits per pixel:
        320x200: 16 bpp
        640x400: 4 bpp (color cell), 8 bpp (uses 256K, sorta works)
        800x600: 2 or 4 bpp color-cell
       1024x768: 1 bpp monochrome, other experiments (*1)
         Or, use the 2 bpp mode, for 192K.

    *1: Bayer Pattern Mode/Logic (where the pattern of pixels also
    encodes the color);
    One possibility also being to use an indexed color pair for every
    8x8, allowing for a 1.25 bpp color cell mode.



    Expanding on this:
    Idea 1, original:
    Each group of 2x2 pixels understood as:
       G R
       B G
    With each pixel alternating color.

    But, slightly better for quality is to operate on blocks of 4x4
    pixels, with the pixel bits encoding color indirectly for the whole
    4x4 block:
       G R G B
       B G R G
       G R G B
       B G R G
    So, if >= 4 G bits are set, G is High.
    So, if >= 2 R bits are set, R is High.
    So, if >= 2 B bits are set, B is High.
    If > 8 bits are set, I is high.

    The non-set pixels usually assuming either 0000 (Black) or 1000 (Dark
    Grey) depending on I bit. Or, a low intensity version of the main
    color if over 75% of a given bit are set in a given way (say, for
    mostly flat color blocks).

    Still kinda sucks, but allows a crude approximation of 16 color
    graphics at 1 bpp...


    Well, anyways, here is me testing with another variation of the idea
    (after thinking about it again).

    Using a joke image as a test case here...

    https://x.com/cr88192/status/1984694932666261839

    This variation uses:
       Y R
       B G

    In this case tiling as:
       Y R Y R ...
       B G B G ...
       Y R Y R ...
       B G B G ...
       ...

    Where, Y is a pure luma value.
       May or may not use this, or:
         Y R B G Y R B G
         B G Y R B G Y R
         ...
       But, prior pattern is simpler to deal with.

    Note that having every line follow the same pattern (with no
    alternation) would lead to obvious vertical lines in the output.


    With a different (slightly more complicated color recovery algorithm),
    and was operating on 8x8 pixel blocks.

    With 4x4, there is effectively 4 bits per channel, which is enough to
    recover 1 bit of color per channel.

    With 8x8, there are 16 bits, and it is possible to recover ~ 3 bits
    per channel, allowing for roughly a RGB333 color space (though, the
    vectors are normalized here).

    Having both a Y and G channel slightly helps with the color-recovery
    process; and allows a way to signal a monochrome block (if Y==G, the
    block is assumed to be monochrome, and the R/B bits can be used more
    freely for expressing luma).

    Where:
    Chroma accuracy comes at the expense of luma accuracy;
    An increased colorspace comes at the cost of spatial resolution of
    chroma;
    ...


    Dealing with chroma does have the effect of making the dithering
    process more complicated. As noted, reliable recovery of the color
    vector is itself a bit fiddly (and is very sensitive to the encoder
    side dither process).

    The former image was itself an example of an artifact caused by the
    dithering process, which in this case was over-boosting the green
    channel (and rotating the dither matrix would result in drastic color
    shifts). The later image was mostly after I realized the issue with
    the dither pattern, and modified how it was being handled (replacing
    the use of an 8x8 ordered dither with a 4x4 ordered dither, and then
    rotating the matrix for each channel).


    Image quality isn't great, but then again, not sure how to do that
    much better with a naive 1 bit/pixel encoding.


    I guess, an open question here is whether the color-recovery algorithm
    would be practical for hardware / FPGA.

    One possible could be:
       Use LUT4 to map 4b -> 2b (as a count)
       Then, map 2x2b -> 3b (adder)
       Then, map 2x3b -> 4b (adder), then discard LSB.
       Then, select max or R/G/B/Y;
         This is used as an inverse normalization scale.
       Feed each value and scale through a LUT (for R/G/B)
         Getting a 5-bit scaled RGB;
         Roughly: (Val<<5)/Max
       Compose a 5-bit RGB555 value used for each pixel that is set.

    Actual pixel decoding process works the same as with 8x8 blocks of 1
    bit monochome, selecting minimum or maximum color based on each bit.

    Possibly, Y could also be used to select "relative" minimum and
    maximum values, vs full intensity and black, but this would add more
    logic complexity.


    Pros/Cons:
       +: Looks better than per-pixel Bayer-RGB
       +: Looks better than 4x4 RGBI
       -: Would require more complex decoder logic;
       -: Requires specialized dither logic to not look like broken crap.
       -: Doesn't give passable results if handed naive grayscale dithering. >>
    Per-Pixel RGB still holds up OK with naive grayscale dither.
    But, this approach is a lot more particular.

    the RGBI approach seems intermediate, more likely to decode grayscale
    patterns as gray.



    I guess a more open question is if such a thing could be useful (it is
    pretty far down the image-quality scale). But, OTOH, with simpler
    (non- randomized) dither patterns; it can LZ compress OK (depending on
    image, can get 0.1 to 0.8 bpp; which is generally JPEG territory).

    If combined with delta encoding or similar; could almost be adapted
    into a very crappy video codec.

    Well, or LZ4, where (at 320x200) one could potentially hold several
    frames of video in a 64K sliding window.

    But, image quality might be unacceptably poor. Also if decoded in
    software, the color-reconstruction is likely to be more
    computationally expensive than just using a CRAM style codec (while
    also giving worse image quality).


    More just interesting that I was able to get things "almost half-way
    passable" from 1 bpp monochrome.


    ...



    I think your support for graphics is interesting; something to keep in
    mind for displays with limited RAM.

    I use a high-speed DDR memory interface and video fifo (line cache).
    Colors are broken into components specifying the number of bits per component (up to 10) in CRs. Colors are passed around as 32-bit values
    for video processing. Using the colors directly is much easier than
    dealing with dithered colors.
    The graphics accelerator just spits out colors to the frame buffer
    without needing to go through a dithering stage.



    No real need to go much beyond RGB555, as the FPGA boards have VGA DACs
    that generally fall below this (Eg: 4 bit/channel on the Nexys A7). And,
    2-bit for many VGA PMods (PMod allowing 8 IO pins, so RGB222+H/V Sync;
    or needing to use 2 PMOD connections for the VGA). The usual workaround
    was also to perform dithering while driving the VGA output (with ordered dither in the Verilog).

    But, yeah, even the theoretical framebuffer images generally look better
    than what one sees on actual monitors.

    Even then, modern LCD panels mostly can't display even full RGB24 color
    depth; more often it is 6-bit / channel or similar (then the panels
    dither for full 24). But, IIRC a lot of OLEDs are back up to full
    color-depth (but, OLEDs are more expensive and have often had a
    notoriously short lifespans, ...).

    But, yeah, my current monitor seems to be LCD based.



    In my case, the video HW uses prefetch requests along a ring-bus, which
    goes to the L2 cache, and then to external RAM. It then works on hope
    that the requests get around the bus and can be resolved in time.

    In this case, the memory works in a vaguely similar way to the CPU's L1
    caches (although with line-oriented access), and a module that
    translates this to color-values during screen refresh. General access
    pattern was built around "character cells".


    It can give stable results at 8MB/s to 16MB/s (with more glitches as it
    goes higher), but breaks down too much past this point.

    So, switching to a RAM backed framebuffer didn't significantly usable
    increase screen resolutions or color depths.

    Also, I am mostly limited to using either either a 25 or 50 MHz pixel
    clock, so some timings were tweaked to fit this. Doesn't really fit
    standard VESA timings, but it seems like monitors can tolerate
    nonstandard timings, and are more limited to operating range.

    So, say:
    320x200 70Hz, 25MHz; 9MB/s @ 16bpp (hi-color)
    640x400 70Hz, 25MHz; 9MB/s @ 4bpp, 18 MB/s @ 8bpp
    640x480 60Hz, 50Mhz; 9MB/s @ 4bpp, 18 MB/s @ 8bpp
    800x600 72Hz, 50Mhz; 8.6 MB/s @ 2bpp, 17 MB/s @ 4bpp
    1024x768 48Hz, 50Mhz, 5MB/s @ 1bpp, 10MB/s @ 2bpp

    So, this implies that just running 1024x768 at 2bpp should be acceptable
    (even if it exceeds the usual 128K limit).


    Earlier on, I had an 800x600/36Hz, and 1024x768/25Hz, these would have
    allowed 8bpp color, but are below the minimum refresh rate of most
    monitors (seems like VGA monitors don't like going below around 40Hz).


    Of these modes, 8bpp (Indexed color) is technically newest.
    Originally the graphics hardware was written for color-cell.

    Earliest design had 32-bit cells (for 8x8 pixels):
    10 bits: Glyph
    2x 6b color + Attrib (RGB222)
    2x 9b Color: RGB333

    was later expanded first to 64b cells, then to 128b and 256b
    Some control bits effect cell size.
    Also with ability to specify 8x8 or 4x4 cells.
    Where, 4x4 cells reduce the effective resolution.
    In the bitmap modes:
    4x4 + 256b: 16bpp Hicolor
    4x4 + 128b: 8bpp Indexed
    4x4 + 64b: 4bpp RGBI (Alt2)
    8x8 + 256b: 4bpp RGBI (Alt1)
    8x8 + 128b: 2bpp (4-color, CGA-like)
    With a range of color palettes available (more than CGA).
    Black/White/Cyan/Magenta, Black/White/Red/Green, ...
    Black/White/DarkGray/LightGray, also with Green and Amber, ...
    8x8 + 64b: 1bpp (Monochrome)
    Can select between RGBI colors and some special sub-modes.
    The recent idea, if added to HW, would slot into this mode.
    The color-cell modes:
    8x8 + 256b: 4bpp (DXT1 like, 4x 4x4 cells per 256-bit cell)
    8x8 + 128b: 2bpp (2bpp cells)
    Each cell has 2x RGB555 colors, and 8x8x1 for pixel data
    Had experimented with 8x RGB232,
    didn't catch on (looked terrible).
    8x8 + 64b: Text-Mode + Early Graphics (4x4 cells)


    Generally, the text mode operates in a 640x200 mode with 8x8 + 128b
    cells, so 32K of VRAM used (for 80x25 cells).

    The 640x200 mode is the same as 640x400 (for VGA) but with the vertical resolution halved. The 320x200 mode also halves the horizontal
    resolution (so 40x25 cells).


    In this case, a 40x25 color-cell mode (with 256-bit cells) could be used
    for graphics (32K). Early on, this was used as the graphics mode for
    Doom and similar, before I later expanded VRAM to 128K and switched to
    320x200 Hicolor.


    The bitmap modes are non-raster, generally with pixels packed into 8x8
    or 4x4 blocks.
    4x4:
    16bpp: pixels in raster order.
    8bpp: raster order, 32-bits per row
    4bpp: Raster order, 16-bits per row
    And, 8x8:
    4bpp: Takes 16bpp layout, splits each pixel into 2x2.
    2bpp: Takes 8bpp layout, splits each pixel into 2x2.
    1bpp: Raster order, 1bpp, but same order as text glyphs.
    With MSB in upper left, LSB in lower right.

    Can note that the 8x8x1b cells have the upper-left corner in the MSB.
    This differs from most other modes where the upper left corner is in the
    LSB (so, pixels flipped both horizontally and vertically).


    Can note that in this case, the video memory had several parts:
    VRAM / Framebuffer
    Note: Uses 64-bit word addressing.
    Font RAM: Stores character glyphs as 8x8 patterns.
    Originally, there was a FontROM, but I dropped this feature.
    This means BootROM needs to supply the initial glyph set.
    I went with 5x6 pixel cells in the ROM to save space.
    Where 5x6 does ASCII well enough.
    Palette RAM: Stores 256x 16-bits (as RGB555).

    Though, TestKern typically uses what is effectively color-cell graphics
    for the text mode (so, just draws 8x8 pixel blocks for the character
    glyphs).


    All this differs notably from CGA/EGA/VGA, which had used mostly raster-ordered modes. Except for the oddity of bit-planes for 16 color
    modes in EGA and VGA.


    I did experiment with raster ordered modes which worked by effectively stretching the character cell horizontally while reducing vertical
    height to 1 pixel. Ended not going with this was it was prone to a lot
    more glitches with the screen refresh (turned out to be a lot more
    sensitive to timing than the use of 8x8 or 4x4 cells).

    But, since generally programs don't draw directly into VRAM, the use of non-raster VRAM is mostly less of an issue.


    Well, apart from the computational cost of converting from internal
    RGB555 frame-buffers. Though, partial reason RGB555 ended up used so
    often was because it was faster to do RGB555 -> ColorCell encoding than
    8-bit indexed color to color-cell, as indexed color typically also
    requires a bunch of palette lookups (which could end up more expensive
    than the additional RAM bandwidth from the RGB555).

    Also, there isn't really a "good and simple" way to generalize 8-bit
    colors in a way that leads to acceptable image quality. Invariably, one
    ends up needing palettes or encoding schemes that are slightly irregular.



    For color-cell, there are different approaches depending on how fast it
    needs to be:
    Faster: Simply select minimum and maximum luma;
    Selector encoding is often via comparing against thresholds.
    Except on x86, where multiply+bias+shift is faster.
    Medium: Calculate along 4 axes in parallel;
    Select axis which gives highest contrast;
    Usually: Luma, Cyan, Magenta, Yellow.
    Adjust endpoints to better reflect standard deviation.
    Vs simply min/max.
    Slower:
    Calculate centroid and mass distribution and similar;
    Better quality, more for offline / batch encoding.


    As noted, early on, I was mostly using real-time color-cell encoders for
    Doom and Quake and similar (hence part of why they were modified to use RGB555).

    Some of this is also related to the existence of a lot of RGB555 related helper ops. Though, early on, had also used YUV655 as well, but RGB555
    mostly won out over YUV655 (even if it is easier to get a luma from
    YUV655 vs RGB555).

    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sun Nov 2 16:56:05 2025
    From Newsgroup: comp.arch

    On 2025-11-02 3:58 p.m., BGB wrote:
    <snip>

    No real need to go much beyond RGB555, as the FPGA boards have VGA DACs
    that generally fall below this (Eg: 4 bit/channel on the Nexys A7). And, 2-bit for many VGA PMods (PMod allowing 8 IO pins, so RGB222+H/V Sync;
    or needing to use 2 PMOD connections for the VGA). The usual workaround
    was also to perform dithering while driving the VGA output (with ordered dither in the Verilog).


    I am using an HDMI interface so the monitor is fed 24-bit RGB digitally.
    I tried to get a display channel interface working but no luck. VGA is
    so old.

    Have you tried dithering based on the frame (temporal dithering vs
    space-al dithering)? First frame is one set of colors, the next frame is
    a second set of colors. I think it may work if the refresh rate is high
    enough (120 Hz). IIRC I tried this a while ago and was not happy with
    the results. I also tried rotating the dithering pattern around each frame.

    <snip>

    Generally, the text mode operates in a 640x200 mode with 8x8 + 128b
    cells, so 32K of VRAM used (for 80x25 cells).

    For the text mode 800x600 mode is used on my system, with 12x18 cells so
    that I can read the display at a distance (64x32 characters).

    The font then has 64 block graphic characters of 2x3 block. Low-res
    graphics can be done in text mode with the appropriate font size and
    block graphics characters. Color selection is limited though.>
    In this case, a 40x25 color-cell mode (with 256-bit cells) could be used
    for graphics (32K). Early on, this was used as the graphics mode for
    Doom and similar, before I later expanded VRAM to 128K and switched to 320x200 Hicolor.


    The bitmap modes are non-raster, generally with pixels packed into 8x8
    or 4x4 blocks.
    4x4:
      16bpp: pixels in raster order.
       8bpp: raster order, 32-bits per row
       4bpp: Raster order, 16-bits per row
    And, 8x8:
       4bpp: Takes 16bpp layout, splits each pixel into 2x2.
       2bpp: Takes  8bpp layout, splits each pixel into 2x2.
       1bpp: Raster order, 1bpp, but same order as text glyphs.
         With MSB in upper left, LSB in lower right.


    <snip>

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Nov 2 17:21:52 2025
    From Newsgroup: comp.arch

    On 11/2/2025 3:56 PM, Robert Finch wrote:
    On 2025-11-02 3:58 p.m., BGB wrote:
    <snip>

    No real need to go much beyond RGB555, as the FPGA boards have VGA
    DACs that generally fall below this (Eg: 4 bit/channel on the Nexys
    A7). And, 2-bit for many VGA PMods (PMod allowing 8 IO pins, so
    RGB222+H/V Sync; or needing to use 2 PMOD connections for the VGA).
    The usual workaround was also to perform dithering while driving the
    VGA output (with ordered dither in the Verilog).


    I am using an HDMI interface so the monitor is fed 24-bit RGB digitally.
    I tried to get a display channel interface working but no luck. VGA is
    so old.


    Never went up the learning curve for HDMI.
    Would likely need to drive the monitor outputs with SERDES or similar
    though.


    Have you tried dithering based on the frame (temporal dithering vs
    space-al dithering)? First frame is one set of colors, the next frame is
    a second set of colors. I think it may work if the refresh rate is high enough (120 Hz). IIRC I tried this a while ago and was not happy with
    the results. I also tried rotating the dithering pattern around each frame.


    Temporal dithering seems to generate annoying artifacts on the monitors
    I tried it on. Trying to use temporal dithering tended to result in
    annoying wavy/rippling artifacts.

    Likewise, PWM'ing the pixels also makes LCD monitors unhappy (rainbow
    banding artifacts), but seems to work OK on CRTs. I suspect it is an
    issue that the monitors expect a 25MHz pixel clock (when using 640x400
    or 640x480 timing) with an ADC that doesn't like sudden changes in level
    (say, if updating the pixels at 50MHz internally).


    <snip>

    Generally, the text mode operates in a 640x200 mode with 8x8 + 128b
    cells, so 32K of VRAM used (for 80x25 cells).

    For the text mode 800x600 mode is used on my system, with 12x18 cells so that I can read the display at a distance (64x32 characters).

    The font then has 64 block graphic characters of 2x3 block. Low-res
    graphics can be done in text mode with the appropriate font size and
    block graphics characters. Color selection is limited though.>

    I went with 80x25 as it is pretty standard;
    80x50 is also possible, but less standard.

    Though, Linux seems to often like using high-res text modes rather than
    the usual 80x25 or similar.

    As for 8x8 character cells:
    Also pretty standard, and fix nicely into 64 bits.



    In theory, for a text mode, could drive a monitor at 1280x400 with
    640x400 timings for 16x16 character cells, but LCD monitors don't like
    this sort of thing.


    Even at 640x400/70Hz timings, the monitor didn't consistently recognize
    it as 640x400, and would sometimes try to detect it as 720x400 or
    similar (which would look wonky).

    The other option being to output 640x480 and simply black-fill the extra
    lines (so, add 20 lines of black-fill at the top and bottom of the
    screen). Where, the monitors were able to more reliably detect 640x480/60Hz


    The main tradeoff is that mostly I have a limited selection of pixel
    clocks available:
    25, 50, maybe 100.

    Mostly because the pixel clocks are high enough and clock-edges
    sensitive enough where accumulation timers don't really work.

    Though, accumulation timers do work for driving an NTSC composite
    output. But, NTSC composite looks poor, can't even really do an 80x25
    text mode acceptably (if using colorburst); but can do 80x25 if one can
    accept black-and-white.

    Well, there was also component video, but this is basically the same as driving VGA (just with it being able to accept both NTSC and VGA
    timings; eg, 15 to 70 kHz for horizontal refresh, 40 to 90 Hz vertical,
    ...).

    Though, I no longer have the display that had component video inputs.


    Contrast, there is generally a very limited range of timings for
    composite or S-Video (generally, these don't accept VGA-like timings). Whereas, VGA only really accepts VGA-like timings, and is unhappy if
    given NTSC timings (eg: 15 kHz horizontal refresh).


    Not sure why seemingly component video is the only "accepts whatever you
    throw at it" analog input (say, on a display with multiple input types
    and presumably similar hardware internally).


    Checks, annoyingly hard to find plain LCD monitors with a component
    video inputs that is not also a full TV with a TV tuner (but, a little
    easier to find ones with both VGA and composite). Closest I can find are apparently intended mostly as CCTV monitors.


    But, mostly using VGA anyways, so...


    ...




    In this case, a 40x25 color-cell mode (with 256-bit cells) could be
    used for graphics (32K). Early on, this was used as the graphics mode
    for Doom and similar, before I later expanded VRAM to 128K and
    switched to 320x200 Hicolor.


    The bitmap modes are non-raster, generally with pixels packed into 8x8
    or 4x4 blocks.
    4x4:
       16bpp: pixels in raster order.
        8bpp: raster order, 32-bits per row
        4bpp: Raster order, 16-bits per row
    And, 8x8:
        4bpp: Takes 16bpp layout, splits each pixel into 2x2.
        2bpp: Takes  8bpp layout, splits each pixel into 2x2.
        1bpp: Raster order, 1bpp, but same order as text glyphs.
          With MSB in upper left, LSB in lower right.


    <snip>


    ...

    But, yeah, my makeshift graphics hardware is a little wonky.
    And, works in an almost entirely different way from the VGA style hardware.

    Ironically, software doesn't configure timings itself, but rather uses selector bits to control various properties:
    Base Resolution (640x400, 640x480, 800x600, ...);
    Character cell size in pixels (4x4 or 8x8);
    Settings to modify the number of horizontal and vertical cells relative
    to the base resolution;
    ...

    But, for the most part, had been using 640x400 or similar; with 800x600
    as more experimental (and doesn't look great with 2bpp cells).

    The 1024x768 mode had gone mostly unused, and is still untested on real hardware.

    ...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Nov 3 15:22:44 2025
    From Newsgroup: comp.arch

    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Michael S wrote:

    I mostly use ULP/Guard/Sticky in the same meaning. Except when I use
    them, esp. Guard, differently.
    Given the choice, [in the context of binary floating point] I'd rather
    not use the term 'guard' at all. Names like 'rounding bit' or
    'half-ULP' are far more self-describing.

    Guard also works for decimal FP, where you need a single Sticky bit if
    the Guard digit is equal to 5.

    By decimal FP, do you mean BCD? I.e. a format where
    you have a BCD exponent sign digit (BCD 'C' or 'D')
    followed by two BCD exponent digits, followed by a
    mantissa sign digit ('C' or 'D') followed by a variable
    number of mantissa digits (1 to 100)?
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Mon Nov 3 11:53:48 2025
    From Newsgroup: comp.arch

    On 11/3/2025 9:22 AM, Scott Lurndal wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Michael S wrote:

    I mostly use ULP/Guard/Sticky in the same meaning. Except when I use
    them, esp. Guard, differently.
    Given the choice, [in the context of binary floating point] I'd rather
    not use the term 'guard' at all. Names like 'rounding bit' or
    'half-ULP' are far more self-describing.

    Guard also works for decimal FP, where you need a single Sticky bit if
    the Guard digit is equal to 5.

    By decimal FP, do you mean BCD? I.e. a format where
    you have a BCD exponent sign digit (BCD 'C' or 'D')
    followed by two BCD exponent digits, followed by a
    mantissa sign digit ('C' or 'D') followed by a variable
    number of mantissa digits (1 to 100)?


    I would assume he meant something like either the newer IEEE-754 decimal formats, or a decimal-FP format that MS had used in .NET, ...

    The IEEE formats are generally one of:
    Linear mantissa understood as decimal;
    Groups of 10-bits, each used to encode 3 digits.
    As Densely Packed Decimal.
    With a power-of-10 exponent.

    The .NET format was similar, except using groups of 32 bits as linear
    values representing 9 digits.

    When I looked at it before, the most practical way to me to support
    something like this seemed to be to not do it directly in hardware, but
    to support a subset of operations:
    Operations to pack and unpack DPD into BCD;
    Say: 64 bit value holds 15 BCD digits, mapped to 50 bits of DPD.
    Some basic operations to help with arithmetic on BCD.

    I partly implemented these as an experiment before, but then noted I
    have basically no use case for Decimal-FP in my project.

    And, ironically, the main benefit the helpers would have provided would
    be to allow for faster Binary<->Decimal conversion. But, even then are debatable, as Binary<->Decimal conversion isn't itself enough CPU time
    to justify making it faster at the cost of needing to drag around BCD
    helper instructions.

    One downside is that there was no multiplier, so the BCD helpers would
    need to be used to effectively implement a Radix-10 Shift-and-Add.

    ...


    Though, it is debatable, something more like the .NET approach could
    make more sense for a SW implementation.

    If one wants to make the encoding more efficiently use the bits, a
    hybrid approach could make sense, say:
    Use 3 groups of 30 bits, and another group of 20 (6 digits)
    Use an 17 bit linear exponent and sign bit.

    This would be slightly cheaper to implement vs what is defined in the
    standard (for the BID variant), and could achieve a similar effect
    (though, with 33 digits rather than 34).

    Internally, it could work similar to the .NET approach, just with a
    little more up-front to pack/unpack the 30 bit components. The merit of
    30 bit groups being that they map internally onto 32-bit integer
    operations (which would also provide a space internally for carry/borrow signaling in operations).

    Most CPUs at least have native support for 32-bit integer math, and for
    SW (on a 32/64 bit machine) this could be an easier chunking size than
    10 bits. Someone could argue for 60 bit chunking on a 64-bit machine
    (or, one 60 bit chunk, and a 50 bit chunk), but likely this wouldn't
    save much over 30 bit chunking.

    Also, 60-bit chunking would imply access to a 64*64->128 bit widening multiply; which is asking more than 32*32->64. And, also precludes some
    ways to more cheaply implement the divide/modulo step for each chunk
    (*). So, it is likely in this sense 30 bit chunks could still be preferable.

    *:
    high=product>>30;
    low=product-(high*1000000000LL);
    if(low>=1000000000)
    { high++; low-=1000000000; }
    Where, 60 bit chunking would require 128-bit math here.

    Where, effectively, the multiply step is operating in radix-1-billion.

    ...



    Still don't have much of a use-case though.

    In general, Decimal-FP seems more like a solution in search of a problem.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Nov 3 18:47:36 2025
    From Newsgroup: comp.arch

    Robert Finch <robfi680@gmail.com> schrieb:
    Contemplating having conditional branch instructions branch to a target value in a register instead of using a displacement.

    I think this has about the same code density as having a branch to a displacement from the IP.

    Should be possible. A question is if you want to have a special
    register for that (like POWER's link register), tell the CPU
    what the target is (like VEC in My66000) or just use a general
    purpose register with a general-purpose instruction.

    Using a fused compare-and-branch instruction for Qupls4

    Is that the name of your architecture, or an instruction? (That
    may have been mentioned upthread, in that case I don't remember).

    there is not
    enough room in the instruction for a large branch displacement (10
    bits). So, my thought is to branch to a register value instead.
    There is already an add-to-instruction-pointer instruction that can be
    used to generate relative addresses.

    That makes sense.

    By moving the register load outside of a loop, the dynamic instruction
    count can be reduced. I think this solution is a bit better than having compare and branch as two separate instructions, or having an extended constant added to the branch instruction.

    Are you talking about a normal loop condition or a jump out of
    a loop?

    One gotcha may be that the branch target needs to be predicted as it
    cannot be calculated earlier in the pipeline.

    If you use a link register or a special instruction, the CPU could
    do that.

    The 10-bit displacement format could also be supported, but it is yet another branch instruction format. I may leave holes in the instruction
    set for future support, but I think it is best to start with just a
    single format.

    Code:
    AIPSI R3,1234 ; add displacement to IP and store in R3 (hoist-able)
    BLT R1,R2,R3 ; branch to R3 if R1 < R2

    Versus:
    CMP R3,R1,R2
    BLT R3,displacement
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Nov 3 19:03:13 2025
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    Actually, for the five required basic operations, you can always do the >> op in the next higher precision, then round again down to the target,
    and get exactly the same result.

    https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

    The PowerISA version 3.0 introduced rounding to odd for its 128-bit
    floating point arithmetic, for that very reason (I assume).

    Likely, My 66000 also has RNO and
    Round Nearest Random is defined but not yet available
    Round Away from Zero is also defined and available.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Nov 3 19:13:50 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    Contemplating having conditional branch instructions branch to a target value in a register instead of using a displacement.

    I think this has about the same code density as having a branch to a displacement from the IP.

    Using a fused compare-and-branch instruction for Qupls4 there is not
    enough room in the instruction for a large branch displacement (10
    bits). So, my thought is to branch to a register value instead.
    There is already an add-to-instruction-pointer instruction that can be
    used to generate relative addresses.

    The VEC instruction (My 66000) provides a register that is used for
    the address of the top of the loop and the address of the VEC inst
    itself. So, when running in the loop, the LOOP instruction branches
    to the register value, and when taking an exception in the loop,
    the register leads back to the VEC instruction for after the excpt
    has been performed.

    By moving the register load outside of a loop, the dynamic instruction
    count can be reduced. I think this solution is a bit better than having compare and branch as two separate instructions, or having an extended constant added to the branch instruction.

    VEC-{ }-LOOP always saves at least 1 instruction per iteration.

    One gotcha may be that the branch target needs to be predicted as it
    cannot be calculated earlier in the pipeline.

    VEC does its own predictions. LOOP does not overrun the loop-count,
    so loop termination is not a pipeline flush.

    The 10-bit displacement format could also be supported, but it is yet another branch instruction format. I may leave holes in the instruction
    set for future support, but I think it is best to start with just a
    single format.

    Code:
    AIPSI R3,1234 ; add displacement to IP and store in R3 (hoist-able)

    LDA Rd,[IP,displacement]

    BLT R1,R2,R3 ; branch to R3 if R1 < R2

    Versus:
    CMP R3,R1,R2
    BLT R3,displacement

    But if you create "R3" from your VEC instruction, you KNOW that
    the compiler is only allowed to use "r3" as a branch target, and
    that "R3" is static over the duration of the loop, so you can get
    the reservation stations moving faster/easier.

    I have a "special" RS for the VEC-LOOP brackets.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Mon Nov 3 23:04:53 2025
    From Newsgroup: comp.arch

    On Mon, 03 Nov 2025 15:22:44 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Michael S wrote:

    I mostly use ULP/Guard/Sticky in the same meaning. Except when I
    use them, esp. Guard, differently.
    Given the choice, [in the context of binary floating point] I'd
    rather not use the term 'guard' at all. Names like 'rounding bit'
    or 'half-ULP' are far more self-describing.

    Guard also works for decimal FP, where you need a single Sticky bit
    if the Guard digit is equal to 5.

    By decimal FP, do you mean BCD? I.e. a format where
    you have a BCD exponent sign digit (BCD 'C' or 'D')
    followed by two BCD exponent digits, followed by a
    mantissa sign digit ('C' or 'D') followed by a variable
    number of mantissa digits (1 to 100)?

    I am pretty sure that by decimal FP Terje means decimal FP :-). As
    defined in IEEE 754 (formerly it was in 854, but since 2008 it became a
    part of the main standard).
    IEEE 754 has two options for encoding of mantissa, IBM's DPD which is
    a clever variation of Base 1000 and Intel's binary.
    DPD encoding is considered preferable for hardware implementations
    while binary encoding is easier for software implementations.
    BCD is not an option, it's information density is insufficient to
    supply required semantics in given size of container.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Nov 4 08:50:25 2025
    From Newsgroup: comp.arch

    Scott Lurndal wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    Michael S wrote:

    I mostly use ULP/Guard/Sticky in the same meaning. Except when I use
    them, esp. Guard, differently.
    Given the choice, [in the context of binary floating point] I'd rather
    not use the term 'guard' at all. Names like 'rounding bit' or
    'half-ULP' are far more self-describing.

    Guard also works for decimal FP, where you need a single Sticky bit if
    the Guard digit is equal to 5.

    By decimal FP, do you mean BCD? I.e. a format where
    you have a BCD exponent sign digit (BCD 'C' or 'D')
    followed by two BCD exponent digits, followed by a
    mantissa sign digit ('C' or 'D') followed by a variable
    number of mantissa digits (1 to 100)?

    No, I meant ieee754 DFP, where you either store the decimal digits in
    packed modulo-1000 groups, or as a binary mantissa with a decimal exponent/scaling value.

    When you do math with these you have to handle all the required
    (financial?) rounding modes.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Nov 4 07:50:33 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Should be possible. A question is if you want to have a special
    register for that (like POWER's link register),

    There is this idea of splitting an (indirect) branch into a
    prepare-to-branch instruction and a take-branch instruction. The prepare-to-branch instruction announces the branch target to the CPU,
    and Power's mtlr and mtctr are examples of that (somewhat muddled by
    the fact that the ctr register can also be used for counted loops as
    well as for indirect branches), and IA-64's branch-target registers
    and the instructions that move there are another example. AFAIK SPARC
    acquired something in this direction (touted as good for accelerating
    Java) in the early 2000s. The take-branch instruction on Power is
    blr/bctr.

    I used to think that this kind of splitting is a good idea, and it is
    certainly better than a branch-delay slot or a branch with a fixed
    number of delay slots.

    But in practice, it turned out that Intel and AMD processors had much
    better performance on indirect-branch intensive workloads in the early
    2000s without this architectural feature. What happened?

    The IA-32 and AMD64 microarchitects implemented indirect-branch
    prediction; in the early 2000s it was based on the BTB, which these
    CPUs need for fast direct branching anyway. They were not content
    with that, and have implemented history-based indirect branch
    predictors in the meantime, which improve the performance even more.

    By contrast, Power and IA-64 implementations apparently rely on
    getting the target-address early enough, and typically predict that
    the indirect branch will go to the current contents of the
    branch-target register when the front-end encounters the take-branch instruction; but if the prepare-to-branch instruction is in the
    instruction stream just before the take-branch instruction, it takes
    several cycles until the prepare-to-branch actually can move the
    target to the branch-target register. In case of an OoO
    implementation, the number of cycles tends to be longer. It's
    essentially a similar latency as in a branch misprediction.

    That all would not be so bad, if the compilers would move the
    prepare-to-branch instructions sufficiently far away from the
    take-branch instruction. But gcc certainly has not done so whenever I
    looked at code it generated for PowerPC or IA-64.

    Here is some data for code that focusses on indirect-branch
    performance (with indirect branches that vary their targets), from <https://www.complang.tuwien.ac.at/forth/threading/>:

    Numbers are cycles per indirect branch, smaller is faster, the years
    are the release dates of the CPUs:

    First, machines from the early 2000s:

    sub- in- repl.
    routine direct direct switch call switch CPU year
    9.6 8.0 9.5 23.1 38.6 Alpha 21264B 800MHz ~2000
    4.7 8.1 9.5 19.0 21.3 Pentium III 1000MHz 2000
    18.4 8.5 10.3 24.5 29.0 Athlon 1200MHz 2000
    8.6 14.2 15.3 23.4 30.2 Pentium 4 2.26 2002
    13.3 10.3 12.3 15.7 18.7 Itanium 2 (McKinley) 900MHz 2002
    5.7 9.2 12.3 16.3 17.9 PPC 7447A 1066MHz 2004
    7.8 12.8 12.9 30.2 39.0 PPC 970 2000MHz 2002

    Ignore the first column (it uses call and return), the others all need
    an indirect branch or indirect call ("call" column) per dispatch, with
    varying amounts of other instructions; "direct" needs the least
    instructions.

    And here are results with some newer machines:

    sub- in- repl.
    routine direct direct switch call switch CPU year
    4.9 5.6 4.3 5.1 7.64 Pentium M 755 2000MHz 2004
    4.4 2.2 2.0 20.3 18.6 3.3 Xeon E3-1220 3100MHz 2011
    4.0 2.3 2.3 4.0 5.1 3.5 Core i7-4790K 4400MHz 2013
    4.2 2.1 2.0 4.9 5.2 2.7 Core i5-6600K 4000MHz 2015
    5.7 3.2 3.9 7.0 8.6 3.7 Cortex-A73 1800MHz 2016
    4.2 3.3 3.2 17.9 23.1 4.2 Ryzen 5 1600X 3600MHz 2017
    6.9 24.5 27.3 37.1 33.5 36.6 Power9 3800MHz 2017
    3.8 1.0 1.1 3.8 6.2 2.2 Core i5-1135G7 4200MHz 2020

    The age of the Pentium M would suggest putting it into the earlier
    table, but given its clear performance-per-clock advantage over the
    other IA-32 and AMD64 CPUs of its day, it was probably the first CPU
    to have a history-based indirect-branch predictor.

    It seems that, while the AMD64 microarchitectures improved not just in
    clock rate, but also in performance per clock for this microbenchmark
    (thanks to history-based indirect-branch predictors), the Power 9
    still relies on its split-branch architectural feature, resulting in
    slowness. And it's not just slowness in "direct", but the additional instructions in the other benchmarks add more cycles than in most
    other CPUs.

    Particularly notable is the Core i5-1135G7, which takes one indirect
    branch per cycle.

    I have to take additional measurements with other Power and AMD64
    processors.

    Couldn't the Power and IA-64 CPUs use history-based branch prediction,
    too? Of course, but then it would be even more obvious that the
    split-branch architecture provides no benefit.

    Bottom line: History-based branch prediction has won, any kind of
    delayed branches (including split-branch designs) turn out to be a bad idea.

    tell the CPU
    what the target is (like VEC in My66000)

    I have no idea what VEC does, but all indirect-branch architectures
    are about telling the CPU what the target is.

    just use a general
    purpose register with a general-purpose instruction.

    That turns out to be the winner.

    One gotcha may be that the branch target needs to be predicted as it
    cannot be calculated earlier in the pipeline.

    If you want to be able to perform one taken branch per cycle (or
    more), you always need prediction.

    If you use a link register or a special instruction, the CPU could
    do that.

    It turns out that this does not work well in practice.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue Nov 4 15:19:08 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 03 Nov 2025 15:22:44 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    By decimal FP, do you mean BCD? I.e. a format where
    you have a BCD exponent sign digit (BCD 'C' or 'D')
    followed by two BCD exponent digits, followed by a
    mantissa sign digit ('C' or 'D') followed by a variable
    number of mantissa digits (1 to 100)?

    I am pretty sure that by decimal FP Terje means decimal FP :-). As
    defined in IEEE 754 (formerly it was in 854, but since 2008 it became a
    part of the main standard).
    IEEE 754 has two options for encoding of mantissa, IBM's DPD which is
    a clever variation of Base 1000 and Intel's binary.
    DPD encoding is considered preferable for hardware implementations
    while binary encoding is easier for software implementations.
    BCD is not an option, it's information density is insufficient to
    supply required semantics in given size of container.

    How so? The B3500 supported 100 digit (400 bit) signed mantissa and
    a two digit signed exponent using a BCD representation.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Tue Nov 4 17:41:07 2025
    From Newsgroup: comp.arch

    On Tue, 04 Nov 2025 15:19:08 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 03 Nov 2025 15:22:44 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    By decimal FP, do you mean BCD? I.e. a format where
    you have a BCD exponent sign digit (BCD 'C' or 'D')
    followed by two BCD exponent digits, followed by a
    mantissa sign digit ('C' or 'D') followed by a variable
    number of mantissa digits (1 to 100)?

    I am pretty sure that by decimal FP Terje means decimal FP :-). As
    defined in IEEE 754 (formerly it was in 854, but since 2008 it
    became a part of the main standard).
    IEEE 754 has two options for encoding of mantissa, IBM's DPD which is
    a clever variation of Base 1000 and Intel's binary.
    DPD encoding is considered preferable for hardware implementations
    while binary encoding is easier for software implementations.
    BCD is not an option, it's information density is insufficient to
    supply required semantics in given size of container.

    How so? The B3500 supported 100 digit (400 bit) signed mantissa and
    a two digit signed exponent using a BCD representation.

    What is not clear about 'in given size of container' ?
    Semantics of IEEE Decimal128 call for 33 decimal digits + 1 binary bit
    to be contained within 111 bits.
    With BCD encoding one would need 133 bits.

    Decimal32 and Decimal64 would suffer from similar mismatch, but those
    formats probably not important. IMHO, IEEE defined them for sake of completeness rather than because they are useful in real world.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Nov 4 07:47:50 2025
    From Newsgroup: comp.arch

    On 11/4/2025 7:19 AM, Scott Lurndal wrote:
    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 03 Nov 2025 15:22:44 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    By decimal FP, do you mean BCD? I.e. a format where
    you have a BCD exponent sign digit (BCD 'C' or 'D')
    followed by two BCD exponent digits, followed by a
    mantissa sign digit ('C' or 'D') followed by a variable
    number of mantissa digits (1 to 100)?

    I am pretty sure that by decimal FP Terje means decimal FP :-). As
    defined in IEEE 754 (formerly it was in 854, but since 2008 it became a
    part of the main standard).
    IEEE 754 has two options for encoding of mantissa, IBM's DPD which is
    a clever variation of Base 1000 and Intel's binary.
    DPD encoding is considered preferable for hardware implementations
    while binary encoding is easier for software implementations.
    BCD is not an option, it's information density is insufficient to
    supply required semantics in given size of container.

    How so? The B3500 supported 100 digit (400 bit) signed mantissa and
    a two digit signed exponent using a BCD representation.

    By "information density" I think he means that for almost any (I won't
    say any because there might be some edge cases where the isn't true)
    value, it takes fewer bits to represent in the IEEE scheme than in your beloved Burroughs Medium system's scheme. :-) Fewer bits per value
    means higher information density.

    Fewer bits means less less hardware, thus lower cost, less power
    required, etc.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Nov 4 16:52:18 2025
    From Newsgroup: comp.arch

    Scott Lurndal wrote:
    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 03 Nov 2025 15:22:44 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    By decimal FP, do you mean BCD? I.e. a format where
    you have a BCD exponent sign digit (BCD 'C' or 'D')
    followed by two BCD exponent digits, followed by a
    mantissa sign digit ('C' or 'D') followed by a variable
    number of mantissa digits (1 to 100)?

    I am pretty sure that by decimal FP Terje means decimal FP :-). As
    defined in IEEE 754 (formerly it was in 854, but since 2008 it became a
    part of the main standard).
    IEEE 754 has two options for encoding of mantissa, IBM's DPD which is
    a clever variation of Base 1000 and Intel's binary.
    DPD encoding is considered preferable for hardware implementations
    while binary encoding is easier for software implementations.
    BCD is not an option, it's information density is insufficient to
    supply required semantics in given size of container.

    How so? The B3500 supported 100 digit (400 bit) signed mantissa and
    a two digit signed exponent using a BCD representation.

    It is needed to be comparable to binary FP:

    A 64-bit double provides 54 mantissa bits, this corresponds to 16+
    decimal digits, while fp128 gives us 113 bits or a smidgen over 34 digits.

    The corresponding 128-bit DFP format also provides 34 decimal digts,
    with an exponent range which covers 10^-6143 to 10^6144, while the 15
    exponent bits in binary128 covers 2^-16k to 2^16k, corresponding to 5.9e(+/-)4931.

    I.e. the DFP format has the same precision and a larger range than BFP.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Tue Nov 4 18:54:58 2025
    From Newsgroup: comp.arch

    On Tue, 4 Nov 2025 16:52:18 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Scott Lurndal wrote:
    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 03 Nov 2025 15:22:44 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    By decimal FP, do you mean BCD? I.e. a format where
    you have a BCD exponent sign digit (BCD 'C' or 'D')
    followed by two BCD exponent digits, followed by a
    mantissa sign digit ('C' or 'D') followed by a variable
    number of mantissa digits (1 to 100)?

    I am pretty sure that by decimal FP Terje means decimal FP :-). As
    defined in IEEE 754 (formerly it was in 854, but since 2008 it
    became a part of the main standard).
    IEEE 754 has two options for encoding of mantissa, IBM's DPD which
    is a clever variation of Base 1000 and Intel's binary.
    DPD encoding is considered preferable for hardware implementations
    while binary encoding is easier for software implementations.
    BCD is not an option, it's information density is insufficient to
    supply required semantics in given size of container.

    How so? The B3500 supported 100 digit (400 bit) signed mantissa and
    a two digit signed exponent using a BCD representation.

    It is needed to be comparable to binary FP:

    A 64-bit double provides 54 mantissa bits, this corresponds to 16+
    decimal digits, while fp128 gives us 113 bits or a smidgen over 34
    digits.

    The corresponding 128-bit DFP format also provides 34 decimal digts,
    with an exponent range which covers 10^-6143 to 10^6144, while the 15 exponent bits in binary128 covers 2^-16k to 2^16k, corresponding to 5.9e(+/-)4931.

    I.e. the DFP format has the same precision and a larger range than
    BFP.

    Terje


    Nitpick:
    In the best case, i.e. cases where mantissa of BFP is close to 2 and MS
    digit of DFP =9, [relative] precision is indeed almost identical.
    But in the worst case, i.e. cases where mantissa of BFP is close to 1
    and MS digit of DFP =1, [relative] precision of BFP is 5 times better.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue Nov 4 17:12:54 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    On Tue, 04 Nov 2025 15:19:08 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 03 Nov 2025 15:22:44 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    By decimal FP, do you mean BCD? I.e. a format where
    you have a BCD exponent sign digit (BCD 'C' or 'D')
    followed by two BCD exponent digits, followed by a
    mantissa sign digit ('C' or 'D') followed by a variable
    number of mantissa digits (1 to 100)?

    I am pretty sure that by decimal FP Terje means decimal FP :-). As
    defined in IEEE 754 (formerly it was in 854, but since 2008 it
    became a part of the main standard).
    IEEE 754 has two options for encoding of mantissa, IBM's DPD which is
    a clever variation of Base 1000 and Intel's binary.
    DPD encoding is considered preferable for hardware implementations
    while binary encoding is easier for software implementations.
    BCD is not an option, it's information density is insufficient to
    supply required semantics in given size of container.

    How so? The B3500 supported 100 digit (400 bit) signed mantissa and
    a two digit signed exponent using a BCD representation.

    What is not clear about 'in given size of container' ?
    Semantics of IEEE Decimal128 call for 33 decimal digits + 1 binary bit
    to be contained within 111 bits.
    With BCD encoding one would need 133 bits.

    I guess it wasn't clear that my question was regarding
    the necessity of providing 'hidden' bits for BCD floating
    point.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Nov 4 20:13:36 2025
    From Newsgroup: comp.arch

    Michael S wrote:
    On Tue, 4 Nov 2025 16:52:18 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    Scott Lurndal wrote:
    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 03 Nov 2025 15:22:44 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    By decimal FP, do you mean BCD? I.e. a format where
    you have a BCD exponent sign digit (BCD 'C' or 'D')
    followed by two BCD exponent digits, followed by a
    mantissa sign digit ('C' or 'D') followed by a variable
    number of mantissa digits (1 to 100)?

    I am pretty sure that by decimal FP Terje means decimal FP :-). As
    defined in IEEE 754 (formerly it was in 854, but since 2008 it
    became a part of the main standard).
    IEEE 754 has two options for encoding of mantissa, IBM's DPD which
    is a clever variation of Base 1000 and Intel's binary.
    DPD encoding is considered preferable for hardware implementations
    while binary encoding is easier for software implementations.
    BCD is not an option, it's information density is insufficient to
    supply required semantics in given size of container.

    How so? The B3500 supported 100 digit (400 bit) signed mantissa and
    a two digit signed exponent using a BCD representation.

    It is needed to be comparable to binary FP:

    A 64-bit double provides 54 mantissa bits, this corresponds to 16+
    decimal digits, while fp128 gives us 113 bits or a smidgen over 34
    digits.

    The corresponding 128-bit DFP format also provides 34 decimal digts,
    with an exponent range which covers 10^-6143 to 10^6144, while the 15
    exponent bits in binary128 covers 2^-16k to 2^16k, corresponding to
    5.9e(+/-)4931.

    I.e. the DFP format has the same precision and a larger range than
    BFP.

    Terje


    Nitpick:
    In the best case, i.e. cases where mantissa of BFP is close to 2 and MS
    digit of DFP =9, [relative] precision is indeed almost identical.
    But in the worst case, i.e. cases where mantissa of BFP is close to 1
    and MS digit of DFP =1, [relative] precision of BFP is 5 times better.

    Agreed.

    It is somewhat similar to the very old hex fp which had a wider exonent
    range but more variable precision.

    I still think the IBM DFP people did an impressively good job packing
    that much data into a decimal representation. :-)

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Nov 4 19:15:31 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Should be possible. A question is if you want to have a special
    register for that (like POWER's link register),

    There is this idea of splitting an (indirect) branch into a
    prepare-to-branch instruction and a take-branch instruction. The

    I first heard about this 1982 from Burton Smith.

    prepare-to-branch instruction announces the branch target to the CPU,
    and Power's mtlr and mtctr are examples of that (somewhat muddled by
    the fact that the ctr register can also be used for counted loops as
    well as for indirect branches), and IA-64's branch-target registers
    and the instructions that move there are another example. AFAIK SPARC acquired something in this direction (touted as good for accelerating
    Java) in the early 2000s. The take-branch instruction on Power is
    blr/bctr.

    I used to think that this kind of splitting is a good idea, and it is certainly better than a branch-delay slot or a branch with a fixed
    number of delay slots.

    PL/1 allows for Label variables so one can build their own
    switches (and state machines with variable paths). I used
    these in a checkers playing program 1974.

    But in practice, it turned out that Intel and AMD processors had much
    better performance on indirect-branch intensive workloads in the early
    2000s without this architectural feature. What happened?

    We threw HW at the problem.

    The IA-32 and AMD64 microarchitects implemented indirect-branch
    prediction; in the early 2000s it was based on the BTB, which these
    CPUs need for fast direct branching anyway. They were not content
    with that, and have implemented history-based indirect branch
    predictors in the meantime, which improve the performance even more.

    By contrast, Power and IA-64 implementations apparently rely on
    getting the target-address early enough, and typically predict that
    the indirect branch will go to the current contents of the
    branch-target register when the front-end encounters the take-branch instruction; but if the prepare-to-branch instruction is in the
    instruction stream just before the take-branch instruction, it takes
    several cycles until the prepare-to-branch actually can move the
    target to the branch-target register. In case of an OoO
    implementation, the number of cycles tends to be longer. It's
    essentially a similar latency as in a branch misprediction.

    That all would not be so bad, if the compilers would move the prepare-to-branch instructions sufficiently far away from the
    take-branch instruction. But gcc certainly has not done so whenever I
    looked at code it generated for PowerPC or IA-64.

    Here is some data for code that focusses on indirect-branch
    performance (with indirect branches that vary their targets), from <https://www.complang.tuwien.ac.at/forth/threading/>:

    Numbers are cycles per indirect branch, smaller is faster, the years
    are the release dates of the CPUs:

    First, machines from the early 2000s:

    sub- in- repl.
    routine direct direct switch call switch CPU year
    9.6 8.0 9.5 23.1 38.6 Alpha 21264B 800MHz ~2000
    4.7 8.1 9.5 19.0 21.3 Pentium III 1000MHz 2000
    18.4 8.5 10.3 24.5 29.0 Athlon 1200MHz 2000
    8.6 14.2 15.3 23.4 30.2 Pentium 4 2.26 2002
    13.3 10.3 12.3 15.7 18.7 Itanium 2 (McKinley) 900MHz 2002
    5.7 9.2 12.3 16.3 17.9 PPC 7447A 1066MHz 2004
    7.8 12.8 12.9 30.2 39.0 PPC 970 2000MHz 2002

    Ignore the first column (it uses call and return), the others all need
    an indirect branch or indirect call ("call" column) per dispatch, with varying amounts of other instructions; "direct" needs the least
    instructions.

    And here are results with some newer machines:

    sub- in- repl.
    routine direct direct switch call switch CPU year
    4.9 5.6 4.3 5.1 7.64 Pentium M 755 2000MHz 2004
    4.4 2.2 2.0 20.3 18.6 3.3 Xeon E3-1220 3100MHz 2011
    4.0 2.3 2.3 4.0 5.1 3.5 Core i7-4790K 4400MHz 2013
    4.2 2.1 2.0 4.9 5.2 2.7 Core i5-6600K 4000MHz 2015
    5.7 3.2 3.9 7.0 8.6 3.7 Cortex-A73 1800MHz 2016
    4.2 3.3 3.2 17.9 23.1 4.2 Ryzen 5 1600X 3600MHz 2017
    6.9 24.5 27.3 37.1 33.5 36.6 Power9 3800MHz 2017
    3.8 1.0 1.1 3.8 6.2 2.2 Core i5-1135G7 4200MHz 2020

    The age of the Pentium M would suggest putting it into the earlier
    table, but given its clear performance-per-clock advantage over the
    other IA-32 and AMD64 CPUs of its day, it was probably the first CPU
    to have a history-based indirect-branch predictor.

    It seems that, while the AMD64 microarchitectures improved not just in
    clock rate, but also in performance per clock for this microbenchmark
    (thanks to history-based indirect-branch predictors), the Power 9
    still relies on its split-branch architectural feature, resulting in slowness. And it's not just slowness in "direct", but the additional instructions in the other benchmarks add more cycles than in most
    other CPUs.

    Particularly notable is the Core i5-1135G7, which takes one indirect
    branch per cycle.

    I have to take additional measurements with other Power and AMD64
    processors.

    Couldn't the Power and IA-64 CPUs use history-based branch prediction,
    too? Of course, but then it would be even more obvious that the
    split-branch architecture provides no benefit.

    Bottom line: History-based branch prediction has won, any kind of
    delayed branches (including split-branch designs) turn out to be
    a bad idea.

    Or "Never bet against branch prediction".

    tell the CPU
    what the target is (like VEC in My66000)

    I have no idea what VEC does, but all indirect-branch architectures
    are about telling the CPU what the target is.

    VEC is the bracket at the top of a loop. VEC supplies a register
    which will contain the address of the instruction at the top of
    the loop, and a 21-bit-vector use to specify those registers which
    are "Live" out of the loop. VEC is "executed" as the loop is entered
    and then not again until the loop is entered again.

    The LOOP instruction is the bottom bracket of the loop and performs
    the ADD-CMP-BC sequence as a single instruction. There are 3 flavors
    {counted, value terminated, counter value terminated} that use the
    3 registers similarly but differently.

    just use a general
    purpose register with a general-purpose instruction.

    That turns out to be the winner.

    One gotcha may be that the branch target needs to be predicted as it
    cannot be calculated earlier in the pipeline.

    With VEC-LOOP you are guaranteed that the branch and its target are
    100% correlated.

    If you want to be able to perform one taken branch per cycle (or
    more), you always need prediction.

    Greater than 1 branch per FETCH latency.

    If you use a link register or a special instruction, the CPU could
    do that.

    It turns out that this does not work well in practice.

    Agreed.

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Nov 4 20:16:59 2025
    From Newsgroup: comp.arch

    Scott Lurndal wrote:
    Michael S <already5chosen@yahoo.com> writes:
    On Tue, 04 Nov 2025 15:19:08 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    On Mon, 03 Nov 2025 15:22:44 GMT
    scott@slp53.sl.home (Scott Lurndal) wrote:

    By decimal FP, do you mean BCD? I.e. a format where
    you have a BCD exponent sign digit (BCD 'C' or 'D')
    followed by two BCD exponent digits, followed by a
    mantissa sign digit ('C' or 'D') followed by a variable
    number of mantissa digits (1 to 100)?

    I am pretty sure that by decimal FP Terje means decimal FP :-). As
    defined in IEEE 754 (formerly it was in 854, but since 2008 it
    became a part of the main standard).
    IEEE 754 has two options for encoding of mantissa, IBM's DPD which is
    a clever variation of Base 1000 and Intel's binary.
    DPD encoding is considered preferable for hardware implementations
    while binary encoding is easier for software implementations.
    BCD is not an option, it's information density is insufficient to
    supply required semantics in given size of container.

    How so? The B3500 supported 100 digit (400 bit) signed mantissa and
    a two digit signed exponent using a BCD representation.

    What is not clear about 'in given size of container' ?
    Semantics of IEEE Decimal128 call for 33 decimal digits + 1 binary bit
    to be contained within 111 bits.
    With BCD encoding one would need 133 bits.

    I guess it wasn't clear that my question was regarding
    the necessity of providing 'hidden' bits for BCD floating
    point.

    I thought that was obvious:

    When you learned how to do decimal rounding back in your pen & paper
    math classes, you probably realized that for any calculation which could
    not be done exactly, you had to generate enough extra digits to be sure
    how to round.

    Those extra digits play exactly the same role as Guard + Sticky do in
    binary FP.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Tue Nov 4 21:07:43 2025
    From Newsgroup: comp.arch

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

    I still think the IBM DFP people did an impressively good job packing
    that much data into a decimal representation. :-)

    Yes, that modulo 1000 packing is quite clever. It is relatively
    cheap to implement in hardware (which is the point, of course).
    Not sure how easy it would be in software.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Nov 4 22:44:21 2025
    From Newsgroup: comp.arch

    MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Bottom line: History-based branch prediction has won, any kind of
    delayed branches (including split-branch designs) turn out to be
    a bad idea.

    Or "Never bet against branch prediction".

    I have probably mentioned this before, once or twice, but I'm actually
    quite proud of the meeting I had with Intel Santa Clara in the spring of
    1995:

    I had (accidentally) written the first public mention of the FDIV bug
    (on comp.sys.intel) in Oct 1994, then together with Cleve Moler of MathWorks/MatLab fame led the effort to develop a minimum cost sw
    workaround for the bug. (My code became part of all/most x86 compiler
    runtimes for the next few years.)

    Due to this Intel invited me to receive an early engineering prototype
    of the PentiumPro, together with an NDA-covered briefing about its architecture.

    Before the start of that briefing I suggested that I should start off on
    the blackboard by showing what I had been able to figure out on my own,
    then I proceeded to pretty much exactly cover every single feature on
    the cpu, with one glaring exception:

    Based on the useful but not great branch predictor on the Pentium I told
    them that I expected the P6 to employ eager execution, i.e execute both
    ways of one or two layers of branches, discarding the non-taken paths as
    the branch direction info became available.

    That's the point when they got to brag about how having a much, much
    better branch predictor was better both from a performance and a power viewpoint, since out of order execution could predict much deeper than
    any eager execution would have the resources for.

    As you said: "Never bet against branch prediction".

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Nov 4 22:52:46 2025
    From Newsgroup: comp.arch

    Thomas Koenig wrote:
    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

    I still think the IBM DFP people did an impressively good job packing
    that much data into a decimal representation. :-)

    Yes, that modulo 1000 packing is quite clever. It is relatively
    cheap to implement in hardware (which is the point, of course).
    Not sure how easy it would be in software.

    Several options, the easiest is of course a set of full forward/reverse
    lookup tables, but you can take advantage of the regularities by using
    smaller tables together with a little bit of logic.

    You also need a way to extract one or two digits from the top/bottom of
    each mod1000 container in order to handle normalization.

    For the Intel binary mantissa dfp128 normalization is the hard issue,
    Michael S have figured out some really nice tricks to speed it up, but
    when you have a (worst case) temporary 220+ bit product mantissa,
    scaling is not that easy.

    The saving grace is that almost all DFP calculations tend to employ
    relatively small numbers, mostly dfadd/dfsub/dfmul operations with fixed precision, and those will always be faster (in software) using the
    binary mantissa.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Nov 4 22:51:28 2025
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

    I still think the IBM DFP people did an impressively good job packing
    that much data into a decimal representation. :-)

    Yes, that modulo 1000 packing is quite clever. It is relatively
    cheap to implement in hardware (which is the point, of course).
    Not sure how easy it would be in software.

    Brain dead easy: 1 table of 1024 entries each 12-bits wide,
    1 table of 4096 entries each 10-bits wide,
    isolate the 10-bit field, LD the converted value.
    isolate the 12-bit field, LD the converted value.

    Other than "crap loads" of {deMorganizing and gate optimization}
    that is essentially what HW actually does.

    You still need to build 12-bit decimal ALUs to string together
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Nov 4 15:46:06 2025
    From Newsgroup: comp.arch

    On 11/4/2025 11:15 AM, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Should be possible. A question is if you want to have a special
    register for that (like POWER's link register),

    There is this idea of splitting an (indirect) branch into a
    prepare-to-branch instruction and a take-branch instruction. The

    I first heard about this 1982 from Burton Smith.

    prepare-to-branch instruction announces the branch target to the CPU,
    and Power's mtlr and mtctr are examples of that (somewhat muddled by
    the fact that the ctr register can also be used for counted loops as
    well as for indirect branches), and IA-64's branch-target registers
    and the instructions that move there are another example. AFAIK SPARC
    acquired something in this direction (touted as good for accelerating
    Java) in the early 2000s. The take-branch instruction on Power is
    blr/bctr.

    I used to think that this kind of splitting is a good idea, and it is
    certainly better than a branch-delay slot or a branch with a fixed
    number of delay slots.

    PL/1 allows for Label variables so one can build their own
    switches (and state machines with variable paths). I used
    these in a checkers playing program 1974.

    Wasn't this PL/1 feature "inherited" from the, now rightly deprecated, Alter/Goto in COBOL and Assigned GOTO in Fortran?
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 00:44:18 2025
    From Newsgroup: comp.arch


    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Bottom line: History-based branch prediction has won, any kind of
    delayed branches (including split-branch designs) turn out to be
    a bad idea.

    Or "Never bet against branch prediction".

    I have probably mentioned this before, once or twice, but I'm actually
    quite proud of the meeting I had with Intel Santa Clara in the spring of 1995:

    I had (accidentally) written the first public mention of the FDIV bug
    (on comp.sys.intel) in Oct 1994, then together with Cleve Moler of MathWorks/MatLab fame led the effort to develop a minimum cost sw
    workaround for the bug. (My code became part of all/most x86 compiler runtimes for the next few years.)

    Due to this Intel invited me to receive an early engineering prototype
    of the PentiumPro, together with an NDA-covered briefing about its architecture.

    Before the start of that briefing I suggested that I should start off on
    the blackboard by showing what I had been able to figure out on my own,
    then I proceeded to pretty much exactly cover every single feature on
    the cpu, with one glaring exception:

    Based on the useful but not great branch predictor on the Pentium I told them that I expected the P6 to employ eager execution, i.e execute both
    ways of one or two layers of branches, discarding the non-taken paths as
    the branch direction info became available.

    That's the point when they got to brag about how having a much, much
    better branch predictor was better both from a performance and a power viewpoint, since out of order execution could predict much deeper than
    any eager execution would have the resources for.

    I remember you relating this story about 6-8 years ago.

    As you said: "Never bet against branch prediction".

    Terje

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 02:51:10 2025
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 11/4/2025 11:15 AM, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Should be possible. A question is if you want to have a special
    register for that (like POWER's link register),

    There is this idea of splitting an (indirect) branch into a
    prepare-to-branch instruction and a take-branch instruction. The

    I first heard about this 1982 from Burton Smith.

    prepare-to-branch instruction announces the branch target to the CPU,
    and Power's mtlr and mtctr are examples of that (somewhat muddled by
    the fact that the ctr register can also be used for counted loops as
    well as for indirect branches), and IA-64's branch-target registers
    and the instructions that move there are another example. AFAIK SPARC
    acquired something in this direction (touted as good for accelerating
    Java) in the early 2000s. The take-branch instruction on Power is
    blr/bctr.

    I used to think that this kind of splitting is a good idea, and it is
    certainly better than a branch-delay slot or a branch with a fixed
    number of delay slots.

    PL/1 allows for Label variables so one can build their own
    switches (and state machines with variable paths). I used
    these in a checkers playing program 1974.

    Wasn't this PL/1 feature "inherited" from the, now rightly deprecated, Alter/Goto in COBOL and Assigned GOTO in Fortran?

    Probably.

    I find it somewhat amusing that modern languages moved away from
    label variables and into method calls -- which if you look at it
    from 5,000 feet/metres -- is just a more expensive "label".

    I also find it amusing that the backbone of modern software is
    a static version of label variables -- we call them switch state-
    ments.

    But you can be sure COBOL got them from assembly language programmers.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue Nov 4 23:43:48 2025
    From Newsgroup: comp.arch

    On 11/4/2025 4:51 PM, MitchAlsup wrote:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

    I still think the IBM DFP people did an impressively good job packing
    that much data into a decimal representation. :-)

    Yes, that modulo 1000 packing is quite clever. It is relatively
    cheap to implement in hardware (which is the point, of course).
    Not sure how easy it would be in software.

    Brain dead easy: 1 table of 1024 entries each 12-bits wide,
    1 table of 4096 entries each 10-bits wide,
    isolate the 10-bit field, LD the converted value.
    isolate the 12-bit field, LD the converted value.

    Other than "crap loads" of {deMorganizing and gate optimization}
    that is essentially what HW actually does.


    In SW, you would still need to burn 16 bits per entry on the table, and possibly have code to fill in the tables (well, unless the numbers are expressed in code).


    A similar strategy is often used for sin/cos in many 90s era games,
    though the table is big enough that it would likely be impractical to
    type out by hand (or calculate using mental math).

    It is likely someone at ID Software or similar wrote out code at one
    point to spit out the sin+cos lookup table as a big blob of C (say,
    because an 8192 entry table is likely too big to be reasonable to type
    out by hand).


    Sometimes it becomes a tradeoff where exactly is the tradeoff in these
    cases between when to use typing and mental math, and when to write some
    code to spit out a table.

    For me, the tradeoff is often somewhere around 256 numbers, or less if
    the calculation is mentally difficult (namely, whether typing or
    calculating is the bottleneck).


    It is most likely for DPD<->BCD, would resort to using code to generate
    the lookup table.

    Then again, it might depend a lot on the person...



    You still need to build 12-bit decimal ALUs to string together

    When I did it experimentally, I had done 16 BCD digits in 64 bits...

    The cost was slightly higher than that of a 64-bit ADD/SUB unit.

    Generally, it was combining the normal 4-bit CARRY4 style logic with
    some LUTs on the output side to turn it into a sort of BCD equivalent of
    a CARRY4.

    Granted, doing it with 3/6/9 digits would be cheaper than with 16 digits.


    Though, if doing it purely in software, may make sense to go a different route:
    Map DPD to a linear integer between 0 and 999;
    Combine groups of 3 values into a 32 bit value;
    Work 32 bits at a time;
    Split back up to groups of 3 digits, and map back to DPD.

    Though, depends on the operation, for some it may be faster to operate
    in groups of 3 digits at a time (and sidestep the costs of combining or splitting the values).


    Then again, thinking about it, it is possible that for the Decimal128
    BID format, the mantissa could be broken up into smaller chunks (say, 9 digits) without the need for a full-width 128-bit multiply.

    In this case, could use a narrower multiply, and the "error" from the
    overflow would exist outside of the range of digits that are being
    worked on, so effectively becomes irrelevant for the operation in
    question (so, may be able to use 32 or 64 bit multiply, and 128-bit ADD).

    Granted, this is untested.

    Well, apart from how to recombine the parts without the need for wide multiply.

    In theory, could turn it into a big pile of shifts-and-add. Not sure if
    there is a good way to limit the number of shifts-and-adds needed. Well, unless turned into multiply-by-100 (3 shift 2 add) 4x times followed by multiply by 10 (1 shift 1 add), to implement multiply by 1 billion, but
    this also sucks (vs 13 shift 12 add).

    Hmm...


    Ironically, the DPD option almost looks preferable...


    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Nov 5 05:17:53 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 11/4/2025 11:15 AM, MitchAlsup wrote:
    PL/1 allows for Label variables so one can build their own
    switches (and state machines with variable paths). I used
    these in a checkers playing program 1974.

    Wasn't this PL/1 feature "inherited" from the, now rightly deprecated,
    Alter/Goto in COBOL and Assigned GOTO in Fortran?

    Assigned GOTO has been deleted from the Fortran standard (in Fortran
    95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
    supports it <https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

    What makes you think that it is "rightly" to deprecate or delete this
    feature?

    <https://riptutorial.com/fortran/example/11872/assigned-goto> says:
    |It can be avoided in modern code by using procedures, internal
    |procedures, procedure pointers and other features.

    I know no feature in Fortran or standard C which replaces my use of labels-as-values, the GNU C equivalent of the assigned goto. If you
    look at <https://www.complang.tuwien.ac.at/forth/threading/>, "direct"
    and "indirect" use labels-as-values, whereas "switch", "call" and
    "repl. switch" use standard C features (switch, indirect calls, and
    switch+goto respectively). "direct" and "indirect" usually outperform
    these others, sometimes by a lot.

    I also find it amusing that the backbone of modern software is
    a static version of label variables -- we call them switch state-
    ments.

    I am not sure if it's "the" backbone. Fortran has (had?) a feature
    called "computed goto" that's closer to C's switch than "assigned
    goto". Ironically, the gcc people usually call their labels-as-values
    feature "computed goto" rather than "labels as values" or "assigned
    goto".

    But you can be sure COBOL got them from assembly language programmers.

    Yes, assigned goto and labels-as-values (and probably the Cobol
    alter/goto and PL/1 label variables) are there because computer
    architectures have indirect branches and the programming language
    designer wanted to give the programmers a way to express what they
    would otherwise have to express in assembly language.

    Why does standard C not have it? C had it up to and including the 6th
    edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
    between 6th and 7th edition. Ritchie wrote
    <37178013.A1EE3D4F@bell-labs.com>:

    | I eliminated them because I didn't know what to say about their
    | semantics.

    Stallman obviously knew what to say about their semantics when he
    added labels-as-values to GNU C with gcc 2.0.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Nov 5 01:41:30 2025
    From Newsgroup: comp.arch

    On 2025-11-03 2:03 p.m., MitchAlsup wrote:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    Actually, for the five required basic operations, you can always do the >>>> op in the next higher precision, then round again down to the target,
    and get exactly the same result.

    https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

    The PowerISA version 3.0 introduced rounding to odd for its 128-bit
    floating point arithmetic, for that very reason (I assume).

    Likely, My 66000 also has RNO and
    Round Nearest Random is defined but not yet available
    Round Away from Zero is also defined and available.

    Round nearest random? How about round externally guided (RXG) by an
    input signal? For instance, the rounding could come from a feedback
    filter of some sort.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Nov 5 06:44:54 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 11/4/2025 11:15 AM, MitchAlsup wrote:
    PL/1 allows for Label variables so one can build their own
    switches (and state machines with variable paths). I used
    these in a checkers playing program 1974.

    Wasn't this PL/1 feature "inherited" from the, now rightly deprecated,
    Alter/Goto in COBOL and Assigned GOTO in Fortran?

    Assigned GOTO has been deleted from the Fortran standard (in Fortran
    95, obsolescent in Fortran 90), but at least Intel's Fortran compiler supports it
    <https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

    That is the problem with deleted features - compiler writers have
    to support them forever, and interaction with other features can
    lead to problems.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Nov 5 01:47:56 2025
    From Newsgroup: comp.arch

    On 2025-11-03 1:47 p.m., Thomas Koenig wrote:
    Robert Finch <robfi680@gmail.com> schrieb:
    Contemplating having conditional branch instructions branch to a target
    value in a register instead of using a displacement.

    I think this has about the same code density as having a branch to a
    displacement from the IP.

    Should be possible. A question is if you want to have a special
    register for that (like POWER's link register), tell the CPU
    what the target is (like VEC in My66000) or just use a general
    purpose register with a general-purpose instruction.

    Using a fused compare-and-branch instruction for Qupls4

    Is that the name of your architecture, or an instruction? (That
    may have been mentioned upthread, in that case I don't remember).

    That was the name of the architecture, but I am being fickle and
    scrapping it, restarting with the Qupls2024 architecture innovated to Qupls2026.


    there is not
    enough room in the instruction for a large branch displacement (10
    bits). So, my thought is to branch to a register value instead.
    There is already an add-to-instruction-pointer instruction that can be
    used to generate relative addresses.

    That makes sense.

    Using 48-bit instructions now, so there is enough room for an 18-bit displacement. Still having branch to register as well.>
    By moving the register load outside of a loop, the dynamic instruction
    count can be reduced. I think this solution is a bit better than having
    compare and branch as two separate instructions, or having an extended
    constant added to the branch instruction.

    Are you talking about a normal loop condition or a jump out of
    a loop?

    Any loop condition that needs a displacement constant. The constant
    being loaded into a register.

    One gotcha may be that the branch target needs to be predicted as it
    cannot be calculated earlier in the pipeline.

    If you use a link register or a special instruction, the CPU could
    do that.

    The 10-bit displacement format could also be supported, but it is yet
    another branch instruction format. I may leave holes in the instruction
    set for future support, but I think it is best to start with just a
    single format.

    Code:
    AIPSI R3,1234 ; add displacement to IP and store in R3 (hoist-able) >> BLT R1,R2,R3 ; branch to R3 if R1 < R2

    Versus:
    CMP R3,R1,R2
    BLT R3,displacement


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Nov 4 22:53:49 2025
    From Newsgroup: comp.arch

    On 11/4/2025 9:17 PM, Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 11/4/2025 11:15 AM, MitchAlsup wrote:
    PL/1 allows for Label variables so one can build their own
    switches (and state machines with variable paths). I used
    these in a checkers playing program 1974.

    Wasn't this PL/1 feature "inherited" from the, now rightly deprecated,
    Alter/Goto in COBOL and Assigned GOTO in Fortran?

    Assigned GOTO has been deleted from the Fortran standard (in Fortran
    95, obsolescent in Fortran 90), but at least Intel's Fortran compiler supports it <https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

    What makes you think that it is "rightly" to deprecate or delete this feature?

    Because it could, and often did, make the code "unfollowable". That is,
    you are reading the code, following it to try to figure out what it is
    doing and come to an assigned/alter goto, and you don't know where to go
    next. The value was set some place else in the code, who knows where,
    and thus what value it was set to, and people/programmers just aren't
    used to being able to follow code like that. BTDT.

    BTW, you mentioned that it could be implemented as an indirect jump. It
    could for those architectures that supported that feature, but it could
    also be implemented by having the Alter/Assign modify the code (i.e.
    change the address in the jump/branch instruction), and self modifying
    code is just bad.

    I am not saying it couldn't be used well. Just that it was often not,
    and when not, it caused a lot of problems.




    <https://riptutorial.com/fortran/example/11872/assigned-goto> says:
    |It can be avoided in modern code by using procedures, internal
    |procedures, procedure pointers and other features.

    I know no feature in Fortran or standard C which replaces my use of labels-as-values, the GNU C equivalent of the assigned goto. If you
    look at <https://www.complang.tuwien.ac.at/forth/threading/>, "direct"
    and "indirect" use labels-as-values, whereas "switch", "call" and
    "repl. switch" use standard C features (switch, indirect calls, and switch+goto respectively). "direct" and "indirect" usually outperform
    these others, sometimes by a lot.

    I also find it amusing that the backbone of modern software is
    a static version of label variables -- we call them switch state-
    ments.

    I am not sure if it's "the" backbone. Fortran has (had?) a feature
    called "computed goto" that's closer to C's switch than "assigned
    goto".

    As did COBOL, called goto depending on, but those features didn't suffer
    the problems of assigned/alter gotos.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Nov 5 06:55:49 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Assigned GOTO has been deleted from the Fortran standard (in Fortran
    95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
    supports it >><https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

    That is the problem with deleted features - compiler writers have
    to support them forever, and interaction with other features can
    lead to problems.

    So does gfortran support assigned goto, too? What problems in
    interaction with other features do you see?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Nov 5 01:00:32 2025
    From Newsgroup: comp.arch

    On 11/4/2025 3:44 PM, Terje Mathisen wrote:
    MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Bottom line: History-based branch prediction has won, any kind of
    delayed branches (including split-branch designs) turn out to be
    a bad idea.

    Or "Never bet against branch prediction".

    I have probably mentioned this before, once or twice, but I'm actually
    quite proud of the meeting I had with Intel Santa Clara in the spring of 1995:

    I had (accidentally) written the first public mention of the FDIV bug
    (on comp.sys.intel) in Oct 1994, then together with Cleve Moler of MathWorks/MatLab fame led the effort to develop a minimum cost sw
    workaround for the bug. (My code became part of all/most x86 compiler runtimes for the next few years.)

    Due to this Intel invited me to receive an early engineering prototype
    of the PentiumPro, together with an NDA-covered briefing about its architecture.

    Before the start of that briefing I suggested that I should start off on
    the blackboard by showing what I had been able to figure out on my own,
    then I proceeded to pretty much exactly cover every single feature on
    the cpu, with one glaring exception:

    Based on the useful but not great branch predictor on the Pentium I told them that I expected the P6 to employ eager execution, i.e execute both
    ways of one or two layers of branches, discarding the non-taken paths as
    the branch direction info became available.

    That's the point when they got to brag about how having a much, much
    better branch predictor was better both from a performance and a power viewpoint, since out of order execution could predict much deeper than
    any eager execution would have the resources for.

    As you said: "Never bet against branch prediction".


    Branch prediction is fun.


    When I looked around online before, a lot of stuff about branch
    prediction was talking about fairly large and convoluted schemes for the branch predictors.

    But, then always at the end of it using 2-bit saturating counters:
    weakly taken, weakly not-taken, strongly taken, strongly not taken.

    But, in my fiddling, there was seemingly a simple but moderately
    effective strategy:
    Keep a local history of taken/not-taken;
    XOR this with the low-order-bits of PC for the table index;
    Use a 5/6-bit finite-state-machine or similar.
    Can model repeating patterns up to ~ 4 bits.

    Where, the idea was that the state-machine in updated with the current
    state and branch direction, giving the next state and next predicted
    branch direction (for this state).


    Could model slightly more complex patterns than the 2-bit saturating
    counters, but it is sort of a partial mystery why (for mainstream
    processors) more complex lookup schemes and 2 bit state, was preferable
    to a simpler lookup scheme and 5-bit state.

    Well, apart from the relative "dark arts" needed to cram 4-bit patterns
    into a 5 bit FSM (is a bit easier if limiting the patterns to 3 bits).



    Then again, had before noted that the LLMs are seemingly also not really
    able to figure out how to make a 5 bit FSM to model a full set of 4 bit patterns.


    Then again, I wouldn't expect it to be all that difficult of a problem
    for someone that is "actually smart"; so presumably chip designers could
    have done similar.

    Well, unless maybe the argument is that 5 or 6 bits of storage would
    cost more than 2 bits, but then presumably needing to have significantly larger tables (to compensate for the relative predictive weakness of
    2-bit state) would have costed more than the cost of smaller tables of 6
    bit state ?...

    Say, for example, 2b:
    00_0 => 10_0 //Weakly not-taken, dir=0, goes strong not-taken
    00_1 => 01_0 //Weakly not-taken, dir=1, goes weakly taken
    01_0 => 00_1 //Weakly taken, dir=0, goes weakly not-taken
    01_1 => 11_1 //Weakly taken, dir=1, goes strongly taken
    10_0 => 10_0 //strongly not taken, dir=0
    10_1 => 00_0 //strongly not taken, dir=1 (goes weak)
    11_0 => 01_1 //strongly taken, dir=0
    11_1 => 11_1 //strongly taken, dir=1 (goes weak)

    Can expand it to 3-bits, for 2-bit patterns
    As above, and 4-more alternating states
    And slightly different transition logic.
    Say (abbreviated):
    000 weak, not taken
    001 weak, taken
    010 strong, not taken
    011 strong, taken
    100 weak, alternating, not-taken
    101 weak, alternating, taken
    110 strong, alternating, not-taken
    111 strong, alternating, taken
    The alternating states just flip-flopping between taken and not taken.
    The weak states can more between any of the 4.
    The strong states used if the pattern is reinforced.

    Going up to 3 bit patterns is more of the same (add another bit,
    doubling the number of states). Seemingly something goes nasty when
    getting to 4 bit patterns though (and can't fit both weak and strong
    states for longer patterns, so the 4b patterns effectively only exist as
    weak states which partly overlap with the weak states for the 3-bit
    patterns).

    But, yeah, not going to type out state tables for these ones.


    Not proven, but I suspect that an arbitrary 5 bit pattern within a 6 bit
    state might be impossible. Although there would be sufficient
    state-space for the looping 5-bit patterns, there may not be sufficient state-space to distinguish whether to move from a mismatched 4-bit
    pattern to a 3 or 5 bit pattern. Whereas, at least with 4-bit, any
    mismatch of the 4-bit pattern can always decay to a 3-bit pattern, etc.
    One needs to be able to express decay both to shorter patterns and to
    longer patterns, and I suspect at this point, the pattern breaks down
    (but can't easily confirm; it is either this or the pattern extends indefinitely, I don't know...).


    Could almost have this sort of thing as a "brain teaser" puzzle or something...

    Then again, maybe other people would not find any particular difficulty
    in these sorts of tasks.


    Terje


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Nov 5 02:06:50 2025
    From Newsgroup: comp.arch

    On 2025-11-05 1:47 a.m., Robert Finch wrote:
    On 2025-11-03 1:47 p.m., Thomas Koenig wrote:
    Robert Finch <robfi680@gmail.com> schrieb:
    Contemplating having conditional branch instructions branch to a target
    value in a register instead of using a displacement.

    I think this has about the same code density as having a branch to a
    displacement from the IP.

    Should be possible.  A question is if you want to have a special
    register for that (like POWER's link register), tell the CPU
    what the target is (like VEC in My66000) or just use a general
    purpose register with a general-purpose instruction.

    Using a fused compare-and-branch instruction for Qupls4

    Is that the name of your architecture, or an instruction?  (That
    may have been mentioned upthread, in that case I don't remember).

    That was the name of the architecture, but I am being fickle and
    scrapping it, restarting with the Qupls2024 architecture innovated to Qupls2026.


    there is not
    enough room in the instruction for a large branch displacement (10
    bits). So, my thought is to branch to a register value instead.
    There is already an add-to-instruction-pointer instruction that can be
    used to generate relative addresses.

    That makes sense.

    Using 48-bit instructions now, so there is enough room for an 18-bit displacement. Still having branch to register as well.>
    By moving the register load outside of a loop, the dynamic instruction
    count can be reduced. I think this solution is a bit better than having
    compare and branch as two separate instructions, or having an extended
    constant added to the branch instruction.

    Are you talking about a normal loop condition or a jump out of
    a loop?

    Any loop condition that needs a displacement constant. The constant
    being loaded into a register.

    One gotcha may be that the branch target needs to be predicted as it
    cannot be calculated earlier in the pipeline.

    If you use a link register or a special instruction, the CPU could
    do that.

    The 10-bit displacement format could also be supported, but it is yet
    another branch instruction format. I may leave holes in the instruction
    set for future support, but I think it is best to start with just a
    single format.

    Code:
    AIPSI R3,1234    ; add displacement to IP and store in R3 (hoist-able) >>> BLT R1,R2,R3        ; branch to R3 if R1 < R2

    Versus:
    CMP R3,R1,R2
    BLT R3,displacement


    I am now modifying Qupls2024 into Qupls2026 rather than starting a
    completely new ISA. The big difference is Qupls2024 uses 64-bit
    instructions and Qupls2026 uses 48-bit instructions making the code 25%
    more compact with no real loss of operations.

    Qupls2024 also used 8-bit register specs. This was a bit of overkill and
    not really needed. Register specs are reduced to 6-bits. Right-away that reduced most instructions eight bits.

    I decided I liked the dual operations that some instructions supported,
    which need a wide instruction format.

    One gotcha is that 64-bit constant overrides need to be modified. For Qupls2024 a 64-bit constant override could be specified using only a
    single additional instruction word. This is not possible with 48-bit instruction words. Qupls2024 only allowed a single additional constant
    word. I may maintain this for Qupls2026, but that means that a max
    constant override of 48-bits would be supported. A 64-bit constant can
    still be built up in a register using the add-immediate with shift instruction. It is ugly and takes about three instructions.

    I could reduce the 64-bit constant build to two instructions by adding a load-immediate instruction.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Nov 5 07:13:46 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

    I still think the IBM DFP people did an impressively good job packing
    that much data into a decimal representation. :-)

    Yes, that modulo 1000 packing is quite clever. It is relatively
    cheap to implement in hardware (which is the point, of course).
    Not sure how easy it would be in software.

    Brain dead easy: 1 table of 1024 entries each 12-bits wide,
    1 table of 4096 entries each 10-bits wide,
    isolate the 10-bit field, LD the converted value.
    isolate the 12-bit field, LD the converted value.

    I played around with the formulas from the POWER manual a bit,
    using Berkeley abc for logic optimization, for the conversion
    of the packed modulo 1000 to three BCD digits.

    Without spending too much effort, I arrived at four gate delays
    (INV -> OAI21 -> NAND2 -> NAND2) with a total of 37 gates optimizing
    for speed, or five gate delays optimizing for space.

    I strongly suspect that IBM is doing something similar :-)
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Nov 5 01:38:30 2025
    From Newsgroup: comp.arch

    On 11/5/2025 1:00 AM, BGB wrote:
    On 11/4/2025 3:44 PM, Terje Mathisen wrote:
    MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Bottom line: History-based branch prediction has won, any kind of
    delayed branches (including split-branch designs) turn out to be
    a bad idea.

    Or "Never bet against branch prediction".

    I have probably mentioned this before, once or twice, but I'm actually
    quite proud of the meeting I had with Intel Santa Clara in the spring
    of 1995:

    I had (accidentally) written the first public mention of the FDIV bug
    (on comp.sys.intel) in Oct 1994, then together with Cleve Moler of
    MathWorks/MatLab fame led the effort to develop a minimum cost sw
    workaround for the bug. (My code became part of all/most x86 compiler
    runtimes for the next few years.)

    Due to this Intel invited me to receive an early engineering prototype
    of the PentiumPro, together with an NDA-covered briefing about its
    architecture.

    Before the start of that briefing I suggested that I should start off
    on the blackboard by showing what I had been able to figure out on my
    own, then I proceeded to pretty much exactly cover every single
    feature on the cpu, with one glaring exception:

    Based on the useful but not great branch predictor on the Pentium I
    told them that I expected the P6 to employ eager execution, i.e
    execute both ways of one or two layers of branches, discarding the
    non-taken paths as the branch direction info became available.

    That's the point when they got to brag about how having a much, much
    better branch predictor was better both from a performance and a power
    viewpoint, since out of order execution could predict much deeper than
    any eager execution would have the resources for.

    As you said: "Never bet against branch prediction".


    Branch prediction is fun.


    When I looked around online before, a lot of stuff about branch
    prediction was talking about fairly large and convoluted schemes for the branch predictors.

    But, then always at the end of it using 2-bit saturating counters:
      weakly taken, weakly not-taken, strongly taken, strongly not taken.

    But, in my fiddling, there was seemingly a simple but moderately
    effective strategy:
      Keep a local history of taken/not-taken;
      XOR this with the low-order-bits of PC for the table index;
      Use a 5/6-bit finite-state-machine or similar.
        Can model repeating patterns up to ~ 4 bits.

    Where, the idea was that the state-machine in updated with the current
    state and branch direction, giving the next state and next predicted
    branch direction (for this state).


    Could model slightly more complex patterns than the 2-bit saturating counters, but it is sort of a partial mystery why (for mainstream processors) more complex lookup schemes and 2 bit state, was preferable
    to a simpler lookup scheme and 5-bit state.

    Well, apart from the relative "dark arts" needed to cram 4-bit patterns
    into a 5 bit FSM (is a bit easier if limiting the patterns to 3 bits).



    Then again, had before noted that the LLMs are seemingly also not really able to figure out how to make a 5 bit FSM to model a full set of 4 bit patterns.



    Errm...

    I just decided to test it, and it appears Grok was able to figure it out
    (more or less).

    This is concerning, either the AIs are getting smart enough to deal with semi-difficult problems; or in fact it is not difficult and I was just
    dumb for thinking there is any difficulty in working out the state
    tables for the longer patterns.

    I tried before with DeepSeek R1 and similar, which had failed.



    Then again, I wouldn't expect it to be all that difficult of a problem
    for someone that is "actually smart"; so presumably chip designers could have done similar.

    Well, unless maybe the argument is that 5 or 6 bits of storage would
    cost more than 2 bits, but then presumably needing to have significantly larger tables (to compensate for the relative predictive weakness of 2-
    bit state) would have costed more than the cost of smaller tables of 6
    bit state ?...

    Say, for example, 2b:
     00_0 => 10_0  //Weakly not-taken, dir=0, goes strong not-taken
     00_1 => 01_0  //Weakly not-taken, dir=1, goes weakly taken
     01_0 => 00_1  //Weakly taken, dir=0, goes weakly not-taken
     01_1 => 11_1  //Weakly taken, dir=1, goes strongly taken
     10_0 => 10_0  //strongly not taken, dir=0
     10_1 => 00_0  //strongly not taken, dir=1 (goes weak)
     11_0 => 01_1  //strongly taken, dir=0
     11_1 => 11_1  //strongly taken, dir=1 (goes weak)

    Can expand it to 3-bits, for 2-bit patterns
      As above, and 4-more alternating states
      And slightly different transition logic.
    Say (abbreviated):
      000   weak, not taken
      001   weak, taken
      010   strong, not taken
      011   strong, taken
      100   weak, alternating, not-taken
      101   weak, alternating, taken
      110   strong, alternating, not-taken
      111   strong, alternating, taken
    The alternating states just flip-flopping between taken and not taken.
      The weak states can more between any of the 4.
      The strong states used if the pattern is reinforced.

    Going up to 3 bit patterns is more of the same (add another bit,
    doubling the number of states). Seemingly something goes nasty when
    getting to 4 bit patterns though (and can't fit both weak and strong
    states for longer patterns, so the 4b patterns effectively only exist as weak states which partly overlap with the weak states for the 3-bit patterns).

    But, yeah, not going to type out state tables for these ones.


    Not proven, but I suspect that an arbitrary 5 bit pattern within a 6 bit state might be impossible. Although there would be sufficient state-
    space for the looping 5-bit patterns, there may not be sufficient state- space to distinguish whether to move from a mismatched 4-bit pattern to
    a 3 or 5 bit pattern. Whereas, at least with 4-bit, any mismatch of the 4-bit pattern can always decay to a 3-bit pattern, etc. One needs to be
    able to express decay both to shorter patterns and to longer patterns,
    and I suspect at this point, the pattern breaks down (but can't easily confirm; it is either this or the pattern extends indefinitely, I don't know...).


    Could almost have this sort of thing as a "brain teaser" puzzle or something...

    Then again, maybe other people would not find any particular difficulty
    in these sorts of tasks.


    But, alas, sometimes I wonder if I am just kinda stupid and everyone
    else has already kinda figured this out, but doesn't say much...

    Like, just smart enough to do the things that I do, but not so much otherwise... In theory, I am kinda OK, but often it mostly seems like I
    mostly just suck at everything.



    Terje



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Nov 5 02:01:35 2025
    From Newsgroup: comp.arch

    On 11/4/2025 11:17 PM, Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 11/4/2025 11:15 AM, MitchAlsup wrote:
    PL/1 allows for Label variables so one can build their own
    switches (and state machines with variable paths). I used
    these in a checkers playing program 1974.

    Wasn't this PL/1 feature "inherited" from the, now rightly deprecated,
    Alter/Goto in COBOL and Assigned GOTO in Fortran?

    Assigned GOTO has been deleted from the Fortran standard (in Fortran
    95, obsolescent in Fortran 90), but at least Intel's Fortran compiler supports it <https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

    What makes you think that it is "rightly" to deprecate or delete this feature?

    <https://riptutorial.com/fortran/example/11872/assigned-goto> says:
    |It can be avoided in modern code by using procedures, internal
    |procedures, procedure pointers and other features.

    I know no feature in Fortran or standard C which replaces my use of labels-as-values, the GNU C equivalent of the assigned goto. If you
    look at <https://www.complang.tuwien.ac.at/forth/threading/>, "direct"
    and "indirect" use labels-as-values, whereas "switch", "call" and
    "repl. switch" use standard C features (switch, indirect calls, and switch+goto respectively). "direct" and "indirect" usually outperform
    these others, sometimes by a lot.


    I usually used call threading, because:
    In my testing it was one of the faster options;
    At least if excluding 32-bit x86,
    which often has slow function calls.
    Because pretty much every function needs a stack frame, ...
    It is usable in standard C.

    Often "while loop and switch()" was notably slower than using unrolled
    lists of indirect function calls (usually with the main dispatch loop
    based on "traces", which would call each of the opcode functions and
    then return the next trace to be run).

    Granted, "while loop and switch" is the more traditional way of writing
    an interpreter.


    I also find it amusing that the backbone of modern software is
    a static version of label variables -- we call them switch state-
    ments.

    I am not sure if it's "the" backbone. Fortran has (had?) a feature
    called "computed goto" that's closer to C's switch than "assigned
    goto". Ironically, the gcc people usually call their labels-as-values feature "computed goto" rather than "labels as values" or "assigned
    goto".

    But you can be sure COBOL got them from assembly language programmers.

    Yes, assigned goto and labels-as-values (and probably the Cobol
    alter/goto and PL/1 label variables) are there because computer
    architectures have indirect branches and the programming language
    designer wanted to give the programmers a way to express what they
    would otherwise have to express in assembly language.

    Why does standard C not have it? C had it up to and including the 6th edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
    between 6th and 7th edition. Ritchie wrote <37178013.A1EE3D4F@bell-labs.com>:

    | I eliminated them because I didn't know what to say about their
    | semantics.

    Stallman obviously knew what to say about their semantics when he
    added labels-as-values to GNU C with gcc 2.0.


    But, if you use it, you are basically stuck with GCC...


    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Wed Nov 5 11:18:50 2025
    From Newsgroup: comp.arch

    On Tue, 4 Nov 2025 22:52:46 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    For the Intel binary mantissa dfp128 normalization is the hard issue, Michael S have figured out some really nice tricks to speed it up,

    I remember that I played with that, but don't remember what I did
    exactly. I dimly recollect that the fastest solution was relatively straight-forward. It was trying to minimize the length of dependency
    chains rather than total number of multiplications.
    An important point here is that I played on relatively old x86-64
    hardware. My solution is not necessarily optimal for newer hardware.
    The differences between old and new are two-fold and they push
    optimal solution into different directions.
    1. Increase in throughput of integer multiplier
    2. Decrease in latency of integer division

    The first factor suggests even more intense push toward "eager"
    solutions.

    The second factor suggests, possibly, much simpler code, especially in
    common case of division by 1 to 27 decimal digits (5**27 < 2**64).
    How they say? Sometimes a division is just a division.

    but when you have a (worst case) temporary 220+ bit product mantissa, scaling is not that easy.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Wed Nov 5 11:21:32 2025
    From Newsgroup: comp.arch

    On Tue, 04 Nov 2025 22:51:28 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

    I still think the IBM DFP people did an impressively good job
    packing that much data into a decimal representation. :-)

    Yes, that modulo 1000 packing is quite clever. It is relatively
    cheap to implement in hardware (which is the point, of course).
    Not sure how easy it would be in software.

    Brain dead easy: 1 table of 1024 entries each 12-bits wide,
    1 table of 4096 entries each 10-bits wide,
    isolate the 10-bit field, LD the converted value.
    isolate the 12-bit field, LD the converted value.

    Other than "crap loads" of {deMorganizing and gate optimization}
    that is essentially what HW actually does.

    You still need to build 12-bit decimal ALUs to string together

    Are talking about hardware or software?

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Nov 5 09:25:45 2025
    From Newsgroup: comp.arch

    On 2025-11-05 2:13 a.m., Thomas Koenig wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

    I still think the IBM DFP people did an impressively good job packing
    that much data into a decimal representation. :-)

    Yes, that modulo 1000 packing is quite clever. It is relatively
    cheap to implement in hardware (which is the point, of course).
    Not sure how easy it would be in software.

    Brain dead easy: 1 table of 1024 entries each 12-bits wide,
    1 table of 4096 entries each 10-bits wide,
    isolate the 10-bit field, LD the converted value.
    isolate the 12-bit field, LD the converted value.

    I played around with the formulas from the POWER manual a bit,
    using Berkeley abc for logic optimization, for the conversion
    of the packed modulo 1000 to three BCD digits.

    Without spending too much effort, I arrived at four gate delays
    (INV -> OAI21 -> NAND2 -> NAND2) with a total of 37 gates optimizing
    for speed, or five gate delays optimizing for space.

    I strongly suspect that IBM is doing something similar :-)

    Like that IBM packing method.

    I have some RTL code to pack and unpack modulo 1000 to BCD. I think it
    is fast and small enough that it can be used inline at the input and
    output of DFP operations. The DFP values can then be passed around in
    the CPU as 128-bit values instead of the expanded BCD value.

    Only 128-bit DFP is supported on my machine under the assumption that
    one is wanting the extended decimal precision for engineering / finance. Otherwise, why would one use it? Better to use BFP.

    One headache I have not worked out how to do yet is convert between DFP
    and BFP in a sensible fashion. I have tried a couple of means but the
    results are way off. Using log/exp type functions. I suppose I could
    rely on conversions to and from text strings.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Nov 5 15:27:48 2025
    From Newsgroup: comp.arch

    MitchAlsup wrote:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 11/4/2025 11:15 AM, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Should be possible. A question is if you want to have a special
    register for that (like POWER's link register),

    There is this idea of splitting an (indirect) branch into a
    prepare-to-branch instruction and a take-branch instruction. The

    I first heard about this 1982 from Burton Smith.

    prepare-to-branch instruction announces the branch target to the CPU,
    and Power's mtlr and mtctr are examples of that (somewhat muddled by
    the fact that the ctr register can also be used for counted loops as
    well as for indirect branches), and IA-64's branch-target registers
    and the instructions that move there are another example. AFAIK SPARC >>>> acquired something in this direction (touted as good for accelerating
    Java) in the early 2000s. The take-branch instruction on Power is
    blr/bctr.

    I used to think that this kind of splitting is a good idea, and it is
    certainly better than a branch-delay slot or a branch with a fixed
    number of delay slots.

    PL/1 allows for Label variables so one can build their own
    switches (and state machines with variable paths). I used
    these in a checkers playing program 1974.

    Wasn't this PL/1 feature "inherited" from the, now rightly deprecated,
    Alter/Goto in COBOL and Assigned GOTO in Fortran?

    Probably.

    I find it somewhat amusing that modern languages moved away from
    label variables and into method calls -- which if you look at it
    from 5,000 feet/metres -- is just a more expensive "label".

    I also find it amusing that the backbone of modern software is
    a static version of label variables -- we call them switch state-
    ments.

    But you can be sure COBOL got them from assembly language programmers.

    Back before caches and branch predictors, my fastest world count (wc)
    asm program employed runtime code generation, it started by filling in a
    64kB segment with code snippets aligned every 128 bytes: Even block
    counts were for scanning outside a word and the odd entries were used
    when a word start had been found, then each snippet would load the next
    byte into BH and jump to BX. (BL contained the outside/inside flag value
    as 0/128)

    Fast forward a few years and a branchless data state machine ran far
    faster, culminating at (a measured) 1.5 clock cycles/byte on a Pentium.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Nov 5 15:42:37 2025
    From Newsgroup: comp.arch

    Michael S wrote:
    On Tue, 4 Nov 2025 22:52:46 +0100
    Terje Mathisen <terje.mathisen@tmsw.no> wrote:

    For the Intel binary mantissa dfp128 normalization is the hard issue,
    Michael S have figured out some really nice tricks to speed it up,

    I remember that I played with that, but don't remember what I did
    exactly. I dimly recollect that the fastest solution was relatively straight-forward. It was trying to minimize the length of dependency
    chains rather than total number of multiplications.
    An important point here is that I played on relatively old x86-64
    hardware. My solution is not necessarily optimal for newer hardware.
    The differences between old and new are two-fold and they push
    optimal solution into different directions.
    1. Increase in throughput of integer multiplier
    2. Decrease in latency of integer division

    The first factor suggests even more intense push toward "eager"
    solutions.

    The second factor suggests, possibly, much simpler code, especially in
    common case of division by 1 to 27 decimal digits (5**27 < 2**64).
    How they say? Sometimes a division is just a division.

    I suspect that a model using pre-calculated reciprocals which generate
    ~10+ approximate digits, back-multiply and subtract, repeat once or
    twice, could perform OK.

    Having full ~225 bit reciprocals in order to generate the exact result
    in a single iteration would require 256-bit storage for each of them and
    the 256x256->512 MUL would use 16 64x64->128 MULs, but here we do have
    the possibility to start from the top and as soon as you get the high
    end 128 bits of the mantissa fixed (modulo any propagating carries from
    lower down) you could inspect the preliminary result and see that it
    would usually be far enough away from a tipping point so that you could
    stop there.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Nov 5 09:56:12 2025
    From Newsgroup: comp.arch

    Qupls2026 currently supports 48-bit inline constants. I am debating
    whether to support 89 and 130-bit inline constants as well. Constant
    sizes increase by 41-bits due to the 48-bit instruction word size. The
    larger constants would require more instruction words to be available to
    be processed in decode. Not sure if it is even possible to pass a
    constant larger than 64-bits in the machine.

    I just realized that constant operand routing was already in Qupls, I
    had just not specifically identified it. The operand routing bits are
    just moved into a postfix instruction word rather than the first
    instruction word. This gives more bits available in the instruction
    word. Rather than burn a couple of bits in every R3 type instruction,
    another couple of opcodes are used to represent constant extensions.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Wed Nov 5 17:26:44 2025
    From Newsgroup: comp.arch

    On 2025-11-05 7:17, Anton Ertl wrote:

    [ snip ]

    Yes, assigned goto and labels-as-values (and probably the Cobol
    alter/goto and PL/1 label variables) are there because computer
    architectures have indirect branches and the programming language
    designer wanted to give the programmers a way to express what they
    would otherwise have to express in assembly language.

    Why does standard C not have it? C had it up to and including the 6th edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
    between 6th and 7th edition. Ritchie wrote <37178013.A1EE3D4F@bell-labs.com>:

    | I eliminated them because I didn't know what to say about their
    | semantics.

    Stallman obviously knew what to say about their semantics when he
    added labels-as-values to GNU C with gcc 2.0.


    I don't know what Stallman said, or would have said if asked, but I
    guess something like "the semantics is a jump to the (address of the)
    label to which the value refers", which is machine-level semantics and
    not semantics in the abstract C machine.

    The problem in the abstract C machine is a "goto label-value" statement
    where the label-value refers to a label in a different function. Does
    gcc prevent that at compile time? If not, I would expect the semantics
    to be Undefined Behavior, the usual cop-out when nothing useful can be said.

    (In an earlier discussion on this group, some years ago, I explained how labels-as-values could be added to Ada, using the type system to ensure
    safe and defined semantics. But I don't think such an extension would be accepted for the Ada standard.)

    Niklas
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Nov 5 10:49:10 2025
    From Newsgroup: comp.arch

    Anton Ertl wrote:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Assigned GOTO has been deleted from the Fortran standard (in Fortran
    95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
    supports it
    <https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>
    That is the problem with deleted features - compiler writers have
    to support them forever, and interaction with other features can
    lead to problems.

    So does gfortran support assigned goto, too? What problems in
    interaction with other features do you see?

    - anton

    For a code analysis, an assigned goto, aka label variables,
    looks equivalent to:
    - make a list of all the target labels assigned to each label variable
    - at each "goto variable" substitute a switch statement with that list

    Where this might be a problem is if the label variable was a
    global symbol and the target labels were in other name spaces.
    At that point it could treat it like a pointer to a function and
    have to spill all live register variables to memory.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Nov 5 10:15:00 2025
    From Newsgroup: comp.arch

    On 11/5/2025 3:21 AM, Michael S wrote:
    On Tue, 04 Nov 2025 22:51:28 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

    I still think the IBM DFP people did an impressively good job
    packing that much data into a decimal representation. :-)

    Yes, that modulo 1000 packing is quite clever. It is relatively
    cheap to implement in hardware (which is the point, of course).
    Not sure how easy it would be in software.

    Brain dead easy: 1 table of 1024 entries each 12-bits wide,
    1 table of 4096 entries each 10-bits wide,
    isolate the 10-bit field, LD the converted value.
    isolate the 12-bit field, LD the converted value.

    Other than "crap loads" of {deMorganizing and gate optimization}
    that is essentially what HW actually does.

    You still need to build 12-bit decimal ALUs to string together

    Are talking about hardware or software?


    I had interpreted it as being about software with BCD helper ops.

    Otherwise, would probably go a different route.

    One other tradeoff is whether to go for Decimal128 in DPD or BID.

    Stuff online says BID is better for a software implementation, but I am
    having doubts. It is possible that DPD could make more sense in both
    cases, albeit likely, in the absence of BCD helpers, it may make sense
    to map DPD to linear 10-bit values.

    While BID could make sense, it would have a drawback of assuming having
    some way of quickly performing power-of-10 multiplies on large integer
    values. If you have a CPU where the fastest way to perform generic
    128-bit multiply is to break it down into 32 bit multiplies, and/or use shift-and-add, it is not a particularly attractive option.

    Contrast, working with 16-bit chunks holding 10 bit values is likely to
    work out being cheaper.

    Despite BID being more conceptually similar to Binary128, they differ in
    that Binary128 would only need to use large-integer multiply sparingly (namely, for multiply operations).



    Though, likely fastest option would be to map the DPD values to 30-bit
    linear values, then internally use the 30-bit linear values, and convert
    back to DPD at the end. Though, the performance of this is likely to
    depend on the operation.

    A non-standard variant, representing the value as packed 30 bit fields,
    could likely be the fastest option. Could use the same basic layout as
    the existing Decimal128 format.


    S0, my guess for a performance ranking, fast to slow, being:
    1: Dense packed, 30b linear, 30+30+30+20+digit
    2: DPD
    3: BID


    As for whether or not to support Decimal128 (in either form), dunno.

    Closest I have to a use-case is that well, technically there is a
    _Decimal128 type in C, and it might make sense for it to be usable.

    But, then one needs to decide on which possible format to use here.
    And, whether to aim for performance or compatibility.


    ...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Nov 5 10:23:16 2025
    From Newsgroup: comp.arch

    On 11/5/2025 9:26 AM, Niklas Holsti wrote:
    On 2025-11-05 7:17, Anton Ertl wrote:

       [ snip ]

    Yes, assigned goto and labels-as-values (and probably the Cobol
    alter/goto and PL/1 label variables) are there because computer
    architectures have indirect branches and the programming language
    designer wanted to give the programmers a way to express what they
    would otherwise have to express in assembly language.

    Why does standard C not have it?  C had it up to and including the 6th
    edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
    between 6th and 7th edition.  Ritchie wrote
    <37178013.A1EE3D4F@bell-labs.com>:

    | I eliminated them because I didn't know what to say about their
    | semantics.

    Stallman obviously knew what to say about their semantics when he
    added labels-as-values to GNU C with gcc 2.0.


    I don't know what Stallman said, or would have said if asked, but I
    guess something like "the semantics is a jump to the (address of the)
    label to which the value refers", which is machine-level semantics and
    not semantics in the abstract C machine.

    The problem in the abstract C machine is a "goto label-value" statement where the label-value refers to a label in a different function. Does
    gcc prevent that at compile time? If not, I would expect the semantics
    to be Undefined Behavior, the usual cop-out when nothing useful can be
    said.

    (In an earlier discussion on this group, some years ago, I explained how labels-as-values could be added to Ada, using the type system to ensure
    safe and defined semantics. But I don't think such an extension would be accepted for the Ada standard.)


    My guess here:
    It is an "oh crap" situation and program either immediately or (maybe
    not as immediately) explodes...

    Otherwise, it would need to function more like a longjmp, which would
    mean that it would likely be painfully slow.


    So, yeah, most likely UB, of a "particularly destructive" / "unlikely to
    be useful" kind.


    FWIW:
    This was not a feature that I feel inclined to support in BGBCC...


    Niklas

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Nov 5 17:22:48 2025
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> writes:
    On 11/5/2025 9:26 AM, Niklas Holsti wrote:
    On 2025-11-05 7:17, Anton Ertl wrote:

    <computed goto>

    My guess here:
    It is an "oh crap" situation and program either immediately or (maybe
    not as immediately) explodes...

    Otherwise, it would need to function more like a longjmp, which would
    mean that it would likely be painfully slow.

    In my experience, longjmp is far faster than e.g. C++ exceptions.

    Granted, the code needs to be designed to allow longjmp without
    orphaning or leaking memory (i.e. in a context where there isn't any
    dynamic memory allocation) for the best speed.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Nov 5 18:03:31 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Assigned GOTO has been deleted from the Fortran standard (in Fortran
    95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
    supports it >>><https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

    That is the problem with deleted features - compiler writers have
    to support them forever, and interaction with other features can
    lead to problems.

    So does gfortran support assigned goto, too?

    Yes.

    What problems in
    interaction with other features do you see?

    In this case, it is more the problem of modern architeectures.
    On 32-bit architectures, it might have been possible to stash
    the address of a jump target in an actual INTEGER variable and
    GO TO there. On a 64-bit architecture, this is not possible, so
    you need to have a shadow variable for the pointer, and possibly
    (if you want to catch GOTO when no variable has been assigned)
    a second variable.

    But it interacts with compiler writers - additional efforts, warning,
    testing, ...
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Wed Nov 5 21:30:11 2025
    From Newsgroup: comp.arch

    On 2025-11-05 18:23, BGB wrote:
    On 11/5/2025 9:26 AM, Niklas Holsti wrote:
    On 2025-11-05 7:17, Anton Ertl wrote:

        [ snip ]

    Yes, assigned goto and labels-as-values (and probably the Cobol
    alter/goto and PL/1 label variables) are there because computer
    architectures have indirect branches and the programming language
    designer wanted to give the programmers a way to express what they
    would otherwise have to express in assembly language.

    Why does standard C not have it?  C had it up to and including the 6th
    edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
    between 6th and 7th edition.  Ritchie wrote
    <37178013.A1EE3D4F@bell-labs.com>:

    | I eliminated them because I didn't know what to say about their
    | semantics.

    Stallman obviously knew what to say about their semantics when he
    added labels-as-values to GNU C with gcc 2.0.


    I don't know what Stallman said, or would have said if asked, but I
    guess something like "the semantics is a jump to the (address of the)
    label to which the value refers", which is machine-level semantics and
    not semantics in the abstract C machine.

    The problem in the abstract C machine is a "goto label-value"
    statement where the label-value refers to a label in a different
    function. Does gcc prevent that at compile time? If not, I would
    expect the semantics to be Undefined Behavior, the usual cop-out when
    nothing useful can be said.

    (In an earlier discussion on this group, some years ago, I explained
    how labels-as-values could be added to Ada, using the type system to
    ensure safe and defined semantics. But I don't think such an extension
    would be accepted for the Ada standard.)


    My guess here:
    It is an "oh crap" situation and program either immediately or (maybe
    not as immediately) explodes...

    Or silently produces wrong results.

    Otherwise, it would need to function more like a longjmp, which would
    mean that it would likely be painfully slow.

    But then you could get the problem of a longjmp to a setjmp value that
    is stale because the targeted function invocation (stack frame) is no
    longer there.

    Niklas

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 20:30:05 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    On 2025-11-03 2:03 p.m., MitchAlsup wrote:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    Actually, for the five required basic operations, you can always do the >>>> op in the next higher precision, then round again down to the target, >>>> and get exactly the same result.

    https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

    The PowerISA version 3.0 introduced rounding to odd for its 128-bit
    floating point arithmetic, for that very reason (I assume).

    Likely, My 66000 also has RNO and
    Round Nearest Random is defined but not yet available
    Round Away from Zero is also defined and available.

    Round nearest random?

    Another unbiased rounding mode. Not yet available because I don't have
    a truly random source to guide the rounding.

    How about round externally guided (RXG) by an
    input signal?

    I guess that would be OK, but you could not make the statement that
    the rounding mode was unbiased.

    For instance, the rounding could come from a feedback
    filter of some sort.

    Sure, just you can state "unbiased".
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 20:43:58 2025
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 11/4/2025 3:44 PM, Terje Mathisen wrote:
    MitchAlsup wrote:
    ---------------

    As you said: "Never bet against branch prediction".


    Branch prediction is fun.


    When I looked around online before, a lot of stuff about branch
    prediction was talking about fairly large and convoluted schemes for the branch predictors.

    But, then always at the end of it using 2-bit saturating counters:
    weakly taken, weakly not-taken, strongly taken, strongly not taken.

    But, in my fiddling, there was seemingly a simple but moderately
    effective strategy:
    Keep a local history of taken/not-taken;
    XOR this with the low-order-bits of PC for the table index;
    Use a 5/6-bit finite-state-machine or similar.
    Can model repeating patterns up to ~ 4 bits.

    Where, the idea was that the state-machine in updated with the current
    state and branch direction, giving the next state and next predicted
    branch direction (for this state).


    Could model slightly more complex patterns than the 2-bit saturating counters, but it is sort of a partial mystery why (for mainstream processors) more complex lookup schemes and 2 bit state, was preferable
    to a simpler lookup scheme and 5-bit state.

    In 1991 Mike Shebanow, Tse-Yu Yeh, and I tried out a Correlation predictor where strings of {T, !T}** were pattern matched to create a prediction.
    While it was somewhat competitive with Global History Table, it ultimately failed.

    I am now working on predictors for a 6-wide My 66000 machine--which is a bit different.
    a) VEC-LOOP loops do not alter the branch prediction tables.
    b) Predication clauses do not alter the BPTs.
    c) Jump Through Table is not predicted through jump indirect table-like
    prediction, what is predicted is the value (switch variable) and this
    is used to index the table (early)
    d) CMOV gets rid of another 8%

    These strip out about 40% of branches from needing prediction, causing
    the remaining branches to be harder to predict but having less total
    latency in execution.

    -----------------
    Not proven, but I suspect that an arbitrary 5 bit pattern within a 6 bit state might be impossible. Although there would be sufficient
    state-space for the looping 5-bit patterns, there may not be sufficient state-space to distinguish whether to move from a mismatched 4-bit
    pattern to a 3 or 5 bit pattern. Whereas, at least with 4-bit, any
    mismatch of the 4-bit pattern can always decay to a 3-bit pattern, etc.
    One needs to be able to express decay both to shorter patterns and to
    longer patterns, and I suspect at this point, the pattern breaks down
    (but can't easily confirm; it is either this or the pattern extends indefinitely, I don't know...).

    Tried some of these (1991) mostly with little to no success.
    Be my guest and try again.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 20:52:22 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    On 2025-11-05 1:47 a.m., Robert Finch wrote:
    -----------
    I am now modifying Qupls2024 into Qupls2026 rather than starting a completely new ISA. The big difference is Qupls2024 uses 64-bit
    instructions and Qupls2026 uses 48-bit instructions making the code 25%
    more compact with no real loss of operations.

    Qupls2024 also used 8-bit register specs. This was a bit of overkill and
    not really needed. Register specs are reduced to 6-bits. Right-away that reduced most instructions eight bits.

    4 register specifiers: check.

    I decided I liked the dual operations that some instructions supported, which need a wide instruction format.

    With 48-bits, if you can get 2 instructions 50% of the time, you are only
    12% bigger than a 32-bit ISA.

    One gotcha is that 64-bit constant overrides need to be modified. For Qupls2024 a 64-bit constant override could be specified using only a
    single additional instruction word. This is not possible with 48-bit instruction words. Qupls2024 only allowed a single additional constant
    word. I may maintain this for Qupls2026, but that means that a max
    constant override of 48-bits would be supported. A 64-bit constant can
    still be built up in a register using the add-immediate with shift instruction. It is ugly and takes about three instructions.

    It was that sticking problem of constants that drove most of My 66000
    ISA style--variable length and how to encode access to these constants
    and routing thereof.

    Motto: never execute any instructions fetching or building constants.

    I could reduce the 64-bit constant build to two instructions by adding a load-immediate instruction.

    May I humbly suggest this is the wrong direction.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 20:53:59 2025
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

    I still think the IBM DFP people did an impressively good job packing >> > that much data into a decimal representation. :-)

    Yes, that modulo 1000 packing is quite clever. It is relatively
    cheap to implement in hardware (which is the point, of course).
    Not sure how easy it would be in software.

    Brain dead easy: 1 table of 1024 entries each 12-bits wide,
    1 table of 4096 entries each 10-bits wide,
    isolate the 10-bit field, LD the converted value.
    isolate the 12-bit field, LD the converted value.

    I played around with the formulas from the POWER manual a bit,
    using Berkeley abc for logic optimization, for the conversion
    of the packed modulo 1000 to three BCD digits.

    Without spending too much effort, I arrived at four gate delays
    (INV -> OAI21 -> NAND2 -> NAND2) with a total of 37 gates optimizing
    for speed, or five gate delays optimizing for space.

    Since the gates hang off flip-flops, you don't need the inv gate
    at the front. Flip-flops can easily give both true and complement
    outputs.

    I strongly suspect that IBM is doing something similar :-)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 21:04:57 2025
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 11/4/2025 11:17 PM, Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 11/4/2025 11:15 AM, MitchAlsup wrote:
    PL/1 allows for Label variables so one can build their own
    switches (and state machines with variable paths). I used
    these in a checkers playing program 1974.

    Wasn't this PL/1 feature "inherited" from the, now rightly deprecated, >>> Alter/Goto in COBOL and Assigned GOTO in Fortran?

    Assigned GOTO has been deleted from the Fortran standard (in Fortran
    95, obsolescent in Fortran 90), but at least Intel's Fortran compiler supports it <https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

    What makes you think that it is "rightly" to deprecate or delete this feature?

    <https://riptutorial.com/fortran/example/11872/assigned-goto> says:
    |It can be avoided in modern code by using procedures, internal |procedures, procedure pointers and other features.

    I know no feature in Fortran or standard C which replaces my use of labels-as-values, the GNU C equivalent of the assigned goto. If you
    look at <https://www.complang.tuwien.ac.at/forth/threading/>, "direct"
    and "indirect" use labels-as-values, whereas "switch", "call" and
    "repl. switch" use standard C features (switch, indirect calls, and switch+goto respectively). "direct" and "indirect" usually outperform these others, sometimes by a lot.


    I usually used call threading, because:
    In my testing it was one of the faster options;
    At least if excluding 32-bit x86,
    which often has slow function calls.
    Because pretty much every function needs a stack frame, ...
    It is usable in standard C.

    I have converged on call-threading as a way to eliminate "if-statements" -----------------------
    extern uint64_t operation( uint64_t src1, uint64_t src1 );

    static uint64_t (*int2op[32])( uint64_t src1, uint64_t src1 ) =
    { // integer 2-operand decoding table
    /* 00 */ operation,
    /* 01 */ operation,
    /* 02 */ uadd,
    /* 03 */ sadd,
    /* 04 */ umul,
    /* 05 */ smul,
    /* 06 */ udiv,
    /* 07 */ sdiv,
    /* 10 */ cmp,
    /* 11 */ operation,
    /* 12 */ operation,
    /* 13 */ operation,
    /* 14 */ umax,
    /* 15 */ smax,
    /* 16 */ umin,
    /* 17 */ smin,
    /* 20 */ or,
    /* 21 */ operation,
    /* 22 */ xor,
    /* 23 */ operation,
    /* 24 */ and,
    /* 25 */ operation,
    /* 26 */ operation,
    /* 27 */ operation,
    /* 30 */ operation,
    /* 31 */ operation,
    /* 32 */ operation,
    /* 33 */ operation,
    /* 34 */ operation,
    /* 35 */ operation,
    /* 36 */ operation,
    /* 37 */ operation;
    };

    /*
    * Integer 2-Operand Table Caller
    */
    bool intimm16( coreStack *cpu, Context *c, Major I )
    {
    uint8_t or = I.or;
    uint64_t src1 = c->ctx.reg[ I.src1 ],
    src2 = c->ctx.reg[ I.src2 ],
    *dst = &c->ctx.reg[ I.dst ];
    *dst = int2op[ (I.major&15)<<1 ]( src1, src2, 0 );
    return true;
    }

    bool int2op( coreStack *cpu, Context *c, OpRoute I )
    {
    uint8_t or = I.or,
    s = I.size;
    uint64_t *src1 = &c->ctx.reg[ I.src1 ],
    *src2 = &c->ctx.reg[ I.src2 ],
    *dst = &c->ctx.reg[ I.dst ];
    iorTable[ or ]( *c, I, src1, src2 );
    *dst = int2op[ I.minor ]( src1, src2, s );
    return true;
    }
    -----------------------

    One does not have to check for unimplemented instructions, just place
    a call to the operation() subroutine where they are not defined. The operation() subroutine raises an exception which is caught at the
    next instruction fetch.

    I show both 16-bit immediates and general 2-Operand instructions use
    the same table (with a trifling of bit twiddling).

    Often "while loop and switch()" was notably slower than using unrolled
    lists of indirect function calls (usually with the main dispatch loop
    based on "traces", which would call each of the opcode functions and
    then return the next trace to be run).

    Table-calls are faster than many switches unless you can demonstrate
    the switch is dense and there are no missing cases.

    Granted, "while loop and switch" is the more traditional way of writing
    an interpreter.

    Just not a fast one...
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 21:06:16 2025
    From Newsgroup: comp.arch


    Michael S <already5chosen@yahoo.com> posted:

    On Tue, 04 Nov 2025 22:51:28 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

    I still think the IBM DFP people did an impressively good job
    packing that much data into a decimal representation. :-)

    Yes, that modulo 1000 packing is quite clever. It is relatively
    cheap to implement in hardware (which is the point, of course).
    Not sure how easy it would be in software.

    Brain dead easy: 1 table of 1024 entries each 12-bits wide,
    1 table of 4096 entries each 10-bits wide,
    isolate the 10-bit field, LD the converted value.
    isolate the 12-bit field, LD the converted value.

    Other than "crap loads" of {deMorganizing and gate optimization}
    that is essentially what HW actually does.

    You still need to build 12-bit decimal ALUs to string together

    Are talking about hardware or software?

    A SW solution based on how it would be done in HW.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 21:21:34 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    Qupls2026 currently supports 48-bit inline constants. I am debating
    whether to support 89 and 130-bit inline constants as well. Constant
    sizes increase by 41-bits due to the 48-bit instruction word size. The larger constants would require more instruction words to be available to
    be processed in decode. Not sure if it is even possible to pass a
    constant larger than 64-bits in the machine.

    I just realized that constant operand routing was already in Qupls, I
    had just not specifically identified it. The operand routing bits are
    just moved into a postfix instruction word rather than the first
    instruction word. This gives more bits available in the instruction
    word. Rather than burn a couple of bits in every R3 type instruction, another couple of opcodes are used to represent constant extensions.

    My 66000 ISA has OpCodes in the range {I.major >= 8 && I.major < 24}
    that can supply constants and perform operand routing. Within this
    range; instruction<8:5> specify the following table:

    0 0 0 0 +Src1 +Src2
    0 0 0 1 +Src1 -Src2
    0 0 1 0 -Src1 +Src2
    0 0 1 1 -Src1 -Src2
    0 1 0 0 +Src1 +imm5
    0 1 0 1 +Imm5 +Src2
    0 1 1 0 -Src1 -Imm5
    0 1 1 1 +Imm5 -Src2
    1 0 0 0 +Src1 Imm32
    1 0 0 1 Imm32 +Src2
    1 0 1 0 -Src1 Imm32
    1 0 1 1 Imm32 -Src2
    1 1 0 0 +Src1 Imm64
    1 1 0 1 Imm64 +Src2
    1 1 1 0 -Src1 Imm64
    1 1 1 1 Imm64 -Src2

    Here we have access to {5, 32, 64}-bit constants, 16-bit constants
    come from different OpCodes.

    Imm5 are the register specifier bits: range {-31..31} for integer and
    logical, range {-15.5..15.5} for floating point.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 21:24:07 2025
    From Newsgroup: comp.arch


    Niklas Holsti <niklas.holsti@tidorum.invalid> posted:

    On 2025-11-05 7:17, Anton Ertl wrote:

    [ snip ]

    Yes, assigned goto and labels-as-values (and probably the Cobol
    alter/goto and PL/1 label variables) are there because computer architectures have indirect branches and the programming language
    designer wanted to give the programmers a way to express what they
    would otherwise have to express in assembly language.

    Why does standard C not have it? C had it up to and including the 6th edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
    between 6th and 7th edition. Ritchie wrote <37178013.A1EE3D4F@bell-labs.com>:

    | I eliminated them because I didn't know what to say about their
    | semantics.

    Stallman obviously knew what to say about their semantics when he
    added labels-as-values to GNU C with gcc 2.0.


    I don't know what Stallman said, or would have said if asked, but I
    guess something like "the semantics is a jump to the (address of the)
    label to which the value refers", which is machine-level semantics and
    not semantics in the abstract C machine.

    The problem in the abstract C machine is a "goto label-value" statement where the label-value refers to a label in a different function. Does
    gcc prevent that at compile time?

    This is where the call-table approach works better--the scope is well
    defined.

    If not, I would expect the semantics
    to be Undefined Behavior, the usual cop-out when nothing useful can be said.

    (In an earlier discussion on this group, some years ago, I explained how labels-as-values could be added to Ada, using the type system to ensure
    safe and defined semantics. But I don't think such an extension would be accepted for the Ada standard.)

    Niklas
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 21:28:16 2025
    From Newsgroup: comp.arch


    Niklas Holsti <niklas.holsti@tidorum.invalid> posted:

    On 2025-11-05 18:23, BGB wrote:
    On 11/5/2025 9:26 AM, Niklas Holsti wrote:
    On 2025-11-05 7:17, Anton Ertl wrote:

        [ snip ]

    Yes, assigned goto and labels-as-values (and probably the Cobol
    alter/goto and PL/1 label variables) are there because computer
    architectures have indirect branches and the programming language
    designer wanted to give the programmers a way to express what they
    would otherwise have to express in assembly language.

    Why does standard C not have it?  C had it up to and including the 6th >>> edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
    between 6th and 7th edition.  Ritchie wrote
    <37178013.A1EE3D4F@bell-labs.com>:

    | I eliminated them because I didn't know what to say about their
    | semantics.

    Stallman obviously knew what to say about their semantics when he
    added labels-as-values to GNU C with gcc 2.0.


    I don't know what Stallman said, or would have said if asked, but I
    guess something like "the semantics is a jump to the (address of the)
    label to which the value refers", which is machine-level semantics and
    not semantics in the abstract C machine.

    The problem in the abstract C machine is a "goto label-value"
    statement where the label-value refers to a label in a different
    function. Does gcc prevent that at compile time? If not, I would
    expect the semantics to be Undefined Behavior, the usual cop-out when
    nothing useful can be said.

    (In an earlier discussion on this group, some years ago, I explained
    how labels-as-values could be added to Ada, using the type system to
    ensure safe and defined semantics. But I don't think such an extension
    would be accepted for the Ada standard.)


    My guess here:
    It is an "oh crap" situation and program either immediately or (maybe
    not as immediately) explodes...

    Or silently produces wrong results.

    Otherwise, it would need to function more like a longjmp, which would
    mean that it would likely be painfully slow.

    But then you could get the problem of a longjmp to a setjmp value that
    is stale because the targeted function invocation (stack frame) is no
    longer there.

    But YOU had to pass the jumpbuf out of the setjump() scope.

    Now, YOU complain there is a hole in your own foot with a smoking gun
    in your own hand.

    Niklas

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Thu Nov 6 00:45:19 2025
    From Newsgroup: comp.arch

    On 2025-11-05 23:28, MitchAlsup wrote:

    Niklas Holsti <niklas.holsti@tidorum.invalid> posted:

    On 2025-11-05 18:23, BGB wrote:
    On 11/5/2025 9:26 AM, Niklas Holsti wrote:
    On 2025-11-05 7:17, Anton Ertl wrote:

        [ snip ]

    Yes, assigned goto and labels-as-values (and probably the Cobol
    alter/goto and PL/1 label variables) are there because computer
    architectures have indirect branches and the programming language
    designer wanted to give the programmers a way to express what they
    would otherwise have to express in assembly language.

    Why does standard C not have it?  C had it up to and including the 6th >>>>> edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
    between 6th and 7th edition.  Ritchie wrote
    <37178013.A1EE3D4F@bell-labs.com>:

    | I eliminated them because I didn't know what to say about their
    | semantics.

    Stallman obviously knew what to say about their semantics when he
    added labels-as-values to GNU C with gcc 2.0.


    I don't know what Stallman said, or would have said if asked, but I
    guess something like "the semantics is a jump to the (address of the)
    label to which the value refers", which is machine-level semantics and >>>> not semantics in the abstract C machine.

    The problem in the abstract C machine is a "goto label-value"
    statement where the label-value refers to a label in a different
    function. Does gcc prevent that at compile time? If not, I would
    expect the semantics to be Undefined Behavior, the usual cop-out when
    nothing useful can be said.

    (In an earlier discussion on this group, some years ago, I explained
    how labels-as-values could be added to Ada, using the type system to
    ensure safe and defined semantics. But I don't think such an extension >>>> would be accepted for the Ada standard.)


    My guess here:
    It is an "oh crap" situation and program either immediately or (maybe
    not as immediately) explodes...

    Or silently produces wrong results.

    Otherwise, it would need to function more like a longjmp, which would
    mean that it would likely be painfully slow.

    But then you could get the problem of a longjmp to a setjmp value that
    is stale because the targeted function invocation (stack frame) is no
    longer there.

    But YOU had to pass the jumpbuf out of the setjump() scope.

    Now, YOU complain there is a hole in your own foot with a smoking gun
    in your own hand.

    That is not the issue. The question is if the semantics of "goto label-valued-variable" are hard to define, as Ritchie said, or not, as
    Anton thinks Stallman said or would have said.

    The discussion above shows that whether a label value is implemented as
    a bare code address, or as a jumpbuf, some cases will have Undefined
    Behavior semantics. So I think Ritchie was right, unless the undefined
    cases can be excluded at compile time.

    The undefined cases could be excluded at compile-time, even in C, by
    requiring all label-valued variables to be local to some function and forbidding passing such values as parameters or function results. In
    addition, the use of an uninitialized label-valued variable should be prevented or detected. Perhaps Anton could accept such restrictions.

    Niklas

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Nov 5 20:41:18 2025
    From Newsgroup: comp.arch

    On 2025-11-05 3:52 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2025-11-05 1:47 a.m., Robert Finch wrote:
    -----------
    I am now modifying Qupls2024 into Qupls2026 rather than starting a
    completely new ISA. The big difference is Qupls2024 uses 64-bit
    instructions and Qupls2026 uses 48-bit instructions making the code 25%
    more compact with no real loss of operations.

    Qupls2024 also used 8-bit register specs. This was a bit of overkill and
    not really needed. Register specs are reduced to 6-bits. Right-away that
    reduced most instructions eight bits.

    4 register specifiers: check.

    I decided I liked the dual operations that some instructions supported,
    which need a wide instruction format.

    With 48-bits, if you can get 2 instructions 50% of the time, you are only
    12% bigger than a 32-bit ISA.

    One gotcha is that 64-bit constant overrides need to be modified. For
    Qupls2024 a 64-bit constant override could be specified using only a
    single additional instruction word. This is not possible with 48-bit
    instruction words. Qupls2024 only allowed a single additional constant
    word. I may maintain this for Qupls2026, but that means that a max
    constant override of 48-bits would be supported. A 64-bit constant can
    still be built up in a register using the add-immediate with shift
    instruction. It is ugly and takes about three instructions.

    It was that sticking problem of constants that drove most of My 66000
    ISA style--variable length and how to encode access to these constants
    and routing thereof.

    Motto: never execute any instructions fetching or building constants.

    I could reduce the 64-bit constant build to two instructions by adding a
    load-immediate instruction.

    May I humbly suggest this is the wrong direction.

    agree.

    Taking heed of the motto, I have
    scrapped a bunch of shifted immediate instructions and load immediate.
    These were present as an alternate means to work with large constants.
    They were really redundant with the ability to specify constant
    overrides (routing) for registers, and they would increase the dynamic instruction count (bad!) Scrapping the extra instructions will also make writing a compiler simpler.

    One instruction scrapped was an add to IP. So, another means of forming relative addresses was required. Sacrificing a register code (code 32)
    to represent the instruction pointer. This will allow the easy formation
    of IP relative addresses.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Nov 5 21:49:19 2025
    From Newsgroup: comp.arch

    On 2025-11-05 4:21 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    Qupls2026 currently supports 48-bit inline constants. I am debating
    whether to support 89 and 130-bit inline constants as well. Constant
    sizes increase by 41-bits due to the 48-bit instruction word size. The
    larger constants would require more instruction words to be available to
    be processed in decode. Not sure if it is even possible to pass a
    constant larger than 64-bits in the machine.

    I just realized that constant operand routing was already in Qupls, I
    had just not specifically identified it. The operand routing bits are
    just moved into a postfix instruction word rather than the first
    instruction word. This gives more bits available in the instruction
    word. Rather than burn a couple of bits in every R3 type instruction,
    another couple of opcodes are used to represent constant extensions.

    My 66000 ISA has OpCodes in the range {I.major >= 8 && I.major < 24}
    that can supply constants and perform operand routing. Within this
    range; instruction<8:5> specify the following table:

    0 0 0 0 +Src1 +Src2
    0 0 0 1 +Src1 -Src2
    0 0 1 0 -Src1 +Src2
    0 0 1 1 -Src1 -Src2
    0 1 0 0 +Src1 +imm5
    0 1 0 1 +Imm5 +Src2
    0 1 1 0 -Src1 -Imm5
    0 1 1 1 +Imm5 -Src2
    1 0 0 0 +Src1 Imm32
    1 0 0 1 Imm32 +Src2
    1 0 1 0 -Src1 Imm32
    1 0 1 1 Imm32 -Src2
    1 1 0 0 +Src1 Imm64
    1 1 0 1 Imm64 +Src2
    1 1 1 0 -Src1 Imm64
    1 1 1 1 Imm64 -Src2

    What happens if one tries to use an unsupported combination?

    Here we have access to {5, 32, 64}-bit constants, 16-bit constants
    come from different OpCodes.

    Imm5 are the register specifier bits: range {-31..31} for integer and logical, range {-15.5..15.5} for floating point.
    I just realized that Qupls2026 does not accommodate small constants very
    well except for a few instructions like shift and bitfield instructions
    which have special formats. Sure, constants can be made to override
    register specs, but they take up a whole additional word. I am not sure
    how big a deal this is as there are also immediate forms of instructions
    with the constant encoded in the instruction, but these do not allow
    operand routing. There is a dedicated subtract from immediate
    instruction. A lot of other instructions are commutative, so operand
    routing is not needed.

    Qupls has potentially 25, 48, 89 and 130-bit constants. 7-bit constants
    are available for shifts and bitfield ops. Leaving the 130-bit constants
    out for now. They may be useful for 128-bit SIMD against constant operands.

    The constant routing issue could maybe be fixed as there are 30+ free
    opcodes still. But there needs to be more routing bits with three source operands. All the permutations may get complicated to encode and allow
    for in the compiler. May want to permute two registers and a constant,
    or two constants and a register, and then three or four different sizes.

    Qupls strives to be the low-cost processor.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed Nov 5 19:20:57 2025
    From Newsgroup: comp.arch

    On 11/5/2025 1:21 PM, MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    Qupls2026 currently supports 48-bit inline constants. I am debating
    whether to support 89 and 130-bit inline constants as well. Constant
    sizes increase by 41-bits due to the 48-bit instruction word size. The
    larger constants would require more instruction words to be available to
    be processed in decode. Not sure if it is even possible to pass a
    constant larger than 64-bits in the machine.

    I just realized that constant operand routing was already in Qupls, I
    had just not specifically identified it. The operand routing bits are
    just moved into a postfix instruction word rather than the first
    instruction word. This gives more bits available in the instruction
    word. Rather than burn a couple of bits in every R3 type instruction,
    another couple of opcodes are used to represent constant extensions.

    My 66000 ISA has OpCodes in the range {I.major >= 8 && I.major < 24}
    that can supply constants and perform operand routing. Within this
    range; instruction<8:5> specify the following table:

    0 0 0 0 +Src1 +Src2
    0 0 0 1 +Src1 -Src2
    0 0 1 0 -Src1 +Src2
    0 0 1 1 -Src1 -Src2
    0 1 0 0 +Src1 +imm5
    0 1 0 1 +Imm5 +Src2
    0 1 1 0 -Src1 -Imm5
    0 1 1 1 +Imm5 -Src2
    1 0 0 0 +Src1 Imm32
    1 0 0 1 Imm32 +Src2
    1 0 1 0 -Src1 Imm32
    1 0 1 1 Imm32 -Src2
    1 1 0 0 +Src1 Imm64
    1 1 0 1 Imm64 +Src2
    1 1 1 0 -Src1 Imm64
    1 1 1 1 Imm64 -Src2

    Here we have access to {5, 32, 64}-bit constants, 16-bit constants
    come from different OpCodes.

    Imm5 are the register specifier bits: range {-31..31} for integer and logical, range {-15.5..15.5} for floating point.

    Some time ago, we discussed using the 5 bit immediates in floating point instructions as an index to an internal ROM with frequently used
    constants. The idea is that it would save some space in the instruction stream. Are you implementing that, and if not, why not?
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Thu Nov 6 11:24:24 2025
    From Newsgroup: comp.arch

    On Wed, 05 Nov 2025 21:06:16 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Michael S <already5chosen@yahoo.com> posted:

    On Tue, 04 Nov 2025 22:51:28 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

    I still think the IBM DFP people did an impressively good job
    packing that much data into a decimal representation. :-)

    Yes, that modulo 1000 packing is quite clever. It is relatively
    cheap to implement in hardware (which is the point, of course).
    Not sure how easy it would be in software.

    Brain dead easy: 1 table of 1024 entries each 12-bits wide,
    1 table of 4096 entries each 10-bits wide,
    isolate the 10-bit field, LD the converted value.
    isolate the 12-bit field, LD the converted value.

    Other than "crap loads" of {deMorganizing and gate optimization}
    that is essentially what HW actually does.

    You still need to build 12-bit decimal ALUs to string together

    Are talking about hardware or software?

    A SW solution based on how it would be done in HW.

    Then, I suspect that you didn't understand objection of Thomas Koenig.

    1. Format of interest is Decimal128. https://en.wikipedia.org/wiki/Decimal128_floating-point_format

    2. According to my understanding, Thomas didn't suggest that *slow*
    software implementation of DPD-encoded DFP, i.e. implementation that
    only cares about correctness, is hard.

    3. OTOH, he seems to suspects, and I agree with him, that *non-slow*
    software implementation, the one comparable in speed (say, within
    factor of 1,5-2) to competent implementation of the same DFP operations
    in BID format, is not easy. If at all possible.

    4. All said above assumes an absence of HW assists.



    BTW, at least for multiplication, I would probably would not do my
    arithmetic in BCD domain.
    Instead, I'd convert 10+ DPD digits to two Base_1e18 digits (11 look
    ups per operand, 22 total look ups + ~40 shifts + ~20 ANDs + ~20
    additions).

    Then I'd do multiplication and normalization and rounding in Base_1e18.

    Then I'd convert from Base_1e18 to Base_1000. The ideas of such
    conversion are similar to fast binary-to-BCD conversion that I
    demonstrated her decade or so ago. AVX2 could be quite helpful at that
    stage.

    Then I'd have to convert the result from Base_1000 to DPD. Here, again,
    11 table look-ups + plenty of ANDs/shift/ORs seem inevitable.
    May be, at that stage SIMD gather can be of help, but I have my doubts.
    So far, every time I tried gather I was disappointed with performance.

    Overall, even with seemingly decent plan like sketched above, I'd expect
    DPD multiplication to be 2.5x to 3x slower than BID. But, then again,
    in the past my early performance estimates were wrong quite often.






    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Nov 6 08:46:40 2025
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 11/4/2025 9:17 PM, Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 11/4/2025 11:15 AM, MitchAlsup wrote:
    PL/1 allows for Label variables so one can build their own
    switches (and state machines with variable paths). I used
    these in a checkers playing program 1974.

    Wasn't this PL/1 feature "inherited" from the, now rightly deprecated, >>>> Alter/Goto in COBOL and Assigned GOTO in Fortran?

    Assigned GOTO has been deleted from the Fortran standard (in Fortran
    95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
    supports it
    <https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

    What makes you think that it is "rightly" to deprecate or delete this
    feature?

    Because it could, and often did, make the code "unfollowable". That is,
    you are reading the code, following it to try to figure out what it is
    doing and come to an assigned/alter goto, and you don't know where to go >next. The value was set some place else in the code, who knows where,
    and thus what value it was set to, and people/programmers just aren't
    used to being able to follow code like that.

    Take an example use: A VM interpreter. With labels-as-values it looks
    like this:

    void engine(char *source)
    {
    void *insts[] = {&&add, &&load, &&ip, ...};

    void **ip=compile_to_vm_code(source,insts);

    goto *ip++;

    add:
    ...
    goto *ip++;
    load:
    ...
    goto *ip++;
    store:
    ...
    goto *ip++;
    ...
    }

    So of course you don't know where one of the gotos goes to, because
    that depends on the VM code, which depends on the source code.

    Now let's see how it looks with switch:

    void engine(char *source)
    {
    typedef enum {add, load, store,...} inst;
    inst *ip=compile_to_vm_code(source,insts);

    for (;;) {
    switch (*ip++) {
    add:
    ...
    break;
    load:
    ...
    break;
    store:
    ...
    break;
    ...
    }
    }
    }

    Do you know any better which of the "..." is executed next? Of course
    not, for the same reason. Likewise for call threading, but there the
    VM instruction implementations can be discributed across many source
    files. With the replicated switch, the problem of predictability is
    the same, but there is lots of extra code, with many direct gotos.

    If you implement, say, a state machine using labels-as-values, or
    switch, again, the logic behind it is the same and the predictability
    is the same between the two implementations.

    BTW, you mentioned that it could be implemented as an indirect jump. It >could for those architectures that supported that feature, but it could
    also be implemented by having the Alter/Assign modify the code (i.e.
    change the address in the jump/branch instruction), and self modifying
    code is just bad.

    On such architectures switch would also be implemented by modifying
    the code, and indirect calls and method dispatch would also be
    implemented by modifying the code. If self-modifying code is "just
    bad", and any language features that are implemented on some long-gone architectures using self-modifying code are bad by association, then
    we have to get rid of all of these language features ASAP.

    One interesting aspect here is that the Fortran assigned goto and GNU
    C's goto * (to go with labels-as-values) look more like something that
    may have been inspired by a modern indirect branch than by
    self-modifying code. I only dimly remember the Cobol thing, but IIRC
    this looked more like something that's intended to be implemented by self-modifying code. I don't know how the PL/I solution looked like.

    As did COBOL, called goto depending on, but those features didn't suffer
    the problems of assigned/alter gotos.

    As demonstrated above, they do. And if you fall back to using ifs, it
    does not get any better, either.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Thu Nov 6 11:43:57 2025
    From Newsgroup: comp.arch

    On Wed, 5 Nov 2025 17:26:44 +0200
    Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:

    On 2025-11-05 7:17, Anton Ertl wrote:

    [ snip ]

    Yes, assigned goto and labels-as-values (and probably the Cobol
    alter/goto and PL/1 label variables) are there because computer architectures have indirect branches and the programming language
    designer wanted to give the programmers a way to express what they
    would otherwise have to express in assembly language.

    Why does standard C not have it? C had it up to and including the
    6th edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away between 6th and 7th edition. Ritchie wrote <37178013.A1EE3D4F@bell-labs.com>:

    | I eliminated them because I didn't know what to say about their
    | semantics.

    Stallman obviously knew what to say about their semantics when he
    added labels-as-values to GNU C with gcc 2.0.


    I don't know what Stallman said, or would have said if asked, but I
    guess something like "the semantics is a jump to the (address of the)
    label to which the value refers", which is machine-level semantics
    and not semantics in the abstract C machine.

    The problem in the abstract C machine is a "goto label-value"
    statement where the label-value refers to a label in a different
    function. Does gcc prevent that at compile time? If not, I would
    expect the semantics to be Undefined Behavior, the usual cop-out when
    nothing useful can be said.

    Yes, UB sounnds as the best answer.. Inter-procedural assigned goto is
    not different from out-of-bound array access or from attempt to use
    pointer to local variable when the block/function that originally
    declared the variable is no longer active.
    But compiler shall try to detect as many cases of such misuse as it can.


    (In an earlier discussion on this group, some years ago, I explained
    how labels-as-values could be added to Ada, using the type system to
    ensure safe and defined semantics. But I don't think such an
    extension would be accepted for the Ada standard.)

    Niklas


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Thu Nov 6 12:11:54 2025
    From Newsgroup: comp.arch

    On 2025-11-06 11:43, Michael S wrote:
    On Wed, 5 Nov 2025 17:26:44 +0200
    Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:

    On 2025-11-05 7:17, Anton Ertl wrote:

    [ snip ]

    Yes, assigned goto and labels-as-values (and probably the Cobol
    alter/goto and PL/1 label variables) are there because computer
    architectures have indirect branches and the programming language
    designer wanted to give the programmers a way to express what they
    would otherwise have to express in assembly language.

    Why does standard C not have it? C had it up to and including the
    6th edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
    between 6th and 7th edition. Ritchie wrote
    <37178013.A1EE3D4F@bell-labs.com>:

    | I eliminated them because I didn't know what to say about their
    | semantics.

    Stallman obviously knew what to say about their semantics when he
    added labels-as-values to GNU C with gcc 2.0.


    I don't know what Stallman said, or would have said if asked, but I
    guess something like "the semantics is a jump to the (address of the)
    label to which the value refers", which is machine-level semantics
    and not semantics in the abstract C machine.

    The problem in the abstract C machine is a "goto label-value"
    statement where the label-value refers to a label in a different
    function. Does gcc prevent that at compile time? If not, I would
    expect the semantics to be Undefined Behavior, the usual cop-out when
    nothing useful can be said.

    Yes, UB sounnds as the best answer..

    The point is that Ritchie was not satisfied with that answer, which is
    why he removed labels-as-values from his version of C. I doubt that
    Stallman had any better answer for gcc, but he did not care.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Thu Nov 6 12:37:16 2025
    From Newsgroup: comp.arch

    On 2025-11-06 10:46, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 11/4/2025 9:17 PM, Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 11/4/2025 11:15 AM, MitchAlsup wrote:
    PL/1 allows for Label variables so one can build their own
    switches (and state machines with variable paths). I used
    these in a checkers playing program 1974.

    Wasn't this PL/1 feature "inherited" from the, now rightly deprecated, >>>>> Alter/Goto in COBOL and Assigned GOTO in Fortran?

    Assigned GOTO has been deleted from the Fortran standard (in Fortran
    95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
    supports it
    <https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

    What makes you think that it is "rightly" to deprecate or delete this
    feature?

    Because it could, and often did, make the code "unfollowable". That is,
    you are reading the code, following it to try to figure out what it is
    doing and come to an assigned/alter goto, and you don't know where to go
    next. The value was set some place else in the code, who knows where,
    and thus what value it was set to, and people/programmers just aren't
    used to being able to follow code like that.

    Take an example use: A VM interpreter. With labels-as-values it looks
    like this:

    void engine(char *source)
    {
    void *insts[] = {&&add, &&load, &&ip, ...};

    void **ip=compile_to_vm_code(source,insts);

    goto *ip++;

    add:
    ...
    goto *ip++;
    load:
    ...
    goto *ip++;
    store:
    ...
    goto *ip++;
    ...
    }

    So of course you don't know where one of the gotos goes to, because
    that depends on the VM code, which depends on the source code.

    I'm not sure if you are trolling or serious, but I will assume the latter.

    The point is that without a deep analysis of the program you cannot be
    sure that these goto's actually go to one of the labels in the engine() function, and not to some other location in the code, perhaps in some
    other function. That analysis would have to discover that the compile_to_vm_code() function returns a pointer to a vector of addresses picked from the insts[] vector. That could need an analysis of many
    functions called from compile_to_vm_code(), the history of the whole
    program execution, and so on. NOT easy.

    Now let's see how it looks with switch:

    void engine(char *source)
    {
    typedef enum {add, load, store,...} inst;
    inst *ip=compile_to_vm_code(source,insts);

    for (;;) {
    switch (*ip++) {
    add:
    ...
    break;
    load:
    ...
    break;
    store:
    ...
    break;
    ...
    }
    }
    }

    Do you know any better which of the "..." is executed next?

    You know, without any deep analysis or understanding, that the execution
    goes to one of the cases in the switch, and /not/ into the wild blue yonder.

    Niklas

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Thu Nov 6 13:14:55 2025
    From Newsgroup: comp.arch

    On Thu, 6 Nov 2025 12:11:54 +0200
    Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:

    On 2025-11-06 11:43, Michael S wrote:
    On Wed, 5 Nov 2025 17:26:44 +0200
    Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:

    On 2025-11-05 7:17, Anton Ertl wrote:

    [ snip ]

    Yes, assigned goto and labels-as-values (and probably the Cobol
    alter/goto and PL/1 label variables) are there because computer
    architectures have indirect branches and the programming language
    designer wanted to give the programmers a way to express what they
    would otherwise have to express in assembly language.

    Why does standard C not have it? C had it up to and including the
    6th edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went
    away between 6th and 7th edition. Ritchie wrote
    <37178013.A1EE3D4F@bell-labs.com>:

    | I eliminated them because I didn't know what to say about their
    | semantics.

    Stallman obviously knew what to say about their semantics when he
    added labels-as-values to GNU C with gcc 2.0.


    I don't know what Stallman said, or would have said if asked, but I
    guess something like "the semantics is a jump to the (address of
    the) label to which the value refers", which is machine-level
    semantics and not semantics in the abstract C machine.

    The problem in the abstract C machine is a "goto label-value"
    statement where the label-value refers to a label in a different
    function. Does gcc prevent that at compile time? If not, I would
    expect the semantics to be Undefined Behavior, the usual cop-out
    when nothing useful can be said.

    Yes, UB sounnds as the best answer..

    The point is that Ritchie was not satisfied with that answer, which
    is why he removed labels-as-values from his version of C. I doubt
    that Stallman had any better answer for gcc, but he did not care.


    I suspect that the reason was different: DMR had no sanctifying answer
    even for some of intra-procedural cases.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Thu Nov 6 07:44:38 2025
    From Newsgroup: comp.arch

    Taking direction from the VAX’s AOB? (add-one and branch) instruction
    and the DBcc instruction of the 68k, the Qupls Rs1 register of a compare-and-branch instruction may be incremented or decremented. This
    is really a form of instruction fusing the op performed on the branch
    register into the branch instruction.

    I was thinking of modifying this to support additional ops and constant values. Why just add, if one can shift right or XOR as well? It may be
    useful to increment by a structure size. Also, a ring counter might be
    handy which could be implemented as a right shift. This could be
    supported by adding a postfix word to the branch instruction. It would
    make the instruction wider but it would not increase the dynamic
    instruction count.

    Not sure about the syntax to use for coding such instructions.

    BEQ Rs1,Rs2,label:ADD Rs1,256


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Thu Nov 6 07:57:23 2025
    From Newsgroup: comp.arch

    On 11/6/2025 12:46 AM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 11/4/2025 9:17 PM, Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 11/4/2025 11:15 AM, MitchAlsup wrote:
    PL/1 allows for Label variables so one can build their own
    switches (and state machines with variable paths). I used
    these in a checkers playing program 1974.

    Wasn't this PL/1 feature "inherited" from the, now rightly deprecated, >>>>> Alter/Goto in COBOL and Assigned GOTO in Fortran?

    Assigned GOTO has been deleted from the Fortran standard (in Fortran
    95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
    supports it
    <https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

    What makes you think that it is "rightly" to deprecate or delete this
    feature?

    Because it could, and often did, make the code "unfollowable". That is,
    you are reading the code, following it to try to figure out what it is
    doing and come to an assigned/alter goto, and you don't know where to go
    next. The value was set some place else in the code, who knows where,
    and thus what value it was set to, and people/programmers just aren't
    used to being able to follow code like that.

    Take an example use: A VM interpreter. With labels-as-values it looks
    like this:

    void engine(char *source)
    {
    void *insts[] = {&&add, &&load, &&ip, ...};

    void **ip=compile_to_vm_code(source,insts);

    goto *ip++;

    add:
    ...
    goto *ip++;
    load:
    ...
    goto *ip++;
    store:
    ...
    goto *ip++;
    ...
    }

    So of course you don't know where one of the gotos goes to, because
    that depends on the VM code, which depends on the source code.

    Now let's see how it looks with switch:

    void engine(char *source)
    {
    typedef enum {add, load, store,...} inst;
    inst *ip=compile_to_vm_code(source,insts);

    for (;;) {
    switch (*ip++) {
    add:
    ...
    break;
    load:
    ...
    break;
    store:
    ...
    break;
    ...
    }
    }
    }

    Do you know any better which of the "..." is executed next? Of course
    not, for the same reason. Likewise for call threading, but there the
    VM instruction implementations can be discributed across many source
    files. With the replicated switch, the problem of predictability is
    the same, but there is lots of extra code, with many direct gotos.

    If you implement, say, a state machine using labels-as-values, or
    switch, again, the logic behind it is the same and the predictability
    is the same between the two implementations.

    Nick responded better than I could to this argument, demonstrating how
    it isn't true. As I said, in the hands of a good programmer, you might
    assume that the goto goes to one of those labels, but you can't be sure
    of it.


    BTW, you mentioned that it could be implemented as an indirect jump. It
    could for those architectures that supported that feature, but it could
    also be implemented by having the Alter/Assign modify the code (i.e.
    change the address in the jump/branch instruction), and self modifying
    code is just bad.

    On such architectures switch would also be implemented by modifying
    the code,

    I don't think so. Switch can, and I understand usually is,implemented
    via an index into a jump table. No self modifying code required.


    and indirect calls and method dispatch would also be
    implemented by modifying the code. If self-modifying code is "just
    bad", and any language features that are implemented on some long-gone architectures using self-modifying code are bad by association, then
    we have to get rid of all of these language features ASAP.

    And, by an large they have. BTW, I can accept the argument for keeping
    it in C on the argument that C is "lower level" than say Fortran, COBOL
    or PL/1, and people using it are used to the language allowing "risky" constructs.


    One interesting aspect here is that the Fortran assigned goto and GNU
    C's goto * (to go with labels-as-values) look more like something that
    may have been inspired by a modern indirect branch than by
    self-modifying code.

    Well, the Fortran feature was designed in what, the late 1950s? Back
    then, self modifying code wasn't considered as bad as it now is.


    I only dimly remember the Cobol thing, but IIRC
    this looked more like something that's intended to be implemented by self-modifying code. I don't know how the PL/I solution looked like.

    As did COBOL, called goto depending on, but those features didn't suffer
    the problems of assigned/alter gotos.

    As demonstrated above, they do.

    No, they are implemented as an indexed jump table.


    And if you fall back to using ifs, it
    does not get any better, either.

    - anton
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Nov 6 17:44:32 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

    I still think the IBM DFP people did an impressively good job packing >> >> > that much data into a decimal representation. :-)

    Yes, that modulo 1000 packing is quite clever. It is relatively
    cheap to implement in hardware (which is the point, of course).
    Not sure how easy it would be in software.

    Brain dead easy: 1 table of 1024 entries each 12-bits wide,
    1 table of 4096 entries each 10-bits wide,
    isolate the 10-bit field, LD the converted value.
    isolate the 12-bit field, LD the converted value.

    I played around with the formulas from the POWER manual a bit,
    using Berkeley abc for logic optimization, for the conversion
    of the packed modulo 1000 to three BCD digits.

    Without spending too much effort, I arrived at four gate delays
    (INV -> OAI21 -> NAND2 -> NAND2) with a total of 37 gates optimizing
    for speed, or five gate delays optimizing for space.

    Since the gates hang off flip-flops, you don't need the inv gate
    at the front. Flip-flops can easily give both true and complement
    outputs.

    Agreed. Unfortunately, I have a hard time (i.e. "have not managed")
    convincing abc that both signals are available, and assert that
    exactly one of them is 1 at any given time, without completely
    blowing up the optimization routines. It also does not handle
    external don't cares. But as I use it purely to play around with
    things, that is not too bad :-)
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Nov 6 17:52:32 2025
    From Newsgroup: comp.arch

    Niklas Holsti <niklas.holsti@tidorum.invalid> writes:
    On 2025-11-05 7:17, Anton Ertl wrote:
    Stallman obviously knew what to say about their semantics when he
    added labels-as-values to GNU C with gcc 2.0.


    I don't know what Stallman said, or would have said if asked, but I
    guess something like "the semantics is a jump to the (address of the)
    label to which the value refers", which is machine-level semantics and
    not semantics in the abstract C machine.

    You can look at his specification in the documentation of, say, 7th
    edition Unix (where Ritchie apparently took the effort to document
    semantics), and see how he specified that. I doubt he specified
    "semantics in the abstract C machine", but I expect that he specified
    semantics at the C level.

    Concerning how Stallman documented it, you can look at the gcc
    documentation from 2.0 until Stallman passed maintainership on
    (gcc-2.7?).

    If you look at the curent documentation <https://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html>, it talks
    about the "address of a label" and "jump to one", which you might
    consider to be a machine-level description. You can also describe
    this at a C source level or "C abstract machine" level, but I don't
    expect the description to become any clearer.

    The problem in the abstract C machine is a "goto label-value" statement >where the label-value refers to a label in a different function. Does
    gcc prevent that at compile time? If not, I would expect the semantics
    to be Undefined Behavior, the usual cop-out when nothing useful can be said.

    The gcc documentation says:

    |You may not use this mechanism to jump to code in a different
    |function. If you do that, totally unpredictable things happen.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Nov 6 18:14:54 2025
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Where this might be a problem is if the label variable was a
    global symbol and the target labels were in other name spaces.
    At that point it could treat it like a pointer to a function and
    have to spill all live register variables to memory.

    Does the assigned goto support that? What about regular goto and
    computed goto?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Nov 6 18:28:19 2025
    From Newsgroup: comp.arch


    Niklas Holsti <niklas.holsti@tidorum.invalid> posted:

    On 2025-11-05 23:28, MitchAlsup wrote:

    Niklas Holsti <niklas.holsti@tidorum.invalid> posted:
    ----------------
    But then you could get the problem of a longjmp to a setjmp value that
    is stale because the targeted function invocation (stack frame) is no
    longer there.

    But YOU had to pass the jumpbuf out of the setjump() scope.

    Now, YOU complain there is a hole in your own foot with a smoking gun
    in your own hand.

    That is not the issue. The question is if the semantics of "goto label-valued-variable" are hard to define, as Ritchie said, or not, as
    Anton thinks Stallman said or would have said.

    So, label-variables are hard to define, but function-variables are not ?!?

    The discussion above shows that whether a label value is implemented as
    a bare code address, or as a jumpbuf, some cases will have Undefined Behavior semantics. So I think Ritchie was right, unless the undefined
    cases can be excluded at compile time.

    The undefined cases could be excluded at compile-time, even in C, by requiring all label-valued variables to be local to some function and forbidding passing such values as parameters or function results. In addition, the use of an uninitialized label-valued variable should be prevented or detected. Perhaps Anton could accept such restrictions.

    Niklas

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Nov 6 18:17:31 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    So does gfortran support assigned goto, too?

    Yes.

    Cool.

    What problems in
    interaction with other features do you see?

    In this case, it is more the problem of modern architeectures.
    On 32-bit architectures, it might have been possible to stash
    the address of a jump target in an actual INTEGER variable and
    GO TO there. On a 64-bit architecture, this is not possible, so
    you need to have a shadow variable for the pointer

    Implementation options that come to my mind are:

    1) Have the code in the bottom 4GB (or maybe 2GB), and a 32-bit
    variable is sufficient. AFAIK on some 64-bit architectures the
    default memory model puts the code in the bottom 4GB or 2GB.

    2) Put the offset from the start of the function or compilation unit
    (whatever scope the assigned goto can be used in) in the 32-bit
    variable. 32 bits should be enough for that. Of course, if Fortran
    assigns labels between shared libraries and the main program, that
    approach probably does not work, but does anybody really do that?

    How does ifort deal with this problem?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Nov 6 18:36:33 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    On 2025-11-05 4:21 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    Qupls2026 currently supports 48-bit inline constants. I am debating
    whether to support 89 and 130-bit inline constants as well. Constant
    sizes increase by 41-bits due to the 48-bit instruction word size. The
    larger constants would require more instruction words to be available to >> be processed in decode. Not sure if it is even possible to pass a
    constant larger than 64-bits in the machine.

    I just realized that constant operand routing was already in Qupls, I
    had just not specifically identified it. The operand routing bits are
    just moved into a postfix instruction word rather than the first
    instruction word. This gives more bits available in the instruction
    word. Rather than burn a couple of bits in every R3 type instruction,
    another couple of opcodes are used to represent constant extensions.

    My 66000 ISA has OpCodes in the range {I.major >= 8 && I.major < 24}
    that can supply constants and perform operand routing. Within this
    range; instruction<8:5> specify the following table:

    0 0 0 0 +Src1 +Src2
    0 0 0 1 +Src1 -Src2
    0 0 1 0 -Src1 +Src2
    0 0 1 1 -Src1 -Src2
    0 1 0 0 +Src1 +imm5
    0 1 0 1 +Imm5 +Src2
    0 1 1 0 -Src1 -Imm5
    0 1 1 1 +Imm5 -Src2
    1 0 0 0 +Src1 Imm32
    1 0 0 1 Imm32 +Src2
    1 0 1 0 -Src1 Imm32
    1 0 1 1 Imm32 -Src2
    1 1 0 0 +Src1 Imm64
    1 1 0 1 Imm64 +Src2
    1 1 1 0 -Src1 Imm64
    1 1 1 1 Imm64 -Src2

    What happens if one tries to use an unsupported combination?

    For 2-operands and 3-operand instructions, they are all present.
    For 1-Operand instructions, only the ones targeting Src2 are
    available and if you use one not allowed you take an OPERATION
    exception.

    Here we have access to {5, 32, 64}-bit constants, 16-bit constants
    come from different OpCodes.

    Imm5 are the register specifier bits: range {-31..31} for integer and logical, range {-15.5..15.5} for floating point.

    I just realized that Qupls2026 does not accommodate small constants very well except for a few instructions like shift and bitfield instructions which have special formats. Sure, constants can be made to override
    register specs, but they take up a whole additional word. I am not sure
    how big a deal this is as there are also immediate forms of instructions with the constant encoded in the instruction, but these do not allow
    operand routing. There is a dedicated subtract from immediate
    instruction. A lot of other instructions are commutative, so operand
    routing is not needed.

    1<<const // performed at compile time
    1<<var // 1-instruction {1-word in My 66000}

    17/var // 1-instruction {1-word}

    You might notice My 66000 does not even HAVE a SUB instruction,
    instead:

    ADD Rd,Rs1,-Rs2

    Qupls has potentially 25, 48, 89 and 130-bit constants. 7-bit constants
    are available for shifts and bitfield ops. Leaving the 130-bit constants
    out for now. They may be useful for 128-bit SIMD against constant operands.

    The constant routing issue could maybe be fixed as there are 30+ free opcodes still. But there needs to be more routing bits with three source operands. All the permutations may get complicated to encode and allow
    for in the compiler. May want to permute two registers and a constant,
    or two constants and a register, and then three or four different sizes.

    Out of the 64-slot Major OpCode space, 23-clost are left over, 6-reserved
    in perpetuity to catch random jumps into integer or fp data.

    Qupls strives to be the low-cost processor.

    My 66000 strives to be the low-instruction-count processor.

    But remember, ISA is only the first 1/3rd of an architecture.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Nov 6 18:39:55 2025
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 11/5/2025 1:21 PM, MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    Qupls2026 currently supports 48-bit inline constants. I am debating
    whether to support 89 and 130-bit inline constants as well. Constant
    sizes increase by 41-bits due to the 48-bit instruction word size. The
    larger constants would require more instruction words to be available to >> be processed in decode. Not sure if it is even possible to pass a
    constant larger than 64-bits in the machine.

    I just realized that constant operand routing was already in Qupls, I
    had just not specifically identified it. The operand routing bits are
    just moved into a postfix instruction word rather than the first
    instruction word. This gives more bits available in the instruction
    word. Rather than burn a couple of bits in every R3 type instruction,
    another couple of opcodes are used to represent constant extensions.

    My 66000 ISA has OpCodes in the range {I.major >= 8 && I.major < 24}
    that can supply constants and perform operand routing. Within this
    range; instruction<8:5> specify the following table:

    0 0 0 0 +Src1 +Src2
    0 0 0 1 +Src1 -Src2
    0 0 1 0 -Src1 +Src2
    0 0 1 1 -Src1 -Src2
    0 1 0 0 +Src1 +imm5
    0 1 0 1 +Imm5 +Src2
    0 1 1 0 -Src1 -Imm5
    0 1 1 1 +Imm5 -Src2
    1 0 0 0 +Src1 Imm32
    1 0 0 1 Imm32 +Src2
    1 0 1 0 -Src1 Imm32
    1 0 1 1 Imm32 -Src2
    1 1 0 0 +Src1 Imm64
    1 1 0 1 Imm64 +Src2
    1 1 1 0 -Src1 Imm64
    1 1 1 1 Imm64 -Src2

    Here we have access to {5, 32, 64}-bit constants, 16-bit constants
    come from different OpCodes.

    Imm5 are the register specifier bits: range {-31..31} for integer and logical, range {-15.5..15.5} for floating point.

    Some time ago, we discussed using the 5 bit immediates in floating point instructions as an index to an internal ROM with frequently used
    constants. The idea is that it would save some space in the instruction stream. Are you implementing that, and if not, why not?

    The constant ROM[specifier] seems to be the easiest way of taking
    5-bits and converting it into a FP number. It was only a few weeks
    ago that we changed the range from {-31..+31} to {-15.5..+15.5} as
    this covers <slightly> more fp constant uses. In My case, one always
    has access to larger constants at the same instruction-count price
    just a larger code footprint.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Nov 6 18:45:41 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 11/4/2025 9:17 PM, Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 11/4/2025 11:15 AM, MitchAlsup wrote:
    PL/1 allows for Label variables so one can build their own
    switches (and state machines with variable paths). I used
    these in a checkers playing program 1974.

    Wasn't this PL/1 feature "inherited" from the, now rightly deprecated, >>>> Alter/Goto in COBOL and Assigned GOTO in Fortran?

    Assigned GOTO has been deleted from the Fortran standard (in Fortran
    95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
    supports it
    <https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

    What makes you think that it is "rightly" to deprecate or delete this
    feature?

    Because it could, and often did, make the code "unfollowable". That is, >you are reading the code, following it to try to figure out what it is >doing and come to an assigned/alter goto, and you don't know where to go >next. The value was set some place else in the code, who knows where,
    and thus what value it was set to, and people/programmers just aren't
    used to being able to follow code like that.

    Take an example use: A VM interpreter. With labels-as-values it looks
    like this:

    void engine(char *source)
    {
    void *insts[] = {&&add, &&load, &&ip, ...};

    void **ip=compile_to_vm_code(source,insts);

    goto *ip++;

    add:
    ...
    goto *ip++;
    load:
    ...
    goto *ip++;
    store:
    ...
    goto *ip++;
    ...
    }

    So of course you don't know where one of the gotos goes to, because
    that depends on the VM code, which depends on the source code.

    Now let's see how it looks with switch:

    void engine(char *source)
    {
    typedef enum {add, load, store,...} inst;
    inst *ip=compile_to_vm_code(source,insts);

    for (;;) {
    switch (*ip++) {
    add:
    ...
    break;
    load:
    ...
    break;
    store:
    ...
    break;
    ...
    }
    }
    }

    Now let us look at it with tabularized functions:: {Ignore the
    interrupt and exception stuff at your peril}

    bool RunInst( Chip chip )
    {
    for( uint64_t i = 0; i < cores; i++ )
    {
    ContextStack *cpu = &core[i];
    uint8_t cs = cpu->cs;
    Thread *t = cpu->context[cs];
    Inst I;

    if( cpu->interrupt & ((((signed)1)<<63) >> cpu->priority) )
    { // take an interrupt
    cpu->cs = cpu->interrupt.cs;
    cpu->priority = cpu->interrupt.priority;
    t = context[cpu->cs];
    t->reg[0] = cpu->interrupt.message;
    }
    else if( uint16_t raised = c->raised & c->enabled )
    { // take an exception
    cpu->cs--;
    t = context[cpu->cs];
    t->reg[0] = FT1( raised ) | EXCPT;
    t->reg[1] = I.inst;
    t->reg[2] = I.src1;
    t->reg[3] = I.src2;
    t->reg[4] = I.src3;
    }
    else
    { // run an instruction
    t->ip += memory( FETCH, t->ip, &I.inst );
    t->raised |= majorTable[ I.major ]( cpu, t, &I );
    }
    }
    }

    Do you know any better which of the "..." is executed next? Of course
    not, for the same reason. Likewise for call threading, but there the
    VM instruction implementations can be discributed across many source
    files. With the replicated switch, the problem of predictability is
    the same, but there is lots of extra code, with many direct gotos.

    If you implement, say, a state machine using labels-as-values, or
    switch, again, the logic behind it is the same and the predictability
    is the same between the two implementations.

    BTW, you mentioned that it could be implemented as an indirect jump. It >could for those architectures that supported that feature, but it could >also be implemented by having the Alter/Assign modify the code (i.e. >change the address in the jump/branch instruction), and self modifying >code is just bad.

    On such architectures switch would also be implemented by modifying
    the code, and indirect calls and method dispatch would also be
    implemented by modifying the code. If self-modifying code is "just
    bad", and any language features that are implemented on some long-gone architectures using self-modifying code are bad by association, then
    we have to get rid of all of these language features ASAP.

    One interesting aspect here is that the Fortran assigned goto and GNU
    C's goto * (to go with labels-as-values) look more like something that
    may have been inspired by a modern indirect branch than by
    self-modifying code. I only dimly remember the Cobol thing, but IIRC
    this looked more like something that's intended to be implemented by self-modifying code. I don't know how the PL/I solution looked like.

    As did COBOL, called goto depending on, but those features didn't suffer >the problems of assigned/alter gotos.

    As demonstrated above, they do. And if you fall back to using ifs, it
    does not get any better, either.

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Thu Nov 6 13:11:10 2025
    From Newsgroup: comp.arch

    On 11/6/2025 3:24 AM, Michael S wrote:
    On Wed, 05 Nov 2025 21:06:16 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Michael S <already5chosen@yahoo.com> posted:

    On Tue, 04 Nov 2025 22:51:28 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

    I still think the IBM DFP people did an impressively good job
    packing that much data into a decimal representation. :-)

    Yes, that modulo 1000 packing is quite clever. It is relatively
    cheap to implement in hardware (which is the point, of course).
    Not sure how easy it would be in software.

    Brain dead easy: 1 table of 1024 entries each 12-bits wide,
    1 table of 4096 entries each 10-bits wide,
    isolate the 10-bit field, LD the converted value.
    isolate the 12-bit field, LD the converted value.

    Other than "crap loads" of {deMorganizing and gate optimization}
    that is essentially what HW actually does.

    You still need to build 12-bit decimal ALUs to string together

    Are talking about hardware or software?

    A SW solution based on how it would be done in HW.

    Then, I suspect that you didn't understand objection of Thomas Koenig.

    1. Format of interest is Decimal128. https://en.wikipedia.org/wiki/Decimal128_floating-point_format

    2. According to my understanding, Thomas didn't suggest that *slow*
    software implementation of DPD-encoded DFP, i.e. implementation that
    only cares about correctness, is hard.

    3. OTOH, he seems to suspects, and I agree with him, that *non-slow*
    software implementation, the one comparable in speed (say, within
    factor of 1,5-2) to competent implementation of the same DFP operations
    in BID format, is not easy. If at all possible.

    4. All said above assumes an absence of HW assists.



    BTW, at least for multiplication, I would probably would not do my
    arithmetic in BCD domain.
    Instead, I'd convert 10+ DPD digits to two Base_1e18 digits (11 look
    ups per operand, 22 total look ups + ~40 shifts + ~20 ANDs + ~20
    additions).

    Then I'd do multiplication and normalization and rounding in Base_1e18.

    Then I'd convert from Base_1e18 to Base_1000. The ideas of such
    conversion are similar to fast binary-to-BCD conversion that I
    demonstrated her decade or so ago. AVX2 could be quite helpful at that
    stage.

    Then I'd have to convert the result from Base_1000 to DPD. Here, again,
    11 table look-ups + plenty of ANDs/shift/ORs seem inevitable.
    May be, at that stage SIMD gather can be of help, but I have my doubts.
    So far, every time I tried gather I was disappointed with performance.

    Overall, even with seemingly decent plan like sketched above, I'd expect
    DPD multiplication to be 2.5x to 3x slower than BID. But, then again,
    in the past my early performance estimates were wrong quite often.


    I decided to start working on a mockup (quickly thrown together).
    I don't expect to have much use for it, but meh.


    It works by packing/unpacking the values into an internal format along
    vaguely similar lines to the .NET format, just bigger to accommodate
    more digits:
    4x 32-bit values each holding 9 digits
    Except the top one generally holding 7 digits.
    16-bit exponent, sign byte.

    Then wrote a few pack/unpack scenarios:
    X30: Directly packing 20/30 bit chunks, non-standard;
    DPD: Use the DPD format;
    BID: Use the BID format.

    For the pack/unpack step (taken in isolation):
    X30 is around 10x faster than either DPD or BID;
    Both DPD and BID need a similar amount of time.
    BID needs a bunch of 128-bit arithmetic handlers.
    DPD needs a bunch of merge/split and table lookups.
    Seems to mostly balance out in this case.


    For DPD, merge is effectively:
    Do the table lookups;
    v=v0+(v1*1000)+(v2*1000000);
    With a split step like:
    v0=v;
    v1=v/1000;
    v0-=v1*1000;
    v2=v1/1000;
    v1-=v2*1000;
    Then, use table lookups to go back to DPD.

    Did look into possible faster ways of doing the splitting, but then
    noted that have not yet found a faster way that gives correct results
    (where one can assume the compiler already knows how to turn divide by constant into multiply by reciprocal).


    At first it seemed like a strong reason to favor X30 over either DPD or
    BID. Except, that the cost of the ADD and MUL operations effectively
    dwarf that of the pack/unpack operations, so the relative cost
    difference between X30 and DPD may not matter much.


    As is, it seems MUL and ADD being roughly 6x more than the cost of the
    DPD pack/unpack steps.

    So, it seems, while DPD pack/unpack isn't free, it is not something that
    would lead to X30 being a decisive win either in terms of performance.



    It might make more sense, if supporting BID, to just do it as its own
    thing (and embrace just using a bunch of 128-bit arithmetic, and a 128*128=>256 bit widening multiply, ...). Also, can note that the BID
    case ends up needing a lot more clutter, mostly again because C lacks
    native support for 128-bit arithmetic.

    If working based on digit chunks, likely better to stick with DPD due to
    less clutter, etc. Though, this part would be less bad if C had had
    widespread support for 128-bit integers.



    Though, in this case, the ADD and MUL operations currently work by
    internally doubling the width and then narrowing the result after normalization. This is slower, but could give exact results.


    Though, still not complete nor confirmed to produce correct results.



    But, yeah, might be more worthwhile to look into digit chunking:
    12x 3 digits (16b chunk)
    4x 9 digits (32b chunk)
    2x 18 digits (64b chunk)
    3x 12 digits (64b chunk)

    Likely I think:
    3 digits, likely slower because of needing significantly more operations;
    9 digits, seemed sensible, option I went with, internal operations fully
    fit within the limits of 64 bit arithmetic;
    18 digits, possible, but runs into many cases internally that would
    require using 128-bit arithmetic.

    12 digits, fits more easily into 64-bit arithmetic, but would still
    sometimes exceed it; and isn't that much more than 9 digits (but would
    reduce the number of chunks needed from 4 to 3).


    While 18 digits conceptually needs fewer abstract operations than 9
    digits, it would suffer the drawback of many of these operations being
    notably slower.

    However, if running on RV64G with the standard ABI, it is likely the
    9-digit case would also take a performance hit due to sign-extended
    unsigned int (and needing to spend 2 shifts whenever zero-extending a
    value).


    With 3x 12 digits,while not exactly the densest scheme, leaves a little
    more "working space" so would reduce cases which exceed the limits of
    64-bit arithmetic. Well, except multiply, where 24 > 18 ...

    The main merit of 9 digit chunking here being that it fully stays within
    the limits of 64-bit arithmetic (where multiply temporarily widens to
    working with 18 digits, but then narrows back to 9 digit chunks).

    Also 9 digit chunking may be preferable when one has a faster 32*32=>64
    bit multiplier, but 64*64=>128 is slower.


    One other possibility could be to use BCD rather than chunking, but I
    expect BCD emulation to be painfully slow in the absence of ISA level
    helpers.


    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Nov 6 19:38:54 2025
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

    Some time ago, we discussed using the 5 bit immediates in floating point instructions as an index to an internal ROM with frequently used
    constants. The idea is that it would save some space in the instruction stream. Are you implementing that, and if not, why not?

    I did some statistics on which floating point constants occurred how
    often, looking at three different packages (Perl, gnuplot and GSL).
    GSL implements a lot of special founctions, so it has a lot of
    constants you are not likely to find often in a random sample of
    other packages :-) Perl has very little floating point. gnuplot
    is also special in its own way, of course.

    A few constants occur quite often, but there are a lot of
    differences between the floating point constants for different
    programs, to nobody's surprise (presumably).

    Here is the head of an output of a little script I wrote to count
    all floating-point constants from My66000 assembler. Note that
    the compiler is for the version that does not yet do 0.5 etc as
    floating point. The first number is the number of occurrences,
    the second one is the constant itself.

    5-bit constants: 886
    32-bit constants: 566
    64-bit constants:597
    303 0
    290 1
    96 0.5
    81 6
    58 -1
    58 1e-14
    49 2
    46 -2
    45 -8.98846567431158e+307
    44 10
    44 255
    37 8.98846567431158e+307
    29 -0.5
    28 3
    27 90
    27 360
    26 -1e-05
    21 0.0174532925199433
    20 0.9
    18 -3
    17 180
    17 0.1
    17 0.01
    [...]
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Nov 6 20:04:37 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Where this might be a problem is if the label variable was a
    global symbol and the target labels were in other name spaces.
    At that point it could treat it like a pointer to a function and
    have to spill all live register variables to memory.

    Does the assigned goto support that?

    No, that would be beyond horrible.

    What about regular goto and
    computed goto?

    Neither; according to F77, it must be "defined in the same program
    unit".

    An extra feature: When using GOTO variable, you can also supply a
    list of labels that it should jump to; if the jump target is not
    in the list, the GOTO variable is illegal.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Nov 6 20:07:16 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    So does gfortran support assigned goto, too?

    Yes.

    Cool.

    What problems in
    interaction with other features do you see?

    In this case, it is more the problem of modern architeectures.
    On 32-bit architectures, it might have been possible to stash
    the address of a jump target in an actual INTEGER variable and
    GO TO there. On a 64-bit architecture, this is not possible, so
    you need to have a shadow variable for the pointer

    Implementation options that come to my mind are:

    1) Have the code in the bottom 4GB (or maybe 2GB), and a 32-bit
    variable is sufficient. AFAIK on some 64-bit architectures the
    default memory model puts the code in the bottom 4GB or 2GB.

    Compiler writers should never box themselves in like that.

    2) Put the offset from the start of the function or compilation unit (whatever scope the assigned goto can be used in) in the 32-bit
    variable. 32 bits should be enough for that.

    That would make jumps very inefficient.

    Of course, if Fortran
    assigns labels between shared libraries and the main program,

    It does not.

    How does ifort deal with this problem?

    I have no idea, and no inclination to find out; check out
    assembly code at godbolt if you are really interested.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Thu Nov 6 12:14:33 2025
    From Newsgroup: comp.arch

    On 11/6/2025 11:38 AM, Thomas Koenig wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

    Some time ago, we discussed using the 5 bit immediates in floating point
    instructions as an index to an internal ROM with frequently used
    constants. The idea is that it would save some space in the instruction
    stream. Are you implementing that, and if not, why not?

    I did some statistics on which floating point constants occurred how
    often, looking at three different packages (Perl, gnuplot and GSL).
    GSL implements a lot of special founctions, so it has a lot of
    constants you are not likely to find often in a random sample of
    other packages :-) Perl has very little floating point. gnuplot
    is also special in its own way, of course.

    A few constants occur quite often, but there are a lot of
    differences between the floating point constants for different
    programs, to nobody's surprise (presumably).

    Here is the head of an output of a little script I wrote to count
    all floating-point constants from My66000 assembler. Note that
    the compiler is for the version that does not yet do 0.5 etc as
    floating point. The first number is the number of occurrences,
    the second one is the constant itself.

    5-bit constants: 886
    32-bit constants: 566
    64-bit constants:597
    303 0
    290 1
    96 0.5
    81 6
    58 -1
    58 1e-14
    49 2
    46 -2
    45 -8.98846567431158e+307
    44 10
    44 255
    37 8.98846567431158e+307
    29 -0.5
    28 3
    27 90
    27 360
    26 -1e-05
    21 0.0174532925199433
    20 0.9
    18 -3
    17 180
    17 0.1
    17 0.01
    [...]

    Interesting! No values related to pi? And what are the ...e+307 used for?
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Nov 6 20:24:23 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    So does gfortran support assigned goto, too?

    Yes.

    Cool.

    What problems in
    interaction with other features do you see?

    In this case, it is more the problem of modern architeectures.
    On 32-bit architectures, it might have been possible to stash
    the address of a jump target in an actual INTEGER variable and
    GO TO there. On a 64-bit architecture, this is not possible, so
    you need to have a shadow variable for the pointer

    Implementation options that come to my mind are:

    1) Have the code in the bottom 4GB (or maybe 2GB), and a 32-bit
    variable is sufficient. AFAIK on some 64-bit architectures the
    default memory model puts the code in the bottom 4GB or 2GB.

    2) Put the offset from the start of the function or compilation unit (whatever scope the assigned goto can be used in) in the 32-bit
    variable. 32 bits should be enough for that.

    After 4 years of looking, we are still waiting for a single function
    that needs more than a scaled 16-bit displacement from current IP
    {±17-bits} to reach all labels within the function.

    Of course, if Fortran
    assigns labels between shared libraries and the main program, that
    approach probably does not work, but does anybody really do that?

    How does ifort deal with this problem?

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Thu Nov 6 16:24:28 2025
    From Newsgroup: comp.arch

    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Where this might be a problem is if the label variable was a
    global symbol and the target labels were in other name spaces.
    At that point it could treat it like a pointer to a function and
    have to spill all live register variables to memory.

    Does the assigned goto support that? What about regular goto and
    computed goto?

    - anton

    I didn't mean to imply that it did.
    As far as I remember, Fortran 77 does not allow it.
    I never used later Fortrans.

    I hadn't given the dynamic branch topic any thought until you raised it
    and this was just me working through the things a compiler might have
    to deal with.

    I have written jump dispatch table code myself where the destinations
    came from symbols external to the routine, but I had to switch to
    inline assembler for this as MS C does not support goto variables,
    and it was up to me to make sure the registers were all handled correctly.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Nov 6 21:59:31 2025
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

    Some time ago, we discussed using the 5 bit immediates in floating point instructions as an index to an internal ROM with frequently used constants. The idea is that it would save some space in the instruction stream. Are you implementing that, and if not, why not?

    I did some statistics on which floating point constants occurred how
    often, looking at three different packages (Perl, gnuplot and GSL).
    GSL implements a lot of special founctions, so it has a lot of
    constants you are not likely to find often in a random sample of
    other packages :-) Perl has very little floating point. gnuplot
    is also special in its own way, of course.

    A few constants occur quite often, but there are a lot of
    differences between the floating point constants for different
    programs, to nobody's surprise (presumably).

    Here is the head of an output of a little script I wrote to count
    all floating-point constants from My66000 assembler. Note that

    There is a space between the y and the 6 in My 66000.

    the compiler is for the version that does not yet do 0.5 etc as
    floating point. The first number is the number of occurrences,
    the second one is the constant itself.

    5-bit constants: 886
    32-bit constants: 566
    64-bit constants:597
    303 0
    290 1
    96 0.5
    81 6
    58 -1
    58 1e-14
    49 2
    46 -2
    45 -8.98846567431158e+307
    44 10
    44 255
    37 8.98846567431158e+307
    29 -0.5
    28 3
    27 90
    27 360
    26 -1e-05
    21 0.0174532925199433
    20 0.9
    18 -3
    17 180
    17 0.1
    17 0.01
    [...]

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Thu Nov 6 22:09:25 2025
    From Newsgroup: comp.arch

    It appears that MitchAlsup <user5857@newsgrouper.org.invalid> said:
    That is not the issue. The question is if the semantics of "goto
    label-valued-variable" are hard to define, as Ritchie said, or not, as
    Anton thinks Stallman said or would have said.

    So, label-variables are hard to define, but function-variables are not ?!?

    Relatively speaking, yeah. In languages with nested scopes, label gotos
    can jump to an outer scope so they have to unwind some frames. Back when people used such things, a common use was on an error to jump out to some recovery code.

    Function pointers have a sort of similar problem in that they need to carry along pointers to all of the enclosing frames the function can see. That is reasonably well solved by displays, give or take the infamous Knuth man or boy program, 13 lines of Algol60 horror that Knuth himself got the results wrong. --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Nov 6 22:53:09 2025
    From Newsgroup: comp.arch


    EricP <ThatWouldBeTelling@thevillage.com> posted:

    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Where this might be a problem is if the label variable was a
    global symbol and the target labels were in other name spaces.
    At that point it could treat it like a pointer to a function and
    have to spill all live register variables to memory.

    Does the assigned goto support that? What about regular goto and
    computed goto?

    - anton

    I didn't mean to imply that it did.
    As far as I remember, Fortran 77 does not allow it.
    I never used later Fortrans.

    I hadn't given the dynamic branch topic any thought until you raised it
    and this was just me working through the things a compiler might have
    to deal with.

    I have written jump dispatch table code myself where the destinations
    came from symbols external to the routine, but I had to switch to
    inline assembler for this as MS C does not support goto variables,

    Oh sure it does--it is called Return-Oriented-Programming.
    You take the return address off the stack and insert your
    go-to label on the stack and then just return.

    Or you could do some "foul play" on a jumpbuf and longjump.

    {{Be careful not to shoot yourself in the foot.}}

    and it was up to me to make sure the registers were all handled correctly.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Nov 6 22:21:05 2025
    From Newsgroup: comp.arch

    Niklas Holsti <niklas.holsti@tidorum.invalid> writes:
    That is not the issue. The question is if the semantics of "goto >label-valued-variable" are hard to define, as Ritchie said, or not, as
    Anton thinks Stallman said or would have said.

    The discussion above shows that whether a label value is implemented as
    a bare code address, or as a jumpbuf, some cases will have Undefined >Behavior semantics. So I think Ritchie was right, unless the undefined
    cases can be excluded at compile time.

    Ritchie designed lots of features into C for which the C
    standardization committee later decided that some cases are undefined behaviour. I don't think that Ritchie had any qualms at designing
    something like labels-as-values with unchecked limitations (what would
    later become undefined or implementation-defined behaviour), or
    documenting these limitations.

    Here is my attempt (from 1999) at a specification for
    labels-as-values:

    |goto *<expr>" [or whatever the syntax was] is equivalent to "goto <label>"
    |if <expr> evaluates to the same value as the expression "&&<label>" [or |whatever the syntax was]. If <expr> does not evaluate to a label of the |function that contains the "goto *<expr>", the result is undefined.

    The undefined cases could be excluded at compile-time, even in C, by >requiring all label-valued variables to be local to some function and >forbidding passing such values as parameters or function results.

    Gforth certainly passes the labels out, for use by the compiler that
    generates the VM code.

    In
    addition, the use of an uninitialized label-valued variable should be >prevented or detected.

    Using an uninitialized variable is undefined behaviour in C, but not
    prevented, and not always detected (compilers emit warnings in some
    cases when they detect a use of an uninitialized variable). Why
    should it be any different for an uninitialized variable in used with
    "goto *"?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Thu Nov 6 20:10:19 2025
    From Newsgroup: comp.arch

    MitchAlsup wrote:
    EricP <ThatWouldBeTelling@thevillage.com> posted:

    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Where this might be a problem is if the label variable was a
    global symbol and the target labels were in other name spaces.
    At that point it could treat it like a pointer to a function and
    have to spill all live register variables to memory.
    Does the assigned goto support that? What about regular goto and
    computed goto?

    - anton
    I didn't mean to imply that it did.
    As far as I remember, Fortran 77 does not allow it.
    I never used later Fortrans.

    I hadn't given the dynamic branch topic any thought until you raised it
    and this was just me working through the things a compiler might have
    to deal with.

    I have written jump dispatch table code myself where the destinations
    came from symbols external to the routine, but I had to switch to
    inline assembler for this as MS C does not support goto variables,

    Oh sure it does--it is called Return-Oriented-Programming.
    You take the return address off the stack and insert your
    go-to label on the stack and then just return.

    Or you could do some "foul play" on a jumpbuf and longjump.

    {{Be careful not to shoot yourself in the foot.}}

    Or worse... shoot yourself in the foot and then step in a cow pie.
    I hate when that happens.

    and it was up to me to make sure the registers were all handled correctly. >>


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Nov 7 06:55:08 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    After 4 years of looking, we are still waiting for a single function
    that needs more than a scaled 16-bit displacement from current IP
    {±17-bits} to reach all labels within the function.

    Some people use auto-generated code (for example from computer
    algebra systems), which generate really, really long procedures.
    A good stress-test for compilers, too; they tend to expose
    O(n^2) or worse behavior where nobody looked. So it is good that
    branch instructions within functions are expanded by the assembler
    if needed :-)

    Even having 64-bit offsets like My 66000 can lead into a trap (and will
    require future optimization work on the compiler). This is a simplified version of something that came up in a PR.

    SUBROUTINE FOO
    DOUBLE PRECISION A,B,C,D,E
    COMMON /A,B,C,D,E/
    C very many statements involving A,B,C,D,E

    If you load and store each access to one of the variables via its
    64-bit access, you can end up using very many 96-bit instructions,
    where a single load of the base address of the COMMON block would
    save a lot of code space at the expense of a single instruction
    at the beginning.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Nov 7 08:06:41 2025
    From Newsgroup: comp.arch

    Niklas Holsti <niklas.holsti@tidorum.invalid> writes:
    On 2025-11-06 11:43, Michael S wrote:
    On Wed, 5 Nov 2025 17:26:44 +0200
    Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:

    On 2025-11-05 7:17, Anton Ertl wrote:
    Why does standard C not have it? C had it up to and including the
    6th edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
    between 6th and 7th edition. Ritchie wrote
    <37178013.A1EE3D4F@bell-labs.com>:

    | I eliminated them because I didn't know what to say about their
    | semantics.
    ...
    Yes, UB sounnds as the best answer..

    The point is that Ritchie was not satisfied with that answer, which is
    why he removed labels-as-values from his version of C.

    He did not write that, and given the rest of C, I very much doubt that
    this was the reason.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Nov 7 08:08:42 2025
    From Newsgroup: comp.arch

    Niklas Holsti <niklas.holsti@tidorum.invalid> writes:
    On 2025-11-06 10:46, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    [Fortran's assigned goto]
    Because it could, and often did, make the code "unfollowable". That is, >>> you are reading the code, following it to try to figure out what it is
    doing and come to an assigned/alter goto, and you don't know where to go >>> next. The value was set some place else in the code, who knows where,
    and thus what value it was set to, and people/programmers just aren't
    used to being able to follow code like that.

    Take an example use: A VM interpreter. With labels-as-values it looks
    like this:

    void engine(char *source)
    {
    void *insts[] = {&&add, &&load, &&ip, ...};

    void **ip=compile_to_vm_code(source,insts);

    goto *ip++;

    add:
    ...
    goto *ip++;
    load:
    ...
    goto *ip++;
    store:
    ...
    goto *ip++;
    ...
    }

    So of course you don't know where one of the gotos goes to, because
    that depends on the VM code, which depends on the source code.

    I'm not sure if you are trolling or serious, but I will assume the latter.

    This is the problem that Stephen Fuld mentioned, and that is actually
    a practical problem that I have experience in some cases when
    debugging programs with indirect control flow, usually with various
    forms of indirect calls, e.g., method calls. I have not experienced
    it for threaded-code interpreters that use labels-as-values (as
    outlined above), because there I can always look at ip[0], ip[1]
    etc. to see where the next executions of goto *ip will go.

    The point is that without a deep analysis of the program you cannot be
    sure that these goto's actually go to one of the labels in the engine() >function, and not to some other location in the code, perhaps in some
    other function. That analysis would have to discover that the >compile_to_vm_code() function returns a pointer to a vector of addresses >picked from the insts[] vector. That could need an analysis of many >functions called from compile_to_vm_code(), the history of the whole
    program execution, and so on. NOT easy.

    That has never been a problem in my experience, and I have been using labels-as-values since 1992. Up to gforth-0.6 (2003), all instances
    of &&label and all instances of goto *expr were in the same function,
    so if labels had a separate type, that could not be converted by
    casts, the analysis would be trivial, at least if GNU C was an
    Ada-like language, where labels have their own type that cannot be
    converted to other types. As it is, Fortran's assigned goto uses
    integer numbers, and labels-as-values uses void *, so if anybody was
    really interested in performing such an analysis, they would have a
    lot of work to do. But the design of these features with using
    existing types makes it obvious that performing such an analysis was
    not intended.

    Interestingly, if somebody wanted to work in that direction, checking
    at run-time that the target of a goto is inside the function that
    contains the goto is easy and not particularly expensive. With the
    newfangled "control-flow integrity" features in hardware, you could
    even check relatively cheaply that only &&label instances are targets
    of goto *.

    Ok, so what about gforth-0.6 (2003) and later? First of all, they
    contain two functions with goto * and &&label instances, so the
    trivial analysis would no longer work. Has there ever been any mixup
    where a goto * jumped to a label in the other function? Not that I
    know of; if it happened, it would actually work, because the two
    functions are identical apart from some code-space padding.

    What's more relevant is that gforth-0.6 added code-copying dynamic
    native code generation: It copies code snippets (using the addresses
    gotten with &&label to determine where they start and where they end)
    to some RWX data region, concatenating the snippets in this way,
    resulting in a compiled program in the RWX region. It then uses one
    of the goto * in one of the functions to actually start executing this dynamically-generated code.

    This is probably outside of what Stallman had in mind for
    labels-as-values, but fortunately Stallman did not try to limit what
    can be done to what he had in mind, the way that many programming
    language designers do, and the way that many people discussing
    programming languages think. This is a feature that Ritchie's C also
    has, which cannot be said about the C of people who think that
    "undefined behaviour" is enough justification to declare a program
    "buggy".

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Nov 7 10:09:02 2025
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 11/6/2025 12:46 AM, Anton Ertl wrote:
    If you implement, say, a state machine using labels-as-values, or
    switch, again, the logic behind it is the same and the predictability
    is the same between the two implementations.

    Nick responded better than I could to this argument, demonstrating how
    it isn't true. As I said, in the hands of a good programmer, you might >assume that the goto goes to one of those labels, but you can't be sure
    of it.

    In <1762311070-5857@newsgrouper.org> you mentioned method calls as
    'just a more expensive "label"', there you know that the method call
    calls one of the implementations of the method with the name, like
    with the switch. You did not find that satisfying in <1762311070-5857@newsgrouper.org>, but now knowing that it's one of a
    large number of switch targets is good enough for you, whereas Niklas
    Holsti's problem (which does not occur in my practical experience with labels-as-values) has become your problem?

    BTW, you mentioned that it could be implemented as an indirect jump. It >>> could for those architectures that supported that feature, but it could
    also be implemented by having the Alter/Assign modify the code (i.e.
    change the address in the jump/branch instruction), and self modifying
    code is just bad.

    On such architectures switch would also be implemented by modifying
    the code,

    I don't think so. Switch can, and I understand usually is,implemented
    via an index into a jump table. No self modifying code required.

    What does "index into a jump table" mean in one of those architectures
    that did not have indirect jumps and used self-modifying code instead?
    I bet that it ends up in self-modifying code, too, because these
    architectures usually don't have indirect jumps through jump tables,
    either. If they had, the easy way to implement indirect branches
    without self-modifying code would be to have a one-entry jump table,
    store the target in that entry, and then perform an indirect jump
    through that jump table.

    and indirect calls and method dispatch would also be
    implemented by modifying the code. If self-modifying code is "just
    bad", and any language features that are implemented on some long-gone
    architectures using self-modifying code are bad by association, then
    we have to get rid of all of these language features ASAP.

    And, by an large they have.

    We have gotten rid of indirect calls, e.g., in higher-order functions
    in functional programming languages? We have gotten rid of dynamic
    method dispatch in object-oriented programs.

    Thinking about the things that self-modifying code has been used for
    on some architecture, IIRC that also includes array indexing. So have
    we gotten rid of array indexing in programming languages?

    One interesting aspect here is that the Fortran assigned goto and GNU
    C's goto * (to go with labels-as-values) look more like something that
    may have been inspired by a modern indirect branch than by
    self-modifying code.

    Well, the Fortran feature was designed in what, the late 1950s? Back
    then, self modifying code wasn't considered as bad as it now is.

    Did you read what you are replying to?

    Does the IBM 704 (for which FORTRAN has been designed originally)
    support indirect branches, or was it necessary to implement the
    assigned goto (and computed goto) with self-modifying code on that architecture?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Nov 7 10:32:08 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    An extra feature: When using GOTO variable, you can also supply a
    list of labels that it should jump to; if the jump target is not
    in the list, the GOTO variable is illegal.

    The benefit I see from that is that data-flow analysis must only
    consider the control flows from the assigned goto to these targets and
    not to all assigned labels (in contrast to labels-as-values), and
    conversely, if every assigned goto has such a list, data-flow analysis
    knows more precisely which gotos can actually jump to a given label.

    This would make a small difference in Gforth since 0.6, which has
    introduced hybrid direc/indirect-threaded code, and where some goto *
    are for indirect-threaded dispatches, and some labels are only reached
    from these goto * instances, and a certain variable is only alive
    across these jumps. GNU C does not have this option, so what we did
    instead is to kill the variable right before all the gotos that do not
    jump to these labels.

    It might also help with static stack caching: There are stack states
    with 0-n stack items in registers, and a particular VM instruction
    code snippet starts in a particular state (say, 2 stack items in a
    register) and ends with another state S (say, 1 stack item in a
    register). It will jump to code that expects the same state S. All
    variables that contain stack items beyond what S has are dead at that
    point. If we could tell that the goto * from state S only goes to
    targets in state S, the data-flow analysis could determine that.
    Instead, what we do is to kill these additional variables in a subset
    of uses. When we tried to kill them at all uses, the quality of the
    code produced by gcc deteriorated significantly.

    This variable-killing happens by having empty asm statements that
    claim to write to these variables, so if this is used incorrectly, the
    produced code will be incorrect. So the benefit of this assigned-goto
    feature would be to replace a dangerous feature with another dangerous
    one: if you fail to list all the jumped-to labels, the data-flow
    analysis would be wrong, too. It seems more elegant to describe the
    actual control flow, and then let the data-flow analysis do its work
    than the heavy-handed direct influence on the data-flow analysis that
    our variable-killing does.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Nov 7 15:26:38 2025
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> writes:
    In languages with nested scopes, label gotos
    can jump to an outer scope so they have to unwind some frames. Back when >people used such things, a common use was on an error to jump out to some >recovery code.

    Pascal has that feature. Concerning error handling, jumping to an
    error handler in a statically enclosing scope has fallen out of
    favour, but throwing an exception to the next dynamically enclosing
    exception handler is supported in a number of languages.

    Function pointers have a sort of similar problem in that they need to carry >along pointers to all of the enclosing frames the function can see. That is >reasonably well solved by displays, give or take the infamous Knuth man or boy >program, 13 lines of Algol60 horror that Knuth himself got the results wrong.

    Displays and static link chains are among the techniques that can be
    used to implement static scoping correctly, i.e., where the man-or-boy
    test produces the correct result. Knuth initially got the result
    wrong, because he only had boy compilers, and the computation is too
    involved to do it by hand.

    The main horror in the original version is that for some of the Algol
    60 syntax that is used, it is not obvious without studying the Algol
    60 report what it means. <https://rosettacode.org/wiki/Man_or_boy_test#ALGOL_60> contains some discussion, and one can find it in various other programming
    languages, more or (often) less close to the original. The discussion
    at <https://rosettacode.org/wiki/Man_or_boy_test#TXR> and the
    difference between the "proper job" version from the "crib the Common
    Lisp or Scheme solution" version gives some insight.

    The fact that "less close" also produces the correct result suggests
    that the man-or-boy test is less discerning than Knuth probably
    intended. That's a common problem with testing.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Fri Nov 7 08:26:41 2025
    From Newsgroup: comp.arch

    On 11/7/2025 2:09 AM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 11/6/2025 12:46 AM, Anton Ertl wrote:
    If you implement, say, a state machine using labels-as-values, or
    switch, again, the logic behind it is the same and the predictability
    is the same between the two implementations.

    Nick responded better than I could to this argument, demonstrating how
    it isn't true. As I said, in the hands of a good programmer, you might
    assume that the goto goes to one of those labels, but you can't be sure
    of it.

    In <1762311070-5857@newsgrouper.org> you

    I think the attributions are messed up, as I didn't say what you next
    say I said.


    mentioned method calls as
    'just a more expensive "label"', there you know that the method call
    calls one of the implementations of the method with the name, like
    with the switch. You did not find that satisfying in <1762311070-5857@newsgrouper.org>, but now knowing that it's one of a
    large number of switch targets is good enough for you, whereas Niklas Holsti's problem (which does not occur in my practical experience with labels-as-values) has become your problem?

    BTW, you mentioned that it could be implemented as an indirect jump. It >>>> could for those architectures that supported that feature, but it could >>>> also be implemented by having the Alter/Assign modify the code (i.e.
    change the address in the jump/branch instruction), and self modifying >>>> code is just bad.

    On such architectures switch would also be implemented by modifying
    the code,

    I don't think so. Switch can, and I understand usually is,implemented
    via an index into a jump table. No self modifying code required.

    What does "index into a jump table" mean in one of those architectures
    that did not have indirect jumps and used self-modifying code instead?

    For example, the following Fortran code

    goto (10,20,30,40) I @ will jump to label 10 if I =1, 20 if I = 2, etc

    would be compiled to something like (add any required "bounds checking"
    for I)

    load R1,I
    Jump $,R1
    Jump 10
    Jump 20
    Jump 30
    Jump 40

    No code modification nor indirection required .

    Yes, it does require execution of an "extra" jump instruction.


    I bet that it ends up in self-modifying code, too, because these architectures usually don't have indirect jumps through jump tables,
    either.

    Not required.


    If they had, the easy way to implement indirect branches
    without self-modifying code would be to have a one-entry jump table,
    store the target in that entry, and then perform an indirect jump
    through that jump table.

    and indirect calls and method dispatch would also be
    implemented by modifying the code. If self-modifying code is "just
    bad", and any language features that are implemented on some long-gone
    architectures using self-modifying code are bad by association, then
    we have to get rid of all of these language features ASAP.

    And, by an large they have.

    We have gotten rid of indirect calls, e.g., in higher-order functions
    in functional programming languages? We have gotten rid of dynamic
    method dispatch in object-oriented programs.

    No, and I defer to you, or others here, on how these features are
    implemented, specifically whether code modification is required. I was referring to features such as assigned goto in Fortran, and Alter goto
    in Cobol.


    Thinking about the things that self-modifying code has been used for
    on some architecture, IIRC that also includes array indexing. So have
    we gotten rid of array indexing in programming languages?

    Of course not. But I suspect that we have "gotten rid of" any
    architecture that *requires* code modification for array indexing.


    One interesting aspect here is that the Fortran assigned goto and GNU
    C's goto * (to go with labels-as-values) look more like something that
    may have been inspired by a modern indirect branch than by
    self-modifying code.

    Well, the Fortran feature was designed in what, the late 1950s? Back
    then, self modifying code wasn't considered as bad as it now is.

    Did you read what you are replying to?

    Does the IBM 704 (for which FORTRAN has been designed originally)
    support indirect branches, or was it necessary to implement the
    assigned goto (and computed goto) with self-modifying code on that architecture?

    I don't know what the 704 implemented, but I have shown above self
    modifying code is not necessary for computed goto, and I suspect
    assigned goto was implemented with self modifying code. But as I said,
    back then self modifying code was not considered as bad as it is now.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Nov 7 17:29:07 2025
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
    On 11/6/2025 11:38 AM, Thomas Koenig wrote:

    [...]

    Here is the head of an output of a little script I wrote to count
    all floating-point constants from My66000 assembler. Note that
    the compiler is for the version that does not yet do 0.5 etc as
    floating point. The first number is the number of occurrences,
    the second one is the constant itself.

    5-bit constants: 886
    32-bit constants: 566
    64-bit constants:597
    303 0
    290 1
    96 0.5
    81 6
    58 -1
    58 1e-14
    49 2
    46 -2
    45 -8.98846567431158e+307
    44 10
    44 255
    37 8.98846567431158e+307
    29 -0.5
    28 3
    27 90
    27 360
    26 -1e-05
    21 0.0174532925199433
    20 0.9
    18 -3
    17 180
    17 0.1
    17 0.01
    [...]

    Interesting! No values related to pi? And what are the ...e+307 used for?

    If you loook closely, you'll see pi/180 in that list. But pi is
    also there (I cut it off the list), it occurs 11 times. And the
    large numbers are +/- DBL_MAX*0.5, I don't know what they are
    used for.

    By comparision, here are the values which are most frequently
    contained in GSL:

    5-bit constants: 5148
    32-bit constants: 3769
    64-bit constants:3140
    2678 1
    1518 0
    687 -1
    424 2
    329 0.5
    298 -2
    291 2.22044604925031e-16
    275 4.44089209850063e-16
    273 3
    132 -3
    131 -0.5
    131 3.14159265358979
    88 4
    86 1.34078079299426e+154
    77 6
    70 0.25
    70 5
    68 2.2250738585072e-308
    66 10
    64 -4
    50 -6
    46 0.1
    45 5.87747175411144e-39
    43 0.333333333333333
    42 1e+50
    38 6.28318530717959
    35 9
    31 0.2
    30 7
    30 -0.25

    [...]

    So, having values between -15.5 and +15.5 is a choice that will
    cover quite a few floating point constants. For different packages,
    FP constant distributions probably vary too much to create something
    that is much more useful.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Nov 7 17:15:59 2025
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 11/7/2025 2:09 AM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 11/6/2025 12:46 AM, Anton Ertl wrote:
    On such architectures switch would also be implemented by modifying
    the code,

    I don't think so. Switch can, and I understand usually is,implemented
    via an index into a jump table. No self modifying code required.

    What does "index into a jump table" mean in one of those architectures
    that did not have indirect jumps and used self-modifying code instead?

    For example, the following Fortran code

    goto (10,20,30,40) I @ will jump to label 10 if I =1, 20 if I = 2, etc

    would be compiled to something like (add any required "bounds checking"
    for I)

    load R1,I
    Jump $,R1
    Jump 10
    Jump 20
    Jump 30
    Jump 40

    Which architecture ist that?

    No code modification nor indirection required .

    The "Jump $,R1" is an indirect jump. With that the assigned goto can
    be implemented as (for "GOTO X")

    load R1,X
    Jump 0,R1

    and indirect calls and method dispatch would also be
    implemented by modifying the code. If self-modifying code is "just
    bad", and any language features that are implemented on some long-gone >>>> architectures using self-modifying code are bad by association, then
    we have to get rid of all of these language features ASAP.

    And, by an large they have.

    We have gotten rid of indirect calls, e.g., in higher-order functions
    in functional programming languages? We have gotten rid of dynamic
    method dispatch in object-oriented programs.

    No, and I defer to you, or others here, on how these features are >implemented, specifically whether code modification is required. I was >referring to features such as assigned goto in Fortran, and Alter goto
    in Cobol.

    On modern architectures higher-order functions are implemented with
    indirect branches or indirect calls (depending on whether it's a
    tail-call or not); likewise for method dispatch.

    I do not know how Lisp, FORTRAN, Algol 60 and other early languages
    with higher-order functions were implemented on architectures that do
    not have indirect branches; but if the assigned goto was implemented
    with self-modifying code, the call to a function in a variable was
    probably implemented like that, too.

    Thinking about the things that self-modifying code has been used for
    on some architecture, IIRC that also includes array indexing. So have
    we gotten rid of array indexing in programming languages?

    Of course not. But I suspect that we have "gotten rid of" any
    architecture that *requires* code modification for array indexing.

    We have also gotten rid of any architecture that requires
    self-modifying code for implementing the assigned goto.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Bill Findlay@findlaybill@blueyonder.co.uk to comp.arch on Fri Nov 7 17:54:33 2025
    From Newsgroup: comp.arch

    On 7 Nov 2025, Anton Ertl wrote
    (in article<2025Nov7.162638@mips.complang.tuwien.ac.at>):

    John Levine <johnl@taugh.com> writes:
    In languages with nested scopes, label gotos
    can jump to an outer scope so they have to unwind some frames. Back when people used such things, a common use was on an error to jump out to some recovery code.

    Pascal has that feature. Concerning error handling, jumping to an
    error handler in a statically enclosing scope has fallen out of
    favour, but throwing an exception to the next dynamically enclosing
    exception handler is supported in a number of languages.

    Function pointers have a sort of similar problem in that they need to carry along pointers to all of the enclosing frames the function can see. That is reasonably well solved by displays, give or take the infamous Knuth man or boy
    program, 13 lines of Algol60 horror that Knuth himself got the results wrong.

    Displays and static link chains are among the techniques that can be
    used to implement static scoping correctly, i.e., where the man-or-boy
    test produces the correct result. Knuth initially got the result
    wrong, because he only had boy compilers, and the computation is too
    involved to do it by hand.

    I append a run of MANORBOY in Pascal for the KDF9.
    No display was used.
    A static frame pointer as part of the functional parameter
    suffices logically and gives better performance.

    Paskal : the KDF9 Pascal cross-compiler V19.2a, compiled ... on 2025-11-07.
    1 u | %storage = 32767
    2 u | %ystores = 30100
    3 u |
    4 u | program MAN_OR_BOY;
    5 u |
    6 u | { See: }
    7 u | { "Man or boy?", }
    8 u | { by Donald Knuth, }
    9 u | { ALGOL Bulletin 17.2.4, p7; July 1964. }
    10 u |
    11 u | var
    12 u | i : integer;
    13 u | function A (
    14 u | k : integer;
    15 u | function x1 : integer;
    16 u | function x2 : integer;
    17 u | function x3 : integer;
    18 u | function x4 : integer;
    19 u | function x5 : integer
    20 u | ) : integer;
    21 u |
    22 u | function B : integer;
    23 u 1b| begin
    24 u | k := k - 1;
    25 u | B := A (k, B, x1, x2, x3, x4);
    26 u 1e| end { B };
    27 u |
    28 u 1b| begin { A }
    29 u | if k <= 0 then
    30 u | A := x4 + x5
    31 u | else
    32 u | A := B;
    33 u 1e| end { A };
    34 u |
    35 u | function pos_one : integer;
    36 u | begin pos_one := 1 end;
    37 u |
    38 u | function neg_one : integer;
    39 u | begin neg_one := -1 end;
    40 u |
    41 u | function zero : integer;
    42 u | begin zero := 0 end;
    43 u |
    44 u 1b| begin { MAN_OR_BOY }
    45 u | rewrite(1, 3);
    46 u | for i := 0 to 11 do
    47 u | write(A(i, pos_one, neg_one, neg_one, pos_one, zero):6);
    48 u | writeln;
    49 u 1e| end { MAN_OR_BOY }.

    Compilation complete : 0 error(s) and 0 warning(s) were reported.
    ...
    This is ee9 17.0a, compiled by GNAT ... on 2025-11-07.
    Running the KDF9 problem program Binary/MANORBOY
    ...
    Final State: Normal end of run.
    ...
    LP0 on buffer #05 printed 1 line.

    LP0:
    ===
    1 0 -2 0 1 0 1 -1 -10 -30 -67 -138
    ===
    --
    Bill Findlay

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Fri Nov 7 10:45:39 2025
    From Newsgroup: comp.arch

    On 11/7/2025 9:15 AM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 11/7/2025 2:09 AM, Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 11/6/2025 12:46 AM, Anton Ertl wrote:
    On such architectures switch would also be implemented by modifying
    the code,

    I don't think so. Switch can, and I understand usually is,implemented >>>> via an index into a jump table. No self modifying code required.

    What does "index into a jump table" mean in one of those architectures
    that did not have indirect jumps and used self-modifying code instead?

    For example, the following Fortran code

    goto (10,20,30,40) I @ will jump to label 10 if I =1, 20 if I = 2, etc >>
    would be compiled to something like (add any required "bounds checking"
    for I)

    load R1,I
    Jump $,R1
    Jump 10
    Jump 20
    Jump 30
    Jump 40

    Which architecture ist that?

    It is generic enough that it could be lots of architectures, but the one
    I know best is the Univac 1100.



    No code modification nor indirection required .

    The "Jump $,R1" is an indirect jump.

    Perhaps we just have a terminology disagreement. I don't call that
    indirect addressing. The 1100 architecture supports indirect addressing
    in the hardware. An indirect reference was represented in the assembler
    by an asterisk preceding the label, which set a bit in the instruction
    that told the hardware to go to the address specified in the instruction
    and treat what it found there as the address of the operand for the instruction.

    So, for example:

    J *tag

    tag finaladdress

    would cause the hardware to fetch the address at tag and use that as the operand, thus causing a jump to "final address".

    This is what I call indirect addressing.

    So to use this in an assigned goto, the assign statement would store the desired address at tag such that when the jump was executed, it would
    jump to the desired address.

    I call the construct with several consecutive jump instructions an
    indexed jump, not an indirect one.



    With that the assigned goto can
    be implemented as (for "GOTO X")

    load R1,X
    Jump 0,R1


    Yes.


    and indirect calls and method dispatch would also be
    implemented by modifying the code. If self-modifying code is "just
    bad", and any language features that are implemented on some long-gone >>>>> architectures using self-modifying code are bad by association, then >>>>> we have to get rid of all of these language features ASAP.

    And, by an large they have.

    We have gotten rid of indirect calls, e.g., in higher-order functions
    in functional programming languages? We have gotten rid of dynamic
    method dispatch in object-oriented programs.

    No, and I defer to you, or others here, on how these features are
    implemented, specifically whether code modification is required. I was
    referring to features such as assigned goto in Fortran, and Alter goto
    in Cobol.

    On modern architectures higher-order functions are implemented with
    indirect branches or indirect calls (depending on whether it's a
    tail-call or not); likewise for method dispatch.

    I do not know how Lisp, FORTRAN, Algol 60 and other early languages
    with higher-order functions were implemented on architectures that do
    not have indirect branches; but if the assigned goto was implemented
    with self-modifying code, the call to a function in a variable was
    probably implemented like that, too.

    Thinking about the things that self-modifying code has been used for
    on some architecture, IIRC that also includes array indexing. So have
    we gotten rid of array indexing in programming languages?

    Of course not. But I suspect that we have "gotten rid of" any
    architecture that *requires* code modification for array indexing.

    We have also gotten rid of any architecture that requires
    self-modifying code for implementing the assigned goto.

    True. But we still have my original argument, better expressed by
    Niklas about code readability/followability.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Nov 7 14:28:48 2025
    From Newsgroup: comp.arch

    On 11/6/2025 1:11 PM, BGB wrote:
    On 11/6/2025 3:24 AM, Michael S wrote:
    On Wed, 05 Nov 2025 21:06:16 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Michael S <already5chosen@yahoo.com> posted:

    On Tue, 04 Nov 2025 22:51:28 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
    Thomas Koenig <tkoenig@netcologne.de> posted:
    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    I still think the IBM DFP people did an impressively good job
    packing that much data into a decimal representation. :-)

    Yes, that modulo 1000 packing is quite clever.  It is relatively
    cheap to implement in hardware (which is the point, of course).
    Not sure how easy it would be in software.

    Brain dead easy: 1 table of 1024 entries each 12-bits wide,
                      1 table of 4096 entries each 10-bits wide,
    isolate the 10-bit field, LD the converted value.
    isolate the 12-bit field, LD the converted value.

    Other than "crap loads" of {deMorganizing and gate optimization}
    that is essentially what HW actually does.

    You still need to build 12-bit decimal ALUs to string together

    Are talking about hardware or software?
    A SW solution based on how it would be done in HW.

    Then, I suspect that you didn't understand objection of Thomas Koenig.

    1. Format of interest is Decimal128.
    https://en.wikipedia.org/wiki/Decimal128_floating-point_format

    2. According to my understanding, Thomas didn't suggest that *slow*
    software implementation of DPD-encoded DFP, i.e. implementation that
    only cares about correctness, is hard.

    3. OTOH, he seems to suspects, and I agree with him, that *non-slow*
    software implementation, the one comparable in speed  (say, within
    factor of 1,5-2) to competent implementation of the same DFP operations
    in BID format, is not easy. If at all possible.

    4. All said above assumes an absence of HW assists.



    BTW, at least for multiplication, I would probably would not do my
    arithmetic in BCD domain.
    Instead, I'd convert 10+ DPD digits to two Base_1e18 digits (11 look
    ups per operand, 22 total look ups + ~40 shifts + ~20 ANDs + ~20
    additions).

    Then I'd do multiplication and normalization and rounding in Base_1e18.

    Then I'd convert from Base_1e18 to Base_1000. The ideas of such
    conversion are similar to fast binary-to-BCD conversion that I
    demonstrated her decade or so ago. AVX2 could be quite helpful at that
    stage.

    Then I'd have to convert the result from Base_1000 to DPD. Here, again,
    11 table look-ups + plenty of ANDs/shift/ORs seem inevitable.
    May be, at that stage SIMD gather can be of help, but I have my doubts.
    So far, every time I tried gather I was disappointed with performance.

    Overall, even with seemingly decent plan like sketched above, I'd expect
    DPD multiplication to be 2.5x to 3x slower than BID. But, then again,
    in the past my early performance estimates were wrong quite often.


    I decided to start working on a mockup (quickly thrown together).
      I don't expect to have much use for it, but meh.


    It works by packing/unpacking the values into an internal format along vaguely similar lines to the .NET format, just bigger to accommodate
    more digits:
      4x 32-bit values each holding 9 digits
        Except the top one generally holding 7 digits.
      16-bit exponent, sign byte.

    Then wrote a few pack/unpack scenarios:
      X30: Directly packing 20/30 bit chunks, non-standard;
      DPD: Use the DPD format;
      BID: Use the BID format.

    For the pack/unpack step (taken in isolation):
      X30 is around 10x faster than either DPD or BID;
      Both DPD and BID need a similar amount of time.
        BID needs a bunch of 128-bit arithmetic handlers.
        DPD needs a bunch of merge/split and table lookups.
        Seems to mostly balance out in this case.


    For DPD, merge is effectively:
      Do the table lookups;
      v=v0+(v1*1000)+(v2*1000000);
    With a split step like:
      v0=v;
      v1=v/1000;
      v0-=v1*1000;
      v2=v1/1000;
      v1-=v2*1000;
      Then, use table lookups to go back to DPD.

    Did look into possible faster ways of doing the splitting, but then
    noted that have not yet found a faster way that gives correct results
    (where one can assume the compiler already knows how to turn divide by constant into multiply by reciprocal).


    At first it seemed like a strong reason to favor X30 over either DPD or
    BID. Except, that the cost of the ADD and MUL operations effectively
    dwarf that of the pack/unpack operations, so the relative cost
    difference between X30 and DPD may not matter much.


    As is, it seems MUL and ADD being roughly 6x more than the cost of the
    DPD pack/unpack steps.

    So, it seems, while DPD pack/unpack isn't free, it is not something that would lead to X30 being a decisive win either in terms of performance.



    It might make more sense, if supporting BID, to just do it as its own
    thing (and embrace just using a bunch of 128-bit arithmetic, and a 128*128=>256 bit widening multiply, ...). Also, can note that the BID
    case ends up needing a lot more clutter, mostly again because C lacks
    native support for 128-bit arithmetic.

    If working based on digit chunks, likely better to stick with DPD due to less clutter, etc. Though, this part would be less bad if C had had widespread support for 128-bit integers.



    Though, in this case, the ADD and MUL operations currently work by internally doubling the width and then narrowing the result after normalization. This is slower, but could give exact results.


    Though, still not complete nor confirmed to produce correct results.



    But, yeah, might be more worthwhile to look into digit chunking:
      12x  3 digits (16b chunk)
      4x   9 digits (32b chunk)
      2x  18 digits (64b chunk)
      3x  12 digits (64b chunk)

    Likely I think:
    3 digits, likely slower because of needing significantly more operations;
    9 digits, seemed sensible, option I went with, internal operations fully
    fit within the limits of 64 bit arithmetic;
    18 digits, possible, but runs into many cases internally that would
    require using 128-bit arithmetic.

    12 digits, fits more easily into 64-bit arithmetic, but would still sometimes exceed it; and isn't that much more than 9 digits (but would reduce the number of chunks needed from 4 to 3).


    While 18 digits conceptually needs fewer abstract operations than 9
    digits, it would suffer the drawback of many of these operations being notably slower.

    However, if running on RV64G with the standard ABI, it is likely the 9- digit case would also take a performance hit due to sign-extended
    unsigned int (and needing to spend 2 shifts whenever zero-extending a value).


    With 3x 12 digits,while not exactly the densest scheme, leaves a little
    more "working space" so would reduce cases which exceed the limits of
    64-bit arithmetic. Well, except multiply, where 24 > 18 ...

    The main merit of 9 digit chunking here being that it fully stays within
    the limits of 64-bit arithmetic (where multiply temporarily widens to working with 18 digits, but then narrows back to 9 digit chunks).

    Also 9 digit chunking may be preferable when one has a faster 32*32=>64
    bit multiplier, but 64*64=>128 is slower.


    One other possibility could be to use BCD rather than chunking, but I
    expect BCD emulation to be painfully slow in the absence of ISA level helpers.


    I don't know yet if my implementation of DPD is actually correct.

    Seems Decimal128 DPD is obscure enough that I don't currently have any alternate options to confirm if my encoding is correct.

    Here is an example value:
    2DFFCC1AEB53B3FB_B4E262D0DAB5E680

    Which, in theory, should resemble PI.


    Annoyingly, it seems like pretty much everyone else either went with
    BID, or with other non-standard Decimal encodings.

    Can't seem to find:
    Any examples of hard-coded numbers in this format on the internet;
    Any obvious way to generate them involving "stuff I already have".
    As, in, not going and using some proprietary IBM library or similar.

    Also Grok wasn't much help here, just keeps trying to use Python's
    "decimal", which quickly becomes obvious is not using Decimal128 (much
    less DPD), but seemingly some other 256-bit format.

    And, Grok fails to notice that what it is saying is nowhere close to
    correct in this case.

    Neither DeepSeek nor QWen being much help either... Both just sort of go
    down a rabbit hole, and eventually fall back to "Here is how you might
    go about trying to decode this format...".


    Not helpful, I more would just want some way to confirm whether or not I
    got the format correct.

    Which is easier if one has some example numbers or something that they
    can decode and verify the value, or something that is able to decode
    these numbers (which isn't just trying to stupidly shove it into
    Python's Decimal class...).


    Looking around, there is Decimal128 support in MongoDB/BSON, PyArrow,
    and Boost C++, but in these cases, less helpful because they went with BID.

    ...




    Checking, after things a a little more complete, MHz for (millions of
    times per second), on my desktop PC:
    DPD Pack/Unpack: 63.7 MHz (58 cycles)
    X30 Pack/Unpack: 567 MHz ( 7 cycles) ?...

    FMUL (unwrap) : 21.0 MHz (176 cycles)
    FADD (unwrap) : 11.9 MHz (311 cycles)

    FDIV : 0.4 MHz (very slow; Newton Raphson)

    FMUL (DPD) : 11.2 MHz (330 cycles)
    FADD (DPD) : 8.6 MHz (430 cycles)
    FMUL (X30) : 12.4 MHz (298 cycles)
    FADD (X30) : 9.8 MHz (378 cycles)

    The relative performance impact of the wrap/unwrap step is somewhat
    larger than expected (vs the unwrapped case).

    Though, there seems to only be a small difference here between DPD and
    X30 (so, likely whatever is effecting performance here is not directly
    related to the cost of the pack/unpack process).

    The wrapped cases basically just add a wrapper function that unpacks the
    input values to the internal format, and then re-packs the result.

    For using the wrapped functions to estimate pack/unpack cost:
    DPD cost: 51 cycles.
    X30 cost: 41 cycles.


    Not really a good way to make X30 much faster. It does pay for the cost
    of dealing with the combination field.

    Not sure why they would be so close:
    DPD case does a whole lot of stuff;
    X30 case is mostly some shifts and similar.

    Though, in this case, it does use these functions by passing/returning
    structs by value. It is possible a by-reference design might be faster
    in this case.


    This could possibly be cheapened slightly by going to, say:
    S.E13.M114
    In effect trading off some exponent range for cheaper handling of the exponent.


    Can note:
    MUL and ADD use double-width internal mantissa, so should be accurate;
    Current test doesn't implement rounding modes though, could do so.
    Currently hard-wired at Round-Nearest-Even.

    DIV uses Newton-Raphson
    The process of converging is a lot more fiddly than with Binary FP.
    Partly as the strategy for generating the initial guess is far less
    accurate.

    So, it first uses a loop with hard-coded checks and scales to get it in
    the general area, before then letting N-R take over. If the value isn't
    close enough (seemingly +/- 25% or so), N-R flies off into space.

    Namely:
    Exponent is wrong:
    Scale by factors of 2 until correct;
    Off by more than 50%, scale by +/- 25%;
    Off by more than 25%, scale by +/- 12.5%;
    Else: Good enough, let normal N-R take over.

    Precondition step is usually simpler with Binary-FP as the initial guess
    is usually within the correct range. So, one can use a single modified
    N-R step (that undershoots) followed by letting N-R take over.

    More of an issue though when the initial guess is "maybe within a factor
    of 10" because the usual reciprocal-approximation strategy used for
    Binary-FP isn't quite as effective.


    ...


    Still don't have a use-case, mostly just messing around with this...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Nov 7 22:57:14 2025
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:
    --------------snip---------------

    DIV uses Newton-Raphson
    The process of converging is a lot more fiddly than with Binary FP.
    Partly as the strategy for generating the initial guess is far less accurate.

    Binary FDIV NR uses a 9-bit in, 11-bits out table which results in
    an 8-bit accurate first iteration result.

    Other than DFP not being normalized, once you find the HoD, you should
    be able to use something like a 10-bit in 13-bit out table to get the
    first 2 decimal digits correct, and N-R from there.

    That 10-bits in could be the packed DFP representation (its denser and
    has smaller tables). This way, table lookup overlaps unpacking.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Nov 7 20:23:40 2025
    From Newsgroup: comp.arch

    On 11/7/2025 4:57 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:
    --------------snip---------------

    DIV uses Newton-Raphson
    The process of converging is a lot more fiddly than with Binary FP.
    Partly as the strategy for generating the initial guess is far less
    accurate.

    Binary FDIV NR uses a 9-bit in, 11-bits out table which results in
    an 8-bit accurate first iteration result.

    Other than DFP not being normalized, once you find the HoD, you should
    be able to use something like a 10-bit in 13-bit out table to get the
    first 2 decimal digits correct, and N-R from there.

    That 10-bits in could be the packed DFP representation (its denser and
    has smaller tables). This way, table lookup overlaps unpacking.


    FWIW: Dump of the test code as it exists...
    https://pastebin.com/NcvCi5gD

    I had since found the decNumber library, and with this was able to
    confirm that I had in-fact figured out the specifics of the format (I
    was unsure whether or not my version was correct; as I had implemented
    in based mostly on descriptions of the format on Wikipedia; which were
    not entirely consistent).

    Otherwise, experiment / proof of concept.
    Unlikely to actually be useful.



    Way I had usually started out with binary FDIV/reciprocal:
    Turn the reciprocal into a modified integer subtract;
    Or, subtract for HOB's, everything else is a bitwise inversion.
    Can often get within the top 4 bits of the mantissa or so.

    Way I had tried to do so for decimal:
    Invert the exponent in a similar way as binary FP;
    Set the mantissa to the 9s complement value.


    Issue:
    The 9s complement method doesn't give a value particularly close to the
    actual target value.

    For example:
    Taking the reciprocal of 3.14159x, I get 0.685840x, but actual target is 0.318309x.

    Like, I almost may as well just leave the mantissa as-is, or fill it
    with all 5s or something.


    Granted, feeding the high 3 digits through a lookup table and just
    setting all the low digits to whatever is probably also an option, and probably faster than using an initial coarse convergence to try to get
    it somewhere in the right general area.


    I realized after finding decNumber and using it to generate a test
    number, that it seems to use the format in a very different way,
    effectively keeping the value right-aligned and normalized, rather than left-aligned and normalized.

    My code sort of assumed keeping values normalized (as with traditional floating point).

    ...

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Fri Nov 7 22:18:08 2025
    From Newsgroup: comp.arch

    On 2025-11-07 3:28 p.m., BGB wrote:
    On 11/6/2025 1:11 PM, BGB wrote:
    On 11/6/2025 3:24 AM, Michael S wrote:
    On Wed, 05 Nov 2025 21:06:16 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Michael S <already5chosen@yahoo.com> posted:

    On Tue, 04 Nov 2025 22:51:28 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
    Thomas Koenig <tkoenig@netcologne.de> posted:
    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    I still think the IBM DFP people did an impressively good job
    packing that much data into a decimal representation. :-)

    Yes, that modulo 1000 packing is quite clever.  It is relatively >>>>>>> cheap to implement in hardware (which is the point, of course).
    Not sure how easy it would be in software.

    Brain dead easy: 1 table of 1024 entries each 12-bits wide,
                      1 table of 4096 entries each 10-bits wide,
    isolate the 10-bit field, LD the converted value.
    isolate the 12-bit field, LD the converted value.

    Other than "crap loads" of {deMorganizing and gate optimization}
    that is essentially what HW actually does.

    You still need to build 12-bit decimal ALUs to string together

    Are talking about hardware or software?
    A SW solution based on how it would be done in HW.

    Then, I suspect that you didn't understand objection of Thomas Koenig.

    1. Format of interest is Decimal128.
    https://en.wikipedia.org/wiki/Decimal128_floating-point_format

    2. According to my understanding, Thomas didn't suggest that *slow*
    software implementation of DPD-encoded DFP, i.e. implementation that
    only cares about correctness, is hard.

    3. OTOH, he seems to suspects, and I agree with him, that *non-slow*
    software implementation, the one comparable in speed  (say, within
    factor of 1,5-2) to competent implementation of the same DFP operations
    in BID format, is not easy. If at all possible.

    4. All said above assumes an absence of HW assists.



    BTW, at least for multiplication, I would probably would not do my
    arithmetic in BCD domain.
    Instead, I'd convert 10+ DPD digits to two Base_1e18 digits (11 look
    ups per operand, 22 total look ups + ~40 shifts + ~20 ANDs + ~20
    additions).

    Then I'd do multiplication and normalization and rounding in Base_1e18.

    Then I'd convert from Base_1e18 to Base_1000. The ideas of such
    conversion are similar to fast binary-to-BCD conversion that I
    demonstrated her decade or so ago. AVX2 could be quite helpful at that
    stage.

    Then I'd have to convert the result from Base_1000 to DPD. Here, again,
    11 table look-ups + plenty of ANDs/shift/ORs seem inevitable.
    May be, at that stage SIMD gather can be of help, but I have my doubts.
    So far, every time I tried gather I was disappointed with performance.

    Overall, even with seemingly decent plan like sketched above, I'd expect >>> DPD multiplication to be 2.5x to 3x slower than BID. But, then again,
    in the past my early performance estimates were wrong quite often.


    I decided to start working on a mockup (quickly thrown together).
       I don't expect to have much use for it, but meh.


    It works by packing/unpacking the values into an internal format along
    vaguely similar lines to the .NET format, just bigger to accommodate
    more digits:
       4x 32-bit values each holding 9 digits
         Except the top one generally holding 7 digits.
       16-bit exponent, sign byte.

    Then wrote a few pack/unpack scenarios:
       X30: Directly packing 20/30 bit chunks, non-standard;
       DPD: Use the DPD format;
       BID: Use the BID format.

    For the pack/unpack step (taken in isolation):
       X30 is around 10x faster than either DPD or BID;
       Both DPD and BID need a similar amount of time.
         BID needs a bunch of 128-bit arithmetic handlers.
         DPD needs a bunch of merge/split and table lookups.
         Seems to mostly balance out in this case.


    For DPD, merge is effectively:
       Do the table lookups;
       v=v0+(v1*1000)+(v2*1000000);
    With a split step like:
       v0=v;
       v1=v/1000;
       v0-=v1*1000;
       v2=v1/1000;
       v1-=v2*1000;
       Then, use table lookups to go back to DPD.

    Did look into possible faster ways of doing the splitting, but then
    noted that have not yet found a faster way that gives correct results
    (where one can assume the compiler already knows how to turn divide by
    constant into multiply by reciprocal).


    At first it seemed like a strong reason to favor X30 over either DPD
    or BID. Except, that the cost of the ADD and MUL operations
    effectively dwarf that of the pack/unpack operations, so the relative
    cost difference between X30 and DPD may not matter much.


    As is, it seems MUL and ADD being roughly 6x more than the cost of the
    DPD pack/unpack steps.

    So, it seems, while DPD pack/unpack isn't free, it is not something
    that would lead to X30 being a decisive win either in terms of
    performance.



    It might make more sense, if supporting BID, to just do it as its own
    thing (and embrace just using a bunch of 128-bit arithmetic, and a
    128*128=>256 bit widening multiply, ...). Also, can note that the BID
    case ends up needing a lot more clutter, mostly again because C lacks
    native support for 128-bit arithmetic.

    If working based on digit chunks, likely better to stick with DPD due
    to less clutter, etc. Though, this part would be less bad if C had had
    widespread support for 128-bit integers.



    Though, in this case, the ADD and MUL operations currently work by
    internally doubling the width and then narrowing the result after
    normalization. This is slower, but could give exact results.


    Though, still not complete nor confirmed to produce correct results.



    But, yeah, might be more worthwhile to look into digit chunking:
       12x  3 digits (16b chunk)
       4x   9 digits (32b chunk)
       2x  18 digits (64b chunk)
       3x  12 digits (64b chunk)

    Likely I think:
    3 digits, likely slower because of needing significantly more operations;
    9 digits, seemed sensible, option I went with, internal operations
    fully fit within the limits of 64 bit arithmetic;
    18 digits, possible, but runs into many cases internally that would
    require using 128-bit arithmetic.

    12 digits, fits more easily into 64-bit arithmetic, but would still
    sometimes exceed it; and isn't that much more than 9 digits (but would
    reduce the number of chunks needed from 4 to 3).


    While 18 digits conceptually needs fewer abstract operations than 9
    digits, it would suffer the drawback of many of these operations being
    notably slower.

    However, if running on RV64G with the standard ABI, it is likely the
    9- digit case would also take a performance hit due to sign-extended
    unsigned int (and needing to spend 2 shifts whenever zero-extending a
    value).


    With 3x 12 digits,while not exactly the densest scheme, leaves a
    little more "working space" so would reduce cases which exceed the
    limits of 64-bit arithmetic. Well, except multiply, where 24 > 18 ...

    The main merit of 9 digit chunking here being that it fully stays
    within the limits of 64-bit arithmetic (where multiply temporarily
    widens to working with 18 digits, but then narrows back to 9 digit
    chunks).

    Also 9 digit chunking may be preferable when one has a faster
    32*32=>64 bit multiplier, but 64*64=>128 is slower.


    One other possibility could be to use BCD rather than chunking, but I
    expect BCD emulation to be painfully slow in the absence of ISA level
    helpers.


    I don't know yet if my implementation of DPD is actually correct.

    Seems Decimal128 DPD is obscure enough that I don't currently have any alternate options to confirm if my encoding is correct.

    Here is an example value:
      2DFFCC1AEB53B3FB_B4E262D0DAB5E680

    Which, in theory, should resemble PI.


    Annoyingly, it seems like pretty much everyone else either went with
    BID, or with other non-standard Decimal encodings.

    Can't seem to find:
      Any examples of hard-coded numbers in this format on the internet;
      Any obvious way to generate them involving "stuff I already have".
        As, in, not going and using some proprietary IBM library or similar.

    Also Grok wasn't much help here, just keeps trying to use Python's "decimal", which quickly becomes obvious is not using Decimal128 (much
    less DPD), but seemingly some other 256-bit format.

    And, Grok fails to notice that what it is saying is nowhere close to
    correct in this case.

    Neither DeepSeek nor QWen being much help either... Both just sort of go down a rabbit hole, and eventually fall back to "Here is how you might
    go about trying to decode this format...".


    Not helpful, I more would just want some way to confirm whether or not I
    got the format correct.

    Which is easier if one has some example numbers or something that they
    can decode and verify the value, or something that is able to decode
    these numbers (which isn't just trying to stupidly shove it into
    Python's Decimal class...).


    Looking around, there is Decimal128 support in MongoDB/BSON, PyArrow,
    and Boost C++, but in these cases, less helpful because they went with BID.

    ...




    Checking, after things a a little more complete, MHz for (millions of
    times per second), on my desktop PC:
      DPD Pack/Unpack: 63.7 MHz (58 cycles)
      X30 Pack/Unpack: 567 MHz  ( 7 cycles) ?...

      FMUL (unwrap)  : 21.0 MHz (176 cycles)
      FADD (unwrap)  : 11.9 MHz (311 cycles)

      FDIV           :  0.4 MHz (very slow; Newton Raphson)

      FMUL (DPD)     : 11.2 MHz (330 cycles)
      FADD (DPD)     :  8.6 MHz (430 cycles)
      FMUL (X30)     : 12.4 MHz (298 cycles)
      FADD (X30)     :  9.8 MHz (378 cycles)

    The relative performance impact of the wrap/unwrap step is somewhat
    larger than expected (vs the unwrapped case).

    Though, there seems to only be a small difference here between DPD and
    X30 (so, likely whatever is effecting performance here is not directly related to the cost of the pack/unpack process).

    The wrapped cases basically just add a wrapper function that unpacks the input values to the internal format, and then re-packs the result.

    For using the wrapped functions to estimate pack/unpack cost:
      DPD cost: 51 cycles.
      X30 cost: 41 cycles.


    Not really a good way to make X30 much faster. It does pay for the cost
    of dealing with the combination field.

    Not sure why they would be so close:
      DPD case does a whole lot of stuff;
      X30 case is mostly some shifts and similar.

    Though, in this case, it does use these functions by passing/returning structs by value. It is possible a by-reference design might be faster
    in this case.


    This could possibly be cheapened slightly by going to, say:
      S.E13.M114
    In effect trading off some exponent range for cheaper handling of the exponent.


    Can note:
      MUL and ADD use double-width internal mantissa, so should be accurate;
      Current test doesn't implement rounding modes though, could do so.
        Currently hard-wired at Round-Nearest-Even.

    DIV uses Newton-Raphson
    The process of converging is a lot more fiddly than with Binary FP.
    Partly as the strategy for generating the initial guess is far less accurate.

    So, it first uses a loop with hard-coded checks and scales to get it in
    the general area, before then letting N-R take over. If the value isn't close enough (seemingly +/- 25% or so), N-R flies off into space.

    Namely:
      Exponent is wrong:
        Scale by factors of 2 until correct;
      Off by more than 50%, scale by +/- 25%;
      Off by more than 25%, scale by +/- 12.5%;
      Else: Good enough, let normal N-R take over.

    Precondition step is usually simpler with Binary-FP as the initial guess
    is usually within the correct range. So, one can use a single modified
    N-R step (that undershoots) followed by letting N-R take over.

    More of an issue though when the initial guess is "maybe within a factor
    of 10" because the usual reciprocal-approximation strategy used for Binary-FP isn't quite as effective.


    ...


    Still don't have a use-case, mostly just messing around with this...



    When I built my decimal float code I ran into the same issue. There are
    not really examples on the web. I built integer to decimal-float and decimal-float to integer converters then compared results.

    Some DFP encodings for 1,10,100,1000,1000000,12345678 (I hope these are
    right, no guarantees).
    Integer decimal-float
    u 00000000000000000000000000000001 25ffc000000000000000000000000000
    u 0000000000000000000000000000000a 26000000000000000000000000000000
    u 00000000000000000000000000000064 26004000000000000000000000000000
    u 000000000000000000000000000003e8 26008000000000000000000000000000
    u 000000000000000000000000000f4240 26014000000000000000000000000000
    u 00000000000000000000000000bc614e 2601934b9c0c00000000000000000000
    u 00000000000000000000000000000002 29ffc000000000000000000000000000


    I have used the decimal float code (96 bit version) with Tiny BASIC and
    it seems to work.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Fri Nov 7 22:30:36 2025
    From Newsgroup: comp.arch

    Cache-line constants were tried with the StarkCPU and seemed to work
    fine, but wasted cache-line space when constants and instructions could
    not be packed evenly into the cache-line.

    However, for Qupls2026 using constants stored on the cache-line might be
    just as efficient storage wise as having the constants follow
    instruction words because of the 48-bit word width. Constants typically
    do not need to be multiples of 48 bits. If stored on the cache-line they
    could be multiples of 16-bits. There are potentially 32-bits of wasted
    space if an instruction is not able to be packed onto the cache-line.
    There may just be as much wasted space due to the support of over-sized constants in-line with 48-bit parcels. A 32-bit constant uses 48 bits,
    wasting 16-bits of storage.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sat Nov 8 00:34:37 2025
    From Newsgroup: comp.arch

    <snip>>
    Here is an example value:
      2DFFCC1AEB53B3FB_B4E262D0DAB5E680

    <snip>

    I multiplied PI by 10^31 and ran it through the int to decimal-float converter. It should give the same sequence of digits although the
    exponent may be off.

    2e078c2aeb53b3fbb4e262d0dab5e680

    The sequence of digits is the same, except it begins C2 instead of C1.

    <snip>

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sat Nov 8 01:30:43 2025
    From Newsgroup: comp.arch

    On 11/7/2025 11:34 PM, Robert Finch wrote:
    <snip>>
    Here is an example value:
       2DFFCC1AEB53B3FB_B4E262D0DAB5E680

    <snip>

    I multiplied PI by 10^31 and ran it through the int to decimal-float converter. It should give the same sequence of digits although the
    exponent may be off.

    2e078c2aeb53b3fbb4e262d0dab5e680

    The sequence of digits is the same, except it begins C2 instead of C1.


    Does appear to work, mostly, but decodes as:
    31425926535897932384626433832795.0

    Well, except some of the digits don't match up with PI...

    one of the examples from the prior post decodes as:
    12345678.0


    But, yeah, mostly getting consistency across multiple implementations
    does imply that I have implemented the base format correctly.


    As for use-case, this is less clear. It is likely to be slower than the
    usual Binary128 format.

    And, likewise, it would appear that BID is slightly more popular, though
    both less common than people just rolling their own formats.

    So, it looks like:
    Boost, MongoDB, PyArrow: BID
    Python, Java: Custom formats
    .NET: Custom format.

    Leaving mine, yours, and IBM's decNumber, as using DPD.

    It looks like decNumber is using BCD internally.
    Mine is using a "9 digits in 32-bit chunks" scheme.

    In the case of the .NET format, it uses 9 digit chunks, so pretty
    obvious it is probably using 9 digit chunks internally as well.


    I left the BID code out of my example.

    Partly as I realize the reason the BID case was coming out as basically
    the same speed as DPD was because I was in-effect using DPD. If the BID
    case were used, it is in effect somewhat slower than DPD.

    It is more likely that for BID to be effective, it would need to be implemented directly using 128-bit math (likely as its own thing).

    I also had my experimental X30 variant, which can be slightly faster
    than DPD, but seems the relative savings would be small. Though, the
    cost estimates in my microbenchmarks are not showing consistent results.
    It is looking like some sort of weirdness is going on.


    Also the micro benchmarks don't test for values with varied levels of normalization, which is likely to affect performance.

    And, can note that it seems my code and decNumber was very different
    regarding the handling of normalization.

    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Nov 8 10:02:24 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    void engine(char *source)
    {
    void *insts[] = {&&add, &&load, &&ip, ...};

    void **ip=compile_to_vm_code(source,insts);

    goto *ip++;

    add:
    ...
    goto *ip++;

    One problem with assigned GOTO is data flow analysis for a comiler.

    Compilers typically break down structured control flow into GOTO
    and then perform analysis. A label whose address is assigned
    anywhere in the program unit to a variable must be considered to
    be reachable by any GOTO to said variable, so any variable in that
    piece of code must be in a known place (i.e. memory). If it
    is kept in a register in some places that could jump to that
    particular label, the contents of that register must be stored
    to memory before the jump is executed. Alternatively, memory
    allocation must make sure that the same register is always used.

    This was probably less of a problem when assigned goto was invented
    (I assume this was for FORTRAN 66) when few varibles were kept in
    registers, and register allocation was in its infancy. Now, this is
    a much bigger impediment to optimization.

    In other words, assigned goto confuses both programmers and
    compilers.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Nov 8 11:28:36 2025
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> schrieb:

    I don't know yet if my implementation of DPD is actually correct.

    The POWER ISA has a pretty good description, see the OpenPower
    foundation.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Nov 8 14:11:33 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    The constant ROM[specifier] seems to be the easiest way of taking
    5-bits and converting it into a FP number. It was only a few weeks
    ago that we changed the range from {-31..+31} to {-15.5..+15.5} as
    this covers <slightly> more fp constant uses.

    These days, I would assume that software would chose between a
    ROM and random logic with a specification. I gave this a spin,
    again using espresso, followed by Berkeley ABC.

    5-bit FP constants in My 66000 are effectively sign + magnitude,
    which makes the logic quite simple; the sign can be just passed
    through. The equations (e7 down to e0 are exponent bits, m22 down
    to m0 are mantissa bits) for converting are

    e7 = (i4) | (i3) | (i2);
    e6 = (!i4&!i3&!i2&i1) | (!i4&!i3&!i2&i0);
    e5 = (!i4&!i3&!i2&i1) | (!i4&!i3&!i2&i0);
    e4 = (!i4&!i3&!i2&i1) | (!i4&!i3&!i2&i0);
    e3 = (!i4&!i3&!i2&i1) | (!i4&!i3&!i2&i0);
    e2 = (!i4&!i3&!i2&i1) | (!i4&!i3&!i2&i0);
    e1 = (!i3&!i2&i1) | (!i3&!i2&i0) | (i4);
    e0 = (!i4&!i2&i1) | (!i4&i3);
    m22 = (!i4&!i3&i1&i0) | (!i4&i2&i1) | (i4&i3) | (i3&i2);
    m21 = (!i4&i3&i1) | (i4&i2) | (!i3&i2&i0);
    m20 = (!i4&i3&i0) | (i4&i1);
    m19 = (i4&i0);

    Sign is separate and not shown, all other mantissa bits are
    always zero. ABC, optimizing for area, turns into (in BLIF format,
    which is halfway readable)

    .model i2f
    .inputs i4 i3 i2 i1 i0
    .outputs e7 e6 e5 e4 e3 e2 e1 e0 m22 m21 m20 m19

    .gate NOR2_X1 A1=i4 A2=i2 ZN=new_n18
    .gate INV_X1 A=i3 ZN=new_n19
    .gate NAND2_X1 A1=new_n18 A2=new_n19 ZN=e7
    .gate INV_X1 A=i1 ZN=new_n21
    .gate INV_X1 A=i0 ZN=new_n22
    .gate AOI21_X1 A=e7 B1=new_n21 B2=new_n22 ZN=e6
    .gate BUF_X1 A=e6 Z=e5
    .gate BUF_X1 A=e6 Z=e4
    .gate BUF_X1 A=e6 Z=e3
    .gate BUF_X1 A=e6 Z=e2
    .gate OR2_X1 A1=e6 A2=i4 ZN=e1
    .gate INV_X1 A=i4 ZN=new_n29
    .gate NAND2_X1 A1=new_n29 A2=i3 ZN=new_n30
    .gate INV_X1 A=new_n18 ZN=new_n31
    .gate OAI21_X1 A=new_n30 B1=new_n31 B2=new_n21 ZN=e0
    .gate AOI21_X1 A=i2 B1=new_n19 B2=i0 ZN=new_n33
    .gate NAND2_X1 A1=new_n29 A2=i1 ZN=new_n34
    .gate OAI22_X1 A1=new_n33 A2=new_n34 B1=new_n19 B2=new_n18 ZN=m22
    .gate AOI21_X1 A=i4 B1=new_n19 B2=i0 ZN=new_n36
    .gate INV_X1 A=i2 ZN=new_n37
    .gate OAI22_X1 A1=new_n36 A2=new_n37 B1=new_n30 B2=new_n21 ZN=m21
    .gate OAI22_X1 A1=new_n30 A2=new_n22 B1=new_n29 B2=new_n21 ZN=m20
    .gate NOR2_X1 A1=new_n29 A2=new_n22 ZN=m19
    .end

    The inverter gates on input bit are not needed when they come
    from flip-flops, and I am also not sure the buffers are needed.
    If both are taken out, 14 gates are left, which is not a lot
    (I assume that this is smaller than a small ROM, but I don't know).
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sat Nov 8 10:31:54 2025
    From Newsgroup: comp.arch

    Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    No, and I defer to you, or others here, on how these features are
    implemented, specifically whether code modification is required. I was
    referring to features such as assigned goto in Fortran, and Alter goto
    in Cobol.

    On modern architectures higher-order functions are implemented with
    indirect branches or indirect calls (depending on whether it's a
    tail-call or not); likewise for method dispatch.

    I do not know how Lisp, FORTRAN, Algol 60 and other early languages
    with higher-order functions were implemented on architectures that do
    not have indirect branches; but if the assigned goto was implemented
    with self-modifying code, the call to a function in a variable was
    probably implemented like that, too.

    What architecture cannot do an indirect branch, which I assume
    means a branch/jump to a variable location in a register?
    And how would the operating system on such a machine get programs running?

    Even if an ISA did not have a JMP reg instruction one can create it
    using CALL to copy the IP to the stack where you modify it and
    RET to pop the new IP value.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Nov 8 18:04:04 2025
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    void engine(char *source)
    {
    void *insts[] = {&&add, &&load, &&ip, ...};

    void **ip=compile_to_vm_code(source,insts);

    goto *ip++;

    add:
    ...
    goto *ip++;

    One problem with assigned GOTO is data flow analysis for a comiler.

    Compilers typically break down structured control flow into GOTO
    and then perform analysis. A label whose address is assigned
    anywhere in the program unit to a variable must be considered to
    be reachable by any GOTO to said variable, so any variable in that
    piece of code must be in a known place (i.e. memory). If it
    is kept in a register in some places that could jump to that
    particular label, the contents of that register must be stored
    to memory before the jump is executed. Alternatively, memory
    allocation must make sure that the same register is always used.

    This was probably less of a problem when assigned goto was invented
    (I assume this was for FORTRAN 66)

    I think FORTRAN 66 inherited from FORTRAN II or even FORTRAN (1),
    it was available in WATFOR and WATFIV.

    when few varibles were kept in
    registers, and register allocation was in its infancy. Now, this is
    a much bigger impediment to optimization.

    In other words, assigned goto confuses both programmers and
    compilers.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Nov 8 18:08:28 2025
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    The constant ROM[specifier] seems to be the easiest way of taking
    5-bits and converting it into a FP number. It was only a few weeks
    ago that we changed the range from {-31..+31} to {-15.5..+15.5} as
    this covers <slightly> more fp constant uses.

    These days, I would assume that software would chose between a
    ROM and random logic with a specification. I gave this a spin,
    again using espresso, followed by Berkeley ABC.

    5-bit FP constants in My 66000 are effectively sign + magnitude,
    which makes the logic quite simple; the sign can be just passed
    through. The equations (e7 down to e0 are exponent bits, m22 down
    to m0 are mantissa bits) for converting are

    e7 = (i4) | (i3) | (i2);
    e6 = (!i4&!i3&!i2&i1) | (!i4&!i3&!i2&i0);
    e5 = (!i4&!i3&!i2&i1) | (!i4&!i3&!i2&i0);
    e4 = (!i4&!i3&!i2&i1) | (!i4&!i3&!i2&i0);
    e3 = (!i4&!i3&!i2&i1) | (!i4&!i3&!i2&i0);
    e2 = (!i4&!i3&!i2&i1) | (!i4&!i3&!i2&i0);
    e1 = (!i3&!i2&i1) | (!i3&!i2&i0) | (i4);
    e0 = (!i4&!i2&i1) | (!i4&i3);
    m22 = (!i4&!i3&i1&i0) | (!i4&i2&i1) | (i4&i3) | (i3&i2);
    m21 = (!i4&i3&i1) | (i4&i2) | (!i3&i2&i0);
    m20 = (!i4&i3&i0) | (i4&i1);
    m19 = (i4&i0);

    Then you need a multiplexer to Mux between (double) and (float).

    With a special case of 0.0, the range is 0.5..15.5 so I think only
    3 exponent bits need computed/created:: exponent range {-1..+4}.

    Sign is separate and not shown, all other mantissa bits are
    always zero. ABC, optimizing for area, turns into (in BLIF format,
    which is halfway readable)

    .model i2f
    .inputs i4 i3 i2 i1 i0
    .outputs e7 e6 e5 e4 e3 e2 e1 e0 m22 m21 m20 m19

    .gate NOR2_X1 A1=i4 A2=i2 ZN=new_n18
    .gate INV_X1 A=i3 ZN=new_n19
    .gate NAND2_X1 A1=new_n18 A2=new_n19 ZN=e7
    .gate INV_X1 A=i1 ZN=new_n21
    .gate INV_X1 A=i0 ZN=new_n22
    .gate AOI21_X1 A=e7 B1=new_n21 B2=new_n22 ZN=e6
    .gate BUF_X1 A=e6 Z=e5
    .gate BUF_X1 A=e6 Z=e4
    .gate BUF_X1 A=e6 Z=e3
    .gate BUF_X1 A=e6 Z=e2
    .gate OR2_X1 A1=e6 A2=i4 ZN=e1
    .gate INV_X1 A=i4 ZN=new_n29
    .gate NAND2_X1 A1=new_n29 A2=i3 ZN=new_n30
    .gate INV_X1 A=new_n18 ZN=new_n31
    .gate OAI21_X1 A=new_n30 B1=new_n31 B2=new_n21 ZN=e0
    .gate AOI21_X1 A=i2 B1=new_n19 B2=i0 ZN=new_n33
    .gate NAND2_X1 A1=new_n29 A2=i1 ZN=new_n34
    .gate OAI22_X1 A1=new_n33 A2=new_n34 B1=new_n19 B2=new_n18 ZN=m22
    .gate AOI21_X1 A=i4 B1=new_n19 B2=i0 ZN=new_n36
    .gate INV_X1 A=i2 ZN=new_n37
    .gate OAI22_X1 A1=new_n36 A2=new_n37 B1=new_n30 B2=new_n21 ZN=m21
    .gate OAI22_X1 A1=new_n30 A2=new_n22 B1=new_n29 B2=new_n21 ZN=m20
    .gate NOR2_X1 A1=new_n29 A2=new_n22 ZN=m19
    .end

    The inverter gates on input bit are not needed when they come
    from flip-flops, and I am also not sure the buffers are needed.
    If both are taken out, 14 gates are left, which is not a lot
    (I assume that this is smaller than a small ROM, but I don't know).

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Nov 8 18:13:59 2025
    From Newsgroup: comp.arch


    EricP <ThatWouldBeTelling@thevillage.com> posted:

    Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    No, and I defer to you, or others here, on how these features are
    implemented, specifically whether code modification is required. I was >> referring to features such as assigned goto in Fortran, and Alter goto
    in Cobol.

    On modern architectures higher-order functions are implemented with indirect branches or indirect calls (depending on whether it's a
    tail-call or not); likewise for method dispatch.

    I do not know how Lisp, FORTRAN, Algol 60 and other early languages
    with higher-order functions were implemented on architectures that do
    not have indirect branches; but if the assigned goto was implemented
    with self-modifying code, the call to a function in a variable was
    probably implemented like that, too.

    What architecture cannot do an indirect branch, which I assume
    means a branch/jump to a variable location in a register?

    PDP-8, 4004, IBM 650, ... And any machine without "registers".

    And how would the operating system on such a machine get programs running?

    Load them at a known location and branch to the known location.

    Even if an ISA did not have a JMP reg instruction one can create it
    using CALL to copy the IP to the stack where you modify it and
    RET to pop the new IP value.

    Pure stack machines did a lot of this.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Nov 8 18:25:18 2025
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    What architecture cannot do an indirect branch, which I assume
    means a branch/jump to a variable location in a register?

    Or, in case of the 6502, in memory.

    I don't know of any architecture (except maybe some one-instruction proof-of-concepts) that does not have indirect branches in one form or
    another, but I am not that familiar with architectures from the 1950s
    or some of the extremely deprived embedded-control processors.

    Maybe the thing about self-modifying code was thrown in to taint the
    assigned goto through guilt-by-association.

    Even if an ISA did not have a JMP reg instruction one can create it
    using CALL to copy the IP to the stack where you modify it and
    RET to pop the new IP value.

    In most cases that is possible (even if the return address is stored
    in a register and not on the stack), but the return addresses might
    live on a separate stack (IIRC the Intel 8008 or the 8080 has such a
    stack), and the call might be the only thing that pushes on that
    stack. But yes, in most cases, it's a good argument that even very
    deprived processors usually have some form of indirect branching.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sat Nov 8 20:56:33 2025
    From Newsgroup: comp.arch

    On Sat, 08 Nov 2025 18:25:18 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    What architecture cannot do an indirect branch, which I assume
    means a branch/jump to a variable location in a register?

    Or, in case of the 6502, in memory.

    I don't know of any architecture (except maybe some one-instruction proof-of-concepts) that does not have indirect branches in one form or another, but I am not that familiar with architectures from the 1950s
    or some of the extremely deprived embedded-control processors.

    Maybe the thing about self-modifying code was thrown in to taint the
    assigned goto through guilt-by-association.

    Even if an ISA did not have a JMP reg instruction one can create it
    using CALL to copy the IP to the stack where you modify it and
    RET to pop the new IP value.

    In most cases that is possible (even if the return address is stored
    in a register and not on the stack), but the return addresses might
    live on a separate stack (IIRC the Intel 8008 or the 8080 has such a
    stack), and the call might be the only thing that pushes on that
    stack. But yes, in most cases, it's a good argument that even very
    deprived processors usually have some form of indirect branching.

    - anton

    I would imagine that in old times return iinstruction was less common
    than indirect addressing itself.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Nov 8 18:37:48 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    void engine(char *source)
    {
    void *insts[] = {&&add, &&load, &&ip, ...};

    void **ip=compile_to_vm_code(source,insts);

    goto *ip++;

    add:
    ...
    goto *ip++;

    One problem with assigned GOTO is data flow analysis for a comiler.

    Compilers typically break down structured control flow into GOTO
    and then perform analysis. A label whose address is assigned
    anywhere in the program unit to a variable must be considered to
    be reachable by any GOTO to said variable, so any variable in that
    piece of code must be in a known place (i.e. memory). If it
    is kept in a register in some places that could jump to that
    particular label, the contents of that register must be stored
    to memory before the jump is executed. Alternatively, memory
    allocation must make sure that the same register is always used.

    The data flow analysis for labels-as-values (and assigned goto) is
    just the same as for any other control flow. Every goto * has to be
    considered to potentially jump to any label whose address ist taken
    with &&label, just as a switch has to be considered to go to any of
    the case labels, an if has to be considered to go to either of the two
    paths. Similarly, a label has to be considered to be reachable from
    any of the gotos that jump to it, and the statement behind a switch
    statement has to be considered to be reachable from any of the break
    statements in the switch statement. So, having many outgoing or
    incoming control flow edges is nothing that only labels-as-values
    produces. Consider that the replicated switch is intended to produce
    a control-flow graph that's as close as possible to the one produced
    by using labels-as-values.

    Concerning register allocation (never heard of memory allocation), of
    course variables have to live in the same register or memory location
    at either end of a control-flow edge; and when multiple control-flow
    edges start or end at the same point, they have to live in the same
    location for all of these edges.

    This is certainly something that gcc has known how to do from when labels-as-values were introduced in 2.0 (admittedly I only tried using
    it a few months later, when the version was already at 2.2.2).

    There have been a few episodes (e.g., in gcc-3.0 and 3.1) when gcc put
    a lot of register-memory-shuffling code in each VM instructions, but
    they were fixed, or we found a workaround (a recent case was due to auto-vectorization, and we fixed it with -fno-tree-vectorize, which
    would be counterproductive for the engine() function anyway).

    As for the control-flow, all these edges going from every goto to
    every label whose address is taken lead to a quadratic number of
    control-flow edges, so starting with gcc-3.x gcc replaced all goto *
    with gotos to a common goto *. So now you have m edges to that goto *
    (for m instances of goto * in the source code) and n edges from that
    goto * to the labels whose address is taken (for n such labels),
    resulting in n+m edges instead of n*m edges. During the 3.x and early
    4.x series gcc failed to turn the jump-to-indirect-jump instructions
    back into plain indirect-jump instructions afterwards, but they have
    fixed that later in the 4.x series, and that works now (we still have workarounds for that in Gforth).

    By contrast, clang completely drops the ball: First of all, it takes
    forever to compile the code, and then the code contains lots of
    shuffling between registers and memory, leading to low performance.
    Why is clang doing worse in 2021 (and probably in 2025, too) than gcc
    was doing in 1992?

    I described this in <2021May29.164810@mips.complang.tuwien.ac.at>,
    here are some of the data from there:

    Building gforth on a Ryzen 5800X:

    | gcc10 clang11
    | make -j make -j
    |real 11.930s 33m22.542s
    |user 53.876s 143m45.884s
    |sys 3.110s 22.699s

    Running Gforth's small benchmarks:

    | Time in seconds user time
    | sieve bubble matrix fib fft
    | 0.056 0.055 0.034 0.047 0.021 Ryzen 5800X gcc-10
    | 1.100 0.933 0.970 1.265 0.560 Ryzen 5800X clang-11
    |
    |I looked at the generated code, and for a primitive like + which can
    |be done in 4 instructions and which gcc-10 does in 5 instructions:
    |
    |563FB2DED3BF: add r13,$08
    |563FB2DED3C3: add r15,$08
    |563FB2DED3C7: add r8,$00[r13]
    |563FB2DED3CB: mov rcx,-$08[r15]
    |563FB2DED3CF: jmp ecx
    |
    |clang-11 produces 183 instructions for +.

    This was probably less of a problem when assigned goto was invented
    (I assume this was for FORTRAN 66) when few varibles were kept in
    registers, and register allocation was in its infancy. Now, this is
    a much bigger impediment to optimization.

    On what basis do you make this claim? Labels-as-values does not
    impede optimization, so why should the assigned goto do so?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Nov 8 19:32:47 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    This was probably less of a problem when assigned goto was invented
    (I assume this was for FORTRAN 66)

    I think FORTRAN 66 inherited from FORTRAN II or even FORTRAN (1),
    it was available in WATFOR and WATFIV.

    I looked it up: It was at least in Fortran II, according to https://archive.computerhistory.org/resources/text/Fortran/102663119.05.01.acc.pdf
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sat Nov 8 21:47:18 2025
    From Newsgroup: comp.arch

    On Sat, 08 Nov 2025 18:13:59 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    EricP <ThatWouldBeTelling@thevillage.com> posted:

    Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    No, and I defer to you, or others here, on how these features
    are implemented, specifically whether code modification is
    required. I was referring to features such as assigned goto in
    Fortran, and Alter goto in Cobol.

    On modern architectures higher-order functions are implemented
    with indirect branches or indirect calls (depending on whether
    it's a tail-call or not); likewise for method dispatch.

    I do not know how Lisp, FORTRAN, Algol 60 and other early
    languages with higher-order functions were implemented on
    architectures that do not have indirect branches; but if the
    assigned goto was implemented with self-modifying code, the call
    to a function in a variable was probably implemented like that,
    too.

    What architecture cannot do an indirect branch, which I assume
    means a branch/jump to a variable location in a register?

    PDP-8,

    PDP-8 has inderect jump through address stored in memory.
    It also counts.

    4004,

    Are you sure?
    http://www.e4004.szyc.org/iset.html


    IBM 650,

    Sounds like that.
    It seems that earlier, but more expensive, IBM 702 already had indirect
    jumps through content of word in memory.


    ... And any machine without "registers".


    Not necessarily.
    Indirect jump through word in memory also counts.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Sat Nov 8 21:07:01 2025
    From Newsgroup: comp.arch

    It appears that Anton Ertl <anton@mips.complang.tuwien.ac.at> said:
    I don't know of any architecture (except maybe some one-instruction >proof-of-concepts) that does not have indirect branches in one form or >another, but I am not that familiar with architectures from the 1950s
    or some of the extremely deprived embedded-control processors.

    Some of the 1950s machines didn't have indirect branches. You got the
    effect by patching the address into a branch instruction and then
    flowing or jumping to it.

    Maybe the thing about self-modifying code was thrown in to taint the
    assigned goto through guilt-by-association.

    if you want guilt by association, the word is ALTER.

    stack. But yes, in most cases, it's a good argument that even very
    deprived processors usually have some form of indirect branching.

    I agree. Indirect addressing and indexing appeared quite early.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Sat Nov 8 21:08:39 2025
    From Newsgroup: comp.arch

    According to Michael S <already5chosen@yahoo.com>:
    I would imagine that in old times return iinstruction was less common
    than indirect addressing itself.

    On several of the machines I used a subroutine call stored the return
    address in the first word of the routine and branched to that address+1.
    The return was just an indirect jump.

    Stacks? What's a stack? We barely had registers.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Sat Nov 8 21:14:22 2025
    From Newsgroup: comp.arch

    According to Thomas Koenig <tkoenig@netcologne.de>:
    This was probably less of a problem when assigned goto was invented
    (I assume this was for FORTRAN 66) ..

    Not 1966, 1956. It was in the original FORTRAN compiler.

    In its defense, there were no user defined subroutines so that
    was how you faked it. The biggest improvement in FORTRAN II
    was SUBROUTINE and FUNCTION.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sun Nov 9 17:06:18 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    EricP <ThatWouldBeTelling@thevillage.com> posted:

    Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    No, and I defer to you, or others here, on how these features are
    implemented, specifically whether code modification is required. I was >> >> referring to features such as assigned goto in Fortran, and Alter goto >> >> in Cobol.

    On modern architectures higher-order functions are implemented with
    indirect branches or indirect calls (depending on whether it's a
    tail-call or not); likewise for method dispatch.

    I do not know how Lisp, FORTRAN, Algol 60 and other early languages
    with higher-order functions were implemented on architectures that do
    not have indirect branches; but if the assigned goto was implemented
    with self-modifying code, the call to a function in a variable was
    probably implemented like that, too.

    What architecture cannot do an indirect branch, which I assume
    means a branch/jump to a variable location in a register?

    PDP-8, 4004, IBM 650, ... And any machine without "registers".

    To be fair, addresses 10 through 17 in the PDP-8 were effectively auto-increment registers and indirect branches were their
    primary function. The PDP-8 accumulator is considered
    a register, plus the optional multiply hardware provided additional
    registers although they couldn't be used with branch instructions.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sun Nov 9 13:01:56 2025
    From Newsgroup: comp.arch

    John Levine wrote:
    According to Michael S <already5chosen@yahoo.com>:
    I would imagine that in old times return iinstruction was less common
    than indirect addressing itself.

    On several of the machines I used a subroutine call stored the return
    address in the first word of the routine and branched to that address+1.
    The return was just an indirect jump.

    Stacks? What's a stack? We barely had registers.

    Yes, I saw the PDP-8 did that for JMS Jump Subroutine.
    I've never used one but it looks like by playing with the
    Indirect and Page-zero memory addressing options you could
    treat page-zero a bit like a register bank,
    but also store some short but critical routines in page-zero
    to manually move the return PC to/from a stack.
    And use indirect addressing to access its full sumptuous 4kW address space.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Sun Nov 9 20:00:25 2025
    From Newsgroup: comp.arch

    According to Scott Lurndal <slp53@pacbell.net>:
    PDP-8, 4004, IBM 650, ... And any machine without "registers".

    To be fair, addresses 10 through 17 in the PDP-8 were effectively >auto-increment registers and indirect branches were their
    primary function. ....

    I did a fair amount of PDP-8 programming and I don't ever recall using
    the auto-index locations for branches. They were used to step
    through a table of data, e.g. to add up a list of numbers:

    10, 1007 ; list starts at 1010

    100, -50 ; list is 50 (octal long)

    CLA
    LOOP,
    TAD I 10
    ISZ 100
    JMP LOOP
    ; sum is in the accumulator

    I suppose you could use them for threaded code, but I didn't run into
    any PDP-8 progams that used that.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Sun Nov 9 20:18:31 2025
    From Newsgroup: comp.arch

    It appears that EricP <ThatWouldBeTelling@thevillage.com> said:
    On several of the machines I used a subroutine call stored the return
    address in the first word of the routine and branched to that address+1.
    The return was just an indirect jump.

    Stacks? What's a stack? We barely had registers.

    Yes, I saw the PDP-8 did that for JMS Jump Subroutine.
    I've never used one but it looks like by playing with the
    Indirect and Page-zero memory addressing options you could
    treat page-zero a bit like a register bank,
    but also store some short but critical routines in page-zero
    to manually move the return PC to/from a stack.
    And use indirect addressing to access its full sumptuous 4kW address space.

    You wouldn't put routines in page zero but you might put pointers to
    them so you could do JMS I 123 to call the routine pointed to by page
    zero location 123. We rarely did recursive stuff so there wasn't any
    need to simulate a stack.

    Storing the return address in the first word was pretty common. Even
    the PDP-6/10 had a JSR instruction that did that. On machines without
    index registers, there's no better place to put the return address.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Nov 9 21:11:52 2025
    From Newsgroup: comp.arch


    John Levine <johnl@taugh.com> posted:

    According to Michael S <already5chosen@yahoo.com>:
    I would imagine that in old times return iinstruction was less common
    than indirect addressing itself.

    On several of the machines I used a subroutine call stored the return
    address in the first word of the routine and branched to that address+1.
    The return was just an indirect jump.

    Stacks? What's a stack? We barely had registers.

    Heck, back then we barely had memory !!


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Nov 9 21:14:57 2025
    From Newsgroup: comp.arch


    John Levine <johnl@taugh.com> posted:

    According to Scott Lurndal <slp53@pacbell.net>:
    PDP-8, 4004, IBM 650, ... And any machine without "registers".

    To be fair, addresses 10 through 17 in the PDP-8 were effectively >auto-increment registers and indirect branches were their
    primary function. ....

    I did a fair amount of PDP-8 programming and I don't ever recall using
    the auto-index locations for branches. They were used to step
    through a table of data, e.g. to add up a list of numbers:

    10, 1007 ; list starts at 1010

    100, -50 ; list is 50 (octal long)

    CLA
    LOOP,
    TAD I 10
    ISZ 100
    JMP LOOP
    ; sum is in the accumulator

    I suppose you could use them for threaded code, but I didn't run into
    any PDP-8 progams that used that.


    Way back when (1970) I did a bunch of PDP-8 asm--but it is one of the few
    I don't remember enough about to carry a cogent conversation.

    On the other hand it had a decent ALGOL 60 compiler.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Sun Nov 9 14:54:28 2025
    From Newsgroup: comp.arch

    On 11/7/2025 9:29 AM, Thomas Koenig wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
    On 11/6/2025 11:38 AM, Thomas Koenig wrote:

    [...]

    Here is the head of an output of a little script I wrote to count
    all floating-point constants from My66000 assembler. Note that
    the compiler is for the version that does not yet do 0.5 etc as
    floating point. The first number is the number of occurrences,
    the second one is the constant itself.

    5-bit constants: 886
    32-bit constants: 566
    64-bit constants:597
    303 0
    290 1
    96 0.5
    81 6
    58 -1
    58 1e-14
    49 2
    46 -2
    45 -8.98846567431158e+307
    44 10
    44 255
    37 8.98846567431158e+307
    29 -0.5
    28 3
    27 90
    27 360
    26 -1e-05
    21 0.0174532925199433
    20 0.9
    18 -3
    17 180
    17 0.1
    17 0.01
    [...]

    Interesting! No values related to pi? And what are the ...e+307 used for?

    If you loook closely, you'll see pi/180 in that list. But pi is
    also there (I cut it off the list), it occurs 11 times. And the
    large numbers are +/- DBL_MAX*0.5, I don't know what they are
    used for.

    By comparision, here are the values which are most frequently
    contained in GSL:

    5-bit constants: 5148
    32-bit constants: 3769
    64-bit constants:3140
    2678 1
    1518 0
    687 -1
    424 2
    329 0.5
    298 -2
    291 2.22044604925031e-16
    275 4.44089209850063e-16
    273 3
    132 -3
    131 -0.5
    131 3.14159265358979
    88 4
    86 1.34078079299426e+154
    77 6
    70 0.25
    70 5
    68 2.2250738585072e-308
    66 10
    64 -4
    50 -6
    46 0.1
    45 5.87747175411144e-39
    43 0.333333333333333
    42 1e+50
    38 6.28318530717959
    35 9
    31 0.2
    30 7
    30 -0.25

    [...]

    So, having values between -15.5 and +15.5 is a choice that will
    cover quite a few floating point constants.

    Agreed. And the switch from +-31 to +-15.5 seems like a very good choice.

    For different packages,
    FP constant distributions probably vary too much to create something
    that is much more useful.

    I am not convinced of that but it would take an analysis similar to what
    you did but for more packages to resolve that issue. It is an
    interesting question of what packages to use to get the most information
    out of the least number of packages. I don't know enough about package
    usage to have an opinion about that. Perhaps LAPACK to pick up SCIPY,
    one of your CFD packages, Octave????

    But given what we have, and given that it would take no additional HW
    cost, it might make sense to change the ROM table to substitute say
    3.14159... (which occurs 131 times above for -13.5 (which I assume
    occurs approximately never :-))

    I think there is some gain in object code size to be had for things like
    this, but it is probably modest.

    One related question, and it is really a compiler question. Say I am
    writing a program and I know I will need the value of pi say 10 times in
    the source code. I decide to make my coding easier, and the source code
    more compact by creating a constant, called PI, with a value of
    3.14159..., then write the word PI instead of the numerical constant 10
    times in the source code. Will/should the compiler generate inline
    immediates for the ten references or will it generate a load of the
    actually constant variable? Tradeoffs either way.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sun Nov 9 17:22:28 2025
    From Newsgroup: comp.arch

    On 11/8/2025 5:28 AM, Thomas Koenig wrote:
    BGB <cr88192@gmail.com> schrieb:

    I don't know yet if my implementation of DPD is actually correct.

    The POWER ISA has a pretty good description, see the OpenPower
    foundation.

    Luckily, I have since figured it out and confirmed it.


    Otherwise, fiddled with the division algorithm some more, and it is now "slightly less awful", and converges a bit faster...

    Relatedly, also added Square-Root...



    My previous strategies for square-root didn't really work as effectively
    in this case, so just sorta fiddled with stuff until I got something
    that worked...

    Algorithm I came up with (to find sqrt(S)):
    Make an initial guess of the square root, calling it C;
    Make an initial guess for the reciprocal of C, calling it H;
    Take a few passes (threading the needle, *1):
    C[n+1]=C+(S-(C*c))*(H*0.375)
    Redo approximate reciprocal of C, as H (*2);
    Refine H: H=H*(2-C*H)
    Enter main iteration pass:
    C[+1]=C+(S-(C*c))*(H*0.5)
    H[+1]=H*(2-C*H) //(*3)


    *1: Usual "try to keep stuff from flying off into space" step, using a
    scale of 0.375 to undershoot convergence and increase stability (lower
    means more stability but slower convergence; closer to 0.5 means faster,
    but more likely to "fly off into space" depending on the accuracy of the initial guesses).

    *2: Seemed better to start over from a slightly better guess of C, than
    to directly iterate from the initial (much less accurate) guess.

    *3: Noting that if H is also converged, the convergence rate for C is significantly improved (the gains from faster C convergence are enough
    to offset the added cost of also converging H).

    Seems to be effective, though still slower than divide (which is still
    23x slower than an ADD or MUL).


    In this case, the more complex algorithm being (ironically) partly
    justified by the comparably higher relative cost per operation (and the
    issue that I can't resort to tricks like handling the floating-point
    values as integers; doesn't work so hot with Decimal128).




    Felt curious, tried asking Grok about this, it identified this approach
    as the Goldschmidt Algorithm, OK. If so, kinda weird that I arrived at a
    well known (?) algorithm mostly by fiddling with it.

    Looking on Wikipedia though, this doesn't look like the same algorithm
    though.


    Well, apart from some weird thing, where it initially responded in
    Arabic for some reason (seems odd, it has recently gotten smart enough
    to almost start being useful; apart from when it is being stupid, or
    just doing something weird like responding in the wrong language).

    ...


    Well, also was fiddling with code to try to improve "general
    robustness", like making the compare operation still work if inputs were
    not normalized; dealing with some related edge cases in the ADD/SUB
    logic; ...



    So, ATM, this means it now has:
    ADD, SUB, MUL, DIV, SQRT
    Compare;
    Printing and Parsing numbers as strings;
    ...

    In theory, could expand it out with other math functions if needed.


    Still unclear if there is a use-case.
    Drawback is that it is very slow, even vs Binary128.

    Well, except maybe that the Square-Root algorithm could be applicable to Binary128, which has a similar issue of slow operations. Though, in this
    case, could just copy/paste the existing double-precision code for
    long-double in my C library, which uses an unrolled Taylor Series in
    that case.

    ...



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Nov 10 02:00:26 2025
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 11/7/2025 9:29 AM, Thomas Koenig wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
    On 11/6/2025 11:38 AM, Thomas Koenig wrote:
    ----------snip-------------
    I think there is some gain in object code size to be had for things like this, but it is probably modest.

    The gain in instruction count is constant (sic) since one can represent
    any FP constant as an operand with 1 instruction--what we are striving
    for is code footprint.

    One related question, and it is really a compiler question. Say I am writing a program and I know I will need the value of pi say 10 times in
    the source code. I decide to make my coding easier, and the source code more compact by creating a constant, called PI, with a value of
    3.14159..., then write the word PI instead of the numerical constant 10 times in the source code. Will/should the compiler generate inline immediates for the ten references or will it generate a load of the
    actually constant variable? Tradeoffs either way.

    The number of instructions executed will be exactly the same, the size of
    the code footprint will be lower if/when the compiler can figure out
    when to allocate PI into a register for some duration.

    Currently, a) if there are free registers, and b) the constant is used
    3 times, you gain 1 word of code footprint.
    but (BUT), c) if there are no free registers, and d) the constant is
    used more than 6 times, you gain your first word of code footprint.

    So, it is a bit tricky trading off instruction count for instruction
    footprint.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Nov 10 02:12:53 2025
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 11/8/2025 5:28 AM, Thomas Koenig wrote:
    BGB <cr88192@gmail.com> schrieb:

    I don't know yet if my implementation of DPD is actually correct.

    The POWER ISA has a pretty good description, see the OpenPower
    foundation.

    Luckily, I have since figured it out and confirmed it.


    Otherwise, fiddled with the division algorithm some more, and it is now "slightly less awful", and converges a bit faster...

    Relatedly, also added Square-Root...



    My previous strategies for square-root didn't really work as effectively
    in this case, so just sorta fiddled with stuff until I got something
    that worked...

    Algorithm I came up with (to find sqrt(S)):
    Make an initial guess of the square root, calling it C;
    Make an initial guess for the reciprocal of C, calling it H;
    Take a few passes (threading the needle, *1):
    C[n+1]=C+(S-(C*c))*(H*0.375)
    Redo approximate reciprocal of C, as H (*2);
    Refine H: H=H*(2-C*H)
    Enter main iteration pass:
    C[+1]=C+(S-(C*c))*(H*0.5)
    H[+1]=H*(2-C*H) //(*3)


    *1: Usual "try to keep stuff from flying off into space" step, using a
    scale of 0.375 to undershoot convergence and increase stability (lower
    means more stability but slower convergence; closer to 0.5 means faster,
    but more likely to "fly off into space" depending on the accuracy of the initial guesses).

    *2: Seemed better to start over from a slightly better guess of C, than
    to directly iterate from the initial (much less accurate) guess.

    *3: Noting that if H is also converged, the convergence rate for C is significantly improved (the gains from faster C convergence are enough
    to offset the added cost of also converging H).

    Seems to be effective, though still slower than divide (which is still
    23x slower than an ADD or MUL).

    SQRT should be 20%-30% slower than DIV.


    In this case, the more complex algorithm being (ironically) partly
    justified by the comparably higher relative cost per operation (and the issue that I can't resort to tricks like handling the floating-point
    values as integers; doesn't work so hot with Decimal128).

    If you have binary SQRT and a quick way from DFP128 to BFP32, take SQRT
    in binary, convert back and do 2 iterations. Should be faster. {{I need
    to remind some folks that {float; float; FDIV; fix} was faster than
    IDIV on many 2st generation RISC machines.

    Felt curious, tried asking Grok about this, it identified this approach
    as the Goldschmidt Algorithm, OK. If so, kinda weird that I arrived at a well known (?) algorithm mostly by fiddling with it.

    Feels like it is 1965--does it not ?!?

    Looking on Wikipedia though, this doesn't look like the same algorithm though.

    Goldschmidt is just a N_R where the arithmetic has been arranged so
    that multiplies are not data-dependent (like N-R). And for this
    independence; GS lacks the automatic correction N-R has.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Sun Nov 9 20:03:12 2025
    From Newsgroup: comp.arch

    On 11/9/2025 6:00 PM, MitchAlsup wrote:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 11/7/2025 9:29 AM, Thomas Koenig wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:
    On 11/6/2025 11:38 AM, Thomas Koenig wrote:
    ----------snip-------------
    I think there is some gain in object code size to be had for things like
    this, but it is probably modest.

    The gain in instruction count is constant (sic) since one can represent
    any FP constant as an operand with 1 instruction--what we are striving
    for is code footprint.

    Yes, agreed. The gain would come from being able to express highly used values (e.g. pi) rather than lesser used values (e.g. -13.5) as 5 bit immediates, thus avoiding the extra 32/64 bits of separate immediate
    word or two.


    One related question, and it is really a compiler question. Say I am
    writing a program and I know I will need the value of pi say 10 times in
    the source code. I decide to make my coding easier, and the source code
    more compact by creating a constant, called PI, with a value of
    3.14159..., then write the word PI instead of the numerical constant 10
    times in the source code. Will/should the compiler generate inline
    immediates for the ten references or will it generate a load of the
    actually constant variable? Tradeoffs either way.

    The number of instructions executed will be exactly the same,

    Yes, but execution time may not be. Presumably the load of a
    non-immediate data value might take longer, certainly so if the value is
    not in the L1 data cache.


    the size of
    the code footprint will be lower if/when the compiler can figure out
    when to allocate PI into a register for some duration.

    Yes.

    Currently, a) if there are free registers, and b) the constant is used
    3 times, you gain 1 word of code footprint.
    but (BUT), c) if there are no free registers, and d) the constant is
    used more than 6 times, you gain your first word of code footprint.

    So, it is a bit tricky trading off instruction count for instruction footprint.

    Yes. That is why I thought it was an interesting question. Your
    heuristic seems as good as any, at least to my uninformed thoughts.

    Thanks.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Nov 10 06:30:21 2025
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> schrieb:
    On 11/8/2025 5:28 AM, Thomas Koenig wrote:
    BGB <cr88192@gmail.com> schrieb:

    I don't know yet if my implementation of DPD is actually correct.

    The POWER ISA has a pretty good description, see the OpenPower
    foundation.

    Luckily, I have since figured it out and confirmed it.

    Did you also implement the rounding modes? That's where all the
    "fun" (and utility) of decimal FP ist...

    It's in section 5.5.2 of the 3.1 version of the ISA.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Mon Nov 10 08:16:07 2025
    From Newsgroup: comp.arch

    BGB wrote:
    DIV uses Newton-Raphson
    The process of converging is a lot more fiddly than with Binary FP.
    Partly as the strategy for generating the initial guess is far less accurate.

    So, it first uses a loop with hard-coded checks and scales to get it in
    the general area, before then letting N-R take over. If the value isn't close enough (seemingly +/- 25% or so), N-R flies off into space.

    Namely:
      Exponent is wrong:
        Scale by factors of 2 until correct;
      Off by more than 50%, scale by +/- 25%;
      Off by more than 25%, scale by +/- 12.5%;
      Else: Good enough, let normal N-R take over.
    My possibly naive idea would extract the top 9-15 digits from divisor
    and dividend, convert both to binary FP, do the division and convert back.
    That would reduce the NR step to two or three iterations, right?
    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Mon Nov 10 08:27:56 2025
    From Newsgroup: comp.arch

    MitchAlsup wrote:

    EricP <ThatWouldBeTelling@thevillage.com> posted:

    Anton Ertl wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    No, and I defer to you, or others here, on how these features are
    implemented, specifically whether code modification is required. I was >>>> referring to features such as assigned goto in Fortran, and Alter goto >>>> in Cobol.

    On modern architectures higher-order functions are implemented with
    indirect branches or indirect calls (depending on whether it's a
    tail-call or not); likewise for method dispatch.

    I do not know how Lisp, FORTRAN, Algol 60 and other early languages
    with higher-order functions were implemented on architectures that do
    not have indirect branches; but if the assigned goto was implemented
    with self-modifying code, the call to a function in a variable was
    probably implemented like that, too.

    What architecture cannot do an indirect branch, which I assume
    means a branch/jump to a variable location in a register?

    PDP-8, 4004, IBM 650, ... And any machine without "registers".

    And how would the operating system on such a machine get programs running?

    Load them at a known location and branch to the known location.

    Even if an ISA did not have a JMP reg instruction one can create it
    using CALL to copy the IP to the stack where you modify it and
    RET to pop the new IP value.

    Pure stack machines did a lot of this.

    We even did similar stuff in low-level x86 code, like when very early
    8088 cpus could allow an interrupt between the loading of the stack
    pointer and the stack segment (double-plus ungood!), the fix was to
    instead munge the stack so that an IRET could be used instead.

    I seem to remember that there could also be a similar issue when doing a
    far return? If so, also solved with setting up the stack to allow IRET
    to have the same effect.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Nov 10 07:46:47 2025
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> writes:
    [indirect branches through auto-increment locations]
    I suppose you could use them for threaded code, but I didn't run into
    any PDP-8 progams that used that.

    You can use them for (direct) threaded code if the indirect branch is
    not to the auto-incremented address, but if there is one additional
    indirection involved. E.g, on RISC-V this is a direct-threaded code
    dispatch:

    addi s5,s5,8
    ld a5,0(s5)
    jr a5

    If the use of the auto-increment location would be equivalent to

    addi s5,s5,8
    jr s5

    it would not be useful for direct-threaded code.

    The paper on (direct) threaded code was only published in 1973, so
    that technique may not have been widely known at the time when much of
    the PDP-8 software was developed.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Mon Nov 10 03:40:26 2025
    From Newsgroup: comp.arch

    On 11/9/2025 8:12 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 11/8/2025 5:28 AM, Thomas Koenig wrote:
    BGB <cr88192@gmail.com> schrieb:

    I don't know yet if my implementation of DPD is actually correct.

    The POWER ISA has a pretty good description, see the OpenPower
    foundation.

    Luckily, I have since figured it out and confirmed it.


    Otherwise, fiddled with the division algorithm some more, and it is now
    "slightly less awful", and converges a bit faster...

    Relatedly, also added Square-Root...



    My previous strategies for square-root didn't really work as effectively
    in this case, so just sorta fiddled with stuff until I got something
    that worked...

    Algorithm I came up with (to find sqrt(S)):
    Make an initial guess of the square root, calling it C;
    Make an initial guess for the reciprocal of C, calling it H;
    Take a few passes (threading the needle, *1):
    C[n+1]=C+(S-(C*c))*(H*0.375)
    Redo approximate reciprocal of C, as H (*2);
    Refine H: H=H*(2-C*H)
    Enter main iteration pass:
    C[+1]=C+(S-(C*c))*(H*0.5)
    H[+1]=H*(2-C*H) //(*3)


    *1: Usual "try to keep stuff from flying off into space" step, using a
    scale of 0.375 to undershoot convergence and increase stability (lower
    means more stability but slower convergence; closer to 0.5 means faster,
    but more likely to "fly off into space" depending on the accuracy of the
    initial guesses).

    *2: Seemed better to start over from a slightly better guess of C, than
    to directly iterate from the initial (much less accurate) guess.

    *3: Noting that if H is also converged, the convergence rate for C is
    significantly improved (the gains from faster C convergence are enough
    to offset the added cost of also converging H).

    Seems to be effective, though still slower than divide (which is still
    23x slower than an ADD or MUL).

    SQRT should be 20%-30% slower than DIV.


    It is currently around 2.5x slower.


    Though, the number of loop iterations isn't that much different; rather
    the complexity of the loop is higher (as it is iterating both the square
    root and the reciprocal of the square-root).


    Compared to the version I put on pastebin, there has been around an 8x improvement to the speed of performing the divide operation.


    And, sqrt is around 3x faster than DIV in the pastebin version...


    So, at the moment:
    MUL: 19 MHz
    ADD: 13 MHz
    DIV: 0.83 MHz
    SQRT: 0.34 MHz



    In this case, the more complex algorithm being (ironically) partly
    justified by the comparably higher relative cost per operation (and the
    issue that I can't resort to tricks like handling the floating-point
    values as integers; doesn't work so hot with Decimal128).

    If you have binary SQRT and a quick way from DFP128 to BFP32, take SQRT
    in binary, convert back and do 2 iterations. Should be faster. {{I need
    to remind some folks that {float; float; FDIV; fix} was faster than
    IDIV on many 2st generation RISC machines.


    Yeah, this is a possible option.

    hardware FPU could give much better starting values for starting iteration.

    Depends mostly on having reasonably fast and accurate format conversion.



    Felt curious, tried asking Grok about this, it identified this approach
    as the Goldschmidt Algorithm, OK. If so, kinda weird that I arrived at a
    well known (?) algorithm mostly by fiddling with it.

    Feels like it is 1965--does it not ?!?


    I don't know there.

    Back then, my parents would have still been children...

    All I really know about this era is stuff I have seen in TV shows.



    Though, ironically, did before go and watch through some of the Kroft
    brothers shows ("H.R Pufnstuf" and "Lidsville" and similar), which were
    around when my parents were young. Kinda surreal...


    Though, it seemed like both shows were sort of trying to do a thing of creating a fantastical world on as little budget as possible. Seemed
    like Pufnstuf was more ambitious, but with much cheaper SFX. Lidsville
    was a little more conservative here, but generally did a better job in
    terms of quality of both effects and costumes.

    Pufnstuf had used a lot of fabric and stuffing for costumes (sorta like pillows), and when puppets were used, were often crudely constructed and controlled. There were a few cases where they used rigid sticks (though
    this was more a Henson thing), but more often it was pulling on flexible strings.

    Some small puppets used foam rubber, but it appears to have been used sparingly.


    Scenery was often indoor sets with painted backgrounds, colored tarps of
    the floors (sometimes with some sort of sand-like material on the
    tarps), and flat cut outs for plants (usually hand-painted).

    Contrast, Lidsville was less ambitious with its use of special effects,
    but when used, were typically better done. A lot of the costumes
    appeared better made as well.


    But, I guess, one can compare/contrast with other types of shows, say:
    Toho: Godzilla movies:
    Foam rubber suits and what look like a lot model train-set parts;
    Likely a lot more expensive;
    Toei: Super Sentai / Power Rangers
    Heavy use of foam rubber for costumes;
    Spandex of vinyl for protagonist suits;
    Frequent use of styrofoam for destructible objects;
    Something gets smashed/broken/exploded, often styrofoam;
    City scenes often used modified cardboard boxes;
    Or, actors super-imposed onto scenes made using miniatures;
    CGI sometimes used, but sparingly.
    And, then, mostly compositing type effects.
    The 90s show more liked using things like pyrotechnics.
    Some of the later shows used CGI for things like explosions.
    ...

    Though, did see a recent movie "Psycho Goreman" which seemed to be
    approaching special effects in a very similar way to Power Rangers (a
    lot of foam rubber and occasional "obviously bad" CGI). I suspect they
    may have been intentionally going for a Power Ranger's kinda look though.

    Contrast, likely the effects in Godzilla would have been more expensive
    than those in Power Rangers.

    But, they were still kind of a hold out for using a lot of practical
    effects; in an era where people were (elsewhere) rapidly jumping over to
    the use of CGI. As did most newer Godzilla movies (like, CGI isn't quite
    the same as rubber suits and puppets).

    Feels sometimes like something was lost here.



    A few times, seems like it would be funny though if a person did a show,
    but instead deliberately used Pufnstuf style effects.

    Well, and/or mixed with Ed Wood style effects.
    Like, say, a paper plate on a string for UFO;
    or BBQ lighter rocket engines...

    Or, have some costumes with some really cheap rubber masks (like the
    sort that sometimes come with Halloween costumes).
    Or maybe papercraft (like construction paper or cardstock). Maybe in combination with fabric+stuffing and googly eyes.



    Maybe also cool if they could capture some of that "terrible holiday
    special" vibes. Or, maybe some musical numbers, but it is mostly
    "Schoolhouse Rock" style stuff.

    Though, preferably "so bad it is funny" kind of effects...
    Not so much "Manos: The Hands of Fate" bad, which was also technically
    bad, but not it a way that I found particularly amusing.


    Well, and while in theory could be cheaper still to use sock-puppets,
    this is going a little too far.



    Or, the extreme opposite that was 90s CGI jank. Proceeds to watch
    episodes of "Donkey Kong Country" or similar, "Yeah, that's the crap".

    Not everything needs to look good though, sometimes there is a certain
    charm in the "jank".

    Where, could maybe classify CGI into a few buckets:
    80s/experimental:
    Tron;
    "Money for Nothing";
    Various CGI "fever dream" stuff.
    Looked like they really liked CGI solids,
    and some kind of ray-casting.
    Some early/mid 90s stuff:
    ReBoot, Donkey Kong Country, Beast Machines, ...
    Some late 1990s/2000s stuff:
    Where human type characters got *very ugly*.
    Side Branch:
    Shows like "Jimmy Neutron" going to a more cartoony style
    Humanoids still looked OK, if kept cartoon-like.
    2010s to present:
    Paths solidly split into "photo realistic" and cartoon styles.
    Or, Pixar liking to sit right on the edge.
    Like, they want to do photo-realism,
    but if they try too hard, it gets ugly.


    If I were to do anything, might try to borrow some from the 80s style,
    and use a lot of CSG.

    Could maybe make either "artistic choice" effects, like dithering, and
    saving JPEG images at 0% quality (so that the image looks "kinda cooked").


    Though, reminds me of a funny observation with my color-via-monochrome experiment:
    The images could be LZ compressed to fairly small sizes, and seemingly
    beat JPEG in terms of Q/bpp while doing so (because one needs to save
    the JPEG at 0%, and then it looks cooked; with the dithered image
    looking less bad than the 0% JPEG).

    Though, not sure what to make of this exactly.


    Looking on Wikipedia though, this doesn't look like the same algorithm
    though.

    Goldschmidt is just a N_R where the arithmetic has been arranged so
    that multiplies are not data-dependent (like N-R). And for this
    independence; GS lacks the automatic correction N-R has.

    Dunno.

    In my case, both terms being iterated do depend on each other.

    The actual calculation didn't look the same either.
    But, it does involve iteration, like N-R, and uses 2 terms with one
    being a reciprocal of the square root (like Goldschmidt), and appears to converge in a relatively small number of loop iterations.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Nov 10 14:52:36 2025
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> writes:
    According to Scott Lurndal <slp53@pacbell.net>:
    PDP-8, 4004, IBM 650, ... And any machine without "registers".

    To be fair, addresses 10 through 17 in the PDP-8 were effectively >>auto-increment registers and indirect branches were their
    primary function. ....

    I did a fair amount of PDP-8 programming and I don't ever recall using
    the auto-index locations for branches. They were used to step
    through a table of data, e.g. to add up a list of numbers:

    10, 1007 ; list starts at 1010

    100, -50 ; list is 50 (octal long)

    CLA
    LOOP,
    TAD I 10
    ISZ 100
    JMP LOOP
    ; sum is in the accumulator

    I suppose you could use them for threaded code, but I didn't run into
    any PDP-8 progams that used that.

    Yes, mainly for data. I do have a vague recollection of hand[*]disassembling the basic interpreter and finding some unexpected indirect branches through 010-017.

    [*] Paper and pencil from an octal dump in high school.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Mon Nov 10 18:53:34 2025
    From Newsgroup: comp.arch

    According to Scott Lurndal <slp53@pacbell.net>:
    I suppose you could use them for threaded code, but I didn't run into
    any PDP-8 progams that used that.

    Yes, mainly for data. I do have a vague recollection of hand[*]disassembling >the basic interpreter and finding some unexpected indirect branches through >010-017.

    The usual way to do threaded code needs double indirection, like on the PDP-11 JMP @(R5)+ which jumps to the address that the word at R5 points to, then increments R5. The PDP-8 only had single indirect so the autoindex would have to
    point at a list of JMP instructions, which in turn would usually have to be indirect unless the routine was so small it could fit on the page with the JMP list.

    People did all sorts of strange stuff to cram programs into the PDP-8 so I can imagine other sorts of autoindex JMP tricks, like doing one thing the first time
    through a loop and something else after that.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Mon Nov 10 13:54:23 2025
    From Newsgroup: comp.arch

    On 11/10/2025 1:16 AM, Terje Mathisen wrote:
    BGB wrote:
    DIV uses Newton-Raphson
    The process of converging is a lot more fiddly than with Binary FP.
    Partly as the strategy for generating the initial guess is far less
    accurate.

    So, it first uses a loop with hard-coded checks and scales to get it
    in the general area, before then letting N-R take over. If the value
    isn't close enough (seemingly +/- 25% or so), N-R flies off into space.

    Namely:
       Exponent is wrong:
         Scale by factors of 2 until correct;
       Off by more than 50%, scale by +/- 25%;
       Off by more than 25%, scale by +/- 12.5%;
       Else: Good enough, let normal N-R take over.

    My possibly naive idea would extract the top 9-15 digits from divisor
    and dividend, convert both to binary FP, do the division and convert back.

    That would reduce the NR step to two or three iterations, right?


    After adding code to feed to convert to/from 'double', and using this
    for initial reciprocal and square-root:
    DIV gets around 50% faster: ~ 1.5 MHz (~ 12x slower than MUL);
    SQRT gets around 260% faster: ~ 0.9 MHz (~ 22x slower than MUL);


    Single-stepping in the debugger:
    SQRT takes around 3 iterations.

    With the initial worse estimates, it requires 7 iterations.

    the iteration has a special case to stop once the adjustment would
    effectively become too small to make a difference.

    ...


    Otherwise, it is possible I could add the fancy rounding modes.

    Though, I can note that another library that uses this format also uses
    funky normalization (primarily keeping numbers right aligned rather than normalizing to left-alignment) which could effect the behavior of
    rounding (it would nominally round to however many digits exist past the decimal point in the ASCII strings it parses as input).


    Though, it could be possible to add a feature to partly defeat the floating-point behavior and behave as-if there were always at least N
    digits above the decimal point (for normalization/rounding).

    For example, if specifying that ADD should behave as-if there were 31
    digits above the decimal point, then operations being rounded to having
    3 digits below the decimal point.

    This sort of behavior would likely need to be per-operation though.

    Well, unless "how many digits exist past the decimal point in an ASCII
    string representation" is itself a semantically important detail?...


    For most contexts where it could matter, would expect setting a minimum exponent to make more sense. Though, in these use-cases, not clear how
    the added complexity (and overhead) if decimal floating-point could make
    sense over using some sort of decimal fixed point scheme (such as
    storing 64 or 128 bit integers with a fixed scale of 1000 or something).

    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Tue Nov 11 00:08:48 2025
    From Newsgroup: comp.arch

    On Mon, 10 Nov 2025 13:54:23 -0600
    BGB <cr88192@gmail.com> wrote:
    On 11/10/2025 1:16 AM, Terje Mathisen wrote:
    BGB wrote:
    DIV uses Newton-Raphson
    The process of converging is a lot more fiddly than with Binary
    FP. Partly as the strategy for generating the initial guess is far
    less accurate.

    So, it first uses a loop with hard-coded checks and scales to get
    it in the general area, before then letting N-R take over. If the
    value isn't close enough (seemingly +/- 25% or so), N-R flies off
    into space.

    Namely:
    Exponent is wrong:
    Scale by factors of 2 until correct;
    Off by more than 50%, scale by +/- 25%;
    Off by more than 25%, scale by +/- 12.5%;
    Else: Good enough, let normal N-R take over.

    My possibly naive idea would extract the top 9-15 digits from
    divisor and dividend, convert both to binary FP, do the division
    and convert back.

    That would reduce the NR step to two or three iterations, right?


    After adding code to feed to convert to/from 'double', and using this
    for initial reciprocal and square-root:
    DIV gets around 50% faster: ~ 1.5 MHz (~ 12x slower than MUL);
    That is your timing for Decimal128 on modern desktop PC?
    Dependent divisions or independent?
    Even for dependent, it sounds slow.
    Did you try to compare against brute force calculation using GMP? https://gmplib.org/
    I.e. asuming that num < den < 10*num use GMP to calculate 40 decimal
    digits of intermediate result y as follows:
    Numx = num * 1e40;
    y = Numx/den;
    Yi = y / 1e6, Yf = y % 1e6 (this step does not require GMP, figure out
    why).
    If Yf != 5e5 then you finished. Only in extremely rare case (1 in a
    million) of Yf == 5e5 you will have to calculate reminder of Numx/den
    to found correct rounding.
    Somehow, I suspect that on modern PC even non-optimized method like
    above will be faster tham 670 usec.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Mon Nov 10 21:56:45 2025
    From Newsgroup: comp.arch

    Typical process for NaN boxing is to set the high order bits of the
    value which causes the value to appear to be a NaN at higher precision.
    I have been thinking about using some of the high order bits of the NaN
    (eg bits 32 to 51) to indicate the precision of the boxed value. This
    would allow detection of the use of a lower precision value in
    arithmetic. Suppose a convert from single to double precision is being
    done, but the value to be converted is only half precision. If it were indicated by the NaN software might be able to fix the result. I also
    preserve the sign bit of the number in the NaN box.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Mon Nov 10 21:25:47 2025
    From Newsgroup: comp.arch

    On 11/10/2025 4:08 PM, Michael S wrote:
    On Mon, 10 Nov 2025 13:54:23 -0600
    BGB <cr88192@gmail.com> wrote:

    On 11/10/2025 1:16 AM, Terje Mathisen wrote:
    BGB wrote:
    DIV uses Newton-Raphson
    The process of converging is a lot more fiddly than with Binary
    FP. Partly as the strategy for generating the initial guess is far
    less accurate.

    So, it first uses a loop with hard-coded checks and scales to get
    it in the general area, before then letting N-R take over. If the
    value isn't close enough (seemingly +/- 25% or so), N-R flies off
    into space.

    Namely:
       Exponent is wrong:
         Scale by factors of 2 until correct;
       Off by more than 50%, scale by +/- 25%;
       Off by more than 25%, scale by +/- 12.5%;
       Else: Good enough, let normal N-R take over.

    My possibly naive idea would extract the top 9-15 digits from
    divisor and dividend, convert both to binary FP, do the division
    and convert back.

    That would reduce the NR step to two or three iterations, right?


    After adding code to feed to convert to/from 'double', and using this
    for initial reciprocal and square-root:
    DIV gets around 50% faster: ~ 1.5 MHz (~ 12x slower than MUL);

    That is your timing for Decimal128 on modern desktop PC?
    Dependent divisions or independent?
    Even for dependent, it sounds slow.


    Modern-ish...

    I am running a CPU type that was originally released 7 years ago, with
    slower RAM than it was designed to work with.


    Did you try to compare against brute force calculation using GMP? https://gmplib.org/
    I.e. asuming that num < den < 10*num use GMP to calculate 40 decimal
    digits of intermediate result y as follows:
    Numx = num * 1e40;
    y = Numx/den;
    Yi = y / 1e6, Yf = y % 1e6 (this step does not require GMP, figure out
    why).
    If Yf != 5e5 then you finished. Only in extremely rare case (1 in a
    million) of Yf == 5e5 you will have to calculate reminder of Numx/den
    to found correct rounding.
    Somehow, I suspect that on modern PC even non-optimized method like
    above will be faster tham 670 usec.




    Well, first step is building with GCC rather than MSVC...

    It would appear that it gets roughly 79% faster when built with GCC.
    So, around 2 million divides per second.



    As for GMP, dividing two 40 digit numbers:
    22 million per second.
    If I do both a divide and a remainder:
    16 million.

    I don't really get what you are wanting me to measure exactly though...


    If I compare against the IBM decNumber library:
    Multiply: 14 million.
    Divide: 7 million

    The decNumber library doesn't appear to have a square-root function...


    Granted, there are possibly faster ways to do divide, versus using Newton-Raphson in this case...

    It was not the point that I could pull the fastest possible
    implementation out of thin air. But, does appear I am beating decNumber
    at least for multiply performance and similar.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Tue Nov 11 12:02:07 2025
    From Newsgroup: comp.arch

    On Mon, 10 Nov 2025 21:25:47 -0600
    BGB <cr88192@gmail.com> wrote:
    On 11/10/2025 4:08 PM, Michael S wrote:
    On Mon, 10 Nov 2025 13:54:23 -0600
    BGB <cr88192@gmail.com> wrote:

    On 11/10/2025 1:16 AM, Terje Mathisen wrote:
    BGB wrote:
    DIV uses Newton-Raphson
    The process of converging is a lot more fiddly than with Binary
    FP. Partly as the strategy for generating the initial guess is
    far less accurate.

    So, it first uses a loop with hard-coded checks and scales to get
    it in the general area, before then letting N-R take over. If the
    value isn't close enough (seemingly +/- 25% or so), N-R flies off
    into space.

    Namely:
    Exponent is wrong:
    Scale by factors of 2 until correct;
    Off by more than 50%, scale by +/- 25%;
    Off by more than 25%, scale by +/- 12.5%;
    Else: Good enough, let normal N-R take over.

    My possibly naive idea would extract the top 9-15 digits from
    divisor and dividend, convert both to binary FP, do the division
    and convert back.

    That would reduce the NR step to two or three iterations, right?


    After adding code to feed to convert to/from 'double', and using
    this for initial reciprocal and square-root:
    DIV gets around 50% faster: ~ 1.5 MHz (~ 12x slower than
    MUL);

    That is your timing for Decimal128 on modern desktop PC?
    Dependent divisions or independent?
    Even for dependent, it sounds slow.


    Modern-ish...

    Zen2 ?
    I consider it the last of non-modern. Zen3 and Ice Lake are first
    of modern. 128by64 bit integer division on Zen2 is still quite slow
    and overall uArch is even less advanced than 10 y.o. Intel Skylake.
    In majority of real-world workloads it's partially compensated by
    Zen2 bigger L3 cache. In our case big cache does not help.
    But even last non-modern CPU shall be capable to divide faster than
    suggested by your numbers.
    I am running a CPU type that was originally released 7 years ago,
    with slower RAM than it was designed to work with.


    Did you try to compare against brute force calculation using GMP? https://gmplib.org/
    I.e. asuming that num < den < 10*num use GMP to calculate 40
    decimal digits of intermediate result y as follows:
    Numx = num * 1e40;
    y = Numx/den;
    Yi = y / 1e6, Yf = y % 1e6 (this step does not require GMP, figure
    out why).
    If Yf != 5e5 then you finished. Only in extremely rare case (1 in a million) of Yf == 5e5 you will have to calculate reminder of
    Numx/den to found correct rounding.
    Somehow, I suspect that on modern PC even non-optimized method like
    above will be faster tham 670 usec.




    Well, first step is building with GCC rather than MSVC...

    It would appear that it gets roughly 79% faster when built with GCC.
    So, around 2 million divides per second.



    As for GMP, dividing two 40 digit numbers:
    22 million per second.
    If I do both a divide and a remainder:
    16 million.

    I don't really get what you are wanting me to measure exactly
    though...
    I want you to measure division of 74-digit integer by 34-digit integer,
    because it is the slowest part [of brute force implementation] of
    Decimal128 division. The rest of division is approximately the same as multiplication.
    So, [unoptimized] Decimal128 division time should be no worse than
    t1+t2, where t1 is duration of Decimal128 multiplication and t2 is
    duration of above-mentioned integer division. An estimate is
    pessimistic, because post-division normalization tends to be simpler
    than post-multiplication normalization.
    Optimized division would be faster yet.


    If I compare against the IBM decNumber library:
    Multiply: 14 million.
    Divide: 7 million

    The decNumber library doesn't appear to have a square-root function...


    Granted, there are possibly faster ways to do divide, versus using Newton-Raphson in this case...

    It was not the point that I could pull the fastest possible
    implementation out of thin air. But, does appear I am beating
    decNumber at least for multiply performance and similar.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue Nov 11 04:44:48 2025
    From Newsgroup: comp.arch

    On 11/11/2025 4:02 AM, Michael S wrote:
    On Mon, 10 Nov 2025 21:25:47 -0600
    BGB <cr88192@gmail.com> wrote:

    On 11/10/2025 4:08 PM, Michael S wrote:
    On Mon, 10 Nov 2025 13:54:23 -0600
    BGB <cr88192@gmail.com> wrote:

    On 11/10/2025 1:16 AM, Terje Mathisen wrote:
    BGB wrote:
    DIV uses Newton-Raphson
    The process of converging is a lot more fiddly than with Binary
    FP. Partly as the strategy for generating the initial guess is
    far less accurate.

    So, it first uses a loop with hard-coded checks and scales to get
    it in the general area, before then letting N-R take over. If the
    value isn't close enough (seemingly +/- 25% or so), N-R flies off
    into space.

    Namely:
       Exponent is wrong:
         Scale by factors of 2 until correct;
       Off by more than 50%, scale by +/- 25%;
       Off by more than 25%, scale by +/- 12.5%;
       Else: Good enough, let normal N-R take over.

    My possibly naive idea would extract the top 9-15 digits from
    divisor and dividend, convert both to binary FP, do the division
    and convert back.

    That would reduce the NR step to two or three iterations, right?


    After adding code to feed to convert to/from 'double', and using
    this for initial reciprocal and square-root:
    DIV gets around 50% faster: ~ 1.5 MHz (~ 12x slower than
    MUL);

    That is your timing for Decimal128 on modern desktop PC?
    Dependent divisions or independent?
    Even for dependent, it sounds slow.


    Modern-ish...


    Zen2 ?
    I consider it the last of non-modern. Zen3 and Ice Lake are first
    of modern. 128by64 bit integer division on Zen2 is still quite slow
    and overall uArch is even less advanced than 10 y.o. Intel Skylake.
    In majority of real-world workloads it's partially compensated by
    Zen2 bigger L3 cache. In our case big cache does not help.
    But even last non-modern CPU shall be capable to divide faster than
    suggested by your numbers.


    Zen+

    Or, a slightly tweaked version of Zen1.


    It is very well possible to do big integer divide faster than this.
    Such as via shift-and-add.

    But, as for decimal, this makes it harder.


    I could do long division, but this is a much more complicated algorithm (versus using Newton-Raphson).

    But, N-R is slow as it is basically a bunch of operations, which are
    granted themselves, each kinda slow.



    I am running a CPU type that was originally released 7 years ago,
    with slower RAM than it was designed to work with.


    Did you try to compare against brute force calculation using GMP?
    https://gmplib.org/
    I.e. asuming that num < den < 10*num use GMP to calculate 40
    decimal digits of intermediate result y as follows:
    Numx = num * 1e40;
    y = Numx/den;
    Yi = y / 1e6, Yf = y % 1e6 (this step does not require GMP, figure
    out why).
    If Yf != 5e5 then you finished. Only in extremely rare case (1 in a
    million) of Yf == 5e5 you will have to calculate reminder of
    Numx/den to found correct rounding.
    Somehow, I suspect that on modern PC even non-optimized method like
    above will be faster tham 670 usec.




    Well, first step is building with GCC rather than MSVC...

    It would appear that it gets roughly 79% faster when built with GCC.
    So, around 2 million divides per second.



    As for GMP, dividing two 40 digit numbers:
    22 million per second.
    If I do both a divide and a remainder:
    16 million.

    I don't really get what you are wanting me to measure exactly
    though...


    I want you to measure division of 74-digit integer by 34-digit integer, because it is the slowest part [of brute force implementation] of
    Decimal128 division. The rest of division is approximately the same as multiplication.
    So, [unoptimized] Decimal128 division time should be no worse than
    t1+t2, where t1 is duration of Decimal128 multiplication and t2 is
    duration of above-mentioned integer division. An estimate is
    pessimistic, because post-division normalization tends to be simpler
    than post-multiplication normalization.
    Optimized division would be faster yet.


    If it is a big-integer divide, this is not quite the same thing.

    And, if I were to use big-integer divide (probably not via GMP though,
    this would be too big of a dependency), there is still the issue of efficiently converting between big-integer and the "groups of 9 digits
    in 32-bits" format.


    This is partly why I removed the BID code:
    At first, it seemed like the DPD and BID converters were similar speed;
    But, turns out I was still testing the DPD converter, and in-fact the
    BID converter was significantly slower.

    And, if I were going to do BID, would make more sense to do it as its
    own thing, and build it mostly around 128-bit integer math.


    But, in this case, I had decided to experiment with DPD.


    Most likely, in this case if I wanted faster divide, that also played
    well with the existing format, I would need to do long division or similar.





    If I compare against the IBM decNumber library:
    Multiply: 14 million.
    Divide: 7 million

    The decNumber library doesn't appear to have a square-root function...


    Granted, there are possibly faster ways to do divide, versus using
    Newton-Raphson in this case...

    It was not the point that I could pull the fastest possible
    implementation out of thin air. But, does appear I am beating
    decNumber at least for multiply performance and similar.




    Can note that while decNumber exists, at the moment, it is over 10x more code...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Tue Nov 11 14:03:40 2025
    From Newsgroup: comp.arch

    On Tue, 11 Nov 2025 04:44:48 -0600
    BGB <cr88192@gmail.com> wrote:
    On 11/11/2025 4:02 AM, Michael S wrote:
    On Mon, 10 Nov 2025 21:25:47 -0600
    BGB <cr88192@gmail.com> wrote:

    On 11/10/2025 4:08 PM, Michael S wrote:
    On Mon, 10 Nov 2025 13:54:23 -0600
    BGB <cr88192@gmail.com> wrote:

    On 11/10/2025 1:16 AM, Terje Mathisen wrote:
    BGB wrote:
    DIV uses Newton-Raphson
    The process of converging is a lot more fiddly than with Binary
    FP. Partly as the strategy for generating the initial guess is
    far less accurate.

    So, it first uses a loop with hard-coded checks and scales to
    get it in the general area, before then letting N-R take over.
    If the value isn't close enough (seemingly +/- 25% or so), N-R
    flies off into space.

    Namely:
    Exponent is wrong:
    Scale by factors of 2 until correct;
    Off by more than 50%, scale by +/- 25%;
    Off by more than 25%, scale by +/- 12.5%;
    Else: Good enough, let normal N-R take over.

    My possibly naive idea would extract the top 9-15 digits from
    divisor and dividend, convert both to binary FP, do the division
    and convert back.

    That would reduce the NR step to two or three iterations, right?


    After adding code to feed to convert to/from 'double', and using
    this for initial reciprocal and square-root:
    DIV gets around 50% faster: ~ 1.5 MHz (~ 12x slower than
    MUL);

    That is your timing for Decimal128 on modern desktop PC?
    Dependent divisions or independent?
    Even for dependent, it sounds slow.


    Modern-ish...


    Zen2 ?
    I consider it the last of non-modern. Zen3 and Ice Lake are first
    of modern. 128by64 bit integer division on Zen2 is still quite slow
    and overall uArch is even less advanced than 10 y.o. Intel Skylake.
    In majority of real-world workloads it's partially compensated by
    Zen2 bigger L3 cache. In our case big cache does not help.
    But even last non-modern CPU shall be capable to divide faster than suggested by your numbers.


    Zen+

    Or, a slightly tweaked version of Zen1.


    It is very well possible to do big integer divide faster than this.
    Such as via shift-and-add.

    But, as for decimal, this makes it harder.


    I could do long division, but this is a much more complicated
    algorithm (versus using Newton-Raphson).

    But, N-R is slow as it is basically a bunch of operations, which are
    granted themselves, each kinda slow.



    I am running a CPU type that was originally released 7 years ago,
    with slower RAM than it was designed to work with.


    Did you try to compare against brute force calculation using GMP?
    https://gmplib.org/
    I.e. asuming that num < den < 10*num use GMP to calculate 40
    decimal digits of intermediate result y as follows:
    Numx = num * 1e40;
    y = Numx/den;
    Yi = y / 1e6, Yf = y % 1e6 (this step does not require GMP, figure
    out why).
    If Yf != 5e5 then you finished. Only in extremely rare case (1 in
    a million) of Yf == 5e5 you will have to calculate reminder of
    Numx/den to found correct rounding.
    Somehow, I suspect that on modern PC even non-optimized method
    like above will be faster tham 670 usec.




    Well, first step is building with GCC rather than MSVC...

    It would appear that it gets roughly 79% faster when built with
    GCC. So, around 2 million divides per second.



    As for GMP, dividing two 40 digit numbers:
    22 million per second.
    If I do both a divide and a remainder:
    16 million.

    I don't really get what you are wanting me to measure exactly
    though...


    I want you to measure division of 74-digit integer by 34-digit
    integer, because it is the slowest part [of brute force
    implementation] of Decimal128 division. The rest of division is approximately the same as multiplication.
    So, [unoptimized] Decimal128 division time should be no worse than
    t1+t2, where t1 is duration of Decimal128 multiplication and t2 is
    duration of above-mentioned integer division. An estimate is
    pessimistic, because post-division normalization tends to be simpler
    than post-multiplication normalization.
    Optimized division would be faster yet.


    If it is a big-integer divide, this is not quite the same thing.

    And, if I were to use big-integer divide (probably not via GMP
    though,
    Certainly not via GMP in final product. But doing 1st version via GMP
    makes perfect sense.
    this would be too big of a dependency), there is still the
    issue of efficiently converting between big-integer and the "groups
    of 9 digits in 32-bits" format.
    No, no, no. Not "group of 9 digits"! Plain unadulterated binary. 64
    binary 'digits' per 64-bit word.


    This is partly why I removed the BID code:
    At first, it seemed like the DPD and BID converters were similar
    speed; But, turns out I was still testing the DPD converter, and
    in-fact the BID converter was significantly slower.

    DPD-specific code and algorithms make sense for multiplication.
    They likely makes sense for addition/subtraction as well, I didn't try
    to think deeply about it.
    But for division I wouldn't bother with DPD-specific things. Just
    convert mantissa from DPD to binary, then divide, normalize, round then
    convert back.
    And, if I were going to do BID, would make more sense to do it as its
    own thing, and build it mostly around 128-bit integer math.


    But, in this case, I had decided to experiment with DPD.


    Most likely, in this case if I wanted faster divide, that also played
    well with the existing format, I would need to do long division or
    similar.





    If I compare against the IBM decNumber library:
    Multiply: 14 million.
    Divide: 7 million

    The decNumber library doesn't appear to have a square-root
    function...


    Granted, there are possibly faster ways to do divide, versus using
    Newton-Raphson in this case...

    It was not the point that I could pull the fastest possible
    implementation out of thin air. But, does appear I am beating
    decNumber at least for multiply performance and similar.




    Can note that while decNumber exists, at the moment, it is over 10x
    more code...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Tue Nov 11 18:50:20 2025
    From Newsgroup: comp.arch

    On 2025-11-06 20:28, MitchAlsup wrote:

    Niklas Holsti <niklas.holsti@tidorum.invalid> posted:

    On 2025-11-05 23:28, MitchAlsup wrote:

    Niklas Holsti <niklas.holsti@tidorum.invalid> posted:
    ----------------
    But then you could get the problem of a longjmp to a setjmp value that >>>> is stale because the targeted function invocation (stack frame) is no
    longer there.

    But YOU had to pass the jumpbuf out of the setjump() scope.

    Now, YOU complain there is a hole in your own foot with a smoking gun
    in your own hand.

    That is not the issue. The question is if the semantics of "goto
    label-valued-variable" are hard to define, as Ritchie said, or not, as
    Anton thinks Stallman said or would have said.

    So, label-variables are hard to define, but function-variables are not ?!?

    Depends on the level at which you want to define it.

    At the machine level, where semantics are (usually) defined for each instruction separately, a jump to a dynamic address (using a
    "label-variable") is not much different from a call to a dynamic address (using a "function-variable"), and the effect of the single instruction
    on the machine state is much the same as for the static address case.
    The higher-level effect on the further execution of the program is out
    of scope, whatever the actual value of the target address in the
    instruction.

    It is only if your machine has some semantics for instruction
    combinations, such as your VEC-LOOP pair, that you have to define what
    happens if a jump or call to some address leads to later executing only
    some of those instructions or executing them in the wrong order, such as trying to execute a LOOP without having executed a preceding VEC.

    At the higher programming-language level, the label case can be much
    harder to define and less useful than the function case, depending on
    the programming language and its abstract model of execution, and also depending on what compile-time checks you assume.

    Consider an imperative language such as C with no functions nested
    within other functions or other blocks (where by "block" I mean some syntactical construct that sets up its local context with local
    variables etc.). If you have a function-variable (that is, a pointer to
    a function) that actually refers to a function with the same parameter profile, it is easy to define the semantics of a call via this function variable: it is the same as for a call that names the referenced
    function statically, and such a call is always legal. Problems arise
    only if the function-variable has some invalid value such as NULL, or
    the address of a function with a different profile, or some code address
    that does not refer to (the start of) a function. Such invalid values
    can be prevented at compile time, except (usually) for NULL.

    In the same language setting, the semantics of a jump using a
    label-variable are easy to define only if the label-variable refers to a
    label in the same block as the jump. A jump from one block into another
    would mess up the context, omitting the set-up of the target block's
    context and/or omitting the tear-down of the source block's context. The further results of program execution are machine-dependent and so
    undefined behavior.

    A compiler could enforce the label-in-same-block rule, but it seems that
    GNU C does not do so.

    In a programming language that allows nested functions the same kind of context-crossing problems arise for function-variables. Traditional
    languages solve them by allowing, at compile-time, calls via function-variables only if it is certain that the containing context of
    the callee still exists (if the callee is nested), or by (expensively) preserving that context as a dynamically constructed closure. In either
    case, the caller's context never needs to be torn down to execute the
    call, differing from the jump case.

    In summary, jumps via label-variables are useful only for control
    transfers within one function, and do not help to build up a computation
    by combining several functions -- the main method of program design at present. In contrast, calls via function-variables are a useful
    extension to static calls, actually helping to combine several functions
    in a computation, as shown by the general adoption of
    class/object/method coding styles.

    Niklas

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Tue Nov 11 19:58:55 2025
    From Newsgroup: comp.arch

    On 2025-11-08 23:08, John Levine wrote:
    According to Michael S <already5chosen@yahoo.com>:
    I would imagine that in old times return iinstruction was less common
    than indirect addressing itself.

    On several of the machines I used a subroutine call stored the return
    address in the first word of the routine and branched to that address+1.
    The return was just an indirect jump.

    One such machine was the HP 2100; I used some of those.

    Stacks? What's a stack? We barely had registers.
    And indeed the Algol 60 compiler for the HP 2100 did not support
    recursion. My programs did real-time control, so I wrote a small non-preemptive but priority-driven multi-threading kernel. Thread switch
    was easy as there were very few registers and no stack. But you had to
    be careful because no subroutines were re-entrant.

    Speaking of indirect addressing, the HP 2100 had a special feature: it
    had a 64 KB address space, but with word addressing of 16-bit words, so addresses were only 15 bits, leaving the MSbit in each word free.

    When using indirect addressing there was an "indirect" bit in the
    instruction which, in the usual way, made the machine use the 16-bit
    content of the (directly) addressed word as the actual target address,
    but only if the MSbit of that content was zero. If the MSbit was one, it caused a further level of indirection, using the 15 other bits as the
    address of another word that again would contain the actual target
    address, if the MSbit of /that/ content was zero, and so on.

    So an indirect instruction could cause a chain of indirections which
    ended when an address-word had a zero in its MSbit. And the machine
    could get stuck in an eternal indirection loop, which IIRC happened to
    me once :-)
    --
    Niklas Holsti

    niklas holsti tidorum fi
    . @ .

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue Nov 11 18:48:47 2025
    From Newsgroup: comp.arch

    Niklas Holsti <niklas.holsti@tidorum.invalid> writes:
    On 2025-11-08 23:08, John Levine wrote:
    According to Michael S <already5chosen@yahoo.com>:
    I would imagine that in old times return iinstruction was less common
    than indirect addressing itself.

    On several of the machines I used a subroutine call stored the return
    address in the first word of the routine and branched to that address+1.
    The return was just an indirect jump.

    One such machine was the HP 2100; I used some of those.

    Stacks? What's a stack? We barely had registers.
    And indeed the Algol 60 compiler for the HP 2100 did not support
    recursion. My programs did real-time control, so I wrote a small >non-preemptive but priority-driven multi-threading kernel. Thread switch
    was easy as there were very few registers and no stack. But you had to
    be careful because no subroutines were re-entrant.

    Speaking of indirect addressing, the HP 2100 had a special feature: it
    had a 64 KB address space, but with word addressing of 16-bit words, so >addresses were only 15 bits, leaving the MSbit in each word free.

    When using indirect addressing there was an "indirect" bit in the >instruction which, in the usual way, made the machine use the 16-bit
    content of the (directly) addressed word as the actual target address,
    but only if the MSbit of that content was zero. If the MSbit was one, it >caused a further level of indirection, using the 15 other bits as the >address of another word that again would contain the actual target
    address, if the MSbit of /that/ content was zero, and so on.

    So an indirect instruction could cause a chain of indirections which
    ended when an address-word had a zero in its MSbit. And the machine
    could get stuck in an eternal indirection loop, which IIRC happened to
    me once :-)

    The Burroughs B3500 and sucessors had a similar feature. An
    instruction operand contained the address of the operand plus
    four control bits (BCD architecture). Two of the control bits
    could select one of three index registers that would be summed
    with the address (the index registers are signed, the address
    unsigned). The other two control bits specified the operand
    type (UN - Unsigned Numeric, SN - Signed Numeric,
    UA - Unsigned Alphanumeric, IA - Indirect Address).

    If the IA bit was set for an operand, the processor would read
    a new operand from the target address and process it as if it
    were an operand. This indirection continued until an operand
    specified a data type other than IA.

    The processor started a timer before each instruction, if the
    instruction execution time exceeded the timer value, the MCP
    would terminate the program.

    In the B3500 operands were six digits, and the controller
    bits consumed the high-order digit, allowing addresses ranging
    from 000000 to 099999 (100 kilo digits). The B4700
    added extended operands which supported 000000 through 999999
    by placing an undigit (12 or 0xC) in the second digit position
    of the operand and extending the operand to 32 bits (8 BCD digits).

    The first digit still contained the operand type bits, the second
    digit the value 0xc and the remaining six digits were the
    program address.

    The V380 (upgraded B4900) extended further by supporting four
    additional index registers; if the second digit of the operand
    was 0xd, the data type index register bits selected IX4 through
    IX7.

    In all cases, a "segment" was limited to one million digits in
    size. Before the V380, a program was limited to a single segment;
    the V380 added an entirely new virtual memory subsystem (segment based)
    that supported 100,000 environments per process, with up to
    100 segments per environment. A single segment was still limited
    to 500KB, however, for backward binary compatibility with programs
    from 1965. Large programs (e.g. the COBOL compiler) used an
    operating system (MCP) provided overlay mechanism (the MCP cached
    overlays in other parts of main memory or on a fast RAMdisk).
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Tue Nov 11 14:23:38 2025
    From Newsgroup: comp.arch

    Niklas Holsti wrote:
    On 2025-11-06 20:28, MitchAlsup wrote:

    Niklas Holsti <niklas.holsti@tidorum.invalid> posted:

    On 2025-11-05 23:28, MitchAlsup wrote:

    Niklas Holsti <niklas.holsti@tidorum.invalid> posted:
    ----------------
    But then you could get the problem of a longjmp to a setjmp value that >>>>> is stale because the targeted function invocation (stack frame) is no >>>>> longer there.

    But YOU had to pass the jumpbuf out of the setjump() scope.

    Now, YOU complain there is a hole in your own foot with a smoking gun
    in your own hand.

    That is not the issue. The question is if the semantics of "goto
    label-valued-variable" are hard to define, as Ritchie said, or not, as
    Anton thinks Stallman said or would have said.

    So, label-variables are hard to define, but function-variables are not
    ?!?

    Depends on the level at which you want to define it.

    At the machine level, where semantics are (usually) defined for each instruction separately, a jump to a dynamic address (using a "label-variable") is not much different from a call to a dynamic address (using a "function-variable"), and the effect of the single instruction
    on the machine state is much the same as for the static address case.
    The higher-level effect on the further execution of the program is out
    of scope, whatever the actual value of the target address in the instruction.

    It is only if your machine has some semantics for instruction
    combinations, such as your VEC-LOOP pair, that you have to define what happens if a jump or call to some address leads to later executing only
    some of those instructions or executing them in the wrong order, such as trying to execute a LOOP without having executed a preceding VEC.

    At the higher programming-language level, the label case can be much
    harder to define and less useful than the function case, depending on
    the programming language and its abstract model of execution, and also depending on what compile-time checks you assume.

    Consider an imperative language such as C with no functions nested
    within other functions or other blocks (where by "block" I mean some syntactical construct that sets up its local context with local
    variables etc.). If you have a function-variable (that is, a pointer to
    a function) that actually refers to a function with the same parameter profile, it is easy to define the semantics of a call via this function variable: it is the same as for a call that names the referenced
    function statically, and such a call is always legal. Problems arise
    only if the function-variable has some invalid value such as NULL, or
    the address of a function with a different profile, or some code address that does not refer to (the start of) a function. Such invalid values
    can be prevented at compile time, except (usually) for NULL.

    In the same language setting, the semantics of a jump using a
    label-variable are easy to define only if the label-variable refers to a label in the same block as the jump. A jump from one block into another would mess up the context, omitting the set-up of the target block's
    context and/or omitting the tear-down of the source block's context. The further results of program execution are machine-dependent and so
    undefined behavior.

    A compiler could enforce the label-in-same-block rule, but it seems that
    GNU C does not do so.

    In a programming language that allows nested functions the same kind of context-crossing problems arise for function-variables. Traditional languages solve them by allowing, at compile-time, calls via function-variables only if it is certain that the containing context of
    the callee still exists (if the callee is nested), or by (expensively) preserving that context as a dynamically constructed closure. In either case, the caller's context never needs to be torn down to execute the
    call, differing from the jump case.

    In summary, jumps via label-variables are useful only for control
    transfers within one function, and do not help to build up a computation
    by combining several functions -- the main method of program design at present. In contrast, calls via function-variables are a useful
    extension to static calls, actually helping to combine several functions
    in a computation, as shown by the general adoption of
    class/object/method coding styles.

    Niklas


    I was curious about the interaction between dynamic stack allocations
    and goto variables to see if it handled the block scoping correctly.
    Ada should have the same issues as C.
    It appears GCC x86-64 15.2 with -O3 does not properly recover
    stack space with dynamic goto's.

    Test1 allocates a dynamic sized buffer and has a static goto Loop
    for which GCC generates a jne .L6 to a mov rsp, rbx that recovers
    the stack allocation inside the {} block.

    Test2 is the same but does a goto *dest and GCC does not generate
    code to recover the inner {} block allocation. It just loops over
    the sub rsp, rbx so the stack space just grows.

    long Sub (long len, char buf[]);

    void Test1 (long len)
    {
    long ok;

    Loop:
    {
    char buf[len];

    ok = Sub (len, buf);
    if (ok)
    goto Loop;
    }
    }

    # Compilation provided by Compiler Explorer at https://godbolt.org/ Test1(long):
    push rbp
    mov rbp, rsp
    push r13
    mov r13, rdi
    push r12
    lea r12, [rdi+15]
    push rbx
    shr r12, 4
    sal r12, 4
    sub rsp, 8
    jmp .L2
    .L6:
    mov rsp, rbx
    .L2:
    mov rbx, rsp
    sub rsp, r12
    mov rdi, r13
    mov rsi, rsp
    call Sub(long, char*)
    test rax, rax
    jne .L6
    lea rsp, [rbp-24]
    pop rbx
    pop r12
    pop r13
    pop rbp
    ret

    void Test2 (long len)
    {
    long ok;
    void *dest;

    dest = &&Loop;
    Loop:
    {
    char buf[len];

    ok = Sub (len, buf);
    if (ok)
    goto *dest;
    }
    }

    Test2(long):
    push rbp
    mov rbp, rsp
    push r12
    mov r12, rdi
    push rbx
    lea rbx, [rdi+15]
    shr rbx, 4
    sal rbx, 4
    .L8:
    sub rsp, rbx
    mov rdi, r12
    mov rsi, rsp
    call Sub(long, char*)
    test rax, rax
    jne .L8
    lea rsp, [rbp-16]
    pop rbx
    pop r12
    pop rbp
    ret






    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Nov 11 19:30:43 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    Typical process for NaN boxing is to set the high order bits of the
    value which causes the value to appear to be a NaN at higher precision.

    Any FP value representable in lower precision can be exactly represented
    in higher precision.

    I have been thinking about using some of the high order bits of the NaN
    (eg bits 32 to 51) to indicate the precision of the boxed value.

    When My 66000 generates a NaN it inserts the cause in the 3 HoBs and
    inserts IP in the LoBs. Nothing prevents you from overwriting the NaN,
    but I thought it was best to point at the causing-instruction and an
    encoded "why" the nan was generated. The cause is a 3-bit index to the
    7 defined IEEE exceptions.

    There are rules when more than 1 NaN are an operand to an instruction
    designed to leave the more important NaN as the result. {Where more
    important is generally the first to be generated.}

    This
    would allow detection of the use of a lower precision value in
    arithmetic. Suppose a convert from single to double precision is being
    done, but the value to be converted is only half precision. If it were indicated by the NaN software might be able to fix the result.

    I think it is better to fix the SW that thinks a (half) is a (float).

    I also preserve the sign bit of the number in the NaN box.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Nov 11 19:46:39 2025
    From Newsgroup: comp.arch


    Niklas Holsti <niklas.holsti@tidorum.invalid> posted:

    On 2025-11-06 20:28, MitchAlsup wrote:

    Niklas Holsti <niklas.holsti@tidorum.invalid> posted:

    On 2025-11-05 23:28, MitchAlsup wrote:

    Niklas Holsti <niklas.holsti@tidorum.invalid> posted:
    ----------------
    But then you could get the problem of a longjmp to a setjmp value that >>>> is stale because the targeted function invocation (stack frame) is no >>>> longer there.

    But YOU had to pass the jumpbuf out of the setjump() scope.

    Now, YOU complain there is a hole in your own foot with a smoking gun
    in your own hand.

    That is not the issue. The question is if the semantics of "goto
    label-valued-variable" are hard to define, as Ritchie said, or not, as
    Anton thinks Stallman said or would have said.

    So, label-variables are hard to define, but function-variables are not ?!?

    Depends on the level at which you want to define it.

    At the machine level, where semantics are (usually) defined for each instruction separately, a jump to a dynamic address (using a "label-variable") is not much different from a call to a dynamic address (using a "function-variable"), and the effect of the single instruction
    on the machine state is much the same as for the static address case.

    Yes,

    The higher-level effect on the further execution of the program is out
    of scope, whatever the actual value of the target address in the instruction.

    A good point:

    It is only if your machine has some semantics for instruction
    combinations, such as your VEC-LOOP pair, that you have to define what happens if a jump or call to some address leads to later executing only
    some of those instructions or executing them in the wrong order, such as trying to execute a LOOP without having executed a preceding VEC.

    BTW, encountering a LOOP without encountering a VEC is a natural
    occurrence when returning from exception or interrupt. The VEC
    register points at the VEC+1 instruction which is easy to return
    to the VEC instruction.

    It is the very mechanism whereby vectorized and multi-lane execution
    becomes scalar so the the debugger only sees scalar instructions.

    At the higher programming-language level, the label case can be much
    harder to define and less useful than the function case, depending on
    the programming language and its abstract model of execution, and also depending on what compile-time checks you assume.

    And what block boundaries are preserved (scope).

    Consider an imperative language such as C with no functions nested
    within other functions or other blocks (where by "block" I mean some syntactical construct that sets up its local context with local
    variables etc.). If you have a function-variable (that is, a pointer to
    a function) that actually refers to a function with the same parameter profile,

    It is this parameter profile (argument list) which separates
    goto lable[i];
    from
    value = function[i](argument list);

    The dynamic goto is expected, by the SW writer, to carry all of the local
    scope content to the new label--and yet none of it is specified. It is
    this local scope content which is (IS) precisely specified with the
    dynamic call.

    it is easy to define the semantics of a call via this function variable: it is the same as for a call that names the referenced
    function statically, and such a call is always legal. Problems arise
    only if the function-variable has some invalid value such as NULL, or
    the address of a function with a different profile, or some code address that does not refer to (the start of) a function. Such invalid values
    can be prevented at compile time, except (usually) for NULL.

    In the same language setting, the semantics of a jump using a
    label-variable are easy to define only if the label-variable refers to a label in the same block as the jump. A jump from one block into another would mess up the context, omitting the set-up of the target block's
    context and/or omitting the tear-down of the source block's context. The further results of program execution are machine-dependent and so
    undefined behavior.

    Or worse:: when said label-variable was "trashed" by some attack vector,
    the label-variable can transfer control to literally anywhere.

    A compiler could enforce the label-in-same-block rule, but it seems that
    GNU C does not do so.

    In a programming language that allows nested functions the same kind of context-crossing problems arise for function-variables. Traditional languages solve them by allowing, at compile-time, calls via function-variables only if it is certain that the containing context of
    the callee still exists (if the callee is nested), or by (expensively) preserving that context as a dynamically constructed closure. In either case, the caller's context never needs to be torn down to execute the
    call, differing from the jump case.

    In summary, jumps via label-variables are useful only for control
    transfers within one function, and do not help to build up a computation
    by combining several functions -- the main method of program design at present. In contrast, calls via function-variables are a useful
    extension to static calls, actually helping to combine several functions
    in a computation, as shown by the general adoption of
    class/object/method coding styles.

    Thanks for your clear wording on why and why not.

    Niklas

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Nov 11 20:44:47 2025
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Test1 allocates a dynamic sized buffer and has a static goto Loop
    for which GCC generates a jne .L6 to a mov rsp, rbx that recovers
    the stack allocation inside the {} block.

    Test2 is the same but does a goto *dest and GCC does not generate
    code to recover the inner {} block allocation. It just loops over
    the sub rsp, rbx so the stack space just grows.

    Interestingly, gcc optimizes the indirect branch with a constant
    target into a direct branch, but then does not continue with the same
    code as you get with a plain goto.

    void Test2 (long len)
    {
    long ok;
    void *dest;

    dest = &&Loop;
    Loop:
    {
    char buf[len];

    ok = Sub (len, buf);
    if (ok)
    goto *dest;
    }
    }

    Test2(long):
    push rbp
    mov rbp, rsp
    push r12
    mov r12, rdi
    push rbx
    lea rbx, [rdi+15]
    shr rbx, 4
    sal rbx, 4
    .L8:
    sub rsp, rbx
    mov rdi, r12
    mov rsi, rsp
    call Sub(long, char*)
    test rax, rax
    jne .L8
    lea rsp, [rbp-16]
    pop rbx
    pop r12
    pop rbp
    ret

    Interesting that this bug has not been fixed in the >33 years that labels-as-values have been in gcc; I don't know how long these
    dynamically sized arrays have been in gcc, but IIRC alloca(), a
    similar feature, has been available at least as long as
    labels-as-values. The bug has apparently been avoided or worked
    around by the users of labels-as-values (e.g., Gforth does not use
    alloca or dynamically-sized arrays in the function that contains all
    the taken labels and all the "goto *"s.

    As long as all taken labels have the same stack depth, the bugfix does
    not look particularly hard: just put code before each goto * that
    adjusts the stack depth to the depth of these labels.

    Things become more interesting if there are labels with different
    stack depths, because labels are stored in "void *" variables, and
    there is not enough room for a target and a stack depth. One can ue
    the same approach as is used in Test1, however: have the stack depth
    for a specific target in some location, and have a copy from that
    location to the stack pointer right behind the label.

    ...
    jmp .L2
    .L6:
    mov rsp, rbx
    .L2:
    ...
    jne .L6

    All the code that works now would not need these extra copy
    intructions, so the bugfix should special-case the case where all the
    targets have the same depth.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Tue Nov 11 21:10:09 2025
    From Newsgroup: comp.arch

    According to Niklas Holsti <niklas.holsti@tidorum.invalid>:
    Speaking of indirect addressing, the HP 2100 had a special feature: it
    had a 64 KB address space, but with word addressing of 16-bit words, so >addresses were only 15 bits, leaving the MSbit in each word free.

    [multi-level indirect chains]

    That was quite common back in the day.

    The Data General Nova and Varian 620i (both popular for OEM
    applications) did exactly the same thing, 15 bit addresses with the
    high bit saying indirect.

    The PDP-6/10 was a 36 bit machine with 18 bit addresses and a rather overimplemented addressing scheme -- each instruction had an address, an indirect bit, and an index register, so it added the address to the index register (if the register number wasn't zero), then if the indirect bit was set,
    fetch the addressed word and interpret its address, indirect bit, and index register the same way, ad infinitum.

    An interesting question is what happened if a computer got into an indirect loop. The Nova just hung unless it had the memory protection option which limited it to two levels of indirection. The PDP-6/10 could take an interrupt before each address calculation, which restarted when the interrupt returned. One day when I was feeling bored I wrote a program that did an ever longer indirect chain until the program stalled because it took longer than a clock interrupt time. The system was fine, only my program stalled. Dunno what
    the 620i did, I never ran into that particular bug and the manual doesn't say. --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Tue Nov 11 21:18:32 2025
    From Newsgroup: comp.arch

    Robert Finch <robfi680@gmail.com> schrieb:
    Typical process for NaN boxing is to set the high order bits of the
    value which causes the value to appear to be a NaN at higher precision.
    I have been thinking about using some of the high order bits of the NaN
    (eg bits 32 to 51) to indicate the precision of the boxed value. This
    would allow detection of the use of a lower precision value in
    arithmetic. Suppose a convert from single to double precision is being
    done, but the value to be converted is only half precision.

    Do you mean a type mismatch, a conversion, or digits lost due to
    cancellation?

    If it were
    indicated by the NaN software might be able to fix the result.

    Fixing a result after an NaN has occurred is too late, I think.

    I also
    preserve the sign bit of the number in the NaN box.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Nov 11 15:55:16 2025
    From Newsgroup: comp.arch

    On 11/11/2025 11:46 AM, MitchAlsup wrote:

    Niklas Holsti <niklas.holsti@tidorum.invalid> posted:


    snip
    It is only if your machine has some semantics for instruction
    combinations, such as your VEC-LOOP pair, that you have to define what
    happens if a jump or call to some address leads to later executing only
    some of those instructions or executing them in the wrong order, such as
    trying to execute a LOOP without having executed a preceding VEC.

    BTW, encountering a LOOP without encountering a VEC is a natural
    occurrence when returning from exception or interrupt. The VEC
    register points at the VEC+1 instruction which is easy to return
    to the VEC instruction.

    OK, but what if, say through an errant pointer, the code, totally
    unrelated to the VEC, jumps somewhere in the middle of a VEC/LOOP pair?
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Nov 11 16:06:46 2025
    From Newsgroup: comp.arch

    On 11/11/2025 1:10 PM, John Levine wrote:
    According to Niklas Holsti <niklas.holsti@tidorum.invalid>:
    Speaking of indirect addressing, the HP 2100 had a special feature: it
    had a 64 KB address space, but with word addressing of 16-bit words, so
    addresses were only 15 bits, leaving the MSbit in each word free.

    [multi-level indirect chains]

    That was quite common back in the day.

    Yes, as I mentioned earlier in this thread, so did the Univac 1100 series.>
    The Data General Nova and Varian 620i (both popular for OEM
    applications) did exactly the same thing, 15 bit addresses with the
    high bit saying indirect.

    The PDP-6/10 was a 36 bit machine with 18 bit addresses and a rather overimplemented addressing scheme -- each instruction had an address, an indirect bit, and an index register, so it added the address to the index register (if the register number wasn't zero), then if the indirect bit was set,
    fetch the addressed word and interpret its address, indirect bit, and index register the same way, ad infinitum.

    Yup. Similarly the 1100 series, a 36 bit machine with 18 bit addresses,
    had all of those features, plus one more. If the index register
    increment bit was set (in the instruction itself, or in each of the
    indirect words), the upper 18 bits of the index register were added
    (after indexing) to the lower 18 bits. This allowed some really
    "interesting" possible code when this was within a loop. :-)



    An interesting question is what happened if a computer got into an indirect loop.


    Yup. The 1100 prevented an infinite loop by having a hardware timer for
    each instruction. If the timer expired, an illegal operation exception occurred.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 12 00:31:24 2025
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 11/11/2025 11:46 AM, MitchAlsup wrote:

    Niklas Holsti <niklas.holsti@tidorum.invalid> posted:


    snip
    It is only if your machine has some semantics for instruction
    combinations, such as your VEC-LOOP pair, that you have to define what
    happens if a jump or call to some address leads to later executing only
    some of those instructions or executing them in the wrong order, such as >> trying to execute a LOOP without having executed a preceding VEC.

    BTW, encountering a LOOP without encountering a VEC is a natural
    occurrence when returning from exception or interrupt. The VEC
    register points at the VEC+1 instruction which is easy to return
    to the VEC instruction.

    OK, but what if, say through an errant pointer, the code, totally
    unrelated to the VEC, jumps somewhere in the middle of a VEC/LOOP pair?

    All taken branches clear the V-bit associated with vectorization.
    So encountering the LOOP instruction would raise an exception.

    Flow control WITHIN a VEC-LOOP pair is by predication-only.
    Exception Control Transfer is special in this regards.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Nov 11 17:18:11 2025
    From Newsgroup: comp.arch

    On 11/11/2025 4:31 PM, MitchAlsup wrote:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 11/11/2025 11:46 AM, MitchAlsup wrote:

    Niklas Holsti <niklas.holsti@tidorum.invalid> posted:


    snip
    It is only if your machine has some semantics for instruction
    combinations, such as your VEC-LOOP pair, that you have to define what >>>> happens if a jump or call to some address leads to later executing only >>>> some of those instructions or executing them in the wrong order, such as >>>> trying to execute a LOOP without having executed a preceding VEC.

    BTW, encountering a LOOP without encountering a VEC is a natural
    occurrence when returning from exception or interrupt. The VEC
    register points at the VEC+1 instruction which is easy to return
    to the VEC instruction.

    OK, but what if, say through an errant pointer, the code, totally
    unrelated to the VEC, jumps somewhere in the middle of a VEC/LOOP pair?

    All taken branches clear the V-bit associated with vectorization.
    So encountering the LOOP instruction would raise an exception.

    Seems like the right thing to do. I believe this resolves Nikals's issue.

    Flow control WITHIN a VEC-LOOP pair is by predication-only.
    Exception Control Transfer is special in this regards.

    Makes sense.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Tue Nov 11 21:16:22 2025
    From Newsgroup: comp.arch

    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Test1 allocates a dynamic sized buffer and has a static goto Loop
    for which GCC generates a jne .L6 to a mov rsp, rbx that recovers
    the stack allocation inside the {} block.

    Test2 is the same but does a goto *dest and GCC does not generate
    code to recover the inner {} block allocation. It just loops over
    the sub rsp, rbx so the stack space just grows.

    Interestingly, gcc optimizes the indirect branch with a constant
    target into a direct branch, but then does not continue with the same
    code as you get with a plain goto.

    void Test2 (long len)
    {
    long ok;
    void *dest;

    dest = &&Loop;
    Loop:
    {
    char buf[len];

    ok = Sub (len, buf);
    if (ok)
    goto *dest;
    }
    }

    Test2(long):
    push rbp
    mov rbp, rsp
    push r12
    mov r12, rdi
    push rbx
    lea rbx, [rdi+15]
    shr rbx, 4
    sal rbx, 4
    .L8:
    sub rsp, rbx
    mov rdi, r12
    mov rsi, rsp
    call Sub(long, char*)
    test rax, rax
    jne .L8
    lea rsp, [rbp-16]
    pop rbx
    pop r12
    pop rbp
    ret

    Interesting that this bug has not been fixed in the >33 years that labels-as-values have been in gcc; I don't know how long these
    dynamically sized arrays have been in gcc, but IIRC alloca(), a
    similar feature, has been available at least as long as
    labels-as-values. The bug has apparently been avoided or worked
    around by the users of labels-as-values (e.g., Gforth does not use
    alloca or dynamically-sized arrays in the function that contains all
    the taken labels and all the "goto *"s.

    alloca is not required to recover storage at the {} block level.
    MS C does not recover alloca space until the subroutine returns.

    But when they added dynamic allocation to C as a first class feature
    I figured it should recover storage at the end of a {} block,
    and I wondered it the superficially non-deterministic nature of
    goto variable would be a problem.

    As long as all taken labels have the same stack depth, the bugfix does
    not look particularly hard: just put code before each goto * that
    adjusts the stack depth to the depth of these labels.

    Things become more interesting if there are labels with different
    stack depths, because labels are stored in "void *" variables, and
    there is not enough room for a target and a stack depth. One can ue
    the same approach as is used in Test1, however: have the stack depth
    for a specific target in some location, and have a copy from that
    location to the stack pointer right behind the label.

    ....
    jmp .L2
    .L6:
    mov rsp, rbx
    .L2:
    ....
    jne .L6

    All the code that works now would not need these extra copy
    intructions, so the bugfix should special-case the case where all the
    targets have the same depth.

    - anton

    Below in Test3 I replace the goto variable with a switch statement
    arranged to be nondeterministic, and it does get it right.
    I suggest GCC forgot to treat the goto variable as equivalent to a switch statement and threw up its hands and treated the buffer as an alloca.

    This all relates to Niklas's comments as to why the label variables must
    all be within the current context, so it knows when to recover storage.
    If the language had destructors the goto variable could have to call them
    which alloca also does not deal with.

    long Sub (long len, char buf[]);

    void Test3 (long len)
    {
    long ok, dest;

    dest = 0;
    Loop:
    {
    char buf[len];

    ok = Sub (len, buf);
    if (ok)
    dest = 1;

    switch (dest)
    {
    case 0:
    goto Loop;
    case 1:
    goto Out;
    }
    Out:
    ;
    }
    }

    # Compilation provided by Compiler Explorer at https://godbolt.org/ Test3(long):
    push rbp
    mov rbp, rsp
    push r13
    mov r13, rdi
    push r12
    lea r12, [rdi+15]
    push rbx
    shr r12, 4
    sal r12, 4
    sub rsp, 8
    jmp .L2
    .L6:
    mov rsp, rbx
    .L2:
    mov rbx, rsp
    sub rsp, r12
    mov rdi, r13
    mov rsi, rsp
    call Sub(long, char*)
    test rax, rax
    je .L6
    lea rsp, [rbp-24]
    pop rbx
    pop r12
    pop r13
    pop rbp
    ret




    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Tue Nov 11 21:42:49 2025
    From Newsgroup: comp.arch

    On 2025-11-11 2:30 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    Typical process for NaN boxing is to set the high order bits of the
    value which causes the value to appear to be a NaN at higher precision.

    Any FP value representable in lower precision can be exactly represented
    in higher precision.

    I have been thinking about using some of the high order bits of the NaN
    (eg bits 32 to 51) to indicate the precision of the boxed value.

    When My 66000 generates a NaN it inserts the cause in the 3 HoBs and
    inserts IP in the LoBs. Nothing prevents you from overwriting the NaN,
    but I thought it was best to point at the causing-instruction and an
    encoded "why" the nan was generated. The cause is a 3-bit index to the
    7 defined IEEE exceptions.

    My float package puts the cause in the 3 LoBs. The cause is always in
    the low order bits of the register then, even when the precision is
    different. But the address is not tracked. The package does not have
    access to the address. Seems like NaN trace hardware might be useful.

    There are rules when more than 1 NaN are an operand to an instruction designed to leave the more important NaN as the result. {Where more
    important is generally the first to be generated.}

    Hopefully the package follows the rules correctly. NaN operation is one
    thing not tested yet.

    This
    would allow detection of the use of a lower precision value in
    arithmetic. Suppose a convert from single to double precision is being
    done, but the value to be converted is only half precision. If it were
    indicated by the NaN software might be able to fix the result.

    I think it is better to fix the SW that thinks a (half) is a (float).

    It would be better, but some software is so complex it may be unknown
    the values coming in. The SW does not really need to croak if its a
    lower precision value as they are always represent-able in a higher precision.>>
    I also
    preserve the sign bit of the number in the NaN box.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Tue Nov 11 21:46:02 2025
    From Newsgroup: comp.arch

    On 2025-11-11 4:18 p.m., Thomas Koenig wrote:
    Robert Finch <robfi680@gmail.com> schrieb:
    Typical process for NaN boxing is to set the high order bits of the
    value which causes the value to appear to be a NaN at higher precision.
    I have been thinking about using some of the high order bits of the NaN
    (eg bits 32 to 51) to indicate the precision of the boxed value. This
    would allow detection of the use of a lower precision value in
    arithmetic. Suppose a convert from single to double precision is being
    done, but the value to be converted is only half precision.

    Do you mean a type mismatch, a conversion, or digits lost due to cancellation?

    It would be an input type mismatch. >
    If it were
    indicated by the NaN software might be able to fix the result.

    Fixing a result after an NaN has occurred is too late, I think.

    I suppose the float package could always just automatically upgrade the precision from lower to higher when it goes to do the calculation. But
    maybe with a trace warning. It would be able to if the precision were indicated in the NaN.

    I also
    preserve the sign bit of the number in the NaN box.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue Nov 11 21:34:08 2025
    From Newsgroup: comp.arch

    On 11/11/2025 6:03 AM, Michael S wrote:
    On Tue, 11 Nov 2025 04:44:48 -0600
    BGB <cr88192@gmail.com> wrote:

    On 11/11/2025 4:02 AM, Michael S wrote:
    On Mon, 10 Nov 2025 21:25:47 -0600
    BGB <cr88192@gmail.com> wrote:

    On 11/10/2025 4:08 PM, Michael S wrote:
    On Mon, 10 Nov 2025 13:54:23 -0600
    BGB <cr88192@gmail.com> wrote:

    On 11/10/2025 1:16 AM, Terje Mathisen wrote:
    BGB wrote:
    DIV uses Newton-Raphson
    The process of converging is a lot more fiddly than with Binary >>>>>>>> FP. Partly as the strategy for generating the initial guess is >>>>>>>> far less accurate.

    So, it first uses a loop with hard-coded checks and scales to
    get it in the general area, before then letting N-R take over. >>>>>>>> If the value isn't close enough (seemingly +/- 25% or so), N-R >>>>>>>> flies off into space.

    Namely:
       Exponent is wrong:
         Scale by factors of 2 until correct;
       Off by more than 50%, scale by +/- 25%;
       Off by more than 25%, scale by +/- 12.5%;
       Else: Good enough, let normal N-R take over.

    My possibly naive idea would extract the top 9-15 digits from
    divisor and dividend, convert both to binary FP, do the division >>>>>>> and convert back.

    That would reduce the NR step to two or three iterations, right? >>>>>>>

    After adding code to feed to convert to/from 'double', and using
    this for initial reciprocal and square-root:
    DIV gets around 50% faster: ~ 1.5 MHz (~ 12x slower than
    MUL);

    That is your timing for Decimal128 on modern desktop PC?
    Dependent divisions or independent?
    Even for dependent, it sounds slow.


    Modern-ish...


    Zen2 ?
    I consider it the last of non-modern. Zen3 and Ice Lake are first
    of modern. 128by64 bit integer division on Zen2 is still quite slow
    and overall uArch is even less advanced than 10 y.o. Intel Skylake.
    In majority of real-world workloads it's partially compensated by
    Zen2 bigger L3 cache. In our case big cache does not help.
    But even last non-modern CPU shall be capable to divide faster than
    suggested by your numbers.


    Zen+

    Or, a slightly tweaked version of Zen1.


    It is very well possible to do big integer divide faster than this.
    Such as via shift-and-add.

    But, as for decimal, this makes it harder.


    I could do long division, but this is a much more complicated
    algorithm (versus using Newton-Raphson).

    But, N-R is slow as it is basically a bunch of operations, which are
    granted themselves, each kinda slow.



    I am running a CPU type that was originally released 7 years ago,
    with slower RAM than it was designed to work with.


    Did you try to compare against brute force calculation using GMP?
    https://gmplib.org/
    I.e. asuming that num < den < 10*num use GMP to calculate 40
    decimal digits of intermediate result y as follows:
    Numx = num * 1e40;
    y = Numx/den;
    Yi = y / 1e6, Yf = y % 1e6 (this step does not require GMP, figure
    out why).
    If Yf != 5e5 then you finished. Only in extremely rare case (1 in
    a million) of Yf == 5e5 you will have to calculate reminder of
    Numx/den to found correct rounding.
    Somehow, I suspect that on modern PC even non-optimized method
    like above will be faster tham 670 usec.




    Well, first step is building with GCC rather than MSVC...

    It would appear that it gets roughly 79% faster when built with
    GCC. So, around 2 million divides per second.



    As for GMP, dividing two 40 digit numbers:
    22 million per second.
    If I do both a divide and a remainder:
    16 million.

    I don't really get what you are wanting me to measure exactly
    though...


    I want you to measure division of 74-digit integer by 34-digit
    integer, because it is the slowest part [of brute force
    implementation] of Decimal128 division. The rest of division is
    approximately the same as multiplication.
    So, [unoptimized] Decimal128 division time should be no worse than
    t1+t2, where t1 is duration of Decimal128 multiplication and t2 is
    duration of above-mentioned integer division. An estimate is
    pessimistic, because post-division normalization tends to be simpler
    than post-multiplication normalization.
    Optimized division would be faster yet.


    If it is a big-integer divide, this is not quite the same thing.

    And, if I were to use big-integer divide (probably not via GMP
    though,

    Certainly not via GMP in final product. But doing 1st version via GMP
    makes perfect sense.


    GMP is only really an option for targets where GMP exists;
    Needed to jump over to GCC in WSL just to test GMP here.

    If avoidable, you don't want to use anything beyond the C standard
    library, and ideally limit things to a C95 style dialect for maximum portability.

    Granted, it does appear like the GMP divider is faster than expected.
    Like, possibly something faster than "ye olde shift-and-subtract".




    Though, can note a curious property:
    This code is around 79% faster when built with GCC vs MSVC;
    In GCC, the relative speed of MUL and ADD trade places:
    In MSVC, MUL is faster;
    In GCC, ADD is faster.

    Though, the code in question tends to frequently use struct members
    directly, rather than caching multiply-accessed struct members in local variables. MSVC tends not to fully optimize away this sort of thing,
    whereas GCC tends to act as-if the struct members had in-fact been
    cached in local variables.


    this would be too big of a dependency), there is still the
    issue of efficiently converting between big-integer and the "groups
    of 9 digits in 32-bits" format.

    No, no, no. Not "group of 9 digits"! Plain unadulterated binary. 64
    binary 'digits' per 64-bit word.


    Alas, the code was written mostly to use 9-digit groupings, and going
    between 9-digit groupings and 128-bit integers is a bigger chunk of code
    than I want to have for this.

    This would mean an additional ~ 500 LOC, plus probably whatever code I
    need to do a semi-fast 256 by 128 bit integer divider.




    This is partly why I removed the BID code:
    At first, it seemed like the DPD and BID converters were similar
    speed; But, turns out I was still testing the DPD converter, and
    in-fact the BID converter was significantly slower.


    DPD-specific code and algorithms make sense for multiplication.
    They likely makes sense for addition/subtraction as well, I didn't try
    to think deeply about it.
    But for division I wouldn't bother with DPD-specific things. Just
    convert mantissa from DPD to binary, then divide, normalize, round then convert back.


    It is the 9-digit-decimal <-> Large Binary Integer converter step that
    is the main issue here.

    Going to/from 128-bit integer adds a few "there be dragons here" issues regarding performance.

    At the moment, I don't have a fast (and correct) converter between these
    two representations (that also does not rely on any external libraries
    or similar; or nothing outside of the C standard library).




    Like, if you need to crack 128 bits into 9-digit chunks using 128-bit
    divide, and if the 128-bit divider in question is a shift-and-subtract
    loop, this sucks.

    There are faster ways to do multiply by powers of 10, but divide by powers-of-10 is still a harder problem at the moment.

    Well, and also there is the annoyance that it is difficult to write an efficient 128-bit integer multiply if staying within the limits of
    portable C95.


    ...



    Goes off and tries a few things:
    128-bit integer divider;
    Various attempts at decimal long divide;
    ...

    Thus far, things have either not worked correctly, or have ended up
    slower than the existing Newton-Raphson divider.


    the most promising option would be Radix-10e9 long-division, but
    couldn't get this working thus far.

    Did also try Radix-10 long division (working on 72 digit sequences), but
    this was slower than the existing N-R divider.


    One possibility could be to try doing divide with Radix-10 in an
    unpacked BCD variant (likely using bytes from 0..9). Here, compare and subtract would be sower, but shifting could be faster, and allows a
    faster way (lookup tables) to find "A goes into B, N times".

    I still don't have much confidence in it though.


    Radix-10e9 has a higher chance of OK performance, if I could get the long-division algo to work correctly with it. Thus far, I was having difficulty getting it to give the correct answer. Integer divide was
    tending to overshoot the "A goes into B N times" logic, and trying to
    fudge it (eg, but adding 1 to the initial divisor) wasn't really
    working; kinda need an accurate answer here, and a reliable way to scale
    and add the divisor, ...


    Granted, one possibility could be to expand out each group of 9 digits
    to 64 bits, so effectively it has an intermediate 10 decimal digits of headroom (or two 10e9 "digits").

    But, yeah, long-division is a lot more of a PITA than N-R or shift-and-subtract.



    And, if I were going to do BID, would make more sense to do it as its
    own thing, and build it mostly around 128-bit integer math.


    But, in this case, I had decided to experiment with DPD.


    Most likely, in this case if I wanted faster divide, that also played
    well with the existing format, I would need to do long division or
    similar.





    If I compare against the IBM decNumber library:
    Multiply: 14 million.
    Divide: 7 million

    The decNumber library doesn't appear to have a square-root
    function...


    Granted, there are possibly faster ways to do divide, versus using
    Newton-Raphson in this case...

    It was not the point that I could pull the fastest possible
    implementation out of thin air. But, does appear I am beating
    decNumber at least for multiply performance and similar.




    Can note that while decNumber exists, at the moment, it is over 10x
    more code...





    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From kegs@kegs@provalid.com (Kent Dickey) to comp.arch on Wed Nov 12 06:20:53 2025
    From Newsgroup: comp.arch

    In article <1762377694-5857@newsgrouper.org>,
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
    My 66000 ISA has OpCodes in the range {I.major >= 8 && I.major < 24}
    that can supply constants and perform operand routing. Within this
    range; instruction<8:5> specify the following table:

    0 0 0 0 +Src1 +Src2
    0 0 0 1 +Src1 -Src2
    0 0 1 0 -Src1 +Src2
    0 0 1 1 -Src1 -Src2
    0 1 0 0 +Src1 +imm5
    0 1 0 1 +Imm5 +Src2
    0 1 1 0 -Src1 -Imm5
    0 1 1 1 +Imm5 -Src2
    1 0 0 0 +Src1 Imm32
    1 0 0 1 Imm32 +Src2
    1 0 1 0 -Src1 Imm32
    1 0 1 1 Imm32 -Src2
    1 1 0 0 +Src1 Imm64
    1 1 0 1 Imm64 +Src2
    1 1 1 0 -Src1 Imm64
    1 1 1 1 Imm64 -Src2

    Here we have access to {5, 32, 64}-bit constants, 16-bit constants
    come from different OpCodes.

    Imm5 are the register specifier bits: range {-31..31} for integer and >logical, range {-15.5..15.5} for floating point.

    For FP, Arm32 has an 8-bit immediate turned into an FP number as follows:

    sign = imm8<7>;
    exp = NOT(imm8<6>):Replicate(imm8<6>,E-3):imm8<5:4>;
    frac = imm8<3:0>:Zeros(F-4);
    result = sign : exp : frac;

    For Float, exp[7:0] can be 0x80-0x83 or 0x7c-0x7f, which is 2^1 through 2^4
    and 2^-3 through 2^0. And the mantissa upper 4 bits are from the immediate field. Note that 0.0 is not encodeable, and I'm going to assume you
    don't need it either.

    For your FP, the sign comes from elsewhere, so you have 5 bits for the
    FP number. I suggest you use the Arm32 encoding for the exponent (using
    3 bits), and then set the upper 2 bits of the mantissa from the remaining
    two immediate bits.

    This encodes integers from 1.0 through 8.0, and can also encode 10.0, 12.0, 14.0, 16.0, 20.0, 24.0, and 28.0. And it can do 0.5, 1.5, 2.5, 3.5.
    And it can encode 0.125 and 0.25.

    This encoding makes a lot of sense from ease of decode. However, it
    would be nice to be able to encode 100.0, 1000.0 and .1, .01 and .001, each
    of which is likely to be more useful than 12.0 or 3.5.

    From a compiler standpoint, having arbitrary constants is perfectly fine,
    it can just look up if it's available. So you can make 1000.0 and .001
    and PI and lg2(e) and ln(2), and whatever available, if you want.
    GCC looks up Arm64 integer 13-bit immediates in a hashtable--the encoding
    is almost a one-way function, so it's just faster to look it up rather than
    try to figure out if 0xaaaaaaaa is encodeable out by inspecting the value.
    So something similar could be done for FP constants. Since the values will
    be fixed, a perfect hash can be created ensuring it's a fast lookup.

    Kent
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Nov 12 07:19:36 2025
    From Newsgroup: comp.arch

    Robert Finch <robfi680@gmail.com> schrieb:
    On 2025-11-11 4:18 p.m., Thomas Koenig wrote:
    Robert Finch <robfi680@gmail.com> schrieb:
    Typical process for NaN boxing is to set the high order bits of the
    value which causes the value to appear to be a NaN at higher precision.
    I have been thinking about using some of the high order bits of the NaN
    (eg bits 32 to 51) to indicate the precision of the boxed value. This
    would allow detection of the use of a lower precision value in
    arithmetic. Suppose a convert from single to double precision is being
    done, but the value to be converted is only half precision.

    Do you mean a type mismatch, a conversion, or digits lost due to
    cancellation?

    It would be an input type mismatch. >

    I think this can only happen when software is buggy; compilers should
    deal with it, unless the user intentionally accesses data with
    the wrong type.

    If it were
    indicated by the NaN software might be able to fix the result.

    Fixing a result after an NaN has occurred is too late, I think.

    I suppose the float package could always just automatically upgrade the precision from lower to higher when it goes to do the calculation. But
    maybe with a trace warning. It would be able to if the precision were indicated in the NaN.

    I have implemented a few warning about conversions in gfortran.
    For example, -Wconversion-extra gives you, for the program

    program main
    print *,0.3333333333
    end program main

    the warning

    2 | print *,0.3333333333
    | 1
    Warning: Non-significant digits in 'REAL(4)' number at (1), maybe incorrect KIND [-Wconversion-extra]

    But my favorite is

    3 | print *,a**(3/5)
    | 1
    Warning: Integer division truncated to constant '0' at (1) [-Winteger-division]

    which (presumably) has caught that particular idiom in a few codes.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Nov 12 08:01:09 2025
    From Newsgroup: comp.arch

    Kent Dickey <kegs@provalid.com> schrieb:

    For FP, Arm32 has an 8-bit immediate turned into an FP number as follows:

    sign = imm8<7>;
    exp = NOT(imm8<6>):Replicate(imm8<6>,E-3):imm8<5:4>;
    frac = imm8<3:0>:Zeros(F-4);
    result = sign : exp : frac;

    For Float, exp[7:0] can be 0x80-0x83 or 0x7c-0x7f, which is 2^1 through 2^4 and 2^-3 through 2^0. And the mantissa upper 4 bits are from the immediate field. Note that 0.0 is not encodeable, and I'm going to assume you
    don't need it either.

    Looking at the statistics upthhead, 0.0 is the most common floating
    point constant for My 66000 code.

    For your FP, the sign comes from elsewhere, so you have 5 bits for the
    FP number. I suggest you use the Arm32 encoding for the exponent (using
    3 bits), and then set the upper 2 bits of the mantissa from the remaining
    two immediate bits.

    This encodes integers from 1.0 through 8.0, and can also encode 10.0, 12.0, 14.0, 16.0, 20.0, 24.0, and 28.0. And it can do 0.5, 1.5, 2.5, 3.5.
    And it can encode 0.125 and 0.25.

    This encoding makes a lot of sense from ease of decode. However, it
    would be nice to be able to encode 100.0, 1000.0 and .1, .01 and .001, each of which is likely to be more useful than 12.0 or 3.5.

    This is really hard to quantify, and going by gut feeling is likely to
    give wrong results. Do you have any statistics, done on more software
    packages than what I have done, on the ditribution of floating point
    constants?

    From a compiler standpoint, having arbitrary constants is perfectly fine,
    it can just look up if it's available. So you can make 1000.0 and .001
    and PI and lg2(e) and ln(2), and whatever available, if you want.
    GCC looks up Arm64 integer 13-bit immediates in a hashtable--the encoding
    is almost a one-way function, so it's just faster to look it up rather than try to figure out if 0xaaaaaaaa is encodeable out by inspecting the value.
    So something similar could be done for FP constants. Since the values will be fixed, a perfect hash can be created ensuring it's a fast lookup.

    Sure, it can be done, but I would like to do it on the basis of hard(er)
    data.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Wed Nov 12 11:47:34 2025
    From Newsgroup: comp.arch

    On Tue, 11 Nov 2025 21:34:08 -0600
    BGB <cr88192@gmail.com> wrote:

    On 11/11/2025 6:03 AM, Michael S wrote:
    On Tue, 11 Nov 2025 04:44:48 -0600
    BGB <cr88192@gmail.com> wrote:


    Certainly not via GMP in final product. But doing 1st version via
    GMP makes perfect sense.


    GMP is only really an option for targets where GMP exists;

    Decimal128 is of interest only on targets where GMP exists.

    Needed to jump over to GCC in WSL just to test GMP here.


    So, you don't like msys2. It's your problem. Many Windows developers,
    myself included, find it handy. Esp. newer variant of tools, prefixed mingw-w64-ucrt-x86_64- .


    If avoidable, you don't want to use anything beyond the C standard
    library, and ideally limit things to a C95 style dialect for maximum portability.


    I almost agree, except for C95.
    C99 is, may be, too much, but C99 sub/super set known as C11 sounds
    about right.
    Also, I wouldn't consider such project without few extensions of
    standard language. As a minimum:
    - ability to get upper 64 bit of 64b*64b product
    - convenient way to exploit 64-bit add with carry
    - MS _BitScanReverse64 or Gnu __builtin_ctzll or equivalen
    The first and the second items are provided by Gnu __int128.
    All 3 items are available as standard features in C23, but I realize
    that for your purposes it is a bit to early to rely on C23.

    But all that only applies to final version of the library. At stage of experimentation and proof of concept I suggest to use any available
    tool. Including GMP.


    Granted, it does appear like the GMP divider is faster than expected.
    Like, possibly something faster than "ye olde shift-and-subtract".


    You see! It already had shown you something.
    The mere knowledge that something already done successfully by others is
    2/3rd of what you need to accomplish the same by yourself.
    Even without looking at GMP sources. Which is certainly an option.




    Though, can note a curious property:
    This code is around 79% faster when built with GCC vs MSVC;
    In GCC, the relative speed of MUL and ADD trade places:
    In MSVC, MUL is faster;
    In GCC, ADD is faster.

    Though, the code in question tends to frequently use struct members directly, rather than caching multiply-accessed struct members in
    local variables. MSVC tends not to fully optimize away this sort of
    thing, whereas GCC tends to act as-if the struct members had in-fact
    been cached in local variables.


    this would be too big of a dependency), there is still the
    issue of efficiently converting between big-integer and the "groups
    of 9 digits in 32-bits" format.

    No, no, no. Not "group of 9 digits"! Plain unadulterated binary. 64
    binary 'digits' per 64-bit word.


    Alas, the code was written mostly to use 9-digit groupings, and going between 9-digit groupings and 128-bit integers is a bigger chunk of
    code than I want to have for this.


    Using 9-digit groups during conversions is bad idea, both speed-wise
    and code complexity wise. Much better to use groups of 18 digits. Or
    15+19.


    This would mean an additional ~ 500 LOC, plus probably whatever code
    I need to do a semi-fast 256 by 128 bit integer divider.




    This is partly why I removed the BID code:
    At first, it seemed like the DPD and BID converters were similar
    speed; But, turns out I was still testing the DPD converter, and
    in-fact the BID converter was significantly slower.


    DPD-specific code and algorithms make sense for multiplication.
    They likely makes sense for addition/subtraction as well, I didn't
    try to think deeply about it.
    But for division I wouldn't bother with DPD-specific things. Just
    convert mantissa from DPD to binary, then divide, normalize, round
    then convert back.


    It is the 9-digit-decimal <-> Large Binary Integer converter step
    that is the main issue here.


    See above.

    Going to/from 128-bit integer adds a few "there be dragons here"
    issues regarding performance.


    Not really.
    That is, conversions are not blazingly fast, but still much better
    than any attempt to divide in any form of decimal. And helps to
    preserve your sanity.
    There is also psychological factor at play - your users expect
    division and square root to be slower than other primitive FP
    operations, so they are not disappointed. Possibly they are even
    pleasantly surprised, when they find out that the difference in
    throughput between division and multiplication is smaller than factor
    20-30 that they were accustomed to for 'double' on their 20 y.o. Intel
    and AMD.

    At the moment, I don't have a fast (and correct) converter between
    these two representations (that also does not rely on any external
    libraries or similar; or nothing outside of the C standard library).


    For 'correct', don't hesitate to use GMP.
    For 'not slow and correct' don't hesitate to use gnu extensions like
    __int128. After majority of work is done and you are reasonably
    satisfied with result, you can re-code in MS dialect, if that is your
    wish. That would be a simple mechanical work.




    Like, if you need to crack 128 bits into 9-digit chunks using 128-bit divide, and if the 128-bit divider in question is a
    shift-and-subtract loop, this sucks.

    There are faster ways to do multiply by powers of 10, but divide by powers-of-10 is still a harder problem at the moment.

    Well, and also there is the annoyance that it is difficult to write
    an efficient 128-bit integer multiply if staying within the limits of portable C95.


    ...



    Goes off and tries a few things:
    128-bit integer divider;
    Various attempts at decimal long divide;
    ...

    Thus far, things have either not worked correctly, or have ended up
    slower than the existing Newton-Raphson divider.


    the most promising option would be Radix-10e9 long-division, but
    couldn't get this working thus far.


    No, just no. Anything non-binary no good for division.

    Did also try Radix-10 long division (working on 72 digit sequences),
    but this was slower than the existing N-R divider.


    One possibility could be to try doing divide with Radix-10 in an
    unpacked BCD variant (likely using bytes from 0..9). Here, compare
    and subtract would be sower, but shifting could be faster, and allows
    a faster way (lookup tables) to find "A goes into B, N times".

    I still don't have much confidence in it though.


    Radix-10e9 has a higher chance of OK performance, if I could get the long-division algo to work correctly with it. Thus far, I was having difficulty getting it to give the correct answer. Integer divide was
    tending to overshoot the "A goes into B N times" logic, and trying to
    fudge it (eg, but adding 1 to the initial divisor) wasn't really
    working; kinda need an accurate answer here, and a reliable way to
    scale and add the divisor, ...


    Granted, one possibility could be to expand out each group of 9
    digits to 64 bits, so effectively it has an intermediate 10 decimal
    digits of headroom (or two 10e9 "digits").

    But, yeah, long-division is a lot more of a PITA than N-R or shift-and-subtract.



    I am not totally sure what you mean by 'long division', 'N-R' and 'shift-and-subtract'. In my view, they are not really distinct. Shades
    of gray, rather than black-and-white.
    Without experimentation, I'd recommend something similar to what Terje suggested - calculate approximate reciprocal with 52-bit precision (by
    FP_DP division), then do 3 iterations. You can call them as you
    like, all three names above apply.
    I am not sure that it is the fastest method. It is possible, that it
    is better to improve reciprocal initially to 62-63 bits and then proceed
    with 2 iterations instead of 3.
    I *am* sure that the difference in speed between two variants is not
    dramatic and that both of them ALOT faster than what you are doing
    today.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 12 19:22:24 2025
    From Newsgroup: comp.arch


    kegs@provalid.com (Kent Dickey) posted:

    In article <1762377694-5857@newsgrouper.org>,
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:
    My 66000 ISA has OpCodes in the range {I.major >= 8 && I.major < 24}
    that can supply constants and perform operand routing. Within this
    range; instruction<8:5> specify the following table:

    0 0 0 0 +Src1 +Src2
    0 0 0 1 +Src1 -Src2
    0 0 1 0 -Src1 +Src2
    0 0 1 1 -Src1 -Src2
    0 1 0 0 +Src1 +imm5
    0 1 0 1 +Imm5 +Src2
    0 1 1 0 -Src1 -Imm5
    0 1 1 1 +Imm5 -Src2
    1 0 0 0 +Src1 Imm32
    1 0 0 1 Imm32 +Src2
    1 0 1 0 -Src1 Imm32
    1 0 1 1 Imm32 -Src2
    1 1 0 0 +Src1 Imm64
    1 1 0 1 Imm64 +Src2
    1 1 1 0 -Src1 Imm64
    1 1 1 1 Imm64 -Src2

    Here we have access to {5, 32, 64}-bit constants, 16-bit constants
    come from different OpCodes.

    Imm5 are the register specifier bits: range {-31..31} for integer and >logical, range {-15.5..15.5} for floating point.

    For FP, Arm32 has an 8-bit immediate turned into an FP number as follows:

    sign = imm8<7>;
    exp = NOT(imm8<6>):Replicate(imm8<6>,E-3):imm8<5:4>;
    frac = imm8<3:0>:Zeros(F-4);
    result = sign : exp : frac;

    For Float, exp[7:0] can be 0x80-0x83 or 0x7c-0x7f, which is 2^1 through 2^4 and 2^-3 through 2^0. And the mantissa upper 4 bits are from the immediate field. Note that 0.0 is not encodeable, and I'm going to assume you
    don't need it either.

    For your FP, the sign comes from elsewhere, so you have 5 bits for the
    FP number. I suggest you use the Arm32 encoding for the exponent (using
    3 bits), and then set the upper 2 bits of the mantissa from the remaining
    two immediate bits.

    This encodes integers from 1.0 through 8.0, and can also encode 10.0, 12.0, 14.0, 16.0, 20.0, 24.0, and 28.0. And it can do 0.5, 1.5, 2.5, 3.5.
    And it can encode 0.125 and 0.25.

    Thank you for this suggestion and clear explanation.

    This encoding makes a lot of sense from ease of decode. However, it
    would be nice to be able to encode 100.0, 1000.0 and .1, .01 and .001, each of which is likely to be more useful than 12.0 or 3.5.

    My 66000 also has complete 32-bit and 64-bit FP constants; somewhat
    lessening the need for imm5's to cover as wide a ground as possible.
    I will keep your scheme in mind.

    From a compiler standpoint, having arbitrary constants is perfectly fine,
    it can just look up if it's available.

    That is not the way My 66000 ISA works. All constants are available--
    the only thing the compiler has to determine is: does the constant fit
    in imm5, imm32, or imm64.

    So you can make 1000.0 and .001
    and PI and lg2(e) and ln(2), and whatever available, if you want.

    I already did. The compiler also uses CVT instructions when CVT can
    create a FP constant (say for an call argument) that has a smaller
    code footprint than just MOVing the constant to a register.

    GCC looks up Arm64 integer 13-bit immediates in a hashtable--the encoding
    is almost a one-way function, so it's just faster to look it up rather than try to figure out if 0xaaaaaaaa is encodeable out by inspecting the value.
    So something similar could be done for FP constants. Since the values will be fixed, a perfect hash can be created ensuring it's a fast lookup.

    Kent
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Wed Nov 12 21:56:32 2025
    From Newsgroup: comp.arch

    On 2025-11-12 3:18, Stephen Fuld wrote:
    On 11/11/2025 4:31 PM, MitchAlsup wrote:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 11/11/2025 11:46 AM, MitchAlsup wrote:

    Niklas Holsti <niklas.holsti@tidorum.invalid> posted:


    snip
    It is only if your machine has some semantics for instruction
    combinations, such as your VEC-LOOP pair, that you have to define what >>>>> happens if a jump or call to some address leads to later executing
    only
    some of those instructions or executing them in the wrong order,
    such as
    trying to execute a LOOP without having executed a preceding VEC.

    BTW, encountering a LOOP without encountering a VEC is a natural
    occurrence when returning from exception or interrupt. The VEC
    register points at the VEC+1 instruction which is easy to return
    to the VEC instruction.

    OK, but what if, say through an errant pointer, the code, totally
    unrelated to the VEC, jumps somewhere in the middle of a VEC/LOOP pair?

    All taken branches clear the V-bit associated with vectorization.
    So encountering the LOOP instruction would raise an exception.

    Seems like the right thing to do.  I believe this resolves Nikals's issue.

    Yes, in the sense that this example supports my statement (above) that
    in a machine that has instruction combinations (like VEC-LOOP) that must
    be executed in a certain order, it is necessary to address what happens
    if a jump or call breaks that order, complicating the semantics
    definition. I agree that an exception seems the right thing to do here,
    and I expected it.

    Connecting this to the labels-as-values discussion, this means that a C compiler that compiles a C loop into a VEC-LOOP machine loop, and allows
    a "goto" to a label within that loop, from outside the loop, would
    result in execution that fails due to this exception, whether the label
    is statically named or referenced by a label-valued variable. So I would
    wish that the compiler would prevent that at compile time, to avoid
    possible UB.
    --
    Niklas Holsti

    niklas holsti tidorum fi
    . @ .

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 12 20:25:33 2025
    From Newsgroup: comp.arch


    Niklas Holsti <niklas.holsti@tidorum.invalid> posted:

    On 2025-11-12 3:18, Stephen Fuld wrote:
    On 11/11/2025 4:31 PM, MitchAlsup wrote:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 11/11/2025 11:46 AM, MitchAlsup wrote:

    Niklas Holsti <niklas.holsti@tidorum.invalid> posted:


    snip
    It is only if your machine has some semantics for instruction
    combinations, such as your VEC-LOOP pair, that you have to define what >>>>> happens if a jump or call to some address leads to later executing >>>>> only
    some of those instructions or executing them in the wrong order,
    such as
    trying to execute a LOOP without having executed a preceding VEC.

    BTW, encountering a LOOP without encountering a VEC is a natural
    occurrence when returning from exception or interrupt. The VEC
    register points at the VEC+1 instruction which is easy to return
    to the VEC instruction.

    OK, but what if, say through an errant pointer, the code, totally
    unrelated to the VEC, jumps somewhere in the middle of a VEC/LOOP pair? >>
    All taken branches clear the V-bit associated with vectorization.
    So encountering the LOOP instruction would raise an exception.

    Seems like the right thing to do.  I believe this resolves Nikals's issue.

    Yes, in the sense that this example supports my statement (above) that
    in a machine that has instruction combinations (like VEC-LOOP) that must
    be executed in a certain order, it is necessary to address what happens
    if a jump or call breaks that order, complicating the semantics
    definition. I agree that an exception seems the right thing to do here,
    and I expected it.

    Connecting this to the labels-as-values discussion, this means that a C compiler that compiles a C loop into a VEC-LOOP machine loop, and allows
    a "goto" to a label within that loop, from outside the loop, would
    result in execution that fails due to this exception, whether the label
    is statically named or referenced by a label-valued variable. So I would wish that the compiler would prevent that at compile time, to avoid
    possible UB.

    It seems to me that taking the value of a label within a VEC-LOOP
    could be prevented by the compiler--or cause the potentially vectorized
    loop to become a scalar loop with spaghetti control flow. Like how taking
    the address of a variable prevents the compiler from allocating it to a register--taking the address of a label prevents the encompassing loop
    to remain scalar.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 12 20:27:43 2025
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    Robert Finch <robfi680@gmail.com> schrieb:
    On 2025-11-11 4:18 p.m., Thomas Koenig wrote:
    Robert Finch <robfi680@gmail.com> schrieb:
    Typical process for NaN boxing is to set the high order bits of the
    value which causes the value to appear to be a NaN at higher precision. >>> I have been thinking about using some of the high order bits of the NaN >>> (eg bits 32 to 51) to indicate the precision of the boxed value. This
    would allow detection of the use of a lower precision value in
    arithmetic. Suppose a convert from single to double precision is being >>> done, but the value to be converted is only half precision.

    Do you mean a type mismatch, a conversion, or digits lost due to
    cancellation?

    It would be an input type mismatch. >

    I think this can only happen when software is buggy; compilers should
    deal with it, unless the user intentionally accesses data with
    the wrong type.

    If it were
    indicated by the NaN software might be able to fix the result.

    Fixing a result after an NaN has occurred is too late, I think.

    I suppose the float package could always just automatically upgrade the precision from lower to higher when it goes to do the calculation. But maybe with a trace warning. It would be able to if the precision were indicated in the NaN.

    I have implemented a few warning about conversions in gfortran.
    For example, -Wconversion-extra gives you, for the program

    program main
    print *,0.3333333333
    end program main

    the warning

    2 | print *,0.3333333333
    | 1
    Warning: Non-significant digits in 'REAL(4)' number at (1), maybe incorrect KIND [-Wconversion-extra]

    But my favorite is

    3 | print *,a**(3/5)

    BTW, this works in eXcel where 3/5 = 0.6

    AND, in My 66000, a**0.6 is a single instruction. ...

    | 1
    Warning: Integer division truncated to constant '0' at (1) [-Winteger-division]

    which (presumably) has caught that particular idiom in a few codes.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Thu Nov 13 01:35:37 2025
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> wrote:
    Niklas Holsti wrote:
    On 2025-11-06 20:28, MitchAlsup wrote:

    Niklas Holsti <niklas.holsti@tidorum.invalid> posted:

    On 2025-11-05 23:28, MitchAlsup wrote:

    Niklas Holsti <niklas.holsti@tidorum.invalid> posted:
    ----------------
    But then you could get the problem of a longjmp to a setjmp value that >>>>>> is stale because the targeted function invocation (stack frame) is no >>>>>> longer there.

    But YOU had to pass the jumpbuf out of the setjump() scope.

    Now, YOU complain there is a hole in your own foot with a smoking gun >>>>> in your own hand.

    That is not the issue. The question is if the semantics of "goto
    label-valued-variable" are hard to define, as Ritchie said, or not, as >>>> Anton thinks Stallman said or would have said.

    So, label-variables are hard to define, but function-variables are not
    ?!?

    Depends on the level at which you want to define it.

    At the machine level, where semantics are (usually) defined for each
    instruction separately, a jump to a dynamic address (using a
    "label-variable") is not much different from a call to a dynamic address
    (using a "function-variable"), and the effect of the single instruction
    on the machine state is much the same as for the static address case.
    The higher-level effect on the further execution of the program is out
    of scope, whatever the actual value of the target address in the
    instruction.

    It is only if your machine has some semantics for instruction
    combinations, such as your VEC-LOOP pair, that you have to define what
    happens if a jump or call to some address leads to later executing only
    some of those instructions or executing them in the wrong order, such as
    trying to execute a LOOP without having executed a preceding VEC.

    At the higher programming-language level, the label case can be much
    harder to define and less useful than the function case, depending on
    the programming language and its abstract model of execution, and also
    depending on what compile-time checks you assume.

    Consider an imperative language such as C with no functions nested
    within other functions or other blocks (where by "block" I mean some
    syntactical construct that sets up its local context with local
    variables etc.). If you have a function-variable (that is, a pointer to
    a function) that actually refers to a function with the same parameter
    profile, it is easy to define the semantics of a call via this function
    variable: it is the same as for a call that names the referenced
    function statically, and such a call is always legal. Problems arise
    only if the function-variable has some invalid value such as NULL, or
    the address of a function with a different profile, or some code address
    that does not refer to (the start of) a function. Such invalid values
    can be prevented at compile time, except (usually) for NULL.

    In the same language setting, the semantics of a jump using a
    label-variable are easy to define only if the label-variable refers to a
    label in the same block as the jump. A jump from one block into another
    would mess up the context, omitting the set-up of the target block's
    context and/or omitting the tear-down of the source block's context. The
    further results of program execution are machine-dependent and so
    undefined behavior.

    A compiler could enforce the label-in-same-block rule, but it seems that
    GNU C does not do so.

    In a programming language that allows nested functions the same kind of
    context-crossing problems arise for function-variables. Traditional
    languages solve them by allowing, at compile-time, calls via
    function-variables only if it is certain that the containing context of
    the callee still exists (if the callee is nested), or by (expensively)
    preserving that context as a dynamically constructed closure. In either
    case, the caller's context never needs to be torn down to execute the
    call, differing from the jump case.

    In summary, jumps via label-variables are useful only for control
    transfers within one function, and do not help to build up a computation
    by combining several functions -- the main method of program design at
    present. In contrast, calls via function-variables are a useful
    extension to static calls, actually helping to combine several functions
    in a computation, as shown by the general adoption of
    class/object/method coding styles.

    Niklas


    I was curious about the interaction between dynamic stack allocations
    and goto variables to see if it handled the block scoping correctly.
    Ada should have the same issues as C.
    It appears GCC x86-64 15.2 with -O3 does not properly recover
    stack space with dynamic goto's.

    Test1 allocates a dynamic sized buffer and has a static goto Loop
    for which GCC generates a jne .L6 to a mov rsp, rbx that recovers
    the stack allocation inside the {} block.

    Test2 is the same but does a goto *dest and GCC does not generate
    code to recover the inner {} block allocation. It just loops over
    the sub rsp, rbx so the stack space just grows.

    long Sub (long len, char buf[]);

    void Test1 (long len)
    {
    long ok;

    Loop:
    {
    char buf[len];

    ok = Sub (len, buf);
    if (ok)
    goto Loop;
    }
    }

    IIRC there is clear statement in the C standard that you are not
    allowed to jump into a scope after a dynamic declaration. This
    restriction is because otherwise compiler would need some twisty
    logic to run allocation code. With label variables that obvoiusly
    generalizes to jumps outside of scope of dynamic allocation:
    compiler does not try to recover allocated storage. Your code
    does not differ much from infinite recursion. In case of
    infinte recursion compiler _may_ be able to optimize things
    so that they run in constant memory, but usually such
    recursion will lead to stack overflow.

    So natural restriction is: when jumping to label variable
    dynamic locals may be released only at function exit.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Nov 12 23:59:30 2025
    From Newsgroup: comp.arch

    On 2025-11-12 3:27 p.m., MitchAlsup wrote:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    Robert Finch <robfi680@gmail.com> schrieb:
    On 2025-11-11 4:18 p.m., Thomas Koenig wrote:
    Robert Finch <robfi680@gmail.com> schrieb:
    Typical process for NaN boxing is to set the high order bits of the
    value which causes the value to appear to be a NaN at higher precision. >>>>> I have been thinking about using some of the high order bits of the NaN >>>>> (eg bits 32 to 51) to indicate the precision of the boxed value. This >>>>> would allow detection of the use of a lower precision value in
    arithmetic. Suppose a convert from single to double precision is being >>>>> done, but the value to be converted is only half precision.

    Do you mean a type mismatch, a conversion, or digits lost due to
    cancellation?

    It would be an input type mismatch. >

    I think this can only happen when software is buggy; compilers should
    deal with it, unless the user intentionally accesses data with
    the wrong type.

    If it were
    indicated by the NaN software might be able to fix the result.

    Fixing a result after an NaN has occurred is too late, I think.

    I suppose the float package could always just automatically upgrade the
    precision from lower to higher when it goes to do the calculation. But
    maybe with a trace warning. It would be able to if the precision were
    indicated in the NaN.

    I have implemented a few warning about conversions in gfortran.
    For example, -Wconversion-extra gives you, for the program

    program main
    print *,0.3333333333
    end program main

    the warning

    2 | print *,0.3333333333
    | 1
    Warning: Non-significant digits in 'REAL(4)' number at (1), maybe incorrect KIND [-Wconversion-extra]

    But my favorite is

    3 | print *,a**(3/5)

    It has been a long while since I did any Fortran code – back in school
    40ish years ago. I hardly recognize it. I think I kept my Fortran
    textbook somewhere.
    I have used VBA in eXcel with varying degrees of luck.

    The number line is infinitely discontinuous!

    BTW, this works in eXcel where 3/5 = 0.6

    AND, in My 66000, a**0.6 is a single instruction. ...

    The right way of doing things.

    Qupls allows up to three constants per instruction which follow the instruction in specialized NOPs. It is only slightly less compact to
    encode the constants in NOPs. While the opcode for a NOP does use some
    room, multiple constants can be encoded in it. It sure makes the front
    end easier as there are no variable length instructions to deal with.

    Coded a fused dot product today. Prelim testing shows it matches the
    output of the compiler running an executable on the PC about 50% of the
    time. I checked a few of the mismatches and they were out only by 1 in
    the LSB. So, it is probably good enough for my purposes.

    | 1
    Warning: Integer division truncated to constant '0' at (1) [-Winteger-division]

    which (presumably) has caught that particular idiom in a few codes.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Nov 13 07:24:15 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    But my favorite is

    3 | print *,a**(3/5)

    BTW, this works in eXcel where 3/5 = 0.6

    C has the same semantics for integer division:

    $ cat int.c && gcc int.c && ./a.out
    #include <stdio.h>
    int main()
    {
    printf("%d\n",3/5);
    return 0;
    }
    0

    It's one of those things that take people by surprise, and
    exponentiation is one of the places where it may not be seen easily.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Nov 13 08:42:35 2025
    From Newsgroup: comp.arch

    EricP <ThatWouldBeTelling@thevillage.com> writes:
    Anton Ertl wrote:
    EricP <ThatWouldBeTelling@thevillage.com> writes:
    void Test2 (long len)
    {
    long ok;
    void *dest;

    dest = &&Loop;
    Loop:
    {
    char buf[len];

    ok = Sub (len, buf);
    if (ok)
    goto *dest;
    }
    }

    Test2(long):
    push rbp
    mov rbp, rsp
    push r12
    mov r12, rdi
    push rbx
    lea rbx, [rdi+15]
    shr rbx, 4
    sal rbx, 4
    .L8:
    sub rsp, rbx
    mov rdi, r12
    mov rsi, rsp
    call Sub(long, char*)
    test rax, rax
    jne .L8
    lea rsp, [rbp-16]
    pop rbx
    pop r12
    pop rbp
    ret

    Interesting that this bug has not been fixed in the >33 years that
    labels-as-values have been in gcc; I don't know how long these
    dynamically sized arrays have been in gcc, but IIRC alloca(), a
    similar feature, has been available at least as long as
    labels-as-values. The bug has apparently been avoided or worked
    around by the users of labels-as-values (e.g., Gforth does not use
    alloca or dynamically-sized arrays in the function that contains all
    the taken labels and all the "goto *"s.

    alloca is not required to recover storage at the {} block level.

    Good point. So if you do

    for (i=0; i<1000000000; i++) {
    char *s = alloca(1000+i%1024);
    ... use s ...
    }

    and the program runs out of memory, it's a bug in your C source code,
    whereas if you do

    for (i=0; i<1000000000; i++) {
    char s[1000+i%1024];
    ... use s ...
    }

    and the program runs out of memory, it's a bug in the compiler.

    So this bug has only existed since dynamically-sized arrays were added
    to gcc (probably just a quarter-century or so).

    But when they added dynamic allocation to C as a first class feature
    I figured it should recover storage at the end of a {} block,
    and I wondered it the superficially non-deterministic nature of
    goto variable would be a problem.

    I outlined a correct implementation in my previous posting. The
    general way is basically the same that gcc already uses for the direct
    goto, as shown in your test1. Have a jump target that copies the
    stack depth for the label from another location, and use that jump
    target as the taken address. E.g.:

    L1:
    ...
    { int foo[n];
    ...
    L2:
    ...
    { int bar[n2];
    ...
    L3:
    ...
    void *labels[] = {&&L1, &&L2, &&L3, &&L4, &&L5};
    ...
    goto *labels[i];
    }
    ...
    L4:
    ...
    }
    ...
    L5:
    ...

    would be compiled to

    L1x: # used for &&L1 and for direct gotos where %rsp may be different
    mov L1L5_depth(%rbp), %rsp
    L1y: # used for direct gotos where %rsp is the same
    ...
    L2x: # used for &&L2 and for direct gotos where %rsp may be different
    mov L2L4_depth(%rbp), %rsp
    L2y:
    ...
    L3x: # the only goto * is at the same %rsp depth, so no mov needed
    L3y:
    ...
    jmp *rcx
    ...
    L4x: # used for &&L2 and for direct gotos where %rsp may be different
    mov L2L4_depth(%rbp), %rsp
    L4y:
    ...
    L5x: # used for &&L1 and for direct gotos where %rsp may be different
    mov L1L5_depth(%rbp), %rsp
    L5y: # used for direct gotos where %rsp is the same
    ...

    And of course, for those programs that do not combine these features,
    all labels would turn out like L3, i.e., without the extra mov.

    This all relates to Niklas's comments as to why the label variables must
    all be within the current context, so it knows when to recover storage.

    The gcc documentation specifies that the labels must be in the same
    function as the goto, so the compiler does not have to do stack
    unwinding which the Pascal compiler has to do for the Pascal goto.

    If the language had destructors the goto variable could have to call them >which alloca also does not deal with.

    GNU C has no destructors.

    long Sub (long len, char buf[]);

    void Test3 (long len)
    {
    long ok, dest;

    dest = 0;
    Loop:
    {
    char buf[len];

    ok = Sub (len, buf);
    if (ok)
    dest = 1;

    switch (dest)
    {
    case 0:
    goto Loop;
    case 1:
    goto Out;
    }
    Out:
    ;
    }
    }

    That actually tests direct goto. For the switch, one could wonder
    about stuff like

    switch (...) {
    char s[n];
    case 1:
    ... s[i] ...
    { char t[m];
    case 2:
    ... t[i]...
    }
    }

    But I expect that this has been declared undefined behaviour at some point.

    At least the block structure protects the switch from having case
    labels in outer scopes (in contrast to the labels-as-values example
    above).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Nov 13 09:24:20 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    I almost agree, except for C95.

    What is C95? I only know of C89/90, C99, C11, C23.

    Also, I wouldn't consider such project without few extensions of
    standard language. As a minimum:
    - ability to get upper 64 bit of 64b*64b product
    - convenient way to exploit 64-bit add with carry

    I have explored these topics recently in "Multi-precision integer
    arithmetics" <http://www.complang.tuwien.ac.at/anton/tmp/carry2.pdf>.

    Actually, with uint128_t you get pretty far, and _BitInt(bits) has
    been added in C23, which has good potential, but is not quite there.
    Builtins for add-with-carry and intrinsics are somewhat disappointing.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Nov 13 09:45:51 2025
    From Newsgroup: comp.arch

    antispam@fricas.org (Waldek Hebisch) writes:
    IIRC there is clear statement in the C standard that you are not
    allowed to jump into a scope after a dynamic declaration. This
    restriction is because otherwise compiler would need some twisty
    logic to run allocation code.

    Not just that. If the dynamic definition is not executed, it's
    unclear how much should be allocated. Consider:

    n=-5;
    goto L;
    n = m; // dead code
    {
    int x[n]; // dead code
    n=0; // dead code
    L:
    ... x[3] ...
    ...
    }

    With label variables that obvoiusly
    generalizes to jumps outside of scope of dynamic allocation:

    This is a use of "obviously" that wants the reader to skip thinking
    about the issue (and maybe the writer has not thought about it,
    either). But actually, the cases are completely different.

    If control flow passed through the dynamic definition on the way to
    the goto, the stack depth in its scope is known, and can be restored
    when performing the goto, as I showed in <2025Nov13.094235@mips.complang.tuwien.ac.at>.

    So natural restriction is: when jumping to label variable
    dynamic locals may be released only at function exit.

    A compiler bug is not a natural restriction. Of course, the gcc
    people might decide not to fix the bug (after all, no production code
    is affected by this bug), and declare it undefined behaviour to, say,
    perform a goto * inside a scope with a dynamic array that jumps
    outside the scope, but if they do something like this, it's a human
    decision based on a cost-benefit analysis, not something natural.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Thu Nov 13 12:18:47 2025
    From Newsgroup: comp.arch

    On Thu, 13 Nov 2025 09:24:20 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    Michael S <already5chosen@yahoo.com> writes:
    I almost agree, except for C95.

    What is C95? I only know of C89/90, C99, C11, C23.



    I didn't hear about it until mentioned by BGB here.
    According to Wikipedia, a minor modification of C89/90 called C94 or
    C95 indeed exists.

    Also, I wouldn't consider such project without few extensions of
    standard language. As a minimum:
    - ability to get upper 64 bit of 64b*64b product
    - convenient way to exploit 64-bit add with carry

    I have explored these topics recently in "Multi-precision integer arithmetics" <http://www.complang.tuwien.ac.at/anton/tmp/carry2.pdf>.

    Actually, with uint128_t you get pretty far, and _BitInt(bits) has
    been added in C23, which has good potential, but is not quite there.

    Yes, that what I wrote above.
    As far as BGB is concerned, the big disadvantage is absence of support
    by MSVC.

    Builtins for add-with-carry and intrinsics are somewhat disappointing.

    - anton

    For me the most disappointing part is that different architectures
    have different spellings. In case of Arm64, I don't even know what is
    correct spelling. Other than that even gcc now mostly able to generate
    decent code for Intel's variant. MSVC and clang were able to do it for
    very long time.
    Or do you have in mind new gcc intrinsic in a group "Arithmetic with
    Overflow Checking" ? Those are for completely different purpose.
    Sometimes they can be abused for multiple-precision arithmetic, but one
    should not be surprised when results are disappointing.




    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Thu Nov 13 17:35:50 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    antispam@fricas.org (Waldek Hebisch) writes:
    IIRC there is clear statement in the C standard that you are not
    allowed to jump into a scope after a dynamic declaration. This
    restriction is because otherwise compiler would need some twisty
    logic to run allocation code.

    Not just that. If the dynamic definition is not executed, it's
    unclear how much should be allocated. Consider:

    n=-5;
    goto L;
    n = m; // dead code
    {
    int x[n]; // dead code
    n=0; // dead code
    L:
    ... x[3] ...
    ...
    }

    With label variables that obvoiusly
    generalizes to jumps outside of scope of dynamic allocation:

    This is a use of "obviously" that wants the reader to skip thinking
    about the issue (and maybe the writer has not thought about it,
    either). But actually, the cases are completely different.

    If control flow passed through the dynamic definition on the way to
    the goto, the stack depth in its scope is known, and can be restored
    when performing the goto, as I showed in <2025Nov13.094235@mips.complang.tuwien.ac.at>.

    So natural restriction is: when jumping to label variable
    dynamic locals may be released only at function exit.

    A compiler bug is not a natural restriction. Of course, the gcc
    people might decide not to fix the bug (after all, no production code
    is affected by this bug), and declare it undefined behaviour to, say,
    perform a goto * inside a scope with a dynamic array that jumps
    outside the scope, but if they do something like this, it's a human
    decision based on a cost-benefit analysis, not something natural.

    It is natural result of cost-benefit analysis in a language like
    C. I know something about related issues: I tried to implement
    nicer semantic of goto-s in a language having 'finally' blocks
    and destructors. Basically, making goto-s behave similarly to
    exceptions and labels like exception handlers. Simple goto
    got turned into twisted maze taking care that relevant
    exception handler are executed when exiting a scope. That
    worked for one target and one gcc version. It did not work
    for different targets and got completely broken by newer gcc
    versions. When my code worked it freqeuntly lead to slower
    code.

    In higher level language one can have nice semantic for gotos:
    goto is essentially the function call to a parameterless local
    function. But implementing goto this way almost surely will
    negate _your_ reason to use computed gotos: goto implemented in
    such a way is likely to be slower than normal function calls
    via function pointers. C normally offers construct which
    have reasonably simple mapping to machine instructions and
    avoid "nice" constructs that require extensive code
    transformations. So the only natural definition in C is to
    avoid nice semantics like above and declare restrictions.
    Declaring computed jump out of scope as undefined is
    reasonably natural. But jumps are frequently used for
    abnormal exits and in such case one wants to exit a scope.
    So not reclaiming memory allocated in the scope is more
    natural restriction.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Bernd Linsel@bl1-thispartdoesnotbelonghere@gmx.com to comp.arch on Thu Nov 13 19:32:12 2025
    From Newsgroup: comp.arch

    On 11/13/25 09:42, Anton Ertl wrote:

    GNU C has no destructors.
    It has, in limited form via __attribute__((__cleanup__(...)))

    see https://gcc.gnu.org/onlinedocs/gcc-15.2.0/gcc/Common-Variable-Attributes.html#index-cleanup-variable-attribute

    Regards,
    Bernd

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Nov 13 18:09:12 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    On Thu, 13 Nov 2025 09:24:20 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    Actually, with uint128_t you get pretty far, and _BitInt(bits) has
    been added in C23, which has good potential, but is not quite there.

    Yes, that what I wrote above.
    As far as BGB is concerned, the big disadvantage is absence of support
    by MSVC.

    Why would that be a disadvantage? If MSVC does not do what he needs,
    there are other C compilers to choose from.

    Builtins for add-with-carry and intrinsics are somewhat disappointing.

    - anton

    For me the most disappointing part is that different architectures
    have different spellings.

    For intrinsics that's by design. They are essentially a way to write
    assembly language instructions in Fortran or C. And assembly language
    is compiler-specific.

    Other than that even gcc now mostly able to generate
    decent code for Intel's variant. MSVC and clang were able to do it for
    very long time.

    When using the Intel intrinsic c_out = _addcarry_u64(c_in, s1, s2,&sum),
    the code from both gcc and clang uses adcq, but cannot preserve the
    carry in CF in a loop, and moves it into a register right after the
    adcq, and back from the register to CF right before:

    addb $-1, %r8b
    adcq (%rdx,%rax,8), %r9
    setb %r8b

    If you (or compiler unrolling) have several _addcarry_u64 in a row,
    with the carry-out becoming the carry-in of the next one, at least one
    of these compilers manages to eliminate the overhead between these
    adcqs, but of course not at the start and end of the sequence.

    Or do you have in mind new gcc intrinsic in a group "Arithmetic with
    Overflow Checking" ?

    These are gcc builtins, not intrinsics. The difference is that they
    work on all architectures. However, when I looked (three months ago),
    gcc did not have a builtin with carry-in; the builtins you mention
    only provide carry-out (or overflow-out).

    However, clang has a builtin with carry-in and carry-out:
    sum = __builtin_addcll(s1, s2, c_in, &c_out)

    Unfortunately, the code produced by clang is pretty horrible for ARM
    A64 and AMD64:

    ARM A64: # clang 11.0.1 -Os
    adds x9, x9, x10
    cset w10, hs
    adds x9, x9, x8
    cset w8, hs
    orr w8, w10, w8

    AMD64: # clang 14.0.6 -march=x86-64-v4 -Os
    addq (%rdx,%r8,8), %r9
    setb %r10b
    addq %rax, %r9
    setb %al
    orb %r10b, %al
    movzbl %al, %eax

    For RISC-V the code is a five-instruction sequence, which is the
    minimum that's possible on RISC-V.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Nov 13 19:04:18 2025
    From Newsgroup: comp.arch


    Michael S <already5chosen@yahoo.com> posted:

    On Tue, 11 Nov 2025 21:34:08 -0600
    ------------------
    C99 is, may be, too much, but C99 sub/super set known as C11 sounds
    about right.

    Also, I wouldn't consider such project without few extensions of
    standard language. As a minimum:
    - ability to get upper 64 bit of 64b*64b product
    hll:
    {carry, result} = multiplier × multiplicand;
    asm:
    CARRY Rc,{{O}}
    MUL Rr,Rm1,Rm2 // {Rc,Rr} is the 128-bit result

    - convenient way to exploit 64-bit add with carry
    hll:
    {carry, result} = augend + addend;
    asm:
    CARRY Rc,{{O}}
    ADD Rd,Ra1,Ra2
    or
    hll:
    {carry, result} = augend + addend + carry;
    asm:
    CARRY Rc,{{IO}}
    ADD Rd,Ra1,Ra2

    - MS _BitScanReverse64 or Gnu __builtin_ctzll or equivalen

    asm:
    CLZ Rd,Rs
    ---------------------------
    Not really.
    That is, conversions are not blazingly fast, but still much better
    than any attempt to divide in any form of decimal. And helps to
    preserve your sanity.

    Are you trying to pull our proverbial leg here ?!?

    There is also psychological factor at play - your users expect
    division and square root to be slower than other primitive FP
    operations, so they are not disappointed. Possibly they are even
    pleasantly surprised, when they find out that the difference in
    throughput between division and multiplication is smaller than factor
    20-30 that they were accustomed to for 'double' on their 20 y.o. Intel
    and AMD.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Thu Nov 13 14:34:43 2025
    From Newsgroup: comp.arch

    On 11/13/2025 3:24 AM, Anton Ertl wrote:
    Michael S <already5chosen@yahoo.com> writes:
    I almost agree, except for C95.

    What is C95? I only know of C89/90, C99, C11, C23.


    Essentially, it is C89, but:
    Has // style comments;
    Has "long long" and similar.
    Vs, plain C89, where one only has
    /* comment */
    long (32-bit).


    More or less what most versions of MSVC supported between ~ 2000 and
    2013 (VS2015 added some C99 stuff).

    Still required if one wants to be able to target Win2K or WinXP, as the versions of the compiler that support these only support C95.


    Much prior to this, and it drops to C89; but mostly only really matters
    if one wants to compile code on Windows 3.11 or similar.

    Though, there is the other option of (for older Windows versions, or
    real-mode MS-DOS) using Borland C instead.

    OTOH, for some other targets there are compilers like SDCC or CC65,
    which IIRC lack support for "long long", but I sorta suspect there is no practical reason to want to run this code on these targets (well, except
    maybe for novelty of running Decimal128 math on a 6502 or something...).

    I did set a limit of mostly ignoring 8/16-bit machines.



    Also, I wouldn't consider such project without few extensions of
    standard language. As a minimum:
    - ability to get upper 64 bit of 64b*64b product
    - convenient way to exploit 64-bit add with carry

    I have explored these topics recently in "Multi-precision integer arithmetics" <http://www.complang.tuwien.ac.at/anton/tmp/carry2.pdf>.

    Actually, with uint128_t you get pretty far, and _BitInt(bits) has
    been added in C23, which has good potential, but is not quite there.
    Builtins for add-with-carry and intrinsics are somewhat disappointing.


    Though, probably going to be a good long time before MSVC gets these...


    For now, if one wants 128-bit math, it is mostly via wrapper structs and explicit function calls.

    Could work OK. Except that shift-and-subtract division is slow.
    At present, I lack a good/efficient way to break a 128-bit integer into
    10e9 chunks (if using the 10e9 divider, this sucks).


    Can note that GCC seemingly doesn't support 128-bit integers on 64-bit
    RISC-V. Also, doing 128-bit arithmetic on RV64 kinda sucks as there is basically no good way to do extended precision arithmetic (essentially,
    the ISA offers nothing more here than what C already gives you).

    Like, you can do what is essentially:
    c_lo = a_lo + b_lo;
    c_hi = a_hi + b_hi;
    if((c_lo<a_lo) || (c_lo<b_lo))
    c_hi++;
    But... This kinda sucks...

    ...


    Though, can at least do multiply by 10e9 and similar fast-ish via fixed-patterns shift-and-add. In premise, could use the "toothpaste
    tube" strategy (multiplying by powers of 10 to squeeze digits out the
    top), but would need to figure out the appropriate magic number to
    multiply against the Int128 value (via a 128*128->256bit multiply,
    keeping high result) to be able to get the value scaled correctly to use
    this algo (this multiply also being an area of concern).

    Ironically, this strategy is more directly relevant to Binary128 or
    similar, as in this case, Binary128 will already have the mantissa bits
    scaled in the correct way (after normalizing to remove the integer part).

    I had experimented with trying to "crack" groups of digits off the
    low-end, eg:
    while(arr[0]>=1000000000)
    {
    arr[0]-=1000000000;
    Inc(arr+1);
    }
    But, alas, it was seemingly not so easy, and this does not give the
    correct results.

    It is possible to use an approach similar to double-dabble (feeding in
    the binary number 1 bit at a time, and adding the decimal vector to
    itself and incrementing for each 1 bit seen). But, alas, this is also
    slow in this case (takes around 128 iterations to convert the Int128 to
    4x 10e9). Though, still slightly faster than using a shift-subtract
    divider to crack off 9 digit chunks by successively dividing by 1000000000.


    Or, maybe make another attempt at Radix-10e9 long division and see if I
    can get it to actually work and give the correct result.

    Though, might be worthwhile, since if I could make the DIV operator
    faster, I could claim a result of "faster than IBM's decNumber library".

    Even if in practice it might still be moot, as it is still impractically
    slow if compared with Binary128.

    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Nov 13 20:40:01 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Michael S <already5chosen@yahoo.com> writes:
    On Thu, 13 Nov 2025 09:24:20 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    Actually, with uint128_t you get pretty far, and _BitInt(bits) has
    been added in C23, which has good potential, but is not quite there.

    Yes, that what I wrote above.
    As far as BGB is concerned, the big disadvantage is absence of support
    by MSVC.

    Why would that be a disadvantage? If MSVC does not do what he needs,
    there are other C compilers to choose from.

    Builtins for add-with-carry and intrinsics are somewhat disappointing.

    - anton

    For me the most disappointing part is that different architectures
    have different spellings.

    For intrinsics that's by design. They are essentially a way to write assembly language instructions in Fortran or C. And assembly language
    is compiler-specific.

    {Pedantic mode=ON}
    Assembly language is ASSEMBLER specific.
    Compilers have to spit out what the assembler wants or go directly to
    linker representation.
    {Pedantic mode=OFF}

    Other than that even gcc now mostly able to generate
    decent code for Intel's variant. MSVC and clang were able to do it for
    very long time.

    When using the Intel intrinsic c_out = _addcarry_u64(c_in, s1, s2,&sum),
    the code from both gcc and clang uses adcq, but cannot preserve the
    carry in CF in a loop, and moves it into a register right after the
    adcq, and back from the register to CF right before:

    addb $-1, %r8b
    adcq (%rdx,%rax,8), %r9
    setb %r8b

    CALK R9,what,ever
    CARRY R9,{{IO}}
    ADD R8,Rs1,Rs2
    performs
    {R9, R8} = R9 + Rs1 + Rs2;

    If you (or compiler unrolling) have several _addcarry_u64 in a row,
    with the carry-out becoming the carry-in of the next one, at least one
    of these compilers manages to eliminate the overhead between these
    adcqs, but of course not at the start and end of the sequence.

    Or do you have in mind new gcc intrinsic in a group "Arithmetic with >Overflow Checking" ?

    These are gcc builtins, not intrinsics. The difference is that they
    work on all architectures. However, when I looked (three months ago),
    gcc did not have a builtin with carry-in; the builtins you mention
    only provide carry-out (or overflow-out).

    However, clang has a builtin with carry-in and carry-out:
    sum = __builtin_addcll(s1, s2, c_in, &c_out)

    Unfortunately, the code produced by clang is pretty horrible for ARM
    A64 and AMD64:

    ARM A64: # clang 11.0.1 -Os
    adds x9, x9, x10
    cset w10, hs
    adds x9, x9, x8
    cset w8, hs
    orr w8, w10, w8

    AMD64: # clang 14.0.6 -march=x86-64-v4 -Os
    addq (%rdx,%r8,8), %r9
    setb %r10b
    addq %rax, %r9
    setb %al
    orb %r10b, %al
    movzbl %al, %eax

    For RISC-V the code is a five-instruction sequence, which is the
    minimum that's possible on RISC-V.

    2 in My 66000, 1 if you don't count CARRY as it is an
    instruction-modifier instead of an instruction. There is
    only 1 instruction that "gets executed".


    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Nov 13 21:50:59 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Michael S <already5chosen@yahoo.com> writes:
    On Thu, 13 Nov 2025 09:24:20 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    Actually, with uint128_t you get pretty far, and _BitInt(bits) has
    been added in C23, which has good potential, but is not quite there.

    Yes, that what I wrote above.
    As far as BGB is concerned, the big disadvantage is absence of support
    by MSVC.

    Why would that be a disadvantage? If MSVC does not do what he needs,
    there are other C compilers to choose from.

    Builtins for add-with-carry and intrinsics are somewhat disappointing.

    - anton

    For me the most disappointing part is that different architectures
    have different spellings.

    For intrinsics that's by design. They are essentially a way to write
    assembly language instructions in Fortran or C. And assembly language
    is compiler-specific.

    {Pedantic mode=ON}
    Assembly language is ASSEMBLER specific.

    What I wanted to write was "And assembly language is
    architecture-specific".

    It's the builtin function that are compiler-specific.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Nov 13 21:58:13 2025
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> writes:
    Can note that GCC seemingly doesn't support 128-bit integers on 64-bit >RISC-V.

    What makes you think so? It has certainly worked every time I tried
    it. E.g., Gforth's "configure" reports:

    checking size of __int128_t... 16
    checking size of __uint128_t... 16
    [...]
    checking for a C type for double-cells... __int128_t
    checking for a C type for unsigned double-cells... __uint128_t

    That's with gcc 10.3.1

    Also, doing 128-bit arithmetic on RV64 kinda sucks as there is
    basically no good way to do extended precision arithmetic (essentially,
    the ISA offers nothing more here than what C already gives you).

    Like, you can do what is essentially:
    c_lo = a_lo + b_lo;
    c_hi = a_hi + b_hi;
    if((c_lo<a_lo) || (c_lo<b_lo))
    c_hi++;

    You only need to check for c_lo<a_lo (or for c_lo<b_lo), they will
    either both be true or both be false.

    Here's 128-bit arithmetic on RV64GC (and very similar on MIPS and
    Alpha):

    add a4,a4,a5
    sltu a5,a4,a5
    add s8,s8,s9
    add s9,a5,s8

    RISC-V (and MIPS and Alpha) becomes relly bad when you need add with
    carry-in and carry-out (five instructions).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Nov 13 22:13:54 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Michael S <already5chosen@yahoo.com> writes:
    On Thu, 13 Nov 2025 09:24:20 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    Actually, with uint128_t you get pretty far, and _BitInt(bits) has
    been added in C23, which has good potential, but is not quite there.

    Yes, that what I wrote above.
    As far as BGB is concerned, the big disadvantage is absence of support
    by MSVC.

    Why would that be a disadvantage? If MSVC does not do what he needs,
    there are other C compilers to choose from.

    Builtins for add-with-carry and intrinsics are somewhat disappointing. >> >>
    - anton

    For me the most disappointing part is that different architectures
    have different spellings.

    For intrinsics that's by design. They are essentially a way to write
    assembly language instructions in Fortran or C. And assembly language
    is compiler-specific.

    {Pedantic mode=ON}
    Assembly language is ASSEMBLER specific.

    What I wanted to write was "And assembly language is
    architecture-specific".

    I have worked on a single machine with several different ASM "compilers". Believe me, one asm can be different than another asm.

    But it is absolutely true that asm is architecture specific.

    It's the builtin function that are compiler-specific.

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Nov 14 00:43:07 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    BGB <cr88192@gmail.com> writes:
    Can note that GCC seemingly doesn't support 128-bit integers on 64-bit >RISC-V.

    What makes you think so? It has certainly worked every time I tried
    it. E.g., Gforth's "configure" reports:

    checking size of __int128_t... 16
    checking size of __uint128_t... 16
    [...]
    checking for a C type for double-cells... __int128_t
    checking for a C type for unsigned double-cells... __uint128_t

    That's with gcc 10.3.1

    Also, doing 128-bit arithmetic on RV64 kinda sucks as there is
    basically no good way to do extended precision arithmetic (essentially, >the ISA offers nothing more here than what C already gives you).

    Like, you can do what is essentially:
    c_lo = a_lo + b_lo;
    c_hi = a_hi + b_hi;
    if((c_lo<a_lo) || (c_lo<b_lo))
    c_hi++;

    You only need to check for c_lo<a_lo (or for c_lo<b_lo), they will
    either both be true or both be false.

    Here's 128-bit arithmetic on RV64GC (and very similar on MIPS and
    Alpha):

    add a4,a4,a5
    sltu a5,a4,a5
    add s8,s8,s9
    add s9,a5,s8

    RISC-V (and MIPS and Alpha) becomes relly bad when you need add with
    carry-in and carry-out (five instructions).

    My 66000: // 256-bit add
    CARRY R15,{{O}{IO}{IO}{I}}
    ADD R12,R8,R24
    ADD R13,R9,R25
    ADD R14,R10,R26
    ADD R15,R11,R27
    !!!!!!

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Thu Nov 13 19:17:33 2025
    From Newsgroup: comp.arch

    On 11/13/2025 3:58 PM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    Can note that GCC seemingly doesn't support 128-bit integers on 64-bit
    RISC-V.

    What makes you think so? It has certainly worked every time I tried
    it. E.g., Gforth's "configure" reports:

    checking size of __int128_t... 16
    checking size of __uint128_t... 16
    [...]
    checking for a C type for double-cells... __int128_t
    checking for a C type for unsigned double-cells... __uint128_t

    That's with gcc 10.3.1


    Hmm...

    Seems so.

    Testing again, it does appear to work; the error message I thought I remembered seeing, instead applied to when trying to use the type in
    MSVC. I had thought I remembered checking before and it failing, but it
    seems not.

    But, yeah, good to know I guess.


    As for MSVC:
    tst_int128.c(5): error C4235: nonstandard extension used: '__int128'
    keyword not supported on this architecture

    MSVC doesn't recognize __int128_t at all.

    Where:
    """
    Microsoft (R) C/C++ Optimizing Compiler Version 19.44.35219 for x64
    Copyright (C) Microsoft Corporation. All rights reserved.
    """

    ...



    Either way, it falls outside the scope of the C dialect I was targeting
    here; and at least one of the compilers I am using still doesn't support it.

    It is possible though I could still have an ifdef for targets that have
    this type.


    Misc:
    Decided to post a graph generated by BGBCC: https://x.com/cr88192/status/1989134230648156378/photo/1

    It shows the relative distribution of constants as recorded by the compiler.

    I had considered trying to feed the data into LibreOffice, but the
    interface is awkward enough that it became less effort to just add graph-drawing code to my compiler. Nevermind if graph-drawing
    technically falls outside of the compiler's scope of responsibilities.

    Partly this was to make a case for why it makes sense to have 33-bit
    immediate values, but not bother so much with slightly larger values.

    Basically, one has to cross a big gap of "mostly nothing" before
    reaching a modest spike up near the 64-bit mark.

    Note that Y axis is in Log 2 (it was either this or mask off 0).

    Granted, there are other possible ways to graph this.

    This particular example mostly resulted from compiling Doom.


    Also, doing 128-bit arithmetic on RV64 kinda sucks as there is
    basically no good way to do extended precision arithmetic (essentially,
    the ISA offers nothing more here than what C already gives you).

    Like, you can do what is essentially:
    c_lo = a_lo + b_lo;
    c_hi = a_hi + b_hi;
    if((c_lo<a_lo) || (c_lo<b_lo))
    c_hi++;

    You only need to check for c_lo<a_lo (or for c_lo<b_lo), they will
    either both be true or both be false.


    OK, I wasn't sure here.


    Here's 128-bit arithmetic on RV64GC (and very similar on MIPS and
    Alpha):

    add a4,a4,a5
    sltu a5,a4,a5
    add s8,s8,s9
    add s9,a5,s8

    RISC-V (and MIPS and Alpha) becomes relly bad when you need add with
    carry-in and carry-out (five instructions).


    OK.

    Still not great.

    On my ISAs, it is one of:
    ALUX R10, R12, R10 //if supported
    Or(XG1/XG2):
    CLRT
    ADDC R12, R10
    ADDC R13, R11
    Never got around to adding a 3R ADDC (and as-is is basically the same
    idiom as carried over from SH-4).


    On XG3, the latter is no longer formally allowed (partly for consistency
    with RISC-V), but nothing technically prevents it (support for SR.T and predication was demoted to optional, and currently not enabled by default).

    Could maybe still make sense to add a 3R ADDC though at some point, as
    it could help with 256-bit arithmetic (and 256-bit stuff is not
    addressed by ALUX).

    Or, maybe even ADDCX, for 128-bit ADD-with-Carry?...



    Though, did randomly remember a video I saw recently talking about how,
    if the dinosaurs were still around, it is very unlikely human-like
    creatures would have emerged.

    The idea was that creatures would rise to the peaks of the "fitness
    landscape" and eventually get stuck there, and it would have created a
    world where basically there would have been no paths that would have
    favored anything human-like emerging (and it is unlikely that any
    creatures would descend back into the "valleys" to reach other possible
    peaks in the landscape).


    Does make me wonder if similar ideas could apply to things like software
    and CPU architecture. Like, possible higher peaks that could potentially
    lead to significant improvements in performance or capability, but
    nothing can reach them as there is a "valley of suck" in the way.

    Well, there are always random detours that seem to point this way, such
    as with things like trinary logic, analog electronics, stochastic logic,
    etc. Which have interesting properties but are, strictly speaking,
    inferior to what we have now.


    Say, for example, it does seem like Trinary could be used to drive more
    data over a differential signaling bus, say:
    00: 0 (Z)
    01: +1 (P)
    10: -1 (N)
    11: Hi (H, idle state)
    Then, say, one can drive the equivalent of 9 bits of data in 6 clock
    cycles, with enough additional states that they could be used either for error-detection or DC balancing (could in concept do something like NRZ
    where there would often be multiple possible paths to encode every
    possible bit sequence and the encoder chooses the path that maintains
    the best balance and avoids getting stuck in a non-changing state for
    too many cycles).

    Say, for example, every 2 trits encodes 3 bits, but then leaves one
    redundant option. If one assumes that a scheme similar to NRZ is used,
    then in cases where one gets a long run of 0 bits or similar, then it
    can use a redundant zero encoding such that "on the wire" it still sees
    a state transition, maybe:
    ZZ: 000, ZP: 001
    ZN: 010, PZ: 011
    PP: 100, PN: 101
    NZ: 110, NP: 111
    NN: 000
    Though, if the mapping rotates after every odd trit, then it becomes statistically unlikely that and significant DC-imbalance could arise
    (but would make NRZ redundant). Or, if it does arise somehow, the
    encoder could stick some idle-state pulses into the mix as well
    (possibly understood as repeating the prior trit).

    So, say, ZH/PH/NH being equivalent to ZZ/PP/NN but with an extra
    transition, vs HH being the true idle state.

    Well, unless going through a coupling transformer (like in Ethernet)
    where ZZ/HZ would be ineffective (but would be saved by a ZZ/NN transition).


    But, this seems like one of those things that presumably someone would
    have already thought of it?...

    ...



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Nov 14 03:59:08 2025
    From Newsgroup: comp.arch


    BGB <cr88192@gmail.com> posted:

    On 11/13/2025 3:58 PM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    Can note that GCC seemingly doesn't support 128-bit integers on 64-bit
    RISC-V.

    What makes you think so? It has certainly worked every time I tried
    it. E.g., Gforth's "configure" reports:

    checking size of __int128_t... 16
    checking size of __uint128_t... 16
    [...]
    checking for a C type for double-cells... __int128_t
    checking for a C type for unsigned double-cells... __uint128_t

    That's with gcc 10.3.1


    Hmm...

    Seems so.

    Testing again, it does appear to work; the error message I thought I remembered seeing, instead applied to when trying to use the type in
    MSVC. I had thought I remembered checking before and it failing, but it seems not.

    But, yeah, good to know I guess.


    As for MSVC:
    tst_int128.c(5): error C4235: nonstandard extension used: '__int128'
    keyword not supported on this architecture

    ERRRRRRR:: not supported by this compiler, the architecture has
    ISA level support for doing this, but the compiler does not allow
    you access.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Nov 14 07:18:30 2025
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> writes:
    Never got around to adding a 3R ADDC (and as-is is basically the same
    idiom as carried over from SH-4).


    On XG3, the latter is no longer formally allowed (partly for consistency >with RISC-V), but nothing technically prevents it (support for SR.T and >predication was demoted to optional, and currently not enabled by default).

    Could maybe still make sense to add a 3R ADDC though at some point, as
    it could help with 256-bit arithmetic (and 256-bit stuff is not
    addressed by ALUX).

    In "Extending General-Purpose Registers with Carry and Overflow Bits" <http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf> I discuss
    adding a carry bit and overflow bit to every GPR of an architecture.
    To make it concrete how that would affect the instruction set, I
    propose such an instruction set extension for RISC-V. It contains the instructions

    addc rd, rs1, rs2

    which adds the carry bit of rs2 to the 65-bit (i.e., including the
    carry bit) data in rs1. The other instruction I proposed is

    bo rs1, rs2, target

    which branches if the overflow bit of rs1 or rs2 are set (why check
    two registers? Because it fits in the RISC-V conditional branch
    instruction scheme).

    A 256-bit addition (d1,c1,b1,a1)+(d2,c2,b2,a2) would look as follows

    add a3,a1,a2
    add b3,b1,b2
    addc b3,b3,a3
    add c3,c1,c2
    addc c3,c3,b3
    add d3,d1,d2
    addc d3,d3,c3

    with 4 cycles latency. addc is limited to having two source registers
    (RV64G instructions all have this limit). The decoder could combine a
    pair of add and addc instructions into one three-source
    macro-instruction. Alternatively, one could add a three-source
    instruction addc4 (VAX-inspired naming) to the instruction set, and
    maybe include subc4 as well.

    Does make me wonder if similar ideas could apply to things like software
    and CPU architecture. Like, possible higher peaks that could potentially >lead to significant improvements in performance or capability, but
    nothing can reach them as there is a "valley of suck" in the way.

    Network effects favour incumbents, and network effects are strong in
    computer architecture for general-purpose processors. Sometimes I
    think that it's a miracle that we have seen the progress in computer architecture that we have seen:

    1) We used to have a de-facto standard of 36-bit word-addressed
    machines (ok, there were character-addressed and digit-addressed
    machines at the time, too), and it has been superseded by a
    standard of 8-byte-addressed machines with word size 16 bits, 32
    bits, or 64 bits. The mechanism here seems to have been that most
    of the 36-bit machines had 18-bit addresses, and, as Gordon Bell
    wrote, running out of address bits spells doom for an architecture.

    2) At one point (late 1980s) it looked like big-endian would win
    (almost all workstations at the time, with DEC stuff being the
    exception that proved the rule), but eventually little-endian won,
    thanks to PCs (which inherited the Datapoint 2200 byte order) and
    smart phones (which inherited the 6502 byte order).

    Another, less surprising development is that trapping on unaligned
    accesses is dying out in general-purpose machines. In the 1980s and
    1990s most architectures trapped on unaligned accesses. But that's a
    "feature" that almost no software relies on, so there are no network
    effects in its favour. OTOH, porting software from an architecture
    that performs unaligned accesses is easier to architectures that
    perform unaligned accesses. So eventually all general-purpose
    architectures have converted to performing unaligned accesses, or died
    out. One can see this progression already in S/360->S/370.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Nov 14 14:18:02 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    {Pedantic mode=ON}
    Assembly language is ASSEMBLER specific.

    What I wanted to write was "And assembly language is
    architecture-specific".

    foo_:
    add DWORD PTR [rdi], 1
    ret

    and

    foo_:
    addl $1, (%rdi)
    ret

    are written in two different assembly languages, yet have the same
    meaning when compiled.

    It's the builtin function that are compiler-specific.

    Also, not really. For x86, Intel defines them, and other
    compilers like gcc follow suit.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Fri Nov 14 15:57:22 2025
    From Newsgroup: comp.arch

    BGB wrote:
    It is possible to use an approach similar to double-dabble (feeding in
    the binary number 1 bit at a time, and adding the decimal vector to
    itself and incrementing for each 1 bit seen). But, alas, this is also
    slow in this case (takes around 128 iterations to convert the Int128 to
    4x 10e9). Though, still slightly faster than using a shift-subtract
    divider to crack off 9 digit chunks by successively dividing by 1000000000.


    Or, maybe make another attempt at Radix-10e9 long division and see if I
    can get it to actually work and give the correct result.

    I used division by 1e9 to extract groups of 9 digits from the binary
    result I got when calculating pi with arbitrary precision, back then (on
    a 386) I did it with the obvious edx:eax / 1e9 (in ebx) -> remainder
    (edx) and result (eax) in a loop, which was fast enough for tsomething I
    only needed to do once.

    Today, with 64-bit cpus, why not use a reciprocal mul to get a value
    that cannot be too high, save the result, then back-multiply and subtract?

    Any off-by-one error will be caught by the next iteration.


    Though, might be worthwhile, since if I could make the DIV operator
    faster, I could claim a result of "faster than IBM's decNumber library".

    :-)


    Even if in practice it might still be moot, as it is still impractically slow if compared with Binary128.

    Right.

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Nov 14 18:48:44 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    BGB <cr88192@gmail.com> writes:
    Never got around to adding a 3R ADDC (and as-is is basically the same >idiom as carried over from SH-4).


    On XG3, the latter is no longer formally allowed (partly for consistency >with RISC-V), but nothing technically prevents it (support for SR.T and >predication was demoted to optional, and currently not enabled by default).

    Could maybe still make sense to add a 3R ADDC though at some point, as
    it could help with 256-bit arithmetic (and 256-bit stuff is not
    addressed by ALUX).

    In "Extending General-Purpose Registers with Carry and Overflow Bits" <http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf> I discuss
    adding a carry bit and overflow bit to every GPR of an architecture.

    Which does nothing for MUL and DIV, while creating complications for
    LD/ST if you want to maintain the 66-bit illusion of a GPR through
    memory.

    To make it concrete how that would affect the instruction set, I
    propose such an instruction set extension for RISC-V. It contains the instructions

    addc rd, rs1, rs2

    which adds the carry bit of rs2 to the 65-bit (i.e., including the
    carry bit) data in rs1. The other instruction I proposed is

    bo rs1, rs2, target

    which branches if the overflow bit of rs1 or rs2 are set (why check
    two registers? Because it fits in the RISC-V conditional branch
    instruction scheme).

    A 256-bit addition (d1,c1,b1,a1)+(d2,c2,b2,a2) would look as follows

    add a3,a1,a2
    add b3,b1,b2
    addc b3,b3,a3
    add c3,c1,c2
    addc c3,c3,b3
    add d3,d1,d2
    addc d3,d3,c3

    with 4 cycles latency. addc is limited to having two source registers
    (RV64G instructions all have this limit). The decoder could combine a
    pair of add and addc instructions into one three-source
    macro-instruction. Alternatively, one could add a three-source
    instruction addc4 (VAX-inspired naming) to the instruction set, and
    maybe include subc4 as well.

    CARRY in My 66000 essentially provides an accumulator for a few instructions that supply more operands to and receives another result from a calculation. Most multiprecision calculation sequences are perfectly happy with another register used as an accumulator.

    Does make me wonder if similar ideas could apply to things like software >and CPU architecture. Like, possible higher peaks that could potentially >lead to significant improvements in performance or capability, but
    nothing can reach them as there is a "valley of suck" in the way.

    Network effects favour incumbents, and network effects are strong in
    computer architecture for general-purpose processors. Sometimes I
    think that it's a miracle that we have seen the progress in computer architecture that we have seen:

    1) We used to have a de-facto standard of 36-bit word-addressed
    machines (ok, there were character-addressed and digit-addressed
    machines at the time, too), and it has been superseded by a
    standard of 8-byte-addressed machines with word size 16 bits, 32
    bits, or 64 bits. The mechanism here seems to have been that most
    of the 36-bit machines had 18-bit addresses, and, as Gordon Bell
    wrote, running out of address bits spells doom for an architecture.

    2) At one point (late 1980s) it looked like big-endian would win
    (almost all workstations at the time, with DEC stuff being the
    exception that proved the rule), but eventually little-endian won,
    thanks to PCs (which inherited the Datapoint 2200 byte order) and
    smart phones (which inherited the 6502 byte order).

    Another, less surprising development is that trapping on unaligned
    accesses is dying out in general-purpose machines. In the 1980s and
    1990s most architectures trapped on unaligned accesses. But that's a "feature" that almost no software relies on, so there are no network
    effects in its favour. OTOH, porting software from an architecture
    that performs unaligned accesses is easier to architectures that
    perform unaligned accesses. So eventually all general-purpose
    architectures have converted to performing unaligned accesses, or died
    out. One can see this progression already in S/360->S/370.

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Fri Nov 14 15:00:14 2025
    From Newsgroup: comp.arch

    <snip>

    In "Extending General-Purpose Registers with Carry and Overflow Bits" <http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf> I discuss
    adding a carry bit and overflow bit to every GPR of an architecture.
    To make it concrete how that would affect the instruction set, I
    propose such an instruction set extension for RISC-V. It contains the instructions

    Sidetracking a bit here.

    There are 64-regs in Qupls with four flag bits, so loading and storing
    all the flags into a single 64-bit register is not possible. I have
    managed to come up with a scheme that might work.

    Rather than use the dedicated registers ‘storeextra’ and ‘loadextra’, a
    load / store queue entry is directly allocated and used for the purpose.

    The LSQ entry already has enough room to store a cache-line (512-bits)
    for merging operations. The ASTF / ALDF instructions (‘A’ for allocate) supply a bitmask of flag groups that need to be moved. A single 256-bit cache-line data access is performed. ALDF allocates and loads the
    cache-line full of bits. ASTF simply allocates the LSQ entry. Which
    registers need to be moved is indicated by the byte lane selects for the
    LSQ entry (already present in the design).

    For stores an STF instruction sets the byte lane select in the LSQ for
    the flag store for the corresponding register. Once all the byte lane
    selects are set the flags store operation is ready to proceed like any
    other store. (There is already a data valid signal, which could be set).

    For loads the LDF instruction clears the byte lane select for
    corresponding registers. Once all the byte lane selects are cleared,
    then the load is finished.

    The LSQ entry allocated for the load / store remains present in the LSQ
    for the duration of operations.

    <snip>

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Nov 14 14:39:30 2025
    From Newsgroup: comp.arch

    On 11/14/2025 8:57 AM, Terje Mathisen wrote:
    BGB wrote:
    It is possible to use an approach similar to double-dabble (feeding in
    the binary number 1 bit at a time, and adding the decimal vector to
    itself and incrementing for each 1 bit seen). But, alas, this is also
    slow in this case (takes around 128 iterations to convert the Int128
    to 4x 10e9). Though, still slightly faster than using a shift-subtract
    divider to crack off 9 digit chunks by successively dividing by
    1000000000.


    Or, maybe make another attempt at Radix-10e9 long division and see if
    I can get it to actually work and give the correct result.

    I used division by 1e9 to extract groups of 9 digits from the binary
    result I got when calculating pi with arbitrary precision, back then (on
    a 386) I did it with the obvious edx:eax / 1e9 (in ebx) -> remainder
    (edx) and result (eax) in a loop, which was fast enough for tsomething I only needed to do once.

    Today, with 64-bit cpus, why not use a reciprocal mul to get a value
    that cannot be too high, save the result, then back-multiply and subtract?


    Dunno.

    I guess, 128/64 bit IDIV could be possible on the HW, but there isn't a
    good way to access this from C (absent using functionality outside what
    exists in portable C95).

    Could be possible, but at the moment don't really want to go the
    direction of alternate code paths, and wildly different performance
    based on what compiler one is using.


    Though, in simple cases, the compilers are smart enough to turn divide-by-constant into multiply-by-reciprocal internally (if they
    support the type in question).



    If anything, GCC having __int128 support, leans a lot more in favor of
    using BID rather than DPD.

    I decided against using BID with this code for the reason that this
    bottleneck would unnecessarily penalize BID, which would be better
    handled in a different way (namely writing code which works natively
    with numbers in linear integer form; and not in 10e9 form).


    As noted, a double-dabble approach is, say:
    v=a.hi;
    for(i=0; i<64; i++)
    {
    TKD128_AddArray4(arr, arr, arr);
    arr[0]+=v>>63;
    v=v<<1;
    }
    v=a.lo;
    for(i=0; i<64; i++)
    {
    TKD128_AddArray4(arr, arr, arr);
    arr[0]+=v>>63;
    v=v<<1;
    }
    Which, technically works, but doesn't really win any awards for speed...


    Any off-by-one error will be caught by the next iteration.


    Yeah...


    Though, did make another attempt at 10e9 long division, and got it
    working correctly this time...

    /* arem is both dividend and remainder, 8 elements; padded.
    * adiv is the divisor (4 elements).
    * aquo is the output quotient (also 8 elements)
    * result derived from high elements.
    */
    void TKD128_LongDivArray8x4(u32 *arem, u32 *adiv, u32 *aquo)
    {
    u32 adtmp[8];
    u64 adx, ady, tdiv;
    u32 ad0, ad1, ad2, ad3, or8;
    int i, j, n, re;

    memset(aquo, 0, 8*sizeof(u32));
    adtmp[0]=0; adtmp[1]=0;
    adtmp[2]=0; adtmp[3]=0;
    adtmp[4]=adiv[0]; adtmp[5]=adiv[1];
    adtmp[6]=adiv[2]; adtmp[7]=adiv[3];

    tdiv=adiv[3]; /* assume not zero... */

    for(i=0; i<5; i++)
    {
    /* doesn't always work in a single pass, usually 1 or 2 */
    for(j=0; j<4; j++)
    {
    ad0=arem[7-i];
    ad1=arem[8-i];
    adx=(ad1*1000000000ULL)+ad0;
    ady=adx/(tdiv+1);
    if(!ady)
    break; /* if was zero, this position is done */
    ad2=ady;
    if(ady>=2000000000) /* range limit so no overflow */
    { ad2=1999999999; }
    if(ad2>0)
    {
    TKD128_SubScaleArray8X_30(arem, adtmp, ad2, arem);
    ad3=aquo[0];
    ad3+=ad2;
    if(ad3>=1000000000)
    { aquo[1]++; ad3-=1000000000; }
    aquo[0]=ad3;
    }
    }
    TKD128_ScaleLeftArray8_S9(aquo);
    TKD128_ScaleRightArray8_S9(adtmp);
    }
    for(; i<8; i++)
    TKD128_ScaleLeftArray8_S9(aquo);
    }

    TKD128_ScaleLeftArray8_S9:
    Copy elements left (towards higher index) by 1 position (32 bits).

    TKD128_ScaleRightArray8_S9:
    Copy elements right (towards a lower index) by 1 position (32 bits).

    TKD128_SubScaleArray8X_30(c, a, b, d):
    v=a[i]*b;
    v_h=v/1000000000;
    v_l=v-v_h*1000000000;
    d[i+0]=c[i+0]-v_l;
    d[i+1]=c[i+1]-v_h;
    With extra parts to deal with borrow propagation and similar.
    May access out-of-bounds for c/d arrays
    (arem needs to be padded by a few extra elements).

    Note that is runs for 5 iterations (vs 4) because this is how one gets
    it to produce a full fraction rather than an integer divide (the integer divide results are similar, but differ on the low order digits).

    Running for 5 appears sufficient (could run for 6..8, but these appear
    to deliver the same final result and are slower).

    One could debate whether stopping early could effect the results, but
    the low-order digits are initialized to 0, and 1000000000-999999999 is 000000001, meaning that in the worst case, the low-order borrows would
    be absorbed in this case.



    Performance:
    Slightly faster than using N-R.
    But, still nowhere near what I had hoped.



    Though, might be worthwhile, since if I could make the DIV operator
    faster, I could claim a result of "faster than IBM's decNumber library".

    :-)


    Currently, ADD/SUB/MUL seem to be faster in my case.

    DIV is still slower, and seems to be putting up a big fight here.
    decNumber seems to have a DIV that is around 1/2 the speed of the MUL.
    In my case, DIV is still around 10x slower than MUL.

    Currently I have it at 3.4 MHz in GCC, 2.7 MHz in MSVC.

    To match decNumber, would need to get closer to around 6 million divides
    per second (in GCC).


    Current stats in a GCC build are:
    ADD/SUB: 36 MHz (unpacked), 17 MHz (DPD)
    MUL: 27 MHz (unpacked), 14 MHz (DPD)
    DIV: 3.4 MHz (both)
    SQRT: 1.0 MHz (both)


    MSVC scores:
    ADD/SUB: 13 MHz (unpacked), 0.8 MHz (DPD)
    MUL: 18 MHz (unpacked), 1.0 MHz (DPD)
    DIV: 2.7 MHz, 0.7 MHz (DPD)
    SQRT: 0.8 MHz, 0.6 MHz (DPD)

    Everything is slower here with MSVC it seems...
    The DPD pack/unpack kinda wrecks things.
    The X30 pack/unpack is around 36% faster than DPD.

    Not entirely sure why MSVC is sucking so badly here (it doesn't usually
    suck this bad).

    Checking Clang:
    It is slightly faster than MSVC, but much closer to the MSVC performance
    than the GCC performance in this case (so, whatever issue seems to be effecting MSVC here also appears to effect Clang).


    This may require investigation, but then again, a lot of this isn't
    exactly "high performance" code (and does a lot of stuff I would
    normally avoid, but was basically unavoidable due to the whole
    Radix-10e9 thing).



    decNumber uses DPD, but is around 13 and 12 MHz in GCC with similar inputs.
    As noted, its DIV is still a bit faster.
    Currently only seems to build with GCC or Clang.

    SQRT is N/A, as decNumber seemingly lacks SQRT, or any other complex
    math functions. Like, no log/pow, nor sin/cos/tan/..., ...


    Also a low of its example programs are for things like calculating
    interest and similar...



    Even if in practice it might still be moot, as it is still
    impractically slow if compared with Binary128.

    Right.

    Terje


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Nov 14 22:32:14 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    {Pedantic mode=ON}
    Assembly language is ASSEMBLER specific.

    What I wanted to write was "And assembly language is
    architecture-specific".

    foo_:
    add DWORD PTR [rdi], 1
    ret

    and

    foo_:
    addl $1, (%rdi)
    ret

    are written in two different assembly languages, yet have the same
    meaning when compiled.

    That does not contradict what I wrote. Both assembly languages are
    specific to the AMD64 architecture.

    It's the builtin function that are compiler-specific.

    Also, not really. For x86, Intel defines them, and other
    compilers like gcc follow suit.

    You are confusing builtins with intrinsics. Builtins are defined by
    the compiler. E.g., __builtin_addcll() is supported by clang on all architectures, but is not supported by gcc. By contrast, the
    intrinsic _addcarry_u64() is defined by Intel and is supported on gcc
    and clang (and, I guess icc, and maybe others), but only when
    compiling for AMD64.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Nov 14 22:38:07 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    In "Extending General-Purpose Registers with Carry and Overflow Bits"
    <http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf> I discuss
    adding a carry bit and overflow bit to every GPR of an architecture.

    Which does nothing for MUL and DIV, while creating complications for
    LD/ST if you want to maintain the 66-bit illusion of a GPR through
    memory.

    I don't think the benefit is worth the cost, as do you, because you
    support your CARRY functionality only in very limited sequences. So
    storing stores only 64 bits, and loading only loads those bits, and
    sets carry and overflow to no overflow.

    A 256-bit addition (d1,c1,b1,a1)+(d2,c2,b2,a2) would look as follows

    add a3,a1,a2
    add b3,b1,b2
    addc b3,b3,a3
    add c3,c1,c2
    addc c3,c3,b3
    add d3,d1,d2
    addc d3,d3,c3

    with 4 cycles latency.

    CARRY in My 66000 essentially provides an accumulator for a few instructions >that supply more operands to and receives another result from a calculation. >Most multiprecision calculation sequences are perfectly happy with another >register used as an accumulator.

    How does a four-input 2048-bit-addition look with your CARRY? For GPRs-with-flags it would look as follows:

    L:
    ld xn, (xp)
    ld yn, (yp)
    ld zn, (zp)
    ld tn, (tp)
    add rn, xn, yn
    addc rn, rn, rm
    add sn, zn, tn
    addc sn, sn, sm
    add vn, rn, tn
    addc vn, vn, vm
    sd vn, (vp)
    ... #mov rn, sn, vn to rm, sm, vm
    ... #increment xp yp zp tp vp
    ... #loop control and branch back to L:

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Nov 15 01:22:03 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    In "Extending General-Purpose Registers with Carry and Overflow Bits"
    <http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf> I discuss
    adding a carry bit and overflow bit to every GPR of an architecture.

    Which does nothing for MUL and DIV, while creating complications for
    LD/ST if you want to maintain the 66-bit illusion of a GPR through
    memory.

    I don't think the benefit is worth the cost, as do you, because you
    support your CARRY functionality only in very limited sequences. So
    storing stores only 64 bits, and loading only loads those bits, and
    sets carry and overflow to no overflow.

    My point was that 1-bit of carry is not enough when MUL and IV need 64-bits--and that is the issue CARRY addresses. In addition multi-
    width shifts also require <essentially> a whole register of width.

    A 256-bit addition (d1,c1,b1,a1)+(d2,c2,b2,a2) would look as follows

    add a3,a1,a2
    add b3,b1,b2
    addc b3,b3,a3
    add c3,c1,c2
    addc c3,c3,b3
    add d3,d1,d2
    addc d3,d3,c3

    with 4 cycles latency.

    CARRY in My 66000 essentially provides an accumulator for a few instructions >that supply more operands to and receives another result from a calculation. >Most multiprecision calculation sequences are perfectly happy with another >register used as an accumulator.

    How does a four-input 2048-bit-addition look with your CARRY? For GPRs-with-flags it would look as follows:

    L:
    ld xn, (xp)
    ld yn, (yp)
    ld zn, (zp)
    ld tn, (tp)
    add rn, xn, yn
    addc rn, rn, rm
    add sn, zn, tn
    addc sn, sn, sm
    add vn, rn, tn
    addc vn, vn, vm
    sd vn, (vp)
    .. #mov rn, sn, vn to rm, sm, vm
    .. #increment xp yp zp tp vp
    .. #loop control and branch back to L:

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Nov 15 01:28:27 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    In "Extending General-Purpose Registers with Carry and Overflow Bits"
    <http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf> I discuss
    adding a carry bit and overflow bit to every GPR of an architecture.

    Which does nothing for MUL and DIV, while creating complications for
    LD/ST if you want to maintain the 66-bit illusion of a GPR through
    memory.

    I don't think the benefit is worth the cost, as do you, because you
    support your CARRY functionality only in very limited sequences. So
    storing stores only 64 bits, and loading only loads those bits, and
    sets carry and overflow to no overflow.

    My point was that 1-bit of carry is not enough when MUL and IV need 64-bits--and that is the issue CARRY addresses. In addition multi-
    width shifts also require <essentially> a whole register of width.

    A 256-bit addition (d1,c1,b1,a1)+(d2,c2,b2,a2) would look as follows

    add a3,a1,a2
    add b3,b1,b2
    addc b3,b3,a3
    add c3,c1,c2
    addc c3,c3,b3
    add d3,d1,d2
    addc d3,d3,c3

    with 4 cycles latency.

    CARRY in My 66000 essentially provides an accumulator for a few instructions >that supply more operands to and receives another result from a calculation. >Most multiprecision calculation sequences are perfectly happy with another >register used as an accumulator.

    How does a four-input 2048-bit-addition look with your CARRY? For GPRs-with-flags it would look as follows:

    L:
    ld xn, (xp)
    ld yn, (yp)
    ld zn, (zp)
    ld tn, (tp)
    add rn, xn, yn
    addc rn, rn, rm
    add sn, zn, tn
    addc sn, sn, sm
    add vn, rn, tn
    addc vn, vn, vm
    sd vn, (vp)
    .. #mov rn, sn, vn to rm, sm, vm
    .. #increment xp yp zp tp vp
    .. #loop control and branch back to L:

    //pretty close to::

    MOV R12,#0
    VEC R7,{}
    LDD R8,[Rx,Ri<<3]
    LDD R9,[Ry,Ri<<3]
    LDD R10,[Rz,Ri<<3]
    LDD R11,[Rt,Ri<<3]
    CARRY R12,{{IO}{IO}{IO}}
    ADD R13,R8,R9
    ADD R14,R10,R11
    ADD R14,R14,R13
    STD R14,[Rv,Ri<<3]
    LOOP R7,LT,#1,#32

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Nov 15 10:46:42 2025
    From Newsgroup: comp.arch

    Robert Finch <robfi680@gmail.com> writes:
    <snip>

    In "Extending General-Purpose Registers with Carry and Overflow Bits"
    <http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf> I discuss
    adding a carry bit and overflow bit to every GPR of an architecture.
    To make it concrete how that would affect the instruction set, I
    propose such an instruction set extension for RISC-V. It contains the
    instructions

    Sidetracking a bit here.

    There are 64-regs in Qupls with four flag bits,

    What other flags do you use?

    A common set of flags is NZCV. Of these N and Z can be generated from
    the 64 ordinary bits (actually N is the MSB of these bits).

    You might also want NCZV of 32-bit instructions, but in that case all
    flags are derivable from the 64 ordinary bits of the GPR; but in that
    case you may need additional branch instructions: Instructions that
    check only if the bottom 32-bits are 0 (Z), if bit 31 is 1 (N), if bit
    32 is 1 (C), or if bit 32 is different from bit 31 (V).

    Concerning saving the extra bits across interrupts, yes, this has to
    be adapted to the actual architecture, and there are many ways to skin
    this cat. I just outlined one to give an idea how this can be done.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sat Nov 15 07:48:21 2025
    From Newsgroup: comp.arch

    On 2025-11-15 5:46 a.m., Anton Ertl wrote:
    Robert Finch <robfi680@gmail.com> writes:
    <snip>

    In "Extending General-Purpose Registers with Carry and Overflow Bits"
    <http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf> I discuss
    adding a carry bit and overflow bit to every GPR of an architecture.
    To make it concrete how that would affect the instruction set, I
    propose such an instruction set extension for RISC-V. It contains the
    instructions

    Sidetracking a bit here.

    There are 64-regs in Qupls with four flag bits,

    What other flags do you use?

    A capabilities tag bit, and possibly a bit to indicate float/integer or pointer data. Because of the implementation there are eight bits
    available in the register file (only a byte update is available with the
    BRAM so its eight or none). But I am planning on using only four bits so
    there is less data to move around.


    A common set of flags is NZCV. Of these N and Z can be generated from
    the 64 ordinary bits (actually N is the MSB of these bits).

    You might also want NCZV of 32-bit instructions, but in that case all
    flags are derivable from the 64 ordinary bits of the GPR; but in that
    case you may need additional branch instructions: Instructions that
    check only if the bottom 32-bits are 0 (Z), if bit 31 is 1 (N), if bit
    32 is 1 (C), or if bit 32 is different from bit 31 (V).

    Concerning saving the extra bits across interrupts, yes, this has to
    be adapted to the actual architecture, and there are many ways to skin
    this cat. I just outlined one to give an idea how this can be done.

    - anton


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sat Nov 15 15:36:22 2025
    From Newsgroup: comp.arch

    MitchAlsup wrote:
    CARRY in My 66000 essentially provides an accumulator for a few instructions that supply more operands to and receives another result from a calculation. Most multiprecision calculation sequences are perfectly happy with another register used as an accumulator.

    I think I've said so before, but it bears repeating:

    I _really_ love CARRY!

    It provides a lot of "missing link" operations, while adding zero extra
    bits to all the instructions that don't need it.

    That said, if I had infinite resources (in this case infinity == 4
    sources), I would like to have an unsigned integer MulAddAdd like this:

    (hi, lo) = a*b+c+d

    simply because this is the largest possible building block that cannot overflow, the result range covers the full 128 bit space.

    From what you've taught us about multipliers, adding one (or in this
    case two) extra inputs to the adder that aggregates all the partial multiplication products will be close to free in time, but the routing
    of the extra set of inputs might require an extra cycle?

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Nov 15 18:04:16 2025
    From Newsgroup: comp.arch


    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    MitchAlsup wrote:
    CARRY in My 66000 essentially provides an accumulator for a few instructions
    that supply more operands to and receives another result from a calculation.
    Most multiprecision calculation sequences are perfectly happy with another register used as an accumulator.

    I think I've said so before, but it bears repeating:

    I _really_ love CARRY!

    It provides a lot of "missing link" operations, while adding zero extra
    bits to all the instructions that don't need it.

    That said, if I had infinite resources (in this case infinity == 4
    sources), I would like to have an unsigned integer MulAddAdd like this:

    (hi, lo) = a*b+c+d

    Alas:: the best CARRY can do is:

    {hi,c} = a×b+hi

    simply because this is the largest possible building block that cannot overflow, the result range covers the full 128 bit space.

    From what you've taught us about multipliers, adding one (or in this
    case two) extra inputs to the adder that aggregates all the partial multiplication products will be close to free in time, but the routing
    of the extra set of inputs might require an extra cycle?

    In the integer case, there is no rounding.
    In the FP case, FMAC is not part of the CARRY applicable OpCode space.

    Terje
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Nov 15 18:07:19 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Robert Finch <robfi680@gmail.com> writes:
    <snip>

    In "Extending General-Purpose Registers with Carry and Overflow Bits"
    <http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf> I discuss
    adding a carry bit and overflow bit to every GPR of an architecture.
    To make it concrete how that would affect the instruction set, I
    propose such an instruction set extension for RISC-V. It contains the
    instructions

    Sidetracking a bit here.

    There are 64-regs in Qupls with four flag bits,

    What other flags do you use?

    A common set of flags is NZCV. Of these N and Z can be generated from
    the 64 ordinary bits (actually N is the MSB of these bits).

    You might also want NCZV of 32-bit instructions, but in that case all
    flags are derivable from the 64 ordinary bits of the GPR; but in that
    case you may need additional branch instructions: Instructions that
    check only if the bottom 32-bits are 0 (Z), if bit 31 is 1 (N), if bit
    32 is 1 (C), or if bit 32 is different from bit 31 (V).

    If you write an architectural rule whereby every integer result is
    "proper" one set of bits {top, bottom, dispersed} covers everything.

    Proper means that all the bits in the register are written but the
    value written is range limited to {Sign}×{Size} of the calculation.

    Concerning saving the extra bits across interrupts, yes, this has to
    be adapted to the actual architecture, and there are many ways to skin
    this cat. I just outlined one to give an idea how this can be done.

    On the other hand, with CARRY, none of those bits are needed.

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Nov 15 18:01:28 2025
    From Newsgroup: comp.arch

    Terje Mathisen <terje.mathisen@tmsw.no> writes:
    That said, if I had infinite resources (in this case infinity == 4
    sources), I would like to have an unsigned integer MulAddAdd like this:

    (hi, lo) = a*b+c+d

    simply because this is the largest possible building block that cannot >overflow, the result range covers the full 128 bit space.

    Not just that, it is also useful for mpn_addmul_1() (ma = ma+s*mb, where m
    is multi-precision, and s is single-precision), which is a useful
    stepping stone for mpn_mul() (m=ma*mb).

    One iteration of mpn_addmul_1() performs:

    (hi[i], ma[i]) = ma[i]+s*mb[i]+hi[i-1]

    From what you've taught us about multipliers, adding one (or in this
    case two) extra inputs to the adder that aggregates all the partial >multiplication products will be close to free in time

    The question is the latency. If you get a latency of 4 cycles from
    each input of the a*b+c+d computation to the results, this means that
    the recurrence from hi[i-1] to hi[i] takes 4 cycles, which becomes a performance problem if m is large. One can work around that for
    m=ma*mb by rearranging the computations, but if you really need
    just something like m=s*ma, that's not possible.

    Alternatively, you might use

    (hi, lo) = ma[i]+s*mb[i]
    (hi[i], ma[i])=(hi,lo)+hi[i-1]

    The first line has no recurrences, and so executing it is only limited
    by CPU resources, the second operation has a recurrence from hi[i-1]
    to hi[i], but it takes only one cycle of latency with the right
    architecture, e.g.:

    AMD64 with ADX:
    #rdx = s
    #carry = carry1+C+O
    mulx ma, m, carry2
    adcx mb, m
    adox carry1, m
    mov carry2, carry1

    Given that, a useful instruction is

    (hi,lo) = a*b+c

    Then you only need one carry flag for the rest. A hypothetical ARM
    A64 which has umaddh in addition to (the existing) madd could do this
    with one cycle of recurrence latency, too:

    ARM A64 with umaddh:
    # carry = carry1+C
    umaddh carry2, s, ma, mb
    madd smab, s, ma, mb
    adcs m, smab, carry1
    mov carry1, carry2

    but the routing
    of the extra set of inputs might require an extra cycle?

    Depends on the microarchitecture. The ARM A64 architects have
    instructions that have 4 inputs (store pair with an addressing modes
    that reads two registers), and it apparently works for them.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Nov 16 08:22:52 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    A common set of flags is NZCV. Of these N and Z can be generated from
    the 64 ordinary bits (actually N is the MSB of these bits).

    You might also want NCZV of 32-bit instructions, but in that case all
    flags are derivable from the 64 ordinary bits of the GPR; but in that
    case you may need additional branch instructions: Instructions that
    check only if the bottom 32-bits are 0 (Z), if bit 31 is 1 (N), if bit
    32 is 1 (C), or if bit 32 is different from bit 31 (V).

    If you write an architectural rule whereby every integer result is
    "proper" one set of bits {top, bottom, dispersed} covers everything.

    Proper means that all the bits in the register are written but the
    value written is range limited to {Sign}×{Size} of the calculation.

    I have no idea what you mean with "one set of bits {top, bottom,
    dispersed}".

    As for "proper": Does this mean that one would have to have add(c),
    sub(c), mul (madd etc.), shift right and shift left (did I forget
    anything?) for i8, i16, i32, i64, u8, u16, u32, and u64? Yes, if
    specify in the operation which kind of Z, C/V, and maybe N you are
    interested in, you do not need to specify it in the branch that checks
    that result; you also eliminate the sign-extension and zero-extension operations that we discussed some time ago.

    But given that the operations are much more frequent than branches,
    encoding that information in the branches uses less space (for shift
    right, the sign is usually included in the operation). It's
    interesting that AFAIK there are instruction sets (e.g., Power) that
    just have one full-width sign-agnostic add, and do not have
    width-specific flags, either. So when compiling stuff like

    if (a[1]+a[2] == 0) /* unsigned a[] */

    a width-specific compare instruction provides that information. But
    gcc generates a compare instruction even when a[] is "unsigned long",
    so apparently add does not set the flags on addition anyway (and if
    there is an add that sets flags, it is not used by gcc for this code).

    Another case is SPARC v9, which tends to set flags. For

    if ((a[1]^a[2]) < 0)

    I see:

    long a[] int a[]
    ldx [ %i0 + 8 ], %g1 ld [ %i0 + 4 ], %g2
    ldx [ %i0 + 0x10 ], %g2 ld [ %i0 + 8 ], %g1
    xor %g1, %g2, %g1 xorcc %g2, %g1, %g0
    brlz,pn %g1, 24 <foo+0x24> bl,a,pn %icc, 20 <foo+0x20>

    Reading up on SPARC v9, it has two sets of condition codes: 32-bit
    (icc) and 64-bit (xcc), and every instruction that sets condition
    codes (e.g., xorcc) sets both. In the present case, the 32-bit
    sequence sets the ccs and then checks icc, while the 64-bit sequence
    does not set the ccs, and instead uses a branch instruction that
    inspects an integer register (%g1). These branch instructions all
    work for the full 64 bits, and do not provide a way to check a 32-bit
    result. In the present case, an alternate way to use brlz for the
    32-bit case would have been:

    ldsw [ %i0 + 8 ], %g1 #ld is a synonym for lduw
    ldsw [ %i0 + 0x10 ], %g2
    xor %g1, %g2, %g1
    brlz,pn %g1, 24 <foo+0x24>

    because the xor of two sign-extended data is also a correct
    sign-extended result, but instread gcc chose to use xorcc and bl %icc.

    There are many ways to skin this cat.

    Concerning saving the extra bits across interrupts, yes, this has to
    be adapted to the actual architecture, and there are many ways to skin
    this cat. I just outlined one to give an idea how this can be done.

    On the other hand, with CARRY, none of those bits are needed.

    But the mechanism of CARRY is quite a bit more involved: Either store
    the carry in a GPR at every step, or have another mechanism inside a
    CARRY block. And either make the CARRY block atomic or have some way
    to preserve the fact that there is this prefix across interrupts and
    (worse) synchronous traps.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Nov 16 14:34:54 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:
    (hi, lo) = a*b+c+d

    Alas:: the best CARRY can do is:

    {hi,c} = a*b+hi

    What latency?

    simply because this is the largest possible building block that cannot
    overflow, the result range covers the full 128 bit space.

    With the carry in the result GPR, you could achieve that as follows:

    add t,c,d
    umaddc hi,lo,a,b,t

    (or split umaddc into an instruction that produces the low result and
    one that produces the high result).

    The disadvantage here is that, with d being the hi of the last
    iteration, you will see the full latency of the add and the umaddh.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Nov 16 14:45:09 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    [...]
    My point was that 1-bit of carry is not enough when MUL and IV need >64-bits--and that is the issue CARRY addresses. In addition multi-
    width shifts also require <essentially> a whole register of width.

    For multiplication, instruction sets provide widening multiplication,
    with one instruction producing two words, or with instructions that
    produce low and high words. These days I would consider doing this
    with SIMD registers, which would allow the double-width result in a
    single register; that was an option for ARM A64, but they chose to use
    GPRs and low and high instructions; strange. For AMD64, it was not
    really an option (it continued with the multiplication instructions
    that existed in IA-32); for RISC-V, it also was not an option, as its
    M extension was designed long before the V extension.

    For division, again double-width by single-width division has been
    designed into several instruction sets and is a good stepping stone
    for multi-precision by single-width division. Again, using SIMD
    registers appears to be a way to reduce the number of register
    accesses in the instruction.

    The double-width->single-width shift looks like a good stepping stone
    for multi-precision shifts. It's unclear to me why Intel included two instructions for that purpose: SHLD and SHRD.

    How does a four-input 2048-bit-addition look with your CARRY? For
    GPRs-with-flags it would look as follows:

    L:
    ld xn, (xp)
    ld yn, (yp)
    ld zn, (zp)
    ld tn, (tp)
    add rn, xn, yn
    addc rn, rn, rm
    add sn, zn, tn
    addc sn, sn, sm
    add vn, rn, tn
    addc vn, vn, vm
    sd vn, (vp)
    .. #mov rn, sn, vn to rm, sm, vm
    .. #increment xp yp zp tp vp
    .. #loop control and branch back to L:

    Actually, the way to go would be to unroll by a factor of two, with
    the n registers and m registers switching role between the
    sub-iterations. If you do not want to go there, you would not use a
    RISC-V mov, because that expands to an instruction that destroys the
    carry bit. Instead, you would use an idiom (e.g., or rm, rn, zero)
    that transfers the carry bit.

    //pretty close to::

    MOV R12,#0
    VEC R7,{}
    LDD R8,[Rx,Ri<<3]
    LDD R9,[Ry,Ri<<3]
    LDD R10,[Rz,Ri<<3]
    LDD R11,[Rt,Ri<<3]
    CARRY R12,{{IO}{IO}{IO}}
    ADD R13,R8,R9
    ADD R14,R10,R11
    ADD R14,R14,R13
    STD R14,[Rv,Ri<<3]
    LOOP R7,LT,#1,#32

    I thought up to now that the stuff covered by CARRY means

    (R12,R13) = R8+R9+R12
    (R12,R14) = R10+R11+R12
    (R12,R14) = R14+R13+R12

    Which would be wrong for the desired operation. What is needed
    instead is, maybe

    (R12,R14) = ((R8+R9)+(R10+R11))+R12

    My expectation is that with CARRY, something functionally equivalent
    might be implemented as:

    MOV R12,#0
    MOV R15,#0
    MOV R16,#0
    VEC R7, {}
    ... LDDs
    CARRY R12,{{IO}}
    ADD R13,R8,R9
    CARRY R15,{{IO}}
    ADD R14,R10,R11
    CARRY R16,{{IO}}
    ADD R14,R14,R13
    STD ...
    LOOP ...
    ADD R15,R15,R16
    ADD R12,R12,R15

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Nov 16 18:36:02 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    ERROR "unexpected byte sequence starting at index 853: '\xC3'" while decoding:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    A common set of flags is NZCV. Of these N and Z can be generated from
    the 64 ordinary bits (actually N is the MSB of these bits).

    You might also want NCZV of 32-bit instructions, but in that case all
    flags are derivable from the 64 ordinary bits of the GPR; but in that
    case you may need additional branch instructions: Instructions that
    check only if the bottom 32-bits are 0 (Z), if bit 31 is 1 (N), if bit
    32 is 1 (C), or if bit 32 is different from bit 31 (V).

    If you write an architectural rule whereby every integer result is
    "proper" one set of bits {top, bottom, dispersed} covers everything.

    Proper means that all the bits in the register are written but the
    value written is range limited to {Sign}×{Size} of the calculation.

    I have no idea what you mean with "one set of bits {top, bottom,
    dispersed}".

    typedef struct { uint64_t reg;
    uint8_t bits: 4; } gpr;
    or
    typedef struct { uint8_t bits: 4;
    uint64_t reg;} gpr;
    or
    typedef struct { uint16_t reg0;
    uint8_t bit0: 1;
    uint16_t reg1;
    uint8_t bit1: 1;
    uint16_t reg2;
    uint8_t bit2: 1;
    uint16_t reg3;
    uint8_t bit3: 1; } gpr;

    Did you loose every brain-cell of imagination ?!?

    As for "proper": Does this mean that one would have to have add(c),
    sub(c), mul (madd etc.), shift right and shift left (did I forget
    anything?) for i8, i16, i32, i64, u8, u16, u32, and u64? Yes, if
    specify in the operation which kind of Z, C/V, and maybe N you are
    interested in, you do not need to specify it in the branch that checks
    that result; you also eliminate the sign-extension and zero-extension operations that we discussed some time ago.

    {s8, s16, s32, s64, u8, u16, u32, u64} yes.

    But given that the operations are much more frequent than branches,
    encoding that information in the branches uses less space (for shift
    right, the sign is usually included in the operation). It's

    Which is why I don't have ANY of those extra bits.

    interesting that AFAIK there are instruction sets (e.g., Power) that
    just have one full-width sign-agnostic add, and do not have
    width-specific flags, either. So when compiling stuff like

    if (a[1]+a[2] == 0) /* unsigned a[] */

    a width-specific compare instruction provides that information. But
    gcc generates a compare instruction even when a[] is "unsigned long",
    so apparently add does not set the flags on addition anyway (and if
    there is an add that sets flags, it is not used by gcc for this code).

    Another case is SPARC v9, which tends to set flags. For

    if ((a[1]^a[2]) < 0)

    I see:

    long a[] int a[]
    ldx [ %i0 + 8 ], %g1 ld [ %i0 + 4 ], %g2
    ldx [ %i0 + 0x10 ], %g2 ld [ %i0 + 8 ], %g1
    xor %g1, %g2, %g1 xorcc %g2, %g1, %g0
    brlz,pn %g1, 24 <foo+0x24> bl,a,pn %icc, 20 <foo+0x20>

    Reading up on SPARC v9, it has two sets of condition codes: 32-bit
    (icc) and 64-bit (xcc), and every instruction that sets condition
    codes (e.g., xorcc) sets both.

    Another reason its death is helpful to comp.arch

    In the present case, the 32-bit
    sequence sets the ccs and then checks icc, while the 64-bit sequence
    does not set the ccs, and instead uses a branch instruction that
    inspects an integer register (%g1). These branch instructions all
    work for the full 64 bits, and do not provide a way to check a 32-bit
    result. In the present case, an alternate way to use brlz for the
    32-bit case would have been:

    ldsw [ %i0 + 8 ], %g1 #ld is a synonym for lduw
    ldsw [ %i0 + 0x10 ], %g2
    xor %g1, %g2, %g1
    brlz,pn %g1, 24 <foo+0x24>

    because the xor of two sign-extended data is also a correct
    sign-extended result, but instread gcc chose to use xorcc and bl %icc.

    There are many ways to skin this cat.

    Sure:: close to 20-ways, less than 4 of them are "proper".

    Concerning saving the extra bits across interrupts, yes, this has to
    be adapted to the actual architecture, and there are many ways to skin
    this cat. I just outlined one to give an idea how this can be done.

    On the other hand, with CARRY, none of those bits are needed.

    But the mechanism of CARRY is quite a bit more involved: Either store
    the carry in a GPR at every step, or have another mechanism inside a
    CARRY block. And either make the CARRY block atomic or have some way
    to preserve the fact that there is this prefix across interrupts and
    (worse) synchronous traps.

    During its "life" the bits used in CARRY are simply another feedback
    path on the data-path. Afterwards, carry is written once. CARRY also
    gets written when an exception is taken.


    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Nov 16 18:41:03 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Terje Mathisen <terje.mathisen@tmsw.no> posted:
    (hi, lo) = a*b+c+d

    Alas:: the best CARRY can do is:

    {hi,c} = a*b+hi

    What latency?

    1 multiply latency {likely 4 cycles} but more importantly no more cycles
    than
    c = a*b;

    simply because this is the largest possible building block that cannot
    overflow, the result range covers the full 128 bit space.

    With the carry in the result GPR, you could achieve that as follows:

    add t,c,d
    umaddc hi,lo,a,b,t

    You can do this at the added latency of ADD.

    (or split umaddc into an instruction that produces the low result and
    one that produces the high result).

    CARRY is an instruction-modifier it is not "executed" {or you can
    consider it "executed" in the DECODE stage of the pipeline.} The
    subsequent MUL takes no more time CARRY or no-CARRY.

    The disadvantage here is that, with d being the hi of the last
    iteration, you will see the full latency of the add and the umaddh.

    Does R stand for Reduced or Ridiculoous ?!?


    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Mon Nov 17 02:49:15 2025
    From Newsgroup: comp.arch

    On 2025-11-16 1:36 p.m., MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    ERROR "unexpected byte sequence starting at index 853: '\xC3'" while decoding:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    A common set of flags is NZCV. Of these N and Z can be generated from >>>> the 64 ordinary bits (actually N is the MSB of these bits).

    You might also want NCZV of 32-bit instructions, but in that case all
    flags are derivable from the 64 ordinary bits of the GPR; but in that
    case you may need additional branch instructions: Instructions that
    check only if the bottom 32-bits are 0 (Z), if bit 31 is 1 (N), if bit >>>> 32 is 1 (C), or if bit 32 is different from bit 31 (V).

    If you write an architectural rule whereby every integer result is
    "proper" one set of bits {top, bottom, dispersed} covers everything.

    Proper means that all the bits in the register are written but the
    value written is range limited to {Sign}×{Size} of the calculation.

    I have no idea what you mean with "one set of bits {top, bottom,
    dispersed}".

    typedef struct { uint64_t reg;
    uint8_t bits: 4; } gpr;
    or
    typedef struct { uint8_t bits: 4;
    uint64_t reg;} gpr;
    or
    typedef struct { uint16_t reg0;
    uint8_t bit0: 1;
    uint16_t reg1;
    uint8_t bit1: 1;
    uint16_t reg2;
    uint8_t bit2: 1;
    uint16_t reg3;
    uint8_t bit3: 1; } gpr;

    Did you loose every brain-cell of imagination ?!?

    As for "proper": Does this mean that one would have to have add(c),
    sub(c), mul (madd etc.), shift right and shift left (did I forget
    anything?) for i8, i16, i32, i64, u8, u16, u32, and u64? Yes, if
    specify in the operation which kind of Z, C/V, and maybe N you are
    interested in, you do not need to specify it in the branch that checks
    that result; you also eliminate the sign-extension and zero-extension
    operations that we discussed some time ago.

    {s8, s16, s32, s64, u8, u16, u32, u64} yes.

    But given that the operations are much more frequent than branches,
    encoding that information in the branches uses less space (for shift
    right, the sign is usually included in the operation). It's

    Which is why I don't have ANY of those extra bits.

    interesting that AFAIK there are instruction sets (e.g., Power) that
    just have one full-width sign-agnostic add, and do not have
    width-specific flags, either. So when compiling stuff like

    if (a[1]+a[2] == 0) /* unsigned a[] */

    a width-specific compare instruction provides that information. But
    gcc generates a compare instruction even when a[] is "unsigned long",
    so apparently add does not set the flags on addition anyway (and if
    there is an add that sets flags, it is not used by gcc for this code).

    Another case is SPARC v9, which tends to set flags. For

    if ((a[1]^a[2]) < 0)

    I see:

    long a[] int a[]
    ldx [ %i0 + 8 ], %g1 ld [ %i0 + 4 ], %g2
    ldx [ %i0 + 0x10 ], %g2 ld [ %i0 + 8 ], %g1
    xor %g1, %g2, %g1 xorcc %g2, %g1, %g0
    brlz,pn %g1, 24 <foo+0x24> bl,a,pn %icc, 20 <foo+0x20>

    Reading up on SPARC v9, it has two sets of condition codes: 32-bit
    (icc) and 64-bit (xcc), and every instruction that sets condition
    codes (e.g., xorcc) sets both.

    Another reason its death is helpful to comp.arch

    In the present case, the 32-bit
    sequence sets the ccs and then checks icc, while the 64-bit sequence
    does not set the ccs, and instead uses a branch instruction that
    inspects an integer register (%g1). These branch instructions all
    work for the full 64 bits, and do not provide a way to check a 32-bit
    result. In the present case, an alternate way to use brlz for the
    32-bit case would have been:

    ldsw [ %i0 + 8 ], %g1 #ld is a synonym for lduw
    ldsw [ %i0 + 0x10 ], %g2
    xor %g1, %g2, %g1
    brlz,pn %g1, 24 <foo+0x24>

    because the xor of two sign-extended data is also a correct
    sign-extended result, but instread gcc chose to use xorcc and bl %icc.

    There are many ways to skin this cat.

    Sure:: close to 20-ways, less than 4 of them are "proper".

    Concerning saving the extra bits across interrupts, yes, this has to
    be adapted to the actual architecture, and there are many ways to skin >>>> this cat. I just outlined one to give an idea how this can be done.

    On the other hand, with CARRY, none of those bits are needed.

    But the mechanism of CARRY is quite a bit more involved: Either store
    the carry in a GPR at every step, or have another mechanism inside a
    CARRY block. And either make the CARRY block atomic or have some way
    to preserve the fact that there is this prefix across interrupts and
    (worse) synchronous traps.

    During its "life" the bits used in CARRY are simply another feedback
    path on the data-path. Afterwards, carry is written once. CARRY also
    gets written when an exception is taken.


    - anton

    These posts have inspired me to keep working on the ISA. I am on a simplification mission.

    The CARRY modifier is just a substitute for not having r3w2 port
    instructions directly in the ISA. Since Qupls ISA has room to support
    some r3w2 instructions directly there is no need for CARRY, much as I
    like the idea.

    While not using a carry flag in the register, there is still a
    capabilities bit, overflow bit and pointer bit plus four user assigned
    bits. I decided to just have 72-bit register store and load instructions
    along with the usual 8,16,32 and 64.

    Finding it too difficult to support 128-bit operations using high, low register pairs. Getting the reservation stations to pair up the
    registers seems a bit scary. It would be much simpler to just have
    128-bit registers and it appears as if it may not be any more logic. The benefit of using register pairs is the internal busses need only be
    64-bits then.

    Sparc v9 died?



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Nov 17 08:33:58 2025
    From Newsgroup: comp.arch

    Robert Finch <robfi680@gmail.com> writes:
    Finding it too difficult to support 128-bit operations using high, low >register pairs. Getting the reservation stations to pair up the
    registers seems a bit scary. It would be much simpler to just have
    128-bit registers and it appears as if it may not be any more logic.

    If you want to support 128-bit operations, using 128-bit registers
    certainly is the way to go. Note how AMD used to split 128-bit SSE
    operations into 64-bit parts on 64-bit registers in the K8, split
    256-bit AVX operations into 128-bit parts on 128-bit registers in Zen,
    but they went away from that: In Zen4 512-bit operations are performed
    in 256-bit-pieces, but the registers are 512 bits wide.

    However, the point of carry bits or Mitch Alsup's CARRY is not 128-bit operations, but multi-precision, which can be 256-bit for some crypto,
    4096 bits for other crypto, or billions of bits for the stuff that
    Alexander Yee is doing.

    Sparc v9 died?

    Oracle has discontinued SPARC development in 2017, Fujitsu has
    announced in 2016 that they switch to ARM A64. Both Oracle and
    Fujitsu released their last new SPARC CPU in 2017. Fujitsu has
    released the ARM A64-based A64FX in 2019. The Leon4 (2017 according
    to <https://en.wikipedia.org/wiki/SPARC#Implementations>) and Leon5
    (2019) implement SPARC v8, not v9.

    The MCST-R2000 (2018) implements SPARC v9, but will it have a
    successor? And even if it has a successor, will it be available in
    relevant numbers? MCST is not married to SPARC, despite their name;
    they have worked on Elbrus 2000 implementations as well; Elbrus 2000
    supports Elbrus VLIW and "Intel x86" instruction sets, and new models
    were released in 2018, 2021, and 2025, so MCST now seems to focus on
    that.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Mon Nov 17 08:17:20 2025
    From Newsgroup: comp.arch

    On 2025-11-17 3:33 a.m., Anton Ertl wrote:
    Robert Finch <robfi680@gmail.com> writes:
    Finding it too difficult to support 128-bit operations using high, low
    register pairs. Getting the reservation stations to pair up the
    registers seems a bit scary. It would be much simpler to just have
    128-bit registers and it appears as if it may not be any more logic.

    If you want to support 128-bit operations, using 128-bit registers
    certainly is the way to go. Note how AMD used to split 128-bit SSE operations into 64-bit parts on 64-bit registers in the K8, split
    256-bit AVX operations into 128-bit parts on 128-bit registers in Zen,
    but they went away from that: In Zen4 512-bit operations are performed
    in 256-bit-pieces, but the registers are 512 bits wide.

    However, the point of carry bits or Mitch Alsup's CARRY is not 128-bit operations, but multi-precision, which can be 256-bit for some crypto,
    4096 bits for other crypto, or billions of bits for the stuff that
    Alexander Yee is doing.

    Sparc v9 died?

    Oracle has discontinued SPARC development in 2017, Fujitsu has
    announced in 2016 that they switch to ARM A64. Both Oracle and
    Fujitsu released their last new SPARC CPU in 2017. Fujitsu has
    released the ARM A64-based A64FX in 2019. The Leon4 (2017 according
    to <https://en.wikipedia.org/wiki/SPARC#Implementations>) and Leon5
    (2019) implement SPARC v8, not v9.

    The MCST-R2000 (2018) implements SPARC v9, but will it have a
    successor? And even if it has a successor, will it be available in
    relevant numbers? MCST is not married to SPARC, despite their name;
    they have worked on Elbrus 2000 implementations as well; Elbrus 2000
    supports Elbrus VLIW and "Intel x86" instruction sets, and new models
    were released in 2018, 2021, and 2025, so MCST now seems to focus on
    that.

    - anton

    Skimming through the SPARC architecture manual I am wondering how they
    handle register renaming with a windowed register file. If the register
    window file is deep there must be a ginormous number of registers for renaming. Would it need to keep track of the renames for all the
    registers? How does it dump the rename state to memory?

    Tried to find some information on Elbrus. I got page not found a couple
    of times. Other than it’s a VLIW machine I do not know much about it.

    *****

    I would like a machine able to process 128-bit values directly, but it
    takes up too many resources. It is easier to make the register file deep
    as opposed to wide. BRAM has a max 64-bit width. After that it takes
    more BRAMs to get a wider port. I tried a 128-bit wide register file,
    but it used about 200 BRAMs. Too many.

    There are now 128 logical registers available in Qupls. It turns out
    that the BRAM setup is 512 registers deep no matter whether there are
    32,64 or 128 registers. So, may as well make them available.

    Qupls reservation stations were set up with support for eight operands
    (four each for each ½ 128-bit register). The resulting logic was about
    25,000 LUTs for just one RS. This is compared to about 5,000 LUTs when
    there were just four operands. What gets implemented is considerably
    less as most functional units do not need all the operands.

    It may be resource efficient to use multiple reservation stations as
    opposed to more operands in a single station. But then the operands need
    to be linked together between stations. It may be possible using a hash
    of the PC value and ROB entry number.

    Qupls seems to have an implementation four or five times the size of the
    FPGA again. Back to the drawing board.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Nov 17 17:36:47 2025
    From Newsgroup: comp.arch

    Robert Finch <robfi680@gmail.com> writes:
    Skimming through the SPARC architecture manual I am wondering how they >handle register renaming with a windowed register file. If the register >window file is deep there must be a ginormous number of registers for >renaming. Would it need to keep track of the renames for all the
    registers? How does it dump the rename state to memory?

    There is no need to dump the rename state to memory, not for SPARC nor
    for anything else. It's only microarchitectural.

    The large number of architected registers may have been a reason why
    they needed so long to implement OoO execution.

    I think that the cost is typically a register allocation table RAT per
    branch (for maybe 50 branches or potential traps that you want to
    predict, i.e., 50 RATs). With 32 architected registers and 257-512
    physical registers that's 32*9 bits = 288 bits per RAT; with the 136 architected registers of SPARC, and again <=512 physical registers,
    that would be 1224 bits per RAT.

    There are probably other options that using a RAT, but I have
    forgotten them.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Nov 17 18:41:19 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    On 2025-11-16 1:36 p.m., MitchAlsup wrote:
    -------------------------------

    During its "life" the bits used in CARRY are simply another feedback
    path on the data-path. Afterwards, carry is written once. CARRY also
    gets written when an exception is taken.


    - anton

    These posts have inspired me to keep working on the ISA. I am on a simplification mission.

    The CARRY modifier is just a substitute for not having r3w2 port instructions directly in the ISA. Since Qupls ISA has room to support
    some r3w2 instructions directly there is no need for CARRY, much as I
    like the idea.

    That is correct at the 95% level.

    While not using a carry flag in the register, there is still a
    capabilities bit, overflow bit and pointer bit plus four user assigned
    bits. I decided to just have 72-bit register store and load instructions along with the usual 8,16,32 and 64.

    Finding it too difficult to support 128-bit operations using high, low register pairs. Getting the reservation stations to pair up the
    registers seems a bit scary.

    It IS scary and hard and tricky to get right.

    It would be much simpler to just have
    128-bit registers and it appears as if it may not be any more logic. The benefit of using register pairs is the internal busses need only be
    64-bits then.

    Almost exactly what we did in Mc 88120 when facing the same problem.
    Except we kept the 32-bit model and had register files 2 registers
    tall {even, odd},{odd even} so any register specifier would simply
    read out the status and values of both registers and then let the
    stations handle the insundry problems.

    Sparc v9 died?

    What was the last year SPARC sold more than 100,000 CPUs ??
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Nov 17 18:45:39 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    On 2025-11-17 3:33 a.m., Anton Ertl wrote:
    Robert Finch <robfi680@gmail.com> writes:
    Finding it too difficult to support 128-bit operations using high, low
    register pairs. Getting the reservation stations to pair up the
    registers seems a bit scary. It would be much simpler to just have
    128-bit registers and it appears as if it may not be any more logic.

    If you want to support 128-bit operations, using 128-bit registers certainly is the way to go. Note how AMD used to split 128-bit SSE operations into 64-bit parts on 64-bit registers in the K8, split
    256-bit AVX operations into 128-bit parts on 128-bit registers in Zen,
    but they went away from that: In Zen4 512-bit operations are performed
    in 256-bit-pieces, but the registers are 512 bits wide.

    However, the point of carry bits or Mitch Alsup's CARRY is not 128-bit operations, but multi-precision, which can be 256-bit for some crypto,
    4096 bits for other crypto, or billions of bits for the stuff that Alexander Yee is doing.

    Sparc v9 died?

    Oracle has discontinued SPARC development in 2017, Fujitsu has
    announced in 2016 that they switch to ARM A64. Both Oracle and
    Fujitsu released their last new SPARC CPU in 2017. Fujitsu has
    released the ARM A64-based A64FX in 2019. The Leon4 (2017 according
    to <https://en.wikipedia.org/wiki/SPARC#Implementations>) and Leon5
    (2019) implement SPARC v8, not v9.

    The MCST-R2000 (2018) implements SPARC v9, but will it have a
    successor? And even if it has a successor, will it be available in relevant numbers? MCST is not married to SPARC, despite their name;
    they have worked on Elbrus 2000 implementations as well; Elbrus 2000 supports Elbrus VLIW and "Intel x86" instruction sets, and new models
    were released in 2018, 2021, and 2025, so MCST now seems to focus on
    that.

    - anton

    Skimming through the SPARC architecture manual I am wondering how they handle register renaming with a windowed register file. If the register window file is deep there must be a ginormous number of registers for renaming. Would it need to keep track of the renames for all the
    registers? How does it dump the rename state to memory?

    Tried to find some information on Elbrus. I got page not found a couple
    of times. Other than it’s a VLIW machine I do not know much about it.

    *****

    I would like a machine able to process 128-bit values directly, but it
    takes up too many resources. It is easier to make the register file deep
    as opposed to wide. BRAM has a max 64-bit width. After that it takes
    more BRAMs to get a wider port. I tried a 128-bit wide register file,
    but it used about 200 BRAMs. Too many.

    There are now 128 logical registers available in Qupls. It turns out
    that the BRAM setup is 512 registers deep no matter whether there are
    32,64 or 128 registers. So, may as well make them available.

    Can you read BRAM 2× or 4× per CPU cycle ?!?

    Qupls reservation stations were set up with support for eight operands
    (four each for each ½ 128-bit register). The resulting logic was about 25,000 LUTs for just one RS. This is compared to about 5,000 LUTs when
    there were just four operands. What gets implemented is considerably
    less as most functional units do not need all the operands.

    Ok, you found one way NOT to DO IT.

    It may be resource efficient to use multiple reservation stations as
    opposed to more operands in a single station. But then the operands need
    to be linked together between stations. It may be possible using a hash
    of the PC value and ROB entry number.

    Allow me to dissuade you from this.

    Qupls seems to have an implementation four or five times the size of the FPGA again. Back to the drawing board.

    Live within your means.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Nov 17 18:54:17 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Robert Finch <robfi680@gmail.com> writes:
    Skimming through the SPARC architecture manual I am wondering how they >handle register renaming with a windowed register file. If the register >window file is deep there must be a ginormous number of registers for >renaming. Would it need to keep track of the renames for all the >registers? How does it dump the rename state to memory?

    I don't remember SPARC ever getting OoO. The windowed register file
    is but one cause.

    There is no need to dump the rename state to memory, not for SPARC nor
    for anything else. It's only microarchitectural.

    It does need to be checkpointed if/when going OoO.

    The large number of architected registers may have been a reason why
    they needed so long to implement OoO execution.

    I think that the cost is typically a register allocation table RAT per
    branch (for maybe 50 branches or potential traps that you want to
    predict, i.e., 50 RATs).

    50 RAT entries not 50 RATs.

    With 32 architected registers and 257-512
    physical registers that's 32*9 bits = 288 bits per RAT; with the 136 architected registers of SPARC, and again <=512 physical registers,
    that would be 1224 bits per RAT.

    Register files with more than 128 entries become big and especially SLOW.
    Even 128 register entries is pushing your luck.

    There are probably other options that using a RAT, but I have
    forgotten them.

    Physical register file where reads are done by {cam,valid} and writes
    are done by decoder. the valid bits are recorded in a history table
    for mispredict recovery between decode cycles.

    There is also the Value-free reservation station model, where the RF
    is not read until the station fires its entry.

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Nov 17 20:58:14 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    I don't remember SPARC ever getting OoO.

    https://dl.acm.org/doi/10.5555/874064.875643 (paywalled, but the
    first few lines are legible) talks about such an implementation.

    The windowed register file
    is but one cause.

    Certainly didn't make it easier...
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Mon Nov 17 23:35:37 2025
    From Newsgroup: comp.arch

    On Mon, 17 Nov 2025 18:54:17 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Robert Finch <robfi680@gmail.com> writes:
    Skimming through the SPARC architecture manual I am wondering how
    they handle register renaming with a windowed register file. If
    the register window file is deep there must be a ginormous number
    of registers for renaming. Would it need to keep track of the
    renames for all the registers? How does it dump the rename state
    to memory?

    I don't remember SPARC ever getting OoO. The windowed register file
    is but one cause.


    The first production OoO SPARC was HAL SPARC64 manufactured for
    Fujitsu on Fujitsu's own fabs back in 1995, so contemporary of PPro. It
    was 4-die chipset.
    HAL SPARC64-GP was first single-chip implementation in 1997. https://en.wikipedia.org/wiki/HAL_SPARC64
    The line was continued by Fujitsu:
    https://en.wikipedia.org/wiki/SPARC64_V
    Since then and up to 2017 there were many generations made by Fujitsu.

    There were also few OoO SPARCs designed by Oracle, independently of
    Fujitsu. I think that they all shared the same core uArch originally
    introduced in SPARC T4 (2011).
    https://en.wikipedia.org/wiki/SPARC_T4

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Mon Nov 17 16:58:31 2025
    From Newsgroup: comp.arch

    On 2025-11-17 1:45 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2025-11-17 3:33 a.m., Anton Ertl wrote:
    Robert Finch <robfi680@gmail.com> writes:
    Finding it too difficult to support 128-bit operations using high, low >>>> register pairs. Getting the reservation stations to pair up the
    registers seems a bit scary. It would be much simpler to just have
    128-bit registers and it appears as if it may not be any more logic.

    If you want to support 128-bit operations, using 128-bit registers
    certainly is the way to go. Note how AMD used to split 128-bit SSE
    operations into 64-bit parts on 64-bit registers in the K8, split
    256-bit AVX operations into 128-bit parts on 128-bit registers in Zen,
    but they went away from that: In Zen4 512-bit operations are performed
    in 256-bit-pieces, but the registers are 512 bits wide.

    However, the point of carry bits or Mitch Alsup's CARRY is not 128-bit
    operations, but multi-precision, which can be 256-bit for some crypto,
    4096 bits for other crypto, or billions of bits for the stuff that
    Alexander Yee is doing.

    Sparc v9 died?

    Oracle has discontinued SPARC development in 2017, Fujitsu has
    announced in 2016 that they switch to ARM A64. Both Oracle and
    Fujitsu released their last new SPARC CPU in 2017. Fujitsu has
    released the ARM A64-based A64FX in 2019. The Leon4 (2017 according
    to <https://en.wikipedia.org/wiki/SPARC#Implementations>) and Leon5
    (2019) implement SPARC v8, not v9.

    The MCST-R2000 (2018) implements SPARC v9, but will it have a
    successor? And even if it has a successor, will it be available in
    relevant numbers? MCST is not married to SPARC, despite their name;
    they have worked on Elbrus 2000 implementations as well; Elbrus 2000
    supports Elbrus VLIW and "Intel x86" instruction sets, and new models
    were released in 2018, 2021, and 2025, so MCST now seems to focus on
    that.

    - anton

    Skimming through the SPARC architecture manual I am wondering how they
    handle register renaming with a windowed register file. If the register
    window file is deep there must be a ginormous number of registers for
    renaming. Would it need to keep track of the renames for all the
    registers? How does it dump the rename state to memory?

    Tried to find some information on Elbrus. I got page not found a couple
    of times. Other than it’s a VLIW machine I do not know much about it.

    *****

    I would like a machine able to process 128-bit values directly, but it
    takes up too many resources. It is easier to make the register file deep
    as opposed to wide. BRAM has a max 64-bit width. After that it takes
    more BRAMs to get a wider port. I tried a 128-bit wide register file,
    but it used about 200 BRAMs. Too many.

    There are now 128 logical registers available in Qupls. It turns out
    that the BRAM setup is 512 registers deep no matter whether there are
    32,64 or 128 registers. So, may as well make them available.

    Can you read BRAM 2× or 4× per CPU cycle ?!?

    The BRAM and logic is not fast enough. There is also some logic to
    select BRAM outputs via a live value table.


    Qupls reservation stations were set up with support for eight operands
    (four each for each ½ 128-bit register). The resulting logic was about
    25,000 LUTs for just one RS. This is compared to about 5,000 LUTs when
    there were just four operands. What gets implemented is considerably
    less as most functional units do not need all the operands.

    Ok, you found one way NOT to DO IT.

    It may be resource efficient to use multiple reservation stations as
    opposed to more operands in a single station. But then the operands need
    to be linked together between stations. It may be possible using a hash
    of the PC value and ROB entry number.

    Allow me to dissuade you from this.

    Whew! After several tries I think I found a much better way of doing
    things. The 128-bit op instructions are simply translated into two (or
    more) 64-bit op micro-ops at the micro-op translation stage. There is no messing around with reservation stations or operands then. But the
    performance is potentially cut in half. For a much smaller
    implementation it is worth it. Micro-op translation is only a few
    hundred LUTs.

    Qupls seems to have an implementation four or five times the size of the
    FPGA again. Back to the drawing board.

    Live within your means.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Nov 18 08:58:17 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    There is no need to dump the rename state to memory, not for SPARC nor
    for anything else. It's only microarchitectural.

    It does need to be checkpointed if/when going OoO.

    You do register renaming in order to go OoO, so OoO is a given.
    Unless Robert Finch meant the register windowing, but I don't think
    so.

    And yes, the rename state needs to be checkpointed in order to restore
    it when recovering from a branch misprediction or the like. But these checkpoints are also microarchitectural and must not reach
    architectural memory.

    With 32 architected registers and 257-512
    physical registers that's 32*9 bits = 288 bits per RAT; with the 136
    architected registers of SPARC, and again <=512 physical registers,
    that would be 1224 bits per RAT.

    Register files with more than 128 entries become big and especially SLOW.

    The 280 physical integer and 332 physical FP registers of Raptor Cove
    have not prevented it from reaching 6.2GHz. Zen5 also reaches pretty
    high clocks with its 384 physical FP registers.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Nov 18 15:16:23 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    The first production OoO SPARC was HAL SPARC64 manufactured for
    Fujitsu on Fujitsu's own fabs back in 1995, so contemporary of PPro. It
    was 4-die chipset.
    HAL SPARC64-GP was first single-chip implementation in 1997. >https://en.wikipedia.org/wiki/HAL_SPARC64
    The line was continued by Fujitsu:
    https://en.wikipedia.org/wiki/SPARC64_V

    It did not register with me at the time, probably because the
    HAL/Fujitsu SPARCs did not get as much press as the CPUs of American
    Computers, and because the SPARC64 I/II/GP did not come out with
    impressive clock rates for their time. It got better with the SPARC
    64 V, which reached higher clock rates (and SPEC results, CINT2000
    shown):

    SPARC64 V 1350MHz 905 peak, 776 base Jul 2003
    SPARC64 V 1890MHz 1345 peak, 1174 base Jun 2004

    For comparison:

    UltraSPARC III Cu 1200MHz 722 peak, 642 base Apr 2003

    Opteron 1800MHz 1170 peak, 1095 base May 2003
    Athlon 64 2000MHz 1335 peak, 1266 base Sep 2003
    Opteron 2400MHz 1655 peak, 1566 base May 2004

    Pentium 4 3067MHz 1210 peak, 1167 base May 2003
    Pentium 4 EE 3400MHz 1704 peak, 1666 base Feb 2004
    Xeon 3600MHz 1538 peak, 1463 base Aug 2004

    Itanium2 1500MHz 1322 peak, 1322 base Jul 2003

    21364 1300MHz 994 peak, 904 base Aug 2004

    Power 4+ 1700MHz 1113 peak, 1077 base May 2003
    Power 5 1900MHz 1451 peak, 1383 base Nov 2004
    PowerPC 970 2200MHz 1040 peak, 986 base Nov 2004

    The SPARC64 V has the best Cint2000 results for any SPARC published
    before 2005, by a wide margin. It was competetive with its
    contemporaries, but did not surpass them. It needed somewhat higher
    clock rates to match the in-order Itanium II 1500MHz (and match only
    in peak); maybe the higher clock rate was a result of the OoO design,
    but certainly no higher IPC is visible compared to the Itanium II.
    Compared to the other in-order design in this collection (UltraSPARC
    III Cu), both the clock rate and the IPC are better, however.

    BTW, I bought a 2000MHz Athlon 64 3200+ like the one listed above in
    IIRC October or November 2003 (I posted benchmark results here in <2003Nov23.094309@a0.complang.tuwien.ac.at>.

    <https://en.wikipedia.org/wiki/HAL_SPARC64#SPARC64_II> says:

    |The number of physical registers was increased to 128 from 116 and the
    |number of register files to five from four.

    I assume that the latter is supposed to mean five instead of four
    register windows. That would mean that 80 of the 116 registers (in
    SPARC64 I) or 96 of the 128 registers (in SPARC64 II) would be
    architectural, if register windows and register renaming happened independently; not a lot of renaming capacity, but that's probably in
    line with the vintage (the 1999 Coppermine has 40 ROB entries (with
    valued uops, so each ROB entry has one result register), so around the
    same renaming capacity as the SPAR64 I/II).

    How can register renaming be implemented on SPARC? As discussed
    above, this can be done independently: Have 96 architectural registers
    (plus the window pointer), and make 8 of them global registers, and
    the rest 24 visible registers plus 4 windows of 16 registers, with the
    usual switching. And then rename these 96 architectural registers.

    A variant in the opposite direction would be to treat only the 32
    visible registers as architectural registers, avoiding large RAT
    entries. The save instruction would emit store microinstructions for
    the local and in registers, and then the renamer would rename the out
    registers to the in registers, and would assign 0 to the local and out registers (which would not occupy a physical register at first). This
    approach makes the most sense with a separate renamer as is now
    common. The restore instruction would rename the in registers to the
    out registers, and emit load microinstructions for the local and the
    in registers.

    OoO tends to work fine with storing around calls and loading around
    returns in architectures without register windows, because the storing
    mostly consumes resources, but is not on the critical path, and
    likewise for the loading (the loads tend to be ready earlier than the instructions on the critical path); and store-to-load forwarding
    deals with the problem of a return shortly after a call.

    In the scheme above, these benefits would also happen, but there are
    the following problems: Each save save 16 registers, and each restore
    restores 16 registers, many more than is typical with the usual
    calling conventions. Moreover, at least gcc inserts SAVE and RETURN
    (which includes RESTORE) instructions even at leaf functions with low
    register pressure like

    int foo(int a[])
    {
    if ((a[1]^a[2]) < 0)
    return a[0];
    else
    return a[3]+1;
    }

    Clang OTOH manages to do without save instruction in this case
    (working with the o registers, and %g1..%g4), so with the right
    compiler this would be less of a problem.

    Another problem is that SPARC is specified to have at least three
    register windows. I guess this could be addressed in some way.

    In any case, if we want save to be reasonably fast on average, we may
    want to implement a few more architectural registers for register
    windows (as outlined above), and have, e.g., 2*16 registers for
    invisible registers (for a total of 64 architectural registers), and
    have the storing only when all windows are consumed, and there is
    another SAVE, and likewise load only when there is a RESTORE and there
    is no register window that contains the calling context.

    After a restore, the registers of the now unused window could be
    freed, making more registers available for renaming. This would
    correspond to having only as many architectural registers as necessary
    for the currently active register windows. An effect of this idea
    would be that, depending on the save and restore pattens leading up to
    some code that needs a lot of renaming capacity, the performance of
    that code would vary, but similar effects have also been seen in other
    areas.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue Nov 18 13:15:24 2025
    From Newsgroup: comp.arch

    On 11/17/2025 1:49 AM, Robert Finch wrote:
    On 2025-11-16 1:36 p.m., MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    ERROR "unexpected byte sequence starting at index 853: '\xC3'" while
    decoding:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    A common set of flags is NZCV.  Of these N and Z can be generated from >>>>> the 64 ordinary bits (actually N is the MSB of these bits).

    You might also want NCZV of 32-bit instructions, but in that case all >>>>> flags are derivable from the 64 ordinary bits of the GPR; but in that >>>>> case you may need additional branch instructions: Instructions that
    check only if the bottom 32-bits are 0 (Z), if bit 31 is 1 (N), if bit >>>>> 32 is 1 (C), or if bit 32 is different from bit 31 (V).

    If you write an architectural rule whereby every integer result is
    "proper" one set of bits {top, bottom, dispersed} covers everything.

    Proper means that all the bits in the register are written but the
    value written is range limited to {Sign}×{Size} of the calculation.

    I have no idea what you mean with "one set of bits {top, bottom,
    dispersed}".

    typedef struct { uint64_t reg;
                      uint8_t  bits: 4; } gpr;
    or
    typedef struct { uint8_t  bits: 4;
                      uint64_t reg;} gpr;
    or
    typedef struct { uint16_t reg0;
                      uint8_t  bit0: 1;
                      uint16_t reg1;
                      uint8_t  bit1: 1;
                      uint16_t reg2;
                      uint8_t  bit2: 1;
                      uint16_t reg3;
                      uint8_t  bit3: 1;  } gpr;

    Did you loose every brain-cell of imagination ?!?

    As for "proper": Does this mean that one would have to have add(c),
    sub(c), mul (madd etc.), shift right and shift left (did I forget
    anything?) for i8, i16, i32, i64, u8, u16, u32, and u64?  Yes, if
    specify in the operation which kind of Z, C/V, and maybe N you are
    interested in, you do not need to specify it in the branch that checks
    that result; you also eliminate the sign-extension and zero-extension
    operations that we discussed some time ago.

    {s8, s16, s32, s64, u8, u16, u32, u64} yes.
    But given that the operations are much more frequent than branches,
    encoding that information in the branches uses less space (for shift
    right, the sign is usually included in the operation).  It's

    Which is why I don't have ANY of those extra bits.

    interesting that AFAIK there are instruction sets (e.g., Power) that
    just have one full-width sign-agnostic add, and do not have
    width-specific flags, either.  So when compiling stuff like

    if (a[1]+a[2] == 0) /* unsigned a[] */

    a width-specific compare instruction provides that information.  But
    gcc generates a compare instruction even when a[] is "unsigned long",
    so apparently add does not set the flags on addition anyway (and if
    there is an add that sets flags, it is not used by gcc for this code).

    Another case is SPARC v9, which tends to set flags.  For

       if ((a[1]^a[2]) < 0)

    I see:

    long a[]                      int a[]
    ldx  [ %i0 + 8 ], %g1         ld  [ %i0 + 4 ], %g2
    ldx  [ %i0 + 0x10 ], %g2      ld  [ %i0 + 8 ], %g1
    xor  %g1, %g2, %g1            xorcc  %g2, %g1, %g0
    brlz,pn   %g1, 24 <foo+0x24>  bl,a,pn   %icc, 20 <foo+0x20>

    Reading up on SPARC v9, it has two sets of condition codes: 32-bit
    (icc) and 64-bit (xcc), and every instruction that sets condition
    codes (e.g., xorcc) sets both.

    Another reason its death is helpful to comp.arch

                                     In the present case, the 32-bit
    sequence sets the ccs and then checks icc, while the 64-bit sequence
    does not set the ccs, and instead uses a branch instruction that
    inspects an integer register (%g1).  These branch instructions all
    work for the full 64 bits, and do not provide a way to check a 32-bit
    result.  In the present case, an alternate way to use brlz for the
    32-bit case would have been:

    ldsw  [ %i0 + 8 ], %g1       #ld is a synonym for lduw
    ldsw  [ %i0 + 0x10 ], %g2
    xor  %g1, %g2, %g1
    brlz,pn   %g1, 24 <foo+0x24>

    because the xor of two sign-extended data is also a correct
    sign-extended result, but instread gcc chose to use xorcc and bl %icc.

    There are many ways to skin this cat.

    Sure:: close to 20-ways, less than 4 of them are "proper".
    Concerning saving the extra bits across interrupts, yes, this has to >>>>> be adapted to the actual architecture, and there are many ways to skin >>>>> this cat.  I just outlined one to give an idea how this can be done. >>>>
    On the other hand, with CARRY, none of those bits are needed.

    But the mechanism of CARRY is quite a bit more involved: Either store
    the carry in a GPR at every step, or have another mechanism inside a
    CARRY block.  And either make the CARRY block atomic or have some way
    to preserve the fact that there is this prefix across interrupts and
    (worse) synchronous traps.

    During its "life" the bits used in CARRY are simply another feedback
    path on the data-path. Afterwards, carry is written once. CARRY also
    gets written when an exception is taken.


    - anton

    These posts have inspired me to keep working on the ISA. I am on a simplification mission.

    The CARRY modifier is just a substitute for not having r3w2 port instructions directly in the ISA. Since Qupls ISA has room to support
    some r3w2 instructions directly there is no need for CARRY, much as I
    like the idea.

    While not using a carry flag in the register, there is still a
    capabilities bit, overflow bit and pointer bit plus four user assigned
    bits. I decided to just have 72-bit register store and load instructions along with the usual 8,16,32 and 64.

    Finding it too difficult to support 128-bit operations using high, low register pairs. Getting the reservation stations to pair up the
    registers seems a bit scary. It would be much simpler to just have 128-
    bit registers and it appears as if it may not be any more logic. The
    benefit of using register pairs is the internal busses need only be 64-
    bits then.


    I went with pairs, but I guess maybe pairs are a lot easier for in-order
    than OoO.

    Sparc v9 died?


    Pretty sure SPARC is good and dead at this point...

    Many others in this space are not far behind.

    Basically, anything remaining needs to compete against ARM and RISC-V
    (the latter of which making an unexpectedly rapid rise in mind-share and prominence...).


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Tue Nov 18 13:22:44 2025
    From Newsgroup: comp.arch

    On 11/17/2025 12:41 PM, MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2025-11-16 1:36 p.m., MitchAlsup wrote:
    -------------------------------

    During its "life" the bits used in CARRY are simply another feedback
    path on the data-path. Afterwards, carry is written once. CARRY also
    gets written when an exception is taken.


    - anton

    These posts have inspired me to keep working on the ISA. I am on a
    simplification mission.

    The CARRY modifier is just a substitute for not having r3w2 port
    instructions directly in the ISA. Since Qupls ISA has room to support
    some r3w2 instructions directly there is no need for CARRY, much as I
    like the idea.

    That is correct at the 95% level.

    While not using a carry flag in the register, there is still a
    capabilities bit, overflow bit and pointer bit plus four user assigned
    bits. I decided to just have 72-bit register store and load instructions
    along with the usual 8,16,32 and 64.

    Finding it too difficult to support 128-bit operations using high, low
    register pairs. Getting the reservation stations to pair up the
    registers seems a bit scary.

    It IS scary and hard and tricky to get right.

    It would be much simpler to just have
    128-bit registers and it appears as if it may not be any more logic. The
    benefit of using register pairs is the internal busses need only be
    64-bits then.

    Almost exactly what we did in Mc 88120 when facing the same problem.
    Except we kept the 32-bit model and had register files 2 registers
    tall {even, odd},{odd even} so any register specifier would simply
    read out the status and values of both registers and then let the
    stations handle the insundry problems.


    I had actually considered this as a possibly implementation strategy in
    the past.

    Either way, strict even+odd pairing does mean that it is possible to
    treat things either as 64 or 128 bit registers internally, except that
    the 64-bit case would still need to be able to operate with
    independently addressable registers (a 3R1W 128-bit regfile can't
    directly mimic a 6R2W or similar).

    One possibility here is that for register pairs, if it functions as a
    128-bit access, one of the 64-bit ID's is effectively ignored/disabled,
    and any OoO magic would mark both registers in the pair as unavailable.

    But, alas, never implemented an OoO CPU, so I don't really know here.


    Sparc v9 died?

    What was the last year SPARC sold more than 100,000 CPUs ??

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Tue Nov 18 19:28:29 2025
    From Newsgroup: comp.arch

    BGB <cr88192@gmail.com> schrieb:

    Pretty sure SPARC is good and dead at this point...

    Almost, but not quite. I still have login on a couple of SPARC
    machines:

    $ uname -a
    SunOS s11-sparc.cfarm 5.11 11.4.86.201.2 sun4v sparc sun4v logical-domain
    $ kstat -p cpu_info | head
    cpu_info:0:cpu_info0:brand SPARC-M8
    cpu_info:0:cpu_info0:chip_id 0
    cpu_info:0:cpu_info0:class misc
    cpu_info:0:cpu_info0:clock_MHz 5067
    cpu_info:0:cpu_info0:core_id 8
    cpu_info:0:cpu_info0:cpu_fru hc:///component=
    cpu_info:0:cpu_info0:cpu_type sparcv9
    cpu_info:0:cpu_info0:crtime 12619319,2018106 cpu_info:0:cpu_info0:cstates_count 0:0
    cpu_info:0:cpu_info0:cstates_nsec 11963950024050:12619342341210000

    Many others in this space are not far behind.

    Basically, anything remaining needs to compete against ARM and RISC-V
    (the latter of which making an unexpectedly rapid rise in mind-share and prominence...).

    Power's not dead, either, if very highly priced. MIPS is still
    being sold, apparently. Then there's Loongarch. As for RISC-V,
    I am not sure how much business they actually generate compared
    to others.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Nov 18 22:25:24 2025
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    BGB <cr88192@gmail.com> schrieb:

    Pretty sure SPARC is good and dead at this point...

    Almost, but not quite. I still have login on a couple of SPARC
    machines:

    My doctor told me that he had given my prostrate enough x-ray radiation
    to kill the prostrate-cancer, but I still had to take medicine because
    they had not actually died yet (for 2 more months).

    SPARC has been killed, but is not quite dead.

    A fine line indeed.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Tue Nov 18 20:26:25 2025
    From Newsgroup: comp.arch

    On 2025-11-18 2:15 p.m., BGB wrote:
    On 11/17/2025 1:49 AM, Robert Finch wrote:
    On 2025-11-16 1:36 p.m., MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    ERROR "unexpected byte sequence starting at index 853: '\xC3'" while
    decoding:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    A common set of flags is NZCV.  Of these N and Z can be generated >>>>>> from
    the 64 ordinary bits (actually N is the MSB of these bits).

    You might also want NCZV of 32-bit instructions, but in that case all >>>>>> flags are derivable from the 64 ordinary bits of the GPR; but in that >>>>>> case you may need additional branch instructions: Instructions that >>>>>> check only if the bottom 32-bits are 0 (Z), if bit 31 is 1 (N), if >>>>>> bit
    32 is 1 (C), or if bit 32 is different from bit 31 (V).

    If you write an architectural rule whereby every integer result is
    "proper" one set of bits {top, bottom, dispersed} covers everything. >>>>>
    Proper means that all the bits in the register are written but the
    value written is range limited to {Sign}×{Size} of the calculation. >>>>
    I have no idea what you mean with "one set of bits {top, bottom,
    dispersed}".

    typedef struct { uint64_t reg;
                      uint8_t  bits: 4; } gpr;
    or
    typedef struct { uint8_t  bits: 4;
                      uint64_t reg;} gpr;
    or
    typedef struct { uint16_t reg0;
                      uint8_t  bit0: 1;
                      uint16_t reg1;
                      uint8_t  bit1: 1;
                      uint16_t reg2;
                      uint8_t  bit2: 1;
                      uint16_t reg3;
                      uint8_t  bit3: 1;  } gpr;

    Did you loose every brain-cell of imagination ?!?

    As for "proper": Does this mean that one would have to have add(c),
    sub(c), mul (madd etc.), shift right and shift left (did I forget
    anything?) for i8, i16, i32, i64, u8, u16, u32, and u64?  Yes, if
    specify in the operation which kind of Z, C/V, and maybe N you are
    interested in, you do not need to specify it in the branch that checks >>>> that result; you also eliminate the sign-extension and zero-extension
    operations that we discussed some time ago.

    {s8, s16, s32, s64, u8, u16, u32, u64} yes.
    But given that the operations are much more frequent than branches,
    encoding that information in the branches uses less space (for shift
    right, the sign is usually included in the operation).  It's

    Which is why I don't have ANY of those extra bits.

    interesting that AFAIK there are instruction sets (e.g., Power) that
    just have one full-width sign-agnostic add, and do not have
    width-specific flags, either.  So when compiling stuff like

    if (a[1]+a[2] == 0) /* unsigned a[] */

    a width-specific compare instruction provides that information.  But
    gcc generates a compare instruction even when a[] is "unsigned long",
    so apparently add does not set the flags on addition anyway (and if
    there is an add that sets flags, it is not used by gcc for this code). >>>>
    Another case is SPARC v9, which tends to set flags.  For

       if ((a[1]^a[2]) < 0)

    I see:

    long a[]                      int a[]
    ldx  [ %i0 + 8 ], %g1         ld  [ %i0 + 4 ], %g2
    ldx  [ %i0 + 0x10 ], %g2      ld  [ %i0 + 8 ], %g1
    xor  %g1, %g2, %g1            xorcc  %g2, %g1, %g0
    brlz,pn   %g1, 24 <foo+0x24>  bl,a,pn   %icc, 20 <foo+0x20>

    Reading up on SPARC v9, it has two sets of condition codes: 32-bit
    (icc) and 64-bit (xcc), and every instruction that sets condition
    codes (e.g., xorcc) sets both.

    Another reason its death is helpful to comp.arch

                                     In the present case, the 32-bit
    sequence sets the ccs and then checks icc, while the 64-bit sequence
    does not set the ccs, and instead uses a branch instruction that
    inspects an integer register (%g1).  These branch instructions all
    work for the full 64 bits, and do not provide a way to check a 32-bit
    result.  In the present case, an alternate way to use brlz for the
    32-bit case would have been:

    ldsw  [ %i0 + 8 ], %g1       #ld is a synonym for lduw
    ldsw  [ %i0 + 0x10 ], %g2
    xor  %g1, %g2, %g1
    brlz,pn   %g1, 24 <foo+0x24>

    because the xor of two sign-extended data is also a correct
    sign-extended result, but instread gcc chose to use xorcc and bl %icc. >>>>
    There are many ways to skin this cat.

    Sure:: close to 20-ways, less than 4 of them are "proper".
    Concerning saving the extra bits across interrupts, yes, this has to >>>>>> be adapted to the actual architecture, and there are many ways to >>>>>> skin
    this cat.  I just outlined one to give an idea how this can be done. >>>>>
    On the other hand, with CARRY, none of those bits are needed.

    But the mechanism of CARRY is quite a bit more involved: Either store
    the carry in a GPR at every step, or have another mechanism inside a
    CARRY block.  And either make the CARRY block atomic or have some way >>>> to preserve the fact that there is this prefix across interrupts and
    (worse) synchronous traps.

    During its "life" the bits used in CARRY are simply another feedback
    path on the data-path. Afterwards, carry is written once. CARRY also
    gets written when an exception is taken.


    - anton

    These posts have inspired me to keep working on the ISA. I am on a
    simplification mission.

    The CARRY modifier is just a substitute for not having r3w2 port
    instructions directly in the ISA. Since Qupls ISA has room to support
    some r3w2 instructions directly there is no need for CARRY, much as I
    like the idea.

    While not using a carry flag in the register, there is still a
    capabilities bit, overflow bit and pointer bit plus four user assigned
    bits. I decided to just have 72-bit register store and load
    instructions along with the usual 8,16,32 and 64.

    Finding it too difficult to support 128-bit operations using high, low
    register pairs. Getting the reservation stations to pair up the
    registers seems a bit scary. It would be much simpler to just have
    128- bit registers and it appears as if it may not be any more logic.
    The benefit of using register pairs is the internal busses need only
    be 64- bits then.


    I went with pairs, but I guess maybe pairs are a lot easier for in-order than OoO.

    I have gone with quads now. It is faked by translating one ISA
    instruction into four micro-ops doing 64-bit ops. It could go with pairs
    too in the same manner. The number of registers was upped to 128 so
    there can be 32x256 bit SIMD registers.

    Shelved the 128-bit ops for now.

    Sparc v9 died?


    Pretty sure SPARC is good and dead at this point...

    Many others in this space are not far behind.

    Basically, anything remaining needs to compete against ARM and RISC-V
    (the latter of which making an unexpectedly rapid rise in mind-share and prominence...).


    I am still waiting to see what else shows up.

    Is the need for backwards compatibility killing things as technology has improved? There seems to be a lot more known good/bad approaches making
    me think that the lifetime of newer designs could be longer.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 19 01:47:26 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    Is the need for backwards compatibility killing things as technology has improved?

    No with respect to::
    Little Endian
    IEEE 754 floating point
    Byte addressable memory
    Misaligned memory
    PCIe peripheral access
    CXL interconnect
    CXL added memory
    CXL added cache
    access to Linux
    access to gnu
    access to LLVM
    access to qemu
    access to gem5
    numerical libraries/packages

    yes with respect to::
    x86 condition codes
    x86 shift by 0
    x86 descriptor tables
    4096 byte pages
    long latency exception/interrupt control transfer
    need source to port application
    SIMD considered harmful
    ATOMIC activities
    Exception walk-back across block structure
    Signal/exception delivery
    language evolution
    environment evolution

    There seems to be a lot more known good/bad approaches making
    me think that the lifetime of newer designs could be longer.

    Yes, but the people making the decisions are still to young to have
    the history needed to make better decisions.

    The graduates of major universities go right out and start designing
    without being exposed to "enough" of the disease of computer architecture
    to be in a position to understand why feature.X of arch.Y was bad overall,
    or why feature.X of architecture.Y was not enough to save it.

    Each generation reaches employment after university at about the same
    level as we did when we invented RISC.

    Architecture is only 1/3rd ISA--and it is the other 2/3rds where the
    {trouble or success} lies (85% confidence level).
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Nov 19 07:47:12 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:
    environment evolution

    There seems to be a lot more known good/bad approaches making
    me think that the lifetime of newer designs could be longer.

    Yes, but the people making the decisions are still to young to have
    the history needed to make better decisions.

    The graduates of major universities go right out and start designing
    without being exposed to "enough" of the disease of computer architecture
    to be in a position to understand why feature.X of arch.Y was bad overall,
    or why feature.X of architecture.Y was not enough to save it.

    Each generation reaches employment after university at about the same
    level as we did when we invented RISC.

    I recently heard that CS graduates from ETH Zürich had heard about
    pipelines, but thought it was fetch-decode-execute.

    They also did not know about DEC or the VAX. Sic transit gloria
    mundi... Apparently, the most ancient computer history they heard
    about was Nehalem.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Wed Nov 19 12:53:35 2025
    From Newsgroup: comp.arch

    On 11/13/2025 9:59 PM, MitchAlsup wrote:

    BGB <cr88192@gmail.com> posted:

    On 11/13/2025 3:58 PM, Anton Ertl wrote:
    BGB <cr88192@gmail.com> writes:
    Can note that GCC seemingly doesn't support 128-bit integers on 64-bit >>>> RISC-V.

    What makes you think so? It has certainly worked every time I tried
    it. E.g., Gforth's "configure" reports:

    checking size of __int128_t... 16
    checking size of __uint128_t... 16
    [...]
    checking for a C type for double-cells... __int128_t
    checking for a C type for unsigned double-cells... __uint128_t

    That's with gcc 10.3.1


    Hmm...

    Seems so.

    Testing again, it does appear to work; the error message I thought I
    remembered seeing, instead applied to when trying to use the type in
    MSVC. I had thought I remembered checking before and it failing, but it
    seems not.

    But, yeah, good to know I guess.


    As for MSVC:
    tst_int128.c(5): error C4235: nonstandard extension used: '__int128'
    keyword not supported on this architecture

    ERRRRRRR:: not supported by this compiler, the architecture has
    ISA level support for doing this, but the compiler does not allow
    you access.

    More or less it seems.


    This leaves, apparently:
    MSVC: Maybe once had it for IA-64, but nowhere else;
    GCC: Supported, but lacks a printf modifier for it in glibc.
    Clang: Supported, but lacks support for 128-bit integer literals?...
    BGBCC: Supported, with literals and 'I128' printf modifier.
    Where, 'I128' is similar to 'I64' in MSVC,
    as for a long time they also lacked the 'll' modifier and similar.

    ISA's:
    X64: Can build manually via register pairs (any two registers), ADD+ADC
    allows for 128-bit in 2 instructions;
    Many 128-bit ops can be built using flags bits;
    ISA supports widening multiply and narrowing divide, though typically
    with hardwired registers.

    XG1/XG2:
    CLRT+ADDC+ADDC
    Theoretically arbitrary, BGBCC only uses even pairs;
    CLRT needed to clear the SR.T flag;
    Normal ADD does not modify SR.T.
    Could maybe be better if there were a 3R ADDC variant,
    and maybe a carry-out only variant (so no CLRT was needed).
    ADDX
    Even pairs only, single instruction.

    XG3:
    Support for SR.T was demoted to optional,
    half the encoding space goes unused if predication isn't used though.
    Could bit a "better" RV-C in there (*1).
    ALUX instructions could be used, also optional.
    Otherwise, it is left in a similar situation to RISC-V here.

    *1: Noted before that if one tweaks the design of RV-C some:
    Makes Imm/Disp fields smaller;
    Replaces Reg3 with Reg4 (X8..X27);
    ...
    It is possible to get an set of 16-bit ops that both use less encoding
    space and get a better average hit rate than the existing RV-C ops
    (mostly by not trying to do Imm6/Disp6 in said ops; and only using Reg5
    on a few instructions).

    However, IMO, makes more sense to support RV-C for binary compatibility,
    than for the encoding scheme not being "kind of a turd".

    However, "XG3 sub-variant that drops predicated encodings in favor of re-adding a new/different set of 16-bit encodings" was not a
    particularly attractive option.


    For where it makes sense to use XG3 though, likely it makes sense to
    allow/use SR.T and the predicated encodings, which can still offer a
    small but non-zero performance benefit (even if debatable if it is
    something that is worth spending half of the encoding space on).

    I did also experimenting with allowing a few blocks to be used for pair-encoded ops. One other possibility could be some additional unconditional-only instruction blocks (but, these would be N/E in XG1/XG2).



    One possibility could also be an "XG3 Lite" subset:
    Likely unconditional only, and also disallows RISC-V encodings.

    Or, IOW:
    ...xx00 Disallowed
    ...xx01 Disallowed
    ...xx10 Allowed
    ...xx11 Disallowed

    Could maybe make sense if I wanted a core on a smaller FPGA.

    However, there isn't that much incentive to go for much smaller than the XC7S50 with this, and for current use-cases that could involve an XC7S25
    or XC7A35T, you kinda really want to try to maximize code density
    (mostly because the currently available dev-boards with these FPGAs tend
    to lack external RAM).

    The Intel/Altera chips tend to always have integrated ARM cores;
    Boards with Lattice FPGAs (probably ECP5 or similar in this case, *)
    tend to be obscure and overpriced (even if theoretically the FPGAs
    themselves are cheaper).

    *: One is harder pressed to make a non-trivial CPU core that fits into
    an ICE40.


    Though, one other possibility being trying to again implement dual-core
    on an XC7A100T, but possibly sharing FPU and SIMD between the cores (may
    or may not be viable).

    In this case, there would be a mechanism such that inter-core interlocks
    could trigger to disallow both cores trying to access the FPU or SIMD
    unit on the same clock-cycle. Though unclear how this could interact
    with pipeline stalls (would ideally want both cores to have independent pipelines; but then one needs to arbitrate things such that both units
    get their results at the expected clock cycle, ...).

    Though, to that end, may also make sense to consider going to a
    dual-issue superscalar with 4R2W register file.

    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Nov 20 07:33:36 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Power's not dead, either, if very highly priced.

    New Power CPUs and machines based on them are released regularly. I
    think there is enough business in the iSeries (or whatever its current
    name) is to produce enough money for the costs of that development.
    pSeries benefits from that. I guess that the profits from that are
    enough to finance the development of the pSeries machines, but can
    contribute little to finance the development of the CPUs.

    MIPS is still
    being sold, apparently.

    From <https://en.wikipedia.org/wiki/MIPS_architecture>:
    |In March 2021, MIPS announced that the development of the MIPS
    |architecture had ended as the company is making the transition to
    |RISC-V.

    So it's the same status as SPARC. They may be selling to existing
    customers, but nobody sane will use MIPS for a new project.

    As for RISC-V,
    I am not sure how much business they actually generate compared
    to others.

    I think a lot of embedded RISC-Vs are used, e.g., in WD (and now
    Sandisk) HDDs and SSDs; so you can look at the business reports of WD
    if you want to know how much business they make. As for things you
    can actually program, there are a number of SBCs on sale (and we have
    one), from the Raspi Pico 2 (where you apparently can use either
    ARMv8-M (i.e., ARM T32) or RISC-V (probably some RV32 variant) up to
    stuff like the Visionfive V2, several Chinese offerings, and some
    Hifive SBCs. The latter are not yet competetive in CPU performance
    with the like of RK3588-based SBCs or the Raspi 5, so I expect the
    main reason for buying them is to try out RISC-V (we have a Visionfive
    V1 for that purpose); still, the fact that there are several offerings indicates that there is nonnegligible revenue there.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Nov 20 07:55:48 2025
    From Newsgroup: comp.arch

    Robert Finch <robfi680@gmail.com> writes:
    Is the need for backwards compatibility killing things as technology has >improved?

    That is certainly the usual complaint by engineers who are hindered in
    doing what they would otherwise like to to by backwards compatibility requirements. It's certainly easier to design on a clean slate. OTOH
    not all of the ideas that are prevented by backwards compatibility
    requirements are good ideas.

    Overall, as I mentioned in this thread, there is architectural
    progress, in some cases (e.g., the establishment of 8/16/32/64-bit
    machines) in ways that are not backwards-compatible. So backwards compatibility is not preventing all progress.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Nov 20 08:05:53 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    I recently heard that CS graduates from ETH Zürich had heard about >pipelines, but thought it was fetch-decode-execute.

    Why would a CS graduate need to know about pipelines?

    They also did not know about DEC or the VAX. Sic transit gloria
    mundi...

    Yes, a few years ago I asked some students, among them an older one
    who was interested in some older technologies, whether they had heard
    of the VAX. None had. It seems that VAX was big in the 80s, but it
    then vanished from the radar of the computer-interested public. So
    anybody who became interested in computers only afterwards is unlikely
    to have heard of the VAX, unless they are into retrocomputing or read
    old debates about the advantages of RISC.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Fri Nov 21 15:31:49 2025
    From Newsgroup: comp.arch

    On Thu, 13 Nov 2025 19:04:18 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Michael S <already5chosen@yahoo.com> posted:

    Not really.
    That is, conversions are not blazingly fast, but still much better
    than any attempt to divide in any form of decimal. And helps to
    preserve your sanity.

    Are you trying to pull our proverbial leg here ?!?


    After reading paragraph 5.2 of IEEE-754-2008 Standard I am less sure in correctness of my above statement.
    For the case of exact division, preservation of mental sanity during fulfillment of requirements of this paragraph is far from simple,
    regardless of numeric base used in the process.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Fri Nov 21 13:36:05 2025
    From Newsgroup: comp.arch

    On 11/21/2025 7:31 AM, Michael S wrote:
    On Thu, 13 Nov 2025 19:04:18 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Michael S <already5chosen@yahoo.com> posted:

    Not really.
    That is, conversions are not blazingly fast, but still much better
    than any attempt to divide in any form of decimal. And helps to
    preserve your sanity.

    Are you trying to pull our proverbial leg here ?!?


    After reading paragraph 5.2 of IEEE-754-2008 Standard I am less sure in correctness of my above statement.
    For the case of exact division, preservation of mental sanity during fulfillment of requirements of this paragraph is far from simple,
    regardless of numeric base used in the process.


    One effectively needs to do a special extra-wide divide rather than just
    a normal integer divide, etc.


    But, yeah, fastest I had gotten in my experiments was radix-10e9 long-division, but still not the fastest option.

    So, rough ranking, fast to slow:
    Radix-10e9 Long Divide (fastest)
    Newton-Raphson
    Radix-10 Long Divide
    Integer Shift-Subtract with converters (slowest).
    Fastest converter strategy ATM:
    Radix-10e9 double-dabble (Int->Dec).
    MUL-by-10e9 and ADD (Dec->Int)
    Fastest strategy: Unrolled Shifts and ADDs (*1).


    *1: While it is possible to perform a 128-bit multiply decomposing into multiplying 32-bit parts and adding them together; it was working out
    slightly faster in this case to do a fixed multiply by decomposing it
    into a series of explicit shifts and ADDs.

    Though, in this case, it is faster (and less ugly) to decompose this
    into a pattern of iteratively multiplying by smaller amounts. I had
    ended up using 4x multiply by 100 followed by multiply by 10, as while
    not the fastest strategy, needs less code than 2x multiply by 10000 +
    multiply by 10. Most other patterns would need more shifts and adds.

    In theory, x86-64 could do it better with multiply ops, but getting
    something optimal out of the C compilers is a bigger issue here it seems.


    Unexplored options:
    Radix 10e2 (byte)
    Radix 10e3 (word)
    Radix 10e4 (word)

    Radix 10e3 could have the closest to direct mapping to DPD.


    Looking at the decNumber code, it appears also to be Radix-10e9 based.
    They also do significant (ab)use of the C preprocessor.

    Apparently, "Why use functions when you can use macros?"...


    For the Radix-10e9 long-divide, part of the magic was in the function to
    scale a value by a radix value and subtract it from another array.

    Ended up trying a few options, fastest was to temporarily turn the
    operation into non-normalized 64-bit pieces and then normalize the
    result (borrow propagation, etc) as an output step.

    Initial attempt kept it normalized within the operation, which was slower.

    It was seemingly compiler-dependent whether it was faster to do a
    combined operation, or separate scale and subtract, but the margins were small. On MSVC the combined operation was slightly faster than the
    separate operations.

    ...



    Otherwise, after this, just went and fiddled with BGBCC some more,
    adding more options for its resource converter.

    Had before (for image formats):
    In: TGA, BMP (various), PNG, QOI, UPIC
    Out: BMP (various), QOI, UPIC

    Added (now):
    In: PPM, JPG, DDS
    Out: PNG, JPG, DDS (DXT1 and DXT5)

    Considered (not added yet):
    PCX
    Evaluated PCX, possible but not a clear win.


    Fiddled with making the PNG encoder less slow, mostly this was tweaking
    some parameters for the LZ searches. Initial settings were using deeper searches over initially smaller sliding windows (at lower compression
    levels); better in this case to do a shallower search over a max-sized
    sliding window.

    ATM, speed of PNG is now on-par with the JPG encoder (still one of the
    slower options).

    For simple use-cases, PNG still loses (in terms of both speed and
    compression) to 16-color BMP + LZ compression (LZ4 or RP2).
    Theoretically, indexed-color PNG exists, but is less widely supported.

    It is less space-efficient to represent 16-colors as Deflate-compressed
    color differences than it is to just represent the 4-bit RGBI values
    directly.

    However, can note that the RLE compression scheme (used by PCX) is
    clearly inferior to that of any sort of LZ compression.


    Comparably, PNG is also a more expensive format to decode as well (even
    vs JPEG).


    UPIC can partly address the use-cases of both PNG and JPEG while being
    cheaper to decode than either, but more niche as pretty much nothing
    supports it. Some of its design and properties being mostly JPEG-like.

    QOI is interesting, but suffers some similar limitations to PCX (its
    design is mostly about more compactly encoding color-differences in
    true-color images and otherwise only offers RLE compression).

    QOI is not particularly effective against images with little variety in
    color variation but lots of repeating patterns (I have a modified QOI
    that does a little better here, still not particularly effective with
    16-color graphics though).


    Otherwise, also added up adding a small text format for image drawing commands.

    As a simplistic line oriented format containing various commands to
    perform drawing operations or composite images.
    creating a "canvas"
    setting the working color
    drawing lines
    bucket fill
    drawing text strings
    overlaying other images
    ...


    This is maybe (debatable) outside the scope of a C compiler, but could
    have use-cases for preparing resource data (nevermind if scope creep is
    partly also turning it into an asset-packer tool; where it is useful to
    make graphics/sounds/etc in one set of formats and then process and
    convert them into another set of files, usually inside of some sort of
    VFS image or similar).

    Design is much more simplistic than something like SVG and I am
    currently assuming its use for mostly hand-edited files. Unlike SVG, it
    also assumes drawing to a pixel grid rather than some more abstract
    coordinate space (so, its abstract model is more like "MS Paint" or
    similar); also SVG would suck as a human-edited format.

    Granted, one could argue maybe it could make scope that asset-processing
    is its own tool, then one converts it to a format that the compiler
    accepts (WAD2 or WAD4 in this case) prior to compiling the main binary
    (and/or not use resource data).

    Still, IMO, an internal WAD image is still better than the
    horrid/unusable mess that Windows had used (where anymore most people
    don't bother with the resource section much more than storing a program
    icon or similar...).

    But, realistically, one does still want to limit how much data they
    stick into the EXE.

    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Fri Nov 21 22:09:00 2025
    From Newsgroup: comp.arch

    On 2025-11-21 2:36 p.m., BGB wrote:
    On 11/21/2025 7:31 AM, Michael S wrote:
    On Thu, 13 Nov 2025 19:04:18 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Michael S <already5chosen@yahoo.com> posted:

    Not really.
    That is, conversions are not blazingly fast, but still much better
    than any attempt to divide in any form of decimal. And helps to
    preserve your sanity.

    Are you trying to pull our proverbial leg here ?!?


    After reading paragraph 5.2 of IEEE-754-2008 Standard I am less sure in
    correctness of my above statement.
    For the case of exact division, preservation of mental sanity during
    fulfillment of requirements of this paragraph is far from simple,
    regardless of numeric base used in the process.


    One effectively needs to do a special extra-wide divide rather than just
    a normal integer divide, etc.


    But, yeah, fastest I had gotten in my experiments was radix-10e9 long- division, but still not the fastest option.

    So, rough ranking, fast to slow:
      Radix-10e9 Long Divide (fastest)
      Newton-Raphson
      Radix-10 Long Divide
      Integer Shift-Subtract with converters (slowest).
        Fastest converter strategy ATM:
          Radix-10e9 double-dabble (Int->Dec).
          MUL-by-10e9 and ADD (Dec->Int)
            Fastest strategy: Unrolled Shifts and ADDs (*1).


    *1: While it is possible to perform a 128-bit multiply decomposing into multiplying 32-bit parts and adding them together; it was working out slightly faster in this case to do a fixed multiply by decomposing it
    into a series of explicit shifts and ADDs.

    Though, in this case, it is faster (and less ugly) to decompose this
    into a pattern of iteratively multiplying by smaller amounts. I had
    ended up using 4x multiply by 100 followed by multiply by 10, as while
    not the fastest strategy, needs less code than 2x multiply by 10000 + multiply by 10. Most other patterns would need more shifts and adds.

    In theory, x86-64 could do it better with multiply ops, but getting something optimal out of the C compilers is a bigger issue here it seems.


    Unexplored options:
      Radix 10e2 (byte)
      Radix 10e3 (word)
      Radix 10e4 (word)

    Radix 10e3 could have the closest to direct mapping to DPD.


    Looking at the decNumber code, it appears also to be Radix-10e9 based.
    They also do significant (ab)use of the C preprocessor.

    Apparently, "Why use functions when you can use macros?"...


    For the Radix-10e9 long-divide, part of the magic was in the function to scale a value by a radix value and subtract it from another array.

    Ended up trying a few options, fastest was to temporarily turn the
    operation into non-normalized 64-bit pieces and then normalize the
    result (borrow propagation, etc) as an output step.

    Initial attempt kept it normalized within the operation, which was slower.

    It was seemingly compiler-dependent whether it was faster to do a
    combined operation, or separate scale and subtract, but the margins were small. On MSVC the combined operation was slightly faster than the
    separate operations.

    ...



    Otherwise, after this, just went and fiddled with BGBCC some more,
    adding more options for its resource converter.

    Had before (for image formats):
      In: TGA, BMP (various), PNG, QOI, UPIC
      Out: BMP (various), QOI, UPIC

    Added (now):
      In: PPM, JPG, DDS
      Out: PNG, JPG, DDS (DXT1 and DXT5)

    Considered (not added yet):
      PCX
    Evaluated PCX, possible but not a clear win.


    Fiddled with making the PNG encoder less slow, mostly this was tweaking
    some parameters for the LZ searches. Initial settings were using deeper searches over initially smaller sliding windows (at lower compression levels); better in this case to do a shallower search over a max-sized sliding window.

    ATM, speed of PNG is now on-par with the JPG encoder (still one of the slower options).

    For simple use-cases, PNG still loses (in terms of both speed and compression) to 16-color BMP + LZ compression (LZ4 or RP2).
    Theoretically, indexed-color PNG exists, but is less widely supported.

    It is less space-efficient to represent 16-colors as Deflate-compressed color differences than it is to just represent the 4-bit RGBI values directly.

    However, can note that the RLE compression scheme (used by PCX) is
    clearly inferior to that of any sort of LZ compression.


    Comparably, PNG is also a more expensive format to decode as well (even
    vs JPEG).


    UPIC can partly address the use-cases of both PNG and JPEG while being cheaper to decode than either, but more niche as pretty much nothing supports it. Some of its design and properties being mostly JPEG-like.

    QOI is interesting, but suffers some similar limitations to PCX (its
    design is mostly about more compactly encoding color-differences in true-color images and otherwise only offers RLE compression).

    QOI is not particularly effective against images with little variety in color variation but lots of repeating patterns (I have a modified QOI
    that does a little better here, still not particularly effective with 16-color graphics though).


    Otherwise, also added up adding a small text format for image drawing commands.

    As a simplistic line oriented format containing various commands to
    perform drawing operations or composite images.
      creating a "canvas"
      setting the working color
      drawing lines
      bucket fill
      drawing text strings
      overlaying other images
      ...


    This is maybe (debatable) outside the scope of a C compiler, but could
    have use-cases for preparing resource data (nevermind if scope creep is partly also turning it into an asset-packer tool; where it is useful to
    make graphics/sounds/etc in one set of formats and then process and
    convert them into another set of files, usually inside of some sort of
    VFS image or similar).

    Design is much more simplistic than something like SVG and I am
    currently assuming its use for mostly hand-edited files. Unlike SVG, it
    also assumes drawing to a pixel grid rather than some more abstract coordinate space (so, its abstract model is more like "MS Paint" or similar); also SVG would suck as a human-edited format.

    Granted, one could argue maybe it could make scope that asset-processing
    is its own tool, then one converts it to a format that the compiler
    accepts (WAD2 or WAD4 in this case) prior to compiling the main binary (and/or not use resource data).

    Still, IMO, an internal WAD image is still better than the horrid/
    unusable mess that Windows had used (where anymore most people don't
    bother with the resource section much more than storing a program icon
    or similar...).

    But, realistically, one does still want to limit how much data they
    stick into the EXE.

    ...


    My forays into the world of graphics formats are pretty limited. I tend
    to use libraries already written by other people. I assume people a lot brighter than myself have come up with them.

    A while ago I wrote a set of graphics routines in assembler that were
    quite fast. One format I have delt with is the .flic file format used to render animated graphics. I wanted to write my own CIV style game. It
    took a little bit of research and some reverse engineering. Apparently,
    the authors used a modified version of the format making it difficult to
    use the CIV graphics in my own game. I never could get it to render as
    fast as the game’s engine. I wrote the code for my game in C or C++, the original’s game engine code was likely in a different language.

    *****

    Been working on vectors for the ISA. I split the vector length register
    into eight sections to define up to eight different vector lengths. The
    first five are defined for integer, float, fixed, character, and address
    data types. I figure one may want to use vectors of different lengths at
    the same time, for instance to address data using byte offsets, while
    the data itself might be a float. The vector load / store instructions
    accept a data type to load / store and always use the address type for
    address calculations.

    There is also a vector lane size register split up the same way. I had
    thought of giving each vector register its own format for length and
    lane size. But thought that is a bit much, with limited use cases.

    I think I can get away with only two load and two store instructions.
    One to do a strided load and a second to do an vector indexed load (gather/scatter). The addressing mode in use is d[Rbase+Rindex*Scale].
    Where Rindex is used as the stride when scalar or as a supplier of the
    lane offset when Rindex is a vector.

    Writing the RTL code to support the vector memory ops has been
    challenging. Using a simple approach ATM. The instruction needs to be re-issued for each vector lane accessed. Unaligned vector loads and
    stores are also allowed, adding some complexity when the operation
    crosses a cache-line boundary.

    I have the max vector length and max vector size constants returned by
    the GETINFO instruction which returns CPU specific information.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sat Nov 22 04:54:00 2025
    From Newsgroup: comp.arch

    On 11/21/2025 9:09 PM, Robert Finch wrote:
    On 2025-11-21 2:36 p.m., BGB wrote:
    On 11/21/2025 7:31 AM, Michael S wrote:
    On Thu, 13 Nov 2025 19:04:18 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Michael S <already5chosen@yahoo.com> posted:

    Not really.
    That is, conversions are not blazingly fast, but still much better
    than any attempt to divide in any form of decimal. And helps to
    preserve your sanity.

    Are you trying to pull our proverbial leg here ?!?


    After reading paragraph 5.2 of IEEE-754-2008 Standard I am less sure in
    correctness of my above statement.
    For the case of exact division, preservation of mental sanity during
    fulfillment of requirements of this paragraph is far from simple,
    regardless of numeric base used in the process.


    One effectively needs to do a special extra-wide divide rather than
    just a normal integer divide, etc.


    But, yeah, fastest I had gotten in my experiments was radix-10e9 long-
    division, but still not the fastest option.

    So, rough ranking, fast to slow:
       Radix-10e9 Long Divide (fastest)
       Newton-Raphson
       Radix-10 Long Divide
       Integer Shift-Subtract with converters (slowest).
         Fastest converter strategy ATM:
           Radix-10e9 double-dabble (Int->Dec).
           MUL-by-10e9 and ADD (Dec->Int)
             Fastest strategy: Unrolled Shifts and ADDs (*1).


    *1: While it is possible to perform a 128-bit multiply decomposing
    into multiplying 32-bit parts and adding them together; it was working
    out slightly faster in this case to do a fixed multiply by decomposing
    it into a series of explicit shifts and ADDs.

    Though, in this case, it is faster (and less ugly) to decompose this
    into a pattern of iteratively multiplying by smaller amounts. I had
    ended up using 4x multiply by 100 followed by multiply by 10, as while
    not the fastest strategy, needs less code than 2x multiply by 10000 +
    multiply by 10. Most other patterns would need more shifts and adds.

    In theory, x86-64 could do it better with multiply ops, but getting
    something optimal out of the C compilers is a bigger issue here it seems.


    Unexplored options:
       Radix 10e2 (byte)
       Radix 10e3 (word)
       Radix 10e4 (word)

    Radix 10e3 could have the closest to direct mapping to DPD.


    Looking at the decNumber code, it appears also to be Radix-10e9 based.
    They also do significant (ab)use of the C preprocessor.

    Apparently, "Why use functions when you can use macros?"...


    For the Radix-10e9 long-divide, part of the magic was in the function
    to scale a value by a radix value and subtract it from another array.

    Ended up trying a few options, fastest was to temporarily turn the
    operation into non-normalized 64-bit pieces and then normalize the
    result (borrow propagation, etc) as an output step.

    Initial attempt kept it normalized within the operation, which was
    slower.

    It was seemingly compiler-dependent whether it was faster to do a
    combined operation, or separate scale and subtract, but the margins
    were small. On MSVC the combined operation was slightly faster than
    the separate operations.

    ...



    Otherwise, after this, just went and fiddled with BGBCC some more,
    adding more options for its resource converter.

    Had before (for image formats):
       In: TGA, BMP (various), PNG, QOI, UPIC
       Out: BMP (various), QOI, UPIC

    Added (now):
       In: PPM, JPG, DDS
       Out: PNG, JPG, DDS (DXT1 and DXT5)

    Considered (not added yet):
       PCX
    Evaluated PCX, possible but not a clear win.


    Fiddled with making the PNG encoder less slow, mostly this was
    tweaking some parameters for the LZ searches. Initial settings were
    using deeper searches over initially smaller sliding windows (at lower
    compression levels); better in this case to do a shallower search over
    a max-sized sliding window.

    ATM, speed of PNG is now on-par with the JPG encoder (still one of the
    slower options).

    For simple use-cases, PNG still loses (in terms of both speed and
    compression) to 16-color BMP + LZ compression (LZ4 or RP2).
    Theoretically, indexed-color PNG exists, but is less widely supported.

    It is less space-efficient to represent 16-colors as Deflate-
    compressed color differences than it is to just represent the 4-bit
    RGBI values directly.

    However, can note that the RLE compression scheme (used by PCX) is
    clearly inferior to that of any sort of LZ compression.


    Comparably, PNG is also a more expensive format to decode as well
    (even vs JPEG).


    UPIC can partly address the use-cases of both PNG and JPEG while being
    cheaper to decode than either, but more niche as pretty much nothing
    supports it. Some of its design and properties being mostly JPEG-like.

    QOI is interesting, but suffers some similar limitations to PCX (its
    design is mostly about more compactly encoding color-differences in
    true-color images and otherwise only offers RLE compression).

    QOI is not particularly effective against images with little variety
    in color variation but lots of repeating patterns (I have a modified
    QOI that does a little better here, still not particularly effective
    with 16-color graphics though).


    Otherwise, also added up adding a small text format for image drawing
    commands.

    As a simplistic line oriented format containing various commands to
    perform drawing operations or composite images.
       creating a "canvas"
       setting the working color
       drawing lines
       bucket fill
       drawing text strings
       overlaying other images
       ...


    This is maybe (debatable) outside the scope of a C compiler, but could
    have use-cases for preparing resource data (nevermind if scope creep
    is partly also turning it into an asset-packer tool; where it is
    useful to make graphics/sounds/etc in one set of formats and then
    process and convert them into another set of files, usually inside of
    some sort of VFS image or similar).

    Design is much more simplistic than something like SVG and I am
    currently assuming its use for mostly hand-edited files. Unlike SVG,
    it also assumes drawing to a pixel grid rather than some more abstract
    coordinate space (so, its abstract model is more like "MS Paint" or
    similar); also SVG would suck as a human-edited format.

    Granted, one could argue maybe it could make scope that asset-
    processing is its own tool, then one converts it to a format that the
    compiler accepts (WAD2 or WAD4 in this case) prior to compiling the
    main binary (and/or not use resource data).

    Still, IMO, an internal WAD image is still better than the horrid/
    unusable mess that Windows had used (where anymore most people don't
    bother with the resource section much more than storing a program icon
    or similar...).

    But, realistically, one does still want to limit how much data they
    stick into the EXE.

    ...


    My forays into the world of graphics formats are pretty limited. I tend
    to use libraries already written by other people. I assume people a lot brighter than myself have come up with them.


    I usually wrote my own code for most things.

    Not dealt much with FLIC.


    In the past, whenever doing animated stuff, had usually used the AVI
    file format. A lot of time, the codecs were custom.

    Both AVI (and BMP) can be used to hold a wide range of image data,
    partly as a merit of using FOURCCs.

    Over the course of the past 15 years, have fiddled a lot here.



    A few of the longer-lived ones:
    BTIC1C (~ 2010):
    Was a modified version of RPZA with Deflate compression glued on.
    BTIC1H:
    Made use of multiple block formats,
    used STF+AdRice for entropy coding, and Paeth for color endpoints.
    Block formats, IIRC:
    4x4x2, 4x2x2, 2x4x2, 2x2x2, 4x4x1, 2x2x1, flat
    4x4x2: 32-bits for pixel selectors
    2x2x2: 8 bits for pixel selectors
    BTIC4B:
    Similar to BTIC1H, but a lot more complicated.
    Switched to 8x8 blocks, so had a whole lot of block formats.

    Shorter-Lived:
    BTIC2C: Similar design to MPEG;
    IIRC, used Huffman, but updated the Huffman tables for each I-Frame.
    This sort of thing being N/A with STF+AdRice,
    which starts from a clean slate every time.


    1C: Was used for animated textures in my first 3D engine.

    1H and 4B could be used for video, but were also used in my second 3D
    engine for sprites and textures (inside of a BMP packaging).


    My 3rd 3D engine is mostly using a mix of:
    DDS (mostly DXt1)
    BMP (mostly 16 color and 256 color).

    Though, in modern times, things like 16-color graphics are overlooked,
    in some cases they are still usable or useful (or at least sufficient).

    Typically, I had settled on a variant of the CGA/EGA color palette:
    0: 000000 (Black)
    1: 0000AA (Blue)
    2: 00AA00 (Green)
    3: 00AAAA (Cyan)
    4: AA0000 (Red)
    5: AA00AA (Magenta)
    6: AA5500 (Brown)
    7: AAAAAA (LightGray)
    8: 555555 (DarkGray)
    9: 5555FF (LightBlue)
    A: 55FF55 (LightGreen)
    B: 55FFFF (LightCyan)
    C: FF5555 (LightRed)
    D: FF55FF (Violet)
    E: FFFF55 (Yellow)
    F: FFFFFF (White)

    I am not sure why they changed it for the default 16-color assignments
    in VGA (eg, in the Windows 256-color system palette). Like, IMO, 00/AA
    and 55/FF works better for typical 16-color use-cases than 00/80 and 00/FF.

    Sorta depends on use-case: Sometimes something works well as 16 colors,
    other times it would fall on its face.



    Most other designs sucked so bad they didn't get very far.

    Where, I had ended up categorizing designs:
    BTIC1x: Designs mostly following an RPZA like path.
    1C: RPZA + Deflate
    Mostly built on 4x4x2 blocks (32 bits).
    1D, 1E: Byte-Encoding + Deflate
    Both sucked, quickly dropped.
    Both were like RPZA both with 48-bit 4:2:0 blocks.
    Neither great compression nor particularly fast.
    Deflate carries a high computational overhead.
    1F, 1G: No entropy coding (back to being like RPZA)
    Major innovations: Variable-size pixel blocks.
    1H: STF+AdRice
    Mostly final state of 1x line.
    BTIC2x: Designs mostly influenced by JPEG and MPEG.
    Difficult to make particularly fast.
    1A/1B: Modified MJPEG IIRC.
    Technically, also based on my BTJPEG format (*1).
    2C: IIRC, MPEG-like, Huffman-coded.
    Well influenced by both MPEG and the Xiph Theora codec.
    2D: Like 2C, but STF+AdRice
    2E: Like 2C, but byte stream based
    Was trying, mostly in vain, to make it faster.
    My attempts at this style of codecs were mostly, too slow.
    2F: Goes back to a more JPEG like core in some ways.
    Entropy and VLN scheme borrows more from Deflate.
    Though, uses a shorter limit on max symbol length (13 bit).
    13 bit simplifies things and makes decoding faster vs 15 bit.
    Abandons DCT and YCbCr in favor of Block-Haar and RCT.
    Later, UPIC did similar, just with STF+AdRice versus Huffman.
    BTIC3x:
    Attempts to hybridize 1x and 2x
    Nothing implemented, all designs too complicated to bother with.
    BTIC4x:
    4A: RPZA-like but with 8x8 blocks and multiple block sizes.
    4B: Like 4A but reusing the encoding scheme from 1H.
    BTIC5x:
    5A: Resembled a CRAM/QOI hybrid, but with 8-bit indexed colors.
    No entropy coding.
    5B: Like 5A, but used differential RGB555 (still QOI like).
    Major innovation was to use a 6-bit 64-entry pattern table.
    Optionally, can use per-frame RP2 or TKuLZ compression.
    Used if doing so results in a significant savings.


    *1: BTJPEG was an attempt at making a more advanced image format based
    on tweaking the existing T.81 JPEG format in a way that sorta worked in existing decoders. The more widespread use (and "not totally dead"
    feature) being to allow for an embedded alpha channel as essentially
    another monochrome JPEG inside the APP11 marker.

    I had tried a bunch of other ideas, but it turned into a mess of
    experimental tweaks, and most of it died off. The surviving variant is basically just T.81+JFIF with an optional alpha channel (ignored by a non-aware JPEG decoder).

    Some other (mostly dead) tweaks were things like:
    Allowing multi-layered images (more like Paint.NET's PDN or GIMP's XCF,
    mostly by nesting the images like a Matryoshka doll), where the
    top-level image would contain a view of all the layers rendered together; Allowing lossless images (similar to PNG) by using SERMS-RDCT and RCT
    (where SERMS-RDCT was a trick to make the DCT/IDCT transform exactly reversible, at the cost of speed).


    In the early 2010s, I was pretty bad about massively over-engineering everything.

    Later on, some ideas were reused in 2F and UPIC.
    Though, 2F and UPIC were much less over-engineered.

    Did specify possible use as video codecs, but thus far both were used
    only as still image formats.

    The major goal for UPIC was mostly be to address the core use-cases but
    also for the decoder to be small and relatively cheap. Still sorta JPEG competitive despite being primarily cost-optimized to try to make it
    more viable for use in programs running on the BJX2 core (where JPEG
    decoding is slow and expensive).

    As for Static Huffman vs STF+AdRice:
    Huffman:
    + Slightly faster for larger payloads
    + Optimal for a static distribution
    - Higher memory cost for decoding (storing decoder tables)
    - High initial setup cost (setting up decoder tables)
    - Higher constant overhead (storing symbol lengths)
    - Need to provision for storing Huffman tables
    STF+AdRice:
    + Very cheap initial setup (minimal context)
    + No need to transmit tables
    + Better compression for small data
    + Significantly faster than Adaptive Huffman
    + Significantly faster than Range Coding
    - Slower for large data and worse compression vs Huffman.

    Where, STF+AdRice is mostly:
    Have a table of symbols;
    Whenever a symbol is encoded, swap it forwards;
    Next time, it may potentially be encoded with a smaller index.
    Encode indices into table using Adaptive Rice Codes.
    Or, basically, using a lookup table to allow AdRice to pretend to be
    Huffman. Also reasonably fast and simple.


    Block-Haar vs DCT:
    + Block-Haar is faster and easily reversible (lossless);
    + Mostly a drop-in replacement for DCT/IDCT in the design.
    + Also faster than WHT (Walsh-Hadamard Transform)

    RCT vs YCbCr:
    RCT is both slightly faster, and also reversible;
    Had experimented with YCoCg, but saw no real advantage over RCT.



    The existence of BTIC5x was mostly because:
    BTIC1H and BTIC4B were too computationally demanding to do 320x200 16Hz
    on a 50MHz BJX2 core;

    MS-CRAM was fast to decode, but needed too much bitrate (SDcard couldn't
    keep the decoder fed with any semblance of image quality).


    So, 5A and 5B were aimed at trying to give tolerable Q/bpp at more
    CRAM-like decoding speeds.

    Also, while reasonably effective (and fast desktop by PC standards), one
    other drawback of the 4B design (and to a lesser degree 1H) was the
    design being overly complicated (and thus the code is large and bulky).

    Part of this was due to having too many block formats.


    If my UPIC format were put into my older naming scheme, would likely be
    called 2G. Design is kinda similar to 2F, but replaces Huffman with STF+AdRice.


    As for RP2 and TKuLZ:
    RP2 is a byte-oriented LZ77 variant, like LZ4,
    but on-average compresses slightly better than LZ4.
    TKuLZ: Is sorta like a simplified/tuned Deflate variant.
    Uses a shorter max symbol length,
    borrows some design elements from LZ4.

    Can note, some past experiments with LZ decompression (at Desktop PC
    speeds), with entropy scheme, and len/dist limits:
    LZMA : ~ 35 MB/sec (Range Coding, 273/ 4GB)
    Zstd : ~ 60 MB/sec (tANS, 16MB/ 128MB)
    Deflate: ~ 175 MB/sec (Huffman, 258/ 32767)
    TKuLZ : ~ 300 MB/sec (Huffman, 65535/262143)
    RP2 : ~ 1100 MB/sec (Raw Bytes, 512/131071)
    LZ4 : ~ 1300 MB/sec (Raw Bytes, 16383/ 65535)


    While Zstd is claimed to be fast, my testing tended to show it closer to
    LZMA speeds than to Deflate, but it does give compression closer to
    LZMA. The tANS strategy seems to under-perform claims IME (and is
    notably slower than static Huffman). Also it is the most complicated
    design among these.


    A lot of my older stuff used Deflate, but often Deflate wasn't fast
    enough, so has mostly gotten displaced by RP2 in my uses.

    TKuLZ is an intermediate, generally faster than Deflate, had an option
    to get some speed (at the expense of compression) by using fixed length symbols in some cases. This can push it to around 500 MB/sec (at the
    expense of compression), hard to get much faster (or anywhere near RP2
    or LZ4).

    Whether RP2 or LZ4 is faster seems to depend on target:
    BJX2 Core, RasPi, and Piledriver: RP2 is faster.
    Mostly things with in-order cores.
    And Piledriver, which behaved almost more like an in-order machine.
    Zen+, Core 2, and Core i7: LZ4 is faster.

    LZ4 needs typically multiple chained memory accesses for each LZ run,
    whereas for RP2, match length/distance and raw count are typically all available via a single memory load (then maybe a few bit-tests and
    conditional branches).

    ...



    A while ago I wrote a set of graphics routines in assembler that were
    quite fast. One format I have delt with is the .flic file format used to render animated graphics. I wanted to write my own CIV style game. It
    took a little bit of research and some reverse engineering. Apparently,
    the authors used a modified version of the format making it difficult to
    use the CIV graphics in my own game. I never could get it to render as
    fast as the game’s engine. I wrote the code for my game in C or C++, the original’s game engine code was likely in a different language.


    This sort of thing is almost inevitable with this stuff.

    Usually I just ended up using C for nearly everything.


    *****

    Been working on vectors for the ISA. I split the vector length register
    into eight sections to define up to eight different vector lengths. The first five are defined for integer, float, fixed, character, and address data types. I figure one may want to use vectors of different lengths at
    the same time, for instance to address data using byte offsets, while
    the data itself might be a float. The vector load / store instructions accept a data type to load / store and always use the address type for address calculations.

    There is also a vector lane size register split up the same way. I had thought of giving each vector register its own format for length and
    lane size. But thought that is a bit much, with limited use cases.

    I think I can get away with only two load and two store instructions.
    One to do a strided load and a second to do an vector indexed load (gather/scatter). The addressing mode in use is d[Rbase+Rindex*Scale].
    Where Rindex is used as the stride when scalar or as a supplier of the
    lane offset when Rindex is a vector.

    Writing the RTL code to support the vector memory ops has been
    challenging. Using a simple approach ATM. The instruction needs to be re-issued for each vector lane accessed. Unaligned vector loads and
    stores are also allowed, adding some complexity when the operation
    crosses a cache-line boundary.

    I have the max vector length and max vector size constants returned by
    the GETINFO instruction which returns CPU specific information.


    I don't get it...

    Usually makes sense to treat vectors as opaque blobs of bits that are
    then interpreted as one of the available formats for a specific operation.

    In my case, I have a SIMD setup:
    2 or 4 elements in a GPR or GPR pair;
    Most other operations are just the normal GPR operations.

    ...


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sat Nov 22 18:50:01 2025
    From Newsgroup: comp.arch

    On Fri, 21 Nov 2025 13:36:05 -0600
    BGB <cr88192@gmail.com> wrote:

    On 11/21/2025 7:31 AM, Michael S wrote:
    On Thu, 13 Nov 2025 19:04:18 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Michael S <already5chosen@yahoo.com> posted:

    Not really.
    That is, conversions are not blazingly fast, but still much better
    than any attempt to divide in any form of decimal. And helps to
    preserve your sanity.

    Are you trying to pull our proverbial leg here ?!?


    After reading paragraph 5.2 of IEEE-754-2008 Standard I am less
    sure in correctness of my above statement.
    For the case of exact division, preservation of mental sanity during fulfillment of requirements of this paragraph is far from simple, regardless of numeric base used in the process.


    One effectively needs to do a special extra-wide divide rather than
    just a normal integer divide, etc.



    It seems, you are talking about case of inexact division
    (rem(num*10**scale, den) != 0) . I don't consider it harmful for sanity.

    It is the opposite case that I find stressful.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sat Nov 22 12:45:57 2025
    From Newsgroup: comp.arch

    On 2025-11-22 5:54 a.m., BGB wrote:
    On 11/21/2025 9:09 PM, Robert Finch wrote:
    On 2025-11-21 2:36 p.m., BGB wrote:
    On 11/21/2025 7:31 AM, Michael S wrote:
    On Thu, 13 Nov 2025 19:04:18 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Michael S <already5chosen@yahoo.com> posted:

    Not really.
    That is, conversions are not blazingly fast, but still much better >>>>>> than any attempt to divide in any form of decimal. And helps to
    preserve your sanity.

    Are you trying to pull our proverbial leg here ?!?


    After reading paragraph 5.2 of IEEE-754-2008 Standard I am less sure in >>>> correctness of my above statement.
    For the case of exact division, preservation of mental sanity during
    fulfillment of requirements of this paragraph is far from simple,
    regardless of numeric base used in the process.


    One effectively needs to do a special extra-wide divide rather than
    just a normal integer divide, etc.


    But, yeah, fastest I had gotten in my experiments was radix-10e9
    long- division, but still not the fastest option.

    So, rough ranking, fast to slow:
       Radix-10e9 Long Divide (fastest)
       Newton-Raphson
       Radix-10 Long Divide
       Integer Shift-Subtract with converters (slowest).
         Fastest converter strategy ATM:
           Radix-10e9 double-dabble (Int->Dec).
           MUL-by-10e9 and ADD (Dec->Int)
             Fastest strategy: Unrolled Shifts and ADDs (*1).


    *1: While it is possible to perform a 128-bit multiply decomposing
    into multiplying 32-bit parts and adding them together; it was
    working out slightly faster in this case to do a fixed multiply by
    decomposing it into a series of explicit shifts and ADDs.

    Though, in this case, it is faster (and less ugly) to decompose this
    into a pattern of iteratively multiplying by smaller amounts. I had
    ended up using 4x multiply by 100 followed by multiply by 10, as
    while not the fastest strategy, needs less code than 2x multiply by
    10000 + multiply by 10. Most other patterns would need more shifts
    and adds.

    In theory, x86-64 could do it better with multiply ops, but getting
    something optimal out of the C compilers is a bigger issue here it
    seems.


    Unexplored options:
       Radix 10e2 (byte)
       Radix 10e3 (word)
       Radix 10e4 (word)

    Radix 10e3 could have the closest to direct mapping to DPD.


    Looking at the decNumber code, it appears also to be Radix-10e9 based.
    They also do significant (ab)use of the C preprocessor.

    Apparently, "Why use functions when you can use macros?"...


    For the Radix-10e9 long-divide, part of the magic was in the function
    to scale a value by a radix value and subtract it from another array.

    Ended up trying a few options, fastest was to temporarily turn the
    operation into non-normalized 64-bit pieces and then normalize the
    result (borrow propagation, etc) as an output step.

    Initial attempt kept it normalized within the operation, which was
    slower.

    It was seemingly compiler-dependent whether it was faster to do a
    combined operation, or separate scale and subtract, but the margins
    were small. On MSVC the combined operation was slightly faster than
    the separate operations.

    ...



    Otherwise, after this, just went and fiddled with BGBCC some more,
    adding more options for its resource converter.

    Had before (for image formats):
       In: TGA, BMP (various), PNG, QOI, UPIC
       Out: BMP (various), QOI, UPIC

    Added (now):
       In: PPM, JPG, DDS
       Out: PNG, JPG, DDS (DXT1 and DXT5)

    Considered (not added yet):
       PCX
    Evaluated PCX, possible but not a clear win.


    Fiddled with making the PNG encoder less slow, mostly this was
    tweaking some parameters for the LZ searches. Initial settings were
    using deeper searches over initially smaller sliding windows (at
    lower compression levels); better in this case to do a shallower
    search over a max-sized sliding window.

    ATM, speed of PNG is now on-par with the JPG encoder (still one of
    the slower options).

    For simple use-cases, PNG still loses (in terms of both speed and
    compression) to 16-color BMP + LZ compression (LZ4 or RP2).
    Theoretically, indexed-color PNG exists, but is less widely supported.

    It is less space-efficient to represent 16-colors as Deflate-
    compressed color differences than it is to just represent the 4-bit
    RGBI values directly.

    However, can note that the RLE compression scheme (used by PCX) is
    clearly inferior to that of any sort of LZ compression.


    Comparably, PNG is also a more expensive format to decode as well
    (even vs JPEG).


    UPIC can partly address the use-cases of both PNG and JPEG while
    being cheaper to decode than either, but more niche as pretty much
    nothing supports it. Some of its design and properties being mostly
    JPEG-like.

    QOI is interesting, but suffers some similar limitations to PCX (its
    design is mostly about more compactly encoding color-differences in
    true-color images and otherwise only offers RLE compression).

    QOI is not particularly effective against images with little variety
    in color variation but lots of repeating patterns (I have a modified
    QOI that does a little better here, still not particularly effective
    with 16-color graphics though).


    Otherwise, also added up adding a small text format for image drawing
    commands.

    As a simplistic line oriented format containing various commands to
    perform drawing operations or composite images.
       creating a "canvas"
       setting the working color
       drawing lines
       bucket fill
       drawing text strings
       overlaying other images
       ...


    This is maybe (debatable) outside the scope of a C compiler, but
    could have use-cases for preparing resource data (nevermind if scope
    creep is partly also turning it into an asset-packer tool; where it
    is useful to make graphics/sounds/etc in one set of formats and then
    process and convert them into another set of files, usually inside of
    some sort of VFS image or similar).

    Design is much more simplistic than something like SVG and I am
    currently assuming its use for mostly hand-edited files. Unlike SVG,
    it also assumes drawing to a pixel grid rather than some more
    abstract coordinate space (so, its abstract model is more like "MS
    Paint" or similar); also SVG would suck as a human-edited format.

    Granted, one could argue maybe it could make scope that asset-
    processing is its own tool, then one converts it to a format that the
    compiler accepts (WAD2 or WAD4 in this case) prior to compiling the
    main binary (and/or not use resource data).

    Still, IMO, an internal WAD image is still better than the horrid/
    unusable mess that Windows had used (where anymore most people don't
    bother with the resource section much more than storing a program
    icon or similar...).

    But, realistically, one does still want to limit how much data they
    stick into the EXE.

    ...


    My forays into the world of graphics formats are pretty limited. I
    tend to use libraries already written by other people. I assume people
    a lot brighter than myself have come up with them.


    I usually wrote my own code for most things.

    Not dealt much with FLIC.


    In the past, whenever doing animated stuff, had usually used the AVI
    file format. A lot of time, the codecs were custom.

    Both AVI (and BMP) can be used to hold a wide range of image data,
    partly as a merit of using FOURCCs.

    Over the course of the past 15 years, have fiddled a lot here.



    A few of the longer-lived ones:
      BTIC1C (~ 2010):
        Was a modified version of RPZA with Deflate compression glued on.
      BTIC1H:
        Made use of multiple block formats,
          used STF+AdRice for entropy coding, and Paeth for color endpoints.
        Block formats, IIRC:
          4x4x2, 4x2x2, 2x4x2, 2x2x2, 4x4x1, 2x2x1, flat
            4x4x2: 32-bits for pixel selectors
            2x2x2: 8 bits for pixel selectors
      BTIC4B:
        Similar to BTIC1H, but a lot more complicated.
        Switched to 8x8 blocks, so had a whole lot of block formats.

    Shorter-Lived:
      BTIC2C: Similar design to MPEG;
      IIRC, used Huffman, but updated the Huffman tables for each I-Frame.
        This sort of thing being N/A with STF+AdRice,
          which starts from a clean slate every time.


    1C: Was used for animated textures in my first 3D engine.

    1H and 4B could be used for video, but were also used in my second 3D
    engine for sprites and textures (inside of a BMP packaging).


    My 3rd 3D engine is mostly using a mix of:
      DDS (mostly DXt1)
      BMP (mostly 16 color and 256 color).

    Though, in modern times, things like 16-color graphics are overlooked,
    in some cases they are still usable or useful (or at least sufficient).

    Typically, I had settled on a variant of the CGA/EGA color palette:
      0: 000000 (Black)
      1: 0000AA (Blue)
      2: 00AA00 (Green)
      3: 00AAAA (Cyan)
      4: AA0000 (Red)
      5: AA00AA (Magenta)
      6: AA5500 (Brown)
      7: AAAAAA (LightGray)
      8: 555555 (DarkGray)
      9: 5555FF (LightBlue)
      A: 55FF55 (LightGreen)
      B: 55FFFF (LightCyan)
      C: FF5555 (LightRed)
      D: FF55FF (Violet)
      E: FFFF55 (Yellow)
      F: FFFFFF (White)

    I am not sure why they changed it for the default 16-color assignments
    in VGA (eg, in the Windows 256-color system palette). Like, IMO, 00/AA
    and 55/FF works better for typical 16-color use-cases than 00/80 and 00/FF.

    Sorta depends on use-case: Sometimes something works well as 16 colors, other times it would fall on its face.



    Most other designs sucked so bad they didn't get very far.

    Where, I had ended up categorizing designs:
      BTIC1x: Designs mostly following an RPZA like path.
        1C: RPZA + Deflate
          Mostly built on 4x4x2 blocks (32 bits).
        1D, 1E: Byte-Encoding + Deflate
          Both sucked, quickly dropped.
          Both were like RPZA both with 48-bit 4:2:0 blocks.
          Neither great compression nor particularly fast.
            Deflate carries a high computational overhead.
        1F, 1G: No entropy coding (back to being like RPZA)
          Major innovations: Variable-size pixel blocks.
        1H: STF+AdRice
          Mostly final state of 1x line.
      BTIC2x: Designs mostly influenced by JPEG and MPEG.
        Difficult to make particularly fast.
        1A/1B: Modified MJPEG IIRC.
          Technically, also based on my BTJPEG format (*1).
        2C: IIRC, MPEG-like, Huffman-coded.
          Well influenced by both MPEG and the Xiph Theora codec.
        2D: Like 2C, but STF+AdRice
        2E: Like 2C, but byte stream based
          Was trying, mostly in vain, to make it faster.
          My attempts at this style of codecs were mostly, too slow.
        2F: Goes back to a more JPEG like core in some ways.
          Entropy and VLN scheme borrows more from Deflate.
            Though, uses a shorter limit on max symbol length (13 bit).
            13 bit simplifies things and makes decoding faster vs 15 bit.
          Abandons DCT and YCbCr in favor of Block-Haar and RCT.
            Later, UPIC did similar, just with STF+AdRice versus Huffman.
      BTIC3x:
        Attempts to hybridize 1x and 2x
        Nothing implemented, all designs too complicated to bother with.
      BTIC4x:
        4A: RPZA-like but with 8x8 blocks and multiple block sizes.
        4B: Like 4A but reusing the encoding scheme from 1H.
      BTIC5x:
        5A: Resembled a CRAM/QOI hybrid, but with 8-bit indexed colors.
          No entropy coding.
        5B: Like 5A, but used differential RGB555 (still QOI like).
          Major innovation was to use a 6-bit 64-entry pattern table.
          Optionally, can use per-frame RP2 or TKuLZ compression.
            Used if doing so results in a significant savings.


    *1: BTJPEG was an attempt at making a more advanced image format based
    on tweaking the existing T.81 JPEG format in a way that sorta worked in existing decoders. The more widespread use (and "not totally dead"
    feature) being to allow for an embedded alpha channel as essentially
    another monochrome JPEG inside the APP11 marker.

    I had tried a bunch of other ideas, but it turned into a mess of experimental tweaks, and most of it died off. The surviving variant is basically just T.81+JFIF with an optional alpha channel (ignored by a non-aware JPEG decoder).

    Some other (mostly dead) tweaks were things like:
    Allowing multi-layered images (more like Paint.NET's PDN or GIMP's XCF, mostly by nesting the images like a Matryoshka doll), where the top-
    level image would contain a view of all the layers rendered together; Allowing lossless images (similar to PNG) by using SERMS-RDCT and RCT
    (where SERMS-RDCT was a trick to make the DCT/IDCT transform exactly reversible, at the cost of speed).


    In the early 2010s, I was pretty bad about massively over-engineering everything.

    Later on, some ideas were reused in 2F and UPIC.
    Though, 2F and UPIC were much less over-engineered.

    Did specify possible use as video codecs, but thus far both were used
    only as still image formats.

    The major goal for UPIC was mostly be to address the core use-cases but
    also for the decoder to be small and relatively cheap. Still sorta JPEG competitive despite being primarily cost-optimized to try to make it
    more viable for use in programs running on the BJX2 core (where JPEG decoding is slow and expensive).

    As for Static Huffman vs STF+AdRice:
      Huffman:
        + Slightly faster for larger payloads
        + Optimal for a static distribution
        - Higher memory cost for decoding (storing decoder tables)
        - High initial setup cost (setting up decoder tables)
        - Higher constant overhead (storing symbol lengths)
        - Need to provision for storing Huffman tables
      STF+AdRice:
        + Very cheap initial setup (minimal context)
        + No need to transmit tables
        + Better compression for small data
        + Significantly faster than Adaptive Huffman
        + Significantly faster than Range Coding
        - Slower for large data and worse compression vs Huffman.

    Where, STF+AdRice is mostly:
      Have a table of symbols;
      Whenever a symbol is encoded, swap it forwards;
        Next time, it may potentially be encoded with a smaller index.
      Encode indices into table using Adaptive Rice Codes.
    Or, basically, using a lookup table to allow AdRice to pretend to be Huffman. Also reasonably fast and simple.


    Block-Haar vs DCT:
      + Block-Haar is faster and easily reversible (lossless);
      + Mostly a drop-in replacement for DCT/IDCT in the design.
      + Also faster than WHT (Walsh-Hadamard Transform)

    RCT vs YCbCr:
      RCT is both slightly faster, and also reversible;
      Had experimented with YCoCg, but saw no real advantage over RCT.



    The existence of BTIC5x was mostly because:
    BTIC1H and BTIC4B were too computationally demanding to do 320x200 16Hz
    on a 50MHz BJX2 core;

    MS-CRAM was fast to decode, but needed too much bitrate (SDcard couldn't keep the decoder fed with any semblance of image quality).


    So, 5A and 5B were aimed at trying to give tolerable Q/bpp at more CRAM- like decoding speeds.

    Also, while reasonably effective (and fast desktop by PC standards), one other drawback of the 4B design (and to a lesser degree 1H) was the
    design being overly complicated (and thus the code is large and bulky).

    Part of this was due to having too many block formats.


    If my UPIC format were put into my older naming scheme, would likely be called 2G. Design is kinda similar to 2F, but replaces Huffman with STF+AdRice.


    As for RP2 and TKuLZ:
      RP2 is a byte-oriented LZ77 variant, like LZ4,
        but on-average compresses slightly better than LZ4.
      TKuLZ: Is sorta like a simplified/tuned Deflate variant.
        Uses a shorter max symbol length,
          borrows some design elements from LZ4.

    Can note, some past experiments with LZ decompression (at Desktop PC speeds), with entropy scheme, and len/dist limits:
      LZMA   : ~   35 MB/sec (Range Coding,   273/   4GB)
      Zstd   : ~   60 MB/sec (tANS,          16MB/ 128MB)
      Deflate: ~  175 MB/sec (Huffman,        258/ 32767)
      TKuLZ  : ~  300 MB/sec (Huffman,      65535/262143)
      RP2    : ~ 1100 MB/sec (Raw Bytes,      512/131071)
      LZ4    : ~ 1300 MB/sec (Raw Bytes,    16383/ 65535)


    While Zstd is claimed to be fast, my testing tended to show it closer to LZMA speeds than to Deflate, but it does give compression closer to
    LZMA. The tANS strategy seems to under-perform claims IME (and is
    notably slower than static Huffman). Also it is the most complicated
    design among these.


    A lot of my older stuff used Deflate, but often Deflate wasn't fast
    enough, so has mostly gotten displaced by RP2 in my uses.

    TKuLZ is an intermediate, generally faster than Deflate, had an option
    to get some speed (at the expense of compression) by using fixed length symbols in some cases. This can push it to around 500 MB/sec (at the
    expense of compression), hard to get much faster (or anywhere near RP2
    or LZ4).

    Whether RP2 or LZ4 is faster seems to depend on target:
      BJX2 Core, RasPi, and Piledriver: RP2 is faster.
        Mostly things with in-order cores.
        And Piledriver, which behaved almost more like an in-order machine.
      Zen+, Core 2, and Core i7: LZ4 is faster.

    LZ4 needs typically multiple chained memory accesses for each LZ run, whereas for RP2, match length/distance and raw count are typically all available via a single memory load (then maybe a few bit-tests and conditional branches).

    ...



    A while ago I wrote a set of graphics routines in assembler that were
    quite fast. One format I have delt with is the .flic file format used
    to render animated graphics. I wanted to write my own CIV style game.
    It took a little bit of research and some reverse engineering.
    Apparently, the authors used a modified version of the format making
    it difficult to use the CIV graphics in my own game. I never could get
    it to render as fast as the game’s engine. I wrote the code for my
    game in C or C++, the original’s game engine code was likely in a
    different language.


    This sort of thing is almost inevitable with this stuff.

    Usually I just ended up using C for nearly everything.


    *****

    Been working on vectors for the ISA. I split the vector length
    register into eight sections to define up to eight different vector
    lengths. The first five are defined for integer, float, fixed,
    character, and address data types. I figure one may want to use
    vectors of different lengths at the same time, for instance to address
    data using byte offsets, while the data itself might be a float. The
    vector load / store instructions accept a data type to load / store
    and always use the address type for address calculations.

    There is also a vector lane size register split up the same way. I had
    thought of giving each vector register its own format for length and
    lane size. But thought that is a bit much, with limited use cases.

    I think I can get away with only two load and two store instructions.
    One to do a strided load and a second to do an vector indexed load
    (gather/scatter). The addressing mode in use is d[Rbase+Rindex*Scale].
    Where Rindex is used as the stride when scalar or as a supplier of the
    lane offset when Rindex is a vector.

    Writing the RTL code to support the vector memory ops has been
    challenging. Using a simple approach ATM. The instruction needs to be
    re-issued for each vector lane accessed. Unaligned vector loads and
    stores are also allowed, adding some complexity when the operation
    crosses a cache-line boundary.

    I have the max vector length and max vector size constants returned by
    the GETINFO instruction which returns CPU specific information.


    I don't get it...

    Usually makes sense to treat vectors as opaque blobs of bits that are
    then interpreted as one of the available formats for a specific operation.

    In my case, I have a SIMD setup:
      2 or 4 elements in a GPR or GPR pair;
      Most other operations are just the normal GPR operations.

    ...


    Many vector machines (RISCV-V) have a way of specifying the vector
    length and element size, but it tends to be a global setting which may
    be overridden in some cases by specifying in the instruction. For Qupls
    it also allows setting based on the data type which is a bit of a
    misnomer, it would be better named data format. It is just three bits in
    the instruction that select one of the fields in the VLEN, VELSZ
    registers. The instruction itself specifies the data type for the
    operation on an opaque bag of bits. It is possible to encode selecting
    the integer size fields, then performing a float operation on the data.

    The size agnostic instructions use the micro-op translator to convert
    the instructions into size specific versions. The translator calculates
    the number of architectural registers required then puts the appropriate number of instructions (up to eight) in the micro-op queue.

    Therefore, there are lots of vector instructions in the ISA. SIMD type instructions where the size of a vector is assumed to be one register,
    and the element size is specified by the instruction. So, separate instructions for 1,2,4 or 8 elements. (For example 50 instructions *
    four different sizes = 200 instructions). Then also size agnostic
    instructions where the size/format comes indirectly from the VLEN
    (vector length) and VELSZ (vector lane size) registers.

    The size agnostic instructions allow writing a generic vector routine
    without needing to code the size of the operation. This avoids having a
    switch statement with a whole bunch of cases for different vector
    lengths. It also avoids having thousands of vector instructions. (50 instructions * 5 different lanes sizes * 64 different lengths).

    The vectors are opaque blobs of bytes in my case. Size specs are in
    terms of bytes. The vectors are not a fixed length. They may (currently)
    use from 0 to 8 GPR registers. Hence the need to specify the length in
    use. While the length could be specified as part of the format for the instruction, that would require a wide instruction.

    *****

    .flic file format is supposed to be fast enough to allow use “on the
    fly”. But I just decompress all the frames into a matrix of bitmaps at
    game startup, then select the appropriate one based on direction and
    timing. With dozens of different sprites and hundreds of frames, I think
    it takes about 3GB of memory just for the sprite data. I had trouble
    running this on my machine a few years ago, but maybe with newer
    technology it could work.

    Experimented some with LZ4 and Huffman encoding. Huffman used for ECC logic.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From BGB@cr88192@gmail.com to comp.arch on Sat Nov 22 14:29:23 2025
    From Newsgroup: comp.arch

    On 11/22/2025 11:45 AM, Robert Finch wrote:
    On 2025-11-22 5:54 a.m., BGB wrote:
    On 11/21/2025 9:09 PM, Robert Finch wrote:
    On 2025-11-21 2:36 p.m., BGB wrote:
    On 11/21/2025 7:31 AM, Michael S wrote:
    On Thu, 13 Nov 2025 19:04:18 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Michael S <already5chosen@yahoo.com> posted:

    Not really.
    That is, conversions are not blazingly fast, but still much better >>>>>>> than any attempt to divide in any form of decimal. And helps to
    preserve your sanity.

    Are you trying to pull our proverbial leg here ?!?


    After reading paragraph 5.2 of IEEE-754-2008 Standard I am less
    sure in
    correctness of my above statement.
    For the case of exact division, preservation of mental sanity during >>>>> fulfillment of requirements of this paragraph is far from simple,
    regardless of numeric base used in the process.


    One effectively needs to do a special extra-wide divide rather than
    just a normal integer divide, etc.


    But, yeah, fastest I had gotten in my experiments was radix-10e9
    long- division, but still not the fastest option.

    So, rough ranking, fast to slow:
       Radix-10e9 Long Divide (fastest)
       Newton-Raphson
       Radix-10 Long Divide
       Integer Shift-Subtract with converters (slowest).
         Fastest converter strategy ATM:
           Radix-10e9 double-dabble (Int->Dec).
           MUL-by-10e9 and ADD (Dec->Int)
             Fastest strategy: Unrolled Shifts and ADDs (*1).


    *1: While it is possible to perform a 128-bit multiply decomposing
    into multiplying 32-bit parts and adding them together; it was
    working out slightly faster in this case to do a fixed multiply by
    decomposing it into a series of explicit shifts and ADDs.

    Though, in this case, it is faster (and less ugly) to decompose this
    into a pattern of iteratively multiplying by smaller amounts. I had
    ended up using 4x multiply by 100 followed by multiply by 10, as
    while not the fastest strategy, needs less code than 2x multiply by
    10000 + multiply by 10. Most other patterns would need more shifts
    and adds.

    In theory, x86-64 could do it better with multiply ops, but getting
    something optimal out of the C compilers is a bigger issue here it
    seems.


    Unexplored options:
       Radix 10e2 (byte)
       Radix 10e3 (word)
       Radix 10e4 (word)

    Radix 10e3 could have the closest to direct mapping to DPD.


    Looking at the decNumber code, it appears also to be Radix-10e9 based. >>>> They also do significant (ab)use of the C preprocessor.

    Apparently, "Why use functions when you can use macros?"...


    For the Radix-10e9 long-divide, part of the magic was in the
    function to scale a value by a radix value and subtract it from
    another array.

    Ended up trying a few options, fastest was to temporarily turn the
    operation into non-normalized 64-bit pieces and then normalize the
    result (borrow propagation, etc) as an output step.

    Initial attempt kept it normalized within the operation, which was
    slower.

    It was seemingly compiler-dependent whether it was faster to do a
    combined operation, or separate scale and subtract, but the margins
    were small. On MSVC the combined operation was slightly faster than
    the separate operations.

    ...



    Otherwise, after this, just went and fiddled with BGBCC some more,
    adding more options for its resource converter.

    Had before (for image formats):
       In: TGA, BMP (various), PNG, QOI, UPIC
       Out: BMP (various), QOI, UPIC

    Added (now):
       In: PPM, JPG, DDS
       Out: PNG, JPG, DDS (DXT1 and DXT5)

    Considered (not added yet):
       PCX
    Evaluated PCX, possible but not a clear win.


    Fiddled with making the PNG encoder less slow, mostly this was
    tweaking some parameters for the LZ searches. Initial settings were
    using deeper searches over initially smaller sliding windows (at
    lower compression levels); better in this case to do a shallower
    search over a max-sized sliding window.

    ATM, speed of PNG is now on-par with the JPG encoder (still one of
    the slower options).

    For simple use-cases, PNG still loses (in terms of both speed and
    compression) to 16-color BMP + LZ compression (LZ4 or RP2).
    Theoretically, indexed-color PNG exists, but is less widely supported. >>>>
    It is less space-efficient to represent 16-colors as Deflate-
    compressed color differences than it is to just represent the 4-bit
    RGBI values directly.

    However, can note that the RLE compression scheme (used by PCX) is
    clearly inferior to that of any sort of LZ compression.


    Comparably, PNG is also a more expensive format to decode as well
    (even vs JPEG).


    UPIC can partly address the use-cases of both PNG and JPEG while
    being cheaper to decode than either, but more niche as pretty much
    nothing supports it. Some of its design and properties being mostly
    JPEG-like.

    QOI is interesting, but suffers some similar limitations to PCX (its
    design is mostly about more compactly encoding color-differences in
    true-color images and otherwise only offers RLE compression).

    QOI is not particularly effective against images with little variety
    in color variation but lots of repeating patterns (I have a modified
    QOI that does a little better here, still not particularly effective
    with 16-color graphics though).


    Otherwise, also added up adding a small text format for image
    drawing commands.

    As a simplistic line oriented format containing various commands to
    perform drawing operations or composite images.
       creating a "canvas"
       setting the working color
       drawing lines
       bucket fill
       drawing text strings
       overlaying other images
       ...


    This is maybe (debatable) outside the scope of a C compiler, but
    could have use-cases for preparing resource data (nevermind if scope
    creep is partly also turning it into an asset-packer tool; where it
    is useful to make graphics/sounds/etc in one set of formats and then
    process and convert them into another set of files, usually inside
    of some sort of VFS image or similar).

    Design is much more simplistic than something like SVG and I am
    currently assuming its use for mostly hand-edited files. Unlike SVG,
    it also assumes drawing to a pixel grid rather than some more
    abstract coordinate space (so, its abstract model is more like "MS
    Paint" or similar); also SVG would suck as a human-edited format.

    Granted, one could argue maybe it could make scope that asset-
    processing is its own tool, then one converts it to a format that
    the compiler accepts (WAD2 or WAD4 in this case) prior to compiling
    the main binary (and/or not use resource data).

    Still, IMO, an internal WAD image is still better than the horrid/
    unusable mess that Windows had used (where anymore most people don't
    bother with the resource section much more than storing a program
    icon or similar...).

    But, realistically, one does still want to limit how much data they
    stick into the EXE.

    ...


    My forays into the world of graphics formats are pretty limited. I
    tend to use libraries already written by other people. I assume
    people a lot brighter than myself have come up with them.


    I usually wrote my own code for most things.

    Not dealt much with FLIC.


    In the past, whenever doing animated stuff, had usually used the AVI
    file format. A lot of time, the codecs were custom.

    Both AVI (and BMP) can be used to hold a wide range of image data,
    partly as a merit of using FOURCCs.

    Over the course of the past 15 years, have fiddled a lot here.



    A few of the longer-lived ones:
       BTIC1C (~ 2010):
         Was a modified version of RPZA with Deflate compression glued on. >>    BTIC1H:
         Made use of multiple block formats,
           used STF+AdRice for entropy coding, and Paeth for color endpoints.
         Block formats, IIRC:
           4x4x2, 4x2x2, 2x4x2, 2x2x2, 4x4x1, 2x2x1, flat
             4x4x2: 32-bits for pixel selectors
             2x2x2: 8 bits for pixel selectors
       BTIC4B:
         Similar to BTIC1H, but a lot more complicated.
         Switched to 8x8 blocks, so had a whole lot of block formats.

    Shorter-Lived:
       BTIC2C: Similar design to MPEG;
       IIRC, used Huffman, but updated the Huffman tables for each I-Frame.
         This sort of thing being N/A with STF+AdRice,
           which starts from a clean slate every time.


    1C: Was used for animated textures in my first 3D engine.

    1H and 4B could be used for video, but were also used in my second 3D
    engine for sprites and textures (inside of a BMP packaging).


    My 3rd 3D engine is mostly using a mix of:
       DDS (mostly DXt1)
       BMP (mostly 16 color and 256 color).

    Though, in modern times, things like 16-color graphics are overlooked,
    in some cases they are still usable or useful (or at least sufficient).

    Typically, I had settled on a variant of the CGA/EGA color palette:
       0: 000000 (Black)
       1: 0000AA (Blue)
       2: 00AA00 (Green)
       3: 00AAAA (Cyan)
       4: AA0000 (Red)
       5: AA00AA (Magenta)
       6: AA5500 (Brown)
       7: AAAAAA (LightGray)
       8: 555555 (DarkGray)
       9: 5555FF (LightBlue)
       A: 55FF55 (LightGreen)
       B: 55FFFF (LightCyan)
       C: FF5555 (LightRed)
       D: FF55FF (Violet)
       E: FFFF55 (Yellow)
       F: FFFFFF (White)

    I am not sure why they changed it for the default 16-color assignments
    in VGA (eg, in the Windows 256-color system palette). Like, IMO, 00/AA
    and 55/FF works better for typical 16-color use-cases than 00/80 and
    00/FF.

    Sorta depends on use-case: Sometimes something works well as 16
    colors, other times it would fall on its face.



    Most other designs sucked so bad they didn't get very far.

    Where, I had ended up categorizing designs:
       BTIC1x: Designs mostly following an RPZA like path.
         1C: RPZA + Deflate
           Mostly built on 4x4x2 blocks (32 bits).
         1D, 1E: Byte-Encoding + Deflate
           Both sucked, quickly dropped.
           Both were like RPZA both with 48-bit 4:2:0 blocks.
           Neither great compression nor particularly fast.
             Deflate carries a high computational overhead.
         1F, 1G: No entropy coding (back to being like RPZA)
           Major innovations: Variable-size pixel blocks.
         1H: STF+AdRice
           Mostly final state of 1x line.
       BTIC2x: Designs mostly influenced by JPEG and MPEG.
         Difficult to make particularly fast.
         1A/1B: Modified MJPEG IIRC.
           Technically, also based on my BTJPEG format (*1).
         2C: IIRC, MPEG-like, Huffman-coded.
           Well influenced by both MPEG and the Xiph Theora codec.
         2D: Like 2C, but STF+AdRice
         2E: Like 2C, but byte stream based
           Was trying, mostly in vain, to make it faster.
           My attempts at this style of codecs were mostly, too slow.
         2F: Goes back to a more JPEG like core in some ways.
           Entropy and VLN scheme borrows more from Deflate.
             Though, uses a shorter limit on max symbol length (13 bit). >>          13 bit simplifies things and makes decoding faster vs 15 bit.
           Abandons DCT and YCbCr in favor of Block-Haar and RCT.
             Later, UPIC did similar, just with STF+AdRice versus Huffman.
       BTIC3x:
         Attempts to hybridize 1x and 2x
         Nothing implemented, all designs too complicated to bother with.
       BTIC4x:
         4A: RPZA-like but with 8x8 blocks and multiple block sizes.
         4B: Like 4A but reusing the encoding scheme from 1H.
       BTIC5x:
         5A: Resembled a CRAM/QOI hybrid, but with 8-bit indexed colors.
           No entropy coding.
         5B: Like 5A, but used differential RGB555 (still QOI like).
           Major innovation was to use a 6-bit 64-entry pattern table.
           Optionally, can use per-frame RP2 or TKuLZ compression.
             Used if doing so results in a significant savings.


    *1: BTJPEG was an attempt at making a more advanced image format based
    on tweaking the existing T.81 JPEG format in a way that sorta worked
    in existing decoders. The more widespread use (and "not totally dead"
    feature) being to allow for an embedded alpha channel as essentially
    another monochrome JPEG inside the APP11 marker.

    I had tried a bunch of other ideas, but it turned into a mess of
    experimental tweaks, and most of it died off. The surviving variant is
    basically just T.81+JFIF with an optional alpha channel (ignored by a
    non-aware JPEG decoder).

    Some other (mostly dead) tweaks were things like:
    Allowing multi-layered images (more like Paint.NET's PDN or GIMP's
    XCF, mostly by nesting the images like a Matryoshka doll), where the
    top- level image would contain a view of all the layers rendered
    together;
    Allowing lossless images (similar to PNG) by using SERMS-RDCT and RCT
    (where SERMS-RDCT was a trick to make the DCT/IDCT transform exactly
    reversible, at the cost of speed).


    In the early 2010s, I was pretty bad about massively over-engineering
    everything.

    Later on, some ideas were reused in 2F and UPIC.
    Though, 2F and UPIC were much less over-engineered.

    Did specify possible use as video codecs, but thus far both were used
    only as still image formats.

    The major goal for UPIC was mostly be to address the core use-cases
    but also for the decoder to be small and relatively cheap. Still sorta
    JPEG competitive despite being primarily cost-optimized to try to make
    it more viable for use in programs running on the BJX2 core (where
    JPEG decoding is slow and expensive).

    As for Static Huffman vs STF+AdRice:
       Huffman:
         + Slightly faster for larger payloads
         + Optimal for a static distribution
         - Higher memory cost for decoding (storing decoder tables)
         - High initial setup cost (setting up decoder tables)
         - Higher constant overhead (storing symbol lengths)
         - Need to provision for storing Huffman tables
       STF+AdRice:
         + Very cheap initial setup (minimal context)
         + No need to transmit tables
         + Better compression for small data
         + Significantly faster than Adaptive Huffman
         + Significantly faster than Range Coding
         - Slower for large data and worse compression vs Huffman.

    Where, STF+AdRice is mostly:
       Have a table of symbols;
       Whenever a symbol is encoded, swap it forwards;
         Next time, it may potentially be encoded with a smaller index.
       Encode indices into table using Adaptive Rice Codes.
    Or, basically, using a lookup table to allow AdRice to pretend to be
    Huffman. Also reasonably fast and simple.


    Block-Haar vs DCT:
       + Block-Haar is faster and easily reversible (lossless);
       + Mostly a drop-in replacement for DCT/IDCT in the design.
       + Also faster than WHT (Walsh-Hadamard Transform)

    RCT vs YCbCr:
       RCT is both slightly faster, and also reversible;
       Had experimented with YCoCg, but saw no real advantage over RCT.



    The existence of BTIC5x was mostly because:
    BTIC1H and BTIC4B were too computationally demanding to do 320x200
    16Hz on a 50MHz BJX2 core;

    MS-CRAM was fast to decode, but needed too much bitrate (SDcard
    couldn't keep the decoder fed with any semblance of image quality).


    So, 5A and 5B were aimed at trying to give tolerable Q/bpp at more
    CRAM- like decoding speeds.

    Also, while reasonably effective (and fast desktop by PC standards),
    one other drawback of the 4B design (and to a lesser degree 1H) was
    the design being overly complicated (and thus the code is large and
    bulky).

    Part of this was due to having too many block formats.


    If my UPIC format were put into my older naming scheme, would likely
    be called 2G. Design is kinda similar to 2F, but replaces Huffman with
    STF+AdRice.


    As for RP2 and TKuLZ:
       RP2 is a byte-oriented LZ77 variant, like LZ4,
         but on-average compresses slightly better than LZ4.
       TKuLZ: Is sorta like a simplified/tuned Deflate variant.
         Uses a shorter max symbol length,
           borrows some design elements from LZ4.

    Can note, some past experiments with LZ decompression (at Desktop PC
    speeds), with entropy scheme, and len/dist limits:
       LZMA   : ~   35 MB/sec (Range Coding,   273/   4GB)
       Zstd   : ~   60 MB/sec (tANS,          16MB/ 128MB)
       Deflate: ~  175 MB/sec (Huffman,        258/ 32767)
       TKuLZ  : ~  300 MB/sec (Huffman,      65535/262143)
       RP2    : ~ 1100 MB/sec (Raw Bytes,      512/131071)
       LZ4    : ~ 1300 MB/sec (Raw Bytes,    16383/ 65535)


    While Zstd is claimed to be fast, my testing tended to show it closer
    to LZMA speeds than to Deflate, but it does give compression closer to
    LZMA. The tANS strategy seems to under-perform claims IME (and is
    notably slower than static Huffman). Also it is the most complicated
    design among these.


    A lot of my older stuff used Deflate, but often Deflate wasn't fast
    enough, so has mostly gotten displaced by RP2 in my uses.

    TKuLZ is an intermediate, generally faster than Deflate, had an option
    to get some speed (at the expense of compression) by using fixed
    length symbols in some cases. This can push it to around 500 MB/sec
    (at the expense of compression), hard to get much faster (or anywhere
    near RP2 or LZ4).

    Whether RP2 or LZ4 is faster seems to depend on target:
       BJX2 Core, RasPi, and Piledriver: RP2 is faster.
         Mostly things with in-order cores.
         And Piledriver, which behaved almost more like an in-order machine. >>    Zen+, Core 2, and Core i7: LZ4 is faster.

    LZ4 needs typically multiple chained memory accesses for each LZ run,
    whereas for RP2, match length/distance and raw count are typically all
    available via a single memory load (then maybe a few bit-tests and
    conditional branches).

    ...



    A while ago I wrote a set of graphics routines in assembler that were
    quite fast. One format I have delt with is the .flic file format used
    to render animated graphics. I wanted to write my own CIV style game.
    It took a little bit of research and some reverse engineering.
    Apparently, the authors used a modified version of the format making
    it difficult to use the CIV graphics in my own game. I never could
    get it to render as fast as the game’s engine. I wrote the code for
    my game in C or C++, the original’s game engine code was likely in a
    different language.


    This sort of thing is almost inevitable with this stuff.

    Usually I just ended up using C for nearly everything.


    *****

    Been working on vectors for the ISA. I split the vector length
    register into eight sections to define up to eight different vector
    lengths. The first five are defined for integer, float, fixed,
    character, and address data types. I figure one may want to use
    vectors of different lengths at the same time, for instance to
    address data using byte offsets, while the data itself might be a
    float. The vector load / store instructions accept a data type to
    load / store and always use the address type for address calculations.

    There is also a vector lane size register split up the same way. I
    had thought of giving each vector register its own format for length
    and lane size. But thought that is a bit much, with limited use cases.

    I think I can get away with only two load and two store instructions.
    One to do a strided load and a second to do an vector indexed load
    (gather/scatter). The addressing mode in use is
    d[Rbase+Rindex*Scale]. Where Rindex is used as the stride when scalar
    or as a supplier of the lane offset when Rindex is a vector.

    Writing the RTL code to support the vector memory ops has been
    challenging. Using a simple approach ATM. The instruction needs to be
    re-issued for each vector lane accessed. Unaligned vector loads and
    stores are also allowed, adding some complexity when the operation
    crosses a cache-line boundary.

    I have the max vector length and max vector size constants returned
    by the GETINFO instruction which returns CPU specific information.


    I don't get it...

    Usually makes sense to treat vectors as opaque blobs of bits that are
    then interpreted as one of the available formats for a specific
    operation.

    In my case, I have a SIMD setup:
       2 or 4 elements in a GPR or GPR pair;
       Most other operations are just the normal GPR operations.

    ...


    Many vector machines (RISCV-V) have a way of specifying the vector
    length and element size, but it tends to be a global setting which may
    be overridden in some cases by specifying in the instruction. For Qupls
    it also allows setting based on the data type which is a bit of a
    misnomer, it would be better named data format. It is just three bits in
    the instruction that select one of the fields in the VLEN, VELSZ
    registers. The instruction itself specifies the data type for the
    operation on an opaque bag of bits. It is possible to encode selecting
    the integer size fields, then performing a float operation on the data.

    The size agnostic instructions use the micro-op translator to convert
    the instructions into size specific versions. The translator calculates
    the number of architectural registers required then puts the appropriate number of instructions (up to eight) in the micro-op queue.

    Therefore, there are lots of vector instructions in the ISA. SIMD type instructions where the size of a vector is assumed to be one register,
    and the element size is specified by the instruction. So, separate instructions for 1,2,4 or 8 elements. (For example 50 instructions *
    four different sizes = 200 instructions). Then also size agnostic instructions where the size/format comes indirectly from the VLEN
    (vector length) and VELSZ (vector lane size) registers.

    The size agnostic instructions allow writing a generic vector routine without needing to code the size of the operation. This avoids having a switch statement with a whole bunch of cases for different vector
    lengths. It also avoids having thousands of vector instructions. (50 instructions * 5 different lanes sizes * 64 different lengths).

    The vectors are opaque blobs of bytes in my case. Size specs are in
    terms of bytes. The vectors are not a fixed length. They may (currently)
    use from 0 to 8 GPR registers. Hence the need to specify the length in
    use. While the length could be specified as part of the format for the instruction, that would require a wide instruction.


    I am not personally a fan of RV-V, as it seems too complicated and
    expensive.


    I had taken a different approach towards adding SIMD to RISC-V:
    The instructions that operated on narrower types, were implicitly
    redefined to operate on SIMD vectors rather than a single narrower value (operation may be understood as scalar if NaN boxed or similar).

    A the two remaining rounding modes were redefined to operate on 128-bit vectors, defined as register pairs (serving as RNE or RTZ on said vectors).

    The DYN rounding mode was defined as scalar-only (only operates on a
    single value and produces NaN boxed results, also supports the IEEE
    emulation mode). This is compatible with GCC-like use of the FPU, where
    GCC tends to always use DYN instructions, which then relies on FPU
    control registers for the rounding mode, and updates status flags (which
    in this case is not done for the instructions using fixed modes).

    The scalar converter ops were silently modified into SIMD converters
    where appropriate.


    A few other instructions were added to help with some SIMD tasks, like
    vector shuffles, etc.


    Annoyingly, there is a split between the F/D extensions and P extension
    in that the P extension operates in GPRs, so can't directly reuse P
    extension encodings on F registers (effectively need to define new
    encodings to map some of the P instructions over to F registers).

    But, even then, it is only grabbing a limited set of instructions from
    P, as P had gone down a combinatorial explosion path and defined way too
    many instructions.


    *****

    .flic file format is supposed to be fast enough to allow use “on the fly”. But I just decompress all the frames into a matrix of bitmaps at game startup, then select the appropriate one based on direction and
    timing. With dozens of different sprites and hundreds of frames, I think
    it takes about 3GB of memory just for the sprite data. I had trouble
    running this on my machine a few years ago, but maybe with newer
    technology it could work.


    Hmm...

    When I was doing animated textures in my first 3D engine, with my '1C'
    codec, it was effectively:
    Transcode frame blocks to DXT1;
    Uploading the compressed texture blocks to OpenGL using the same texture numbers.


    Main issue I ran into here was that it doesn't work well for large textures.


    IIRC, for this engine had used a 4096x4096 atlas for the main
    block-surface textures (allowing 256x256 for each block).

    If trying to upload a 4096x4096 texture at 10Hz, whole PC bogged down (including mouse, which started "submarining", etc). So, this experiment
    was very short-lived (basically as soon as I could get it exited, which
    was harder when basically the whole OS ground to a halt).


    So, had to use multiple texture numbers for the main animated texture in
    the main animated-texture atlas (advancing the sequence at 10Hz).

    Theoretically, should have been pushing ~ 336 MB/sec to the GPU for this
    (DXT5 with mipmaps), but something was clearly not happy here.

    So, alas, even if one can get gigapixel/second for decoding, doesn't necessarily mean one can push it to the GPU.


    Where, the idea for the atlas is, rather than giving each of the block textures its own texture, one can instead create a much bigger texture
    (say, with 16x16 sub-textures) and then consolidate everything using the
    same atlas into the same vertex array (so fewer draw calls).



    But, if streaming a few 256x256 textures or similar, it worked well enough.

    Some special blocks, like torches and fires, had used their own video
    textures and were not tied to the main animated-texture atlas.

    Mostly, all of this was being done as RIFF AVI files.


    Had experimentally transcoded and streamed full video to blocks, mostly
    using a few videos I scavenged off YOuTube as test cases.

    So, errm, a video example of these experiments:
    https://www.youtube.com/watch?v=64LL0GdrxQg

    Errm, yeah, I was a bit into MLP at the time...

    IIRC, the audio effect here was that these blocks could have virtual "speakers" that would stream the audio from the corresponding video
    stream if the player was in range. Audio was IIRC mostly using a tweaked version of IMA ADPCM.



    As can be noted, unlike a conventional animated texture, there can be
    audio, and an arbitrary length.

    IIRC, because of the inability to stream the main animated-texture
    atlas, it was limited to something like 16 frames (or, around 1.6
    seconds of loop).

    IIRC, 10Hz was more standard for animated textures from past games; but
    had usually used 16Hz for full motion video.



    Along this path (decoding to DXT1), both 1C and 1H could get in the area
    of around 600 megapixels/second (my later 4B design could exceed 1 gigapixel/second).

    Can note that 1C (after unpacking the Deflate compression) used an
    encoding scheme sorta like (if looking at bytes), IIRC:
    00..7F: First byte of a raw block, 8 bytes.
    Consisted of two RGB555 values, big endian, and a 4x4x2 pixel block.
    For DXT1, needed to be turned to RGB565 LE.
    Pixel block was also BE and different from DXT1,
    but an easy enough fix (with lookup tables), and shift+or.
    In RPZA, could escape to 16x RGB555 colors, but was not used in 1C.
    80..9F: 1-32 skip blocks (kept as-is from last frame)
    A0..BF: Flat Color, RGB555 value.
    C0..DF: Two RGB555 colors, 1-32 4x4 blocks sharing the same colors.
    E0..FF: Used for something...
    I forget ATM, maybe skip+translate.
    At top-level, would indicate the use of TLV packaging.
    Otherwise, frame would be decoded as a raw command stream.

    IIRC, there was an option to encode a separate alpha layer (for decoding
    to DXT5). Another option was to encode the alpha similar to DXT1, namely
    via color endpoint ordering, IIRC:
    C0<C1: Opaque
    C0>C1: 1-bit transparency.
    Or, no alpha of either sort, in which case it was opaque.

    No mipmaps were encoded here. Strategy was to use a quick/dirty approach
    to rebuild mipmaps on the fly.


    The followups, 1D and 1E, were intended to try to give better fidelity
    when decoding to BC7, but mostly failed to be all that fast.


    By 1H, had switched to one-off command-tag codes, with colors being delta-coded and blocks reusing prior colors. Worked OK as it was built
    around Rice coding everything. This format was significantly more
    complicated.

    For 5A/5B, had instead used a unary-coding scheme to encode commands
    (similar to both QOI and RP2).



    By the second engine, I had mostly stopped using video textures, and was instead using shader effects to do some animations (with a static
    atlas). In the shaders, I ended up mostly using dithering for the alpha
    as I sorta liked this effect at the time over the more traditional translucency effects.


    IIRC, was using a 2048x2048 atlas in this case (for 128x128 pixels per
    block).

    For my 3rd engine, it dropped it again to 1024x1024, with each block
    texture limited to 64x64. No shaders as this engine was written to
    assume being limited to roughly OpenGL 1.3 features.

    Instead it redraws all the water blocks and similar using Quake-style ST warping (but only for blocks near the camera).


    If I were to bring back video textures, could maybe use 1C or 5B as a
    base, though if using 1C may modify it to allow for RP2 and TKuLZ,
    mostly because of the whole "Deflate is kinda slow" issue.

    Ironically, it seems the only reason 1H seemed fast may have been
    because it was faster than Deflate; but by my current standards Deflate
    isn't all that fast.



    Experimented some with LZ4 and Huffman encoding. Huffman used for ECC
    logic.


    Yeah. LZ4 works.

    I am mostly using it for PE/COFF compression, as it seems to do much
    better in this case.

    For data compression, mostly ended up with my own custom RP2 design as
    it mostly beats LZ4 in terms of compression, and is similarly fast.

    Which is better mostly depends on the data in question though...


    In a few cases, I had STF+AdRice based LZ compressors, but these mostly
    make sense if the file being compressed is fairly small (lower end of
    the kB range).

    But, RP2 also works well for small data.

    Whereas, both TKuLZ and Deflate need larger data (say, over 16K-64K) to
    be effective (if compressing chunks of data in single-digit kB or less, Deflate kinda sucks...).





    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Nov 23 03:20:10 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    On 2025-11-11 2:30 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    Typical process for NaN boxing is to set the high order bits of the
    value which causes the value to appear to be a NaN at higher precision.

    Any FP value representable in lower precision can be exactly represented
    in higher precision.

    I have been thinking about using some of the high order bits of the NaN
    (eg bits 32 to 51) to indicate the precision of the boxed value.

    When My 66000 generates a NaN it inserts the cause in the 3 HoBs and inserts IP in the LoBs. Nothing prevents you from overwriting the NaN,
    but I thought it was best to point at the causing-instruction and an encoded "why" the nan was generated. The cause is a 3-bit index to the
    7 defined IEEE exceptions.

    My float package puts the cause in the 3 LoBs. The cause is always in
    the low order bits of the register then, even when the precision is different. But the address is not tracked. The package does not have
    access to the address. Seems like NaN trace hardware might be useful.

    Suggest you read:: https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdf
    For conversation about LoBs versus HoBs.

    There are rules when more than 1 NaN are an operand to an instruction designed to leave the more important NaN as the result. {Where more important is generally the first to be generated.}

    Hopefully the package follows the rules correctly. NaN operation is one thing not tested yet.

    This
    would allow detection of the use of a lower precision value in
    arithmetic. Suppose a convert from single to double precision is being
    done, but the value to be converted is only half precision. If it were
    indicated by the NaN software might be able to fix the result.

    I think it is better to fix the SW that thinks a (half) is a (float).

    It would be better, but some software is so complex it may be unknown
    the values coming in. The SW does not really need to croak if its a
    lower precision value as they are always represent-able in a higher precision.>>
    I also
    preserve the sign bit of the number in the NaN box.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sat Nov 22 23:16:17 2025
    From Newsgroup: comp.arch

    On 2025-11-22 10:20 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2025-11-11 2:30 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    Typical process for NaN boxing is to set the high order bits of the
    value which causes the value to appear to be a NaN at higher precision. >>>
    Any FP value representable in lower precision can be exactly represented >>> in higher precision.

    I have been thinking about using some of the high order bits of the NaN >>>> (eg bits 32 to 51) to indicate the precision of the boxed value.

    When My 66000 generates a NaN it inserts the cause in the 3 HoBs and
    inserts IP in the LoBs. Nothing prevents you from overwriting the NaN,
    but I thought it was best to point at the causing-instruction and an
    encoded "why" the nan was generated. The cause is a 3-bit index to the
    7 defined IEEE exceptions.

    My float package puts the cause in the 3 LoBs. The cause is always in
    the low order bits of the register then, even when the precision is
    different. But the address is not tracked. The package does not have
    access to the address. Seems like NaN trace hardware might be useful.

    Suggest you read:: https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdf
    For conversation about LoBs versus HoBs.


    Okay, it sounds like there are good reasons to use the HoBs. But I think
    it is only when converting precisions that it makes a difference. I have
    the float package moving the LoBs of a larger precision to the LoBs of
    the lower precision if a NaN (or infinity) is present. I do not think
    this consumes any more logic. It looks like just wires. It looks to be a
    three bit mux on the low order bits going the other way.

    I suppose I could code the package to accept NaN values either way.

    The following NaN values are in use.

    `define QSUBINFD 63'h7FF0000000000001 // - infinity - infinity `define QINFDIVD 63'h7FF0000000000002 // - infinity / infinity `define QZEROZEROD 63'h7FF0000000000003 // - zero / zero
    `define QINFZEROD 63'h7FF0000000000004 // - infinity X zero
    `define QSQRTINFD 63'h7FF0000000000005 // - square root of infinity `define QSQRTNEGD 63'h7FF0000000000006 // - square root of negaitve number


    There are rules when more than 1 NaN are an operand to an instruction
    designed to leave the more important NaN as the result. {Where more
    important is generally the first to be generated.}

    Hopefully the package follows the rules correctly. NaN operation is one
    thing not tested yet.

    This >>>> would allow detection of the use of a lower precision value in
    arithmetic. Suppose a convert from single to double precision is being >>>> done, but the value to be converted is only half precision. If it were >>>> indicated by the NaN software might be able to fix the result.

    I think it is better to fix the SW that thinks a (half) is a (float).

    It would be better, but some software is so complex it may be unknown
    the values coming in. The SW does not really need to croak if its a
    lower precision value as they are always represent-able in a higher
    precision.>>
    I also
    preserve the sign bit of the number in the NaN box.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sat Nov 22 23:36:47 2025
    From Newsgroup: comp.arch

    On 2025-11-22 11:16 p.m., Robert Finch wrote:
    On 2025-11-22 10:20 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2025-11-11 2:30 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    Typical process for NaN boxing is to set the high order bits of the
    value which causes the value to appear to be a NaN at higher
    precision.

    Any FP value representable in lower precision can be exactly
    represented
    in higher precision.

    I have been thinking about using some of the high order bits of the >>>>> NaN
    (eg bits 32 to 51) to indicate the precision of the boxed value.

    When My 66000 generates a NaN it inserts the cause in the 3 HoBs and
    inserts IP in the LoBs. Nothing prevents you from overwriting the NaN, >>>> but I thought it was best to point at the causing-instruction and an
    encoded "why" the nan was generated. The cause is a 3-bit index to the >>>> 7 defined IEEE exceptions.

    My float package puts the cause in the 3 LoBs. The cause is always in
    the low order bits of the register then, even when the precision is
    different. But the address is not tracked. The package does not have
    access to the address. Seems like NaN trace hardware might be useful.

    Suggest you read::
    https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/
    nan-propagation.pdf
    For conversation about LoBs versus HoBs.


    Okay, it sounds like there are good reasons to use the HoBs. But I think
    it is only when converting precisions that it makes a difference. I have
    the float package moving the LoBs of a larger precision to the LoBs of
    the lower precision if a NaN (or infinity) is present. I do not think
    this consumes any more logic. It looks like just wires. It looks to be a three bit mux on the low order bits going the other way.

    I suppose I could code the package to accept NaN values either way.

    The following NaN values are in use.

    `define    QSUBINFD     63'h7FF0000000000001    // - infinity - infinity
    `define QINFDIVD     63'h7FF0000000000002    // - infinity / infinity `define QZEROZEROD  63'h7FF0000000000003    // - zero / zero
    `define QINFZEROD    63'h7FF0000000000004    // - infinity X zero `define QSQRTINFD    63'h7FF0000000000005    // - square root of infinity
    `define QSQRTNEGD    63'h7FF0000000000006    // - square root of negaitve number


    When converting a NaN from higher to lower precision, the float package preserves both the low order four bits and as many high order bits of
    the NaN that will fit. The middle bits are dropped.

    There are rules when more than 1 NaN are an operand to an instruction
    designed to leave the more important NaN as the result. {Where more
    important is generally the first to be generated.}

    Hopefully the package follows the rules correctly. NaN operation is one
    thing not tested yet.


    This
    would allow detection of the use of a lower precision value in
    arithmetic. Suppose a convert from single to double precision is being >>>>> done, but the value to be converted is only half precision. If it were >>>>> indicated by the NaN software might be able to fix the result.

    I think it is better to fix the SW that thinks a (half) is a (float).

    It would be better, but some software is so complex it may be unknown
    the values coming in. The SW does not really need to croak if its a
    lower precision value as they are always represent-able in a higher
    precision.>>
          I also
    preserve the sign bit of the number in the NaN box.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sun Nov 23 07:04:37 2025
    From Newsgroup: comp.arch

    On 2025-11-22 11:36 p.m., Robert Finch wrote:
    On 2025-11-22 11:16 p.m., Robert Finch wrote:
    On 2025-11-22 10:20 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2025-11-11 2:30 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    Typical process for NaN boxing is to set the high order bits of the >>>>>> value which causes the value to appear to be a NaN at higher
    precision.

    Any FP value representable in lower precision can be exactly
    represented
    in higher precision.

    I have been thinking about using some of the high order bits of
    the NaN
    (eg bits 32 to 51) to indicate the precision of the boxed value.

    When My 66000 generates a NaN it inserts the cause in the 3 HoBs and >>>>> inserts IP in the LoBs. Nothing prevents you from overwriting the NaN, >>>>> but I thought it was best to point at the causing-instruction and an >>>>> encoded "why" the nan was generated. The cause is a 3-bit index to the >>>>> 7 defined IEEE exceptions.

    My float package puts the cause in the 3 LoBs. The cause is always in
    the low order bits of the register then, even when the precision is
    different. But the address is not tracked. The package does not have
    access to the address. Seems like NaN trace hardware might be useful.

    Suggest you read::
    https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/
    background/ nan-propagation.pdf
    For conversation about LoBs versus HoBs.


    Okay, it sounds like there are good reasons to use the HoBs. But I
    think it is only when converting precisions that it makes a
    difference. I have the float package moving the LoBs of a larger
    precision to the LoBs of the lower precision if a NaN (or infinity) is
    present. I do not think this consumes any more logic. It looks like
    just wires. It looks to be a three bit mux on the low order bits going
    the other way.

    I suppose I could code the package to accept NaN values either way.

    The following NaN values are in use.

    `define    QSUBINFD     63'h7FF0000000000001    // - infinity - infinity
    `define QINFDIVD     63'h7FF0000000000002    // - infinity / infinity >> `define QZEROZEROD  63'h7FF0000000000003    // - zero / zero
    `define QINFZEROD    63'h7FF0000000000004    // - infinity X zero
    `define QSQRTINFD    63'h7FF0000000000005    // - square root of infinity
    `define QSQRTNEGD    63'h7FF0000000000006    // - square root of
    negaitve number


    When converting a NaN from higher to lower precision, the float package preserves both the low order four bits and as many high order bits of
    the NaN that will fit. The middle bits are dropped.

    There are rules when more than 1 NaN are an operand to an instruction >>>>> designed to leave the more important NaN as the result. {Where more
    important is generally the first to be generated.}

    Hopefully the package follows the rules correctly. NaN operation is one >>>> thing not tested yet.

    This
    would allow detection of the use of a lower precision value in
    arithmetic. Suppose a convert from single to double precision is
    being
    done, but the value to be converted is only half precision. If it >>>>>> were
    indicated by the NaN software might be able to fix the result.

    I think it is better to fix the SW that thinks a (half) is a (float). >>>>>
    It would be better, but some software is so complex it may be unknown
    the values coming in. The SW does not really need to croak if its a
    lower precision value as they are always represent-able in a higher
    precision.>>
          I also
    preserve the sign bit of the number in the NaN box.



    Added a NaN tracing facility as an core option. It can only log two NaNs
    per clock to a buffer, possibly slowing the core down. The NaN addresses
    are logged in order to a 512 entry buffer. The core already tracks
    exceptions so it was not too bad to add a NaN flag to the re-order buffer.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Nov 23 16:32:46 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    I recently heard that CS graduates from ETH Zürich had heard about >>pipelines, but thought it was fetch-decode-execute.

    Why would a CS graduate need to know about pipelines?

    Why would a chemical engineer know the basics of heat transfer?
    They are going to use commercial programs to design them anyway.

    Why would anybody know the basics of what they are doing?
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sun Nov 23 16:51:19 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    I recently heard that CS graduates from ETH Zürich had heard about >>>pipelines, but thought it was fetch-decode-execute.

    Why would a CS graduate need to know about pipelines?

    So they can properly simluate a pipelined processor?

    When I got my MSCS, computer engineering courses were
    required, including basic logic elements and overviews
    of processor design.


    Why would a chemical engineer know the basics of heat transfer?
    They are going to use commercial programs to design them anyway.

    Why would anybody know the basics of what they are doing?

    Indeed, a programmer that doesn't understand the underlying
    hardware is crippled.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Nov 23 17:25:12 2025
    From Newsgroup: comp.arch

    scott@slp53.sl.home (Scott Lurndal) writes:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    I recently heard that CS graduates from ETH Zürich had heard about >>>>pipelines, but thought it was fetch-decode-execute.

    Why would a CS graduate need to know about pipelines?

    So they can properly simluate a pipelined processor?

    Sure, if a CS graduate works in an application area, they need to
    learn about that application area, whatever it is.

    But why would knowledge about processor pipelines be part of their CS curriculum?

    When I got my MSCS, computer engineering courses were
    required, including basic logic elements and overviews
    of processor design.

    For me, too. I even learned something about processor pipelines, in a specialized elective course.

    Why would anybody know the basics of what they are doing?

    Processor pipelines are not the basics of what a CS graduate is doing.
    They are an implementation detail in computer engineering.

    Indeed, a programmer that doesn't understand the underlying
    hardware is crippled.

    I certainly have a lot of sympathy for that point of view. However,
    there are a lot of abstractions whose cost a programmer should
    understand if they intend to write efficient code, e.g., the memory
    hierarchy or system calls.

    But CPU pipelines have the nice property that they are mostly
    transparent. What you need to understand for performance is the
    latency of various instructions, and the costs of branch
    misprediction. I teach a course "Efficient programs", and I do not
    discuss hardware pipelining, but I do explain these performance characteristics.

    If anything, understanding OoO execution and it's effect on
    performance is more relevant. But looking at the dearth of textbooks,
    and the fact that Henry Wong did his thesis on his own initiative,
    even among computer engineering professors that is a topic that is of
    little interest.

    Back to programmers: There is also the other POV that programmers
    should never concern themselves with low-level details and should
    always leave that to compilers, which supposedly can do all those
    things better than programmers (I call that the compiler supremacy
    position). Compiler supremacy is wishful thinking, but wishful
    thinking has a strong influence in the world.

    A few more examples where compilers are not as good as even I expected:

    Just today, I compiled

    u4 = u1/10;
    u3 = u1%10;

    (plus some surrounding code) with gcc-14 in three contexts. Here's
    the code for two of them (the third one is similar to the second one):

    movabs $0xcccccccccccccccd,%rax movabs $0xcccccccccccccccd,%rsi
    sub $0x8,%r13 mov %r8,%rax
    mul %r8 mov %r8,%rcx
    mov %rdx,%rax mul %rsi
    shr $0x3,%rax shr $0x3,%rdx
    lea (%rax,%rax,4),%rdx lea (%rdx,%rdx,4),%rax
    add %rdx,%rdx add %rax,%rax
    sub %rdx,%r8 sub %rax,%r8
    mov %r8,0x8(%r13) mov %rcx,%rax
    mov %rax,%r8 mul %rsi
    shr $0x3,%rdx
    mov %rdx,%r9

    The major difference is that in the left context, u3 is stored into
    memory (at 0x8(%r13)), while in the right context, it stays in a
    register. In the left context, gcc managed to base its computation of
    u1%10 on the result of u1/10; in the right context, gcc first computes
    u1%10 (computing u1/10 as part of that), and then computes u1/10
    again.

    Then I looked if there is some unsigned equivalent of ldiv(), but
    there is not, supposedly because the compilers manage to combine the /
    and % operations by themselves.

    I also found that the resulting code was slower on a Rocket Lake than
    a variant of the code that passes the divisor in a variable, but
    that's ok: On Skylake and earlier CPUs division is so slow that the
    replacement code is probably faster.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Nov 23 20:13:25 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    On 2025-11-22 10:20 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2025-11-11 2:30 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    Typical process for NaN boxing is to set the high order bits of the
    value which causes the value to appear to be a NaN at higher precision. >>>
    Any FP value representable in lower precision can be exactly represented >>> in higher precision.

    I have been thinking about using some of the high order bits of the NaN >>>> (eg bits 32 to 51) to indicate the precision of the boxed value.

    When My 66000 generates a NaN it inserts the cause in the 3 HoBs and
    inserts IP in the LoBs. Nothing prevents you from overwriting the NaN, >>> but I thought it was best to point at the causing-instruction and an
    encoded "why" the nan was generated. The cause is a 3-bit index to the >>> 7 defined IEEE exceptions.

    My float package puts the cause in the 3 LoBs. The cause is always in
    the low order bits of the register then, even when the precision is
    different. But the address is not tracked. The package does not have
    access to the address. Seems like NaN trace hardware might be useful.

    Suggest you read:: https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdf
    For conversation about LoBs versus HoBs.


    Okay, it sounds like there are good reasons to use the HoBs. But I think
    it is only when converting precisions that it makes a difference. I have
    the float package moving the LoBs of a larger precision to the LoBs of
    the lower precision if a NaN (or infinity) is present. I do not think
    this consumes any more logic. It looks like just wires. It looks to be a three bit mux on the low order bits going the other way.

    The other part of the paper's reasoning is that if you want to insert
    some portion of IP in NaN, doing it bit-reversed enables conversions
    to smaller and larger to loose as few bits as possible. The realization
    was a surprise to me (yesterday).

    I suppose I could code the package to accept NaN values either way.

    The following NaN values are in use.

    `define QSUBINFD 63'h7FF0000000000001 // - infinity - infinity
    `define QINFDIVD 63'h7FF0000000000002 // - infinity / infinity `define QZEROZEROD 63'h7FF0000000000003 // - zero / zero
    `define QINFZEROD 63'h7FF0000000000004 // - infinity X zero
    `define QSQRTINFD 63'h7FF0000000000005 // - square root of infinity `define QSQRTNEGD 63'h7FF0000000000006 // - square root of negaitve number


    There are rules when more than 1 NaN are an operand to an instruction
    designed to leave the more important NaN as the result. {Where more
    important is generally the first to be generated.}

    Hopefully the package follows the rules correctly. NaN operation is one
    thing not tested yet.

    This >>>> would allow detection of the use of a lower precision value in
    arithmetic. Suppose a convert from single to double precision is being >>>> done, but the value to be converted is only half precision. If it were >>>> indicated by the NaN software might be able to fix the result.

    I think it is better to fix the SW that thinks a (half) is a (float).

    It would be better, but some software is so complex it may be unknown
    the values coming in. The SW does not really need to croak if its a
    lower precision value as they are always represent-able in a higher
    precision.>>
    I also
    preserve the sign bit of the number in the NaN box.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Nov 23 20:15:47 2025
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    I recently heard that CS graduates from ETH Zürich had heard about >>pipelines, but thought it was fetch-decode-execute.

    Why would a CS graduate need to know about pipelines?

    Why would a chemical engineer know the basics of heat transfer?
    They are going to use commercial programs to design them anyway.

    Why would anybody know the basics of what they are doing?

    Because scientists and engineers actually want to know about things
    they work-on and say--unlike politicians. ...
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Nov 23 20:16:39 2025
    From Newsgroup: comp.arch


    scott@slp53.sl.home (Scott Lurndal) posted:

    ERROR "unexpected byte sequence starting at index 199: '\xC3'" while decoding:

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    I recently heard that CS graduates from ETH Zürich had heard about >>>pipelines, but thought it was fetch-decode-execute.

    Why would a CS graduate need to know about pipelines?

    So they can properly simluate a pipelined processor?

    When I got my MSCS, computer engineering courses were
    required, including basic logic elements and overviews
    of processor design.


    Why would a chemical engineer know the basics of heat transfer?
    They are going to use commercial programs to design them anyway.

    Why would anybody know the basics of what they are doing?

    Indeed, a programmer that doesn't understand the underlying
    hardware is crippled.

    So, only 95% of programmers are crippled ?!?
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Nov 23 20:46:23 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Just today, I compiled

    u4 = u1/10;
    u3 = u1%10;


    (plus some surrounding code) with gcc-14 in three contexts. Here's
    the code for two of them (the third one is similar to the second one):

    Care for to present a self-contained example? Otherwise, your
    example and its analyis are meaingless to the reader.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Nov 23 22:40:02 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Just today, I compiled

    u4 = u1/10;
    u3 = u1%10;


    (plus some surrounding code) with gcc-14 in three contexts. Here's
    the code for two of them (the third one is similar to the second one):

    Care for to present a self-contained example? Otherwise, your
    example and its analyis are meaingless to the reader.

    I doubt that a self-contained example will be more meaningful to all
    but the most determined readers, but anyway, the preprocessed C code is at

    https://www.complang.tuwien.ac.at/anton/tmp/engine-fast.i

    You can search for "/10" to get to the three contexts. The compiler
    call is:

    gcc -I./../arch/amd64 -I. -Wall -g -O2 -fomit-frame-pointer -pthread -DHAVE_CONFIG_H -DFORCE_LL -DFORCE_REG -DDEFAULTPATH='".:/usr/local/lib/gforth/site-forth:/usr/local/lib/gforth/0.7.9_20251119:/usr/local/share/gforth/0.7.9_20251119:/usr/share/gforth/site-forth:/usr/local/share/gforth/site-forth"' -c -fno-gcse -fcaller-saves -fno-defer-pop -fno-inline -fwrapv -fno-strict-aliasing -fno-cse-follow-jumps -fno-reorder-blocks -fno-reorder-blocks-and-partition -fno-toplevel-reorder -falign-labels=1 -falign-loops=1 -falign-jumps=1 -fno-delete-null-pointer-checks -fcf-protection=none -fno-tree-vectorize -fno-lto -pthread -DENGINE=2 -fPIC -DPIC -o libengine-fast2-ll-reg.S -S engine-fast.i

    The output of gcc-14 is at

    https://www.complang.tuwien.ac.at/anton/tmp/libengine-fast2-ll-reg.S

    You can find the three contexts by searching for "-3689348814741910323".

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sun Nov 23 23:58:16 2025
    From Newsgroup: comp.arch

    On 2025-11-23 3:13 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2025-11-22 10:20 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2025-11-11 2:30 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    Typical process for NaN boxing is to set the high order bits of the >>>>>> value which causes the value to appear to be a NaN at higher precision. >>>>>
    Any FP value representable in lower precision can be exactly represented >>>>> in higher precision.

    I have been thinking about using some of the high order bits of the NaN >>>>>> (eg bits 32 to 51) to indicate the precision of the boxed value.

    When My 66000 generates a NaN it inserts the cause in the 3 HoBs and >>>>> inserts IP in the LoBs. Nothing prevents you from overwriting the NaN, >>>>> but I thought it was best to point at the causing-instruction and an >>>>> encoded "why" the nan was generated. The cause is a 3-bit index to the >>>>> 7 defined IEEE exceptions.

    My float package puts the cause in the 3 LoBs. The cause is always in
    the low order bits of the register then, even when the precision is
    different. But the address is not tracked. The package does not have
    access to the address. Seems like NaN trace hardware might be useful.

    Suggest you read::
    https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdf
    For conversation about LoBs versus HoBs.


    Okay, it sounds like there are good reasons to use the HoBs. But I think
    it is only when converting precisions that it makes a difference. I have
    the float package moving the LoBs of a larger precision to the LoBs of
    the lower precision if a NaN (or infinity) is present. I do not think
    this consumes any more logic. It looks like just wires. It looks to be a
    three bit mux on the low order bits going the other way.

    The other part of the paper's reasoning is that if you want to insert
    some portion of IP in NaN, doing it bit-reversed enables conversions
    to smaller and larger to loose as few bits as possible. The realization
    was a surprise to me (yesterday).


    It is probably not possible to embed enough IP information in smaller floating-point formats (<=16-bit) to be worthwhile. For 32-bit floats
    only about 18-bits of the address can be stored. It looks like different formats are going to handle NaNs differently, which I find somewhat undesirable.

    I am now leaning towards allocating four HOB bits to indicate the NaN
    cause, and then filling the rest of the payload with a bit reversed
    address. There should be some instruction to extract the NaN cause and address.

    I like the bit-reversed address idea. Losing high order address bits is
    less of an issue than low order ones.

    The extra bit in the NaN cause may be used by software for when access
    to the payload area is desired for other purposes.

    I still like the idea of a NaN trace facility as an option. Perhaps the debugger logic could trigger a dump to trace on a NaN after a specific address.

    I think that just a cause code to indicate multiple NaNs colliding would
    be good. With the fused-dot-product there could be up to four NaNs. Some
    of the information is going to be lost, so might as well just assign a code.

    Insane idea: use more payload bits to record the colliding NaN causes,
    then dump it to a CSR somewhere when the address is inserted into the
    NaN. The FP status needs to be recorded, so maybe it could be part of
    that status record.

    My float package does not have access to an address, so it cannot be
    inserted in the individual modules where the NaN occurs. It must be
    inserted at a higher level in the FPU which I believe has access to the instruction address.



    I suppose I could code the package to accept NaN values either way.

    The following NaN values are in use.

    `define QSUBINFD 63'h7FF0000000000001 // - infinity - infinity
    `define QINFDIVD 63'h7FF0000000000002 // - infinity / infinity
    `define QZEROZEROD 63'h7FF0000000000003 // - zero / zero
    `define QINFZEROD 63'h7FF0000000000004 // - infinity X zero
    `define QSQRTINFD 63'h7FF0000000000005 // - square root of infinity
    `define QSQRTNEGD 63'h7FF0000000000006 // - square root of negaitve number


    There are rules when more than 1 NaN are an operand to an instruction >>>>> designed to leave the more important NaN as the result. {Where more
    important is generally the first to be generated.}

    Hopefully the package follows the rules correctly. NaN operation is one >>>> thing not tested yet.

    This >>>>>> would allow detection of the use of a lower precision value in
    arithmetic. Suppose a convert from single to double precision is being >>>>>> done, but the value to be converted is only half precision. If it were >>>>>> indicated by the NaN software might be able to fix the result.

    I think it is better to fix the SW that thinks a (half) is a (float). >>>>>
    It would be better, but some software is so complex it may be unknown
    the values coming in. The SW does not really need to croak if its a
    lower precision value as they are always represent-able in a higher
    precision.>>
    I also
    preserve the sign bit of the number in the NaN box.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Nov 24 18:03:39 2025
    From Newsgroup: comp.arch

    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    I recently heard that CS graduates from ETH Zürich had heard about >>>>>pipelines, but thought it was fetch-decode-execute.

    Why would a CS graduate need to know about pipelines?

    So they can properly simluate a pipelined processor?

    Sure, if a CS graduate works in an application area, they need to
    learn about that application area, whatever it is.

    It's useful for code optimization, as well. In general,
    any programmer should have a solid understanding of the
    underlying hardware - generically, and specifically
    for the hardware being programmed.


    But why would knowledge about processor pipelines be part of their CS >curriculum?

    When I got my MSCS, computer engineering courses were
    required, including basic logic elements and overviews
    of processor design.

    For me, too. I even learned something about processor pipelines, in a >specialized elective course.

    Why would anybody know the basics of what they are doing?

    Processor pipelines are not the basics of what a CS graduate is doing.
    They are an implementation detail in computer engineering.

    Which affect the performance of the software created by the
    software engineer (CS graduate).


    Indeed, a programmer that doesn't understand the underlying
    hardware is crippled.

    <snip>

    If anything, understanding OoO execution and it's effect on
    performance is more relevant. But looking at the dearth of textbooks,
    and the fact that Henry Wong did his thesis on his own initiative,
    even among computer engineering professors that is a topic that is of
    little interest.

    Back to programmers: There is also the other POV that programmers
    should never concern themselves with low-level details and should
    always leave that to compilers, which supposedly can do all those
    things better than programmers (I call that the compiler supremacy
    position). Compiler supremacy is wishful thinking, but wishful
    thinking has a strong influence in the world.

    My experience with those who espouse that point of view has
    been uniformly poor.


    A few more examples where compilers are not as good as even I expected:

    Just today, I compiled

    u4 = u1/10;
    u3 = u1%10;

    (plus some surrounding code) with gcc-14 in three contexts. Here's
    the code for two of them (the third one is similar to the second one):

    movabs $0xcccccccccccccccd,%rax movabs $0xcccccccccccccccd,%rsi
    sub $0x8,%r13 mov %r8,%rax
    mul %r8 mov %r8,%rcx
    mov %rdx,%rax mul %rsi
    shr $0x3,%rax shr $0x3,%rdx
    lea (%rax,%rax,4),%rdx lea (%rdx,%rdx,4),%rax
    add %rdx,%rdx add %rax,%rax
    sub %rdx,%r8 sub %rax,%r8
    mov %r8,0x8(%r13) mov %rcx,%rax
    mov %rax,%r8 mul %rsi
    shr $0x3,%rdx
    mov %rdx,%r9

    The major difference is that in the left context, u3 is stored into
    memory (at 0x8(%r13)), while in the right context, it stays in a
    register. In the left context, gcc managed to base its computation of
    u1%10 on the result of u1/10; in the right context, gcc first computes
    u1%10 (computing u1/10 as part of that), and then computes u1/10
    again.

    Sort of emphasizes that programmers need to understand the
    underlying hardware.

    What were u1, u3 and u4 declared as?

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Nov 24 20:00:59 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    On 2025-11-23 3:13 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2025-11-22 10:20 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2025-11-11 2:30 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    Typical process for NaN boxing is to set the high order bits of the >>>>>> value which causes the value to appear to be a NaN at higher precision.

    Any FP value representable in lower precision can be exactly represented
    in higher precision.

    I have been thinking about using some of the high order bits of the NaN
    (eg bits 32 to 51) to indicate the precision of the boxed value.

    When My 66000 generates a NaN it inserts the cause in the 3 HoBs and >>>>> inserts IP in the LoBs. Nothing prevents you from overwriting the NaN, >>>>> but I thought it was best to point at the causing-instruction and an >>>>> encoded "why" the nan was generated. The cause is a 3-bit index to the >>>>> 7 defined IEEE exceptions.

    My float package puts the cause in the 3 LoBs. The cause is always in >>>> the low order bits of the register then, even when the precision is
    different. But the address is not tracked. The package does not have >>>> access to the address. Seems like NaN trace hardware might be useful. >>>
    Suggest you read::
    https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdf
    For conversation about LoBs versus HoBs.


    Okay, it sounds like there are good reasons to use the HoBs. But I think >> it is only when converting precisions that it makes a difference. I have >> the float package moving the LoBs of a larger precision to the LoBs of
    the lower precision if a NaN (or infinity) is present. I do not think
    this consumes any more logic. It looks like just wires. It looks to be a >> three bit mux on the low order bits going the other way.

    The other part of the paper's reasoning is that if you want to insert
    some portion of IP in NaN, doing it bit-reversed enables conversions
    to smaller and larger to loose as few bits as possible. The realization
    was a surprise to me (yesterday).


    It is probably not possible to embed enough IP information in smaller floating-point formats (<=16-bit) to be worthwhile. For 32-bit floats
    only about 18-bits of the address can be stored. It looks like different formats are going to handle NaNs differently, which I find somewhat undesirable.

    I am now leaning towards allocating four HOB bits to indicate the NaN
    cause, and then filling the rest of the payload with a bit reversed address. There should be some instruction to extract the NaN cause and address.

    I like the bit-reversed address idea. Losing high order address bits is
    less of an issue than low order ones.

    The extra bit in the NaN cause may be used by software for when access
    to the payload area is desired for other purposes.

    I still like the idea of a NaN trace facility as an option. Perhaps the debugger logic could trigger a dump to trace on a NaN after a specific address.

    I think that just a cause code to indicate multiple NaNs colliding would
    be good. With the fused-dot-product there could be up to four NaNs. Some
    of the information is going to be lost, so might as well just assign a code.

    Insane idea: use more payload bits to record the colliding NaN causes,
    then dump it to a CSR somewhere when the address is inserted into the
    NaN. The FP status needs to be recorded, so maybe it could be part of
    that status record.

    My float package does not have access to an address, so it cannot be inserted in the individual modules where the NaN occurs. It must be
    inserted at a higher level in the FPU which I believe has access to the instruction address.

    In this case, put the cause in a container the instruction drags down
    the pipe, and retrieve it when you do have address access to where it
    needs to go.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Tue Nov 25 00:40:38 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Power's not dead, either, if very highly priced.

    New Power CPUs and machines based on them are released regularly. I
    think there is enough business in the iSeries (or whatever its current
    name) is to produce enough money for the costs of that development.
    pSeries benefits from that. I guess that the profits from that are
    enough to finance the development of the pSeries machines, but can
    contribute little to finance the development of the CPUs.

    MIPS is still
    being sold, apparently.

    From <https://en.wikipedia.org/wiki/MIPS_architecture>:
    |In March 2021, MIPS announced that the development of the MIPS
    |architecture had ended as the company is making the transition to
    |RISC-V.

    So it's the same status as SPARC. They may be selling to existing
    customers, but nobody sane will use MIPS for a new project.

    Original MIPS yes. IIUC Chinese bought rights to use MIPS architecture
    and that goes on.

    As for RISC-V,
    I am not sure how much business they actually generate compared
    to others.

    I think a lot of embedded RISC-Vs are used, e.g., in WD (and now
    Sandisk) HDDs and SSDs; so you can look at the business reports of WD
    if you want to know how much business they make. As for things you
    can actually program, there are a number of SBCs on sale (and we have
    one), from the Raspi Pico 2 (where you apparently can use either
    ARMv8-M (i.e., ARM T32) or RISC-V (probably some RV32 variant) up to
    stuff like the Visionfive V2, several Chinese offerings, and some
    Hifive SBCs. The latter are not yet competetive in CPU performance
    with the like of RK3588-based SBCs or the Raspi 5, so I expect the
    main reason for buying them is to try out RISC-V (we have a Visionfive
    V1 for that purpose); still, the fact that there are several offerings indicates that there is nonnegligible revenue there.

    There are several 32-bit MCU-s and they probably have nontrival
    part of the market. There are also 64-bit processors, ATM
    cheapest 64-bit Linux capable SBC-s known to me are RISC-V
    (but ARM-based ones are quite close). My impression is that
    corresponding chips are used in security cameras (they have
    special-purpose coprecessor for image recognition).
    Several new chips offer choice of RISC-V or ARM, I am not sure
    what percentage of users run them as ARM.

    Currently big questions are:
    - will Chinese dominate CPU market?
    - which architectures will be used by Chinese?

    It seems that main Chinese bet is on RISC-V. They manufacture
    a lot of ARM-s, but are not entirely comfortable with it.
    There have few architectures that seem to still get some
    developement.
    --
    Waldek Hebisch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Tue Nov 25 21:08:45 2025
    From Newsgroup: comp.arch

    In this case, put the cause in a container the instruction drags down
    the pipe, and retrieve it when you do have address access to where it
    needs to go.

    I may change things to pass the address around in the float package.
    Putting the address into the NaN later may cause issues with timing. It
    adds a mux into things. May be better to use the original NaN mux in the
    float modules. May call it a NaN identity field instead of an address.

    Modified NaN support in the float package to store to the HOBs.

    Survey says:

    The Qulps PUSH and POP instructions have room for six register fields.
    Should one of the fields be used to identify the stack pointer register allowing five registers to be pushed or popped? Or should the stack
    pointer register be assumed so that six registers may be pushed or popped?

    I think the SP should be identified as PUSH / POP would be the only instructions assuming the SP register. Otherwise any register could be
    chosen by the compiler.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Nov 26 07:53:49 2025
    From Newsgroup: comp.arch

    antispam@fricas.org (Waldek Hebisch) writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    IIUC Chinese bought rights to use MIPS architecture
    and that goes on.

    None are known to me. LoongSon originally implemented MIPS, but,
    according to <https://en.wikipedia.org/wiki/Loongson>:

    |Loongson moved to their own processor instruction set architecture
    |(ISA) in 2021 with the release of the Loongson 3 5000 series.

    This instruction set is called LoongArch, and while it is similar to
    MIPS, RISC-V, Alpha, DLX, Nios, it is different enough that Bernd
    Paysan wrote a separate assembler and disassembler for it <https://cgit.git.savannah.gnu.org/cgit/gforth.git/tree/arch/loongarch64> rather than copying and modifying the MIPS assembler/disassembler.

    It seems that main Chinese bet is on RISC-V. They manufacture
    a lot of ARM-s, but are not entirely comfortable with it.

    It seems to me that different companies in China use different
    architectures. Huawei on ARM, Loongson on Loongarch, some on RISC-V
    etc.

    That also seems to be the Chinese approach to other technologies:
    E.g., they build solar power, coal power, wind power, nuclear power,
    hydro power, etc.; and in nuclear power, they built a few of every
    kind of Generation III reactor on the market before developing their
    own their own designs, some of them based on the Westinghouse AP-1000,
    others (Hualong One) based on earlier Chinese Generation II designs.
    They are also experimenting with Generation IV and SMR designs.

    So, at least in technology, the CP does not pretend to know what's
    best.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Wed Nov 26 12:17:09 2025
    From Newsgroup: comp.arch

    On Wed, 26 Nov 2025 07:53:49 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:



    That also seems to be the Chinese approach to other technologies:
    E.g., they build solar power, coal power, wind power, nuclear power,
    hydro power, etc.; and in nuclear power, they built a few of every
    kind of Generation III reactor on the market before developing their
    own their own designs, some of them based on the Westinghouse AP-1000,
    others (Hualong One) based on earlier Chinese Generation II designs.
    They are also experimenting with Generation IV and SMR designs.

    So, at least in technology, the CP does not pretend to know what's
    best.

    - anton

    Is not it the same as in all big countries except ultra-pro-nuclear
    France and ultra-anti-nuclear Germany?
    China is just bigger, so capable to build more things simultaneously.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Nov 26 18:08:49 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    On Wed, 26 Nov 2025 07:53:49 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    That also seems to be the Chinese approach to other technologies:
    E.g., they build solar power, coal power, wind power, nuclear power,
    hydro power, etc.; and in nuclear power, they built a few of every
    kind of Generation III reactor on the market before developing their
    own their own designs, some of them based on the Westinghouse AP-1000,
    others (Hualong One) based on earlier Chinese Generation II designs.
    They are also experimenting with Generation IV and SMR designs.

    So, at least in technology, the CP does not pretend to know what's
    best.
    ...
    Is not it the same as in all big countries except ultra-pro-nuclear
    France and ultra-anti-nuclear Germany?

    Not sure what you mean by "it", but I doubt that many new coal plants
    are built in the first world (maybe in Australia?); Wind power faces significant opposition in some countries.

    Concerning nuclear power: it stagnates or is in decline in the first
    world. E.g., a number of nuclear power plants were shut down in the
    2010s in the USA despite being granted lifetime extensions, due to
    being uneconomical in the fracking age, and the building of new
    reactors led to huge cost overruns (Nukegate) and the bankruptcy of Westinghouse, and to the cancelation of some of the projects.
    Similarly, the first EPRs in Finland and in France led to huge delays
    and cost overruns, and a large part (all?) of the losses were
    shouldered by the French state, which restructured the companies
    involved. The Chinese EPRs also had long delays, but were the first
    to deliver grid energy.

    In any case, no AP-1000 has been built in Europe, and no EPR in the
    USA. Both have been built in China.

    China is just bigger, so capable to build more things simultaneously.

    They are willing to build different things. France has announced the
    building of 15 EPRs to replace much of its aging reactor fleet (which
    is not that much less than what China is building). It will be
    interesting when in 30 years defects are found in one of the reactor
    vessels of an EPR (like happened for an older model in 2022), and all
    EPRs have to be shut down for inspection and repairs (like happened
    for that model in 2022).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 26 20:57:11 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    In this case, put the cause in a container the instruction drags down
    the pipe, and retrieve it when you do have address access to where it
    needs to go.

    I may change things to pass the address around in the float package.
    Putting the address into the NaN later may cause issues with timing. It
    adds a mux into things. May be better to use the original NaN mux in the float modules. May call it a NaN identity field instead of an address.

    For example: when a My 66000 instruction needs to raise an exception
    the Inst *I argument contains a field I->raised which is set (1<<excpt)
    and at the end of the pipe (at retire), t->raised |= I->raised. Where
    we have a *t there is also t->ip. So, you don't have to drag Thread *t
    through all the subroutine calls, but you can easily access t->raised
    at the point you do have access to t->ip.

    Modified NaN support in the float package to store to the HOBs.

    Survey says:

    The Qulps PUSH and POP instructions have room for six register fields. Should one of the fields be used to identify the stack pointer register allowing five registers to be pushed or popped? Or should the stack
    pointer register be assumed so that six registers may be pushed or popped?

    My 66000 ENTER and EXIT instruction use SP == R31 implicitly. But,
    instead of giving it a number of registers, there is a start register
    and a stop register, so 1-to-32 regsiters can be saved/restored. The
    immediate contains how much stack space to allocate/deallocate.

    {{when Safe-Stack is enabled:: Rstart-to-R0 are placed on the inaccessible stack, while R1-to-Rstop are placed on the normal stack.}}

    Because the stack is always DoubleWord aligned, the 3-LoBs of the
    immediate are used to indicate "special" activities on a couple of
    registers {R0, R31, R30}, R31 is rarely saves and reloaded from Stack
    but just returned to its previous value by integer arithmetic. FP can
    be updated or it can be treated like "just another register". R0 can
    be loaded directly to t->ip, or loaded into R0 for stack walk-backs.

    The corresponding LDM and STM are seldom used.

    I think the SP should be identified as PUSH / POP would be the only instructions assuming the SP register. Otherwise any register could be chosen by the compiler.

    I started with that philosophy--and begrudgingly went away from it as
    a) the compiler took form
    b) we started adding instructions to ISA to remove instructions from
    code footprint.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 26 21:00:23 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    antispam@fricas.org (Waldek Hebisch) writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
    IIUC Chinese bought rights to use MIPS architecture
    and that goes on.

    None are known to me. LoongSon originally implemented MIPS, but,
    according to <https://en.wikipedia.org/wiki/Loongson>:

    |Loongson moved to their own processor instruction set architecture
    |(ISA) in 2021 with the release of the Loongson 3 5000 series.

    This instruction set is called LoongArch, and while it is similar to
    MIPS, RISC-V, Alpha, DLX, Nios, it is different enough that Bernd
    Paysan wrote a separate assembler and disassembler for it <https://cgit.git.savannah.gnu.org/cgit/gforth.git/tree/arch/loongarch64> rather than copying and modifying the MIPS assembler/disassembler.

    It seems that main Chinese bet is on RISC-V. They manufacture
    a lot of ARM-s, but are not entirely comfortable with it.

    It seems to me that different companies in China use different
    architectures. Huawei on ARM, Loongson on Loongarch, some on RISC-V
    etc.

    That also seems to be the Chinese approach to other technologies:
    E.g., they build solar power, coal power, wind power, nuclear power,
    hydro power, etc.; and in nuclear power, they built a few of every
    kind of Generation III reactor on the market before developing their
    own their own designs, some of them based on the Westinghouse AP-1000,
    others (Hualong One) based on earlier Chinese Generation II designs.
    They are also experimenting with Generation IV and SMR designs.

    This reminds me of Samsung. They developed both deep trench and stacked capacitor DRAM and had both in production for about 1 full year before
    choosing one for long term production (stacked).

    So, at least in technology, the CP does not pretend to know what's
    best.

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Nov 26 22:26:14 2025
    From Newsgroup: comp.arch

    MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2025-11-22 10:20 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2025-11-11 2:30 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    Typical process for NaN boxing is to set the high order bits of the >>>>>> value which causes the value to appear to be a NaN at higher precision. >>>>>
    Any FP value representable in lower precision can be exactly represented >>>>> in higher precision.

    I have been thinking about using some of the high order bits of the NaN >>>>>> (eg bits 32 to 51) to indicate the precision of the boxed value.

    When My 66000 generates a NaN it inserts the cause in the 3 HoBs and >>>>> inserts IP in the LoBs. Nothing prevents you from overwriting the NaN, >>>>> but I thought it was best to point at the causing-instruction and an >>>>> encoded "why" the nan was generated. The cause is a 3-bit index to the >>>>> 7 defined IEEE exceptions.

    My float package puts the cause in the 3 LoBs. The cause is always in
    the low order bits of the register then, even when the precision is
    different. But the address is not tracked. The package does not have
    access to the address. Seems like NaN trace hardware might be useful.

    Suggest you read::
    https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdf
    For conversation about LoBs versus HoBs.


    Okay, it sounds like there are good reasons to use the HoBs. But I think
    it is only when converting precisions that it makes a difference. I have
    the float package moving the LoBs of a larger precision to the LoBs of
    the lower precision if a NaN (or infinity) is present. I do not think
    this consumes any more logic. It looks like just wires. It looks to be a
    three bit mux on the low order bits going the other way.

    The other part of the paper's reasoning is that if you want to insert
    some portion of IP in NaN, doing it bit-reversed enables conversions
    to smaller and larger to loose as few bits as possible. The realization
    was a surprise to me (yesterday).

    I think I read about IBM's approach years before the 754-2019 process
    started.

    Storing the offending address in byte-reversed order would do pretty
    much the same thing, but at lower HW cost, right?

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 26 21:58:13 2025
    From Newsgroup: comp.arch


    Terje Mathisen <terje.mathisen@tmsw.no> posted:

    MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2025-11-22 10:20 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2025-11-11 2:30 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    Typical process for NaN boxing is to set the high order bits of the >>>>>> value which causes the value to appear to be a NaN at higher precision.

    Any FP value representable in lower precision can be exactly represented
    in higher precision.

    I have been thinking about using some of the high order bits of the NaN
    (eg bits 32 to 51) to indicate the precision of the boxed value.

    When My 66000 generates a NaN it inserts the cause in the 3 HoBs and >>>>> inserts IP in the LoBs. Nothing prevents you from overwriting the NaN, >>>>> but I thought it was best to point at the causing-instruction and an >>>>> encoded "why" the nan was generated. The cause is a 3-bit index to the >>>>> 7 defined IEEE exceptions.

    My float package puts the cause in the 3 LoBs. The cause is always in >>>> the low order bits of the register then, even when the precision is
    different. But the address is not tracked. The package does not have >>>> access to the address. Seems like NaN trace hardware might be useful. >>>
    Suggest you read::
    https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdf
    For conversation about LoBs versus HoBs.


    Okay, it sounds like there are good reasons to use the HoBs. But I think >> it is only when converting precisions that it makes a difference. I have >> the float package moving the LoBs of a larger precision to the LoBs of
    the lower precision if a NaN (or infinity) is present. I do not think
    this consumes any more logic. It looks like just wires. It looks to be a >> three bit mux on the low order bits going the other way.

    The other part of the paper's reasoning is that if you want to insert
    some portion of IP in NaN, doing it bit-reversed enables conversions
    to smaller and larger to loose as few bits as possible. The realization
    was a surprise to me (yesterday).

    I think I read about IBM's approach years before the 754-2019 process started.

    Storing the offending address in byte-reversed order would do pretty
    much the same thing, but at lower HW cost, right?

    Yes, no, and maybe.

    In order to byte/bit-reverse a field/register, you take the horizontal data-path bit-lines and turn them 90º degrees. Once so turned, the
    difference in cost between bit-reversal and byte reversal is about too
    small to worry about. So, no.

    On the other hand, shifters, and bit-field-reversers are often part
    of the data path (calculation circuits), so you can pretty much get
    one or the other or both at vey little extra charge. So, yes.

    It is only at SW use of the bit-vector does one or the other matter
    a little (or a lot). In a machine with either bit-reverse instruction
    or byte reverse instruction, the ISA determines which one is better.
    So, maybe.

    My 66000 has a bit reverse instructions that can also perform
    pair-reverse, quad-reverse, byte-reverse, half-reverse, and
    word-reverse. So, in this ISA it does not matter which HW choice
    was made.

    Terje


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Nov 26 22:16:25 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Robert Finch <robfi680@gmail.com> posted:


    The Qulps PUSH and POP instructions have room for six register fields.
    Should one of the fields be used to identify the stack pointer register
    allowing five registers to be pushed or popped? Or should the stack
    pointer register be assumed so that six registers may be pushed or popped?

    My 66000 ENTER and EXIT instruction use SP == R31 implicitly. But,
    instead of giving it a number of registers, there is a start register
    and a stop register, so 1-to-32 regsiters can be saved/restored. The >immediate contains how much stack space to allocate/deallocate.

    That seems both confining for the compiler designers and less
    useful than the VAX-11 register mask stored in the instruction stream
    at the function entry point(s).

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Brian G. Lucas@bagel99@gmail.com to comp.arch on Wed Nov 26 17:20:30 2025
    From Newsgroup: comp.arch

    On 11/26/25 5:16 PM, Scott Lurndal wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Robert Finch <robfi680@gmail.com> posted:


    The Qulps PUSH and POP instructions have room for six register fields.
    Should one of the fields be used to identify the stack pointer register
    allowing five registers to be pushed or popped? Or should the stack
    pointer register be assumed so that six registers may be pushed or popped? >>
    My 66000 ENTER and EXIT instruction use SP == R31 implicitly. But,
    instead of giving it a number of registers, there is a start register
    and a stop register, so 1-to-32 regsiters can be saved/restored. The
    immediate contains how much stack space to allocate/deallocate.

    That seems both confining for the compiler designers and less
    useful than the VAX-11 register mask stored in the instruction stream
    at the function entry point(s).

    When the compiler can control the order in which registers are chosen
    to allocate, the ENTER and EXIT stuff works very well.

    Brian


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Nov 26 22:29:33 2025
    From Newsgroup: comp.arch

    "Brian G. Lucas" <bagel99@gmail.com> writes:
    On 11/26/25 5:16 PM, Scott Lurndal wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Robert Finch <robfi680@gmail.com> posted:


    The Qulps PUSH and POP instructions have room for six register fields. >>>> Should one of the fields be used to identify the stack pointer register >>>> allowing five registers to be pushed or popped? Or should the stack
    pointer register be assumed so that six registers may be pushed or popped? >>>
    My 66000 ENTER and EXIT instruction use SP == R31 implicitly. But,
    instead of giving it a number of registers, there is a start register
    and a stop register, so 1-to-32 regsiters can be saved/restored. The
    immediate contains how much stack space to allocate/deallocate.

    That seems both confining for the compiler designers and less
    useful than the VAX-11 register mask stored in the instruction stream
    at the function entry point(s).

    When the compiler can control the order in which registers are chosen
    to allocate, the ENTER and EXIT stuff works very well.

    They are often, however, constrained by the processor specific ABI
    which defines the usage model for registers when multiple languages
    are linked to provide code for an application.

    When every enter insn that calls the function has
    that mask, there is the possibility for strange and difficult to locate
    errors when a program links with a library function that was built
    earlier or with a different version of a (or even different language)
    compiler and thus the mask is not necessarily correct for the latest
    version of the called function.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Wed Nov 26 18:19:28 2025
    From Newsgroup: comp.arch

    On 2025-11-26 3:57 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    In this case, put the cause in a container the instruction drags down
    the pipe, and retrieve it when you do have address access to where it
    needs to go.

    I may change things to pass the address around in the float package.
    Putting the address into the NaN later may cause issues with timing. It
    adds a mux into things. May be better to use the original NaN mux in the
    float modules. May call it a NaN identity field instead of an address.

    For example: when a My 66000 instruction needs to raise an exception
    the Inst *I argument contains a field I->raised which is set (1<<excpt)
    and at the end of the pipe (at retire), t->raised |= I->raised. Where
    we have a *t there is also t->ip. So, you don't have to drag Thread *t through all the subroutine calls, but you can easily access t->raised
    at the point you do have access to t->ip.

    Had trouble reading that, sounds like goobly-goop. But I believe I
    figured it out.

    Sounds like the address is inserted at the end of the pipe which I am
    sure is not the case.

    I figured this out: the NaN address must be embedded in the result by
    the time the result updates the bypass network and registers so that it
    is available to other instructions.

    The address is available at the start of the calc from the reservation
    station entry. Me thinks it must be embedded when the NaN result status
    is set, provided there is not already a NaN. The existing (first) NaN
    must propagate through.

    Modified NaN support in the float package to store to the HOBs.

    Survey says:

    The Qulps PUSH and POP instructions have room for six register fields.
    Should one of the fields be used to identify the stack pointer register
    allowing five registers to be pushed or popped? Or should the stack
    pointer register be assumed so that six registers may be pushed or popped?

    My 66000 ENTER and EXIT instruction use SP == R31 implicitly. But,
    instead of giving it a number of registers, there is a start register
    and a stop register, so 1-to-32 regsiters can be saved/restored. The immediate contains how much stack space to allocate/deallocate.

    {{when Safe-Stack is enabled:: Rstart-to-R0 are placed on the inaccessible stack, while R1-to-Rstop are placed on the normal stack.}}

    Because the stack is always DoubleWord aligned, the 3-LoBs of the
    immediate are used to indicate "special" activities on a couple of
    registers {R0, R31, R30}, R31 is rarely saves and reloaded from Stack
    but just returned to its previous value by integer arithmetic. FP can
    be updated or it can be treated like "just another register". R0 can
    be loaded directly to t->ip, or loaded into R0 for stack walk-backs.

    The corresponding LDM and STM are seldom used.

    I ran out of micro-ops for ENTER and EXIT, so they only save the LR and
    FP (on the safe stack). A separate PUSH/POP on safe stack instruction is
    used.

    I figured LDM and STM are not used often enough. PUSH / POP is used in
    many places LDM / STM might be.

    For context switching a whole bunch of load / store instructions are
    used. There is context switching in only a couple of places.

    I think the SP should be identified as PUSH / POP would be the only
    instructions assuming the SP register. Otherwise any register could be
    chosen by the compiler.

    I started with that philosophy--and begrudgingly went away from it as
    a) the compiler took form
    b) we started adding instructions to ISA to remove instructions from
    code footprint.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 26 23:46:47 2025
    From Newsgroup: comp.arch


    scott@slp53.sl.home (Scott Lurndal) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Robert Finch <robfi680@gmail.com> posted:


    The Qulps PUSH and POP instructions have room for six register fields.
    Should one of the fields be used to identify the stack pointer register >> allowing five registers to be pushed or popped? Or should the stack
    pointer register be assumed so that six registers may be pushed or popped?

    My 66000 ENTER and EXIT instruction use SP == R31 implicitly. But,
    instead of giving it a number of registers, there is a start register
    and a stop register, so 1-to-32 regsiters can be saved/restored. The >immediate contains how much stack space to allocate/deallocate.

    That seems both confining for the compiler designers and less
    useful than the VAX-11 register mask stored in the instruction stream
    at the function entry point(s).

    We, and by that I mean Brian, have not found that so. In the early stages
    we did see a bit of that, and then Brian found a way to allocate registers
    from R31-down-to-R16 that fit the ENTER/EXIT model and we find essentially nothing (that is no more instructions in the stream than necessary).

    Part of the distinction is::
    a) how arguments/results are passed to/from subroutines.
    b) having a minimum of 7-temporary registers at entry point.
    c) how the stack frame is designed/allocated wrt:
    1) my arguments and my results,
    2) his arguments and his results,
    3) varargs,
    4) dynamic arrays on stack,
    5) stack frame allocation at ENTER,
    d) freedom to use R30 as FP or as joe-random-register.

    These were all co-designed together, after much of the instruction
    emission logic was sorted out.

    Consider this as a VAX CALL model except that the mask was replaced by
    a list of registers, which were then packed towards R31 instead of a bit vector.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 26 23:53:44 2025
    From Newsgroup: comp.arch


    scott@slp53.sl.home (Scott Lurndal) posted:

    "Brian G. Lucas" <bagel99@gmail.com> writes:
    On 11/26/25 5:16 PM, Scott Lurndal wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Robert Finch <robfi680@gmail.com> posted:


    The Qulps PUSH and POP instructions have room for six register fields. >>>> Should one of the fields be used to identify the stack pointer register >>>> allowing five registers to be pushed or popped? Or should the stack
    pointer register be assumed so that six registers may be pushed or popped?

    My 66000 ENTER and EXIT instruction use SP == R31 implicitly. But,
    instead of giving it a number of registers, there is a start register
    and a stop register, so 1-to-32 regsiters can be saved/restored. The
    immediate contains how much stack space to allocate/deallocate.

    That seems both confining for the compiler designers and less
    useful than the VAX-11 register mask stored in the instruction stream
    at the function entry point(s).

    When the compiler can control the order in which registers are chosen
    to allocate, the ENTER and EXIT stuff works very well.

    They are often, however, constrained by the processor specific ABI
    which defines the usage model for registers when multiple languages
    are linked to provide code for an application.

    When every enter insn that calls the function has that mask,

    a) wrong order: It is subroutine entry point that has the mask
    not the calling point. Thus, the mask is universal to the
    subroutine just entered. And, thus, the corresponding EXIT
    will use the same "bit pattern".

    there is the possibility for strange and difficult to locate errors when a program links with a library function that was built
    earlier or with a different version of a (or even different language) compiler and thus the mask is not necessarily correct for the latest
    version of the called function.

    b) this was the x86-32 problem in using one of its "CALL" instructions
    the stack was manipulated on the calling side instead of the called
    side. Originally this worked fine for PASCAL and nadda-so-gooda for
    C-like languages.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Nov 27 00:08:19 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    On 2025-11-26 3:57 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    In this case, put the cause in a container the instruction drags down
    the pipe, and retrieve it when you do have address access to where it
    needs to go.

    I may change things to pass the address around in the float package.
    Putting the address into the NaN later may cause issues with timing. It
    adds a mux into things. May be better to use the original NaN mux in the >> float modules. May call it a NaN identity field instead of an address.

    For example: when a My 66000 instruction needs to raise an exception
    the Inst *I argument contains a field I->raised which is set (1<<excpt)
    and at the end of the pipe (at retire), t->raised |= I->raised. Where
    we have a *t there is also t->ip. So, you don't have to drag Thread *t through all the subroutine calls, but you can easily access t->raised
    at the point you do have access to t->ip.

    Had trouble reading that, sounds like goobly-goop. But I believe I
    figured it out.

    Sounds like the address is inserted at the end of the pipe which I am
    sure is not the case.

    I figured this out: the NaN address must be embedded in the result by
    the time the result updates the bypass network and registers so that it
    is available to other instructions.

    The address is available at the start of the calc from the reservation station entry. Me thinks it must be embedded when the NaN result status
    is set, provided there is not already a NaN. The existing (first) NaN
    must propagate through.

    See last calculation line in the following::

    void RunInst( Chip *chip )
    {
    for( uint64_t i = 0; i < chip->cores; i++ )
    {
    ContextStack *cpu = &core[i];
    uint8_t cs = cpu->cs;
    Thread *t;
    Inst *I;
    uint16_t raised;

    if( cpu->interrupt.raised & ((((signed)1)<<63) >> cpu->priority) )
    { // take an interrupt
    cpu->cs = cpu->interrupt.cs;
    cpu->priority = cpu->interrupt.priority;
    t = context[cpu->cs];
    t->reg[0] = cpu->interrupt.message;
    }
    else if( raised = t->raised & t->enabled )
    { // take an exception
    cpu->cs--;
    t = context[cpu->cs];
    t->reg[0] = FT1( raised ) | EXCPT;
    t->reg[1] = I->inst;
    t->reg[2] = I->src1;
    t->reg[3] = I->src2;
    t->reg[4] = I->src3;
    }
    else
    { // run an instruction
    t = context[cpu->cs];
    memory( FETCH, t->ip, &I->inst );
    t->ip += 4;
    majorTable[ I->inst.major ]( t, I );
    t->raised |= I->raised; // propagate raised here
    }
    }
    }

    Modified NaN support in the float package to store to the HOBs.

    Survey says:

    The Qulps PUSH and POP instructions have room for six register fields.
    Should one of the fields be used to identify the stack pointer register
    allowing five registers to be pushed or popped? Or should the stack
    pointer register be assumed so that six registers may be pushed or popped?

    My 66000 ENTER and EXIT instruction use SP == R31 implicitly. But,
    instead of giving it a number of registers, there is a start register
    and a stop register, so 1-to-32 regsiters can be saved/restored. The immediate contains how much stack space to allocate/deallocate.

    {{when Safe-Stack is enabled:: Rstart-to-R0 are placed on the inaccessible stack, while R1-to-Rstop are placed on the normal stack.}}

    Because the stack is always DoubleWord aligned, the 3-LoBs of the
    immediate are used to indicate "special" activities on a couple of registers {R0, R31, R30}, R31 is rarely saves and reloaded from Stack
    but just returned to its previous value by integer arithmetic. FP can
    be updated or it can be treated like "just another register". R0 can
    be loaded directly to t->ip, or loaded into R0 for stack walk-backs.

    The corresponding LDM and STM are seldom used.

    I ran out of micro-ops for ENTER and EXIT, so they only save the LR and
    FP (on the safe stack). A separate PUSH/POP on safe stack instruction is used.

    I figured LDM and STM are not used often enough. PUSH / POP is used in
    many places LDM / STM might be.

    Its a fine line.

    I found more uses for an instruction that moves a number of registers
    randomly allocated to fixed positions (arguments to a call) than to
    move random string of registers to/from memory.

    .
    MOV R1,R10
    MOV R2,R25
    MOV R3,R17
    CALL Subroutine
    . ; deal with any result

    For context switching a whole bunch of load / store instructions are
    used. There is context switching in only a couple of places.

    I use a cache-model for thread-state {program-status-line and the
    register file}.

    The high level simulator, leaves all of the context in memory without
    loading it or storing it. Thus this serves as a pipeline Oracle so if
    the OoO pipeline makes a timing error, the Oracle stops the thread in
    its tracks.

    Thus::

    .
    .
    -----interrupt detected
    . change CS (cs--) <---
    . access threadState[cs]
    . t->ip = dispatcher
    . t->reg[0] = why
    dispatcher in control
    .
    .
    .
    RET
    SVR
    .
    .

    In your typical interrupt/exception control transfers, there is
    no code to actually switch state. Just like there is no code to
    switch a cache line that takes a miss.

    (*) The cs-- is all that is necessary to change from one Thread State
    to another in its entirety.

    I think the SP should be identified as PUSH / POP would be the only
    instructions assuming the SP register. Otherwise any register could be
    chosen by the compiler.

    I started with that philosophy--and begrudgingly went away from it as
    a) the compiler took form
    b) we started adding instructions to ISA to remove instructions from
    code footprint.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Thu Nov 27 00:36:54 2025
    From Newsgroup: comp.arch

    On 2025-11-26 7:08 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2025-11-26 3:57 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    In this case, put the cause in a container the instruction drags down >>>>> the pipe, and retrieve it when you do have address access to where it >>>>> needs to go.

    I may change things to pass the address around in the float package.
    Putting the address into the NaN later may cause issues with timing. It >>>> adds a mux into things. May be better to use the original NaN mux in the >>>> float modules. May call it a NaN identity field instead of an address.

    For example: when a My 66000 instruction needs to raise an exception
    the Inst *I argument contains a field I->raised which is set (1<<excpt)
    and at the end of the pipe (at retire), t->raised |= I->raised. Where
    we have a *t there is also t->ip. So, you don't have to drag Thread *t
    through all the subroutine calls, but you can easily access t->raised
    at the point you do have access to t->ip.

    Had trouble reading that, sounds like goobly-goop. But I believe I
    figured it out.

    Sounds like the address is inserted at the end of the pipe which I am
    sure is not the case.

    I figured this out: the NaN address must be embedded in the result by
    the time the result updates the bypass network and registers so that it
    is available to other instructions.

    The address is available at the start of the calc from the reservation
    station entry. Me thinks it must be embedded when the NaN result status
    is set, provided there is not already a NaN. The existing (first) NaN
    must propagate through.

    See last calculation line in the following::

    void RunInst( Chip *chip )
    {
    for( uint64_t i = 0; i < chip->cores; i++ )
    {
    ContextStack *cpu = &core[i];
    uint8_t cs = cpu->cs;
    Thread *t;
    Inst *I;
    uint16_t raised;

    if( cpu->interrupt.raised & ((((signed)1)<<63) >> cpu->priority) )
    { // take an interrupt
    cpu->cs = cpu->interrupt.cs;
    cpu->priority = cpu->interrupt.priority;
    t = context[cpu->cs];
    t->reg[0] = cpu->interrupt.message;
    }
    else if( raised = t->raised & t->enabled )
    { // take an exception
    cpu->cs--;
    t = context[cpu->cs];
    t->reg[0] = FT1( raised ) | EXCPT;
    t->reg[1] = I->inst;
    t->reg[2] = I->src1;
    t->reg[3] = I->src2;
    t->reg[4] = I->src3;
    }
    else
    { // run an instruction
    t = context[cpu->cs];
    memory( FETCH, t->ip, &I->inst );
    t->ip += 4;
    majorTable[ I->inst.major ]( t, I );
    t->raised |= I->raised; // propagate raised here
    }
    }
    }

    That looks like code for a simulator. How closely does it follow the
    operation of the CPU? I do not see where 'I' is initialized.

    It has been a while since I worked on simulator code.

    The IP value is just muxed in in a five to one mux for the significand.
    Had to account for NaN's infinities and overflow anyway. Address gets propagated with some some flops, but flops are inexpensive in an FPGA.

    always_comb
    casez({aNan5,bNan5,qNaNOutab5,aInf5,bInf5,overab5})
    6'b1?????: moab6 <= {1'b1,1'b1,a5[fp64Pkg::FMSB-1:0],{fp64Pkg::FMSB+1{1'b0}}};
    6'b01????: moab6 <= {1'b1,1'b1,b5[fp64Pkg::FMSB-1:0],{fp64Pkg::FMSB+1{1'b0}}};
    6'b001???: moab6 <= {1'b1,qNaN|(64'd4 << (fp64Pkg::FMSB-4))|adr5[63:16],{fp64Pkg::FMSB+1{1'b0}}}; // multiply inf
    * zero
    6'b0001??: moab6 <= 0; // mul inf's
    6'b00001?: moab6 <= 0; // mul inf's
    6'b000001: moab6 <= 0; // mul overflow
    default: moab6 <= fractab5;
    endcase



    Modified NaN support in the float package to store to the HOBs.

    Survey says:

    The Qulps PUSH and POP instructions have room for six register fields. >>>> Should one of the fields be used to identify the stack pointer register >>>> allowing five registers to be pushed or popped? Or should the stack
    pointer register be assumed so that six registers may be pushed or popped? >>>
    My 66000 ENTER and EXIT instruction use SP == R31 implicitly. But,
    instead of giving it a number of registers, there is a start register
    and a stop register, so 1-to-32 regsiters can be saved/restored. The
    immediate contains how much stack space to allocate/deallocate.

    {{when Safe-Stack is enabled:: Rstart-to-R0 are placed on the inaccessible >>> stack, while R1-to-Rstop are placed on the normal stack.}}

    Because the stack is always DoubleWord aligned, the 3-LoBs of the
    immediate are used to indicate "special" activities on a couple of
    registers {R0, R31, R30}, R31 is rarely saves and reloaded from Stack
    but just returned to its previous value by integer arithmetic. FP can
    be updated or it can be treated like "just another register". R0 can
    be loaded directly to t->ip, or loaded into R0 for stack walk-backs.

    The corresponding LDM and STM are seldom used.

    I ran out of micro-ops for ENTER and EXIT, so they only save the LR and
    FP (on the safe stack). A separate PUSH/POP on safe stack instruction is
    used.

    I figured LDM and STM are not used often enough. PUSH / POP is used in
    many places LDM / STM might be.

    Its a fine line.

    I found more uses for an instruction that moves a number of registers randomly allocated to fixed positions (arguments to a call) than to
    move random string of registers to/from memory.

    .
    MOV R1,R10
    MOV R2,R25
    MOV R3,R17
    CALL Subroutine
    . ; deal with any result


    My 66000 has an instruction to do that? I'd not seen an instruction like
    that. It is almost like a byte map. I can see how it could be done.
    Another instruction to add to the ISA. My compiler does not do such a
    nice job of packing the register moves together though.

    For context switching a whole bunch of load / store instructions are
    used. There is context switching in only a couple of places.

    I use a cache-model for thread-state {program-status-line and the
    register file}.

    The high level simulator, leaves all of the context in memory without
    loading it or storing it. Thus this serves as a pipeline Oracle so if
    the OoO pipeline makes a timing error, the Oracle stops the thread in
    its tracks.

    Thus::

    .
    .
    -----interrupt detected
    . change CS (cs--) <---
    . access threadState[cs]
    . t->ip = dispatcher
    . t->reg[0] = why
    dispatcher in control
    .
    .
    .
    RET
    SVR
    .
    .

    In your typical interrupt/exception control transfers, there is
    no code to actually switch state. Just like there is no code to
    switch a cache line that takes a miss.

    The My 66000 hardware takes care of it automatically? Interrupts push
    and pop context in my system.

    (*) The cs-- is all that is necessary to change from one Thread State
    to another in its entirety.

    I think the SP should be identified as PUSH / POP would be the only
    instructions assuming the SP register. Otherwise any register could be >>>> chosen by the compiler.

    I started with that philosophy--and begrudgingly went away from it as
    a) the compiler took form
    b) we started adding instructions to ISA to remove instructions from
    code footprint.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From George Neuner@gneuner2@comcast.net to comp.arch on Thu Nov 27 00:44:25 2025
    From Newsgroup: comp.arch

    On Sun, 23 Nov 2025 23:58:16 -0500, Robert Finch <robfi680@gmail.com>
    wrote:

    On 2025-11-23 3:13 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:
    On 2025-11-22 10:20 p.m., MitchAlsup wrote:
    Robert Finch <robfi680@gmail.com> posted:
    On 2025-11-11 2:30 p.m., MitchAlsup wrote:
    Robert Finch <robfi680@gmail.com> posted:

    Typical process for NaN boxing is to set the high order bits of the >>>>>>> value which causes the value to appear to be a NaN at higher precision. >>>>>>
    Any FP value representable in lower precision can be exactly represented >>>>>> in higher precision.

    I have been thinking about using some of the high order bits of the NaN >>>>>>> (eg bits 32 to 51) to indicate the precision of the boxed value.

    When My 66000 generates a NaN it inserts the cause in the 3 HoBs and >>>>>> inserts IP in the LoBs. Nothing prevents you from overwriting the NaN, >>>>>> but I thought it was best to point at the causing-instruction and an >>>>>> encoded "why" the nan was generated. The cause is a 3-bit index to the >>>>>> 7 defined IEEE exceptions.

    My float package puts the cause in the 3 LoBs. The cause is always in >>>>> the low order bits of the register then, even when the precision is
    different. But the address is not tracked. The package does not have >>>>> access to the address. Seems like NaN trace hardware might be useful. >>>>
    Suggest you read::
    https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdf
    For conversation about LoBs versus HoBs.


    Okay, it sounds like there are good reasons to use the HoBs. But I think >>> it is only when converting precisions that it makes a difference. I have >>> the float package moving the LoBs of a larger precision to the LoBs of
    the lower precision if a NaN (or infinity) is present. I do not think
    this consumes any more logic. It looks like just wires. It looks to be a >>> three bit mux on the low order bits going the other way.

    The other part of the paper's reasoning is that if you want to insert
    some portion of IP in NaN, doing it bit-reversed enables conversions
    to smaller and larger to loose as few bits as possible. The realization
    was a surprise to me (yesterday).


    It is probably not possible to embed enough IP information in smaller >floating-point formats (<=16-bit) to be worthwhile. For 32-bit floats
    only about 18-bits of the address can be stored. It looks like different >formats are going to handle NaNs differently, which I find somewhat >undesirable.

    This discussion reminds me somewhat of Ivan Godard's description of
    NAR faults on the Mill. Because of wide issue, just having the
    address of the offending instruction was not very helpful - you needed
    to know which of the many operations within the instruction was the
    culprit. And because NARs flow through speculated code, the offending
    site could be hundreds of operations away by the time the fault is
    signaled and pops out.

    Ivan discusses NARs in the "metadata" talk. Around 1h:25m, he
    describes the way Mill (approximately) encodes a fault location using
    a hash code created from the address of the code block, the
    instruction's issue cycle within the block, and the slot of the
    operation that failed. They stick the LO bits of this hash into
    however many bits are available for the payload. The NAR itself has a
    type, and the payload width depends on the data type produced by the
    faulting operation.

    Obviously that all is Mill specific, but it may stimulate another,
    better idea that is relevant to your design.


    YMMV.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From kegs@kegs@provalid.com (Kent Dickey) to comp.arch on Thu Nov 27 15:50:37 2025
    From Newsgroup: comp.arch

    In article <1763868010-5857@newsgrouper.org>,
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Robert Finch <robfi680@gmail.com> posted:
    My float package puts the cause in the 3 LoBs. The cause is always in
    the low order bits of the register then, even when the precision is
    different. But the address is not tracked. The package does not have
    access to the address. Seems like NaN trace hardware might be useful.

    Suggest you read:: >https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdf
    For conversation about LoBs versus HoBs.

    I wasn't sure where to join the NaN conversation, but this seems like a
    good spot.

    We've had 40+ years of different architectures handling NaNs, (what to
    encode in them to indicate where the first problem occurred) and all architectures do something different when operating on two NaNs:

    From that paper:
    - Intel using x87 instructions: NaN2 if both quiet, NaN1 if NaN2 is signalling - Intel using SSE instructions: NaN1
    - AMD using x87 instructions: NaN2
    - AMD using SSE instructions: NaN1
    - IBM Power PC: NaN1
    - IBM Z mainframe: NaN1 if both quiet, [precedence] to signalling NaN
    - ARM: NaN1 if both quiet, [precedence] to signalling NaN

    And adding one more not in that paper:
    - RISC-V: Always returns canonical NaN only, for Single: 0x7fc00000

    I'll just say whatever your NaN handling is, for the source code:

    A = B + C + D + E

    then for whatever values B,C,D,E having NaN or not, the value of A should
    be well defined and not dependent on the order of operations. How can you
    use bits in the NaN value for debugging if the hardware is returning arbitrary results when NaNs collide? Users have almost no control over whether
    A = B + C treats B as the first argument or the second.

    I think encoding stuff in NaN is a very 80's idea: turning on exceptions
    costs performance, so we want to debug after-the-fact using NaNs.

    But I think RISC-V has the right modern idea: make hardware fast so you can simply always enable Invalid Operation Traps (and maybe Overflow, if
    infinities are happening), and then stop right at the point of NaN being
    first created. So the NaN propagation doesn't matter.

    I think the common current debug strategy for NaNs is run at full speed
    with exceptions masked, and if you get NaNs in your answer, you re-run
    with exceptions on and then debug the traps that occur. And no one looks at the NaN values at all, just their presence.

    So rather than spending time on NaN encoding, make it so that FP performance
    is not affected by enabling exceptions, so we can skip the re-running step,
    and just run with Invalid Operations trapping enabled. And then just
    return canonical NaNs.

    Kent
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Thu Nov 27 19:16:24 2025
    From Newsgroup: comp.arch

    On Thu, 27 Nov 2025 15:50:37 -0000 (UTC)
    kegs@provalid.com (Kent Dickey) wrote:

    In article <1763868010-5857@newsgrouper.org>,
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Robert Finch <robfi680@gmail.com> posted:
    My float package puts the cause in the 3 LoBs. The cause is always
    in the low order bits of the register then, even when the
    precision is different. But the address is not tracked. The
    package does not have access to the address. Seems like NaN trace
    hardware might be useful.

    Suggest you read:: >https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdf
    For conversation about LoBs versus HoBs.

    I wasn't sure where to join the NaN conversation, but this seems like
    a good spot.

    We've had 40+ years of different architectures handling NaNs, (what to
    encode in them to indicate where the first problem occurred) and all architectures do something different when operating on two NaNs:

    From that paper:
    - Intel using x87 instructions: NaN2 if both quiet, NaN1 if NaN2 is signalling
    - Intel using SSE instructions: NaN1
    - AMD using x87 instructions: NaN2
    - AMD using SSE instructions: NaN1
    - IBM Power PC: NaN1
    - IBM Z mainframe: NaN1 if both quiet, [precedence] to signalling NaN
    - ARM: NaN1 if both quiet, [precedence] to signalling NaN

    And adding one more not in that paper:
    - RISC-V: Always returns canonical NaN only, for Single: 0x7fc00000

    I'll just say whatever your NaN handling is, for the source code:

    A = B + C + D + E

    then for whatever values B,C,D,E having NaN or not, the value of A
    should be well defined and not dependent on the order of operations.
    How can you use bits in the NaN value for debugging if the hardware
    is returning arbitrary results when NaNs collide? Users have almost
    no control over whether A = B + C treats B as the first argument or
    the second.

    I think encoding stuff in NaN is a very 80's idea: turning on
    exceptions costs performance, so we want to debug after-the-fact
    using NaNs.

    But I think RISC-V has the right modern idea: make hardware fast so
    you can simply always enable Invalid Operation Traps (and maybe
    Overflow, if infinities are happening), and then stop right at the
    point of NaN being first created. So the NaN propagation doesn't
    matter.

    I think the common current debug strategy for NaNs is run at full
    speed with exceptions masked, and if you get NaNs in your answer, you
    re-run with exceptions on and then debug the traps that occur. And
    no one looks at the NaN values at all, just their presence.

    So rather than spending time on NaN encoding, make it so that FP
    performance is not affected by enabling exceptions, so we can skip
    the re-running step, and just run with Invalid Operations trapping
    enabled. And then just return canonical NaNs.

    Kent

    How do you ship your software to the end user? Are exceptions masked off
    or enabled?

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Nov 28 06:45:58 2025
    From Newsgroup: comp.arch

    Scott Lurndal <scott@slp53.sl.home> schrieb:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    Robert Finch <robfi680@gmail.com> posted:


    The Qulps PUSH and POP instructions have room for six register fields.
    Should one of the fields be used to identify the stack pointer register >>> allowing five registers to be pushed or popped? Or should the stack
    pointer register be assumed so that six registers may be pushed or popped? >>
    My 66000 ENTER and EXIT instruction use SP == R31 implicitly. But,
    instead of giving it a number of registers, there is a start register
    and a stop register, so 1-to-32 regsiters can be saved/restored. The >>immediate contains how much stack space to allocate/deallocate.

    That seems both confining for the compiler designers and less
    useful than the VAX-11 register mask stored in the instruction stream
    at the function entry point(s).

    That's the nice thing if the ISA, the ABI including calling
    convention and the compiler are designed together - this allows
    ENTER and EXIT to work just as well, without needing the full
    generality.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Nov 28 07:17:07 2025
    From Newsgroup: comp.arch

    Kent Dickey <kegs@provalid.com> schrieb:

    I'll just say whatever your NaN handling is, for the source code:

    A = B + C + D + E

    then for whatever values B,C,D,E having NaN or not, the value of A should
    be well defined and not dependent on the order of operations.

    That is not possible in general with normal floating point (you could
    guarantee if you keep track of all digits9. But normally,
    1 + 1e-9 - 1 will be different from 1 - 1 + 1e9.

    (BTW, Fortran is allowed re-arrangement, unless there are parentheses,
    these have to be honored).
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Fri Nov 28 02:59:36 2025
    From Newsgroup: comp.arch

    On 2025-11-27 10:50 a.m., Kent Dickey wrote:
    In article <1763868010-5857@newsgrouper.org>,
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Robert Finch <robfi680@gmail.com> posted:
    My float package puts the cause in the 3 LoBs. The cause is always in
    the low order bits of the register then, even when the precision is
    different. But the address is not tracked. The package does not have
    access to the address. Seems like NaN trace hardware might be useful.

    Suggest you read::
    https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdf
    For conversation about LoBs versus HoBs.

    I wasn't sure where to join the NaN conversation, but this seems like a
    good spot.

    We've had 40+ years of different architectures handling NaNs, (what to
    encode in them to indicate where the first problem occurred) and all architectures do something different when operating on two NaNs:

    From that paper:
    - Intel using x87 instructions: NaN2 if both quiet, NaN1 if NaN2 is signalling
    - Intel using SSE instructions: NaN1
    - AMD using x87 instructions: NaN2
    - AMD using SSE instructions: NaN1
    - IBM Power PC: NaN1
    - IBM Z mainframe: NaN1 if both quiet, [precedence] to signalling NaN
    - ARM: NaN1 if both quiet, [precedence] to signalling NaN

    And adding one more not in that paper:
    - RISC-V: Always returns canonical NaN only, for Single: 0x7fc00000

    I'll just say whatever your NaN handling is, for the source code:

    A = B + C + D + E

    then for whatever values B,C,D,E having NaN or not, the value of A should
    be well defined and not dependent on the order of operations. How can you use bits in the NaN value for debugging if the hardware is returning arbitrary
    results when NaNs collide? Users have almost no control over whether
    A = B + C treats B as the first argument or the second.

    I think encoding stuff in NaN is a very 80's idea: turning on exceptions costs performance, so we want to debug after-the-fact using NaNs.
    But I think RISC-V has the right modern idea: make hardware fast so
    you can
    simply always enable Invalid Operation Traps (and maybe Overflow, if infinities are happening), and then stop right at the point of NaN being first created. So the NaN propagation doesn't matter.

    I think the common current debug strategy for NaNs is run at full speed
    with exceptions masked, and if you get NaNs in your answer, you re-run
    with exceptions on and then debug the traps that occur. And no one looks at the NaN values at all, just their presence.

    So rather than spending time on NaN encoding, make it so that FP performance is not affected by enabling exceptions, so we can skip the re-running step, and just run with Invalid Operations trapping enabled. And then just
    return canonical NaNs.

    Kent

    I do not know how one would make FP performance improve and have
    exceptions at the same time. The FP would have to operate asynchronous.
    The only thing I can think of is to have core(s) specifically dedicated
    to performance FP that do not service interrupts.

    Given that nobody looks at the NaN values it is tempting to leave out
    the NaN info, but I think I will still have it as an input to modules
    where NaNs can be generated (when I get around to it). The NaN info can
    always be set to zeros then and the extra logic should disappear then.

    I think that there may be a reason why nobody looks at the NaN values.
    IDK but maybe the debug does not make it easy to spot. A NaN display
    with a random assortment of digits is pretty useless. But if debug where
    to display all the address and other info, would it get used?


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Nov 28 07:21:14 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    scott@slp53.sl.home (Scott Lurndal) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    My 66000 ENTER and EXIT instruction use SP == R31 implicitly. But,
    instead of giving it a number of registers, there is a start register
    and a stop register, so 1-to-32 regsiters can be saved/restored. The
    immediate contains how much stack space to allocate/deallocate.

    That seems both confining for the compiler designers and less
    useful than the VAX-11 register mask stored in the instruction stream
    at the function entry point(s).

    We, and by that I mean Brian, have not found that so. In the early stages
    we did see a bit of that, and then Brian found a way to allocate registers >from R31-down-to-R16 that fit the ENTER/EXIT model and we find essentially >nothing (that is no more instructions in the stream than necessary).

    Part of the distinction is::
    a) how arguments/results are passed to/from subroutines.
    b) having a minimum of 7-temporary registers at entry point.
    c) how the stack frame is designed/allocated wrt:
    1) my arguments and my results,
    2) his arguments and his results,
    3) varargs,
    4) dynamic arrays on stack,
    5) stack frame allocation at ENTER,
    d) freedom to use R30 as FP or as joe-random-register.

    These were all co-designed together, after much of the instruction
    emission logic was sorted out.

    What is "my" and "his"?

    Consider this as a VAX CALL model except that the mask was replaced by
    a list of registers, which were then packed towards R31 instead of a bit >vector.

    Do you need both a start and a stop register?

    As far as I understand, ENTER is at the entry point of the callee, and
    EXIT is before the return or tail call; actually, the tail call case
    answers my question above:

    If the tail-caller has m callee-saved registers and the tail-callee
    has n callee-saved registers, then

    if m>n, generate an EXIT that restores the m-n registers;
    if m<n, generate an ENTER that saves the n-m registers;
    Generate a jump to behind the ENTER instruction of the callee.

    That is, assuming that the tail-callee is in the same compilation unit
    as the tail-caller; otherwise the tail-caller needs to do a full EXIT
    and then jump to the normal entry point of the tail-callee, which does
    a full ENTER.

    And in these ENTERs and EXITs, you don't end (or start) at the same
    point as in the regular ENTERs and EXITs.

    And yes, for saving the callee-saved registers I don't see a need for
    a mask. For caller-saved registers, it's different. Consider:

    long foo(...)
    {
    long x = ...;
    long y = ...;
    long z = ...;
    if (...) {
    bar(...);
    x = ...;
    } else if (...){
    baz(...);
    y = ...;
    } else {
    bla(...);
    z = ...;
    }
    return x+y+z;
    }

    Here one could put x, y, and z in callee-saved registers (and use ENTER
    and EXIT for them), but that would need to save and later restore
    three registers on every path through foo().

    Or one could put it in caller-saved registers and save only two
    registers on every path through foo(). Then one needs to save y and z
    around the call to bar(), x and z around the call to baz(), and x and
    y around the call to bla(). For any register allocation, in one of
    the cases the registers to be saved are not contiguous. So if one
    would use a save-multiple or load-multiple instruction for that, a
    mask would be needed.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Fri Nov 28 12:56:30 2025
    From Newsgroup: comp.arch

    Robert Finch wrote:
    On 2025-11-27 10:50 a.m., Kent Dickey wrote:

    I think encoding stuff in NaN is a very 80's idea: turning on exceptions
    costs performance, so we want to debug after-the-fact using NaNs.
    But I think RISC-V has the right modern idea: make hardware fast so
    you can
    simply always enable Invalid Operation Traps (and maybe Overflow, if
    infinities are happening), and then stop right at the point of NaN being
    first created. So the NaN propagation doesn't matter.

    I think the common current debug strategy for NaNs is run at full speed
    with exceptions masked, and if you get NaNs in your answer, you re-run
    with exceptions on and then debug the traps that occur. And no one
    looks at
    the NaN values at all, just their presence.

    So rather than spending time on NaN encoding, make it so that FP
    performance
    is not affected by enabling exceptions, so we can skip the re-running
    step,
    and just run with Invalid Operations trapping enabled. And then just
    return canonical NaNs.

    Kent

    I do not know how one would make FP performance improve and have
    exceptions at the same time. The FP would have to operate asynchronous.
    The only thing I can think of is to have core(s) specifically dedicated
    to performance FP that do not service interrupts.

    Why do you think that enabling FP exceptions "costs performance",
    by which I assume you mean that, say, an FPADD with exceptions
    enabled is slower than disabled?

    The FP exceptions are rising-edge triggered based on individual
    instruction calculation status, that is before being merged (OR'd)
    into the overall FP status. If an FP instruction has unmasked exceptions
    then mark the uOp as Except'd and recognize it at Retire like any
    other exception. This also assumes that the overall FP status is
    updated (merged) at Retire so it only contains status flags for
    FP instructions older than the exception point.




    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Nov 28 19:35:16 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    On 2025-11-26 7:08 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2025-11-26 3:57 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    In this case, put the cause in a container the instruction drags down >>>>> the pipe, and retrieve it when you do have address access to where it >>>>> needs to go.

    I may change things to pass the address around in the float package. >>>> Putting the address into the NaN later may cause issues with timing. It >>>> adds a mux into things. May be better to use the original NaN mux in the >>>> float modules. May call it a NaN identity field instead of an address. >>>
    For example: when a My 66000 instruction needs to raise an exception
    the Inst *I argument contains a field I->raised which is set (1<<excpt) >>> and at the end of the pipe (at retire), t->raised |= I->raised. Where
    we have a *t there is also t->ip. So, you don't have to drag Thread *t >>> through all the subroutine calls, but you can easily access t->raised
    at the point you do have access to t->ip.

    Had trouble reading that, sounds like goobly-goop. But I believe I
    figured it out.

    Sounds like the address is inserted at the end of the pipe which I am
    sure is not the case.

    I figured this out: the NaN address must be embedded in the result by
    the time the result updates the bypass network and registers so that it
    is available to other instructions.

    The address is available at the start of the calc from the reservation
    station entry. Me thinks it must be embedded when the NaN result status
    is set, provided there is not already a NaN. The existing (first) NaN
    must propagate through.

    See last calculation line in the following::

    void RunInst( Chip *chip )
    {
    for( uint64_t i = 0; i < chip->cores; i++ )
    {
    ContextStack *cpu = &core[i];
    uint8_t cs = cpu->cs;
    Thread *t;
    Inst *I;
    uint16_t raised;

    if( cpu->interrupt.raised & ((((signed)1)<<63) >> cpu->priority) )
    { // take an interrupt
    cpu->cs = cpu->interrupt.cs;
    cpu->priority = cpu->interrupt.priority;
    t = context[cpu->cs];
    t->reg[0] = cpu->interrupt.message;
    }
    else if( raised = t->raised & t->enabled )
    { // take an exception
    cpu->cs--;
    t = context[cpu->cs];
    t->reg[0] = FT1( raised ) | EXCPT;
    t->reg[1] = I->inst;
    t->reg[2] = I->src1;
    t->reg[3] = I->src2;
    t->reg[4] = I->src3;
    }
    else
    { // run an instruction
    t = context[cpu->cs];
    memory( FETCH, t->ip, &I->inst );
    t->ip += 4;
    majorTable[ I->inst.major ]( t, I );
    t->raised |= I->raised; // propagate raised here
    }
    }
    }

    That looks like code for a simulator.

    It is (IS) code for a non-timing simulator {a "right answer" simulator
    if you please.}

    How closely does it follow the operation of the CPU?

    CPUs have a pipeline, I is the quantity that gets dragged down the
    pipe, *t is the control registers of that CPU.

    I do not see where 'I' is initialized.

    Call to memory(). Then as I gets dragged down the pipeline, more
    fields are initialized. I drag the whole structure mostly for
    debug purposes.

    It has been a while since I worked on simulator code.

    The IP value is just muxed in in a five to one mux for the significand.
    Had to account for NaN's infinities and overflow anyway. Address gets propagated with some some flops, but flops are inexpensive in an FPGA.

    always_comb
    casez({aNan5,bNan5,qNaNOutab5,aInf5,bInf5,overab5})
    6'b1?????: moab6 <= {1'b1,1'b1,a5[fp64Pkg::FMSB-1:0],{fp64Pkg::FMSB+1{1'b0}}};
    6'b01????: moab6 <= {1'b1,1'b1,b5[fp64Pkg::FMSB-1:0],{fp64Pkg::FMSB+1{1'b0}}};
    6'b001???: moab6 <= {1'b1,qNaN|(64'd4 << (fp64Pkg::FMSB-4))|adr5[63:16],{fp64Pkg::FMSB+1{1'b0}}}; // multiply inf
    * zero
    6'b0001??: moab6 <= 0; // mul inf's
    6'b00001?: moab6 <= 0; // mul inf's
    6'b000001: moab6 <= 0; // mul overflow
    default: moab6 <= fractab5;
    endcase



    Modified NaN support in the float package to store to the HOBs.

    Survey says:

    The Qulps PUSH and POP instructions have room for six register fields. >>>> Should one of the fields be used to identify the stack pointer register >>>> allowing five registers to be pushed or popped? Or should the stack
    pointer register be assumed so that six registers may be pushed or popped?

    My 66000 ENTER and EXIT instruction use SP == R31 implicitly. But,
    instead of giving it a number of registers, there is a start register
    and a stop register, so 1-to-32 regsiters can be saved/restored. The
    immediate contains how much stack space to allocate/deallocate.

    {{when Safe-Stack is enabled:: Rstart-to-R0 are placed on the inaccessible
    stack, while R1-to-Rstop are placed on the normal stack.}}

    Because the stack is always DoubleWord aligned, the 3-LoBs of the
    immediate are used to indicate "special" activities on a couple of
    registers {R0, R31, R30}, R31 is rarely saves and reloaded from Stack
    but just returned to its previous value by integer arithmetic. FP can
    be updated or it can be treated like "just another register". R0 can
    be loaded directly to t->ip, or loaded into R0 for stack walk-backs.

    The corresponding LDM and STM are seldom used.

    I ran out of micro-ops for ENTER and EXIT, so they only save the LR and
    FP (on the safe stack). A separate PUSH/POP on safe stack instruction is >> used.

    I figured LDM and STM are not used often enough. PUSH / POP is used in
    many places LDM / STM might be.

    Its a fine line.

    I found more uses for an instruction that moves a number of registers randomly allocated to fixed positions (arguments to a call) than to
    move random string of registers to/from memory.

    .
    MOV R1,R10
    MOV R2,R25
    MOV R3,R17
    CALL Subroutine
    . ; deal with any result


    My 66000 has an instruction to do that?

    No, but the thought that it could be profitable to have such an
    instruction is a common recurrence.

    I'd not seen an instruction like that. It is almost like a byte map. I can see how it could be done.
    Another instruction to add to the ISA. My compiler does not do such a
    nice job of packing the register moves together though.

    Your instruction size can support such a thing, mine would be difficult.

    For context switching a whole bunch of load / store instructions are
    used. There is context switching in only a couple of places.

    I use a cache-model for thread-state {program-status-line and the
    register file}.

    The high level simulator, leaves all of the context in memory without loading it or storing it. Thus this serves as a pipeline Oracle so if
    the OoO pipeline makes a timing error, the Oracle stops the thread in
    its tracks.

    Thus::

    .
    .
    -----interrupt detected
    . change CS (cs--) <---
    . access threadState[cs]
    . t->ip = dispatcher
    . t->reg[0] = why
    dispatcher in control
    .
    .
    .
    RET
    SVR
    .
    .

    In your typical interrupt/exception control transfers, there is
    no code to actually switch state. Just like there is no code to
    switch a cache line that takes a miss.

    The My 66000 hardware takes care of it automatically? Interrupts push
    and pop context in my system.

    Yes, context switching is automatic and re-entrant. Whereas exceptions
    walk up the privilege stack, interrupts go directly to the specified
    context on the stack. So, you could be operating at high privilege
    and low priority, only to get interrupted by lower privilege at higher priority.

    (*) The cs-- is all that is necessary to change from one Thread State
    to another in its entirety.

    I think the SP should be identified as PUSH / POP would be the only
    instructions assuming the SP register. Otherwise any register could be >>>> chosen by the compiler.

    I started with that philosophy--and begrudgingly went away from it as
    a) the compiler took form
    b) we started adding instructions to ISA to remove instructions from
    code footprint.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Nov 28 19:49:31 2025
    From Newsgroup: comp.arch


    kegs@provalid.com (Kent Dickey) posted:

    In article <1763868010-5857@newsgrouper.org>,
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Robert Finch <robfi680@gmail.com> posted:
    My float package puts the cause in the 3 LoBs. The cause is always in
    the low order bits of the register then, even when the precision is
    different. But the address is not tracked. The package does not have
    access to the address. Seems like NaN trace hardware might be useful.

    Suggest you read:: >https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdf
    For conversation about LoBs versus HoBs.

    I wasn't sure where to join the NaN conversation, but this seems like a
    good spot.

    We've had 40+ years of different architectures handling NaNs, (what to
    encode in them to indicate where the first problem occurred) and all architectures do something different when operating on two NaNs:

    From that paper:
    - Intel using x87 instructions: NaN2 if both quiet, NaN1 if NaN2 is signalling
    - Intel using SSE instructions: NaN1
    - AMD using x87 instructions: NaN2
    - AMD using SSE instructions: NaN1
    - IBM Power PC: NaN1
    - IBM Z mainframe: NaN1 if both quiet, [precedence] to signalling NaN
    - ARM: NaN1 if both quiet, [precedence] to signalling NaN

    And adding one more not in that paper:
    - RISC-V: Always returns canonical NaN only, for Single: 0x7fc00000

    I'll just say whatever your NaN handling is, for the source code:

    A = B + C + D + E

    then for whatever values B,C,D,E having NaN or not, the value of A should
    be well defined and not dependent on the order of operations.

    I nice philosophy, but how does one achieve that when the compiler is allowed to encode the above as::

    A = (B+C)+(D+E)
    or
    A = (B+D)+(C+E)
    or
    A = (B+E)+(C+D)
    or
    A = (B+C)+(E+D)
    or
    ...

    No single set of rules can give the first created NaN because which
    is first created is dependent on how the compiler ordered the FADDs.

    How can you
    use bits in the NaN value for debugging if the hardware is returning arbitrary
    results when NaNs collide?

    My 66000 has specific rules covering {Operand NaNs, Created NaNs}
    which attempt to preserve the earliest created NaN and to properly
    propagate Operand NaN values.

    Users have almost no control over whether
    A = B + C treats B as the first argument or the second.

    Optimizers treat B and C as independent optimization opportunities.

    I think encoding stuff in NaN is a very 80's idea: turning on exceptions costs performance, so we want to debug after-the-fact using NaNs.

    But I think RISC-V has the right modern idea: make hardware fast so you can simply always enable Invalid Operation Traps (and maybe Overflow, if infinities are happening), and then stop right at the point of NaN being first created. So the NaN propagation doesn't matter.

    This is a 1960s idea. Stop at the first occurrence of trouble. More
    workable than NaNs, but has its own set of baggage--for example how
    does one stop 13 elements into a Vector instruction ???

    {{BTW: My 66000 has a way to scalarize vector code 13 elements into
    the vector, and after the exception has been handled, to reenter
    vector operation.}}

    I think the common current debug strategy for NaNs is run at full speed
    with exceptions masked, and if you get NaNs in your answer, you re-run
    with exceptions on and then debug the traps that occur. And no one looks at the NaN values at all, just their presence.

    Yes, this is a common strategy, and with the list of architectures that
    "all do it differently" what else could one expect.

    So rather than spending time on NaN encoding, make it so that FP performance is not affected by enabling exceptions, so we can skip the re-running step, and just run with Invalid Operations trapping enabled. And then just
    return canonical NaNs.

    My 66000 has that option available.

    Kent
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Nov 28 20:05:00 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    scott@slp53.sl.home (Scott Lurndal) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    My 66000 ENTER and EXIT instruction use SP == R31 implicitly. But,
    instead of giving it a number of registers, there is a start register
    and a stop register, so 1-to-32 regsiters can be saved/restored. The
    immediate contains how much stack space to allocate/deallocate.

    That seems both confining for the compiler designers and less
    useful than the VAX-11 register mask stored in the instruction stream
    at the function entry point(s).

    We, and by that I mean Brian, have not found that so. In the early stages >we did see a bit of that, and then Brian found a way to allocate registers >from R31-down-to-R16 that fit the ENTER/EXIT model and we find essentially >nothing (that is no more instructions in the stream than necessary).

    Part of the distinction is::
    a) how arguments/results are passed to/from subroutines.
    b) having a minimum of 7-temporary registers at entry point.
    c) how the stack frame is designed/allocated wrt:
    1) my arguments and my results,
    2) his arguments and his results,
    3) varargs,
    4) dynamic arrays on stack,
    5) stack frame allocation at ENTER,
    d) freedom to use R30 as FP or as joe-random-register.

    These were all co-designed together, after much of the instruction >emission logic was sorted out.

    What is "my" and "his"?

    My arguments are the arguments to me (this subroutine)
    His arguments are the arguments to subroutines I call

    Consider this as a VAX CALL model except that the mask was replaced by
    a list of registers, which were then packed towards R31 instead of a bit >vector.

    Do you need both a start and a stop register?

    Consider:
    ENTER R19,R31,#constant
    versus
    ENTER R19,R0,#constant

    The former saves R19-through-R31 and leave the return address in R0

    The later saves R19-through-R0 leaving the return address on the stack.

    This should illustrate that the stopping register is compiler chosen.
    It is obvious that the starting point should be compiler chosen.
    Thus, start and stop are independent.

    Now Consider:
    ENTER R19,R9,#constant

    Not only are R19-R0 saved on the stack, R1-R9 are saved on the stack immediately preceding the memory based arguments, thus varargs only
    changes the stop register in ENTER; and this makes a linear vector
    of arguments for valist.

    As far as I understand, ENTER is at the entry point of the callee, and
    EXIT is before the return or tail call; actually, the tail call case
    answers my question above:

    If the tail-caller has m callee-saved registers and the tail-callee
    has n callee-saved registers, then

    if m>n, generate an EXIT that restores the m-n registers;
    if m<n, generate an ENTER that saves the n-m registers;
    Generate a jump to behind the ENTER instruction of the callee.

    The above sounds complicated enough to simply avoid the tail-call
    optimization it the arguments lists are not similar enough.

    That is, assuming that the tail-callee is in the same compilation unit
    as the tail-caller; otherwise the tail-caller needs to do a full EXIT
    and then jump to the normal entry point of the tail-callee, which does
    a full ENTER.

    And in these ENTERs and EXITs, you don't end (or start) at the same
    point as in the regular ENTERs and EXITs.

    And yes, for saving the callee-saved registers I don't see a need for
    a mask. For caller-saved registers, it's different. Consider:

    long foo(...)
    {
    long x = ...;
    long y = ...;
    long z = ...;
    if (...) {
    bar(...);
    x = ...;
    } else if (...){
    baz(...);
    y = ...;
    } else {
    bla(...);
    z = ...;
    }
    return x+y+z;
    }

    Here one could put x, y, and z in callee-saved registers (and use ENTER
    and EXIT for them), but that would need to save and later restore
    three registers on every path through foo().

    Or one could put it in caller-saved registers and save only two
    registers on every path through foo(). Then one needs to save y and z
    around the call to bar(), x and z around the call to baz(), and x and
    y around the call to bla(). For any register allocation, in one of
    the cases the registers to be saved are not contiguous. So if one
    would use a save-multiple or load-multiple instruction for that, a
    mask would be needed.

    There is a delicate balance between callee-save and caller-save
    registers. In many situations caller-save is better (counting
    instructions) but callee-save is better (counting cycles--mostly
    due to second order cache effects).

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Nov 28 20:09:12 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    On 2025-11-27 10:50 a.m., Kent Dickey wrote:
    In article <1763868010-5857@newsgrouper.org>,
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Robert Finch <robfi680@gmail.com> posted:
    My float package puts the cause in the 3 LoBs. The cause is always in
    the low order bits of the register then, even when the precision is
    different. But the address is not tracked. The package does not have
    access to the address. Seems like NaN trace hardware might be useful.

    Suggest you read::
    https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdf
    For conversation about LoBs versus HoBs.

    I wasn't sure where to join the NaN conversation, but this seems like a good spot.

    We've had 40+ years of different architectures handling NaNs, (what to encode in them to indicate where the first problem occurred) and all architectures do something different when operating on two NaNs:

    From that paper:
    - Intel using x87 instructions: NaN2 if both quiet, NaN1 if NaN2 is signalling
    - Intel using SSE instructions: NaN1
    - AMD using x87 instructions: NaN2
    - AMD using SSE instructions: NaN1
    - IBM Power PC: NaN1
    - IBM Z mainframe: NaN1 if both quiet, [precedence] to signalling NaN
    - ARM: NaN1 if both quiet, [precedence] to signalling NaN

    And adding one more not in that paper:
    - RISC-V: Always returns canonical NaN only, for Single: 0x7fc00000

    I'll just say whatever your NaN handling is, for the source code:

    A = B + C + D + E

    then for whatever values B,C,D,E having NaN or not, the value of A should be well defined and not dependent on the order of operations. How can you use bits in the NaN value for debugging if the hardware is returning arbitrary
    results when NaNs collide? Users have almost no control over whether
    A = B + C treats B as the first argument or the second.

    I think encoding stuff in NaN is a very 80's idea: turning on exceptions costs performance, so we want to debug after-the-fact using NaNs.
    But I think RISC-V has the right modern idea: make hardware fast so
    you can
    simply always enable Invalid Operation Traps (and maybe Overflow, if infinities are happening), and then stop right at the point of NaN being first created. So the NaN propagation doesn't matter.

    I think the common current debug strategy for NaNs is run at full speed with exceptions masked, and if you get NaNs in your answer, you re-run
    with exceptions on and then debug the traps that occur. And no one looks at
    the NaN values at all, just their presence.

    So rather than spending time on NaN encoding, make it so that FP performance
    is not affected by enabling exceptions, so we can skip the re-running step, and just run with Invalid Operations trapping enabled. And then just return canonical NaNs.

    Kent

    I do not know how one would make FP performance improve and have
    exceptions at the same time. The FP would have to operate asynchronous.

    What is it that you fail to understand what reservation stations do
    to instructions arriving at various FPUs !?!?! The stations effectively
    turn the FPUs into asynchronous calculation units.

    The only thing I can think of is to have core(s) specifically dedicated
    to performance FP that do not service interrupts.

    Given that nobody looks at the NaN values it is tempting to leave out
    the NaN info, but I think I will still have it as an input to modules
    where NaNs can be generated (when I get around to it). The NaN info can always be set to zeros then and the extra logic should disappear then.

    I think that there may be a reason why nobody looks at the NaN values.
    IDK but maybe the debug does not make it easy to spot. A NaN display
    with a random assortment of digits is pretty useless. But if debug where
    to display all the address and other info, would it get used?

    That is the idea behind the why code and the IP in My 66000 NaNs.

    I still do not think they will be used "all that often" simply because
    so many other ways to generate and propagate NaNs exist--and there is
    no "universal" consensus.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Nov 28 20:39:07 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Just today, I compiled

    u4 = u1/10;
    u3 = u1%10;


    (plus some surrounding code) with gcc-14 in three contexts. Here's
    the code for two of them (the third one is similar to the second one):

    Care for to present a self-contained example? Otherwise, your
    example and its analyis are meaingless to the reader.

    I doubt that a self-contained example will be more meaningful to all
    but the most determined readers, but anyway, the preprocessed C code is at

    https://www.complang.tuwien.ac.at/anton/tmp/engine-fast.i

    Interesting test case. You might be interested to know that there
    is some improvement. With a relatively recent trunk, gcc compiles
    the offending sequence to

    movabsq $-3689348814741910323, %rax
    movq %r13, %rcx
    mulq %r13
    movq %rdx, %r13
    shrq $3, %rdx
    shrq $3, %r13
    movq %rdx, %r9
    leaq 0(%r13,%r13,4), %rax
    addq %rax, %rax
    subq %rax, %rcx
    movq %rcx, %r13

    There is improvement (only a single mulq) but the two shrq
    instructions are clearly redundant, so there is still some
    confusion there.

    Unfortunately, the usual tool for reducing test cases to something
    manageable (cvise) failed because of the size of the test case
    (32 GB main were not enough) and maybe also because cvise may not
    be well suited to the style of programming with goto labels and
    interspersed assembler statements, lots of them. (Looking at your
    code, it also does not seem to be self-sufficient, at least the
    numerous SKIP4 statements require something else).

    My assumption is that the control flow is confusing gcc. For this
    to be fixed, somebody with knowledge of the code would need to
    cut this down to something that still exhibits the behavior, and
    that can be reduced further with cvise (or delta, but cvise is
    usually much better).
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Nov 28 20:41:48 2025
    From Newsgroup: comp.arch


    EricP <ThatWouldBeTelling@thevillage.com> posted:

    Robert Finch wrote:
    On 2025-11-27 10:50 a.m., Kent Dickey wrote:

    I think encoding stuff in NaN is a very 80's idea: turning on exceptions >> costs performance, so we want to debug after-the-fact using NaNs.
    But I think RISC-V has the right modern idea: make hardware fast so
    you can
    simply always enable Invalid Operation Traps (and maybe Overflow, if
    infinities are happening), and then stop right at the point of NaN being >> first created. So the NaN propagation doesn't matter.

    I think the common current debug strategy for NaNs is run at full speed
    with exceptions masked, and if you get NaNs in your answer, you re-run
    with exceptions on and then debug the traps that occur. And no one
    looks at
    the NaN values at all, just their presence.

    So rather than spending time on NaN encoding, make it so that FP
    performance
    is not affected by enabling exceptions, so we can skip the re-running
    step,
    and just run with Invalid Operations trapping enabled. And then just
    return canonical NaNs.

    Kent

    I do not know how one would make FP performance improve and have exceptions at the same time. The FP would have to operate asynchronous. The only thing I can think of is to have core(s) specifically dedicated
    to performance FP that do not service interrupts.

    Why do you think that enabling FP exceptions "costs performance",
    by which I assume you mean that, say, an FPADD with exceptions
    enabled is slower than disabled?

    It is the control transfer to and from the handler on the occurrence
    of an exception that diminishes performance; and the time consumed
    by the handler itself. The enabled and disabled FPU takes the same
    time regardless of whether an exception transpired or not.

    The FP exceptions are rising-edge triggered based on individual
    instruction calculation status, that is before being merged (OR'd)
    into the overall FP status. If an FP instruction has unmasked exceptions
    then mark the uOp as Except'd and recognize

    and order

    it at Retire like any
    other exception. This also assumes that the overall FP status is
    updated (merged) at Retire so it only contains status flags for
    FP instructions older than the
    retire
    point.




    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Nov 28 23:06:45 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    (Looking at your
    code, it also does not seem to be self-sufficient, at least the
    numerous SKIP4 statements require something else).

    If you want to assemble the resulting .S file, it's assembled once
    with

    -DSKIP4= -Dgforth_engine2=gforth_engine

    and once with

    -DSKIP4=".skip 4"

    (on Linux-GNU AMD64, the .skip assembler directive is autoconfigured
    and may be different on other platforms).

    My assumption is that the control flow is confusing gcc.

    My guess is the same.

    For this
    to be fixed, somebody with knowledge of the code would need to
    cut this down to something that still exhibits the behavior, and
    that can be reduced further with cvise (or delta, but cvise is
    usually much better).

    Everything from

    H_<name1>:

    to the next

    H_<name2>:

    is one implementation of a VM instruction. You can remove a machine instructions and the references to the labels in the tables at the
    start of gforth_engine(), and the thing should still compile, and
    ideally the code for all the other VM instructions should be
    unchanged.

    In the extreme, you could remove everything but H_ten_u_slash_mod and
    the code up to the next H_..., but my guess is that you need more VM instruction implementations to produce the not-so-great code.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sat Nov 29 09:29:01 2025
    From Newsgroup: comp.arch

    I hard-coded an IRQ delay down-count in the Qupls4 core. The down-count
    delays accepting interrupts for ten clock cycles or about 40
    instructions if an interrupt got deferred. The interrupt being deferred because interrupts got disabled by an instruction in the pipeline. I
    guessed 40 instructions would likely be enough for many cases where IRQs
    are disabled then enabled again.

    The issue is the pipeline is full of ISR instructions that should not be committed because the IRQs got disabled in the meantime. If the CPU were allowed to accept another IRQ right away, it could get stuck in a loop flushing the pipeline and reloading with the ISR routine code instead of progressing through the code where IRQs were disabled.

    I could create a control register for this count and allow it to be programmable. But I think that may not be necessary.

    It is possible that 40 instructions is not enough. In that case the CPU
    would advance in 40 instruction burps. Alternating between fetching ISR instructions and the desired instruction stream. On the other hand, a
    larger down-count starts to impact the IRQ latency.

    Tradeoffs…

    I suppose I could have the CPU increase the down-count if it is looping
    around fetching ISR instructions. The down-count would be reset to the
    minimum again once an interrupt enable instruction is executed.

    Complex…

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Sat Nov 29 07:37:20 2025
    From Newsgroup: comp.arch

    On 11/29/2025 6:29 AM, Robert Finch wrote:
    I hard-coded an IRQ delay down-count in the Qupls4 core. The down-count delays accepting interrupts for ten clock cycles or about 40
    instructions if an interrupt got deferred. The interrupt being deferred because interrupts got disabled by an instruction in the pipeline. I
    guessed 40 instructions would likely be enough for many cases where IRQs
    are disabled then enabled again.

    The issue is the pipeline is full of ISR instructions that should not be committed because the IRQs got disabled in the meantime. If the CPU were allowed to accept another IRQ right away, it could get stuck in a loop flushing the pipeline and reloading with the ISR routine code instead of progressing through the code where IRQs were disabled.

    I could create a control register for this count and allow it to be programmable. But I think that may not be necessary.

    It is possible that 40 instructions is not enough. In that case the CPU would advance in 40 instruction burps. Alternating between fetching ISR instructions and the desired instruction stream. On the other hand, a
    larger down-count starts to impact the IRQ latency.

    Tradeoffs…

    I suppose I could have the CPU increase the down-count if it is looping around fetching ISR instructions. The down-count would be reset to the minimum again once an interrupt enable instruction is executed.

    Complex…

    A simple alternative that I have seen is to have an instruction that
    enables interrupts and jumps to somewhere, probably either the
    interrupted code or the dispatcher that might do a full context switch.
    The ISR would issue this instruction when it has saved everything that
    is necessary to handle the interrupt and thus could be interrupted
    again. This minimized the time interrupts are locked out without the
    need for an arbitrary timer, etc.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From kegs@kegs@provalid.com (Kent Dickey) to comp.arch on Sat Nov 29 15:48:22 2025
    From Newsgroup: comp.arch

    In article <1764359371-5857@newsgrouper.org>,
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    kegs@provalid.com (Kent Dickey) posted:

    In article <1763868010-5857@newsgrouper.org>,
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

    Robert Finch <robfi680@gmail.com> posted:
    My float package puts the cause in the 3 LoBs. The cause is always in
    the low order bits of the register then, even when the precision is
    different. But the address is not tracked. The package does not have
    access to the address. Seems like NaN trace hardware might be useful.

    Suggest you read::
    https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdf
    For conversation about LoBs versus HoBs.
    [snip]
    I'll just say whatever your NaN handling is, for the source code:

    A = B + C + D + E

    then for whatever values B,C,D,E having NaN or not, the value of A should
    be well defined and not dependent on the order of operations.

    I nice philosophy, but how does one achieve that when the compiler is allowed >to encode the above as::

    A = (B+C)+(D+E)
    or
    A = (B+D)+(C+E)
    or
    A = (B+E)+(C+D)
    or
    A = (B+C)+(E+D)
    or
    ...

    No single set of rules can give the first created NaN because which
    is first created is dependent on how the compiler ordered the FADDs.

    This is my point: I don't see a great way to encode the first NaN, which
    is why I propose not making that a goal. You're not getting the first
    NaN in any case even if you try to do so in hardware, since the order of operations is a fragile thing that's hard to control unless you write
    assembly code, or the most tedious source code imaginable.

    Several rules easily satisfy my property: canonical NaN (always return 0x7fc00000 as the result of any invalid op or any operation involving a
    NaN), or Max(NaN.mantissa), where you return the largest mantissa value
    of any NaN. An OR of the NaN mantissas also works. This lets you at
    least encode the most serious NaN if you order them, or lets you know
    all the different invalid ops that occured with the OR of flags stored
    in the mantissa.

    But canonical NaN is so much simpler. There's no need to preserve and
    mux around the NaN mantissas, which might save a tiny amount of datapath
    logic in FP units.

    Perhaps clever algorithms involving integer ops on FP values will come
    around and we'll WANT to have simpler FP handling so the integer
    accelerations will be easier to get right.

    Kent
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sat Nov 29 13:28:58 2025
    From Newsgroup: comp.arch

    On 2025-11-29 10:37 a.m., Stephen Fuld wrote:
    On 11/29/2025 6:29 AM, Robert Finch wrote:
    I hard-coded an IRQ delay down-count in the Qupls4 core. The down-
    count delays accepting interrupts for ten clock cycles or about 40
    instructions if an interrupt got deferred. The interrupt being
    deferred because interrupts got disabled by an instruction in the
    pipeline. I guessed 40 instructions would likely be enough for many
    cases where IRQs are disabled then enabled again.

    The issue is the pipeline is full of ISR instructions that should not
    be committed because the IRQs got disabled in the meantime. If the CPU
    were allowed to accept another IRQ right away, it could get stuck in a
    loop flushing the pipeline and reloading with the ISR routine code
    instead of progressing through the code where IRQs were disabled.

    I could create a control register for this count and allow it to be
    programmable. But I think that may not be necessary.

    It is possible that 40 instructions is not enough. In that case the
    CPU would advance in 40 instruction burps. Alternating between
    fetching ISR instructions and the desired instruction stream. On the
    other hand, a larger down-count starts to impact the IRQ latency.

    Tradeoffs…

    I suppose I could have the CPU increase the down-count if it is
    looping around fetching ISR instructions. The down-count would be
    reset to the minimum again once an interrupt enable instruction is
    executed.

    Complex…

    A simple alternative that I have seen is to have an instruction that
    enables interrupts and jumps to somewhere, probably either the
    interrupted code or the dispatcher that might do a full context switch.
     The ISR would issue this instruction when it has saved everything that
    is necessary to handle the interrupt and thus could be interrupted
    again.  This minimized the time interrupts are locked out without the
    need for an arbitrary timer, etc.



    That is a decent idea. A special jump and disable interrupts instruction
    to the next instruction might do it. The pipeline needs to be cleared of
    the external interrupt when interrupts are disabled, and the address
    reset. The issue then is that the interrupt gets lost, so it needs to be cached somewhere so that once interrupts are enabled again it can be processed. There could be multiple interrupts in the pipeline that need
    to be cached.

    Seeing as the address needs to be reset, an explicit jump instruction
    may not be necessary. The IP of the interrupted instruction could be used.

    I see now that a stack might be better than a FIFO as only a higher
    priority interrupt would be able to interrupt the lower one. Should they
    be processed in order of occurrence? Order of occurrence = FIFO,
    otherwise stack = FILO. Leave it to the user to decide? Out of order asynchronous interrupts probably are not a big deal. Hardware likely
    does not know what the order is, or care about it.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Nov 29 19:05:00 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    I hard-coded an IRQ delay down-count in the Qupls4 core. The down-count delays accepting interrupts for ten clock cycles or about 40
    instructions if an interrupt got deferred. The interrupt being deferred because interrupts got disabled by an instruction in the pipeline. I
    guessed 40 instructions would likely be enough for many cases where IRQs
    are disabled then enabled again.

    The issue is the pipeline is full of ISR instructions that should not be committed because the IRQs got disabled in the meantime. If the CPU were allowed to accept another IRQ right away, it could get stuck in a loop flushing the pipeline and reloading with the ISR routine code instead of progressing through the code where IRQs were disabled.

    The above is one of the reasons EricP supports the pipeline notion that interrupts do NOT flush the pipe. Instead, the instruction in the pipe
    are allowed to retire (apace) and new instructions are inserted from
    the interrupt service point. As long as the instructions "IN" the pipe
    can deliver their results to their registers, and update µArchitectural
    state they "own", there is no reason to flush--AND--no corresponding
    reason to delay "taking" the interrupt.

    At the µArchitectural level, you, the designer, see both the front
    and the end of the pipeline, you can change what goes in the front
    and allow what was already in the pipe to come out the back. This
    requires dragging a small amount of information down the pipe, much
    like multi-threaded CPUs.

    I could create a control register for this count and allow it to be programmable. But I think that may not be necessary.

    It is possible that 40 instructions is not enough. In that case the CPU would advance in 40 instruction burps. Alternating between fetching ISR instructions and the desired instruction stream. On the other hand, a
    larger down-count starts to impact the IRQ latency.

    Tradeoffs…

    I suppose I could have the CPU increase the down-count if it is looping around fetching ISR instructions. The down-count would be reset to the minimum again once an interrupt enable instruction is executed.

    Complex…

    Make the problem "go away". You will be happier in the end.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Nov 29 19:11:30 2025
    From Newsgroup: comp.arch

    Kent Dickey <kegs@provalid.com> schrieb:

    This is my point: I don't see a great way to encode the first NaN, which
    is why I propose not making that a goal. You're not getting the first
    NaN in any case even if you try to do so in hardware, since the order of operations is a fragile thing that's hard to control unless you write assembly code, or the most tedious source code imaginable.

    Using Fortran, parentheses have to be honored. If you write

    A = (B + C) + (D + E)

    then B + C and D + E have to be calculated before the total sum.
    If you write

    A = B + (C + (D + E))

    then you prescribe the order completetely.

    I can imagine source code that is much more tedious than this :-)
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Nov 29 19:23:03 2025
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 11/29/2025 6:29 AM, Robert Finch wrote:
    I hard-coded an IRQ delay down-count in the Qupls4 core. The down-count delays accepting interrupts for ten clock cycles or about 40
    instructions if an interrupt got deferred. The interrupt being deferred because interrupts got disabled by an instruction in the pipeline. I guessed 40 instructions would likely be enough for many cases where IRQs are disabled then enabled again.

    The issue is the pipeline is full of ISR instructions that should not be committed because the IRQs got disabled in the meantime. If the CPU were allowed to accept another IRQ right away, it could get stuck in a loop flushing the pipeline and reloading with the ISR routine code instead of progressing through the code where IRQs were disabled.

    I could create a control register for this count and allow it to be programmable. But I think that may not be necessary.

    It is possible that 40 instructions is not enough. In that case the CPU would advance in 40 instruction burps. Alternating between fetching ISR instructions and the desired instruction stream. On the other hand, a larger down-count starts to impact the IRQ latency.

    Tradeoffs…

    I suppose I could have the CPU increase the down-count if it is looping around fetching ISR instructions. The down-count would be reset to the minimum again once an interrupt enable instruction is executed.

    Complex…

    A simple alternative that I have seen is to have an instruction that
    enables interrupts and jumps to somewhere, probably either the
    interrupted code or the dispatcher that might do a full context switch.
    The ISR would issue this instruction when it has saved everything that
    is necessary to handle the interrupt and thus could be interrupted
    again. This minimized the time interrupts are locked out without the
    need for an arbitrary timer, etc.

    Another alternative is to allow ISRs to be interrupted by ISRs of higher priority. All you need here is a clean and precise definition of priority
    and when said priority gets associated with any given interrupt.

    My 66000 goes so far as to never need to disable interrupts because all interrupts of the same or lower priority are automatically disabled by
    the priority of the current ISR/running-thread. That is, one arrives
    at the ISR with interrupts enabled and in a reentrant state with the
    priority given by the I/O MMU when device sent ISR message to MSI-X
    queue.

    If/when an ISR needs to be sure it is not interrupted, it can change
    priority in 1 instruction to "highest" and have the system not allow
    the I/O MMU to associate said "exclusive" priority with any device
    interrupt. When ISR returns, priority reverts to priority at the time
    the interrupt was taken. {No need to back down on priority} This only
    requires that there are enough priorities to spare one exclusively to
    the system.

    EricP has argued that 8-I/O priority levels are enough. I argue that
    64 priority levels are enough for {Guest OS, Host OS, HyperVisor}
    to each have their own somewhat-coordinated structure of priorities.
    AND further I argue that given one is designing a 64-bit machine,
    that 64 priority levels are dé rigueur.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sat Nov 29 15:08:05 2025
    From Newsgroup: comp.arch

    Thomas Koenig wrote:
    Kent Dickey <kegs@provalid.com> schrieb:

    This is my point: I don't see a great way to encode the first NaN, which
    is why I propose not making that a goal. You're not getting the first
    NaN in any case even if you try to do so in hardware, since the order of
    operations is a fragile thing that's hard to control unless you write
    assembly code, or the most tedious source code imaginable.

    Using Fortran, parentheses have to be honored. If you write

    A = (B + C) + (D + E)

    then B + C and D + E have to be calculated before the total sum.
    If you write

    A = B + (C + (D + E))

    then you prescribe the order completetely.

    I can imagine source code that is much more tedious than this :-)

    That doesn't control which variable is assigned to each source operand.
    If both operands were Nan's and the two-Nan-rule was "always take src1"
    then the choice of which to propagate would still be non-deterministic.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sat Nov 29 15:42:13 2025
    From Newsgroup: comp.arch

    On 2025-11-29 2:05 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    I hard-coded an IRQ delay down-count in the Qupls4 core. The down-count
    delays accepting interrupts for ten clock cycles or about 40
    instructions if an interrupt got deferred. The interrupt being deferred
    because interrupts got disabled by an instruction in the pipeline. I
    guessed 40 instructions would likely be enough for many cases where IRQs
    are disabled then enabled again.

    The issue is the pipeline is full of ISR instructions that should not be
    committed because the IRQs got disabled in the meantime. If the CPU were
    allowed to accept another IRQ right away, it could get stuck in a loop
    flushing the pipeline and reloading with the ISR routine code instead of
    progressing through the code where IRQs were disabled.

    The above is one of the reasons EricP supports the pipeline notion that interrupts do NOT flush the pipe. Instead, the instruction in the pipe
    are allowed to retire (apace) and new instructions are inserted from
    the interrupt service point.

    That is how Qupls is working too. The issue is what happens when the instruction in the pipe before the ISR disables the interrupt. Then the
    ISR instructions need to be flushed.

    As long as the instructions "IN" the pipe
    can deliver their results to their registers, and update µArchitectural state they "own", there is no reason to flush--AND--no corresponding
    reason to delay "taking" the interrupt.

    That is the usual case for Qupls too when there is an interrupt.

    At the µArchitectural level, you, the designer, see both the front
    and the end of the pipeline, you can change what goes in the front
    and allow what was already in the pipe to come out the back. This
    requires dragging a small amount of information down the pipe, much
    like multi-threaded CPUs.

    Yes, the IRQ info is being dragged down the pipe.

    I could create a control register for this count and allow it to be
    programmable. But I think that may not be necessary.

    It is possible that 40 instructions is not enough. In that case the CPU
    would advance in 40 instruction burps. Alternating between fetching ISR
    instructions and the desired instruction stream. On the other hand, a
    larger down-count starts to impact the IRQ latency.

    Tradeoffs…

    I suppose I could have the CPU increase the down-count if it is looping
    around fetching ISR instructions. The down-count would be reset to the
    minimum again once an interrupt enable instruction is executed.

    Complex…

    Make the problem "go away". You will be happier in the end.

    The interrupt mask is set at fetch time to disable lower priority
    interrupts. I suppose disabling of interrupts by the OS could simply be ignored. The interrupt could only be taken if it is a higher priority
    than the current level.

    I had thought the OS might have good reason to disable interrupts. But
    maybe I am making things too complex.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sat Nov 29 16:10:45 2025
    From Newsgroup: comp.arch

    Robert Finch wrote:
    I hard-coded an IRQ delay down-count in the Qupls4 core. The down-count delays accepting interrupts for ten clock cycles or about 40
    instructions if an interrupt got deferred. The interrupt being deferred because interrupts got disabled by an instruction in the pipeline. I
    guessed 40 instructions would likely be enough for many cases where IRQs
    are disabled then enabled again.

    The issue is the pipeline is full of ISR instructions that should not be committed because the IRQs got disabled in the meantime. If the CPU were allowed to accept another IRQ right away, it could get stuck in a loop flushing the pipeline and reloading with the ISR routine code instead of progressing through the code where IRQs were disabled.

    I could create a control register for this count and allow it to be programmable. But I think that may not be necessary.

    It is possible that 40 instructions is not enough. In that case the CPU would advance in 40 instruction burps. Alternating between fetching ISR instructions and the desired instruction stream. On the other hand, a
    larger down-count starts to impact the IRQ latency.

    Tradeoffs…

    I suppose I could have the CPU increase the down-count if it is looping around fetching ISR instructions. The down-count would be reset to the minimum again once an interrupt enable instruction is executed.

    Complex…


    You are using this timer to predict the delay for draining the pipeline.
    It would only take a read of a slow IO device register to exceed it.

    I was thinking a simple and cheap way would be to use a variation of the single-step mechanism. An interrupt request would cause Decode to emit a special uOp with the single-step flag set and then stall, to allow the
    pipeline to drain the old stream before accepting the interrupt and
    redirecting Fetch to its handler. That way if there are and interrupt
    enable or disable instructions, or branch mispredicts, or pending exceptions in-flight they all are allowed to finish and the state to settle down.

    Pipelining interrupt delivery looks possible but gets complicated and
    expensive real quick.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Nov 29 22:07:04 2025
    From Newsgroup: comp.arch


    EricP <ThatWouldBeTelling@thevillage.com> posted:

    Thomas Koenig wrote:
    Kent Dickey <kegs@provalid.com> schrieb:

    This is my point: I don't see a great way to encode the first NaN, which >> is why I propose not making that a goal. You're not getting the first
    NaN in any case even if you try to do so in hardware, since the order of >> operations is a fragile thing that's hard to control unless you write
    assembly code, or the most tedious source code imaginable.

    Using Fortran, parentheses have to be honored. If you write

    A = (B + C) + (D + E)

    then B + C and D + E have to be calculated before the total sum.
    If you write

    A = B + (C + (D + E))

    then you prescribe the order completetely.

    I can imagine source code that is much more tedious than this :-)

    That doesn't control which variable is assigned to each source operand.
    If both operands were Nan's and the two-Nan-rule was "always take src1"
    then the choice of which to propagate would still be non-deterministic.


    In addition, the compiler is still allowed to perform the FORTRAN
    equation as::

    A = (C + B) + (E + D)

    instead of the way expressed in ASCII.

    Parenthesis order calculations not operands.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Nov 29 22:17:36 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    On 2025-11-29 2:05 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    I hard-coded an IRQ delay down-count in the Qupls4 core. The down-count
    delays accepting interrupts for ten clock cycles or about 40
    instructions if an interrupt got deferred. The interrupt being deferred
    because interrupts got disabled by an instruction in the pipeline. I
    guessed 40 instructions would likely be enough for many cases where IRQs >> are disabled then enabled again.

    The issue is the pipeline is full of ISR instructions that should not be >> committed because the IRQs got disabled in the meantime. If the CPU were >> allowed to accept another IRQ right away, it could get stuck in a loop
    flushing the pipeline and reloading with the ISR routine code instead of >> progressing through the code where IRQs were disabled.

    The above is one of the reasons EricP supports the pipeline notion that interrupts do NOT flush the pipe. Instead, the instruction in the pipe
    are allowed to retire (apace) and new instructions are inserted from
    the interrupt service point.

    That is how Qupls is working too. The issue is what happens when the instruction in the pipe before the ISR disables the interrupt. Then the
    ISR instructions need to be flushed.

    As a general rule of thumb:: an instruction is not "performed" until
    after it retires. {when you cannot undo its deeds}

    Consider the case where you redirect the front of the pipe to an ISR and
    an instruction already in the pipe raises an exception. Here, what I do
    {and have done in the past} is to not retire instructions after the
    exception, so the ISR is not delayed and IP ends up pointing at the
    excepting instruction.

    Since you started ISR before you retired DI, you can treat DI as an
    exception. {DI after ISR control transfer}. If, on the other hand,
    you perform DI at the front of the pipe, you don't "accept" the ISR
    until EI.

    As long as the instructions "IN" the pipe
    can deliver their results to their registers, and update µArchitectural state they "own", there is no reason to flush--AND--no corresponding
    reason to delay "taking" the interrupt.

    That is the usual case for Qupls too when there is an interrupt.

    At the µArchitectural level, you, the designer, see both the front
    and the end of the pipeline, you can change what goes in the front
    and allow what was already in the pipe to come out the back. This
    requires dragging a small amount of information down the pipe, much
    like multi-threaded CPUs.

    Yes, the IRQ info is being dragged down the pipe.

    I could create a control register for this count and allow it to be
    programmable. But I think that may not be necessary.

    It is possible that 40 instructions is not enough. In that case the CPU
    would advance in 40 instruction burps. Alternating between fetching ISR
    instructions and the desired instruction stream. On the other hand, a
    larger down-count starts to impact the IRQ latency.

    Tradeoffs…

    I suppose I could have the CPU increase the down-count if it is looping
    around fetching ISR instructions. The down-count would be reset to the
    minimum again once an interrupt enable instruction is executed.

    Complex…

    Make the problem "go away". You will be happier in the end.

    The interrupt mask is set at fetch time to disable lower priority interrupts. I suppose disabling of interrupts by the OS could simply be ignored. The interrupt could only be taken if it is a higher priority
    than the current level.

    I had thought the OS might have good reason to disable interrupts. But
    maybe I am making things too complex.


    The OS DOES have good reasons to DI "every once in a while", IIRC my conversations with EricP, these are short sequences the OS needs
    to be ATOMIC across all OS threads--and almost always without the
    possibility that the ATOMIC event fails {which can happen in user code}.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Nov 29 22:26:21 2025
    From Newsgroup: comp.arch


    EricP <ThatWouldBeTelling@thevillage.com> posted:

    Robert Finch wrote:
    I hard-coded an IRQ delay down-count in the Qupls4 core. The down-count delays accepting interrupts for ten clock cycles or about 40
    instructions if an interrupt got deferred. The interrupt being deferred because interrupts got disabled by an instruction in the pipeline. I guessed 40 instructions would likely be enough for many cases where IRQs are disabled then enabled again.

    The issue is the pipeline is full of ISR instructions that should not be committed because the IRQs got disabled in the meantime. If the CPU were allowed to accept another IRQ right away, it could get stuck in a loop flushing the pipeline and reloading with the ISR routine code instead of progressing through the code where IRQs were disabled.

    I could create a control register for this count and allow it to be programmable. But I think that may not be necessary.

    It is possible that 40 instructions is not enough. In that case the CPU would advance in 40 instruction burps. Alternating between fetching ISR instructions and the desired instruction stream. On the other hand, a larger down-count starts to impact the IRQ latency.

    Tradeoffs…

    I suppose I could have the CPU increase the down-count if it is looping around fetching ISR instructions. The down-count would be reset to the minimum again once an interrupt enable instruction is executed.

    Complex…


    You are using this timer to predict the delay for draining the pipeline.
    It would only take a read of a slow IO device register to exceed it.

    Yes, exactly::

    Consider a GBOoO processor that performs a LD R9,[deviceCR].

    a) all earlier memory references have to be seen globally
    ...before this LD can be seen globally. {dozens of cycles}
    b) this LD has to arrive at HostBridge. {dozens of cycles}
    c) HostBrdge sends request down PCIe {hundreds of cycles}
    d) device responds to LD {handful of cycles}
    e) PCIe transports response to HB {hundreds of cycles}
    f) HB transfers response to requestor {dozens of cycles}
    g) CPU is allowed to re-enter OoO {handful of cycles}

    Accesses to devices need to have most of the properties of
    "Sequential Consistency" as defined by Lamport.

    Now, several LDs [DeviceCRs] can be seen globally and in order
    before the first (or all responses) but you are going to see all
    that latency in the pipeline; but OoO memory requests are not one
    of them.

    I was thinking a simple and cheap way would be to use a variation of the single-step mechanism. An interrupt request would cause Decode to emit a special uOp with the single-step flag set and then stall, to allow the pipeline to drain the old stream before accepting the interrupt and redirecting Fetch to its handler. That way if there are and interrupt
    enable or disable instructions, or branch mispredicts, or pending exceptions in-flight they all are allowed to finish and the state to settle down.

    Pipelining interrupt delivery looks possible but gets complicated and expensive real quick.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sat Nov 29 17:45:17 2025
    From Newsgroup: comp.arch

    On 2025-11-29 4:10 p.m., EricP wrote:
    Robert Finch wrote:
    I hard-coded an IRQ delay down-count in the Qupls4 core. The down-
    count delays accepting interrupts for ten clock cycles or about 40
    instructions if an interrupt got deferred. The interrupt being
    deferred because interrupts got disabled by an instruction in the
    pipeline. I guessed 40 instructions would likely be enough for many
    cases where IRQs are disabled then enabled again.

    The issue is the pipeline is full of ISR instructions that should not
    be committed because the IRQs got disabled in the meantime. If the CPU
    were allowed to accept another IRQ right away, it could get stuck in a
    loop flushing the pipeline and reloading with the ISR routine code
    instead of progressing through the code where IRQs were disabled.

    I could create a control register for this count and allow it to be
    programmable. But I think that may not be necessary.

    It is possible that 40 instructions is not enough. In that case the
    CPU would advance in 40 instruction burps. Alternating between
    fetching ISR instructions and the desired instruction stream. On the
    other hand, a larger down-count starts to impact the IRQ latency.

    Tradeoffs…

    I suppose I could have the CPU increase the down-count if it is
    looping around fetching ISR instructions. The down-count would be
    reset to the minimum again once an interrupt enable instruction is
    executed.

    Complex…


    You are using this timer to predict the delay for draining the pipeline.
    It would only take a read of a slow IO device register to exceed it.

    The down count is counting down only when the front-end of the pipeline advances, instructions are sure to be loaded.

    I was thinking a simple and cheap way would be to use a variation of the single-step mechanism. An interrupt request would cause Decode to emit a special uOp with the single-step flag set and then stall, to allow the pipeline to drain the old stream before accepting the interrupt and redirecting Fetch to its handler. That way if there are and interrupt
    enable or disable instructions, or branch mispredicts, or pending
    exceptions
    in-flight they all are allowed to finish and the state to settle down.

    Pipelining interrupt delivery looks possible but gets complicated and expensive real quick.



    The base down count increases every time the IRQ is found at the commit
    stage. If the base down count is too large (stuck interrupt) then an
    exception is processed. For instance if interrupts were disabled for
    1000 clocks.

    I think the mechanism could work, complicated though.

    Treating the DI as an exception, as mentioned in another post would also
    work. It is a matter then of flushing the instructions between the DI
    and ISR.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Nov 29 23:14:23 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    On 2025-11-29 4:10 p.m., EricP wrote:
    Robert Finch wrote:
    I hard-coded an IRQ delay down-count in the Qupls4 core. The down-
    count delays accepting interrupts for ten clock cycles or about 40
    instructions if an interrupt got deferred. The interrupt being
    deferred because interrupts got disabled by an instruction in the
    pipeline. I guessed 40 instructions would likely be enough for many
    cases where IRQs are disabled then enabled again.

    The issue is the pipeline is full of ISR instructions that should not
    be committed because the IRQs got disabled in the meantime. If the CPU
    were allowed to accept another IRQ right away, it could get stuck in a
    loop flushing the pipeline and reloading with the ISR routine code
    instead of progressing through the code where IRQs were disabled.

    I could create a control register for this count and allow it to be
    programmable. But I think that may not be necessary.

    It is possible that 40 instructions is not enough. In that case the
    CPU would advance in 40 instruction burps. Alternating between
    fetching ISR instructions and the desired instruction stream. On the
    other hand, a larger down-count starts to impact the IRQ latency.

    Tradeoffs…

    I suppose I could have the CPU increase the down-count if it is
    looping around fetching ISR instructions. The down-count would be
    reset to the minimum again once an interrupt enable instruction is
    executed.

    Complex…


    You are using this timer to predict the delay for draining the pipeline.
    It would only take a read of a slow IO device register to exceed it.

    The down count is counting down only when the front-end of the pipeline advances, instructions are sure to be loaded.

    I was thinking a simple and cheap way would be to use a variation of the single-step mechanism. An interrupt request would cause Decode to emit a special uOp with the single-step flag set and then stall, to allow the pipeline to drain the old stream before accepting the interrupt and redirecting Fetch to its handler. That way if there are and interrupt enable or disable instructions, or branch mispredicts, or pending exceptions
    in-flight they all are allowed to finish and the state to settle down.

    Pipelining interrupt delivery looks possible but gets complicated and expensive real quick.



    The base down count increases every time the IRQ is found at the commit stage. If the base down count is too large (stuck interrupt) then an exception is processed. For instance if interrupts were disabled for
    1000 clocks.

    I think the mechanism could work, complicated though.

    Treating the DI as an exception, as mentioned in another post would also work. It is a matter then of flushing the instructions between the DI
    and ISR.

    Which is no different than flushing instructions after a mispredicted branch. --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Nov 29 23:37:21 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    (Looking at your
    code, it also does not seem to be self-sufficient, at least the
    numerous SKIP4 statements require something else).

    If you want to assemble the resulting .S file, it's assembled once
    with

    -DSKIP4= -Dgforth_engine2=gforth_engine

    and once with

    -DSKIP4=".skip 4"

    (on Linux-GNU AMD64, the .skip assembler directive is autoconfigured
    and may be different on other platforms).

    My assumption is that the control flow is confusing gcc.

    My guess is the same.

    Both our guesses were wrong, and Scott (I think) was on the right
    track - this is a signed / unsigned issue. A reduced test case is

    void bar(unsigned long, long);

    void foo(unsigned long u1)
    {
    long u3;
    u1 = u1 / 10;
    u3 = u1 % 10;
    bar(u1,u3);
    }

    This is now https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122911 .
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sun Nov 30 02:17:10 2025
    From Newsgroup: comp.arch

    On 2025-11-29 6:14 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    On 2025-11-29 4:10 p.m., EricP wrote:
    Robert Finch wrote:
    I hard-coded an IRQ delay down-count in the Qupls4 core. The down-
    count delays accepting interrupts for ten clock cycles or about 40
    instructions if an interrupt got deferred. The interrupt being
    deferred because interrupts got disabled by an instruction in the
    pipeline. I guessed 40 instructions would likely be enough for many
    cases where IRQs are disabled then enabled again.

    The issue is the pipeline is full of ISR instructions that should not
    be committed because the IRQs got disabled in the meantime. If the CPU >>>> were allowed to accept another IRQ right away, it could get stuck in a >>>> loop flushing the pipeline and reloading with the ISR routine code
    instead of progressing through the code where IRQs were disabled.

    I could create a control register for this count and allow it to be
    programmable. But I think that may not be necessary.

    It is possible that 40 instructions is not enough. In that case the
    CPU would advance in 40 instruction burps. Alternating between
    fetching ISR instructions and the desired instruction stream. On the
    other hand, a larger down-count starts to impact the IRQ latency.

    Tradeoffs…

    I suppose I could have the CPU increase the down-count if it is
    looping around fetching ISR instructions. The down-count would be
    reset to the minimum again once an interrupt enable instruction is
    executed.

    Complex…


    You are using this timer to predict the delay for draining the pipeline. >>> It would only take a read of a slow IO device register to exceed it.

    The down count is counting down only when the front-end of the pipeline
    advances, instructions are sure to be loaded.

    I was thinking a simple and cheap way would be to use a variation of the >>> single-step mechanism. An interrupt request would cause Decode to emit a >>> special uOp with the single-step flag set and then stall, to allow the
    pipeline to drain the old stream before accepting the interrupt and
    redirecting Fetch to its handler. That way if there are and interrupt
    enable or disable instructions, or branch mispredicts, or pending
    exceptions
    in-flight they all are allowed to finish and the state to settle down.

    Pipelining interrupt delivery looks possible but gets complicated and
    expensive real quick.



    The base down count increases every time the IRQ is found at the commit
    stage. If the base down count is too large (stuck interrupt) then an
    exception is processed. For instance if interrupts were disabled for
    1000 clocks.

    I think the mechanism could work, complicated though.

    Treating the DI as an exception, as mentioned in another post would also
    work. It is a matter then of flushing the instructions between the DI
    and ISR.

    Which is no different than flushing instructions after a mispredicted branch.

    Got fed up with trying to work out how get interrupts working. It turns
    out to be more challenging than I expected, no matter which way it is
    done. So, I decided to just poll for interrupts, getting rid of most of
    the IRQ logic. I added a branch-on-interrupt BOI instruction that works
    almost the same way as every other branch. Then the micro-op translator
    has been adapted to insert a polling branch periodically. It looks a lot simpler.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Nov 30 10:10:00 2025
    From Newsgroup: comp.arch

    Robert Finch <robfi680@gmail.com> schrieb:

    Got fed up with trying to work out how get interrupts working. It turns
    out to be more challenging than I expected, no matter which way it is
    done. So, I decided to just poll for interrupts, getting rid of most of
    the IRQ logic. I added a branch-on-interrupt BOI instruction that works almost the same way as every other branch. Then the micro-op translator
    has been adapted to insert a polling branch periodically. It looks a lot simpler.

    What is the expected delay until an interrupt is delivered?
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sun Nov 30 06:29:55 2025
    From Newsgroup: comp.arch

    On 2025-11-30 5:10 a.m., Thomas Koenig wrote:
    Robert Finch <robfi680@gmail.com> schrieb:

    Got fed up with trying to work out how get interrupts working. It turns
    out to be more challenging than I expected, no matter which way it is
    done. So, I decided to just poll for interrupts, getting rid of most of
    the IRQ logic. I added a branch-on-interrupt BOI instruction that works
    almost the same way as every other branch. Then the micro-op translator
    has been adapted to insert a polling branch periodically. It looks a lot
    simpler.

    What is the expected delay until an interrupt is delivered?

    I set the timing to 16 clocks which is about 64 (or more) instructions.
    Did not want to go much over 1% the number of instructions executed.
    Not every instruction inserts a poll, so sometimes a poll is lacking.
    IDK how well it will work. Making it an instruction means it might also
    be used by software.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sun Nov 30 06:41:52 2025
    From Newsgroup: comp.arch

    On 2025-11-30 6:29 a.m., Robert Finch wrote:
    On 2025-11-30 5:10 a.m., Thomas Koenig wrote:
    Robert Finch <robfi680@gmail.com> schrieb:

    Got fed up with trying to work out how get interrupts working. It turns
    out to be more challenging than I expected, no matter which way it is
    done. So, I decided to just poll for interrupts, getting rid of most of
    the IRQ logic. I added a branch-on-interrupt BOI instruction that works
    almost the same way as every other branch. Then the micro-op translator
    has been adapted to insert a polling branch periodically. It looks a lot >>> simpler.

    What is the expected delay until an interrupt is delivered?

    I set the timing to 16 clocks which is about 64 (or more) instructions.
    Did not want to go much over 1% the number of instructions executed.
    Not every instruction inserts a poll, so sometimes a poll is lacking.
    IDK how well it will work. Making it an instruction means it might also
    be used by software.

    Might be able to modify the branch predictor to predict the interrupt.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Nov 30 14:14:16 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Both our guesses were wrong, and Scott (I think) was on the right
    track - this is a signed / unsigned issue. A reduced test case is

    void bar(unsigned long, long);

    void foo(unsigned long u1)
    {
    long u3;
    u1 = u1 / 10;
    u3 = u1 % 10;
    bar(u1,u3);
    }

    Assigning to u1 changed the meaning, as Andrew Pinski noted; so the
    jury is still out on what the actual problem is.

    This is now https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122911 .

    and a revised one at
    <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122919>

    (The announced attachment is not there yet.)

    The latter case is interesting, because real_ca and spc became global,
    and symbols[] is still local, and no assignment to real_ca happens
    inside foo().

    So one way the compiler could interpret this code might be that
    real_ca gets one of the labels whose address is taken in some way
    unknown to the compiler; the it has to preserve all the code reachable
    through the labels.

    Another way to interpret this code would be that symbols is not used,
    so it is dead and can be optimized away. Consequently, none of the
    addresses of any of the labels is ever taken, and the labels are not
    used by direct jumps, either, so all the code reachable only by
    jumping to the labels is unreachable and can be optimized away.

    Apparently gcc takes the latter attitude if there are <=100 labels in
    symbols, but maybe something like the former attitude if there are
    100 labels in symbols. This may appear strange, but gcc generally
    tends to produce good code in relatively short time for Gforth (while
    clang generates horribly slow code and takes extremely long in doing
    so), and my guess is that having such a cutoff on doing the usual
    analysis has something to do with gcc's superior performance.

    I guess that if you treat symbols like in the original code (i.e.,
    return it in one case), you can reduce the labels more without the
    compiler optimizing everything away. I don't dare to predict when the
    compiler will stop generating the inefficient variant. Maybe it has
    to do with the cutoff.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Nov 30 15:47:03 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Both our guesses were wrong, and Scott (I think) was on the right
    track - this is a signed / unsigned issue. A reduced test case is

    void bar(unsigned long, long);

    void foo(unsigned long u1)
    {
    long u3;
    u1 = u1 / 10;
    u3 = u1 % 10;
    bar(u1,u3);
    }

    Assigning to u1 changed the meaning, as Andrew Pinski noted;

    An example which could be tested at run-time to verify correct
    operation was not provided, so I had to do without.

    In reducing compiler bugs, automated tools such as delta or
    (much better) cvise are essential. Your test case was so
    large that cvise failed, so a lot of manual work was required.

    cvise uses a user-supplied "interestingness script" which returns
    0 if the feature in question is there, or non-zero if it is
    not there. For relatively simple cases like an ICE, it
    can have two steps: a) check that compilation fails, and b)
    check that the error messages is output.

    Looking for a missed optimization is more difficult, especially
    in the absence of a run-time test. It is then necessary to

    a) check the source code that the interesting code is still there

    b) compile the code (exiting if this fails)

    c) verify the generated assembly that it still does the same

    a) and c) are very easy to get wrong, and there were numerous
    false reductions where cvise came up with something that the
    scripts didn't catch.


    so the
    jury is still out on what the actual problem is.

    This is now https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122911 .

    and a revised one at
    <https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122919>

    (The announced attachment is not there yet.)

    The latter case is interesting, because real_ca and spc became global,
    and symbols[] is still local, and no assignment to real_ca happens
    inside foo().

    That is what cvise does. It sometimes reduces code more than a
    human would.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Nov 30 15:18:21 2025
    From Newsgroup: comp.arch

    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    I recently heard that CS graduates from ETH Zürich had heard about >>>>>>pipelines, but thought it was fetch-decode-execute.

    Why would a CS graduate need to know about pipelines?

    So they can properly simluate a pipelined processor?

    Sure, if a CS graduate works in an application area, they need to
    learn about that application area, whatever it is.

    It's useful for code optimization, as well.

    In what way?

    In general,
    any programmer should have a solid understanding of the
    underlying hardware - generically, and specifically
    for the hardware being programmed.

    Certainly. But do they need to know between a a Wallace multiplier
    and a Dadda multiplier? If not, what is it about pipelined processors
    that would require CS graduates to know about them?

    Processor pipelines are not the basics of what a CS graduate is doing.
    They are an implementation detail in computer engineering.

    Which affect the performance of the software created by the
    software engineer (CS graduate).

    By a constant factor; and the software creator does not need to know
    that the CPU that executes instructions at 2 CPI (486) instead of at
    10 CPI (VAX-11/780) is pipelined; and these days both the 486 and the
    VAX are irrelevant to software creators.

    A few more examples where compilers are not as good as even I expected:

    Just today, I compiled

    u4 = u1/10;
    u3 = u1%10;

    (plus some surrounding code) with gcc-14 in three contexts. Here's
    the code for two of them (the third one is similar to the second one):

    movabs $0xcccccccccccccccd,%rax movabs $0xcccccccccccccccd,%rsi
    sub $0x8,%r13 mov %r8,%rax
    mul %r8 mov %r8,%rcx
    mov %rdx,%rax mul %rsi
    shr $0x3,%rax shr $0x3,%rdx
    lea (%rax,%rax,4),%rdx lea (%rdx,%rdx,4),%rax
    add %rdx,%rdx add %rax,%rax
    sub %rdx,%r8 sub %rax,%r8
    mov %r8,0x8(%r13) mov %rcx,%rax
    mov %rax,%r8 mul %rsi
    shr $0x3,%rdx
    mov %rdx,%r9

    The major difference is that in the left context, u3 is stored into
    memory (at 0x8(%r13)), while in the right context, it stays in a
    register. In the left context, gcc managed to base its computation of >>u1%10 on the result of u1/10; in the right context, gcc first computes >>u1%10 (computing u1/10 as part of that), and then computes u1/10
    again.

    Sort of emphasizes that programmers need to understand the
    underlying hardware.

    I am the programmer of the code shown above. In what way would better knowledge of the hardware made me aware that gcc would produce
    suboptimal code in some cases?

    What were u1, u3 and u4 declared as?

    unsigned long (on that platform).

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Nov 30 16:39:41 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    In reducing compiler bugs, automated tools such as delta or
    (much better) cvise are essential. Your test case was so
    large that cvise failed, so a lot of manual work was required.

    I have now done a manual reduction myself; essentially I left only the
    3 variants of the VM instruction that performs 10/, plus all the
    surroundings, and I added code to ensure that spTOS, spb, and spc are
    not dead. You find the result at

    http://www.complang.tuwien.ac.at/anton/tmp/engine-fast-red.i

    The result of compiling this with

    gcc -I./../arch/amd64 -I. -Wall -g -O2 -fomit-frame-pointer -pthread -DHAVE_CONFIG_H -DFORCE_LL -DFORCE_REG -DDEFAULTPATH='".:/usr/local/lib/gforth/site-forth:/usr/local/lib/gforth/0.7.9_20251119:/usr/local/share/gforth/0.7.9_20251119:/usr/share/gforth/site-forth:/usr/local/share/gforth/site-forth"' -c -fno-gcse -fcaller-saves -fno-defer-pop -fno-inline -fwrapv -fno-strict-aliasing -fno-cse-follow-jumps -fno-reorder-blocks -fno-reorder-blocks-and-partition -fno-toplevel-reorder -falign-labels=1 -falign-loops=1 -falign-jumps=1 -fno-delete-null-pointer-checks -fcf-protection=none -fno-tree-vectorize -fno-lto -pthread -DENGINE=2 -fPIC -DPIC -o libengine-fast2-ll-reg-red.S -S engine-fast-red.i

    can be found at

    http://www.complang.tuwien.ac.at/anton/tmp/libengine-fast2-ll-reg-red.S

    Now the multiplier is permanently allocated to %r11, so searching for
    it won't help. However, if you search for "mulq", you will find the
    code generated for the three instances of the VM instruction. The
    first is optimized well, the second exhibits two mulqs and two shrqs,
    the third exhibits just one mulq, but two shrqs.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Nov 30 18:59:15 2025
    From Newsgroup: comp.arch

    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    In reducing compiler bugs, automated tools such as delta or
    (much better) cvise are essential. Your test case was so
    large that cvise failed, so a lot of manual work was required.

    I have now done a manual reduction myself; essentially I left only the
    3 variants of the VM instruction that performs 10/, plus all the surroundings, and I added code to ensure that spTOS, spb, and spc are
    not dead. You find the result at

    http://www.complang.tuwien.ac.at/anton/tmp/engine-fast-red.i

    Do you have an example which tests the codepath taken for the
    offending piece of code, so it is possible to further reduce this
    case automatically? The example is still quite big (>13000 lines).
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Nov 30 19:33:47 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    ERROR "unexpected byte sequence starting at index 356: '\xC3'" while decoding:

    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    I recently heard that CS graduates from ETH Zürich had heard about >>>>>>pipelines, but thought it was fetch-decode-execute.

    Why would a CS graduate need to know about pipelines?

    So they can properly simluate a pipelined processor?

    Sure, if a CS graduate works in an application area, they need to
    learn about that application area, whatever it is.

    It's useful for code optimization, as well.

    In what way?

    In general,
    any programmer should have a solid understanding of the
    underlying hardware - generically, and specifically
    for the hardware being programmed.

    Certainly. But do they need to know between a a Wallace multiplier
    and a Dadda multiplier?

    You do realize that all Wallace multipliers are Dadda multipliers ??
    But there are Dadda Multipliers that are not Wallace multipliers ?!?!?!

    If not, what is it about pipelined processors
    that would require CS graduates to know about them?

    How execution order disturbs things like program order and memory order.
    That is how and when they need to insert Fences in their multi-threaded
    code.

    Processor pipelines are not the basics of what a CS graduate is doing. >>They are an implementation detail in computer engineering.

    Which affect the performance of the software created by the
    software engineer (CS graduate).

    By a constant factor; and the software creator does not need to know
    that the CPU that executes instructions at 2 CPI (486) instead of at
    10 CPI (VAX-11/780) is pipelined; and these days both the 486 and the
    VAX are irrelevant to software creators.

    I do not believe that the word "the" in front of x86 or VAX is proper.

    A few more examples where compilers are not as good as even I expected:

    Just today, I compiled

    u4 = u1/10;
    u3 = u1%10;

    (plus some surrounding code) with gcc-14 in three contexts. Here's
    the code for two of them (the third one is similar to the second one):

    movabs $0xcccccccccccccccd,%rax movabs $0xcccccccccccccccd,%rsi
    sub $0x8,%r13 mov %r8,%rax
    mul %r8 mov %r8,%rcx
    mov %rdx,%rax mul %rsi
    shr $0x3,%rax shr $0x3,%rdx
    lea (%rax,%rax,4),%rdx lea (%rdx,%rdx,4),%rax
    add %rdx,%rdx add %rax,%rax
    sub %rdx,%r8 sub %rax,%r8
    mov %r8,0x8(%r13) mov %rcx,%rax
    mov %rax,%r8 mul %rsi
    shr $0x3,%rdx
    mov %rdx,%r9

    The major difference is that in the left context, u3 is stored into >>memory (at 0x8(%r13)), while in the right context, it stays in a >>register. In the left context, gcc managed to base its computation of >>u1%10 on the result of u1/10; in the right context, gcc first computes >>u1%10 (computing u1/10 as part of that), and then computes u1/10
    again.

    Sort of emphasizes that programmers need to understand the
    underlying hardware.

    I am the programmer of the code shown above. In what way would better knowledge of the hardware made me aware that gcc would produce
    suboptimal code in some cases?

    Reading and thinking about the asm-code and running the various code
    sequences enough times that you can measure which is better and which
    is worse. That is the engineering part of software Engineering.

    What were u1, u3 and u4 declared as?

    unsigned long (on that platform).

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Sun Nov 30 22:38:39 2025
    From Newsgroup: comp.arch

    On 2025-11-30 21:33, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    ERROR "unexpected byte sequence starting at index 356: '\xC3'" while decoding:

    scott@slp53.sl.home (Scott Lurndal) writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
    scott@slp53.sl.home (Scott Lurndal) writes:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    I recently heard that CS graduates from ETH Zürich had heard about >>>>>>>> pipelines, but thought it was fetch-decode-execute.

    Why would a CS graduate need to know about pipelines?

    So they can properly simluate a pipelined processor?

    Sure, if a CS graduate works in an application area, they need to
    learn about that application area, whatever it is.

    It's useful for code optimization, as well.

    In what way?

    In general,
    any programmer should have a solid understanding of the
    underlying hardware - generically, and specifically
    for the hardware being programmed.

    Certainly. But do they need to know between a a Wallace multiplier
    and a Dadda multiplier?

    You do realize that all Wallace multipliers are Dadda multipliers ??
    But there are Dadda Multipliers that are not Wallace multipliers ?!?!?!

    If not, what is it about pipelined processors
    that would require CS graduates to know about them?

    How execution order disturbs things like program order and memory order.
    That is how and when they need to insert Fences in their multi-threaded
    code.

    That is an aspect of processor architecture that is relevant to some programmers, but not to the large number of programmers who use
    languages or operating systems with built-in multi-threading and safe inter-thread communication primitives and services for input/output.

    I am the programmer of the code shown above. In what way would better
    knowledge of the hardware made me aware that gcc would produce
    suboptimal code in some cases?

    Reading and thinking about the asm-code and running the various code sequences enough times that you can measure which is better and which
    is worse. That is the engineering part of software Engineering.

    That is a very niche part of software (performance) engineering. Speed
    of execution is only one of many "goodness" dimensions of a piece of SW, others including correctness, reliability, security, portability, maintainability, and so on. All dimensions need and depend on systematic engineering, although some dimensions cannot be quantified as easily as execution speed.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Nov 30 22:11:26 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
    Thomas Koenig <tkoenig@netcologne.de> writes:
    In reducing compiler bugs, automated tools such as delta or
    (much better) cvise are essential. Your test case was so
    large that cvise failed, so a lot of manual work was required.

    I have now done a manual reduction myself; essentially I left only the
    3 variants of the VM instruction that performs 10/, plus all the
    surroundings, and I added code to ensure that spTOS, spb, and spc are
    not dead. You find the result at

    http://www.complang.tuwien.ac.at/anton/tmp/engine-fast-red.i

    Do you have an example which tests the codepath taken for the
    offending piece of code,

    Not easily.

    so it is possible to further reduce this
    case automatically? The example is still quite big (>13000 lines).

    Most of which is coming from including stdlib.h etc. The actual code
    of the gforth_engine function in that example is 264 lines, many of
    which are empty or line number indicators.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Nov 30 22:17:19 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    ERROR "unexpected byte sequence starting at index 356: '\xC3'" while decoding:

    scott@slp53.sl.home (Scott Lurndal) writes:
    In general,
    any programmer should have a solid understanding of the
    underlying hardware - generically, and specifically
    for the hardware being programmed.

    Certainly. But do they need to know between a a Wallace multiplier
    and a Dadda multiplier?

    You do realize that all Wallace multipliers are Dadda multipliers ??
    But there are Dadda Multipliers that are not Wallace multipliers ?!?!?!

    Good to know, but does not answer the question.

    If not, what is it about pipelined processors
    that would require CS graduates to know about them?

    How execution order disturbs things like program order and memory order.
    That is how and when they need to insert Fences in their multi-threaded >code.

    And the relevance of pipelined processors for that issue is what?

    Memory-ordering shenanigans come from the unholy alliance of
    cache-coherent multiprocessing and the supercomputer attitude. If you implement per-CPU caches and multiple memory controllers as shoddily
    as possible while providing features for programs to slow themselves
    down heavily in order to get memory-ordering guarantess, then you get
    a weak memory model; slightly less shoddy, and you get a "strong" memory
    model. Processor pipelines have no relevance here.

    And, as Niklas Holsti observed, dealing with memory-ordering
    shenanigans is something that a few specialists do; no need for others
    to know about the memory model, except that common CPUs unfortunately
    do not implement sequential consistency.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Dec 1 00:12:15 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    ERROR "unexpected byte sequence starting at index 356: '\xC3'" while decoding:

    scott@slp53.sl.home (Scott Lurndal) writes:
    In general,
    any programmer should have a solid understanding of the
    underlying hardware - generically, and specifically
    for the hardware being programmed.

    Certainly. But do they need to know between a a Wallace multiplier
    and a Dadda multiplier?

    You do realize that all Wallace multipliers are Dadda multipliers ??
    But there are Dadda Multipliers that are not Wallace multipliers ?!?!?!

    Good to know, but does not answer the question.

    {Without contradicting that Wallace got on the correct track first}
    Wallace gets the credit that should rightly go to Dadda.

    If not, what is it about pipelined processors
    that would require CS graduates to know about them?

    How execution order disturbs things like program order and memory order. >That is how and when they need to insert Fences in their multi-threaded >code.

    And the relevance of pipelined processors for that issue is what?

    Memory-ordering shenanigans come from the unholy alliance of
    cache-coherent multiprocessing and the supercomputer attitude.

    And without the SuperComputer attitude, you sell 0 parts.
    {Remember how we talk about performance all the time here ?}

    If you implement per-CPU caches and multiple memory controllers as shoddily
    as possible while providing features for programs to slow themselves
    down heavily in order to get memory-ordering guarantess, then you get
    a weak memory model; slightly less shoddy, and you get a "strong" memory model. Processor pipelines have no relevance here.

    It is the pipelines themselves (along with the SuperComputer attitude)
    that gives rise to the weak memory models.

    And, as Niklas Holsti observed, dealing with memory-ordering
    shenanigans is something that a few specialists do; no need for others
    to know about the memory model, except that common CPUs unfortunately
    do not implement sequential consistency.

    Because of the SuperComputer attitude ! {Performance first}

    And only after several languages built their own ATOMIC primitives, so
    the programmers could remain ignorant. But this also ties the hands of
    the designers in such a way that performance grows ever more slowly
    with more threads.

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Dec 1 07:56:37 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    Memory-ordering shenanigans come from the unholy alliance of
    cache-coherent multiprocessing and the supercomputer attitude.

    And without the SuperComputer attitude, you sell 0 parts.
    {Remember how we talk about performance all the time here ?}

    Wrong. The supercomputer attitude gave us such wonders as IA-64
    (sells 0 parts) and Larrabee (sells 0 parts); why: because OoO is not
    only easier to program, but also faster.

    The advocates of weaker memory models justify them by pointing to the
    slowness of sequential consistency if one implements it by using
    fences on hardware optimized for a weaker memory model. But that's
    not the way to implement efficient sequential consistency.

    In an alternate reality where AMD64 did not happen and IA-64 won,
    people would justify the IA-64 ISA complexity as necessary for
    performance, and claim that the IA-32 hardware in the Itanium
    demonstrates the performance superiority of the EPIC approach, just
    like they currently justify the performance superiority of weak and
    "strong" memory models over sequential consistency.

    If hardware designers put their mind to it, they could make sequential consistency perform well, probably better on code that actually
    accesses data shared between different threads than weak and "strong"
    ordering, because there is no need to slow down the program with
    fences and the like in cases where only one thread accesses the data,
    and in cases where the data is read by all threads. You will see the
    slowdown only in run-time cases when one thread writes and another
    reads in temporal proximity. And all the fences etc. that are
    inserted just in case would also become fast (noops).

    A similar case: Alpha includes a trapb instruction (an exception
    fence). Programmers have to insert it after FP instructions to get
    precise exceptions. This was justified with performance; i.e., the
    theory went: If you compile without trapb, you get performance and
    imprecise exceptions, if you compile with trapb, you get slowness and
    precise exceptions. I then measured SPEC 95 compiled without and with
    trapb <2003Apr3.202651@a0.complang.tuwien.ac.at>, and on the OoO 21264
    there was hardly any difference; I believe that trapb is a noop on the
    21264. Here's the SPECfp_base95 numbers:

    with without
    trapb trapb
    9.56 11.6 AlphaPC164LX 600MHz 21164A
    19.7 20.0 Compaq XP1000 500MHz 21264

    So the machine that needs trapb is much slower even without trapb than
    even the with-trapb variant on the machine where trapb is probably a
    noop. And lots of implementations of architectures without trapb have demonstrated since then that you can have high performance and precise exceptions without trapb.

    And only after several languages built their own ATOMIC primitives, so
    the programmers could remain ignorant. But this also ties the hands of
    the designers in such a way that performance grows ever more slowly
    with more threads.

    Maybe they could free their hands by designing for a
    sequential-consistency interface, just like designing for a simple sequential-execution model without EPIC features freed their hands to
    design microarchitectural features that allowed ordinary code to
    utilize wider and wider OoO cores profitably.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Mon Dec 1 13:23:22 2025
    From Newsgroup: comp.arch

    On Mon, 01 Dec 2025 07:56:37 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    Memory-ordering shenanigans come from the unholy alliance of
    cache-coherent multiprocessing and the supercomputer attitude.

    And without the SuperComputer attitude, you sell 0 parts.
    {Remember how we talk about performance all the time here ?}

    Wrong. The supercomputer attitude gave us such wonders as IA-64
    (sells 0 parts) and Larrabee (sells 0 parts); why: because OoO is not
    only easier to program, but also faster.

    The advocates of weaker memory models justify them by pointing to the slowness of sequential consistency if one implements it by using
    fences on hardware optimized for a weaker memory model. But that's
    not the way to implement efficient sequential consistency.

    In an alternate reality where AMD64 did not happen and IA-64 won,
    people would justify the IA-64 ISA complexity as necessary for
    performance, and claim that the IA-32 hardware in the Itanium
    demonstrates the performance superiority of the EPIC approach, just
    like they currently justify the performance superiority of weak and
    "strong" memory models over sequential consistency.

    If hardware designers put their mind to it, they could make sequential consistency perform well, probably better on code that actually
    accesses data shared between different threads than weak and "strong" ordering, because there is no need to slow down the program with
    fences and the like in cases where only one thread accesses the data,
    and in cases where the data is read by all threads. You will see the slowdown only in run-time cases when one thread writes and another
    reads in temporal proximity. And all the fences etc. that are
    inserted just in case would also become fast (noops).


    Where does sequential consistency simplifies programming over x86 model
    of "TCO + globally ordered synchronization primitives +
    every synchronization primitives have implied barriers"?

    More so, where it simplifies over ARMv8.1-A, assuming that programmer
    does not try to be too smart and never uses LL/SC and always uses
    8.1-style synchronization instructions with Acquire+Release flags set?

    IMHO, the only simple thing about sequential consistency is simple
    description. Other than that, it simplifies very little. It does not
    magically make lockless multithreaded programming bearable to
    non-genius coders.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Mon Dec 1 14:07:34 2025
    From Newsgroup: comp.arch

    Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    Memory-ordering shenanigans come from the unholy alliance of
    cache-coherent multiprocessing and the supercomputer attitude.
    And without the SuperComputer attitude, you sell 0 parts.
    {Remember how we talk about performance all the time here ?}

    Wrong. The supercomputer attitude gave us such wonders as IA-64
    (sells 0 parts) and Larrabee (sells 0 parts); why: because OoO is not
    only easier to program, but also faster.

    The advocates of weaker memory models justify them by pointing to the slowness of sequential consistency if one implements it by using
    fences on hardware optimized for a weaker memory model. But that's
    not the way to implement efficient sequential consistency.

    In an alternate reality where AMD64 did not happen and IA-64 won,
    people would justify the IA-64 ISA complexity as necessary for
    performance, and claim that the IA-32 hardware in the Itanium
    demonstrates the performance superiority of the EPIC approach, just
    like they currently justify the performance superiority of weak and
    "strong" memory models over sequential consistency.

    If hardware designers put their mind to it, they could make sequential consistency perform well, probably better on code that actually
    accesses data shared between different threads than weak and "strong" ordering, because there is no need to slow down the program with
    fences and the like in cases where only one thread accesses the data,
    and in cases where the data is read by all threads. You will see the slowdown only in run-time cases when one thread writes and another
    reads in temporal proximity. And all the fences etc. that are
    inserted just in case would also become fast (noops).

    A similar case: Alpha includes a trapb instruction (an exception
    fence). Programmers have to insert it after FP instructions to get
    precise exceptions. This was justified with performance; i.e., the
    theory went: If you compile without trapb, you get performance and
    imprecise exceptions, if you compile with trapb, you get slowness and
    precise exceptions. I then measured SPEC 95 compiled without and with
    trapb <2003Apr3.202651@a0.complang.tuwien.ac.at>, and on the OoO 21264
    there was hardly any difference; I believe that trapb is a noop on the
    21264. Here's the SPECfp_base95 numbers:

    with without
    trapb trapb
    9.56 11.6 AlphaPC164LX 600MHz 21164A
    19.7 20.0 Compaq XP1000 500MHz 21264

    So the machine that needs trapb is much slower even without trapb than
    even the with-trapb variant on the machine where trapb is probably a
    noop. And lots of implementations of architectures without trapb have demonstrated since then that you can have high performance and precise exceptions without trapb.

    The 21264 Hardware Reference Manual says TRAPB (general exception barrier)
    and EXCB (floating point control register barrier) are both NOP's
    internally, are tossed at decode, and don't even take up an
    instruction slot.

    The purpose of the EXCB is to synchronize pipeline access to the
    floating point control and status register with FP operations.
    In the worst case this stalls until the pipeline drains.

    I wonder how much logic it really saved allowing imprecise exceptions
    in the InO 21064 and 21164? Conversely, how much did it cost to deal
    with problems caused by leaving these interlocks off?

    The cores have multiple, parallel pipelines for int, lsq, fadd and fmul. Without exception interlocks, each pipeline only obeys the scoreboard
    rules for when to writeback its result register: WAW and WAR.
    That allows a younger, faster instruction to finish and write its register before an older, slower instruction. If that older instruction then throws
    an exception and does not write its register then we can see the out of
    order register writes.

    For register file writes to be precise in the presence of exceptions
    requires each instruction look ahead at the state of all older
    instructions *in all pipelines*.
    Each uOp can be Unresolved, Resolved_Normal, or Resolved_Exception.
    A writeback can occur if there are no WAW or WAR dependencies,
    and all older uOps are Resolved_Normal.

    Just off the top of my head, in addition to the normal scoreboard,
    a FIFO buffer with a priority selector could be used to look ahead
    at all older uOps and check their status, and allow or stall uOp
    writebacks and ensure registers always appear precise.
    Which really doesn't look that expensive.

    Is there something I missed, or would that FIFO suffice?



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Dec 1 22:50:15 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    Memory-ordering shenanigans come from the unholy alliance of
    cache-coherent multiprocessing and the supercomputer attitude.

    And without the SuperComputer attitude, you sell 0 parts.
    {Remember how we talk about performance all the time here ?}

    Wrong. The supercomputer attitude gave us such wonders as IA-64
    (sells 0 parts) and Larrabee (sells 0 parts); why: because OoO is not
    only easier to program, but also faster.

    The advocates of weaker memory models justify them by pointing to the slowness of sequential consistency if one implements it by using
    fences on hardware optimized for a weaker memory model. But that's
    not the way to implement efficient sequential consistency.

    In an alternate reality where AMD64 did not happen and IA-64 won,
    people would justify the IA-64 ISA complexity as necessary for
    performance, and claim that the IA-32 hardware in the Itanium
    demonstrates the performance superiority of the EPIC approach, just
    like they currently justify the performance superiority of weak and
    "strong" memory models over sequential consistency.

    If hardware designers put their mind to it, they could make sequential consistency perform well,

    Depends on your definition of SC and "performs well", but see below::

    probably better on code that actually
    accesses data shared between different threads than weak and "strong" ordering, because there is no need to slow down the program with
    fences and the like in cases where only one thread accesses the data,
    and in cases where the data is read by all threads. You will see the slowdown only in run-time cases when one thread writes and another
    reads in temporal proximity. And all the fences etc. that are
    inserted just in case would also become fast (noops).

    In the case of My 66000, there is a slightly weak memory model
    (Causal consistency) for accesses to DRAM, and there is Sequential
    consistency for ATOMIC stuff and device control registers, and then
    there is strongly ordered for configuration space access, and the
    programmer does not have to do "jack" to get these orderings--
    its all programmed in the PTEs.

    {{There is even a way to make DRAM accesses SC should you want.}}

    A similar case: Alpha includes a trapb instruction (an exception
    fence). Programmers have to insert it after FP instructions to get
    precise exceptions. This was justified with performance; i.e., the
    theory went: If you compile without trapb, you get performance and
    imprecise exceptions, if you compile with trapb, you get slowness and
    precise exceptions. I then measured SPEC 95 compiled without and with
    trapb <2003Apr3.202651@a0.complang.tuwien.ac.at>, and on the OoO 21264
    there was hardly any difference; I believe that trapb is a noop on the
    21264. Here's the SPECfp_base95 numbers:

    with without
    trapb trapb
    9.56 11.6 AlphaPC164LX 600MHz 21164A
    moderate slowdown
    19.7 20.0 Compaq XP1000 500MHz 21264
    slowdown has disappeared.

    So the machine that needs trapb is much slower even without trapb than
    even the with-trapb variant on the machine where trapb is probably a
    noop. And lots of implementations of architectures without trapb have demonstrated since then that you can have high performance and precise exceptions without trapb.

    And only after several languages built their own ATOMIC primitives, so
    the programmers could remain ignorant. But this also ties the hands of
    the designers in such a way that performance grows ever more slowly
    with more threads.

    Maybe they could free their hands by designing for a
    sequential-consistency interface, just like designing for a simple sequential-execution model without EPIC features freed their hands to
    design microarchitectural features that allowed ordinary code to
    utilize wider and wider OoO cores profitably.

    That is not the property I was getting at--the property I was getting at
    is that the language model for synchronization can only use 1 memory
    location {TS, TTS, CAS, DCAS, LL, SC} and this fundamentally limits the
    amount of work one can do in a single event, and also fundamentally limits
    what one can "say" about a concurrent data structure.

    Given a certain amount of interference--the fewer ATOMIC things one has
    to do the lower the chance of interference, and the greater the chance
    of success. So, if one could move an element of a CDS from one location
    to another in one ATOMIC event rather than 2 (or 3) then the exponent
    of synchronization overhead goes down, and then one can make statements
    like "and no outside observer can see the CDS without that element present"--which cannot be stated with current models.


    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Dec 1 23:03:24 2025
    From Newsgroup: comp.arch


    EricP <ThatWouldBeTelling@thevillage.com> posted:

    Anton Ertl wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:
    Memory-ordering shenanigans come from the unholy alliance of
    cache-coherent multiprocessing and the supercomputer attitude.
    And without the SuperComputer attitude, you sell 0 parts.
    {Remember how we talk about performance all the time here ?}

    Wrong. The supercomputer attitude gave us such wonders as IA-64
    (sells 0 parts) and Larrabee (sells 0 parts); why: because OoO is not
    only easier to program, but also faster.

    The advocates of weaker memory models justify them by pointing to the slowness of sequential consistency if one implements it by using
    fences on hardware optimized for a weaker memory model. But that's
    not the way to implement efficient sequential consistency.

    In an alternate reality where AMD64 did not happen and IA-64 won,
    people would justify the IA-64 ISA complexity as necessary for
    performance, and claim that the IA-32 hardware in the Itanium
    demonstrates the performance superiority of the EPIC approach, just
    like they currently justify the performance superiority of weak and "strong" memory models over sequential consistency.

    If hardware designers put their mind to it, they could make sequential consistency perform well, probably better on code that actually
    accesses data shared between different threads than weak and "strong" ordering, because there is no need to slow down the program with
    fences and the like in cases where only one thread accesses the data,
    and in cases where the data is read by all threads. You will see the slowdown only in run-time cases when one thread writes and another
    reads in temporal proximity. And all the fences etc. that are
    inserted just in case would also become fast (noops).

    A similar case: Alpha includes a trapb instruction (an exception
    fence). Programmers have to insert it after FP instructions to get
    precise exceptions. This was justified with performance; i.e., the
    theory went: If you compile without trapb, you get performance and imprecise exceptions, if you compile with trapb, you get slowness and precise exceptions. I then measured SPEC 95 compiled without and with trapb <2003Apr3.202651@a0.complang.tuwien.ac.at>, and on the OoO 21264 there was hardly any difference; I believe that trapb is a noop on the 21264. Here's the SPECfp_base95 numbers:

    with without
    trapb trapb
    9.56 11.6 AlphaPC164LX 600MHz 21164A
    19.7 20.0 Compaq XP1000 500MHz 21264

    So the machine that needs trapb is much slower even without trapb than
    even the with-trapb variant on the machine where trapb is probably a
    noop. And lots of implementations of architectures without trapb have demonstrated since then that you can have high performance and precise exceptions without trapb.

    The 21264 Hardware Reference Manual says TRAPB (general exception barrier) and EXCB (floating point control register barrier) are both NOP's
    internally, are tossed at decode, and don't even take up an
    instruction slot.

    The purpose of the EXCB is to synchronize pipeline access to the
    floating point control and status register with FP operations.
    In the worst case this stalls until the pipeline drains.

    I wonder how much logic it really saved allowing imprecise exceptions
    in the InO 21064 and 21164?

    Having done something similar in Mc 88100, I can state that the amount
    of logic saved is too small to justify such nïevity.

    Conversely, how much did it cost to deal
    with problems caused by leaving these interlocks off?

    Way toooooo much. The SW delay to get all those things right cost more
    time than HW designers could have possibly saved leaving them out.

    The cores have multiple, parallel pipelines for int, lsq, fadd and fmul. Without exception interlocks, each pipeline only obeys the scoreboard
    rules for when to writeback its result register: WAW and WAR.
    That allows a younger, faster instruction to finish and write its register before an older, slower instruction. If that older instruction then throws
    an exception and does not write its register then we can see the out of
    order register writes.

    For register file writes to be precise in the presence of exceptions
    requires each instruction look ahead at the state of all older
    instructions *in all pipelines*.

    Or you use dead stages in the pipelines so instructions arrive at
    RF write ports no earlier than their compatriots. you still have to
    look across all the delay slots for forwarding opportunities.

    Each uOp can be Unresolved, Resolved_Normal, or Resolved_Exception.
    A writeback can occur if there are no WAW or WAR dependencies,
    and all older uOps are Resolved_Normal.

    That is the scoreboard model. The Reservation station has a simpler
    model by providing unique register for each instruction (or µOp).

    Just off the top of my head, in addition to the normal scoreboard,
    a FIFO buffer with a priority selector could be used to look ahead
    at all older uOps and check their status,

    Such a block of logic is called a ReOrder Buffer.

    Given an architectural register file with 16-32 entries, and
    given a reorder buffer of 96+ entries--if you integrate both
    ARF and RoB into a single structure you call it a physical
    register file. A PRF is just a RoB that is big enough never
    to have to migrate registers to the ARF.

    and allow or stall uOp
    writebacks and ensure registers always appear precise.
    Which really doesn't look that expensive.

    Is there something I missed, or would that FIFO suffice?

    If the FiFo is big enough, it works just fine; if you scrimp on
    the FiFo, you will want to play games with orderings to make it
    faster.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Tue Dec 2 07:10:16 2025
    From Newsgroup: comp.arch

    Semi-unaligned memory tradeoff. If unaligned access is required, the
    memory logic just increments the physical address by 64 bytes to fetch
    the next cache line. The issue with this is it does not go backwards to
    get the address fetched again from the TLB. Meaning no check is made for protection or translation of the address.

    It would be quite slow to have the instructions reissued and percolate
    down the cache access again.

    This should only be an issue if an unaligned access crosses a memory
    page boundary.

    The instruction causes an alignment fault if a page cross boundary is detected.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Dec 2 18:50:12 2025
    From Newsgroup: comp.arch

    Robert Finch <robfi680@gmail.com> writes:
    The issue with this is it does not go backwards to
    get the address fetched again from the TLB. Meaning no check is made for >protection or translation of the address.

    It would be quite slow to have the instructions reissued and percolate
    down the cache access again.

    Unaligned access on a page boundary is extremely slow on the Core 2
    Duo (IIRC 160 cycles for a store). So don't be shy:-)

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Dec 2 19:55:43 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    Semi-unaligned memory tradeoff. If unaligned access is required, the
    memory logic just increments the physical address by 64 bytes to fetch
    the next cache line. The issue with this is it does not go backwards to
    get the address fetched again from the TLB. Meaning no check is made for protection or translation of the address.

    You can determine is an access is misaligned "enough" to warrant two
    trips down the pipe.
    a) crosses cache width
    b) crosses page boundary

    Case b ALLWAYS needs 2 trips; so the mechanism HAS to be there.

    It would be quite slow to have the instructions reissued and percolate
    down the cache access again.

    An AGEN-like adder has 11-gates of delay, you can determine misaligned
    by 4-gates of delay.

    This should only be an issue if an unaligned access crosses a memory
    page boundary.

    Here you need to access the TLB twice.

    The instruction causes an alignment fault if a page cross boundary is detected.

    probably not as wise as you think.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Tue Dec 2 21:20:33 2025
    From Newsgroup: comp.arch

    On 2025-12-02 2:55 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    Semi-unaligned memory tradeoff. If unaligned access is required, the
    memory logic just increments the physical address by 64 bytes to fetch
    the next cache line. The issue with this is it does not go backwards to
    get the address fetched again from the TLB. Meaning no check is made for
    protection or translation of the address.

    You can determine is an access is misaligned "enough" to warrant two
    trips down the pipe.
    a) crosses cache width
    b) crosses page boundary

    Case b ALLWAYS needs 2 trips; so the mechanism HAS to be there.

    It would be quite slow to have the instructions reissued and percolate
    down the cache access again.

    An AGEN-like adder has 11-gates of delay, you can determine misaligned
    by 4-gates of delay.

    I was thinking in terms of clock cycles. The recalc of the address could
    be triggered by resetting bits in the reorder buffer. Which causes the instruction to be re-dispatched. I am not sure how many clocks, but
    likely a minimum of four or five. Memory access is sequential, so it
    will stall other accesses too.

    I have a tendency not to think about the gate delays too much, until
    they appear on the timing path. The lookup tables can absorb a good
    chunk of gates delay.


    This should only be an issue if an unaligned access crosses a memory
    page boundary.

    Here you need to access the TLB twice.

    The instruction causes an alignment fault if a page cross boundary is
    detected.

    probably not as wise as you think.

    I coded it so it makes two trips to the TLB now for page boundaries (in theory). I got to thinking that maybe the page size could be made huge
    to avoid page crossings.

    I may need to put more logic in to ensure the same load store queue slot
    is used. I think it should work since things are sequential.

    My toy is broken. It is taking too long to synthesize. Qupls is so
    complex now. I may pick something simpler.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From kegs@kegs@provalid.com (Kent Dickey) to comp.arch on Thu Dec 4 16:54:56 2025
    From Newsgroup: comp.arch

    In article <20251201132322.000051a5@yahoo.com>,
    Michael S <already5chosen@yahoo.com> wrote:
    On Mon, 01 Dec 2025 07:56:37 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    [snip]
    If hardware designers put their mind to it, they could make sequential
    consistency perform well, probably better on code that actually
    accesses data shared between different threads than weak and "strong"
    ordering, because there is no need to slow down the program with
    fences and the like in cases where only one thread accesses the data,
    and in cases where the data is read by all threads. You will see the
    slowdown only in run-time cases when one thread writes and another
    reads in temporal proximity. And all the fences etc. that are
    inserted just in case would also become fast (noops).


    Where does sequential consistency simplifies programming over x86 model
    of "TCO + globally ordered synchronization primitives +
    every synchronization primitives have implied barriers"?

    More so, where it simplifies over ARMv8.1-A, assuming that programmer
    does not try to be too smart and never uses LL/SC and always uses
    8.1-style synchronization instructions with Acquire+Release flags set?

    IMHO, the only simple thing about sequential consistency is simple >description. Other than that, it simplifies very little. It does not >magically make lockless multithreaded programming bearable to
    non-genius coders.

    Compiler writers have hidden behind the hardware complexity to make
    writing source code that is thread-safe much harder than it should be.
    If you have to support placing hardware barriers, then the languages
    can get away with needing lots of <atomic> qualifiers everywhere, even
    on systems which don't need barriers, making the code more complex. And language purists still love to sneer at volatile in C-like languages as "providing no guarantees, and so is essentially useless"--when volatile providing no guarantees is a language and compiler choice, not something written in stone. A bunch of useful algorithms could be written with
    merely "volatile" like semantics, but for some reason, people like the line-noise-like junk of C++ atomics, where rather than thinking in terms
    of the algorithm, everyone needs to think in terms of release and acquire. (Which are weakly-ordering concepts).

    Kent
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Dec 4 18:37:54 2025
    From Newsgroup: comp.arch


    kegs@provalid.com (Kent Dickey) posted:

    In article <20251201132322.000051a5@yahoo.com>,
    Michael S <already5chosen@yahoo.com> wrote:
    On Mon, 01 Dec 2025 07:56:37 GMT
    anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
    [snip]
    If hardware designers put their mind to it, they could make sequential
    consistency perform well, probably better on code that actually
    accesses data shared between different threads than weak and "strong"
    ordering, because there is no need to slow down the program with
    fences and the like in cases where only one thread accesses the data,
    and in cases where the data is read by all threads. You will see the
    slowdown only in run-time cases when one thread writes and another
    reads in temporal proximity. And all the fences etc. that are
    inserted just in case would also become fast (noops).


    Where does sequential consistency simplifies programming over x86 model
    of "TCO + globally ordered synchronization primitives +
    every synchronization primitives have implied barriers"?

    More so, where it simplifies over ARMv8.1-A, assuming that programmer
    does not try to be too smart and never uses LL/SC and always uses
    8.1-style synchronization instructions with Acquire+Release flags set?

    IMHO, the only simple thing about sequential consistency is simple >description. Other than that, it simplifies very little. It does not >magically make lockless multithreaded programming bearable to
    non-genius coders.

    Compiler writers have hidden behind the hardware complexity to make
    writing source code that is thread-safe much harder than it should be.

    Blaming the wrong people.

    If you have to support placing hardware barriers, then the languages
    can get away with needing lots of <atomic> qualifiers everywhere, even
    on systems which don't need barriers, making the code more complex. And

    Thread-safe, by definition, is (IS) harder.

    language purists still love to sneer at volatile in C-like languages as "providing no guarantees, and so is essentially useless"--when volatile providing no guarantees is a language and compiler choice, not something written in stone.

    The problem with volatile is that all it means is the every time a volatile variable is touched, the code has to have a corresponding LD or ST. The HW
    ends up knowing nothing about the value's volativity and ends up in no
    position to help.

    A bunch of useful algorithms could be written with
    merely "volatile" like semantics, but for some reason, people like the line-noise-like junk of C++ atomics, where rather than thinking in terms
    of the algorithm, everyone needs to think in terms of release and acquire. (Which are weakly-ordering concepts).

    As far as ATOMICs go:: until you can code a single ATOMIC event that moves
    an element of a concurrent data structure from one place to another in a
    single event, you are thinking too SMALL (4-pointers in 4 different cache lines).

    In addition, the code should NOT have to test for success failure, but
    be defined in such a way that if you get here success is known and if
    you get there, failure is known.

    Kent

    Mitch
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Fri Dec 5 11:10:22 2025
    From Newsgroup: comp.arch

    On 04/12/2025 19:37, MitchAlsup wrote:

    kegs@provalid.com (Kent Dickey) posted:


    Thread-safe, by definition, is (IS) harder.

    language purists still love to sneer at volatile in C-like languages as
    "providing no guarantees, and so is essentially useless"--when volatile
    providing no guarantees is a language and compiler choice, not something
    written in stone.

    The problem with volatile is that all it means is the every time a volatile variable is touched, the code has to have a corresponding LD or ST. The HW ends up knowing nothing about the value's volativity and ends up in no position to help.


    "volatile" /does/ provide guarantees - it just doesn't provide enough guarantees for multi-threaded coding on multi-core systems. Basically,
    it only works at the C abstract machine level - it does nothing that
    affects the hardware. So volatile writes are ordered at the C level,
    but that says nothing about how they might progress through storage
    queues, caches, inter-processor communication buses, or whatever. But
    you need volatile semantics for atomics and fences as well - there's no
    point in enforcing an order at the hardware level if the accesses can be re-ordered at the software level!

    "volatile" on its own is therefore not sufficient for atomics on big
    modern processors. But it /is/ sufficient for some uses, such as
    accessing hardware registers, or for small atomic loads and stores on
    single processor systems (which are far and away the biggest market, as embedded microcontrollers).

    As I see it, the biggest problem with "volatile" in C is
    misunderstandings and misuse of all sorts. At least, that's what I see
    in my field of embedded development.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Dec 5 14:37:57 2025
    From Newsgroup: comp.arch

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough >guarantees for multi-threaded coding on multi-core systems. Basically,
    it only works at the C abstract machine level - it does nothing that
    affects the hardware. So volatile writes are ordered at the C level,
    but that says nothing about how they might progress through storage
    queues, caches, inter-processor communication buses, or whatever.

    You describe in many words and not really to the point what can be
    explained concisely as: "volatile says nothing about memory ordering
    on hardware with weaker memory ordering than sequential consistency".
    If hardware guaranteed sequential consistency, volatile would provide guarantees that are as good on multi-core machines as on single-core
    machines.

    However, for concurrent manipulations of data structures, one wants
    atomic operations beyond load and store (even on single-core systems),
    and I don't think that C with just volatile gives you such guarantees.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Fri Dec 5 18:29:48 2025
    From Newsgroup: comp.arch

    On 05/12/2025 15:37, Anton Ertl wrote:
    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough
    guarantees for multi-threaded coding on multi-core systems. Basically,
    it only works at the C abstract machine level - it does nothing that
    affects the hardware. So volatile writes are ordered at the C level,
    but that says nothing about how they might progress through storage
    queues, caches, inter-processor communication buses, or whatever.

    You describe in many words and not really to the point what can be
    explained concisely as: "volatile says nothing about memory ordering
    on hardware with weaker memory ordering than sequential consistency".

    It says a good deal about the ordering at the C level - but nothing
    about it at the memory level.

    I know very little about the MMU setups on "big" systems like the x86-64 world. But in the embedded microcontroller world, it is very common for
    areas of the memory map to have sequential consistency even if other
    areas can be re-ordered, cached, or otherwise jumbled around. Thus for memory-mapped peripheral areas, memory accesses are kept strictly in
    order and "volatile" is all you need.

    If hardware guaranteed sequential consistency, volatile would provide guarantees that are as good on multi-core machines as on single-core machines.

    Sure. Of course multi-core systems will not have that hardware
    guarantee, at least not on main memory, for performance reasons. So
    there you need something more than just C "volatile" to force specific orderings. But volatile semantics will still be needed in many cases.
    Thus "volatile" is not sufficient, but it is still necessary. Usually,
    of course, all necessary "volatile" qualifiers are included in OS or
    library macros or functions for anything that needs them for locks or inter-process communication and the like. (In Linux, you have the
    READ_ONCE and WRITE_ONCE macros, which are just wrappers forcing
    volatile accesses.)


    However, for concurrent manipulations of data structures, one wants
    atomic operations beyond load and store (even on single-core systems),
    and I don't think that C with just volatile gives you such guarantees.


    Correct.

    Getting this wrong is one of the problems I have seen with volatile
    usage in embedded systems. I've seen people assuming that declaring "x"
    as "volatile" means that "x++;" is an atomic operation, or that volatile
    alone lets you share 64-bit data between threads on a 32-bit processor.

    Used correctly, it /can/ be enough for shared data between pre-emptive
    threads or a main loop and interrupts on a single core system. But
    sometimes you need to do more (for microcontrollers, that usually means disabling interrupts for a short period).

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Dec 5 17:57:48 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough >guarantees for multi-threaded coding on multi-core systems. Basically,
    it only works at the C abstract machine level - it does nothing that >affects the hardware. So volatile writes are ordered at the C level,
    but that says nothing about how they might progress through storage >queues, caches, inter-processor communication buses, or whatever.

    You describe in many words and not really to the point what can be
    explained concisely as: "volatile says nothing about memory ordering
    on hardware with weaker memory ordering than sequential consistency".
    If hardware guaranteed sequential consistency, volatile would provide guarantees that are as good on multi-core machines as on single-core machines.

    However, for concurrent manipulations of data structures, one wants
    atomic operations beyond load and store (even on single-core systems),

    Such as ????

    and I don't think that C with just volatile gives you such guarantees.

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Fri Dec 5 20:10:11 2025
    From Newsgroup: comp.arch

    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough
    guarantees for multi-threaded coding on multi-core systems. Basically,
    it only works at the C abstract machine level - it does nothing that
    affects the hardware. So volatile writes are ordered at the C level,
    but that says nothing about how they might progress through storage
    queues, caches, inter-processor communication buses, or whatever.

    You describe in many words and not really to the point what can be
    explained concisely as: "volatile says nothing about memory ordering
    on hardware with weaker memory ordering than sequential consistency".
    If hardware guaranteed sequential consistency, volatile would provide
    guarantees that are as good on multi-core machines as on single-core
    machines.

    However, for concurrent manipulations of data structures, one wants
    atomic operations beyond load and store (even on single-core systems),

    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes
    bigger than the maximum load/store size of the processor. Even with a
    single core system you can have pre-emptive multi-threading, or at least interrupt routines that may need to cooperate with other tasks on data.


    and I don't think that C with just volatile gives you such guarantees.

    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Dec 5 20:54:00 2025
    From Newsgroup: comp.arch


    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough
    guarantees for multi-threaded coding on multi-core systems. Basically, >>> it only works at the C abstract machine level - it does nothing that
    affects the hardware. So volatile writes are ordered at the C level,
    but that says nothing about how they might progress through storage
    queues, caches, inter-processor communication buses, or whatever.

    You describe in many words and not really to the point what can be
    explained concisely as: "volatile says nothing about memory ordering
    on hardware with weaker memory ordering than sequential consistency".
    If hardware guaranteed sequential consistency, volatile would provide
    guarantees that are as good on multi-core machines as on single-core
    machines.

    However, for concurrent manipulations of data structures, one wants
    atomic operations beyond load and store (even on single-core systems),

    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes
    bigger than the maximum load/store size of the processor.

    My 66000 ISA can::

    LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
    MM can MOV up to 8192 bytes as a single ATOMIC instruction.

    Compare Double, Swap Double::

    BOOLEAN DCAS( type oldp, type_t oldq,
    type *p, type_t *q,
    type newp, type newq )
    {
    type t = esmLOCKload( *p );
    type r = esmLOCKload( *q );
    if( t == oldp && r == oldq )
    {
    *p = newp;
    esmLOCKstore( *q, newq );
    return TRUE;
    }
    return FALSE;
    }

    Move Element from one place to another:

    BOOLEAN MoveElement( Element *fr, Element *to )
    {
    Element *fn = esmLOCKload( fr->next );
    Element *fp = esmLOCKload( fr->prev );
    Element *tn = esmLOCKload( to->next );
    esmLOCKprefetch( fn );
    esmLOCKprefetch( fp );
    esmLOCKprefetch( tn );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn;
    fn->prev = fp;
    to->next = fr;
    tn->prev = fr;
    fr->prev = to;
    esmLOCKstore( fr->next, tn );
    return TRUE;
    }
    return FALSE;
    }

    So, I guess, you are not talking about what My 66000 cannot do, but
    only what other ISAs cannot do.

    Even with a single core system you can have pre-emptive multi-threading, or at least interrupt routines that may need to cooperate with other tasks on data.


    and I don't think that C with just volatile gives you such guarantees.

    - anton

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Dec 5 14:55:36 2025
    From Newsgroup: comp.arch

    On 12/5/2025 12:54 PM, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough >>>>> guarantees for multi-threaded coding on multi-core systems. Basically, >>>>> it only works at the C abstract machine level - it does nothing that >>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>> but that says nothing about how they might progress through storage
    queues, caches, inter-processor communication buses, or whatever.

    You describe in many words and not really to the point what can be
    explained concisely as: "volatile says nothing about memory ordering
    on hardware with weaker memory ordering than sequential consistency".
    If hardware guaranteed sequential consistency, volatile would provide
    guarantees that are as good on multi-core machines as on single-core
    machines.

    However, for concurrent manipulations of data structures, one wants
    atomic operations beyond load and store (even on single-core systems),

    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes
    bigger than the maximum load/store size of the processor.

    My 66000 ISA can::

    LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
    MM can MOV up to 8192 bytes as a single ATOMIC instruction.

    Compare Double, Swap Double::

    BOOLEAN DCAS( type oldp, type_t oldq,
    type *p, type_t *q,
    type newp, type newq )
    {
    type t = esmLOCKload( *p );
    type r = esmLOCKload( *q );
    if( t == oldp && r == oldq )
    {
    *p = newp;
    esmLOCKstore( *q, newq );
    return TRUE;
    }
    return FALSE;
    }

    Move Element from one place to another:

    BOOLEAN MoveElement( Element *fr, Element *to )
    {
    Element *fn = esmLOCKload( fr->next );
    Element *fp = esmLOCKload( fr->prev );
    Element *tn = esmLOCKload( to->next );
    esmLOCKprefetch( fn );
    esmLOCKprefetch( fp );
    esmLOCKprefetch( tn );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn;
    fn->prev = fp;
    to->next = fr;
    tn->prev = fr;
    fr->prev = to;
    esmLOCKstore( fr->next, tn );
    return TRUE;
    }
    return FALSE;
    }

    So, I guess, you are not talking about what My 66000 cannot do, but
    only what other ISAs cannot do.

    Any issues with live lock in here?

    [...]
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Dec 5 15:03:53 2025
    From Newsgroup: comp.arch

    On 12/5/2025 11:10 AM, David Brown wrote:
    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough
    guarantees for multi-threaded coding on multi-core systems.  Basically, >>>> it only works at the C abstract machine level - it does nothing that
    affects the hardware.  So volatile writes are ordered at the C level, >>>> but that says nothing about how they might progress through storage
    queues, caches, inter-processor communication buses, or whatever.

    You describe in many words and not really to the point what can be
    explained concisely as: "volatile says nothing about memory ordering
    on hardware with weaker memory ordering than sequential consistency".
    If hardware guaranteed sequential consistency, volatile would provide
    guarantees that are as good on multi-core machines as on single-core
    machines.

    However, for concurrent manipulations of data structures, one wants
    atomic operations beyond load and store (even on single-core systems),

    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes
    bigger than the maximum load/store size of the processor.

    It's strange that double-word compare and swap (DWCAS), where the words
    are contiguous. Well, I have seen compilers say its not lock-free even
    on a x86. for a 32 bit system we have cmpxchg8b. For a 64 bit system cmpxchg16b. But the compiler reports not lock free. Strange.

    using cmpxchg instead of xadd:
    https://forum.pellesc.de/index.php?topic=7167.0

    trying to tell me that a DWCAS is not lock free: https://forum.pellesc.de/index.php?topic=7311.msg27764#msg27764

    This should be lock-free on an x86, even x64:

    struct ct_proxy_dwcas
    {
    struct ct_proxy_node* node;
    intptr_t count;
    };

    some of my older code:

    AC_SYS_APIEXPORT
    int AC_CDECL
    np_ac_i686_atomic_dwcas_fence
    ( void*,
    void*,
    const void* );


    np_ac_i686_atomic_dwcas_fence PROC
    push esi
    push ebx
    mov esi, [esp + 16]
    mov eax, [esi]
    mov edx, [esi + 4]
    mov esi, [esp + 20]
    mov ebx, [esi]
    mov ecx, [esi + 4]
    mov esi, [esp + 12]
    lock cmpxchg8b qword ptr [esi]
    jne np_ac_i686_atomic_dwcas_fence_fail
    xor eax, eax
    pop ebx
    pop esi
    ret

    np_ac_i686_atomic_dwcas_fence_fail:
    mov esi, [esp + 16]
    mov [esi + 0], eax;
    mov [esi + 4], edx;
    mov eax, 1
    pop ebx
    pop esi
    ret
    np_ac_i686_atomic_dwcas_fence ENDP


    Even with a
    single core system you can have pre-emptive multi-threading, or at least interrupt routines that may need to cooperate with other tasks on data.


    and I don't think that C with just volatile gives you such guarantees.

    - anton


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sat Dec 6 00:40:11 2025
    From Newsgroup: comp.arch

    Tradeoffs bypassing r0 causing more ISA tweaks.

    It is expensive to bypass r0. To truly bypass it, it needs to be
    bypassed in a couple of dozen places which really drives up the LUT
    count. Removing the bypassing of r0 from the register file shaved 1000
    LUTs off the design. This is no real loss as most instructions can
    substitute small constants for register values.

    Decided to go PowerPC style with bypassing of r0 to zero. R0 is bypassed
    to zero only in the agen units. So, the bypass is only in a couple of
    places. Otherwise r0 can be used as an ordinary register. Load / store instructions cannot use r0 as a GPR then, but it works for the PowerPC.

    I hit this trying to decide where to bypass another register code to
    represent the instruction pointer. In that case I think it may be better
    to go RISCV style and just add an instruction to add the IP to a
    constant and place it in a register. The alternative might be to
    sacrifice a bit of displacement to indicate IP relative addressing.

    Anyone got a summary of bypassing r0 in different architectures?

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Dec 6 07:26:24 2025
    From Newsgroup: comp.arch

    Robert Finch <robfi680@gmail.com> writes:
    Tradeoffs bypassing r0 causing more ISA tweaks.

    It is expensive to bypass r0. To truly bypass it, it needs to be
    bypassed in a couple of dozen places which really drives up the LUT
    count.

    My impression is that modern implementations deal with this kind of
    stuff at decoding or in the renamer. That should reduce the number of
    places where it is special-cased to one, but it means that the uops
    have to represent 0 in some way. One way would be to have a physical
    register that is 0 and that is never allocated, but if your
    microarchitecture needs a reduction of actual read ports (compared to
    potential read ports), you may prefer a different representation of 0
    in the uops.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sat Dec 6 05:13:01 2025
    From Newsgroup: comp.arch

    On 2025-12-06 2:26 a.m., Anton Ertl wrote:
    Robert Finch <robfi680@gmail.com> writes:
    Tradeoffs bypassing r0 causing more ISA tweaks.

    It is expensive to bypass r0. To truly bypass it, it needs to be
    bypassed in a couple of dozen places which really drives up the LUT
    count.

    My impression is that modern implementations deal with this kind of
    stuff at decoding or in the renamer. That should reduce the number of
    places where it is special-cased to one, but it means that the uops
    have to represent 0 in some way. One way would be to have a physical register that is 0 and that is never allocated, but if your
    microarchitecture needs a reduction of actual read ports (compared to potential read ports), you may prefer a different representation of 0
    in the uops.

    - anton

    Thanks,

    It should have occurred to me to do this at the decode stage. Constants
    are decoded and passed along for all register fields in decode. There
    are only four decoders fortunately.

    Switching the ISA back to having r0 as zero all the time.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Sat Dec 6 14:42:13 2025
    From Newsgroup: comp.arch

    On 05/12/2025 21:54, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough >>>>> guarantees for multi-threaded coding on multi-core systems. Basically, >>>>> it only works at the C abstract machine level - it does nothing that >>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>> but that says nothing about how they might progress through storage
    queues, caches, inter-processor communication buses, or whatever.

    You describe in many words and not really to the point what can be
    explained concisely as: "volatile says nothing about memory ordering
    on hardware with weaker memory ordering than sequential consistency".
    If hardware guaranteed sequential consistency, volatile would provide
    guarantees that are as good on multi-core machines as on single-core
    machines.

    However, for concurrent manipulations of data structures, one wants
    atomic operations beyond load and store (even on single-core systems),

    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes
    bigger than the maximum load/store size of the processor.

    My 66000 ISA can::

    LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
    MM can MOV up to 8192 bytes as a single ATOMIC instruction.


    The functions below rely on more than that - to make the work, as far as
    I can see, you need the first "esmLOCKload" to lock the bus and also
    lock the core from any kind of interrupt or other pre-emption, lasting
    until the esmLOCKstore instruction. Or am I missing something here?

    It is not easy to have atomic or lock mechanisms on multi-core systems
    that are convenient to use, efficient even in the worst cases, and don't require additional hardware.


    Compare Double, Swap Double::

    BOOLEAN DCAS( type oldp, type_t oldq,
    type *p, type_t *q,
    type newp, type newq )
    {
    type t = esmLOCKload( *p );
    type r = esmLOCKload( *q );
    if( t == oldp && r == oldq )
    {
    *p = newp;
    esmLOCKstore( *q, newq );
    return TRUE;
    }
    return FALSE;
    }

    Move Element from one place to another:

    BOOLEAN MoveElement( Element *fr, Element *to )
    {
    Element *fn = esmLOCKload( fr->next );
    Element *fp = esmLOCKload( fr->prev );
    Element *tn = esmLOCKload( to->next );
    esmLOCKprefetch( fn );
    esmLOCKprefetch( fp );
    esmLOCKprefetch( tn );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn;
    fn->prev = fp;
    to->next = fr;
    tn->prev = fr;
    fr->prev = to;
    esmLOCKstore( fr->next, tn );
    return TRUE;
    }
    return FALSE;
    }

    So, I guess, you are not talking about what My 66000 cannot do, but
    only what other ISAs cannot do.


    Of course. It is interesting to speculate about possible features of an architecture like yours, but it is not likely to be available to anyone
    else in practice (unless perhaps it can be implemented as an extension
    for RISC-V).

    Even with a
    single core system you can have pre-emptive multi-threading, or at least
    interrupt routines that may need to cooperate with other tasks on data.


    and I don't think that C with just volatile gives you such guarantees. >>>>
    - anton


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sat Dec 6 17:16:11 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough
    guarantees for multi-threaded coding on multi-core systems. Basically, >> >>> it only works at the C abstract machine level - it does nothing that
    affects the hardware. So volatile writes are ordered at the C level,
    but that says nothing about how they might progress through storage
    queues, caches, inter-processor communication buses, or whatever.

    You describe in many words and not really to the point what can be
    explained concisely as: "volatile says nothing about memory ordering
    on hardware with weaker memory ordering than sequential consistency".
    If hardware guaranteed sequential consistency, volatile would provide
    guarantees that are as good on multi-core machines as on single-core
    machines.

    However, for concurrent manipulations of data structures, one wants
    atomic operations beyond load and store (even on single-core systems),

    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes
    bigger than the maximum load/store size of the processor.

    My 66000 ISA can::

    LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
    MM can MOV up to 8192 bytes as a single ATOMIC instruction.

    Compare Double, Swap Double::

    BOOLEAN DCAS( type oldp, type_t oldq,
    type *p, type_t *q,
    type newp, type newq )
    {
    type t = esmLOCKload( *p );
    type r = esmLOCKload( *q );
    if( t == oldp && r == oldq )
    {
    *p = newp;
    esmLOCKstore( *q, newq );
    return TRUE;
    }
    return FALSE;
    }

    Move Element from one place to another:

    BOOLEAN MoveElement( Element *fr, Element *to )
    {
    Element *fn = esmLOCKload( fr->next );
    Element *fp = esmLOCKload( fr->prev );
    Element *tn = esmLOCKload( to->next );
    esmLOCKprefetch( fn );
    esmLOCKprefetch( fp );
    esmLOCKprefetch( tn );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn;
    fn->prev = fp;
    to->next = fr;
    tn->prev = fr;
    fr->prev = to;
    esmLOCKstore( fr->next, tn );
    return TRUE;
    }
    return FALSE;
    }

    So, I guess, you are not talking about what My 66000 cannot do, but
    only what other ISAs cannot do.

    In my 40 years of SMP OS/HV work, I don't recall a
    situation where 'MoveElement' would be useful or
    required as an hardware atomic operation.

    Individual atomic "Remove Element" and "Insert/Append Element"[*], yes. Combined? Too inflexible.

    [*] For which atomic compare-and-swap or atomic swap is generally sufficient.

    Atomic add/sub are useful. The other atomic math operations (min, max, etc) may be useful in certain cases as well.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 6 17:22:55 2025
    From Newsgroup: comp.arch


    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/5/2025 12:54 PM, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough >>>>> guarantees for multi-threaded coding on multi-core systems. Basically, >>>>> it only works at the C abstract machine level - it does nothing that >>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>> but that says nothing about how they might progress through storage >>>>> queues, caches, inter-processor communication buses, or whatever.

    You describe in many words and not really to the point what can be
    explained concisely as: "volatile says nothing about memory ordering >>>> on hardware with weaker memory ordering than sequential consistency". >>>> If hardware guaranteed sequential consistency, volatile would provide >>>> guarantees that are as good on multi-core machines as on single-core >>>> machines.

    However, for concurrent manipulations of data structures, one wants
    atomic operations beyond load and store (even on single-core systems), >>>
    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes
    bigger than the maximum load/store size of the processor.

    My 66000 ISA can::

    LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
    MM can MOV up to 8192 bytes as a single ATOMIC instruction.

    Compare Double, Swap Double::

    BOOLEAN DCAS( type oldp, type_t oldq,
    type *p, type_t *q,
    type newp, type newq )
    {
    type t = esmLOCKload( *p );
    type r = esmLOCKload( *q );
    if( t == oldp && r == oldq )
    {
    *p = newp;
    esmLOCKstore( *q, newq );
    return TRUE;
    }
    return FALSE;
    }

    Move Element from one place to another:

    BOOLEAN MoveElement( Element *fr, Element *to )
    {
    Element *fn = esmLOCKload( fr->next );
    Element *fp = esmLOCKload( fr->prev );
    Element *tn = esmLOCKload( to->next );
    esmLOCKprefetch( fn );
    esmLOCKprefetch( fp );
    esmLOCKprefetch( tn );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn;
    fn->prev = fp;
    to->next = fr;
    tn->prev = fr;
    fr->prev = to;
    esmLOCKstore( fr->next, tn );
    return TRUE;
    }
    return FALSE;
    }

    So, I guess, you are not talking about what My 66000 cannot do, but
    only what other ISAs cannot do.

    Any issues with live lock in here?

    A bit hard to tell because of 2 things::
    a) I carry around the thread priority and when interference occurs,
    the higher priority thread wins--ties the already accessed thread wins.
    b) live-lock is resolved or not by the caller to these routines, not
    these routines themselves.

    [...]
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 6 17:29:53 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    Tradeoffs bypassing r0 causing more ISA tweaks.

    It is expensive to bypass r0. To truly bypass it, it needs to be
    bypassed in a couple of dozen places which really drives up the LUT
    count. Removing the bypassing of r0 from the register file shaved 1000
    LUTs off the design. This is no real loss as most instructions can substitute small constants for register values.

    Often the use of R0 as an operand causes the calculation to be degenerate.
    That is, R0 is not needed at all.
    ADD R9,R7,R0 // is a MOV instruction
    AND R9,R7,R0 // is a CLR instruction

    So, you don't have to treat R0 in bypassing, but as Operand processing.

    Decided to go PowerPC style with bypassing of r0 to zero. R0 is bypassed
    to zero only in the agen units. So, the bypass is only in a couple of places. Otherwise r0 can be used as an ordinary register. Load / store instructions cannot use r0 as a GPR then, but it works for the PowerPC.

    AGEN Rbase ==R0 implies Rbase = IP
    AGEN Rindex==R0 implies Rindex = 0

    I hit this trying to decide where to bypass another register code to represent the instruction pointer. In that case I think it may be better
    to go RISCV style and just add an instruction to add the IP to a
    constant and place it in a register. The alternative might be to
    sacrifice a bit of displacement to indicate IP relative addressing.

    Anyone got a summary of bypassing r0 in different architectures?

    These are some of the reasons I went with
    a) universal constants
    b) R0 is just another GPR
    So, R0, gets forwarded just as often (or lack thereof) as any joe-random register.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 6 17:31:43 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    Robert Finch <robfi680@gmail.com> writes:
    Tradeoffs bypassing r0 causing more ISA tweaks.

    It is expensive to bypass r0. To truly bypass it, it needs to be
    bypassed in a couple of dozen places which really drives up the LUT
    count.

    My impression is that modern implementations deal with this kind of
    stuff at decoding or in the renamer. That should reduce the number of
    places where it is special-cased to one, but it means that the uops
    have to represent 0 in some way. One way would be to have a physical register that is 0 and that is never allocated, but if your
    microarchitecture needs a reduction of actual read ports (compared to potential read ports), you may prefer a different representation of 0
    in the uops.

    Another way to implement R0 is to have an AND gate after the Operand
    flip-flop, and if <whatever> was captured is R0, then AND with 0, other-
    wise AND with 1.

    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 6 17:44:30 2025
    From Newsgroup: comp.arch


    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 21:54, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough >>>>> guarantees for multi-threaded coding on multi-core systems. Basically, >>>>> it only works at the C abstract machine level - it does nothing that >>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>> but that says nothing about how they might progress through storage >>>>> queues, caches, inter-processor communication buses, or whatever.

    You describe in many words and not really to the point what can be
    explained concisely as: "volatile says nothing about memory ordering >>>> on hardware with weaker memory ordering than sequential consistency". >>>> If hardware guaranteed sequential consistency, volatile would provide >>>> guarantees that are as good on multi-core machines as on single-core >>>> machines.

    However, for concurrent manipulations of data structures, one wants
    atomic operations beyond load and store (even on single-core systems), >>>
    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes
    bigger than the maximum load/store size of the processor.

    My 66000 ISA can::

    LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
    MM can MOV up to 8192 bytes as a single ATOMIC instruction.


    The functions below rely on more than that - to make the work, as far as
    I can see, you need the first "esmLOCKload" to lock the bus and also
    lock the core from any kind of interrupt or other pre-emption, lasting
    until the esmLOCKstore instruction. Or am I missing something here?

    In the above, I was stating that the maximum width of LD/ST can be a lot
    bigger than the size of a single register, not that the above instructions
    make writing ATOMIC events easier.

    These is no bus!

    The esmLOCKload causes the <translated> address to be 'monitored'
    for interference, and to announce participation in the ATOMIC event.

    The FIRST esmLOCKload tells the core that an ATOMIC event is beginning,
    AND sets up a default control point (This instruction itself) so that
    if interference is detected at esmLOCKstore control is transferred to
    that control point.

    So, there is no way to write Test-and-Set !! you get Test-and-Test-and-Set
    for free.

    There is a branch-on-interference instruction that
    a) does what it says,
    b) sets up an alternate atomic control point.

    It is not easy to have atomic or lock mechanisms on multi-core systems
    that are convenient to use, efficient even in the worst cases, and don't require additional hardware.

    I am using the "Miss Buffer" as the point of monitoring for interference.
    a) it already has to monitor "other hits" from outside accesses to deal
    with the coherence mechanism.
    b) that esm additions to Miss Buffer are on the order of 2%

    c) there are other means to strengthen guarantees of forward progress.


    Compare Double, Swap Double::

    BOOLEAN DCAS( type oldp, type_t oldq,
    type *p, type_t *q,
    type newp, type newq )
    {
    type t = esmLOCKload( *p );
    type r = esmLOCKload( *q );
    if( t == oldp && r == oldq )
    {
    *p = newp;
    esmLOCKstore( *q, newq );
    return TRUE;
    }
    return FALSE;
    }

    Move Element from one place to another:

    BOOLEAN MoveElement( Element *fr, Element *to )
    {
    Element *fn = esmLOCKload( fr->next );
    Element *fp = esmLOCKload( fr->prev );
    Element *tn = esmLOCKload( to->next );
    esmLOCKprefetch( fn );
    esmLOCKprefetch( fp );
    esmLOCKprefetch( tn );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn;
    fn->prev = fp;
    to->next = fr;
    tn->prev = fr;
    fr->prev = to;
    esmLOCKstore( fr->next, tn );
    return TRUE;
    }
    return FALSE;
    }

    So, I guess, you are not talking about what My 66000 cannot do, but
    only what other ISAs cannot do.

    Of course. It is interesting to speculate about possible features of an architecture like yours, but it is not likely to be available to anyone
    else in practice (unless perhaps it can be implemented as an extension
    for RISC-V).

    Even with a
    single core system you can have pre-emptive multi-threading, or at least >> interrupt routines that may need to cooperate with other tasks on data.


    and I don't think that C with just volatile gives you such guarantees. >>>>
    - anton


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 6 18:07:50 2025
    From Newsgroup: comp.arch


    scott@slp53.sl.home (Scott Lurndal) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough >> >>> guarantees for multi-threaded coding on multi-core systems. Basically,
    it only works at the C abstract machine level - it does nothing that >> >>> affects the hardware. So volatile writes are ordered at the C level, >> >>> but that says nothing about how they might progress through storage
    queues, caches, inter-processor communication buses, or whatever.

    You describe in many words and not really to the point what can be
    explained concisely as: "volatile says nothing about memory ordering
    on hardware with weaker memory ordering than sequential consistency". >> >> If hardware guaranteed sequential consistency, volatile would provide >> >> guarantees that are as good on multi-core machines as on single-core
    machines.

    However, for concurrent manipulations of data structures, one wants
    atomic operations beyond load and store (even on single-core systems), >> >
    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes
    bigger than the maximum load/store size of the processor.

    My 66000 ISA can::

    LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
    MM can MOV up to 8192 bytes as a single ATOMIC instruction.

    Compare Double, Swap Double::

    BOOLEAN DCAS( type oldp, type_t oldq,
    type *p, type_t *q,
    type newp, type newq )
    {
    type t = esmLOCKload( *p );
    type r = esmLOCKload( *q );
    if( t == oldp && r == oldq )
    {
    *p = newp;
    esmLOCKstore( *q, newq );
    return TRUE;
    }
    return FALSE;
    }

    Move Element from one place to another:

    BOOLEAN MoveElement( Element *fr, Element *to )
    {
    Element *fn = esmLOCKload( fr->next );
    Element *fp = esmLOCKload( fr->prev );
    Element *tn = esmLOCKload( to->next );
    esmLOCKprefetch( fn );
    esmLOCKprefetch( fp );
    esmLOCKprefetch( tn );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn;
    fn->prev = fp;
    to->next = fr;
    tn->prev = fr;
    fr->prev = to;
    esmLOCKstore( fr->next, tn );
    return TRUE;
    }
    return FALSE;
    }

    So, I guess, you are not talking about what My 66000 cannot do, but
    only what other ISAs cannot do.

    In my 40 years of SMP OS/HV work, I don't recall a
    situation where 'MoveElement' would be useful or
    required as an hardware atomic operation.

    The question is not would "MoveElement" be useful, but
    would it be useful to have a single ATOMIC event be
    able to manipulate {5,6,7,8} pointers in one event ??

    Individual atomic "Remove Element" and "Insert/Append Element"[*], yes. Combined? Too inflexible.

    BOOLEAN InsertElement( Element *el, Element *to )
    {
    tn = esmLOCKload( to->next );
    esmLOCKprefetch( el );
    esmLOCKprefetch( tn );
    if( !esmINTERFERENCE() )
    {
    el->next = tn;
    el->prev = to;
    to->next = el;
    esmLOCKstore( tn->prev, el );
    return TRUE;
    }
    return FALSE;
    }

    BOOLEAN RemoveElement( Element *fr )
    {
    fn = esmLOCKload( fr->next );
    fp = esmLOCKload( fr->prev );
    esmLOCKprefetch( fn );
    esmLOCKprefetch( fp );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn;
    fn->prev = fp;
    fr->prev = NULL;
    esmLOCKstore( fr->next, NULL );
    return TRUE;
    }
    return FALSE;
    }


    [*] For which atomic compare-and-swap or atomic swap is generally sufficient.

    Atomic add/sub are useful. The other atomic math operations (min, max, etc) may be useful in certain cases as well.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sat Dec 6 19:04:09 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    scott@slp53.sl.home (Scott Lurndal) posted:


    Move Element from one place to another:

    BOOLEAN MoveElement( Element *fr, Element *to )
    {
    Element *fn = esmLOCKload( fr->next );
    Element *fp = esmLOCKload( fr->prev );
    Element *tn = esmLOCKload( to->next );
    esmLOCKprefetch( fn );
    esmLOCKprefetch( fp );
    esmLOCKprefetch( tn );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn;
    fn->prev = fp;
    to->next = fr;
    tn->prev = fr;
    fr->prev = to;
    esmLOCKstore( fr->next, tn );
    return TRUE;
    }
    return FALSE;
    }

    So, I guess, you are not talking about what My 66000 cannot do, but
    only what other ISAs cannot do.

    In my 40 years of SMP OS/HV work, I don't recall a
    situation where 'MoveElement' would be useful or
    required as an hardware atomic operation.

    The question is not would "MoveElement" be useful, but
    would it be useful to have a single ATOMIC event be
    able to manipulate {5,6,7,8} pointers in one event ??

    Nothing comes immediately to mind.


    Individual atomic "Remove Element" and "Insert/Append Element"[*], yes.
    Combined? Too inflexible.

    BOOLEAN InsertElement( Element *el, Element *to )
    {
    tn = esmLOCKload( to->next );
    esmLOCKprefetch( el );
    esmLOCKprefetch( tn );
    if( !esmINTERFERENCE() )
    {
    el->next = tn;
    el->prev = to;
    to->next = el;
    esmLOCKstore( tn->prev, el );
    return TRUE;
    }
    return FALSE;
    }

    BOOLEAN RemoveElement( Element *fr )
    {
    fn = esmLOCKload( fr->next );
    fp = esmLOCKload( fr->prev );
    esmLOCKprefetch( fn );
    esmLOCKprefetch( fp );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn;
    fn->prev = fp;
    fr->prev = NULL;
    esmLOCKstore( fr->next, NULL );
    return TRUE;
    }
    return FALSE;
    }


    [*] For which atomic compare-and-swap or atomic swap is generally sufficient.

    Yes, you can add special instructions. However, the compilers will be unlikely
    to generate them, thus applications that desired the generation of such an instruction would need to create a compiler extension (like gcc __builtin functions)
    or inline assembler which would then make the program that uses the capability both compiler
    specific _and_ hardware specific.

    Most extant SMP processors provide a compare and swap operation, which
    are widely supported by the common compilers that support the C and C++ threading functionality.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Dec 6 21:36:27 2025
    From Newsgroup: comp.arch

    Scott Lurndal <scott@slp53.sl.home> schrieb:

    Yes, you can add special instructions. However, the compilers will be unlikely
    to generate them, thus applications that desired the generation of such an instruction would need to create a compiler extension (like gcc __builtin functions)
    or inline assembler which would then make the program that uses the capability both compiler
    specific _and_ hardware specific.

    Most extant SMP processors provide a compare and swap operation, which
    are widely supported by the common compilers that support the C and C++ threading functionality.

    Interestingly, Linux restartable sequences allow for acquisition of
    a lock with no membarrier or atomic instruction on the fast path,
    at the cost of a syscall on the slow path (no free lunch...)

    But you also need assembler to do it.

    An example is, for example, at https://gitlab.ethz.ch/extra_projects/cpu-local-lock
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 6 21:44:17 2025
    From Newsgroup: comp.arch


    scott@slp53.sl.home (Scott Lurndal) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    scott@slp53.sl.home (Scott Lurndal) posted:


    Move Element from one place to another:

    BOOLEAN MoveElement( Element *fr, Element *to )
    {
    Element *fn = esmLOCKload( fr->next );
    Element *fp = esmLOCKload( fr->prev );
    Element *tn = esmLOCKload( to->next );
    esmLOCKprefetch( fn );
    esmLOCKprefetch( fp );
    esmLOCKprefetch( tn );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn;
    fn->prev = fp;
    to->next = fr;
    tn->prev = fr;
    fr->prev = to;
    esmLOCKstore( fr->next, tn );
    return TRUE;
    }
    return FALSE;
    }

    So, I guess, you are not talking about what My 66000 cannot do, but
    only what other ISAs cannot do.

    In my 40 years of SMP OS/HV work, I don't recall a
    situation where 'MoveElement' would be useful or
    required as an hardware atomic operation.

    The question is not would "MoveElement" be useful, but
    would it be useful to have a single ATOMIC event be
    able to manipulate {5,6,7,8} pointers in one event ??

    Nothing comes immediately to mind.


    Individual atomic "Remove Element" and "Insert/Append Element"[*], yes.
    Combined? Too inflexible.

    BOOLEAN InsertElement( Element *el, Element *to )
    {
    tn = esmLOCKload( to->next );
    esmLOCKprefetch( el );
    esmLOCKprefetch( tn );
    if( !esmINTERFERENCE() )
    {
    el->next = tn;
    el->prev = to;
    to->next = el;
    esmLOCKstore( tn->prev, el );
    return TRUE;
    }
    return FALSE;
    }

    BOOLEAN RemoveElement( Element *fr )
    {
    fn = esmLOCKload( fr->next );
    fp = esmLOCKload( fr->prev );
    esmLOCKprefetch( fn );
    esmLOCKprefetch( fp );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn;
    fn->prev = fp;
    fr->prev = NULL;
    esmLOCKstore( fr->next, NULL );
    return TRUE;
    }
    return FALSE;
    }


    [*] For which atomic compare-and-swap or atomic swap is generally sufficient.

    Yes, you can add special instructions. However, the compilers will be unlikely
    to generate them, thus applications that desired the generation of such an instruction would need to create a compiler extension (like gcc __builtin functions)
    or inline assembler which would then make the program that uses the capability both compiler
    specific _and_ hardware specific.

    So, in other words, if you can't put it in every ISA known to man,
    don't bother making something better than existent ?!?

    Most extant SMP processors provide a compare and swap operation, which
    are widely supported by the common compilers that support the C and C++ threading functionality.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sat Dec 6 18:33:55 2025
    From Newsgroup: comp.arch

    On 2025-12-06 12:29 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    Tradeoffs bypassing r0 causing more ISA tweaks.

    It is expensive to bypass r0. To truly bypass it, it needs to be
    bypassed in a couple of dozen places which really drives up the LUT
    count. Removing the bypassing of r0 from the register file shaved 1000
    LUTs off the design. This is no real loss as most instructions can
    substitute small constants for register values.

    Often the use of R0 as an operand causes the calculation to be degenerate. That is, R0 is not needed at all.
    ADD R9,R7,R0 // is a MOV instruction
    AND R9,R7,R0 // is a CLR instruction

    We dont want no degenerating instructions.

    So, you don't have to treat R0 in bypassing, but as Operand processing.

    Decided to go PowerPC style with bypassing of r0 to zero. R0 is bypassed
    to zero only in the agen units. So, the bypass is only in a couple of
    places. Otherwise r0 can be used as an ordinary register. Load / store
    instructions cannot use r0 as a GPR then, but it works for the PowerPC.

    AGEN Rbase ==R0 implies Rbase = IP
    AGEN Rindex==R0 implies Rindex = 0

    Qupls now follows a similar paradigm.
    Rbase = r0 bypasses to 0
    Rindex = r0 bypasses to 0
    Rbase = r31 bypasses to IP
    Bypassing r0 for both base and index allows absolute addressing mode.
    Otherwise r0, r31 are general-purpose regs.

    I hit this trying to decide where to bypass another register code to
    represent the instruction pointer. In that case I think it may be better
    to go RISCV style and just add an instruction to add the IP to a
    constant and place it in a register. The alternative might be to
    sacrifice a bit of displacement to indicate IP relative addressing.

    Anyone got a summary of bypassing r0 in different architectures?

    These are some of the reasons I went with
    a) universal constants
    b) R0 is just another GPR
    So, R0, gets forwarded just as often (or lack thereof) as any joe-random register.

    Qupls has IP offset constant loading.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Sat Dec 6 18:55:17 2025
    From Newsgroup: comp.arch

    On 2025-12-06 6:33 p.m., Robert Finch wrote:
    On 2025-12-06 12:29 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    Tradeoffs bypassing r0 causing more ISA tweaks.

    It is expensive to bypass r0. To truly bypass it, it needs to be
    bypassed in a couple of dozen places which really drives up the LUT
    count. Removing the bypassing of r0 from the register file shaved 1000
    LUTs off the design. This is no real loss as most instructions can
    substitute small constants for register values.

    Often the use of R0 as an operand causes the calculation to be
    degenerate.
    That is, R0 is not needed at all.
          ADD   R9,R7,R0        // is a MOV instruction
          AND   R9,R7,R0        // is a CLR instruction

    We dont want no degenerating instructions.

    So, you don't have to treat R0 in bypassing, but as Operand processing.
    Decided to go PowerPC style with bypassing of r0 to zero. R0 is bypassed >>> to zero only in the agen units. So, the bypass is only in a couple of
    places. Otherwise r0 can be used as an ordinary register. Load / store
    instructions cannot use r0 as a GPR then, but it works for the PowerPC.

    AGEN Rbase ==R0 implies Rbase  = IP
    AGEN Rindex==R0 implies Rindex = 0

    Qupls now follows a similar paradigm.
     Rbase = r0 bypasses to 0
     Rindex = r0 bypasses to 0
     Rbase = r31 bypasses to IP
    Bypassing r0 for both base and index allows absolute addressing mode. Otherwise r0, r31 are general-purpose regs.

    I hit this trying to decide where to bypass another register code to
    represent the instruction pointer. In that case I think it may be better >>> to go RISCV style and just add an instruction to add the IP to a
    constant and place it in a register. The alternative might be to
    sacrifice a bit of displacement to indicate IP relative addressing.

    Anyone got a summary of bypassing r0 in different architectures?

    These are some of the reasons I went with
    a) universal constants
    b) R0 is just another GPR
    So, R0, gets forwarded just as often (or lack thereof) as any joe-random
    register.

    Qupls has IP offset constant loading.



    No sooner than having updated the spec, I added two more opcodes to
    perform loads and stores using IP relative addressing. That way, no need
    to use r31, leaving 31 registers completely general purpose. I am
    wanting to cast some aspects of the ISA in stone, or it will never get anywhere.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Dec 7 03:29:05 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    On 2025-12-06 6:33 p.m., Robert Finch wrote:
    On 2025-12-06 12:29 p.m., MitchAlsup wrote:

    Robert Finch <robfi680@gmail.com> posted:

    Tradeoffs bypassing r0 causing more ISA tweaks.

    It is expensive to bypass r0. To truly bypass it, it needs to be
    bypassed in a couple of dozen places which really drives up the LUT
    count. Removing the bypassing of r0 from the register file shaved 1000 >>> LUTs off the design. This is no real loss as most instructions can
    substitute small constants for register values.

    Often the use of R0 as an operand causes the calculation to be
    degenerate.
    That is, R0 is not needed at all.
          ADD   R9,R7,R0        // is a MOV instruction
          AND   R9,R7,R0        // is a CLR instruction

    We dont want no degenerating instructions.

    So, you don't have to treat R0 in bypassing, but as Operand processing. >>> Decided to go PowerPC style with bypassing of r0 to zero. R0 is bypassed >>> to zero only in the agen units. So, the bypass is only in a couple of
    places. Otherwise r0 can be used as an ordinary register. Load / store >>> instructions cannot use r0 as a GPR then, but it works for the PowerPC. >>
    AGEN Rbase ==R0 implies Rbase  = IP
    AGEN Rindex==R0 implies Rindex = 0

    Qupls now follows a similar paradigm.
     Rbase = r0 bypasses to 0
     Rindex = r0 bypasses to 0
     Rbase = r31 bypasses to IP
    Bypassing r0 for both base and index allows absolute addressing mode. Otherwise r0, r31 are general-purpose regs.

    I hit this trying to decide where to bypass another register code to
    represent the instruction pointer. In that case I think it may be better >>> to go RISCV style and just add an instruction to add the IP to a
    constant and place it in a register. The alternative might be to
    sacrifice a bit of displacement to indicate IP relative addressing.

    Anyone got a summary of bypassing r0 in different architectures?

    These are some of the reasons I went with
    a) universal constants
    b) R0 is just another GPR
    So, R0, gets forwarded just as often (or lack thereof) as any joe-random >> register.

    Qupls has IP offset constant loading.



    No sooner than having updated the spec, I added two more opcodes to
    perform loads and stores using IP relative addressing. That way, no need
    to use r31, leaving 31 registers completely general purpose. I am
    wanting to cast some aspects of the ISA in stone, or it will never get anywhere.

    Cast some elements in plaster--this will hold for a few years until
    you find the bigger mistakes, then demolish the plaster and fix the
    parts that don't work so well.

    After 6 years of essential stability, I did a major update to My 66000
    ISA last month. The new ISA is ASCII compatible with the last, but not
    at the binary level, which solves several problems and saves another
    2%-4% in code footprint.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Dec 7 09:30:50 2025
    From Newsgroup: comp.arch

    Scott Lurndal <scott@slp53.sl.home> schrieb:

    Yes, you can add special instructions. However, the compilers will be unlikely
    to generate them, thus applications that desired the generation of such an instruction would need to create a compiler extension (like gcc __builtin functions)
    or inline assembler which would then make the program that uses the capability both compiler
    specific _and_ hardware specific.

    This would likely be hidden in a header, and need only be
    written once (although gcc and clang, for example, are compatible
    in this respecct). And people have been doing this, even for
    microarchitecture specific features, if the need for performance
    gain is large enough.

    A primary example is Intel TSX, which is (was?) required by SAP.

    POWER also had a transactional memory feature, but they messed it
    up for POWER 9 and dropped it for POWER 10 (IIRC); POWER is the
    only other architecture certified to run SAP, so it seems they
    can do without.

    Googling around, I also find the "Transactional Memory Extension"
    for ARM datetd 2022, so ARM also appears to see some value in that,
    at least enough to write a spec for it.

    Most extant SMP processors provide a compare and swap operation, which
    are widely supported by the common compilers that support the C and C++ threading functionality.

    It seems there is a market for going beyond compare and swap.
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Sun Dec 7 16:05:32 2025
    From Newsgroup: comp.arch

    On Sun, 7 Dec 2025 09:30:50 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Scott Lurndal <scott@slp53.sl.home> schrieb:

    Yes, you can add special instructions. However, the compilers
    will be unlikely to generate them, thus applications that desired
    the generation of such an instruction would need to create a
    compiler extension (like gcc __builtin functions) or inline
    assembler which would then make the program that uses the
    capability both compiler specific _and_ hardware specific.

    This would likely be hidden in a header, and need only be
    written once (although gcc and clang, for example, are compatible
    in this respecct). And people have been doing this, even for microarchitecture specific features, if the need for performance
    gain is large enough.

    A primary example is Intel TSX, which is (was?) required by SAP.


    By SAP HANA, I assume.
    Not sure for how long it was true. It sounds very unlikely that it is
    still true.

    POWER also had a transactional memory feature, but they messed it
    up for POWER 9 and dropped it for POWER 10 (IIRC); POWER is the
    only other architecture certified to run SAP, so it seems they
    can do without.

    Googling around, I also find the "Transactional Memory Extension"
    for ARM datetd 2022, so ARM also appears to see some value in that,
    at least enough to write a spec for it.

    Most extant SMP processors provide a compare and swap operation,
    which are widely supported by the common compilers that support the
    C and C++ threading functionality.

    It seems there is a market for going beyond compare and swap.

    TSX is close to dead.

    ARM's TME was announced almost 5 years ago. AFAIK, there were no implementations. Recently ARM said that FEAT_TME is obsoleted. It sounds
    like the whole thing is dead, but there is small chance that I am misinterpreting.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sun Dec 7 16:13:06 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    scott@slp53.sl.home (Scott Lurndal) posted:



    Yes, you can add special instructions. However, the compilers will be unlikely
    to generate them, thus applications that desired the generation of such an >> instruction would need to create a compiler extension (like gcc __builtin functions)
    or inline assembler which would then make the program that uses the capability both compiler
    specific _and_ hardware specific.

    So, in other words, if you can't put it in every ISA known to man,
    don't bother making something better than existent ?!?

    Long experience. Back in the early 80's we had fancy instructions
    for searching linked lists (up to 100 digit or byte keys, comparisons for equal, ne, lt, gt, lte, gte, and any-bit-equal). Took special language support to use, which mean that it wasn't usable from COBOL without
    extensions. We also had Lock, Unlock and condition variable instructions (with a small microkernel to handle the contention cases, trapping on acquisition failure, release [when another thread was pending], and
    event signal.). Perhaps ahead of its time, as most of the common languages (COBOL and Fortran) had no syntactical support for them. We used them
    in the OS language (SPRITE), but they never got traction in applications (and then the
    entire computer line was discontinued in 1991).

    That's not to suggest that your innovations aren't potentially useful
    or an interesting take on multithreaded instruction primitives;
    just that idealism and the real world are often incompatible :-)

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sun Dec 7 16:28:41 2025
    From Newsgroup: comp.arch

    Thomas Koenig <tkoenig@netcologne.de> writes:
    Scott Lurndal <scott@slp53.sl.home> schrieb:

    Yes, you can add special instructions. However, the compilers will be unlikely
    to generate them, thus applications that desired the generation of such an >> instruction would need to create a compiler extension (like gcc __builtin functions)
    or inline assembler which would then make the program that uses the capability both compiler
    specific _and_ hardware specific.

    This would likely be hidden in a header, and need only be
    written once (although gcc and clang, for example, are compatible
    in this respecct). And people have been doing this, even for >microarchitecture specific features, if the need for performance
    gain is large enough.

    A primary example is Intel TSX, which is (was?) required by SAP.

    POWER also had a transactional memory feature, but they messed it
    up for POWER 9 and dropped it for POWER 10 (IIRC); POWER is the
    only other architecture certified to run SAP, so it seems they
    can do without.

    Googling around, I also find the "Transactional Memory Extension"
    for ARM datetd 2022, so ARM also appears to see some value in that,
    at least enough to write a spec for it.

    The ARM spec has been published. I'm not aware of any implementations
    of it to date, and the spec had been available to architecture partners
    for several years prior to 2022.

    Intel's TSX support seems to be restricted to a subset of xeon processors,
    and it's not clear how well it's supported by non-intel compilers.

    AMD has never released their Advanced Synchronization Facility in any
    processor to date.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Dec 7 16:55:26 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> schrieb:
    On Sun, 7 Dec 2025 09:30:50 -0000 (UTC)
    Thomas Koenig <tkoenig@netcologne.de> wrote:

    Scott Lurndal <scott@slp53.sl.home> schrieb:

    Yes, you can add special instructions. However, the compilers
    will be unlikely to generate them, thus applications that desired
    the generation of such an instruction would need to create a
    compiler extension (like gcc __builtin functions) or inline
    assembler which would then make the program that uses the
    capability both compiler specific _and_ hardware specific.

    This would likely be hidden in a header, and need only be
    written once (although gcc and clang, for example, are compatible
    in this respecct). And people have been doing this, even for
    microarchitecture specific features, if the need for performance
    gain is large enough.

    A primary example is Intel TSX, which is (was?) required by SAP.


    By SAP HANA, I assume.
    Not sure for how long it was true. It sounds very unlikely that it is
    still true.

    https://www.redhat.com/en/blog/red-hat-enterprise-linux-performance-results-5th-gen-intel-xeon-scalable-processors
    from 2024 has benchmarks with TSX for SAP/HANA, and the processors
    (5th generation Xeon) at least pretend to have TSX.

    https://community.sap.com/t5/technology-blog-posts-by-sap/seamless-scaling-of-sap-hana-on-intel-xeon-processors-from-micro-to-mega/ba-p/13968648
    (almost a year old) writes

    "Intel's Transactional Synchronization Extensions (TSX), also
    implemented into the SAP HANA database, further enhances this
    scalability and offers a significant performance boost for critical
    HANA database operations."

    which does not read "required", but certainly sounds like it is an
    advantage.

    POWER also had a transactional memory feature, but they messed it
    up for POWER 9 and dropped it for POWER 10 (IIRC); POWER is the
    only other architecture certified to run SAP, so it seems they
    can do without.

    Googling around, I also find the "Transactional Memory Extension"
    for ARM datetd 2022, so ARM also appears to see some value in that,
    at least enough to write a spec for it.

    Most extant SMP processors provide a compare and swap operation,
    which are widely supported by the common compilers that support the
    C and C++ threading functionality.

    It seems there is a market for going beyond compare and swap.

    TSX is close to dead.

    For general-purpose computers, it seems the security implications
    killed it. An SAP server is a different matter; if you don't trust
    the software you are running there, you have other issues.


    ARM's TME was announced almost 5 years ago. AFAIK, there were no implementations. Recently ARM said that FEAT_TME is obsoleted. It sounds
    like the whole thing is dead, but there is small chance that I am misinterpreting.

    Maybe restartable sequences are the way to go for lock-free
    critical sections. Not sure if everybody is aware of these. A good introduction can be found at https://lwn.net/Articles/883104/ .
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sun Dec 7 12:19:34 2025
    From Newsgroup: comp.arch

    Scott Lurndal wrote:
    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    scott@slp53.sl.home (Scott Lurndal) posted:
    In my 40 years of SMP OS/HV work, I don't recall a
    situation where 'MoveElement' would be useful or
    required as an hardware atomic operation.
    The question is not would "MoveElement" be useful, but
    would it be useful to have a single ATOMIC event be
    able to manipulate {5,6,7,8} pointers in one event ??

    Nothing comes immediately to mind.

    Atomically moving an object from one double linked list to another,
    like when a thread wakes up and moves from the waiting to ready list.

    One iteration of balancing a binary tree (AVL, red-black)

    Plus the data structs above might straddle cache lines so how ever many
    objects there are, there could be twice the lines being updated at once.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Dec 7 17:48:50 2025
    From Newsgroup: comp.arch

    Michael S <already5chosen@yahoo.com> writes:
    Where does sequential consistency simplifies programming over x86 model
    of "TCO + globally ordered synchronization primitives +
    every synchronization primitives have implied barriers"?

    More so, where it simplifies over ARMv8.1-A, assuming that programmer
    does not try to be too smart and never uses LL/SC and always uses
    8.1-style synchronization instructions with Acquire+Release flags set?

    IMHO, the only simple thing about sequential consistency is simple >description. Other than that, it simplifies very little. It does not >magically make lockless multithreaded programming bearable to
    non-genius coders.

    Is single-core multi-threaded programming bearable to non-genius
    programmers? I think so. Sequential consistency plus atomic sequences
    (where the single-core program disables interrupts to start an atomic
    sequence and enables them to end an atomic sequence) gives the same
    programming model.

    Concerning synchronization instructions and memory barriers of
    architectures with weaker memory models, their main problem is that
    they are implemented slowly, because the idea is to make only the
    weaker memory model go fast, and then suffer what you must if you need
    more guarantees. Already the guarantee makes them slow, not just the
    actual synchronization case. This makes the memory model hard to use,
    because you want to minimize the use of these instructions. And
    that's where the need for genius-level coding comes in.

    As for the size of the description, IMO this reflects on the
    simplicity of programming. ARM's memory model was advertized here as:
    "It's only 32 pages" <YfxXO.384093$EEm7.56154@fx16.iad>. If it is
    simple to program, why does it need 32 pages of description?

    Concerning non-genius coders and coders that are not experts in memory
    ordering models, the current setup seems to be design to have a few
    people who program system software that does such things, and
    everybody else should just use this software (whether it's system
    calls or libraries). That's ok if the need to communicate between
    threads is rare, but not so great if it is frequent (especially the
    system-call variant). And if the need to communicate between threads
    is rare, it's also good enough if the hardware features for that need
    are slow. So maybe this whole setup is good enough.

    OTOH, maybe there are applications that could potentially use multiple
    threads that are currently using sequential programs or context
    switching within a hardware thread (green threads and the like)
    because the communication between the threads is too slow and making
    it faster is too hard to program. In that case the underutilization
    of many of the multi-core CPUs that we have may be due to this
    phenomenon. If so, the argument that it's too expensive in hardware
    resources to implement sequential consistency in hardware well does
    not hold: Is it more expensive than implementing an 8-core CPU where 6 or 7 cores are usually not utilized?

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sun Dec 7 14:51:01 2025
    From Newsgroup: comp.arch

    On 12/5/2025 3:03 PM, Chris M. Thomasson wrote:
    On 12/5/2025 11:10 AM, David Brown wrote:
    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough >>>>> guarantees for multi-threaded coding on multi-core systems.
    Basically,
    it only works at the C abstract machine level - it does nothing that >>>>> affects the hardware.  So volatile writes are ordered at the C level, >>>>> but that says nothing about how they might progress through storage
    queues, caches, inter-processor communication buses, or whatever.

    You describe in many words and not really to the point what can be
    explained concisely as: "volatile says nothing about memory ordering
    on hardware with weaker memory ordering than sequential consistency".
    If hardware guaranteed sequential consistency, volatile would provide
    guarantees that are as good on multi-core machines as on single-core
    machines.

    However, for concurrent manipulations of data structures, one wants
    atomic operations beyond load and store (even on single-core systems),

    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes
    bigger than the maximum load/store size of the processor.

    It's strange that double-word compare and swap (DWCAS), where the words
    are contiguous. Well, I have seen compilers say its not lock-free even
    on a x86. for a 32 bit system we have cmpxchg8b. For a 64 bit system cmpxchg16b. But the compiler reports not lock free. Strange.

    using cmpxchg instead of xadd: https://forum.pellesc.de/index.php?topic=7167.0

    trying to tell me that a DWCAS is not lock free: https://forum.pellesc.de/index.php?topic=7311.msg27764#msg27764

    This should be lock-free on an x86, even x64:

    struct ct_proxy_dwcas
    {
        struct ct_proxy_node* node;
        intptr_t count;
    };

    Ideally, struct ct_proxy_dwcas should be aligned on a l2 cache line and
    padded up the the size of a cache line.




    some of my older code:

    AC_SYS_APIEXPORT
    int AC_CDECL
    np_ac_i686_atomic_dwcas_fence
    ( void*,
      void*,
      const void* );


    np_ac_i686_atomic_dwcas_fence PROC
      push esi
      push ebx
      mov esi, [esp + 16]
      mov eax, [esi]
      mov edx, [esi + 4]
      mov esi, [esp + 20]
      mov ebx, [esi]
      mov ecx, [esi + 4]
      mov esi, [esp + 12]
      lock cmpxchg8b qword ptr [esi]
      jne np_ac_i686_atomic_dwcas_fence_fail
      xor eax, eax
      pop ebx
      pop esi
      ret

    np_ac_i686_atomic_dwcas_fence_fail:
      mov esi, [esp + 16]
      mov [esi + 0],  eax;
      mov [esi + 4],  edx;
      mov eax, 1
      pop ebx
      pop esi
      ret
    np_ac_i686_atomic_dwcas_fence ENDP


    Even with a single core system you can have pre-emptive multi-
    threading, or at least interrupt routines that may need to cooperate
    with other tasks on data.


    and I don't think that C with just volatile gives you such guarantees. >>>>
    - anton



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sun Dec 7 15:09:15 2025
    From Newsgroup: comp.arch

    On 12/6/2025 9:22 AM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/5/2025 12:54 PM, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough >>>>>>> guarantees for multi-threaded coding on multi-core systems. Basically, >>>>>>> it only works at the C abstract machine level - it does nothing that >>>>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>>>> but that says nothing about how they might progress through storage >>>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>>
    You describe in many words and not really to the point what can be >>>>>> explained concisely as: "volatile says nothing about memory ordering >>>>>> on hardware with weaker memory ordering than sequential consistency". >>>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>>> guarantees that are as good on multi-core machines as on single-core >>>>>> machines.

    However, for concurrent manipulations of data structures, one wants >>>>>> atomic operations beyond load and store (even on single-core systems), >>>>>
    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes
    bigger than the maximum load/store size of the processor.

    My 66000 ISA can::

    LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
    MM can MOV up to 8192 bytes as a single ATOMIC instruction.

    Compare Double, Swap Double::

    BOOLEAN DCAS( type oldp, type_t oldq,
    type *p, type_t *q,
    type newp, type newq )
    {
    type t = esmLOCKload( *p );
    type r = esmLOCKload( *q );
    if( t == oldp && r == oldq )
    {
    *p = newp;
    esmLOCKstore( *q, newq );
    return TRUE;
    }
    return FALSE;
    }

    Move Element from one place to another:

    BOOLEAN MoveElement( Element *fr, Element *to )
    {
    Element *fn = esmLOCKload( fr->next );
    Element *fp = esmLOCKload( fr->prev );
    Element *tn = esmLOCKload( to->next );
    esmLOCKprefetch( fn );
    esmLOCKprefetch( fp );
    esmLOCKprefetch( tn );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn;
    fn->prev = fp;
    to->next = fr;
    tn->prev = fr;
    fr->prev = to;
    esmLOCKstore( fr->next, tn );
    return TRUE;
    }
    return FALSE;
    }

    So, I guess, you are not talking about what My 66000 cannot do, but
    only what other ISAs cannot do.

    Any issues with live lock in here?

    A bit hard to tell because of 2 things::
    a) I carry around the thread priority and when interference occurs,
    the higher priority thread wins--ties the already accessed thread wins. b) live-lock is resolved or not by the caller to these routines, not
    these routines themselves.

    Hummm... Iirc, I was able to cause damage to a strong CAS. It was around
    20 years ago. A thread was running strong CAS in a tight loop. I counted success vs failure. Then allowed some other threads that altered the
    target word with random data. The failure rate for the CAS increased. Actually, I think cmpxchg, cmpxchg8b, cmpxchg16b, and the strange one on Itanium. Cannot remember it right now. cmp8xchg16? Or some shit.

    Well, they would hit a bus lock if they failed too many times. I think
    Scott knows about it.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sun Dec 7 15:17:04 2025
    From Newsgroup: comp.arch

    On 12/6/2025 5:42 AM, David Brown wrote:
    On 05/12/2025 21:54, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough >>>>>> guarantees for multi-threaded coding on multi-core systems.
    Basically,
    it only works at the C abstract machine level - it does nothing that >>>>>> affects the hardware.  So volatile writes are ordered at the C level, >>>>>> but that says nothing about how they might progress through storage >>>>>> queues, caches, inter-processor communication buses, or whatever.

    You describe in many words and not really to the point what can be
    explained concisely as: "volatile says nothing about memory ordering >>>>> on hardware with weaker memory ordering than sequential consistency". >>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>> guarantees that are as good on multi-core machines as on single-core >>>>> machines.

    However, for concurrent manipulations of data structures, one wants
    atomic operations beyond load and store (even on single-core systems), >>>>
    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes
    bigger than the maximum load/store size of the processor.

    My 66000 ISA can::

    LDM/STM can LD/ST up to 32   DWs   as a single ATOMIC instruction.
    MM      can MOV   up to 8192 bytes as a single ATOMIC instruction.


    The functions below rely on more than that - to make the work, as far as
    I can see, you need the first "esmLOCKload" to lock the bus and also
    lock the core from any kind of interrupt or other pre-emption, lasting
    until the esmLOCKstore instruction.  Or am I missing something here?

    Lock the BUS? Only when shit hits the fan. What about locking the cache
    line? Actually, I think we can "force" an x86/x64 to lock the bus if we
    do a LOCK'ed RMW on memory that straddles cache lines?

    [...]

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sun Dec 7 16:08:03 2025
    From Newsgroup: comp.arch

    On 12/6/2025 1:36 PM, Thomas Koenig wrote:
    Scott Lurndal <scott@slp53.sl.home> schrieb:

    Yes, you can add special instructions. However, the compilers will be unlikely
    to generate them, thus applications that desired the generation of such an >> instruction would need to create a compiler extension (like gcc __builtin functions)
    or inline assembler which would then make the program that uses the capability both compiler
    specific _and_ hardware specific.

    Most extant SMP processors provide a compare and swap operation, which
    are widely supported by the common compilers that support the C and C++
    threading functionality.

    Interestingly, Linux restartable sequences allow for acquisition of
    a lock with no membarrier or atomic instruction on the fast path,
    at the cost of a syscall on the slow path (no free lunch...)

    But you also need assembler to do it.

    An example is, for example, at https://gitlab.ethz.ch/extra_projects/cpu-local-lock


    I need to read more about them, but they kind of remind me of an
    asymmetric mutex, or rwmutex. Ones that use a remote membar on the slow
    path. Iirc, FlushProcessWriteBuffers on windows and iirc,
    synchronize_rcu or membarrier on linux.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sun Dec 7 16:36:59 2025
    From Newsgroup: comp.arch

    On 12/6/2025 10:07 AM, MitchAlsup wrote:

    scott@slp53.sl.home (Scott Lurndal) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough >>>>>>> guarantees for multi-threaded coding on multi-core systems. Basically, >>>>>>> it only works at the C abstract machine level - it does nothing that >>>>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>>>> but that says nothing about how they might progress through storage >>>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>>
    You describe in many words and not really to the point what can be >>>>>> explained concisely as: "volatile says nothing about memory ordering >>>>>> on hardware with weaker memory ordering than sequential consistency". >>>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>>> guarantees that are as good on multi-core machines as on single-core >>>>>> machines.

    However, for concurrent manipulations of data structures, one wants >>>>>> atomic operations beyond load and store (even on single-core systems), >>>>>
    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes
    bigger than the maximum load/store size of the processor.

    My 66000 ISA can::

    LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
    MM can MOV up to 8192 bytes as a single ATOMIC instruction.

    Compare Double, Swap Double::

    BOOLEAN DCAS( type oldp, type_t oldq,
    type *p, type_t *q,
    type newp, type newq )
    {
    type t = esmLOCKload( *p );
    type r = esmLOCKload( *q );
    if( t == oldp && r == oldq )
    {
    *p = newp;
    esmLOCKstore( *q, newq );
    return TRUE;
    }
    return FALSE;
    }

    Move Element from one place to another:

    BOOLEAN MoveElement( Element *fr, Element *to )
    {
    Element *fn = esmLOCKload( fr->next );
    Element *fp = esmLOCKload( fr->prev );
    Element *tn = esmLOCKload( to->next );
    esmLOCKprefetch( fn );
    esmLOCKprefetch( fp );
    esmLOCKprefetch( tn );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn;
    fn->prev = fp;
    to->next = fr;
    tn->prev = fr;
    fr->prev = to;
    esmLOCKstore( fr->next, tn );
    return TRUE;
    }
    return FALSE;
    }

    So, I guess, you are not talking about what My 66000 cannot do, but
    only what other ISAs cannot do.

    In my 40 years of SMP OS/HV work, I don't recall a
    situation where 'MoveElement' would be useful or
    required as an hardware atomic operation.

    The question is not would "MoveElement" be useful, but
    would it be useful to have a single ATOMIC event be
    able to manipulate {5,6,7,8} pointers in one event ??

    Individual atomic "Remove Element" and "Insert/Append Element"[*], yes.
    Combined? Too inflexible.

    BOOLEAN InsertElement( Element *el, Element *to )
    {
    tn = esmLOCKload( to->next );
    esmLOCKprefetch( el );
    esmLOCKprefetch( tn );
    if( !esmINTERFERENCE() )
    {
    el->next = tn;
    el->prev = to;
    to->next = el;
    esmLOCKstore( tn->prev, el );
    return TRUE;
    }
    return FALSE;
    }

    BOOLEAN RemoveElement( Element *fr )
    {
    fn = esmLOCKload( fr->next );
    fp = esmLOCKload( fr->prev );
    esmLOCKprefetch( fn );
    esmLOCKprefetch( fp );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn;
    fn->prev = fp;
    fr->prev = NULL;
    esmLOCKstore( fr->next, NULL );
    return TRUE;
    }
    return FALSE;
    }


    [*] For which atomic compare-and-swap or atomic swap is generally sufficient.

    Atomic add/sub are useful. The other atomic math operations (min, max, etc) >> may be useful in certain cases as well.

    Have you ever read about KCSS?

    https://groups.google.com/g/comp.arch/c/shshLdF1uqs

    https://patents.google.com/patent/US7293143
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Mon Dec 8 10:07:25 2025
    From Newsgroup: comp.arch

    On 06/12/2025 18:44, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 21:54, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough >>>>>>> guarantees for multi-threaded coding on multi-core systems. Basically, >>>>>>> it only works at the C abstract machine level - it does nothing that >>>>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>>>> but that says nothing about how they might progress through storage >>>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>>
    You describe in many words and not really to the point what can be >>>>>> explained concisely as: "volatile says nothing about memory ordering >>>>>> on hardware with weaker memory ordering than sequential consistency". >>>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>>> guarantees that are as good on multi-core machines as on single-core >>>>>> machines.

    However, for concurrent manipulations of data structures, one wants >>>>>> atomic operations beyond load and store (even on single-core systems), >>>>>
    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes
    bigger than the maximum load/store size of the processor.

    My 66000 ISA can::

    LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
    MM can MOV up to 8192 bytes as a single ATOMIC instruction.


    The functions below rely on more than that - to make the work, as far as
    I can see, you need the first "esmLOCKload" to lock the bus and also
    lock the core from any kind of interrupt or other pre-emption, lasting
    until the esmLOCKstore instruction. Or am I missing something here?

    In the above, I was stating that the maximum width of LD/ST can be a lot bigger than the size of a single register, not that the above instructions make writing ATOMIC events easier.


    That's what I assumed.

    Certainly there are situations where it can be helpful to have longer
    atomic reads and writes. I am not so sure about allowing 8 KB atomic accesses, especially in a system with multiple cores - that sounds like letting user programs DoS everything else on the system.

    These is no bus!

    I think there's a typo or some missing words there?


    The esmLOCKload causes the <translated> address to be 'monitored'
    for interference, and to announce participation in the ATOMIC event.

    The FIRST esmLOCKload tells the core that an ATOMIC event is beginning,
    AND sets up a default control point (This instruction itself) so that
    if interference is detected at esmLOCKstore control is transferred to
    that control point.

    So, there is no way to write Test-and-Set !! you get Test-and-Test-and-Set for free.

    If I understand you correctly here, you basically have a "load-reserve / store-conditional" sequence as commonly found in RISC architectures, but
    you have the associated loop built into the hardware? I can see that potentially improving efficiency, but I also find it very difficult to
    read or write C code that has hidden loops. And I worry about how it
    would all work if another thread on the same core or a different core
    was running similar code in the middle of these sequences. It also
    reduces the flexibility - in some use-cases, you want to have software
    limits on the number of attempts of a lr/sc loop to detect serious synchronisation problems.


    There is a branch-on-interference instruction that
    a) does what it says,
    b) sets up an alternate atomic control point.

    It is not easy to have atomic or lock mechanisms on multi-core systems
    that are convenient to use, efficient even in the worst cases, and don't
    require additional hardware.

    I am using the "Miss Buffer" as the point of monitoring for interference.
    a) it already has to monitor "other hits" from outside accesses to deal
    with the coherence mechanism.
    b) that esm additions to Miss Buffer are on the order of 2%

    c) there are other means to strengthen guarantees of forward progress.


    Compare Double, Swap Double::

    BOOLEAN DCAS( type oldp, type_t oldq,
    type *p, type_t *q,
    type newp, type newq )
    {
    type t = esmLOCKload( *p );
    type r = esmLOCKload( *q );
    if( t == oldp && r == oldq )
    {
    *p = newp;
    esmLOCKstore( *q, newq );
    return TRUE;
    }
    return FALSE;
    }

    Move Element from one place to another:

    BOOLEAN MoveElement( Element *fr, Element *to )
    {
    Element *fn = esmLOCKload( fr->next );
    Element *fp = esmLOCKload( fr->prev );
    Element *tn = esmLOCKload( to->next );
    esmLOCKprefetch( fn );
    esmLOCKprefetch( fp );
    esmLOCKprefetch( tn );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn;
    fn->prev = fp;
    to->next = fr;
    tn->prev = fr;
    fr->prev = to;
    esmLOCKstore( fr->next, tn );
    return TRUE;
    }
    return FALSE;
    }

    So, I guess, you are not talking about what My 66000 cannot do, but
    only what other ISAs cannot do.

    Of course. It is interesting to speculate about possible features of an
    architecture like yours, but it is not likely to be available to anyone
    else in practice (unless perhaps it can be implemented as an extension
    for RISC-V).

    Even with a >>>> single core system you can have pre-emptive multi-threading, or at least >>>> interrupt routines that may need to cooperate with other tasks on data. >>>>

    and I don't think that C with just volatile gives you such guarantees. >>>>>>
    - anton



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Mon Dec 8 10:12:19 2025
    From Newsgroup: comp.arch

    On 08/12/2025 00:17, Chris M. Thomasson wrote:
    On 12/6/2025 5:42 AM, David Brown wrote:
    On 05/12/2025 21:54, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide
    enough
    guarantees for multi-threaded coding on multi-core systems.
    Basically,
    it only works at the C abstract machine level - it does nothing that >>>>>>> affects the hardware.  So volatile writes are ordered at the C >>>>>>> level,
    but that says nothing about how they might progress through storage >>>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>>
    You describe in many words and not really to the point what can be >>>>>> explained concisely as: "volatile says nothing about memory ordering >>>>>> on hardware with weaker memory ordering than sequential consistency". >>>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>>> guarantees that are as good on multi-core machines as on single-core >>>>>> machines.

    However, for concurrent manipulations of data structures, one wants >>>>>> atomic operations beyond load and store (even on single-core
    systems),

    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes
    bigger than the maximum load/store size of the processor.

    My 66000 ISA can::

    LDM/STM can LD/ST up to 32   DWs   as a single ATOMIC instruction.
    MM      can MOV   up to 8192 bytes as a single ATOMIC instruction. >>>

    The functions below rely on more than that - to make the work, as far
    as I can see, you need the first "esmLOCKload" to lock the bus and
    also lock the core from any kind of interrupt or other pre-emption,
    lasting until the esmLOCKstore instruction.  Or am I missing something
    here?

    Lock the BUS? Only when shit hits the fan. What about locking the cache line? Actually, I think we can "force" an x86/x64 to lock the bus if we
    do a LOCK'ed RMW on memory that straddles cache lines?


    Yes, I meant "lock the bus" - but I might have been overcautious.
    However, it seems there is a hidden hardware loop here - the
    esmLOCKstore instruction can fail and and the processor jumps back to
    the first esmLOCKload instruction. With that, you don't need to block
    other code from running or accessing the bus.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Robert Finch@robfi680@gmail.com to comp.arch on Mon Dec 8 07:25:42 2025
    From Newsgroup: comp.arch

    <snip>
    BOOLEAN RemoveElement( Element *fr )
    {
    fn = esmLOCKload( fr->next );
    fp = esmLOCKload( fr->prev );
    esmLOCKprefetch( fn );
    esmLOCKprefetch( fp );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn;
    fn->prev = fp;
    fr->prev = NULL;
    esmLOCKstore( fr->next, NULL );
    return TRUE;
    }
    return FALSE;
    }


    [*] For which atomic compare-and-swap or atomic swap is generally sufficient.

    Yes, you can add special instructions. However, the compilers will be unlikely
    to generate them, thus applications that desired the generation of such an >> instruction would need to create a compiler extension (like gcc __builtin functions)
    or inline assembler which would then make the program that uses the capability both compiler
    specific _and_ hardware specific.

    So, in other words, if you can't put it in every ISA known to man,
    don't bother making something better than existent ?!?

    Most extant SMP processors provide a compare and swap operation, which
    are widely supported by the common compilers that support the C and C++
    threading functionality.

    I am having trouble understanding how the block of code in the esmINTERFERENCE() block is protected so that the whole thing executes as
    a unit. It would seem to me that the address range(s) needing to be
    locked would have to be supplied throughout the system, including across buffers and bus bridges. It would have to go to the memory coherence
    point. Otherwise, some other device using a bridge could update the same address range in the middle of an update.

    I am assuming the esmLockStore() just unlocks what was previously locked
    and the stores have already happened by that time.

    It would seem that esmINTERFERENCE() would indicate that everybody with
    access out to the coherence point has agreed to the locked area? Does
    that require that all devices respect the esmINTERFERENCE()?


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Mon Dec 8 04:32:39 2025
    From Newsgroup: comp.arch

    On 12/8/2025 1:12 AM, David Brown wrote:
    On 08/12/2025 00:17, Chris M. Thomasson wrote:
    On 12/6/2025 5:42 AM, David Brown wrote:
    On 05/12/2025 21:54, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide >>>>>>>> enough
    guarantees for multi-threaded coding on multi-core systems.
    Basically,
    it only works at the C abstract machine level - it does nothing >>>>>>>> that
    affects the hardware.  So volatile writes are ordered at the C >>>>>>>> level,
    but that says nothing about how they might progress through storage >>>>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>>>
    You describe in many words and not really to the point what can be >>>>>>> explained concisely as: "volatile says nothing about memory ordering >>>>>>> on hardware with weaker memory ordering than sequential
    consistency".
    If hardware guaranteed sequential consistency, volatile would
    provide
    guarantees that are as good on multi-core machines as on single-core >>>>>>> machines.

    However, for concurrent manipulations of data structures, one wants >>>>>>> atomic operations beyond load and store (even on single-core
    systems),

    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes >>>>> bigger than the maximum load/store size of the processor.

    My 66000 ISA can::

    LDM/STM can LD/ST up to 32   DWs   as a single ATOMIC instruction. >>>> MM      can MOV   up to 8192 bytes as a single ATOMIC instruction. >>>>

    The functions below rely on more than that - to make the work, as far
    as I can see, you need the first "esmLOCKload" to lock the bus and
    also lock the core from any kind of interrupt or other pre-emption,
    lasting until the esmLOCKstore instruction.  Or am I missing
    something here?

    Lock the BUS? Only when shit hits the fan. What about locking the
    cache line? Actually, I think we can "force" an x86/x64 to lock the
    bus if we do a LOCK'ed RMW on memory that straddles cache lines?


    Yes, I meant "lock the bus" - but I might have been overcautious.
    However, it seems there is a hidden hardware loop here - the
    esmLOCKstore instruction can fail and and the processor jumps back to
    the first esmLOCKload instruction.  With that, you don't need to block other code from running or accessing the bus.



    Humm.. For some damn reason it reminds me of a multi lock thing I did a
    while back. Called it the multex. Consisted of a table of locks. A
    thread would take the addresses it wanted to lock, hash then into the
    table, remove duplicates and sorted them and took them all without any
    fear of deadlock.

    (read all when you get some free time to burn...) https://groups.google.com/g/comp.lang.c++/c/sV4WC_cBb9Q/m/SkSqpSxGCAAJ

    It kind of seems like it might want to work with Mitch's scheme in a
    loose sense?
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon Dec 8 08:23:59 2025
    From Newsgroup: comp.arch

    On 12/8/2025 4:25 AM, Robert Finch wrote:
    <snip>
    BOOLEAN RemoveElement( Element *fr )
    {
         fn = esmLOCKload( fr->next );
         fp = esmLOCKload( fr->prev );
         esmLOCKprefetch( fn );
         esmLOCKprefetch( fp );
         if( !esmINTERFERENCE() )
         {
                       fp->next = fn;
                       fn->prev = fp;
                       fr->prev = NULL;
         esmLOCKstore( fr->next,  NULL );
                       return TRUE;
         }
         return FALSE;
    }


    [*] For which atomic compare-and-swap or atomic swap is generally
    sufficient.

    Yes, you can add special instructions.   However, the compilers will
    be unlikely
    to generate them, thus applications that desired the generation of
    such an
    instruction would need to create a compiler extension (like gcc
    __builtin functions)
    or inline assembler which would then make the program that uses the
    capability both compiler
    specific _and_ hardware specific.

    So, in other words, if you can't put it in every ISA known to man,
    don't bother making something better than existent ?!?

    Most extant SMP processors provide a compare and swap operation, which
    are widely supported by the common compilers that support the C and C++
    threading functionality.

    I am having trouble understanding how the block of code in the esmINTERFERENCE() block is protected so that the whole thing executes as
    a unit. It would seem to me that the address range(s) needing to be
    locked would have to be supplied throughout the system, including across buffers and bus bridges. It would have to go to the memory coherence
    point. Otherwise, some other device using a bridge could update the same address range in the middle of an update.

    I may be wrong about this, but I think you have a misconception. The
    ESM doesn't *prevent* interference, but it *detect* interference. Thus nothing is required of other cores, no locks, etc. If they write to a "protected" location, the write is allowed, but the core in the ESM is notified, so it can redo the ESM protected code.


    I am assuming the esmLockStore() just unlocks what was previously locked
    and the stores have already happened by that time.

    There is no "locking" in the sense of preventing any accesses.
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Dec 8 17:14:11 2025
    From Newsgroup: comp.arch

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 12/8/2025 4:25 AM, Robert Finch wrote:
    <snip>

    I am having trouble understanding how the block of code in the
    esmINTERFERENCE() block is protected so that the whole thing executes as
    a unit. It would seem to me that the address range(s) needing to be
    locked would have to be supplied throughout the system, including across
    buffers and bus bridges. It would have to go to the memory coherence
    point. Otherwise, some other device using a bridge could update the same
    address range in the middle of an update.

    I may be wrong about this, but I think you have a misconception. The
    ESM doesn't *prevent* interference, but it *detect* interference. Thus >nothing is required of other cores, no locks, etc. If they write to a >"protected" location, the write is allowed, but the core in the ESM is >notified, so it can redo the ESM protected code.

    Sounds very much similar to the ARMv8 concept of an "exclusive monitor"
    (the basis of the Store-Exclusive/Load-Exclusive instructions, which
    mirror the LL/SC paradigm). The ARMv8 monitors an implementation defined
    range surrounding the target address and the store will fail if any other
    agent has modified any byte within the exclusive range.

    esmINTERFERENCE seems to require multiple of these exclusive blocks
    to cover non-contiguous address ranges, which on first blush leads
    me to worry both about deadlock situations and starvation issues.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Dec 8 20:06:34 2025
    From Newsgroup: comp.arch


    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/6/2025 5:42 AM, David Brown wrote:
    On 05/12/2025 21:54, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough >>>>>> guarantees for multi-threaded coding on multi-core systems.
    Basically,
    it only works at the C abstract machine level - it does nothing that >>>>>> affects the hardware.  So volatile writes are ordered at the C level, >>>>>> but that says nothing about how they might progress through storage >>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>
    You describe in many words and not really to the point what can be >>>>> explained concisely as: "volatile says nothing about memory ordering >>>>> on hardware with weaker memory ordering than sequential consistency". >>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>> guarantees that are as good on multi-core machines as on single-core >>>>> machines.

    However, for concurrent manipulations of data structures, one wants >>>>> atomic operations beyond load and store (even on single-core systems), >>>>
    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes
    bigger than the maximum load/store size of the processor.

    My 66000 ISA can::

    LDM/STM can LD/ST up to 32   DWs   as a single ATOMIC instruction.
    MM      can MOV   up to 8192 bytes as a single ATOMIC instruction. >>

    The functions below rely on more than that - to make the work, as far as
    I can see, you need the first "esmLOCKload" to lock the bus and also
    lock the core from any kind of interrupt or other pre-emption, lasting until the esmLOCKstore instruction.  Or am I missing something here?

    Lock the BUS? Only when shit hits the fan. What about locking the cache line? Actually, I think we can "force" an x86/x64 to lock the bus if we
    do a LOCK'ed RMW on memory that straddles cache lines?

    In the My 66000 case, Mem References can lock up to 8 cache lines.

    [...]

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Dec 8 20:15:13 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/6/2025 5:42 AM, David Brown wrote:
    On 05/12/2025 21:54, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough >> >>>>>> guarantees for multi-threaded coding on multi-core systems.
    Basically,
    it only works at the C abstract machine level - it does nothing that >> >>>>>> affects the hardware.  So volatile writes are ordered at the C level,
    but that says nothing about how they might progress through storage >> >>>>>> queues, caches, inter-processor communication buses, or whatever.

    You describe in many words and not really to the point what can be
    explained concisely as: "volatile says nothing about memory ordering >> >>>>> on hardware with weaker memory ordering than sequential consistency". >> >>>>> If hardware guaranteed sequential consistency, volatile would provide >> >>>>> guarantees that are as good on multi-core machines as on single-core >> >>>>> machines.

    However, for concurrent manipulations of data structures, one wants
    atomic operations beyond load and store (even on single-core systems), >> >>>>
    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes
    bigger than the maximum load/store size of the processor.

    My 66000 ISA can::

    LDM/STM can LD/ST up to 32   DWs   as a single ATOMIC instruction.
    MM      can MOV   up to 8192 bytes as a single ATOMIC instruction. >> >>

    The functions below rely on more than that - to make the work, as far as >> > I can see, you need the first "esmLOCKload" to lock the bus and also
    lock the core from any kind of interrupt or other pre-emption, lasting
    until the esmLOCKstore instruction.  Or am I missing something here?

    Lock the BUS? Only when shit hits the fan. What about locking the cache
    line? Actually, I think we can "force" an x86/x64 to lock the bus if we
    do a LOCK'ed RMW on memory that straddles cache lines?

    In the My 66000 case, Mem References can lock up to 8 cache lines.

    What if two processors have intersecting (but not fully overlapping)
    sets of those 8 cache lines?

    Can you guarantee forward progress?
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Dec 8 20:20:27 2025
    From Newsgroup: comp.arch


    David Brown <david.brown@hesbynett.no> posted:

    On 06/12/2025 18:44, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 21:54, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough >>>>>>> guarantees for multi-threaded coding on multi-core systems. Basically,
    it only works at the C abstract machine level - it does nothing that >>>>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>>>> but that says nothing about how they might progress through storage >>>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>>
    You describe in many words and not really to the point what can be >>>>>> explained concisely as: "volatile says nothing about memory ordering >>>>>> on hardware with weaker memory ordering than sequential consistency". >>>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>>> guarantees that are as good on multi-core machines as on single-core >>>>>> machines.

    However, for concurrent manipulations of data structures, one wants >>>>>> atomic operations beyond load and store (even on single-core systems), >>>>>
    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes >>>> bigger than the maximum load/store size of the processor.

    My 66000 ISA can::

    LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
    MM can MOV up to 8192 bytes as a single ATOMIC instruction.


    The functions below rely on more than that - to make the work, as far as >> I can see, you need the first "esmLOCKload" to lock the bus and also
    lock the core from any kind of interrupt or other pre-emption, lasting
    until the esmLOCKstore instruction. Or am I missing something here?

    In the above, I was stating that the maximum width of LD/ST can be a lot bigger than the size of a single register, not that the above instructions make writing ATOMIC events easier.


    That's what I assumed.

    Certainly there are situations where it can be helpful to have longer
    atomic reads and writes. I am not so sure about allowing 8 KB atomic accesses, especially in a system with multiple cores - that sounds like letting user programs DoS everything else on the system.

    These is no bus!

    I think there's a typo or some missing words there?

    There is a fabric based interconnect to transport data-transfer requests
    around the system, where everyone connected to the transport can send
    a new request, receive a response, and receive a SNOOP simultaneously.

    There is NO single point on the fabric one can GRAB and prevent other
    sections of the fabric from "doing their prescribed transport duties.

    There is a memory ordering protocol in L3/DRAM-controller that prevents
    more than one "SNOOP per cache line" from being "in progress" at the
    same time.


    The esmLOCKload causes the <translated> address to be 'monitored'
    for interference, and to announce participation in the ATOMIC event.

    The FIRST esmLOCKload tells the core that an ATOMIC event is beginning,
    AND sets up a default control point (This instruction itself) so that
    if interference is detected at esmLOCKstore control is transferred to
    that control point.

    So, there is no way to write Test-and-Set !! you get Test-and-Test-and-Set for free.

    If I understand you correctly here, you basically have a "load-reserve / store-conditional" sequence as commonly found in RISC architectures, but
    you have the associated loop built into the hardware?

    In effect, yes. I have a multi-{LoadLocked StoreConditional} scheme
    as found in other RISC architectures with several small/big changes::
    a) you get up to 8 LLs
    b) the last SC causes the rest of the system to see all the memory
    changes at the same time (or nobody sees any changes).
    c) The ATOMIC sequence cannot persist across an exception or interrupt.
    d) only participating memory lines have the ATOMIC property.

    And yes, control transfer is built-into the architecture.

    I can see that potentially improving efficiency, but I also find it very difficult to
    read or write C code that has hidden loops. And I worry about how it
    would all work if another thread on the same core or a different core
    was running similar code in the middle of these sequences. It also
    reduces the flexibility - in some use-cases, you want to have software limits on the number of attempts of a lr/sc loop to detect serious synchronisation problems.

    In this case, said SW would use the Branch-on-interference instruction.


    There is a branch-on-interference instruction that
    a) does what it says,
    b) sets up an alternate atomic control point.

    It is not easy to have atomic or lock mechanisms on multi-core systems
    that are convenient to use, efficient even in the worst cases, and don't >> require additional hardware.

    I am using the "Miss Buffer" as the point of monitoring for interference. a) it already has to monitor "other hits" from outside accesses to deal
    with the coherence mechanism.
    b) that esm additions to Miss Buffer are on the order of 2%

    c) there are other means to strengthen guarantees of forward progress.


    Compare Double, Swap Double::

    BOOLEAN DCAS( type oldp, type_t oldq,
    type *p, type_t *q,
    type newp, type newq )
    {
    type t = esmLOCKload( *p );
    type r = esmLOCKload( *q );
    if( t == oldp && r == oldq )
    {
    *p = newp;
    esmLOCKstore( *q, newq );
    return TRUE;
    }
    return FALSE;
    }

    Move Element from one place to another:

    BOOLEAN MoveElement( Element *fr, Element *to )
    {
    Element *fn = esmLOCKload( fr->next );
    Element *fp = esmLOCKload( fr->prev );
    Element *tn = esmLOCKload( to->next );
    esmLOCKprefetch( fn );
    esmLOCKprefetch( fp );
    esmLOCKprefetch( tn );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn;
    fn->prev = fp;
    to->next = fr;
    tn->prev = fr;
    fr->prev = to;
    esmLOCKstore( fr->next, tn );
    return TRUE;
    }
    return FALSE;
    }

    So, I guess, you are not talking about what My 66000 cannot do, but
    only what other ISAs cannot do.

    Of course. It is interesting to speculate about possible features of an >> architecture like yours, but it is not likely to be available to anyone
    else in practice (unless perhaps it can be implemented as an extension
    for RISC-V).

    Even with a >>>> single core system you can have pre-emptive multi-threading, or at least >>>> interrupt routines that may need to cooperate with other tasks on data. >>>>

    and I don't think that C with just volatile gives you such guarantees. >>>>>>
    - anton



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Dec 8 20:30:34 2025
    From Newsgroup: comp.arch


    Robert Finch <robfi680@gmail.com> posted:

    <snip>
    BOOLEAN RemoveElement( Element *fr )
    {
    fn = esmLOCKload( fr->next );
    fp = esmLOCKload( fr->prev );
    esmLOCKprefetch( fn );
    esmLOCKprefetch( fp );
    if( !esmINTERFERENCE() )
    {
    fp->next = fn;
    fn->prev = fp;
    fr->prev = NULL;
    esmLOCKstore( fr->next, NULL );
    return TRUE;
    }
    return FALSE;
    }


    [*] For which atomic compare-and-swap or atomic swap is generally sufficient.

    Yes, you can add special instructions. However, the compilers will be unlikely
    to generate them, thus applications that desired the generation of such an >> instruction would need to create a compiler extension (like gcc __builtin functions)
    or inline assembler which would then make the program that uses the capability both compiler
    specific _and_ hardware specific.

    So, in other words, if you can't put it in every ISA known to man,
    don't bother making something better than existent ?!?

    Most extant SMP processors provide a compare and swap operation, which
    are widely supported by the common compilers that support the C and C++
    threading functionality.

    I am having trouble understanding how the block of code in the esmINTERFERENCE() block is protected so that the whole thing executes as
    a unit. It would seem to me that the address range(s) needing to be
    locked would have to be supplied throughout the system, including across buffers and bus bridges. It would have to go to the memory coherence
    point. Otherwise, some other device using a bridge could update the same address range in the middle of an update.

    esmLOCKLoad sets up monitors (in Miss Buffers) that detect Snoops to
    the participating cache lines.

    esmINTERFERENCE sets up a block of code that either executes in its
    entirety or fails in its entirety--and transfers control.

    In "certain circumstances" the code inside the esmINTERFERENCE block
    are allowed to NaK SNOOPs to those lines. So, if interference happens
    this late, you can effectively tell requestor "Yes, I have that cache
    line, No you cannot have it right now".

    If requestor gets a NaK, and requestor was attempting an ATOMIC event,
    the event fails. If requestor was NOT attempting, requestor resubmits
    the request. In both cases, the thread causing the interference is the
    one delayed while the one performing the event has higher probability
    of success.

    I am assuming the esmLockStore() just unlocks what was previously locked
    and the stores have already happened by that time.

    Yes, it is the terminal sentinel.

    It would seem that esmINTERFERENCE() would indicate that everybody with access out to the coherence point has agreed to the locked area? Does
    that require that all devices respect the esmINTERFERENCE()?

    I can see you are getting at something subtle, here. I cannot quite grasp
    what it might be.

    Can you ask the above again but use different words ?!?
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Dec 8 20:35:01 2025
    From Newsgroup: comp.arch


    scott@slp53.sl.home (Scott Lurndal) posted:

    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 12/8/2025 4:25 AM, Robert Finch wrote:
    <snip>

    I am having trouble understanding how the block of code in the
    esmINTERFERENCE() block is protected so that the whole thing executes as >> a unit. It would seem to me that the address range(s) needing to be
    locked would have to be supplied throughout the system, including across >> buffers and bus bridges. It would have to go to the memory coherence
    point. Otherwise, some other device using a bridge could update the same >> address range in the middle of an update.

    I may be wrong about this, but I think you have a misconception. The
    ESM doesn't *prevent* interference, but it *detect* interference. Thus >nothing is required of other cores, no locks, etc. If they write to a >"protected" location, the write is allowed, but the core in the ESM is >notified, so it can redo the ESM protected code.

    Sounds very much similar to the ARMv8 concept of an "exclusive monitor"
    (the basis of the Store-Exclusive/Load-Exclusive instructions, which
    mirror the LL/SC paradigm). The ARMv8 monitors an implementation defined range surrounding the target address and the store will fail if any other agent has modified any byte within the exclusive range.

    esmINTERFERENCE seems to require multiple of these exclusive blocks
    to cover non-contiguous address ranges, which on first blush leads
    me to worry both about deadlock situations and starvation issues.

    Over in the Miss Buffer there are (at least) 8 miss buffers. Each miss
    buffer has to monitor inbound messages for requests (SNOOPs) to its
    entry.

    So, each MB entry has a bit to tell if it is participating in an event. esmINTERFERENCE is a way to sample all participating MB entries simul- taneously; and in addition, esmINTERFERENCE is part of what enables
    the NaKing of SNOOP requests.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Dec 8 21:58:00 2025
    From Newsgroup: comp.arch


    scott@slp53.sl.home (Scott Lurndal) posted:

    ERROR "unexpected byte sequence starting at index 736: '\xC2'" while decoding:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/6/2025 5:42 AM, David Brown wrote:
    On 05/12/2025 21:54, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough
    guarantees for multi-threaded coding on multi-core systems.
    Basically,
    it only works at the C abstract machine level - it does nothing that
    affects the hardware.  So volatile writes are ordered at the C level,
    but that says nothing about how they might progress through storage >> >>>>>> queues, caches, inter-processor communication buses, or whatever. >> >>>>>
    You describe in many words and not really to the point what can be >> >>>>> explained concisely as: "volatile says nothing about memory ordering >> >>>>> on hardware with weaker memory ordering than sequential consistency".
    If hardware guaranteed sequential consistency, volatile would provide
    guarantees that are as good on multi-core machines as on single-core >> >>>>> machines.

    However, for concurrent manipulations of data structures, one wants >> >>>>> atomic operations beyond load and store (even on single-core systems),

    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes >> >>> bigger than the maximum load/store size of the processor.

    My 66000 ISA can::

    LDM/STM can LD/ST up to 32   DWs   as a single ATOMIC instruction.
    MM      can MOV   up to 8192 bytes as a single ATOMIC instruction.


    The functions below rely on more than that - to make the work, as far as
    I can see, you need the first "esmLOCKload" to lock the bus and also
    lock the core from any kind of interrupt or other pre-emption, lasting >> > until the esmLOCKstore instruction.  Or am I missing something here? >>
    Lock the BUS? Only when shit hits the fan. What about locking the cache >> line? Actually, I think we can "force" an x86/x64 to lock the bus if we >> do a LOCK'ed RMW on memory that straddles cache lines?

    In the My 66000 case, Mem References can lock up to 8 cache lines.

    What if two processors have intersecting (but not fully overlapping)
    sets of those 8 cache lines?

    Can you guarantee forward progress?

    Yes.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Mon Dec 8 16:31:08 2025
    From Newsgroup: comp.arch

    On 12/8/2025 9:14 AM, Scott Lurndal wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 12/8/2025 4:25 AM, Robert Finch wrote:
    <snip>

    I am having trouble understanding how the block of code in the
    esmINTERFERENCE() block is protected so that the whole thing executes as >>> a unit. It would seem to me that the address range(s) needing to be
    locked would have to be supplied throughout the system, including across >>> buffers and bus bridges. It would have to go to the memory coherence
    point. Otherwise, some other device using a bridge could update the same >>> address range in the middle of an update.

    I may be wrong about this, but I think you have a misconception. The
    ESM doesn't *prevent* interference, but it *detect* interference. Thus
    nothing is required of other cores, no locks, etc. If they write to a
    "protected" location, the write is allowed, but the core in the ESM is
    notified, so it can redo the ESM protected code.

    Sounds very much similar to the ARMv8 concept of an "exclusive monitor"
    (the basis of the Store-Exclusive/Load-Exclusive instructions, which
    mirror the LL/SC paradigm). The ARMv8 monitors an implementation defined range surrounding the target address and the store will fail if any other agent has modified any byte within the exclusive range.

    Any mutation the reservation granule?




    esmINTERFERENCE seems to require multiple of these exclusive blocks
    to cover non-contiguous address ranges, which on first blush leads
    me to worry both about deadlock situations and starvation issues.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Tue Dec 9 09:13:54 2025
    From Newsgroup: comp.arch

    On 08/12/2025 17:23, Stephen Fuld wrote:
    On 12/8/2025 4:25 AM, Robert Finch wrote:
    <snip>

    I am having trouble understanding how the block of code in the
    esmINTERFERENCE() block is protected so that the whole thing executes
    as a unit. It would seem to me that the address range(s) needing to be
    locked would have to be supplied throughout the system, including
    across buffers and bus bridges. It would have to go to the memory
    coherence point. Otherwise, some other device using a bridge could
    update the same address range in the middle of an update.

    I may be wrong about this, but I think you have a misconception.  The
    ESM doesn't *prevent* interference, but it *detect* interference.  Thus nothing is required of other cores, no locks, etc.  If they write to a "protected" location, the write is allowed, but the core in the ESM is notified, so it can redo the ESM protected code.


    Yes, that is correct (as far as I understand it now). The critical part
    is the hidden hardware loop that was not mentioned or indicated in the original code.

    There are basically two ways to handle atomic operations. One way is to
    use locking mechanisms to ensure that nothing (other cores, interrupts
    or other pre-emption on the same core) can break up the sequence. The
    other way is to have a mechanism to detect conflicts and a failure of
    the atomic operation, so that you can try again (or otherwise handle the situation). (You can, of course, combine these - such as by disabling
    local interrupts and detecting conflicts from other cores.)

    The code Mitch posted apparently had neither of these mechanisms, hence
    my confusion. It turns out that it /does/ have conflict detection and a hardware retry loop, all hidden from anyone trying to understand the
    code. (I can appreciate that there may be benefits in doing this in
    hardware, but there are no benefits in hiding it from the programmer!)


    I am assuming the esmLockStore() just unlocks what was previously
    locked and the stores have already happened by that time.

    There is no "locking" in the sense of preventing any accesses.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Dec 9 19:15:48 2025
    From Newsgroup: comp.arch


    David Brown <david.brown@hesbynett.no> posted:

    On 08/12/2025 17:23, Stephen Fuld wrote:
    On 12/8/2025 4:25 AM, Robert Finch wrote:
    <snip>

    I am having trouble understanding how the block of code in the
    esmINTERFERENCE() block is protected so that the whole thing executes
    as a unit. It would seem to me that the address range(s) needing to be
    locked would have to be supplied throughout the system, including
    across buffers and bus bridges. It would have to go to the memory
    coherence point. Otherwise, some other device using a bridge could
    update the same address range in the middle of an update.

    ---------------------------------
    I may be wrong about this, but I think you have a misconception.  The
    ESM doesn't *prevent* interference, but it *detect* interference.  Thus nothing is required of other cores, no locks, etc.  If they write to a "protected" location, the write is allowed, but the core in the ESM is notified, so it can redo the ESM protected code.


    Yes, that is correct (as far as I understand it now). The critical part
    is the hidden hardware loop that was not mentioned or indicated in the original code.
    ---------------------------------

    Mostly esm detects interference but there are times when esm is allowed
    to ignore interference.

    Consider a sever scale esm implementation. In such an implementation, esm
    is enhanced with a system* arbiter.

    After any successful ATOMIC event esm reverts to "Optimistic" mode. In optimistic mode, esm races through the code as fast as possible expecting
    no interference. When interference is detected, the event fails and a HW counter is incremented. The failure diverts control to the ATOMIC control point. We still have the property that all participating memory locations become visible at the same instant.

    At this point the core is in "careful" mode, core becomes sequentially consistent, SW chooses to re-run the event. Here, cache misses leave
    core in program order,... When interference is detected, the event fails
    and that HW counter is incremented. Failure diverts control to the ATOMIC control point, no participating memory is seen to have been modified.
    If core can determine that all writes to participating memory can be
    performed (at the first participating store) core is allowed to NaK
    lower priority interfering accesses.

    At this point the core is in "Slow and Methodological" mode. Now, after
    all participating cache lines have been touched, all the physical pointers
    are bundled into a message and sent to the system arbiter. System arbiter examines each cache line address and if no-other-core has a reservation
    on ANY of them, then system arbiter installs said reservations, and
    returns "success". At this point, core is allowed to NaK interfering
    accesses. This event WILL SUCCEED. After the event is complete, the
    termination of the event at the core, takes the same bundle of addresses
    and sends it back to system arbiter; who removes them from reservation.

    Optimistic mode takes no more cycles than if the memory references were
    not ATOMIC.

    I should also note:: none of this state is preserved across interrupts
    or exceptions. So, an interrupt or exception causes the event to fail
    prior to control transfer. Interrupts do not care about this control
    transfer. Exception control transfer in My 66000 packs everything the
    exception handler needs in registers, so having IP point at ATOMIC
    control point with the registers setup for page fault does not cause
    exception handler any issues whatsoever.

    There are basically two ways to handle atomic operations. One way is to
    use locking mechanisms to ensure that nothing (other cores, interrupts
    or other pre-emption on the same core) can break up the sequence. The
    other way is to have a mechanism to detect conflicts and a failure of
    the atomic operation, so that you can try again (or otherwise handle the situation). (You can, of course, combine these - such as by disabling
    local interrupts and detecting conflicts from other cores.)

    The code Mitch posted apparently had neither of these mechanisms, hence
    my confusion. It turns out that it /does/ have conflict detection and a hardware retry loop, all hidden from anyone trying to understand the
    code. (I can appreciate that there may be benefits in doing this in hardware, but there are no benefits in hiding it from the programmer!)

    How exactly do you inform the programmer that:

    InBound [Address]
    OutBound [Address]

    operates like::

    try_again:
    InBound [Address]
    BIN try_again
    OutBound [Address]

    And why clutter up asm with extraneous labels and require extra instructions. --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Tue Dec 9 20:51:26 2025
    From Newsgroup: comp.arch

    On 09/12/2025 20:15, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:


    There are basically two ways to handle atomic operations. One way is to
    use locking mechanisms to ensure that nothing (other cores, interrupts
    or other pre-emption on the same core) can break up the sequence. The
    other way is to have a mechanism to detect conflicts and a failure of
    the atomic operation, so that you can try again (or otherwise handle the
    situation). (You can, of course, combine these - such as by disabling
    local interrupts and detecting conflicts from other cores.)

    The code Mitch posted apparently had neither of these mechanisms, hence
    my confusion. It turns out that it /does/ have conflict detection and a
    hardware retry loop, all hidden from anyone trying to understand the
    code. (I can appreciate that there may be benefits in doing this in
    hardware, but there are no benefits in hiding it from the programmer!)

    How exactly do you inform the programmer that:

    InBound [Address]
    OutBound [Address]

    operates like::

    try_again:
    InBound [Address]
    BIN try_again
    OutBound [Address]

    And why clutter up asm with extraneous labels and require extra instructions.

    The most obvious answer is that in any code that uses these features,
    good comments are essential so that readers can see what is happening.

    Another method would be to use better names for the intrinsics, as seen
    at the C (or other HLL) level. (Assembly instruction names don't matter nearly as much.)

    So maybe instead of "esmLOCKload()" and "esmLOCKstore()" you have "load_and_set_retry_point()" and "store_or_retry()". Feel free to think
    of better names, but that would at least give the reader a clue that
    there's something odd going on.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Dec 9 21:28:47 2025
    From Newsgroup: comp.arch


    David Brown <david.brown@hesbynett.no> posted:

    On 09/12/2025 20:15, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:


    There are basically two ways to handle atomic operations. One way is to >> use locking mechanisms to ensure that nothing (other cores, interrupts
    or other pre-emption on the same core) can break up the sequence. The
    other way is to have a mechanism to detect conflicts and a failure of
    the atomic operation, so that you can try again (or otherwise handle the >> situation). (You can, of course, combine these - such as by disabling
    local interrupts and detecting conflicts from other cores.)

    The code Mitch posted apparently had neither of these mechanisms, hence
    my confusion. It turns out that it /does/ have conflict detection and a >> hardware retry loop, all hidden from anyone trying to understand the
    code. (I can appreciate that there may be benefits in doing this in
    hardware, but there are no benefits in hiding it from the programmer!)

    How exactly do you inform the programmer that:

    InBound [Address]
    OutBound [Address]

    operates like::

    try_again:
    InBound [Address]
    BIN try_again
    OutBound [Address]

    And why clutter up asm with extraneous labels and require extra instructions.

    The most obvious answer is that in any code that uses these features,
    good comments are essential so that readers can see what is happening.

    Another method would be to use better names for the intrinsics, as seen
    at the C (or other HLL) level. (Assembly instruction names don't matter nearly as much.)

    So maybe instead of "esmLOCKload()" and "esmLOCKstore()" you have "load_and_set_retry_point()" and "store_or_retry()". Feel free to think
    of better names, but that would at least give the reader a clue that
    there's something odd going on.

    This is a useful suggestion; thanks.

    On the other hand, there are some non-vonNeumann actions lurking within
    esm. Where vonNeumann means: that every instruction is executed in its
    entirety before the next instruction appears to start executing.

    1st:: one cannot single step through an ATMOIC event, if you enter an
    ATOMIC event in single-step mode, you will see the 1st instruction in
    the event, than you will receive control after the terminal instruction
    has executed.

    2nd::the only way to debug an event is to have a buffer of SW locations
    that gets written with non-participating STs. Unlike participating
    memory lines, these locations will be written--but not in a sequentially consistent manner (architecturally), and can be examined outside the
    event; whereas the participating lines are either all written instan-
    taneously or not modified at all.

    So, here we have non-participating STs having been written and older participating STs have not.

    3rd:: control transfer not under SW control--more like exceptions and interrupts than Br-condition--except that the target of control transfer
    is based on the code in the event.

    4th:: one cannot test esm with a random code generator, since the probability that the random code generator creates a legal esm event is exceedingly low. --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Dec 9 13:55:12 2025
    From Newsgroup: comp.arch

    On 12/9/2025 11:15 AM, MitchAlsup wrote:

    snip


    Mostly esm detects interference but there are times when esm is allowed
    to ignore interference.

    Consider a sever scale esm implementation. In such an implementation, esm
    is enhanced with a system* arbiter.

    After any successful ATOMIC event esm reverts to "Optimistic" mode. In optimistic mode, esm races through the code as fast as possible expecting
    no interference. When interference is detected, the event fails and a HW counter is incremented. The failure diverts control to the ATOMIC control point. We still have the property that all participating memory locations become visible at the same instant.>
    At this point the core is in "careful" mode,

    I am missing some understanding here, about this "counter". This
    paragraph seems to indicate that after one failure, the core goes into "careful" mode, but if that were true, you wouldn't need a "counter",
    just a mode bit. So assuming it is a counter and you need "n" failures
    in a row to go into careful mode, is "n" hardwired or settable by
    software? What are the tradeoffs for smaller or larger values of "n"?

    core becomes sequentially
    consistent, SW chooses to re-run the event. Here, cache misses leave
    core in program order,... When interference is detected, the event fails
    and that HW counter is incremented. Failure diverts control to the ATOMIC control point, no participating memory is seen to have been modified.
    If core can determine that all writes to participating memory can be performed (at the first participating store) core is allowed to NaK
    lower priority interfering accesses.

    Again, after a single failure in careful mode or n failures? If n, is
    it the same value of n as for the transition from optimistic to careful
    mode? Same questions as before about who sets the value and is it
    software changeable?
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Dec 9 22:52:31 2025
    From Newsgroup: comp.arch


    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

    On 12/9/2025 11:15 AM, MitchAlsup wrote:

    snip


    Mostly esm detects interference but there are times when esm is allowed
    to ignore interference.

    Consider a sever scale esm implementation. In such an implementation, esm is enhanced with a system* arbiter.

    After any successful ATOMIC event esm reverts to "Optimistic" mode. In optimistic mode, esm races through the code as fast as possible expecting no interference. When interference is detected, the event fails and a HW counter is incremented. The failure diverts control to the ATOMIC control point. We still have the property that all participating memory locations become visible at the same instant.>
    At this point the core is in "careful" mode,

    I am missing some understanding here, about this "counter". This
    paragraph seems to indicate that after one failure, the core goes into "careful" mode, but if that were true, you wouldn't need a "counter",
    just a mode bit. So assuming it is a counter and you need "n" failures
    in a row to go into careful mode, is "n" hardwired or settable by
    software? What are the tradeoffs for smaller or larger values of "n"?

    2-bits; 3-states--not part of save thread state.

    core becomes sequentially
    consistent, SW chooses to re-run the event. Here, cache misses leave
    core in program order,... When interference is detected, the event fails and that HW counter is incremented. Failure diverts control to the ATOMIC control point, no participating memory is seen to have been modified.
    If core can determine that all writes to participating memory can be performed (at the first participating store) core is allowed to NaK
    lower priority interfering accesses.

    Again, after a single failure in careful mode or n failures? If n, is
    it the same value of n as for the transition from optimistic to careful mode? Same questions as before about who sets the value and is it
    software changeable?

    3-state counter::

    00 -> Optimistic
    01 -> Careful
    10 -> Slow and methodological

    success -> counter = 00;
    failure -> counter++;
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Wed Dec 10 10:07:19 2025
    From Newsgroup: comp.arch

    On 09/12/2025 22:28, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 09/12/2025 20:15, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:


    There are basically two ways to handle atomic operations. One way is to >>>> use locking mechanisms to ensure that nothing (other cores, interrupts >>>> or other pre-emption on the same core) can break up the sequence. The >>>> other way is to have a mechanism to detect conflicts and a failure of
    the atomic operation, so that you can try again (or otherwise handle the >>>> situation). (You can, of course, combine these - such as by disabling >>>> local interrupts and detecting conflicts from other cores.)

    The code Mitch posted apparently had neither of these mechanisms, hence >>>> my confusion. It turns out that it /does/ have conflict detection and a >>>> hardware retry loop, all hidden from anyone trying to understand the
    code. (I can appreciate that there may be benefits in doing this in
    hardware, but there are no benefits in hiding it from the programmer!)

    How exactly do you inform the programmer that:

    InBound [Address]
    OutBound [Address]

    operates like::

    try_again:
    InBound [Address]
    BIN try_again
    OutBound [Address]

    And why clutter up asm with extraneous labels and require extra instructions.

    The most obvious answer is that in any code that uses these features,
    good comments are essential so that readers can see what is happening.

    Another method would be to use better names for the intrinsics, as seen
    at the C (or other HLL) level. (Assembly instruction names don't matter
    nearly as much.)

    So maybe instead of "esmLOCKload()" and "esmLOCKstore()" you have
    "load_and_set_retry_point()" and "store_or_retry()". Feel free to think
    of better names, but that would at least give the reader a clue that
    there's something odd going on.

    This is a useful suggestion; thanks.

    I can certainly say they would help /me/ understand the code, so maybe
    they would help other people understand it too.


    On the other hand, there are some non-vonNeumann actions lurking within
    esm. Where vonNeumann means: that every instruction is executed in its entirety before the next instruction appears to start executing.


    That's a rather different use of the term "vonNeumann" from anything I
    have heard. I'd just talk about "indivisible" instructions (avoiding "atomic", because that usually refers to a wider view of the system).
    And are we thinking about the instructions purely from the viewpoint of
    the cpu executing them?

    IME, most instructions on most processors are indivisible, but most
    processors have some instructions that are not. For example, processors
    can have load/store multiple instructions that are interruptable - in
    some cases, after returning from the interrupt (and any associated
    thread context switches) the instructions are restarted, in other cases
    they are continued.

    But most instructions /appear/ to be executed entirely before the next instruction /appears/ to start executing. Fast processors have a lot of hardware designed to keep up this appearance - register renaming,
    pipelining, speculative execution, dependency tracking, and all the rest
    of it.

    1st:: one cannot single step through an ATMOIC event, if you enter an
    ATOMIC event in single-step mode, you will see the 1st instruction in
    the event, than you will receive control after the terminal instruction
    has executed.


    That is presumably a choice you made for the debugging features of the
    device.

    2nd::the only way to debug an event is to have a buffer of SW locations
    that gets written with non-participating STs. Unlike participating
    memory lines, these locations will be written--but not in a sequentially consistent manner (architecturally), and can be examined outside the
    event; whereas the participating lines are either all written instan- taneously or not modified at all.

    So, here we have non-participating STs having been written and older participating STs have not.

    3rd:: control transfer not under SW control--more like exceptions and interrupts than Br-condition--except that the target of control transfer
    is based on the code in the event.


    OK. I can see the advantages of that - though there are disadvantages
    too (such as being unable to control a limit on the number of retries,
    or add SW tracking of retry counts for metrics). My main concern was
    the disconnect between how the code was written and what it actually does.

    4th:: one cannot test esm with a random code generator, since the probability that the random code generator creates a legal esm event is exceedingly low.


    Testing and debugging any kind of locking or atomic access solution is
    always very difficult. You can rarely try out conflicts or potential
    race conditions in the lab - they only ever turn up at customer demos!

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed Dec 10 08:51:16 2025
    From Newsgroup: comp.arch

    On 12/10/2025 1:07 AM, David Brown wrote:
    On 09/12/2025 22:28, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 09/12/2025 20:15, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:


    There are basically two ways to handle atomic operations.  One way >>>>> is to
    use locking mechanisms to ensure that nothing (other cores, interrupts >>>>> or other pre-emption on the same core) can break up the sequence.  The >>>>> other way is to have a mechanism to detect conflicts and a failure of >>>>> the atomic operation, so that you can try again (or otherwise
    handle the
    situation).  (You can, of course, combine these - such as by disabling >>>>> local interrupts and detecting conflicts from other cores.)

    The code Mitch posted apparently had neither of these mechanisms,
    hence
    my confusion.  It turns out that it /does/ have conflict detection >>>>> and a
    hardware retry loop, all hidden from anyone trying to understand the >>>>> code.  (I can appreciate that there may be benefits in doing this in >>>>> hardware, but there are no benefits in hiding it from the programmer!) >>>>
    How exactly do you inform the programmer that:

             InBound   [Address]
             OutBound  [Address]

    operates like::

    try_again:
             InBound   [Address]
             BIN       try_again
             OutBound  [Address]

    And why clutter up asm with extraneous labels and require extra
    instructions.

    The most obvious answer is that in any code that uses these features,
    good comments are essential so that readers can see what is happening.

    Another method would be to use better names for the intrinsics, as seen
    at the C (or other HLL) level.  (Assembly instruction names don't matter >>> nearly as much.)

    So maybe instead of "esmLOCKload()" and "esmLOCKstore()" you have
    "load_and_set_retry_point()" and "store_or_retry()".  Feel free to think >>> of better names, but that would at least give the reader a clue that
    there's something odd going on.

    This is a useful suggestion; thanks.

    I can certainly say they would help /me/ understand the code, so maybe
    they would help other people understand it too.


    On the other hand, there are some non-vonNeumann actions lurking within
    esm. Where vonNeumann means: that every instruction is executed in its
    entirety before the next instruction appears to start executing.


    That's a rather different use of the term "vonNeumann" from anything I
    have heard.  I'd just talk about "indivisible" instructions (avoiding "atomic", because that usually refers to a wider view of the system).
    And are we thinking about the instructions purely from the viewpoint of
    the cpu executing them?

    IME, most instructions on most processors are indivisible, but most processors have some instructions that are not.  For example, processors can have load/store multiple instructions that are interruptable - in
    some cases, after returning from the interrupt (and any associated
    thread context switches) the instructions are restarted, in other cases
    they are continued.

    But most instructions /appear/ to be executed entirely before the next instruction /appears/ to start executing.  Fast processors have a lot of hardware designed to keep up this appearance - register renaming, pipelining, speculative execution, dependency tracking, and all the rest
    of it.

    1st:: one cannot single step through an ATMOIC event, if you enter an
    ATOMIC event in single-step mode, you will see the 1st instruction in
    the event, than you will receive control after the terminal instruction
    has executed.


    That is presumably a choice you made for the debugging features of the device.

    2nd::the only way to debug an event is to have a buffer of SW locations
    that gets written with non-participating STs. Unlike participating
    memory lines, these locations will be written--but not in a sequentially
    consistent manner (architecturally), and can be examined outside the
    event; whereas the participating lines are either all written instan-
    taneously or not modified at all.

    So, here we have non-participating STs having been written and older
    participating STs have not.

    3rd:: control transfer not under SW control--more like exceptions and
    interrupts than Br-condition--except that the target of control transfer
    is based on the code in the event.


    OK.  I can see the advantages of that - though there are disadvantages
    too (such as being unable to control a limit on the number of retries,

    Yes, but. ISTM there is a hardware limit on the number of retries - it
    is two retries, as the third try (second retry) is guaranteed to
    succeed, albeit at a higher cost (in time and interference with other threads/processes) compared to the earlier tries.


    or add SW tracking of retry counts for metrics).

    Again, ISTM that you could do some software tracking by using non participating stores within the locked area to save information outside
    the locked area. I haven't thought through the cost benefit of this,
    how much to save, etc.

    But I am not sure that the "escalation" to a more "intrusive" mechanism
    upon a single failure is optimal. Perhaps it would be better to retry
    once or twice using the current mechanism. I don't have a good feeling
    for what is optimal here, and to what extent the optimal choice would be workload dependent.


    My main concern was
    the disconnect between how the code was written and what it actually does.

    4th:: one cannot test esm with a random code generator, since the
    probability
    that the random code generator creates a legal esm event is
    exceedingly low.


    Testing and debugging any kind of locking or atomic access solution is always very difficult.

    Yup!


    You can rarely try out conflicts or potential
    race conditions in the lab - they only ever turn up at customer demos!

    :-)
    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Dec 10 20:10:43 2025
    From Newsgroup: comp.arch


    David Brown <david.brown@hesbynett.no> posted:

    On 09/12/2025 22:28, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 09/12/2025 20:15, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:


    There are basically two ways to handle atomic operations. One way is to >>>> use locking mechanisms to ensure that nothing (other cores, interrupts >>>> or other pre-emption on the same core) can break up the sequence. The >>>> other way is to have a mechanism to detect conflicts and a failure of >>>> the atomic operation, so that you can try again (or otherwise handle the >>>> situation). (You can, of course, combine these - such as by disabling >>>> local interrupts and detecting conflicts from other cores.)

    The code Mitch posted apparently had neither of these mechanisms, hence >>>> my confusion. It turns out that it /does/ have conflict detection and a >>>> hardware retry loop, all hidden from anyone trying to understand the >>>> code. (I can appreciate that there may be benefits in doing this in >>>> hardware, but there are no benefits in hiding it from the programmer!) >>>
    How exactly do you inform the programmer that:

    InBound [Address]
    OutBound [Address]

    operates like::

    try_again:
    InBound [Address]
    BIN try_again
    OutBound [Address]

    And why clutter up asm with extraneous labels and require extra instructions.

    The most obvious answer is that in any code that uses these features,
    good comments are essential so that readers can see what is happening.

    Another method would be to use better names for the intrinsics, as seen
    at the C (or other HLL) level. (Assembly instruction names don't matter >> nearly as much.)

    So maybe instead of "esmLOCKload()" and "esmLOCKstore()" you have
    "load_and_set_retry_point()" and "store_or_retry()". Feel free to think >> of better names, but that would at least give the reader a clue that
    there's something odd going on.

    This is a useful suggestion; thanks.

    I can certainly say they would help /me/ understand the code, so maybe
    they would help other people understand it too.


    On the other hand, there are some non-vonNeumann actions lurking within esm. Where vonNeumann means: that every instruction is executed in its entirety before the next instruction appears to start executing.


    That's a rather different use of the term "vonNeumann" from anything I
    have heard. I'd just talk about "indivisible" instructions (avoiding "atomic", because that usually refers to a wider view of the system).
    And are we thinking about the instructions purely from the viewpoint of
    the cpu executing them?

    An ATOMIC event is a series of instructions that appear to be performed
    all at once--as if the whole series was "indivisible".

    IME, most instructions on most processors are indivisible, but most processors have some instructions that are not. For example, processors
    can have load/store multiple instructions that are interruptable - in
    some cases, after returning from the interrupt (and any associated
    thread context switches) the instructions are restarted, in other cases
    they are continued.

    Go in the other direction, where a series of instructions HAS TO APPEAR
    as if executed instantaneously.

    But most instructions /appear/ to be executed entirely before the next instruction /appears/ to start executing. Fast processors have a lot of hardware designed to keep up this appearance - register renaming, pipelining, speculative execution, dependency tracking, and all the rest
    of it.

    None of those things is ARHICTECTURAL--esm is an architectural window into
    how to program ATOMIC events such no future generation of the ISA has to continuously add more synchronization instructions. One can program every known industrial and academic synchronization primitive in esm without ever adding new synchronization instructions.

    1st:: one cannot single step through an ATMOIC event, if you enter an ATOMIC event in single-step mode, you will see the 1st instruction in
    the event, than you will receive control after the terminal instruction
    has executed.


    That is presumably a choice you made for the debugging features of the device.

    No it is the nature of executing a series of instructions as if instantaneously.

    2nd::the only way to debug an event is to have a buffer of SW locations that gets written with non-participating STs. Unlike participating
    memory lines, these locations will be written--but not in a sequentially consistent manner (architecturally), and can be examined outside the
    event; whereas the participating lines are either all written instan- taneously or not modified at all.

    So, here we have non-participating STs having been written and older participating STs have not.

    3rd:: control transfer not under SW control--more like exceptions and interrupts than Br-condition--except that the target of control transfer
    is based on the code in the event.


    OK. I can see the advantages of that - though there are disadvantages
    too (such as being unable to control a limit on the number of retries,
    or add SW tracking of retry counts for metrics).

    esm attempts to allow SW to program with features previously available
    only at the µCode level. µCode allows for many µinstructions to execute before/between any real instructions.

    My main concern was
    the disconnect between how the code was written and what it actually does.

    There is a 26 page specification the programmer needs to read and understand. This includes things we have not talked about--such as::
    a) terminating an event without writing anything
    b) proactively minimizing future interference
    c) modifications to cache coherence model
    at the architectural level.

    The architectural specification allows for various scales of µArchitecture
    to independently choose how to implement esm and provide the architectural features at SW level. For example the kinds of esm activities for a 1-wide In-Order µController are vastly different that those suitable for a server scale rack of processor ensembles. What we want is one SW model that covers
    the whole gamut.

    4th:: one cannot test esm with a random code generator, since the probability
    that the random code generator creates a legal esm event is exceedingly low.


    Testing and debugging any kind of locking or atomic access solution is always very difficult. You can rarely try out conflicts or potential
    race conditions in the lab - they only ever turn up at customer demos!

    Right at Christmas time !! {Ask me how I know}.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Thu Dec 11 10:05:34 2025
    From Newsgroup: comp.arch

    On 10/12/2025 21:10, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:


    OK. I can see the advantages of that - though there are disadvantages
    too (such as being unable to control a limit on the number of retries,
    or add SW tracking of retry counts for metrics).

    esm attempts to allow SW to program with features previously available
    only at the µCode level. µCode allows for many µinstructions to execute before/between any real instructions.

    My main concern was
    the disconnect between how the code was written and what it actually does.


    Perhaps it would be better to think of these sequences in assembler
    rather than C - you want tighter control than C normally allows, and you
    don't want optimisers re-arranging things too much.

    There is a 26 page specification the programmer needs to read and understand. This includes things we have not talked about--such as::
    a) terminating an event without writing anything
    b) proactively minimizing future interference
    c) modifications to cache coherence model
    at the architectural level.

    Fair enough. This is not a minor or simple feature!


    The architectural specification allows for various scales of µArchitecture to independently choose how to implement esm and provide the architectural features at SW level. For example the kinds of esm activities for a 1-wide In-Order µController are vastly different that those suitable for a server scale rack of processor ensembles. What we want is one SW model that covers the whole gamut.

    4th:: one cannot test esm with a random code generator, since the probability
    that the random code generator creates a legal esm event is exceedingly low.


    Testing and debugging any kind of locking or atomic access solution is
    always very difficult. You can rarely try out conflicts or potential
    race conditions in the lab - they only ever turn up at customer demos!

    Right at Christmas time !! {Ask me how I know}.

    We can gather round the fire, and Grampa can settle in his rocking chair
    to tell us war stories from the olden days :-)

    A good story is always nice, so go for it!

    (We once had a system where there was a bug that not only triggered only
    at the customer's site, but did so only on the 30th of September. It
    took years before we made the connection to the date and found the bug.)


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Dec 11 20:26:09 2025
    From Newsgroup: comp.arch


    David Brown <david.brown@hesbynett.no> posted:

    On 10/12/2025 21:10, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:


    OK. I can see the advantages of that - though there are disadvantages
    too (such as being unable to control a limit on the number of retries,
    or add SW tracking of retry counts for metrics).

    esm attempts to allow SW to program with features previously available
    only at the µCode level. µCode allows for many µinstructions to execute before/between any real instructions.

    My main concern was
    the disconnect between how the code was written and what it actually does.


    Perhaps it would be better to think of these sequences in assembler
    rather than C - you want tighter control than C normally allows, and you don't want optimisers re-arranging things too much.

    Heck, there are assemblers that rearrange code like this too much--
    until they can be taught not to.

    There is a 26 page specification the programmer needs to read and understand.
    This includes things we have not talked about--such as::
    a) terminating an event without writing anything
    b) proactively minimizing future interference
    c) modifications to cache coherence model
    at the architectural level.

    Fair enough. This is not a minor or simple feature!

    No, it is a design that allows for ISA to remain static while all sorts of synchronization stuff gets written, tested, and tuned.


    The architectural specification allows for various scales of µArchitecture to independently choose how to implement esm and provide the architectural features at SW level. For example the kinds of esm activities for a 1-wide In-Order µController are vastly different that those suitable for a server scale rack of processor ensembles. What we want is one SW model that covers the whole gamut.

    4th:: one cannot test esm with a random code generator, since the probability
    that the random code generator creates a legal esm event is exceedingly low.


    Testing and debugging any kind of locking or atomic access solution is
    always very difficult. You can rarely try out conflicts or potential
    race conditions in the lab - they only ever turn up at customer demos!

    Right at Christmas time !! {Ask me how I know}.

    We can gather round the fire, and Grampa can settle in his rocking chair
    to tell us war stories from the olden days :-)

    A good story is always nice, so go for it!

    Year:: 1997, time 7 days before Christmas:: situation, Customer is
    having (and has had) strange bugs that happen about once a week.
    Customer is unhappy, we have had a senior engineer on sight for
    4 months without forward progress. We were told "You don't come home
    until the problem is fixed".

    System:: 2 (or more) of our cache coherent motherboards, connected
    with a proven cache coherent buss.

    On the flight from Austin to Manchester England, I decide that what
    we have is a physics experiment. So, when we arrive, I had their SW
    guy code up a routine that as soon as it got a time slice, it would
    signal it no longer needed time. While we hooked up the logic analyzer
    to our motherboards and to their bus. When SW was ready (about 30 minutes)
    we tried the case--Instantly, the time delay between the bug showing up
    went from once a week to milliseconds. We spent the afternoon taking
    logic analyzer traces, and went to dinner.

    The next day, we went through the traces with a fine tooth comb and
    found a smoking gun--so we ran more experiments and this same smoking
    gun was found in each track. After a couple of hours, we found that
    their proven coherent bus was allowing 1 single cycle where our bus
    could be seen in an inconsistent state. and it was only a dozen
    cycles downstream that the crash was transpiring.

    It turns out that their bus was only coherent when the attached bus
    was slower than 4 cycles to do "random coherent message", whereas
    our bus was times at 2 cycles for this response.

    So, we took their FPGA which ran the bus apart and found out how to
    delay one signal, reprogrammed it--ONLY to run into another message
    that was off by 1 or 2 cycles. This one took a whole day to find and
    program around.

    We both made it home for Christmas, and in some part saved the company...

    (We once had a system where there was a bug that not only triggered only
    at the customer's site, but did so only on the 30th of September. It
    took years before we made the connection to the date and found the bug.)


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Dec 11 20:47:12 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Heck, there are assemblers that rearrange code like this too much--
    until they can be taught not to.

    Any example? This would definitely go against what I would consider
    to be reasonable for an assembler. gdb certainly does not do so.

    What _would_ be useful on occasion would be an assembler which
    could do register assignment, for example for a small function.
    It would be OK if this were to issue an error if there were too
    many variables for assignment.

    Does anybody know of such a beast?
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Michael S@already5chosen@yahoo.com to comp.arch on Thu Dec 11 23:51:26 2025
    From Newsgroup: comp.arch

    On Thu, 11 Dec 2025 20:26:09 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:


    We both made it home for Christmas, and in some part saved the
    company...


    Not for long so... Was not it dead anyway in the 6-7 months?


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Thu Dec 11 15:00:53 2025
    From Newsgroup: comp.arch

    On 12/10/2025 1:07 AM, David Brown wrote:
    [...]
    Testing and debugging any kind of locking or atomic access solution is always very difficult.  You can rarely try out conflicts or potential
    race conditions in the lab - they only ever turn up at customer demos!

    Murphy's Law. Actually, have you ever messed around with Relacy Race
    Detector? Its pretty interesting.


    [...]

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Thu Dec 11 15:02:40 2025
    From Newsgroup: comp.arch

    On 12/11/2025 1:05 AM, David Brown wrote:
    On 10/12/2025 21:10, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:


    OK.  I can see the advantages of that - though there are disadvantages
    too (such as being unable to control a limit on the number of retries,
    or add SW tracking of retry counts for metrics).

    esm attempts to allow SW to program with features previously available
    only at the µCode level. µCode allows for many µinstructions to execute >> before/between any real instructions.

                                                      My main concern was
    the disconnect between how the code was written and what it actually
    does.


    Perhaps it would be better to think of these sequences in assembler
    rather than C - you want tighter control than C normally allows, and you don't want optimisers re-arranging things too much.

    Right. Way back before C/C++ 11 I would code all of my sensitive lock/wait-free code in assembly.

    [...]



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Thu Dec 11 15:03:29 2025
    From Newsgroup: comp.arch

    On 12/11/2025 3:02 PM, Chris M. Thomasson wrote:
    On 12/11/2025 1:05 AM, David Brown wrote:
    On 10/12/2025 21:10, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:


    OK.  I can see the advantages of that - though there are disadvantages >>>> too (such as being unable to control a limit on the number of retries, >>>> or add SW tracking of retry counts for metrics).

    esm attempts to allow SW to program with features previously available
    only at the µCode level. µCode allows for many µinstructions to execute >>> before/between any real instructions.

                                                      My main concern was
    the disconnect between how the code was written and what it actually
    does.


    Perhaps it would be better to think of these sequences in assembler
    rather than C - you want tighter control than C normally allows, and
    you don't want optimisers re-arranging things too much.

    Right. Way back before C/C++ 11 I would code all of my sensitive lock/ wait-free code in assembly.

    [...]




    Actually, I would turn off link-time optimization back then.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Fri Dec 12 01:41:41 2025
    From Newsgroup: comp.arch

    According to Thomas Koenig <tkoenig@netcologne.de>:
    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Heck, there are assemblers that rearrange code like this too much--
    until they can be taught not to.

    Any example? This would definitely go against what I would consider
    to be reasonable for an assembler. gdb certainly does not do so.

    On machines with delayed branches I've seen assemblers that move
    instructions into the delay slot. Can't think of any others off hand.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Thu Dec 11 18:27:48 2025
    From Newsgroup: comp.arch

    On 12/11/2025 5:41 PM, John Levine wrote:
    According to Thomas Koenig <tkoenig@netcologne.de>:
    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Heck, there are assemblers that rearrange code like this too much--
    until they can be taught not to.

    Any example? This would definitely go against what I would consider
    to be reasonable for an assembler. gdb certainly does not do so.

    On machines with delayed branches I've seen assemblers that move
    instructions into the delay slot. Can't think of any others off hand.


    That would suck! Back when I used to code in SPARC assembly language, I
    had full control over my delay slots. Actually, IIRC, putting a MEMBAR instruction in a delay slot is VERY bad.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From John Levine@johnl@taugh.com to comp.arch on Fri Dec 12 02:48:19 2025
    From Newsgroup: comp.arch

    According to Chris M. Thomasson <chris.m.thomasson.1@gmail.com>:
    On 12/11/2025 5:41 PM, John Levine wrote:
    According to Thomas Koenig <tkoenig@netcologne.de>:
    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Heck, there are assemblers that rearrange code like this too much--
    until they can be taught not to.

    Any example? This would definitely go against what I would consider
    to be reasonable for an assembler. gdb certainly does not do so.

    On machines with delayed branches I've seen assemblers that move
    instructions into the delay slot. Can't think of any others off hand.

    That would suck! Back when I used to code in SPARC assembly language, I
    had full control over my delay slots. Actually, IIRC, putting a MEMBAR >instruction in a delay slot is VERY bad.

    I think they were smart enough only to move instructions that wouldn't cause problems.
    --
    Regards,
    John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
    Please consider the environment before reading this e-mail. https://jl.ly
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Fri Dec 12 08:59:12 2025
    From Newsgroup: comp.arch

    On 11/12/2025 22:51, Michael S wrote:
    On Thu, 11 Dec 2025 20:26:09 GMT
    MitchAlsup <user5857@newsgrouper.org.invalid> wrote:


    We both made it home for Christmas, and in some part saved the
    company...


    Not for long so... Was not it dead anyway in the 6-7 months?


    This is why stories end with "they all lived happily ever after", and
    why sequel movies are almost always terrible! I liked the first story
    better.



    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Dec 12 08:14:47 2025
    From Newsgroup: comp.arch

    John Levine <johnl@taugh.com> schrieb:
    According to Thomas Koenig <tkoenig@netcologne.de>:
    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Heck, there are assemblers that rearrange code like this too much--
    until they can be taught not to.

    Any example? This would definitely go against what I would consider
    to be reasonable for an assembler. gdb certainly does not do so.

    On machines with delayed branches I've seen assemblers that move
    instructions into the delay slot. Can't think of any others off hand.

    Thinking of it a bit more, the optimizing assemblers for drum memory
    computers like the IBM 650 or the LGP-30 of Mel the Programmer
    fame moved around instructions so the next one would be under the
    head when the previous one was done executing.

    Random-access memory made this redundant :-)
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Fri Dec 12 13:05:43 2025
    From Newsgroup: comp.arch

    In article <10hfrsl$145v$1@gal.iecc.com>, John Levine <johnl@taugh.com> wrote: >According to Thomas Koenig <tkoenig@netcologne.de>:
    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Heck, there are assemblers that rearrange code like this too much--
    until they can be taught not to.

    Any example? This would definitely go against what I would consider
    to be reasonable for an assembler. gdb certainly does not do so.

    On machines with delayed branches I've seen assemblers that move
    instructions into the delay slot. Can't think of any others off hand.

    I've seen things like this, as well, particularly on machines
    with multiple delay slots, where this detail was hidden from the
    programmer. Or at least I have a vague memory of this; perhaps
    I'm hallucinating.

    More dangerous are linkers that do LTO and decide to elide code
    that, no, really, I actually need for reasons that are not
    apparent to the toolchain.

    - Dan C.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Fri Dec 12 15:28:30 2025
    From Newsgroup: comp.arch

    On 12/12/2025 14:05, Dan Cross wrote:
    In article <10hfrsl$145v$1@gal.iecc.com>, John Levine <johnl@taugh.com> wrote:
    According to Thomas Koenig <tkoenig@netcologne.de>:
    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Heck, there are assemblers that rearrange code like this too much--
    until they can be taught not to.

    Any example? This would definitely go against what I would consider
    to be reasonable for an assembler. gdb certainly does not do so.

    On machines with delayed branches I've seen assemblers that move
    instructions into the delay slot. Can't think of any others off hand.

    I've seen things like this, as well, particularly on machines
    with multiple delay slots, where this detail was hidden from the
    programmer. Or at least I have a vague memory of this; perhaps
    I'm hallucinating.


    I've seen a few assemblers that do fancy things with jumps and branches
    - giving you generic conditional branch pseudo-instructions that get
    turned into different types of real instructions depending on the
    distance needed for the jumps and the ranges supported by the
    instructions. And there are plenty that have pseudo-instructions for
    loading immediates into registers that generate whatever sequence of
    load immediate, shift-and-or, etc., are needed.


    More dangerous are linkers that do LTO and decide to elide code
    that, no, really, I actually need for reasons that are not
    apparent to the toolchain.


    IME you have control over the details - either using directives in the assembly, or in the linker control files. Of course that might mean
    modifying code that you hoped to use untouched, and it's not hard to
    forget to add a "keep" or "retain" directive.

    I've found link-time dead code elimination quite useful when I have one
    code base but different binary builds - sometimes all you need is a
    different linker file.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Fri Dec 12 16:25:42 2025
    From Newsgroup: comp.arch

    In article <10hh8qe$2v9lm$1@dont-email.me>,
    David Brown <david.brown@hesbynett.no> wrote:
    On 12/12/2025 14:05, Dan Cross wrote:
    In article <10hfrsl$145v$1@gal.iecc.com>, John Levine <johnl@taugh.com> wrote:
    According to Thomas Koenig <tkoenig@netcologne.de>:
    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Heck, there are assemblers that rearrange code like this too much--
    until they can be taught not to.

    Any example? This would definitely go against what I would consider
    to be reasonable for an assembler. gdb certainly does not do so.

    On machines with delayed branches I've seen assemblers that move
    instructions into the delay slot. Can't think of any others off hand.

    I've seen things like this, as well, particularly on machines
    with multiple delay slots, where this detail was hidden from the
    programmer. Or at least I have a vague memory of this; perhaps
    I'm hallucinating.


    I've seen a few assemblers that do fancy things with jumps and branches
    - giving you generic conditional branch pseudo-instructions that get
    turned into different types of real instructions depending on the
    distance needed for the jumps and the ranges supported by the
    instructions. And there are plenty that have pseudo-instructions for >loading immediates into registers that generate whatever sequence of
    load immediate, shift-and-or, etc., are needed.


    More dangerous are linkers that do LTO and decide to elide code
    that, no, really, I actually need for reasons that are not
    apparent to the toolchain.


    IME you have control over the details - either using directives in the >assembly, or in the linker control files. Of course that might mean >modifying code that you hoped to use untouched, and it's not hard to
    forget to add a "keep" or "retain" directive.

    Provided, of course, that you have access to both the assembly
    and the linker configuration for a given program. Sometimes you
    don't (e.g., if the code in question is in some higher-level
    language) or the linker configuration is just some default.

    For example, the Plan 9 C compiler delegated actual instruction
    selection to the linker; the compiler emitted a high(er)-level
    representation of the operation. This made the linker free to
    perform peephole optimization, potentially eliding important
    instructions (like writes to MMIO regions). Fortunately, the
    Plan 9 authors understood this so effectively all globals were
    volatile, but when porting that code to standard C, one had to
    exercise some care.

    I've found link-time dead code elimination quite useful when I have one
    code base but different binary builds - sometimes all you need is a >different linker file.

    Agreed, it _is_ useful. But sometimes it's inappropriate.

    - Dan C.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Dec 12 19:17:16 2025
    From Newsgroup: comp.arch


    John Levine <johnl@taugh.com> posted:

    According to Chris M. Thomasson <chris.m.thomasson.1@gmail.com>:
    On 12/11/2025 5:41 PM, John Levine wrote:
    According to Thomas Koenig <tkoenig@netcologne.de>:
    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Heck, there are assemblers that rearrange code like this too much--
    until they can be taught not to.

    Any example? This would definitely go against what I would consider
    to be reasonable for an assembler. gdb certainly does not do so.

    On machines with delayed branches I've seen assemblers that move
    instructions into the delay slot. Can't think of any others off hand.

    That would suck! Back when I used to code in SPARC assembly language, I >had full control over my delay slots. Actually, IIRC, putting a MEMBAR >instruction in a delay slot is VERY bad.

    I think they were smart enough only to move instructions that wouldn't cause problems.

    Many early RISC assemblers were in charge of moving instructions around
    subject to not altering register dependencies and not altering control
    flow dependencies. This allowed those assemblers to move code across
    memory instructions, across long latency calculation instructions,
    branch instructions, including delay slots; and redefine what "program
    order" now is. A bad side effect of exposing the pipeline to SW.

    We mostly have gotten away from this due to "smart" instruction queueing.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From David Brown@david.brown@hesbynett.no to comp.arch on Fri Dec 12 21:12:05 2025
    From Newsgroup: comp.arch

    On 12/12/2025 17:25, Dan Cross wrote:
    In article <10hh8qe$2v9lm$1@dont-email.me>,
    David Brown <david.brown@hesbynett.no> wrote:
    On 12/12/2025 14:05, Dan Cross wrote:
    In article <10hfrsl$145v$1@gal.iecc.com>, John Levine <johnl@taugh.com> wrote:
    According to Thomas Koenig <tkoenig@netcologne.de>:
    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Heck, there are assemblers that rearrange code like this too much-- >>>>>> until they can be taught not to.

    Any example? This would definitely go against what I would consider >>>>> to be reasonable for an assembler. gdb certainly does not do so.

    On machines with delayed branches I've seen assemblers that move
    instructions into the delay slot. Can't think of any others off hand.

    I've seen things like this, as well, particularly on machines
    with multiple delay slots, where this detail was hidden from the
    programmer. Or at least I have a vague memory of this; perhaps
    I'm hallucinating.


    I've seen a few assemblers that do fancy things with jumps and branches
    - giving you generic conditional branch pseudo-instructions that get
    turned into different types of real instructions depending on the
    distance needed for the jumps and the ranges supported by the
    instructions. And there are plenty that have pseudo-instructions for
    loading immediates into registers that generate whatever sequence of
    load immediate, shift-and-or, etc., are needed.


    More dangerous are linkers that do LTO and decide to elide code
    that, no, really, I actually need for reasons that are not
    apparent to the toolchain.


    IME you have control over the details - either using directives in the
    assembly, or in the linker control files. Of course that might mean
    modifying code that you hoped to use untouched, and it's not hard to
    forget to add a "keep" or "retain" directive.

    Provided, of course, that you have access to both the assembly
    and the linker configuration for a given program. Sometimes you
    don't (e.g., if the code in question is in some higher-level
    language) or the linker configuration is just some default.

    I've managed so far in my own work, but I suppose I work at a lower
    level than most. I don't think it is common for C or C++ programmers to
    know much about linker control files.


    For example, the Plan 9 C compiler delegated actual instruction
    selection to the linker; the compiler emitted a high(er)-level
    representation of the operation. This made the linker free to
    perform peephole optimization, potentially eliding important
    instructions (like writes to MMIO regions). Fortunately, the
    Plan 9 authors understood this so effectively all globals were
    volatile, but when porting that code to standard C, one had to
    exercise some care.

    I've found link-time dead code elimination quite useful when I have one
    code base but different binary builds - sometimes all you need is a
    different linker file.

    Agreed, it _is_ useful. But sometimes it's inappropriate.


    Indeed.


    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Dec 12 21:02:14 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Many early RISC assemblers were in charge of moving instructions around subject to not altering register dependencies and not altering control
    flow dependencies. This allowed those assemblers to move code across
    memory instructions, across long latency calculation instructions,
    branch instructions, including delay slots; and redefine what "program order" now is. A bad side effect of exposing the pipeline to SW.

    I never heard of that one.

    Sounds like bad design - that should be done by the compiler,
    not the assembler. It is fine for the compiler to have pipeline
    descriptions in the cost model of the CPU under a specific -march
    or -mtune flag.

    (Yes, it is preferred that performance should be rather good for
    code generated for a generic microarchitecture).

    We mostly have gotten away from this due to "smart" instruction queueing.

    What is that?
    --
    This USENET posting was made without artificial intelligence,
    artificial impertinence, artificial arrogance, artificial stupidity,
    artificial flavorings or artificial colorants.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Dec 12 22:05:14 2025
    From Newsgroup: comp.arch


    Thomas Koenig <tkoenig@netcologne.de> posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Many early RISC assemblers were in charge of moving instructions around subject to not altering register dependencies and not altering control
    flow dependencies. This allowed those assemblers to move code across
    memory instructions, across long latency calculation instructions,
    branch instructions, including delay slots; and redefine what "program order" now is. A bad side effect of exposing the pipeline to SW.

    I never heard of that one.

    Sounds like bad design - that should be done by the compiler,
    not the assembler. It is fine for the compiler to have pipeline
    descriptions in the cost model of the CPU under a specific -march
    or -mtune flag.

    (Yes, it is preferred that performance should be rather good for
    code generated for a generic microarchitecture).

    We mostly have gotten away from this due to "smart" instruction queueing.

    What is that?

    Reservation stations {Value capturing and value free}, Scoreboards,
    Dispatch stacks, and similar.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Dec 12 14:19:29 2025
    From Newsgroup: comp.arch

    On 12/12/2025 2:05 PM, MitchAlsup wrote:

    Thomas Koenig <tkoenig@netcologne.de> posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Many early RISC assemblers were in charge of moving instructions around
    subject to not altering register dependencies and not altering control
    flow dependencies. This allowed those assemblers to move code across
    memory instructions, across long latency calculation instructions,
    branch instructions, including delay slots; and redefine what "program
    order" now is. A bad side effect of exposing the pipeline to SW.

    I never heard of that one.

    Sounds like bad design - that should be done by the compiler,
    not the assembler. It is fine for the compiler to have pipeline
    descriptions in the cost model of the CPU under a specific -march
    or -mtune flag.

    (Yes, it is preferred that performance should be rather good for
    code generated for a generic microarchitecture).

    We mostly have gotten away from this due to "smart" instruction queueing. >>
    What is that?

    Reservation stations {Value capturing and value free}, Scoreboards,
    Dispatch stacks, and similar.

    Iiic, over on the PPC, wrt LL/SC, it was the reservation granule. I
    think it could be larger that a L2 cache line. So, any interference in
    that granule could cause LL/SC to fail. This can lead to livelock if the program's data was not aligned and/or padded correctly.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Dec 12 14:22:30 2025
    From Newsgroup: comp.arch

    On 12/11/2025 6:48 PM, John Levine wrote:
    According to Chris M. Thomasson <chris.m.thomasson.1@gmail.com>:
    On 12/11/2025 5:41 PM, John Levine wrote:
    According to Thomas Koenig <tkoenig@netcologne.de>:
    MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

    Heck, there are assemblers that rearrange code like this too much--
    until they can be taught not to.

    Any example? This would definitely go against what I would consider
    to be reasonable for an assembler. gdb certainly does not do so.

    On machines with delayed branches I've seen assemblers that move
    instructions into the delay slot. Can't think of any others off hand.

    That would suck! Back when I used to code in SPARC assembly language, I
    had full control over my delay slots. Actually, IIRC, putting a MEMBAR
    instruction in a delay slot is VERY bad.

    I think they were smart enough only to move instructions that wouldn't cause problems.




    I would check the disassembly to see if anything funny happened. Also,
    when my assembled code was used in C, back before C/C++11, I would turn
    off link time optimization. And check again. This was way back, around
    25 years ago. My lock/wait free code was highly sensitive. If something thought it could "optimize" it, well, that was NOT good.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Dec 12 14:37:03 2025
    From Newsgroup: comp.arch

    On 12/8/2025 12:06 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/6/2025 5:42 AM, David Brown wrote:
    On 05/12/2025 21:54, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide enough >>>>>>>> guarantees for multi-threaded coding on multi-core systems.
    Basically,
    it only works at the C abstract machine level - it does nothing that >>>>>>>> affects the hardware.  So volatile writes are ordered at the C level, >>>>>>>> but that says nothing about how they might progress through storage >>>>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>>>
    You describe in many words and not really to the point what can be >>>>>>> explained concisely as: "volatile says nothing about memory ordering >>>>>>> on hardware with weaker memory ordering than sequential consistency". >>>>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>>>> guarantees that are as good on multi-core machines as on single-core >>>>>>> machines.

    However, for concurrent manipulations of data structures, one wants >>>>>>> atomic operations beyond load and store (even on single-core systems), >>>>>>
    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes >>>>> bigger than the maximum load/store size of the processor.

    My 66000 ISA can::

    LDM/STM can LD/ST up to 32   DWs   as a single ATOMIC instruction. >>>> MM      can MOV   up to 8192 bytes as a single ATOMIC instruction. >>>>

    The functions below rely on more than that - to make the work, as far as >>> I can see, you need the first "esmLOCKload" to lock the bus and also
    lock the core from any kind of interrupt or other pre-emption, lasting
    until the esmLOCKstore instruction.  Or am I missing something here?

    Lock the BUS? Only when shit hits the fan. What about locking the cache
    line? Actually, I think we can "force" an x86/x64 to lock the bus if we
    do a LOCK'ed RMW on memory that straddles cache lines?

    In the My 66000 case, Mem References can lock up to 8 cache lines.

    Pretty flexible wrt implementing those exotic things back in the day, experimental algos that need DCAS, KCSS, ect... A heck of a lot of
    things can be accomplished with DWCAS, aka cmpxchg8b on a 32 bit system.
    or cmpxchg16b on a 64-bit system.

    People would bend over backwards to get a DCAS, or NCAS. It would be
    infested with strange indirection ala d"escriptors", and involved a shit
    load of atomic RMW's. CAS, DWCAS, XCHG and XADD can get a lot done.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Dec 12 14:39:16 2025
    From Newsgroup: comp.arch

    On 12/12/2025 2:37 PM, Chris M. Thomasson wrote:
    On 12/8/2025 12:06 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/6/2025 5:42 AM, David Brown wrote:
    On 05/12/2025 21:54, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide >>>>>>>>> enough
    guarantees for multi-threaded coding on multi-core systems.
    Basically,
    it only works at the C abstract machine level - it does nothing >>>>>>>>> that
    affects the hardware.  So volatile writes are ordered at the C >>>>>>>>> level,
    but that says nothing about how they might progress through >>>>>>>>> storage
    queues, caches, inter-processor communication buses, or whatever. >>>>>>>>
    You describe in many words and not really to the point what can be >>>>>>>> explained concisely as: "volatile says nothing about memory
    ordering
    on hardware with weaker memory ordering than sequential
    consistency".
    If hardware guaranteed sequential consistency, volatile would >>>>>>>> provide
    guarantees that are as good on multi-core machines as on single- >>>>>>>> core
    machines.

    However, for concurrent manipulations of data structures, one wants >>>>>>>> atomic operations beyond load and store (even on single-core
    systems),

    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes >>>>>> bigger than the maximum load/store size of the processor.

    My 66000 ISA can::

    LDM/STM can LD/ST up to 32   DWs   as a single ATOMIC instruction. >>>>> MM      can MOV   up to 8192 bytes as a single ATOMIC instruction. >>>>>

    The functions below rely on more than that - to make the work, as
    far as
    I can see, you need the first "esmLOCKload" to lock the bus and also
    lock the core from any kind of interrupt or other pre-emption, lasting >>>> until the esmLOCKstore instruction.  Or am I missing something here?

    Lock the BUS? Only when shit hits the fan. What about locking the cache
    line? Actually, I think we can "force" an x86/x64 to lock the bus if we
    do a LOCK'ed RMW on memory that straddles cache lines?

    In the My 66000 case, Mem References can lock up to 8 cache lines.

    Pretty flexible wrt implementing those exotic things back in the day, experimental algos that need DCAS, KCSS, ect... A heck of a lot of
    things can be accomplished with DWCAS, aka cmpxchg8b on a 32 bit system.
    or cmpxchg16b on a 64-bit system.

    People would bend over backwards to get a DCAS, or NCAS. It would be infested with strange indirection ala d"escriptors", and involved a shit load of atomic RMW's. CAS, DWCAS, XCHG and XADD can get a lot done.

    Have you ever read about KCSS?

    https://groups.google.com/g/comp.arch/c/shshLdF1uqs

    https://patents.google.com/patent/US7293143
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Dec 12 14:47:50 2025
    From Newsgroup: comp.arch

    On 12/12/2025 2:37 PM, Chris M. Thomasson wrote:
    On 12/8/2025 12:06 PM, MitchAlsup wrote:
    [...]
    People would bend over backwards to get a DCAS, or NCAS. It would be infested with strange indirection ala d"escriptors", and involved a shit load of atomic RMW's. CAS, DWCAS, XCHG and XADD can get a lot done.

    I am trying to convey that a lot of neat algos do not even need the
    fancy DCAS, NCAS.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Dec 12 23:39:53 2025
    From Newsgroup: comp.arch


    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/12/2025 2:37 PM, Chris M. Thomasson wrote:
    On 12/8/2025 12:06 PM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/6/2025 5:42 AM, David Brown wrote:
    On 05/12/2025 21:54, MitchAlsup wrote:

    David Brown <david.brown@hesbynett.no> posted:

    On 05/12/2025 18:57, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    David Brown <david.brown@hesbynett.no> writes:
    "volatile" /does/ provide guarantees - it just doesn't provide >>>>>>>>> enough
    guarantees for multi-threaded coding on multi-core systems. >>>>>>>>> Basically,
    it only works at the C abstract machine level - it does nothing >>>>>>>>> that
    affects the hardware.  So volatile writes are ordered at the C >>>>>>>>> level,
    but that says nothing about how they might progress through >>>>>>>>> storage
    queues, caches, inter-processor communication buses, or whatever. >>>>>>>>
    You describe in many words and not really to the point what can be >>>>>>>> explained concisely as: "volatile says nothing about memory >>>>>>>> ordering
    on hardware with weaker memory ordering than sequential
    consistency".
    If hardware guaranteed sequential consistency, volatile would >>>>>>>> provide
    guarantees that are as good on multi-core machines as on single- >>>>>>>> core
    machines.

    However, for concurrent manipulations of data structures, one wants >>>>>>>> atomic operations beyond load and store (even on single-core >>>>>>>> systems),

    Such as ????

    Atomic increment, compare-and-swap, locks, loads and stores of sizes >>>>>> bigger than the maximum load/store size of the processor.

    My 66000 ISA can::

    LDM/STM can LD/ST up to 32   DWs   as a single ATOMIC instruction. >>>>> MM      can MOV   up to 8192 bytes as a single ATOMIC instruction.


    The functions below rely on more than that - to make the work, as
    far as
    I can see, you need the first "esmLOCKload" to lock the bus and also >>>> lock the core from any kind of interrupt or other pre-emption, lasting >>>> until the esmLOCKstore instruction.  Or am I missing something here? >>>
    Lock the BUS? Only when shit hits the fan. What about locking the cache >>> line? Actually, I think we can "force" an x86/x64 to lock the bus if we >>> do a LOCK'ed RMW on memory that straddles cache lines?

    In the My 66000 case, Mem References can lock up to 8 cache lines.

    Pretty flexible wrt implementing those exotic things back in the day, experimental algos that need DCAS, KCSS, ect... A heck of a lot of
    things can be accomplished with DWCAS, aka cmpxchg8b on a 32 bit system. or cmpxchg16b on a 64-bit system.

    People would bend over backwards to get a DCAS, or NCAS. It would be infested with strange indirection ala d"escriptors", and involved a shit load of atomic RMW's. CAS, DWCAS, XCHG and XADD can get a lot done.

    Have you ever read about KCSS?

    https://groups.google.com/g/comp.arch/c/shshLdF1uqs

    https://patents.google.com/patent/US7293143

    While I was not directly exposed to KCSS, I was exposed to the underlying
    need for multi-location Compare and Swap requirements, and provided a means
    to implement same in both ASF and ESM. {All of us (synchronization people)
    were so exposed. And a lot of academic ideas came out of those trends, too.}

    In my case, I simply wanted a way "out" of inventing a new synchronization primitive ever ISA generation. What my solution entails is a modification
    to the cache coherence model (NaK) that indicates "Yes I have the line you referenced, but, no you can't have it right now" in order to strengthen
    the guarantees of forward progress.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Dec 12 15:52:45 2025
    From Newsgroup: comp.arch

    On 12/6/2025 11:04 AM, Scott Lurndal wrote:
    [...]
    Most extant SMP processors provide a compare and swap operation, which
    are widely supported by the common compilers that support the C and C++ threading functionality.

    Right. However, a DWCAS is important as well... Well, for me... This
    only works on contiguous words.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Dec 12 15:56:52 2025
    From Newsgroup: comp.arch

    On 12/8/2025 4:31 PM, Chris M. Thomasson wrote:
    On 12/8/2025 9:14 AM, Scott Lurndal wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 12/8/2025 4:25 AM, Robert Finch wrote:
    <snip>

    I am having trouble understanding how the block of code in the
    esmINTERFERENCE() block is protected so that the whole thing
    executes as
    a unit. It would seem to me that the address range(s) needing to be
    locked would have to be supplied throughout the system, including
    across
    buffers and bus bridges. It would have to go to the memory coherence
    point. Otherwise, some other device using a bridge could update the
    same
    address range in the middle of an update.

    I may be wrong about this, but I think you have a misconception.  The
    ESM doesn't *prevent* interference, but it *detect* interference.  Thus >>> nothing is required of other cores, no locks, etc.  If they write to a
    "protected" location, the write is allowed, but the core in the ESM is
    notified, so it can redo the ESM protected code.

    Sounds very much similar to the ARMv8 concept of an "exclusive monitor"
    (the basis of the Store-Exclusive/Load-Exclusive instructions, which
    mirror the LL/SC paradigm).  The ARMv8 monitors an implementation defined >> range surrounding the target address and the store will fail if any other
    agent has modified any byte within the exclusive range.

    Any mutation the reservation granule?

    I forgot if a load from the reservation granule would cause a LL/SC to
    fail. I know a store would. False sharing in poorly written programs
    would cause it to occur. LL/SC experiencing live lock. This was back in
    my PPC days.

    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Dec 13 09:31:05 2025
    From Newsgroup: comp.arch

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    What my solution entails is a modification
    to the cache coherence model (NaK) that indicates "Yes I have the line you >referenced, but, no you can't have it right now" in order to strengthen
    the guarantees of forward progress.

    How does it strengthen the guarantees of forward progress? My guess:
    If the requester itself is in an atomic sequence B, it will cancel it.
    This could help if the atomic sequence A that caused the NaK then
    tries to get a cache line that would be kept by B.

    There is still a chance of both sequences canceling each other by
    sending NaKs at the same time, but it is smaller and with something
    like exponential backoff eventual forward progress could be achieved.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 13 19:03:07 2025
    From Newsgroup: comp.arch


    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/8/2025 4:31 PM, Chris M. Thomasson wrote:
    On 12/8/2025 9:14 AM, Scott Lurndal wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 12/8/2025 4:25 AM, Robert Finch wrote:
    <snip>

    I am having trouble understanding how the block of code in the
    esmINTERFERENCE() block is protected so that the whole thing
    executes as
    a unit. It would seem to me that the address range(s) needing to be
    locked would have to be supplied throughout the system, including
    across
    buffers and bus bridges. It would have to go to the memory coherence >>>> point. Otherwise, some other device using a bridge could update the >>>> same
    address range in the middle of an update.

    I may be wrong about this, but I think you have a misconception.  The >>> ESM doesn't *prevent* interference, but it *detect* interference.  Thus >>> nothing is required of other cores, no locks, etc.  If they write to a >>> "protected" location, the write is allowed, but the core in the ESM is >>> notified, so it can redo the ESM protected code.

    Sounds very much similar to the ARMv8 concept of an "exclusive monitor"
    (the basis of the Store-Exclusive/Load-Exclusive instructions, which
    mirror the LL/SC paradigm).  The ARMv8 monitors an implementation defined >> range surrounding the target address and the store will fail if any other >> agent has modified any byte within the exclusive range.

    Any mutation the reservation granule?

    I forgot if a load from the reservation granule would cause a LL/SC to
    fail. I know a store would. False sharing in poorly written programs
    would cause it to occur. LL/SC experiencing live lock. This was back in
    my PPC days.

    A LD to the granule would cause loss of write permission, causing a long
    delay to perform SC and greatly increase the probability of interference.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 13 19:12:28 2025
    From Newsgroup: comp.arch


    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    What my solution entails is a modification
    to the cache coherence model (NaK) that indicates "Yes I have the line you >referenced, but, no you can't have it right now" in order to strengthen >the guarantees of forward progress.

    How does it strengthen the guarantees of forward progress?

    The allowance of a NaK is only available under somewhat special
    circumstances::
    a) in Careful mode:: when core can see that all STs have write permission
    and data is present, NaKs allow the Modification part to run to
    completion.
    b) In Slow and Methodical mode:: core can NaK any access to any of its
    cache lines--preventing interference.

    My guess:
    If the requester itself is in an atomic sequence B, it will cancel it.

    Yes, the "other guy" takes the hit not the guy who has made more forward progress. If B was an innocent accessor of the data, it retires its request--this generally takes 100-odd cycles, allowing A to complete
    the event by the time the innocent request shows up again.

    This could help if the atomic sequence A that caused the NaK then
    tries to get a cache line that would be kept by B.

    There is still a chance of both sequences canceling each other by
    sending NaKs at the same time, but it is smaller and with something
    like exponential backoff eventual forward progress could be achieved.

    Instead of some contrived back-off policy--at the failure point one can
    read the WHY register. 0 indicates success; negative indicates spurious, positive indicates how far down the line of requestors YOU happen to be.
    So, if you are going after a unit of work, you march down the queue WHY
    units and then YOU are guaranteed that YOU are the only one after that
    unit of work.


    - anton
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sat Dec 13 11:46:17 2025
    From Newsgroup: comp.arch

    On 12/13/2025 11:12 AM, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    What my solution entails is a modification
    to the cache coherence model (NaK) that indicates "Yes I have the line you >>> referenced, but, no you can't have it right now" in order to strengthen
    the guarantees of forward progress.

    How does it strengthen the guarantees of forward progress?

    The allowance of a NaK is only available under somewhat special circumstances::
    a) in Careful mode:: when core can see that all STs have write permission
    and data is present, NaKs allow the Modification part to run to
    completion.
    b) In Slow and Methodical mode:: core can NaK any access to any of its
    cache lines--preventing interference.

    My guess:
    If the requester itself is in an atomic sequence B, it will cancel it.

    Yes, the "other guy" takes the hit not the guy who has made more forward progress. If B was an innocent accessor of the data, it retires its request--this generally takes 100-odd cycles, allowing A to complete
    the event by the time the innocent request shows up again.

    This could help if the atomic sequence A that caused the NaK then
    tries to get a cache line that would be kept by B.

    There is still a chance of both sequences canceling each other by
    sending NaKs at the same time, but it is smaller and with something
    like exponential backoff eventual forward progress could be achieved.

    Instead of some contrived back-off policy--at the failure point one can
    read the WHY register. 0 indicates success; negative indicates spurious, positive indicates how far down the line of requestors YOU happen to be.
    So, if you are going after a unit of work, you march down the queue WHY
    units and then YOU are guaranteed that YOU are the only one after that
    unit of work.

    Step one. Make sure that a failure means another thread made progress.
    strong CAS does this. Don't let it spuriously fail where nothing makes progress... ;^o

    Oh my we got a load on the reservation granule, abort all LL/SC in
    progress wrt that granule. Of course this assumes that the user that
    created the program for it gets things right. For a LL/SC on the PPC it definitely helps where things are aligned and padded up to a reservation granule, not just a l2 cache line. Helps mitigate false sharing causing livelock.

    Even in weak CAS, akin to LL/SC. Well, how sensitive is that reservation granule. Can a simple load cause a failure?
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sat Dec 13 11:49:46 2025
    From Newsgroup: comp.arch

    On 12/13/2025 11:03 AM, MitchAlsup wrote:

    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/8/2025 4:31 PM, Chris M. Thomasson wrote:
    On 12/8/2025 9:14 AM, Scott Lurndal wrote:
    Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:
    On 12/8/2025 4:25 AM, Robert Finch wrote:
    <snip>

    I am having trouble understanding how the block of code in the
    esmINTERFERENCE() block is protected so that the whole thing
    executes as
    a unit. It would seem to me that the address range(s) needing to be >>>>>> locked would have to be supplied throughout the system, including
    across
    buffers and bus bridges. It would have to go to the memory coherence >>>>>> point. Otherwise, some other device using a bridge could update the >>>>>> same
    address range in the middle of an update.

    I may be wrong about this, but I think you have a misconception.  The >>>>> ESM doesn't *prevent* interference, but it *detect* interference.  Thus >>>>> nothing is required of other cores, no locks, etc.  If they write to a >>>>> "protected" location, the write is allowed, but the core in the ESM is >>>>> notified, so it can redo the ESM protected code.

    Sounds very much similar to the ARMv8 concept of an "exclusive monitor" >>>> (the basis of the Store-Exclusive/Load-Exclusive instructions, which
    mirror the LL/SC paradigm).  The ARMv8 monitors an implementation defined >>>> range surrounding the target address and the store will fail if any other >>>> agent has modified any byte within the exclusive range.

    Any mutation the reservation granule?

    I forgot if a load from the reservation granule would cause a LL/SC to
    fail. I know a store would. False sharing in poorly written programs
    would cause it to occur. LL/SC experiencing live lock. This was back in
    my PPC days.

    A LD to the granule would cause loss of write permission, causing a long delay to perform SC and greatly increase the probability of interference.

    So, you need to create a rule. If you program for my system, you MUST
    make sure that everything is properly aligned and padded. Been there,
    done that. Now, think of nefarious agents... I was able to cause damage
    to a simple strong CAS loop with another thread(s) mutating the cache
    line on purpose, as a stress test... CAS would start hitting higher and
    higher failure rates, and finally, hit the BUS to ensure some sort of
    forward progress.
    --- Synchronet 3.21a-Linux NewsLink 1.2
  • From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 13 21:58:07 2025
    From Newsgroup: comp.arch


    "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

    On 12/13/2025 11:12 AM, MitchAlsup wrote:

    anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

    MitchAlsup <user5857@newsgrouper.org.invalid> writes:
    What my solution entails is a modification
    to the cache coherence model (NaK) that indicates "Yes I have the line you
    referenced, but, no you can't have it right now" in order to strengthen >>> the guarantees of forward progress.

    How does it strengthen the guarantees of forward progress?

    The allowance of a NaK is only available under somewhat special circumstances::
    a) in Careful mode:: when core can see that all STs have write permission
    and data is present, NaKs allow the Modification part to run to
    completion.
    b) In Slow and Methodical mode:: core can NaK any access to any of its
    cache lines--preventing interference.

    My guess:
    If the requester itself is in an atomic sequence B, it will cancel it.

    Yes, the "other guy" takes the hit not the guy who has made more forward progress. If B was an innocent accessor of the data, it retires its request--this generally takes 100-odd cycles, allowing A to complete
    the event by the time the innocent request shows up again.

    This could help if the atomic sequence A that caused the NaK then
    tries to get a cache line that would be kept by B.

    There is still a chance of both sequences canceling each other by
    sending NaKs at the same time, but it is smaller and with something
    like exponential backoff eventual forward progress could be achieved.

    Instead of some contrived back-off policy--at the failure point one can read the WHY register. 0 indicates success; negative indicates spurious, positive indicates how far down the line of requestors YOU happen to be. So, if you are going after a unit of work, you march down the queue WHY units and then YOU are guaranteed that YOU are the only one after that
    unit of work.

    Step one. Make sure that a failure means another thread made progress. strong CAS does this. Don't let it spuriously fail where nothing makes progress... ;^o

    Absollutely!

    WHY is only valid in "slow and methodological" which has strong guarantees
    of forward progress--at least 1 thread is making forward progress in S&M.

    Spurious has to do with things like "system arbiter buffer overflow" and
    is not related to exceptions or interrupts.

    Oh my we got a load on the reservation granule, abort all LL/SC in
    progress wrt that granule. Of course this assumes that the user that
    created the program for it gets things right.

    This is why I created NaK in the cache coherence protocol--to strengthen
    the guarantee of forward progress.

    For a LL/SC on the PPC it definitely helps where things are aligned and padded up to a reservation granule, not just a l2 cache line. Helps mitigate false sharing causing livelock.

    Even in weak CAS, akin to LL/SC. Well, how sensitive is that reservation granule. Can a simple load cause a failure?

    Innocent LD gets NaKed causing the innocent thread to waste time while
    allowing the ATOMIC event to make forward progress.

    In my case reservation granule is a cache line {which is the same across
    the memory hierarchy--but still allows for implementation defined size}.

    For example:: HBM can deliver 1024-bits (soon 2048-bits) in a single beat,
    so, for main_memory == HBM it makes sense to align the size of the LLcache
    to the width of HBM. Once in LLC, you can parcel it out any way your system prescribes.
    --- Synchronet 3.21a-Linux NewsLink 1.2