Forum: War Ensemble BBS

Re: Tonights Tradeoff

From Robert Finch@robfi680@gmail.com to comp.arch on Tue Oct 28 23:52:53 2025

From Newsgroup: comp.arch

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or floating-point values.

Going with a bit result vector in any GPR for compares, then a branch on bit-set/clear for conditional branches. Might also include branch true / false.

Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed Oct 29 00:14:08 2025

From Newsgroup: comp.arch

On 10/28/2025 8:52 PM, Robert Finch wrote:

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers.

I assume the "high" registers are for handling 128 bit operations
without the need to specify another register name. Do you have 5 or 6
bit register numbers in the instructions. Five allows you to use the
high registers for 128 bit operations without needing another register specifier, but then the high registers can only be used for 128 bit operations, which seems a waste. If you have six bits, you can use all
64 registers for any operation, but how is the "upper" method that
better than automatically using r(x+1)?

GPRs may contain either integer or
floating-point values.

Going with a bit result vector in any GPR for compares, then a branch on bit-set/clear for conditional branches. Might also include branch true / false.

Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.

Those seem like a call from the My 66000 playbook, which I like.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Oct 29 04:29:15 2025

From Newsgroup: comp.arch

On 10/28/2025 10:52 PM, Robert Finch wrote:

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or floating-point values.

OK.

I mostly stuck with 32-bit encodings, but 40 could maybe allow more
encoding space, but the drawback of being non-power-of-2.

But, yeah, occasionally dealing with 128-bit data is a major case for 64
GPRs and paired-registers registers.

Well, that and when co-existing with RV64G, it gives somewhere to put
the FPRs. But, in turn this was initially motivated by me failing to
figure out how to get GCC configured to target Zfinx/Zdinx.

Had ended up going with the Even/Odd pairing scheme as it is less wonky
IMO to deal with R5:R4 than R36:R4.

Going with a bit result vector in any GPR for compares, then a branch on bit-set/clear for conditional branches. Might also include branch true / false.

BT/BF works well. I otherwise also ended up using RISC-V style branches,
which I originally disliked due to higher implementation cost, but they
do technically allow for higher performance than just BT/BF or Branch-Compare-with-Zero in 2-R cases.

So, it becomes harder to complain about a feature that does technically
help with performance.

Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.

Hmm...

My case: 10/33/64.
No direct 128-bit constant, but can use two 64-bit constants whenever
128 bits is needed.

Otherwise, goings on in my land:
ISA development is slow, and had mostly turned into bug hunting;
There are some unresolved bugs, but I haven't been able to fully hunt
them down. A lot was in relation to RISC-V's C extension, but at least
it seems like at this point the C extension is likely fully working.

Haven't been many features that can usefully increase general-case performance. So, it is starting to seem like XG2 and XG3 may be fairly
stable at this point.

The longer term future is uncertain.

My ISA's can beat RISC-V in terms of code-density and performance, but
when when RISC-V is extended with similar features, it is harder to make
a case that it is "enough".

Doesn't seem like (within the ISA) there are many obvious ways left to
grab large general-case performance gains over what I have done already.

Some code benefits from lots of GPRs, but harder to make the case that
it reflects the general case.

Recently got a new very-cheap laptop (a Dell Latitude 7490, for around
$240), made some curious observations:
It seems to slightly outperform my main PC in single-threaded performance;
Its RAM timings don't seem to match the expected values.

My main PC still wins at multi-threaded performance, and has the
advantage of 7x more RAM.

Had noted in Cinebench that my main PC is actually performing a little
slower than is typical for the 2700X, but then again, it is effectively
a 2700X running with DDR4-2133 rather than DDR4-2933, but partly this
was a case of the RAM I have was unstable if run all that fast (and in
this case; more RAM but slightly slower seemed preferable to less RAM
but slightly faster, or running it slightly faster but having the
computer be crash-prone).

They sold the ran with its on-the-box speed being the XMP2 settings
rather than the baseline settings, but the RAM in question didn't run
reliably at the XMP or XMP2 settings (and wasn't inclined to spend more;
more so when there was already the annoyance that my MOBO chipset
apparently doesn't deal with a full 128GB, but can tolerate 112GB, but
maybe not an ideal setup for perf).

So, yeah, it seems that I have a setup where the 2700X is getting worse single-threaded performance than the i7 8650U in the laptop.

Apparently, going by Cinebench scores, my PC's single threaded
performance is mostly hanging out with a bunch of Xeons (getting a score
in R23 of around 700 vs 950).

Well, could be addressed, in theory, but would need some RAM that
actually runs reliably at 2933 or 3200 MT/s and is also cheap...

In both cases, they are CPUs originally released in 2018.

Has noted, in a few tests:
LZ4 benchmark (same file):
Main PC: 3.3 GB/s
Laptop: 3.9 GB/s
memcpy (single threaded):
Main PC: 3.8 GB/s
Laptop : 5.6 GB/s
memcpy (all threads):
Main PC: ~ 15 GB/s
Laptop : ~ 24 GB/s
( Like, what; thing only has 1 stick of RAM... *1 )

*1: Also, how is a laptop with 1 stick of RAM matching a dual-socket
Xeon E5410 with like 8 sticks of RAM...

or, maybe it was just weak that my main PC was failing to beat the Xeon
at this?... My main PC does at least beat the Xeon at single-threaded performance (was less true of my older Piledriver based PC).

Granted, then again, I am using (almost) the cheapest MOBO I could find
at the time (that had an OK number of RAM slots and SATA connectors).
Can't quite identify the MOBO or chipset as I lost the box (and not
clearly labeled on the MOBO itself); except that it is a
something-or-another ASUS board.

Like, at the time, IIRC:
Went on Newggg;
Pick mostly the cheapest parts on the site;
Say, a Zen+ CPU being a lot cheaper than Zen 2,
or pretty much anything from Intel.
...

Did get a slightly fancy/beefy case, but partly this was because I was
annoyed with the late-90s-era beige tower case I had been using. Which I
had ended up hot gluing a bunch of extra PC fans into the thing in an
attempt to keep airflow good enough so that it didn't melt. And
under-clocking the CPU so that it could run reliably.

Like, 4GHz Piledriver ran too hot and was unreliable, but was far more
stable at 3.4 GHz. Was technically faster than a Phenom II underclocked
to 2.8 GHz (for similar reasons).

Where, at least the Zen+ doesn't overheat at stock settings (but, they
also supplied the thing with a comparably much bigger stock CPU cooler).

The case I got is slightly more traditional, with 5.25" bays and similar
and mostly sheet-steel construction, Vs the "new" trend of mostly glass-covered-box PC cases. Sadly, it seems like companies have mostly
stopped selling the traditional sheet-steel PC cases with open 5.25"
bays. Like, where exactly is someone supposed to put their DVD-RW drive,
or hot-swap HDD trays ?...

Well, in the past we also had floppy drives, but the MOBO's removed the connectors forcing one to now go the USB route if they want a floppy
drive (but, now mostly moot as relatively few other computers still have floppy drives either).

Well, in theory could build a PC with newer components and a bigger
budget for parts. Still wouldn't want to go over to Win11, now it is a
choice between jumping to Linux or "Windows Server" or similar (like, at
least they didn't pollute Windows Server with a bunch of random
pointless crap).

For now, inertia option is to just keep using Win10 for now.

As for the laptop, had noted:
Can run Minecraft:
Yes; though best results at an 8-chunk draw distance.
Much more than this, and the "Intel UHD" graphics struggle.
At 12 chunks, there is obvious chug.
At 16 chunks, it starts dropping into single digit territory.
Can run Doom3:
Yes: Mostly gets 40-50 fps in Doom 3.

My main PC can manage a 16-chunk draw distance in Minecraft and mostly
gets a constant 63 fps in Doom3.

Don't have many other newer games to test, as I mostly lost interest in
modern "AAA" games. And, stuff like Doom+RTX, I already know this wont
work. I can mostly just be happy that Minecraft works and is playable
(and that its GPU is solidly faster than just using a software renderer...).

On both fronts, this is a significant improvement over the older laptop.
For the price, I sort of worried that it would be dead slow, but it significantly outperforms its Vista-era predecessor.

This is mostly because I had noticed that, right now (unlike a few years
ago), there are actually OK laptops at cheap prices (along with all the
$80 Dell OptiPlex computers and similar on Amazon...).

Otherwise, went and recently wrote up a spec partly based on a BASIC
dialect I had used in one of my 3D engines, with some design cleanup: https://pastebin.com/2pEE7VE8

Where I was able to get a usable implementation for something similar in
a little over 1000 lines of C.

Though, this was for an Unstructured BASIC dialect.

Decided then to try something a little harder:
Doing a small JavaScript like language, and trying to keep the
interpreter small.

I don't yet have the full language implemented, but for a partial JS
like language, I currently have something in around 2500 lines of C.

I had initially set a target estimate of 4-8 kLOC.
Unless the remaining functionality ends up eating a lot of code, I am on target towards hitting the lower end of this range (need to get most of
the rest of the core-language implemented within around 1.5 kLOC or so).

Note: No 3rd party libraries allowed, only the normal C runtime library.
Did end up using a few C99 features, but mostly still C95.

For now, I was calling the language BS3L, where:
Dynamically typed;
Supports: Integers, Floating-Point, Strings, Objects, Arrays, ...
JS style syntax;
Nothing too exciting here.
Still has JS style arrays and objects;
Dynamically scoped.
Where, dynamic scoping needs less code than lexical scoping;
But, dynamic scoping is also a potential foot-gun as well.
Not sure if too much of a foot-gun.
Vs going to C-style scoping;
Or, biting the bullet and properly implementing lexical scoping.
Leaving out most advanced features.
will be fairly minimal even vs early versions of JS.

But, in some cases, was borrowing some design ideas from the BASIC interpreter. There were some unavoidable costs, such as in this case
needing a full parser (that builds an AST) and an AST-walking
interpreter. Unlike BASIC, it wouldn't be possible to implement an
interpreter by directly walking and pattern matching lists of tokens.

And, a parser that builds an AST, and code to walk said AST, necessarily
needs more code.

I guess, it is a question if if someone else could manage to implement a JavaScript style language in under 1000 lines of C while also writing "relatively normal" C (no huge blocks of obfuscated or rampant abuse of
the preprocessor). Or, basically, where one has to stick to similar C
coding conventions to those used in Doom and Quake.

I am not sure if this would be possible. Both the dynamic type-system
and parser have eaten up a fair chunk of the code budget. A sub 1000
line parser is also a little novel; but the parser itself got a little
wonky and doesn't fully abstract over what it parses (as there is still
a fair bit of bleed-over from the token stream). And, it sorta ended up abusing the use of binary operators a little.

For example, it has wonk like dealing with lists of statements as-if
there were a right-associative semicolon operator (allowing it to be
walked like a linked list).

There is slightly wonky operator tokenization again to save code:
Separately matching every possible operator pattern is a bunch of extra
logic. Was using rules that mostly give the correct operators, but with
the possibility of non-sense operators. Also the precedence levels don't
match up exactly, but this is a lower priority issue.

I guess, if someone things they can do so in significantly less code,
they can try.

Note that while a language like Lua sort of resembles an intermediate
between BASIC and JavaScript, I wouldn't expect Lua to save that much
here (it would still have the cost of needing to build an AST and similar).

Going from an AST to a bytecode or 3AC IR would allow for higher
performance.

But, I decided to go for an AST walking interpreter in this case as it
would be the least LOC.

Actually takes more effort trying to keep the code small. Rather than
just copy-pasting stuff a bunch of times, one spends more time needing
to try to factor out and reuse common patterns.

Though, in a way, some of this is revisiting stuff I did 20+ years ago,
but from a different perspective.

Like, 20+ years ago, my first interpreters also used AST walkers.

As for where I will go with this, I don't know.
Some of it could make sense as a starting point for a GLSL compiler;
Or maybe adapted into parsing the SCAD language;

Or, as a cheaper alternative to what my first script VM became.
By the end of its span, it had become quite massive...
Though, still not too bad if compared with SpiderMonkey or V8.

Ironically, my jump to a Java + JVM/.NET like design was actually to
make it simpler.

For a simple but slow language, JS works, but if you want it fast it
quickly turns worse (and simpler to jump to a more traditional
statically typed language). Like, there was this thing, known as "Hindley-Milner Type Inference", which on one hand, could be used to
make a JavaScript style language fast (by turning it transparently into
a statically-typed language), but also, was a huge PITA to deal with
(this was combined in my VM with optional explicit type declarations;
with a syntax inspired by ActionScript).

Well, and when something gets big and complicated enough that one almost
may as well just use spiderMonkey or similar to run their JS code, this
is a problem...

Still less bad than LLVM, not sure why anyone would willingly submit to
this.

Well, there is still surviving descendant of the original VM (although branching off from an earlier form) in the form of BGBCC.

Though, makes more sense to do a clean interpreter in this case, than to
try to build one by copy-pasting the parser from BGBCC or my old VM and
trying to build a new lighter-weight VM.

In some of these cases, it is easier to scale up than scale back down.
Easier to take simpler code and add features or improve performance.
Than to take more complex code and try to trim it down.

And, sometimes it does make more sense to just write something starting
from a clean slate.

Well, except for my attempt at a clean slate C compiler, except this was
more a case of realizing I wouldn't undershoot BGBCC by enough to be worthwhile, and there were some new problem points that were emerging in
the design. Partly as I was trying to follow a model more like that used
by GCC and binutils, which I was then left to suspect is not the right approach (and in some ways, the approach I had used in BGBCC seemed to
make more sense than trying to imitate how GCC does things).

Might still make at some point to try for another clean-slate C
compiler, though if I would still end up taking a similar general
approach to BGBCC (or .NET), there isn't a huge incentive (vs continuing
to use BGBCC).

Where, say, the main thing that would ideally need to be improved would
be improving BGBCC's performance and reducing memory footprint. As-is, compiling with BGBCC is about as slow as compiling with GCC, which isn't great.

Comparably, MSVC typically being a bit faster at compiling stuff IME.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Wed Oct 29 08:41:46 2025

From Newsgroup: comp.arch

On 2025-10-29 3:14 a.m., Stephen Fuld wrote:

On 10/28/2025 8:52 PM, Robert Finch wrote:

Started working on yet another CPU – Qupls4. Fixed 40-bit
instructions, 64 GPRs. GPRs may be used in pairs for 128-bit ops.
Registers are named as if there were 32 GPRs, A0 (arg 0 register is
r1) and A0H (arg 0 high is r33). Sameo for other registers.

I assume the "high" registers are for handling 128 bit operations
without the need to specify another register name. Do you have 5 or 6
bit register numbers in the instructions. Five allows you to use the
high registers for 128 bit operations without needing another register specifier, but then the high registers can only be used for 128 bit operations, which seems a waste. If you have six bits, you can use all
64 registers for any operation, but how is the "upper" method that
better than automatically using r(x+1)?

Yes, but it is just a suggested usage. The registers are GPRs that can
be used for anything, specified using a six bit register number. I
suggested it that way because most of the time register values would be
passed around as 64-bit quantities and it keeps the same set of
registers for the same register type (argument, temp, saved). But since
it should be using mostly compiled code, it does not make much difference.

Also, the high registers could be used as FP registers. Maybe allowing
for saving only the low order 32 regs during a context switch.>

GPRs may contain either integer or floating-point values.

Going with a bit result vector in any GPR for compares, then a branch
on bit-set/clear for conditional branches. Might also include branch
true / false.

Using operand routing for immediate constants and an operation size
for the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be
10,50,90 or 130 bits.

Those seem like a call from the My 66000 playbook, which I like.

Yup.>

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Wed Oct 29 08:50:35 2025

From Newsgroup: comp.arch

On 2025-10-29 8:41 a.m., Robert Finch wrote:

On 2025-10-29 3:14 a.m., Stephen Fuld wrote:

On 10/28/2025 8:52 PM, Robert Finch wrote:

Started working on yet another CPU – Qupls4. Fixed 40-bit
instructions, 64 GPRs. GPRs may be used in pairs for 128-bit ops.
Registers are named as if there were 32 GPRs, A0 (arg 0 register is
r1) and A0H (arg 0 high is r33). Sameo for other registers.

I assume the "high" registers are for handling 128 bit operations
without the need to specify another register name. Do you have 5 or 6
bit register numbers in the instructions. Five allows you to use the
high registers for 128 bit operations without needing another register
specifier, but then the high registers can only be used for 128 bit
operations, which seems a waste. If you have six bits, you can use
all 64 registers for any operation, but how is the "upper" method that
better than automatically using r(x+1)?

Yes, but it is just a suggested usage. The registers are GPRs that can
be used for anything, specified using a six bit register number. I
suggested it that way because most of the time register values would be passed around as 64-bit quantities and it keeps the same set of
registers for the same register type (argument, temp, saved). But since
it should be using mostly compiled code, it does not make much difference.

Also, the high registers could be used as FP registers. Maybe allowing
for saving only the low order 32 regs during a context switch.>

GPRs may contain either integer or floating-point values.

Going with a bit result vector in any GPR for compares, then a branch
on bit-set/clear for conditional branches. Might also include branch
true / false.

Using operand routing for immediate constants and an operation size
for the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be
10,50,90 or 130 bits.

Those seem like a call from the My 66000 playbook, which I like.

Yup.>

I should mention that the high registers are available only in user/app
mode. For other modes of operation only the low order 32 registers are available. I did this to reduce the number of logical registers in the
design. There are about 160 (64+32+32+32) logical registers then. They
are supported by 512 physical registers. My previous design had 224
logical registers which eats up more hardware, probably for little benefit.

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Oct 29 17:44:14 2025

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

Do you have 5 or 6
bit register numbers in the instructions. Five allows you to use the
high registers for 128 bit operations without needing another register >specifier, but then the high registers can only be used for 128 bit >operations, which seems a waste.

These days, that's not so clear. E.g., Zen4 has 192 physical 512-bit
SIMD registers, despite having only 256-bit wide FUs. The way I
understand it, a 512-bit operation comes as one uop to the FU,
occupies it for two cycles (and of course the result latency is
extra), and then has a 512-bit result.

The alternative would be to do as AMD did in some earlier cores,
starting with (I think) K8: have registers that are half as wide and
split each 512-bit operation into 2 256-bit uops that go throught the
OoO engine individually. This approach would allow more physical
256-bit registers, and waste less on 32-bit, 64-bit, 128-bit and
256-bit operations, but would cost additional decoding bandwidth,
renaming bandwidth, renaming checkpoint size (a little), and scheduler
space than the approach AMD have taken. Apparently the cost of this
approach is higher than the benefit.

Doubling the logical register size doubles the renamer checkpoint
size, no? This way of avoiding "waste" looks quite a bit more
expensive.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Oct 29 13:04:42 2025

From Newsgroup: comp.arch

On 10/29/2025 7:50 AM, Robert Finch wrote:

On 2025-10-29 8:41 a.m., Robert Finch wrote:

On 2025-10-29 3:14 a.m., Stephen Fuld wrote:

On 10/28/2025 8:52 PM, Robert Finch wrote:

Started working on yet another CPU – Qupls4. Fixed 40-bit
instructions, 64 GPRs. GPRs may be used in pairs for 128-bit ops.
Registers are named as if there were 32 GPRs, A0 (arg 0 register is
r1) and A0H (arg 0 high is r33). Sameo for other registers.

I assume the "high" registers are for handling 128 bit operations
without the need to specify another register name. Do you have 5 or
6 bit register numbers in the instructions. Five allows you to use
the high registers for 128 bit operations without needing another
register specifier, but then the high registers can only be used for
128 bit operations, which seems a waste. If you have six bits, you
can use all 64 registers for any operation, but how is the "upper"
method that better than automatically using r(x+1)?

Yes, but it is just a suggested usage. The registers are GPRs that can
be used for anything, specified using a six bit register number. I
suggested it that way because most of the time register values would
be passed around as 64-bit quantities and it keeps the same set of
registers for the same register type (argument, temp, saved). But
since it should be using mostly compiled code, it does not make much
difference.

Also, the high registers could be used as FP registers. Maybe allowing
for saving only the low order 32 regs during a context switch.>

I am not as sure about this approach...

Well, Low 32=GPR, High 32=FPR, makes sense, I did this.

But, pairing a GPR and FPR for the 128-bit cases seems wonky; or
subsetting registers on context switch seems like it could turn into a problem.

Or, if a goal is to allow for encodings with a 5-bit register field,
would make sense to use 32-bit encodings.

Where, granted, 6b register fields in a 32-bit instruction does have the drawback of limiting how much encoding space exists for opcode and
immediate (and one has to be more careful not to "waste" the encoding
space as badly as RISC-V had done).

Though, can note that both:
R6+R6+Imm10
R5+R5+Imm12
Use the same amount of encoding space.
But, R6+R6+R6 uses 3 bits more than R5+R5+R5.

Though, one could debate my case, as I did effectively end up burning
1/4 of the total encoding space mostly on Jumbo prefixes.

...

GPRs may contain either integer or floating-point values.

Going with a bit result vector in any GPR for compares, then a
branch on bit-set/clear for conditional branches. Might also include
branch true / false.

Using operand routing for immediate constants and an operation size
for the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be
10,50,90 or 130 bits.

Those seem like a call from the My 66000 playbook, which I like.

Yup.>

I should mention that the high registers are available only in user/app mode. For other modes of operation only the low order 32 registers are available. I did this to reduce the number of logical registers in the design. There are about 160 (64+32+32+32) logical registers then. They
are supported by 512 physical registers. My previous design had 224
logical registers which eats up more hardware, probably for little benefit.

FWIW: I have gotten by OK with 128 internal registers:
00..3F: Array-Mapped Registers (mostly the GPRs)
40..7F: CRs and SPRs

Mostly sufficient.

For the array-mapped registers, these ones use LUTRAM, with a logical
copy of the array per write port, and some control bits to encode which
array currently holds the up-to-date copy of the register.

All this gets internally replicated for each read port.

So, roughly 18 internal copies of all of the registers with 6R3W, but
this is unavoidable (since LUTRAMs are 1R1W).

The other option is using flip-flops, which is the strategy mostly used
for the writable CRs and SPRs. This is done sparingly as the resource
cost is higher in this case (at least on xilinx, *).

*: Things went amiss on Altera and when I tried to build on this, needed
to use FF's for all the GPRs as well; as these FPGAs lack a direct
equivalent of LUTRAMs and instead have smaller Block RAMs. Also the
Lattice FPGAs also lack LUTRAM IIRC (but, my core doesn't map as well to Lattice FPGAs either).

As for the CR/SPR space:
Some of it is used for writable registers;
A big chunk is used for internal read-only registers.
ZZR, IMM, IMMB, JIMM, etc.
ZZR: Zero Register / Null Register (Write)
IMM: Immediate for current lane (33-bit, sign-ext).
IMMB: Immediate from Lane 3.
JIMM: 64-bit immediate spanning Lanes A and B.
...

Could also be seen as C0..C63 (or, all control registers) except that
much of C32..C63 is used for internal read-only SPRs, and a few other
SPRs (DLR, DHR, and SP).

Originally, the CRs and SPRs were handled as separate, but now things
have gotten fuzzy (and, for RISC-V, some of the CRs need to be accessed
in GPR like ways).

There is some wonk as they were handled as separate modules, but with
the current way things are done it would almost make more sense to fold
all of the CRs into the GPR file module.

The module might also continue to deal with forwarding, but might also
make sense to have a RegisterFile module, possibly with a disjoint
"Register Forwarding And Interlocks" style module (which forwards
registers if the value is available and signals pipeline stalls as
needed; this logic currently partly handled by the existing
register-file module).

Did experiment with a mechanism to allow bank-swapped registers. This
would have added an internal 2-bit mode for the registers, and would
stall the pipeline to swap the current registers with their bank-swapped versions if needed (with the registers internally backed to Block-RAM).
Ended up mostly not using this though (at best, it wouldn't gain much
over the existing "Load and Store everything to RAM" strategy; and would
make context switching slower than it is already).

It is more likely that a practical mechanism for fast bank swapping
would need a mechanism to bank-swap the registers to external RAM. Or
maybe a special "Stall and dump all the registers to this RAM Address" instruction.

For the RISC-V CSRs:
Part of the space maps to the CRs, and part maps to CPUID;
For pretty much everything else, it traps.
So, pretty much all of the normal RISC-V CSRs will trap.

Ended up trapping for the RISC-V FPU CSRs as well:
Rarely accessed;
Rather than just one CSR for the FPU status, they broke it up to
multiple sub-registers for parts of the register (like, there is a
special CSR just for the rounding-mode, ...).

Also the hardware only supports moving to/from a CR, so any more complex scenarios will also trap. They had gotten a little too fancy with this
stuff IMO.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Oct 29 18:15:42 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> schrieb:

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or floating-point values.

I understand the temptation to go for more bits :-) What is your
instruction alignment? Bytewise so 40 bits fit, or do you have some
alignment that the first instruction of a cache line is always aligned?

Having register pairs does not make the compiler writer's life easier, unfortunately.

Going with a bit result vector in any GPR for compares, then a branch on bit-set/clear for conditional branches. Might also include branch true / false.

Having 64 registers and 64 bit registers makes life easier for that
particular task :-)

If you have that many bits available, do you still go for a load-store architecture, or do you have memory operations? This could offset the
larger size of your instructions.

Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.

Those sizes are not really a good fit for constants from programs,
where quite a few constants tend to be 32 or 64 bits. Would a
64-bit FP constant leave 26 bits empty?
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed Oct 29 11:29:54 2025

From Newsgroup: comp.arch

On 10/29/2025 10:44 AM, Anton Ertl wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

Do you have 5 or 6
bit register numbers in the instructions. Five allows you to use the
high registers for 128 bit operations without needing another register
specifier, but then the high registers can only be used for 128 bit
operations, which seems a waste.

At this point, the discussion is academic, as Robert has said he has 6
bit register specifiers in the instructions. But my issue had nothing
to do with SIMD registers, as he said he supported 128 bit arithmetic
and the "high" registers were used for that. e.g.

Add A1,A2,A3 would be a 64 bit add on those registers but
Add128 A1,A2,A3 would be a 128 bit add using A1H for the high order
bits of the destination, etc. So the question becomes how is using
Rn+32 better than using Rn+1?

That being said, your points are well taken for a different implementation.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Oct 29 18:33:46 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or floating-point values.

Going with a bit result vector in any GPR for compares, then a branch on bit-set/clear for conditional branches. Might also include branch true / false.

I have both the bit-vector compare and branch, but also a compare to zero
and branch as a single instruction. I suggest you should too, if for no
other reason than:

if( p && p->next )

Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.

My 66000 allows for occasional use of 128-bit values but is designed mainly
for 64-bit and smaller.

With 32-bit instructions, I provide, {5, 16, 32, and 64}-bit constants.

Just last week we discovered a case where HW can do a better job than SW. Previously, the compiler would emit:

CVTfd Rt,Rf
FMUL Rt,Rt,#1.425D0
CVTdf Rd,Rt

Which is subject to double rounding once at the FMUL and again at the
down conversion. I though about the problem and it seems fairly easy
to gate the 24-bit fraction into the multiplier tree along with the
53-bit fraction of the constant, and then normalize and round the
result dropping out of the tree--avoiding the double rounding case.

Now, the compiler emits:

FMULf Rd,Rf,#1.425D0

saving 2 instructions alongwith the higher precision.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Oct 29 18:47:09 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> posted:

On 10/28/2025 10:52 PM, Robert Finch wrote:

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or floating-point values.

OK.

I mostly stuck with 32-bit encodings, but 40 could maybe allow more
encoding space, but the drawback of being non-power-of-2.

it is definitely an issue.

But, yeah, occasionally dealing with 128-bit data is a major case for 64 GPRs and paired-registers registers.

There is always the DBLE pseudo-instruction.

DBLE Rd,Rs1,Rs2,Rs3

All DBLE does is to provide more registers for the wide computation
in such a way that compiler is not forced to pair or share any reg-
isters. The other thing DBLE does is to tell the decoder that the
next instruction is 2× as wide as its OpCode states. In lower end
machines (and in GPUs) DBLE is sequenced as if it were an instruction.
In higher end machines, DBLE would be CoIssued with its mate.

----------

My case: 10/33/64.
No direct 128-bit constant, but can use two 64-bit constants whenever
128 bits is needed.

{5, 16, 32, 64}-bit immediates.

Otherwise, goings on in my land:
ISA development is slow, and had mostly turned into bug hunting;

<snip>

The longer term future is uncertain.

My ISA's can beat RISC-V in terms of code-density and performance, but
when when RISC-V is extended with similar features, it is harder to make
a case that it is "enough".

I am still running at 70% RISC-Vs instruction count.

Doesn't seem like (within the ISA) there are many obvious ways left to
grab large general-case performance gains over what I have done already.

Fewer instructions, and or instructions that take fewer cycles to execute.

Example, ENTER and EXIT instructions move 4 registers per cycle to/from
cache in a pipeline that has 1 result per cycle.

Some code benefits from lots of GPRs, but harder to make the case that
it reflects the general case.

There is very little to be gained with that many registers.

Recently got a new very-cheap laptop (a Dell Latitude 7490, for around $240), made some curious observations:
It seems to slightly outperform my main PC in single-threaded performance; Its RAM timings don't seem to match the expected values.

My main PC still wins at multi-threaded performance, and has the
advantage of 7x more RAM.

My new Linux box has 64 cores at 4.5 GHz and 96GB of DRAM.
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Oct 29 14:02:32 2025

From Newsgroup: comp.arch

On 10/29/2025 1:15 PM, Thomas Koenig wrote:

Robert Finch <robfi680@gmail.com> schrieb:

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or
floating-point values.

I understand the temptation to go for more bits :-) What is your
instruction alignment? Bytewise so 40 bits fit, or do you have some alignment that the first instruction of a cache line is always aligned?

Having register pairs does not make the compiler writer's life easier, unfortunately.

Yeah, and from the compiler POV, would likely prefer having Even+Odd pairs.

Going with a bit result vector in any GPR for compares, then a branch on
bit-set/clear for conditional branches. Might also include branch true /
false.

Having 64 registers and 64 bit registers makes life easier for that particular task :-)

If you have that many bits available, do you still go for a load-store architecture, or do you have memory operations? This could offset the
larger size of your instructions.

Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.

Those sizes are not really a good fit for constants from programs,
where quite a few constants tend to be 32 or 64 bits. Would a
64-bit FP constant leave 26 bits empty?

Agreed.

From what I have seen, the vast bulk of constants tend to come in
several major clusters:
0 to 511: The bulk of all constants (peaks near 0, geometric fall-off)
-64 to -1: Much of what falls outside 0 to 511.
-32768 to 65535: Second major group
-2G to +4G: Third group (smaller than second)
64-bit: Another smaller spike.

For values between 512 and 16384: Sparsely populated.
Mostly the continued geometric fall-off from the near-0 peak.
Likewise for values between 65536 and 1G.
Values between 4G and 4E tend to be mostly unused.

Like, in the sense of, if you have 33-bit vs 52 or 56-bit for a
constant, the larger constants would have very little advantage (in
terms of statistical hit rate) over the 33 bit constant (and, it isn't
until you reach 64 bits that it suddenly becomes worthwhile again).

Partly why I go with 33 bit immediate fields in the pipeline in my core,
but nothing much bigger or smaller:
Slightly smaller misses out on a lot, so almost may as well drop back to
17 in this case;
Going slightly bigger would gain pretty much nothing.

Like, in the latter case, does sort of almost turn into a "go all the
way to 64 bits or don't bother" thing.

That said, I do use a 48-bit address space, so while in concept 48-bits
could be useful for pointers: This is statistically insignificant in an
ISA which doesn't encode absolute addresses in instructions.

So, ironically, there are a lot of 48-bit values around, just pretty
much none of them being encoded via instructions.

Kind of a similar situation to function argument counts:
8 arguments: Most of the functions;
12: Vast majority of them;
16: Often only a few stragglers remain.

So, 16 gets like 99.95% of the functions, but maybe there are a few
isolated ones taking 20+ arguments lurking somewhere in the code. One
would then need to go up to 32 arguments to have reasonable confidence
of "100%" coverage.

Or, impose an arbitrary limit, where the stragglers would need to be
modified to pass arguments using a struct or something.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed Oct 29 13:05:08 2025

From Newsgroup: comp.arch

On 10/29/2025 11:47 AM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

snip

But, yeah, occasionally dealing with 128-bit data is a major case for 64
GPRs and paired-registers registers.

There is always the DBLE pseudo-instruction.

DBLE Rd,Rs1,Rs2,Rs3

All DBLE does is to provide more registers for the wide computation
in such a way that compiler is not forced to pair or share any reg-
isters. The other thing DBLE does is to tell the decoder that the
next instruction is 2× as wide as its OpCode states. In lower end
machines (and in GPUs) DBLE is sequenced as if it were an instruction.
In higher end machines, DBLE would be CoIssued with its mate.

So if DBLE says the next instruction is double width, does that mean
that all "128 bit instructions" require 64 bits in the instruction
stream? So a sequence of say four 128 bit arithmetic instructions would require the I space of 8 instructions?

If so, I guess it is a tradeoff for not requiring register pairing, e.g.
Rn and Rn+1.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Oct 29 15:58:40 2025

From Newsgroup: comp.arch

On 10/29/2025 1:47 PM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 10/28/2025 10:52 PM, Robert Finch wrote:

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, >>> 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or
floating-point values.

OK.

I mostly stuck with 32-bit encodings, but 40 could maybe allow more
encoding space, but the drawback of being non-power-of-2.

it is definitely an issue.

But, yeah, occasionally dealing with 128-bit data is a major case for 64
GPRs and paired-registers registers.

There is always the DBLE pseudo-instruction.

DBLE Rd,Rs1,Rs2,Rs3

All DBLE does is to provide more registers for the wide computation
in such a way that compiler is not forced to pair or share any reg-
isters. The other thing DBLE does is to tell the decoder that the
next instruction is 2× as wide as its OpCode states. In lower end
machines (and in GPUs) DBLE is sequenced as if it were an instruction.
In higher end machines, DBLE would be CoIssued with its mate.

OK.

In my case, a lot of the 128-bit operations are a single 32-bit
instruction, which splits (in decode) to spanning multiple lanes (using
the 6R3w register file as a virtual 3R1W 128-bit register file).

In some cases, pairs of 64-bit SIMD instructions may be merged to send
both through the SIMD unit at the same time. Say, as a special-case
co-issue for 2x Binary32 ops (which can basically be handled the same as
the 4x Binary32 scenario by the SIMD unit).

----------

My case: 10/33/64.
No direct 128-bit constant, but can use two 64-bit constants whenever
128 bits is needed.

{5, 16, 32, 64}-bit immediates.

The reason 17 and 33 ended up slightly preferable is that both
zero-extended and sign-extended 16 and 32 bit values are fairly common.

And, if one has both a zero and sign extended immediate, this eats the
same encoding space as having a 17-bit immediate, or a separate
zero-extended and one-extended variant.

There are a few 5/6 bit immediate instructions, but I didn't really
count them.

XG3's equivalent of SLTI and similar only has Imm6 encodings (can be
extended to 33 bits with a jumbo prefix).

There isn't much need for a direct 128-bit immediate though:
This case is exceedingly rate;
Register-pairs basically make it a non-issue;
Even if it were supported:
This would still require a 24-byte encoding...
Which, doesn't save anything over 2x 12-bytes.
And doesn't gain much, apart from making CPU more expensive.

Someone could maybe do 20 bytes by using a 128-bit memory load, but with
the usual drawbacks of using a memory load (BGBCC doesn't usually do
this). The memory load will have a higher latency than a pair of
immediate instructions.

Otherwise, goings on in my land:
ISA development is slow, and had mostly turned into bug hunting;

<snip>

The longer term future is uncertain.

My ISA's can beat RISC-V in terms of code-density and performance, but
when when RISC-V is extended with similar features, it is harder to make
a case that it is "enough".

I am still running at 70% RISC-Vs instruction count.

Basically similar.

XG3 also uses only 70% as many instructions as RV64G.

But, if you throw Indexed Load/Store, Load/Store Pair, Jumbo Prefixes,
etc, at the problem (on top of RISC-V), suddenly RISC-V becomes a lot
more competitive (30% smaller and 50% faster).

Not found a good way to much improve much over this though...

But, yeah, if comparing against RV64G as it exists in its standard form,
there is a bit of room for improvement.

Doesn't seem like (within the ISA) there are many obvious ways left to
grab large general-case performance gains over what I have done already.

Fewer instructions, and or instructions that take fewer cycles to execute.

Example, ENTER and EXIT instructions move 4 registers per cycle to/from
cache in a pipeline that has 1 result per cycle.

Some code benefits from lots of GPRs, but harder to make the case that
it reflects the general case.

There is very little to be gained with that many registers.

Granted.

The main thing it benefits is things like TKRA-GL, ...

Doom basically sees no real difference between 32 and 64 GPRs (nor does SW-Quake).

Mostly matters for code where one has functions with around 100+ local variables... Which, are uncommon much outside of TKRA-GL or similar.

As-is, SW-Quake is one of the cases that does well with RISC-V, though GL-Quake performs like hot dog-crap; mostly as TKRA-GL gets wrecked if
it is limited to 32 registers and doesn't have SIMD.

Only real saving point is when running with TKRA-GL over system calls in
which case it runs in the kernel (as XG1) which is slightly less bad.
For reasons, TestKern kinda still needs to be built as XG1.

Recently got a new very-cheap laptop (a Dell Latitude 7490, for around
$240), made some curious observations:
It seems to slightly outperform my main PC in single-threaded performance; >> Its RAM timings don't seem to match the expected values.

My main PC still wins at multi-threaded performance, and has the
advantage of 7x more RAM.

My new Linux box has 64 cores at 4.5 GHz and 96GB of DRAM.

Desktop PC:
8C/16T: 3.7 Base, 4.3 Turbo, 112GB RAM (just, not very fast RAM)
Rarely reaches turbo
pretty much only happens if just running a single thread...
With all cores running stuff in the background:
Idles around 3.6 to 3.8.

Laptop:
4C/8T, 1.9 GHz Base, 4.2 GHz Turbo
If power set to performance, reaches turbo a lot more easily,
and with multi-core workloads.
But, puts out a lot of heat while doing so...

If set to Efficiency, mostly stays below 3 GHz.

As noted, the laptop is surprisingly speedy for how cheap it was.

For $240 I was paranoid is still might not have been fast enough to run Minecraft...

Still annoyed as the RAM claimed like DDR4-3200 on the box, but doesn't
run reliably at more than DDR4-2133... Like, you can try 3200 if you
don't mind computer blue-screening after a few minutes I guess...

But, without much RAM, nor enough SSD space to set up a huge pagefile,
not going to try compiling LLVM on the thing.

Even with all the RAM, a full rebuild of LLVM still takes several hours
on my main PC (though, trying to build LLVM or GCC is at least slightly
faster if one tells the AV software to stop grinding the CPU by looking
at every file accessed).

Vs the $80 OptiPlex that came with a 2C/4T Core i3 variant, that wasn't particularly snappy (seemed on-par with the Vista era laptop; though
this has a 2C/2T CPU).

Basically, was a small PC that was using mostly laptop-style parts
internally (laptop DVD-RW drive and laptop style HDD); some sort of ITX
MOBO layout I think.

I don't remember there being any card slots; so like if you want to
install a PCIe card or similar, basically SOL.

But, it was either this or an off-brand NUC clone...

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Oct 29 21:52:54 2025

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 10/29/2025 11:47 AM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

snip

But, yeah, occasionally dealing with 128-bit data is a major case for 64 >> GPRs and paired-registers registers.

There is always the DBLE pseudo-instruction.

DBLE Rd,Rs1,Rs2,Rs3

All DBLE does is to provide more registers for the wide computation
in such a way that compiler is not forced to pair or share any reg-
isters. The other thing DBLE does is to tell the decoder that the
next instruction is 2× as wide as its OpCode states. In lower end
machines (and in GPUs) DBLE is sequenced as if it were an instruction.
In higher end machines, DBLE would be CoIssued with its mate.

So if DBLE says the next instruction is double width, does that mean
that all "128 bit instructions" require 64 bits in the instruction
stream? So a sequence of say four 128 bit arithmetic instructions would require the I space of 8 instructions?

It is a 64-bit machine that provides a small modicum of support for
larger sizes. It is not and never will be a 128-bit machine--that is
what vVM is for.

Key words "small modicum"

DBLE simply supplies registers to the pipeline and width to decode.

If so, I guess it is a tradeoff for not requiring register pairing, e.g.
Rn and Rn+1.

DBLE supports 128-bits in the ISA at the total cost of 1 instruction
added per use. In many situations (especially integer) CARRY is the
better option because it throws a shadow of width over a number of
instructions and thereby has lower code foot print costs. So, a 256
bit shift is only 5 instructions instead of 8. And realistically, if
you want wider than that, you have already run out of registers.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Wed Oct 29 18:01:17 2025

From Newsgroup: comp.arch

On 2025-10-29 2:15 p.m., Thomas Koenig wrote:

Robert Finch <robfi680@gmail.com> schrieb:

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or
floating-point values.

I understand the temptation to go for more bits :-) What is your
instruction alignment? Bytewise so 40 bits fit, or do you have some alignment that the first instruction of a cache line is always aligned?

The 40-bit instructions are byte aligned. This does add more shifting in
the align stage. Once shifted though instructions are easily peeled off
from fixed positions. One consequence is jump targets must be byte
aligned OR routines could be required to be 32-bit aligned for instance.>

Having register pairs does not make the compiler writer's life easier, unfortunately.

Going with a bit result vector in any GPR for compares, then a branch on
bit-set/clear for conditional branches. Might also include branch true /
false.

Having 64 registers and 64 bit registers makes life easier for that particular task :-)

If you have that many bits available, do you still go for a load-store architecture, or do you have memory operations? This could offset the
larger size of your instructions.

It is load/store with no memory ops excepting possibly atomic memory ops.>

Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.

Those sizes are not really a good fit for constants from programs,
where quite a few constants tend to be 32 or 64 bits. Would a
64-bit FP constant leave 26 bits empty?

I found that 16-bit immediates could be encoded instead of 10-bit.
So, now there are 16,56,96 and 136 bit constants possible. The
56-bitconstant likely has enough range for most 64-bit ops. Otherwise using
a 96-bit constant for 64-bit ops would leave the upper 32-bit of the
constant unused. 136 bit constants may not be implemented, but a size
code is reserved for that size.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Wed Oct 29 18:20:51 2025

From Newsgroup: comp.arch

On 2025-10-29 2:33 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or
floating-point values.

Going with a bit result vector in any GPR for compares, then a branch on
bit-set/clear for conditional branches. Might also include branch true /
false.

I have both the bit-vector compare and branch, but also a compare to zero
and branch as a single instruction. I suggest you should too, if for no
other reason than:

if( p && p->next )

Yes, I was going to have at least branch on register 0 (false) 1 (true)
as there is encoding room to support it. It does add more cases in the
branch eval, but is probably well worth it.

Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.

My 66000 allows for occasional use of 128-bit values but is designed mainly for 64-bit and smaller.

Following the same philosophy. Expecting only some use for 128-bit
floats. Integers can only handle 8,16,32, or 64-bits.

With 32-bit instructions, I provide, {5, 16, 32, and 64}-bit constants.

Just last week we discovered a case where HW can do a better job than SW. Previously, the compiler would emit:

CVTfd Rt,Rf
FMUL Rt,Rt,#1.425D0
CVTdf Rd,Rt

Which is subject to double rounding once at the FMUL and again at the
down conversion. I though about the problem and it seems fairly easy
to gate the 24-bit fraction into the multiplier tree along with the
53-bit fraction of the constant, and then normalize and round the
result dropping out of the tree--avoiding the double rounding case.

Now, the compiler emits:

FMULf Rd,Rf,#1.425D0

saving 2 instructions alongwith the higher precision.

Improves the accuracy? of algorithms, but seems a bit specific to me.
Are there other instruction sequence where double-rounding would be good
to avoid? Seems like HW could detect the sequence and fuse the instructions.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Wed Oct 29 18:26:05 2025

From Newsgroup: comp.arch

<snip>>> My new Linux box has 64 cores at 4.5 GHz and 96GB of DRAM.

Desktop PC:
8C/16T: 3.7 Base, 4.3 Turbo, 112GB RAM (just, not very fast RAM)
    Rarely reaches turbo
      pretty much only happens if just running a single thread...
    With all cores running stuff in the background:
      Idles around 3.6 to 3.8.

Laptop:
4C/8T, 1.9 GHz Base, 4.2 GHz Turbo
    If power set to performance, reaches turbo a lot more easily,
      and with multi-core workloads.
    But, puts out a lot of heat while doing so...

If set to Efficiency, mostly stays below 3 GHz.

As noted, the laptop is surprisingly speedy for how cheap it was.

<snip>
For my latest PC I bought a gaming machine – i7-14700KF CPU (20 cores).
32 GB RAM, 16GB graphics RAM. 3.4 GHz (5.6 GHz in turbo mode). More RAM
was needed, my last machine only had 16GB, found it using about 20GB. I
did not want to spring for a machine with even more RAM, they tended to
be high-end machines.

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Oct 29 22:31:12 2025

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

At this point, the discussion is academic, as Robert has said he has 6
bit register specifiers in the instructions.

He could still make these registers have 128 bits rather than pairing
registers for 128-bit operation.

But my issue had nothing
to do with SIMD registers, as he said he supported 128 bit arithmetic
and the "high" registers were used for that.

As far as waste etc. is concerned, it does not matter if the 128-bit
operation is a SIMD operation or a scalar 128-bit operation.

Intel designed SSE with scalar instructions that use only 32 bits out
of the 128 bits available; SSE2 with 64-bit scalar instructions, AVX
(and AVX2) with 32-bit and 64-bit scalar operations in a 256-bit
register, and various AVX-512 variants with 32-bit and 64-bit scalars,
and 128-bit and 256-bit operations in addition to the 512-bit ones.
They are obviously not worried about waste.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Oct 29 18:48:56 2025

From Newsgroup: comp.arch

On 10/29/2025 5:26 PM, Robert Finch wrote:

<snip>>> My new Linux box has 64 cores at 4.5 GHz and 96GB of DRAM.

Desktop PC:
   8C/16T: 3.7 Base, 4.3 Turbo, 112GB RAM (just, not very fast RAM)
     Rarely reaches turbo
       pretty much only happens if just running a single thread...
     With all cores running stuff in the background:
       Idles around 3.6 to 3.8.

Laptop:
   4C/8T, 1.9 GHz Base, 4.2 GHz Turbo
     If power set to performance, reaches turbo a lot more easily,
       and with multi-core workloads.
     But, puts out a lot of heat while doing so...

If set to Efficiency, mostly stays below 3 GHz.

As noted, the laptop is surprisingly speedy for how cheap it was.

<snip>
For my latest PC I bought a gaming machine – i7-14700KF CPU (20 cores).
32 GB RAM, 16GB graphics RAM. 3.4 GHz (5.6 GHz in turbo mode). More RAM
was needed, my last machine only had 16GB, found it using about 20GB. I
did not want to spring for a machine with even more RAM, they tended to
be high-end machines.

IIRC, current PC was something like:
CPU: $80 (Zen+; Zen 2 and 3 were around, but more expensive)
MOBO: $60
Case: $50
...

Spent around $200 for 128GB of RAM.
Could have gotten a cheaper 64GB kit had I known my MOBO would not
accept a full 128GB (then could have had 96 GB).

The RTX card I have (RTX 3060) has 12 GB of VRAM.

IIRC, it was also about the cheapest semi-modern graphics card I could
find at the time. Like, while I could have bought an RTX 4090 or similar
at the time, I am not made of money.

Like, a prior-generation mid-range card being the cheaper option.
And, still newer than the GTX980 that had died on my (where, the GTX980
was itself second-hand).

Before this, had been running a GTX 460, and before that, a Radeon HD
4850 (IIRC).

I think it was a case of:
Had a Phenom II box, with the HD 4850;
Switched to GTX 460, as I got one second-hand for free, slightly better; Replaced Phenom II board+CPU with FX-8350;
Got GTX 980 (also second hand);
Got Ryzen 7 2700X and new MOBO;
Got RTX 3060 (as the 980 was failing).

With the RTX 3060, had to go single-monitor, mostly as it only has
DisplayPort outputs, and DP->HDMI->DVI via adapters doesn't seem to work (whereas HDMI->DVI did work via adapters).

Well, also the RTX 3060 doesn't have a VGA output either (monitor would
also accept VGA).

Though, the current monitor I am using is newer and does support
DisplayPort.

I also managed to get a MultiSync CRT a while ago, but it only really
gives good results at 640x480 and 800x600, 1024x768 sorta-works (but
1280x1024 does not work), has a roughly 16" CRT or so; VGA input.

I also have an LCD that goes up to 1280x1024, although it looks like
garbage if set above 1024x768. Only accepts VGA.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Oct 30 07:13:54 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> schrieb:

On 2025-10-29 2:15 p.m., Thomas Koenig wrote:

Robert Finch <robfi680@gmail.com> schrieb:

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, >>> 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or
floating-point values.

I understand the temptation to go for more bits :-) What is your
instruction alignment? Bytewise so 40 bits fit, or do you have some
alignment that the first instruction of a cache line is always aligned?

The 40-bit instructions are byte aligned. This does add more shifting in
the align stage. Once shifted though instructions are easily peeled off
from fixed positions. One consequence is jump targets must be byte
aligned OR routines could be required to be 32-bit aligned for instance.>

That raises an interesting question. If you want to align a branch
target on a 32-bit boundary, or even a cache line, how do you fill
up the rest? If all instructions are 40 bits, you cannot have a
NOP that is not 40 bits, so there would need to be a jump before
a gap that is does not fit 40 bits.

If you have that many bits available, do you still go for a load-store
architecture, or do you have memory operations? This could offset the
larger size of your instructions.

It is load/store with no memory ops excepting possibly atomic memory ops.>

OK. Starting with 40 vs 32 bits, you have a factor of 1.25 disadvantage
in code density to start with. Having memory operations could offset
that by a certain factor, that was why I was asking.

Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.

Those sizes are not really a good fit for constants from programs,
where quite a few constants tend to be 32 or 64 bits. Would a
64-bit FP constant leave 26 bits empty?

I found that 16-bit immediates could be encoded instead of 10-bit.

OK. That should also help for offsets in load/store.

So, now there are 16,56,96 and 136 bit constants possible. The 56-bitconstant likely has enough range for most 64-bit ops.

For addresses, it will take some time for this to overflow :-)
For floating point constants, that will be hard.

I have done some analysis on frequency of floating point constants
in different programs, and what I found was that there are a few
floating point constants that keep coming up, like a few integers
around zero (biased towards the positive side), plus a few more
golden oldies like 0.5, 1.5 and pi. Apart from that, I found that
different programs have wildly different floating point constants,
which is not surprising. (I based that analysis on the grand
total of three packages, namely Perl, gnuplot and GSL, so cover
is not really extensive).

Otherwise using
a 96-bit constant for 64-bit ops would leave the upper 32-bit of the constant unused.

There are also 32-bit floating point constants, and 32-bit integers
as constants. There are also very many small integer constants, but
of course there also could be others.

136 bit constants may not be implemented, but a size
code is reserved for that size.

I'm still hoping for good 128-bit IEEE hardware float support.
POWER has this, but stuck it on their their decimal float
arithmetic, which is not highly performing...
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Thu Oct 30 13:53:04 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> writes:

Robert Finch <robfi680@gmail.com> schrieb:

On 2025-10-29 2:15 p.m., Thomas Koenig wrote:

Robert Finch <robfi680@gmail.com> schrieb:

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, >>>> 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named >>>> as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high >>>> is r33). Sameo for other registers. GPRs may contain either integer or >>>> floating-point values.

I understand the temptation to go for more bits :-) What is your
instruction alignment? Bytewise so 40 bits fit, or do you have some
alignment that the first instruction of a cache line is always aligned?

The 40-bit instructions are byte aligned. This does add more shifting in
the align stage. Once shifted though instructions are easily peeled off
from fixed positions. One consequence is jump targets must be byte
aligned OR routines could be required to be 32-bit aligned for instance.>

That raises an interesting question. If you want to align a branch
target on a 32-bit boundary, or even a cache line, how do you fill
up the rest? If all instructions are 40 bits, you cannot have a
NOP that is not 40 bits, so there would need to be a jump before
a gap that is does not fit 40 bits.

iCache lines could be a multiple of 5-bytes in size (e.g. 80 bytes
instead of 64).

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Oct 30 16:09:00 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

On 2025-10-29 2:33 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, >> 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named
as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high
is r33). Sameo for other registers. GPRs may contain either integer or
floating-point values.

Going with a bit result vector in any GPR for compares, then a branch on >> bit-set/clear for conditional branches. Might also include branch true / >> false.

I have both the bit-vector compare and branch, but also a compare to zero and branch as a single instruction. I suggest you should too, if for no other reason than:

if( p && p->next )

Yes, I was going to have at least branch on register 0 (false) 1 (true)
as there is encoding room to support it. It does add more cases in the branch eval, but is probably well worth it.

Using operand routing for immediate constants and an operation size for
the instruction. Constants and operation size may be specified
independently. With 40-bit instruction words, constants may be 10,50,90
or 130 bits.

My 66000 allows for occasional use of 128-bit values but is designed mainly for 64-bit and smaller.

Following the same philosophy. Expecting only some use for 128-bit
floats. Integers can only handle 8,16,32, or 64-bits.

With 32-bit instructions, I provide, {5, 16, 32, and 64}-bit constants.

Just last week we discovered a case where HW can do a better job than SW. Previously, the compiler would emit:

CVTfd Rt,Rf
FMUL Rt,Rt,#1.425D0
CVTdf Rd,Rt

Which is subject to double rounding once at the FMUL and again at the
down conversion. I though about the problem and it seems fairly easy
to gate the 24-bit fraction into the multiplier tree along with the
53-bit fraction of the constant, and then normalize and round the
result dropping out of the tree--avoiding the double rounding case.

Now, the compiler emits:

FMULf Rd,Rf,#1.425D0

saving 2 instructions along with the higher precision.

Improves the accuracy? of algorithms, but seems a bit specific to me.

It is down in the 1% footprint area.

Are there other instruction sequence where double-rounding would be good
to avoid?

Back when I joined Moto (1983) there was a lot of talk about double
roundings and how it could screw up various algorithms but mainly in
the 64-bit versus 80-bit stuff of 68881, where you got 11-more bits
of precision and thus took a change of 2/2^10 of a double rounding.
Today with 32-bit versus 64-bit you take a chance of 2/2^28 so the
problem is greatly ameliorated although technically still present.

The problem arises due to a cross products of various {machine,
language, compiler} features not working "all ends towards the middle".

LLVM promotes FP calculations with a constant to 64-bits whenever the
constant cannot be represented exactly in 32-bits. {Strike one}

C makes no <useful> statements about precision of calculation control.
{strike two}

HW almost never provides mixed mode calculations which provide the
means to avoid the double rounding. {strike three}

So, technically, My 66000 does not provide general-mixed-mode FP,
but I wrote a special rule to allow for larger constants used with
narrower registers to cover exactly this case. {It also saves 2 CVT instructions (latency and footprint).

Seems like HW could detect the sequence and fuse the instructions.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Oct 30 16:10:47 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

At this point, the discussion is academic, as Robert has said he has 6
bit register specifiers in the instructions.

He could still make these registers have 128 bits rather than pairing registers for 128-bit operation.

But my issue had nothing
to do with SIMD registers, as he said he supported 128 bit arithmetic
and the "high" registers were used for that.

As far as waste etc. is concerned, it does not matter if the 128-bit operation is a SIMD operation or a scalar 128-bit operation.

Intel designed SSE with scalar instructions that use only 32 bits out
of the 128 bits available; SSE2 with 64-bit scalar instructions, AVX
(and AVX2) with 32-bit and 64-bit scalar operations in a 256-bit
register, and various AVX-512 variants with 32-bit and 64-bit scalars,
and 128-bit and 256-bit operations in addition to the 512-bit ones.
They are obviously not worried about waste.

Which only goes to prove that x86 is not IRSC.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Thu Oct 30 12:29:39 2025

From Newsgroup: comp.arch

On 10/30/2025 11:10 AM, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

At this point, the discussion is academic, as Robert has said he has 6
bit register specifiers in the instructions.

He could still make these registers have 128 bits rather than pairing
registers for 128-bit operation.

Only really makes sense if one assumes these resources are "borderline
free".

If you are also paying for logic complexity and wires/routing, then
having bigger registers just to typically waste most of them is not ideal.

Granted, one could argue that most of the register is wasted when, say:
Most integer values could easily fit into 16 bits;
We have 64-bit registers.

But, there is enough that actually uses the 64-bits of a 64-bit register
to make it worthwhile. Would be harder to say the same for 128-bit
registers.

It is common on many 32-bit machines to use register pairs for 64-bit operations.

But my issue had nothing
to do with SIMD registers, as he said he supported 128 bit arithmetic
and the "high" registers were used for that.

As far as waste etc. is concerned, it does not matter if the 128-bit
operation is a SIMD operation or a scalar 128-bit operation.

Intel designed SSE with scalar instructions that use only 32 bits out
of the 128 bits available; SSE2 with 64-bit scalar instructions, AVX
(and AVX2) with 32-bit and 64-bit scalar operations in a 256-bit
register, and various AVX-512 variants with 32-bit and 64-bit scalars,
and 128-bit and 256-bit operations in addition to the 512-bit ones.
They are obviously not worried about waste.

Which only goes to prove that x86 is not IRSC.

Also questionable to read as someone lacking much hardware that actually supports 256 or 512-bit AVX on the actual HW level. And, both AVX and
AVX-512 had not exactly had clean roll-outs.

Checks and, ironically, my recent super-cheap laptop was the first thing
I got that apparently has proper 256-bit AVX support (still no AVX-512 though...).

Still some oddities though:
RAM that appears to be faster than it should be;
The MHz and CAS latency appear abnormally high.
They do not match the values for DDR4-2400.
(Nor, even DDR4 in general).
Appears to exceed expected bandwidth on memcpy test;
...
Windows 11 on an unsupported CPU model;
More so, Windows 11 Professional, also on something cheap.
(Listing said it would come with Win10, got Win11 instead, OK).

So, technically seems good, but also slightly sus...

Differs slightly from what I was expecting:
Something kinda old and not super fast;
Listing said Windows 10, kinda expected Windows 10;
...

Like, something non-standard may have been done here.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Oct 30 16:46:14 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Intel designed SSE with scalar instructions that use only 32 bits out
of the 128 bits available; SSE2 with 64-bit scalar instructions, AVX
(and AVX2) with 32-bit and 64-bit scalar operations in a 256-bit
register, and various AVX-512 variants with 32-bit and 64-bit scalars,
and 128-bit and 256-bit operations in addition to the 512-bit ones.
They are obviously not worried about waste.

Which only goes to prove that x86 is not IRSC.

I don't see that following at all, but it inspired a closer look at
the usage/waste of register bits in RISCs:

Every 64-bit RISC starting with MIPS-IV and Alpha, wastes a lot of
precious register bits by keeping 8-bit, 16-bit, and 32-bit values in
64-bit registers rather than following the idea of Intel and Robert
Finch of splitting the 64-bit register in the double number of 32-bit registers; this idea can be extended to eliminate waste by having the
quadruple number of 16-bit registers that can be joined into 32-bit
anbd 64-bit registers when needed, or even better, the octuple number
of 8-bit registers that can be joined to 16-bit, 32-bit, and 64-bit
registers. We can even ressurrect the character-oriented or
digit-oriented architectures of the 1950s.

Intel split AX into AL and AH, similar for BX, CX, and DX, but not for
SI, DI, BP, and SP. In the 32-bit extension, they did not add ways to
access the third and fourth byte, or the second wyde (16-bit value).
In the 64-bit extension, AMD added ways to access the low byte of
every register (in addition to AH-DH), but no way to access the second
byte of other registers than RAX-RDX, nor ways to access higher wydes,
or 32-bit units. Apparently they were not concerned about this kind
of waste. For the 8086 the explanation is not trying to avoid waste,
but an easy automatic mapping from 8080 code to 8086 code.

Writing to AL-DL or AX-DX,SI,DI,BP,SP leaves the other bits of the
32-bit register alone, which one can consider to be useful for storing
data in those bits (and in case of AL, AH actually provides a
conventient way to access some of the bits, and vice versa), but leads
to partial-register stalls. The hardware contains fast paths for some
common cases of partial-register writes, but AFAIK AH-DH do not get
fast paths in most CPUs.

By contrast, RISCs waste the other 24 of 56 bits on a byte load by zero-extending or sign-extending the byte.

Alpha avoids wasting register bits for some idioms by keeping up to 8
bytes in a register in SIMD style (a few years before the wave of SIMD extensions across the industry), but still provides no direct name for
the individual bytes of a register.

IIRC the original HPPA has 32 or so 64-bit FP registers, which they
then split into 58? 32-bit FP registers. I don't know how they
further evolved that feature.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Oct 30 17:58:34 2025

From Newsgroup: comp.arch

Scott Lurndal <scott@slp53.sl.home> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

Robert Finch <robfi680@gmail.com> schrieb:

On 2025-10-29 2:15 p.m., Thomas Koenig wrote:

Robert Finch <robfi680@gmail.com> schrieb:

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions, >>>>> 64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named >>>>> as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high >>>>> is r33). Sameo for other registers. GPRs may contain either integer or >>>>> floating-point values.

I understand the temptation to go for more bits :-) What is your
instruction alignment? Bytewise so 40 bits fit, or do you have some
alignment that the first instruction of a cache line is always aligned? >>>

The 40-bit instructions are byte aligned. This does add more shifting in >>> the align stage. Once shifted though instructions are easily peeled off
from fixed positions. One consequence is jump targets must be byte
aligned OR routines could be required to be 32-bit aligned for instance.> >>

That raises an interesting question. If you want to align a branch
target on a 32-bit boundary, or even a cache line, how do you fill
up the rest? If all instructions are 40 bits, you cannot have a
NOP that is not 40 bits, so there would need to be a jump before
a gap that is does not fit 40 bits.

iCache lines could be a multiple of 5-bytes in size (e.g. 80 bytes
instead of 64).

There is a cache level (L2 usually, I believe) when icache and
dcache are no longer separate. Wouldn't this cause problemso
or inefficiencies?
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Thu Oct 30 23:39:28 2025

From Newsgroup: comp.arch

On Thu, 30 Oct 2025 16:46:14 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Alpha avoids wasting register bits for some idioms by keeping up to 8
bytes in a register in SIMD style (a few years before the wave of SIMD extensions across the industry), but still provides no direct name for
the individual bytes of a register.

According to my understanding, EV4 had no SIMD-style instructions.
They were introduced in EV5 (Jan 1995). Which makes it only ~6 months
ahead of VIS in UltraSPARC.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Oct 30 22:00:50 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Intel designed SSE with scalar instructions that use only 32 bits out
of the 128 bits available; SSE2 with 64-bit scalar instructions, AVX
(and AVX2) with 32-bit and 64-bit scalar operations in a 256-bit
register, and various AVX-512 variants with 32-bit and 64-bit scalars,
and 128-bit and 256-bit operations in addition to the 512-bit ones.
They are obviously not worried about waste.

Which only goes to prove that x86 is not RISC.

I don't see that following at all, but it inspired a closer look at
the usage/waste of register bits in RISCs:

Every 64-bit RISC starting with MIPS-IV and Alpha, wastes a lot of
precious register bits by keeping 8-bit, 16-bit, and 32-bit values in
64-bit registers rather than following the idea of Intel and Robert
Finch of splitting the 64-bit register in the double number of 32-bit registers; this idea can be extended to eliminate waste by having the quadruple number of 16-bit registers that can be joined into 32-bit
anbd 64-bit registers when needed, or even better, the octuple number
of 8-bit registers that can be joined to 16-bit, 32-bit, and 64-bit registers. We can even ressurrect the character-oriented or
digit-oriented architectures of the 1950s.

Consider that being able to address every 2^(3+n) field of a register
is far from free. Take a simple add of 2 bytes::

ADDB R8[7], R6[3], R19[4]

One has to individually align each of the bytes, which is going to blow
out your timing for forwarding by at least 3 gates of delay (operands)
and 4 gates for the result (register). The only way it makes "timing"
sense if if you restrict the patterns to::

ADDB R8[7], R6[7], R19[7]

Where there is no "vertical" routine in obtaining operands and delivering results. {{OR you could always just eat a latency cycle when all fields
are not the same.}}

I also suspect that you would gain few compiler writers to support random fields in registers.

Intel split AX into AL and AH, similar for BX, CX, and DX, but not for
SI, DI, BP, and SP.

{ABCD}X registers were data.
{SDBS} registers were pointer registers.

There are vanishingly few useful manipulations on part of pointers.

Oh and BTW:: using x86-history as justification for an architectural
feature is "bad style".

In the 32-bit extension, they did not add ways to
access the third and fourth byte, or the second wyde (16-bit value).
In the 64-bit extension, AMD added ways to access the low byte of
every register (in addition to AH-DH), but no way to access the second
byte of other registers than RAX-RDX, nor ways to access higher wydes,
or 32-bit units. Apparently they were not concerned about this kind
of waste. For the 8086 the explanation is not trying to avoid waste,
but an easy automatic mapping from 8080 code to 8086 code.

Writing to AL-DL or AX-DX,SI,DI,BP,SP leaves the other bits of the
32-bit register alone, which one can consider to be useful for storing
data in those bits (and in case of AL, AH actually provides a
conventient way to access some of the bits, and vice versa), but leads
to partial-register stalls. The hardware contains fast paths for some
common cases of partial-register writes, but AFAIK AH-DH do not get
fast paths in most CPUs.

By contrast, RISCs waste the other 24 of 56 bits on a byte load by zero-extending or sign-extending the byte.

But gains the property that the whole register contains 1 proper value {range-limited to the container size whence it came} This in turn makes tracking values easy--in fact placing several different sized values
in a single register makes it essentially impossible to perform value
analysis in the compiler.

Alpha avoids wasting register bits for some idioms by keeping up to 8
bytes in a register in SIMD style (a few years before the wave of SIMD extensions across the industry), but still provides no direct name for
the individual bytes of a register.

If your ISA has excellent support for statically positioned bit-fields
(or even better with dynamically positioned bit fields) fetching the
fields and depositing them back into containers does not add significant latency. {volatile notwithstanding} While poor ISA support does add
significant latency.

IIRC the original HPPA has 32 or so 64-bit FP registers, which they
then split into 58? 32-bit FP registers. I don't know how they
further evolved that feature.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Oct 30 22:06:35 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> posted:

Scott Lurndal <scott@slp53.sl.home> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

Robert Finch <robfi680@gmail.com> schrieb:

On 2025-10-29 2:15 p.m., Thomas Koenig wrote:

Robert Finch <robfi680@gmail.com> schrieb:

Started working on yet another CPU – Qupls4. Fixed 40-bit instructions,
64 GPRs. GPRs may be used in pairs for 128-bit ops. Registers are named >>>>> as if there were 32 GPRs, A0 (arg 0 register is r1) and A0H (arg 0 high >>>>> is r33). Sameo for other registers. GPRs may contain either integer or >>>>> floating-point values.

I understand the temptation to go for more bits :-) What is your
instruction alignment? Bytewise so 40 bits fit, or do you have some >>>> alignment that the first instruction of a cache line is always aligned? >>>

The 40-bit instructions are byte aligned. This does add more shifting in >>> the align stage. Once shifted though instructions are easily peeled off >>> from fixed positions. One consequence is jump targets must be byte
aligned OR routines could be required to be 32-bit aligned for instance.> >>

That raises an interesting question. If you want to align a branch >>target on a 32-bit boundary, or even a cache line, how do you fill
up the rest? If all instructions are 40 bits, you cannot have a
NOP that is not 40 bits, so there would need to be a jump before
a gap that is does not fit 40 bits.

iCache lines could be a multiple of 5-bytes in size (e.g. 80 bytes
instead of 64).

There is a cache level (L2 usually, I believe) when icache and
dcache are no longer separate. Wouldn't this cause problems
or inefficiencies?

Consider trying to invalidate an ICache line--this requires looking
at 2 DCache lines to see if they, too, need invalidation.

Consider self-modifying code, the data stream overwrites an instruction,
then later the FETCH engine runs over the modified line, but the modified
line is 64-bytes of the needed 80-bytes, so you take a hit and a miss on
a single fetch.

It also prevents SNARFing updates to ICache instructions, unless the
SNARFed data is entirely retained in a single ICache line.
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Oct 30 22:19:18 2025

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> writes:

According to my understanding, EV4 had no SIMD-style instructions.

My understanding ist that CMPBGE and ZAP(NOT), both SIMD-style
instructions, were already present in EV4. The architecture
description <https://download.majix.org/dec/alpha_arch_ref.pdf> does
not say that some implementations don't include these instructons in
hardware, whereas for the Multimedia support instructions (Section
4.13), the reference does say that.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Fri Oct 31 00:57:42 2025

From Newsgroup: comp.arch

On Thu, 30 Oct 2025 22:19:18 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Michael S <already5chosen@yahoo.com> writes:

According to my understanding, EV4 had no SIMD-style instructions.

My understanding ist that CMPBGE and ZAP(NOT), both SIMD-style
instructions, were already present in EV4.

Yes, those were in EV4.

Alpha 21064 and Alpha 21064A HRM is here: https://github.com/JonathanBelanger/DECaxp/blob/master/ExternalDocumentation

I didn't consider these instructions as SIMD. May be, I should have.
Looks like these instructions are intended to accelerated string
processing. That's unusual for the first wave of SIMD extensions.

The architecture
description <https://download.majix.org/dec/alpha_arch_ref.pdf> does
not say that some implementations don't include these instructons in hardware, whereas for the Multimedia support instructions (Section
4.13), the reference does say that.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Oct 31 14:48:41 2025

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> writes:

On Thu, 30 Oct 2025 22:19:18 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

My understanding ist that CMPBGE and ZAP(NOT), both SIMD-style
instructions, were already present in EV4.

...

I didn't consider these instructions as SIMD. May be, I should have.

They definitely are, but they were not touted as such at the time, and
they use the GPRs, unlike most SIMD extensions to instruction sets.

Looks like these instructions are intended to accelerated string
processing. That's unusual for the first wave of SIMD extensions.

Yes. This was pre-first-wave. The Alpha architects just wanted to
speed up some common operations that would otherwise have been
relatively slow thanks to Alpha initially not having BWX instructions. Ironically, when Alpha showed a particularly good result on some
benchmark (maybe Dhrystone), someone claimed that these string
instructions gave Alpha an unfair advantage.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Fri Oct 31 13:21:45 2025

From Newsgroup: comp.arch

On 10/31/2025 9:48 AM, Anton Ertl wrote:

Michael S <already5chosen@yahoo.com> writes:

On Thu, 30 Oct 2025 22:19:18 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

My understanding ist that CMPBGE and ZAP(NOT), both SIMD-style
instructions, were already present in EV4.

...

I didn't consider these instructions as SIMD. May be, I should have.

They definitely are, but they were not touted as such at the time, and
they use the GPRs, unlike most SIMD extensions to instruction sets.

Looks like these instructions are intended to accelerated string
processing. That's unusual for the first wave of SIMD extensions.

Yes. This was pre-first-wave. The Alpha architects just wanted to
speed up some common operations that would otherwise have been
relatively slow thanks to Alpha initially not having BWX instructions. Ironically, when Alpha showed a particularly good result on some
benchmark (maybe Dhrystone), someone claimed that these string
instructions gave Alpha an unfair advantage.

Most likely Dhrystone:
It shows disproportionate impact from the relative speed of things like "strcmp()" and integer divide.

I had experimented with special instructions for packed search, which
could be used to help with either string compare of implementing
dictionary objects in my usual way.

Though, had later fallen back to a more generic way of implementing
"strcmp()" that could allow more fair comparison between my own ISA and RISC-V. Where, say, one instead makes the determination based on how efficiently the ISA can handle various pieces of C code (rather than the
use of niche instructions that typically require hand-written ASM or
similar).

Generally, makes more sense to use helper instructions that have a
general impact on performance, say for example, effecting how quickly a
new image can be drawn into VRAM.

For example, in my GUI experiments:
Most of the programs are redrawing the screens as, say, 320x200 RGB555.

Well, except ROTT, which uses 384x200 8-bit, on top of a bunch of code
to mimic planar VGA behavior. In this case, for the port it was easier
to write wrapper code to fake the VGA weirdness than to try to rewrite
the whole renderer to work with a normal linear framebuffer (like what
Doom and similar had used).

In a lot of the cases, I was using an 8-bit indexed color or color-cell
mode. For indexed color, one needs to send each image through a palette conversion (to the OS color palette); or run a color-cell encoder.
Mostly because the display HW used 128K of VRAM.

And, even if RAM backed, there are bandwidth problems with going bigger;
so higher-resolutions had typically worked to reduce the bits per pixel:
320x200: 16 bpp
640x400: 4 bpp (color cell), 8 bpp (uses 256K, sorta works)
800x600: 2 or 4 bpp color-cell
1024x768: 1 bpp monochrome, other experiments (*1)
Or, use the 2 bpp mode, for 192K.

*1: Bayer Pattern Mode/Logic (where the pattern of pixels also encodes
the color);
One possibility also being to use an indexed color pair for every 8x8, allowing for a 1.25 bpp color cell mode.

Though, thus far the 1024x768 mode is still mostly untested on real
hardware.

Had experimented some with special instructions to speed up the indexed
color conversion and color-cell encoding, but had mostly gone back and
forth between using helper instructions and normal plain C logic, and
which exact route to take.

Had at one point had a helper instruction for the "convert 4 RGB555
colors to 4 indexed colors using a hardware palette", but this broke
when I later ended up modifying the system palette for better results
(which was a critical weakness of this approach). Also the naive
strategy of using a 32K lookup table isn't great, as this doesn't fit
into the L1 cache.

So, for 4 bpp color cell:
Generally, each block of 4x4 pixels understood as 2 RGB555 endpoints,
and 2 selector bits per pixel. Though, in VRAM, 4 of these are packed
into a logical 8x8 pixel block; rather than a linear ordering like in
DXT1 or similar (specifics differ, but general concept is similar to DXT1/S3TC).

The 2bpp mode generally has 8x8 pixels encoded as 1bpp in raster order
(same order as a character cell, with MSB in top-left corner and LSB in lower-right corner). And, then typically 2x RGB555 over the 8x8 block.
IIRC, had also experimented with having each 4x4 sub-block able to use a
pair of RGB232 colors, but was harder to get good results.

But, to help with this process, it was useful to have helper operations
for, say:
Map RGB555 values to a luma value;
Select minimum and maximum RGB555 values for block;
Map luma values to 1 or 2 bit selectors;
...

Internally, the GUI mode had worked by drawing everything to an RGB555 framebuffer (~ 512K or 1MB) and then using a bitmap to track which
blocks had been modified and need to be re-encoded and sent over to VRAM (partly by first flagging during window redraw, then comparing with a
previous version of the framebuffer and tracking when pixel-blocks will
differ to refine the selection of blocks that need redraw, copying over
blocks as needed to keep track of these buffers).

Process wasn't particularly efficient (and performance is considerably
worse than what Win3.x or Win9x seemed to give).

As for the packed-search instructions, there were 16-bit versions as
well, which could be used either to help with UTF-16 operations; or for dictionary objects.

Where, a common way I implement dictionary objects is to use arrays of
16-bit keys with 64-bit values (often tagged values or similar).

Though, this does put a limit on the maximum number of unique symbols
that can be used as dictionary keys, but not often an issue in practice. Generally these are not QNames or C function names, so reduces the issue
of running out of symbol name somewhat.

One can also differ though on how much it makes to have sense to have
ISA level helpers for working with tagrefs and similar (or, getting the
ABI involved with these matters, like defining in the ABI the encodings
for things like fixnum/flonum/etc).

...

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Fri Oct 31 14:32:00 2025

From Newsgroup: comp.arch

On 10/31/2025 1:21 PM, BGB wrote:

...

In a lot of the cases, I was using an 8-bit indexed color or color-cell mode. For indexed color, one needs to send each image through a palette conversion (to the OS color palette); or run a color-cell encoder.
Mostly because the display HW used 128K of VRAM.

And, even if RAM backed, there are bandwidth problems with going bigger;
so higher-resolutions had typically worked to reduce the bits per pixel:
   320x200: 16 bpp
   640x400: 4 bpp (color cell), 8 bpp (uses 256K, sorta works)
   800x600: 2 or 4 bpp color-cell
1024x768: 1 bpp monochrome, other experiments (*1)
    Or, use the 2 bpp mode, for 192K.

*1: Bayer Pattern Mode/Logic (where the pattern of pixels also encodes
the color);
One possibility also being to use an indexed color pair for every 8x8, allowing for a 1.25 bpp color cell mode.

Expanding on this:
Idea 1, original:
Each group of 2x2 pixels understood as:
G R
B G
With each pixel alternating color.

But, slightly better for quality is to operate on blocks of 4x4 pixels,
with the pixel bits encoding color indirectly for the whole 4x4 block:
G R G B
B G R G
G R G B
B G R G
So, if >= 4 G bits are set, G is High.
So, if >= 2 R bits are set, R is High.
So, if >= 2 B bits are set, B is High.
If > 8 bits are set, I is high.

The non-set pixels usually assuming either 0000 (Black) or 1000 (Dark
Grey) depending on I bit. Or, a low intensity version of the main color
if over 75% of a given bit are set in a given way (say, for mostly flat
color blocks).

Still kinda sucks, but allows a crude approximation of 16 color graphics
at 1 bpp...

--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Fri Oct 31 21:09:23 2025

From Newsgroup: comp.arch

MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

Improves the accuracy? of algorithms, but seems a bit specific to me.

It is down in the 1% footprint area.

Are there other instruction sequence where double-rounding would be good
to avoid?

Back when I joined Moto (1983) there was a lot of talk about double
roundings and how it could screw up various algorithms but mainly in
the 64-bit versus 80-bit stuff of 68881, where you got 11-more bits
of precision and thus took a change of 2/2^10 of a double rounding.
Today with 32-bit versus 64-bit you take a chance of 2/2^28 so the
problem is greatly ameliorated although technically still present.

Actually, for the five required basic operations, you can always do the
op in the next higher precision, then round again down to the target,
and get exactly the same result.

This is because the mantissa lengths (including the hidden bit) increase
to at least 2n+2:

f16 1:5:10 (1+10=11, 11*2+2 = 22)
f32 1:8:23 (1+23=24, 24*2+2 = 50)
f64 1:11:52 (1+52=53, 53*2+2 = 108)
f128 1:15:112 (1+112=113)

You can however NOT use f128 FMUL + FADD to emulate f64 FMAC, since that
would require a triple sized mantissa.

The Intel+Motorola 80-bit format was a bastard that made it effectively impossible to produce bit-for-bit identical results even when the FPU
was set to 64-bit precision.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Fri Oct 31 21:12:45 2025

From Newsgroup: comp.arch

Michael S wrote:

On Thu, 30 Oct 2025 16:46:14 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Alpha avoids wasting register bits for some idioms by keeping up to 8
bytes in a register in SIMD style (a few years before the wave of SIMD
extensions across the industry), but still provides no direct name for
the individual bytes of a register.

According to my understanding, EV4 had no SIMD-style instructions.
They were introduced in EV5 (Jan 1995). Which makes it only ~6 months
ahead of VIS in UltraSPARC.

The original (v1?) Alpha had instructions intending to make it "easy" to process character data in 8-byte chunks inside a register.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Nov 1 18:19:48 2025

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> posted:

MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

Improves the accuracy? of algorithms, but seems a bit specific to me.

It is down in the 1% footprint area.

Are there other instruction sequence where double-rounding would be good >> to avoid?

Back when I joined Moto (1983) there was a lot of talk about double roundings and how it could screw up various algorithms but mainly in
the 64-bit versus 80-bit stuff of 68881, where you got 11-more bits
of precision and thus took a change of 2/2^10 of a double rounding.
Today with 32-bit versus 64-bit you take a chance of 2/2^28 so the
problem is greatly ameliorated although technically still present.

Actually, for the five required basic operations, you can always do the
op in the next higher precision, then round again down to the target,
and get exactly the same result.

https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

This is because the mantissa lengths (including the hidden bit) increase
to at least 2n+2:

f16 1:5:10 (1+10=11, 11*2+2 = 22)
f32 1:8:23 (1+23=24, 24*2+2 = 50)
f64 1:11:52 (1+52=53, 53*2+2 = 108)
f128 1:15:112 (1+112=113)

You can however NOT use f128 FMUL + FADD to emulate f64 FMAC, since that would require a triple sized mantissa.

The Intel+Motorola 80-bit format was a bastard that made it effectively impossible to produce bit-for-bit identical results even when the FPU
was set to 64-bit precision.

Terje

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Nov 1 19:18:39 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Intel split AX into AL and AH, similar for BX, CX, and DX, but not for
SI, DI, BP, and SP.

{ABCD}X registers were data.
{SDBS} registers were pointer registers.

The 8086 is no 68000. The [BX] addressing mode makes it obvious that
that's not the case.

What is actually the case: AL-DL, AH-DH correspond to 8-bit registers
of the 8080, some of AX-DX correspond to register pairs. SI, DI, BP
are new, SP corresponds to the 8080 SP, which does not have 8-bit
components. That's why SI, DI, BP, SP have no low or high
sub-registers.

Oh and BTW:: using x86-history as justification for an architectural
feature is "bad style".

I think that we can learn a lot from earlier architectures, some
things to adopt and some things to avoid. Concerning subregisters, I
lean towards avoiding.

That's also another reason to avoid load-and-op and RMW instructions.
With a load/store architecture, load can sign/zero extend as
necessary, and then most operations can be done at full width.

But gains the property that the whole register contains 1 proper value >{range-limited to the container size whence it came} This in turn makes >tracking values easy--in fact placing several different sized values
in a single register makes it essentially impossible to perform value >analysis in the compiler.

I don't think it's impossible or particularly hard for the compiler. Implementing it in OoO hardware causes complications, though.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Nov 1 21:08:35 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Terje Mathisen <terje.mathisen@tmsw.no> posted:

Actually, for the five required basic operations, you can always do the
op in the next higher precision, then round again down to the target,
and get exactly the same result.

https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

The PowerISA version 3.0 introduced rounding to odd for its 128-bit
floating point arithmetic, for that very reason (I assume).
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sun Nov 2 02:21:18 2025

From Newsgroup: comp.arch

On 10/31/2025 2:32 PM, BGB wrote:

On 10/31/2025 1:21 PM, BGB wrote:

...

In a lot of the cases, I was using an 8-bit indexed color or color-
cell mode. For indexed color, one needs to send each image through a
palette conversion (to the OS color palette); or run a color-cell
encoder. Mostly because the display HW used 128K of VRAM.

And, even if RAM backed, there are bandwidth problems with going
bigger; so higher-resolutions had typically worked to reduce the bits
per pixel:
    320x200: 16 bpp
    640x400: 4 bpp (color cell), 8 bpp (uses 256K, sorta works)
    800x600: 2 or 4 bpp color-cell
   1024x768: 1 bpp monochrome, other experiments (*1)
     Or, use the 2 bpp mode, for 192K.

*1: Bayer Pattern Mode/Logic (where the pattern of pixels also encodes
the color);
One possibility also being to use an indexed color pair for every 8x8,
allowing for a 1.25 bpp color cell mode.

Expanding on this:
Idea 1, original:
Each group of 2x2 pixels understood as:
G R
B G
With each pixel alternating color.

But, slightly better for quality is to operate on blocks of 4x4 pixels,
with the pixel bits encoding color indirectly for the whole 4x4 block:
G R G B
B G R G
G R G B
B G R G
So, if >= 4 G bits are set, G is High.
So, if >= 2 R bits are set, R is High.
So, if >= 2 B bits are set, B is High.
If > 8 bits are set, I is high.

The non-set pixels usually assuming either 0000 (Black) or 1000 (Dark
Grey) depending on I bit. Or, a low intensity version of the main color
if over 75% of a given bit are set in a given way (say, for mostly flat color blocks).

Still kinda sucks, but allows a crude approximation of 16 color graphics
at 1 bpp...

Well, anyways, here is me testing with another variation of the idea
(after thinking about it again).

Using a joke image as a test case here...

https://x.com/cr88192/status/1984694932666261839

This variation uses:
Y R
B G

In this case tiling as:
Y R Y R ...
B G B G ...
Y R Y R ...
B G B G ...
...

Where, Y is a pure luma value.
May or may not use this, or:
Y R B G Y R B G
B G Y R B G Y R
...
But, prior pattern is simpler to deal with.

Note that having every line follow the same pattern (with no
alternation) would lead to obvious vertical lines in the output.

With a different (slightly more complicated color recovery algorithm),
and was operating on 8x8 pixel blocks.

With 4x4, there is effectively 4 bits per channel, which is enough to
recover 1 bit of color per channel.

With 8x8, there are 16 bits, and it is possible to recover ~ 3 bits per channel, allowing for roughly a RGB333 color space (though, the vectors
are normalized here).

Having both a Y and G channel slightly helps with the color-recovery
process; and allows a way to signal a monochrome block (if Y==G, the
block is assumed to be monochrome, and the R/B bits can be used more
freely for expressing luma).

Where:
Chroma accuracy comes at the expense of luma accuracy;
An increased colorspace comes at the cost of spatial resolution of chroma;
...

Dealing with chroma does have the effect of making the dithering process
more complicated. As noted, reliable recovery of the color vector is
itself a bit fiddly (and is very sensitive to the encoder side dither process).

The former image was itself an example of an artifact caused by the
dithering process, which in this case was over-boosting the green
channel (and rotating the dither matrix would result in drastic color
shifts). The later image was mostly after I realized the issue with the
dither pattern, and modified how it was being handled (replacing the use
of an 8x8 ordered dither with a 4x4 ordered dither, and then rotating
the matrix for each channel).

Image quality isn't great, but then again, not sure how to do that much
better with a naive 1 bit/pixel encoding.

I guess, an open question here is whether the color-recovery algorithm
would be practical for hardware / FPGA.

One possible could be:
Use LUT4 to map 4b -> 2b (as a count)
Then, map 2x2b -> 3b (adder)
Then, map 2x3b -> 4b (adder), then discard LSB.
Then, select max or R/G/B/Y;
This is used as an inverse normalization scale.
Feed each value and scale through a LUT (for R/G/B)
Getting a 5-bit scaled RGB;
Roughly: (Val<<5)/Max
Compose a 5-bit RGB555 value used for each pixel that is set.

Actual pixel decoding process works the same as with 8x8 blocks of 1 bit monochome, selecting minimum or maximum color based on each bit.

Possibly, Y could also be used to select "relative" minimum and maximum values, vs full intensity and black, but this would add more logic
complexity.

Pros/Cons:
+: Looks better than per-pixel Bayer-RGB
+: Looks better than 4x4 RGBI
-: Would require more complex decoder logic;
-: Requires specialized dither logic to not look like broken crap.
-: Doesn't give passable results if handed naive grayscale dithering.

Per-Pixel RGB still holds up OK with naive grayscale dither.
But, this approach is a lot more particular.

the RGBI approach seems intermediate, more likely to decode grayscale
patterns as gray.

I guess a more open question is if such a thing could be useful (it is
pretty far down the image-quality scale). But, OTOH, with simpler (non-randomized) dither patterns; it can LZ compress OK (depending on
image, can get 0.1 to 0.8 bpp; which is generally JPEG territory).

If combined with delta encoding or similar; could almost be adapted into
a very crappy video codec.

Well, or LZ4, where (at 320x200) one could potentially hold several
frames of video in a 64K sliding window.

But, image quality might be unacceptably poor. Also if decoded in
software, the color-reconstruction is likely to be more computationally expensive than just using a CRAM style codec (while also giving worse
image quality).

More just interesting that I was able to get things "almost half-way
passable" from 1 bpp monochrome.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sun Nov 2 11:36:36 2025

From Newsgroup: comp.arch

Thomas Koenig wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Terje Mathisen <terje.mathisen@tmsw.no> posted:

Actually, for the five required basic operations, you can always do the
op in the next higher precision, then round again down to the target,
and get exactly the same result.

https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

The PowerISA version 3.0 introduced rounding to odd for its 128-bit
floating point arithmetic, for that very reason (I assume).

Rounding to odd is basically the same as rounding to sticky, i.e if
there are any trailing 1 bits in the exact result, then put that in the
ulp position.

We have known since before the 1978 ieee754 standard that guard+sticky
(plus sign and ulp) is enough to get the rounding correct in all modes.

The single exception is when rounding up from the maximum magnitude
value to inf should be suppressed, there you do in fact need to check
all the bits.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Sun Nov 2 15:56:12 2025

From Newsgroup: comp.arch

On Sun, 2 Nov 2025 11:36:36 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Thomas Koenig wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Terje Mathisen <terje.mathisen@tmsw.no> posted:

Actually, for the five required basic operations, you can always
do the op in the next higher precision, then round again down to
the target, and get exactly the same result.

https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

The PowerISA version 3.0 introduced rounding to odd for its 128-bit floating point arithmetic, for that very reason (I assume).

Rounding to odd is basically the same as rounding to sticky, i.e if
there are any trailing 1 bits in the exact result, then put that in
the ulp position.

We have known since before the 1978 ieee754 standard that
guard+sticky (plus sign and ulp) is enough to get the rounding
correct in all modes.

The single exception is when rounding up from the maximum magnitude
value to inf should be suppressed, there you do in fact need to check
all the bits.

Terje

People use names like guard and sticky bits and sometimes also rounding
bit (e.g. in Wikipedia article) without explanation, as if everybody
had agreed about what they mean. But I don't think that everybody
really agree.

Shockingly, an absence of strict definitions apples even to most widely refereed article of David Goldberg "What Every Computer Scientist Should
Know About Floating-Point Arithmetic". It seems, people copy the name
of the article one from another, but very small fraction of them
bothered to actually read it.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sun Nov 2 09:39:27 2025

From Newsgroup: comp.arch

Contemplating having conditional branch instructions branch to a target
value in a register instead of using a displacement.

I think this has about the same code density as having a branch to a displacement from the IP.

Using a fused compare-and-branch instruction for Qupls4 there is not
enough room in the instruction for a large branch displacement (10
bits). So, my thought is to branch to a register value instead.
There is already an add-to-instruction-pointer instruction that can be
used to generate relative addresses.

By moving the register load outside of a loop, the dynamic instruction
count can be reduced. I think this solution is a bit better than having compare and branch as two separate instructions, or having an extended constant added to the branch instruction.

One gotcha may be that the branch target needs to be predicted as it
cannot be calculated earlier in the pipeline.

The 10-bit displacement format could also be supported, but it is yet
another branch instruction format. I may leave holes in the instruction
set for future support, but I think it is best to start with just a
single format.

Code:
AIPSI R3,1234 ; add displacement to IP and store in R3 (hoist-able)
BLT R1,R2,R3 ; branch to R3 if R1 < R2

Versus:
CMP R3,R1,R2
BLT R3,displacement

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sun Nov 2 10:06:42 2025

From Newsgroup: comp.arch

On 2025-11-02 3:21 a.m., BGB wrote:

On 10/31/2025 2:32 PM, BGB wrote:

On 10/31/2025 1:21 PM, BGB wrote:

...

In a lot of the cases, I was using an 8-bit indexed color or color-
cell mode. For indexed color, one needs to send each image through a
palette conversion (to the OS color palette); or run a color-cell
encoder. Mostly because the display HW used 128K of VRAM.

And, even if RAM backed, there are bandwidth problems with going
bigger; so higher-resolutions had typically worked to reduce the bits
per pixel:
    320x200: 16 bpp
    640x400: 4 bpp (color cell), 8 bpp (uses 256K, sorta works)
    800x600: 2 or 4 bpp color-cell
   1024x768: 1 bpp monochrome, other experiments (*1)
     Or, use the 2 bpp mode, for 192K.

*1: Bayer Pattern Mode/Logic (where the pattern of pixels also
encodes the color);
One possibility also being to use an indexed color pair for every
8x8, allowing for a 1.25 bpp color cell mode.

Expanding on this:
Idea 1, original:
Each group of 2x2 pixels understood as:
   G R
   B G
With each pixel alternating color.

But, slightly better for quality is to operate on blocks of 4x4
pixels, with the pixel bits encoding color indirectly for the whole
4x4 block:
   G R G B
   B G R G
   G R G B
   B G R G
So, if >= 4 G bits are set, G is High.
So, if >= 2 R bits are set, R is High.
So, if >= 2 B bits are set, B is High.
If > 8 bits are set, I is high.

The non-set pixels usually assuming either 0000 (Black) or 1000 (Dark
Grey) depending on I bit. Or, a low intensity version of the main
color if over 75% of a given bit are set in a given way (say, for
mostly flat color blocks).

Still kinda sucks, but allows a crude approximation of 16 color
graphics at 1 bpp...

Well, anyways, here is me testing with another variation of the idea
(after thinking about it again).

Using a joke image as a test case here...

https://x.com/cr88192/status/1984694932666261839

This variation uses:
Y R
B G

In this case tiling as:
Y R Y R ...
B G B G ...
Y R Y R ...
B G B G ...
...

Where, Y is a pure luma value.
May or may not use this, or:
    Y R B G Y R B G
    B G Y R B G Y R
    ...
But, prior pattern is simpler to deal with.

Note that having every line follow the same pattern (with no
alternation) would lead to obvious vertical lines in the output.

With a different (slightly more complicated color recovery algorithm),
and was operating on 8x8 pixel blocks.

With 4x4, there is effectively 4 bits per channel, which is enough to recover 1 bit of color per channel.

With 8x8, there are 16 bits, and it is possible to recover ~ 3 bits per channel, allowing for roughly a RGB333 color space (though, the vectors
are normalized here).

Having both a Y and G channel slightly helps with the color-recovery process; and allows a way to signal a monochrome block (if Y==G, the
block is assumed to be monochrome, and the R/B bits can be used more
freely for expressing luma).

Where:
Chroma accuracy comes at the expense of luma accuracy;
An increased colorspace comes at the cost of spatial resolution of chroma; ...

Dealing with chroma does have the effect of making the dithering process more complicated. As noted, reliable recovery of the color vector is
itself a bit fiddly (and is very sensitive to the encoder side dither process).

The former image was itself an example of an artifact caused by the dithering process, which in this case was over-boosting the green
channel (and rotating the dither matrix would result in drastic color shifts). The later image was mostly after I realized the issue with the dither pattern, and modified how it was being handled (replacing the use
of an 8x8 ordered dither with a 4x4 ordered dither, and then rotating
the matrix for each channel).

Image quality isn't great, but then again, not sure how to do that much better with a naive 1 bit/pixel encoding.

I guess, an open question here is whether the color-recovery algorithm
would be practical for hardware / FPGA.

One possible could be:
Use LUT4 to map 4b -> 2b (as a count)
Then, map 2x2b -> 3b (adder)
Then, map 2x3b -> 4b (adder), then discard LSB.
Then, select max or R/G/B/Y;
    This is used as an inverse normalization scale.
Feed each value and scale through a LUT (for R/G/B)
    Getting a 5-bit scaled RGB;
    Roughly: (Val<<5)/Max
Compose a 5-bit RGB555 value used for each pixel that is set.

Actual pixel decoding process works the same as with 8x8 blocks of 1 bit monochome, selecting minimum or maximum color based on each bit.

Possibly, Y could also be used to select "relative" minimum and maximum values, vs full intensity and black, but this would add more logic complexity.

Pros/Cons:
+: Looks better than per-pixel Bayer-RGB
+: Looks better than 4x4 RGBI
-: Would require more complex decoder logic;
-: Requires specialized dither logic to not look like broken crap.
-: Doesn't give passable results if handed naive grayscale dithering.

Per-Pixel RGB still holds up OK with naive grayscale dither.
But, this approach is a lot more particular.

the RGBI approach seems intermediate, more likely to decode grayscale patterns as gray.

I guess a more open question is if such a thing could be useful (it is pretty far down the image-quality scale). But, OTOH, with simpler (non- randomized) dither patterns; it can LZ compress OK (depending on image,
can get 0.1 to 0.8 bpp; which is generally JPEG territory).

If combined with delta encoding or similar; could almost be adapted into
a very crappy video codec.

Well, or LZ4, where (at 320x200) one could potentially hold several
frames of video in a 64K sliding window.

But, image quality might be unacceptably poor. Also if decoded in
software, the color-reconstruction is likely to be more computationally expensive than just using a CRAM style codec (while also giving worse
image quality).

More just interesting that I was able to get things "almost half-way passable" from 1 bpp monochrome.

...

I think your support for graphics is interesting; something to keep in
mind for displays with limited RAM.

I use a high-speed DDR memory interface and video fifo (line cache).
Colors are broken into components specifying the number of bits per
component (up to 10) in CRs. Colors are passed around as 32-bit values
for video processing. Using the colors directly is much easier than
dealing with dithered colors.
The graphics accelerator just spits out colors to the frame buffer
without needing to go through a dithering stage.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sun Nov 2 16:09:10 2025

From Newsgroup: comp.arch

Michael S wrote:

On Sun, 2 Nov 2025 11:36:36 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Thomas Koenig wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Terje Mathisen <terje.mathisen@tmsw.no> posted:

Actually, for the five required basic operations, you can always
do the op in the next higher precision, then round again down to
the target, and get exactly the same result.

https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

The PowerISA version 3.0 introduced rounding to odd for its 128-bit
floating point arithmetic, for that very reason (I assume).

Rounding to odd is basically the same as rounding to sticky, i.e if
there are any trailing 1 bits in the exact result, then put that in
the ulp position.

We have known since before the 1978 ieee754 standard that
guard+sticky (plus sign and ulp) is enough to get the rounding
correct in all modes.

The single exception is when rounding up from the maximum magnitude
value to inf should be suppressed, there you do in fact need to check
all the bits.

Terje

People use names like guard and sticky bits and sometimes also rounding
bit (e.g. in Wikipedia article) without explanation, as if everybody
had agreed about what they mean. But I don't think that everybody
really agree.

Within the 754 working group the definition is totally clear:

Guard is the first bit after the normal mantissa.

Sticky is the bit following the guard bit, it is generated by OR'ing
together all subsequent bits in the exact/infinitely precise result.

I.e if an exact result is exactly halfway between two representable
numbers, the Guard bit will be set and Sticky unset.

Ulp (Unit in Last Place)) is the final mantissa bit

Sign is of course the sign in the Sign-Magnitude format used for all fp numbers.

This means that those four bits in combination suffices to separate
between rounding directions:

Default rounding is nearest or even: (In this case Sign does not matter.)

Ulp | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
Guard | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 |
Sticky | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |

Round | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Sun Nov 2 18:14:54 2025

From Newsgroup: comp.arch

On Sun, 2 Nov 2025 16:09:10 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Michael S wrote:

On Sun, 2 Nov 2025 11:36:36 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Thomas Koenig wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Terje Mathisen <terje.mathisen@tmsw.no> posted:

Actually, for the five required basic operations, you can always
do the op in the next higher precision, then round again down to
the target, and get exactly the same result.

https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

The PowerISA version 3.0 introduced rounding to odd for its
128-bit floating point arithmetic, for that very reason (I
assume).

Rounding to odd is basically the same as rounding to sticky, i.e if
there are any trailing 1 bits in the exact result, then put that in
the ulp position.

We have known since before the 1978 ieee754 standard that
guard+sticky (plus sign and ulp) is enough to get the rounding
correct in all modes.

The single exception is when rounding up from the maximum magnitude
value to inf should be suppressed, there you do in fact need to
check all the bits.

Terje

People use names like guard and sticky bits and sometimes also
rounding bit (e.g. in Wikipedia article) without explanation, as if everybody had agreed about what they mean. But I don't think that
everybody really agree.

Within the 754 working group the definition is totally clear:

I could believe that there is consensus about these names between
current members of 754 working group. But nothing of that sort is
mentioned in the text of the Standard. Which among other things means
that you can not rely on being understood even by new members of 754
working group.

Guard is the first bit after the normal mantissa.

Sticky is the bit following the guard bit, it is generated by OR'ing together all subsequent bits in the exact/infinitely precise result.

I.e if an exact result is exactly halfway between two representable
numbers, the Guard bit will be set and Sticky unset.

Ulp (Unit in Last Place)) is the final mantissa bit

Sign is of course the sign in the Sign-Magnitude format used for all
fp numbers.

This means that those four bits in combination suffices to separate
between rounding directions:

Default rounding is nearest or even: (In this case Sign does not
matter.)

Ulp | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
Guard | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 |
Sticky | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |

Round | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |

Terje

I mostly use ULP/Guard/Sticky in the same meaning. Except when I use
them, esp. Guard, differently.
Given the choice, [in the context of binary floating point] I'd rather
not use the term 'guard' at all. Names like 'rounding bit' or
'half-ULP' are far more self-describing.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sun Nov 2 20:19:10 2025

From Newsgroup: comp.arch

Michael S wrote:

On Sun, 2 Nov 2025 16:09:10 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Michael S wrote:

On Sun, 2 Nov 2025 11:36:36 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Thomas Koenig wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Terje Mathisen <terje.mathisen@tmsw.no> posted:

Actually, for the five required basic operations, you can always >>>>>>> do the op in the next higher precision, then round again down to >>>>>>> the target, and get exactly the same result.

https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

The PowerISA version 3.0 introduced rounding to odd for its
128-bit floating point arithmetic, for that very reason (I
assume).

Rounding to odd is basically the same as rounding to sticky, i.e if
there are any trailing 1 bits in the exact result, then put that in
the ulp position.

We have known since before the 1978 ieee754 standard that
guard+sticky (plus sign and ulp) is enough to get the rounding
correct in all modes.

The single exception is when rounding up from the maximum magnitude
value to inf should be suppressed, there you do in fact need to
check all the bits.

Terje

People use names like guard and sticky bits and sometimes also
rounding bit (e.g. in Wikipedia article) without explanation, as if
everybody had agreed about what they mean. But I don't think that
everybody really agree.

Within the 754 working group the definition is totally clear:

I could believe that there is consensus about these names between
current members of 754 working group. But nothing of that sort is
mentioned in the text of the Standard. Which among other things means
that you can not rely on being understood even by new members of 754
working group.

Guard is the first bit after the normal mantissa.

Sticky is the bit following the guard bit, it is generated by OR'ing
together all subsequent bits in the exact/infinitely precise result.

I.e if an exact result is exactly halfway between two representable
numbers, the Guard bit will be set and Sticky unset.

Ulp (Unit in Last Place)) is the final mantissa bit

Sign is of course the sign in the Sign-Magnitude format used for all
fp numbers.

This means that those four bits in combination suffices to separate
between rounding directions:

Default rounding is nearest or even: (In this case Sign does not
matter.)

Ulp | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
Guard | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 |
Sticky | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |

Round | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |

Terje

I mostly use ULP/Guard/Sticky in the same meaning. Except when I use
them, esp. Guard, differently.
Given the choice, [in the context of binary floating point] I'd rather
not use the term 'guard' at all. Names like 'rounding bit' or
'half-ULP' are far more self-describing.

Guard also works for decimal FP, where you need a single Sticky bit if
the Guard digit is equal to 5. If you work with the binary
representation for decimal, then you just need two extra bits, just like
BFP.

Correct rounding also work when Guard temporarily contains more than one
bit, possibly due to normalization, but you would normally squash this
down (Guard, Sticky) by OR'ing any secondary guard bits into Sticky.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sun Nov 2 14:58:36 2025

From Newsgroup: comp.arch

On 11/2/2025 9:06 AM, Robert Finch wrote:

On 2025-11-02 3:21 a.m., BGB wrote:

On 10/31/2025 2:32 PM, BGB wrote:

On 10/31/2025 1:21 PM, BGB wrote:

...

In a lot of the cases, I was using an 8-bit indexed color or color-
cell mode. For indexed color, one needs to send each image through a
palette conversion (to the OS color palette); or run a color-cell
encoder. Mostly because the display HW used 128K of VRAM.

And, even if RAM backed, there are bandwidth problems with going
bigger; so higher-resolutions had typically worked to reduce the
bits per pixel:
    320x200: 16 bpp
    640x400: 4 bpp (color cell), 8 bpp (uses 256K, sorta works)
    800x600: 2 or 4 bpp color-cell
   1024x768: 1 bpp monochrome, other experiments (*1)
     Or, use the 2 bpp mode, for 192K.

*1: Bayer Pattern Mode/Logic (where the pattern of pixels also
encodes the color);
One possibility also being to use an indexed color pair for every
8x8, allowing for a 1.25 bpp color cell mode.

Expanding on this:
Idea 1, original:
Each group of 2x2 pixels understood as:
   G R
   B G
With each pixel alternating color.

But, slightly better for quality is to operate on blocks of 4x4
pixels, with the pixel bits encoding color indirectly for the whole
4x4 block:
   G R G B
   B G R G
   G R G B
   B G R G
So, if >= 4 G bits are set, G is High.
So, if >= 2 R bits are set, R is High.
So, if >= 2 B bits are set, B is High.
If > 8 bits are set, I is high.

The non-set pixels usually assuming either 0000 (Black) or 1000 (Dark
Grey) depending on I bit. Or, a low intensity version of the main
color if over 75% of a given bit are set in a given way (say, for
mostly flat color blocks).

Still kinda sucks, but allows a crude approximation of 16 color
graphics at 1 bpp...

Well, anyways, here is me testing with another variation of the idea
(after thinking about it again).

Using a joke image as a test case here...

https://x.com/cr88192/status/1984694932666261839

This variation uses:
   Y R
   B G

In this case tiling as:
   Y R Y R ...
   B G B G ...
   Y R Y R ...
   B G B G ...
   ...

Where, Y is a pure luma value.
   May or may not use this, or:
     Y R B G Y R B G
     B G Y R B G Y R
     ...
   But, prior pattern is simpler to deal with.

Note that having every line follow the same pattern (with no
alternation) would lead to obvious vertical lines in the output.

With a different (slightly more complicated color recovery algorithm),
and was operating on 8x8 pixel blocks.

With 4x4, there is effectively 4 bits per channel, which is enough to
recover 1 bit of color per channel.

With 8x8, there are 16 bits, and it is possible to recover ~ 3 bits
per channel, allowing for roughly a RGB333 color space (though, the
vectors are normalized here).

Having both a Y and G channel slightly helps with the color-recovery
process; and allows a way to signal a monochrome block (if Y==G, the
block is assumed to be monochrome, and the R/B bits can be used more
freely for expressing luma).

Where:
Chroma accuracy comes at the expense of luma accuracy;
An increased colorspace comes at the cost of spatial resolution of
chroma;
...

Dealing with chroma does have the effect of making the dithering
process more complicated. As noted, reliable recovery of the color
vector is itself a bit fiddly (and is very sensitive to the encoder
side dither process).

The former image was itself an example of an artifact caused by the
dithering process, which in this case was over-boosting the green
channel (and rotating the dither matrix would result in drastic color
shifts). The later image was mostly after I realized the issue with
the dither pattern, and modified how it was being handled (replacing
the use of an 8x8 ordered dither with a 4x4 ordered dither, and then
rotating the matrix for each channel).

Image quality isn't great, but then again, not sure how to do that
much better with a naive 1 bit/pixel encoding.

I guess, an open question here is whether the color-recovery algorithm
would be practical for hardware / FPGA.

One possible could be:
   Use LUT4 to map 4b -> 2b (as a count)
   Then, map 2x2b -> 3b (adder)
   Then, map 2x3b -> 4b (adder), then discard LSB.
   Then, select max or R/G/B/Y;
     This is used as an inverse normalization scale.
   Feed each value and scale through a LUT (for R/G/B)
     Getting a 5-bit scaled RGB;
     Roughly: (Val<<5)/Max
   Compose a 5-bit RGB555 value used for each pixel that is set.

Actual pixel decoding process works the same as with 8x8 blocks of 1
bit monochome, selecting minimum or maximum color based on each bit.

Possibly, Y could also be used to select "relative" minimum and
maximum values, vs full intensity and black, but this would add more
logic complexity.

Pros/Cons:
   +: Looks better than per-pixel Bayer-RGB
   +: Looks better than 4x4 RGBI
   -: Would require more complex decoder logic;
   -: Requires specialized dither logic to not look like broken crap.
   -: Doesn't give passable results if handed naive grayscale dithering. >>
Per-Pixel RGB still holds up OK with naive grayscale dither.
But, this approach is a lot more particular.

the RGBI approach seems intermediate, more likely to decode grayscale
patterns as gray.

I guess a more open question is if such a thing could be useful (it is
pretty far down the image-quality scale). But, OTOH, with simpler
(non- randomized) dither patterns; it can LZ compress OK (depending on
image, can get 0.1 to 0.8 bpp; which is generally JPEG territory).

If combined with delta encoding or similar; could almost be adapted
into a very crappy video codec.

Well, or LZ4, where (at 320x200) one could potentially hold several
frames of video in a 64K sliding window.

But, image quality might be unacceptably poor. Also if decoded in
software, the color-reconstruction is likely to be more
computationally expensive than just using a CRAM style codec (while
also giving worse image quality).

More just interesting that I was able to get things "almost half-way
passable" from 1 bpp monochrome.

...

I think your support for graphics is interesting; something to keep in
mind for displays with limited RAM.

I use a high-speed DDR memory interface and video fifo (line cache).
Colors are broken into components specifying the number of bits per component (up to 10) in CRs. Colors are passed around as 32-bit values
for video processing. Using the colors directly is much easier than
dealing with dithered colors.
The graphics accelerator just spits out colors to the frame buffer
without needing to go through a dithering stage.

No real need to go much beyond RGB555, as the FPGA boards have VGA DACs
that generally fall below this (Eg: 4 bit/channel on the Nexys A7). And,
2-bit for many VGA PMods (PMod allowing 8 IO pins, so RGB222+H/V Sync;
or needing to use 2 PMOD connections for the VGA). The usual workaround
was also to perform dithering while driving the VGA output (with ordered dither in the Verilog).

But, yeah, even the theoretical framebuffer images generally look better
than what one sees on actual monitors.

Even then, modern LCD panels mostly can't display even full RGB24 color
depth; more often it is 6-bit / channel or similar (then the panels
dither for full 24). But, IIRC a lot of OLEDs are back up to full
color-depth (but, OLEDs are more expensive and have often had a
notoriously short lifespans, ...).

But, yeah, my current monitor seems to be LCD based.

In my case, the video HW uses prefetch requests along a ring-bus, which
goes to the L2 cache, and then to external RAM. It then works on hope
that the requests get around the bus and can be resolved in time.

In this case, the memory works in a vaguely similar way to the CPU's L1
caches (although with line-oriented access), and a module that
translates this to color-values during screen refresh. General access
pattern was built around "character cells".

It can give stable results at 8MB/s to 16MB/s (with more glitches as it
goes higher), but breaks down too much past this point.

So, switching to a RAM backed framebuffer didn't significantly usable
increase screen resolutions or color depths.

Also, I am mostly limited to using either either a 25 or 50 MHz pixel
clock, so some timings were tweaked to fit this. Doesn't really fit
standard VESA timings, but it seems like monitors can tolerate
nonstandard timings, and are more limited to operating range.

So, say:
320x200 70Hz, 25MHz; 9MB/s @ 16bpp (hi-color)
640x400 70Hz, 25MHz; 9MB/s @ 4bpp, 18 MB/s @ 8bpp
640x480 60Hz, 50Mhz; 9MB/s @ 4bpp, 18 MB/s @ 8bpp
800x600 72Hz, 50Mhz; 8.6 MB/s @ 2bpp, 17 MB/s @ 4bpp
1024x768 48Hz, 50Mhz, 5MB/s @ 1bpp, 10MB/s @ 2bpp

So, this implies that just running 1024x768 at 2bpp should be acceptable
(even if it exceeds the usual 128K limit).

Earlier on, I had an 800x600/36Hz, and 1024x768/25Hz, these would have
allowed 8bpp color, but are below the minimum refresh rate of most
monitors (seems like VGA monitors don't like going below around 40Hz).

Of these modes, 8bpp (Indexed color) is technically newest.
Originally the graphics hardware was written for color-cell.

Earliest design had 32-bit cells (for 8x8 pixels):
10 bits: Glyph
2x 6b color + Attrib (RGB222)
2x 9b Color: RGB333

was later expanded first to 64b cells, then to 128b and 256b
Some control bits effect cell size.
Also with ability to specify 8x8 or 4x4 cells.
Where, 4x4 cells reduce the effective resolution.
In the bitmap modes:
4x4 + 256b: 16bpp Hicolor
4x4 + 128b: 8bpp Indexed
4x4 + 64b: 4bpp RGBI (Alt2)
8x8 + 256b: 4bpp RGBI (Alt1)
8x8 + 128b: 2bpp (4-color, CGA-like)
With a range of color palettes available (more than CGA).
Black/White/Cyan/Magenta, Black/White/Red/Green, ...
Black/White/DarkGray/LightGray, also with Green and Amber, ...
8x8 + 64b: 1bpp (Monochrome)
Can select between RGBI colors and some special sub-modes.
The recent idea, if added to HW, would slot into this mode.
The color-cell modes:
8x8 + 256b: 4bpp (DXT1 like, 4x 4x4 cells per 256-bit cell)
8x8 + 128b: 2bpp (2bpp cells)
Each cell has 2x RGB555 colors, and 8x8x1 for pixel data
Had experimented with 8x RGB232,
didn't catch on (looked terrible).
8x8 + 64b: Text-Mode + Early Graphics (4x4 cells)

Generally, the text mode operates in a 640x200 mode with 8x8 + 128b
cells, so 32K of VRAM used (for 80x25 cells).

The 640x200 mode is the same as 640x400 (for VGA) but with the vertical resolution halved. The 320x200 mode also halves the horizontal
resolution (so 40x25 cells).

In this case, a 40x25 color-cell mode (with 256-bit cells) could be used
for graphics (32K). Early on, this was used as the graphics mode for
Doom and similar, before I later expanded VRAM to 128K and switched to
320x200 Hicolor.

The bitmap modes are non-raster, generally with pixels packed into 8x8
or 4x4 blocks.
4x4:
16bpp: pixels in raster order.
8bpp: raster order, 32-bits per row
4bpp: Raster order, 16-bits per row
And, 8x8:
4bpp: Takes 16bpp layout, splits each pixel into 2x2.
2bpp: Takes 8bpp layout, splits each pixel into 2x2.
1bpp: Raster order, 1bpp, but same order as text glyphs.
With MSB in upper left, LSB in lower right.

Can note that the 8x8x1b cells have the upper-left corner in the MSB.
This differs from most other modes where the upper left corner is in the
LSB (so, pixels flipped both horizontally and vertically).

Can note that in this case, the video memory had several parts:
VRAM / Framebuffer
Note: Uses 64-bit word addressing.
Font RAM: Stores character glyphs as 8x8 patterns.
Originally, there was a FontROM, but I dropped this feature.
This means BootROM needs to supply the initial glyph set.
I went with 5x6 pixel cells in the ROM to save space.
Where 5x6 does ASCII well enough.
Palette RAM: Stores 256x 16-bits (as RGB555).

Though, TestKern typically uses what is effectively color-cell graphics
for the text mode (so, just draws 8x8 pixel blocks for the character
glyphs).

All this differs notably from CGA/EGA/VGA, which had used mostly raster-ordered modes. Except for the oddity of bit-planes for 16 color
modes in EGA and VGA.

I did experiment with raster ordered modes which worked by effectively stretching the character cell horizontally while reducing vertical
height to 1 pixel. Ended not going with this was it was prone to a lot
more glitches with the screen refresh (turned out to be a lot more
sensitive to timing than the use of 8x8 or 4x4 cells).

But, since generally programs don't draw directly into VRAM, the use of non-raster VRAM is mostly less of an issue.

Well, apart from the computational cost of converting from internal
RGB555 frame-buffers. Though, partial reason RGB555 ended up used so
often was because it was faster to do RGB555 -> ColorCell encoding than
8-bit indexed color to color-cell, as indexed color typically also
requires a bunch of palette lookups (which could end up more expensive
than the additional RAM bandwidth from the RGB555).

Also, there isn't really a "good and simple" way to generalize 8-bit
colors in a way that leads to acceptable image quality. Invariably, one
ends up needing palettes or encoding schemes that are slightly irregular.

For color-cell, there are different approaches depending on how fast it
needs to be:
Faster: Simply select minimum and maximum luma;
Selector encoding is often via comparing against thresholds.
Except on x86, where multiply+bias+shift is faster.
Medium: Calculate along 4 axes in parallel;
Select axis which gives highest contrast;
Usually: Luma, Cyan, Magenta, Yellow.
Adjust endpoints to better reflect standard deviation.
Vs simply min/max.
Slower:
Calculate centroid and mass distribution and similar;
Better quality, more for offline / batch encoding.

As noted, early on, I was mostly using real-time color-cell encoders for
Doom and Quake and similar (hence part of why they were modified to use RGB555).

Some of this is also related to the existence of a lot of RGB555 related helper ops. Though, early on, had also used YUV655 as well, but RGB555
mostly won out over YUV655 (even if it is easier to get a luma from
YUV655 vs RGB555).

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sun Nov 2 16:56:05 2025

From Newsgroup: comp.arch

On 2025-11-02 3:58 p.m., BGB wrote:
<snip>

No real need to go much beyond RGB555, as the FPGA boards have VGA DACs
that generally fall below this (Eg: 4 bit/channel on the Nexys A7). And, 2-bit for many VGA PMods (PMod allowing 8 IO pins, so RGB222+H/V Sync;
or needing to use 2 PMOD connections for the VGA). The usual workaround
was also to perform dithering while driving the VGA output (with ordered dither in the Verilog).

I am using an HDMI interface so the monitor is fed 24-bit RGB digitally.
I tried to get a display channel interface working but no luck. VGA is
so old.

Have you tried dithering based on the frame (temporal dithering vs
space-al dithering)? First frame is one set of colors, the next frame is
a second set of colors. I think it may work if the refresh rate is high
enough (120 Hz). IIRC I tried this a while ago and was not happy with
the results. I also tried rotating the dithering pattern around each frame.

<snip>

Generally, the text mode operates in a 640x200 mode with 8x8 + 128b
cells, so 32K of VRAM used (for 80x25 cells).

For the text mode 800x600 mode is used on my system, with 12x18 cells so
that I can read the display at a distance (64x32 characters).

The font then has 64 block graphic characters of 2x3 block. Low-res
graphics can be done in text mode with the appropriate font size and
block graphics characters. Color selection is limited though.>

In this case, a 40x25 color-cell mode (with 256-bit cells) could be used
for graphics (32K). Early on, this was used as the graphics mode for
Doom and similar, before I later expanded VRAM to 128K and switched to 320x200 Hicolor.

The bitmap modes are non-raster, generally with pixels packed into 8x8
or 4x4 blocks.
4x4:
16bpp: pixels in raster order.
   8bpp: raster order, 32-bits per row
   4bpp: Raster order, 16-bits per row
And, 8x8:
   4bpp: Takes 16bpp layout, splits each pixel into 2x2.
   2bpp: Takes 8bpp layout, splits each pixel into 2x2.
   1bpp: Raster order, 1bpp, but same order as text glyphs.
     With MSB in upper left, LSB in lower right.

<snip>

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sun Nov 2 17:21:52 2025

From Newsgroup: comp.arch

On 11/2/2025 3:56 PM, Robert Finch wrote:

On 2025-11-02 3:58 p.m., BGB wrote:
<snip>

No real need to go much beyond RGB555, as the FPGA boards have VGA
DACs that generally fall below this (Eg: 4 bit/channel on the Nexys
A7). And, 2-bit for many VGA PMods (PMod allowing 8 IO pins, so
RGB222+H/V Sync; or needing to use 2 PMOD connections for the VGA).
The usual workaround was also to perform dithering while driving the
VGA output (with ordered dither in the Verilog).

I am using an HDMI interface so the monitor is fed 24-bit RGB digitally.
I tried to get a display channel interface working but no luck. VGA is
so old.

Never went up the learning curve for HDMI.
Would likely need to drive the monitor outputs with SERDES or similar
though.

Have you tried dithering based on the frame (temporal dithering vs
space-al dithering)? First frame is one set of colors, the next frame is
a second set of colors. I think it may work if the refresh rate is high enough (120 Hz). IIRC I tried this a while ago and was not happy with
the results. I also tried rotating the dithering pattern around each frame.

Temporal dithering seems to generate annoying artifacts on the monitors
I tried it on. Trying to use temporal dithering tended to result in
annoying wavy/rippling artifacts.

Likewise, PWM'ing the pixels also makes LCD monitors unhappy (rainbow
banding artifacts), but seems to work OK on CRTs. I suspect it is an
issue that the monitors expect a 25MHz pixel clock (when using 640x400
or 640x480 timing) with an ADC that doesn't like sudden changes in level
(say, if updating the pixels at 50MHz internally).

<snip>

Generally, the text mode operates in a 640x200 mode with 8x8 + 128b
cells, so 32K of VRAM used (for 80x25 cells).

For the text mode 800x600 mode is used on my system, with 12x18 cells so that I can read the display at a distance (64x32 characters).

The font then has 64 block graphic characters of 2x3 block. Low-res
graphics can be done in text mode with the appropriate font size and
block graphics characters. Color selection is limited though.>

I went with 80x25 as it is pretty standard;
80x50 is also possible, but less standard.

Though, Linux seems to often like using high-res text modes rather than
the usual 80x25 or similar.

As for 8x8 character cells:
Also pretty standard, and fix nicely into 64 bits.

In theory, for a text mode, could drive a monitor at 1280x400 with
640x400 timings for 16x16 character cells, but LCD monitors don't like
this sort of thing.

Even at 640x400/70Hz timings, the monitor didn't consistently recognize
it as 640x400, and would sometimes try to detect it as 720x400 or
similar (which would look wonky).

The other option being to output 640x480 and simply black-fill the extra
lines (so, add 20 lines of black-fill at the top and bottom of the
screen). Where, the monitors were able to more reliably detect 640x480/60Hz

The main tradeoff is that mostly I have a limited selection of pixel
clocks available:
25, 50, maybe 100.

Mostly because the pixel clocks are high enough and clock-edges
sensitive enough where accumulation timers don't really work.

Though, accumulation timers do work for driving an NTSC composite
output. But, NTSC composite looks poor, can't even really do an 80x25
text mode acceptably (if using colorburst); but can do 80x25 if one can
accept black-and-white.

Well, there was also component video, but this is basically the same as driving VGA (just with it being able to accept both NTSC and VGA
timings; eg, 15 to 70 kHz for horizontal refresh, 40 to 90 Hz vertical,
...).

Though, I no longer have the display that had component video inputs.

Contrast, there is generally a very limited range of timings for
composite or S-Video (generally, these don't accept VGA-like timings). Whereas, VGA only really accepts VGA-like timings, and is unhappy if
given NTSC timings (eg: 15 kHz horizontal refresh).

Not sure why seemingly component video is the only "accepts whatever you
throw at it" analog input (say, on a display with multiple input types
and presumably similar hardware internally).

Checks, annoyingly hard to find plain LCD monitors with a component
video inputs that is not also a full TV with a TV tuner (but, a little
easier to find ones with both VGA and composite). Closest I can find are apparently intended mostly as CCTV monitors.

But, mostly using VGA anyways, so...

...

In this case, a 40x25 color-cell mode (with 256-bit cells) could be
used for graphics (32K). Early on, this was used as the graphics mode
for Doom and similar, before I later expanded VRAM to 128K and
switched to 320x200 Hicolor.

The bitmap modes are non-raster, generally with pixels packed into 8x8
or 4x4 blocks.
4x4:
   16bpp: pixels in raster order.
    8bpp: raster order, 32-bits per row
    4bpp: Raster order, 16-bits per row
And, 8x8:
    4bpp: Takes 16bpp layout, splits each pixel into 2x2.
    2bpp: Takes 8bpp layout, splits each pixel into 2x2.
    1bpp: Raster order, 1bpp, but same order as text glyphs.
      With MSB in upper left, LSB in lower right.

<snip>

...

But, yeah, my makeshift graphics hardware is a little wonky.
And, works in an almost entirely different way from the VGA style hardware.

Ironically, software doesn't configure timings itself, but rather uses selector bits to control various properties:
Base Resolution (640x400, 640x480, 800x600, ...);
Character cell size in pixels (4x4 or 8x8);
Settings to modify the number of horizontal and vertical cells relative
to the base resolution;
...

But, for the most part, had been using 640x400 or similar; with 800x600
as more experimental (and doesn't look great with 2bpp cells).

The 1024x768 mode had gone mostly unused, and is still untested on real hardware.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Nov 3 15:22:44 2025

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Michael S wrote:

I mostly use ULP/Guard/Sticky in the same meaning. Except when I use
them, esp. Guard, differently.
Given the choice, [in the context of binary floating point] I'd rather
not use the term 'guard' at all. Names like 'rounding bit' or
'half-ULP' are far more self-describing.

Guard also works for decimal FP, where you need a single Sticky bit if
the Guard digit is equal to 5.

By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Mon Nov 3 11:53:48 2025

From Newsgroup: comp.arch

On 11/3/2025 9:22 AM, Scott Lurndal wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Michael S wrote:

I mostly use ULP/Guard/Sticky in the same meaning. Except when I use
them, esp. Guard, differently.
Given the choice, [in the context of binary floating point] I'd rather
not use the term 'guard' at all. Names like 'rounding bit' or
'half-ULP' are far more self-describing.

Guard also works for decimal FP, where you need a single Sticky bit if
the Guard digit is equal to 5.

By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?

I would assume he meant something like either the newer IEEE-754 decimal formats, or a decimal-FP format that MS had used in .NET, ...

The IEEE formats are generally one of:
Linear mantissa understood as decimal;
Groups of 10-bits, each used to encode 3 digits.
As Densely Packed Decimal.
With a power-of-10 exponent.

The .NET format was similar, except using groups of 32 bits as linear
values representing 9 digits.

When I looked at it before, the most practical way to me to support
something like this seemed to be to not do it directly in hardware, but
to support a subset of operations:
Operations to pack and unpack DPD into BCD;
Say: 64 bit value holds 15 BCD digits, mapped to 50 bits of DPD.
Some basic operations to help with arithmetic on BCD.

I partly implemented these as an experiment before, but then noted I
have basically no use case for Decimal-FP in my project.

And, ironically, the main benefit the helpers would have provided would
be to allow for faster Binary<->Decimal conversion. But, even then are debatable, as Binary<->Decimal conversion isn't itself enough CPU time
to justify making it faster at the cost of needing to drag around BCD
helper instructions.

One downside is that there was no multiplier, so the BCD helpers would
need to be used to effectively implement a Radix-10 Shift-and-Add.

...

Though, it is debatable, something more like the .NET approach could
make more sense for a SW implementation.

If one wants to make the encoding more efficiently use the bits, a
hybrid approach could make sense, say:
Use 3 groups of 30 bits, and another group of 20 (6 digits)
Use an 17 bit linear exponent and sign bit.

This would be slightly cheaper to implement vs what is defined in the
standard (for the BID variant), and could achieve a similar effect
(though, with 33 digits rather than 34).

Internally, it could work similar to the .NET approach, just with a
little more up-front to pack/unpack the 30 bit components. The merit of
30 bit groups being that they map internally onto 32-bit integer
operations (which would also provide a space internally for carry/borrow signaling in operations).

Most CPUs at least have native support for 32-bit integer math, and for
SW (on a 32/64 bit machine) this could be an easier chunking size than
10 bits. Someone could argue for 60 bit chunking on a 64-bit machine
(or, one 60 bit chunk, and a 50 bit chunk), but likely this wouldn't
save much over 30 bit chunking.

Also, 60-bit chunking would imply access to a 64*64->128 bit widening multiply; which is asking more than 32*32->64. And, also precludes some
ways to more cheaply implement the divide/modulo step for each chunk
(*). So, it is likely in this sense 30 bit chunks could still be preferable.

*:
high=product>>30;
low=product-(high*1000000000LL);
if(low>=1000000000)
{ high++; low-=1000000000; }
Where, 60 bit chunking would require 128-bit math here.

Where, effectively, the multiply step is operating in radix-1-billion.

...

Still don't have much of a use-case though.

In general, Decimal-FP seems more like a solution in search of a problem.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Nov 3 18:47:36 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> schrieb:

Contemplating having conditional branch instructions branch to a target value in a register instead of using a displacement.

I think this has about the same code density as having a branch to a displacement from the IP.

Should be possible. A question is if you want to have a special
register for that (like POWER's link register), tell the CPU
what the target is (like VEC in My66000) or just use a general
purpose register with a general-purpose instruction.

Using a fused compare-and-branch instruction for Qupls4

Is that the name of your architecture, or an instruction? (That
may have been mentioned upthread, in that case I don't remember).

there is not
enough room in the instruction for a large branch displacement (10
bits). So, my thought is to branch to a register value instead.
There is already an add-to-instruction-pointer instruction that can be
used to generate relative addresses.

That makes sense.

By moving the register load outside of a loop, the dynamic instruction
count can be reduced. I think this solution is a bit better than having compare and branch as two separate instructions, or having an extended constant added to the branch instruction.

Are you talking about a normal loop condition or a jump out of
a loop?

One gotcha may be that the branch target needs to be predicted as it
cannot be calculated earlier in the pipeline.

If you use a link register or a special instruction, the CPU could
do that.

The 10-bit displacement format could also be supported, but it is yet another branch instruction format. I may leave holes in the instruction
set for future support, but I think it is best to start with just a
single format.

Code:
AIPSI R3,1234 ; add displacement to IP and store in R3 (hoist-able)
BLT R1,R2,R3 ; branch to R3 if R1 < R2

Versus:
CMP R3,R1,R2
BLT R3,displacement

--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Nov 3 19:03:13 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> posted:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Terje Mathisen <terje.mathisen@tmsw.no> posted:

Actually, for the five required basic operations, you can always do the >> op in the next higher precision, then round again down to the target,
and get exactly the same result.

https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

The PowerISA version 3.0 introduced rounding to odd for its 128-bit
floating point arithmetic, for that very reason (I assume).

Likely, My 66000 also has RNO and
Round Nearest Random is defined but not yet available
Round Away from Zero is also defined and available.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Nov 3 19:13:50 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

Contemplating having conditional branch instructions branch to a target value in a register instead of using a displacement.

I think this has about the same code density as having a branch to a displacement from the IP.

Using a fused compare-and-branch instruction for Qupls4 there is not
enough room in the instruction for a large branch displacement (10
bits). So, my thought is to branch to a register value instead.
There is already an add-to-instruction-pointer instruction that can be
used to generate relative addresses.

The VEC instruction (My 66000) provides a register that is used for
the address of the top of the loop and the address of the VEC inst
itself. So, when running in the loop, the LOOP instruction branches
to the register value, and when taking an exception in the loop,
the register leads back to the VEC instruction for after the excpt
has been performed.

By moving the register load outside of a loop, the dynamic instruction
count can be reduced. I think this solution is a bit better than having compare and branch as two separate instructions, or having an extended constant added to the branch instruction.

VEC-{ }-LOOP always saves at least 1 instruction per iteration.

One gotcha may be that the branch target needs to be predicted as it
cannot be calculated earlier in the pipeline.

VEC does its own predictions. LOOP does not overrun the loop-count,
so loop termination is not a pipeline flush.

The 10-bit displacement format could also be supported, but it is yet another branch instruction format. I may leave holes in the instruction
set for future support, but I think it is best to start with just a
single format.

Code:
AIPSI R3,1234 ; add displacement to IP and store in R3 (hoist-able)

LDA Rd,[IP,displacement]

BLT R1,R2,R3 ; branch to R3 if R1 < R2

Versus:
CMP R3,R1,R2
BLT R3,displacement

But if you create "R3" from your VEC instruction, you KNOW that
the compiler is only allowed to use "r3" as a branch target, and
that "R3" is static over the duration of the loop, so you can get
the reservation stations moving faster/easier.

I have a "special" RS for the VEC-LOOP brackets.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Mon Nov 3 23:04:53 2025

From Newsgroup: comp.arch

On Mon, 03 Nov 2025 15:22:44 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Michael S wrote:

I mostly use ULP/Guard/Sticky in the same meaning. Except when I
use them, esp. Guard, differently.
Given the choice, [in the context of binary floating point] I'd
rather not use the term 'guard' at all. Names like 'rounding bit'
or 'half-ULP' are far more self-describing.

Guard also works for decimal FP, where you need a single Sticky bit
if the Guard digit is equal to 5.

By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?

I am pretty sure that by decimal FP Terje means decimal FP :-). As
defined in IEEE 754 (formerly it was in 854, but since 2008 it became a
part of the main standard).
IEEE 754 has two options for encoding of mantissa, IBM's DPD which is
a clever variation of Base 1000 and Intel's binary.
DPD encoding is considered preferable for hardware implementations
while binary encoding is easier for software implementations.
BCD is not an option, it's information density is insufficient to
supply required semantics in given size of container.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Nov 4 08:50:25 2025

From Newsgroup: comp.arch

Scott Lurndal wrote:

Terje Mathisen <terje.mathisen@tmsw.no> writes:

Michael S wrote:

I mostly use ULP/Guard/Sticky in the same meaning. Except when I use
them, esp. Guard, differently.
Given the choice, [in the context of binary floating point] I'd rather
not use the term 'guard' at all. Names like 'rounding bit' or
'half-ULP' are far more self-describing.

Guard also works for decimal FP, where you need a single Sticky bit if
the Guard digit is equal to 5.

By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?

No, I meant ieee754 DFP, where you either store the decimal digits in
packed modulo-1000 groups, or as a binary mantissa with a decimal exponent/scaling value.

When you do math with these you have to handle all the required
(financial?) rounding modes.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Nov 4 07:50:33 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> writes:

Should be possible. A question is if you want to have a special
register for that (like POWER's link register),

There is this idea of splitting an (indirect) branch into a
prepare-to-branch instruction and a take-branch instruction. The prepare-to-branch instruction announces the branch target to the CPU,
and Power's mtlr and mtctr are examples of that (somewhat muddled by
the fact that the ctr register can also be used for counted loops as
well as for indirect branches), and IA-64's branch-target registers
and the instructions that move there are another example. AFAIK SPARC
acquired something in this direction (touted as good for accelerating
Java) in the early 2000s. The take-branch instruction on Power is
blr/bctr.

I used to think that this kind of splitting is a good idea, and it is
certainly better than a branch-delay slot or a branch with a fixed
number of delay slots.

But in practice, it turned out that Intel and AMD processors had much
better performance on indirect-branch intensive workloads in the early
2000s without this architectural feature. What happened?

The IA-32 and AMD64 microarchitects implemented indirect-branch
prediction; in the early 2000s it was based on the BTB, which these
CPUs need for fast direct branching anyway. They were not content
with that, and have implemented history-based indirect branch
predictors in the meantime, which improve the performance even more.

By contrast, Power and IA-64 implementations apparently rely on
getting the target-address early enough, and typically predict that
the indirect branch will go to the current contents of the
branch-target register when the front-end encounters the take-branch instruction; but if the prepare-to-branch instruction is in the
instruction stream just before the take-branch instruction, it takes
several cycles until the prepare-to-branch actually can move the
target to the branch-target register. In case of an OoO
implementation, the number of cycles tends to be longer. It's
essentially a similar latency as in a branch misprediction.

That all would not be so bad, if the compilers would move the
prepare-to-branch instructions sufficiently far away from the
take-branch instruction. But gcc certainly has not done so whenever I
looked at code it generated for PowerPC or IA-64.

Here is some data for code that focusses on indirect-branch
performance (with indirect branches that vary their targets), from <https://www.complang.tuwien.ac.at/forth/threading/>:

Numbers are cycles per indirect branch, smaller is faster, the years
are the release dates of the CPUs:

First, machines from the early 2000s:

sub- in- repl.
routine direct direct switch call switch CPU year
9.6 8.0 9.5 23.1 38.6 Alpha 21264B 800MHz ~2000
4.7 8.1 9.5 19.0 21.3 Pentium III 1000MHz 2000
18.4 8.5 10.3 24.5 29.0 Athlon 1200MHz 2000
8.6 14.2 15.3 23.4 30.2 Pentium 4 2.26 2002
13.3 10.3 12.3 15.7 18.7 Itanium 2 (McKinley) 900MHz 2002
5.7 9.2 12.3 16.3 17.9 PPC 7447A 1066MHz 2004
7.8 12.8 12.9 30.2 39.0 PPC 970 2000MHz 2002

Ignore the first column (it uses call and return), the others all need
an indirect branch or indirect call ("call" column) per dispatch, with
varying amounts of other instructions; "direct" needs the least
instructions.

And here are results with some newer machines:

sub- in- repl.
routine direct direct switch call switch CPU year
4.9 5.6 4.3 5.1 7.64 Pentium M 755 2000MHz 2004
4.4 2.2 2.0 20.3 18.6 3.3 Xeon E3-1220 3100MHz 2011
4.0 2.3 2.3 4.0 5.1 3.5 Core i7-4790K 4400MHz 2013
4.2 2.1 2.0 4.9 5.2 2.7 Core i5-6600K 4000MHz 2015
5.7 3.2 3.9 7.0 8.6 3.7 Cortex-A73 1800MHz 2016
4.2 3.3 3.2 17.9 23.1 4.2 Ryzen 5 1600X 3600MHz 2017
6.9 24.5 27.3 37.1 33.5 36.6 Power9 3800MHz 2017
3.8 1.0 1.1 3.8 6.2 2.2 Core i5-1135G7 4200MHz 2020

The age of the Pentium M would suggest putting it into the earlier
table, but given its clear performance-per-clock advantage over the
other IA-32 and AMD64 CPUs of its day, it was probably the first CPU
to have a history-based indirect-branch predictor.

It seems that, while the AMD64 microarchitectures improved not just in
clock rate, but also in performance per clock for this microbenchmark
(thanks to history-based indirect-branch predictors), the Power 9
still relies on its split-branch architectural feature, resulting in
slowness. And it's not just slowness in "direct", but the additional instructions in the other benchmarks add more cycles than in most
other CPUs.

Particularly notable is the Core i5-1135G7, which takes one indirect
branch per cycle.

I have to take additional measurements with other Power and AMD64
processors.

Couldn't the Power and IA-64 CPUs use history-based branch prediction,
too? Of course, but then it would be even more obvious that the
split-branch architecture provides no benefit.

Bottom line: History-based branch prediction has won, any kind of
delayed branches (including split-branch designs) turn out to be a bad idea.

tell the CPU
what the target is (like VEC in My66000)

I have no idea what VEC does, but all indirect-branch architectures
are about telling the CPU what the target is.

just use a general
purpose register with a general-purpose instruction.

That turns out to be the winner.

One gotcha may be that the branch target needs to be predicted as it
cannot be calculated earlier in the pipeline.

If you want to be able to perform one taken branch per cycle (or
more), you always need prediction.

If you use a link register or a special instruction, the CPU could
do that.

It turns out that this does not work well in practice.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue Nov 4 15:19:08 2025

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> writes:

On Mon, 03 Nov 2025 15:22:44 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?

I am pretty sure that by decimal FP Terje means decimal FP :-). As
defined in IEEE 754 (formerly it was in 854, but since 2008 it became a
part of the main standard).
IEEE 754 has two options for encoding of mantissa, IBM's DPD which is
a clever variation of Base 1000 and Intel's binary.
DPD encoding is considered preferable for hardware implementations
while binary encoding is easier for software implementations.
BCD is not an option, it's information density is insufficient to
supply required semantics in given size of container.

How so? The B3500 supported 100 digit (400 bit) signed mantissa and
a two digit signed exponent using a BCD representation.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Tue Nov 4 17:41:07 2025

From Newsgroup: comp.arch

On Tue, 04 Nov 2025 15:19:08 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 03 Nov 2025 15:22:44 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?

I am pretty sure that by decimal FP Terje means decimal FP :-). As
defined in IEEE 754 (formerly it was in 854, but since 2008 it
became a part of the main standard).
IEEE 754 has two options for encoding of mantissa, IBM's DPD which is
a clever variation of Base 1000 and Intel's binary.
DPD encoding is considered preferable for hardware implementations
while binary encoding is easier for software implementations.
BCD is not an option, it's information density is insufficient to
supply required semantics in given size of container.

How so? The B3500 supported 100 digit (400 bit) signed mantissa and
a two digit signed exponent using a BCD representation.

What is not clear about 'in given size of container' ?
Semantics of IEEE Decimal128 call for 33 decimal digits + 1 binary bit
to be contained within 111 bits.
With BCD encoding one would need 133 bits.

Decimal32 and Decimal64 would suffer from similar mismatch, but those
formats probably not important. IMHO, IEEE defined them for sake of completeness rather than because they are useful in real world.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Nov 4 07:47:50 2025

From Newsgroup: comp.arch

On 11/4/2025 7:19 AM, Scott Lurndal wrote:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 03 Nov 2025 15:22:44 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?

I am pretty sure that by decimal FP Terje means decimal FP :-). As
defined in IEEE 754 (formerly it was in 854, but since 2008 it became a
part of the main standard).
IEEE 754 has two options for encoding of mantissa, IBM's DPD which is
a clever variation of Base 1000 and Intel's binary.
DPD encoding is considered preferable for hardware implementations
while binary encoding is easier for software implementations.
BCD is not an option, it's information density is insufficient to
supply required semantics in given size of container.

How so? The B3500 supported 100 digit (400 bit) signed mantissa and
a two digit signed exponent using a BCD representation.

By "information density" I think he means that for almost any (I won't
say any because there might be some edge cases where the isn't true)
value, it takes fewer bits to represent in the IEEE scheme than in your beloved Burroughs Medium system's scheme. :-) Fewer bits per value
means higher information density.

Fewer bits means less less hardware, thus lower cost, less power
required, etc.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Nov 4 16:52:18 2025

From Newsgroup: comp.arch

Scott Lurndal wrote:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 03 Nov 2025 15:22:44 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?

I am pretty sure that by decimal FP Terje means decimal FP :-). As
defined in IEEE 754 (formerly it was in 854, but since 2008 it became a
part of the main standard).
IEEE 754 has two options for encoding of mantissa, IBM's DPD which is
a clever variation of Base 1000 and Intel's binary.
DPD encoding is considered preferable for hardware implementations
while binary encoding is easier for software implementations.
BCD is not an option, it's information density is insufficient to
supply required semantics in given size of container.

How so? The B3500 supported 100 digit (400 bit) signed mantissa and
a two digit signed exponent using a BCD representation.

It is needed to be comparable to binary FP:

A 64-bit double provides 54 mantissa bits, this corresponds to 16+
decimal digits, while fp128 gives us 113 bits or a smidgen over 34 digits.

The corresponding 128-bit DFP format also provides 34 decimal digts,
with an exponent range which covers 10^-6143 to 10^6144, while the 15
exponent bits in binary128 covers 2^-16k to 2^16k, corresponding to 5.9e(+/-)4931.

I.e. the DFP format has the same precision and a larger range than BFP.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Tue Nov 4 18:54:58 2025

From Newsgroup: comp.arch

On Tue, 4 Nov 2025 16:52:18 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Scott Lurndal wrote:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 03 Nov 2025 15:22:44 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?

I am pretty sure that by decimal FP Terje means decimal FP :-). As
defined in IEEE 754 (formerly it was in 854, but since 2008 it
became a part of the main standard).
IEEE 754 has two options for encoding of mantissa, IBM's DPD which
is a clever variation of Base 1000 and Intel's binary.
DPD encoding is considered preferable for hardware implementations
while binary encoding is easier for software implementations.
BCD is not an option, it's information density is insufficient to
supply required semantics in given size of container.

How so? The B3500 supported 100 digit (400 bit) signed mantissa and
a two digit signed exponent using a BCD representation.

It is needed to be comparable to binary FP:

A 64-bit double provides 54 mantissa bits, this corresponds to 16+
decimal digits, while fp128 gives us 113 bits or a smidgen over 34
digits.

The corresponding 128-bit DFP format also provides 34 decimal digts,
with an exponent range which covers 10^-6143 to 10^6144, while the 15 exponent bits in binary128 covers 2^-16k to 2^16k, corresponding to 5.9e(+/-)4931.

I.e. the DFP format has the same precision and a larger range than
BFP.

Terje

Nitpick:
In the best case, i.e. cases where mantissa of BFP is close to 2 and MS
digit of DFP =9, [relative] precision is indeed almost identical.
But in the worst case, i.e. cases where mantissa of BFP is close to 1
and MS digit of DFP =1, [relative] precision of BFP is 5 times better.

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue Nov 4 17:12:54 2025

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> writes:

On Tue, 04 Nov 2025 15:19:08 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 03 Nov 2025 15:22:44 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?

I am pretty sure that by decimal FP Terje means decimal FP :-). As
defined in IEEE 754 (formerly it was in 854, but since 2008 it
became a part of the main standard).
IEEE 754 has two options for encoding of mantissa, IBM's DPD which is
a clever variation of Base 1000 and Intel's binary.
DPD encoding is considered preferable for hardware implementations
while binary encoding is easier for software implementations.
BCD is not an option, it's information density is insufficient to
supply required semantics in given size of container.

How so? The B3500 supported 100 digit (400 bit) signed mantissa and
a two digit signed exponent using a BCD representation.

What is not clear about 'in given size of container' ?
Semantics of IEEE Decimal128 call for 33 decimal digits + 1 binary bit
to be contained within 111 bits.
With BCD encoding one would need 133 bits.

I guess it wasn't clear that my question was regarding
the necessity of providing 'hidden' bits for BCD floating
point.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Nov 4 20:13:36 2025

From Newsgroup: comp.arch

Michael S wrote:

On Tue, 4 Nov 2025 16:52:18 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Scott Lurndal wrote:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 03 Nov 2025 15:22:44 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?

I am pretty sure that by decimal FP Terje means decimal FP :-). As
defined in IEEE 754 (formerly it was in 854, but since 2008 it
became a part of the main standard).
IEEE 754 has two options for encoding of mantissa, IBM's DPD which
is a clever variation of Base 1000 and Intel's binary.
DPD encoding is considered preferable for hardware implementations
while binary encoding is easier for software implementations.
BCD is not an option, it's information density is insufficient to
supply required semantics in given size of container.

How so? The B3500 supported 100 digit (400 bit) signed mantissa and
a two digit signed exponent using a BCD representation.

It is needed to be comparable to binary FP:

A 64-bit double provides 54 mantissa bits, this corresponds to 16+
decimal digits, while fp128 gives us 113 bits or a smidgen over 34
digits.

The corresponding 128-bit DFP format also provides 34 decimal digts,
with an exponent range which covers 10^-6143 to 10^6144, while the 15
exponent bits in binary128 covers 2^-16k to 2^16k, corresponding to
5.9e(+/-)4931.

I.e. the DFP format has the same precision and a larger range than
BFP.

Terje

Nitpick:
In the best case, i.e. cases where mantissa of BFP is close to 2 and MS
digit of DFP =9, [relative] precision is indeed almost identical.
But in the worst case, i.e. cases where mantissa of BFP is close to 1
and MS digit of DFP =1, [relative] precision of BFP is 5 times better.

Agreed.

It is somewhat similar to the very old hex fp which had a wider exonent
range but more variable precision.

I still think the IBM DFP people did an impressively good job packing
that much data into a decimal representation. :-)

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Nov 4 19:15:31 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Thomas Koenig <tkoenig@netcologne.de> writes:

Should be possible. A question is if you want to have a special
register for that (like POWER's link register),

There is this idea of splitting an (indirect) branch into a
prepare-to-branch instruction and a take-branch instruction. The

I first heard about this 1982 from Burton Smith.

prepare-to-branch instruction announces the branch target to the CPU,
and Power's mtlr and mtctr are examples of that (somewhat muddled by
the fact that the ctr register can also be used for counted loops as
well as for indirect branches), and IA-64's branch-target registers
and the instructions that move there are another example. AFAIK SPARC acquired something in this direction (touted as good for accelerating
Java) in the early 2000s. The take-branch instruction on Power is
blr/bctr.

I used to think that this kind of splitting is a good idea, and it is certainly better than a branch-delay slot or a branch with a fixed
number of delay slots.

PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.

But in practice, it turned out that Intel and AMD processors had much
better performance on indirect-branch intensive workloads in the early
2000s without this architectural feature. What happened?

We threw HW at the problem.

The IA-32 and AMD64 microarchitects implemented indirect-branch
prediction; in the early 2000s it was based on the BTB, which these
CPUs need for fast direct branching anyway. They were not content
with that, and have implemented history-based indirect branch
predictors in the meantime, which improve the performance even more.

By contrast, Power and IA-64 implementations apparently rely on
getting the target-address early enough, and typically predict that
the indirect branch will go to the current contents of the
branch-target register when the front-end encounters the take-branch instruction; but if the prepare-to-branch instruction is in the
instruction stream just before the take-branch instruction, it takes
several cycles until the prepare-to-branch actually can move the
target to the branch-target register. In case of an OoO
implementation, the number of cycles tends to be longer. It's
essentially a similar latency as in a branch misprediction.

That all would not be so bad, if the compilers would move the prepare-to-branch instructions sufficiently far away from the
take-branch instruction. But gcc certainly has not done so whenever I
looked at code it generated for PowerPC or IA-64.

Here is some data for code that focusses on indirect-branch
performance (with indirect branches that vary their targets), from <https://www.complang.tuwien.ac.at/forth/threading/>:

Numbers are cycles per indirect branch, smaller is faster, the years
are the release dates of the CPUs:

First, machines from the early 2000s:

sub- in- repl.
routine direct direct switch call switch CPU year
9.6 8.0 9.5 23.1 38.6 Alpha 21264B 800MHz ~2000
4.7 8.1 9.5 19.0 21.3 Pentium III 1000MHz 2000
18.4 8.5 10.3 24.5 29.0 Athlon 1200MHz 2000
8.6 14.2 15.3 23.4 30.2 Pentium 4 2.26 2002
13.3 10.3 12.3 15.7 18.7 Itanium 2 (McKinley) 900MHz 2002
5.7 9.2 12.3 16.3 17.9 PPC 7447A 1066MHz 2004
7.8 12.8 12.9 30.2 39.0 PPC 970 2000MHz 2002

Ignore the first column (it uses call and return), the others all need
an indirect branch or indirect call ("call" column) per dispatch, with varying amounts of other instructions; "direct" needs the least
instructions.

And here are results with some newer machines:

sub- in- repl.
routine direct direct switch call switch CPU year
4.9 5.6 4.3 5.1 7.64 Pentium M 755 2000MHz 2004
4.4 2.2 2.0 20.3 18.6 3.3 Xeon E3-1220 3100MHz 2011
4.0 2.3 2.3 4.0 5.1 3.5 Core i7-4790K 4400MHz 2013
4.2 2.1 2.0 4.9 5.2 2.7 Core i5-6600K 4000MHz 2015
5.7 3.2 3.9 7.0 8.6 3.7 Cortex-A73 1800MHz 2016
4.2 3.3 3.2 17.9 23.1 4.2 Ryzen 5 1600X 3600MHz 2017
6.9 24.5 27.3 37.1 33.5 36.6 Power9 3800MHz 2017
3.8 1.0 1.1 3.8 6.2 2.2 Core i5-1135G7 4200MHz 2020

The age of the Pentium M would suggest putting it into the earlier
table, but given its clear performance-per-clock advantage over the
other IA-32 and AMD64 CPUs of its day, it was probably the first CPU
to have a history-based indirect-branch predictor.

It seems that, while the AMD64 microarchitectures improved not just in
clock rate, but also in performance per clock for this microbenchmark
(thanks to history-based indirect-branch predictors), the Power 9
still relies on its split-branch architectural feature, resulting in slowness. And it's not just slowness in "direct", but the additional instructions in the other benchmarks add more cycles than in most
other CPUs.

Particularly notable is the Core i5-1135G7, which takes one indirect
branch per cycle.

I have to take additional measurements with other Power and AMD64
processors.

Couldn't the Power and IA-64 CPUs use history-based branch prediction,
too? Of course, but then it would be even more obvious that the
split-branch architecture provides no benefit.

Bottom line: History-based branch prediction has won, any kind of
delayed branches (including split-branch designs) turn out to be
a bad idea.

Or "Never bet against branch prediction".

tell the CPU
what the target is (like VEC in My66000)

I have no idea what VEC does, but all indirect-branch architectures
are about telling the CPU what the target is.

VEC is the bracket at the top of a loop. VEC supplies a register
which will contain the address of the instruction at the top of
the loop, and a 21-bit-vector use to specify those registers which
are "Live" out of the loop. VEC is "executed" as the loop is entered
and then not again until the loop is entered again.

The LOOP instruction is the bottom bracket of the loop and performs
the ADD-CMP-BC sequence as a single instruction. There are 3 flavors
{counted, value terminated, counter value terminated} that use the
3 registers similarly but differently.

just use a general
purpose register with a general-purpose instruction.

That turns out to be the winner.

One gotcha may be that the branch target needs to be predicted as it
cannot be calculated earlier in the pipeline.

With VEC-LOOP you are guaranteed that the branch and its target are
100% correlated.

If you want to be able to perform one taken branch per cycle (or
more), you always need prediction.

Greater than 1 branch per FETCH latency.

If you use a link register or a special instruction, the CPU could
do that.

It turns out that this does not work well in practice.

Agreed.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Nov 4 20:16:59 2025

From Newsgroup: comp.arch

Scott Lurndal wrote:

Michael S <already5chosen@yahoo.com> writes:

On Tue, 04 Nov 2025 15:19:08 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

Michael S <already5chosen@yahoo.com> writes:

On Mon, 03 Nov 2025 15:22:44 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:

By decimal FP, do you mean BCD? I.e. a format where
you have a BCD exponent sign digit (BCD 'C' or 'D')
followed by two BCD exponent digits, followed by a
mantissa sign digit ('C' or 'D') followed by a variable
number of mantissa digits (1 to 100)?

I am pretty sure that by decimal FP Terje means decimal FP :-). As
defined in IEEE 754 (formerly it was in 854, but since 2008 it
became a part of the main standard).
IEEE 754 has two options for encoding of mantissa, IBM's DPD which is
a clever variation of Base 1000 and Intel's binary.
DPD encoding is considered preferable for hardware implementations
while binary encoding is easier for software implementations.
BCD is not an option, it's information density is insufficient to
supply required semantics in given size of container.

How so? The B3500 supported 100 digit (400 bit) signed mantissa and
a two digit signed exponent using a BCD representation.

What is not clear about 'in given size of container' ?
Semantics of IEEE Decimal128 call for 33 decimal digits + 1 binary bit
to be contained within 111 bits.
With BCD encoding one would need 133 bits.

I guess it wasn't clear that my question was regarding
the necessity of providing 'hidden' bits for BCD floating
point.

I thought that was obvious:

When you learned how to do decimal rounding back in your pen & paper
math classes, you probably realized that for any calculation which could
not be done exactly, you had to generate enough extra digits to be sure
how to round.

Those extra digits play exactly the same role as Guard + Sticky do in
binary FP.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Tue Nov 4 21:07:43 2025

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

I still think the IBM DFP people did an impressively good job packing
that much data into a decimal representation. :-)

Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Nov 4 22:44:21 2025

From Newsgroup: comp.arch

MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Bottom line: History-based branch prediction has won, any kind of
delayed branches (including split-branch designs) turn out to be
a bad idea.

Or "Never bet against branch prediction".

I have probably mentioned this before, once or twice, but I'm actually
quite proud of the meeting I had with Intel Santa Clara in the spring of
1995:

I had (accidentally) written the first public mention of the FDIV bug
(on comp.sys.intel) in Oct 1994, then together with Cleve Moler of MathWorks/MatLab fame led the effort to develop a minimum cost sw
workaround for the bug. (My code became part of all/most x86 compiler
runtimes for the next few years.)

Due to this Intel invited me to receive an early engineering prototype
of the PentiumPro, together with an NDA-covered briefing about its architecture.

Before the start of that briefing I suggested that I should start off on
the blackboard by showing what I had been able to figure out on my own,
then I proceeded to pretty much exactly cover every single feature on
the cpu, with one glaring exception:

Based on the useful but not great branch predictor on the Pentium I told
them that I expected the P6 to employ eager execution, i.e execute both
ways of one or two layers of branches, discarding the non-taken paths as
the branch direction info became available.

That's the point when they got to brag about how having a much, much
better branch predictor was better both from a performance and a power viewpoint, since out of order execution could predict much deeper than
any eager execution would have the resources for.

As you said: "Never bet against branch prediction".

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Tue Nov 4 22:52:46 2025

From Newsgroup: comp.arch

Thomas Koenig wrote:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

I still think the IBM DFP people did an impressively good job packing
that much data into a decimal representation. :-)

Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.

Several options, the easiest is of course a set of full forward/reverse
lookup tables, but you can take advantage of the regularities by using
smaller tables together with a little bit of logic.

You also need a way to extract one or two digits from the top/bottom of
each mod1000 container in order to handle normalization.

For the Intel binary mantissa dfp128 normalization is the hard issue,
Michael S have figured out some really nice tricks to speed it up, but
when you have a (worst case) temporary 220+ bit product mantissa,
scaling is not that easy.

The saving grace is that almost all DFP calculations tend to employ
relatively small numbers, mostly dfadd/dfsub/dfmul operations with fixed precision, and those will always be faster (in software) using the
binary mantissa.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Nov 4 22:51:28 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> posted:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

I still think the IBM DFP people did an impressively good job packing
that much data into a decimal representation. :-)

Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.

Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.

Other than "crap loads" of {deMorganizing and gate optimization}
that is essentially what HW actually does.

You still need to build 12-bit decimal ALUs to string together
--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Nov 4 15:46:06 2025

From Newsgroup: comp.arch

On 11/4/2025 11:15 AM, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Thomas Koenig <tkoenig@netcologne.de> writes:

Should be possible. A question is if you want to have a special
register for that (like POWER's link register),

There is this idea of splitting an (indirect) branch into a
prepare-to-branch instruction and a take-branch instruction. The

I first heard about this 1982 from Burton Smith.

prepare-to-branch instruction announces the branch target to the CPU,
and Power's mtlr and mtctr are examples of that (somewhat muddled by
the fact that the ctr register can also be used for counted loops as
well as for indirect branches), and IA-64's branch-target registers
and the instructions that move there are another example. AFAIK SPARC
acquired something in this direction (touted as good for accelerating
Java) in the early 2000s. The take-branch instruction on Power is
blr/bctr.

I used to think that this kind of splitting is a good idea, and it is
certainly better than a branch-delay slot or a branch with a fixed
number of delay slots.

PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.

Wasn't this PL/1 feature "inherited" from the, now rightly deprecated, Alter/Goto in COBOL and Assigned GOTO in Fortran?
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 00:44:18 2025

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> posted:

MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Bottom line: History-based branch prediction has won, any kind of
delayed branches (including split-branch designs) turn out to be
a bad idea.

Or "Never bet against branch prediction".

I have probably mentioned this before, once or twice, but I'm actually
quite proud of the meeting I had with Intel Santa Clara in the spring of 1995:

I had (accidentally) written the first public mention of the FDIV bug
(on comp.sys.intel) in Oct 1994, then together with Cleve Moler of MathWorks/MatLab fame led the effort to develop a minimum cost sw
workaround for the bug. (My code became part of all/most x86 compiler runtimes for the next few years.)

Due to this Intel invited me to receive an early engineering prototype
of the PentiumPro, together with an NDA-covered briefing about its architecture.

Before the start of that briefing I suggested that I should start off on
the blackboard by showing what I had been able to figure out on my own,
then I proceeded to pretty much exactly cover every single feature on
the cpu, with one glaring exception:

Based on the useful but not great branch predictor on the Pentium I told them that I expected the P6 to employ eager execution, i.e execute both
ways of one or two layers of branches, discarding the non-taken paths as
the branch direction info became available.

That's the point when they got to brag about how having a much, much
better branch predictor was better both from a performance and a power viewpoint, since out of order execution could predict much deeper than
any eager execution would have the resources for.

I remember you relating this story about 6-8 years ago.

As you said: "Never bet against branch prediction".

Terje

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 02:51:10 2025

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 11/4/2025 11:15 AM, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Thomas Koenig <tkoenig@netcologne.de> writes:

Should be possible. A question is if you want to have a special
register for that (like POWER's link register),

There is this idea of splitting an (indirect) branch into a
prepare-to-branch instruction and a take-branch instruction. The

I first heard about this 1982 from Burton Smith.

prepare-to-branch instruction announces the branch target to the CPU,
and Power's mtlr and mtctr are examples of that (somewhat muddled by
the fact that the ctr register can also be used for counted loops as
well as for indirect branches), and IA-64's branch-target registers
and the instructions that move there are another example. AFAIK SPARC
acquired something in this direction (touted as good for accelerating
Java) in the early 2000s. The take-branch instruction on Power is
blr/bctr.

I used to think that this kind of splitting is a good idea, and it is
certainly better than a branch-delay slot or a branch with a fixed
number of delay slots.

PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.

Wasn't this PL/1 feature "inherited" from the, now rightly deprecated, Alter/Goto in COBOL and Assigned GOTO in Fortran?

Probably.

I find it somewhat amusing that modern languages moved away from
label variables and into method calls -- which if you look at it
from 5,000 feet/metres -- is just a more expensive "label".

I also find it amusing that the backbone of modern software is
a static version of label variables -- we call them switch state-
ments.

But you can be sure COBOL got them from assembly language programmers.

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Tue Nov 4 23:43:48 2025

From Newsgroup: comp.arch

On 11/4/2025 4:51 PM, MitchAlsup wrote:

Thomas Koenig <tkoenig@netcologne.de> posted:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

I still think the IBM DFP people did an impressively good job packing
that much data into a decimal representation. :-)

Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.

Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.

Other than "crap loads" of {deMorganizing and gate optimization}
that is essentially what HW actually does.

In SW, you would still need to burn 16 bits per entry on the table, and possibly have code to fill in the tables (well, unless the numbers are expressed in code).

A similar strategy is often used for sin/cos in many 90s era games,
though the table is big enough that it would likely be impractical to
type out by hand (or calculate using mental math).

It is likely someone at ID Software or similar wrote out code at one
point to spit out the sin+cos lookup table as a big blob of C (say,
because an 8192 entry table is likely too big to be reasonable to type
out by hand).

Sometimes it becomes a tradeoff where exactly is the tradeoff in these
cases between when to use typing and mental math, and when to write some
code to spit out a table.

For me, the tradeoff is often somewhere around 256 numbers, or less if
the calculation is mentally difficult (namely, whether typing or
calculating is the bottleneck).

It is most likely for DPD<->BCD, would resort to using code to generate
the lookup table.

Then again, it might depend a lot on the person...

You still need to build 12-bit decimal ALUs to string together

When I did it experimentally, I had done 16 BCD digits in 64 bits...

The cost was slightly higher than that of a 64-bit ADD/SUB unit.

Generally, it was combining the normal 4-bit CARRY4 style logic with
some LUTs on the output side to turn it into a sort of BCD equivalent of
a CARRY4.

Granted, doing it with 3/6/9 digits would be cheaper than with 16 digits.

Though, if doing it purely in software, may make sense to go a different route:
Map DPD to a linear integer between 0 and 999;
Combine groups of 3 values into a 32 bit value;
Work 32 bits at a time;
Split back up to groups of 3 digits, and map back to DPD.

Though, depends on the operation, for some it may be faster to operate
in groups of 3 digits at a time (and sidestep the costs of combining or splitting the values).

Then again, thinking about it, it is possible that for the Decimal128
BID format, the mantissa could be broken up into smaller chunks (say, 9 digits) without the need for a full-width 128-bit multiply.

In this case, could use a narrower multiply, and the "error" from the
overflow would exist outside of the range of digits that are being
worked on, so effectively becomes irrelevant for the operation in
question (so, may be able to use 32 or 64 bit multiply, and 128-bit ADD).

Granted, this is untested.

Well, apart from how to recombine the parts without the need for wide multiply.

In theory, could turn it into a big pile of shifts-and-add. Not sure if
there is a good way to limit the number of shifts-and-adds needed. Well, unless turned into multiply-by-100 (3 shift 2 add) 4x times followed by multiply by 10 (1 shift 1 add), to implement multiply by 1 billion, but
this also sucks (vs 13 shift 12 add).

Hmm...

Ironically, the DPD option almost looks preferable...

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Nov 5 05:17:53 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 11/4/2025 11:15 AM, MitchAlsup wrote:

PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.

Wasn't this PL/1 feature "inherited" from the, now rightly deprecated,
Alter/Goto in COBOL and Assigned GOTO in Fortran?

Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
supports it <https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

What makes you think that it is "rightly" to deprecate or delete this
feature?

<https://riptutorial.com/fortran/example/11872/assigned-goto> says:
|It can be avoided in modern code by using procedures, internal
|procedures, procedure pointers and other features.

I know no feature in Fortran or standard C which replaces my use of labels-as-values, the GNU C equivalent of the assigned goto. If you
look at <https://www.complang.tuwien.ac.at/forth/threading/>, "direct"
and "indirect" use labels-as-values, whereas "switch", "call" and
"repl. switch" use standard C features (switch, indirect calls, and
switch+goto respectively). "direct" and "indirect" usually outperform
these others, sometimes by a lot.

I also find it amusing that the backbone of modern software is
a static version of label variables -- we call them switch state-
ments.

I am not sure if it's "the" backbone. Fortran has (had?) a feature
called "computed goto" that's closer to C's switch than "assigned
goto". Ironically, the gcc people usually call their labels-as-values
feature "computed goto" rather than "labels as values" or "assigned
goto".

But you can be sure COBOL got them from assembly language programmers.

Yes, assigned goto and labels-as-values (and probably the Cobol
alter/goto and PL/1 label variables) are there because computer
architectures have indirect branches and the programming language
designer wanted to give the programmers a way to express what they
would otherwise have to express in assembly language.

Why does standard C not have it? C had it up to and including the 6th
edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
between 6th and 7th edition. Ritchie wrote
<37178013.A1EE3D4F@bell-labs.com>:

| I eliminated them because I didn't know what to say about their
| semantics.

Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Wed Nov 5 01:41:30 2025

From Newsgroup: comp.arch

On 2025-11-03 2:03 p.m., MitchAlsup wrote:

Thomas Koenig <tkoenig@netcologne.de> posted:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Terje Mathisen <terje.mathisen@tmsw.no> posted:

Actually, for the five required basic operations, you can always do the >>>> op in the next higher precision, then round again down to the target,
and get exactly the same result.

https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

The PowerISA version 3.0 introduced rounding to odd for its 128-bit
floating point arithmetic, for that very reason (I assume).

Likely, My 66000 also has RNO and
Round Nearest Random is defined but not yet available
Round Away from Zero is also defined and available.

Round nearest random? How about round externally guided (RXG) by an
input signal? For instance, the rounding could come from a feedback
filter of some sort.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Nov 5 06:44:54 2025

From Newsgroup: comp.arch

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 11/4/2025 11:15 AM, MitchAlsup wrote:

PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.

Wasn't this PL/1 feature "inherited" from the, now rightly deprecated,
Alter/Goto in COBOL and Assigned GOTO in Fortran?

Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler supports it
<https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

That is the problem with deleted features - compiler writers have
to support them forever, and interaction with other features can
lead to problems.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Wed Nov 5 01:47:56 2025

From Newsgroup: comp.arch

On 2025-11-03 1:47 p.m., Thomas Koenig wrote:

Robert Finch <robfi680@gmail.com> schrieb:

Contemplating having conditional branch instructions branch to a target
value in a register instead of using a displacement.

I think this has about the same code density as having a branch to a
displacement from the IP.

Should be possible. A question is if you want to have a special
register for that (like POWER's link register), tell the CPU
what the target is (like VEC in My66000) or just use a general
purpose register with a general-purpose instruction.

Using a fused compare-and-branch instruction for Qupls4

Is that the name of your architecture, or an instruction? (That
may have been mentioned upthread, in that case I don't remember).

That was the name of the architecture, but I am being fickle and
scrapping it, restarting with the Qupls2024 architecture innovated to Qupls2026.

there is not
enough room in the instruction for a large branch displacement (10
bits). So, my thought is to branch to a register value instead.
There is already an add-to-instruction-pointer instruction that can be
used to generate relative addresses.

That makes sense.

Using 48-bit instructions now, so there is enough room for an 18-bit displacement. Still having branch to register as well.>

By moving the register load outside of a loop, the dynamic instruction
count can be reduced. I think this solution is a bit better than having
compare and branch as two separate instructions, or having an extended
constant added to the branch instruction.

Are you talking about a normal loop condition or a jump out of
a loop?

Any loop condition that needs a displacement constant. The constant
being loaded into a register.

One gotcha may be that the branch target needs to be predicted as it
cannot be calculated earlier in the pipeline.

If you use a link register or a special instruction, the CPU could
do that.

The 10-bit displacement format could also be supported, but it is yet
another branch instruction format. I may leave holes in the instruction
set for future support, but I think it is best to start with just a
single format.

Code:
AIPSI R3,1234 ; add displacement to IP and store in R3 (hoist-able) >> BLT R1,R2,R3 ; branch to R3 if R1 < R2

Versus:
CMP R3,R1,R2
BLT R3,displacement

--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Nov 4 22:53:49 2025

From Newsgroup: comp.arch

On 11/4/2025 9:17 PM, Anton Ertl wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 11/4/2025 11:15 AM, MitchAlsup wrote:

PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.

Wasn't this PL/1 feature "inherited" from the, now rightly deprecated,
Alter/Goto in COBOL and Assigned GOTO in Fortran?

Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler supports it <https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

What makes you think that it is "rightly" to deprecate or delete this feature?

Because it could, and often did, make the code "unfollowable". That is,
you are reading the code, following it to try to figure out what it is
doing and come to an assigned/alter goto, and you don't know where to go
next. The value was set some place else in the code, who knows where,
and thus what value it was set to, and people/programmers just aren't
used to being able to follow code like that. BTDT.

BTW, you mentioned that it could be implemented as an indirect jump. It
could for those architectures that supported that feature, but it could
also be implemented by having the Alter/Assign modify the code (i.e.
change the address in the jump/branch instruction), and self modifying
code is just bad.

I am not saying it couldn't be used well. Just that it was often not,
and when not, it caused a lot of problems.

<https://riptutorial.com/fortran/example/11872/assigned-goto> says:
|It can be avoided in modern code by using procedures, internal
|procedures, procedure pointers and other features.

I know no feature in Fortran or standard C which replaces my use of labels-as-values, the GNU C equivalent of the assigned goto. If you
look at <https://www.complang.tuwien.ac.at/forth/threading/>, "direct"
and "indirect" use labels-as-values, whereas "switch", "call" and
"repl. switch" use standard C features (switch, indirect calls, and switch+goto respectively). "direct" and "indirect" usually outperform
these others, sometimes by a lot.

I also find it amusing that the backbone of modern software is
a static version of label variables -- we call them switch state-
ments.

I am not sure if it's "the" backbone. Fortran has (had?) a feature
called "computed goto" that's closer to C's switch than "assigned
goto".

As did COBOL, called goto depending on, but those features didn't suffer
the problems of assigned/alter gotos.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Nov 5 06:55:49 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> writes:

Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
supports it >><https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

That is the problem with deleted features - compiler writers have
to support them forever, and interaction with other features can
lead to problems.

So does gfortran support assigned goto, too? What problems in
interaction with other features do you see?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Nov 5 01:00:32 2025

From Newsgroup: comp.arch

On 11/4/2025 3:44 PM, Terje Mathisen wrote:

MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Bottom line: History-based branch prediction has won, any kind of
delayed branches (including split-branch designs) turn out to be
a bad idea.

Or "Never bet against branch prediction".

I have probably mentioned this before, once or twice, but I'm actually
quite proud of the meeting I had with Intel Santa Clara in the spring of 1995:

I had (accidentally) written the first public mention of the FDIV bug
(on comp.sys.intel) in Oct 1994, then together with Cleve Moler of MathWorks/MatLab fame led the effort to develop a minimum cost sw
workaround for the bug. (My code became part of all/most x86 compiler runtimes for the next few years.)

Due to this Intel invited me to receive an early engineering prototype
of the PentiumPro, together with an NDA-covered briefing about its architecture.

Before the start of that briefing I suggested that I should start off on
the blackboard by showing what I had been able to figure out on my own,
then I proceeded to pretty much exactly cover every single feature on
the cpu, with one glaring exception:

Based on the useful but not great branch predictor on the Pentium I told them that I expected the P6 to employ eager execution, i.e execute both
ways of one or two layers of branches, discarding the non-taken paths as
the branch direction info became available.

That's the point when they got to brag about how having a much, much
better branch predictor was better both from a performance and a power viewpoint, since out of order execution could predict much deeper than
any eager execution would have the resources for.

As you said: "Never bet against branch prediction".

Branch prediction is fun.

When I looked around online before, a lot of stuff about branch
prediction was talking about fairly large and convoluted schemes for the branch predictors.

But, then always at the end of it using 2-bit saturating counters:
weakly taken, weakly not-taken, strongly taken, strongly not taken.

But, in my fiddling, there was seemingly a simple but moderately
effective strategy:
Keep a local history of taken/not-taken;
XOR this with the low-order-bits of PC for the table index;
Use a 5/6-bit finite-state-machine or similar.
Can model repeating patterns up to ~ 4 bits.

Where, the idea was that the state-machine in updated with the current
state and branch direction, giving the next state and next predicted
branch direction (for this state).

Could model slightly more complex patterns than the 2-bit saturating
counters, but it is sort of a partial mystery why (for mainstream
processors) more complex lookup schemes and 2 bit state, was preferable
to a simpler lookup scheme and 5-bit state.

Well, apart from the relative "dark arts" needed to cram 4-bit patterns
into a 5 bit FSM (is a bit easier if limiting the patterns to 3 bits).

Then again, had before noted that the LLMs are seemingly also not really
able to figure out how to make a 5 bit FSM to model a full set of 4 bit patterns.

Then again, I wouldn't expect it to be all that difficult of a problem
for someone that is "actually smart"; so presumably chip designers could
have done similar.

Well, unless maybe the argument is that 5 or 6 bits of storage would
cost more than 2 bits, but then presumably needing to have significantly larger tables (to compensate for the relative predictive weakness of
2-bit state) would have costed more than the cost of smaller tables of 6
bit state ?...

Say, for example, 2b:
00_0 => 10_0 //Weakly not-taken, dir=0, goes strong not-taken
00_1 => 01_0 //Weakly not-taken, dir=1, goes weakly taken
01_0 => 00_1 //Weakly taken, dir=0, goes weakly not-taken
01_1 => 11_1 //Weakly taken, dir=1, goes strongly taken
10_0 => 10_0 //strongly not taken, dir=0
10_1 => 00_0 //strongly not taken, dir=1 (goes weak)
11_0 => 01_1 //strongly taken, dir=0
11_1 => 11_1 //strongly taken, dir=1 (goes weak)

Can expand it to 3-bits, for 2-bit patterns
As above, and 4-more alternating states
And slightly different transition logic.
Say (abbreviated):
000 weak, not taken
001 weak, taken
010 strong, not taken
011 strong, taken
100 weak, alternating, not-taken
101 weak, alternating, taken
110 strong, alternating, not-taken
111 strong, alternating, taken
The alternating states just flip-flopping between taken and not taken.
The weak states can more between any of the 4.
The strong states used if the pattern is reinforced.

Going up to 3 bit patterns is more of the same (add another bit,
doubling the number of states). Seemingly something goes nasty when
getting to 4 bit patterns though (and can't fit both weak and strong
states for longer patterns, so the 4b patterns effectively only exist as
weak states which partly overlap with the weak states for the 3-bit
patterns).

But, yeah, not going to type out state tables for these ones.

Not proven, but I suspect that an arbitrary 5 bit pattern within a 6 bit
state might be impossible. Although there would be sufficient
state-space for the looping 5-bit patterns, there may not be sufficient state-space to distinguish whether to move from a mismatched 4-bit
pattern to a 3 or 5 bit pattern. Whereas, at least with 4-bit, any
mismatch of the 4-bit pattern can always decay to a 3-bit pattern, etc.
One needs to be able to express decay both to shorter patterns and to
longer patterns, and I suspect at this point, the pattern breaks down
(but can't easily confirm; it is either this or the pattern extends indefinitely, I don't know...).

Could almost have this sort of thing as a "brain teaser" puzzle or something...

Then again, maybe other people would not find any particular difficulty
in these sorts of tasks.

Terje

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Wed Nov 5 02:06:50 2025

From Newsgroup: comp.arch

On 2025-11-05 1:47 a.m., Robert Finch wrote:

On 2025-11-03 1:47 p.m., Thomas Koenig wrote:

Robert Finch <robfi680@gmail.com> schrieb:

Contemplating having conditional branch instructions branch to a target
value in a register instead of using a displacement.

I think this has about the same code density as having a branch to a
displacement from the IP.

Should be possible. A question is if you want to have a special
register for that (like POWER's link register), tell the CPU
what the target is (like VEC in My66000) or just use a general
purpose register with a general-purpose instruction.

Using a fused compare-and-branch instruction for Qupls4

Is that the name of your architecture, or an instruction? (That
may have been mentioned upthread, in that case I don't remember).

That was the name of the architecture, but I am being fickle and
scrapping it, restarting with the Qupls2024 architecture innovated to Qupls2026.

there is not
enough room in the instruction for a large branch displacement (10
bits). So, my thought is to branch to a register value instead.
There is already an add-to-instruction-pointer instruction that can be
used to generate relative addresses.

That makes sense.

Using 48-bit instructions now, so there is enough room for an 18-bit displacement. Still having branch to register as well.>

By moving the register load outside of a loop, the dynamic instruction
count can be reduced. I think this solution is a bit better than having
compare and branch as two separate instructions, or having an extended
constant added to the branch instruction.

Are you talking about a normal loop condition or a jump out of
a loop?

Any loop condition that needs a displacement constant. The constant
being loaded into a register.

One gotcha may be that the branch target needs to be predicted as it
cannot be calculated earlier in the pipeline.

If you use a link register or a special instruction, the CPU could
do that.

The 10-bit displacement format could also be supported, but it is yet
another branch instruction format. I may leave holes in the instruction
set for future support, but I think it is best to start with just a
single format.

Code:
AIPSI R3,1234 ; add displacement to IP and store in R3 (hoist-able) >>> BLT R1,R2,R3 ; branch to R3 if R1 < R2

Versus:
CMP R3,R1,R2
BLT R3,displacement

I am now modifying Qupls2024 into Qupls2026 rather than starting a
completely new ISA. The big difference is Qupls2024 uses 64-bit
instructions and Qupls2026 uses 48-bit instructions making the code 25%
more compact with no real loss of operations.

Qupls2024 also used 8-bit register specs. This was a bit of overkill and
not really needed. Register specs are reduced to 6-bits. Right-away that reduced most instructions eight bits.

I decided I liked the dual operations that some instructions supported,
which need a wide instruction format.

One gotcha is that 64-bit constant overrides need to be modified. For Qupls2024 a 64-bit constant override could be specified using only a
single additional instruction word. This is not possible with 48-bit instruction words. Qupls2024 only allowed a single additional constant
word. I may maintain this for Qupls2026, but that means that a max
constant override of 48-bits would be supported. A 64-bit constant can
still be built up in a register using the add-immediate with shift instruction. It is ugly and takes about three instructions.

I could reduce the 64-bit constant build to two instructions by adding a load-immediate instruction.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Nov 5 07:13:46 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Thomas Koenig <tkoenig@netcologne.de> posted:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

I still think the IBM DFP people did an impressively good job packing
that much data into a decimal representation. :-)

Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.

Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.

I played around with the formulas from the POWER manual a bit,
using Berkeley abc for logic optimization, for the conversion
of the packed modulo 1000 to three BCD digits.

Without spending too much effort, I arrived at four gate delays
(INV -> OAI21 -> NAND2 -> NAND2) with a total of 37 gates optimizing
for speed, or five gate delays optimizing for space.

I strongly suspect that IBM is doing something similar :-)
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Nov 5 01:38:30 2025

From Newsgroup: comp.arch

On 11/5/2025 1:00 AM, BGB wrote:

On 11/4/2025 3:44 PM, Terje Mathisen wrote:

MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Bottom line: History-based branch prediction has won, any kind of
delayed branches (including split-branch designs) turn out to be
a bad idea.

Or "Never bet against branch prediction".

I have probably mentioned this before, once or twice, but I'm actually
quite proud of the meeting I had with Intel Santa Clara in the spring
of 1995:

I had (accidentally) written the first public mention of the FDIV bug
(on comp.sys.intel) in Oct 1994, then together with Cleve Moler of
MathWorks/MatLab fame led the effort to develop a minimum cost sw
workaround for the bug. (My code became part of all/most x86 compiler
runtimes for the next few years.)

Due to this Intel invited me to receive an early engineering prototype
of the PentiumPro, together with an NDA-covered briefing about its
architecture.

Before the start of that briefing I suggested that I should start off
on the blackboard by showing what I had been able to figure out on my
own, then I proceeded to pretty much exactly cover every single
feature on the cpu, with one glaring exception:

Based on the useful but not great branch predictor on the Pentium I
told them that I expected the P6 to employ eager execution, i.e
execute both ways of one or two layers of branches, discarding the
non-taken paths as the branch direction info became available.

That's the point when they got to brag about how having a much, much
better branch predictor was better both from a performance and a power
viewpoint, since out of order execution could predict much deeper than
any eager execution would have the resources for.

As you said: "Never bet against branch prediction".

Branch prediction is fun.

When I looked around online before, a lot of stuff about branch
prediction was talking about fairly large and convoluted schemes for the branch predictors.

But, then always at the end of it using 2-bit saturating counters:
weakly taken, weakly not-taken, strongly taken, strongly not taken.

But, in my fiddling, there was seemingly a simple but moderately
effective strategy:
Keep a local history of taken/not-taken;
XOR this with the low-order-bits of PC for the table index;
Use a 5/6-bit finite-state-machine or similar.
    Can model repeating patterns up to ~ 4 bits.

Where, the idea was that the state-machine in updated with the current
state and branch direction, giving the next state and next predicted
branch direction (for this state).

Could model slightly more complex patterns than the 2-bit saturating counters, but it is sort of a partial mystery why (for mainstream processors) more complex lookup schemes and 2 bit state, was preferable
to a simpler lookup scheme and 5-bit state.

Well, apart from the relative "dark arts" needed to cram 4-bit patterns
into a 5 bit FSM (is a bit easier if limiting the patterns to 3 bits).

Then again, had before noted that the LLMs are seemingly also not really able to figure out how to make a 5 bit FSM to model a full set of 4 bit patterns.

Errm...

I just decided to test it, and it appears Grok was able to figure it out
(more or less).

This is concerning, either the AIs are getting smart enough to deal with semi-difficult problems; or in fact it is not difficult and I was just
dumb for thinking there is any difficulty in working out the state
tables for the longer patterns.

I tried before with DeepSeek R1 and similar, which had failed.

Then again, I wouldn't expect it to be all that difficult of a problem
for someone that is "actually smart"; so presumably chip designers could have done similar.

Well, unless maybe the argument is that 5 or 6 bits of storage would
cost more than 2 bits, but then presumably needing to have significantly larger tables (to compensate for the relative predictive weakness of 2-
bit state) would have costed more than the cost of smaller tables of 6
bit state ?...

Say, for example, 2b:
00_0 => 10_0 //Weakly not-taken, dir=0, goes strong not-taken
00_1 => 01_0 //Weakly not-taken, dir=1, goes weakly taken
01_0 => 00_1 //Weakly taken, dir=0, goes weakly not-taken
01_1 => 11_1 //Weakly taken, dir=1, goes strongly taken
10_0 => 10_0 //strongly not taken, dir=0
10_1 => 00_0 //strongly not taken, dir=1 (goes weak)
11_0 => 01_1 //strongly taken, dir=0
11_1 => 11_1 //strongly taken, dir=1 (goes weak)

Can expand it to 3-bits, for 2-bit patterns
As above, and 4-more alternating states
And slightly different transition logic.
Say (abbreviated):
000   weak, not taken
001   weak, taken
010   strong, not taken
011   strong, taken
100   weak, alternating, not-taken
101   weak, alternating, taken
110   strong, alternating, not-taken
111   strong, alternating, taken
The alternating states just flip-flopping between taken and not taken.
The weak states can more between any of the 4.
The strong states used if the pattern is reinforced.

Going up to 3 bit patterns is more of the same (add another bit,
doubling the number of states). Seemingly something goes nasty when
getting to 4 bit patterns though (and can't fit both weak and strong
states for longer patterns, so the 4b patterns effectively only exist as weak states which partly overlap with the weak states for the 3-bit patterns).

But, yeah, not going to type out state tables for these ones.

Not proven, but I suspect that an arbitrary 5 bit pattern within a 6 bit state might be impossible. Although there would be sufficient state-
space for the looping 5-bit patterns, there may not be sufficient state- space to distinguish whether to move from a mismatched 4-bit pattern to
a 3 or 5 bit pattern. Whereas, at least with 4-bit, any mismatch of the 4-bit pattern can always decay to a 3-bit pattern, etc. One needs to be
able to express decay both to shorter patterns and to longer patterns,
and I suspect at this point, the pattern breaks down (but can't easily confirm; it is either this or the pattern extends indefinitely, I don't know...).

Could almost have this sort of thing as a "brain teaser" puzzle or something...

Then again, maybe other people would not find any particular difficulty
in these sorts of tasks.

But, alas, sometimes I wonder if I am just kinda stupid and everyone
else has already kinda figured this out, but doesn't say much...

Like, just smart enough to do the things that I do, but not so much otherwise... In theory, I am kinda OK, but often it mostly seems like I
mostly just suck at everything.

Terje

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Nov 5 02:01:35 2025

From Newsgroup: comp.arch

On 11/4/2025 11:17 PM, Anton Ertl wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 11/4/2025 11:15 AM, MitchAlsup wrote:

PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.

Wasn't this PL/1 feature "inherited" from the, now rightly deprecated,
Alter/Goto in COBOL and Assigned GOTO in Fortran?

Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler supports it <https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

What makes you think that it is "rightly" to deprecate or delete this feature?

<https://riptutorial.com/fortran/example/11872/assigned-goto> says:
|It can be avoided in modern code by using procedures, internal
|procedures, procedure pointers and other features.

I know no feature in Fortran or standard C which replaces my use of labels-as-values, the GNU C equivalent of the assigned goto. If you
look at <https://www.complang.tuwien.ac.at/forth/threading/>, "direct"
and "indirect" use labels-as-values, whereas "switch", "call" and
"repl. switch" use standard C features (switch, indirect calls, and switch+goto respectively). "direct" and "indirect" usually outperform
these others, sometimes by a lot.

I usually used call threading, because:
In my testing it was one of the faster options;
At least if excluding 32-bit x86,
which often has slow function calls.
Because pretty much every function needs a stack frame, ...
It is usable in standard C.

Often "while loop and switch()" was notably slower than using unrolled
lists of indirect function calls (usually with the main dispatch loop
based on "traces", which would call each of the opcode functions and
then return the next trace to be run).

Granted, "while loop and switch" is the more traditional way of writing
an interpreter.

I also find it amusing that the backbone of modern software is
a static version of label variables -- we call them switch state-
ments.

I am not sure if it's "the" backbone. Fortran has (had?) a feature
called "computed goto" that's closer to C's switch than "assigned
goto". Ironically, the gcc people usually call their labels-as-values feature "computed goto" rather than "labels as values" or "assigned
goto".

But you can be sure COBOL got them from assembly language programmers.

Yes, assigned goto and labels-as-values (and probably the Cobol
alter/goto and PL/1 label variables) are there because computer
architectures have indirect branches and the programming language
designer wanted to give the programmers a way to express what they
would otherwise have to express in assembly language.

Why does standard C not have it? C had it up to and including the 6th edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
between 6th and 7th edition. Ritchie wrote <37178013.A1EE3D4F@bell-labs.com>:

| I eliminated them because I didn't know what to say about their
| semantics.

Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.

But, if you use it, you are basically stuck with GCC...

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Wed Nov 5 11:18:50 2025

From Newsgroup: comp.arch

On Tue, 4 Nov 2025 22:52:46 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

For the Intel binary mantissa dfp128 normalization is the hard issue, Michael S have figured out some really nice tricks to speed it up,

I remember that I played with that, but don't remember what I did
exactly. I dimly recollect that the fastest solution was relatively straight-forward. It was trying to minimize the length of dependency
chains rather than total number of multiplications.
An important point here is that I played on relatively old x86-64
hardware. My solution is not necessarily optimal for newer hardware.
The differences between old and new are two-fold and they push
optimal solution into different directions.
1. Increase in throughput of integer multiplier
2. Decrease in latency of integer division

The first factor suggests even more intense push toward "eager"
solutions.

The second factor suggests, possibly, much simpler code, especially in
common case of division by 1 to 27 decimal digits (5**27 < 2**64).
How they say? Sometimes a division is just a division.

but when you have a (worst case) temporary 220+ bit product mantissa, scaling is not that easy.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Wed Nov 5 11:21:32 2025

From Newsgroup: comp.arch

On Tue, 04 Nov 2025 22:51:28 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Thomas Koenig <tkoenig@netcologne.de> posted:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

I still think the IBM DFP people did an impressively good job
packing that much data into a decimal representation. :-)

Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.

Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.

Other than "crap loads" of {deMorganizing and gate optimization}
that is essentially what HW actually does.

You still need to build 12-bit decimal ALUs to string together

Are talking about hardware or software?

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Wed Nov 5 09:25:45 2025

From Newsgroup: comp.arch

On 2025-11-05 2:13 a.m., Thomas Koenig wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Thomas Koenig <tkoenig@netcologne.de> posted:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

I still think the IBM DFP people did an impressively good job packing
that much data into a decimal representation. :-)

Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.

Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.

I played around with the formulas from the POWER manual a bit,
using Berkeley abc for logic optimization, for the conversion
of the packed modulo 1000 to three BCD digits.

Without spending too much effort, I arrived at four gate delays
(INV -> OAI21 -> NAND2 -> NAND2) with a total of 37 gates optimizing
for speed, or five gate delays optimizing for space.

I strongly suspect that IBM is doing something similar :-)

Like that IBM packing method.

I have some RTL code to pack and unpack modulo 1000 to BCD. I think it
is fast and small enough that it can be used inline at the input and
output of DFP operations. The DFP values can then be passed around in
the CPU as 128-bit values instead of the expanded BCD value.

Only 128-bit DFP is supported on my machine under the assumption that
one is wanting the extended decimal precision for engineering / finance. Otherwise, why would one use it? Better to use BFP.

One headache I have not worked out how to do yet is convert between DFP
and BFP in a sensible fashion. I have tried a couple of means but the
results are way off. Using log/exp type functions. I suppose I could
rely on conversions to and from text strings.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Nov 5 15:27:48 2025

From Newsgroup: comp.arch

MitchAlsup wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 11/4/2025 11:15 AM, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Thomas Koenig <tkoenig@netcologne.de> writes:

Should be possible. A question is if you want to have a special
register for that (like POWER's link register),

There is this idea of splitting an (indirect) branch into a
prepare-to-branch instruction and a take-branch instruction. The

I first heard about this 1982 from Burton Smith.

prepare-to-branch instruction announces the branch target to the CPU,
and Power's mtlr and mtctr are examples of that (somewhat muddled by
the fact that the ctr register can also be used for counted loops as
well as for indirect branches), and IA-64's branch-target registers
and the instructions that move there are another example. AFAIK SPARC >>>> acquired something in this direction (touted as good for accelerating
Java) in the early 2000s. The take-branch instruction on Power is
blr/bctr.

I used to think that this kind of splitting is a good idea, and it is
certainly better than a branch-delay slot or a branch with a fixed
number of delay slots.

PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.

Wasn't this PL/1 feature "inherited" from the, now rightly deprecated,
Alter/Goto in COBOL and Assigned GOTO in Fortran?

Probably.

I find it somewhat amusing that modern languages moved away from
label variables and into method calls -- which if you look at it
from 5,000 feet/metres -- is just a more expensive "label".

I also find it amusing that the backbone of modern software is
a static version of label variables -- we call them switch state-
ments.

But you can be sure COBOL got them from assembly language programmers.

Back before caches and branch predictors, my fastest world count (wc)
asm program employed runtime code generation, it started by filling in a
64kB segment with code snippets aligned every 128 bytes: Even block
counts were for scanning outside a word and the odd entries were used
when a word start had been found, then each snippet would load the next
byte into BH and jump to BX. (BL contained the outside/inside flag value
as 0/128)

Fast forward a few years and a branchless data state machine ran far
faster, culminating at (a measured) 1.5 clock cycles/byte on a Pentium.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Nov 5 15:42:37 2025

From Newsgroup: comp.arch

Michael S wrote:

On Tue, 4 Nov 2025 22:52:46 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

For the Intel binary mantissa dfp128 normalization is the hard issue,
Michael S have figured out some really nice tricks to speed it up,

I remember that I played with that, but don't remember what I did
exactly. I dimly recollect that the fastest solution was relatively straight-forward. It was trying to minimize the length of dependency
chains rather than total number of multiplications.
An important point here is that I played on relatively old x86-64
hardware. My solution is not necessarily optimal for newer hardware.
The differences between old and new are two-fold and they push
optimal solution into different directions.
1. Increase in throughput of integer multiplier
2. Decrease in latency of integer division

The first factor suggests even more intense push toward "eager"
solutions.

The second factor suggests, possibly, much simpler code, especially in
common case of division by 1 to 27 decimal digits (5**27 < 2**64).
How they say? Sometimes a division is just a division.

I suspect that a model using pre-calculated reciprocals which generate
~10+ approximate digits, back-multiply and subtract, repeat once or
twice, could perform OK.

Having full ~225 bit reciprocals in order to generate the exact result
in a single iteration would require 256-bit storage for each of them and
the 256x256->512 MUL would use 16 64x64->128 MULs, but here we do have
the possibility to start from the top and as soon as you get the high
end 128 bits of the mantissa fixed (modulo any propagating carries from
lower down) you could inspect the preliminary result and see that it
would usually be far enough away from a tipping point so that you could
stop there.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Wed Nov 5 09:56:12 2025

From Newsgroup: comp.arch

Qupls2026 currently supports 48-bit inline constants. I am debating
whether to support 89 and 130-bit inline constants as well. Constant
sizes increase by 41-bits due to the 48-bit instruction word size. The
larger constants would require more instruction words to be available to
be processed in decode. Not sure if it is even possible to pass a
constant larger than 64-bits in the machine.

I just realized that constant operand routing was already in Qupls, I
had just not specifically identified it. The operand routing bits are
just moved into a postfix instruction word rather than the first
instruction word. This gives more bits available in the instruction
word. Rather than burn a couple of bits in every R3 type instruction,
another couple of opcodes are used to represent constant extensions.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Wed Nov 5 17:26:44 2025

From Newsgroup: comp.arch

On 2025-11-05 7:17, Anton Ertl wrote:

[ snip ]

Yes, assigned goto and labels-as-values (and probably the Cobol
alter/goto and PL/1 label variables) are there because computer
architectures have indirect branches and the programming language
designer wanted to give the programmers a way to express what they
would otherwise have to express in assembly language.

Why does standard C not have it? C had it up to and including the 6th edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
between 6th and 7th edition. Ritchie wrote <37178013.A1EE3D4F@bell-labs.com>:

| I eliminated them because I didn't know what to say about their
| semantics.

Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.

I don't know what Stallman said, or would have said if asked, but I
guess something like "the semantics is a jump to the (address of the)
label to which the value refers", which is machine-level semantics and
not semantics in the abstract C machine.

The problem in the abstract C machine is a "goto label-value" statement
where the label-value refers to a label in a different function. Does
gcc prevent that at compile time? If not, I would expect the semantics
to be Undefined Behavior, the usual cop-out when nothing useful can be said.

(In an earlier discussion on this group, some years ago, I explained how labels-as-values could be added to Ada, using the type system to ensure
safe and defined semantics. But I don't think such an extension would be accepted for the Ada standard.)

Niklas
--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Wed Nov 5 10:49:10 2025

From Newsgroup: comp.arch

Anton Ertl wrote:

Thomas Koenig <tkoenig@netcologne.de> writes:

Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
supports it
<https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

That is the problem with deleted features - compiler writers have
to support them forever, and interaction with other features can
lead to problems.

So does gfortran support assigned goto, too? What problems in
interaction with other features do you see?

- anton

For a code analysis, an assigned goto, aka label variables,
looks equivalent to:
- make a list of all the target labels assigned to each label variable
- at each "goto variable" substitute a switch statement with that list

Where this might be a problem is if the label variable was a
global symbol and the target labels were in other name spaces.
At that point it could treat it like a pointer to a function and
have to spill all live register variables to memory.

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Nov 5 10:15:00 2025

From Newsgroup: comp.arch

On 11/5/2025 3:21 AM, Michael S wrote:

On Tue, 04 Nov 2025 22:51:28 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Thomas Koenig <tkoenig@netcologne.de> posted:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

I still think the IBM DFP people did an impressively good job
packing that much data into a decimal representation. :-)

Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.

Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.

Other than "crap loads" of {deMorganizing and gate optimization}
that is essentially what HW actually does.

You still need to build 12-bit decimal ALUs to string together

Are talking about hardware or software?

I had interpreted it as being about software with BCD helper ops.

Otherwise, would probably go a different route.

One other tradeoff is whether to go for Decimal128 in DPD or BID.

Stuff online says BID is better for a software implementation, but I am
having doubts. It is possible that DPD could make more sense in both
cases, albeit likely, in the absence of BCD helpers, it may make sense
to map DPD to linear 10-bit values.

While BID could make sense, it would have a drawback of assuming having
some way of quickly performing power-of-10 multiplies on large integer
values. If you have a CPU where the fastest way to perform generic
128-bit multiply is to break it down into 32 bit multiplies, and/or use shift-and-add, it is not a particularly attractive option.

Contrast, working with 16-bit chunks holding 10 bit values is likely to
work out being cheaper.

Despite BID being more conceptually similar to Binary128, they differ in
that Binary128 would only need to use large-integer multiply sparingly (namely, for multiply operations).

Though, likely fastest option would be to map the DPD values to 30-bit
linear values, then internally use the 30-bit linear values, and convert
back to DPD at the end. Though, the performance of this is likely to
depend on the operation.

A non-standard variant, representing the value as packed 30 bit fields,
could likely be the fastest option. Could use the same basic layout as
the existing Decimal128 format.

S0, my guess for a performance ranking, fast to slow, being:
1: Dense packed, 30b linear, 30+30+30+20+digit
2: DPD
3: BID

As for whether or not to support Decimal128 (in either form), dunno.

Closest I have to a use-case is that well, technically there is a
_Decimal128 type in C, and it might make sense for it to be usable.

But, then one needs to decide on which possible format to use here.
And, whether to aim for performance or compatibility.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Nov 5 10:23:16 2025

From Newsgroup: comp.arch

On 11/5/2025 9:26 AM, Niklas Holsti wrote:

On 2025-11-05 7:17, Anton Ertl wrote:

[ snip ]

Yes, assigned goto and labels-as-values (and probably the Cobol
alter/goto and PL/1 label variables) are there because computer
architectures have indirect branches and the programming language
designer wanted to give the programmers a way to express what they
would otherwise have to express in assembly language.

Why does standard C not have it? C had it up to and including the 6th
edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
between 6th and 7th edition. Ritchie wrote
<37178013.A1EE3D4F@bell-labs.com>:

| I eliminated them because I didn't know what to say about their
| semantics.

Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.

I don't know what Stallman said, or would have said if asked, but I
guess something like "the semantics is a jump to the (address of the)
label to which the value refers", which is machine-level semantics and
not semantics in the abstract C machine.

The problem in the abstract C machine is a "goto label-value" statement where the label-value refers to a label in a different function. Does
gcc prevent that at compile time? If not, I would expect the semantics
to be Undefined Behavior, the usual cop-out when nothing useful can be
said.

(In an earlier discussion on this group, some years ago, I explained how labels-as-values could be added to Ada, using the type system to ensure
safe and defined semantics. But I don't think such an extension would be accepted for the Ada standard.)

My guess here:
It is an "oh crap" situation and program either immediately or (maybe
not as immediately) explodes...

Otherwise, it would need to function more like a longjmp, which would
mean that it would likely be painfully slow.

So, yeah, most likely UB, of a "particularly destructive" / "unlikely to
be useful" kind.

FWIW:
This was not a feature that I feel inclined to support in BGBCC...

Niklas

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Nov 5 17:22:48 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> writes:

On 11/5/2025 9:26 AM, Niklas Holsti wrote:

On 2025-11-05 7:17, Anton Ertl wrote:

<computed goto>

My guess here:
It is an "oh crap" situation and program either immediately or (maybe
not as immediately) explodes...

Otherwise, it would need to function more like a longjmp, which would
mean that it would likely be painfully slow.

In my experience, longjmp is far faster than e.g. C++ exceptions.

Granted, the code needs to be designed to allow longjmp without
orphaning or leaking memory (i.e. in a context where there isn't any
dynamic memory allocation) for the best speed.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Nov 5 18:03:31 2025

From Newsgroup: comp.arch

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
supports it >>><https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

That is the problem with deleted features - compiler writers have
to support them forever, and interaction with other features can
lead to problems.

So does gfortran support assigned goto, too?

Yes.

What problems in
interaction with other features do you see?

In this case, it is more the problem of modern architeectures.
On 32-bit architectures, it might have been possible to stash
the address of a jump target in an actual INTEGER variable and
GO TO there. On a 64-bit architecture, this is not possible, so
you need to have a shadow variable for the pointer, and possibly
(if you want to catch GOTO when no variable has been assigned)
a second variable.

But it interacts with compiler writers - additional efforts, warning,
testing, ...
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Wed Nov 5 21:30:11 2025

From Newsgroup: comp.arch

On 2025-11-05 18:23, BGB wrote:

On 11/5/2025 9:26 AM, Niklas Holsti wrote:

On 2025-11-05 7:17, Anton Ertl wrote:

[ snip ]

Yes, assigned goto and labels-as-values (and probably the Cobol
alter/goto and PL/1 label variables) are there because computer
architectures have indirect branches and the programming language
designer wanted to give the programmers a way to express what they
would otherwise have to express in assembly language.

Why does standard C not have it? C had it up to and including the 6th
edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
between 6th and 7th edition. Ritchie wrote
<37178013.A1EE3D4F@bell-labs.com>:

| I eliminated them because I didn't know what to say about their
| semantics.

Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.

I don't know what Stallman said, or would have said if asked, but I
guess something like "the semantics is a jump to the (address of the)
label to which the value refers", which is machine-level semantics and
not semantics in the abstract C machine.

The problem in the abstract C machine is a "goto label-value"
statement where the label-value refers to a label in a different
function. Does gcc prevent that at compile time? If not, I would
expect the semantics to be Undefined Behavior, the usual cop-out when
nothing useful can be said.

(In an earlier discussion on this group, some years ago, I explained
how labels-as-values could be added to Ada, using the type system to
ensure safe and defined semantics. But I don't think such an extension
would be accepted for the Ada standard.)

My guess here:
It is an "oh crap" situation and program either immediately or (maybe
not as immediately) explodes...

Or silently produces wrong results.

Otherwise, it would need to function more like a longjmp, which would
mean that it would likely be painfully slow.

But then you could get the problem of a longjmp to a setjmp value that
is stale because the targeted function invocation (stack frame) is no
longer there.

Niklas

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 20:30:05 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

On 2025-11-03 2:03 p.m., MitchAlsup wrote:

Thomas Koenig <tkoenig@netcologne.de> posted:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Terje Mathisen <terje.mathisen@tmsw.no> posted:

Actually, for the five required basic operations, you can always do the >>>> op in the next higher precision, then round again down to the target, >>>> and get exactly the same result.

https://guillaume.melquiond.fr/doc/05-imacs17_1-article.pdf

The PowerISA version 3.0 introduced rounding to odd for its 128-bit
floating point arithmetic, for that very reason (I assume).

Likely, My 66000 also has RNO and
Round Nearest Random is defined but not yet available
Round Away from Zero is also defined and available.

Round nearest random?

Another unbiased rounding mode. Not yet available because I don't have
a truly random source to guide the rounding.

How about round externally guided (RXG) by an
input signal?

I guess that would be OK, but you could not make the statement that
the rounding mode was unbiased.

For instance, the rounding could come from a feedback
filter of some sort.

Sure, just you can state "unbiased".
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 20:43:58 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> posted:

On 11/4/2025 3:44 PM, Terje Mathisen wrote:

MitchAlsup wrote:

---------------

As you said: "Never bet against branch prediction".

Branch prediction is fun.

When I looked around online before, a lot of stuff about branch
prediction was talking about fairly large and convoluted schemes for the branch predictors.

But, then always at the end of it using 2-bit saturating counters:
weakly taken, weakly not-taken, strongly taken, strongly not taken.

But, in my fiddling, there was seemingly a simple but moderately
effective strategy:
Keep a local history of taken/not-taken;
XOR this with the low-order-bits of PC for the table index;
Use a 5/6-bit finite-state-machine or similar.
Can model repeating patterns up to ~ 4 bits.

Where, the idea was that the state-machine in updated with the current
state and branch direction, giving the next state and next predicted
branch direction (for this state).

Could model slightly more complex patterns than the 2-bit saturating counters, but it is sort of a partial mystery why (for mainstream processors) more complex lookup schemes and 2 bit state, was preferable
to a simpler lookup scheme and 5-bit state.

In 1991 Mike Shebanow, Tse-Yu Yeh, and I tried out a Correlation predictor where strings of {T, !T}** were pattern matched to create a prediction.
While it was somewhat competitive with Global History Table, it ultimately failed.

I am now working on predictors for a 6-wide My 66000 machine--which is a bit different.
a) VEC-LOOP loops do not alter the branch prediction tables.
b) Predication clauses do not alter the BPTs.
c) Jump Through Table is not predicted through jump indirect table-like
prediction, what is predicted is the value (switch variable) and this
is used to index the table (early)
d) CMOV gets rid of another 8%

These strip out about 40% of branches from needing prediction, causing
the remaining branches to be harder to predict but having less total
latency in execution.

-----------------

Not proven, but I suspect that an arbitrary 5 bit pattern within a 6 bit state might be impossible. Although there would be sufficient
state-space for the looping 5-bit patterns, there may not be sufficient state-space to distinguish whether to move from a mismatched 4-bit
pattern to a 3 or 5 bit pattern. Whereas, at least with 4-bit, any
mismatch of the 4-bit pattern can always decay to a 3-bit pattern, etc.
One needs to be able to express decay both to shorter patterns and to
longer patterns, and I suspect at this point, the pattern breaks down
(but can't easily confirm; it is either this or the pattern extends indefinitely, I don't know...).

Tried some of these (1991) mostly with little to no success.
Be my guest and try again.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 20:52:22 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

On 2025-11-05 1:47 a.m., Robert Finch wrote:

-----------

I am now modifying Qupls2024 into Qupls2026 rather than starting a completely new ISA. The big difference is Qupls2024 uses 64-bit
instructions and Qupls2026 uses 48-bit instructions making the code 25%
more compact with no real loss of operations.

Qupls2024 also used 8-bit register specs. This was a bit of overkill and
not really needed. Register specs are reduced to 6-bits. Right-away that reduced most instructions eight bits.

4 register specifiers: check.

I decided I liked the dual operations that some instructions supported, which need a wide instruction format.

With 48-bits, if you can get 2 instructions 50% of the time, you are only
12% bigger than a 32-bit ISA.

One gotcha is that 64-bit constant overrides need to be modified. For Qupls2024 a 64-bit constant override could be specified using only a
single additional instruction word. This is not possible with 48-bit instruction words. Qupls2024 only allowed a single additional constant
word. I may maintain this for Qupls2026, but that means that a max
constant override of 48-bits would be supported. A 64-bit constant can
still be built up in a register using the add-immediate with shift instruction. It is ugly and takes about three instructions.

It was that sticking problem of constants that drove most of My 66000
ISA style--variable length and how to encode access to these constants
and routing thereof.

Motto: never execute any instructions fetching or building constants.

I could reduce the 64-bit constant build to two instructions by adding a load-immediate instruction.

May I humbly suggest this is the wrong direction.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 20:53:59 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> posted:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Thomas Koenig <tkoenig@netcologne.de> posted:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

I still think the IBM DFP people did an impressively good job packing >> > that much data into a decimal representation. :-)

Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.

Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.

I played around with the formulas from the POWER manual a bit,
using Berkeley abc for logic optimization, for the conversion
of the packed modulo 1000 to three BCD digits.

Without spending too much effort, I arrived at four gate delays
(INV -> OAI21 -> NAND2 -> NAND2) with a total of 37 gates optimizing
for speed, or five gate delays optimizing for space.

Since the gates hang off flip-flops, you don't need the inv gate
at the front. Flip-flops can easily give both true and complement
outputs.

I strongly suspect that IBM is doing something similar :-)

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 21:04:57 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> posted:

On 11/4/2025 11:17 PM, Anton Ertl wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 11/4/2025 11:15 AM, MitchAlsup wrote:

PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.

Wasn't this PL/1 feature "inherited" from the, now rightly deprecated, >>> Alter/Goto in COBOL and Assigned GOTO in Fortran?

Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler supports it <https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

What makes you think that it is "rightly" to deprecate or delete this feature?

<https://riptutorial.com/fortran/example/11872/assigned-goto> says:
|It can be avoided in modern code by using procedures, internal |procedures, procedure pointers and other features.

I know no feature in Fortran or standard C which replaces my use of labels-as-values, the GNU C equivalent of the assigned goto. If you
look at <https://www.complang.tuwien.ac.at/forth/threading/>, "direct"
and "indirect" use labels-as-values, whereas "switch", "call" and
"repl. switch" use standard C features (switch, indirect calls, and switch+goto respectively). "direct" and "indirect" usually outperform these others, sometimes by a lot.

I usually used call threading, because:
In my testing it was one of the faster options;
At least if excluding 32-bit x86,
which often has slow function calls.
Because pretty much every function needs a stack frame, ...
It is usable in standard C.

I have converged on call-threading as a way to eliminate "if-statements" -----------------------
extern uint64_t operation( uint64_t src1, uint64_t src1 );

static uint64_t (*int2op[32])( uint64_t src1, uint64_t src1 ) =
{ // integer 2-operand decoding table
/* 00 */ operation,
/* 01 */ operation,
/* 02 */ uadd,
/* 03 */ sadd,
/* 04 */ umul,
/* 05 */ smul,
/* 06 */ udiv,
/* 07 */ sdiv,
/* 10 */ cmp,
/* 11 */ operation,
/* 12 */ operation,
/* 13 */ operation,
/* 14 */ umax,
/* 15 */ smax,
/* 16 */ umin,
/* 17 */ smin,
/* 20 */ or,
/* 21 */ operation,
/* 22 */ xor,
/* 23 */ operation,
/* 24 */ and,
/* 25 */ operation,
/* 26 */ operation,
/* 27 */ operation,
/* 30 */ operation,
/* 31 */ operation,
/* 32 */ operation,
/* 33 */ operation,
/* 34 */ operation,
/* 35 */ operation,
/* 36 */ operation,
/* 37 */ operation;
};

/*
* Integer 2-Operand Table Caller
*/
bool intimm16( coreStack *cpu, Context *c, Major I )
{
uint8_t or = I.or;
uint64_t src1 = c->ctx.reg[ I.src1 ],
src2 = c->ctx.reg[ I.src2 ],
*dst = &c->ctx.reg[ I.dst ];
*dst = int2op[ (I.major&15)<<1 ]( src1, src2, 0 );
return true;
}

bool int2op( coreStack *cpu, Context *c, OpRoute I )
{
uint8_t or = I.or,
s = I.size;
uint64_t *src1 = &c->ctx.reg[ I.src1 ],
*src2 = &c->ctx.reg[ I.src2 ],
*dst = &c->ctx.reg[ I.dst ];
iorTable[ or ]( *c, I, src1, src2 );
*dst = int2op[ I.minor ]( src1, src2, s );
return true;
}
-----------------------

One does not have to check for unimplemented instructions, just place
a call to the operation() subroutine where they are not defined. The operation() subroutine raises an exception which is caught at the
next instruction fetch.

I show both 16-bit immediates and general 2-Operand instructions use
the same table (with a trifling of bit twiddling).

Often "while loop and switch()" was notably slower than using unrolled
lists of indirect function calls (usually with the main dispatch loop
based on "traces", which would call each of the opcode functions and
then return the next trace to be run).

Table-calls are faster than many switches unless you can demonstrate
the switch is dense and there are no missing cases.

Granted, "while loop and switch" is the more traditional way of writing
an interpreter.

Just not a fast one...
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 21:06:16 2025

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> posted:

On Tue, 04 Nov 2025 22:51:28 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Thomas Koenig <tkoenig@netcologne.de> posted:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

I still think the IBM DFP people did an impressively good job
packing that much data into a decimal representation. :-)

Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.

Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.

Other than "crap loads" of {deMorganizing and gate optimization}
that is essentially what HW actually does.

You still need to build 12-bit decimal ALUs to string together

Are talking about hardware or software?

A SW solution based on how it would be done in HW.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 21:21:34 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

Qupls2026 currently supports 48-bit inline constants. I am debating
whether to support 89 and 130-bit inline constants as well. Constant
sizes increase by 41-bits due to the 48-bit instruction word size. The larger constants would require more instruction words to be available to
be processed in decode. Not sure if it is even possible to pass a
constant larger than 64-bits in the machine.

I just realized that constant operand routing was already in Qupls, I
had just not specifically identified it. The operand routing bits are
just moved into a postfix instruction word rather than the first
instruction word. This gives more bits available in the instruction
word. Rather than burn a couple of bits in every R3 type instruction, another couple of opcodes are used to represent constant extensions.

My 66000 ISA has OpCodes in the range {I.major >= 8 && I.major < 24}
that can supply constants and perform operand routing. Within this
range; instruction<8:5> specify the following table:

0 0 0 0 +Src1 +Src2
0 0 0 1 +Src1 -Src2
0 0 1 0 -Src1 +Src2
0 0 1 1 -Src1 -Src2
0 1 0 0 +Src1 +imm5
0 1 0 1 +Imm5 +Src2
0 1 1 0 -Src1 -Imm5
0 1 1 1 +Imm5 -Src2
1 0 0 0 +Src1 Imm32
1 0 0 1 Imm32 +Src2
1 0 1 0 -Src1 Imm32
1 0 1 1 Imm32 -Src2
1 1 0 0 +Src1 Imm64
1 1 0 1 Imm64 +Src2
1 1 1 0 -Src1 Imm64
1 1 1 1 Imm64 -Src2

Here we have access to {5, 32, 64}-bit constants, 16-bit constants
come from different OpCodes.

Imm5 are the register specifier bits: range {-31..31} for integer and
logical, range {-15.5..15.5} for floating point.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 21:24:07 2025

From Newsgroup: comp.arch

Niklas Holsti <niklas.holsti@tidorum.invalid> posted:

On 2025-11-05 7:17, Anton Ertl wrote:

[ snip ]

Yes, assigned goto and labels-as-values (and probably the Cobol
alter/goto and PL/1 label variables) are there because computer architectures have indirect branches and the programming language
designer wanted to give the programmers a way to express what they
would otherwise have to express in assembly language.

Why does standard C not have it? C had it up to and including the 6th edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
between 6th and 7th edition. Ritchie wrote <37178013.A1EE3D4F@bell-labs.com>:

| I eliminated them because I didn't know what to say about their
| semantics.

Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.

I don't know what Stallman said, or would have said if asked, but I
guess something like "the semantics is a jump to the (address of the)
label to which the value refers", which is machine-level semantics and
not semantics in the abstract C machine.

The problem in the abstract C machine is a "goto label-value" statement where the label-value refers to a label in a different function. Does
gcc prevent that at compile time?

This is where the call-table approach works better--the scope is well
defined.

If not, I would expect the semantics
to be Undefined Behavior, the usual cop-out when nothing useful can be said.

(In an earlier discussion on this group, some years ago, I explained how labels-as-values could be added to Ada, using the type system to ensure
safe and defined semantics. But I don't think such an extension would be accepted for the Ada standard.)

Niklas

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 5 21:28:16 2025

From Newsgroup: comp.arch

Niklas Holsti <niklas.holsti@tidorum.invalid> posted:

On 2025-11-05 18:23, BGB wrote:

On 11/5/2025 9:26 AM, Niklas Holsti wrote:

On 2025-11-05 7:17, Anton Ertl wrote:

[ snip ]

Yes, assigned goto and labels-as-values (and probably the Cobol
alter/goto and PL/1 label variables) are there because computer
architectures have indirect branches and the programming language
designer wanted to give the programmers a way to express what they
would otherwise have to express in assembly language.

Why does standard C not have it? C had it up to and including the 6th >>> edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
between 6th and 7th edition. Ritchie wrote
<37178013.A1EE3D4F@bell-labs.com>:

| I eliminated them because I didn't know what to say about their
| semantics.

Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.

I don't know what Stallman said, or would have said if asked, but I
guess something like "the semantics is a jump to the (address of the)
label to which the value refers", which is machine-level semantics and
not semantics in the abstract C machine.

The problem in the abstract C machine is a "goto label-value"
statement where the label-value refers to a label in a different
function. Does gcc prevent that at compile time? If not, I would
expect the semantics to be Undefined Behavior, the usual cop-out when
nothing useful can be said.

(In an earlier discussion on this group, some years ago, I explained
how labels-as-values could be added to Ada, using the type system to
ensure safe and defined semantics. But I don't think such an extension
would be accepted for the Ada standard.)

My guess here:
It is an "oh crap" situation and program either immediately or (maybe
not as immediately) explodes...

Or silently produces wrong results.

Otherwise, it would need to function more like a longjmp, which would
mean that it would likely be painfully slow.

But then you could get the problem of a longjmp to a setjmp value that
is stale because the targeted function invocation (stack frame) is no
longer there.

But YOU had to pass the jumpbuf out of the setjump() scope.

Now, YOU complain there is a hole in your own foot with a smoking gun
in your own hand.

Niklas

--- Synchronet 3.21a-Linux NewsLink 1.2

From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Thu Nov 6 00:45:19 2025

From Newsgroup: comp.arch

On 2025-11-05 23:28, MitchAlsup wrote:

Niklas Holsti <niklas.holsti@tidorum.invalid> posted:

On 2025-11-05 18:23, BGB wrote:

On 11/5/2025 9:26 AM, Niklas Holsti wrote:

On 2025-11-05 7:17, Anton Ertl wrote:

[ snip ]

Yes, assigned goto and labels-as-values (and probably the Cobol
alter/goto and PL/1 label variables) are there because computer
architectures have indirect branches and the programming language
designer wanted to give the programmers a way to express what they
would otherwise have to express in assembly language.

Why does standard C not have it? C had it up to and including the 6th >>>>> edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
between 6th and 7th edition. Ritchie wrote
<37178013.A1EE3D4F@bell-labs.com>:

| I eliminated them because I didn't know what to say about their
| semantics.

Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.

I don't know what Stallman said, or would have said if asked, but I
guess something like "the semantics is a jump to the (address of the)
label to which the value refers", which is machine-level semantics and >>>> not semantics in the abstract C machine.

The problem in the abstract C machine is a "goto label-value"
statement where the label-value refers to a label in a different
function. Does gcc prevent that at compile time? If not, I would
expect the semantics to be Undefined Behavior, the usual cop-out when
nothing useful can be said.

(In an earlier discussion on this group, some years ago, I explained
how labels-as-values could be added to Ada, using the type system to
ensure safe and defined semantics. But I don't think such an extension >>>> would be accepted for the Ada standard.)

My guess here:
It is an "oh crap" situation and program either immediately or (maybe
not as immediately) explodes...

Or silently produces wrong results.

Otherwise, it would need to function more like a longjmp, which would
mean that it would likely be painfully slow.

But then you could get the problem of a longjmp to a setjmp value that
is stale because the targeted function invocation (stack frame) is no
longer there.

But YOU had to pass the jumpbuf out of the setjump() scope.

Now, YOU complain there is a hole in your own foot with a smoking gun
in your own hand.

That is not the issue. The question is if the semantics of "goto label-valued-variable" are hard to define, as Ritchie said, or not, as
Anton thinks Stallman said or would have said.

The discussion above shows that whether a label value is implemented as
a bare code address, or as a jumpbuf, some cases will have Undefined
Behavior semantics. So I think Ritchie was right, unless the undefined
cases can be excluded at compile time.

The undefined cases could be excluded at compile-time, even in C, by
requiring all label-valued variables to be local to some function and forbidding passing such values as parameters or function results. In
addition, the use of an uninitialized label-valued variable should be prevented or detected. Perhaps Anton could accept such restrictions.

Niklas

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Wed Nov 5 20:41:18 2025

From Newsgroup: comp.arch

On 2025-11-05 3:52 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2025-11-05 1:47 a.m., Robert Finch wrote:

-----------

I am now modifying Qupls2024 into Qupls2026 rather than starting a
completely new ISA. The big difference is Qupls2024 uses 64-bit
instructions and Qupls2026 uses 48-bit instructions making the code 25%
more compact with no real loss of operations.

Qupls2024 also used 8-bit register specs. This was a bit of overkill and
not really needed. Register specs are reduced to 6-bits. Right-away that
reduced most instructions eight bits.

4 register specifiers: check.

I decided I liked the dual operations that some instructions supported,
which need a wide instruction format.

With 48-bits, if you can get 2 instructions 50% of the time, you are only
12% bigger than a 32-bit ISA.

One gotcha is that 64-bit constant overrides need to be modified. For
Qupls2024 a 64-bit constant override could be specified using only a
single additional instruction word. This is not possible with 48-bit
instruction words. Qupls2024 only allowed a single additional constant
word. I may maintain this for Qupls2026, but that means that a max
constant override of 48-bits would be supported. A 64-bit constant can
still be built up in a register using the add-immediate with shift
instruction. It is ugly and takes about three instructions.

It was that sticking problem of constants that drove most of My 66000
ISA style--variable length and how to encode access to these constants
and routing thereof.

Motto: never execute any instructions fetching or building constants.

I could reduce the 64-bit constant build to two instructions by adding a
load-immediate instruction.

May I humbly suggest this is the wrong direction.

agree.

Taking heed of the motto, I have
scrapped a bunch of shifted immediate instructions and load immediate.
These were present as an alternate means to work with large constants.
They were really redundant with the ability to specify constant
overrides (routing) for registers, and they would increase the dynamic instruction count (bad!) Scrapping the extra instructions will also make writing a compiler simpler.

One instruction scrapped was an add to IP. So, another means of forming relative addresses was required. Sacrificing a register code (code 32)
to represent the instruction pointer. This will allow the easy formation
of IP relative addresses.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Wed Nov 5 21:49:19 2025

From Newsgroup: comp.arch

On 2025-11-05 4:21 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

Qupls2026 currently supports 48-bit inline constants. I am debating
whether to support 89 and 130-bit inline constants as well. Constant
sizes increase by 41-bits due to the 48-bit instruction word size. The
larger constants would require more instruction words to be available to
be processed in decode. Not sure if it is even possible to pass a
constant larger than 64-bits in the machine.

I just realized that constant operand routing was already in Qupls, I
had just not specifically identified it. The operand routing bits are
just moved into a postfix instruction word rather than the first
instruction word. This gives more bits available in the instruction
word. Rather than burn a couple of bits in every R3 type instruction,
another couple of opcodes are used to represent constant extensions.

My 66000 ISA has OpCodes in the range {I.major >= 8 && I.major < 24}
that can supply constants and perform operand routing. Within this
range; instruction<8:5> specify the following table:

0 0 0 0 +Src1 +Src2
0 0 0 1 +Src1 -Src2
0 0 1 0 -Src1 +Src2
0 0 1 1 -Src1 -Src2
0 1 0 0 +Src1 +imm5
0 1 0 1 +Imm5 +Src2
0 1 1 0 -Src1 -Imm5
0 1 1 1 +Imm5 -Src2
1 0 0 0 +Src1 Imm32
1 0 0 1 Imm32 +Src2
1 0 1 0 -Src1 Imm32
1 0 1 1 Imm32 -Src2
1 1 0 0 +Src1 Imm64
1 1 0 1 Imm64 +Src2
1 1 1 0 -Src1 Imm64
1 1 1 1 Imm64 -Src2

What happens if one tries to use an unsupported combination?

Here we have access to {5, 32, 64}-bit constants, 16-bit constants
come from different OpCodes.

Imm5 are the register specifier bits: range {-31..31} for integer and logical, range {-15.5..15.5} for floating point.

I just realized that Qupls2026 does not accommodate small constants very
well except for a few instructions like shift and bitfield instructions
which have special formats. Sure, constants can be made to override
register specs, but they take up a whole additional word. I am not sure
how big a deal this is as there are also immediate forms of instructions
with the constant encoded in the instruction, but these do not allow
operand routing. There is a dedicated subtract from immediate
instruction. A lot of other instructions are commutative, so operand
routing is not needed.

Qupls has potentially 25, 48, 89 and 130-bit constants. 7-bit constants
are available for shifts and bitfield ops. Leaving the 130-bit constants
out for now. They may be useful for 128-bit SIMD against constant operands.

The constant routing issue could maybe be fixed as there are 30+ free
opcodes still. But there needs to be more routing bits with three source operands. All the permutations may get complicated to encode and allow
for in the compiler. May want to permute two registers and a constant,
or two constants and a register, and then three or four different sizes.

Qupls strives to be the low-cost processor.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed Nov 5 19:20:57 2025

From Newsgroup: comp.arch

On 11/5/2025 1:21 PM, MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

Qupls2026 currently supports 48-bit inline constants. I am debating
whether to support 89 and 130-bit inline constants as well. Constant
sizes increase by 41-bits due to the 48-bit instruction word size. The
larger constants would require more instruction words to be available to
be processed in decode. Not sure if it is even possible to pass a
constant larger than 64-bits in the machine.

I just realized that constant operand routing was already in Qupls, I
had just not specifically identified it. The operand routing bits are
just moved into a postfix instruction word rather than the first
instruction word. This gives more bits available in the instruction
word. Rather than burn a couple of bits in every R3 type instruction,
another couple of opcodes are used to represent constant extensions.

My 66000 ISA has OpCodes in the range {I.major >= 8 && I.major < 24}
that can supply constants and perform operand routing. Within this
range; instruction<8:5> specify the following table:

0 0 0 0 +Src1 +Src2
0 0 0 1 +Src1 -Src2
0 0 1 0 -Src1 +Src2
0 0 1 1 -Src1 -Src2
0 1 0 0 +Src1 +imm5
0 1 0 1 +Imm5 +Src2
0 1 1 0 -Src1 -Imm5
0 1 1 1 +Imm5 -Src2
1 0 0 0 +Src1 Imm32
1 0 0 1 Imm32 +Src2
1 0 1 0 -Src1 Imm32
1 0 1 1 Imm32 -Src2
1 1 0 0 +Src1 Imm64
1 1 0 1 Imm64 +Src2
1 1 1 0 -Src1 Imm64
1 1 1 1 Imm64 -Src2

Here we have access to {5, 32, 64}-bit constants, 16-bit constants
come from different OpCodes.

Imm5 are the register specifier bits: range {-31..31} for integer and logical, range {-15.5..15.5} for floating point.

Some time ago, we discussed using the 5 bit immediates in floating point instructions as an index to an internal ROM with frequently used
constants. The idea is that it would save some space in the instruction stream. Are you implementing that, and if not, why not?
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Thu Nov 6 11:24:24 2025

From Newsgroup: comp.arch

On Wed, 05 Nov 2025 21:06:16 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Michael S <already5chosen@yahoo.com> posted:

On Tue, 04 Nov 2025 22:51:28 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Thomas Koenig <tkoenig@netcologne.de> posted:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

I still think the IBM DFP people did an impressively good job
packing that much data into a decimal representation. :-)

Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.

Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.

Other than "crap loads" of {deMorganizing and gate optimization}
that is essentially what HW actually does.

You still need to build 12-bit decimal ALUs to string together

Are talking about hardware or software?

A SW solution based on how it would be done in HW.

Then, I suspect that you didn't understand objection of Thomas Koenig.

1. Format of interest is Decimal128. https://en.wikipedia.org/wiki/Decimal128_floating-point_format

2. According to my understanding, Thomas didn't suggest that *slow*
software implementation of DPD-encoded DFP, i.e. implementation that
only cares about correctness, is hard.

3. OTOH, he seems to suspects, and I agree with him, that *non-slow*
software implementation, the one comparable in speed (say, within
factor of 1,5-2) to competent implementation of the same DFP operations
in BID format, is not easy. If at all possible.

4. All said above assumes an absence of HW assists.

BTW, at least for multiplication, I would probably would not do my
arithmetic in BCD domain.
Instead, I'd convert 10+ DPD digits to two Base_1e18 digits (11 look
ups per operand, 22 total look ups + ~40 shifts + ~20 ANDs + ~20
additions).

Then I'd do multiplication and normalization and rounding in Base_1e18.

Then I'd convert from Base_1e18 to Base_1000. The ideas of such
conversion are similar to fast binary-to-BCD conversion that I
demonstrated her decade or so ago. AVX2 could be quite helpful at that
stage.

Then I'd have to convert the result from Base_1000 to DPD. Here, again,
11 table look-ups + plenty of ANDs/shift/ORs seem inevitable.
May be, at that stage SIMD gather can be of help, but I have my doubts.
So far, every time I tried gather I was disappointed with performance.

Overall, even with seemingly decent plan like sketched above, I'd expect
DPD multiplication to be 2.5x to 3x slower than BID. But, then again,
in the past my early performance estimates were wrong quite often.

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Nov 6 08:46:40 2025

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 11/4/2025 9:17 PM, Anton Ertl wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 11/4/2025 11:15 AM, MitchAlsup wrote:

PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.

Wasn't this PL/1 feature "inherited" from the, now rightly deprecated, >>>> Alter/Goto in COBOL and Assigned GOTO in Fortran?

Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
supports it
<https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

What makes you think that it is "rightly" to deprecate or delete this
feature?

Because it could, and often did, make the code "unfollowable". That is,
you are reading the code, following it to try to figure out what it is
doing and come to an assigned/alter goto, and you don't know where to go >next. The value was set some place else in the code, who knows where,
and thus what value it was set to, and people/programmers just aren't
used to being able to follow code like that.

Take an example use: A VM interpreter. With labels-as-values it looks
like this:

void engine(char *source)
{
void *insts[] = {&&add, &&load, &&ip, ...};

void **ip=compile_to_vm_code(source,insts);

goto *ip++;

add:
...
goto *ip++;
load:
...
goto *ip++;
store:
...
goto *ip++;
...
}

So of course you don't know where one of the gotos goes to, because
that depends on the VM code, which depends on the source code.

Now let's see how it looks with switch:

void engine(char *source)
{
typedef enum {add, load, store,...} inst;
inst *ip=compile_to_vm_code(source,insts);

for (;;) {
switch (*ip++) {
add:
...
break;
load:
...
break;
store:
...
break;
...
}
}
}

Do you know any better which of the "..." is executed next? Of course
not, for the same reason. Likewise for call threading, but there the
VM instruction implementations can be discributed across many source
files. With the replicated switch, the problem of predictability is
the same, but there is lots of extra code, with many direct gotos.

If you implement, say, a state machine using labels-as-values, or
switch, again, the logic behind it is the same and the predictability
is the same between the two implementations.

BTW, you mentioned that it could be implemented as an indirect jump. It >could for those architectures that supported that feature, but it could
also be implemented by having the Alter/Assign modify the code (i.e.
change the address in the jump/branch instruction), and self modifying
code is just bad.

On such architectures switch would also be implemented by modifying
the code, and indirect calls and method dispatch would also be
implemented by modifying the code. If self-modifying code is "just
bad", and any language features that are implemented on some long-gone architectures using self-modifying code are bad by association, then
we have to get rid of all of these language features ASAP.

One interesting aspect here is that the Fortran assigned goto and GNU
C's goto * (to go with labels-as-values) look more like something that
may have been inspired by a modern indirect branch than by
self-modifying code. I only dimly remember the Cobol thing, but IIRC
this looked more like something that's intended to be implemented by self-modifying code. I don't know how the PL/I solution looked like.

As did COBOL, called goto depending on, but those features didn't suffer
the problems of assigned/alter gotos.

As demonstrated above, they do. And if you fall back to using ifs, it
does not get any better, either.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Thu Nov 6 11:43:57 2025

From Newsgroup: comp.arch

On Wed, 5 Nov 2025 17:26:44 +0200
Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:

On 2025-11-05 7:17, Anton Ertl wrote:

[ snip ]

Yes, assigned goto and labels-as-values (and probably the Cobol
alter/goto and PL/1 label variables) are there because computer architectures have indirect branches and the programming language
designer wanted to give the programmers a way to express what they
would otherwise have to express in assembly language.

Why does standard C not have it? C had it up to and including the
6th edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away between 6th and 7th edition. Ritchie wrote <37178013.A1EE3D4F@bell-labs.com>:

| I eliminated them because I didn't know what to say about their
| semantics.

Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.

I don't know what Stallman said, or would have said if asked, but I
guess something like "the semantics is a jump to the (address of the)
label to which the value refers", which is machine-level semantics
and not semantics in the abstract C machine.

The problem in the abstract C machine is a "goto label-value"
statement where the label-value refers to a label in a different
function. Does gcc prevent that at compile time? If not, I would
expect the semantics to be Undefined Behavior, the usual cop-out when
nothing useful can be said.

Yes, UB sounnds as the best answer.. Inter-procedural assigned goto is
not different from out-of-bound array access or from attempt to use
pointer to local variable when the block/function that originally
declared the variable is no longer active.
But compiler shall try to detect as many cases of such misuse as it can.

(In an earlier discussion on this group, some years ago, I explained
how labels-as-values could be added to Ada, using the type system to
ensure safe and defined semantics. But I don't think such an
extension would be accepted for the Ada standard.)

Niklas

--- Synchronet 3.21a-Linux NewsLink 1.2

From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Thu Nov 6 12:11:54 2025

From Newsgroup: comp.arch

On 2025-11-06 11:43, Michael S wrote:

On Wed, 5 Nov 2025 17:26:44 +0200
Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:

On 2025-11-05 7:17, Anton Ertl wrote:

[ snip ]

Yes, assigned goto and labels-as-values (and probably the Cobol
alter/goto and PL/1 label variables) are there because computer
architectures have indirect branches and the programming language
designer wanted to give the programmers a way to express what they
would otherwise have to express in assembly language.

Why does standard C not have it? C had it up to and including the
6th edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
between 6th and 7th edition. Ritchie wrote
<37178013.A1EE3D4F@bell-labs.com>:

| I eliminated them because I didn't know what to say about their
| semantics.

Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.

I don't know what Stallman said, or would have said if asked, but I
guess something like "the semantics is a jump to the (address of the)
label to which the value refers", which is machine-level semantics
and not semantics in the abstract C machine.

The problem in the abstract C machine is a "goto label-value"
statement where the label-value refers to a label in a different
function. Does gcc prevent that at compile time? If not, I would
expect the semantics to be Undefined Behavior, the usual cop-out when
nothing useful can be said.

Yes, UB sounnds as the best answer..

The point is that Ritchie was not satisfied with that answer, which is
why he removed labels-as-values from his version of C. I doubt that
Stallman had any better answer for gcc, but he did not care.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Thu Nov 6 12:37:16 2025

From Newsgroup: comp.arch

On 2025-11-06 10:46, Anton Ertl wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 11/4/2025 9:17 PM, Anton Ertl wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 11/4/2025 11:15 AM, MitchAlsup wrote:

PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.

Wasn't this PL/1 feature "inherited" from the, now rightly deprecated, >>>>> Alter/Goto in COBOL and Assigned GOTO in Fortran?

Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
supports it
<https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

What makes you think that it is "rightly" to deprecate or delete this
feature?

Because it could, and often did, make the code "unfollowable". That is,
you are reading the code, following it to try to figure out what it is
doing and come to an assigned/alter goto, and you don't know where to go
next. The value was set some place else in the code, who knows where,
and thus what value it was set to, and people/programmers just aren't
used to being able to follow code like that.

Take an example use: A VM interpreter. With labels-as-values it looks
like this:

void engine(char *source)
{
void *insts[] = {&&add, &&load, &&ip, ...};

void **ip=compile_to_vm_code(source,insts);

goto *ip++;

add:
...
goto *ip++;
load:
...
goto *ip++;
store:
...
goto *ip++;
...
}

So of course you don't know where one of the gotos goes to, because
that depends on the VM code, which depends on the source code.

I'm not sure if you are trolling or serious, but I will assume the latter.

The point is that without a deep analysis of the program you cannot be
sure that these goto's actually go to one of the labels in the engine() function, and not to some other location in the code, perhaps in some
other function. That analysis would have to discover that the compile_to_vm_code() function returns a pointer to a vector of addresses picked from the insts[] vector. That could need an analysis of many
functions called from compile_to_vm_code(), the history of the whole
program execution, and so on. NOT easy.

Now let's see how it looks with switch:

void engine(char *source)
{
typedef enum {add, load, store,...} inst;
inst *ip=compile_to_vm_code(source,insts);

for (;;) {
switch (*ip++) {
add:
...
break;
load:
...
break;
store:
...
break;
...
}
}
}

Do you know any better which of the "..." is executed next?

You know, without any deep analysis or understanding, that the execution
goes to one of the cases in the switch, and /not/ into the wild blue yonder.

Niklas

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Thu Nov 6 13:14:55 2025

From Newsgroup: comp.arch

On Thu, 6 Nov 2025 12:11:54 +0200
Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:

On 2025-11-06 11:43, Michael S wrote:

On Wed, 5 Nov 2025 17:26:44 +0200
Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:

On 2025-11-05 7:17, Anton Ertl wrote:

[ snip ]

Yes, assigned goto and labels-as-values (and probably the Cobol
alter/goto and PL/1 label variables) are there because computer
architectures have indirect branches and the programming language
designer wanted to give the programmers a way to express what they
would otherwise have to express in assembly language.

Why does standard C not have it? C had it up to and including the
6th edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went
away between 6th and 7th edition. Ritchie wrote
<37178013.A1EE3D4F@bell-labs.com>:

| I eliminated them because I didn't know what to say about their
| semantics.

Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.

I don't know what Stallman said, or would have said if asked, but I
guess something like "the semantics is a jump to the (address of
the) label to which the value refers", which is machine-level
semantics and not semantics in the abstract C machine.

The problem in the abstract C machine is a "goto label-value"
statement where the label-value refers to a label in a different
function. Does gcc prevent that at compile time? If not, I would
expect the semantics to be Undefined Behavior, the usual cop-out
when nothing useful can be said.

Yes, UB sounnds as the best answer..

The point is that Ritchie was not satisfied with that answer, which
is why he removed labels-as-values from his version of C. I doubt
that Stallman had any better answer for gcc, but he did not care.

I suspect that the reason was different: DMR had no sanctifying answer
even for some of intra-procedural cases.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Thu Nov 6 07:44:38 2025

From Newsgroup: comp.arch

Taking direction from the VAX’s AOB? (add-one and branch) instruction
and the DBcc instruction of the 68k, the Qupls Rs1 register of a compare-and-branch instruction may be incremented or decremented. This
is really a form of instruction fusing the op performed on the branch
register into the branch instruction.

I was thinking of modifying this to support additional ops and constant values. Why just add, if one can shift right or XOR as well? It may be
useful to increment by a structure size. Also, a ring counter might be
handy which could be implemented as a right shift. This could be
supported by adding a postfix word to the branch instruction. It would
make the instruction wider but it would not increase the dynamic
instruction count.

Not sure about the syntax to use for coding such instructions.

BEQ Rs1,Rs2,label:ADD Rs1,256

--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Thu Nov 6 07:57:23 2025

From Newsgroup: comp.arch

On 11/6/2025 12:46 AM, Anton Ertl wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 11/4/2025 9:17 PM, Anton Ertl wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 11/4/2025 11:15 AM, MitchAlsup wrote:

PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.

Wasn't this PL/1 feature "inherited" from the, now rightly deprecated, >>>>> Alter/Goto in COBOL and Assigned GOTO in Fortran?

Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
supports it
<https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

What makes you think that it is "rightly" to deprecate or delete this
feature?

Because it could, and often did, make the code "unfollowable". That is,
you are reading the code, following it to try to figure out what it is
doing and come to an assigned/alter goto, and you don't know where to go
next. The value was set some place else in the code, who knows where,
and thus what value it was set to, and people/programmers just aren't
used to being able to follow code like that.

Take an example use: A VM interpreter. With labels-as-values it looks
like this:

void engine(char *source)
{
void *insts[] = {&&add, &&load, &&ip, ...};

void **ip=compile_to_vm_code(source,insts);

goto *ip++;

add:
...
goto *ip++;
load:
...
goto *ip++;
store:
...
goto *ip++;
...
}

So of course you don't know where one of the gotos goes to, because
that depends on the VM code, which depends on the source code.

Now let's see how it looks with switch:

void engine(char *source)
{
typedef enum {add, load, store,...} inst;
inst *ip=compile_to_vm_code(source,insts);

for (;;) {
switch (*ip++) {
add:
...
break;
load:
...
break;
store:
...
break;
...
}
}
}

Do you know any better which of the "..." is executed next? Of course
not, for the same reason. Likewise for call threading, but there the
VM instruction implementations can be discributed across many source
files. With the replicated switch, the problem of predictability is
the same, but there is lots of extra code, with many direct gotos.

If you implement, say, a state machine using labels-as-values, or
switch, again, the logic behind it is the same and the predictability
is the same between the two implementations.

Nick responded better than I could to this argument, demonstrating how
it isn't true. As I said, in the hands of a good programmer, you might
assume that the goto goes to one of those labels, but you can't be sure
of it.

BTW, you mentioned that it could be implemented as an indirect jump. It
could for those architectures that supported that feature, but it could
also be implemented by having the Alter/Assign modify the code (i.e.
change the address in the jump/branch instruction), and self modifying
code is just bad.

On such architectures switch would also be implemented by modifying
the code,

I don't think so. Switch can, and I understand usually is,implemented
via an index into a jump table. No self modifying code required.

and indirect calls and method dispatch would also be
implemented by modifying the code. If self-modifying code is "just
bad", and any language features that are implemented on some long-gone architectures using self-modifying code are bad by association, then
we have to get rid of all of these language features ASAP.

And, by an large they have. BTW, I can accept the argument for keeping
it in C on the argument that C is "lower level" than say Fortran, COBOL
or PL/1, and people using it are used to the language allowing "risky" constructs.

One interesting aspect here is that the Fortran assigned goto and GNU
C's goto * (to go with labels-as-values) look more like something that
may have been inspired by a modern indirect branch than by
self-modifying code.

Well, the Fortran feature was designed in what, the late 1950s? Back
then, self modifying code wasn't considered as bad as it now is.

I only dimly remember the Cobol thing, but IIRC
this looked more like something that's intended to be implemented by self-modifying code. I don't know how the PL/I solution looked like.

As did COBOL, called goto depending on, but those features didn't suffer
the problems of assigned/alter gotos.

As demonstrated above, they do.

No, they are implemented as an indexed jump table.

And if you fall back to using ifs, it
does not get any better, either.

- anton

--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Nov 6 17:44:32 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Thomas Koenig <tkoenig@netcologne.de> posted:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Thomas Koenig <tkoenig@netcologne.de> posted:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

I still think the IBM DFP people did an impressively good job packing >> >> > that much data into a decimal representation. :-)

Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.

Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.

I played around with the formulas from the POWER manual a bit,
using Berkeley abc for logic optimization, for the conversion
of the packed modulo 1000 to three BCD digits.

Without spending too much effort, I arrived at four gate delays
(INV -> OAI21 -> NAND2 -> NAND2) with a total of 37 gates optimizing
for speed, or five gate delays optimizing for space.

Since the gates hang off flip-flops, you don't need the inv gate
at the front. Flip-flops can easily give both true and complement
outputs.

Agreed. Unfortunately, I have a hard time (i.e. "have not managed")
convincing abc that both signals are available, and assert that
exactly one of them is 1 at any given time, without completely
blowing up the optimization routines. It also does not handle
external don't cares. But as I use it purely to play around with
things, that is not too bad :-)
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Nov 6 17:52:32 2025

From Newsgroup: comp.arch

Niklas Holsti <niklas.holsti@tidorum.invalid> writes:

On 2025-11-05 7:17, Anton Ertl wrote:

Stallman obviously knew what to say about their semantics when he
added labels-as-values to GNU C with gcc 2.0.

I don't know what Stallman said, or would have said if asked, but I
guess something like "the semantics is a jump to the (address of the)
label to which the value refers", which is machine-level semantics and
not semantics in the abstract C machine.

You can look at his specification in the documentation of, say, 7th
edition Unix (where Ritchie apparently took the effort to document
semantics), and see how he specified that. I doubt he specified
"semantics in the abstract C machine", but I expect that he specified
semantics at the C level.

Concerning how Stallman documented it, you can look at the gcc
documentation from 2.0 until Stallman passed maintainership on
(gcc-2.7?).

If you look at the curent documentation <https://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html>, it talks
about the "address of a label" and "jump to one", which you might
consider to be a machine-level description. You can also describe
this at a C source level or "C abstract machine" level, but I don't
expect the description to become any clearer.

The problem in the abstract C machine is a "goto label-value" statement >where the label-value refers to a label in a different function. Does
gcc prevent that at compile time? If not, I would expect the semantics
to be Undefined Behavior, the usual cop-out when nothing useful can be said.

The gcc documentation says:

|You may not use this mechanism to jump to code in a different
|function. If you do that, totally unpredictable things happen.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Nov 6 18:14:54 2025

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> writes:

Where this might be a problem is if the label variable was a
global symbol and the target labels were in other name spaces.
At that point it could treat it like a pointer to a function and
have to spill all live register variables to memory.

Does the assigned goto support that? What about regular goto and
computed goto?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Nov 6 18:28:19 2025

From Newsgroup: comp.arch

Niklas Holsti <niklas.holsti@tidorum.invalid> posted:

On 2025-11-05 23:28, MitchAlsup wrote:

Niklas Holsti <niklas.holsti@tidorum.invalid> posted:

----------------

But then you could get the problem of a longjmp to a setjmp value that
is stale because the targeted function invocation (stack frame) is no
longer there.

But YOU had to pass the jumpbuf out of the setjump() scope.

Now, YOU complain there is a hole in your own foot with a smoking gun
in your own hand.

That is not the issue. The question is if the semantics of "goto label-valued-variable" are hard to define, as Ritchie said, or not, as
Anton thinks Stallman said or would have said.

So, label-variables are hard to define, but function-variables are not ?!?

The discussion above shows that whether a label value is implemented as
a bare code address, or as a jumpbuf, some cases will have Undefined Behavior semantics. So I think Ritchie was right, unless the undefined
cases can be excluded at compile time.

The undefined cases could be excluded at compile-time, even in C, by requiring all label-valued variables to be local to some function and forbidding passing such values as parameters or function results. In addition, the use of an uninitialized label-valued variable should be prevented or detected. Perhaps Anton could accept such restrictions.

Niklas

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Nov 6 18:17:31 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

So does gfortran support assigned goto, too?

Yes.

Cool.

What problems in
interaction with other features do you see?

In this case, it is more the problem of modern architeectures.
On 32-bit architectures, it might have been possible to stash
the address of a jump target in an actual INTEGER variable and
GO TO there. On a 64-bit architecture, this is not possible, so
you need to have a shadow variable for the pointer

Implementation options that come to my mind are:

1) Have the code in the bottom 4GB (or maybe 2GB), and a 32-bit
variable is sufficient. AFAIK on some 64-bit architectures the
default memory model puts the code in the bottom 4GB or 2GB.

2) Put the offset from the start of the function or compilation unit
(whatever scope the assigned goto can be used in) in the 32-bit
variable. 32 bits should be enough for that. Of course, if Fortran
assigns labels between shared libraries and the main program, that
approach probably does not work, but does anybody really do that?

How does ifort deal with this problem?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Nov 6 18:36:33 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

On 2025-11-05 4:21 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

Qupls2026 currently supports 48-bit inline constants. I am debating
whether to support 89 and 130-bit inline constants as well. Constant
sizes increase by 41-bits due to the 48-bit instruction word size. The
larger constants would require more instruction words to be available to >> be processed in decode. Not sure if it is even possible to pass a
constant larger than 64-bits in the machine.

I just realized that constant operand routing was already in Qupls, I
had just not specifically identified it. The operand routing bits are
just moved into a postfix instruction word rather than the first
instruction word. This gives more bits available in the instruction
word. Rather than burn a couple of bits in every R3 type instruction,
another couple of opcodes are used to represent constant extensions.

My 66000 ISA has OpCodes in the range {I.major >= 8 && I.major < 24}
that can supply constants and perform operand routing. Within this
range; instruction<8:5> specify the following table:

0 0 0 0 +Src1 +Src2
0 0 0 1 +Src1 -Src2
0 0 1 0 -Src1 +Src2
0 0 1 1 -Src1 -Src2
0 1 0 0 +Src1 +imm5
0 1 0 1 +Imm5 +Src2
0 1 1 0 -Src1 -Imm5
0 1 1 1 +Imm5 -Src2
1 0 0 0 +Src1 Imm32
1 0 0 1 Imm32 +Src2
1 0 1 0 -Src1 Imm32
1 0 1 1 Imm32 -Src2
1 1 0 0 +Src1 Imm64
1 1 0 1 Imm64 +Src2
1 1 1 0 -Src1 Imm64
1 1 1 1 Imm64 -Src2

What happens if one tries to use an unsupported combination?

For 2-operands and 3-operand instructions, they are all present.
For 1-Operand instructions, only the ones targeting Src2 are
available and if you use one not allowed you take an OPERATION
exception.

Here we have access to {5, 32, 64}-bit constants, 16-bit constants
come from different OpCodes.

Imm5 are the register specifier bits: range {-31..31} for integer and logical, range {-15.5..15.5} for floating point.

I just realized that Qupls2026 does not accommodate small constants very well except for a few instructions like shift and bitfield instructions which have special formats. Sure, constants can be made to override
register specs, but they take up a whole additional word. I am not sure
how big a deal this is as there are also immediate forms of instructions with the constant encoded in the instruction, but these do not allow
operand routing. There is a dedicated subtract from immediate
instruction. A lot of other instructions are commutative, so operand
routing is not needed.

1<<const // performed at compile time
1<<var // 1-instruction {1-word in My 66000}

17/var // 1-instruction {1-word}

You might notice My 66000 does not even HAVE a SUB instruction,
instead:

ADD Rd,Rs1,-Rs2

Qupls has potentially 25, 48, 89 and 130-bit constants. 7-bit constants
are available for shifts and bitfield ops. Leaving the 130-bit constants
out for now. They may be useful for 128-bit SIMD against constant operands.

The constant routing issue could maybe be fixed as there are 30+ free opcodes still. But there needs to be more routing bits with three source operands. All the permutations may get complicated to encode and allow
for in the compiler. May want to permute two registers and a constant,
or two constants and a register, and then three or four different sizes.

Out of the 64-slot Major OpCode space, 23-clost are left over, 6-reserved
in perpetuity to catch random jumps into integer or fp data.

Qupls strives to be the low-cost processor.

My 66000 strives to be the low-instruction-count processor.

But remember, ISA is only the first 1/3rd of an architecture.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Nov 6 18:39:55 2025

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 11/5/2025 1:21 PM, MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

Qupls2026 currently supports 48-bit inline constants. I am debating
whether to support 89 and 130-bit inline constants as well. Constant
sizes increase by 41-bits due to the 48-bit instruction word size. The
larger constants would require more instruction words to be available to >> be processed in decode. Not sure if it is even possible to pass a
constant larger than 64-bits in the machine.

I just realized that constant operand routing was already in Qupls, I
had just not specifically identified it. The operand routing bits are
just moved into a postfix instruction word rather than the first
instruction word. This gives more bits available in the instruction
word. Rather than burn a couple of bits in every R3 type instruction,
another couple of opcodes are used to represent constant extensions.

My 66000 ISA has OpCodes in the range {I.major >= 8 && I.major < 24}
that can supply constants and perform operand routing. Within this
range; instruction<8:5> specify the following table:

0 0 0 0 +Src1 +Src2
0 0 0 1 +Src1 -Src2
0 0 1 0 -Src1 +Src2
0 0 1 1 -Src1 -Src2
0 1 0 0 +Src1 +imm5
0 1 0 1 +Imm5 +Src2
0 1 1 0 -Src1 -Imm5
0 1 1 1 +Imm5 -Src2
1 0 0 0 +Src1 Imm32
1 0 0 1 Imm32 +Src2
1 0 1 0 -Src1 Imm32
1 0 1 1 Imm32 -Src2
1 1 0 0 +Src1 Imm64
1 1 0 1 Imm64 +Src2
1 1 1 0 -Src1 Imm64
1 1 1 1 Imm64 -Src2

Here we have access to {5, 32, 64}-bit constants, 16-bit constants
come from different OpCodes.

Imm5 are the register specifier bits: range {-31..31} for integer and logical, range {-15.5..15.5} for floating point.

Some time ago, we discussed using the 5 bit immediates in floating point instructions as an index to an internal ROM with frequently used
constants. The idea is that it would save some space in the instruction stream. Are you implementing that, and if not, why not?

The constant ROM[specifier] seems to be the easiest way of taking
5-bits and converting it into a FP number. It was only a few weeks
ago that we changed the range from {-31..+31} to {-15.5..+15.5} as
this covers <slightly> more fp constant uses. In My case, one always
has access to larger constants at the same instruction-count price
just a larger code footprint.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Nov 6 18:45:41 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 11/4/2025 9:17 PM, Anton Ertl wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 11/4/2025 11:15 AM, MitchAlsup wrote:

PL/1 allows for Label variables so one can build their own
switches (and state machines with variable paths). I used
these in a checkers playing program 1974.

Wasn't this PL/1 feature "inherited" from the, now rightly deprecated, >>>> Alter/Goto in COBOL and Assigned GOTO in Fortran?

Assigned GOTO has been deleted from the Fortran standard (in Fortran
95, obsolescent in Fortran 90), but at least Intel's Fortran compiler
supports it
<https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/goto-assigned.html>

What makes you think that it is "rightly" to deprecate or delete this
feature?

Because it could, and often did, make the code "unfollowable". That is, >you are reading the code, following it to try to figure out what it is >doing and come to an assigned/alter goto, and you don't know where to go >next. The value was set some place else in the code, who knows where,
and thus what value it was set to, and people/programmers just aren't
used to being able to follow code like that.

Take an example use: A VM interpreter. With labels-as-values it looks
like this:

void engine(char *source)
{
void *insts[] = {&&add, &&load, &&ip, ...};

void **ip=compile_to_vm_code(source,insts);

goto *ip++;

add:
...
goto *ip++;
load:
...
goto *ip++;
store:
...
goto *ip++;
...
}

So of course you don't know where one of the gotos goes to, because
that depends on the VM code, which depends on the source code.

Now let's see how it looks with switch:

void engine(char *source)
{
typedef enum {add, load, store,...} inst;
inst *ip=compile_to_vm_code(source,insts);

for (;;) {
switch (*ip++) {
add:
...
break;
load:
...
break;
store:
...
break;
...
}
}
}

Now let us look at it with tabularized functions:: {Ignore the
interrupt and exception stuff at your peril}

bool RunInst( Chip chip )
{
for( uint64_t i = 0; i < cores; i++ )
{
ContextStack *cpu = &core[i];
uint8_t cs = cpu->cs;
Thread *t = cpu->context[cs];
Inst I;

if( cpu->interrupt & ((((signed)1)<<63) >> cpu->priority) )
{ // take an interrupt
cpu->cs = cpu->interrupt.cs;
cpu->priority = cpu->interrupt.priority;
t = context[cpu->cs];
t->reg[0] = cpu->interrupt.message;
}
else if( uint16_t raised = c->raised & c->enabled )
{ // take an exception
cpu->cs--;
t = context[cpu->cs];
t->reg[0] = FT1( raised ) | EXCPT;
t->reg[1] = I.inst;
t->reg[2] = I.src1;
t->reg[3] = I.src2;
t->reg[4] = I.src3;
}
else
{ // run an instruction
t->ip += memory( FETCH, t->ip, &I.inst );
t->raised |= majorTable[ I.major ]( cpu, t, &I );
}
}
}

Do you know any better which of the "..." is executed next? Of course
not, for the same reason. Likewise for call threading, but there the
VM instruction implementations can be discributed across many source
files. With the replicated switch, the problem of predictability is
the same, but there is lots of extra code, with many direct gotos.

If you implement, say, a state machine using labels-as-values, or
switch, again, the logic behind it is the same and the predictability
is the same between the two implementations.

BTW, you mentioned that it could be implemented as an indirect jump. It >could for those architectures that supported that feature, but it could >also be implemented by having the Alter/Assign modify the code (i.e. >change the address in the jump/branch instruction), and self modifying >code is just bad.

On such architectures switch would also be implemented by modifying
the code, and indirect calls and method dispatch would also be
implemented by modifying the code. If self-modifying code is "just
bad", and any language features that are implemented on some long-gone architectures using self-modifying code are bad by association, then
we have to get rid of all of these language features ASAP.

One interesting aspect here is that the Fortran assigned goto and GNU
C's goto * (to go with labels-as-values) look more like something that
may have been inspired by a modern indirect branch than by
self-modifying code. I only dimly remember the Cobol thing, but IIRC
this looked more like something that's intended to be implemented by self-modifying code. I don't know how the PL/I solution looked like.

As did COBOL, called goto depending on, but those features didn't suffer >the problems of assigned/alter gotos.

As demonstrated above, they do. And if you fall back to using ifs, it
does not get any better, either.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Thu Nov 6 13:11:10 2025

From Newsgroup: comp.arch

On 11/6/2025 3:24 AM, Michael S wrote:

On Wed, 05 Nov 2025 21:06:16 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Michael S <already5chosen@yahoo.com> posted:

On Tue, 04 Nov 2025 22:51:28 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Thomas Koenig <tkoenig@netcologne.de> posted:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

I still think the IBM DFP people did an impressively good job
packing that much data into a decimal representation. :-)

Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.

Brain dead easy: 1 table of 1024 entries each 12-bits wide,
1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.

Other than "crap loads" of {deMorganizing and gate optimization}
that is essentially what HW actually does.

You still need to build 12-bit decimal ALUs to string together

Are talking about hardware or software?

A SW solution based on how it would be done in HW.

Then, I suspect that you didn't understand objection of Thomas Koenig.

1. Format of interest is Decimal128. https://en.wikipedia.org/wiki/Decimal128_floating-point_format

2. According to my understanding, Thomas didn't suggest that *slow*
software implementation of DPD-encoded DFP, i.e. implementation that
only cares about correctness, is hard.

3. OTOH, he seems to suspects, and I agree with him, that *non-slow*
software implementation, the one comparable in speed (say, within
factor of 1,5-2) to competent implementation of the same DFP operations
in BID format, is not easy. If at all possible.

4. All said above assumes an absence of HW assists.

BTW, at least for multiplication, I would probably would not do my
arithmetic in BCD domain.
Instead, I'd convert 10+ DPD digits to two Base_1e18 digits (11 look
ups per operand, 22 total look ups + ~40 shifts + ~20 ANDs + ~20
additions).

Then I'd do multiplication and normalization and rounding in Base_1e18.

Then I'd convert from Base_1e18 to Base_1000. The ideas of such
conversion are similar to fast binary-to-BCD conversion that I
demonstrated her decade or so ago. AVX2 could be quite helpful at that
stage.

Then I'd have to convert the result from Base_1000 to DPD. Here, again,
11 table look-ups + plenty of ANDs/shift/ORs seem inevitable.
May be, at that stage SIMD gather can be of help, but I have my doubts.
So far, every time I tried gather I was disappointed with performance.

Overall, even with seemingly decent plan like sketched above, I'd expect
DPD multiplication to be 2.5x to 3x slower than BID. But, then again,
in the past my early performance estimates were wrong quite often.

I decided to start working on a mockup (quickly thrown together).
I don't expect to have much use for it, but meh.

It works by packing/unpacking the values into an internal format along
vaguely similar lines to the .NET format, just bigger to accommodate
more digits:
4x 32-bit values each holding 9 digits
Except the top one generally holding 7 digits.
16-bit exponent, sign byte.

Then wrote a few pack/unpack scenarios:
X30: Directly packing 20/30 bit chunks, non-standard;
DPD: Use the DPD format;
BID: Use the BID format.

For the pack/unpack step (taken in isolation):
X30 is around 10x faster than either DPD or BID;
Both DPD and BID need a similar amount of time.
BID needs a bunch of 128-bit arithmetic handlers.
DPD needs a bunch of merge/split and table lookups.
Seems to mostly balance out in this case.

For DPD, merge is effectively:
Do the table lookups;
v=v0+(v1*1000)+(v2*1000000);
With a split step like:
v0=v;
v1=v/1000;
v0-=v1*1000;
v2=v1/1000;
v1-=v2*1000;
Then, use table lookups to go back to DPD.

Did look into possible faster ways of doing the splitting, but then
noted that have not yet found a faster way that gives correct results
(where one can assume the compiler already knows how to turn divide by constant into multiply by reciprocal).

At first it seemed like a strong reason to favor X30 over either DPD or
BID. Except, that the cost of the ADD and MUL operations effectively
dwarf that of the pack/unpack operations, so the relative cost
difference between X30 and DPD may not matter much.

As is, it seems MUL and ADD being roughly 6x more than the cost of the
DPD pack/unpack steps.

So, it seems, while DPD pack/unpack isn't free, it is not something that
would lead to X30 being a decisive win either in terms of performance.

It might make more sense, if supporting BID, to just do it as its own
thing (and embrace just using a bunch of 128-bit arithmetic, and a 128*128=>256 bit widening multiply, ...). Also, can note that the BID
case ends up needing a lot more clutter, mostly again because C lacks
native support for 128-bit arithmetic.

If working based on digit chunks, likely better to stick with DPD due to
less clutter, etc. Though, this part would be less bad if C had had
widespread support for 128-bit integers.

Though, in this case, the ADD and MUL operations currently work by
internally doubling the width and then narrowing the result after normalization. This is slower, but could give exact results.

Though, still not complete nor confirmed to produce correct results.

But, yeah, might be more worthwhile to look into digit chunking:
12x 3 digits (16b chunk)
4x 9 digits (32b chunk)
2x 18 digits (64b chunk)
3x 12 digits (64b chunk)

Likely I think:
3 digits, likely slower because of needing significantly more operations;
9 digits, seemed sensible, option I went with, internal operations fully
fit within the limits of 64 bit arithmetic;
18 digits, possible, but runs into many cases internally that would
require using 128-bit arithmetic.

12 digits, fits more easily into 64-bit arithmetic, but would still
sometimes exceed it; and isn't that much more than 9 digits (but would
reduce the number of chunks needed from 4 to 3).

While 18 digits conceptually needs fewer abstract operations than 9
digits, it would suffer the drawback of many of these operations being
notably slower.

However, if running on RV64G with the standard ABI, it is likely the
9-digit case would also take a performance hit due to sign-extended
unsigned int (and needing to spend 2 shifts whenever zero-extending a
value).

With 3x 12 digits,while not exactly the densest scheme, leaves a little
more "working space" so would reduce cases which exceed the limits of
64-bit arithmetic. Well, except multiply, where 24 > 18 ...

The main merit of 9 digit chunking here being that it fully stays within
the limits of 64-bit arithmetic (where multiply temporarily widens to
working with 18 digits, but then narrows back to 9 digit chunks).

Also 9 digit chunking may be preferable when one has a faster 32*32=>64
bit multiplier, but 64*64=>128 is slower.

One other possibility could be to use BCD rather than chunking, but I
expect BCD emulation to be painfully slow in the absence of ISA level
helpers.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Nov 6 19:38:54 2025

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

Some time ago, we discussed using the 5 bit immediates in floating point instructions as an index to an internal ROM with frequently used
constants. The idea is that it would save some space in the instruction stream. Are you implementing that, and if not, why not?

I did some statistics on which floating point constants occurred how
often, looking at three different packages (Perl, gnuplot and GSL).
GSL implements a lot of special founctions, so it has a lot of
constants you are not likely to find often in a random sample of
other packages :-) Perl has very little floating point. gnuplot
is also special in its own way, of course.

A few constants occur quite often, but there are a lot of
differences between the floating point constants for different
programs, to nobody's surprise (presumably).

Here is the head of an output of a little script I wrote to count
all floating-point constants from My66000 assembler. Note that
the compiler is for the version that does not yet do 0.5 etc as
floating point. The first number is the number of occurrences,
the second one is the constant itself.

5-bit constants: 886
32-bit constants: 566
64-bit constants:597
303 0
290 1
96 0.5
81 6
58 -1
58 1e-14
49 2
46 -2
45 -8.98846567431158e+307
44 10
44 255
37 8.98846567431158e+307
29 -0.5
28 3
27 90
27 360
26 -1e-05
21 0.0174532925199433
20 0.9
18 -3
17 180
17 0.1
17 0.01
[...]
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Nov 6 20:04:37 2025

From Newsgroup: comp.arch

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Where this might be a problem is if the label variable was a
global symbol and the target labels were in other name spaces.
At that point it could treat it like a pointer to a function and
have to spill all live register variables to memory.

Does the assigned goto support that?

No, that would be beyond horrible.

What about regular goto and
computed goto?

Neither; according to F77, it must be "defined in the same program
unit".

An extra feature: When using GOTO variable, you can also supply a
list of labels that it should jump to; if the jump target is not
in the list, the GOTO variable is illegal.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Nov 6 20:07:16 2025

From Newsgroup: comp.arch

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

So does gfortran support assigned goto, too?

Yes.

Cool.

What problems in
interaction with other features do you see?

In this case, it is more the problem of modern architeectures.
On 32-bit architectures, it might have been possible to stash
the address of a jump target in an actual INTEGER variable and
GO TO there. On a 64-bit architecture, this is not possible, so
you need to have a shadow variable for the pointer

Implementation options that come to my mind are:

1) Have the code in the bottom 4GB (or maybe 2GB), and a 32-bit
variable is sufficient. AFAIK on some 64-bit architectures the
default memory model puts the code in the bottom 4GB or 2GB.

Compiler writers should never box themselves in like that.

2) Put the offset from the start of the function or compilation unit (whatever scope the assigned goto can be used in) in the 32-bit
variable. 32 bits should be enough for that.

That would make jumps very inefficient.

Of course, if Fortran
assigns labels between shared libraries and the main program,

It does not.

How does ifort deal with this problem?

I have no idea, and no inclination to find out; check out
assembly code at godbolt if you are really interested.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Thu Nov 6 12:14:33 2025

From Newsgroup: comp.arch

On 11/6/2025 11:38 AM, Thomas Koenig wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

Some time ago, we discussed using the 5 bit immediates in floating point
instructions as an index to an internal ROM with frequently used
constants. The idea is that it would save some space in the instruction
stream. Are you implementing that, and if not, why not?

I did some statistics on which floating point constants occurred how
often, looking at three different packages (Perl, gnuplot and GSL).
GSL implements a lot of special founctions, so it has a lot of
constants you are not likely to find often in a random sample of
other packages :-) Perl has very little floating point. gnuplot
is also special in its own way, of course.

A few constants occur quite often, but there are a lot of
differences between the floating point constants for different
programs, to nobody's surprise (presumably).

Here is the head of an output of a little script I wrote to count
all floating-point constants from My66000 assembler. Note that
the compiler is for the version that does not yet do 0.5 etc as
floating point. The first number is the number of occurrences,
the second one is the constant itself.

5-bit constants: 886
32-bit constants: 566
64-bit constants:597
303 0
290 1
96 0.5
81 6
58 -1
58 1e-14
49 2
46 -2
45 -8.98846567431158e+307
44 10
44 255
37 8.98846567431158e+307
29 -0.5
28 3
27 90
27 360
26 -1e-05
21 0.0174532925199433
20 0.9
18 -3
17 180
17 0.1
17 0.01
[...]

Interesting! No values related to pi? And what are the ...e+307 used for?
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Nov 6 20:24:23 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

So does gfortran support assigned goto, too?

Yes.

Cool.

What problems in
interaction with other features do you see?

In this case, it is more the problem of modern architeectures.
On 32-bit architectures, it might have been possible to stash
the address of a jump target in an actual INTEGER variable and
GO TO there. On a 64-bit architecture, this is not possible, so
you need to have a shadow variable for the pointer

Implementation options that come to my mind are:

1) Have the code in the bottom 4GB (or maybe 2GB), and a 32-bit
variable is sufficient. AFAIK on some 64-bit architectures the
default memory model puts the code in the bottom 4GB or 2GB.

2) Put the offset from the start of the function or compilation unit (whatever scope the assigned goto can be used in) in the 32-bit
variable. 32 bits should be enough for that.

After 4 years of looking, we are still waiting for a single function
that needs more than a scaled 16-bit displacement from current IP
{±17-bits} to reach all labels within the function.

Of course, if Fortran
assigns labels between shared libraries and the main program, that
approach probably does not work, but does anybody really do that?

How does ifort deal with this problem?

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Thu Nov 6 16:24:28 2025

From Newsgroup: comp.arch

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Where this might be a problem is if the label variable was a
global symbol and the target labels were in other name spaces.
At that point it could treat it like a pointer to a function and
have to spill all live register variables to memory.

Does the assigned goto support that? What about regular goto and
computed goto?

- anton

I didn't mean to imply that it did.
As far as I remember, Fortran 77 does not allow it.
I never used later Fortrans.

I hadn't given the dynamic branch topic any thought until you raised it
and this was just me working through the things a compiler might have
to deal with.

I have written jump dispatch table code myself where the destinations
came from symbols external to the routine, but I had to switch to
inline assembler for this as MS C does not support goto variables,
and it was up to me to make sure the registers were all handled correctly.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Nov 6 21:59:31 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> posted:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

Some time ago, we discussed using the 5 bit immediates in floating point instructions as an index to an internal ROM with frequently used constants. The idea is that it would save some space in the instruction stream. Are you implementing that, and if not, why not?

I did some statistics on which floating point constants occurred how
often, looking at three different packages (Perl, gnuplot and GSL).
GSL implements a lot of special founctions, so it has a lot of
constants you are not likely to find often in a random sample of
other packages :-) Perl has very little floating point. gnuplot
is also special in its own way, of course.

A few constants occur quite often, but there are a lot of
differences between the floating point constants for different
programs, to nobody's surprise (presumably).

Here is the head of an output of a little script I wrote to count
all floating-point constants from My66000 assembler. Note that

There is a space between the y and the 6 in My 66000.

the compiler is for the version that does not yet do 0.5 etc as
floating point. The first number is the number of occurrences,
the second one is the constant itself.

5-bit constants: 886
32-bit constants: 566
64-bit constants:597
303 0
290 1
96 0.5
81 6
58 -1
58 1e-14
49 2
46 -2
45 -8.98846567431158e+307
44 10
44 255
37 8.98846567431158e+307
29 -0.5
28 3
27 90
27 360
26 -1e-05
21 0.0174532925199433
20 0.9
18 -3
17 180
17 0.1
17 0.01
[...]

--- Synchronet 3.21a-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Thu Nov 6 22:09:25 2025

From Newsgroup: comp.arch

It appears that MitchAlsup <user5857@newsgrouper.org.invalid> said:

That is not the issue. The question is if the semantics of "goto
label-valued-variable" are hard to define, as Ritchie said, or not, as
Anton thinks Stallman said or would have said.

So, label-variables are hard to define, but function-variables are not ?!?

Relatively speaking, yeah. In languages with nested scopes, label gotos
can jump to an outer scope so they have to unwind some frames. Back when people used such things, a common use was on an error to jump out to some recovery code.

Function pointers have a sort of similar problem in that they need to carry along pointers to all of the enclosing frames the function can see. That is reasonably well solved by displays, give or take the infamous Knuth man or boy program, 13 lines of Algol60 horror that Knuth himself got the results wrong. --
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Nov 6 22:53:09 2025

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> posted:

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Where this might be a problem is if the label variable was a
global symbol and the target labels were in other name spaces.
At that point it could treat it like a pointer to a function and
have to spill all live register variables to memory.

Does the assigned goto support that? What about regular goto and
computed goto?

- anton

I didn't mean to imply that it did.
As far as I remember, Fortran 77 does not allow it.
I never used later Fortrans.

I hadn't given the dynamic branch topic any thought until you raised it
and this was just me working through the things a compiler might have
to deal with.

I have written jump dispatch table code myself where the destinations
came from symbols external to the routine, but I had to switch to
inline assembler for this as MS C does not support goto variables,

Oh sure it does--it is called Return-Oriented-Programming.
You take the return address off the stack and insert your
go-to label on the stack and then just return.

Or you could do some "foul play" on a jumpbuf and longjump.

{{Be careful not to shoot yourself in the foot.}}

and it was up to me to make sure the registers were all handled correctly.

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Nov 6 22:21:05 2025

From Newsgroup: comp.arch

Niklas Holsti <niklas.holsti@tidorum.invalid> writes:

That is not the issue. The question is if the semantics of "goto >label-valued-variable" are hard to define, as Ritchie said, or not, as
Anton thinks Stallman said or would have said.

The discussion above shows that whether a label value is implemented as
a bare code address, or as a jumpbuf, some cases will have Undefined >Behavior semantics. So I think Ritchie was right, unless the undefined
cases can be excluded at compile time.

Ritchie designed lots of features into C for which the C
standardization committee later decided that some cases are undefined behaviour. I don't think that Ritchie had any qualms at designing
something like labels-as-values with unchecked limitations (what would
later become undefined or implementation-defined behaviour), or
documenting these limitations.

Here is my attempt (from 1999) at a specification for
labels-as-values:

|goto *<expr>" [or whatever the syntax was] is equivalent to "goto <label>"
|if <expr> evaluates to the same value as the expression "&&<label>" [or |whatever the syntax was]. If <expr> does not evaluate to a label of the |function that contains the "goto *<expr>", the result is undefined.

The undefined cases could be excluded at compile-time, even in C, by >requiring all label-valued variables to be local to some function and >forbidding passing such values as parameters or function results.

Gforth certainly passes the labels out, for use by the compiler that
generates the VM code.

In
addition, the use of an uninitialized label-valued variable should be >prevented or detected.

Using an uninitialized variable is undefined behaviour in C, but not
prevented, and not always detected (compilers emit warnings in some
cases when they detect a use of an uninitialized variable). Why
should it be any different for an uninitialized variable in used with
"goto *"?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Thu Nov 6 20:10:19 2025

From Newsgroup: comp.arch

MitchAlsup wrote:

EricP <ThatWouldBeTelling@thevillage.com> posted:

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Where this might be a problem is if the label variable was a
global symbol and the target labels were in other name spaces.
At that point it could treat it like a pointer to a function and
have to spill all live register variables to memory.

Does the assigned goto support that? What about regular goto and
computed goto?

- anton

I didn't mean to imply that it did.
As far as I remember, Fortran 77 does not allow it.
I never used later Fortrans.

I hadn't given the dynamic branch topic any thought until you raised it
and this was just me working through the things a compiler might have
to deal with.

I have written jump dispatch table code myself where the destinations
came from symbols external to the routine, but I had to switch to
inline assembler for this as MS C does not support goto variables,

Oh sure it does--it is called Return-Oriented-Programming.
You take the return address off the stack and insert your
go-to label on the stack and then just return.

Or you could do some "foul play" on a jumpbuf and longjump.

{{Be careful not to shoot yourself in the foot.}}

Or worse... shoot yourself in the foot and then step in a cow pie.
I hate when that happens.

and it was up to me to make sure the registers were all handled correctly. >>

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Nov 7 06:55:08 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

After 4 years of looking, we are still waiting for a single function
that needs more than a scaled 16-bit displacement from current IP
{±17-bits} to reach all labels within the function.

Some people use auto-generated code (for example from computer
algebra systems), which generate really, really long procedures.
A good stress-test for compilers, too; they tend to expose
O(n^2) or worse behavior where nobody looked. So it is good that
branch instructions within functions are expanded by the assembler
if needed :-)

Even having 64-bit offsets like My 66000 can lead into a trap (and will
require future optimization work on the compiler). This is a simplified version of something that came up in a PR.

SUBROUTINE FOO
DOUBLE PRECISION A,B,C,D,E
COMMON /A,B,C,D,E/
C very many statements involving A,B,C,D,E

If you load and store each access to one of the variables via its
64-bit access, you can end up using very many 96-bit instructions,
where a single load of the base address of the COMMON block would
save a lot of code space at the expense of a single instruction
at the beginning.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Nov 7 08:06:41 2025

From Newsgroup: comp.arch

Niklas Holsti <niklas.holsti@tidorum.invalid> writes:

On 2025-11-06 11:43, Michael S wrote:

On Wed, 5 Nov 2025 17:26:44 +0200
Niklas Holsti <niklas.holsti@tidorum.invalid> wrote:

On 2025-11-05 7:17, Anton Ertl wrote:

Why does standard C not have it? C had it up to and including the
6th edition Unix <3714DA77.6150C99A@bell-labs.com>, but it went away
between 6th and 7th edition. Ritchie wrote
<37178013.A1EE3D4F@bell-labs.com>:

| I eliminated them because I didn't know what to say about their
| semantics.

...

Yes, UB sounnds as the best answer..

The point is that Ritchie was not satisfied with that answer, which is
why he removed labels-as-values from his version of C.

He did not write that, and given the rest of C, I very much doubt that
this was the reason.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Nov 7 08:08:42 2025

From Newsgroup: comp.arch

Niklas Holsti <niklas.holsti@tidorum.invalid> writes:

On 2025-11-06 10:46, Anton Ertl wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

[Fortran's assigned goto]

Because it could, and often did, make the code "unfollowable". That is, >>> you are reading the code, following it to try to figure out what it is
doing and come to an assigned/alter goto, and you don't know where to go >>> next. The value was set some place else in the code, who knows where,
and thus what value it was set to, and people/programmers just aren't
used to being able to follow code like that.

Take an example use: A VM interpreter. With labels-as-values it looks
like this:

void engine(char *source)
{
void *insts[] = {&&add, &&load, &&ip, ...};

void **ip=compile_to_vm_code(source,insts);

goto *ip++;

add:
...
goto *ip++;
load:
...
goto *ip++;
store:
...
goto *ip++;
...
}

So of course you don't know where one of the gotos goes to, because
that depends on the VM code, which depends on the source code.

I'm not sure if you are trolling or serious, but I will assume the latter.

This is the problem that Stephen Fuld mentioned, and that is actually
a practical problem that I have experience in some cases when
debugging programs with indirect control flow, usually with various
forms of indirect calls, e.g., method calls. I have not experienced
it for threaded-code interpreters that use labels-as-values (as
outlined above), because there I can always look at ip[0], ip[1]
etc. to see where the next executions of goto *ip will go.

The point is that without a deep analysis of the program you cannot be
sure that these goto's actually go to one of the labels in the engine() >function, and not to some other location in the code, perhaps in some
other function. That analysis would have to discover that the >compile_to_vm_code() function returns a pointer to a vector of addresses >picked from the insts[] vector. That could need an analysis of many >functions called from compile_to_vm_code(), the history of the whole
program execution, and so on. NOT easy.

That has never been a problem in my experience, and I have been using labels-as-values since 1992. Up to gforth-0.6 (2003), all instances
of &&label and all instances of goto *expr were in the same function,
so if labels had a separate type, that could not be converted by
casts, the analysis would be trivial, at least if GNU C was an
Ada-like language, where labels have their own type that cannot be
converted to other types. As it is, Fortran's assigned goto uses
integer numbers, and labels-as-values uses void *, so if anybody was
really interested in performing such an analysis, they would have a
lot of work to do. But the design of these features with using
existing types makes it obvious that performing such an analysis was
not intended.

Interestingly, if somebody wanted to work in that direction, checking
at run-time that the target of a goto is inside the function that
contains the goto is easy and not particularly expensive. With the
newfangled "control-flow integrity" features in hardware, you could
even check relatively cheaply that only &&label instances are targets
of goto *.

Ok, so what about gforth-0.6 (2003) and later? First of all, they
contain two functions with goto * and &&label instances, so the
trivial analysis would no longer work. Has there ever been any mixup
where a goto * jumped to a label in the other function? Not that I
know of; if it happened, it would actually work, because the two
functions are identical apart from some code-space padding.

What's more relevant is that gforth-0.6 added code-copying dynamic
native code generation: It copies code snippets (using the addresses
gotten with &&label to determine where they start and where they end)
to some RWX data region, concatenating the snippets in this way,
resulting in a compiled program in the RWX region. It then uses one
of the goto * in one of the functions to actually start executing this dynamically-generated code.

This is probably outside of what Stallman had in mind for
labels-as-values, but fortunately Stallman did not try to limit what
can be done to what he had in mind, the way that many programming
language designers do, and the way that many people discussing
programming languages think. This is a feature that Ritchie's C also
has, which cannot be said about the C of people who think that
"undefined behaviour" is enough justification to declare a program
"buggy".

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Nov 7 10:09:02 2025

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 11/6/2025 12:46 AM, Anton Ertl wrote:

If you implement, say, a state machine using labels-as-values, or
switch, again, the logic behind it is the same and the predictability
is the same between the two implementations.

Nick responded better than I could to this argument, demonstrating how
it isn't true. As I said, in the hands of a good programmer, you might >assume that the goto goes to one of those labels, but you can't be sure
of it.

In <1762311070-5857@newsgrouper.org> you mentioned method calls as
'just a more expensive "label"', there you know that the method call
calls one of the implementations of the method with the name, like
with the switch. You did not find that satisfying in <1762311070-5857@newsgrouper.org>, but now knowing that it's one of a
large number of switch targets is good enough for you, whereas Niklas
Holsti's problem (which does not occur in my practical experience with labels-as-values) has become your problem?

BTW, you mentioned that it could be implemented as an indirect jump. It >>> could for those architectures that supported that feature, but it could
also be implemented by having the Alter/Assign modify the code (i.e.
change the address in the jump/branch instruction), and self modifying
code is just bad.

On such architectures switch would also be implemented by modifying
the code,

I don't think so. Switch can, and I understand usually is,implemented
via an index into a jump table. No self modifying code required.

What does "index into a jump table" mean in one of those architectures
that did not have indirect jumps and used self-modifying code instead?
I bet that it ends up in self-modifying code, too, because these
architectures usually don't have indirect jumps through jump tables,
either. If they had, the easy way to implement indirect branches
without self-modifying code would be to have a one-entry jump table,
store the target in that entry, and then perform an indirect jump
through that jump table.

and indirect calls and method dispatch would also be
implemented by modifying the code. If self-modifying code is "just
bad", and any language features that are implemented on some long-gone
architectures using self-modifying code are bad by association, then
we have to get rid of all of these language features ASAP.

And, by an large they have.

We have gotten rid of indirect calls, e.g., in higher-order functions
in functional programming languages? We have gotten rid of dynamic
method dispatch in object-oriented programs.

Thinking about the things that self-modifying code has been used for
on some architecture, IIRC that also includes array indexing. So have
we gotten rid of array indexing in programming languages?

One interesting aspect here is that the Fortran assigned goto and GNU
C's goto * (to go with labels-as-values) look more like something that
may have been inspired by a modern indirect branch than by
self-modifying code.

Well, the Fortran feature was designed in what, the late 1950s? Back
then, self modifying code wasn't considered as bad as it now is.

Did you read what you are replying to?

Does the IBM 704 (for which FORTRAN has been designed originally)
support indirect branches, or was it necessary to implement the
assigned goto (and computed goto) with self-modifying code on that architecture?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Nov 7 10:32:08 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> writes:

An extra feature: When using GOTO variable, you can also supply a
list of labels that it should jump to; if the jump target is not
in the list, the GOTO variable is illegal.

The benefit I see from that is that data-flow analysis must only
consider the control flows from the assigned goto to these targets and
not to all assigned labels (in contrast to labels-as-values), and
conversely, if every assigned goto has such a list, data-flow analysis
knows more precisely which gotos can actually jump to a given label.

This would make a small difference in Gforth since 0.6, which has
introduced hybrid direc/indirect-threaded code, and where some goto *
are for indirect-threaded dispatches, and some labels are only reached
from these goto * instances, and a certain variable is only alive
across these jumps. GNU C does not have this option, so what we did
instead is to kill the variable right before all the gotos that do not
jump to these labels.

It might also help with static stack caching: There are stack states
with 0-n stack items in registers, and a particular VM instruction
code snippet starts in a particular state (say, 2 stack items in a
register) and ends with another state S (say, 1 stack item in a
register). It will jump to code that expects the same state S. All
variables that contain stack items beyond what S has are dead at that
point. If we could tell that the goto * from state S only goes to
targets in state S, the data-flow analysis could determine that.
Instead, what we do is to kill these additional variables in a subset
of uses. When we tried to kill them at all uses, the quality of the
code produced by gcc deteriorated significantly.

This variable-killing happens by having empty asm statements that
claim to write to these variables, so if this is used incorrectly, the
produced code will be incorrect. So the benefit of this assigned-goto
feature would be to replace a dangerous feature with another dangerous
one: if you fail to list all the jumped-to labels, the data-flow
analysis would be wrong, too. It seems more elegant to describe the
actual control flow, and then let the data-flow analysis do its work
than the heavy-handed direct influence on the data-flow analysis that
our variable-killing does.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Nov 7 15:26:38 2025

From Newsgroup: comp.arch

John Levine <johnl@taugh.com> writes:

In languages with nested scopes, label gotos
can jump to an outer scope so they have to unwind some frames. Back when >people used such things, a common use was on an error to jump out to some >recovery code.

Pascal has that feature. Concerning error handling, jumping to an
error handler in a statically enclosing scope has fallen out of
favour, but throwing an exception to the next dynamically enclosing
exception handler is supported in a number of languages.

Function pointers have a sort of similar problem in that they need to carry >along pointers to all of the enclosing frames the function can see. That is >reasonably well solved by displays, give or take the infamous Knuth man or boy >program, 13 lines of Algol60 horror that Knuth himself got the results wrong.

Displays and static link chains are among the techniques that can be
used to implement static scoping correctly, i.e., where the man-or-boy
test produces the correct result. Knuth initially got the result
wrong, because he only had boy compilers, and the computation is too
involved to do it by hand.

The main horror in the original version is that for some of the Algol
60 syntax that is used, it is not obvious without studying the Algol
60 report what it means. <https://rosettacode.org/wiki/Man_or_boy_test#ALGOL_60> contains some discussion, and one can find it in various other programming
languages, more or (often) less close to the original. The discussion
at <https://rosettacode.org/wiki/Man_or_boy_test#TXR> and the
difference between the "proper job" version from the "crib the Common
Lisp or Scheme solution" version gives some insight.

The fact that "less close" also produces the correct result suggests
that the man-or-boy test is less discerning than Knuth probably
intended. That's a common problem with testing.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Fri Nov 7 08:26:41 2025

From Newsgroup: comp.arch

On 11/7/2025 2:09 AM, Anton Ertl wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 11/6/2025 12:46 AM, Anton Ertl wrote:

If you implement, say, a state machine using labels-as-values, or
switch, again, the logic behind it is the same and the predictability
is the same between the two implementations.

Nick responded better than I could to this argument, demonstrating how
it isn't true. As I said, in the hands of a good programmer, you might
assume that the goto goes to one of those labels, but you can't be sure
of it.

In <1762311070-5857@newsgrouper.org> you

I think the attributions are messed up, as I didn't say what you next
say I said.

mentioned method calls as
'just a more expensive "label"', there you know that the method call
calls one of the implementations of the method with the name, like
with the switch. You did not find that satisfying in <1762311070-5857@newsgrouper.org>, but now knowing that it's one of a
large number of switch targets is good enough for you, whereas Niklas Holsti's problem (which does not occur in my practical experience with labels-as-values) has become your problem?

BTW, you mentioned that it could be implemented as an indirect jump. It >>>> could for those architectures that supported that feature, but it could >>>> also be implemented by having the Alter/Assign modify the code (i.e.
change the address in the jump/branch instruction), and self modifying >>>> code is just bad.

On such architectures switch would also be implemented by modifying
the code,

I don't think so. Switch can, and I understand usually is,implemented
via an index into a jump table. No self modifying code required.

What does "index into a jump table" mean in one of those architectures
that did not have indirect jumps and used self-modifying code instead?

For example, the following Fortran code

goto (10,20,30,40) I @ will jump to label 10 if I =1, 20 if I = 2, etc

would be compiled to something like (add any required "bounds checking"
for I)

load R1,I
Jump $,R1
Jump 10
Jump 20
Jump 30
Jump 40

No code modification nor indirection required .

Yes, it does require execution of an "extra" jump instruction.

I bet that it ends up in self-modifying code, too, because these architectures usually don't have indirect jumps through jump tables,
either.

Not required.

If they had, the easy way to implement indirect branches
without self-modifying code would be to have a one-entry jump table,
store the target in that entry, and then perform an indirect jump
through that jump table.

and indirect calls and method dispatch would also be
implemented by modifying the code. If self-modifying code is "just
bad", and any language features that are implemented on some long-gone
architectures using self-modifying code are bad by association, then
we have to get rid of all of these language features ASAP.

And, by an large they have.

We have gotten rid of indirect calls, e.g., in higher-order functions
in functional programming languages? We have gotten rid of dynamic
method dispatch in object-oriented programs.

No, and I defer to you, or others here, on how these features are
implemented, specifically whether code modification is required. I was referring to features such as assigned goto in Fortran, and Alter goto
in Cobol.

Thinking about the things that self-modifying code has been used for
on some architecture, IIRC that also includes array indexing. So have
we gotten rid of array indexing in programming languages?

Of course not. But I suspect that we have "gotten rid of" any
architecture that *requires* code modification for array indexing.

One interesting aspect here is that the Fortran assigned goto and GNU
C's goto * (to go with labels-as-values) look more like something that
may have been inspired by a modern indirect branch than by
self-modifying code.

Well, the Fortran feature was designed in what, the late 1950s? Back
then, self modifying code wasn't considered as bad as it now is.

Did you read what you are replying to?

Does the IBM 704 (for which FORTRAN has been designed originally)
support indirect branches, or was it necessary to implement the
assigned goto (and computed goto) with self-modifying code on that architecture?

I don't know what the 704 implemented, but I have shown above self
modifying code is not necessary for computed goto, and I suspect
assigned goto was implemented with self modifying code. But as I said,
back then self modifying code was not considered as bad as it is now.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Nov 7 17:29:07 2025

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

On 11/6/2025 11:38 AM, Thomas Koenig wrote:

[...]

Here is the head of an output of a little script I wrote to count
all floating-point constants from My66000 assembler. Note that
the compiler is for the version that does not yet do 0.5 etc as
floating point. The first number is the number of occurrences,
the second one is the constant itself.

5-bit constants: 886
32-bit constants: 566
64-bit constants:597
303 0
290 1
96 0.5
81 6
58 -1
58 1e-14
49 2
46 -2
45 -8.98846567431158e+307
44 10
44 255
37 8.98846567431158e+307
29 -0.5
28 3
27 90
27 360
26 -1e-05
21 0.0174532925199433
20 0.9
18 -3
17 180
17 0.1
17 0.01
[...]

Interesting! No values related to pi? And what are the ...e+307 used for?

If you loook closely, you'll see pi/180 in that list. But pi is
also there (I cut it off the list), it occurs 11 times. And the
large numbers are +/- DBL_MAX*0.5, I don't know what they are
used for.

By comparision, here are the values which are most frequently
contained in GSL:

5-bit constants: 5148
32-bit constants: 3769
64-bit constants:3140
2678 1
1518 0
687 -1
424 2
329 0.5
298 -2
291 2.22044604925031e-16
275 4.44089209850063e-16
273 3
132 -3
131 -0.5
131 3.14159265358979
88 4
86 1.34078079299426e+154
77 6
70 0.25
70 5
68 2.2250738585072e-308
66 10
64 -4
50 -6
46 0.1
45 5.87747175411144e-39
43 0.333333333333333
42 1e+50
38 6.28318530717959
35 9
31 0.2
30 7
30 -0.25

[...]

So, having values between -15.5 and +15.5 is a choice that will
cover quite a few floating point constants. For different packages,
FP constant distributions probably vary too much to create something
that is much more useful.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Nov 7 17:15:59 2025

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 11/7/2025 2:09 AM, Anton Ertl wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 11/6/2025 12:46 AM, Anton Ertl wrote:

On such architectures switch would also be implemented by modifying
the code,

I don't think so. Switch can, and I understand usually is,implemented
via an index into a jump table. No self modifying code required.

What does "index into a jump table" mean in one of those architectures
that did not have indirect jumps and used self-modifying code instead?

For example, the following Fortran code

goto (10,20,30,40) I @ will jump to label 10 if I =1, 20 if I = 2, etc

would be compiled to something like (add any required "bounds checking"
for I)

load R1,I
Jump $,R1
Jump 10
Jump 20
Jump 30
Jump 40

Which architecture ist that?

No code modification nor indirection required .

The "Jump $,R1" is an indirect jump. With that the assigned goto can
be implemented as (for "GOTO X")

load R1,X
Jump 0,R1

and indirect calls and method dispatch would also be
implemented by modifying the code. If self-modifying code is "just
bad", and any language features that are implemented on some long-gone >>>> architectures using self-modifying code are bad by association, then
we have to get rid of all of these language features ASAP.

And, by an large they have.

We have gotten rid of indirect calls, e.g., in higher-order functions
in functional programming languages? We have gotten rid of dynamic
method dispatch in object-oriented programs.

No, and I defer to you, or others here, on how these features are >implemented, specifically whether code modification is required. I was >referring to features such as assigned goto in Fortran, and Alter goto
in Cobol.

On modern architectures higher-order functions are implemented with
indirect branches or indirect calls (depending on whether it's a
tail-call or not); likewise for method dispatch.

I do not know how Lisp, FORTRAN, Algol 60 and other early languages
with higher-order functions were implemented on architectures that do
not have indirect branches; but if the assigned goto was implemented
with self-modifying code, the call to a function in a variable was
probably implemented like that, too.

Thinking about the things that self-modifying code has been used for
on some architecture, IIRC that also includes array indexing. So have
we gotten rid of array indexing in programming languages?

Of course not. But I suspect that we have "gotten rid of" any
architecture that *requires* code modification for array indexing.

We have also gotten rid of any architecture that requires
self-modifying code for implementing the assigned goto.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Bill Findlay@findlaybill@blueyonder.co.uk to comp.arch on Fri Nov 7 17:54:33 2025

From Newsgroup: comp.arch

On 7 Nov 2025, Anton Ertl wrote
(in article<2025Nov7.162638@mips.complang.tuwien.ac.at>):

John Levine <johnl@taugh.com> writes:

In languages with nested scopes, label gotos
can jump to an outer scope so they have to unwind some frames. Back when people used such things, a common use was on an error to jump out to some recovery code.

Pascal has that feature. Concerning error handling, jumping to an
error handler in a statically enclosing scope has fallen out of
favour, but throwing an exception to the next dynamically enclosing
exception handler is supported in a number of languages.

Function pointers have a sort of similar problem in that they need to carry along pointers to all of the enclosing frames the function can see. That is reasonably well solved by displays, give or take the infamous Knuth man or boy
program, 13 lines of Algol60 horror that Knuth himself got the results wrong.

Displays and static link chains are among the techniques that can be
used to implement static scoping correctly, i.e., where the man-or-boy
test produces the correct result. Knuth initially got the result
wrong, because he only had boy compilers, and the computation is too
involved to do it by hand.

I append a run of MANORBOY in Pascal for the KDF9.
No display was used.
A static frame pointer as part of the functional parameter
suffices logically and gives better performance.

Paskal : the KDF9 Pascal cross-compiler V19.2a, compiled ... on 2025-11-07.
1 u | %storage = 32767
2 u | %ystores = 30100
3 u |
4 u | program MAN_OR_BOY;
5 u |
6 u | { See: }
7 u | { "Man or boy?", }
8 u | { by Donald Knuth, }
9 u | { ALGOL Bulletin 17.2.4, p7; July 1964. }
10 u |
11 u | var
12 u | i : integer;
13 u | function A (
14 u | k : integer;
15 u | function x1 : integer;
16 u | function x2 : integer;
17 u | function x3 : integer;
18 u | function x4 : integer;
19 u | function x5 : integer
20 u | ) : integer;
21 u |
22 u | function B : integer;
23 u 1b| begin
24 u | k := k - 1;
25 u | B := A (k, B, x1, x2, x3, x4);
26 u 1e| end { B };
27 u |
28 u 1b| begin { A }
29 u | if k <= 0 then
30 u | A := x4 + x5
31 u | else
32 u | A := B;
33 u 1e| end { A };
34 u |
35 u | function pos_one : integer;
36 u | begin pos_one := 1 end;
37 u |
38 u | function neg_one : integer;
39 u | begin neg_one := -1 end;
40 u |
41 u | function zero : integer;
42 u | begin zero := 0 end;
43 u |
44 u 1b| begin { MAN_OR_BOY }
45 u | rewrite(1, 3);
46 u | for i := 0 to 11 do
47 u | write(A(i, pos_one, neg_one, neg_one, pos_one, zero):6);
48 u | writeln;
49 u 1e| end { MAN_OR_BOY }.

Compilation complete : 0 error(s) and 0 warning(s) were reported.
...
This is ee9 17.0a, compiled by GNAT ... on 2025-11-07.
Running the KDF9 problem program Binary/MANORBOY
...
Final State: Normal end of run.
...
LP0 on buffer #05 printed 1 line.

LP0:
===
1 0 -2 0 1 0 1 -1 -10 -30 -67 -138
===
--
Bill Findlay

--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Fri Nov 7 10:45:39 2025

From Newsgroup: comp.arch

On 11/7/2025 9:15 AM, Anton Ertl wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 11/7/2025 2:09 AM, Anton Ertl wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 11/6/2025 12:46 AM, Anton Ertl wrote:

On such architectures switch would also be implemented by modifying
the code,

I don't think so. Switch can, and I understand usually is,implemented >>>> via an index into a jump table. No self modifying code required.

What does "index into a jump table" mean in one of those architectures
that did not have indirect jumps and used self-modifying code instead?

For example, the following Fortran code

goto (10,20,30,40) I @ will jump to label 10 if I =1, 20 if I = 2, etc >>
would be compiled to something like (add any required "bounds checking"
for I)

load R1,I
Jump $,R1
Jump 10
Jump 20
Jump 30
Jump 40

Which architecture ist that?

It is generic enough that it could be lots of architectures, but the one
I know best is the Univac 1100.

No code modification nor indirection required .

The "Jump $,R1" is an indirect jump.

Perhaps we just have a terminology disagreement. I don't call that
indirect addressing. The 1100 architecture supports indirect addressing
in the hardware. An indirect reference was represented in the assembler
by an asterisk preceding the label, which set a bit in the instruction
that told the hardware to go to the address specified in the instruction
and treat what it found there as the address of the operand for the instruction.

So, for example:

J *tag

tag finaladdress

would cause the hardware to fetch the address at tag and use that as the operand, thus causing a jump to "final address".

This is what I call indirect addressing.

So to use this in an assigned goto, the assign statement would store the desired address at tag such that when the jump was executed, it would
jump to the desired address.

I call the construct with several consecutive jump instructions an
indexed jump, not an indirect one.

With that the assigned goto can
be implemented as (for "GOTO X")

load R1,X
Jump 0,R1

Yes.

and indirect calls and method dispatch would also be
implemented by modifying the code. If self-modifying code is "just
bad", and any language features that are implemented on some long-gone >>>>> architectures using self-modifying code are bad by association, then >>>>> we have to get rid of all of these language features ASAP.

And, by an large they have.

We have gotten rid of indirect calls, e.g., in higher-order functions
in functional programming languages? We have gotten rid of dynamic
method dispatch in object-oriented programs.

No, and I defer to you, or others here, on how these features are
implemented, specifically whether code modification is required. I was
referring to features such as assigned goto in Fortran, and Alter goto
in Cobol.

On modern architectures higher-order functions are implemented with
indirect branches or indirect calls (depending on whether it's a
tail-call or not); likewise for method dispatch.

I do not know how Lisp, FORTRAN, Algol 60 and other early languages
with higher-order functions were implemented on architectures that do
not have indirect branches; but if the assigned goto was implemented
with self-modifying code, the call to a function in a variable was
probably implemented like that, too.

Thinking about the things that self-modifying code has been used for
on some architecture, IIRC that also includes array indexing. So have
we gotten rid of array indexing in programming languages?

Of course not. But I suspect that we have "gotten rid of" any
architecture that *requires* code modification for array indexing.

We have also gotten rid of any architecture that requires
self-modifying code for implementing the assigned goto.

True. But we still have my original argument, better expressed by
Niklas about code readability/followability.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Fri Nov 7 14:28:48 2025

From Newsgroup: comp.arch

On 11/6/2025 1:11 PM, BGB wrote:

On 11/6/2025 3:24 AM, Michael S wrote:

On Wed, 05 Nov 2025 21:06:16 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Michael S <already5chosen@yahoo.com> posted:

On Tue, 04 Nov 2025 22:51:28 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Thomas Koenig <tkoenig@netcologne.de> posted:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

I still think the IBM DFP people did an impressively good job
packing that much data into a decimal representation. :-)

Yes, that modulo 1000 packing is quite clever. It is relatively
cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.

Brain dead easy: 1 table of 1024 entries each 12-bits wide,
                  1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.

Other than "crap loads" of {deMorganizing and gate optimization}
that is essentially what HW actually does.

You still need to build 12-bit decimal ALUs to string together

Are talking about hardware or software?

A SW solution based on how it would be done in HW.

Then, I suspect that you didn't understand objection of Thomas Koenig.

1. Format of interest is Decimal128.
https://en.wikipedia.org/wiki/Decimal128_floating-point_format

2. According to my understanding, Thomas didn't suggest that *slow*
software implementation of DPD-encoded DFP, i.e. implementation that
only cares about correctness, is hard.

3. OTOH, he seems to suspects, and I agree with him, that *non-slow*
software implementation, the one comparable in speed (say, within
factor of 1,5-2) to competent implementation of the same DFP operations
in BID format, is not easy. If at all possible.

4. All said above assumes an absence of HW assists.

BTW, at least for multiplication, I would probably would not do my
arithmetic in BCD domain.
Instead, I'd convert 10+ DPD digits to two Base_1e18 digits (11 look
ups per operand, 22 total look ups + ~40 shifts + ~20 ANDs + ~20
additions).

Then I'd do multiplication and normalization and rounding in Base_1e18.

Then I'd convert from Base_1e18 to Base_1000. The ideas of such
conversion are similar to fast binary-to-BCD conversion that I
demonstrated her decade or so ago. AVX2 could be quite helpful at that
stage.

Then I'd have to convert the result from Base_1000 to DPD. Here, again,
11 table look-ups + plenty of ANDs/shift/ORs seem inevitable.
May be, at that stage SIMD gather can be of help, but I have my doubts.
So far, every time I tried gather I was disappointed with performance.

Overall, even with seemingly decent plan like sketched above, I'd expect
DPD multiplication to be 2.5x to 3x slower than BID. But, then again,
in the past my early performance estimates were wrong quite often.

I decided to start working on a mockup (quickly thrown together).
I don't expect to have much use for it, but meh.

It works by packing/unpacking the values into an internal format along vaguely similar lines to the .NET format, just bigger to accommodate
more digits:
4x 32-bit values each holding 9 digits
    Except the top one generally holding 7 digits.
16-bit exponent, sign byte.

Then wrote a few pack/unpack scenarios:
X30: Directly packing 20/30 bit chunks, non-standard;
DPD: Use the DPD format;
BID: Use the BID format.

For the pack/unpack step (taken in isolation):
X30 is around 10x faster than either DPD or BID;
Both DPD and BID need a similar amount of time.
    BID needs a bunch of 128-bit arithmetic handlers.
    DPD needs a bunch of merge/split and table lookups.
    Seems to mostly balance out in this case.

For DPD, merge is effectively:
Do the table lookups;
v=v0+(v1*1000)+(v2*1000000);
With a split step like:
v0=v;
v1=v/1000;
v0-=v1*1000;
v2=v1/1000;
v1-=v2*1000;
Then, use table lookups to go back to DPD.

Did look into possible faster ways of doing the splitting, but then
noted that have not yet found a faster way that gives correct results
(where one can assume the compiler already knows how to turn divide by constant into multiply by reciprocal).

At first it seemed like a strong reason to favor X30 over either DPD or
BID. Except, that the cost of the ADD and MUL operations effectively
dwarf that of the pack/unpack operations, so the relative cost
difference between X30 and DPD may not matter much.

As is, it seems MUL and ADD being roughly 6x more than the cost of the
DPD pack/unpack steps.

So, it seems, while DPD pack/unpack isn't free, it is not something that would lead to X30 being a decisive win either in terms of performance.

It might make more sense, if supporting BID, to just do it as its own
thing (and embrace just using a bunch of 128-bit arithmetic, and a 128*128=>256 bit widening multiply, ...). Also, can note that the BID
case ends up needing a lot more clutter, mostly again because C lacks
native support for 128-bit arithmetic.

If working based on digit chunks, likely better to stick with DPD due to less clutter, etc. Though, this part would be less bad if C had had widespread support for 128-bit integers.

Though, in this case, the ADD and MUL operations currently work by internally doubling the width and then narrowing the result after normalization. This is slower, but could give exact results.

Though, still not complete nor confirmed to produce correct results.

But, yeah, might be more worthwhile to look into digit chunking:
12x 3 digits (16b chunk)
4x   9 digits (32b chunk)
2x 18 digits (64b chunk)
3x 12 digits (64b chunk)

Likely I think:
3 digits, likely slower because of needing significantly more operations;
9 digits, seemed sensible, option I went with, internal operations fully
fit within the limits of 64 bit arithmetic;
18 digits, possible, but runs into many cases internally that would
require using 128-bit arithmetic.

12 digits, fits more easily into 64-bit arithmetic, but would still sometimes exceed it; and isn't that much more than 9 digits (but would reduce the number of chunks needed from 4 to 3).

While 18 digits conceptually needs fewer abstract operations than 9
digits, it would suffer the drawback of many of these operations being notably slower.

However, if running on RV64G with the standard ABI, it is likely the 9- digit case would also take a performance hit due to sign-extended
unsigned int (and needing to spend 2 shifts whenever zero-extending a value).

With 3x 12 digits,while not exactly the densest scheme, leaves a little
more "working space" so would reduce cases which exceed the limits of
64-bit arithmetic. Well, except multiply, where 24 > 18 ...

The main merit of 9 digit chunking here being that it fully stays within
the limits of 64-bit arithmetic (where multiply temporarily widens to working with 18 digits, but then narrows back to 9 digit chunks).

Also 9 digit chunking may be preferable when one has a faster 32*32=>64
bit multiplier, but 64*64=>128 is slower.

One other possibility could be to use BCD rather than chunking, but I
expect BCD emulation to be painfully slow in the absence of ISA level helpers.

I don't know yet if my implementation of DPD is actually correct.

Seems Decimal128 DPD is obscure enough that I don't currently have any alternate options to confirm if my encoding is correct.

Here is an example value:
2DFFCC1AEB53B3FB_B4E262D0DAB5E680

Which, in theory, should resemble PI.

Annoyingly, it seems like pretty much everyone else either went with
BID, or with other non-standard Decimal encodings.

Can't seem to find:
Any examples of hard-coded numbers in this format on the internet;
Any obvious way to generate them involving "stuff I already have".
As, in, not going and using some proprietary IBM library or similar.

Also Grok wasn't much help here, just keeps trying to use Python's
"decimal", which quickly becomes obvious is not using Decimal128 (much
less DPD), but seemingly some other 256-bit format.

And, Grok fails to notice that what it is saying is nowhere close to
correct in this case.

Neither DeepSeek nor QWen being much help either... Both just sort of go
down a rabbit hole, and eventually fall back to "Here is how you might
go about trying to decode this format...".

Not helpful, I more would just want some way to confirm whether or not I
got the format correct.

Which is easier if one has some example numbers or something that they
can decode and verify the value, or something that is able to decode
these numbers (which isn't just trying to stupidly shove it into
Python's Decimal class...).

Looking around, there is Decimal128 support in MongoDB/BSON, PyArrow,
and Boost C++, but in these cases, less helpful because they went with BID.

...

Checking, after things a a little more complete, MHz for (millions of
times per second), on my desktop PC:
DPD Pack/Unpack: 63.7 MHz (58 cycles)
X30 Pack/Unpack: 567 MHz ( 7 cycles) ?...

FMUL (unwrap) : 21.0 MHz (176 cycles)
FADD (unwrap) : 11.9 MHz (311 cycles)

FDIV : 0.4 MHz (very slow; Newton Raphson)

FMUL (DPD) : 11.2 MHz (330 cycles)
FADD (DPD) : 8.6 MHz (430 cycles)
FMUL (X30) : 12.4 MHz (298 cycles)
FADD (X30) : 9.8 MHz (378 cycles)

The relative performance impact of the wrap/unwrap step is somewhat
larger than expected (vs the unwrapped case).

Though, there seems to only be a small difference here between DPD and
X30 (so, likely whatever is effecting performance here is not directly
related to the cost of the pack/unpack process).

The wrapped cases basically just add a wrapper function that unpacks the
input values to the internal format, and then re-packs the result.

For using the wrapped functions to estimate pack/unpack cost:
DPD cost: 51 cycles.
X30 cost: 41 cycles.

Not really a good way to make X30 much faster. It does pay for the cost
of dealing with the combination field.

Not sure why they would be so close:
DPD case does a whole lot of stuff;
X30 case is mostly some shifts and similar.

Though, in this case, it does use these functions by passing/returning
structs by value. It is possible a by-reference design might be faster
in this case.

This could possibly be cheapened slightly by going to, say:
S.E13.M114
In effect trading off some exponent range for cheaper handling of the exponent.

Can note:
MUL and ADD use double-width internal mantissa, so should be accurate;
Current test doesn't implement rounding modes though, could do so.
Currently hard-wired at Round-Nearest-Even.

DIV uses Newton-Raphson
The process of converging is a lot more fiddly than with Binary FP.
Partly as the strategy for generating the initial guess is far less
accurate.

So, it first uses a loop with hard-coded checks and scales to get it in
the general area, before then letting N-R take over. If the value isn't
close enough (seemingly +/- 25% or so), N-R flies off into space.

Namely:
Exponent is wrong:
Scale by factors of 2 until correct;
Off by more than 50%, scale by +/- 25%;
Off by more than 25%, scale by +/- 12.5%;
Else: Good enough, let normal N-R take over.

Precondition step is usually simpler with Binary-FP as the initial guess
is usually within the correct range. So, one can use a single modified
N-R step (that undershoots) followed by letting N-R take over.

More of an issue though when the initial guess is "maybe within a factor
of 10" because the usual reciprocal-approximation strategy used for
Binary-FP isn't quite as effective.

...

Still don't have a use-case, mostly just messing around with this...

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Nov 7 22:57:14 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> posted:
--------------snip---------------

DIV uses Newton-Raphson
The process of converging is a lot more fiddly than with Binary FP.
Partly as the strategy for generating the initial guess is far less accurate.

Binary FDIV NR uses a 9-bit in, 11-bits out table which results in
an 8-bit accurate first iteration result.

Other than DFP not being normalized, once you find the HoD, you should
be able to use something like a 10-bit in 13-bit out table to get the
first 2 decimal digits correct, and N-R from there.

That 10-bits in could be the packed DFP representation (its denser and
has smaller tables). This way, table lookup overlaps unpacking.
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Fri Nov 7 20:23:40 2025

From Newsgroup: comp.arch

On 11/7/2025 4:57 PM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:
--------------snip---------------

DIV uses Newton-Raphson
The process of converging is a lot more fiddly than with Binary FP.
Partly as the strategy for generating the initial guess is far less
accurate.

Binary FDIV NR uses a 9-bit in, 11-bits out table which results in
an 8-bit accurate first iteration result.

Other than DFP not being normalized, once you find the HoD, you should
be able to use something like a 10-bit in 13-bit out table to get the
first 2 decimal digits correct, and N-R from there.

That 10-bits in could be the packed DFP representation (its denser and
has smaller tables). This way, table lookup overlaps unpacking.

FWIW: Dump of the test code as it exists...
https://pastebin.com/NcvCi5gD

I had since found the decNumber library, and with this was able to
confirm that I had in-fact figured out the specifics of the format (I
was unsure whether or not my version was correct; as I had implemented
in based mostly on descriptions of the format on Wikipedia; which were
not entirely consistent).

Otherwise, experiment / proof of concept.
Unlikely to actually be useful.

Way I had usually started out with binary FDIV/reciprocal:
Turn the reciprocal into a modified integer subtract;
Or, subtract for HOB's, everything else is a bitwise inversion.
Can often get within the top 4 bits of the mantissa or so.

Way I had tried to do so for decimal:
Invert the exponent in a similar way as binary FP;
Set the mantissa to the 9s complement value.

Issue:
The 9s complement method doesn't give a value particularly close to the
actual target value.

For example:
Taking the reciprocal of 3.14159x, I get 0.685840x, but actual target is 0.318309x.

Like, I almost may as well just leave the mantissa as-is, or fill it
with all 5s or something.

Granted, feeding the high 3 digits through a lookup table and just
setting all the low digits to whatever is probably also an option, and probably faster than using an initial coarse convergence to try to get
it somewhere in the right general area.

I realized after finding decNumber and using it to generate a test
number, that it seems to use the format in a very different way,
effectively keeping the value right-aligned and normalized, rather than left-aligned and normalized.

My code sort of assumed keeping values normalized (as with traditional floating point).

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Fri Nov 7 22:18:08 2025

From Newsgroup: comp.arch

On 2025-11-07 3:28 p.m., BGB wrote:

On 11/6/2025 1:11 PM, BGB wrote:

On 11/6/2025 3:24 AM, Michael S wrote:

On Wed, 05 Nov 2025 21:06:16 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Michael S <already5chosen@yahoo.com> posted:

On Tue, 04 Nov 2025 22:51:28 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Thomas Koenig <tkoenig@netcologne.de> posted:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

I still think the IBM DFP people did an impressively good job
packing that much data into a decimal representation. :-)

Yes, that modulo 1000 packing is quite clever. It is relatively >>>>>>> cheap to implement in hardware (which is the point, of course).
Not sure how easy it would be in software.

Brain dead easy: 1 table of 1024 entries each 12-bits wide,
                  1 table of 4096 entries each 10-bits wide,
isolate the 10-bit field, LD the converted value.
isolate the 12-bit field, LD the converted value.

Other than "crap loads" of {deMorganizing and gate optimization}
that is essentially what HW actually does.

You still need to build 12-bit decimal ALUs to string together

Are talking about hardware or software?

A SW solution based on how it would be done in HW.

Then, I suspect that you didn't understand objection of Thomas Koenig.

1. Format of interest is Decimal128.
https://en.wikipedia.org/wiki/Decimal128_floating-point_format

2. According to my understanding, Thomas didn't suggest that *slow*
software implementation of DPD-encoded DFP, i.e. implementation that
only cares about correctness, is hard.

3. OTOH, he seems to suspects, and I agree with him, that *non-slow*
software implementation, the one comparable in speed (say, within
factor of 1,5-2) to competent implementation of the same DFP operations
in BID format, is not easy. If at all possible.

4. All said above assumes an absence of HW assists.

BTW, at least for multiplication, I would probably would not do my
arithmetic in BCD domain.
Instead, I'd convert 10+ DPD digits to two Base_1e18 digits (11 look
ups per operand, 22 total look ups + ~40 shifts + ~20 ANDs + ~20
additions).

Then I'd do multiplication and normalization and rounding in Base_1e18.

Then I'd convert from Base_1e18 to Base_1000. The ideas of such
conversion are similar to fast binary-to-BCD conversion that I
demonstrated her decade or so ago. AVX2 could be quite helpful at that
stage.

Then I'd have to convert the result from Base_1000 to DPD. Here, again,
11 table look-ups + plenty of ANDs/shift/ORs seem inevitable.
May be, at that stage SIMD gather can be of help, but I have my doubts.
So far, every time I tried gather I was disappointed with performance.

Overall, even with seemingly decent plan like sketched above, I'd expect >>> DPD multiplication to be 2.5x to 3x slower than BID. But, then again,
in the past my early performance estimates were wrong quite often.

I decided to start working on a mockup (quickly thrown together).
   I don't expect to have much use for it, but meh.

It works by packing/unpacking the values into an internal format along
vaguely similar lines to the .NET format, just bigger to accommodate
more digits:
   4x 32-bit values each holding 9 digits
     Except the top one generally holding 7 digits.
   16-bit exponent, sign byte.

Then wrote a few pack/unpack scenarios:
   X30: Directly packing 20/30 bit chunks, non-standard;
   DPD: Use the DPD format;
   BID: Use the BID format.

For the pack/unpack step (taken in isolation):
   X30 is around 10x faster than either DPD or BID;
   Both DPD and BID need a similar amount of time.
     BID needs a bunch of 128-bit arithmetic handlers.
     DPD needs a bunch of merge/split and table lookups.
     Seems to mostly balance out in this case.

For DPD, merge is effectively:
   Do the table lookups;
   v=v0+(v1*1000)+(v2*1000000);
With a split step like:
   v0=v;
   v1=v/1000;
   v0-=v1*1000;
   v2=v1/1000;
   v1-=v2*1000;
   Then, use table lookups to go back to DPD.

Did look into possible faster ways of doing the splitting, but then
noted that have not yet found a faster way that gives correct results
(where one can assume the compiler already knows how to turn divide by
constant into multiply by reciprocal).

At first it seemed like a strong reason to favor X30 over either DPD
or BID. Except, that the cost of the ADD and MUL operations
effectively dwarf that of the pack/unpack operations, so the relative
cost difference between X30 and DPD may not matter much.

As is, it seems MUL and ADD being roughly 6x more than the cost of the
DPD pack/unpack steps.

So, it seems, while DPD pack/unpack isn't free, it is not something
that would lead to X30 being a decisive win either in terms of
performance.

It might make more sense, if supporting BID, to just do it as its own
thing (and embrace just using a bunch of 128-bit arithmetic, and a
128*128=>256 bit widening multiply, ...). Also, can note that the BID
case ends up needing a lot more clutter, mostly again because C lacks
native support for 128-bit arithmetic.

If working based on digit chunks, likely better to stick with DPD due
to less clutter, etc. Though, this part would be less bad if C had had
widespread support for 128-bit integers.

Though, in this case, the ADD and MUL operations currently work by
internally doubling the width and then narrowing the result after
normalization. This is slower, but could give exact results.

Though, still not complete nor confirmed to produce correct results.

But, yeah, might be more worthwhile to look into digit chunking:
   12x 3 digits (16b chunk)
   4x   9 digits (32b chunk)
   2x 18 digits (64b chunk)
   3x 12 digits (64b chunk)

Likely I think:
3 digits, likely slower because of needing significantly more operations;
9 digits, seemed sensible, option I went with, internal operations
fully fit within the limits of 64 bit arithmetic;
18 digits, possible, but runs into many cases internally that would
require using 128-bit arithmetic.

12 digits, fits more easily into 64-bit arithmetic, but would still
sometimes exceed it; and isn't that much more than 9 digits (but would
reduce the number of chunks needed from 4 to 3).

While 18 digits conceptually needs fewer abstract operations than 9
digits, it would suffer the drawback of many of these operations being
notably slower.

However, if running on RV64G with the standard ABI, it is likely the
9- digit case would also take a performance hit due to sign-extended
unsigned int (and needing to spend 2 shifts whenever zero-extending a
value).

With 3x 12 digits,while not exactly the densest scheme, leaves a
little more "working space" so would reduce cases which exceed the
limits of 64-bit arithmetic. Well, except multiply, where 24 > 18 ...

The main merit of 9 digit chunking here being that it fully stays
within the limits of 64-bit arithmetic (where multiply temporarily
widens to working with 18 digits, but then narrows back to 9 digit
chunks).

Also 9 digit chunking may be preferable when one has a faster
32*32=>64 bit multiplier, but 64*64=>128 is slower.

One other possibility could be to use BCD rather than chunking, but I
expect BCD emulation to be painfully slow in the absence of ISA level
helpers.

I don't know yet if my implementation of DPD is actually correct.

Seems Decimal128 DPD is obscure enough that I don't currently have any alternate options to confirm if my encoding is correct.

Here is an example value:
2DFFCC1AEB53B3FB_B4E262D0DAB5E680

Which, in theory, should resemble PI.

Annoyingly, it seems like pretty much everyone else either went with
BID, or with other non-standard Decimal encodings.

Can't seem to find:
Any examples of hard-coded numbers in this format on the internet;
Any obvious way to generate them involving "stuff I already have".
    As, in, not going and using some proprietary IBM library or similar.

Also Grok wasn't much help here, just keeps trying to use Python's "decimal", which quickly becomes obvious is not using Decimal128 (much
less DPD), but seemingly some other 256-bit format.

And, Grok fails to notice that what it is saying is nowhere close to
correct in this case.

Neither DeepSeek nor QWen being much help either... Both just sort of go down a rabbit hole, and eventually fall back to "Here is how you might
go about trying to decode this format...".

Not helpful, I more would just want some way to confirm whether or not I
got the format correct.

Which is easier if one has some example numbers or something that they
can decode and verify the value, or something that is able to decode
these numbers (which isn't just trying to stupidly shove it into
Python's Decimal class...).

Looking around, there is Decimal128 support in MongoDB/BSON, PyArrow,
and Boost C++, but in these cases, less helpful because they went with BID.

...

Checking, after things a a little more complete, MHz for (millions of
times per second), on my desktop PC:
DPD Pack/Unpack: 63.7 MHz (58 cycles)
X30 Pack/Unpack: 567 MHz ( 7 cycles) ?...

FMUL (unwrap) : 21.0 MHz (176 cycles)
FADD (unwrap) : 11.9 MHz (311 cycles)

FDIV           : 0.4 MHz (very slow; Newton Raphson)

FMUL (DPD)     : 11.2 MHz (330 cycles)
FADD (DPD)     : 8.6 MHz (430 cycles)
FMUL (X30)     : 12.4 MHz (298 cycles)
FADD (X30)     : 9.8 MHz (378 cycles)

The relative performance impact of the wrap/unwrap step is somewhat
larger than expected (vs the unwrapped case).

Though, there seems to only be a small difference here between DPD and
X30 (so, likely whatever is effecting performance here is not directly related to the cost of the pack/unpack process).

The wrapped cases basically just add a wrapper function that unpacks the input values to the internal format, and then re-packs the result.

For using the wrapped functions to estimate pack/unpack cost:
DPD cost: 51 cycles.
X30 cost: 41 cycles.

Not really a good way to make X30 much faster. It does pay for the cost
of dealing with the combination field.

Not sure why they would be so close:
DPD case does a whole lot of stuff;
X30 case is mostly some shifts and similar.

Though, in this case, it does use these functions by passing/returning structs by value. It is possible a by-reference design might be faster
in this case.

This could possibly be cheapened slightly by going to, say:
S.E13.M114
In effect trading off some exponent range for cheaper handling of the exponent.

Can note:
MUL and ADD use double-width internal mantissa, so should be accurate;
Current test doesn't implement rounding modes though, could do so.
    Currently hard-wired at Round-Nearest-Even.

DIV uses Newton-Raphson
The process of converging is a lot more fiddly than with Binary FP.
Partly as the strategy for generating the initial guess is far less accurate.

So, it first uses a loop with hard-coded checks and scales to get it in
the general area, before then letting N-R take over. If the value isn't close enough (seemingly +/- 25% or so), N-R flies off into space.

Namely:
Exponent is wrong:
    Scale by factors of 2 until correct;
Off by more than 50%, scale by +/- 25%;
Off by more than 25%, scale by +/- 12.5%;
Else: Good enough, let normal N-R take over.

Precondition step is usually simpler with Binary-FP as the initial guess
is usually within the correct range. So, one can use a single modified
N-R step (that undershoots) followed by letting N-R take over.

More of an issue though when the initial guess is "maybe within a factor
of 10" because the usual reciprocal-approximation strategy used for Binary-FP isn't quite as effective.

...

Still don't have a use-case, mostly just messing around with this...

When I built my decimal float code I ran into the same issue. There are
not really examples on the web. I built integer to decimal-float and decimal-float to integer converters then compared results.

Some DFP encodings for 1,10,100,1000,1000000,12345678 (I hope these are
right, no guarantees).
Integer decimal-float
u 00000000000000000000000000000001 25ffc000000000000000000000000000
u 0000000000000000000000000000000a 26000000000000000000000000000000
u 00000000000000000000000000000064 26004000000000000000000000000000
u 000000000000000000000000000003e8 26008000000000000000000000000000
u 000000000000000000000000000f4240 26014000000000000000000000000000
u 00000000000000000000000000bc614e 2601934b9c0c00000000000000000000
u 00000000000000000000000000000002 29ffc000000000000000000000000000

I have used the decimal float code (96 bit version) with Tiny BASIC and
it seems to work.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Fri Nov 7 22:30:36 2025

From Newsgroup: comp.arch

Cache-line constants were tried with the StarkCPU and seemed to work
fine, but wasted cache-line space when constants and instructions could
not be packed evenly into the cache-line.

However, for Qupls2026 using constants stored on the cache-line might be
just as efficient storage wise as having the constants follow
instruction words because of the 48-bit word width. Constants typically
do not need to be multiples of 48 bits. If stored on the cache-line they
could be multiples of 16-bits. There are potentially 32-bits of wasted
space if an instruction is not able to be packed onto the cache-line.
There may just be as much wasted space due to the support of over-sized constants in-line with 48-bit parcels. A 32-bit constant uses 48 bits,
wasting 16-bits of storage.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sat Nov 8 00:34:37 2025

From Newsgroup: comp.arch

<snip>>

Here is an example value:
2DFFCC1AEB53B3FB_B4E262D0DAB5E680

<snip>

I multiplied PI by 10^31 and ran it through the int to decimal-float converter. It should give the same sequence of digits although the
exponent may be off.

2e078c2aeb53b3fbb4e262d0dab5e680

The sequence of digits is the same, except it begins C2 instead of C1.

<snip>

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sat Nov 8 01:30:43 2025

From Newsgroup: comp.arch

On 11/7/2025 11:34 PM, Robert Finch wrote:

<snip>>

Here is an example value:
2DFFCC1AEB53B3FB_B4E262D0DAB5E680

<snip>

I multiplied PI by 10^31 and ran it through the int to decimal-float converter. It should give the same sequence of digits although the
exponent may be off.

2e078c2aeb53b3fbb4e262d0dab5e680

The sequence of digits is the same, except it begins C2 instead of C1.

Does appear to work, mostly, but decodes as:
31425926535897932384626433832795.0

Well, except some of the digits don't match up with PI...

one of the examples from the prior post decodes as:
12345678.0

But, yeah, mostly getting consistency across multiple implementations
does imply that I have implemented the base format correctly.

As for use-case, this is less clear. It is likely to be slower than the
usual Binary128 format.

And, likewise, it would appear that BID is slightly more popular, though
both less common than people just rolling their own formats.

So, it looks like:
Boost, MongoDB, PyArrow: BID
Python, Java: Custom formats
.NET: Custom format.

Leaving mine, yours, and IBM's decNumber, as using DPD.

It looks like decNumber is using BCD internally.
Mine is using a "9 digits in 32-bit chunks" scheme.

In the case of the .NET format, it uses 9 digit chunks, so pretty
obvious it is probably using 9 digit chunks internally as well.

I left the BID code out of my example.

Partly as I realize the reason the BID case was coming out as basically
the same speed as DPD was because I was in-effect using DPD. If the BID
case were used, it is in effect somewhat slower than DPD.

It is more likely that for BID to be effective, it would need to be implemented directly using 128-bit math (likely as its own thing).

I also had my experimental X30 variant, which can be slightly faster
than DPD, but seems the relative savings would be small. Though, the
cost estimates in my microbenchmarks are not showing consistent results.
It is looking like some sort of weirdness is going on.

Also the micro benchmarks don't test for values with varied levels of normalization, which is likely to affect performance.

And, can note that it seems my code and decNumber was very different
regarding the handling of normalization.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Nov 8 10:02:24 2025

From Newsgroup: comp.arch

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

void engine(char *source)
{
void *insts[] = {&&add, &&load, &&ip, ...};

void **ip=compile_to_vm_code(source,insts);

goto *ip++;

add:
...
goto *ip++;

One problem with assigned GOTO is data flow analysis for a comiler.

Compilers typically break down structured control flow into GOTO
and then perform analysis. A label whose address is assigned
anywhere in the program unit to a variable must be considered to
be reachable by any GOTO to said variable, so any variable in that
piece of code must be in a known place (i.e. memory). If it
is kept in a register in some places that could jump to that
particular label, the contents of that register must be stored
to memory before the jump is executed. Alternatively, memory
allocation must make sure that the same register is always used.

This was probably less of a problem when assigned goto was invented
(I assume this was for FORTRAN 66) when few varibles were kept in
registers, and register allocation was in its infancy. Now, this is
a much bigger impediment to optimization.

In other words, assigned goto confuses both programmers and
compilers.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Nov 8 11:28:36 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> schrieb:

I don't know yet if my implementation of DPD is actually correct.

The POWER ISA has a pretty good description, see the OpenPower
foundation.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Nov 8 14:11:33 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

The constant ROM[specifier] seems to be the easiest way of taking
5-bits and converting it into a FP number. It was only a few weeks
ago that we changed the range from {-31..+31} to {-15.5..+15.5} as
this covers <slightly> more fp constant uses.

These days, I would assume that software would chose between a
ROM and random logic with a specification. I gave this a spin,
again using espresso, followed by Berkeley ABC.

5-bit FP constants in My 66000 are effectively sign + magnitude,
which makes the logic quite simple; the sign can be just passed
through. The equations (e7 down to e0 are exponent bits, m22 down
to m0 are mantissa bits) for converting are

e7 = (i4) | (i3) | (i2);
e6 = (!i4&!i3&!i2&i1) | (!i4&!i3&!i2&i0);
e5 = (!i4&!i3&!i2&i1) | (!i4&!i3&!i2&i0);
e4 = (!i4&!i3&!i2&i1) | (!i4&!i3&!i2&i0);
e3 = (!i4&!i3&!i2&i1) | (!i4&!i3&!i2&i0);
e2 = (!i4&!i3&!i2&i1) | (!i4&!i3&!i2&i0);
e1 = (!i3&!i2&i1) | (!i3&!i2&i0) | (i4);
e0 = (!i4&!i2&i1) | (!i4&i3);
m22 = (!i4&!i3&i1&i0) | (!i4&i2&i1) | (i4&i3) | (i3&i2);
m21 = (!i4&i3&i1) | (i4&i2) | (!i3&i2&i0);
m20 = (!i4&i3&i0) | (i4&i1);
m19 = (i4&i0);

Sign is separate and not shown, all other mantissa bits are
always zero. ABC, optimizing for area, turns into (in BLIF format,
which is halfway readable)

.model i2f
.inputs i4 i3 i2 i1 i0
.outputs e7 e6 e5 e4 e3 e2 e1 e0 m22 m21 m20 m19

.gate NOR2_X1 A1=i4 A2=i2 ZN=new_n18
.gate INV_X1 A=i3 ZN=new_n19
.gate NAND2_X1 A1=new_n18 A2=new_n19 ZN=e7
.gate INV_X1 A=i1 ZN=new_n21
.gate INV_X1 A=i0 ZN=new_n22
.gate AOI21_X1 A=e7 B1=new_n21 B2=new_n22 ZN=e6
.gate BUF_X1 A=e6 Z=e5
.gate BUF_X1 A=e6 Z=e4
.gate BUF_X1 A=e6 Z=e3
.gate BUF_X1 A=e6 Z=e2
.gate OR2_X1 A1=e6 A2=i4 ZN=e1
.gate INV_X1 A=i4 ZN=new_n29
.gate NAND2_X1 A1=new_n29 A2=i3 ZN=new_n30
.gate INV_X1 A=new_n18 ZN=new_n31
.gate OAI21_X1 A=new_n30 B1=new_n31 B2=new_n21 ZN=e0
.gate AOI21_X1 A=i2 B1=new_n19 B2=i0 ZN=new_n33
.gate NAND2_X1 A1=new_n29 A2=i1 ZN=new_n34
.gate OAI22_X1 A1=new_n33 A2=new_n34 B1=new_n19 B2=new_n18 ZN=m22
.gate AOI21_X1 A=i4 B1=new_n19 B2=i0 ZN=new_n36
.gate INV_X1 A=i2 ZN=new_n37
.gate OAI22_X1 A1=new_n36 A2=new_n37 B1=new_n30 B2=new_n21 ZN=m21
.gate OAI22_X1 A1=new_n30 A2=new_n22 B1=new_n29 B2=new_n21 ZN=m20
.gate NOR2_X1 A1=new_n29 A2=new_n22 ZN=m19
.end

The inverter gates on input bit are not needed when they come
from flip-flops, and I am also not sure the buffers are needed.
If both are taken out, 14 gates are left, which is not a lot
(I assume that this is smaller than a small ROM, but I don't know).
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sat Nov 8 10:31:54 2025

From Newsgroup: comp.arch

Anton Ertl wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

No, and I defer to you, or others here, on how these features are
implemented, specifically whether code modification is required. I was
referring to features such as assigned goto in Fortran, and Alter goto
in Cobol.

On modern architectures higher-order functions are implemented with
indirect branches or indirect calls (depending on whether it's a
tail-call or not); likewise for method dispatch.

I do not know how Lisp, FORTRAN, Algol 60 and other early languages
with higher-order functions were implemented on architectures that do
not have indirect branches; but if the assigned goto was implemented
with self-modifying code, the call to a function in a variable was
probably implemented like that, too.

What architecture cannot do an indirect branch, which I assume
means a branch/jump to a variable location in a register?
And how would the operating system on such a machine get programs running?

Even if an ISA did not have a JMP reg instruction one can create it
using CALL to copy the IP to the stack where you modify it and
RET to pop the new IP value.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Nov 8 18:04:04 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> posted:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

void engine(char *source)
{
void *insts[] = {&&add, &&load, &&ip, ...};

void **ip=compile_to_vm_code(source,insts);

goto *ip++;

add:
...
goto *ip++;

One problem with assigned GOTO is data flow analysis for a comiler.

Compilers typically break down structured control flow into GOTO
and then perform analysis. A label whose address is assigned
anywhere in the program unit to a variable must be considered to
be reachable by any GOTO to said variable, so any variable in that
piece of code must be in a known place (i.e. memory). If it
is kept in a register in some places that could jump to that
particular label, the contents of that register must be stored
to memory before the jump is executed. Alternatively, memory
allocation must make sure that the same register is always used.

This was probably less of a problem when assigned goto was invented
(I assume this was for FORTRAN 66)

I think FORTRAN 66 inherited from FORTRAN II or even FORTRAN (1),
it was available in WATFOR and WATFIV.

when few varibles were kept in
registers, and register allocation was in its infancy. Now, this is
a much bigger impediment to optimization.

In other words, assigned goto confuses both programmers and
compilers.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Nov 8 18:08:28 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> posted:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

The constant ROM[specifier] seems to be the easiest way of taking
5-bits and converting it into a FP number. It was only a few weeks
ago that we changed the range from {-31..+31} to {-15.5..+15.5} as
this covers <slightly> more fp constant uses.

These days, I would assume that software would chose between a
ROM and random logic with a specification. I gave this a spin,
again using espresso, followed by Berkeley ABC.

5-bit FP constants in My 66000 are effectively sign + magnitude,
which makes the logic quite simple; the sign can be just passed
through. The equations (e7 down to e0 are exponent bits, m22 down
to m0 are mantissa bits) for converting are

e7 = (i4) | (i3) | (i2);
e6 = (!i4&!i3&!i2&i1) | (!i4&!i3&!i2&i0);
e5 = (!i4&!i3&!i2&i1) | (!i4&!i3&!i2&i0);
e4 = (!i4&!i3&!i2&i1) | (!i4&!i3&!i2&i0);
e3 = (!i4&!i3&!i2&i1) | (!i4&!i3&!i2&i0);
e2 = (!i4&!i3&!i2&i1) | (!i4&!i3&!i2&i0);
e1 = (!i3&!i2&i1) | (!i3&!i2&i0) | (i4);
e0 = (!i4&!i2&i1) | (!i4&i3);
m22 = (!i4&!i3&i1&i0) | (!i4&i2&i1) | (i4&i3) | (i3&i2);
m21 = (!i4&i3&i1) | (i4&i2) | (!i3&i2&i0);
m20 = (!i4&i3&i0) | (i4&i1);
m19 = (i4&i0);

Then you need a multiplexer to Mux between (double) and (float).

With a special case of 0.0, the range is 0.5..15.5 so I think only
3 exponent bits need computed/created:: exponent range {-1..+4}.

Sign is separate and not shown, all other mantissa bits are
always zero. ABC, optimizing for area, turns into (in BLIF format,
which is halfway readable)

.model i2f
.inputs i4 i3 i2 i1 i0
.outputs e7 e6 e5 e4 e3 e2 e1 e0 m22 m21 m20 m19

.gate NOR2_X1 A1=i4 A2=i2 ZN=new_n18
.gate INV_X1 A=i3 ZN=new_n19
.gate NAND2_X1 A1=new_n18 A2=new_n19 ZN=e7
.gate INV_X1 A=i1 ZN=new_n21
.gate INV_X1 A=i0 ZN=new_n22
.gate AOI21_X1 A=e7 B1=new_n21 B2=new_n22 ZN=e6
.gate BUF_X1 A=e6 Z=e5
.gate BUF_X1 A=e6 Z=e4
.gate BUF_X1 A=e6 Z=e3
.gate BUF_X1 A=e6 Z=e2
.gate OR2_X1 A1=e6 A2=i4 ZN=e1
.gate INV_X1 A=i4 ZN=new_n29
.gate NAND2_X1 A1=new_n29 A2=i3 ZN=new_n30
.gate INV_X1 A=new_n18 ZN=new_n31
.gate OAI21_X1 A=new_n30 B1=new_n31 B2=new_n21 ZN=e0
.gate AOI21_X1 A=i2 B1=new_n19 B2=i0 ZN=new_n33
.gate NAND2_X1 A1=new_n29 A2=i1 ZN=new_n34
.gate OAI22_X1 A1=new_n33 A2=new_n34 B1=new_n19 B2=new_n18 ZN=m22
.gate AOI21_X1 A=i4 B1=new_n19 B2=i0 ZN=new_n36
.gate INV_X1 A=i2 ZN=new_n37
.gate OAI22_X1 A1=new_n36 A2=new_n37 B1=new_n30 B2=new_n21 ZN=m21
.gate OAI22_X1 A1=new_n30 A2=new_n22 B1=new_n29 B2=new_n21 ZN=m20
.gate NOR2_X1 A1=new_n29 A2=new_n22 ZN=m19
.end

The inverter gates on input bit are not needed when they come
from flip-flops, and I am also not sure the buffers are needed.
If both are taken out, 14 gates are left, which is not a lot
(I assume that this is smaller than a small ROM, but I don't know).

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Nov 8 18:13:59 2025

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> posted:

Anton Ertl wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

No, and I defer to you, or others here, on how these features are
implemented, specifically whether code modification is required. I was >> referring to features such as assigned goto in Fortran, and Alter goto
in Cobol.

On modern architectures higher-order functions are implemented with indirect branches or indirect calls (depending on whether it's a
tail-call or not); likewise for method dispatch.

I do not know how Lisp, FORTRAN, Algol 60 and other early languages
with higher-order functions were implemented on architectures that do
not have indirect branches; but if the assigned goto was implemented
with self-modifying code, the call to a function in a variable was
probably implemented like that, too.

What architecture cannot do an indirect branch, which I assume
means a branch/jump to a variable location in a register?

PDP-8, 4004, IBM 650, ... And any machine without "registers".

And how would the operating system on such a machine get programs running?

Load them at a known location and branch to the known location.

Even if an ISA did not have a JMP reg instruction one can create it
using CALL to copy the IP to the stack where you modify it and
RET to pop the new IP value.

Pure stack machines did a lot of this.
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Nov 8 18:25:18 2025

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> writes:

What architecture cannot do an indirect branch, which I assume
means a branch/jump to a variable location in a register?

Or, in case of the 6502, in memory.

I don't know of any architecture (except maybe some one-instruction proof-of-concepts) that does not have indirect branches in one form or
another, but I am not that familiar with architectures from the 1950s
or some of the extremely deprived embedded-control processors.

Maybe the thing about self-modifying code was thrown in to taint the
assigned goto through guilt-by-association.

Even if an ISA did not have a JMP reg instruction one can create it
using CALL to copy the IP to the stack where you modify it and
RET to pop the new IP value.

In most cases that is possible (even if the return address is stored
in a register and not on the stack), but the return addresses might
live on a separate stack (IIRC the Intel 8008 or the 8080 has such a
stack), and the call might be the only thing that pushes on that
stack. But yes, in most cases, it's a good argument that even very
deprived processors usually have some form of indirect branching.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Sat Nov 8 20:56:33 2025

From Newsgroup: comp.arch

On Sat, 08 Nov 2025 18:25:18 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

What architecture cannot do an indirect branch, which I assume
means a branch/jump to a variable location in a register?

Or, in case of the 6502, in memory.

I don't know of any architecture (except maybe some one-instruction proof-of-concepts) that does not have indirect branches in one form or another, but I am not that familiar with architectures from the 1950s
or some of the extremely deprived embedded-control processors.

Maybe the thing about self-modifying code was thrown in to taint the
assigned goto through guilt-by-association.

Even if an ISA did not have a JMP reg instruction one can create it
using CALL to copy the IP to the stack where you modify it and
RET to pop the new IP value.

In most cases that is possible (even if the return address is stored
in a register and not on the stack), but the return addresses might
live on a separate stack (IIRC the Intel 8008 or the 8080 has such a
stack), and the call might be the only thing that pushes on that
stack. But yes, in most cases, it's a good argument that even very
deprived processors usually have some form of indirect branching.

- anton

I would imagine that in old times return iinstruction was less common
than indirect addressing itself.

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Nov 8 18:37:48 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

void engine(char *source)
{
void *insts[] = {&&add, &&load, &&ip, ...};

void **ip=compile_to_vm_code(source,insts);

goto *ip++;

add:
...
goto *ip++;

One problem with assigned GOTO is data flow analysis for a comiler.

Compilers typically break down structured control flow into GOTO
and then perform analysis. A label whose address is assigned
anywhere in the program unit to a variable must be considered to
be reachable by any GOTO to said variable, so any variable in that
piece of code must be in a known place (i.e. memory). If it
is kept in a register in some places that could jump to that
particular label, the contents of that register must be stored
to memory before the jump is executed. Alternatively, memory
allocation must make sure that the same register is always used.

The data flow analysis for labels-as-values (and assigned goto) is
just the same as for any other control flow. Every goto * has to be
considered to potentially jump to any label whose address ist taken
with &&label, just as a switch has to be considered to go to any of
the case labels, an if has to be considered to go to either of the two
paths. Similarly, a label has to be considered to be reachable from
any of the gotos that jump to it, and the statement behind a switch
statement has to be considered to be reachable from any of the break
statements in the switch statement. So, having many outgoing or
incoming control flow edges is nothing that only labels-as-values
produces. Consider that the replicated switch is intended to produce
a control-flow graph that's as close as possible to the one produced
by using labels-as-values.

Concerning register allocation (never heard of memory allocation), of
course variables have to live in the same register or memory location
at either end of a control-flow edge; and when multiple control-flow
edges start or end at the same point, they have to live in the same
location for all of these edges.

This is certainly something that gcc has known how to do from when labels-as-values were introduced in 2.0 (admittedly I only tried using
it a few months later, when the version was already at 2.2.2).

There have been a few episodes (e.g., in gcc-3.0 and 3.1) when gcc put
a lot of register-memory-shuffling code in each VM instructions, but
they were fixed, or we found a workaround (a recent case was due to auto-vectorization, and we fixed it with -fno-tree-vectorize, which
would be counterproductive for the engine() function anyway).

As for the control-flow, all these edges going from every goto to
every label whose address is taken lead to a quadratic number of
control-flow edges, so starting with gcc-3.x gcc replaced all goto *
with gotos to a common goto *. So now you have m edges to that goto *
(for m instances of goto * in the source code) and n edges from that
goto * to the labels whose address is taken (for n such labels),
resulting in n+m edges instead of n*m edges. During the 3.x and early
4.x series gcc failed to turn the jump-to-indirect-jump instructions
back into plain indirect-jump instructions afterwards, but they have
fixed that later in the 4.x series, and that works now (we still have workarounds for that in Gforth).

By contrast, clang completely drops the ball: First of all, it takes
forever to compile the code, and then the code contains lots of
shuffling between registers and memory, leading to low performance.
Why is clang doing worse in 2021 (and probably in 2025, too) than gcc
was doing in 1992?

I described this in <2021May29.164810@mips.complang.tuwien.ac.at>,
here are some of the data from there:

Building gforth on a Ryzen 5800X:

| gcc10 clang11
| make -j make -j
|real 11.930s 33m22.542s
|user 53.876s 143m45.884s
|sys 3.110s 22.699s

Running Gforth's small benchmarks:

| Time in seconds user time
| sieve bubble matrix fib fft
| 0.056 0.055 0.034 0.047 0.021 Ryzen 5800X gcc-10
| 1.100 0.933 0.970 1.265 0.560 Ryzen 5800X clang-11
|
|I looked at the generated code, and for a primitive like + which can
|be done in 4 instructions and which gcc-10 does in 5 instructions:
|
|563FB2DED3BF: add r13,$08
|563FB2DED3C3: add r15,$08
|563FB2DED3C7: add r8,$00[r13]
|563FB2DED3CB: mov rcx,-$08[r15]
|563FB2DED3CF: jmp ecx
|
|clang-11 produces 183 instructions for +.

This was probably less of a problem when assigned goto was invented
(I assume this was for FORTRAN 66) when few varibles were kept in
registers, and register allocation was in its infancy. Now, this is
a much bigger impediment to optimization.

On what basis do you make this claim? Labels-as-values does not
impede optimization, so why should the assigned goto do so?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Nov 8 19:32:47 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Thomas Koenig <tkoenig@netcologne.de> posted:

This was probably less of a problem when assigned goto was invented
(I assume this was for FORTRAN 66)

I think FORTRAN 66 inherited from FORTRAN II or even FORTRAN (1),
it was available in WATFOR and WATFIV.

I looked it up: It was at least in Fortran II, according to https://archive.computerhistory.org/resources/text/Fortran/102663119.05.01.acc.pdf
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Sat Nov 8 21:47:18 2025

From Newsgroup: comp.arch

On Sat, 08 Nov 2025 18:13:59 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

EricP <ThatWouldBeTelling@thevillage.com> posted:

Anton Ertl wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

No, and I defer to you, or others here, on how these features
are implemented, specifically whether code modification is
required. I was referring to features such as assigned goto in
Fortran, and Alter goto in Cobol.

On modern architectures higher-order functions are implemented
with indirect branches or indirect calls (depending on whether
it's a tail-call or not); likewise for method dispatch.

I do not know how Lisp, FORTRAN, Algol 60 and other early
languages with higher-order functions were implemented on
architectures that do not have indirect branches; but if the
assigned goto was implemented with self-modifying code, the call
to a function in a variable was probably implemented like that,
too.

What architecture cannot do an indirect branch, which I assume
means a branch/jump to a variable location in a register?

PDP-8,

PDP-8 has inderect jump through address stored in memory.
It also counts.

4004,

Are you sure?
http://www.e4004.szyc.org/iset.html

IBM 650,

Sounds like that.
It seems that earlier, but more expensive, IBM 702 already had indirect
jumps through content of word in memory.

... And any machine without "registers".

Not necessarily.
Indirect jump through word in memory also counts.

--- Synchronet 3.21a-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Sat Nov 8 21:07:01 2025

From Newsgroup: comp.arch

It appears that Anton Ertl <anton@mips.complang.tuwien.ac.at> said:

I don't know of any architecture (except maybe some one-instruction >proof-of-concepts) that does not have indirect branches in one form or >another, but I am not that familiar with architectures from the 1950s
or some of the extremely deprived embedded-control processors.

Some of the 1950s machines didn't have indirect branches. You got the
effect by patching the address into a branch instruction and then
flowing or jumping to it.

Maybe the thing about self-modifying code was thrown in to taint the
assigned goto through guilt-by-association.

if you want guilt by association, the word is ALTER.

stack. But yes, in most cases, it's a good argument that even very
deprived processors usually have some form of indirect branching.

I agree. Indirect addressing and indexing appeared quite early.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.21a-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Sat Nov 8 21:08:39 2025

From Newsgroup: comp.arch

According to Michael S <already5chosen@yahoo.com>:

I would imagine that in old times return iinstruction was less common
than indirect addressing itself.

On several of the machines I used a subroutine call stored the return
address in the first word of the routine and branched to that address+1.
The return was just an indirect jump.

Stacks? What's a stack? We barely had registers.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.21a-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Sat Nov 8 21:14:22 2025

From Newsgroup: comp.arch

According to Thomas Koenig <tkoenig@netcologne.de>:

This was probably less of a problem when assigned goto was invented
(I assume this was for FORTRAN 66) ..

Not 1966, 1956. It was in the original FORTRAN compiler.

In its defense, there were no user defined subroutines so that
was how you faked it. The biggest improvement in FORTRAN II
was SUBROUTINE and FUNCTION.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sun Nov 9 17:06:18 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

EricP <ThatWouldBeTelling@thevillage.com> posted:

Anton Ertl wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

No, and I defer to you, or others here, on how these features are
implemented, specifically whether code modification is required. I was >> >> referring to features such as assigned goto in Fortran, and Alter goto >> >> in Cobol.

On modern architectures higher-order functions are implemented with
indirect branches or indirect calls (depending on whether it's a
tail-call or not); likewise for method dispatch.

I do not know how Lisp, FORTRAN, Algol 60 and other early languages
with higher-order functions were implemented on architectures that do
not have indirect branches; but if the assigned goto was implemented
with self-modifying code, the call to a function in a variable was
probably implemented like that, too.

What architecture cannot do an indirect branch, which I assume
means a branch/jump to a variable location in a register?

PDP-8, 4004, IBM 650, ... And any machine without "registers".

To be fair, addresses 10 through 17 in the PDP-8 were effectively auto-increment registers and indirect branches were their
primary function. The PDP-8 accumulator is considered
a register, plus the optional multiply hardware provided additional
registers although they couldn't be used with branch instructions.

--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sun Nov 9 13:01:56 2025

From Newsgroup: comp.arch

John Levine wrote:

According to Michael S <already5chosen@yahoo.com>:

I would imagine that in old times return iinstruction was less common
than indirect addressing itself.

On several of the machines I used a subroutine call stored the return
address in the first word of the routine and branched to that address+1.
The return was just an indirect jump.

Stacks? What's a stack? We barely had registers.

Yes, I saw the PDP-8 did that for JMS Jump Subroutine.
I've never used one but it looks like by playing with the
Indirect and Page-zero memory addressing options you could
treat page-zero a bit like a register bank,
but also store some short but critical routines in page-zero
to manually move the return PC to/from a stack.
And use indirect addressing to access its full sumptuous 4kW address space.

--- Synchronet 3.21a-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Sun Nov 9 20:00:25 2025

From Newsgroup: comp.arch

According to Scott Lurndal <slp53@pacbell.net>:

PDP-8, 4004, IBM 650, ... And any machine without "registers".

To be fair, addresses 10 through 17 in the PDP-8 were effectively >auto-increment registers and indirect branches were their
primary function. ....

I did a fair amount of PDP-8 programming and I don't ever recall using
the auto-index locations for branches. They were used to step
through a table of data, e.g. to add up a list of numbers:

10, 1007 ; list starts at 1010

100, -50 ; list is 50 (octal long)

CLA
LOOP,
TAD I 10
ISZ 100
JMP LOOP
; sum is in the accumulator

I suppose you could use them for threaded code, but I didn't run into
any PDP-8 progams that used that.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.21a-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Sun Nov 9 20:18:31 2025

From Newsgroup: comp.arch

It appears that EricP <ThatWouldBeTelling@thevillage.com> said:

On several of the machines I used a subroutine call stored the return
address in the first word of the routine and branched to that address+1.
The return was just an indirect jump.

Stacks? What's a stack? We barely had registers.

Yes, I saw the PDP-8 did that for JMS Jump Subroutine.
I've never used one but it looks like by playing with the
Indirect and Page-zero memory addressing options you could
treat page-zero a bit like a register bank,
but also store some short but critical routines in page-zero
to manually move the return PC to/from a stack.
And use indirect addressing to access its full sumptuous 4kW address space.

You wouldn't put routines in page zero but you might put pointers to
them so you could do JMS I 123 to call the routine pointed to by page
zero location 123. We rarely did recursive stuff so there wasn't any
need to simulate a stack.

Storing the return address in the first word was pretty common. Even
the PDP-6/10 had a JSR instruction that did that. On machines without
index registers, there's no better place to put the return address.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Nov 9 21:11:52 2025

From Newsgroup: comp.arch

John Levine <johnl@taugh.com> posted:

According to Michael S <already5chosen@yahoo.com>:

I would imagine that in old times return iinstruction was less common
than indirect addressing itself.

On several of the machines I used a subroutine call stored the return
address in the first word of the routine and branched to that address+1.
The return was just an indirect jump.

Stacks? What's a stack? We barely had registers.

Heck, back then we barely had memory !!

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Nov 9 21:14:57 2025

From Newsgroup: comp.arch

John Levine <johnl@taugh.com> posted:

According to Scott Lurndal <slp53@pacbell.net>:

PDP-8, 4004, IBM 650, ... And any machine without "registers".

To be fair, addresses 10 through 17 in the PDP-8 were effectively >auto-increment registers and indirect branches were their
primary function. ....

I did a fair amount of PDP-8 programming and I don't ever recall using
the auto-index locations for branches. They were used to step
through a table of data, e.g. to add up a list of numbers:

10, 1007 ; list starts at 1010

100, -50 ; list is 50 (octal long)

CLA
LOOP,
TAD I 10
ISZ 100
JMP LOOP
; sum is in the accumulator

I suppose you could use them for threaded code, but I didn't run into
any PDP-8 progams that used that.

Way back when (1970) I did a bunch of PDP-8 asm--but it is one of the few
I don't remember enough about to carry a cogent conversation.

On the other hand it had a decent ALGOL 60 compiler.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Sun Nov 9 14:54:28 2025

From Newsgroup: comp.arch

On 11/7/2025 9:29 AM, Thomas Koenig wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

On 11/6/2025 11:38 AM, Thomas Koenig wrote:

[...]

Here is the head of an output of a little script I wrote to count
all floating-point constants from My66000 assembler. Note that
the compiler is for the version that does not yet do 0.5 etc as
floating point. The first number is the number of occurrences,
the second one is the constant itself.

5-bit constants: 886
32-bit constants: 566
64-bit constants:597
303 0
290 1
96 0.5
81 6
58 -1
58 1e-14
49 2
46 -2
45 -8.98846567431158e+307
44 10
44 255
37 8.98846567431158e+307
29 -0.5
28 3
27 90
27 360
26 -1e-05
21 0.0174532925199433
20 0.9
18 -3
17 180
17 0.1
17 0.01
[...]

Interesting! No values related to pi? And what are the ...e+307 used for?

If you loook closely, you'll see pi/180 in that list. But pi is
also there (I cut it off the list), it occurs 11 times. And the
large numbers are +/- DBL_MAX*0.5, I don't know what they are
used for.

By comparision, here are the values which are most frequently
contained in GSL:

5-bit constants: 5148
32-bit constants: 3769
64-bit constants:3140
2678 1
1518 0
687 -1
424 2
329 0.5
298 -2
291 2.22044604925031e-16
275 4.44089209850063e-16
273 3
132 -3
131 -0.5
131 3.14159265358979
88 4
86 1.34078079299426e+154
77 6
70 0.25
70 5
68 2.2250738585072e-308
66 10
64 -4
50 -6
46 0.1
45 5.87747175411144e-39
43 0.333333333333333
42 1e+50
38 6.28318530717959
35 9
31 0.2
30 7
30 -0.25

[...]

So, having values between -15.5 and +15.5 is a choice that will
cover quite a few floating point constants.

Agreed. And the switch from +-31 to +-15.5 seems like a very good choice.

For different packages,
FP constant distributions probably vary too much to create something
that is much more useful.

I am not convinced of that but it would take an analysis similar to what
you did but for more packages to resolve that issue. It is an
interesting question of what packages to use to get the most information
out of the least number of packages. I don't know enough about package
usage to have an opinion about that. Perhaps LAPACK to pick up SCIPY,
one of your CFD packages, Octave????

But given what we have, and given that it would take no additional HW
cost, it might make sense to change the ROM table to substitute say
3.14159... (which occurs 131 times above for -13.5 (which I assume
occurs approximately never :-))

I think there is some gain in object code size to be had for things like
this, but it is probably modest.

One related question, and it is really a compiler question. Say I am
writing a program and I know I will need the value of pi say 10 times in
the source code. I decide to make my coding easier, and the source code
more compact by creating a constant, called PI, with a value of
3.14159..., then write the word PI instead of the numerical constant 10
times in the source code. Will/should the compiler generate inline
immediates for the ten references or will it generate a load of the
actually constant variable? Tradeoffs either way.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sun Nov 9 17:22:28 2025

From Newsgroup: comp.arch

On 11/8/2025 5:28 AM, Thomas Koenig wrote:

BGB <cr88192@gmail.com> schrieb:

I don't know yet if my implementation of DPD is actually correct.

The POWER ISA has a pretty good description, see the OpenPower
foundation.

Luckily, I have since figured it out and confirmed it.

Otherwise, fiddled with the division algorithm some more, and it is now "slightly less awful", and converges a bit faster...

Relatedly, also added Square-Root...

My previous strategies for square-root didn't really work as effectively
in this case, so just sorta fiddled with stuff until I got something
that worked...

Algorithm I came up with (to find sqrt(S)):
Make an initial guess of the square root, calling it C;
Make an initial guess for the reciprocal of C, calling it H;
Take a few passes (threading the needle, *1):
C[n+1]=C+(S-(C*c))*(H*0.375)
Redo approximate reciprocal of C, as H (*2);
Refine H: H=H*(2-C*H)
Enter main iteration pass:
C[+1]=C+(S-(C*c))*(H*0.5)
H[+1]=H*(2-C*H) //(*3)

*1: Usual "try to keep stuff from flying off into space" step, using a
scale of 0.375 to undershoot convergence and increase stability (lower
means more stability but slower convergence; closer to 0.5 means faster,
but more likely to "fly off into space" depending on the accuracy of the initial guesses).

*2: Seemed better to start over from a slightly better guess of C, than
to directly iterate from the initial (much less accurate) guess.

*3: Noting that if H is also converged, the convergence rate for C is significantly improved (the gains from faster C convergence are enough
to offset the added cost of also converging H).

Seems to be effective, though still slower than divide (which is still
23x slower than an ADD or MUL).

In this case, the more complex algorithm being (ironically) partly
justified by the comparably higher relative cost per operation (and the
issue that I can't resort to tricks like handling the floating-point
values as integers; doesn't work so hot with Decimal128).

Felt curious, tried asking Grok about this, it identified this approach
as the Goldschmidt Algorithm, OK. If so, kinda weird that I arrived at a
well known (?) algorithm mostly by fiddling with it.

Looking on Wikipedia though, this doesn't look like the same algorithm
though.

Well, apart from some weird thing, where it initially responded in
Arabic for some reason (seems odd, it has recently gotten smart enough
to almost start being useful; apart from when it is being stupid, or
just doing something weird like responding in the wrong language).

...

Well, also was fiddling with code to try to improve "general
robustness", like making the compare operation still work if inputs were
not normalized; dealing with some related edge cases in the ADD/SUB
logic; ...

So, ATM, this means it now has:
ADD, SUB, MUL, DIV, SQRT
Compare;
Printing and Parsing numbers as strings;
...

In theory, could expand it out with other math functions if needed.

Still unclear if there is a use-case.
Drawback is that it is very slow, even vs Binary128.

Well, except maybe that the Square-Root algorithm could be applicable to Binary128, which has a similar issue of slow operations. Though, in this
case, could just copy/paste the existing double-precision code for
long-double in my C library, which uses an unrolled Taylor Series in
that case.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Nov 10 02:00:26 2025

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 11/7/2025 9:29 AM, Thomas Koenig wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

On 11/6/2025 11:38 AM, Thomas Koenig wrote:

----------snip-------------

I think there is some gain in object code size to be had for things like this, but it is probably modest.

The gain in instruction count is constant (sic) since one can represent
any FP constant as an operand with 1 instruction--what we are striving
for is code footprint.

One related question, and it is really a compiler question. Say I am writing a program and I know I will need the value of pi say 10 times in
the source code. I decide to make my coding easier, and the source code more compact by creating a constant, called PI, with a value of
3.14159..., then write the word PI instead of the numerical constant 10 times in the source code. Will/should the compiler generate inline immediates for the ten references or will it generate a load of the
actually constant variable? Tradeoffs either way.

The number of instructions executed will be exactly the same, the size of
the code footprint will be lower if/when the compiler can figure out
when to allocate PI into a register for some duration.

Currently, a) if there are free registers, and b) the constant is used
3 times, you gain 1 word of code footprint.
but (BUT), c) if there are no free registers, and d) the constant is
used more than 6 times, you gain your first word of code footprint.

So, it is a bit tricky trading off instruction count for instruction
footprint.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Nov 10 02:12:53 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> posted:

On 11/8/2025 5:28 AM, Thomas Koenig wrote:

BGB <cr88192@gmail.com> schrieb:

I don't know yet if my implementation of DPD is actually correct.

The POWER ISA has a pretty good description, see the OpenPower
foundation.

Luckily, I have since figured it out and confirmed it.

Otherwise, fiddled with the division algorithm some more, and it is now "slightly less awful", and converges a bit faster...

Relatedly, also added Square-Root...

My previous strategies for square-root didn't really work as effectively
in this case, so just sorta fiddled with stuff until I got something
that worked...

Algorithm I came up with (to find sqrt(S)):
Make an initial guess of the square root, calling it C;
Make an initial guess for the reciprocal of C, calling it H;
Take a few passes (threading the needle, *1):
C[n+1]=C+(S-(C*c))*(H*0.375)
Redo approximate reciprocal of C, as H (*2);
Refine H: H=H*(2-C*H)
Enter main iteration pass:
C[+1]=C+(S-(C*c))*(H*0.5)
H[+1]=H*(2-C*H) //(*3)

*1: Usual "try to keep stuff from flying off into space" step, using a
scale of 0.375 to undershoot convergence and increase stability (lower
means more stability but slower convergence; closer to 0.5 means faster,
but more likely to "fly off into space" depending on the accuracy of the initial guesses).

*2: Seemed better to start over from a slightly better guess of C, than
to directly iterate from the initial (much less accurate) guess.

*3: Noting that if H is also converged, the convergence rate for C is significantly improved (the gains from faster C convergence are enough
to offset the added cost of also converging H).

Seems to be effective, though still slower than divide (which is still
23x slower than an ADD or MUL).

SQRT should be 20%-30% slower than DIV.

In this case, the more complex algorithm being (ironically) partly
justified by the comparably higher relative cost per operation (and the issue that I can't resort to tricks like handling the floating-point
values as integers; doesn't work so hot with Decimal128).

If you have binary SQRT and a quick way from DFP128 to BFP32, take SQRT
in binary, convert back and do 2 iterations. Should be faster. {{I need
to remind some folks that {float; float; FDIV; fix} was faster than
IDIV on many 2st generation RISC machines.

Felt curious, tried asking Grok about this, it identified this approach
as the Goldschmidt Algorithm, OK. If so, kinda weird that I arrived at a well known (?) algorithm mostly by fiddling with it.

Feels like it is 1965--does it not ?!?

Looking on Wikipedia though, this doesn't look like the same algorithm though.

Goldschmidt is just a N_R where the arithmetic has been arranged so
that multiplies are not data-dependent (like N-R). And for this
independence; GS lacks the automatic correction N-R has.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Sun Nov 9 20:03:12 2025

From Newsgroup: comp.arch

On 11/9/2025 6:00 PM, MitchAlsup wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 11/7/2025 9:29 AM, Thomas Koenig wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> schrieb:

On 11/6/2025 11:38 AM, Thomas Koenig wrote:

----------snip-------------

I think there is some gain in object code size to be had for things like
this, but it is probably modest.

The gain in instruction count is constant (sic) since one can represent
any FP constant as an operand with 1 instruction--what we are striving
for is code footprint.

Yes, agreed. The gain would come from being able to express highly used values (e.g. pi) rather than lesser used values (e.g. -13.5) as 5 bit immediates, thus avoiding the extra 32/64 bits of separate immediate
word or two.

One related question, and it is really a compiler question. Say I am
writing a program and I know I will need the value of pi say 10 times in
the source code. I decide to make my coding easier, and the source code
more compact by creating a constant, called PI, with a value of
3.14159..., then write the word PI instead of the numerical constant 10
times in the source code. Will/should the compiler generate inline
immediates for the ten references or will it generate a load of the
actually constant variable? Tradeoffs either way.

The number of instructions executed will be exactly the same,

Yes, but execution time may not be. Presumably the load of a
non-immediate data value might take longer, certainly so if the value is
not in the L1 data cache.

the size of
the code footprint will be lower if/when the compiler can figure out
when to allocate PI into a register for some duration.

Yes.

Currently, a) if there are free registers, and b) the constant is used
3 times, you gain 1 word of code footprint.
but (BUT), c) if there are no free registers, and d) the constant is
used more than 6 times, you gain your first word of code footprint.

So, it is a bit tricky trading off instruction count for instruction footprint.

Yes. That is why I thought it was an interesting question. Your
heuristic seems as good as any, at least to my uninformed thoughts.

Thanks.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Nov 10 06:30:21 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> schrieb:

On 11/8/2025 5:28 AM, Thomas Koenig wrote:

BGB <cr88192@gmail.com> schrieb:

I don't know yet if my implementation of DPD is actually correct.

The POWER ISA has a pretty good description, see the OpenPower
foundation.

Luckily, I have since figured it out and confirmed it.

Did you also implement the rounding modes? That's where all the
"fun" (and utility) of decimal FP ist...

It's in section 5.5.2 of the 3.1 version of the ISA.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Mon Nov 10 08:16:07 2025

From Newsgroup: comp.arch

BGB wrote:

DIV uses Newton-Raphson
The process of converging is a lot more fiddly than with Binary FP.
Partly as the strategy for generating the initial guess is far less accurate.

So, it first uses a loop with hard-coded checks and scales to get it in
the general area, before then letting N-R take over. If the value isn't close enough (seemingly +/- 25% or so), N-R flies off into space.

Namely:
Exponent is wrong:
Scale by factors of 2 until correct;
Off by more than 50%, scale by +/- 25%;
Off by more than 25%, scale by +/- 12.5%;
Else: Good enough, let normal N-R take over.

My possibly naive idea would extract the top 9-15 digits from divisor
and dividend, convert both to binary FP, do the division and convert back.
That would reduce the NR step to two or three iterations, right?
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Mon Nov 10 08:27:56 2025

From Newsgroup: comp.arch

MitchAlsup wrote:

EricP <ThatWouldBeTelling@thevillage.com> posted:

Anton Ertl wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

No, and I defer to you, or others here, on how these features are
implemented, specifically whether code modification is required. I was >>>> referring to features such as assigned goto in Fortran, and Alter goto >>>> in Cobol.

On modern architectures higher-order functions are implemented with
indirect branches or indirect calls (depending on whether it's a
tail-call or not); likewise for method dispatch.

I do not know how Lisp, FORTRAN, Algol 60 and other early languages
with higher-order functions were implemented on architectures that do
not have indirect branches; but if the assigned goto was implemented
with self-modifying code, the call to a function in a variable was
probably implemented like that, too.

What architecture cannot do an indirect branch, which I assume
means a branch/jump to a variable location in a register?

PDP-8, 4004, IBM 650, ... And any machine without "registers".

And how would the operating system on such a machine get programs running?

Load them at a known location and branch to the known location.

Even if an ISA did not have a JMP reg instruction one can create it
using CALL to copy the IP to the stack where you modify it and
RET to pop the new IP value.

Pure stack machines did a lot of this.

We even did similar stuff in low-level x86 code, like when very early
8088 cpus could allow an interrupt between the loading of the stack
pointer and the stack segment (double-plus ungood!), the fix was to
instead munge the stack so that an IRET could be used instead.

I seem to remember that there could also be a similar issue when doing a
far return? If so, also solved with setting up the stack to allow IRET
to have the same effect.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Nov 10 07:46:47 2025

From Newsgroup: comp.arch

John Levine <johnl@taugh.com> writes:
[indirect branches through auto-increment locations]

I suppose you could use them for threaded code, but I didn't run into
any PDP-8 progams that used that.

You can use them for (direct) threaded code if the indirect branch is
not to the auto-incremented address, but if there is one additional
indirection involved. E.g, on RISC-V this is a direct-threaded code
dispatch:

addi s5,s5,8
ld a5,0(s5)
jr a5

If the use of the auto-increment location would be equivalent to

addi s5,s5,8
jr s5

it would not be useful for direct-threaded code.

The paper on (direct) threaded code was only published in 1973, so
that technique may not have been widely known at the time when much of
the PDP-8 software was developed.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Mon Nov 10 03:40:26 2025

From Newsgroup: comp.arch

On 11/9/2025 8:12 PM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 11/8/2025 5:28 AM, Thomas Koenig wrote:

BGB <cr88192@gmail.com> schrieb:

I don't know yet if my implementation of DPD is actually correct.

The POWER ISA has a pretty good description, see the OpenPower
foundation.

Luckily, I have since figured it out and confirmed it.

Otherwise, fiddled with the division algorithm some more, and it is now
"slightly less awful", and converges a bit faster...

Relatedly, also added Square-Root...

My previous strategies for square-root didn't really work as effectively
in this case, so just sorta fiddled with stuff until I got something
that worked...

Algorithm I came up with (to find sqrt(S)):
Make an initial guess of the square root, calling it C;
Make an initial guess for the reciprocal of C, calling it H;
Take a few passes (threading the needle, *1):
C[n+1]=C+(S-(C*c))*(H*0.375)
Redo approximate reciprocal of C, as H (*2);
Refine H: H=H*(2-C*H)
Enter main iteration pass:
C[+1]=C+(S-(C*c))*(H*0.5)
H[+1]=H*(2-C*H) //(*3)

*1: Usual "try to keep stuff from flying off into space" step, using a
scale of 0.375 to undershoot convergence and increase stability (lower
means more stability but slower convergence; closer to 0.5 means faster,
but more likely to "fly off into space" depending on the accuracy of the
initial guesses).

*2: Seemed better to start over from a slightly better guess of C, than
to directly iterate from the initial (much less accurate) guess.

*3: Noting that if H is also converged, the convergence rate for C is
significantly improved (the gains from faster C convergence are enough
to offset the added cost of also converging H).

Seems to be effective, though still slower than divide (which is still
23x slower than an ADD or MUL).

SQRT should be 20%-30% slower than DIV.

It is currently around 2.5x slower.

Though, the number of loop iterations isn't that much different; rather
the complexity of the loop is higher (as it is iterating both the square
root and the reciprocal of the square-root).

Compared to the version I put on pastebin, there has been around an 8x improvement to the speed of performing the divide operation.

And, sqrt is around 3x faster than DIV in the pastebin version...

So, at the moment:
MUL: 19 MHz
ADD: 13 MHz
DIV: 0.83 MHz
SQRT: 0.34 MHz

In this case, the more complex algorithm being (ironically) partly
justified by the comparably higher relative cost per operation (and the
issue that I can't resort to tricks like handling the floating-point
values as integers; doesn't work so hot with Decimal128).

If you have binary SQRT and a quick way from DFP128 to BFP32, take SQRT
in binary, convert back and do 2 iterations. Should be faster. {{I need
to remind some folks that {float; float; FDIV; fix} was faster than
IDIV on many 2st generation RISC machines.

Yeah, this is a possible option.

hardware FPU could give much better starting values for starting iteration.

Depends mostly on having reasonably fast and accurate format conversion.

Felt curious, tried asking Grok about this, it identified this approach
as the Goldschmidt Algorithm, OK. If so, kinda weird that I arrived at a
well known (?) algorithm mostly by fiddling with it.

Feels like it is 1965--does it not ?!?

I don't know there.

Back then, my parents would have still been children...

All I really know about this era is stuff I have seen in TV shows.

Though, ironically, did before go and watch through some of the Kroft
brothers shows ("H.R Pufnstuf" and "Lidsville" and similar), which were
around when my parents were young. Kinda surreal...

Though, it seemed like both shows were sort of trying to do a thing of creating a fantastical world on as little budget as possible. Seemed
like Pufnstuf was more ambitious, but with much cheaper SFX. Lidsville
was a little more conservative here, but generally did a better job in
terms of quality of both effects and costumes.

Pufnstuf had used a lot of fabric and stuffing for costumes (sorta like pillows), and when puppets were used, were often crudely constructed and controlled. There were a few cases where they used rigid sticks (though
this was more a Henson thing), but more often it was pulling on flexible strings.

Some small puppets used foam rubber, but it appears to have been used sparingly.

Scenery was often indoor sets with painted backgrounds, colored tarps of
the floors (sometimes with some sort of sand-like material on the
tarps), and flat cut outs for plants (usually hand-painted).

Contrast, Lidsville was less ambitious with its use of special effects,
but when used, were typically better done. A lot of the costumes
appeared better made as well.

But, I guess, one can compare/contrast with other types of shows, say:
Toho: Godzilla movies:
Foam rubber suits and what look like a lot model train-set parts;
Likely a lot more expensive;
Toei: Super Sentai / Power Rangers
Heavy use of foam rubber for costumes;
Spandex of vinyl for protagonist suits;
Frequent use of styrofoam for destructible objects;
Something gets smashed/broken/exploded, often styrofoam;
City scenes often used modified cardboard boxes;
Or, actors super-imposed onto scenes made using miniatures;
CGI sometimes used, but sparingly.
And, then, mostly compositing type effects.
The 90s show more liked using things like pyrotechnics.
Some of the later shows used CGI for things like explosions.
...

Though, did see a recent movie "Psycho Goreman" which seemed to be
approaching special effects in a very similar way to Power Rangers (a
lot of foam rubber and occasional "obviously bad" CGI). I suspect they
may have been intentionally going for a Power Ranger's kinda look though.

Contrast, likely the effects in Godzilla would have been more expensive
than those in Power Rangers.

But, they were still kind of a hold out for using a lot of practical
effects; in an era where people were (elsewhere) rapidly jumping over to
the use of CGI. As did most newer Godzilla movies (like, CGI isn't quite
the same as rubber suits and puppets).

Feels sometimes like something was lost here.

A few times, seems like it would be funny though if a person did a show,
but instead deliberately used Pufnstuf style effects.

Well, and/or mixed with Ed Wood style effects.
Like, say, a paper plate on a string for UFO;
or BBQ lighter rocket engines...

Or, have some costumes with some really cheap rubber masks (like the
sort that sometimes come with Halloween costumes).
Or maybe papercraft (like construction paper or cardstock). Maybe in combination with fabric+stuffing and googly eyes.

Maybe also cool if they could capture some of that "terrible holiday
special" vibes. Or, maybe some musical numbers, but it is mostly
"Schoolhouse Rock" style stuff.

Though, preferably "so bad it is funny" kind of effects...
Not so much "Manos: The Hands of Fate" bad, which was also technically
bad, but not it a way that I found particularly amusing.

Well, and while in theory could be cheaper still to use sock-puppets,
this is going a little too far.

Or, the extreme opposite that was 90s CGI jank. Proceeds to watch
episodes of "Donkey Kong Country" or similar, "Yeah, that's the crap".

Not everything needs to look good though, sometimes there is a certain
charm in the "jank".

Where, could maybe classify CGI into a few buckets:
80s/experimental:
Tron;
"Money for Nothing";
Various CGI "fever dream" stuff.
Looked like they really liked CGI solids,
and some kind of ray-casting.
Some early/mid 90s stuff:
ReBoot, Donkey Kong Country, Beast Machines, ...
Some late 1990s/2000s stuff:
Where human type characters got *very ugly*.
Side Branch:
Shows like "Jimmy Neutron" going to a more cartoony style
Humanoids still looked OK, if kept cartoon-like.
2010s to present:
Paths solidly split into "photo realistic" and cartoon styles.
Or, Pixar liking to sit right on the edge.
Like, they want to do photo-realism,
but if they try too hard, it gets ugly.

If I were to do anything, might try to borrow some from the 80s style,
and use a lot of CSG.

Could maybe make either "artistic choice" effects, like dithering, and
saving JPEG images at 0% quality (so that the image looks "kinda cooked").

Though, reminds me of a funny observation with my color-via-monochrome experiment:
The images could be LZ compressed to fairly small sizes, and seemingly
beat JPEG in terms of Q/bpp while doing so (because one needs to save
the JPEG at 0%, and then it looks cooked; with the dithered image
looking less bad than the 0% JPEG).

Though, not sure what to make of this exactly.

Looking on Wikipedia though, this doesn't look like the same algorithm
though.

Goldschmidt is just a N_R where the arithmetic has been arranged so
that multiplies are not data-dependent (like N-R). And for this
independence; GS lacks the automatic correction N-R has.

Dunno.

In my case, both terms being iterated do depend on each other.

The actual calculation didn't look the same either.
But, it does involve iteration, like N-R, and uses 2 terms with one
being a reciprocal of the square root (like Goldschmidt), and appears to converge in a relatively small number of loop iterations.

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Nov 10 14:52:36 2025

From Newsgroup: comp.arch

John Levine <johnl@taugh.com> writes:

According to Scott Lurndal <slp53@pacbell.net>:

PDP-8, 4004, IBM 650, ... And any machine without "registers".

To be fair, addresses 10 through 17 in the PDP-8 were effectively >>auto-increment registers and indirect branches were their
primary function. ....

I did a fair amount of PDP-8 programming and I don't ever recall using
the auto-index locations for branches. They were used to step
through a table of data, e.g. to add up a list of numbers:

10, 1007 ; list starts at 1010

100, -50 ; list is 50 (octal long)

CLA
LOOP,
TAD I 10
ISZ 100
JMP LOOP
; sum is in the accumulator

I suppose you could use them for threaded code, but I didn't run into
any PDP-8 progams that used that.

Yes, mainly for data. I do have a vague recollection of hand[*]disassembling the basic interpreter and finding some unexpected indirect branches through 010-017.

[*] Paper and pencil from an octal dump in high school.
--- Synchronet 3.21a-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Mon Nov 10 18:53:34 2025

From Newsgroup: comp.arch

According to Scott Lurndal <slp53@pacbell.net>:

I suppose you could use them for threaded code, but I didn't run into
any PDP-8 progams that used that.

Yes, mainly for data. I do have a vague recollection of hand[*]disassembling >the basic interpreter and finding some unexpected indirect branches through >010-017.

The usual way to do threaded code needs double indirection, like on the PDP-11 JMP @(R5)+ which jumps to the address that the word at R5 points to, then increments R5. The PDP-8 only had single indirect so the autoindex would have to
point at a list of JMP instructions, which in turn would usually have to be indirect unless the routine was so small it could fit on the page with the JMP list.

People did all sorts of strange stuff to cram programs into the PDP-8 so I can imagine other sorts of autoindex JMP tricks, like doing one thing the first time
through a loop and something else after that.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Mon Nov 10 13:54:23 2025

From Newsgroup: comp.arch

On 11/10/2025 1:16 AM, Terje Mathisen wrote:

BGB wrote:

DIV uses Newton-Raphson
The process of converging is a lot more fiddly than with Binary FP.
Partly as the strategy for generating the initial guess is far less
accurate.

So, it first uses a loop with hard-coded checks and scales to get it
in the general area, before then letting N-R take over. If the value
isn't close enough (seemingly +/- 25% or so), N-R flies off into space.

Namely:
   Exponent is wrong:
     Scale by factors of 2 until correct;
   Off by more than 50%, scale by +/- 25%;
   Off by more than 25%, scale by +/- 12.5%;
   Else: Good enough, let normal N-R take over.

My possibly naive idea would extract the top 9-15 digits from divisor
and dividend, convert both to binary FP, do the division and convert back.

That would reduce the NR step to two or three iterations, right?

After adding code to feed to convert to/from 'double', and using this
for initial reciprocal and square-root:
DIV gets around 50% faster: ~ 1.5 MHz (~ 12x slower than MUL);
SQRT gets around 260% faster: ~ 0.9 MHz (~ 22x slower than MUL);

Single-stepping in the debugger:
SQRT takes around 3 iterations.

With the initial worse estimates, it requires 7 iterations.

the iteration has a special case to stop once the adjustment would
effectively become too small to make a difference.

...

Otherwise, it is possible I could add the fancy rounding modes.

Though, I can note that another library that uses this format also uses
funky normalization (primarily keeping numbers right aligned rather than normalizing to left-alignment) which could effect the behavior of
rounding (it would nominally round to however many digits exist past the decimal point in the ASCII strings it parses as input).

Though, it could be possible to add a feature to partly defeat the floating-point behavior and behave as-if there were always at least N
digits above the decimal point (for normalization/rounding).

For example, if specifying that ADD should behave as-if there were 31
digits above the decimal point, then operations being rounded to having
3 digits below the decimal point.

This sort of behavior would likely need to be per-operation though.

Well, unless "how many digits exist past the decimal point in an ASCII
string representation" is itself a semantically important detail?...

For most contexts where it could matter, would expect setting a minimum exponent to make more sense. Though, in these use-cases, not clear how
the added complexity (and overhead) if decimal floating-point could make
sense over using some sort of decimal fixed point scheme (such as
storing 64 or 128 bit integers with a fixed scale of 1000 or something).

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Tue Nov 11 00:08:48 2025

From Newsgroup: comp.arch

On Mon, 10 Nov 2025 13:54:23 -0600
BGB <cr88192@gmail.com> wrote:

On 11/10/2025 1:16 AM, Terje Mathisen wrote:

BGB wrote:

DIV uses Newton-Raphson
The process of converging is a lot more fiddly than with Binary
FP. Partly as the strategy for generating the initial guess is far
less accurate.

So, it first uses a loop with hard-coded checks and scales to get
it in the general area, before then letting N-R take over. If the
value isn't close enough (seemingly +/- 25% or so), N-R flies off
into space.

Namely:
�� Exponent is wrong:
�� Scale by factors of 2 until correct;
�� Off by more than 50%, scale by +/- 25%;
�� Off by more than 25%, scale by +/- 12.5%;
�� Else: Good enough, let normal N-R take over.

My possibly naive idea would extract the top 9-15 digits from
divisor and dividend, convert both to binary FP, do the division
and convert back.

That would reduce the NR step to two or three iterations, right?

After adding code to feed to convert to/from 'double', and using this
for initial reciprocal and square-root:
DIV gets around 50% faster: ~ 1.5 MHz (~ 12x slower than MUL);

That is your timing for Decimal128 on modern desktop PC?
Dependent divisions or independent?
Even for dependent, it sounds slow.
Did you try to compare against brute force calculation using GMP? https://gmplib.org/
I.e. asuming that num < den < 10*num use GMP to calculate 40 decimal
digits of intermediate result y as follows:
Numx = num * 1e40;
y = Numx/den;
Yi = y / 1e6, Yf = y % 1e6 (this step does not require GMP, figure out
why).
If Yf != 5e5 then you finished. Only in extremely rare case (1 in a
million) of Yf == 5e5 you will have to calculate reminder of Numx/den
to found correct rounding.
Somehow, I suspect that on modern PC even non-optimized method like
above will be faster tham 670 usec.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Mon Nov 10 21:56:45 2025

From Newsgroup: comp.arch

Typical process for NaN boxing is to set the high order bits of the
value which causes the value to appear to be a NaN at higher precision.
I have been thinking about using some of the high order bits of the NaN
(eg bits 32 to 51) to indicate the precision of the boxed value. This
would allow detection of the use of a lower precision value in
arithmetic. Suppose a convert from single to double precision is being
done, but the value to be converted is only half precision. If it were indicated by the NaN software might be able to fix the result. I also
preserve the sign bit of the number in the NaN box.

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Mon Nov 10 21:25:47 2025

From Newsgroup: comp.arch

On 11/10/2025 4:08 PM, Michael S wrote:

On Mon, 10 Nov 2025 13:54:23 -0600
BGB <cr88192@gmail.com> wrote:

On 11/10/2025 1:16 AM, Terje Mathisen wrote:

BGB wrote:

DIV uses Newton-Raphson
The process of converging is a lot more fiddly than with Binary
FP. Partly as the strategy for generating the initial guess is far
less accurate.

So, it first uses a loop with hard-coded checks and scales to get
it in the general area, before then letting N-R take over. If the
value isn't close enough (seemingly +/- 25% or so), N-R flies off
into space.

Namely:
   Exponent is wrong:
     Scale by factors of 2 until correct;
   Off by more than 50%, scale by +/- 25%;
   Off by more than 25%, scale by +/- 12.5%;
   Else: Good enough, let normal N-R take over.

My possibly naive idea would extract the top 9-15 digits from
divisor and dividend, convert both to binary FP, do the division
and convert back.

That would reduce the NR step to two or three iterations, right?

After adding code to feed to convert to/from 'double', and using this
for initial reciprocal and square-root:
DIV gets around 50% faster: ~ 1.5 MHz (~ 12x slower than MUL);

That is your timing for Decimal128 on modern desktop PC?
Dependent divisions or independent?
Even for dependent, it sounds slow.

Modern-ish...

I am running a CPU type that was originally released 7 years ago, with
slower RAM than it was designed to work with.

Did you try to compare against brute force calculation using GMP? https://gmplib.org/
I.e. asuming that num < den < 10*num use GMP to calculate 40 decimal
digits of intermediate result y as follows:
Numx = num * 1e40;
y = Numx/den;
Yi = y / 1e6, Yf = y % 1e6 (this step does not require GMP, figure out
why).
If Yf != 5e5 then you finished. Only in extremely rare case (1 in a
million) of Yf == 5e5 you will have to calculate reminder of Numx/den
to found correct rounding.
Somehow, I suspect that on modern PC even non-optimized method like
above will be faster tham 670 usec.

Well, first step is building with GCC rather than MSVC...

It would appear that it gets roughly 79% faster when built with GCC.
So, around 2 million divides per second.

As for GMP, dividing two 40 digit numbers:
22 million per second.
If I do both a divide and a remainder:
16 million.

I don't really get what you are wanting me to measure exactly though...

If I compare against the IBM decNumber library:
Multiply: 14 million.
Divide: 7 million

The decNumber library doesn't appear to have a square-root function...

Granted, there are possibly faster ways to do divide, versus using Newton-Raphson in this case...

It was not the point that I could pull the fastest possible
implementation out of thin air. But, does appear I am beating decNumber
at least for multiply performance and similar.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Tue Nov 11 12:02:07 2025

From Newsgroup: comp.arch

On Mon, 10 Nov 2025 21:25:47 -0600
BGB <cr88192@gmail.com> wrote:

On 11/10/2025 4:08 PM, Michael S wrote:

On Mon, 10 Nov 2025 13:54:23 -0600
BGB <cr88192@gmail.com> wrote:

On 11/10/2025 1:16 AM, Terje Mathisen wrote:

BGB wrote:

DIV uses Newton-Raphson
The process of converging is a lot more fiddly than with Binary
FP. Partly as the strategy for generating the initial guess is
far less accurate.

So, it first uses a loop with hard-coded checks and scales to get
it in the general area, before then letting N-R take over. If the
value isn't close enough (seemingly +/- 25% or so), N-R flies off
into space.

Namely:
�� Exponent is wrong:
�� Scale by factors of 2 until correct;
�� Off by more than 50%, scale by +/- 25%;
�� Off by more than 25%, scale by +/- 12.5%;
�� Else: Good enough, let normal N-R take over.

My possibly naive idea would extract the top 9-15 digits from
divisor and dividend, convert both to binary FP, do the division
and convert back.

That would reduce the NR step to two or three iterations, right?

After adding code to feed to convert to/from 'double', and using
this for initial reciprocal and square-root:
DIV gets around 50% faster: ~ 1.5 MHz (~ 12x slower than
MUL);

That is your timing for Decimal128 on modern desktop PC?
Dependent divisions or independent?
Even for dependent, it sounds slow.

Modern-ish...

Zen2 ?
I consider it the last of non-modern. Zen3 and Ice Lake are first
of modern. 128by64 bit integer division on Zen2 is still quite slow
and overall uArch is even less advanced than 10 y.o. Intel Skylake.
In majority of real-world workloads it's partially compensated by
Zen2 bigger L3 cache. In our case big cache does not help.
But even last non-modern CPU shall be capable to divide faster than
suggested by your numbers.

I am running a CPU type that was originally released 7 years ago,
with slower RAM than it was designed to work with.

Did you try to compare against brute force calculation using GMP? https://gmplib.org/
I.e. asuming that num < den < 10*num use GMP to calculate 40
decimal digits of intermediate result y as follows:
Numx = num * 1e40;
y = Numx/den;
Yi = y / 1e6, Yf = y % 1e6 (this step does not require GMP, figure
out why).
If Yf != 5e5 then you finished. Only in extremely rare case (1 in a million) of Yf == 5e5 you will have to calculate reminder of
Numx/den to found correct rounding.
Somehow, I suspect that on modern PC even non-optimized method like
above will be faster tham 670 usec.

Well, first step is building with GCC rather than MSVC...

It would appear that it gets roughly 79% faster when built with GCC.
So, around 2 million divides per second.

As for GMP, dividing two 40 digit numbers:
22 million per second.
If I do both a divide and a remainder:
16 million.

I don't really get what you are wanting me to measure exactly
though...

I want you to measure division of 74-digit integer by 34-digit integer,
because it is the slowest part [of brute force implementation] of
Decimal128 division. The rest of division is approximately the same as multiplication.
So, [unoptimized] Decimal128 division time should be no worse than
t1+t2, where t1 is duration of Decimal128 multiplication and t2 is
duration of above-mentioned integer division. An estimate is
pessimistic, because post-division normalization tends to be simpler
than post-multiplication normalization.
Optimized division would be faster yet.

If I compare against the IBM decNumber library:
Multiply: 14 million.
Divide: 7 million

The decNumber library doesn't appear to have a square-root function...

Granted, there are possibly faster ways to do divide, versus using Newton-Raphson in this case...

It was not the point that I could pull the fastest possible
implementation out of thin air. But, does appear I am beating
decNumber at least for multiply performance and similar.

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Tue Nov 11 04:44:48 2025

From Newsgroup: comp.arch

On 11/11/2025 4:02 AM, Michael S wrote:

On Mon, 10 Nov 2025 21:25:47 -0600
BGB <cr88192@gmail.com> wrote:

On 11/10/2025 4:08 PM, Michael S wrote:

On Mon, 10 Nov 2025 13:54:23 -0600
BGB <cr88192@gmail.com> wrote:

On 11/10/2025 1:16 AM, Terje Mathisen wrote:

BGB wrote:

DIV uses Newton-Raphson
The process of converging is a lot more fiddly than with Binary
FP. Partly as the strategy for generating the initial guess is
far less accurate.

So, it first uses a loop with hard-coded checks and scales to get
it in the general area, before then letting N-R take over. If the
value isn't close enough (seemingly +/- 25% or so), N-R flies off
into space.

Namely:
   Exponent is wrong:
     Scale by factors of 2 until correct;
   Off by more than 50%, scale by +/- 25%;
   Off by more than 25%, scale by +/- 12.5%;
   Else: Good enough, let normal N-R take over.

My possibly naive idea would extract the top 9-15 digits from
divisor and dividend, convert both to binary FP, do the division
and convert back.

That would reduce the NR step to two or three iterations, right?

After adding code to feed to convert to/from 'double', and using
this for initial reciprocal and square-root:
DIV gets around 50% faster: ~ 1.5 MHz (~ 12x slower than
MUL);

That is your timing for Decimal128 on modern desktop PC?
Dependent divisions or independent?
Even for dependent, it sounds slow.

Modern-ish...

Zen2 ?
I consider it the last of non-modern. Zen3 and Ice Lake are first
of modern. 128by64 bit integer division on Zen2 is still quite slow
and overall uArch is even less advanced than 10 y.o. Intel Skylake.
In majority of real-world workloads it's partially compensated by
Zen2 bigger L3 cache. In our case big cache does not help.
But even last non-modern CPU shall be capable to divide faster than
suggested by your numbers.

Zen+

Or, a slightly tweaked version of Zen1.

It is very well possible to do big integer divide faster than this.
Such as via shift-and-add.

But, as for decimal, this makes it harder.

I could do long division, but this is a much more complicated algorithm (versus using Newton-Raphson).

But, N-R is slow as it is basically a bunch of operations, which are
granted themselves, each kinda slow.

I am running a CPU type that was originally released 7 years ago,
with slower RAM than it was designed to work with.

Did you try to compare against brute force calculation using GMP?
https://gmplib.org/
I.e. asuming that num < den < 10*num use GMP to calculate 40
decimal digits of intermediate result y as follows:
Numx = num * 1e40;
y = Numx/den;
Yi = y / 1e6, Yf = y % 1e6 (this step does not require GMP, figure
out why).
If Yf != 5e5 then you finished. Only in extremely rare case (1 in a
million) of Yf == 5e5 you will have to calculate reminder of
Numx/den to found correct rounding.
Somehow, I suspect that on modern PC even non-optimized method like
above will be faster tham 670 usec.

Well, first step is building with GCC rather than MSVC...

It would appear that it gets roughly 79% faster when built with GCC.
So, around 2 million divides per second.

As for GMP, dividing two 40 digit numbers:
22 million per second.
If I do both a divide and a remainder:
16 million.

I don't really get what you are wanting me to measure exactly
though...

I want you to measure division of 74-digit integer by 34-digit integer, because it is the slowest part [of brute force implementation] of
Decimal128 division. The rest of division is approximately the same as multiplication.
So, [unoptimized] Decimal128 division time should be no worse than
t1+t2, where t1 is duration of Decimal128 multiplication and t2 is
duration of above-mentioned integer division. An estimate is
pessimistic, because post-division normalization tends to be simpler
than post-multiplication normalization.
Optimized division would be faster yet.

If it is a big-integer divide, this is not quite the same thing.

And, if I were to use big-integer divide (probably not via GMP though,
this would be too big of a dependency), there is still the issue of efficiently converting between big-integer and the "groups of 9 digits
in 32-bits" format.

This is partly why I removed the BID code:
At first, it seemed like the DPD and BID converters were similar speed;
But, turns out I was still testing the DPD converter, and in-fact the
BID converter was significantly slower.

And, if I were going to do BID, would make more sense to do it as its
own thing, and build it mostly around 128-bit integer math.

But, in this case, I had decided to experiment with DPD.

Most likely, in this case if I wanted faster divide, that also played
well with the existing format, I would need to do long division or similar.

If I compare against the IBM decNumber library:
Multiply: 14 million.
Divide: 7 million

The decNumber library doesn't appear to have a square-root function...

Granted, there are possibly faster ways to do divide, versus using
Newton-Raphson in this case...

It was not the point that I could pull the fastest possible
implementation out of thin air. But, does appear I am beating
decNumber at least for multiply performance and similar.

Can note that while decNumber exists, at the moment, it is over 10x more code...

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Tue Nov 11 14:03:40 2025

From Newsgroup: comp.arch

On Tue, 11 Nov 2025 04:44:48 -0600
BGB <cr88192@gmail.com> wrote:

On 11/11/2025 4:02 AM, Michael S wrote:

On Mon, 10 Nov 2025 21:25:47 -0600
BGB <cr88192@gmail.com> wrote:

On 11/10/2025 4:08 PM, Michael S wrote:

On Mon, 10 Nov 2025 13:54:23 -0600
BGB <cr88192@gmail.com> wrote:

On 11/10/2025 1:16 AM, Terje Mathisen wrote:

BGB wrote:

DIV uses Newton-Raphson
The process of converging is a lot more fiddly than with Binary
FP. Partly as the strategy for generating the initial guess is
far less accurate.

So, it first uses a loop with hard-coded checks and scales to
get it in the general area, before then letting N-R take over.
If the value isn't close enough (seemingly +/- 25% or so), N-R
flies off into space.

Namely:
�� Exponent is wrong:
�� Scale by factors of 2 until correct;
�� Off by more than 50%, scale by +/- 25%;
�� Off by more than 25%, scale by +/- 12.5%;
�� Else: Good enough, let normal N-R take over.

My possibly naive idea would extract the top 9-15 digits from
divisor and dividend, convert both to binary FP, do the division
and convert back.

That would reduce the NR step to two or three iterations, right?

After adding code to feed to convert to/from 'double', and using
this for initial reciprocal and square-root:
DIV gets around 50% faster: ~ 1.5 MHz (~ 12x slower than
MUL);

That is your timing for Decimal128 on modern desktop PC?
Dependent divisions or independent?
Even for dependent, it sounds slow.

Modern-ish...

Zen2 ?
I consider it the last of non-modern. Zen3 and Ice Lake are first
of modern. 128by64 bit integer division on Zen2 is still quite slow
and overall uArch is even less advanced than 10 y.o. Intel Skylake.
In majority of real-world workloads it's partially compensated by
Zen2 bigger L3 cache. In our case big cache does not help.
But even last non-modern CPU shall be capable to divide faster than suggested by your numbers.

Zen+

Or, a slightly tweaked version of Zen1.

It is very well possible to do big integer divide faster than this.
Such as via shift-and-add.

But, as for decimal, this makes it harder.

I could do long division, but this is a much more complicated
algorithm (versus using Newton-Raphson).

But, N-R is slow as it is basically a bunch of operations, which are
granted themselves, each kinda slow.

I am running a CPU type that was originally released 7 years ago,
with slower RAM than it was designed to work with.

Did you try to compare against brute force calculation using GMP?
https://gmplib.org/
I.e. asuming that num < den < 10*num use GMP to calculate 40
decimal digits of intermediate result y as follows:
Numx = num * 1e40;
y = Numx/den;
Yi = y / 1e6, Yf = y % 1e6 (this step does not require GMP, figure
out why).
If Yf != 5e5 then you finished. Only in extremely rare case (1 in
a million) of Yf == 5e5 you will have to calculate reminder of
Numx/den to found correct rounding.
Somehow, I suspect that on modern PC even non-optimized method
like above will be faster tham 670 usec.

Well, first step is building with GCC rather than MSVC...

It would appear that it gets roughly 79% faster when built with
GCC. So, around 2 million divides per second.

As for GMP, dividing two 40 digit numbers:
22 million per second.
If I do both a divide and a remainder:
16 million.

I don't really get what you are wanting me to measure exactly
though...

I want you to measure division of 74-digit integer by 34-digit
integer, because it is the slowest part [of brute force
implementation] of Decimal128 division. The rest of division is approximately the same as multiplication.
So, [unoptimized] Decimal128 division time should be no worse than
t1+t2, where t1 is duration of Decimal128 multiplication and t2 is
duration of above-mentioned integer division. An estimate is
pessimistic, because post-division normalization tends to be simpler
than post-multiplication normalization.
Optimized division would be faster yet.

If it is a big-integer divide, this is not quite the same thing.

And, if I were to use big-integer divide (probably not via GMP
though,

Certainly not via GMP in final product. But doing 1st version via GMP
makes perfect sense.

this would be too big of a dependency), there is still the
issue of efficiently converting between big-integer and the "groups
of 9 digits in 32-bits" format.

No, no, no. Not "group of 9 digits"! Plain unadulterated binary. 64
binary 'digits' per 64-bit word.

This is partly why I removed the BID code:
At first, it seemed like the DPD and BID converters were similar
speed; But, turns out I was still testing the DPD converter, and
in-fact the BID converter was significantly slower.

DPD-specific code and algorithms make sense for multiplication.
They likely makes sense for addition/subtraction as well, I didn't try
to think deeply about it.
But for division I wouldn't bother with DPD-specific things. Just
convert mantissa from DPD to binary, then divide, normalize, round then
convert back.

And, if I were going to do BID, would make more sense to do it as its
own thing, and build it mostly around 128-bit integer math.

But, in this case, I had decided to experiment with DPD.

Most likely, in this case if I wanted faster divide, that also played
well with the existing format, I would need to do long division or
similar.

If I compare against the IBM decNumber library:
Multiply: 14 million.
Divide: 7 million

The decNumber library doesn't appear to have a square-root
function...

Granted, there are possibly faster ways to do divide, versus using
Newton-Raphson in this case...

It was not the point that I could pull the fastest possible
implementation out of thin air. But, does appear I am beating
decNumber at least for multiply performance and similar.

Can note that while decNumber exists, at the moment, it is over 10x
more code...

--- Synchronet 3.21a-Linux NewsLink 1.2

From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Tue Nov 11 18:50:20 2025

From Newsgroup: comp.arch

On 2025-11-06 20:28, MitchAlsup wrote:

Niklas Holsti <niklas.holsti@tidorum.invalid> posted:

On 2025-11-05 23:28, MitchAlsup wrote:

Niklas Holsti <niklas.holsti@tidorum.invalid> posted:

----------------

But then you could get the problem of a longjmp to a setjmp value that >>>> is stale because the targeted function invocation (stack frame) is no
longer there.

But YOU had to pass the jumpbuf out of the setjump() scope.

Now, YOU complain there is a hole in your own foot with a smoking gun
in your own hand.

That is not the issue. The question is if the semantics of "goto
label-valued-variable" are hard to define, as Ritchie said, or not, as
Anton thinks Stallman said or would have said.

So, label-variables are hard to define, but function-variables are not ?!?

Depends on the level at which you want to define it.

At the machine level, where semantics are (usually) defined for each instruction separately, a jump to a dynamic address (using a
"label-variable") is not much different from a call to a dynamic address (using a "function-variable"), and the effect of the single instruction
on the machine state is much the same as for the static address case.
The higher-level effect on the further execution of the program is out
of scope, whatever the actual value of the target address in the
instruction.

It is only if your machine has some semantics for instruction
combinations, such as your VEC-LOOP pair, that you have to define what
happens if a jump or call to some address leads to later executing only
some of those instructions or executing them in the wrong order, such as trying to execute a LOOP without having executed a preceding VEC.

At the higher programming-language level, the label case can be much
harder to define and less useful than the function case, depending on
the programming language and its abstract model of execution, and also depending on what compile-time checks you assume.

Consider an imperative language such as C with no functions nested
within other functions or other blocks (where by "block" I mean some syntactical construct that sets up its local context with local
variables etc.). If you have a function-variable (that is, a pointer to
a function) that actually refers to a function with the same parameter profile, it is easy to define the semantics of a call via this function variable: it is the same as for a call that names the referenced
function statically, and such a call is always legal. Problems arise
only if the function-variable has some invalid value such as NULL, or
the address of a function with a different profile, or some code address
that does not refer to (the start of) a function. Such invalid values
can be prevented at compile time, except (usually) for NULL.

In the same language setting, the semantics of a jump using a
label-variable are easy to define only if the label-variable refers to a
label in the same block as the jump. A jump from one block into another
would mess up the context, omitting the set-up of the target block's
context and/or omitting the tear-down of the source block's context. The further results of program execution are machine-dependent and so
undefined behavior.

A compiler could enforce the label-in-same-block rule, but it seems that
GNU C does not do so.

In a programming language that allows nested functions the same kind of context-crossing problems arise for function-variables. Traditional
languages solve them by allowing, at compile-time, calls via function-variables only if it is certain that the containing context of
the callee still exists (if the callee is nested), or by (expensively) preserving that context as a dynamically constructed closure. In either
case, the caller's context never needs to be torn down to execute the
call, differing from the jump case.

In summary, jumps via label-variables are useful only for control
transfers within one function, and do not help to build up a computation
by combining several functions -- the main method of program design at present. In contrast, calls via function-variables are a useful
extension to static calls, actually helping to combine several functions
in a computation, as shown by the general adoption of
class/object/method coding styles.

Niklas

--- Synchronet 3.21a-Linux NewsLink 1.2

From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Tue Nov 11 19:58:55 2025

From Newsgroup: comp.arch

On 2025-11-08 23:08, John Levine wrote:

According to Michael S <already5chosen@yahoo.com>:

I would imagine that in old times return iinstruction was less common
than indirect addressing itself.

On several of the machines I used a subroutine call stored the return
address in the first word of the routine and branched to that address+1.
The return was just an indirect jump.

One such machine was the HP 2100; I used some of those.

Stacks? What's a stack? We barely had registers.

And indeed the Algol 60 compiler for the HP 2100 did not support
recursion. My programs did real-time control, so I wrote a small non-preemptive but priority-driven multi-threading kernel. Thread switch
was easy as there were very few registers and no stack. But you had to
be careful because no subroutines were re-entrant.

Speaking of indirect addressing, the HP 2100 had a special feature: it
had a 64 KB address space, but with word addressing of 16-bit words, so addresses were only 15 bits, leaving the MSbit in each word free.

When using indirect addressing there was an "indirect" bit in the
instruction which, in the usual way, made the machine use the 16-bit
content of the (directly) addressed word as the actual target address,
but only if the MSbit of that content was zero. If the MSbit was one, it caused a further level of indirection, using the 15 other bits as the
address of another word that again would contain the actual target
address, if the MSbit of /that/ content was zero, and so on.

So an indirect instruction could cause a chain of indirections which
ended when an address-word had a zero in its MSbit. And the machine
could get stuck in an eternal indirection loop, which IIRC happened to
me once :-)
--
Niklas Holsti

niklas holsti tidorum fi
. @ .

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Tue Nov 11 18:48:47 2025

From Newsgroup: comp.arch

Niklas Holsti <niklas.holsti@tidorum.invalid> writes:

On 2025-11-08 23:08, John Levine wrote:

According to Michael S <already5chosen@yahoo.com>:

I would imagine that in old times return iinstruction was less common
than indirect addressing itself.

On several of the machines I used a subroutine call stored the return
address in the first word of the routine and branched to that address+1.
The return was just an indirect jump.

One such machine was the HP 2100; I used some of those.

Stacks? What's a stack? We barely had registers.

And indeed the Algol 60 compiler for the HP 2100 did not support
recursion. My programs did real-time control, so I wrote a small >non-preemptive but priority-driven multi-threading kernel. Thread switch
was easy as there were very few registers and no stack. But you had to
be careful because no subroutines were re-entrant.

Speaking of indirect addressing, the HP 2100 had a special feature: it
had a 64 KB address space, but with word addressing of 16-bit words, so >addresses were only 15 bits, leaving the MSbit in each word free.

When using indirect addressing there was an "indirect" bit in the >instruction which, in the usual way, made the machine use the 16-bit
content of the (directly) addressed word as the actual target address,
but only if the MSbit of that content was zero. If the MSbit was one, it >caused a further level of indirection, using the 15 other bits as the >address of another word that again would contain the actual target
address, if the MSbit of /that/ content was zero, and so on.

So an indirect instruction could cause a chain of indirections which
ended when an address-word had a zero in its MSbit. And the machine
could get stuck in an eternal indirection loop, which IIRC happened to
me once :-)

The Burroughs B3500 and sucessors had a similar feature. An
instruction operand contained the address of the operand plus
four control bits (BCD architecture). Two of the control bits
could select one of three index registers that would be summed
with the address (the index registers are signed, the address
unsigned). The other two control bits specified the operand
type (UN - Unsigned Numeric, SN - Signed Numeric,
UA - Unsigned Alphanumeric, IA - Indirect Address).

If the IA bit was set for an operand, the processor would read
a new operand from the target address and process it as if it
were an operand. This indirection continued until an operand
specified a data type other than IA.

The processor started a timer before each instruction, if the
instruction execution time exceeded the timer value, the MCP
would terminate the program.

In the B3500 operands were six digits, and the controller
bits consumed the high-order digit, allowing addresses ranging
from 000000 to 099999 (100 kilo digits). The B4700
added extended operands which supported 000000 through 999999
by placing an undigit (12 or 0xC) in the second digit position
of the operand and extending the operand to 32 bits (8 BCD digits).

The first digit still contained the operand type bits, the second
digit the value 0xc and the remaining six digits were the
program address.

The V380 (upgraded B4900) extended further by supporting four
additional index registers; if the second digit of the operand
was 0xd, the data type index register bits selected IX4 through
IX7.

In all cases, a "segment" was limited to one million digits in
size. Before the V380, a program was limited to a single segment;
the V380 added an entirely new virtual memory subsystem (segment based)
that supported 100,000 environments per process, with up to
100 segments per environment. A single segment was still limited
to 500KB, however, for backward binary compatibility with programs
from 1965. Large programs (e.g. the COBOL compiler) used an
operating system (MCP) provided overlay mechanism (the MCP cached
overlays in other parts of main memory or on a fast RAMdisk).
--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Tue Nov 11 14:23:38 2025

From Newsgroup: comp.arch

Niklas Holsti wrote:

On 2025-11-06 20:28, MitchAlsup wrote:

Niklas Holsti <niklas.holsti@tidorum.invalid> posted:

On 2025-11-05 23:28, MitchAlsup wrote:

Niklas Holsti <niklas.holsti@tidorum.invalid> posted:

----------------

But then you could get the problem of a longjmp to a setjmp value that >>>>> is stale because the targeted function invocation (stack frame) is no >>>>> longer there.

But YOU had to pass the jumpbuf out of the setjump() scope.

Now, YOU complain there is a hole in your own foot with a smoking gun
in your own hand.

That is not the issue. The question is if the semantics of "goto
label-valued-variable" are hard to define, as Ritchie said, or not, as
Anton thinks Stallman said or would have said.

So, label-variables are hard to define, but function-variables are not
?!?

Depends on the level at which you want to define it.

At the machine level, where semantics are (usually) defined for each instruction separately, a jump to a dynamic address (using a "label-variable") is not much different from a call to a dynamic address (using a "function-variable"), and the effect of the single instruction
on the machine state is much the same as for the static address case.
The higher-level effect on the further execution of the program is out
of scope, whatever the actual value of the target address in the instruction.

It is only if your machine has some semantics for instruction
combinations, such as your VEC-LOOP pair, that you have to define what happens if a jump or call to some address leads to later executing only
some of those instructions or executing them in the wrong order, such as trying to execute a LOOP without having executed a preceding VEC.

At the higher programming-language level, the label case can be much
harder to define and less useful than the function case, depending on
the programming language and its abstract model of execution, and also depending on what compile-time checks you assume.

Consider an imperative language such as C with no functions nested
within other functions or other blocks (where by "block" I mean some syntactical construct that sets up its local context with local
variables etc.). If you have a function-variable (that is, a pointer to
a function) that actually refers to a function with the same parameter profile, it is easy to define the semantics of a call via this function variable: it is the same as for a call that names the referenced
function statically, and such a call is always legal. Problems arise
only if the function-variable has some invalid value such as NULL, or
the address of a function with a different profile, or some code address that does not refer to (the start of) a function. Such invalid values
can be prevented at compile time, except (usually) for NULL.

In the same language setting, the semantics of a jump using a
label-variable are easy to define only if the label-variable refers to a label in the same block as the jump. A jump from one block into another would mess up the context, omitting the set-up of the target block's
context and/or omitting the tear-down of the source block's context. The further results of program execution are machine-dependent and so
undefined behavior.

A compiler could enforce the label-in-same-block rule, but it seems that
GNU C does not do so.

In a programming language that allows nested functions the same kind of context-crossing problems arise for function-variables. Traditional languages solve them by allowing, at compile-time, calls via function-variables only if it is certain that the containing context of
the callee still exists (if the callee is nested), or by (expensively) preserving that context as a dynamically constructed closure. In either case, the caller's context never needs to be torn down to execute the
call, differing from the jump case.

In summary, jumps via label-variables are useful only for control
transfers within one function, and do not help to build up a computation
by combining several functions -- the main method of program design at present. In contrast, calls via function-variables are a useful
extension to static calls, actually helping to combine several functions
in a computation, as shown by the general adoption of
class/object/method coding styles.

Niklas

I was curious about the interaction between dynamic stack allocations
and goto variables to see if it handled the block scoping correctly.
Ada should have the same issues as C.
It appears GCC x86-64 15.2 with -O3 does not properly recover
stack space with dynamic goto's.

Test1 allocates a dynamic sized buffer and has a static goto Loop
for which GCC generates a jne .L6 to a mov rsp, rbx that recovers
the stack allocation inside the {} block.

Test2 is the same but does a goto *dest and GCC does not generate
code to recover the inner {} block allocation. It just loops over
the sub rsp, rbx so the stack space just grows.

long Sub (long len, char buf[]);

void Test1 (long len)
{
long ok;

Loop:
{
char buf[len];

ok = Sub (len, buf);
if (ok)
goto Loop;
}
}

# Compilation provided by Compiler Explorer at https://godbolt.org/ Test1(long):
push rbp
mov rbp, rsp
push r13
mov r13, rdi
push r12
lea r12, [rdi+15]
push rbx
shr r12, 4
sal r12, 4
sub rsp, 8
jmp .L2
.L6:
mov rsp, rbx
.L2:
mov rbx, rsp
sub rsp, r12
mov rdi, r13
mov rsi, rsp
call Sub(long, char*)
test rax, rax
jne .L6
lea rsp, [rbp-24]
pop rbx
pop r12
pop r13
pop rbp
ret

void Test2 (long len)
{
long ok;
void *dest;

dest = &&Loop;
Loop:
{
char buf[len];

ok = Sub (len, buf);
if (ok)
goto *dest;
}
}

Test2(long):
push rbp
mov rbp, rsp
push r12
mov r12, rdi
push rbx
lea rbx, [rdi+15]
shr rbx, 4
sal rbx, 4
.L8:
sub rsp, rbx
mov rdi, r12
mov rsi, rsp
call Sub(long, char*)
test rax, rax
jne .L8
lea rsp, [rbp-16]
pop rbx
pop r12
pop rbp
ret

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Nov 11 19:30:43 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

Typical process for NaN boxing is to set the high order bits of the
value which causes the value to appear to be a NaN at higher precision.

Any FP value representable in lower precision can be exactly represented
in higher precision.

I have been thinking about using some of the high order bits of the NaN
(eg bits 32 to 51) to indicate the precision of the boxed value.

When My 66000 generates a NaN it inserts the cause in the 3 HoBs and
inserts IP in the LoBs. Nothing prevents you from overwriting the NaN,
but I thought it was best to point at the causing-instruction and an
encoded "why" the nan was generated. The cause is a 3-bit index to the
7 defined IEEE exceptions.

There are rules when more than 1 NaN are an operand to an instruction
designed to leave the more important NaN as the result. {Where more
important is generally the first to be generated.}

This
would allow detection of the use of a lower precision value in
arithmetic. Suppose a convert from single to double precision is being
done, but the value to be converted is only half precision. If it were indicated by the NaN software might be able to fix the result.

I think it is better to fix the SW that thinks a (half) is a (float).

I also preserve the sign bit of the number in the NaN box.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Nov 11 19:46:39 2025

From Newsgroup: comp.arch

Niklas Holsti <niklas.holsti@tidorum.invalid> posted:

On 2025-11-06 20:28, MitchAlsup wrote:

Niklas Holsti <niklas.holsti@tidorum.invalid> posted:

On 2025-11-05 23:28, MitchAlsup wrote:

Niklas Holsti <niklas.holsti@tidorum.invalid> posted:

----------------

But then you could get the problem of a longjmp to a setjmp value that >>>> is stale because the targeted function invocation (stack frame) is no >>>> longer there.

But YOU had to pass the jumpbuf out of the setjump() scope.

Now, YOU complain there is a hole in your own foot with a smoking gun
in your own hand.

That is not the issue. The question is if the semantics of "goto
label-valued-variable" are hard to define, as Ritchie said, or not, as
Anton thinks Stallman said or would have said.

So, label-variables are hard to define, but function-variables are not ?!?

Depends on the level at which you want to define it.

At the machine level, where semantics are (usually) defined for each instruction separately, a jump to a dynamic address (using a "label-variable") is not much different from a call to a dynamic address (using a "function-variable"), and the effect of the single instruction
on the machine state is much the same as for the static address case.

Yes,

The higher-level effect on the further execution of the program is out
of scope, whatever the actual value of the target address in the instruction.

A good point:

It is only if your machine has some semantics for instruction
combinations, such as your VEC-LOOP pair, that you have to define what happens if a jump or call to some address leads to later executing only
some of those instructions or executing them in the wrong order, such as trying to execute a LOOP without having executed a preceding VEC.

BTW, encountering a LOOP without encountering a VEC is a natural
occurrence when returning from exception or interrupt. The VEC
register points at the VEC+1 instruction which is easy to return
to the VEC instruction.

It is the very mechanism whereby vectorized and multi-lane execution
becomes scalar so the the debugger only sees scalar instructions.

At the higher programming-language level, the label case can be much
harder to define and less useful than the function case, depending on
the programming language and its abstract model of execution, and also depending on what compile-time checks you assume.

And what block boundaries are preserved (scope).

Consider an imperative language such as C with no functions nested
within other functions or other blocks (where by "block" I mean some syntactical construct that sets up its local context with local
variables etc.). If you have a function-variable (that is, a pointer to
a function) that actually refers to a function with the same parameter profile,

It is this parameter profile (argument list) which separates
goto lable[i];
from
value = function[i](argument list);

The dynamic goto is expected, by the SW writer, to carry all of the local
scope content to the new label--and yet none of it is specified. It is
this local scope content which is (IS) precisely specified with the
dynamic call.

it is easy to define the semantics of a call via this function variable: it is the same as for a call that names the referenced
function statically, and such a call is always legal. Problems arise
only if the function-variable has some invalid value such as NULL, or
the address of a function with a different profile, or some code address that does not refer to (the start of) a function. Such invalid values
can be prevented at compile time, except (usually) for NULL.

In the same language setting, the semantics of a jump using a
label-variable are easy to define only if the label-variable refers to a label in the same block as the jump. A jump from one block into another would mess up the context, omitting the set-up of the target block's
context and/or omitting the tear-down of the source block's context. The further results of program execution are machine-dependent and so
undefined behavior.

Or worse:: when said label-variable was "trashed" by some attack vector,
the label-variable can transfer control to literally anywhere.

A compiler could enforce the label-in-same-block rule, but it seems that
GNU C does not do so.

In a programming language that allows nested functions the same kind of context-crossing problems arise for function-variables. Traditional languages solve them by allowing, at compile-time, calls via function-variables only if it is certain that the containing context of
the callee still exists (if the callee is nested), or by (expensively) preserving that context as a dynamically constructed closure. In either case, the caller's context never needs to be torn down to execute the
call, differing from the jump case.

In summary, jumps via label-variables are useful only for control
transfers within one function, and do not help to build up a computation
by combining several functions -- the main method of program design at present. In contrast, calls via function-variables are a useful
extension to static calls, actually helping to combine several functions
in a computation, as shown by the general adoption of
class/object/method coding styles.

Thanks for your clear wording on why and why not.

Niklas

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Nov 11 20:44:47 2025

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> writes:

Test1 allocates a dynamic sized buffer and has a static goto Loop
for which GCC generates a jne .L6 to a mov rsp, rbx that recovers
the stack allocation inside the {} block.

Test2 is the same but does a goto *dest and GCC does not generate
code to recover the inner {} block allocation. It just loops over
the sub rsp, rbx so the stack space just grows.

Interestingly, gcc optimizes the indirect branch with a constant
target into a direct branch, but then does not continue with the same
code as you get with a plain goto.

void Test2 (long len)
{
long ok;
void *dest;

dest = &&Loop;
Loop:
{
char buf[len];

ok = Sub (len, buf);
if (ok)
goto *dest;
}
}

Test2(long):
push rbp
mov rbp, rsp
push r12
mov r12, rdi
push rbx
lea rbx, [rdi+15]
shr rbx, 4
sal rbx, 4
.L8:
sub rsp, rbx
mov rdi, r12
mov rsi, rsp
call Sub(long, char*)
test rax, rax
jne .L8
lea rsp, [rbp-16]
pop rbx
pop r12
pop rbp
ret

Interesting that this bug has not been fixed in the >33 years that labels-as-values have been in gcc; I don't know how long these
dynamically sized arrays have been in gcc, but IIRC alloca(), a
similar feature, has been available at least as long as
labels-as-values. The bug has apparently been avoided or worked
around by the users of labels-as-values (e.g., Gforth does not use
alloca or dynamically-sized arrays in the function that contains all
the taken labels and all the "goto *"s.

As long as all taken labels have the same stack depth, the bugfix does
not look particularly hard: just put code before each goto * that
adjusts the stack depth to the depth of these labels.

Things become more interesting if there are labels with different
stack depths, because labels are stored in "void *" variables, and
there is not enough room for a target and a stack depth. One can ue
the same approach as is used in Test1, however: have the stack depth
for a specific target in some location, and have a copy from that
location to the stack pointer right behind the label.

...

jmp .L2
.L6:
mov rsp, rbx
.L2:

...

jne .L6

All the code that works now would not need these extra copy
intructions, so the bugfix should special-case the case where all the
targets have the same depth.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Tue Nov 11 21:10:09 2025

From Newsgroup: comp.arch

According to Niklas Holsti <niklas.holsti@tidorum.invalid>:

Speaking of indirect addressing, the HP 2100 had a special feature: it
had a 64 KB address space, but with word addressing of 16-bit words, so >addresses were only 15 bits, leaving the MSbit in each word free.

[multi-level indirect chains]

That was quite common back in the day.

The Data General Nova and Varian 620i (both popular for OEM
applications) did exactly the same thing, 15 bit addresses with the
high bit saying indirect.

The PDP-6/10 was a 36 bit machine with 18 bit addresses and a rather overimplemented addressing scheme -- each instruction had an address, an indirect bit, and an index register, so it added the address to the index register (if the register number wasn't zero), then if the indirect bit was set,
fetch the addressed word and interpret its address, indirect bit, and index register the same way, ad infinitum.

An interesting question is what happened if a computer got into an indirect loop. The Nova just hung unless it had the memory protection option which limited it to two levels of indirection. The PDP-6/10 could take an interrupt before each address calculation, which restarted when the interrupt returned. One day when I was feeling bored I wrote a program that did an ever longer indirect chain until the program stalled because it took longer than a clock interrupt time. The system was fine, only my program stalled. Dunno what
the 620i did, I never ran into that particular bug and the manual doesn't say. --
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Tue Nov 11 21:18:32 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> schrieb:

Typical process for NaN boxing is to set the high order bits of the
value which causes the value to appear to be a NaN at higher precision.
I have been thinking about using some of the high order bits of the NaN
(eg bits 32 to 51) to indicate the precision of the boxed value. This
would allow detection of the use of a lower precision value in
arithmetic. Suppose a convert from single to double precision is being
done, but the value to be converted is only half precision.

Do you mean a type mismatch, a conversion, or digits lost due to
cancellation?

If it were
indicated by the NaN software might be able to fix the result.

Fixing a result after an NaN has occurred is too late, I think.

I also
preserve the sign bit of the number in the NaN box.

--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Nov 11 15:55:16 2025

From Newsgroup: comp.arch

On 11/11/2025 11:46 AM, MitchAlsup wrote:

Niklas Holsti <niklas.holsti@tidorum.invalid> posted:

snip

It is only if your machine has some semantics for instruction
combinations, such as your VEC-LOOP pair, that you have to define what
happens if a jump or call to some address leads to later executing only
some of those instructions or executing them in the wrong order, such as
trying to execute a LOOP without having executed a preceding VEC.

BTW, encountering a LOOP without encountering a VEC is a natural
occurrence when returning from exception or interrupt. The VEC
register points at the VEC+1 instruction which is easy to return
to the VEC instruction.

OK, but what if, say through an errant pointer, the code, totally
unrelated to the VEC, jumps somewhere in the middle of a VEC/LOOP pair?
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Nov 11 16:06:46 2025

From Newsgroup: comp.arch

On 11/11/2025 1:10 PM, John Levine wrote:

According to Niklas Holsti <niklas.holsti@tidorum.invalid>:

Speaking of indirect addressing, the HP 2100 had a special feature: it
had a 64 KB address space, but with word addressing of 16-bit words, so
addresses were only 15 bits, leaving the MSbit in each word free.

[multi-level indirect chains]

That was quite common back in the day.

Yes, as I mentioned earlier in this thread, so did the Univac 1100 series.>

The Data General Nova and Varian 620i (both popular for OEM
applications) did exactly the same thing, 15 bit addresses with the
high bit saying indirect.

The PDP-6/10 was a 36 bit machine with 18 bit addresses and a rather overimplemented addressing scheme -- each instruction had an address, an indirect bit, and an index register, so it added the address to the index register (if the register number wasn't zero), then if the indirect bit was set,
fetch the addressed word and interpret its address, indirect bit, and index register the same way, ad infinitum.

Yup. Similarly the 1100 series, a 36 bit machine with 18 bit addresses,
had all of those features, plus one more. If the index register
increment bit was set (in the instruction itself, or in each of the
indirect words), the upper 18 bits of the index register were added
(after indexing) to the lower 18 bits. This allowed some really
"interesting" possible code when this was within a loop. :-)

An interesting question is what happened if a computer got into an indirect loop.

Yup. The 1100 prevented an infinite loop by having a hardware timer for
each instruction. If the timer expired, an illegal operation exception occurred.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 12 00:31:24 2025

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 11/11/2025 11:46 AM, MitchAlsup wrote:

Niklas Holsti <niklas.holsti@tidorum.invalid> posted:

snip

It is only if your machine has some semantics for instruction
combinations, such as your VEC-LOOP pair, that you have to define what
happens if a jump or call to some address leads to later executing only
some of those instructions or executing them in the wrong order, such as >> trying to execute a LOOP without having executed a preceding VEC.

BTW, encountering a LOOP without encountering a VEC is a natural
occurrence when returning from exception or interrupt. The VEC
register points at the VEC+1 instruction which is easy to return
to the VEC instruction.

OK, but what if, say through an errant pointer, the code, totally
unrelated to the VEC, jumps somewhere in the middle of a VEC/LOOP pair?

All taken branches clear the V-bit associated with vectorization.
So encountering the LOOP instruction would raise an exception.

Flow control WITHIN a VEC-LOOP pair is by predication-only.
Exception Control Transfer is special in this regards.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Nov 11 17:18:11 2025

From Newsgroup: comp.arch

On 11/11/2025 4:31 PM, MitchAlsup wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 11/11/2025 11:46 AM, MitchAlsup wrote:

Niklas Holsti <niklas.holsti@tidorum.invalid> posted:

snip

It is only if your machine has some semantics for instruction
combinations, such as your VEC-LOOP pair, that you have to define what >>>> happens if a jump or call to some address leads to later executing only >>>> some of those instructions or executing them in the wrong order, such as >>>> trying to execute a LOOP without having executed a preceding VEC.

BTW, encountering a LOOP without encountering a VEC is a natural
occurrence when returning from exception or interrupt. The VEC
register points at the VEC+1 instruction which is easy to return
to the VEC instruction.

OK, but what if, say through an errant pointer, the code, totally
unrelated to the VEC, jumps somewhere in the middle of a VEC/LOOP pair?

All taken branches clear the V-bit associated with vectorization.
So encountering the LOOP instruction would raise an exception.

Seems like the right thing to do. I believe this resolves Nikals's issue.

Flow control WITHIN a VEC-LOOP pair is by predication-only.
Exception Control Transfer is special in this regards.

Makes sense.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Tue Nov 11 21:16:22 2025

From Newsgroup: comp.arch

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

Test1 allocates a dynamic sized buffer and has a static goto Loop
for which GCC generates a jne .L6 to a mov rsp, rbx that recovers
the stack allocation inside the {} block.

Test2 is the same but does a goto *dest and GCC does not generate
code to recover the inner {} block allocation. It just loops over
the sub rsp, rbx so the stack space just grows.

Interestingly, gcc optimizes the indirect branch with a constant
target into a direct branch, but then does not continue with the same
code as you get with a plain goto.

void Test2 (long len)
{
long ok;
void *dest;

dest = &&Loop;
Loop:
{
char buf[len];

ok = Sub (len, buf);
if (ok)
goto *dest;
}
}

Test2(long):
push rbp
mov rbp, rsp
push r12
mov r12, rdi
push rbx
lea rbx, [rdi+15]
shr rbx, 4
sal rbx, 4
.L8:
sub rsp, rbx
mov rdi, r12
mov rsi, rsp
call Sub(long, char*)
test rax, rax
jne .L8
lea rsp, [rbp-16]
pop rbx
pop r12
pop rbp
ret

Interesting that this bug has not been fixed in the >33 years that labels-as-values have been in gcc; I don't know how long these
dynamically sized arrays have been in gcc, but IIRC alloca(), a
similar feature, has been available at least as long as
labels-as-values. The bug has apparently been avoided or worked
around by the users of labels-as-values (e.g., Gforth does not use
alloca or dynamically-sized arrays in the function that contains all
the taken labels and all the "goto *"s.

alloca is not required to recover storage at the {} block level.
MS C does not recover alloca space until the subroutine returns.

But when they added dynamic allocation to C as a first class feature
I figured it should recover storage at the end of a {} block,
and I wondered it the superficially non-deterministic nature of
goto variable would be a problem.

As long as all taken labels have the same stack depth, the bugfix does
not look particularly hard: just put code before each goto * that
adjusts the stack depth to the depth of these labels.

Things become more interesting if there are labels with different
stack depths, because labels are stored in "void *" variables, and
there is not enough room for a target and a stack depth. One can ue
the same approach as is used in Test1, however: have the stack depth
for a specific target in some location, and have a copy from that
location to the stack pointer right behind the label.

....

jmp .L2
.L6:
mov rsp, rbx
.L2:

....

jne .L6

All the code that works now would not need these extra copy
intructions, so the bugfix should special-case the case where all the
targets have the same depth.

- anton

Below in Test3 I replace the goto variable with a switch statement
arranged to be nondeterministic, and it does get it right.
I suggest GCC forgot to treat the goto variable as equivalent to a switch statement and threw up its hands and treated the buffer as an alloca.

This all relates to Niklas's comments as to why the label variables must
all be within the current context, so it knows when to recover storage.
If the language had destructors the goto variable could have to call them
which alloca also does not deal with.

long Sub (long len, char buf[]);

void Test3 (long len)
{
long ok, dest;

dest = 0;
Loop:
{
char buf[len];

ok = Sub (len, buf);
if (ok)
dest = 1;

switch (dest)
{
case 0:
goto Loop;
case 1:
goto Out;
}
Out:
;
}
}

# Compilation provided by Compiler Explorer at https://godbolt.org/ Test3(long):
push rbp
mov rbp, rsp
push r13
mov r13, rdi
push r12
lea r12, [rdi+15]
push rbx
shr r12, 4
sal r12, 4
sub rsp, 8
jmp .L2
.L6:
mov rsp, rbx
.L2:
mov rbx, rsp
sub rsp, r12
mov rdi, r13
mov rsi, rsp
call Sub(long, char*)
test rax, rax
je .L6
lea rsp, [rbp-24]
pop rbx
pop r12
pop r13
pop rbp
ret

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Tue Nov 11 21:42:49 2025

From Newsgroup: comp.arch

On 2025-11-11 2:30 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

Typical process for NaN boxing is to set the high order bits of the
value which causes the value to appear to be a NaN at higher precision.

Any FP value representable in lower precision can be exactly represented
in higher precision.

I have been thinking about using some of the high order bits of the NaN
(eg bits 32 to 51) to indicate the precision of the boxed value.

When My 66000 generates a NaN it inserts the cause in the 3 HoBs and
inserts IP in the LoBs. Nothing prevents you from overwriting the NaN,
but I thought it was best to point at the causing-instruction and an
encoded "why" the nan was generated. The cause is a 3-bit index to the
7 defined IEEE exceptions.

My float package puts the cause in the 3 LoBs. The cause is always in
the low order bits of the register then, even when the precision is
different. But the address is not tracked. The package does not have
access to the address. Seems like NaN trace hardware might be useful.

There are rules when more than 1 NaN are an operand to an instruction designed to leave the more important NaN as the result. {Where more
important is generally the first to be generated.}

Hopefully the package follows the rules correctly. NaN operation is one
thing not tested yet.

This
would allow detection of the use of a lower precision value in
arithmetic. Suppose a convert from single to double precision is being
done, but the value to be converted is only half precision. If it were
indicated by the NaN software might be able to fix the result.

I think it is better to fix the SW that thinks a (half) is a (float).

It would be better, but some software is so complex it may be unknown
the values coming in. The SW does not really need to croak if its a
lower precision value as they are always represent-able in a higher precision.>>
I also

preserve the sign bit of the number in the NaN box.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Tue Nov 11 21:46:02 2025

From Newsgroup: comp.arch

On 2025-11-11 4:18 p.m., Thomas Koenig wrote:

Robert Finch <robfi680@gmail.com> schrieb:

Typical process for NaN boxing is to set the high order bits of the
value which causes the value to appear to be a NaN at higher precision.
I have been thinking about using some of the high order bits of the NaN
(eg bits 32 to 51) to indicate the precision of the boxed value. This
would allow detection of the use of a lower precision value in
arithmetic. Suppose a convert from single to double precision is being
done, but the value to be converted is only half precision.

Do you mean a type mismatch, a conversion, or digits lost due to cancellation?

It would be an input type mismatch. >

If it were
indicated by the NaN software might be able to fix the result.

Fixing a result after an NaN has occurred is too late, I think.

I suppose the float package could always just automatically upgrade the precision from lower to higher when it goes to do the calculation. But
maybe with a trace warning. It would be able to if the precision were indicated in the NaN.

I also
preserve the sign bit of the number in the NaN box.

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Tue Nov 11 21:34:08 2025

From Newsgroup: comp.arch

On 11/11/2025 6:03 AM, Michael S wrote:

On Tue, 11 Nov 2025 04:44:48 -0600
BGB <cr88192@gmail.com> wrote:

On 11/11/2025 4:02 AM, Michael S wrote:

On Mon, 10 Nov 2025 21:25:47 -0600
BGB <cr88192@gmail.com> wrote:

On 11/10/2025 4:08 PM, Michael S wrote:

On Mon, 10 Nov 2025 13:54:23 -0600
BGB <cr88192@gmail.com> wrote:

On 11/10/2025 1:16 AM, Terje Mathisen wrote:

BGB wrote:

DIV uses Newton-Raphson
The process of converging is a lot more fiddly than with Binary >>>>>>>> FP. Partly as the strategy for generating the initial guess is >>>>>>>> far less accurate.

So, it first uses a loop with hard-coded checks and scales to
get it in the general area, before then letting N-R take over. >>>>>>>> If the value isn't close enough (seemingly +/- 25% or so), N-R >>>>>>>> flies off into space.

Namely:
   Exponent is wrong:
     Scale by factors of 2 until correct;
   Off by more than 50%, scale by +/- 25%;
   Off by more than 25%, scale by +/- 12.5%;
   Else: Good enough, let normal N-R take over.

My possibly naive idea would extract the top 9-15 digits from
divisor and dividend, convert both to binary FP, do the division >>>>>>> and convert back.

That would reduce the NR step to two or three iterations, right? >>>>>>>

After adding code to feed to convert to/from 'double', and using
this for initial reciprocal and square-root:
DIV gets around 50% faster: ~ 1.5 MHz (~ 12x slower than
MUL);

That is your timing for Decimal128 on modern desktop PC?
Dependent divisions or independent?
Even for dependent, it sounds slow.

Modern-ish...

Zen2 ?
I consider it the last of non-modern. Zen3 and Ice Lake are first
of modern. 128by64 bit integer division on Zen2 is still quite slow
and overall uArch is even less advanced than 10 y.o. Intel Skylake.
In majority of real-world workloads it's partially compensated by
Zen2 bigger L3 cache. In our case big cache does not help.
But even last non-modern CPU shall be capable to divide faster than
suggested by your numbers.

Zen+

Or, a slightly tweaked version of Zen1.

It is very well possible to do big integer divide faster than this.
Such as via shift-and-add.

But, as for decimal, this makes it harder.

I could do long division, but this is a much more complicated
algorithm (versus using Newton-Raphson).

But, N-R is slow as it is basically a bunch of operations, which are
granted themselves, each kinda slow.

I am running a CPU type that was originally released 7 years ago,
with slower RAM than it was designed to work with.

Did you try to compare against brute force calculation using GMP?
https://gmplib.org/
I.e. asuming that num < den < 10*num use GMP to calculate 40
decimal digits of intermediate result y as follows:
Numx = num * 1e40;
y = Numx/den;
Yi = y / 1e6, Yf = y % 1e6 (this step does not require GMP, figure
out why).
If Yf != 5e5 then you finished. Only in extremely rare case (1 in
a million) of Yf == 5e5 you will have to calculate reminder of
Numx/den to found correct rounding.
Somehow, I suspect that on modern PC even non-optimized method
like above will be faster tham 670 usec.

Well, first step is building with GCC rather than MSVC...

It would appear that it gets roughly 79% faster when built with
GCC. So, around 2 million divides per second.

As for GMP, dividing two 40 digit numbers:
22 million per second.
If I do both a divide and a remainder:
16 million.

I don't really get what you are wanting me to measure exactly
though...

I want you to measure division of 74-digit integer by 34-digit
integer, because it is the slowest part [of brute force
implementation] of Decimal128 division. The rest of division is
approximately the same as multiplication.
So, [unoptimized] Decimal128 division time should be no worse than
t1+t2, where t1 is duration of Decimal128 multiplication and t2 is
duration of above-mentioned integer division. An estimate is
pessimistic, because post-division normalization tends to be simpler
than post-multiplication normalization.
Optimized division would be faster yet.

If it is a big-integer divide, this is not quite the same thing.

And, if I were to use big-integer divide (probably not via GMP
though,

Certainly not via GMP in final product. But doing 1st version via GMP
makes perfect sense.

GMP is only really an option for targets where GMP exists;
Needed to jump over to GCC in WSL just to test GMP here.

If avoidable, you don't want to use anything beyond the C standard
library, and ideally limit things to a C95 style dialect for maximum portability.

Granted, it does appear like the GMP divider is faster than expected.
Like, possibly something faster than "ye olde shift-and-subtract".

Though, can note a curious property:
This code is around 79% faster when built with GCC vs MSVC;
In GCC, the relative speed of MUL and ADD trade places:
In MSVC, MUL is faster;
In GCC, ADD is faster.

Though, the code in question tends to frequently use struct members
directly, rather than caching multiply-accessed struct members in local variables. MSVC tends not to fully optimize away this sort of thing,
whereas GCC tends to act as-if the struct members had in-fact been
cached in local variables.

this would be too big of a dependency), there is still the
issue of efficiently converting between big-integer and the "groups
of 9 digits in 32-bits" format.

No, no, no. Not "group of 9 digits"! Plain unadulterated binary. 64
binary 'digits' per 64-bit word.

Alas, the code was written mostly to use 9-digit groupings, and going
between 9-digit groupings and 128-bit integers is a bigger chunk of code
than I want to have for this.

This would mean an additional ~ 500 LOC, plus probably whatever code I
need to do a semi-fast 256 by 128 bit integer divider.

This is partly why I removed the BID code:
At first, it seemed like the DPD and BID converters were similar
speed; But, turns out I was still testing the DPD converter, and
in-fact the BID converter was significantly slower.

DPD-specific code and algorithms make sense for multiplication.
They likely makes sense for addition/subtraction as well, I didn't try
to think deeply about it.
But for division I wouldn't bother with DPD-specific things. Just
convert mantissa from DPD to binary, then divide, normalize, round then convert back.

It is the 9-digit-decimal <-> Large Binary Integer converter step that
is the main issue here.

Going to/from 128-bit integer adds a few "there be dragons here" issues regarding performance.

At the moment, I don't have a fast (and correct) converter between these
two representations (that also does not rely on any external libraries
or similar; or nothing outside of the C standard library).

Like, if you need to crack 128 bits into 9-digit chunks using 128-bit
divide, and if the 128-bit divider in question is a shift-and-subtract
loop, this sucks.

There are faster ways to do multiply by powers of 10, but divide by powers-of-10 is still a harder problem at the moment.

Well, and also there is the annoyance that it is difficult to write an efficient 128-bit integer multiply if staying within the limits of
portable C95.

...

Goes off and tries a few things:
128-bit integer divider;
Various attempts at decimal long divide;
...

Thus far, things have either not worked correctly, or have ended up
slower than the existing Newton-Raphson divider.

the most promising option would be Radix-10e9 long-division, but
couldn't get this working thus far.

Did also try Radix-10 long division (working on 72 digit sequences), but
this was slower than the existing N-R divider.

One possibility could be to try doing divide with Radix-10 in an
unpacked BCD variant (likely using bytes from 0..9). Here, compare and subtract would be sower, but shifting could be faster, and allows a
faster way (lookup tables) to find "A goes into B, N times".

I still don't have much confidence in it though.

Radix-10e9 has a higher chance of OK performance, if I could get the long-division algo to work correctly with it. Thus far, I was having difficulty getting it to give the correct answer. Integer divide was
tending to overshoot the "A goes into B N times" logic, and trying to
fudge it (eg, but adding 1 to the initial divisor) wasn't really
working; kinda need an accurate answer here, and a reliable way to scale
and add the divisor, ...

Granted, one possibility could be to expand out each group of 9 digits
to 64 bits, so effectively it has an intermediate 10 decimal digits of headroom (or two 10e9 "digits").

But, yeah, long-division is a lot more of a PITA than N-R or shift-and-subtract.

And, if I were going to do BID, would make more sense to do it as its
own thing, and build it mostly around 128-bit integer math.

But, in this case, I had decided to experiment with DPD.

Most likely, in this case if I wanted faster divide, that also played
well with the existing format, I would need to do long division or
similar.

If I compare against the IBM decNumber library:
Multiply: 14 million.
Divide: 7 million

The decNumber library doesn't appear to have a square-root
function...

Granted, there are possibly faster ways to do divide, versus using
Newton-Raphson in this case...

It was not the point that I could pull the fastest possible
implementation out of thin air. But, does appear I am beating
decNumber at least for multiply performance and similar.

Can note that while decNumber exists, at the moment, it is over 10x
more code...

--- Synchronet 3.21a-Linux NewsLink 1.2

From kegs@kegs@provalid.com (Kent Dickey) to comp.arch on Wed Nov 12 06:20:53 2025

From Newsgroup: comp.arch

In article <1762377694-5857@newsgrouper.org>,
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

My 66000 ISA has OpCodes in the range {I.major >= 8 && I.major < 24}
that can supply constants and perform operand routing. Within this
range; instruction<8:5> specify the following table:

0 0 0 0 +Src1 +Src2
0 0 0 1 +Src1 -Src2
0 0 1 0 -Src1 +Src2
0 0 1 1 -Src1 -Src2
0 1 0 0 +Src1 +imm5
0 1 0 1 +Imm5 +Src2
0 1 1 0 -Src1 -Imm5
0 1 1 1 +Imm5 -Src2
1 0 0 0 +Src1 Imm32
1 0 0 1 Imm32 +Src2
1 0 1 0 -Src1 Imm32
1 0 1 1 Imm32 -Src2
1 1 0 0 +Src1 Imm64
1 1 0 1 Imm64 +Src2
1 1 1 0 -Src1 Imm64
1 1 1 1 Imm64 -Src2

Here we have access to {5, 32, 64}-bit constants, 16-bit constants
come from different OpCodes.

Imm5 are the register specifier bits: range {-31..31} for integer and >logical, range {-15.5..15.5} for floating point.

For FP, Arm32 has an 8-bit immediate turned into an FP number as follows:

sign = imm8<7>;
exp = NOT(imm8<6>):Replicate(imm8<6>,E-3):imm8<5:4>;
frac = imm8<3:0>:Zeros(F-4);
result = sign : exp : frac;

For Float, exp[7:0] can be 0x80-0x83 or 0x7c-0x7f, which is 2^1 through 2^4
and 2^-3 through 2^0. And the mantissa upper 4 bits are from the immediate field. Note that 0.0 is not encodeable, and I'm going to assume you
don't need it either.

For your FP, the sign comes from elsewhere, so you have 5 bits for the
FP number. I suggest you use the Arm32 encoding for the exponent (using
3 bits), and then set the upper 2 bits of the mantissa from the remaining
two immediate bits.

This encodes integers from 1.0 through 8.0, and can also encode 10.0, 12.0, 14.0, 16.0, 20.0, 24.0, and 28.0. And it can do 0.5, 1.5, 2.5, 3.5.
And it can encode 0.125 and 0.25.

This encoding makes a lot of sense from ease of decode. However, it
would be nice to be able to encode 100.0, 1000.0 and .1, .01 and .001, each
of which is likely to be more useful than 12.0 or 3.5.

From a compiler standpoint, having arbitrary constants is perfectly fine,
it can just look up if it's available. So you can make 1000.0 and .001
and PI and lg2(e) and ln(2), and whatever available, if you want.
GCC looks up Arm64 integer 13-bit immediates in a hashtable--the encoding
is almost a one-way function, so it's just faster to look it up rather than
try to figure out if 0xaaaaaaaa is encodeable out by inspecting the value.
So something similar could be done for FP constants. Since the values will
be fixed, a perfect hash can be created ensuring it's a fast lookup.

Kent
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Nov 12 07:19:36 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> schrieb:

On 2025-11-11 4:18 p.m., Thomas Koenig wrote:

Robert Finch <robfi680@gmail.com> schrieb:

Typical process for NaN boxing is to set the high order bits of the
value which causes the value to appear to be a NaN at higher precision.
I have been thinking about using some of the high order bits of the NaN
(eg bits 32 to 51) to indicate the precision of the boxed value. This
would allow detection of the use of a lower precision value in
arithmetic. Suppose a convert from single to double precision is being
done, but the value to be converted is only half precision.

Do you mean a type mismatch, a conversion, or digits lost due to
cancellation?

It would be an input type mismatch. >

I think this can only happen when software is buggy; compilers should
deal with it, unless the user intentionally accesses data with
the wrong type.

If it were
indicated by the NaN software might be able to fix the result.

Fixing a result after an NaN has occurred is too late, I think.

I suppose the float package could always just automatically upgrade the precision from lower to higher when it goes to do the calculation. But
maybe with a trace warning. It would be able to if the precision were indicated in the NaN.

I have implemented a few warning about conversions in gfortran.
For example, -Wconversion-extra gives you, for the program

program main
print *,0.3333333333
end program main

the warning

2 | print *,0.3333333333
| 1
Warning: Non-significant digits in 'REAL(4)' number at (1), maybe incorrect KIND [-Wconversion-extra]

But my favorite is

3 | print *,a**(3/5)
| 1
Warning: Integer division truncated to constant '0' at (1) [-Winteger-division]

which (presumably) has caught that particular idiom in a few codes.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Nov 12 08:01:09 2025

From Newsgroup: comp.arch

Kent Dickey <kegs@provalid.com> schrieb:

For FP, Arm32 has an 8-bit immediate turned into an FP number as follows:

sign = imm8<7>;
exp = NOT(imm8<6>):Replicate(imm8<6>,E-3):imm8<5:4>;
frac = imm8<3:0>:Zeros(F-4);
result = sign : exp : frac;

For Float, exp[7:0] can be 0x80-0x83 or 0x7c-0x7f, which is 2^1 through 2^4 and 2^-3 through 2^0. And the mantissa upper 4 bits are from the immediate field. Note that 0.0 is not encodeable, and I'm going to assume you
don't need it either.

Looking at the statistics upthhead, 0.0 is the most common floating
point constant for My 66000 code.

For your FP, the sign comes from elsewhere, so you have 5 bits for the
FP number. I suggest you use the Arm32 encoding for the exponent (using
3 bits), and then set the upper 2 bits of the mantissa from the remaining
two immediate bits.

This encodes integers from 1.0 through 8.0, and can also encode 10.0, 12.0, 14.0, 16.0, 20.0, 24.0, and 28.0. And it can do 0.5, 1.5, 2.5, 3.5.
And it can encode 0.125 and 0.25.

This encoding makes a lot of sense from ease of decode. However, it
would be nice to be able to encode 100.0, 1000.0 and .1, .01 and .001, each of which is likely to be more useful than 12.0 or 3.5.

This is really hard to quantify, and going by gut feeling is likely to
give wrong results. Do you have any statistics, done on more software
packages than what I have done, on the ditribution of floating point
constants?

From a compiler standpoint, having arbitrary constants is perfectly fine,
it can just look up if it's available. So you can make 1000.0 and .001
and PI and lg2(e) and ln(2), and whatever available, if you want.
GCC looks up Arm64 integer 13-bit immediates in a hashtable--the encoding
is almost a one-way function, so it's just faster to look it up rather than try to figure out if 0xaaaaaaaa is encodeable out by inspecting the value.
So something similar could be done for FP constants. Since the values will be fixed, a perfect hash can be created ensuring it's a fast lookup.

Sure, it can be done, but I would like to do it on the basis of hard(er)
data.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Wed Nov 12 11:47:34 2025

From Newsgroup: comp.arch

On Tue, 11 Nov 2025 21:34:08 -0600
BGB <cr88192@gmail.com> wrote:

On 11/11/2025 6:03 AM, Michael S wrote:

On Tue, 11 Nov 2025 04:44:48 -0600
BGB <cr88192@gmail.com> wrote:

Certainly not via GMP in final product. But doing 1st version via
GMP makes perfect sense.

GMP is only really an option for targets where GMP exists;

Decimal128 is of interest only on targets where GMP exists.

Needed to jump over to GCC in WSL just to test GMP here.

So, you don't like msys2. It's your problem. Many Windows developers,
myself included, find it handy. Esp. newer variant of tools, prefixed mingw-w64-ucrt-x86_64- .

If avoidable, you don't want to use anything beyond the C standard
library, and ideally limit things to a C95 style dialect for maximum portability.

I almost agree, except for C95.
C99 is, may be, too much, but C99 sub/super set known as C11 sounds
about right.
Also, I wouldn't consider such project without few extensions of
standard language. As a minimum:
- ability to get upper 64 bit of 64b*64b product
- convenient way to exploit 64-bit add with carry
- MS _BitScanReverse64 or Gnu __builtin_ctzll or equivalen
The first and the second items are provided by Gnu __int128.
All 3 items are available as standard features in C23, but I realize
that for your purposes it is a bit to early to rely on C23.

But all that only applies to final version of the library. At stage of experimentation and proof of concept I suggest to use any available
tool. Including GMP.

Granted, it does appear like the GMP divider is faster than expected.
Like, possibly something faster than "ye olde shift-and-subtract".

You see! It already had shown you something.
The mere knowledge that something already done successfully by others is
2/3rd of what you need to accomplish the same by yourself.
Even without looking at GMP sources. Which is certainly an option.

Though, can note a curious property:
This code is around 79% faster when built with GCC vs MSVC;
In GCC, the relative speed of MUL and ADD trade places:
In MSVC, MUL is faster;
In GCC, ADD is faster.

Though, the code in question tends to frequently use struct members directly, rather than caching multiply-accessed struct members in
local variables. MSVC tends not to fully optimize away this sort of
thing, whereas GCC tends to act as-if the struct members had in-fact
been cached in local variables.

this would be too big of a dependency), there is still the
issue of efficiently converting between big-integer and the "groups
of 9 digits in 32-bits" format.

No, no, no. Not "group of 9 digits"! Plain unadulterated binary. 64
binary 'digits' per 64-bit word.

Alas, the code was written mostly to use 9-digit groupings, and going between 9-digit groupings and 128-bit integers is a bigger chunk of
code than I want to have for this.

Using 9-digit groups during conversions is bad idea, both speed-wise
and code complexity wise. Much better to use groups of 18 digits. Or
15+19.

This would mean an additional ~ 500 LOC, plus probably whatever code
I need to do a semi-fast 256 by 128 bit integer divider.

This is partly why I removed the BID code:
At first, it seemed like the DPD and BID converters were similar
speed; But, turns out I was still testing the DPD converter, and
in-fact the BID converter was significantly slower.

DPD-specific code and algorithms make sense for multiplication.
They likely makes sense for addition/subtraction as well, I didn't
try to think deeply about it.
But for division I wouldn't bother with DPD-specific things. Just
convert mantissa from DPD to binary, then divide, normalize, round
then convert back.

It is the 9-digit-decimal <-> Large Binary Integer converter step
that is the main issue here.

See above.

Going to/from 128-bit integer adds a few "there be dragons here"
issues regarding performance.

Not really.
That is, conversions are not blazingly fast, but still much better
than any attempt to divide in any form of decimal. And helps to
preserve your sanity.
There is also psychological factor at play - your users expect
division and square root to be slower than other primitive FP
operations, so they are not disappointed. Possibly they are even
pleasantly surprised, when they find out that the difference in
throughput between division and multiplication is smaller than factor
20-30 that they were accustomed to for 'double' on their 20 y.o. Intel
and AMD.

At the moment, I don't have a fast (and correct) converter between
these two representations (that also does not rely on any external
libraries or similar; or nothing outside of the C standard library).

For 'correct', don't hesitate to use GMP.
For 'not slow and correct' don't hesitate to use gnu extensions like
__int128. After majority of work is done and you are reasonably
satisfied with result, you can re-code in MS dialect, if that is your
wish. That would be a simple mechanical work.

Like, if you need to crack 128 bits into 9-digit chunks using 128-bit divide, and if the 128-bit divider in question is a
shift-and-subtract loop, this sucks.

There are faster ways to do multiply by powers of 10, but divide by powers-of-10 is still a harder problem at the moment.

Well, and also there is the annoyance that it is difficult to write
an efficient 128-bit integer multiply if staying within the limits of portable C95.

...

Goes off and tries a few things:
128-bit integer divider;
Various attempts at decimal long divide;
...

Thus far, things have either not worked correctly, or have ended up
slower than the existing Newton-Raphson divider.

the most promising option would be Radix-10e9 long-division, but
couldn't get this working thus far.

No, just no. Anything non-binary no good for division.

Did also try Radix-10 long division (working on 72 digit sequences),
but this was slower than the existing N-R divider.

One possibility could be to try doing divide with Radix-10 in an
unpacked BCD variant (likely using bytes from 0..9). Here, compare
and subtract would be sower, but shifting could be faster, and allows
a faster way (lookup tables) to find "A goes into B, N times".

I still don't have much confidence in it though.

Radix-10e9 has a higher chance of OK performance, if I could get the long-division algo to work correctly with it. Thus far, I was having difficulty getting it to give the correct answer. Integer divide was
tending to overshoot the "A goes into B N times" logic, and trying to
fudge it (eg, but adding 1 to the initial divisor) wasn't really
working; kinda need an accurate answer here, and a reliable way to
scale and add the divisor, ...

Granted, one possibility could be to expand out each group of 9
digits to 64 bits, so effectively it has an intermediate 10 decimal
digits of headroom (or two 10e9 "digits").

But, yeah, long-division is a lot more of a PITA than N-R or shift-and-subtract.

I am not totally sure what you mean by 'long division', 'N-R' and 'shift-and-subtract'. In my view, they are not really distinct. Shades
of gray, rather than black-and-white.
Without experimentation, I'd recommend something similar to what Terje suggested - calculate approximate reciprocal with 52-bit precision (by
FP_DP division), then do 3 iterations. You can call them as you
like, all three names above apply.
I am not sure that it is the fastest method. It is possible, that it
is better to improve reciprocal initially to 62-63 bits and then proceed
with 2 iterations instead of 3.
I *am* sure that the difference in speed between two variants is not
dramatic and that both of them ALOT faster than what you are doing
today.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 12 19:22:24 2025

From Newsgroup: comp.arch

kegs@provalid.com (Kent Dickey) posted:

In article <1762377694-5857@newsgrouper.org>,
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

My 66000 ISA has OpCodes in the range {I.major >= 8 && I.major < 24}
that can supply constants and perform operand routing. Within this
range; instruction<8:5> specify the following table:

0 0 0 0 +Src1 +Src2
0 0 0 1 +Src1 -Src2
0 0 1 0 -Src1 +Src2
0 0 1 1 -Src1 -Src2
0 1 0 0 +Src1 +imm5
0 1 0 1 +Imm5 +Src2
0 1 1 0 -Src1 -Imm5
0 1 1 1 +Imm5 -Src2
1 0 0 0 +Src1 Imm32
1 0 0 1 Imm32 +Src2
1 0 1 0 -Src1 Imm32
1 0 1 1 Imm32 -Src2
1 1 0 0 +Src1 Imm64
1 1 0 1 Imm64 +Src2
1 1 1 0 -Src1 Imm64
1 1 1 1 Imm64 -Src2

Here we have access to {5, 32, 64}-bit constants, 16-bit constants
come from different OpCodes.

Imm5 are the register specifier bits: range {-31..31} for integer and >logical, range {-15.5..15.5} for floating point.

For FP, Arm32 has an 8-bit immediate turned into an FP number as follows:

sign = imm8<7>;
exp = NOT(imm8<6>):Replicate(imm8<6>,E-3):imm8<5:4>;
frac = imm8<3:0>:Zeros(F-4);
result = sign : exp : frac;

For Float, exp[7:0] can be 0x80-0x83 or 0x7c-0x7f, which is 2^1 through 2^4 and 2^-3 through 2^0. And the mantissa upper 4 bits are from the immediate field. Note that 0.0 is not encodeable, and I'm going to assume you
don't need it either.

For your FP, the sign comes from elsewhere, so you have 5 bits for the
FP number. I suggest you use the Arm32 encoding for the exponent (using
3 bits), and then set the upper 2 bits of the mantissa from the remaining
two immediate bits.

This encodes integers from 1.0 through 8.0, and can also encode 10.0, 12.0, 14.0, 16.0, 20.0, 24.0, and 28.0. And it can do 0.5, 1.5, 2.5, 3.5.
And it can encode 0.125 and 0.25.

Thank you for this suggestion and clear explanation.

This encoding makes a lot of sense from ease of decode. However, it
would be nice to be able to encode 100.0, 1000.0 and .1, .01 and .001, each of which is likely to be more useful than 12.0 or 3.5.

My 66000 also has complete 32-bit and 64-bit FP constants; somewhat
lessening the need for imm5's to cover as wide a ground as possible.
I will keep your scheme in mind.

From a compiler standpoint, having arbitrary constants is perfectly fine,
it can just look up if it's available.

That is not the way My 66000 ISA works. All constants are available--
the only thing the compiler has to determine is: does the constant fit
in imm5, imm32, or imm64.

So you can make 1000.0 and .001
and PI and lg2(e) and ln(2), and whatever available, if you want.

I already did. The compiler also uses CVT instructions when CVT can
create a FP constant (say for an call argument) that has a smaller
code footprint than just MOVing the constant to a register.

GCC looks up Arm64 integer 13-bit immediates in a hashtable--the encoding
is almost a one-way function, so it's just faster to look it up rather than try to figure out if 0xaaaaaaaa is encodeable out by inspecting the value.
So something similar could be done for FP constants. Since the values will be fixed, a perfect hash can be created ensuring it's a fast lookup.

Kent

--- Synchronet 3.21a-Linux NewsLink 1.2

From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Wed Nov 12 21:56:32 2025

From Newsgroup: comp.arch

On 2025-11-12 3:18, Stephen Fuld wrote:

On 11/11/2025 4:31 PM, MitchAlsup wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 11/11/2025 11:46 AM, MitchAlsup wrote:

Niklas Holsti <niklas.holsti@tidorum.invalid> posted:

snip

It is only if your machine has some semantics for instruction
combinations, such as your VEC-LOOP pair, that you have to define what >>>>> happens if a jump or call to some address leads to later executing
only
some of those instructions or executing them in the wrong order,
such as
trying to execute a LOOP without having executed a preceding VEC.

BTW, encountering a LOOP without encountering a VEC is a natural
occurrence when returning from exception or interrupt. The VEC
register points at the VEC+1 instruction which is easy to return
to the VEC instruction.

OK, but what if, say through an errant pointer, the code, totally
unrelated to the VEC, jumps somewhere in the middle of a VEC/LOOP pair?

All taken branches clear the V-bit associated with vectorization.
So encountering the LOOP instruction would raise an exception.

Seems like the right thing to do. I believe this resolves Nikals's issue.

Yes, in the sense that this example supports my statement (above) that
in a machine that has instruction combinations (like VEC-LOOP) that must
be executed in a certain order, it is necessary to address what happens
if a jump or call breaks that order, complicating the semantics
definition. I agree that an exception seems the right thing to do here,
and I expected it.

Connecting this to the labels-as-values discussion, this means that a C compiler that compiles a C loop into a VEC-LOOP machine loop, and allows
a "goto" to a label within that loop, from outside the loop, would
result in execution that fails due to this exception, whether the label
is statically named or referenced by a label-valued variable. So I would
wish that the compiler would prevent that at compile time, to avoid
possible UB.
--
Niklas Holsti

niklas holsti tidorum fi
. @ .

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 12 20:25:33 2025

From Newsgroup: comp.arch

Niklas Holsti <niklas.holsti@tidorum.invalid> posted:

On 2025-11-12 3:18, Stephen Fuld wrote:

On 11/11/2025 4:31 PM, MitchAlsup wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 11/11/2025 11:46 AM, MitchAlsup wrote:

Niklas Holsti <niklas.holsti@tidorum.invalid> posted:

snip

It is only if your machine has some semantics for instruction
combinations, such as your VEC-LOOP pair, that you have to define what >>>>> happens if a jump or call to some address leads to later executing >>>>> only
some of those instructions or executing them in the wrong order,
such as
trying to execute a LOOP without having executed a preceding VEC.

BTW, encountering a LOOP without encountering a VEC is a natural
occurrence when returning from exception or interrupt. The VEC
register points at the VEC+1 instruction which is easy to return
to the VEC instruction.

OK, but what if, say through an errant pointer, the code, totally
unrelated to the VEC, jumps somewhere in the middle of a VEC/LOOP pair? >>

All taken branches clear the V-bit associated with vectorization.
So encountering the LOOP instruction would raise an exception.

Seems like the right thing to do. I believe this resolves Nikals's issue.

Yes, in the sense that this example supports my statement (above) that
in a machine that has instruction combinations (like VEC-LOOP) that must
be executed in a certain order, it is necessary to address what happens
if a jump or call breaks that order, complicating the semantics
definition. I agree that an exception seems the right thing to do here,
and I expected it.

Connecting this to the labels-as-values discussion, this means that a C compiler that compiles a C loop into a VEC-LOOP machine loop, and allows
a "goto" to a label within that loop, from outside the loop, would
result in execution that fails due to this exception, whether the label
is statically named or referenced by a label-valued variable. So I would wish that the compiler would prevent that at compile time, to avoid
possible UB.

It seems to me that taking the value of a label within a VEC-LOOP
could be prevented by the compiler--or cause the potentially vectorized
loop to become a scalar loop with spaghetti control flow. Like how taking
the address of a variable prevents the compiler from allocating it to a register--taking the address of a label prevents the encompassing loop
to remain scalar.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 12 20:27:43 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> posted:

Robert Finch <robfi680@gmail.com> schrieb:

On 2025-11-11 4:18 p.m., Thomas Koenig wrote:

Robert Finch <robfi680@gmail.com> schrieb:

Typical process for NaN boxing is to set the high order bits of the
value which causes the value to appear to be a NaN at higher precision. >>> I have been thinking about using some of the high order bits of the NaN >>> (eg bits 32 to 51) to indicate the precision of the boxed value. This
would allow detection of the use of a lower precision value in
arithmetic. Suppose a convert from single to double precision is being >>> done, but the value to be converted is only half precision.

Do you mean a type mismatch, a conversion, or digits lost due to
cancellation?

It would be an input type mismatch. >

I think this can only happen when software is buggy; compilers should
deal with it, unless the user intentionally accesses data with
the wrong type.

If it were
indicated by the NaN software might be able to fix the result.

Fixing a result after an NaN has occurred is too late, I think.

I suppose the float package could always just automatically upgrade the precision from lower to higher when it goes to do the calculation. But maybe with a trace warning. It would be able to if the precision were indicated in the NaN.

I have implemented a few warning about conversions in gfortran.
For example, -Wconversion-extra gives you, for the program

program main
print *,0.3333333333
end program main

the warning

2 | print *,0.3333333333
| 1
Warning: Non-significant digits in 'REAL(4)' number at (1), maybe incorrect KIND [-Wconversion-extra]

But my favorite is

3 | print *,a**(3/5)

BTW, this works in eXcel where 3/5 = 0.6

AND, in My 66000, a**0.6 is a single instruction. ...

| 1
Warning: Integer division truncated to constant '0' at (1) [-Winteger-division]

which (presumably) has caught that particular idiom in a few codes.

--- Synchronet 3.21a-Linux NewsLink 1.2

From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Thu Nov 13 01:35:37 2025

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> wrote:

Niklas Holsti wrote:

On 2025-11-06 20:28, MitchAlsup wrote:

Niklas Holsti <niklas.holsti@tidorum.invalid> posted:

On 2025-11-05 23:28, MitchAlsup wrote:

Niklas Holsti <niklas.holsti@tidorum.invalid> posted:

----------------

But then you could get the problem of a longjmp to a setjmp value that >>>>>> is stale because the targeted function invocation (stack frame) is no >>>>>> longer there.

But YOU had to pass the jumpbuf out of the setjump() scope.

Now, YOU complain there is a hole in your own foot with a smoking gun >>>>> in your own hand.

That is not the issue. The question is if the semantics of "goto
label-valued-variable" are hard to define, as Ritchie said, or not, as >>>> Anton thinks Stallman said or would have said.

So, label-variables are hard to define, but function-variables are not
?!?

Depends on the level at which you want to define it.

At the machine level, where semantics are (usually) defined for each
instruction separately, a jump to a dynamic address (using a
"label-variable") is not much different from a call to a dynamic address
(using a "function-variable"), and the effect of the single instruction
on the machine state is much the same as for the static address case.
The higher-level effect on the further execution of the program is out
of scope, whatever the actual value of the target address in the
instruction.

It is only if your machine has some semantics for instruction
combinations, such as your VEC-LOOP pair, that you have to define what
happens if a jump or call to some address leads to later executing only
some of those instructions or executing them in the wrong order, such as
trying to execute a LOOP without having executed a preceding VEC.

At the higher programming-language level, the label case can be much
harder to define and less useful than the function case, depending on
the programming language and its abstract model of execution, and also
depending on what compile-time checks you assume.

Consider an imperative language such as C with no functions nested
within other functions or other blocks (where by "block" I mean some
syntactical construct that sets up its local context with local
variables etc.). If you have a function-variable (that is, a pointer to
a function) that actually refers to a function with the same parameter
profile, it is easy to define the semantics of a call via this function
variable: it is the same as for a call that names the referenced
function statically, and such a call is always legal. Problems arise
only if the function-variable has some invalid value such as NULL, or
the address of a function with a different profile, or some code address
that does not refer to (the start of) a function. Such invalid values
can be prevented at compile time, except (usually) for NULL.

In the same language setting, the semantics of a jump using a
label-variable are easy to define only if the label-variable refers to a
label in the same block as the jump. A jump from one block into another
would mess up the context, omitting the set-up of the target block's
context and/or omitting the tear-down of the source block's context. The
further results of program execution are machine-dependent and so
undefined behavior.

A compiler could enforce the label-in-same-block rule, but it seems that
GNU C does not do so.

In a programming language that allows nested functions the same kind of
context-crossing problems arise for function-variables. Traditional
languages solve them by allowing, at compile-time, calls via
function-variables only if it is certain that the containing context of
the callee still exists (if the callee is nested), or by (expensively)
preserving that context as a dynamically constructed closure. In either
case, the caller's context never needs to be torn down to execute the
call, differing from the jump case.

In summary, jumps via label-variables are useful only for control
transfers within one function, and do not help to build up a computation
by combining several functions -- the main method of program design at
present. In contrast, calls via function-variables are a useful
extension to static calls, actually helping to combine several functions
in a computation, as shown by the general adoption of
class/object/method coding styles.

Niklas

I was curious about the interaction between dynamic stack allocations
and goto variables to see if it handled the block scoping correctly.
Ada should have the same issues as C.
It appears GCC x86-64 15.2 with -O3 does not properly recover
stack space with dynamic goto's.

Test1 allocates a dynamic sized buffer and has a static goto Loop
for which GCC generates a jne .L6 to a mov rsp, rbx that recovers
the stack allocation inside the {} block.

Test2 is the same but does a goto *dest and GCC does not generate
code to recover the inner {} block allocation. It just loops over
the sub rsp, rbx so the stack space just grows.

long Sub (long len, char buf[]);

void Test1 (long len)
{
long ok;

Loop:
{
char buf[len];

ok = Sub (len, buf);
if (ok)
goto Loop;
}
}

IIRC there is clear statement in the C standard that you are not
allowed to jump into a scope after a dynamic declaration. This
restriction is because otherwise compiler would need some twisty
logic to run allocation code. With label variables that obvoiusly
generalizes to jumps outside of scope of dynamic allocation:
compiler does not try to recover allocated storage. Your code
does not differ much from infinite recursion. In case of
infinte recursion compiler _may_ be able to optimize things
so that they run in constant memory, but usually such
recursion will lead to stack overflow.

So natural restriction is: when jumping to label variable
dynamic locals may be released only at function exit.
--
Waldek Hebisch
--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Wed Nov 12 23:59:30 2025

From Newsgroup: comp.arch

On 2025-11-12 3:27 p.m., MitchAlsup wrote:

Thomas Koenig <tkoenig@netcologne.de> posted:

Robert Finch <robfi680@gmail.com> schrieb:

On 2025-11-11 4:18 p.m., Thomas Koenig wrote:

Robert Finch <robfi680@gmail.com> schrieb:

Typical process for NaN boxing is to set the high order bits of the
value which causes the value to appear to be a NaN at higher precision. >>>>> I have been thinking about using some of the high order bits of the NaN >>>>> (eg bits 32 to 51) to indicate the precision of the boxed value. This >>>>> would allow detection of the use of a lower precision value in
arithmetic. Suppose a convert from single to double precision is being >>>>> done, but the value to be converted is only half precision.

Do you mean a type mismatch, a conversion, or digits lost due to
cancellation?

It would be an input type mismatch. >

I think this can only happen when software is buggy; compilers should
deal with it, unless the user intentionally accesses data with
the wrong type.

If it were
indicated by the NaN software might be able to fix the result.

Fixing a result after an NaN has occurred is too late, I think.

I suppose the float package could always just automatically upgrade the
precision from lower to higher when it goes to do the calculation. But
maybe with a trace warning. It would be able to if the precision were
indicated in the NaN.

I have implemented a few warning about conversions in gfortran.
For example, -Wconversion-extra gives you, for the program

program main
print *,0.3333333333
end program main

the warning

2 | print *,0.3333333333
| 1
Warning: Non-significant digits in 'REAL(4)' number at (1), maybe incorrect KIND [-Wconversion-extra]

But my favorite is

3 | print *,a**(3/5)

It has been a long while since I did any Fortran code – back in school
40ish years ago. I hardly recognize it. I think I kept my Fortran
textbook somewhere.
I have used VBA in eXcel with varying degrees of luck.

The number line is infinitely discontinuous!

BTW, this works in eXcel where 3/5 = 0.6

AND, in My 66000, a**0.6 is a single instruction. ...

The right way of doing things.

Qupls allows up to three constants per instruction which follow the instruction in specialized NOPs. It is only slightly less compact to
encode the constants in NOPs. While the opcode for a NOP does use some
room, multiple constants can be encoded in it. It sure makes the front
end easier as there are no variable length instructions to deal with.

Coded a fused dot product today. Prelim testing shows it matches the
output of the compiler running an executable on the PC about 50% of the
time. I checked a few of the mismatches and they were out only by 1 in
the LSB. So, it is probably good enough for my purposes.

| 1
Warning: Integer division truncated to constant '0' at (1) [-Winteger-division]

which (presumably) has caught that particular idiom in a few codes.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Nov 13 07:24:15 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Thomas Koenig <tkoenig@netcologne.de> posted:

But my favorite is

3 | print *,a**(3/5)

BTW, this works in eXcel where 3/5 = 0.6

C has the same semantics for integer division:

$ cat int.c && gcc int.c && ./a.out
#include <stdio.h>
int main()
{
printf("%d\n",3/5);
return 0;
}
0

It's one of those things that take people by surprise, and
exponentiation is one of the places where it may not be seen easily.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Nov 13 08:42:35 2025

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> writes:

Anton Ertl wrote:

EricP <ThatWouldBeTelling@thevillage.com> writes:

void Test2 (long len)
{
long ok;
void *dest;

dest = &&Loop;
Loop:
{
char buf[len];

ok = Sub (len, buf);
if (ok)
goto *dest;
}
}

Test2(long):
push rbp
mov rbp, rsp
push r12
mov r12, rdi
push rbx
lea rbx, [rdi+15]
shr rbx, 4
sal rbx, 4
.L8:
sub rsp, rbx
mov rdi, r12
mov rsi, rsp
call Sub(long, char*)
test rax, rax
jne .L8
lea rsp, [rbp-16]
pop rbx
pop r12
pop rbp
ret

Interesting that this bug has not been fixed in the >33 years that
labels-as-values have been in gcc; I don't know how long these
dynamically sized arrays have been in gcc, but IIRC alloca(), a
similar feature, has been available at least as long as
labels-as-values. The bug has apparently been avoided or worked
around by the users of labels-as-values (e.g., Gforth does not use
alloca or dynamically-sized arrays in the function that contains all
the taken labels and all the "goto *"s.

alloca is not required to recover storage at the {} block level.

Good point. So if you do

for (i=0; i<1000000000; i++) {
char *s = alloca(1000+i%1024);
... use s ...
}

and the program runs out of memory, it's a bug in your C source code,
whereas if you do

for (i=0; i<1000000000; i++) {
char s[1000+i%1024];
... use s ...
}

and the program runs out of memory, it's a bug in the compiler.

So this bug has only existed since dynamically-sized arrays were added
to gcc (probably just a quarter-century or so).

But when they added dynamic allocation to C as a first class feature
I figured it should recover storage at the end of a {} block,
and I wondered it the superficially non-deterministic nature of
goto variable would be a problem.

I outlined a correct implementation in my previous posting. The
general way is basically the same that gcc already uses for the direct
goto, as shown in your test1. Have a jump target that copies the
stack depth for the label from another location, and use that jump
target as the taken address. E.g.:

L1:
...
{ int foo[n];
...
L2:
...
{ int bar[n2];
...
L3:
...
void *labels[] = {&&L1, &&L2, &&L3, &&L4, &&L5};
...
goto *labels[i];
}
...
L4:
...
}
...
L5:
...

would be compiled to

L1x: # used for &&L1 and for direct gotos where %rsp may be different
mov L1L5_depth(%rbp), %rsp
L1y: # used for direct gotos where %rsp is the same
...
L2x: # used for &&L2 and for direct gotos where %rsp may be different
mov L2L4_depth(%rbp), %rsp
L2y:
...
L3x: # the only goto * is at the same %rsp depth, so no mov needed
L3y:
...
jmp *rcx
...
L4x: # used for &&L2 and for direct gotos where %rsp may be different
mov L2L4_depth(%rbp), %rsp
L4y:
...
L5x: # used for &&L1 and for direct gotos where %rsp may be different
mov L1L5_depth(%rbp), %rsp
L5y: # used for direct gotos where %rsp is the same
...

And of course, for those programs that do not combine these features,
all labels would turn out like L3, i.e., without the extra mov.

This all relates to Niklas's comments as to why the label variables must
all be within the current context, so it knows when to recover storage.

The gcc documentation specifies that the labels must be in the same
function as the goto, so the compiler does not have to do stack
unwinding which the Pascal compiler has to do for the Pascal goto.

If the language had destructors the goto variable could have to call them >which alloca also does not deal with.

GNU C has no destructors.

long Sub (long len, char buf[]);

void Test3 (long len)
{
long ok, dest;

dest = 0;
Loop:
{
char buf[len];

ok = Sub (len, buf);
if (ok)
dest = 1;

switch (dest)
{
case 0:
goto Loop;
case 1:
goto Out;
}
Out:
;
}
}

That actually tests direct goto. For the switch, one could wonder
about stuff like

switch (...) {
char s[n];
case 1:
... s[i] ...
{ char t[m];
case 2:
... t[i]...
}
}

But I expect that this has been declared undefined behaviour at some point.

At least the block structure protects the switch from having case
labels in outer scopes (in contrast to the labels-as-values example
above).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Nov 13 09:24:20 2025

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> writes:

I almost agree, except for C95.

What is C95? I only know of C89/90, C99, C11, C23.

Also, I wouldn't consider such project without few extensions of
standard language. As a minimum:
- ability to get upper 64 bit of 64b*64b product
- convenient way to exploit 64-bit add with carry

I have explored these topics recently in "Multi-precision integer
arithmetics" <http://www.complang.tuwien.ac.at/anton/tmp/carry2.pdf>.

Actually, with uint128_t you get pretty far, and _BitInt(bits) has
been added in C23, which has good potential, but is not quite there.
Builtins for add-with-carry and intrinsics are somewhat disappointing.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Nov 13 09:45:51 2025

From Newsgroup: comp.arch

antispam@fricas.org (Waldek Hebisch) writes:

IIRC there is clear statement in the C standard that you are not
allowed to jump into a scope after a dynamic declaration. This
restriction is because otherwise compiler would need some twisty
logic to run allocation code.

Not just that. If the dynamic definition is not executed, it's
unclear how much should be allocated. Consider:

n=-5;
goto L;
n = m; // dead code
{
int x[n]; // dead code
n=0; // dead code
L:
... x[3] ...
...
}

With label variables that obvoiusly
generalizes to jumps outside of scope of dynamic allocation:

This is a use of "obviously" that wants the reader to skip thinking
about the issue (and maybe the writer has not thought about it,
either). But actually, the cases are completely different.

If control flow passed through the dynamic definition on the way to
the goto, the stack depth in its scope is known, and can be restored
when performing the goto, as I showed in <2025Nov13.094235@mips.complang.tuwien.ac.at>.

So natural restriction is: when jumping to label variable
dynamic locals may be released only at function exit.

A compiler bug is not a natural restriction. Of course, the gcc
people might decide not to fix the bug (after all, no production code
is affected by this bug), and declare it undefined behaviour to, say,
perform a goto * inside a scope with a dynamic array that jumps
outside the scope, but if they do something like this, it's a human
decision based on a cost-benefit analysis, not something natural.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Thu Nov 13 12:18:47 2025

From Newsgroup: comp.arch

On Thu, 13 Nov 2025 09:24:20 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Michael S <already5chosen@yahoo.com> writes:

I almost agree, except for C95.

What is C95? I only know of C89/90, C99, C11, C23.

I didn't hear about it until mentioned by BGB here.
According to Wikipedia, a minor modification of C89/90 called C94 or
C95 indeed exists.

Also, I wouldn't consider such project without few extensions of
standard language. As a minimum:
- ability to get upper 64 bit of 64b*64b product
- convenient way to exploit 64-bit add with carry

I have explored these topics recently in "Multi-precision integer arithmetics" <http://www.complang.tuwien.ac.at/anton/tmp/carry2.pdf>.

Actually, with uint128_t you get pretty far, and _BitInt(bits) has
been added in C23, which has good potential, but is not quite there.

Yes, that what I wrote above.
As far as BGB is concerned, the big disadvantage is absence of support
by MSVC.

Builtins for add-with-carry and intrinsics are somewhat disappointing.

- anton

For me the most disappointing part is that different architectures
have different spellings. In case of Arm64, I don't even know what is
correct spelling. Other than that even gcc now mostly able to generate
decent code for Intel's variant. MSVC and clang were able to do it for
very long time.
Or do you have in mind new gcc intrinsic in a group "Arithmetic with
Overflow Checking" ? Those are for completely different purpose.
Sometimes they can be abused for multiple-precision arithmetic, but one
should not be surprised when results are disappointing.

--- Synchronet 3.21a-Linux NewsLink 1.2

From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Thu Nov 13 17:35:50 2025

From Newsgroup: comp.arch

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

antispam@fricas.org (Waldek Hebisch) writes:

IIRC there is clear statement in the C standard that you are not
allowed to jump into a scope after a dynamic declaration. This
restriction is because otherwise compiler would need some twisty
logic to run allocation code.

Not just that. If the dynamic definition is not executed, it's
unclear how much should be allocated. Consider:

n=-5;
goto L;
n = m; // dead code
{
int x[n]; // dead code
n=0; // dead code
L:
... x[3] ...
...
}

With label variables that obvoiusly
generalizes to jumps outside of scope of dynamic allocation:

This is a use of "obviously" that wants the reader to skip thinking
about the issue (and maybe the writer has not thought about it,
either). But actually, the cases are completely different.

If control flow passed through the dynamic definition on the way to
the goto, the stack depth in its scope is known, and can be restored
when performing the goto, as I showed in <2025Nov13.094235@mips.complang.tuwien.ac.at>.

So natural restriction is: when jumping to label variable
dynamic locals may be released only at function exit.

A compiler bug is not a natural restriction. Of course, the gcc
people might decide not to fix the bug (after all, no production code
is affected by this bug), and declare it undefined behaviour to, say,
perform a goto * inside a scope with a dynamic array that jumps
outside the scope, but if they do something like this, it's a human
decision based on a cost-benefit analysis, not something natural.

It is natural result of cost-benefit analysis in a language like
C. I know something about related issues: I tried to implement
nicer semantic of goto-s in a language having 'finally' blocks
and destructors. Basically, making goto-s behave similarly to
exceptions and labels like exception handlers. Simple goto
got turned into twisted maze taking care that relevant
exception handler are executed when exiting a scope. That
worked for one target and one gcc version. It did not work
for different targets and got completely broken by newer gcc
versions. When my code worked it freqeuntly lead to slower
code.

In higher level language one can have nice semantic for gotos:
goto is essentially the function call to a parameterless local
function. But implementing goto this way almost surely will
negate _your_ reason to use computed gotos: goto implemented in
such a way is likely to be slower than normal function calls
via function pointers. C normally offers construct which
have reasonably simple mapping to machine instructions and
avoid "nice" constructs that require extensive code
transformations. So the only natural definition in C is to
avoid nice semantics like above and declare restrictions.
Declaring computed jump out of scope as undefined is
reasonably natural. But jumps are frequently used for
abnormal exits and in such case one wants to exit a scope.
So not reclaiming memory allocated in the scope is more
natural restriction.
--
Waldek Hebisch
--- Synchronet 3.21a-Linux NewsLink 1.2

From Bernd Linsel@bl1-thispartdoesnotbelonghere@gmx.com to comp.arch on Thu Nov 13 19:32:12 2025

From Newsgroup: comp.arch

On 11/13/25 09:42, Anton Ertl wrote:

GNU C has no destructors.

It has, in limited form via __attribute__((__cleanup__(...)))

see https://gcc.gnu.org/onlinedocs/gcc-15.2.0/gcc/Common-Variable-Attributes.html#index-cleanup-variable-attribute

Regards,
Bernd

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Nov 13 18:09:12 2025

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> writes:

On Thu, 13 Nov 2025 09:24:20 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Actually, with uint128_t you get pretty far, and _BitInt(bits) has
been added in C23, which has good potential, but is not quite there.

Yes, that what I wrote above.
As far as BGB is concerned, the big disadvantage is absence of support
by MSVC.

Why would that be a disadvantage? If MSVC does not do what he needs,
there are other C compilers to choose from.

Builtins for add-with-carry and intrinsics are somewhat disappointing.

- anton

For me the most disappointing part is that different architectures
have different spellings.

For intrinsics that's by design. They are essentially a way to write
assembly language instructions in Fortran or C. And assembly language
is compiler-specific.

Other than that even gcc now mostly able to generate
decent code for Intel's variant. MSVC and clang were able to do it for
very long time.

When using the Intel intrinsic c_out = _addcarry_u64(c_in, s1, s2,&sum),
the code from both gcc and clang uses adcq, but cannot preserve the
carry in CF in a loop, and moves it into a register right after the
adcq, and back from the register to CF right before:

addb $-1, %r8b
adcq (%rdx,%rax,8), %r9
setb %r8b

If you (or compiler unrolling) have several _addcarry_u64 in a row,
with the carry-out becoming the carry-in of the next one, at least one
of these compilers manages to eliminate the overhead between these
adcqs, but of course not at the start and end of the sequence.

Or do you have in mind new gcc intrinsic in a group "Arithmetic with
Overflow Checking" ?

These are gcc builtins, not intrinsics. The difference is that they
work on all architectures. However, when I looked (three months ago),
gcc did not have a builtin with carry-in; the builtins you mention
only provide carry-out (or overflow-out).

However, clang has a builtin with carry-in and carry-out:
sum = __builtin_addcll(s1, s2, c_in, &c_out)

Unfortunately, the code produced by clang is pretty horrible for ARM
A64 and AMD64:

ARM A64: # clang 11.0.1 -Os
adds x9, x9, x10
cset w10, hs
adds x9, x9, x8
cset w8, hs
orr w8, w10, w8

AMD64: # clang 14.0.6 -march=x86-64-v4 -Os
addq (%rdx,%r8,8), %r9
setb %r10b
addq %rax, %r9
setb %al
orb %r10b, %al
movzbl %al, %eax

For RISC-V the code is a five-instruction sequence, which is the
minimum that's possible on RISC-V.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Nov 13 19:04:18 2025

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> posted:

On Tue, 11 Nov 2025 21:34:08 -0600

------------------

C99 is, may be, too much, but C99 sub/super set known as C11 sounds
about right.

Also, I wouldn't consider such project without few extensions of
standard language. As a minimum:
- ability to get upper 64 bit of 64b*64b product

hll:
{carry, result} = multiplier × multiplicand;
asm:
CARRY Rc,{{O}}
MUL Rr,Rm1,Rm2 // {Rc,Rr} is the 128-bit result

- convenient way to exploit 64-bit add with carry

hll:
{carry, result} = augend + addend;
asm:
CARRY Rc,{{O}}
ADD Rd,Ra1,Ra2
or
hll:
{carry, result} = augend + addend + carry;
asm:
CARRY Rc,{{IO}}
ADD Rd,Ra1,Ra2

- MS _BitScanReverse64 or Gnu __builtin_ctzll or equivalen

asm:
CLZ Rd,Rs
---------------------------

Not really.
That is, conversions are not blazingly fast, but still much better
than any attempt to divide in any form of decimal. And helps to
preserve your sanity.

Are you trying to pull our proverbial leg here ?!?

There is also psychological factor at play - your users expect
division and square root to be slower than other primitive FP
operations, so they are not disappointed. Possibly they are even
pleasantly surprised, when they find out that the difference in
throughput between division and multiplication is smaller than factor
20-30 that they were accustomed to for 'double' on their 20 y.o. Intel
and AMD.

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Thu Nov 13 14:34:43 2025

From Newsgroup: comp.arch

On 11/13/2025 3:24 AM, Anton Ertl wrote:

Michael S <already5chosen@yahoo.com> writes:

I almost agree, except for C95.

What is C95? I only know of C89/90, C99, C11, C23.

Essentially, it is C89, but:
Has // style comments;
Has "long long" and similar.
Vs, plain C89, where one only has
/* comment */
long (32-bit).

More or less what most versions of MSVC supported between ~ 2000 and
2013 (VS2015 added some C99 stuff).

Still required if one wants to be able to target Win2K or WinXP, as the versions of the compiler that support these only support C95.

Much prior to this, and it drops to C89; but mostly only really matters
if one wants to compile code on Windows 3.11 or similar.

Though, there is the other option of (for older Windows versions, or
real-mode MS-DOS) using Borland C instead.

OTOH, for some other targets there are compilers like SDCC or CC65,
which IIRC lack support for "long long", but I sorta suspect there is no practical reason to want to run this code on these targets (well, except
maybe for novelty of running Decimal128 math on a 6502 or something...).

I did set a limit of mostly ignoring 8/16-bit machines.

Also, I wouldn't consider such project without few extensions of
standard language. As a minimum:
- ability to get upper 64 bit of 64b*64b product
- convenient way to exploit 64-bit add with carry

I have explored these topics recently in "Multi-precision integer arithmetics" <http://www.complang.tuwien.ac.at/anton/tmp/carry2.pdf>.

Actually, with uint128_t you get pretty far, and _BitInt(bits) has
been added in C23, which has good potential, but is not quite there.
Builtins for add-with-carry and intrinsics are somewhat disappointing.

Though, probably going to be a good long time before MSVC gets these...

For now, if one wants 128-bit math, it is mostly via wrapper structs and explicit function calls.

Could work OK. Except that shift-and-subtract division is slow.
At present, I lack a good/efficient way to break a 128-bit integer into
10e9 chunks (if using the 10e9 divider, this sucks).

Can note that GCC seemingly doesn't support 128-bit integers on 64-bit
RISC-V. Also, doing 128-bit arithmetic on RV64 kinda sucks as there is basically no good way to do extended precision arithmetic (essentially,
the ISA offers nothing more here than what C already gives you).

Like, you can do what is essentially:
c_lo = a_lo + b_lo;
c_hi = a_hi + b_hi;
if((c_lo<a_lo) || (c_lo<b_lo))
c_hi++;
But... This kinda sucks...

...

Though, can at least do multiply by 10e9 and similar fast-ish via fixed-patterns shift-and-add. In premise, could use the "toothpaste
tube" strategy (multiplying by powers of 10 to squeeze digits out the
top), but would need to figure out the appropriate magic number to
multiply against the Int128 value (via a 128*128->256bit multiply,
keeping high result) to be able to get the value scaled correctly to use
this algo (this multiply also being an area of concern).

Ironically, this strategy is more directly relevant to Binary128 or
similar, as in this case, Binary128 will already have the mantissa bits
scaled in the correct way (after normalizing to remove the integer part).

I had experimented with trying to "crack" groups of digits off the
low-end, eg:
while(arr[0]>=1000000000)
{
arr[0]-=1000000000;
Inc(arr+1);
}
But, alas, it was seemingly not so easy, and this does not give the
correct results.

It is possible to use an approach similar to double-dabble (feeding in
the binary number 1 bit at a time, and adding the decimal vector to
itself and incrementing for each 1 bit seen). But, alas, this is also
slow in this case (takes around 128 iterations to convert the Int128 to
4x 10e9). Though, still slightly faster than using a shift-subtract
divider to crack off 9 digit chunks by successively dividing by 1000000000.

Or, maybe make another attempt at Radix-10e9 long division and see if I
can get it to actually work and give the correct result.

Though, might be worthwhile, since if I could make the DIV operator
faster, I could claim a result of "faster than IBM's decNumber library".

Even if in practice it might still be moot, as it is still impractically
slow if compared with Binary128.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Nov 13 20:40:01 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Michael S <already5chosen@yahoo.com> writes:

On Thu, 13 Nov 2025 09:24:20 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Actually, with uint128_t you get pretty far, and _BitInt(bits) has
been added in C23, which has good potential, but is not quite there.

Yes, that what I wrote above.
As far as BGB is concerned, the big disadvantage is absence of support
by MSVC.

Why would that be a disadvantage? If MSVC does not do what he needs,
there are other C compilers to choose from.

Builtins for add-with-carry and intrinsics are somewhat disappointing.

- anton

For me the most disappointing part is that different architectures
have different spellings.

For intrinsics that's by design. They are essentially a way to write assembly language instructions in Fortran or C. And assembly language
is compiler-specific.

{Pedantic mode=ON}
Assembly language is ASSEMBLER specific.
Compilers have to spit out what the assembler wants or go directly to
linker representation.
{Pedantic mode=OFF}

Other than that even gcc now mostly able to generate
decent code for Intel's variant. MSVC and clang were able to do it for
very long time.

When using the Intel intrinsic c_out = _addcarry_u64(c_in, s1, s2,&sum),
the code from both gcc and clang uses adcq, but cannot preserve the
carry in CF in a loop, and moves it into a register right after the
adcq, and back from the register to CF right before:

addb $-1, %r8b
adcq (%rdx,%rax,8), %r9
setb %r8b

CALK R9,what,ever
CARRY R9,{{IO}}
ADD R8,Rs1,Rs2
performs
{R9, R8} = R9 + Rs1 + Rs2;

If you (or compiler unrolling) have several _addcarry_u64 in a row,
with the carry-out becoming the carry-in of the next one, at least one
of these compilers manages to eliminate the overhead between these
adcqs, but of course not at the start and end of the sequence.

Or do you have in mind new gcc intrinsic in a group "Arithmetic with >Overflow Checking" ?

These are gcc builtins, not intrinsics. The difference is that they
work on all architectures. However, when I looked (three months ago),
gcc did not have a builtin with carry-in; the builtins you mention
only provide carry-out (or overflow-out).

However, clang has a builtin with carry-in and carry-out:
sum = __builtin_addcll(s1, s2, c_in, &c_out)

Unfortunately, the code produced by clang is pretty horrible for ARM
A64 and AMD64:

ARM A64: # clang 11.0.1 -Os
adds x9, x9, x10
cset w10, hs
adds x9, x9, x8
cset w8, hs
orr w8, w10, w8

AMD64: # clang 14.0.6 -march=x86-64-v4 -Os
addq (%rdx,%r8,8), %r9
setb %r10b
addq %rax, %r9
setb %al
orb %r10b, %al
movzbl %al, %eax

For RISC-V the code is a five-instruction sequence, which is the
minimum that's possible on RISC-V.

2 in My 66000, 1 if you don't count CARRY as it is an
instruction-modifier instead of an instruction. There is
only 1 instruction that "gets executed".

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Nov 13 21:50:59 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Michael S <already5chosen@yahoo.com> writes:

On Thu, 13 Nov 2025 09:24:20 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Actually, with uint128_t you get pretty far, and _BitInt(bits) has
been added in C23, which has good potential, but is not quite there.

Yes, that what I wrote above.
As far as BGB is concerned, the big disadvantage is absence of support
by MSVC.

Why would that be a disadvantage? If MSVC does not do what he needs,
there are other C compilers to choose from.

Builtins for add-with-carry and intrinsics are somewhat disappointing.

- anton

For me the most disappointing part is that different architectures
have different spellings.

For intrinsics that's by design. They are essentially a way to write
assembly language instructions in Fortran or C. And assembly language
is compiler-specific.

{Pedantic mode=ON}
Assembly language is ASSEMBLER specific.

What I wanted to write was "And assembly language is
architecture-specific".

It's the builtin function that are compiler-specific.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Nov 13 21:58:13 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> writes:

Can note that GCC seemingly doesn't support 128-bit integers on 64-bit >RISC-V.

What makes you think so? It has certainly worked every time I tried
it. E.g., Gforth's "configure" reports:

checking size of __int128_t... 16
checking size of __uint128_t... 16
[...]
checking for a C type for double-cells... __int128_t
checking for a C type for unsigned double-cells... __uint128_t

That's with gcc 10.3.1

Also, doing 128-bit arithmetic on RV64 kinda sucks as there is
basically no good way to do extended precision arithmetic (essentially,
the ISA offers nothing more here than what C already gives you).

Like, you can do what is essentially:
c_lo = a_lo + b_lo;
c_hi = a_hi + b_hi;
if((c_lo<a_lo) || (c_lo<b_lo))
c_hi++;

You only need to check for c_lo<a_lo (or for c_lo<b_lo), they will
either both be true or both be false.

Here's 128-bit arithmetic on RV64GC (and very similar on MIPS and
Alpha):

add a4,a4,a5
sltu a5,a4,a5
add s8,s8,s9
add s9,a5,s8

RISC-V (and MIPS and Alpha) becomes relly bad when you need add with
carry-in and carry-out (five instructions).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Nov 13 22:13:54 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Michael S <already5chosen@yahoo.com> writes:

On Thu, 13 Nov 2025 09:24:20 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Actually, with uint128_t you get pretty far, and _BitInt(bits) has
been added in C23, which has good potential, but is not quite there.

Yes, that what I wrote above.
As far as BGB is concerned, the big disadvantage is absence of support
by MSVC.

Why would that be a disadvantage? If MSVC does not do what he needs,
there are other C compilers to choose from.

Builtins for add-with-carry and intrinsics are somewhat disappointing. >> >>
- anton

For me the most disappointing part is that different architectures
have different spellings.

For intrinsics that's by design. They are essentially a way to write
assembly language instructions in Fortran or C. And assembly language
is compiler-specific.

{Pedantic mode=ON}
Assembly language is ASSEMBLER specific.

What I wanted to write was "And assembly language is
architecture-specific".

I have worked on a single machine with several different ASM "compilers". Believe me, one asm can be different than another asm.

But it is absolutely true that asm is architecture specific.

It's the builtin function that are compiler-specific.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Nov 14 00:43:07 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

BGB <cr88192@gmail.com> writes:

Can note that GCC seemingly doesn't support 128-bit integers on 64-bit >RISC-V.

What makes you think so? It has certainly worked every time I tried
it. E.g., Gforth's "configure" reports:

checking size of __int128_t... 16
checking size of __uint128_t... 16
[...]
checking for a C type for double-cells... __int128_t
checking for a C type for unsigned double-cells... __uint128_t

That's with gcc 10.3.1

Also, doing 128-bit arithmetic on RV64 kinda sucks as there is
basically no good way to do extended precision arithmetic (essentially, >the ISA offers nothing more here than what C already gives you).

Like, you can do what is essentially:
c_lo = a_lo + b_lo;
c_hi = a_hi + b_hi;
if((c_lo<a_lo) || (c_lo<b_lo))
c_hi++;

You only need to check for c_lo<a_lo (or for c_lo<b_lo), they will
either both be true or both be false.

Here's 128-bit arithmetic on RV64GC (and very similar on MIPS and
Alpha):

add a4,a4,a5
sltu a5,a4,a5
add s8,s8,s9
add s9,a5,s8

RISC-V (and MIPS and Alpha) becomes relly bad when you need add with
carry-in and carry-out (five instructions).

My 66000: // 256-bit add
CARRY R15,{{O}{IO}{IO}{I}}
ADD R12,R8,R24
ADD R13,R9,R25
ADD R14,R10,R26
ADD R15,R11,R27
!!!!!!

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Thu Nov 13 19:17:33 2025

From Newsgroup: comp.arch

On 11/13/2025 3:58 PM, Anton Ertl wrote:

BGB <cr88192@gmail.com> writes:

Can note that GCC seemingly doesn't support 128-bit integers on 64-bit
RISC-V.

What makes you think so? It has certainly worked every time I tried
it. E.g., Gforth's "configure" reports:

checking size of __int128_t... 16
checking size of __uint128_t... 16
[...]
checking for a C type for double-cells... __int128_t
checking for a C type for unsigned double-cells... __uint128_t

That's with gcc 10.3.1

Hmm...

Seems so.

Testing again, it does appear to work; the error message I thought I remembered seeing, instead applied to when trying to use the type in
MSVC. I had thought I remembered checking before and it failing, but it
seems not.

But, yeah, good to know I guess.

As for MSVC:
tst_int128.c(5): error C4235: nonstandard extension used: '__int128'
keyword not supported on this architecture

MSVC doesn't recognize __int128_t at all.

Where:
"""
Microsoft (R) C/C++ Optimizing Compiler Version 19.44.35219 for x64
Copyright (C) Microsoft Corporation. All rights reserved.
"""

...

Either way, it falls outside the scope of the C dialect I was targeting
here; and at least one of the compilers I am using still doesn't support it.

It is possible though I could still have an ifdef for targets that have
this type.

Misc:
Decided to post a graph generated by BGBCC: https://x.com/cr88192/status/1989134230648156378/photo/1

It shows the relative distribution of constants as recorded by the compiler.

I had considered trying to feed the data into LibreOffice, but the
interface is awkward enough that it became less effort to just add graph-drawing code to my compiler. Nevermind if graph-drawing
technically falls outside of the compiler's scope of responsibilities.

Partly this was to make a case for why it makes sense to have 33-bit
immediate values, but not bother so much with slightly larger values.

Basically, one has to cross a big gap of "mostly nothing" before
reaching a modest spike up near the 64-bit mark.

Note that Y axis is in Log 2 (it was either this or mask off 0).

Granted, there are other possible ways to graph this.

This particular example mostly resulted from compiling Doom.

Also, doing 128-bit arithmetic on RV64 kinda sucks as there is
basically no good way to do extended precision arithmetic (essentially,
the ISA offers nothing more here than what C already gives you).

Like, you can do what is essentially:
c_lo = a_lo + b_lo;
c_hi = a_hi + b_hi;
if((c_lo<a_lo) || (c_lo<b_lo))
c_hi++;

You only need to check for c_lo<a_lo (or for c_lo<b_lo), they will
either both be true or both be false.

OK, I wasn't sure here.

Here's 128-bit arithmetic on RV64GC (and very similar on MIPS and
Alpha):

add a4,a4,a5
sltu a5,a4,a5
add s8,s8,s9
add s9,a5,s8

RISC-V (and MIPS and Alpha) becomes relly bad when you need add with
carry-in and carry-out (five instructions).

OK.

Still not great.

On my ISAs, it is one of:
ALUX R10, R12, R10 //if supported
Or(XG1/XG2):
CLRT
ADDC R12, R10
ADDC R13, R11
Never got around to adding a 3R ADDC (and as-is is basically the same
idiom as carried over from SH-4).

On XG3, the latter is no longer formally allowed (partly for consistency
with RISC-V), but nothing technically prevents it (support for SR.T and predication was demoted to optional, and currently not enabled by default).

Could maybe still make sense to add a 3R ADDC though at some point, as
it could help with 256-bit arithmetic (and 256-bit stuff is not
addressed by ALUX).

Or, maybe even ADDCX, for 128-bit ADD-with-Carry?...

Though, did randomly remember a video I saw recently talking about how,
if the dinosaurs were still around, it is very unlikely human-like
creatures would have emerged.

The idea was that creatures would rise to the peaks of the "fitness
landscape" and eventually get stuck there, and it would have created a
world where basically there would have been no paths that would have
favored anything human-like emerging (and it is unlikely that any
creatures would descend back into the "valleys" to reach other possible
peaks in the landscape).

Does make me wonder if similar ideas could apply to things like software
and CPU architecture. Like, possible higher peaks that could potentially
lead to significant improvements in performance or capability, but
nothing can reach them as there is a "valley of suck" in the way.

Well, there are always random detours that seem to point this way, such
as with things like trinary logic, analog electronics, stochastic logic,
etc. Which have interesting properties but are, strictly speaking,
inferior to what we have now.

Say, for example, it does seem like Trinary could be used to drive more
data over a differential signaling bus, say:
00: 0 (Z)
01: +1 (P)
10: -1 (N)
11: Hi (H, idle state)
Then, say, one can drive the equivalent of 9 bits of data in 6 clock
cycles, with enough additional states that they could be used either for error-detection or DC balancing (could in concept do something like NRZ
where there would often be multiple possible paths to encode every
possible bit sequence and the encoder chooses the path that maintains
the best balance and avoids getting stuck in a non-changing state for
too many cycles).

Say, for example, every 2 trits encodes 3 bits, but then leaves one
redundant option. If one assumes that a scheme similar to NRZ is used,
then in cases where one gets a long run of 0 bits or similar, then it
can use a redundant zero encoding such that "on the wire" it still sees
a state transition, maybe:
ZZ: 000, ZP: 001
ZN: 010, PZ: 011
PP: 100, PN: 101
NZ: 110, NP: 111
NN: 000
Though, if the mapping rotates after every odd trit, then it becomes statistically unlikely that and significant DC-imbalance could arise
(but would make NRZ redundant). Or, if it does arise somehow, the
encoder could stick some idle-state pulses into the mix as well
(possibly understood as repeating the prior trit).

So, say, ZH/PH/NH being equivalent to ZZ/PP/NN but with an extra
transition, vs HH being the true idle state.

Well, unless going through a coupling transformer (like in Ethernet)
where ZZ/HZ would be ineffective (but would be saved by a ZZ/NN transition).

But, this seems like one of those things that presumably someone would
have already thought of it?...

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Nov 14 03:59:08 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> posted:

On 11/13/2025 3:58 PM, Anton Ertl wrote:

BGB <cr88192@gmail.com> writes:

Can note that GCC seemingly doesn't support 128-bit integers on 64-bit
RISC-V.

What makes you think so? It has certainly worked every time I tried
it. E.g., Gforth's "configure" reports:

checking size of __int128_t... 16
checking size of __uint128_t... 16
[...]
checking for a C type for double-cells... __int128_t
checking for a C type for unsigned double-cells... __uint128_t

That's with gcc 10.3.1

Hmm...

Seems so.

Testing again, it does appear to work; the error message I thought I remembered seeing, instead applied to when trying to use the type in
MSVC. I had thought I remembered checking before and it failing, but it seems not.

But, yeah, good to know I guess.

As for MSVC:
tst_int128.c(5): error C4235: nonstandard extension used: '__int128'
keyword not supported on this architecture

ERRRRRRR:: not supported by this compiler, the architecture has
ISA level support for doing this, but the compiler does not allow
you access.
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Nov 14 07:18:30 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> writes:

Never got around to adding a 3R ADDC (and as-is is basically the same
idiom as carried over from SH-4).

On XG3, the latter is no longer formally allowed (partly for consistency >with RISC-V), but nothing technically prevents it (support for SR.T and >predication was demoted to optional, and currently not enabled by default).

Could maybe still make sense to add a 3R ADDC though at some point, as
it could help with 256-bit arithmetic (and 256-bit stuff is not
addressed by ALUX).

In "Extending General-Purpose Registers with Carry and Overflow Bits" <http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf> I discuss
adding a carry bit and overflow bit to every GPR of an architecture.
To make it concrete how that would affect the instruction set, I
propose such an instruction set extension for RISC-V. It contains the instructions

addc rd, rs1, rs2

which adds the carry bit of rs2 to the 65-bit (i.e., including the
carry bit) data in rs1. The other instruction I proposed is

bo rs1, rs2, target

which branches if the overflow bit of rs1 or rs2 are set (why check
two registers? Because it fits in the RISC-V conditional branch
instruction scheme).

A 256-bit addition (d1,c1,b1,a1)+(d2,c2,b2,a2) would look as follows

add a3,a1,a2
add b3,b1,b2
addc b3,b3,a3
add c3,c1,c2
addc c3,c3,b3
add d3,d1,d2
addc d3,d3,c3

with 4 cycles latency. addc is limited to having two source registers
(RV64G instructions all have this limit). The decoder could combine a
pair of add and addc instructions into one three-source
macro-instruction. Alternatively, one could add a three-source
instruction addc4 (VAX-inspired naming) to the instruction set, and
maybe include subc4 as well.

Does make me wonder if similar ideas could apply to things like software
and CPU architecture. Like, possible higher peaks that could potentially >lead to significant improvements in performance or capability, but
nothing can reach them as there is a "valley of suck" in the way.

Network effects favour incumbents, and network effects are strong in
computer architecture for general-purpose processors. Sometimes I
think that it's a miracle that we have seen the progress in computer architecture that we have seen:

1) We used to have a de-facto standard of 36-bit word-addressed
machines (ok, there were character-addressed and digit-addressed
machines at the time, too), and it has been superseded by a
standard of 8-byte-addressed machines with word size 16 bits, 32
bits, or 64 bits. The mechanism here seems to have been that most
of the 36-bit machines had 18-bit addresses, and, as Gordon Bell
wrote, running out of address bits spells doom for an architecture.

2) At one point (late 1980s) it looked like big-endian would win
(almost all workstations at the time, with DEC stuff being the
exception that proved the rule), but eventually little-endian won,
thanks to PCs (which inherited the Datapoint 2200 byte order) and
smart phones (which inherited the 6502 byte order).

Another, less surprising development is that trapping on unaligned
accesses is dying out in general-purpose machines. In the 1980s and
1990s most architectures trapped on unaligned accesses. But that's a
"feature" that almost no software relies on, so there are no network
effects in its favour. OTOH, porting software from an architecture
that performs unaligned accesses is easier to architectures that
perform unaligned accesses. So eventually all general-purpose
architectures have converted to performing unaligned accesses, or died
out. One can see this progression already in S/360->S/370.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Nov 14 14:18:02 2025

From Newsgroup: comp.arch

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

{Pedantic mode=ON}
Assembly language is ASSEMBLER specific.

What I wanted to write was "And assembly language is
architecture-specific".

foo_:
add DWORD PTR [rdi], 1
ret

and

foo_:
addl $1, (%rdi)
ret

are written in two different assembly languages, yet have the same
meaning when compiled.

It's the builtin function that are compiler-specific.

Also, not really. For x86, Intel defines them, and other
compilers like gcc follow suit.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Fri Nov 14 15:57:22 2025

From Newsgroup: comp.arch

BGB wrote:

It is possible to use an approach similar to double-dabble (feeding in
the binary number 1 bit at a time, and adding the decimal vector to
itself and incrementing for each 1 bit seen). But, alas, this is also
slow in this case (takes around 128 iterations to convert the Int128 to
4x 10e9). Though, still slightly faster than using a shift-subtract
divider to crack off 9 digit chunks by successively dividing by 1000000000.

Or, maybe make another attempt at Radix-10e9 long division and see if I
can get it to actually work and give the correct result.

I used division by 1e9 to extract groups of 9 digits from the binary
result I got when calculating pi with arbitrary precision, back then (on
a 386) I did it with the obvious edx:eax / 1e9 (in ebx) -> remainder
(edx) and result (eax) in a loop, which was fast enough for tsomething I
only needed to do once.

Today, with 64-bit cpus, why not use a reciprocal mul to get a value
that cannot be too high, save the result, then back-multiply and subtract?

Any off-by-one error will be caught by the next iteration.

Though, might be worthwhile, since if I could make the DIV operator
faster, I could claim a result of "faster than IBM's decNumber library".

:-)

Even if in practice it might still be moot, as it is still impractically slow if compared with Binary128.

Right.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Nov 14 18:48:44 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

BGB <cr88192@gmail.com> writes:

Never got around to adding a 3R ADDC (and as-is is basically the same >idiom as carried over from SH-4).

On XG3, the latter is no longer formally allowed (partly for consistency >with RISC-V), but nothing technically prevents it (support for SR.T and >predication was demoted to optional, and currently not enabled by default).

Could maybe still make sense to add a 3R ADDC though at some point, as
it could help with 256-bit arithmetic (and 256-bit stuff is not
addressed by ALUX).

In "Extending General-Purpose Registers with Carry and Overflow Bits" <http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf> I discuss
adding a carry bit and overflow bit to every GPR of an architecture.

Which does nothing for MUL and DIV, while creating complications for
LD/ST if you want to maintain the 66-bit illusion of a GPR through
memory.

To make it concrete how that would affect the instruction set, I
propose such an instruction set extension for RISC-V. It contains the instructions

addc rd, rs1, rs2

which adds the carry bit of rs2 to the 65-bit (i.e., including the
carry bit) data in rs1. The other instruction I proposed is

bo rs1, rs2, target

which branches if the overflow bit of rs1 or rs2 are set (why check
two registers? Because it fits in the RISC-V conditional branch
instruction scheme).

A 256-bit addition (d1,c1,b1,a1)+(d2,c2,b2,a2) would look as follows

add a3,a1,a2
add b3,b1,b2
addc b3,b3,a3
add c3,c1,c2
addc c3,c3,b3
add d3,d1,d2
addc d3,d3,c3

with 4 cycles latency. addc is limited to having two source registers
(RV64G instructions all have this limit). The decoder could combine a
pair of add and addc instructions into one three-source
macro-instruction. Alternatively, one could add a three-source
instruction addc4 (VAX-inspired naming) to the instruction set, and
maybe include subc4 as well.

CARRY in My 66000 essentially provides an accumulator for a few instructions that supply more operands to and receives another result from a calculation. Most multiprecision calculation sequences are perfectly happy with another register used as an accumulator.

Does make me wonder if similar ideas could apply to things like software >and CPU architecture. Like, possible higher peaks that could potentially >lead to significant improvements in performance or capability, but
nothing can reach them as there is a "valley of suck" in the way.

Network effects favour incumbents, and network effects are strong in
computer architecture for general-purpose processors. Sometimes I
think that it's a miracle that we have seen the progress in computer architecture that we have seen:

1) We used to have a de-facto standard of 36-bit word-addressed
machines (ok, there were character-addressed and digit-addressed
machines at the time, too), and it has been superseded by a
standard of 8-byte-addressed machines with word size 16 bits, 32
bits, or 64 bits. The mechanism here seems to have been that most
of the 36-bit machines had 18-bit addresses, and, as Gordon Bell
wrote, running out of address bits spells doom for an architecture.

2) At one point (late 1980s) it looked like big-endian would win
(almost all workstations at the time, with DEC stuff being the
exception that proved the rule), but eventually little-endian won,
thanks to PCs (which inherited the Datapoint 2200 byte order) and
smart phones (which inherited the 6502 byte order).

Another, less surprising development is that trapping on unaligned
accesses is dying out in general-purpose machines. In the 1980s and
1990s most architectures trapped on unaligned accesses. But that's a "feature" that almost no software relies on, so there are no network
effects in its favour. OTOH, porting software from an architecture
that performs unaligned accesses is easier to architectures that
perform unaligned accesses. So eventually all general-purpose
architectures have converted to performing unaligned accesses, or died
out. One can see this progression already in S/360->S/370.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Fri Nov 14 15:00:14 2025

From Newsgroup: comp.arch

<snip>

In "Extending General-Purpose Registers with Carry and Overflow Bits" <http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf> I discuss
adding a carry bit and overflow bit to every GPR of an architecture.
To make it concrete how that would affect the instruction set, I
propose such an instruction set extension for RISC-V. It contains the instructions

Sidetracking a bit here.

There are 64-regs in Qupls with four flag bits, so loading and storing
all the flags into a single 64-bit register is not possible. I have
managed to come up with a scheme that might work.

Rather than use the dedicated registers ‘storeextra’ and ‘loadextra’, a
load / store queue entry is directly allocated and used for the purpose.

The LSQ entry already has enough room to store a cache-line (512-bits)
for merging operations. The ASTF / ALDF instructions (‘A’ for allocate) supply a bitmask of flag groups that need to be moved. A single 256-bit cache-line data access is performed. ALDF allocates and loads the
cache-line full of bits. ASTF simply allocates the LSQ entry. Which
registers need to be moved is indicated by the byte lane selects for the
LSQ entry (already present in the design).

For stores an STF instruction sets the byte lane select in the LSQ for
the flag store for the corresponding register. Once all the byte lane
selects are set the flags store operation is ready to proceed like any
other store. (There is already a data valid signal, which could be set).

For loads the LDF instruction clears the byte lane select for
corresponding registers. Once all the byte lane selects are cleared,
then the load is finished.

The LSQ entry allocated for the load / store remains present in the LSQ
for the duration of operations.

<snip>

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Fri Nov 14 14:39:30 2025

From Newsgroup: comp.arch

On 11/14/2025 8:57 AM, Terje Mathisen wrote:

BGB wrote:

It is possible to use an approach similar to double-dabble (feeding in
the binary number 1 bit at a time, and adding the decimal vector to
itself and incrementing for each 1 bit seen). But, alas, this is also
slow in this case (takes around 128 iterations to convert the Int128
to 4x 10e9). Though, still slightly faster than using a shift-subtract
divider to crack off 9 digit chunks by successively dividing by
1000000000.

Or, maybe make another attempt at Radix-10e9 long division and see if
I can get it to actually work and give the correct result.

I used division by 1e9 to extract groups of 9 digits from the binary
result I got when calculating pi with arbitrary precision, back then (on
a 386) I did it with the obvious edx:eax / 1e9 (in ebx) -> remainder
(edx) and result (eax) in a loop, which was fast enough for tsomething I only needed to do once.

Today, with 64-bit cpus, why not use a reciprocal mul to get a value
that cannot be too high, save the result, then back-multiply and subtract?

Dunno.

I guess, 128/64 bit IDIV could be possible on the HW, but there isn't a
good way to access this from C (absent using functionality outside what
exists in portable C95).

Could be possible, but at the moment don't really want to go the
direction of alternate code paths, and wildly different performance
based on what compiler one is using.

Though, in simple cases, the compilers are smart enough to turn divide-by-constant into multiply-by-reciprocal internally (if they
support the type in question).

If anything, GCC having __int128 support, leans a lot more in favor of
using BID rather than DPD.

I decided against using BID with this code for the reason that this
bottleneck would unnecessarily penalize BID, which would be better
handled in a different way (namely writing code which works natively
with numbers in linear integer form; and not in 10e9 form).

As noted, a double-dabble approach is, say:
v=a.hi;
for(i=0; i<64; i++)
{
TKD128_AddArray4(arr, arr, arr);
arr[0]+=v>>63;
v=v<<1;
}
v=a.lo;
for(i=0; i<64; i++)
{
TKD128_AddArray4(arr, arr, arr);
arr[0]+=v>>63;
v=v<<1;
}
Which, technically works, but doesn't really win any awards for speed...

Any off-by-one error will be caught by the next iteration.

Yeah...

Though, did make another attempt at 10e9 long division, and got it
working correctly this time...

/* arem is both dividend and remainder, 8 elements; padded.
* adiv is the divisor (4 elements).
* aquo is the output quotient (also 8 elements)
* result derived from high elements.
*/
void TKD128_LongDivArray8x4(u32 *arem, u32 *adiv, u32 *aquo)
{
u32 adtmp[8];
u64 adx, ady, tdiv;
u32 ad0, ad1, ad2, ad3, or8;
int i, j, n, re;

memset(aquo, 0, 8*sizeof(u32));
adtmp[0]=0; adtmp[1]=0;
adtmp[2]=0; adtmp[3]=0;
adtmp[4]=adiv[0]; adtmp[5]=adiv[1];
adtmp[6]=adiv[2]; adtmp[7]=adiv[3];

tdiv=adiv[3]; /* assume not zero... */

for(i=0; i<5; i++)
{
/* doesn't always work in a single pass, usually 1 or 2 */
for(j=0; j<4; j++)
{
ad0=arem[7-i];
ad1=arem[8-i];
adx=(ad1*1000000000ULL)+ad0;
ady=adx/(tdiv+1);
if(!ady)
break; /* if was zero, this position is done */
ad2=ady;
if(ady>=2000000000) /* range limit so no overflow */
{ ad2=1999999999; }
if(ad2>0)
{
TKD128_SubScaleArray8X_30(arem, adtmp, ad2, arem);
ad3=aquo[0];
ad3+=ad2;
if(ad3>=1000000000)
{ aquo[1]++; ad3-=1000000000; }
aquo[0]=ad3;
}
}
TKD128_ScaleLeftArray8_S9(aquo);
TKD128_ScaleRightArray8_S9(adtmp);
}
for(; i<8; i++)
TKD128_ScaleLeftArray8_S9(aquo);
}

TKD128_ScaleLeftArray8_S9:
Copy elements left (towards higher index) by 1 position (32 bits).

TKD128_ScaleRightArray8_S9:
Copy elements right (towards a lower index) by 1 position (32 bits).

TKD128_SubScaleArray8X_30(c, a, b, d):
v=a[i]*b;
v_h=v/1000000000;
v_l=v-v_h*1000000000;
d[i+0]=c[i+0]-v_l;
d[i+1]=c[i+1]-v_h;
With extra parts to deal with borrow propagation and similar.
May access out-of-bounds for c/d arrays
(arem needs to be padded by a few extra elements).

Note that is runs for 5 iterations (vs 4) because this is how one gets
it to produce a full fraction rather than an integer divide (the integer divide results are similar, but differ on the low order digits).

Running for 5 appears sufficient (could run for 6..8, but these appear
to deliver the same final result and are slower).

One could debate whether stopping early could effect the results, but
the low-order digits are initialized to 0, and 1000000000-999999999 is 000000001, meaning that in the worst case, the low-order borrows would
be absorbed in this case.

Performance:
Slightly faster than using N-R.
But, still nowhere near what I had hoped.

Though, might be worthwhile, since if I could make the DIV operator
faster, I could claim a result of "faster than IBM's decNumber library".

:-)

Currently, ADD/SUB/MUL seem to be faster in my case.

DIV is still slower, and seems to be putting up a big fight here.
decNumber seems to have a DIV that is around 1/2 the speed of the MUL.
In my case, DIV is still around 10x slower than MUL.

Currently I have it at 3.4 MHz in GCC, 2.7 MHz in MSVC.

To match decNumber, would need to get closer to around 6 million divides
per second (in GCC).

Current stats in a GCC build are:
ADD/SUB: 36 MHz (unpacked), 17 MHz (DPD)
MUL: 27 MHz (unpacked), 14 MHz (DPD)
DIV: 3.4 MHz (both)
SQRT: 1.0 MHz (both)

MSVC scores:
ADD/SUB: 13 MHz (unpacked), 0.8 MHz (DPD)
MUL: 18 MHz (unpacked), 1.0 MHz (DPD)
DIV: 2.7 MHz, 0.7 MHz (DPD)
SQRT: 0.8 MHz, 0.6 MHz (DPD)

Everything is slower here with MSVC it seems...
The DPD pack/unpack kinda wrecks things.
The X30 pack/unpack is around 36% faster than DPD.

Not entirely sure why MSVC is sucking so badly here (it doesn't usually
suck this bad).

Checking Clang:
It is slightly faster than MSVC, but much closer to the MSVC performance
than the GCC performance in this case (so, whatever issue seems to be effecting MSVC here also appears to effect Clang).

This may require investigation, but then again, a lot of this isn't
exactly "high performance" code (and does a lot of stuff I would
normally avoid, but was basically unavoidable due to the whole
Radix-10e9 thing).

decNumber uses DPD, but is around 13 and 12 MHz in GCC with similar inputs.
As noted, its DIV is still a bit faster.
Currently only seems to build with GCC or Clang.

SQRT is N/A, as decNumber seemingly lacks SQRT, or any other complex
math functions. Like, no log/pow, nor sin/cos/tan/..., ...

Also a low of its example programs are for things like calculating
interest and similar...

Even if in practice it might still be moot, as it is still
impractically slow if compared with Binary128.

Right.

Terje

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Nov 14 22:32:14 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

{Pedantic mode=ON}
Assembly language is ASSEMBLER specific.

What I wanted to write was "And assembly language is
architecture-specific".

foo_:
add DWORD PTR [rdi], 1
ret

and

foo_:
addl $1, (%rdi)
ret

are written in two different assembly languages, yet have the same
meaning when compiled.

That does not contradict what I wrote. Both assembly languages are
specific to the AMD64 architecture.

It's the builtin function that are compiler-specific.

Also, not really. For x86, Intel defines them, and other
compilers like gcc follow suit.

You are confusing builtins with intrinsics. Builtins are defined by
the compiler. E.g., __builtin_addcll() is supported by clang on all architectures, but is not supported by gcc. By contrast, the
intrinsic _addcarry_u64() is defined by Intel and is supported on gcc
and clang (and, I guess icc, and maybe others), but only when
compiling for AMD64.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Nov 14 22:38:07 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

In "Extending General-Purpose Registers with Carry and Overflow Bits"
<http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf> I discuss
adding a carry bit and overflow bit to every GPR of an architecture.

Which does nothing for MUL and DIV, while creating complications for
LD/ST if you want to maintain the 66-bit illusion of a GPR through
memory.

I don't think the benefit is worth the cost, as do you, because you
support your CARRY functionality only in very limited sequences. So
storing stores only 64 bits, and loading only loads those bits, and
sets carry and overflow to no overflow.

A 256-bit addition (d1,c1,b1,a1)+(d2,c2,b2,a2) would look as follows

add a3,a1,a2
add b3,b1,b2
addc b3,b3,a3
add c3,c1,c2
addc c3,c3,b3
add d3,d1,d2
addc d3,d3,c3

with 4 cycles latency.

CARRY in My 66000 essentially provides an accumulator for a few instructions >that supply more operands to and receives another result from a calculation. >Most multiprecision calculation sequences are perfectly happy with another >register used as an accumulator.

How does a four-input 2048-bit-addition look with your CARRY? For GPRs-with-flags it would look as follows:

L:
ld xn, (xp)
ld yn, (yp)
ld zn, (zp)
ld tn, (tp)
add rn, xn, yn
addc rn, rn, rm
add sn, zn, tn
addc sn, sn, sm
add vn, rn, tn
addc vn, vn, vm
sd vn, (vp)
... #mov rn, sn, vn to rm, sm, vm
... #increment xp yp zp tp vp
... #loop control and branch back to L:

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Nov 15 01:22:03 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

In "Extending General-Purpose Registers with Carry and Overflow Bits"
<http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf> I discuss
adding a carry bit and overflow bit to every GPR of an architecture.

Which does nothing for MUL and DIV, while creating complications for
LD/ST if you want to maintain the 66-bit illusion of a GPR through
memory.

I don't think the benefit is worth the cost, as do you, because you
support your CARRY functionality only in very limited sequences. So
storing stores only 64 bits, and loading only loads those bits, and
sets carry and overflow to no overflow.

My point was that 1-bit of carry is not enough when MUL and IV need 64-bits--and that is the issue CARRY addresses. In addition multi-
width shifts also require <essentially> a whole register of width.

A 256-bit addition (d1,c1,b1,a1)+(d2,c2,b2,a2) would look as follows

add a3,a1,a2
add b3,b1,b2
addc b3,b3,a3
add c3,c1,c2
addc c3,c3,b3
add d3,d1,d2
addc d3,d3,c3

with 4 cycles latency.

CARRY in My 66000 essentially provides an accumulator for a few instructions >that supply more operands to and receives another result from a calculation. >Most multiprecision calculation sequences are perfectly happy with another >register used as an accumulator.

How does a four-input 2048-bit-addition look with your CARRY? For GPRs-with-flags it would look as follows:

L:
ld xn, (xp)
ld yn, (yp)
ld zn, (zp)
ld tn, (tp)
add rn, xn, yn
addc rn, rn, rm
add sn, zn, tn
addc sn, sn, sm
add vn, rn, tn
addc vn, vn, vm
sd vn, (vp)
.. #mov rn, sn, vn to rm, sm, vm
.. #increment xp yp zp tp vp
.. #loop control and branch back to L:

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Nov 15 01:28:27 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

In "Extending General-Purpose Registers with Carry and Overflow Bits"
<http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf> I discuss
adding a carry bit and overflow bit to every GPR of an architecture.

Which does nothing for MUL and DIV, while creating complications for
LD/ST if you want to maintain the 66-bit illusion of a GPR through
memory.

I don't think the benefit is worth the cost, as do you, because you
support your CARRY functionality only in very limited sequences. So
storing stores only 64 bits, and loading only loads those bits, and
sets carry and overflow to no overflow.

My point was that 1-bit of carry is not enough when MUL and IV need 64-bits--and that is the issue CARRY addresses. In addition multi-
width shifts also require <essentially> a whole register of width.

A 256-bit addition (d1,c1,b1,a1)+(d2,c2,b2,a2) would look as follows

add a3,a1,a2
add b3,b1,b2
addc b3,b3,a3
add c3,c1,c2
addc c3,c3,b3
add d3,d1,d2
addc d3,d3,c3

with 4 cycles latency.

CARRY in My 66000 essentially provides an accumulator for a few instructions >that supply more operands to and receives another result from a calculation. >Most multiprecision calculation sequences are perfectly happy with another >register used as an accumulator.

How does a four-input 2048-bit-addition look with your CARRY? For GPRs-with-flags it would look as follows:

L:
ld xn, (xp)
ld yn, (yp)
ld zn, (zp)
ld tn, (tp)
add rn, xn, yn
addc rn, rn, rm
add sn, zn, tn
addc sn, sn, sm
add vn, rn, tn
addc vn, vn, vm
sd vn, (vp)
.. #mov rn, sn, vn to rm, sm, vm
.. #increment xp yp zp tp vp
.. #loop control and branch back to L:

//pretty close to::

MOV R12,#0
VEC R7,{}
LDD R8,[Rx,Ri<<3]
LDD R9,[Ry,Ri<<3]
LDD R10,[Rz,Ri<<3]
LDD R11,[Rt,Ri<<3]
CARRY R12,{{IO}{IO}{IO}}
ADD R13,R8,R9
ADD R14,R10,R11
ADD R14,R14,R13
STD R14,[Rv,Ri<<3]
LOOP R7,LT,#1,#32

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Nov 15 10:46:42 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> writes:

<snip>

In "Extending General-Purpose Registers with Carry and Overflow Bits"
<http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf> I discuss
adding a carry bit and overflow bit to every GPR of an architecture.
To make it concrete how that would affect the instruction set, I
propose such an instruction set extension for RISC-V. It contains the
instructions

Sidetracking a bit here.

There are 64-regs in Qupls with four flag bits,

What other flags do you use?

A common set of flags is NZCV. Of these N and Z can be generated from
the 64 ordinary bits (actually N is the MSB of these bits).

You might also want NCZV of 32-bit instructions, but in that case all
flags are derivable from the 64 ordinary bits of the GPR; but in that
case you may need additional branch instructions: Instructions that
check only if the bottom 32-bits are 0 (Z), if bit 31 is 1 (N), if bit
32 is 1 (C), or if bit 32 is different from bit 31 (V).

Concerning saving the extra bits across interrupts, yes, this has to
be adapted to the actual architecture, and there are many ways to skin
this cat. I just outlined one to give an idea how this can be done.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sat Nov 15 07:48:21 2025

From Newsgroup: comp.arch

On 2025-11-15 5:46 a.m., Anton Ertl wrote:

Robert Finch <robfi680@gmail.com> writes:

<snip>

In "Extending General-Purpose Registers with Carry and Overflow Bits"
<http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf> I discuss
adding a carry bit and overflow bit to every GPR of an architecture.
To make it concrete how that would affect the instruction set, I
propose such an instruction set extension for RISC-V. It contains the
instructions

Sidetracking a bit here.

There are 64-regs in Qupls with four flag bits,

What other flags do you use?

A capabilities tag bit, and possibly a bit to indicate float/integer or pointer data. Because of the implementation there are eight bits
available in the register file (only a byte update is available with the
BRAM so its eight or none). But I am planning on using only four bits so
there is less data to move around.

A common set of flags is NZCV. Of these N and Z can be generated from
the 64 ordinary bits (actually N is the MSB of these bits).

You might also want NCZV of 32-bit instructions, but in that case all
flags are derivable from the 64 ordinary bits of the GPR; but in that
case you may need additional branch instructions: Instructions that
check only if the bottom 32-bits are 0 (Z), if bit 31 is 1 (N), if bit
32 is 1 (C), or if bit 32 is different from bit 31 (V).

Concerning saving the extra bits across interrupts, yes, this has to
be adapted to the actual architecture, and there are many ways to skin
this cat. I just outlined one to give an idea how this can be done.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Sat Nov 15 15:36:22 2025

From Newsgroup: comp.arch

MitchAlsup wrote:

CARRY in My 66000 essentially provides an accumulator for a few instructions that supply more operands to and receives another result from a calculation. Most multiprecision calculation sequences are perfectly happy with another register used as an accumulator.

I think I've said so before, but it bears repeating:

I _really_ love CARRY!

It provides a lot of "missing link" operations, while adding zero extra
bits to all the instructions that don't need it.

That said, if I had infinite resources (in this case infinity == 4
sources), I would like to have an unsigned integer MulAddAdd like this:

(hi, lo) = a*b+c+d

simply because this is the largest possible building block that cannot overflow, the result range covers the full 128 bit space.

From what you've taught us about multipliers, adding one (or in this
case two) extra inputs to the adder that aggregates all the partial multiplication products will be close to free in time, but the routing
of the extra set of inputs might require an extra cycle?

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Nov 15 18:04:16 2025

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> posted:

MitchAlsup wrote:

CARRY in My 66000 essentially provides an accumulator for a few instructions
that supply more operands to and receives another result from a calculation.
Most multiprecision calculation sequences are perfectly happy with another register used as an accumulator.

I think I've said so before, but it bears repeating:

I _really_ love CARRY!

It provides a lot of "missing link" operations, while adding zero extra
bits to all the instructions that don't need it.

That said, if I had infinite resources (in this case infinity == 4
sources), I would like to have an unsigned integer MulAddAdd like this:

(hi, lo) = a*b+c+d

Alas:: the best CARRY can do is:

{hi,c} = a×b+hi

simply because this is the largest possible building block that cannot overflow, the result range covers the full 128 bit space.

From what you've taught us about multipliers, adding one (or in this
case two) extra inputs to the adder that aggregates all the partial multiplication products will be close to free in time, but the routing
of the extra set of inputs might require an extra cycle?

In the integer case, there is no rounding.
In the FP case, FMAC is not part of the CARRY applicable OpCode space.

Terje

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Nov 15 18:07:19 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Robert Finch <robfi680@gmail.com> writes:

<snip>

In "Extending General-Purpose Registers with Carry and Overflow Bits"
<http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf> I discuss
adding a carry bit and overflow bit to every GPR of an architecture.
To make it concrete how that would affect the instruction set, I
propose such an instruction set extension for RISC-V. It contains the
instructions

Sidetracking a bit here.

There are 64-regs in Qupls with four flag bits,

What other flags do you use?

A common set of flags is NZCV. Of these N and Z can be generated from
the 64 ordinary bits (actually N is the MSB of these bits).

You might also want NCZV of 32-bit instructions, but in that case all
flags are derivable from the 64 ordinary bits of the GPR; but in that
case you may need additional branch instructions: Instructions that
check only if the bottom 32-bits are 0 (Z), if bit 31 is 1 (N), if bit
32 is 1 (C), or if bit 32 is different from bit 31 (V).

If you write an architectural rule whereby every integer result is
"proper" one set of bits {top, bottom, dispersed} covers everything.

Proper means that all the bits in the register are written but the
value written is range limited to {Sign}×{Size} of the calculation.

Concerning saving the extra bits across interrupts, yes, this has to
be adapted to the actual architecture, and there are many ways to skin
this cat. I just outlined one to give an idea how this can be done.

On the other hand, with CARRY, none of those bits are needed.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Nov 15 18:01:28 2025

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> writes:

That said, if I had infinite resources (in this case infinity == 4
sources), I would like to have an unsigned integer MulAddAdd like this:

(hi, lo) = a*b+c+d

simply because this is the largest possible building block that cannot >overflow, the result range covers the full 128 bit space.

Not just that, it is also useful for mpn_addmul_1() (ma = ma+s*mb, where m
is multi-precision, and s is single-precision), which is a useful
stepping stone for mpn_mul() (m=ma*mb).

One iteration of mpn_addmul_1() performs:

(hi[i], ma[i]) = ma[i]+s*mb[i]+hi[i-1]

From what you've taught us about multipliers, adding one (or in this
case two) extra inputs to the adder that aggregates all the partial >multiplication products will be close to free in time

The question is the latency. If you get a latency of 4 cycles from
each input of the a*b+c+d computation to the results, this means that
the recurrence from hi[i-1] to hi[i] takes 4 cycles, which becomes a performance problem if m is large. One can work around that for
m=ma*mb by rearranging the computations, but if you really need
just something like m=s*ma, that's not possible.

Alternatively, you might use

(hi, lo) = ma[i]+s*mb[i]
(hi[i], ma[i])=(hi,lo)+hi[i-1]

The first line has no recurrences, and so executing it is only limited
by CPU resources, the second operation has a recurrence from hi[i-1]
to hi[i], but it takes only one cycle of latency with the right
architecture, e.g.:

AMD64 with ADX:
#rdx = s
#carry = carry1+C+O
mulx ma, m, carry2
adcx mb, m
adox carry1, m
mov carry2, carry1

Given that, a useful instruction is

(hi,lo) = a*b+c

Then you only need one carry flag for the rest. A hypothetical ARM
A64 which has umaddh in addition to (the existing) madd could do this
with one cycle of recurrence latency, too:

ARM A64 with umaddh:
# carry = carry1+C
umaddh carry2, s, ma, mb
madd smab, s, ma, mb
adcs m, smab, carry1
mov carry1, carry2

but the routing
of the extra set of inputs might require an extra cycle?

Depends on the microarchitecture. The ARM A64 architects have
instructions that have 4 inputs (store pair with an addressing modes
that reads two registers), and it apparently works for them.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Nov 16 08:22:52 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

A common set of flags is NZCV. Of these N and Z can be generated from
the 64 ordinary bits (actually N is the MSB of these bits).

You might also want NCZV of 32-bit instructions, but in that case all
flags are derivable from the 64 ordinary bits of the GPR; but in that
case you may need additional branch instructions: Instructions that
check only if the bottom 32-bits are 0 (Z), if bit 31 is 1 (N), if bit
32 is 1 (C), or if bit 32 is different from bit 31 (V).

If you write an architectural rule whereby every integer result is
"proper" one set of bits {top, bottom, dispersed} covers everything.

Proper means that all the bits in the register are written but the
value written is range limited to {Sign}×{Size} of the calculation.

I have no idea what you mean with "one set of bits {top, bottom,
dispersed}".

As for "proper": Does this mean that one would have to have add(c),
sub(c), mul (madd etc.), shift right and shift left (did I forget
anything?) for i8, i16, i32, i64, u8, u16, u32, and u64? Yes, if
specify in the operation which kind of Z, C/V, and maybe N you are
interested in, you do not need to specify it in the branch that checks
that result; you also eliminate the sign-extension and zero-extension operations that we discussed some time ago.

But given that the operations are much more frequent than branches,
encoding that information in the branches uses less space (for shift
right, the sign is usually included in the operation). It's
interesting that AFAIK there are instruction sets (e.g., Power) that
just have one full-width sign-agnostic add, and do not have
width-specific flags, either. So when compiling stuff like

if (a[1]+a[2] == 0) /* unsigned a[] */

a width-specific compare instruction provides that information. But
gcc generates a compare instruction even when a[] is "unsigned long",
so apparently add does not set the flags on addition anyway (and if
there is an add that sets flags, it is not used by gcc for this code).

Another case is SPARC v9, which tends to set flags. For

if ((a[1]^a[2]) < 0)

I see:

long a[] int a[]
ldx [ %i0 + 8 ], %g1 ld [ %i0 + 4 ], %g2
ldx [ %i0 + 0x10 ], %g2 ld [ %i0 + 8 ], %g1
xor %g1, %g2, %g1 xorcc %g2, %g1, %g0
brlz,pn %g1, 24 <foo+0x24> bl,a,pn %icc, 20 <foo+0x20>

Reading up on SPARC v9, it has two sets of condition codes: 32-bit
(icc) and 64-bit (xcc), and every instruction that sets condition
codes (e.g., xorcc) sets both. In the present case, the 32-bit
sequence sets the ccs and then checks icc, while the 64-bit sequence
does not set the ccs, and instead uses a branch instruction that
inspects an integer register (%g1). These branch instructions all
work for the full 64 bits, and do not provide a way to check a 32-bit
result. In the present case, an alternate way to use brlz for the
32-bit case would have been:

ldsw [ %i0 + 8 ], %g1 #ld is a synonym for lduw
ldsw [ %i0 + 0x10 ], %g2
xor %g1, %g2, %g1
brlz,pn %g1, 24 <foo+0x24>

because the xor of two sign-extended data is also a correct
sign-extended result, but instread gcc chose to use xorcc and bl %icc.

There are many ways to skin this cat.

Concerning saving the extra bits across interrupts, yes, this has to
be adapted to the actual architecture, and there are many ways to skin
this cat. I just outlined one to give an idea how this can be done.

On the other hand, with CARRY, none of those bits are needed.

But the mechanism of CARRY is quite a bit more involved: Either store
the carry in a GPR at every step, or have another mechanism inside a
CARRY block. And either make the CARRY block atomic or have some way
to preserve the fact that there is this prefix across interrupts and
(worse) synchronous traps.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Nov 16 14:34:54 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Terje Mathisen <terje.mathisen@tmsw.no> posted:

(hi, lo) = a*b+c+d

Alas:: the best CARRY can do is:

{hi,c} = a*b+hi

What latency?

simply because this is the largest possible building block that cannot
overflow, the result range covers the full 128 bit space.

With the carry in the result GPR, you could achieve that as follows:

add t,c,d
umaddc hi,lo,a,b,t

(or split umaddc into an instruction that produces the low result and
one that produces the high result).

The disadvantage here is that, with d being the hi of the last
iteration, you will see the full latency of the add and the umaddh.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Nov 16 14:45:09 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

[...]

My point was that 1-bit of carry is not enough when MUL and IV need >64-bits--and that is the issue CARRY addresses. In addition multi-
width shifts also require <essentially> a whole register of width.

For multiplication, instruction sets provide widening multiplication,
with one instruction producing two words, or with instructions that
produce low and high words. These days I would consider doing this
with SIMD registers, which would allow the double-width result in a
single register; that was an option for ARM A64, but they chose to use
GPRs and low and high instructions; strange. For AMD64, it was not
really an option (it continued with the multiplication instructions
that existed in IA-32); for RISC-V, it also was not an option, as its
M extension was designed long before the V extension.

For division, again double-width by single-width division has been
designed into several instruction sets and is a good stepping stone
for multi-precision by single-width division. Again, using SIMD
registers appears to be a way to reduce the number of register
accesses in the instruction.

The double-width->single-width shift looks like a good stepping stone
for multi-precision shifts. It's unclear to me why Intel included two instructions for that purpose: SHLD and SHRD.

How does a four-input 2048-bit-addition look with your CARRY? For
GPRs-with-flags it would look as follows:

L:
ld xn, (xp)
ld yn, (yp)
ld zn, (zp)
ld tn, (tp)
add rn, xn, yn
addc rn, rn, rm
add sn, zn, tn
addc sn, sn, sm
add vn, rn, tn
addc vn, vn, vm
sd vn, (vp)
.. #mov rn, sn, vn to rm, sm, vm
.. #increment xp yp zp tp vp
.. #loop control and branch back to L:

Actually, the way to go would be to unroll by a factor of two, with
the n registers and m registers switching role between the
sub-iterations. If you do not want to go there, you would not use a
RISC-V mov, because that expands to an instruction that destroys the
carry bit. Instead, you would use an idiom (e.g., or rm, rn, zero)
that transfers the carry bit.

//pretty close to::

MOV R12,#0
VEC R7,{}
LDD R8,[Rx,Ri<<3]
LDD R9,[Ry,Ri<<3]
LDD R10,[Rz,Ri<<3]
LDD R11,[Rt,Ri<<3]
CARRY R12,{{IO}{IO}{IO}}
ADD R13,R8,R9
ADD R14,R10,R11
ADD R14,R14,R13
STD R14,[Rv,Ri<<3]
LOOP R7,LT,#1,#32

I thought up to now that the stuff covered by CARRY means

(R12,R13) = R8+R9+R12
(R12,R14) = R10+R11+R12
(R12,R14) = R14+R13+R12

Which would be wrong for the desired operation. What is needed
instead is, maybe

(R12,R14) = ((R8+R9)+(R10+R11))+R12

My expectation is that with CARRY, something functionally equivalent
might be implemented as:

MOV R12,#0
MOV R15,#0
MOV R16,#0
VEC R7, {}
... LDDs
CARRY R12,{{IO}}
ADD R13,R8,R9
CARRY R15,{{IO}}
ADD R14,R10,R11
CARRY R16,{{IO}}
ADD R14,R14,R13
STD ...
LOOP ...
ADD R15,R15,R16
ADD R12,R12,R15

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Nov 16 18:36:02 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

ERROR "unexpected byte sequence starting at index 853: '\xC3'" while decoding:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

A common set of flags is NZCV. Of these N and Z can be generated from
the 64 ordinary bits (actually N is the MSB of these bits).

You might also want NCZV of 32-bit instructions, but in that case all
flags are derivable from the 64 ordinary bits of the GPR; but in that
case you may need additional branch instructions: Instructions that
check only if the bottom 32-bits are 0 (Z), if bit 31 is 1 (N), if bit
32 is 1 (C), or if bit 32 is different from bit 31 (V).

If you write an architectural rule whereby every integer result is
"proper" one set of bits {top, bottom, dispersed} covers everything.

Proper means that all the bits in the register are written but the
value written is range limited to {Sign}Ã{Size} of the calculation.

I have no idea what you mean with "one set of bits {top, bottom,
dispersed}".

typedef struct { uint64_t reg;
uint8_t bits: 4; } gpr;
or
typedef struct { uint8_t bits: 4;
uint64_t reg;} gpr;
or
typedef struct { uint16_t reg0;
uint8_t bit0: 1;
uint16_t reg1;
uint8_t bit1: 1;
uint16_t reg2;
uint8_t bit2: 1;
uint16_t reg3;
uint8_t bit3: 1; } gpr;

Did you loose every brain-cell of imagination ?!?

As for "proper": Does this mean that one would have to have add(c),
sub(c), mul (madd etc.), shift right and shift left (did I forget
anything?) for i8, i16, i32, i64, u8, u16, u32, and u64? Yes, if
specify in the operation which kind of Z, C/V, and maybe N you are
interested in, you do not need to specify it in the branch that checks
that result; you also eliminate the sign-extension and zero-extension operations that we discussed some time ago.

{s8, s16, s32, s64, u8, u16, u32, u64} yes.

But given that the operations are much more frequent than branches,
encoding that information in the branches uses less space (for shift
right, the sign is usually included in the operation). It's

Which is why I don't have ANY of those extra bits.

interesting that AFAIK there are instruction sets (e.g., Power) that
just have one full-width sign-agnostic add, and do not have
width-specific flags, either. So when compiling stuff like

if (a[1]+a[2] == 0) /* unsigned a[] */

a width-specific compare instruction provides that information. But
gcc generates a compare instruction even when a[] is "unsigned long",
so apparently add does not set the flags on addition anyway (and if
there is an add that sets flags, it is not used by gcc for this code).

Another case is SPARC v9, which tends to set flags. For

if ((a[1]^a[2]) < 0)

I see:

long a[] int a[]
ldx [ %i0 + 8 ], %g1 ld [ %i0 + 4 ], %g2
ldx [ %i0 + 0x10 ], %g2 ld [ %i0 + 8 ], %g1
xor %g1, %g2, %g1 xorcc %g2, %g1, %g0
brlz,pn %g1, 24 <foo+0x24> bl,a,pn %icc, 20 <foo+0x20>

Reading up on SPARC v9, it has two sets of condition codes: 32-bit
(icc) and 64-bit (xcc), and every instruction that sets condition
codes (e.g., xorcc) sets both.

Another reason its death is helpful to comp.arch

In the present case, the 32-bit
sequence sets the ccs and then checks icc, while the 64-bit sequence
does not set the ccs, and instead uses a branch instruction that
inspects an integer register (%g1). These branch instructions all
work for the full 64 bits, and do not provide a way to check a 32-bit
result. In the present case, an alternate way to use brlz for the
32-bit case would have been:

ldsw [ %i0 + 8 ], %g1 #ld is a synonym for lduw
ldsw [ %i0 + 0x10 ], %g2
xor %g1, %g2, %g1
brlz,pn %g1, 24 <foo+0x24>

because the xor of two sign-extended data is also a correct
sign-extended result, but instread gcc chose to use xorcc and bl %icc.

There are many ways to skin this cat.

Sure:: close to 20-ways, less than 4 of them are "proper".

Concerning saving the extra bits across interrupts, yes, this has to
be adapted to the actual architecture, and there are many ways to skin
this cat. I just outlined one to give an idea how this can be done.

On the other hand, with CARRY, none of those bits are needed.

But the mechanism of CARRY is quite a bit more involved: Either store
the carry in a GPR at every step, or have another mechanism inside a
CARRY block. And either make the CARRY block atomic or have some way
to preserve the fact that there is this prefix across interrupts and
(worse) synchronous traps.

During its "life" the bits used in CARRY are simply another feedback
path on the data-path. Afterwards, carry is written once. CARRY also
gets written when an exception is taken.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Nov 16 18:41:03 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Terje Mathisen <terje.mathisen@tmsw.no> posted:

(hi, lo) = a*b+c+d

Alas:: the best CARRY can do is:

{hi,c} = a*b+hi

What latency?

1 multiply latency {likely 4 cycles} but more importantly no more cycles
than
c = a*b;

simply because this is the largest possible building block that cannot
overflow, the result range covers the full 128 bit space.

With the carry in the result GPR, you could achieve that as follows:

add t,c,d
umaddc hi,lo,a,b,t

You can do this at the added latency of ADD.

(or split umaddc into an instruction that produces the low result and
one that produces the high result).

CARRY is an instruction-modifier it is not "executed" {or you can
consider it "executed" in the DECODE stage of the pipeline.} The
subsequent MUL takes no more time CARRY or no-CARRY.

The disadvantage here is that, with d being the hi of the last
iteration, you will see the full latency of the add and the umaddh.

Does R stand for Reduced or Ridiculoous ?!?

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Mon Nov 17 02:49:15 2025

From Newsgroup: comp.arch

On 2025-11-16 1:36 p.m., MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

ERROR "unexpected byte sequence starting at index 853: '\xC3'" while decoding:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

A common set of flags is NZCV. Of these N and Z can be generated from >>>> the 64 ordinary bits (actually N is the MSB of these bits).

You might also want NCZV of 32-bit instructions, but in that case all
flags are derivable from the 64 ordinary bits of the GPR; but in that
case you may need additional branch instructions: Instructions that
check only if the bottom 32-bits are 0 (Z), if bit 31 is 1 (N), if bit >>>> 32 is 1 (C), or if bit 32 is different from bit 31 (V).

If you write an architectural rule whereby every integer result is
"proper" one set of bits {top, bottom, dispersed} covers everything.

Proper means that all the bits in the register are written but the
value written is range limited to {Sign}Ã{Size} of the calculation.

I have no idea what you mean with "one set of bits {top, bottom,
dispersed}".

typedef struct { uint64_t reg;
uint8_t bits: 4; } gpr;
or
typedef struct { uint8_t bits: 4;
uint64_t reg;} gpr;
or
typedef struct { uint16_t reg0;
uint8_t bit0: 1;
uint16_t reg1;
uint8_t bit1: 1;
uint16_t reg2;
uint8_t bit2: 1;
uint16_t reg3;
uint8_t bit3: 1; } gpr;

Did you loose every brain-cell of imagination ?!?

As for "proper": Does this mean that one would have to have add(c),
sub(c), mul (madd etc.), shift right and shift left (did I forget
anything?) for i8, i16, i32, i64, u8, u16, u32, and u64? Yes, if
specify in the operation which kind of Z, C/V, and maybe N you are
interested in, you do not need to specify it in the branch that checks
that result; you also eliminate the sign-extension and zero-extension
operations that we discussed some time ago.

{s8, s16, s32, s64, u8, u16, u32, u64} yes.

But given that the operations are much more frequent than branches,
encoding that information in the branches uses less space (for shift
right, the sign is usually included in the operation). It's

Which is why I don't have ANY of those extra bits.

interesting that AFAIK there are instruction sets (e.g., Power) that
just have one full-width sign-agnostic add, and do not have
width-specific flags, either. So when compiling stuff like

if (a[1]+a[2] == 0) /* unsigned a[] */

a width-specific compare instruction provides that information. But
gcc generates a compare instruction even when a[] is "unsigned long",
so apparently add does not set the flags on addition anyway (and if
there is an add that sets flags, it is not used by gcc for this code).

Another case is SPARC v9, which tends to set flags. For

if ((a[1]^a[2]) < 0)

I see:

long a[] int a[]
ldx [ %i0 + 8 ], %g1 ld [ %i0 + 4 ], %g2
ldx [ %i0 + 0x10 ], %g2 ld [ %i0 + 8 ], %g1
xor %g1, %g2, %g1 xorcc %g2, %g1, %g0
brlz,pn %g1, 24 <foo+0x24> bl,a,pn %icc, 20 <foo+0x20>

Reading up on SPARC v9, it has two sets of condition codes: 32-bit
(icc) and 64-bit (xcc), and every instruction that sets condition
codes (e.g., xorcc) sets both.

Another reason its death is helpful to comp.arch

In the present case, the 32-bit
sequence sets the ccs and then checks icc, while the 64-bit sequence
does not set the ccs, and instead uses a branch instruction that
inspects an integer register (%g1). These branch instructions all
work for the full 64 bits, and do not provide a way to check a 32-bit
result. In the present case, an alternate way to use brlz for the
32-bit case would have been:

ldsw [ %i0 + 8 ], %g1 #ld is a synonym for lduw
ldsw [ %i0 + 0x10 ], %g2
xor %g1, %g2, %g1
brlz,pn %g1, 24 <foo+0x24>

because the xor of two sign-extended data is also a correct
sign-extended result, but instread gcc chose to use xorcc and bl %icc.

There are many ways to skin this cat.

Sure:: close to 20-ways, less than 4 of them are "proper".

Concerning saving the extra bits across interrupts, yes, this has to
be adapted to the actual architecture, and there are many ways to skin >>>> this cat. I just outlined one to give an idea how this can be done.

On the other hand, with CARRY, none of those bits are needed.

But the mechanism of CARRY is quite a bit more involved: Either store
the carry in a GPR at every step, or have another mechanism inside a
CARRY block. And either make the CARRY block atomic or have some way
to preserve the fact that there is this prefix across interrupts and
(worse) synchronous traps.

During its "life" the bits used in CARRY are simply another feedback
path on the data-path. Afterwards, carry is written once. CARRY also
gets written when an exception is taken.

- anton

These posts have inspired me to keep working on the ISA. I am on a simplification mission.

The CARRY modifier is just a substitute for not having r3w2 port
instructions directly in the ISA. Since Qupls ISA has room to support
some r3w2 instructions directly there is no need for CARRY, much as I
like the idea.

While not using a carry flag in the register, there is still a
capabilities bit, overflow bit and pointer bit plus four user assigned
bits. I decided to just have 72-bit register store and load instructions
along with the usual 8,16,32 and 64.

Finding it too difficult to support 128-bit operations using high, low register pairs. Getting the reservation stations to pair up the
registers seems a bit scary. It would be much simpler to just have
128-bit registers and it appears as if it may not be any more logic. The benefit of using register pairs is the internal busses need only be
64-bits then.

Sparc v9 died?

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Nov 17 08:33:58 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> writes:

Finding it too difficult to support 128-bit operations using high, low >register pairs. Getting the reservation stations to pair up the
registers seems a bit scary. It would be much simpler to just have
128-bit registers and it appears as if it may not be any more logic.

If you want to support 128-bit operations, using 128-bit registers
certainly is the way to go. Note how AMD used to split 128-bit SSE
operations into 64-bit parts on 64-bit registers in the K8, split
256-bit AVX operations into 128-bit parts on 128-bit registers in Zen,
but they went away from that: In Zen4 512-bit operations are performed
in 256-bit-pieces, but the registers are 512 bits wide.

However, the point of carry bits or Mitch Alsup's CARRY is not 128-bit operations, but multi-precision, which can be 256-bit for some crypto,
4096 bits for other crypto, or billions of bits for the stuff that
Alexander Yee is doing.

Sparc v9 died?

Oracle has discontinued SPARC development in 2017, Fujitsu has
announced in 2016 that they switch to ARM A64. Both Oracle and
Fujitsu released their last new SPARC CPU in 2017. Fujitsu has
released the ARM A64-based A64FX in 2019. The Leon4 (2017 according
to <https://en.wikipedia.org/wiki/SPARC#Implementations>) and Leon5
(2019) implement SPARC v8, not v9.

The MCST-R2000 (2018) implements SPARC v9, but will it have a
successor? And even if it has a successor, will it be available in
relevant numbers? MCST is not married to SPARC, despite their name;
they have worked on Elbrus 2000 implementations as well; Elbrus 2000
supports Elbrus VLIW and "Intel x86" instruction sets, and new models
were released in 2018, 2021, and 2025, so MCST now seems to focus on
that.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Mon Nov 17 08:17:20 2025

From Newsgroup: comp.arch

On 2025-11-17 3:33 a.m., Anton Ertl wrote:

Robert Finch <robfi680@gmail.com> writes:

Finding it too difficult to support 128-bit operations using high, low
register pairs. Getting the reservation stations to pair up the
registers seems a bit scary. It would be much simpler to just have
128-bit registers and it appears as if it may not be any more logic.

If you want to support 128-bit operations, using 128-bit registers
certainly is the way to go. Note how AMD used to split 128-bit SSE operations into 64-bit parts on 64-bit registers in the K8, split
256-bit AVX operations into 128-bit parts on 128-bit registers in Zen,
but they went away from that: In Zen4 512-bit operations are performed
in 256-bit-pieces, but the registers are 512 bits wide.

However, the point of carry bits or Mitch Alsup's CARRY is not 128-bit operations, but multi-precision, which can be 256-bit for some crypto,
4096 bits for other crypto, or billions of bits for the stuff that
Alexander Yee is doing.

Sparc v9 died?

Oracle has discontinued SPARC development in 2017, Fujitsu has
announced in 2016 that they switch to ARM A64. Both Oracle and
Fujitsu released their last new SPARC CPU in 2017. Fujitsu has
released the ARM A64-based A64FX in 2019. The Leon4 (2017 according
to <https://en.wikipedia.org/wiki/SPARC#Implementations>) and Leon5
(2019) implement SPARC v8, not v9.

The MCST-R2000 (2018) implements SPARC v9, but will it have a
successor? And even if it has a successor, will it be available in
relevant numbers? MCST is not married to SPARC, despite their name;
they have worked on Elbrus 2000 implementations as well; Elbrus 2000
supports Elbrus VLIW and "Intel x86" instruction sets, and new models
were released in 2018, 2021, and 2025, so MCST now seems to focus on
that.

- anton

Skimming through the SPARC architecture manual I am wondering how they
handle register renaming with a windowed register file. If the register
window file is deep there must be a ginormous number of registers for renaming. Would it need to keep track of the renames for all the
registers? How does it dump the rename state to memory?

Tried to find some information on Elbrus. I got page not found a couple
of times. Other than it’s a VLIW machine I do not know much about it.

*****

I would like a machine able to process 128-bit values directly, but it
takes up too many resources. It is easier to make the register file deep
as opposed to wide. BRAM has a max 64-bit width. After that it takes
more BRAMs to get a wider port. I tried a 128-bit wide register file,
but it used about 200 BRAMs. Too many.

There are now 128 logical registers available in Qupls. It turns out
that the BRAM setup is 512 registers deep no matter whether there are
32,64 or 128 registers. So, may as well make them available.

Qupls reservation stations were set up with support for eight operands
(four each for each ½ 128-bit register). The resulting logic was about
25,000 LUTs for just one RS. This is compared to about 5,000 LUTs when
there were just four operands. What gets implemented is considerably
less as most functional units do not need all the operands.

It may be resource efficient to use multiple reservation stations as
opposed to more operands in a single station. But then the operands need
to be linked together between stations. It may be possible using a hash
of the PC value and ROB entry number.

Qupls seems to have an implementation four or five times the size of the
FPGA again. Back to the drawing board.

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Nov 17 17:36:47 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> writes:

Skimming through the SPARC architecture manual I am wondering how they >handle register renaming with a windowed register file. If the register >window file is deep there must be a ginormous number of registers for >renaming. Would it need to keep track of the renames for all the
registers? How does it dump the rename state to memory?

There is no need to dump the rename state to memory, not for SPARC nor
for anything else. It's only microarchitectural.

The large number of architected registers may have been a reason why
they needed so long to implement OoO execution.

I think that the cost is typically a register allocation table RAT per
branch (for maybe 50 branches or potential traps that you want to
predict, i.e., 50 RATs). With 32 architected registers and 257-512
physical registers that's 32*9 bits = 288 bits per RAT; with the 136 architected registers of SPARC, and again <=512 physical registers,
that would be 1224 bits per RAT.

There are probably other options that using a RAT, but I have
forgotten them.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Nov 17 18:41:19 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

On 2025-11-16 1:36 p.m., MitchAlsup wrote:

-------------------------------

During its "life" the bits used in CARRY are simply another feedback
path on the data-path. Afterwards, carry is written once. CARRY also
gets written when an exception is taken.

- anton

These posts have inspired me to keep working on the ISA. I am on a simplification mission.

The CARRY modifier is just a substitute for not having r3w2 port instructions directly in the ISA. Since Qupls ISA has room to support
some r3w2 instructions directly there is no need for CARRY, much as I
like the idea.

That is correct at the 95% level.

While not using a carry flag in the register, there is still a
capabilities bit, overflow bit and pointer bit plus four user assigned
bits. I decided to just have 72-bit register store and load instructions along with the usual 8,16,32 and 64.

Finding it too difficult to support 128-bit operations using high, low register pairs. Getting the reservation stations to pair up the
registers seems a bit scary.

It IS scary and hard and tricky to get right.

It would be much simpler to just have
128-bit registers and it appears as if it may not be any more logic. The benefit of using register pairs is the internal busses need only be
64-bits then.

Almost exactly what we did in Mc 88120 when facing the same problem.
Except we kept the 32-bit model and had register files 2 registers
tall {even, odd},{odd even} so any register specifier would simply
read out the status and values of both registers and then let the
stations handle the insundry problems.

Sparc v9 died?

What was the last year SPARC sold more than 100,000 CPUs ??
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Nov 17 18:45:39 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

On 2025-11-17 3:33 a.m., Anton Ertl wrote:

Robert Finch <robfi680@gmail.com> writes:

Finding it too difficult to support 128-bit operations using high, low
register pairs. Getting the reservation stations to pair up the
registers seems a bit scary. It would be much simpler to just have
128-bit registers and it appears as if it may not be any more logic.

If you want to support 128-bit operations, using 128-bit registers certainly is the way to go. Note how AMD used to split 128-bit SSE operations into 64-bit parts on 64-bit registers in the K8, split
256-bit AVX operations into 128-bit parts on 128-bit registers in Zen,
but they went away from that: In Zen4 512-bit operations are performed
in 256-bit-pieces, but the registers are 512 bits wide.

However, the point of carry bits or Mitch Alsup's CARRY is not 128-bit operations, but multi-precision, which can be 256-bit for some crypto,
4096 bits for other crypto, or billions of bits for the stuff that Alexander Yee is doing.

Sparc v9 died?

Oracle has discontinued SPARC development in 2017, Fujitsu has
announced in 2016 that they switch to ARM A64. Both Oracle and
Fujitsu released their last new SPARC CPU in 2017. Fujitsu has
released the ARM A64-based A64FX in 2019. The Leon4 (2017 according
to <https://en.wikipedia.org/wiki/SPARC#Implementations>) and Leon5
(2019) implement SPARC v8, not v9.

The MCST-R2000 (2018) implements SPARC v9, but will it have a
successor? And even if it has a successor, will it be available in relevant numbers? MCST is not married to SPARC, despite their name;
they have worked on Elbrus 2000 implementations as well; Elbrus 2000 supports Elbrus VLIW and "Intel x86" instruction sets, and new models
were released in 2018, 2021, and 2025, so MCST now seems to focus on
that.

- anton

Skimming through the SPARC architecture manual I am wondering how they handle register renaming with a windowed register file. If the register window file is deep there must be a ginormous number of registers for renaming. Would it need to keep track of the renames for all the
registers? How does it dump the rename state to memory?

Tried to find some information on Elbrus. I got page not found a couple
of times. Other than it’s a VLIW machine I do not know much about it.

*****

I would like a machine able to process 128-bit values directly, but it
takes up too many resources. It is easier to make the register file deep
as opposed to wide. BRAM has a max 64-bit width. After that it takes
more BRAMs to get a wider port. I tried a 128-bit wide register file,
but it used about 200 BRAMs. Too many.

There are now 128 logical registers available in Qupls. It turns out
that the BRAM setup is 512 registers deep no matter whether there are
32,64 or 128 registers. So, may as well make them available.

Can you read BRAM 2× or 4× per CPU cycle ?!?

Qupls reservation stations were set up with support for eight operands
(four each for each ½ 128-bit register). The resulting logic was about 25,000 LUTs for just one RS. This is compared to about 5,000 LUTs when
there were just four operands. What gets implemented is considerably
less as most functional units do not need all the operands.

Ok, you found one way NOT to DO IT.

It may be resource efficient to use multiple reservation stations as
opposed to more operands in a single station. But then the operands need
to be linked together between stations. It may be possible using a hash
of the PC value and ROB entry number.

Allow me to dissuade you from this.

Qupls seems to have an implementation four or five times the size of the FPGA again. Back to the drawing board.

Live within your means.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Nov 17 18:54:17 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Robert Finch <robfi680@gmail.com> writes:

Skimming through the SPARC architecture manual I am wondering how they >handle register renaming with a windowed register file. If the register >window file is deep there must be a ginormous number of registers for >renaming. Would it need to keep track of the renames for all the >registers? How does it dump the rename state to memory?

I don't remember SPARC ever getting OoO. The windowed register file
is but one cause.

There is no need to dump the rename state to memory, not for SPARC nor
for anything else. It's only microarchitectural.

It does need to be checkpointed if/when going OoO.

The large number of architected registers may have been a reason why
they needed so long to implement OoO execution.

I think that the cost is typically a register allocation table RAT per
branch (for maybe 50 branches or potential traps that you want to
predict, i.e., 50 RATs).

50 RAT entries not 50 RATs.

With 32 architected registers and 257-512
physical registers that's 32*9 bits = 288 bits per RAT; with the 136 architected registers of SPARC, and again <=512 physical registers,
that would be 1224 bits per RAT.

Register files with more than 128 entries become big and especially SLOW.
Even 128 register entries is pushing your luck.

There are probably other options that using a RAT, but I have
forgotten them.

Physical register file where reads are done by {cam,valid} and writes
are done by decoder. the valid bits are recorded in a history table
for mispredict recovery between decode cycles.

There is also the Value-free reservation station model, where the RF
is not read until the station fires its entry.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Mon Nov 17 20:58:14 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

I don't remember SPARC ever getting OoO.

https://dl.acm.org/doi/10.5555/874064.875643 (paywalled, but the
first few lines are legible) talks about such an implementation.

The windowed register file
is but one cause.

Certainly didn't make it easier...
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Mon Nov 17 23:35:37 2025

From Newsgroup: comp.arch

On Mon, 17 Nov 2025 18:54:17 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Robert Finch <robfi680@gmail.com> writes:

Skimming through the SPARC architecture manual I am wondering how
they handle register renaming with a windowed register file. If
the register window file is deep there must be a ginormous number
of registers for renaming. Would it need to keep track of the
renames for all the registers? How does it dump the rename state
to memory?

I don't remember SPARC ever getting OoO. The windowed register file
is but one cause.

The first production OoO SPARC was HAL SPARC64 manufactured for
Fujitsu on Fujitsu's own fabs back in 1995, so contemporary of PPro. It
was 4-die chipset.
HAL SPARC64-GP was first single-chip implementation in 1997. https://en.wikipedia.org/wiki/HAL_SPARC64
The line was continued by Fujitsu:
https://en.wikipedia.org/wiki/SPARC64_V
Since then and up to 2017 there were many generations made by Fujitsu.

There were also few OoO SPARCs designed by Oracle, independently of
Fujitsu. I think that they all shared the same core uArch originally
introduced in SPARC T4 (2011).
https://en.wikipedia.org/wiki/SPARC_T4

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Mon Nov 17 16:58:31 2025

From Newsgroup: comp.arch

On 2025-11-17 1:45 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2025-11-17 3:33 a.m., Anton Ertl wrote:

Robert Finch <robfi680@gmail.com> writes:

Finding it too difficult to support 128-bit operations using high, low >>>> register pairs. Getting the reservation stations to pair up the
registers seems a bit scary. It would be much simpler to just have
128-bit registers and it appears as if it may not be any more logic.

If you want to support 128-bit operations, using 128-bit registers
certainly is the way to go. Note how AMD used to split 128-bit SSE
operations into 64-bit parts on 64-bit registers in the K8, split
256-bit AVX operations into 128-bit parts on 128-bit registers in Zen,
but they went away from that: In Zen4 512-bit operations are performed
in 256-bit-pieces, but the registers are 512 bits wide.

However, the point of carry bits or Mitch Alsup's CARRY is not 128-bit
operations, but multi-precision, which can be 256-bit for some crypto,
4096 bits for other crypto, or billions of bits for the stuff that
Alexander Yee is doing.

Sparc v9 died?

Oracle has discontinued SPARC development in 2017, Fujitsu has
announced in 2016 that they switch to ARM A64. Both Oracle and
Fujitsu released their last new SPARC CPU in 2017. Fujitsu has
released the ARM A64-based A64FX in 2019. The Leon4 (2017 according
to <https://en.wikipedia.org/wiki/SPARC#Implementations>) and Leon5
(2019) implement SPARC v8, not v9.

The MCST-R2000 (2018) implements SPARC v9, but will it have a
successor? And even if it has a successor, will it be available in
relevant numbers? MCST is not married to SPARC, despite their name;
they have worked on Elbrus 2000 implementations as well; Elbrus 2000
supports Elbrus VLIW and "Intel x86" instruction sets, and new models
were released in 2018, 2021, and 2025, so MCST now seems to focus on
that.

- anton

Skimming through the SPARC architecture manual I am wondering how they
handle register renaming with a windowed register file. If the register
window file is deep there must be a ginormous number of registers for
renaming. Would it need to keep track of the renames for all the
registers? How does it dump the rename state to memory?

Tried to find some information on Elbrus. I got page not found a couple
of times. Other than it’s a VLIW machine I do not know much about it.

*****

I would like a machine able to process 128-bit values directly, but it
takes up too many resources. It is easier to make the register file deep
as opposed to wide. BRAM has a max 64-bit width. After that it takes
more BRAMs to get a wider port. I tried a 128-bit wide register file,
but it used about 200 BRAMs. Too many.

There are now 128 logical registers available in Qupls. It turns out
that the BRAM setup is 512 registers deep no matter whether there are
32,64 or 128 registers. So, may as well make them available.

Can you read BRAM 2× or 4× per CPU cycle ?!?

The BRAM and logic is not fast enough. There is also some logic to
select BRAM outputs via a live value table.

Qupls reservation stations were set up with support for eight operands
(four each for each ½ 128-bit register). The resulting logic was about
25,000 LUTs for just one RS. This is compared to about 5,000 LUTs when
there were just four operands. What gets implemented is considerably
less as most functional units do not need all the operands.

Ok, you found one way NOT to DO IT.

It may be resource efficient to use multiple reservation stations as
opposed to more operands in a single station. But then the operands need
to be linked together between stations. It may be possible using a hash
of the PC value and ROB entry number.

Allow me to dissuade you from this.

Whew! After several tries I think I found a much better way of doing
things. The 128-bit op instructions are simply translated into two (or
more) 64-bit op micro-ops at the micro-op translation stage. There is no messing around with reservation stations or operands then. But the
performance is potentially cut in half. For a much smaller
implementation it is worth it. Micro-op translation is only a few
hundred LUTs.

Qupls seems to have an implementation four or five times the size of the
FPGA again. Back to the drawing board.

Live within your means.

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Nov 18 08:58:17 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

There is no need to dump the rename state to memory, not for SPARC nor
for anything else. It's only microarchitectural.

It does need to be checkpointed if/when going OoO.

You do register renaming in order to go OoO, so OoO is a given.
Unless Robert Finch meant the register windowing, but I don't think
so.

And yes, the rename state needs to be checkpointed in order to restore
it when recovering from a branch misprediction or the like. But these checkpoints are also microarchitectural and must not reach
architectural memory.

With 32 architected registers and 257-512
physical registers that's 32*9 bits = 288 bits per RAT; with the 136
architected registers of SPARC, and again <=512 physical registers,
that would be 1224 bits per RAT.

Register files with more than 128 entries become big and especially SLOW.

The 280 physical integer and 332 physical FP registers of Raptor Cove
have not prevented it from reaching 6.2GHz. Zen5 also reaches pretty
high clocks with its 384 physical FP registers.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Nov 18 15:16:23 2025

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> writes:

The first production OoO SPARC was HAL SPARC64 manufactured for
Fujitsu on Fujitsu's own fabs back in 1995, so contemporary of PPro. It
was 4-die chipset.
HAL SPARC64-GP was first single-chip implementation in 1997. >https://en.wikipedia.org/wiki/HAL_SPARC64
The line was continued by Fujitsu:
https://en.wikipedia.org/wiki/SPARC64_V

It did not register with me at the time, probably because the
HAL/Fujitsu SPARCs did not get as much press as the CPUs of American
Computers, and because the SPARC64 I/II/GP did not come out with
impressive clock rates for their time. It got better with the SPARC
64 V, which reached higher clock rates (and SPEC results, CINT2000
shown):

SPARC64 V 1350MHz 905 peak, 776 base Jul 2003
SPARC64 V 1890MHz 1345 peak, 1174 base Jun 2004

For comparison:

UltraSPARC III Cu 1200MHz 722 peak, 642 base Apr 2003

Opteron 1800MHz 1170 peak, 1095 base May 2003
Athlon 64 2000MHz 1335 peak, 1266 base Sep 2003
Opteron 2400MHz 1655 peak, 1566 base May 2004

Pentium 4 3067MHz 1210 peak, 1167 base May 2003
Pentium 4 EE 3400MHz 1704 peak, 1666 base Feb 2004
Xeon 3600MHz 1538 peak, 1463 base Aug 2004

Itanium2 1500MHz 1322 peak, 1322 base Jul 2003

21364 1300MHz 994 peak, 904 base Aug 2004

Power 4+ 1700MHz 1113 peak, 1077 base May 2003
Power 5 1900MHz 1451 peak, 1383 base Nov 2004
PowerPC 970 2200MHz 1040 peak, 986 base Nov 2004

The SPARC64 V has the best Cint2000 results for any SPARC published
before 2005, by a wide margin. It was competetive with its
contemporaries, but did not surpass them. It needed somewhat higher
clock rates to match the in-order Itanium II 1500MHz (and match only
in peak); maybe the higher clock rate was a result of the OoO design,
but certainly no higher IPC is visible compared to the Itanium II.
Compared to the other in-order design in this collection (UltraSPARC
III Cu), both the clock rate and the IPC are better, however.

BTW, I bought a 2000MHz Athlon 64 3200+ like the one listed above in
IIRC October or November 2003 (I posted benchmark results here in <2003Nov23.094309@a0.complang.tuwien.ac.at>.

<https://en.wikipedia.org/wiki/HAL_SPARC64#SPARC64_II> says:

|The number of physical registers was increased to 128 from 116 and the
|number of register files to five from four.

I assume that the latter is supposed to mean five instead of four
register windows. That would mean that 80 of the 116 registers (in
SPARC64 I) or 96 of the 128 registers (in SPARC64 II) would be
architectural, if register windows and register renaming happened independently; not a lot of renaming capacity, but that's probably in
line with the vintage (the 1999 Coppermine has 40 ROB entries (with
valued uops, so each ROB entry has one result register), so around the
same renaming capacity as the SPAR64 I/II).

How can register renaming be implemented on SPARC? As discussed
above, this can be done independently: Have 96 architectural registers
(plus the window pointer), and make 8 of them global registers, and
the rest 24 visible registers plus 4 windows of 16 registers, with the
usual switching. And then rename these 96 architectural registers.

A variant in the opposite direction would be to treat only the 32
visible registers as architectural registers, avoiding large RAT
entries. The save instruction would emit store microinstructions for
the local and in registers, and then the renamer would rename the out
registers to the in registers, and would assign 0 to the local and out registers (which would not occupy a physical register at first). This
approach makes the most sense with a separate renamer as is now
common. The restore instruction would rename the in registers to the
out registers, and emit load microinstructions for the local and the
in registers.

OoO tends to work fine with storing around calls and loading around
returns in architectures without register windows, because the storing
mostly consumes resources, but is not on the critical path, and
likewise for the loading (the loads tend to be ready earlier than the instructions on the critical path); and store-to-load forwarding
deals with the problem of a return shortly after a call.

In the scheme above, these benefits would also happen, but there are
the following problems: Each save save 16 registers, and each restore
restores 16 registers, many more than is typical with the usual
calling conventions. Moreover, at least gcc inserts SAVE and RETURN
(which includes RESTORE) instructions even at leaf functions with low
register pressure like

int foo(int a[])
{
if ((a[1]^a[2]) < 0)
return a[0];
else
return a[3]+1;
}

Clang OTOH manages to do without save instruction in this case
(working with the o registers, and %g1..%g4), so with the right
compiler this would be less of a problem.

Another problem is that SPARC is specified to have at least three
register windows. I guess this could be addressed in some way.

In any case, if we want save to be reasonably fast on average, we may
want to implement a few more architectural registers for register
windows (as outlined above), and have, e.g., 2*16 registers for
invisible registers (for a total of 64 architectural registers), and
have the storing only when all windows are consumed, and there is
another SAVE, and likewise load only when there is a RESTORE and there
is no register window that contains the calling context.

After a restore, the registers of the now unused window could be
freed, making more registers available for renaming. This would
correspond to having only as many architectural registers as necessary
for the currently active register windows. An effect of this idea
would be that, depending on the save and restore pattens leading up to
some code that needs a lot of renaming capacity, the performance of
that code would vary, but similar effects have also been seen in other
areas.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Tue Nov 18 13:15:24 2025

From Newsgroup: comp.arch

On 11/17/2025 1:49 AM, Robert Finch wrote:

On 2025-11-16 1:36 p.m., MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

ERROR "unexpected byte sequence starting at index 853: '\xC3'" while
decoding:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

A common set of flags is NZCV. Of these N and Z can be generated from >>>>> the 64 ordinary bits (actually N is the MSB of these bits).

You might also want NCZV of 32-bit instructions, but in that case all >>>>> flags are derivable from the 64 ordinary bits of the GPR; but in that >>>>> case you may need additional branch instructions: Instructions that
check only if the bottom 32-bits are 0 (Z), if bit 31 is 1 (N), if bit >>>>> 32 is 1 (C), or if bit 32 is different from bit 31 (V).

If you write an architectural rule whereby every integer result is
"proper" one set of bits {top, bottom, dispersed} covers everything.

Proper means that all the bits in the register are written but the
value written is range limited to {Sign}Ã{Size} of the calculation.

I have no idea what you mean with "one set of bits {top, bottom,
dispersed}".

typedef struct { uint64_t reg;
                  uint8_t bits: 4; } gpr;
or
typedef struct { uint8_t bits: 4;
                  uint64_t reg;} gpr;
or
typedef struct { uint16_t reg0;
                  uint8_t bit0: 1;
                  uint16_t reg1;
                  uint8_t bit1: 1;
                  uint16_t reg2;
                  uint8_t bit2: 1;
                  uint16_t reg3;
                  uint8_t bit3: 1; } gpr;

Did you loose every brain-cell of imagination ?!?

As for "proper": Does this mean that one would have to have add(c),
sub(c), mul (madd etc.), shift right and shift left (did I forget
anything?) for i8, i16, i32, i64, u8, u16, u32, and u64? Yes, if
specify in the operation which kind of Z, C/V, and maybe N you are
interested in, you do not need to specify it in the branch that checks
that result; you also eliminate the sign-extension and zero-extension
operations that we discussed some time ago.

{s8, s16, s32, s64, u8, u16, u32, u64} yes.

But given that the operations are much more frequent than branches,
encoding that information in the branches uses less space (for shift
right, the sign is usually included in the operation). It's

Which is why I don't have ANY of those extra bits.

interesting that AFAIK there are instruction sets (e.g., Power) that
just have one full-width sign-agnostic add, and do not have
width-specific flags, either. So when compiling stuff like

if (a[1]+a[2] == 0) /* unsigned a[] */

a width-specific compare instruction provides that information. But
gcc generates a compare instruction even when a[] is "unsigned long",
so apparently add does not set the flags on addition anyway (and if
there is an add that sets flags, it is not used by gcc for this code).

Another case is SPARC v9, which tends to set flags. For

   if ((a[1]^a[2]) < 0)

I see:

long a[]                      int a[]
ldx [ %i0 + 8 ], %g1         ld [ %i0 + 4 ], %g2
ldx [ %i0 + 0x10 ], %g2      ld [ %i0 + 8 ], %g1
xor %g1, %g2, %g1            xorcc %g2, %g1, %g0
brlz,pn   %g1, 24 <foo+0x24> bl,a,pn   %icc, 20 <foo+0x20>

Reading up on SPARC v9, it has two sets of condition codes: 32-bit
(icc) and 64-bit (xcc), and every instruction that sets condition
codes (e.g., xorcc) sets both.

Another reason its death is helpful to comp.arch

                                 In the present case, the 32-bit
sequence sets the ccs and then checks icc, while the 64-bit sequence
does not set the ccs, and instead uses a branch instruction that
inspects an integer register (%g1). These branch instructions all
work for the full 64 bits, and do not provide a way to check a 32-bit
result. In the present case, an alternate way to use brlz for the
32-bit case would have been:

ldsw [ %i0 + 8 ], %g1       #ld is a synonym for lduw
ldsw [ %i0 + 0x10 ], %g2
xor %g1, %g2, %g1
brlz,pn   %g1, 24 <foo+0x24>

because the xor of two sign-extended data is also a correct
sign-extended result, but instread gcc chose to use xorcc and bl %icc.

There are many ways to skin this cat.

Sure:: close to 20-ways, less than 4 of them are "proper".

Concerning saving the extra bits across interrupts, yes, this has to >>>>> be adapted to the actual architecture, and there are many ways to skin >>>>> this cat. I just outlined one to give an idea how this can be done. >>>>

On the other hand, with CARRY, none of those bits are needed.

But the mechanism of CARRY is quite a bit more involved: Either store
the carry in a GPR at every step, or have another mechanism inside a
CARRY block. And either make the CARRY block atomic or have some way
to preserve the fact that there is this prefix across interrupts and
(worse) synchronous traps.

During its "life" the bits used in CARRY are simply another feedback
path on the data-path. Afterwards, carry is written once. CARRY also
gets written when an exception is taken.

- anton

These posts have inspired me to keep working on the ISA. I am on a simplification mission.

The CARRY modifier is just a substitute for not having r3w2 port instructions directly in the ISA. Since Qupls ISA has room to support
some r3w2 instructions directly there is no need for CARRY, much as I
like the idea.

While not using a carry flag in the register, there is still a
capabilities bit, overflow bit and pointer bit plus four user assigned
bits. I decided to just have 72-bit register store and load instructions along with the usual 8,16,32 and 64.

Finding it too difficult to support 128-bit operations using high, low register pairs. Getting the reservation stations to pair up the
registers seems a bit scary. It would be much simpler to just have 128-
bit registers and it appears as if it may not be any more logic. The
benefit of using register pairs is the internal busses need only be 64-
bits then.

I went with pairs, but I guess maybe pairs are a lot easier for in-order
than OoO.

Sparc v9 died?

Pretty sure SPARC is good and dead at this point...

Many others in this space are not far behind.

Basically, anything remaining needs to compete against ARM and RISC-V
(the latter of which making an unexpectedly rapid rise in mind-share and prominence...).

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Tue Nov 18 13:22:44 2025

From Newsgroup: comp.arch

On 11/17/2025 12:41 PM, MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2025-11-16 1:36 p.m., MitchAlsup wrote:

-------------------------------

During its "life" the bits used in CARRY are simply another feedback
path on the data-path. Afterwards, carry is written once. CARRY also
gets written when an exception is taken.

- anton

These posts have inspired me to keep working on the ISA. I am on a
simplification mission.

The CARRY modifier is just a substitute for not having r3w2 port
instructions directly in the ISA. Since Qupls ISA has room to support
some r3w2 instructions directly there is no need for CARRY, much as I
like the idea.

That is correct at the 95% level.

While not using a carry flag in the register, there is still a
capabilities bit, overflow bit and pointer bit plus four user assigned
bits. I decided to just have 72-bit register store and load instructions
along with the usual 8,16,32 and 64.

Finding it too difficult to support 128-bit operations using high, low
register pairs. Getting the reservation stations to pair up the
registers seems a bit scary.

It IS scary and hard and tricky to get right.

It would be much simpler to just have
128-bit registers and it appears as if it may not be any more logic. The
benefit of using register pairs is the internal busses need only be
64-bits then.

Almost exactly what we did in Mc 88120 when facing the same problem.
Except we kept the 32-bit model and had register files 2 registers
tall {even, odd},{odd even} so any register specifier would simply
read out the status and values of both registers and then let the
stations handle the insundry problems.

I had actually considered this as a possibly implementation strategy in
the past.

Either way, strict even+odd pairing does mean that it is possible to
treat things either as 64 or 128 bit registers internally, except that
the 64-bit case would still need to be able to operate with
independently addressable registers (a 3R1W 128-bit regfile can't
directly mimic a 6R2W or similar).

One possibility here is that for register pairs, if it functions as a
128-bit access, one of the 64-bit ID's is effectively ignored/disabled,
and any OoO magic would mark both registers in the pair as unavailable.

But, alas, never implemented an OoO CPU, so I don't really know here.

Sparc v9 died?

What was the last year SPARC sold more than 100,000 CPUs ??

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Tue Nov 18 19:28:29 2025

From Newsgroup: comp.arch

BGB <cr88192@gmail.com> schrieb:

Pretty sure SPARC is good and dead at this point...

Almost, but not quite. I still have login on a couple of SPARC
machines:

$ uname -a
SunOS s11-sparc.cfarm 5.11 11.4.86.201.2 sun4v sparc sun4v logical-domain
$ kstat -p cpu_info | head
cpu_info:0:cpu_info0:brand SPARC-M8
cpu_info:0:cpu_info0:chip_id 0
cpu_info:0:cpu_info0:class misc
cpu_info:0:cpu_info0:clock_MHz 5067
cpu_info:0:cpu_info0:core_id 8
cpu_info:0:cpu_info0:cpu_fru hc:///component=
cpu_info:0:cpu_info0:cpu_type sparcv9
cpu_info:0:cpu_info0:crtime 12619319,2018106 cpu_info:0:cpu_info0:cstates_count 0:0
cpu_info:0:cpu_info0:cstates_nsec 11963950024050:12619342341210000

Many others in this space are not far behind.

Basically, anything remaining needs to compete against ARM and RISC-V
(the latter of which making an unexpectedly rapid rise in mind-share and prominence...).

Power's not dead, either, if very highly priced. MIPS is still
being sold, apparently. Then there's Loongarch. As for RISC-V,
I am not sure how much business they actually generate compared
to others.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Nov 18 22:25:24 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> posted:

BGB <cr88192@gmail.com> schrieb:

Pretty sure SPARC is good and dead at this point...

Almost, but not quite. I still have login on a couple of SPARC
machines:

My doctor told me that he had given my prostrate enough x-ray radiation
to kill the prostrate-cancer, but I still had to take medicine because
they had not actually died yet (for 2 more months).

SPARC has been killed, but is not quite dead.

A fine line indeed.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Tue Nov 18 20:26:25 2025

From Newsgroup: comp.arch

On 2025-11-18 2:15 p.m., BGB wrote:

On 11/17/2025 1:49 AM, Robert Finch wrote:

On 2025-11-16 1:36 p.m., MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

ERROR "unexpected byte sequence starting at index 853: '\xC3'" while
decoding:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

A common set of flags is NZCV. Of these N and Z can be generated >>>>>> from
the 64 ordinary bits (actually N is the MSB of these bits).

You might also want NCZV of 32-bit instructions, but in that case all >>>>>> flags are derivable from the 64 ordinary bits of the GPR; but in that >>>>>> case you may need additional branch instructions: Instructions that >>>>>> check only if the bottom 32-bits are 0 (Z), if bit 31 is 1 (N), if >>>>>> bit
32 is 1 (C), or if bit 32 is different from bit 31 (V).

If you write an architectural rule whereby every integer result is
"proper" one set of bits {top, bottom, dispersed} covers everything. >>>>>
Proper means that all the bits in the register are written but the
value written is range limited to {Sign}Ã{Size} of the calculation. >>>>

I have no idea what you mean with "one set of bits {top, bottom,
dispersed}".

typedef struct { uint64_t reg;
                  uint8_t bits: 4; } gpr;
or
typedef struct { uint8_t bits: 4;
                  uint64_t reg;} gpr;
or
typedef struct { uint16_t reg0;
                  uint8_t bit0: 1;
                  uint16_t reg1;
                  uint8_t bit1: 1;
                  uint16_t reg2;
                  uint8_t bit2: 1;
                  uint16_t reg3;
                  uint8_t bit3: 1; } gpr;

Did you loose every brain-cell of imagination ?!?

As for "proper": Does this mean that one would have to have add(c),
sub(c), mul (madd etc.), shift right and shift left (did I forget
anything?) for i8, i16, i32, i64, u8, u16, u32, and u64? Yes, if
specify in the operation which kind of Z, C/V, and maybe N you are
interested in, you do not need to specify it in the branch that checks >>>> that result; you also eliminate the sign-extension and zero-extension
operations that we discussed some time ago.

{s8, s16, s32, s64, u8, u16, u32, u64} yes.

But given that the operations are much more frequent than branches,
encoding that information in the branches uses less space (for shift
right, the sign is usually included in the operation). It's

Which is why I don't have ANY of those extra bits.

interesting that AFAIK there are instruction sets (e.g., Power) that
just have one full-width sign-agnostic add, and do not have
width-specific flags, either. So when compiling stuff like

if (a[1]+a[2] == 0) /* unsigned a[] */

a width-specific compare instruction provides that information. But
gcc generates a compare instruction even when a[] is "unsigned long",
so apparently add does not set the flags on addition anyway (and if
there is an add that sets flags, it is not used by gcc for this code). >>>>
Another case is SPARC v9, which tends to set flags. For

   if ((a[1]^a[2]) < 0)

I see:

long a[]                      int a[]
ldx [ %i0 + 8 ], %g1         ld [ %i0 + 4 ], %g2
ldx [ %i0 + 0x10 ], %g2      ld [ %i0 + 8 ], %g1
xor %g1, %g2, %g1            xorcc %g2, %g1, %g0
brlz,pn   %g1, 24 <foo+0x24> bl,a,pn   %icc, 20 <foo+0x20>

Reading up on SPARC v9, it has two sets of condition codes: 32-bit
(icc) and 64-bit (xcc), and every instruction that sets condition
codes (e.g., xorcc) sets both.

Another reason its death is helpful to comp.arch

                                 In the present case, the 32-bit
sequence sets the ccs and then checks icc, while the 64-bit sequence
does not set the ccs, and instead uses a branch instruction that
inspects an integer register (%g1). These branch instructions all
work for the full 64 bits, and do not provide a way to check a 32-bit
result. In the present case, an alternate way to use brlz for the
32-bit case would have been:

ldsw [ %i0 + 8 ], %g1       #ld is a synonym for lduw
ldsw [ %i0 + 0x10 ], %g2
xor %g1, %g2, %g1
brlz,pn   %g1, 24 <foo+0x24>

because the xor of two sign-extended data is also a correct
sign-extended result, but instread gcc chose to use xorcc and bl %icc. >>>>
There are many ways to skin this cat.

Sure:: close to 20-ways, less than 4 of them are "proper".

Concerning saving the extra bits across interrupts, yes, this has to >>>>>> be adapted to the actual architecture, and there are many ways to >>>>>> skin
this cat. I just outlined one to give an idea how this can be done. >>>>>

On the other hand, with CARRY, none of those bits are needed.

But the mechanism of CARRY is quite a bit more involved: Either store
the carry in a GPR at every step, or have another mechanism inside a
CARRY block. And either make the CARRY block atomic or have some way >>>> to preserve the fact that there is this prefix across interrupts and
(worse) synchronous traps.

During its "life" the bits used in CARRY are simply another feedback
path on the data-path. Afterwards, carry is written once. CARRY also
gets written when an exception is taken.

- anton

These posts have inspired me to keep working on the ISA. I am on a
simplification mission.

The CARRY modifier is just a substitute for not having r3w2 port
instructions directly in the ISA. Since Qupls ISA has room to support
some r3w2 instructions directly there is no need for CARRY, much as I
like the idea.

While not using a carry flag in the register, there is still a
capabilities bit, overflow bit and pointer bit plus four user assigned
bits. I decided to just have 72-bit register store and load
instructions along with the usual 8,16,32 and 64.

Finding it too difficult to support 128-bit operations using high, low
register pairs. Getting the reservation stations to pair up the
registers seems a bit scary. It would be much simpler to just have
128- bit registers and it appears as if it may not be any more logic.
The benefit of using register pairs is the internal busses need only
be 64- bits then.

I went with pairs, but I guess maybe pairs are a lot easier for in-order than OoO.

I have gone with quads now. It is faked by translating one ISA
instruction into four micro-ops doing 64-bit ops. It could go with pairs
too in the same manner. The number of registers was upped to 128 so
there can be 32x256 bit SIMD registers.

Shelved the 128-bit ops for now.

Sparc v9 died?

Pretty sure SPARC is good and dead at this point...

Many others in this space are not far behind.

Basically, anything remaining needs to compete against ARM and RISC-V
(the latter of which making an unexpectedly rapid rise in mind-share and prominence...).

I am still waiting to see what else shows up.

Is the need for backwards compatibility killing things as technology has improved? There seems to be a lot more known good/bad approaches making
me think that the lifetime of newer designs could be longer.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 19 01:47:26 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

Is the need for backwards compatibility killing things as technology has improved?

No with respect to::
Little Endian
IEEE 754 floating point
Byte addressable memory
Misaligned memory
PCIe peripheral access
CXL interconnect
CXL added memory
CXL added cache
access to Linux
access to gnu
access to LLVM
access to qemu
access to gem5
numerical libraries/packages

yes with respect to::
x86 condition codes
x86 shift by 0
x86 descriptor tables
4096 byte pages
long latency exception/interrupt control transfer
need source to port application
SIMD considered harmful
ATOMIC activities
Exception walk-back across block structure
Signal/exception delivery
language evolution
environment evolution

There seems to be a lot more known good/bad approaches making
me think that the lifetime of newer designs could be longer.

Yes, but the people making the decisions are still to young to have
the history needed to make better decisions.

The graduates of major universities go right out and start designing
without being exposed to "enough" of the disease of computer architecture
to be in a position to understand why feature.X of arch.Y was bad overall,
or why feature.X of architecture.Y was not enough to save it.

Each generation reaches employment after university at about the same
level as we did when we invented RISC.

Architecture is only 1/3rd ISA--and it is the other 2/3rds where the
{trouble or success} lies (85% confidence level).
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Wed Nov 19 07:47:12 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

environment evolution

There seems to be a lot more known good/bad approaches making
me think that the lifetime of newer designs could be longer.

Yes, but the people making the decisions are still to young to have
the history needed to make better decisions.

The graduates of major universities go right out and start designing
without being exposed to "enough" of the disease of computer architecture
to be in a position to understand why feature.X of arch.Y was bad overall,
or why feature.X of architecture.Y was not enough to save it.

Each generation reaches employment after university at about the same
level as we did when we invented RISC.

I recently heard that CS graduates from ETH Zürich had heard about
pipelines, but thought it was fetch-decode-execute.

They also did not know about DEC or the VAX. Sic transit gloria
mundi... Apparently, the most ancient computer history they heard
about was Nehalem.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Wed Nov 19 12:53:35 2025

From Newsgroup: comp.arch

On 11/13/2025 9:59 PM, MitchAlsup wrote:

BGB <cr88192@gmail.com> posted:

On 11/13/2025 3:58 PM, Anton Ertl wrote:

BGB <cr88192@gmail.com> writes:

Can note that GCC seemingly doesn't support 128-bit integers on 64-bit >>>> RISC-V.

What makes you think so? It has certainly worked every time I tried
it. E.g., Gforth's "configure" reports:

checking size of __int128_t... 16
checking size of __uint128_t... 16
[...]
checking for a C type for double-cells... __int128_t
checking for a C type for unsigned double-cells... __uint128_t

That's with gcc 10.3.1

Hmm...

Seems so.

Testing again, it does appear to work; the error message I thought I
remembered seeing, instead applied to when trying to use the type in
MSVC. I had thought I remembered checking before and it failing, but it
seems not.

But, yeah, good to know I guess.

As for MSVC:
tst_int128.c(5): error C4235: nonstandard extension used: '__int128'
keyword not supported on this architecture

ERRRRRRR:: not supported by this compiler, the architecture has
ISA level support for doing this, but the compiler does not allow
you access.

More or less it seems.

This leaves, apparently:
MSVC: Maybe once had it for IA-64, but nowhere else;
GCC: Supported, but lacks a printf modifier for it in glibc.
Clang: Supported, but lacks support for 128-bit integer literals?...
BGBCC: Supported, with literals and 'I128' printf modifier.
Where, 'I128' is similar to 'I64' in MSVC,
as for a long time they also lacked the 'll' modifier and similar.

ISA's:
X64: Can build manually via register pairs (any two registers), ADD+ADC
allows for 128-bit in 2 instructions;
Many 128-bit ops can be built using flags bits;
ISA supports widening multiply and narrowing divide, though typically
with hardwired registers.

XG1/XG2:
CLRT+ADDC+ADDC
Theoretically arbitrary, BGBCC only uses even pairs;
CLRT needed to clear the SR.T flag;
Normal ADD does not modify SR.T.
Could maybe be better if there were a 3R ADDC variant,
and maybe a carry-out only variant (so no CLRT was needed).
ADDX
Even pairs only, single instruction.

XG3:
Support for SR.T was demoted to optional,
half the encoding space goes unused if predication isn't used though.
Could bit a "better" RV-C in there (*1).
ALUX instructions could be used, also optional.
Otherwise, it is left in a similar situation to RISC-V here.

*1: Noted before that if one tweaks the design of RV-C some:
Makes Imm/Disp fields smaller;
Replaces Reg3 with Reg4 (X8..X27);
...
It is possible to get an set of 16-bit ops that both use less encoding
space and get a better average hit rate than the existing RV-C ops
(mostly by not trying to do Imm6/Disp6 in said ops; and only using Reg5
on a few instructions).

However, IMO, makes more sense to support RV-C for binary compatibility,
than for the encoding scheme not being "kind of a turd".

However, "XG3 sub-variant that drops predicated encodings in favor of re-adding a new/different set of 16-bit encodings" was not a
particularly attractive option.

For where it makes sense to use XG3 though, likely it makes sense to
allow/use SR.T and the predicated encodings, which can still offer a
small but non-zero performance benefit (even if debatable if it is
something that is worth spending half of the encoding space on).

I did also experimenting with allowing a few blocks to be used for pair-encoded ops. One other possibility could be some additional unconditional-only instruction blocks (but, these would be N/E in XG1/XG2).

One possibility could also be an "XG3 Lite" subset:
Likely unconditional only, and also disallows RISC-V encodings.

Or, IOW:
...xx00 Disallowed
...xx01 Disallowed
...xx10 Allowed
...xx11 Disallowed

Could maybe make sense if I wanted a core on a smaller FPGA.

However, there isn't that much incentive to go for much smaller than the XC7S50 with this, and for current use-cases that could involve an XC7S25
or XC7A35T, you kinda really want to try to maximize code density
(mostly because the currently available dev-boards with these FPGAs tend
to lack external RAM).

The Intel/Altera chips tend to always have integrated ARM cores;
Boards with Lattice FPGAs (probably ECP5 or similar in this case, *)
tend to be obscure and overpriced (even if theoretically the FPGAs
themselves are cheaper).

*: One is harder pressed to make a non-trivial CPU core that fits into
an ICE40.

Though, one other possibility being trying to again implement dual-core
on an XC7A100T, but possibly sharing FPU and SIMD between the cores (may
or may not be viable).

In this case, there would be a mechanism such that inter-core interlocks
could trigger to disallow both cores trying to access the FPU or SIMD
unit on the same clock-cycle. Though unclear how this could interact
with pipeline stalls (would ideally want both cores to have independent pipelines; but then one needs to arbitrate things such that both units
get their results at the expected clock cycle, ...).

Though, to that end, may also make sense to consider going to a
dual-issue superscalar with 4R2W register file.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Nov 20 07:33:36 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> writes:

Power's not dead, either, if very highly priced.

New Power CPUs and machines based on them are released regularly. I
think there is enough business in the iSeries (or whatever its current
name) is to produce enough money for the costs of that development.
pSeries benefits from that. I guess that the profits from that are
enough to finance the development of the pSeries machines, but can
contribute little to finance the development of the CPUs.

MIPS is still
being sold, apparently.

From <https://en.wikipedia.org/wiki/MIPS_architecture>:
|In March 2021, MIPS announced that the development of the MIPS
|architecture had ended as the company is making the transition to
|RISC-V.

So it's the same status as SPARC. They may be selling to existing
customers, but nobody sane will use MIPS for a new project.

As for RISC-V,
I am not sure how much business they actually generate compared
to others.

I think a lot of embedded RISC-Vs are used, e.g., in WD (and now
Sandisk) HDDs and SSDs; so you can look at the business reports of WD
if you want to know how much business they make. As for things you
can actually program, there are a number of SBCs on sale (and we have
one), from the Raspi Pico 2 (where you apparently can use either
ARMv8-M (i.e., ARM T32) or RISC-V (probably some RV32 variant) up to
stuff like the Visionfive V2, several Chinese offerings, and some
Hifive SBCs. The latter are not yet competetive in CPU performance
with the like of RK3588-based SBCs or the Raspi 5, so I expect the
main reason for buying them is to try out RISC-V (we have a Visionfive
V1 for that purpose); still, the fact that there are several offerings indicates that there is nonnegligible revenue there.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Nov 20 07:55:48 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> writes:

Is the need for backwards compatibility killing things as technology has >improved?

That is certainly the usual complaint by engineers who are hindered in
doing what they would otherwise like to to by backwards compatibility requirements. It's certainly easier to design on a clean slate. OTOH
not all of the ideas that are prevented by backwards compatibility
requirements are good ideas.

Overall, as I mentioned in this thread, there is architectural
progress, in some cases (e.g., the establishment of 8/16/32/64-bit
machines) in ways that are not backwards-compatible. So backwards compatibility is not preventing all progress.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Thu Nov 20 08:05:53 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> writes:

I recently heard that CS graduates from ETH Zürich had heard about >pipelines, but thought it was fetch-decode-execute.

Why would a CS graduate need to know about pipelines?

They also did not know about DEC or the VAX. Sic transit gloria
mundi...

Yes, a few years ago I asked some students, among them an older one
who was interested in some older technologies, whether they had heard
of the VAX. None had. It seems that VAX was big in the 80s, but it
then vanished from the radar of the computer-interested public. So
anybody who became interested in computers only afterwards is unlikely
to have heard of the VAX, unless they are into retrocomputing or read
old debates about the advantages of RISC.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Fri Nov 21 15:31:49 2025

From Newsgroup: comp.arch

On Thu, 13 Nov 2025 19:04:18 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Michael S <already5chosen@yahoo.com> posted:

Not really.
That is, conversions are not blazingly fast, but still much better
than any attempt to divide in any form of decimal. And helps to
preserve your sanity.

Are you trying to pull our proverbial leg here ?!?

After reading paragraph 5.2 of IEEE-754-2008 Standard I am less sure in correctness of my above statement.
For the case of exact division, preservation of mental sanity during fulfillment of requirements of this paragraph is far from simple,
regardless of numeric base used in the process.

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Fri Nov 21 13:36:05 2025

From Newsgroup: comp.arch

On 11/21/2025 7:31 AM, Michael S wrote:

On Thu, 13 Nov 2025 19:04:18 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Michael S <already5chosen@yahoo.com> posted:

Not really.
That is, conversions are not blazingly fast, but still much better
than any attempt to divide in any form of decimal. And helps to
preserve your sanity.

Are you trying to pull our proverbial leg here ?!?

After reading paragraph 5.2 of IEEE-754-2008 Standard I am less sure in correctness of my above statement.
For the case of exact division, preservation of mental sanity during fulfillment of requirements of this paragraph is far from simple,
regardless of numeric base used in the process.

One effectively needs to do a special extra-wide divide rather than just
a normal integer divide, etc.

But, yeah, fastest I had gotten in my experiments was radix-10e9 long-division, but still not the fastest option.

So, rough ranking, fast to slow:
Radix-10e9 Long Divide (fastest)
Newton-Raphson
Radix-10 Long Divide
Integer Shift-Subtract with converters (slowest).
Fastest converter strategy ATM:
Radix-10e9 double-dabble (Int->Dec).
MUL-by-10e9 and ADD (Dec->Int)
Fastest strategy: Unrolled Shifts and ADDs (*1).

*1: While it is possible to perform a 128-bit multiply decomposing into multiplying 32-bit parts and adding them together; it was working out
slightly faster in this case to do a fixed multiply by decomposing it
into a series of explicit shifts and ADDs.

Though, in this case, it is faster (and less ugly) to decompose this
into a pattern of iteratively multiplying by smaller amounts. I had
ended up using 4x multiply by 100 followed by multiply by 10, as while
not the fastest strategy, needs less code than 2x multiply by 10000 +
multiply by 10. Most other patterns would need more shifts and adds.

In theory, x86-64 could do it better with multiply ops, but getting
something optimal out of the C compilers is a bigger issue here it seems.

Unexplored options:
Radix 10e2 (byte)
Radix 10e3 (word)
Radix 10e4 (word)

Radix 10e3 could have the closest to direct mapping to DPD.

Looking at the decNumber code, it appears also to be Radix-10e9 based.
They also do significant (ab)use of the C preprocessor.

Apparently, "Why use functions when you can use macros?"...

For the Radix-10e9 long-divide, part of the magic was in the function to
scale a value by a radix value and subtract it from another array.

Ended up trying a few options, fastest was to temporarily turn the
operation into non-normalized 64-bit pieces and then normalize the
result (borrow propagation, etc) as an output step.

Initial attempt kept it normalized within the operation, which was slower.

It was seemingly compiler-dependent whether it was faster to do a
combined operation, or separate scale and subtract, but the margins were small. On MSVC the combined operation was slightly faster than the
separate operations.

...

Otherwise, after this, just went and fiddled with BGBCC some more,
adding more options for its resource converter.

Had before (for image formats):
In: TGA, BMP (various), PNG, QOI, UPIC
Out: BMP (various), QOI, UPIC

Added (now):
In: PPM, JPG, DDS
Out: PNG, JPG, DDS (DXT1 and DXT5)

Considered (not added yet):
PCX
Evaluated PCX, possible but not a clear win.

Fiddled with making the PNG encoder less slow, mostly this was tweaking
some parameters for the LZ searches. Initial settings were using deeper searches over initially smaller sliding windows (at lower compression
levels); better in this case to do a shallower search over a max-sized
sliding window.

ATM, speed of PNG is now on-par with the JPG encoder (still one of the
slower options).

For simple use-cases, PNG still loses (in terms of both speed and
compression) to 16-color BMP + LZ compression (LZ4 or RP2).
Theoretically, indexed-color PNG exists, but is less widely supported.

It is less space-efficient to represent 16-colors as Deflate-compressed
color differences than it is to just represent the 4-bit RGBI values
directly.

However, can note that the RLE compression scheme (used by PCX) is
clearly inferior to that of any sort of LZ compression.

Comparably, PNG is also a more expensive format to decode as well (even
vs JPEG).

UPIC can partly address the use-cases of both PNG and JPEG while being
cheaper to decode than either, but more niche as pretty much nothing
supports it. Some of its design and properties being mostly JPEG-like.

QOI is interesting, but suffers some similar limitations to PCX (its
design is mostly about more compactly encoding color-differences in
true-color images and otherwise only offers RLE compression).

QOI is not particularly effective against images with little variety in
color variation but lots of repeating patterns (I have a modified QOI
that does a little better here, still not particularly effective with
16-color graphics though).

Otherwise, also added up adding a small text format for image drawing commands.

As a simplistic line oriented format containing various commands to
perform drawing operations or composite images.
creating a "canvas"
setting the working color
drawing lines
bucket fill
drawing text strings
overlaying other images
...

This is maybe (debatable) outside the scope of a C compiler, but could
have use-cases for preparing resource data (nevermind if scope creep is
partly also turning it into an asset-packer tool; where it is useful to
make graphics/sounds/etc in one set of formats and then process and
convert them into another set of files, usually inside of some sort of
VFS image or similar).

Design is much more simplistic than something like SVG and I am
currently assuming its use for mostly hand-edited files. Unlike SVG, it
also assumes drawing to a pixel grid rather than some more abstract
coordinate space (so, its abstract model is more like "MS Paint" or
similar); also SVG would suck as a human-edited format.

Granted, one could argue maybe it could make scope that asset-processing
is its own tool, then one converts it to a format that the compiler
accepts (WAD2 or WAD4 in this case) prior to compiling the main binary
(and/or not use resource data).

Still, IMO, an internal WAD image is still better than the
horrid/unusable mess that Windows had used (where anymore most people
don't bother with the resource section much more than storing a program
icon or similar...).

But, realistically, one does still want to limit how much data they
stick into the EXE.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Fri Nov 21 22:09:00 2025

From Newsgroup: comp.arch

On 2025-11-21 2:36 p.m., BGB wrote:

On 11/21/2025 7:31 AM, Michael S wrote:

On Thu, 13 Nov 2025 19:04:18 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Michael S <already5chosen@yahoo.com> posted:

Not really.
That is, conversions are not blazingly fast, but still much better
than any attempt to divide in any form of decimal. And helps to
preserve your sanity.

Are you trying to pull our proverbial leg here ?!?

After reading paragraph 5.2 of IEEE-754-2008 Standard I am less sure in
correctness of my above statement.
For the case of exact division, preservation of mental sanity during
fulfillment of requirements of this paragraph is far from simple,
regardless of numeric base used in the process.

One effectively needs to do a special extra-wide divide rather than just
a normal integer divide, etc.

But, yeah, fastest I had gotten in my experiments was radix-10e9 long- division, but still not the fastest option.

So, rough ranking, fast to slow:
Radix-10e9 Long Divide (fastest)
Newton-Raphson
Radix-10 Long Divide
Integer Shift-Subtract with converters (slowest).
    Fastest converter strategy ATM:
      Radix-10e9 double-dabble (Int->Dec).
      MUL-by-10e9 and ADD (Dec->Int)
        Fastest strategy: Unrolled Shifts and ADDs (*1).

*1: While it is possible to perform a 128-bit multiply decomposing into multiplying 32-bit parts and adding them together; it was working out slightly faster in this case to do a fixed multiply by decomposing it
into a series of explicit shifts and ADDs.

Though, in this case, it is faster (and less ugly) to decompose this
into a pattern of iteratively multiplying by smaller amounts. I had
ended up using 4x multiply by 100 followed by multiply by 10, as while
not the fastest strategy, needs less code than 2x multiply by 10000 + multiply by 10. Most other patterns would need more shifts and adds.

In theory, x86-64 could do it better with multiply ops, but getting something optimal out of the C compilers is a bigger issue here it seems.

Unexplored options:
Radix 10e2 (byte)
Radix 10e3 (word)
Radix 10e4 (word)

Radix 10e3 could have the closest to direct mapping to DPD.

Looking at the decNumber code, it appears also to be Radix-10e9 based.
They also do significant (ab)use of the C preprocessor.

Apparently, "Why use functions when you can use macros?"...

For the Radix-10e9 long-divide, part of the magic was in the function to scale a value by a radix value and subtract it from another array.

Ended up trying a few options, fastest was to temporarily turn the
operation into non-normalized 64-bit pieces and then normalize the
result (borrow propagation, etc) as an output step.

Initial attempt kept it normalized within the operation, which was slower.

It was seemingly compiler-dependent whether it was faster to do a
combined operation, or separate scale and subtract, but the margins were small. On MSVC the combined operation was slightly faster than the
separate operations.

...

Otherwise, after this, just went and fiddled with BGBCC some more,
adding more options for its resource converter.

Had before (for image formats):
In: TGA, BMP (various), PNG, QOI, UPIC
Out: BMP (various), QOI, UPIC

Added (now):
In: PPM, JPG, DDS
Out: PNG, JPG, DDS (DXT1 and DXT5)

Considered (not added yet):
PCX
Evaluated PCX, possible but not a clear win.

Fiddled with making the PNG encoder less slow, mostly this was tweaking
some parameters for the LZ searches. Initial settings were using deeper searches over initially smaller sliding windows (at lower compression levels); better in this case to do a shallower search over a max-sized sliding window.

ATM, speed of PNG is now on-par with the JPG encoder (still one of the slower options).

For simple use-cases, PNG still loses (in terms of both speed and compression) to 16-color BMP + LZ compression (LZ4 or RP2).
Theoretically, indexed-color PNG exists, but is less widely supported.

It is less space-efficient to represent 16-colors as Deflate-compressed color differences than it is to just represent the 4-bit RGBI values directly.

However, can note that the RLE compression scheme (used by PCX) is
clearly inferior to that of any sort of LZ compression.

Comparably, PNG is also a more expensive format to decode as well (even
vs JPEG).

UPIC can partly address the use-cases of both PNG and JPEG while being cheaper to decode than either, but more niche as pretty much nothing supports it. Some of its design and properties being mostly JPEG-like.

QOI is interesting, but suffers some similar limitations to PCX (its
design is mostly about more compactly encoding color-differences in true-color images and otherwise only offers RLE compression).

QOI is not particularly effective against images with little variety in color variation but lots of repeating patterns (I have a modified QOI
that does a little better here, still not particularly effective with 16-color graphics though).

Otherwise, also added up adding a small text format for image drawing commands.

As a simplistic line oriented format containing various commands to
perform drawing operations or composite images.
creating a "canvas"
setting the working color
drawing lines
bucket fill
drawing text strings
overlaying other images
...

This is maybe (debatable) outside the scope of a C compiler, but could
have use-cases for preparing resource data (nevermind if scope creep is partly also turning it into an asset-packer tool; where it is useful to
make graphics/sounds/etc in one set of formats and then process and
convert them into another set of files, usually inside of some sort of
VFS image or similar).

Design is much more simplistic than something like SVG and I am
currently assuming its use for mostly hand-edited files. Unlike SVG, it
also assumes drawing to a pixel grid rather than some more abstract coordinate space (so, its abstract model is more like "MS Paint" or similar); also SVG would suck as a human-edited format.

Granted, one could argue maybe it could make scope that asset-processing
is its own tool, then one converts it to a format that the compiler
accepts (WAD2 or WAD4 in this case) prior to compiling the main binary (and/or not use resource data).

Still, IMO, an internal WAD image is still better than the horrid/
unusable mess that Windows had used (where anymore most people don't
bother with the resource section much more than storing a program icon
or similar...).

But, realistically, one does still want to limit how much data they
stick into the EXE.

...

My forays into the world of graphics formats are pretty limited. I tend
to use libraries already written by other people. I assume people a lot brighter than myself have come up with them.

A while ago I wrote a set of graphics routines in assembler that were
quite fast. One format I have delt with is the .flic file format used to render animated graphics. I wanted to write my own CIV style game. It
took a little bit of research and some reverse engineering. Apparently,
the authors used a modified version of the format making it difficult to
use the CIV graphics in my own game. I never could get it to render as
fast as the game’s engine. I wrote the code for my game in C or C++, the original’s game engine code was likely in a different language.

*****

Been working on vectors for the ISA. I split the vector length register
into eight sections to define up to eight different vector lengths. The
first five are defined for integer, float, fixed, character, and address
data types. I figure one may want to use vectors of different lengths at
the same time, for instance to address data using byte offsets, while
the data itself might be a float. The vector load / store instructions
accept a data type to load / store and always use the address type for
address calculations.

There is also a vector lane size register split up the same way. I had
thought of giving each vector register its own format for length and
lane size. But thought that is a bit much, with limited use cases.

I think I can get away with only two load and two store instructions.
One to do a strided load and a second to do an vector indexed load (gather/scatter). The addressing mode in use is d[Rbase+Rindex*Scale].
Where Rindex is used as the stride when scalar or as a supplier of the
lane offset when Rindex is a vector.

Writing the RTL code to support the vector memory ops has been
challenging. Using a simple approach ATM. The instruction needs to be re-issued for each vector lane accessed. Unaligned vector loads and
stores are also allowed, adding some complexity when the operation
crosses a cache-line boundary.

I have the max vector length and max vector size constants returned by
the GETINFO instruction which returns CPU specific information.

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sat Nov 22 04:54:00 2025

From Newsgroup: comp.arch

On 11/21/2025 9:09 PM, Robert Finch wrote:

On 2025-11-21 2:36 p.m., BGB wrote:

On 11/21/2025 7:31 AM, Michael S wrote:

On Thu, 13 Nov 2025 19:04:18 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Michael S <already5chosen@yahoo.com> posted:

Not really.
That is, conversions are not blazingly fast, but still much better
than any attempt to divide in any form of decimal. And helps to
preserve your sanity.

Are you trying to pull our proverbial leg here ?!?

After reading paragraph 5.2 of IEEE-754-2008 Standard I am less sure in
correctness of my above statement.
For the case of exact division, preservation of mental sanity during
fulfillment of requirements of this paragraph is far from simple,
regardless of numeric base used in the process.

One effectively needs to do a special extra-wide divide rather than
just a normal integer divide, etc.

But, yeah, fastest I had gotten in my experiments was radix-10e9 long-
division, but still not the fastest option.

So, rough ranking, fast to slow:
   Radix-10e9 Long Divide (fastest)
   Newton-Raphson
   Radix-10 Long Divide
   Integer Shift-Subtract with converters (slowest).
     Fastest converter strategy ATM:
       Radix-10e9 double-dabble (Int->Dec).
       MUL-by-10e9 and ADD (Dec->Int)
         Fastest strategy: Unrolled Shifts and ADDs (*1).

*1: While it is possible to perform a 128-bit multiply decomposing
into multiplying 32-bit parts and adding them together; it was working
out slightly faster in this case to do a fixed multiply by decomposing
it into a series of explicit shifts and ADDs.

Though, in this case, it is faster (and less ugly) to decompose this
into a pattern of iteratively multiplying by smaller amounts. I had
ended up using 4x multiply by 100 followed by multiply by 10, as while
not the fastest strategy, needs less code than 2x multiply by 10000 +
multiply by 10. Most other patterns would need more shifts and adds.

In theory, x86-64 could do it better with multiply ops, but getting
something optimal out of the C compilers is a bigger issue here it seems.

Unexplored options:
   Radix 10e2 (byte)
   Radix 10e3 (word)
   Radix 10e4 (word)

Radix 10e3 could have the closest to direct mapping to DPD.

Looking at the decNumber code, it appears also to be Radix-10e9 based.
They also do significant (ab)use of the C preprocessor.

Apparently, "Why use functions when you can use macros?"...

For the Radix-10e9 long-divide, part of the magic was in the function
to scale a value by a radix value and subtract it from another array.

Ended up trying a few options, fastest was to temporarily turn the
operation into non-normalized 64-bit pieces and then normalize the
result (borrow propagation, etc) as an output step.

Initial attempt kept it normalized within the operation, which was
slower.

It was seemingly compiler-dependent whether it was faster to do a
combined operation, or separate scale and subtract, but the margins
were small. On MSVC the combined operation was slightly faster than
the separate operations.

...

Otherwise, after this, just went and fiddled with BGBCC some more,
adding more options for its resource converter.

Had before (for image formats):
   In: TGA, BMP (various), PNG, QOI, UPIC
   Out: BMP (various), QOI, UPIC

Added (now):
   In: PPM, JPG, DDS
   Out: PNG, JPG, DDS (DXT1 and DXT5)

Considered (not added yet):
   PCX
Evaluated PCX, possible but not a clear win.

Fiddled with making the PNG encoder less slow, mostly this was
tweaking some parameters for the LZ searches. Initial settings were
using deeper searches over initially smaller sliding windows (at lower
compression levels); better in this case to do a shallower search over
a max-sized sliding window.

ATM, speed of PNG is now on-par with the JPG encoder (still one of the
slower options).

For simple use-cases, PNG still loses (in terms of both speed and
compression) to 16-color BMP + LZ compression (LZ4 or RP2).
Theoretically, indexed-color PNG exists, but is less widely supported.

It is less space-efficient to represent 16-colors as Deflate-
compressed color differences than it is to just represent the 4-bit
RGBI values directly.

However, can note that the RLE compression scheme (used by PCX) is
clearly inferior to that of any sort of LZ compression.

Comparably, PNG is also a more expensive format to decode as well
(even vs JPEG).

UPIC can partly address the use-cases of both PNG and JPEG while being
cheaper to decode than either, but more niche as pretty much nothing
supports it. Some of its design and properties being mostly JPEG-like.

QOI is interesting, but suffers some similar limitations to PCX (its
design is mostly about more compactly encoding color-differences in
true-color images and otherwise only offers RLE compression).

QOI is not particularly effective against images with little variety
in color variation but lots of repeating patterns (I have a modified
QOI that does a little better here, still not particularly effective
with 16-color graphics though).

Otherwise, also added up adding a small text format for image drawing
commands.

As a simplistic line oriented format containing various commands to
perform drawing operations or composite images.
   creating a "canvas"
   setting the working color
   drawing lines
   bucket fill
   drawing text strings
   overlaying other images
   ...

This is maybe (debatable) outside the scope of a C compiler, but could
have use-cases for preparing resource data (nevermind if scope creep
is partly also turning it into an asset-packer tool; where it is
useful to make graphics/sounds/etc in one set of formats and then
process and convert them into another set of files, usually inside of
some sort of VFS image or similar).

Design is much more simplistic than something like SVG and I am
currently assuming its use for mostly hand-edited files. Unlike SVG,
it also assumes drawing to a pixel grid rather than some more abstract
coordinate space (so, its abstract model is more like "MS Paint" or
similar); also SVG would suck as a human-edited format.

Granted, one could argue maybe it could make scope that asset-
processing is its own tool, then one converts it to a format that the
compiler accepts (WAD2 or WAD4 in this case) prior to compiling the
main binary (and/or not use resource data).

Still, IMO, an internal WAD image is still better than the horrid/
unusable mess that Windows had used (where anymore most people don't
bother with the resource section much more than storing a program icon
or similar...).

But, realistically, one does still want to limit how much data they
stick into the EXE.

...

My forays into the world of graphics formats are pretty limited. I tend
to use libraries already written by other people. I assume people a lot brighter than myself have come up with them.

I usually wrote my own code for most things.

Not dealt much with FLIC.

In the past, whenever doing animated stuff, had usually used the AVI
file format. A lot of time, the codecs were custom.

Both AVI (and BMP) can be used to hold a wide range of image data,
partly as a merit of using FOURCCs.

Over the course of the past 15 years, have fiddled a lot here.

A few of the longer-lived ones:
BTIC1C (~ 2010):
Was a modified version of RPZA with Deflate compression glued on.
BTIC1H:
Made use of multiple block formats,
used STF+AdRice for entropy coding, and Paeth for color endpoints.
Block formats, IIRC:
4x4x2, 4x2x2, 2x4x2, 2x2x2, 4x4x1, 2x2x1, flat
4x4x2: 32-bits for pixel selectors
2x2x2: 8 bits for pixel selectors
BTIC4B:
Similar to BTIC1H, but a lot more complicated.
Switched to 8x8 blocks, so had a whole lot of block formats.

Shorter-Lived:
BTIC2C: Similar design to MPEG;
IIRC, used Huffman, but updated the Huffman tables for each I-Frame.
This sort of thing being N/A with STF+AdRice,
which starts from a clean slate every time.

1C: Was used for animated textures in my first 3D engine.

1H and 4B could be used for video, but were also used in my second 3D
engine for sprites and textures (inside of a BMP packaging).

My 3rd 3D engine is mostly using a mix of:
DDS (mostly DXt1)
BMP (mostly 16 color and 256 color).

Though, in modern times, things like 16-color graphics are overlooked,
in some cases they are still usable or useful (or at least sufficient).

Typically, I had settled on a variant of the CGA/EGA color palette:
0: 000000 (Black)
1: 0000AA (Blue)
2: 00AA00 (Green)
3: 00AAAA (Cyan)
4: AA0000 (Red)
5: AA00AA (Magenta)
6: AA5500 (Brown)
7: AAAAAA (LightGray)
8: 555555 (DarkGray)
9: 5555FF (LightBlue)
A: 55FF55 (LightGreen)
B: 55FFFF (LightCyan)
C: FF5555 (LightRed)
D: FF55FF (Violet)
E: FFFF55 (Yellow)
F: FFFFFF (White)

I am not sure why they changed it for the default 16-color assignments
in VGA (eg, in the Windows 256-color system palette). Like, IMO, 00/AA
and 55/FF works better for typical 16-color use-cases than 00/80 and 00/FF.

Sorta depends on use-case: Sometimes something works well as 16 colors,
other times it would fall on its face.

Most other designs sucked so bad they didn't get very far.

Where, I had ended up categorizing designs:
BTIC1x: Designs mostly following an RPZA like path.
1C: RPZA + Deflate
Mostly built on 4x4x2 blocks (32 bits).
1D, 1E: Byte-Encoding + Deflate
Both sucked, quickly dropped.
Both were like RPZA both with 48-bit 4:2:0 blocks.
Neither great compression nor particularly fast.
Deflate carries a high computational overhead.
1F, 1G: No entropy coding (back to being like RPZA)
Major innovations: Variable-size pixel blocks.
1H: STF+AdRice
Mostly final state of 1x line.
BTIC2x: Designs mostly influenced by JPEG and MPEG.
Difficult to make particularly fast.
1A/1B: Modified MJPEG IIRC.
Technically, also based on my BTJPEG format (*1).
2C: IIRC, MPEG-like, Huffman-coded.
Well influenced by both MPEG and the Xiph Theora codec.
2D: Like 2C, but STF+AdRice
2E: Like 2C, but byte stream based
Was trying, mostly in vain, to make it faster.
My attempts at this style of codecs were mostly, too slow.
2F: Goes back to a more JPEG like core in some ways.
Entropy and VLN scheme borrows more from Deflate.
Though, uses a shorter limit on max symbol length (13 bit).
13 bit simplifies things and makes decoding faster vs 15 bit.
Abandons DCT and YCbCr in favor of Block-Haar and RCT.
Later, UPIC did similar, just with STF+AdRice versus Huffman.
BTIC3x:
Attempts to hybridize 1x and 2x
Nothing implemented, all designs too complicated to bother with.
BTIC4x:
4A: RPZA-like but with 8x8 blocks and multiple block sizes.
4B: Like 4A but reusing the encoding scheme from 1H.
BTIC5x:
5A: Resembled a CRAM/QOI hybrid, but with 8-bit indexed colors.
No entropy coding.
5B: Like 5A, but used differential RGB555 (still QOI like).
Major innovation was to use a 6-bit 64-entry pattern table.
Optionally, can use per-frame RP2 or TKuLZ compression.
Used if doing so results in a significant savings.

*1: BTJPEG was an attempt at making a more advanced image format based
on tweaking the existing T.81 JPEG format in a way that sorta worked in existing decoders. The more widespread use (and "not totally dead"
feature) being to allow for an embedded alpha channel as essentially
another monochrome JPEG inside the APP11 marker.

I had tried a bunch of other ideas, but it turned into a mess of
experimental tweaks, and most of it died off. The surviving variant is basically just T.81+JFIF with an optional alpha channel (ignored by a non-aware JPEG decoder).

Some other (mostly dead) tweaks were things like:
Allowing multi-layered images (more like Paint.NET's PDN or GIMP's XCF,
mostly by nesting the images like a Matryoshka doll), where the
top-level image would contain a view of all the layers rendered together; Allowing lossless images (similar to PNG) by using SERMS-RDCT and RCT
(where SERMS-RDCT was a trick to make the DCT/IDCT transform exactly reversible, at the cost of speed).

In the early 2010s, I was pretty bad about massively over-engineering everything.

Later on, some ideas were reused in 2F and UPIC.
Though, 2F and UPIC were much less over-engineered.

Did specify possible use as video codecs, but thus far both were used
only as still image formats.

The major goal for UPIC was mostly be to address the core use-cases but
also for the decoder to be small and relatively cheap. Still sorta JPEG competitive despite being primarily cost-optimized to try to make it
more viable for use in programs running on the BJX2 core (where JPEG
decoding is slow and expensive).

As for Static Huffman vs STF+AdRice:
Huffman:
+ Slightly faster for larger payloads
+ Optimal for a static distribution
- Higher memory cost for decoding (storing decoder tables)
- High initial setup cost (setting up decoder tables)
- Higher constant overhead (storing symbol lengths)
- Need to provision for storing Huffman tables
STF+AdRice:
+ Very cheap initial setup (minimal context)
+ No need to transmit tables
+ Better compression for small data
+ Significantly faster than Adaptive Huffman
+ Significantly faster than Range Coding
- Slower for large data and worse compression vs Huffman.

Where, STF+AdRice is mostly:
Have a table of symbols;
Whenever a symbol is encoded, swap it forwards;
Next time, it may potentially be encoded with a smaller index.
Encode indices into table using Adaptive Rice Codes.
Or, basically, using a lookup table to allow AdRice to pretend to be
Huffman. Also reasonably fast and simple.

Block-Haar vs DCT:
+ Block-Haar is faster and easily reversible (lossless);
+ Mostly a drop-in replacement for DCT/IDCT in the design.
+ Also faster than WHT (Walsh-Hadamard Transform)

RCT vs YCbCr:
RCT is both slightly faster, and also reversible;
Had experimented with YCoCg, but saw no real advantage over RCT.

The existence of BTIC5x was mostly because:
BTIC1H and BTIC4B were too computationally demanding to do 320x200 16Hz
on a 50MHz BJX2 core;

MS-CRAM was fast to decode, but needed too much bitrate (SDcard couldn't
keep the decoder fed with any semblance of image quality).

So, 5A and 5B were aimed at trying to give tolerable Q/bpp at more
CRAM-like decoding speeds.

Also, while reasonably effective (and fast desktop by PC standards), one
other drawback of the 4B design (and to a lesser degree 1H) was the
design being overly complicated (and thus the code is large and bulky).

Part of this was due to having too many block formats.

If my UPIC format were put into my older naming scheme, would likely be
called 2G. Design is kinda similar to 2F, but replaces Huffman with STF+AdRice.

As for RP2 and TKuLZ:
RP2 is a byte-oriented LZ77 variant, like LZ4,
but on-average compresses slightly better than LZ4.
TKuLZ: Is sorta like a simplified/tuned Deflate variant.
Uses a shorter max symbol length,
borrows some design elements from LZ4.

Can note, some past experiments with LZ decompression (at Desktop PC
speeds), with entropy scheme, and len/dist limits:
LZMA : ~ 35 MB/sec (Range Coding, 273/ 4GB)
Zstd : ~ 60 MB/sec (tANS, 16MB/ 128MB)
Deflate: ~ 175 MB/sec (Huffman, 258/ 32767)
TKuLZ : ~ 300 MB/sec (Huffman, 65535/262143)
RP2 : ~ 1100 MB/sec (Raw Bytes, 512/131071)
LZ4 : ~ 1300 MB/sec (Raw Bytes, 16383/ 65535)

While Zstd is claimed to be fast, my testing tended to show it closer to
LZMA speeds than to Deflate, but it does give compression closer to
LZMA. The tANS strategy seems to under-perform claims IME (and is
notably slower than static Huffman). Also it is the most complicated
design among these.

A lot of my older stuff used Deflate, but often Deflate wasn't fast
enough, so has mostly gotten displaced by RP2 in my uses.

TKuLZ is an intermediate, generally faster than Deflate, had an option
to get some speed (at the expense of compression) by using fixed length symbols in some cases. This can push it to around 500 MB/sec (at the
expense of compression), hard to get much faster (or anywhere near RP2
or LZ4).

Whether RP2 or LZ4 is faster seems to depend on target:
BJX2 Core, RasPi, and Piledriver: RP2 is faster.
Mostly things with in-order cores.
And Piledriver, which behaved almost more like an in-order machine.
Zen+, Core 2, and Core i7: LZ4 is faster.

LZ4 needs typically multiple chained memory accesses for each LZ run,
whereas for RP2, match length/distance and raw count are typically all available via a single memory load (then maybe a few bit-tests and
conditional branches).

...

A while ago I wrote a set of graphics routines in assembler that were
quite fast. One format I have delt with is the .flic file format used to render animated graphics. I wanted to write my own CIV style game. It
took a little bit of research and some reverse engineering. Apparently,
the authors used a modified version of the format making it difficult to
use the CIV graphics in my own game. I never could get it to render as
fast as the game’s engine. I wrote the code for my game in C or C++, the original’s game engine code was likely in a different language.

This sort of thing is almost inevitable with this stuff.

Usually I just ended up using C for nearly everything.

*****

Been working on vectors for the ISA. I split the vector length register
into eight sections to define up to eight different vector lengths. The first five are defined for integer, float, fixed, character, and address data types. I figure one may want to use vectors of different lengths at
the same time, for instance to address data using byte offsets, while
the data itself might be a float. The vector load / store instructions accept a data type to load / store and always use the address type for address calculations.

There is also a vector lane size register split up the same way. I had thought of giving each vector register its own format for length and
lane size. But thought that is a bit much, with limited use cases.

I think I can get away with only two load and two store instructions.
One to do a strided load and a second to do an vector indexed load (gather/scatter). The addressing mode in use is d[Rbase+Rindex*Scale].
Where Rindex is used as the stride when scalar or as a supplier of the
lane offset when Rindex is a vector.

Writing the RTL code to support the vector memory ops has been
challenging. Using a simple approach ATM. The instruction needs to be re-issued for each vector lane accessed. Unaligned vector loads and
stores are also allowed, adding some complexity when the operation
crosses a cache-line boundary.

I have the max vector length and max vector size constants returned by
the GETINFO instruction which returns CPU specific information.

I don't get it...

Usually makes sense to treat vectors as opaque blobs of bits that are
then interpreted as one of the available formats for a specific operation.

In my case, I have a SIMD setup:
2 or 4 elements in a GPR or GPR pair;
Most other operations are just the normal GPR operations.

...

--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Sat Nov 22 18:50:01 2025

From Newsgroup: comp.arch

On Fri, 21 Nov 2025 13:36:05 -0600
BGB <cr88192@gmail.com> wrote:

On 11/21/2025 7:31 AM, Michael S wrote:

On Thu, 13 Nov 2025 19:04:18 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Michael S <already5chosen@yahoo.com> posted:

Not really.
That is, conversions are not blazingly fast, but still much better
than any attempt to divide in any form of decimal. And helps to
preserve your sanity.

Are you trying to pull our proverbial leg here ?!?

After reading paragraph 5.2 of IEEE-754-2008 Standard I am less
sure in correctness of my above statement.
For the case of exact division, preservation of mental sanity during fulfillment of requirements of this paragraph is far from simple, regardless of numeric base used in the process.

One effectively needs to do a special extra-wide divide rather than
just a normal integer divide, etc.

It seems, you are talking about case of inexact division
(rem(num*10**scale, den) != 0) . I don't consider it harmful for sanity.

It is the opposite case that I find stressful.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sat Nov 22 12:45:57 2025

From Newsgroup: comp.arch

On 2025-11-22 5:54 a.m., BGB wrote:

On 11/21/2025 9:09 PM, Robert Finch wrote:

On 2025-11-21 2:36 p.m., BGB wrote:

On 11/21/2025 7:31 AM, Michael S wrote:

On Thu, 13 Nov 2025 19:04:18 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Michael S <already5chosen@yahoo.com> posted:

Not really.
That is, conversions are not blazingly fast, but still much better >>>>>> than any attempt to divide in any form of decimal. And helps to
preserve your sanity.

Are you trying to pull our proverbial leg here ?!?

After reading paragraph 5.2 of IEEE-754-2008 Standard I am less sure in >>>> correctness of my above statement.
For the case of exact division, preservation of mental sanity during
fulfillment of requirements of this paragraph is far from simple,
regardless of numeric base used in the process.

One effectively needs to do a special extra-wide divide rather than
just a normal integer divide, etc.

But, yeah, fastest I had gotten in my experiments was radix-10e9
long- division, but still not the fastest option.

So, rough ranking, fast to slow:
   Radix-10e9 Long Divide (fastest)
   Newton-Raphson
   Radix-10 Long Divide
   Integer Shift-Subtract with converters (slowest).
     Fastest converter strategy ATM:
       Radix-10e9 double-dabble (Int->Dec).
       MUL-by-10e9 and ADD (Dec->Int)
         Fastest strategy: Unrolled Shifts and ADDs (*1).

*1: While it is possible to perform a 128-bit multiply decomposing
into multiplying 32-bit parts and adding them together; it was
working out slightly faster in this case to do a fixed multiply by
decomposing it into a series of explicit shifts and ADDs.

Though, in this case, it is faster (and less ugly) to decompose this
into a pattern of iteratively multiplying by smaller amounts. I had
ended up using 4x multiply by 100 followed by multiply by 10, as
while not the fastest strategy, needs less code than 2x multiply by
10000 + multiply by 10. Most other patterns would need more shifts
and adds.

In theory, x86-64 could do it better with multiply ops, but getting
something optimal out of the C compilers is a bigger issue here it
seems.

Unexplored options:
   Radix 10e2 (byte)
   Radix 10e3 (word)
   Radix 10e4 (word)

Radix 10e3 could have the closest to direct mapping to DPD.

Looking at the decNumber code, it appears also to be Radix-10e9 based.
They also do significant (ab)use of the C preprocessor.

Apparently, "Why use functions when you can use macros?"...

For the Radix-10e9 long-divide, part of the magic was in the function
to scale a value by a radix value and subtract it from another array.

Ended up trying a few options, fastest was to temporarily turn the
operation into non-normalized 64-bit pieces and then normalize the
result (borrow propagation, etc) as an output step.

Initial attempt kept it normalized within the operation, which was
slower.

It was seemingly compiler-dependent whether it was faster to do a
combined operation, or separate scale and subtract, but the margins
were small. On MSVC the combined operation was slightly faster than
the separate operations.

...

Otherwise, after this, just went and fiddled with BGBCC some more,
adding more options for its resource converter.

Had before (for image formats):
   In: TGA, BMP (various), PNG, QOI, UPIC
   Out: BMP (various), QOI, UPIC

Added (now):
   In: PPM, JPG, DDS
   Out: PNG, JPG, DDS (DXT1 and DXT5)

Considered (not added yet):
   PCX
Evaluated PCX, possible but not a clear win.

Fiddled with making the PNG encoder less slow, mostly this was
tweaking some parameters for the LZ searches. Initial settings were
using deeper searches over initially smaller sliding windows (at
lower compression levels); better in this case to do a shallower
search over a max-sized sliding window.

ATM, speed of PNG is now on-par with the JPG encoder (still one of
the slower options).

For simple use-cases, PNG still loses (in terms of both speed and
compression) to 16-color BMP + LZ compression (LZ4 or RP2).
Theoretically, indexed-color PNG exists, but is less widely supported.

It is less space-efficient to represent 16-colors as Deflate-
compressed color differences than it is to just represent the 4-bit
RGBI values directly.

However, can note that the RLE compression scheme (used by PCX) is
clearly inferior to that of any sort of LZ compression.

Comparably, PNG is also a more expensive format to decode as well
(even vs JPEG).

UPIC can partly address the use-cases of both PNG and JPEG while
being cheaper to decode than either, but more niche as pretty much
nothing supports it. Some of its design and properties being mostly
JPEG-like.

QOI is interesting, but suffers some similar limitations to PCX (its
design is mostly about more compactly encoding color-differences in
true-color images and otherwise only offers RLE compression).

QOI is not particularly effective against images with little variety
in color variation but lots of repeating patterns (I have a modified
QOI that does a little better here, still not particularly effective
with 16-color graphics though).

Otherwise, also added up adding a small text format for image drawing
commands.

As a simplistic line oriented format containing various commands to
perform drawing operations or composite images.
   creating a "canvas"
   setting the working color
   drawing lines
   bucket fill
   drawing text strings
   overlaying other images
   ...

This is maybe (debatable) outside the scope of a C compiler, but
could have use-cases for preparing resource data (nevermind if scope
creep is partly also turning it into an asset-packer tool; where it
is useful to make graphics/sounds/etc in one set of formats and then
process and convert them into another set of files, usually inside of
some sort of VFS image or similar).

Design is much more simplistic than something like SVG and I am
currently assuming its use for mostly hand-edited files. Unlike SVG,
it also assumes drawing to a pixel grid rather than some more
abstract coordinate space (so, its abstract model is more like "MS
Paint" or similar); also SVG would suck as a human-edited format.

Granted, one could argue maybe it could make scope that asset-
processing is its own tool, then one converts it to a format that the
compiler accepts (WAD2 or WAD4 in this case) prior to compiling the
main binary (and/or not use resource data).

Still, IMO, an internal WAD image is still better than the horrid/
unusable mess that Windows had used (where anymore most people don't
bother with the resource section much more than storing a program
icon or similar...).

But, realistically, one does still want to limit how much data they
stick into the EXE.

...

My forays into the world of graphics formats are pretty limited. I
tend to use libraries already written by other people. I assume people
a lot brighter than myself have come up with them.

I usually wrote my own code for most things.

Not dealt much with FLIC.

In the past, whenever doing animated stuff, had usually used the AVI
file format. A lot of time, the codecs were custom.

Both AVI (and BMP) can be used to hold a wide range of image data,
partly as a merit of using FOURCCs.

Over the course of the past 15 years, have fiddled a lot here.

A few of the longer-lived ones:
BTIC1C (~ 2010):
    Was a modified version of RPZA with Deflate compression glued on.
BTIC1H:
    Made use of multiple block formats,
      used STF+AdRice for entropy coding, and Paeth for color endpoints.
    Block formats, IIRC:
      4x4x2, 4x2x2, 2x4x2, 2x2x2, 4x4x1, 2x2x1, flat
        4x4x2: 32-bits for pixel selectors
        2x2x2: 8 bits for pixel selectors
BTIC4B:
    Similar to BTIC1H, but a lot more complicated.
    Switched to 8x8 blocks, so had a whole lot of block formats.

Shorter-Lived:
BTIC2C: Similar design to MPEG;
IIRC, used Huffman, but updated the Huffman tables for each I-Frame.
    This sort of thing being N/A with STF+AdRice,
      which starts from a clean slate every time.

1C: Was used for animated textures in my first 3D engine.

1H and 4B could be used for video, but were also used in my second 3D
engine for sprites and textures (inside of a BMP packaging).

My 3rd 3D engine is mostly using a mix of:
DDS (mostly DXt1)
BMP (mostly 16 color and 256 color).

Though, in modern times, things like 16-color graphics are overlooked,
in some cases they are still usable or useful (or at least sufficient).

Typically, I had settled on a variant of the CGA/EGA color palette:
0: 000000 (Black)
1: 0000AA (Blue)
2: 00AA00 (Green)
3: 00AAAA (Cyan)
4: AA0000 (Red)
5: AA00AA (Magenta)
6: AA5500 (Brown)
7: AAAAAA (LightGray)
8: 555555 (DarkGray)
9: 5555FF (LightBlue)
A: 55FF55 (LightGreen)
B: 55FFFF (LightCyan)
C: FF5555 (LightRed)
D: FF55FF (Violet)
E: FFFF55 (Yellow)
F: FFFFFF (White)

I am not sure why they changed it for the default 16-color assignments
in VGA (eg, in the Windows 256-color system palette). Like, IMO, 00/AA
and 55/FF works better for typical 16-color use-cases than 00/80 and 00/FF.

Sorta depends on use-case: Sometimes something works well as 16 colors, other times it would fall on its face.

Most other designs sucked so bad they didn't get very far.

Where, I had ended up categorizing designs:
BTIC1x: Designs mostly following an RPZA like path.
    1C: RPZA + Deflate
      Mostly built on 4x4x2 blocks (32 bits).
    1D, 1E: Byte-Encoding + Deflate
      Both sucked, quickly dropped.
      Both were like RPZA both with 48-bit 4:2:0 blocks.
      Neither great compression nor particularly fast.
        Deflate carries a high computational overhead.
    1F, 1G: No entropy coding (back to being like RPZA)
      Major innovations: Variable-size pixel blocks.
    1H: STF+AdRice
      Mostly final state of 1x line.
BTIC2x: Designs mostly influenced by JPEG and MPEG.
    Difficult to make particularly fast.
    1A/1B: Modified MJPEG IIRC.
      Technically, also based on my BTJPEG format (*1).
    2C: IIRC, MPEG-like, Huffman-coded.
      Well influenced by both MPEG and the Xiph Theora codec.
    2D: Like 2C, but STF+AdRice
    2E: Like 2C, but byte stream based
      Was trying, mostly in vain, to make it faster.
      My attempts at this style of codecs were mostly, too slow.
    2F: Goes back to a more JPEG like core in some ways.
      Entropy and VLN scheme borrows more from Deflate.
        Though, uses a shorter limit on max symbol length (13 bit).
        13 bit simplifies things and makes decoding faster vs 15 bit.
      Abandons DCT and YCbCr in favor of Block-Haar and RCT.
        Later, UPIC did similar, just with STF+AdRice versus Huffman.
BTIC3x:
    Attempts to hybridize 1x and 2x
    Nothing implemented, all designs too complicated to bother with.
BTIC4x:
    4A: RPZA-like but with 8x8 blocks and multiple block sizes.
    4B: Like 4A but reusing the encoding scheme from 1H.
BTIC5x:
    5A: Resembled a CRAM/QOI hybrid, but with 8-bit indexed colors.
      No entropy coding.
    5B: Like 5A, but used differential RGB555 (still QOI like).
      Major innovation was to use a 6-bit 64-entry pattern table.
      Optionally, can use per-frame RP2 or TKuLZ compression.
        Used if doing so results in a significant savings.

*1: BTJPEG was an attempt at making a more advanced image format based
on tweaking the existing T.81 JPEG format in a way that sorta worked in existing decoders. The more widespread use (and "not totally dead"
feature) being to allow for an embedded alpha channel as essentially
another monochrome JPEG inside the APP11 marker.

I had tried a bunch of other ideas, but it turned into a mess of experimental tweaks, and most of it died off. The surviving variant is basically just T.81+JFIF with an optional alpha channel (ignored by a non-aware JPEG decoder).

Some other (mostly dead) tweaks were things like:
Allowing multi-layered images (more like Paint.NET's PDN or GIMP's XCF, mostly by nesting the images like a Matryoshka doll), where the top-
level image would contain a view of all the layers rendered together; Allowing lossless images (similar to PNG) by using SERMS-RDCT and RCT
(where SERMS-RDCT was a trick to make the DCT/IDCT transform exactly reversible, at the cost of speed).

In the early 2010s, I was pretty bad about massively over-engineering everything.

Later on, some ideas were reused in 2F and UPIC.
Though, 2F and UPIC were much less over-engineered.

Did specify possible use as video codecs, but thus far both were used
only as still image formats.

The major goal for UPIC was mostly be to address the core use-cases but
also for the decoder to be small and relatively cheap. Still sorta JPEG competitive despite being primarily cost-optimized to try to make it
more viable for use in programs running on the BJX2 core (where JPEG decoding is slow and expensive).

As for Static Huffman vs STF+AdRice:
Huffman:
    + Slightly faster for larger payloads
    + Optimal for a static distribution
    - Higher memory cost for decoding (storing decoder tables)
    - High initial setup cost (setting up decoder tables)
    - Higher constant overhead (storing symbol lengths)
    - Need to provision for storing Huffman tables
STF+AdRice:
    + Very cheap initial setup (minimal context)
    + No need to transmit tables
    + Better compression for small data
    + Significantly faster than Adaptive Huffman
    + Significantly faster than Range Coding
    - Slower for large data and worse compression vs Huffman.

Where, STF+AdRice is mostly:
Have a table of symbols;
Whenever a symbol is encoded, swap it forwards;
    Next time, it may potentially be encoded with a smaller index.
Encode indices into table using Adaptive Rice Codes.
Or, basically, using a lookup table to allow AdRice to pretend to be Huffman. Also reasonably fast and simple.

Block-Haar vs DCT:
+ Block-Haar is faster and easily reversible (lossless);
+ Mostly a drop-in replacement for DCT/IDCT in the design.
+ Also faster than WHT (Walsh-Hadamard Transform)

RCT vs YCbCr:
RCT is both slightly faster, and also reversible;
Had experimented with YCoCg, but saw no real advantage over RCT.

The existence of BTIC5x was mostly because:
BTIC1H and BTIC4B were too computationally demanding to do 320x200 16Hz
on a 50MHz BJX2 core;

MS-CRAM was fast to decode, but needed too much bitrate (SDcard couldn't keep the decoder fed with any semblance of image quality).

So, 5A and 5B were aimed at trying to give tolerable Q/bpp at more CRAM- like decoding speeds.

Also, while reasonably effective (and fast desktop by PC standards), one other drawback of the 4B design (and to a lesser degree 1H) was the
design being overly complicated (and thus the code is large and bulky).

Part of this was due to having too many block formats.

If my UPIC format were put into my older naming scheme, would likely be called 2G. Design is kinda similar to 2F, but replaces Huffman with STF+AdRice.

As for RP2 and TKuLZ:
RP2 is a byte-oriented LZ77 variant, like LZ4,
    but on-average compresses slightly better than LZ4.
TKuLZ: Is sorta like a simplified/tuned Deflate variant.
    Uses a shorter max symbol length,
      borrows some design elements from LZ4.

Can note, some past experiments with LZ decompression (at Desktop PC speeds), with entropy scheme, and len/dist limits:
LZMA   : ~   35 MB/sec (Range Coding,   273/   4GB)
Zstd   : ~   60 MB/sec (tANS,          16MB/ 128MB)
Deflate: ~ 175 MB/sec (Huffman,        258/ 32767)
TKuLZ : ~ 300 MB/sec (Huffman,      65535/262143)
RP2    : ~ 1100 MB/sec (Raw Bytes,      512/131071)
LZ4    : ~ 1300 MB/sec (Raw Bytes,    16383/ 65535)

While Zstd is claimed to be fast, my testing tended to show it closer to LZMA speeds than to Deflate, but it does give compression closer to
LZMA. The tANS strategy seems to under-perform claims IME (and is
notably slower than static Huffman). Also it is the most complicated
design among these.

A lot of my older stuff used Deflate, but often Deflate wasn't fast
enough, so has mostly gotten displaced by RP2 in my uses.

TKuLZ is an intermediate, generally faster than Deflate, had an option
to get some speed (at the expense of compression) by using fixed length symbols in some cases. This can push it to around 500 MB/sec (at the
expense of compression), hard to get much faster (or anywhere near RP2
or LZ4).

Whether RP2 or LZ4 is faster seems to depend on target:
BJX2 Core, RasPi, and Piledriver: RP2 is faster.
    Mostly things with in-order cores.
    And Piledriver, which behaved almost more like an in-order machine.
Zen+, Core 2, and Core i7: LZ4 is faster.

LZ4 needs typically multiple chained memory accesses for each LZ run, whereas for RP2, match length/distance and raw count are typically all available via a single memory load (then maybe a few bit-tests and conditional branches).

...

A while ago I wrote a set of graphics routines in assembler that were
quite fast. One format I have delt with is the .flic file format used
to render animated graphics. I wanted to write my own CIV style game.
It took a little bit of research and some reverse engineering.
Apparently, the authors used a modified version of the format making
it difficult to use the CIV graphics in my own game. I never could get
it to render as fast as the game’s engine. I wrote the code for my
game in C or C++, the original’s game engine code was likely in a
different language.

This sort of thing is almost inevitable with this stuff.

Usually I just ended up using C for nearly everything.

*****

Been working on vectors for the ISA. I split the vector length
register into eight sections to define up to eight different vector
lengths. The first five are defined for integer, float, fixed,
character, and address data types. I figure one may want to use
vectors of different lengths at the same time, for instance to address
data using byte offsets, while the data itself might be a float. The
vector load / store instructions accept a data type to load / store
and always use the address type for address calculations.

There is also a vector lane size register split up the same way. I had
thought of giving each vector register its own format for length and
lane size. But thought that is a bit much, with limited use cases.

I think I can get away with only two load and two store instructions.
One to do a strided load and a second to do an vector indexed load
(gather/scatter). The addressing mode in use is d[Rbase+Rindex*Scale].
Where Rindex is used as the stride when scalar or as a supplier of the
lane offset when Rindex is a vector.

Writing the RTL code to support the vector memory ops has been
challenging. Using a simple approach ATM. The instruction needs to be
re-issued for each vector lane accessed. Unaligned vector loads and
stores are also allowed, adding some complexity when the operation
crosses a cache-line boundary.

I have the max vector length and max vector size constants returned by
the GETINFO instruction which returns CPU specific information.

I don't get it...

Usually makes sense to treat vectors as opaque blobs of bits that are
then interpreted as one of the available formats for a specific operation.

In my case, I have a SIMD setup:
2 or 4 elements in a GPR or GPR pair;
Most other operations are just the normal GPR operations.

...

Many vector machines (RISCV-V) have a way of specifying the vector
length and element size, but it tends to be a global setting which may
be overridden in some cases by specifying in the instruction. For Qupls
it also allows setting based on the data type which is a bit of a
misnomer, it would be better named data format. It is just three bits in
the instruction that select one of the fields in the VLEN, VELSZ
registers. The instruction itself specifies the data type for the
operation on an opaque bag of bits. It is possible to encode selecting
the integer size fields, then performing a float operation on the data.

The size agnostic instructions use the micro-op translator to convert
the instructions into size specific versions. The translator calculates
the number of architectural registers required then puts the appropriate number of instructions (up to eight) in the micro-op queue.

Therefore, there are lots of vector instructions in the ISA. SIMD type instructions where the size of a vector is assumed to be one register,
and the element size is specified by the instruction. So, separate instructions for 1,2,4 or 8 elements. (For example 50 instructions *
four different sizes = 200 instructions). Then also size agnostic
instructions where the size/format comes indirectly from the VLEN
(vector length) and VELSZ (vector lane size) registers.

The size agnostic instructions allow writing a generic vector routine
without needing to code the size of the operation. This avoids having a
switch statement with a whole bunch of cases for different vector
lengths. It also avoids having thousands of vector instructions. (50 instructions * 5 different lanes sizes * 64 different lengths).

The vectors are opaque blobs of bytes in my case. Size specs are in
terms of bytes. The vectors are not a fixed length. They may (currently)
use from 0 to 8 GPR registers. Hence the need to specify the length in
use. While the length could be specified as part of the format for the instruction, that would require a wide instruction.

*****

.flic file format is supposed to be fast enough to allow use “on the
fly”. But I just decompress all the frames into a matrix of bitmaps at
game startup, then select the appropriate one based on direction and
timing. With dozens of different sprites and hundreds of frames, I think
it takes about 3GB of memory just for the sprite data. I had trouble
running this on my machine a few years ago, but maybe with newer
technology it could work.

Experimented some with LZ4 and Huffman encoding. Huffman used for ECC logic.

--- Synchronet 3.21a-Linux NewsLink 1.2

From BGB@cr88192@gmail.com to comp.arch on Sat Nov 22 14:29:23 2025

From Newsgroup: comp.arch

On 11/22/2025 11:45 AM, Robert Finch wrote:

On 2025-11-22 5:54 a.m., BGB wrote:

On 11/21/2025 9:09 PM, Robert Finch wrote:

On 2025-11-21 2:36 p.m., BGB wrote:

On 11/21/2025 7:31 AM, Michael S wrote:

On Thu, 13 Nov 2025 19:04:18 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Michael S <already5chosen@yahoo.com> posted:

Not really.
That is, conversions are not blazingly fast, but still much better >>>>>>> than any attempt to divide in any form of decimal. And helps to
preserve your sanity.

Are you trying to pull our proverbial leg here ?!?

After reading paragraph 5.2 of IEEE-754-2008 Standard I am less
sure in
correctness of my above statement.
For the case of exact division, preservation of mental sanity during >>>>> fulfillment of requirements of this paragraph is far from simple,
regardless of numeric base used in the process.

One effectively needs to do a special extra-wide divide rather than
just a normal integer divide, etc.

But, yeah, fastest I had gotten in my experiments was radix-10e9
long- division, but still not the fastest option.

So, rough ranking, fast to slow:
   Radix-10e9 Long Divide (fastest)
   Newton-Raphson
   Radix-10 Long Divide
   Integer Shift-Subtract with converters (slowest).
     Fastest converter strategy ATM:
       Radix-10e9 double-dabble (Int->Dec).
       MUL-by-10e9 and ADD (Dec->Int)
         Fastest strategy: Unrolled Shifts and ADDs (*1).

*1: While it is possible to perform a 128-bit multiply decomposing
into multiplying 32-bit parts and adding them together; it was
working out slightly faster in this case to do a fixed multiply by
decomposing it into a series of explicit shifts and ADDs.

Though, in this case, it is faster (and less ugly) to decompose this
into a pattern of iteratively multiplying by smaller amounts. I had
ended up using 4x multiply by 100 followed by multiply by 10, as
while not the fastest strategy, needs less code than 2x multiply by
10000 + multiply by 10. Most other patterns would need more shifts
and adds.

In theory, x86-64 could do it better with multiply ops, but getting
something optimal out of the C compilers is a bigger issue here it
seems.

Unexplored options:
   Radix 10e2 (byte)
   Radix 10e3 (word)
   Radix 10e4 (word)

Radix 10e3 could have the closest to direct mapping to DPD.

Looking at the decNumber code, it appears also to be Radix-10e9 based. >>>> They also do significant (ab)use of the C preprocessor.

Apparently, "Why use functions when you can use macros?"...

For the Radix-10e9 long-divide, part of the magic was in the
function to scale a value by a radix value and subtract it from
another array.

Ended up trying a few options, fastest was to temporarily turn the
operation into non-normalized 64-bit pieces and then normalize the
result (borrow propagation, etc) as an output step.

Initial attempt kept it normalized within the operation, which was
slower.

It was seemingly compiler-dependent whether it was faster to do a
combined operation, or separate scale and subtract, but the margins
were small. On MSVC the combined operation was slightly faster than
the separate operations.

...

Otherwise, after this, just went and fiddled with BGBCC some more,
adding more options for its resource converter.

Had before (for image formats):
   In: TGA, BMP (various), PNG, QOI, UPIC
   Out: BMP (various), QOI, UPIC

Added (now):
   In: PPM, JPG, DDS
   Out: PNG, JPG, DDS (DXT1 and DXT5)

Considered (not added yet):
   PCX
Evaluated PCX, possible but not a clear win.

Fiddled with making the PNG encoder less slow, mostly this was
tweaking some parameters for the LZ searches. Initial settings were
using deeper searches over initially smaller sliding windows (at
lower compression levels); better in this case to do a shallower
search over a max-sized sliding window.

ATM, speed of PNG is now on-par with the JPG encoder (still one of
the slower options).

For simple use-cases, PNG still loses (in terms of both speed and
compression) to 16-color BMP + LZ compression (LZ4 or RP2).
Theoretically, indexed-color PNG exists, but is less widely supported. >>>>
It is less space-efficient to represent 16-colors as Deflate-
compressed color differences than it is to just represent the 4-bit
RGBI values directly.

However, can note that the RLE compression scheme (used by PCX) is
clearly inferior to that of any sort of LZ compression.

Comparably, PNG is also a more expensive format to decode as well
(even vs JPEG).

UPIC can partly address the use-cases of both PNG and JPEG while
being cheaper to decode than either, but more niche as pretty much
nothing supports it. Some of its design and properties being mostly
JPEG-like.

QOI is interesting, but suffers some similar limitations to PCX (its
design is mostly about more compactly encoding color-differences in
true-color images and otherwise only offers RLE compression).

QOI is not particularly effective against images with little variety
in color variation but lots of repeating patterns (I have a modified
QOI that does a little better here, still not particularly effective
with 16-color graphics though).

Otherwise, also added up adding a small text format for image
drawing commands.

As a simplistic line oriented format containing various commands to
perform drawing operations or composite images.
   creating a "canvas"
   setting the working color
   drawing lines
   bucket fill
   drawing text strings
   overlaying other images
   ...

This is maybe (debatable) outside the scope of a C compiler, but
could have use-cases for preparing resource data (nevermind if scope
creep is partly also turning it into an asset-packer tool; where it
is useful to make graphics/sounds/etc in one set of formats and then
process and convert them into another set of files, usually inside
of some sort of VFS image or similar).

Design is much more simplistic than something like SVG and I am
currently assuming its use for mostly hand-edited files. Unlike SVG,
it also assumes drawing to a pixel grid rather than some more
abstract coordinate space (so, its abstract model is more like "MS
Paint" or similar); also SVG would suck as a human-edited format.

Granted, one could argue maybe it could make scope that asset-
processing is its own tool, then one converts it to a format that
the compiler accepts (WAD2 or WAD4 in this case) prior to compiling
the main binary (and/or not use resource data).

Still, IMO, an internal WAD image is still better than the horrid/
unusable mess that Windows had used (where anymore most people don't
bother with the resource section much more than storing a program
icon or similar...).

But, realistically, one does still want to limit how much data they
stick into the EXE.

...

My forays into the world of graphics formats are pretty limited. I
tend to use libraries already written by other people. I assume
people a lot brighter than myself have come up with them.

I usually wrote my own code for most things.

Not dealt much with FLIC.

In the past, whenever doing animated stuff, had usually used the AVI
file format. A lot of time, the codecs were custom.

Both AVI (and BMP) can be used to hold a wide range of image data,
partly as a merit of using FOURCCs.

Over the course of the past 15 years, have fiddled a lot here.

A few of the longer-lived ones:
   BTIC1C (~ 2010):
     Was a modified version of RPZA with Deflate compression glued on. >>    BTIC1H:
     Made use of multiple block formats,
       used STF+AdRice for entropy coding, and Paeth for color endpoints.
     Block formats, IIRC:
       4x4x2, 4x2x2, 2x4x2, 2x2x2, 4x4x1, 2x2x1, flat
         4x4x2: 32-bits for pixel selectors
         2x2x2: 8 bits for pixel selectors
   BTIC4B:
     Similar to BTIC1H, but a lot more complicated.
     Switched to 8x8 blocks, so had a whole lot of block formats.

Shorter-Lived:
   BTIC2C: Similar design to MPEG;
   IIRC, used Huffman, but updated the Huffman tables for each I-Frame.
     This sort of thing being N/A with STF+AdRice,
       which starts from a clean slate every time.

1C: Was used for animated textures in my first 3D engine.

1H and 4B could be used for video, but were also used in my second 3D
engine for sprites and textures (inside of a BMP packaging).

My 3rd 3D engine is mostly using a mix of:
   DDS (mostly DXt1)
   BMP (mostly 16 color and 256 color).

Though, in modern times, things like 16-color graphics are overlooked,
in some cases they are still usable or useful (or at least sufficient).

Typically, I had settled on a variant of the CGA/EGA color palette:
   0: 000000 (Black)
   1: 0000AA (Blue)
   2: 00AA00 (Green)
   3: 00AAAA (Cyan)
   4: AA0000 (Red)
   5: AA00AA (Magenta)
   6: AA5500 (Brown)
   7: AAAAAA (LightGray)
   8: 555555 (DarkGray)
   9: 5555FF (LightBlue)
   A: 55FF55 (LightGreen)
   B: 55FFFF (LightCyan)
   C: FF5555 (LightRed)
   D: FF55FF (Violet)
   E: FFFF55 (Yellow)
   F: FFFFFF (White)

I am not sure why they changed it for the default 16-color assignments
in VGA (eg, in the Windows 256-color system palette). Like, IMO, 00/AA
and 55/FF works better for typical 16-color use-cases than 00/80 and
00/FF.

Sorta depends on use-case: Sometimes something works well as 16
colors, other times it would fall on its face.

Most other designs sucked so bad they didn't get very far.

Where, I had ended up categorizing designs:
   BTIC1x: Designs mostly following an RPZA like path.
     1C: RPZA + Deflate
       Mostly built on 4x4x2 blocks (32 bits).
     1D, 1E: Byte-Encoding + Deflate
       Both sucked, quickly dropped.
       Both were like RPZA both with 48-bit 4:2:0 blocks.
       Neither great compression nor particularly fast.
         Deflate carries a high computational overhead.
     1F, 1G: No entropy coding (back to being like RPZA)
       Major innovations: Variable-size pixel blocks.
     1H: STF+AdRice
       Mostly final state of 1x line.
   BTIC2x: Designs mostly influenced by JPEG and MPEG.
     Difficult to make particularly fast.
     1A/1B: Modified MJPEG IIRC.
       Technically, also based on my BTJPEG format (*1).
     2C: IIRC, MPEG-like, Huffman-coded.
       Well influenced by both MPEG and the Xiph Theora codec.
     2D: Like 2C, but STF+AdRice
     2E: Like 2C, but byte stream based
       Was trying, mostly in vain, to make it faster.
       My attempts at this style of codecs were mostly, too slow.
     2F: Goes back to a more JPEG like core in some ways.
       Entropy and VLN scheme borrows more from Deflate.
         Though, uses a shorter limit on max symbol length (13 bit). >>          13 bit simplifies things and makes decoding faster vs 15 bit.
       Abandons DCT and YCbCr in favor of Block-Haar and RCT.
         Later, UPIC did similar, just with STF+AdRice versus Huffman.
   BTIC3x:
     Attempts to hybridize 1x and 2x
     Nothing implemented, all designs too complicated to bother with.
   BTIC4x:
     4A: RPZA-like but with 8x8 blocks and multiple block sizes.
     4B: Like 4A but reusing the encoding scheme from 1H.
   BTIC5x:
     5A: Resembled a CRAM/QOI hybrid, but with 8-bit indexed colors.
       No entropy coding.
     5B: Like 5A, but used differential RGB555 (still QOI like).
       Major innovation was to use a 6-bit 64-entry pattern table.
       Optionally, can use per-frame RP2 or TKuLZ compression.
         Used if doing so results in a significant savings.

*1: BTJPEG was an attempt at making a more advanced image format based
on tweaking the existing T.81 JPEG format in a way that sorta worked
in existing decoders. The more widespread use (and "not totally dead"
feature) being to allow for an embedded alpha channel as essentially
another monochrome JPEG inside the APP11 marker.

I had tried a bunch of other ideas, but it turned into a mess of
experimental tweaks, and most of it died off. The surviving variant is
basically just T.81+JFIF with an optional alpha channel (ignored by a
non-aware JPEG decoder).

Some other (mostly dead) tweaks were things like:
Allowing multi-layered images (more like Paint.NET's PDN or GIMP's
XCF, mostly by nesting the images like a Matryoshka doll), where the
top- level image would contain a view of all the layers rendered
together;
Allowing lossless images (similar to PNG) by using SERMS-RDCT and RCT
(where SERMS-RDCT was a trick to make the DCT/IDCT transform exactly
reversible, at the cost of speed).

In the early 2010s, I was pretty bad about massively over-engineering
everything.

Later on, some ideas were reused in 2F and UPIC.
Though, 2F and UPIC were much less over-engineered.

Did specify possible use as video codecs, but thus far both were used
only as still image formats.

The major goal for UPIC was mostly be to address the core use-cases
but also for the decoder to be small and relatively cheap. Still sorta
JPEG competitive despite being primarily cost-optimized to try to make
it more viable for use in programs running on the BJX2 core (where
JPEG decoding is slow and expensive).

As for Static Huffman vs STF+AdRice:
   Huffman:
     + Slightly faster for larger payloads
     + Optimal for a static distribution
     - Higher memory cost for decoding (storing decoder tables)
     - High initial setup cost (setting up decoder tables)
     - Higher constant overhead (storing symbol lengths)
     - Need to provision for storing Huffman tables
   STF+AdRice:
     + Very cheap initial setup (minimal context)
     + No need to transmit tables
     + Better compression for small data
     + Significantly faster than Adaptive Huffman
     + Significantly faster than Range Coding
     - Slower for large data and worse compression vs Huffman.

Where, STF+AdRice is mostly:
   Have a table of symbols;
   Whenever a symbol is encoded, swap it forwards;
     Next time, it may potentially be encoded with a smaller index.
   Encode indices into table using Adaptive Rice Codes.
Or, basically, using a lookup table to allow AdRice to pretend to be
Huffman. Also reasonably fast and simple.

Block-Haar vs DCT:
   + Block-Haar is faster and easily reversible (lossless);
   + Mostly a drop-in replacement for DCT/IDCT in the design.
   + Also faster than WHT (Walsh-Hadamard Transform)

RCT vs YCbCr:
   RCT is both slightly faster, and also reversible;
   Had experimented with YCoCg, but saw no real advantage over RCT.

The existence of BTIC5x was mostly because:
BTIC1H and BTIC4B were too computationally demanding to do 320x200
16Hz on a 50MHz BJX2 core;

MS-CRAM was fast to decode, but needed too much bitrate (SDcard
couldn't keep the decoder fed with any semblance of image quality).

So, 5A and 5B were aimed at trying to give tolerable Q/bpp at more
CRAM- like decoding speeds.

Also, while reasonably effective (and fast desktop by PC standards),
one other drawback of the 4B design (and to a lesser degree 1H) was
the design being overly complicated (and thus the code is large and
bulky).

Part of this was due to having too many block formats.

If my UPIC format were put into my older naming scheme, would likely
be called 2G. Design is kinda similar to 2F, but replaces Huffman with
STF+AdRice.

As for RP2 and TKuLZ:
   RP2 is a byte-oriented LZ77 variant, like LZ4,
     but on-average compresses slightly better than LZ4.
   TKuLZ: Is sorta like a simplified/tuned Deflate variant.
     Uses a shorter max symbol length,
       borrows some design elements from LZ4.

Can note, some past experiments with LZ decompression (at Desktop PC
speeds), with entropy scheme, and len/dist limits:
   LZMA   : ~   35 MB/sec (Range Coding,   273/   4GB)
   Zstd   : ~   60 MB/sec (tANS,          16MB/ 128MB)
   Deflate: ~ 175 MB/sec (Huffman,        258/ 32767)
   TKuLZ : ~ 300 MB/sec (Huffman,      65535/262143)
   RP2    : ~ 1100 MB/sec (Raw Bytes,      512/131071)
   LZ4    : ~ 1300 MB/sec (Raw Bytes,    16383/ 65535)

While Zstd is claimed to be fast, my testing tended to show it closer
to LZMA speeds than to Deflate, but it does give compression closer to
LZMA. The tANS strategy seems to under-perform claims IME (and is
notably slower than static Huffman). Also it is the most complicated
design among these.

A lot of my older stuff used Deflate, but often Deflate wasn't fast
enough, so has mostly gotten displaced by RP2 in my uses.

TKuLZ is an intermediate, generally faster than Deflate, had an option
to get some speed (at the expense of compression) by using fixed
length symbols in some cases. This can push it to around 500 MB/sec
(at the expense of compression), hard to get much faster (or anywhere
near RP2 or LZ4).

Whether RP2 or LZ4 is faster seems to depend on target:
   BJX2 Core, RasPi, and Piledriver: RP2 is faster.
     Mostly things with in-order cores.
     And Piledriver, which behaved almost more like an in-order machine. >>    Zen+, Core 2, and Core i7: LZ4 is faster.

LZ4 needs typically multiple chained memory accesses for each LZ run,
whereas for RP2, match length/distance and raw count are typically all
available via a single memory load (then maybe a few bit-tests and
conditional branches).

...

A while ago I wrote a set of graphics routines in assembler that were
quite fast. One format I have delt with is the .flic file format used
to render animated graphics. I wanted to write my own CIV style game.
It took a little bit of research and some reverse engineering.
Apparently, the authors used a modified version of the format making
it difficult to use the CIV graphics in my own game. I never could
get it to render as fast as the game’s engine. I wrote the code for
my game in C or C++, the original’s game engine code was likely in a
different language.

This sort of thing is almost inevitable with this stuff.

Usually I just ended up using C for nearly everything.

*****

Been working on vectors for the ISA. I split the vector length
register into eight sections to define up to eight different vector
lengths. The first five are defined for integer, float, fixed,
character, and address data types. I figure one may want to use
vectors of different lengths at the same time, for instance to
address data using byte offsets, while the data itself might be a
float. The vector load / store instructions accept a data type to
load / store and always use the address type for address calculations.

There is also a vector lane size register split up the same way. I
had thought of giving each vector register its own format for length
and lane size. But thought that is a bit much, with limited use cases.

I think I can get away with only two load and two store instructions.
One to do a strided load and a second to do an vector indexed load
(gather/scatter). The addressing mode in use is
d[Rbase+Rindex*Scale]. Where Rindex is used as the stride when scalar
or as a supplier of the lane offset when Rindex is a vector.

Writing the RTL code to support the vector memory ops has been
challenging. Using a simple approach ATM. The instruction needs to be
re-issued for each vector lane accessed. Unaligned vector loads and
stores are also allowed, adding some complexity when the operation
crosses a cache-line boundary.

I have the max vector length and max vector size constants returned
by the GETINFO instruction which returns CPU specific information.

I don't get it...

Usually makes sense to treat vectors as opaque blobs of bits that are
then interpreted as one of the available formats for a specific
operation.

In my case, I have a SIMD setup:
   2 or 4 elements in a GPR or GPR pair;
   Most other operations are just the normal GPR operations.

...

Many vector machines (RISCV-V) have a way of specifying the vector
length and element size, but it tends to be a global setting which may
be overridden in some cases by specifying in the instruction. For Qupls
it also allows setting based on the data type which is a bit of a
misnomer, it would be better named data format. It is just three bits in
the instruction that select one of the fields in the VLEN, VELSZ
registers. The instruction itself specifies the data type for the
operation on an opaque bag of bits. It is possible to encode selecting
the integer size fields, then performing a float operation on the data.

The size agnostic instructions use the micro-op translator to convert
the instructions into size specific versions. The translator calculates
the number of architectural registers required then puts the appropriate number of instructions (up to eight) in the micro-op queue.

Therefore, there are lots of vector instructions in the ISA. SIMD type instructions where the size of a vector is assumed to be one register,
and the element size is specified by the instruction. So, separate instructions for 1,2,4 or 8 elements. (For example 50 instructions *
four different sizes = 200 instructions). Then also size agnostic instructions where the size/format comes indirectly from the VLEN
(vector length) and VELSZ (vector lane size) registers.

The size agnostic instructions allow writing a generic vector routine without needing to code the size of the operation. This avoids having a switch statement with a whole bunch of cases for different vector
lengths. It also avoids having thousands of vector instructions. (50 instructions * 5 different lanes sizes * 64 different lengths).

The vectors are opaque blobs of bytes in my case. Size specs are in
terms of bytes. The vectors are not a fixed length. They may (currently)
use from 0 to 8 GPR registers. Hence the need to specify the length in
use. While the length could be specified as part of the format for the instruction, that would require a wide instruction.

I am not personally a fan of RV-V, as it seems too complicated and
expensive.

I had taken a different approach towards adding SIMD to RISC-V:
The instructions that operated on narrower types, were implicitly
redefined to operate on SIMD vectors rather than a single narrower value (operation may be understood as scalar if NaN boxed or similar).

A the two remaining rounding modes were redefined to operate on 128-bit vectors, defined as register pairs (serving as RNE or RTZ on said vectors).

The DYN rounding mode was defined as scalar-only (only operates on a
single value and produces NaN boxed results, also supports the IEEE
emulation mode). This is compatible with GCC-like use of the FPU, where
GCC tends to always use DYN instructions, which then relies on FPU
control registers for the rounding mode, and updates status flags (which
in this case is not done for the instructions using fixed modes).

The scalar converter ops were silently modified into SIMD converters
where appropriate.

A few other instructions were added to help with some SIMD tasks, like
vector shuffles, etc.

Annoyingly, there is a split between the F/D extensions and P extension
in that the P extension operates in GPRs, so can't directly reuse P
extension encodings on F registers (effectively need to define new
encodings to map some of the P instructions over to F registers).

But, even then, it is only grabbing a limited set of instructions from
P, as P had gone down a combinatorial explosion path and defined way too
many instructions.

*****

.flic file format is supposed to be fast enough to allow use “on the fly”. But I just decompress all the frames into a matrix of bitmaps at game startup, then select the appropriate one based on direction and
timing. With dozens of different sprites and hundreds of frames, I think
it takes about 3GB of memory just for the sprite data. I had trouble
running this on my machine a few years ago, but maybe with newer
technology it could work.

Hmm...

When I was doing animated textures in my first 3D engine, with my '1C'
codec, it was effectively:
Transcode frame blocks to DXT1;
Uploading the compressed texture blocks to OpenGL using the same texture numbers.

Main issue I ran into here was that it doesn't work well for large textures.

IIRC, for this engine had used a 4096x4096 atlas for the main
block-surface textures (allowing 256x256 for each block).

If trying to upload a 4096x4096 texture at 10Hz, whole PC bogged down (including mouse, which started "submarining", etc). So, this experiment
was very short-lived (basically as soon as I could get it exited, which
was harder when basically the whole OS ground to a halt).

So, had to use multiple texture numbers for the main animated texture in
the main animated-texture atlas (advancing the sequence at 10Hz).

Theoretically, should have been pushing ~ 336 MB/sec to the GPU for this
(DXT5 with mipmaps), but something was clearly not happy here.

So, alas, even if one can get gigapixel/second for decoding, doesn't necessarily mean one can push it to the GPU.

Where, the idea for the atlas is, rather than giving each of the block textures its own texture, one can instead create a much bigger texture
(say, with 16x16 sub-textures) and then consolidate everything using the
same atlas into the same vertex array (so fewer draw calls).

But, if streaming a few 256x256 textures or similar, it worked well enough.

Some special blocks, like torches and fires, had used their own video
textures and were not tied to the main animated-texture atlas.

Mostly, all of this was being done as RIFF AVI files.

Had experimentally transcoded and streamed full video to blocks, mostly
using a few videos I scavenged off YOuTube as test cases.

So, errm, a video example of these experiments:
https://www.youtube.com/watch?v=64LL0GdrxQg

Errm, yeah, I was a bit into MLP at the time...

IIRC, the audio effect here was that these blocks could have virtual "speakers" that would stream the audio from the corresponding video
stream if the player was in range. Audio was IIRC mostly using a tweaked version of IMA ADPCM.

As can be noted, unlike a conventional animated texture, there can be
audio, and an arbitrary length.

IIRC, because of the inability to stream the main animated-texture
atlas, it was limited to something like 16 frames (or, around 1.6
seconds of loop).

IIRC, 10Hz was more standard for animated textures from past games; but
had usually used 16Hz for full motion video.

Along this path (decoding to DXT1), both 1C and 1H could get in the area
of around 600 megapixels/second (my later 4B design could exceed 1 gigapixel/second).

Can note that 1C (after unpacking the Deflate compression) used an
encoding scheme sorta like (if looking at bytes), IIRC:
00..7F: First byte of a raw block, 8 bytes.
Consisted of two RGB555 values, big endian, and a 4x4x2 pixel block.
For DXT1, needed to be turned to RGB565 LE.
Pixel block was also BE and different from DXT1,
but an easy enough fix (with lookup tables), and shift+or.
In RPZA, could escape to 16x RGB555 colors, but was not used in 1C.
80..9F: 1-32 skip blocks (kept as-is from last frame)
A0..BF: Flat Color, RGB555 value.
C0..DF: Two RGB555 colors, 1-32 4x4 blocks sharing the same colors.
E0..FF: Used for something...
I forget ATM, maybe skip+translate.
At top-level, would indicate the use of TLV packaging.
Otherwise, frame would be decoded as a raw command stream.

IIRC, there was an option to encode a separate alpha layer (for decoding
to DXT5). Another option was to encode the alpha similar to DXT1, namely
via color endpoint ordering, IIRC:
C0<C1: Opaque
C0>C1: 1-bit transparency.
Or, no alpha of either sort, in which case it was opaque.

No mipmaps were encoded here. Strategy was to use a quick/dirty approach
to rebuild mipmaps on the fly.

The followups, 1D and 1E, were intended to try to give better fidelity
when decoding to BC7, but mostly failed to be all that fast.

By 1H, had switched to one-off command-tag codes, with colors being delta-coded and blocks reusing prior colors. Worked OK as it was built
around Rice coding everything. This format was significantly more
complicated.

For 5A/5B, had instead used a unary-coding scheme to encode commands
(similar to both QOI and RP2).

By the second engine, I had mostly stopped using video textures, and was instead using shader effects to do some animations (with a static
atlas). In the shaders, I ended up mostly using dithering for the alpha
as I sorta liked this effect at the time over the more traditional translucency effects.

IIRC, was using a 2048x2048 atlas in this case (for 128x128 pixels per
block).

For my 3rd engine, it dropped it again to 1024x1024, with each block
texture limited to 64x64. No shaders as this engine was written to
assume being limited to roughly OpenGL 1.3 features.

Instead it redraws all the water blocks and similar using Quake-style ST warping (but only for blocks near the camera).

If I were to bring back video textures, could maybe use 1C or 5B as a
base, though if using 1C may modify it to allow for RP2 and TKuLZ,
mostly because of the whole "Deflate is kinda slow" issue.

Ironically, it seems the only reason 1H seemed fast may have been
because it was faster than Deflate; but by my current standards Deflate
isn't all that fast.

Experimented some with LZ4 and Huffman encoding. Huffman used for ECC
logic.

Yeah. LZ4 works.

I am mostly using it for PE/COFF compression, as it seems to do much
better in this case.

For data compression, mostly ended up with my own custom RP2 design as
it mostly beats LZ4 in terms of compression, and is similarly fast.

Which is better mostly depends on the data in question though...

In a few cases, I had STF+AdRice based LZ compressors, but these mostly
make sense if the file being compressed is fairly small (lower end of
the kB range).

But, RP2 also works well for small data.

Whereas, both TKuLZ and Deflate need larger data (say, over 16K-64K) to
be effective (if compressing chunks of data in single-digit kB or less, Deflate kinda sucks...).

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Nov 23 03:20:10 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

On 2025-11-11 2:30 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

Typical process for NaN boxing is to set the high order bits of the
value which causes the value to appear to be a NaN at higher precision.

Any FP value representable in lower precision can be exactly represented
in higher precision.

I have been thinking about using some of the high order bits of the NaN
(eg bits 32 to 51) to indicate the precision of the boxed value.

When My 66000 generates a NaN it inserts the cause in the 3 HoBs and inserts IP in the LoBs. Nothing prevents you from overwriting the NaN,
but I thought it was best to point at the causing-instruction and an encoded "why" the nan was generated. The cause is a 3-bit index to the
7 defined IEEE exceptions.

My float package puts the cause in the 3 LoBs. The cause is always in
the low order bits of the register then, even when the precision is different. But the address is not tracked. The package does not have
access to the address. Seems like NaN trace hardware might be useful.

Suggest you read:: https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdf
For conversation about LoBs versus HoBs.

There are rules when more than 1 NaN are an operand to an instruction designed to leave the more important NaN as the result. {Where more important is generally the first to be generated.}

Hopefully the package follows the rules correctly. NaN operation is one thing not tested yet.

This
would allow detection of the use of a lower precision value in
arithmetic. Suppose a convert from single to double precision is being
done, but the value to be converted is only half precision. If it were
indicated by the NaN software might be able to fix the result.

I think it is better to fix the SW that thinks a (half) is a (float).

It would be better, but some software is so complex it may be unknown
the values coming in. The SW does not really need to croak if its a
lower precision value as they are always represent-able in a higher precision.>>
I also

preserve the sign bit of the number in the NaN box.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sat Nov 22 23:16:17 2025

From Newsgroup: comp.arch

On 2025-11-22 10:20 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2025-11-11 2:30 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

Typical process for NaN boxing is to set the high order bits of the
value which causes the value to appear to be a NaN at higher precision. >>>

Any FP value representable in lower precision can be exactly represented >>> in higher precision.

I have been thinking about using some of the high order bits of the NaN >>>> (eg bits 32 to 51) to indicate the precision of the boxed value.

When My 66000 generates a NaN it inserts the cause in the 3 HoBs and
inserts IP in the LoBs. Nothing prevents you from overwriting the NaN,
but I thought it was best to point at the causing-instruction and an
encoded "why" the nan was generated. The cause is a 3-bit index to the
7 defined IEEE exceptions.

My float package puts the cause in the 3 LoBs. The cause is always in
the low order bits of the register then, even when the precision is
different. But the address is not tracked. The package does not have
access to the address. Seems like NaN trace hardware might be useful.

Suggest you read:: https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdf
For conversation about LoBs versus HoBs.

Okay, it sounds like there are good reasons to use the HoBs. But I think
it is only when converting precisions that it makes a difference. I have
the float package moving the LoBs of a larger precision to the LoBs of
the lower precision if a NaN (or infinity) is present. I do not think
this consumes any more logic. It looks like just wires. It looks to be a
three bit mux on the low order bits going the other way.

I suppose I could code the package to accept NaN values either way.

The following NaN values are in use.

`define QSUBINFD 63'h7FF0000000000001 // - infinity - infinity `define QINFDIVD 63'h7FF0000000000002 // - infinity / infinity `define QZEROZEROD 63'h7FF0000000000003 // - zero / zero
`define QINFZEROD 63'h7FF0000000000004 // - infinity X zero
`define QSQRTINFD 63'h7FF0000000000005 // - square root of infinity `define QSQRTNEGD 63'h7FF0000000000006 // - square root of negaitve number

There are rules when more than 1 NaN are an operand to an instruction
designed to leave the more important NaN as the result. {Where more
important is generally the first to be generated.}

Hopefully the package follows the rules correctly. NaN operation is one
thing not tested yet.

This >>>> would allow detection of the use of a lower precision value in
arithmetic. Suppose a convert from single to double precision is being >>>> done, but the value to be converted is only half precision. If it were >>>> indicated by the NaN software might be able to fix the result.

I think it is better to fix the SW that thinks a (half) is a (float).

It would be better, but some software is so complex it may be unknown
the values coming in. The SW does not really need to croak if its a
lower precision value as they are always represent-able in a higher
precision.>>
I also

preserve the sign bit of the number in the NaN box.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sat Nov 22 23:36:47 2025

From Newsgroup: comp.arch

On 2025-11-22 11:16 p.m., Robert Finch wrote:

On 2025-11-22 10:20 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2025-11-11 2:30 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

Typical process for NaN boxing is to set the high order bits of the
value which causes the value to appear to be a NaN at higher
precision.

Any FP value representable in lower precision can be exactly
represented
in higher precision.

I have been thinking about using some of the high order bits of the >>>>> NaN
(eg bits 32 to 51) to indicate the precision of the boxed value.

When My 66000 generates a NaN it inserts the cause in the 3 HoBs and
inserts IP in the LoBs. Nothing prevents you from overwriting the NaN, >>>> but I thought it was best to point at the causing-instruction and an
encoded "why" the nan was generated. The cause is a 3-bit index to the >>>> 7 defined IEEE exceptions.

My float package puts the cause in the 3 LoBs. The cause is always in
the low order bits of the register then, even when the precision is
different. But the address is not tracked. The package does not have
access to the address. Seems like NaN trace hardware might be useful.

Suggest you read::
https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/
nan-propagation.pdf
For conversation about LoBs versus HoBs.

Okay, it sounds like there are good reasons to use the HoBs. But I think
it is only when converting precisions that it makes a difference. I have
the float package moving the LoBs of a larger precision to the LoBs of
the lower precision if a NaN (or infinity) is present. I do not think
this consumes any more logic. It looks like just wires. It looks to be a three bit mux on the low order bits going the other way.

I suppose I could code the package to accept NaN values either way.

The following NaN values are in use.

`define    QSUBINFD     63'h7FF0000000000001    // - infinity - infinity
`define QINFDIVD     63'h7FF0000000000002    // - infinity / infinity `define QZEROZEROD 63'h7FF0000000000003    // - zero / zero
`define QINFZEROD    63'h7FF0000000000004    // - infinity X zero `define QSQRTINFD    63'h7FF0000000000005    // - square root of infinity
`define QSQRTNEGD    63'h7FF0000000000006    // - square root of negaitve number

When converting a NaN from higher to lower precision, the float package preserves both the low order four bits and as many high order bits of
the NaN that will fit. The middle bits are dropped.

There are rules when more than 1 NaN are an operand to an instruction
designed to leave the more important NaN as the result. {Where more
important is generally the first to be generated.}

Hopefully the package follows the rules correctly. NaN operation is one
thing not tested yet.

This
would allow detection of the use of a lower precision value in
arithmetic. Suppose a convert from single to double precision is being >>>>> done, but the value to be converted is only half precision. If it were >>>>> indicated by the NaN software might be able to fix the result.

I think it is better to fix the SW that thinks a (half) is a (float).

It would be better, but some software is so complex it may be unknown
the values coming in. The SW does not really need to croak if its a
lower precision value as they are always represent-able in a higher
precision.>>
      I also

preserve the sign bit of the number in the NaN box.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sun Nov 23 07:04:37 2025

From Newsgroup: comp.arch

On 2025-11-22 11:36 p.m., Robert Finch wrote:

On 2025-11-22 11:16 p.m., Robert Finch wrote:

On 2025-11-22 10:20 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2025-11-11 2:30 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

Typical process for NaN boxing is to set the high order bits of the >>>>>> value which causes the value to appear to be a NaN at higher
precision.

Any FP value representable in lower precision can be exactly
represented
in higher precision.

I have been thinking about using some of the high order bits of
the NaN
(eg bits 32 to 51) to indicate the precision of the boxed value.

When My 66000 generates a NaN it inserts the cause in the 3 HoBs and >>>>> inserts IP in the LoBs. Nothing prevents you from overwriting the NaN, >>>>> but I thought it was best to point at the causing-instruction and an >>>>> encoded "why" the nan was generated. The cause is a 3-bit index to the >>>>> 7 defined IEEE exceptions.

My float package puts the cause in the 3 LoBs. The cause is always in
the low order bits of the register then, even when the precision is
different. But the address is not tracked. The package does not have
access to the address. Seems like NaN trace hardware might be useful.

Suggest you read::
https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/
background/ nan-propagation.pdf
For conversation about LoBs versus HoBs.

Okay, it sounds like there are good reasons to use the HoBs. But I
think it is only when converting precisions that it makes a
difference. I have the float package moving the LoBs of a larger
precision to the LoBs of the lower precision if a NaN (or infinity) is
present. I do not think this consumes any more logic. It looks like
just wires. It looks to be a three bit mux on the low order bits going
the other way.

I suppose I could code the package to accept NaN values either way.

The following NaN values are in use.

`define    QSUBINFD     63'h7FF0000000000001    // - infinity - infinity
`define QINFDIVD     63'h7FF0000000000002    // - infinity / infinity >> `define QZEROZEROD 63'h7FF0000000000003    // - zero / zero
`define QINFZEROD    63'h7FF0000000000004    // - infinity X zero
`define QSQRTINFD    63'h7FF0000000000005    // - square root of infinity
`define QSQRTNEGD    63'h7FF0000000000006    // - square root of
negaitve number

When converting a NaN from higher to lower precision, the float package preserves both the low order four bits and as many high order bits of
the NaN that will fit. The middle bits are dropped.

There are rules when more than 1 NaN are an operand to an instruction >>>>> designed to leave the more important NaN as the result. {Where more
important is generally the first to be generated.}

Hopefully the package follows the rules correctly. NaN operation is one >>>> thing not tested yet.

This
would allow detection of the use of a lower precision value in
arithmetic. Suppose a convert from single to double precision is
being
done, but the value to be converted is only half precision. If it >>>>>> were
indicated by the NaN software might be able to fix the result.

I think it is better to fix the SW that thinks a (half) is a (float). >>>>>

It would be better, but some software is so complex it may be unknown
the values coming in. The SW does not really need to croak if its a
lower precision value as they are always represent-able in a higher
precision.>>
      I also

preserve the sign bit of the number in the NaN box.

Added a NaN tracing facility as an core option. It can only log two NaNs
per clock to a buffer, possibly slowing the core down. The NaN addresses
are logged in order to a 512 entry buffer. The core already tracks
exceptions so it was not too bad to add a NaN flag to the re-order buffer.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Nov 23 16:32:46 2025

From Newsgroup: comp.arch

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

I recently heard that CS graduates from ETH Zürich had heard about >>pipelines, but thought it was fetch-decode-execute.

Why would a CS graduate need to know about pipelines?

Why would a chemical engineer know the basics of heat transfer?
They are going to use commercial programs to design them anyway.

Why would anybody know the basics of what they are doing?
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sun Nov 23 16:51:19 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

I recently heard that CS graduates from ETH Zürich had heard about >>>pipelines, but thought it was fetch-decode-execute.

Why would a CS graduate need to know about pipelines?

So they can properly simluate a pipelined processor?

When I got my MSCS, computer engineering courses were
required, including basic logic elements and overviews
of processor design.

Why would a chemical engineer know the basics of heat transfer?
They are going to use commercial programs to design them anyway.

Why would anybody know the basics of what they are doing?

Indeed, a programmer that doesn't understand the underlying
hardware is crippled.

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Nov 23 17:25:12 2025

From Newsgroup: comp.arch

scott@slp53.sl.home (Scott Lurndal) writes:

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

I recently heard that CS graduates from ETH Zürich had heard about >>>>pipelines, but thought it was fetch-decode-execute.

Why would a CS graduate need to know about pipelines?

So they can properly simluate a pipelined processor?

Sure, if a CS graduate works in an application area, they need to
learn about that application area, whatever it is.

But why would knowledge about processor pipelines be part of their CS curriculum?

When I got my MSCS, computer engineering courses were
required, including basic logic elements and overviews
of processor design.

For me, too. I even learned something about processor pipelines, in a specialized elective course.

Why would anybody know the basics of what they are doing?

Processor pipelines are not the basics of what a CS graduate is doing.
They are an implementation detail in computer engineering.

Indeed, a programmer that doesn't understand the underlying
hardware is crippled.

I certainly have a lot of sympathy for that point of view. However,
there are a lot of abstractions whose cost a programmer should
understand if they intend to write efficient code, e.g., the memory
hierarchy or system calls.

But CPU pipelines have the nice property that they are mostly
transparent. What you need to understand for performance is the
latency of various instructions, and the costs of branch
misprediction. I teach a course "Efficient programs", and I do not
discuss hardware pipelining, but I do explain these performance characteristics.

If anything, understanding OoO execution and it's effect on
performance is more relevant. But looking at the dearth of textbooks,
and the fact that Henry Wong did his thesis on his own initiative,
even among computer engineering professors that is a topic that is of
little interest.

Back to programmers: There is also the other POV that programmers
should never concern themselves with low-level details and should
always leave that to compilers, which supposedly can do all those
things better than programmers (I call that the compiler supremacy
position). Compiler supremacy is wishful thinking, but wishful
thinking has a strong influence in the world.

A few more examples where compilers are not as good as even I expected:

Just today, I compiled

u4 = u1/10;
u3 = u1%10;

(plus some surrounding code) with gcc-14 in three contexts. Here's
the code for two of them (the third one is similar to the second one):

movabs $0xcccccccccccccccd,%rax movabs $0xcccccccccccccccd,%rsi
sub $0x8,%r13 mov %r8,%rax
mul %r8 mov %r8,%rcx
mov %rdx,%rax mul %rsi
shr $0x3,%rax shr $0x3,%rdx
lea (%rax,%rax,4),%rdx lea (%rdx,%rdx,4),%rax
add %rdx,%rdx add %rax,%rax
sub %rdx,%r8 sub %rax,%r8
mov %r8,0x8(%r13) mov %rcx,%rax
mov %rax,%r8 mul %rsi
shr $0x3,%rdx
mov %rdx,%r9

The major difference is that in the left context, u3 is stored into
memory (at 0x8(%r13)), while in the right context, it stays in a
register. In the left context, gcc managed to base its computation of
u1%10 on the result of u1/10; in the right context, gcc first computes
u1%10 (computing u1/10 as part of that), and then computes u1/10
again.

Then I looked if there is some unsigned equivalent of ldiv(), but
there is not, supposedly because the compilers manage to combine the /
and % operations by themselves.

I also found that the resulting code was slower on a Rocket Lake than
a variant of the code that passes the divisor in a variable, but
that's ok: On Skylake and earlier CPUs division is so slow that the
replacement code is probably faster.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Nov 23 20:13:25 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

On 2025-11-22 10:20 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2025-11-11 2:30 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

Typical process for NaN boxing is to set the high order bits of the
value which causes the value to appear to be a NaN at higher precision. >>>

Any FP value representable in lower precision can be exactly represented >>> in higher precision.

I have been thinking about using some of the high order bits of the NaN >>>> (eg bits 32 to 51) to indicate the precision of the boxed value.

When My 66000 generates a NaN it inserts the cause in the 3 HoBs and
inserts IP in the LoBs. Nothing prevents you from overwriting the NaN, >>> but I thought it was best to point at the causing-instruction and an
encoded "why" the nan was generated. The cause is a 3-bit index to the >>> 7 defined IEEE exceptions.

My float package puts the cause in the 3 LoBs. The cause is always in
the low order bits of the register then, even when the precision is
different. But the address is not tracked. The package does not have
access to the address. Seems like NaN trace hardware might be useful.

Suggest you read:: https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdf
For conversation about LoBs versus HoBs.

Okay, it sounds like there are good reasons to use the HoBs. But I think
it is only when converting precisions that it makes a difference. I have
the float package moving the LoBs of a larger precision to the LoBs of
the lower precision if a NaN (or infinity) is present. I do not think
this consumes any more logic. It looks like just wires. It looks to be a three bit mux on the low order bits going the other way.

The other part of the paper's reasoning is that if you want to insert
some portion of IP in NaN, doing it bit-reversed enables conversions
to smaller and larger to loose as few bits as possible. The realization
was a surprise to me (yesterday).

I suppose I could code the package to accept NaN values either way.

The following NaN values are in use.

`define QSUBINFD 63'h7FF0000000000001 // - infinity - infinity
`define QINFDIVD 63'h7FF0000000000002 // - infinity / infinity `define QZEROZEROD 63'h7FF0000000000003 // - zero / zero
`define QINFZEROD 63'h7FF0000000000004 // - infinity X zero
`define QSQRTINFD 63'h7FF0000000000005 // - square root of infinity `define QSQRTNEGD 63'h7FF0000000000006 // - square root of negaitve number

There are rules when more than 1 NaN are an operand to an instruction
designed to leave the more important NaN as the result. {Where more
important is generally the first to be generated.}

Hopefully the package follows the rules correctly. NaN operation is one
thing not tested yet.

This >>>> would allow detection of the use of a lower precision value in
arithmetic. Suppose a convert from single to double precision is being >>>> done, but the value to be converted is only half precision. If it were >>>> indicated by the NaN software might be able to fix the result.

I think it is better to fix the SW that thinks a (half) is a (float).

It would be better, but some software is so complex it may be unknown
the values coming in. The SW does not really need to croak if its a
lower precision value as they are always represent-able in a higher
precision.>>
I also

preserve the sign bit of the number in the NaN box.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Nov 23 20:15:47 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> posted:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

I recently heard that CS graduates from ETH Zürich had heard about >>pipelines, but thought it was fetch-decode-execute.

Why would a CS graduate need to know about pipelines?

Why would a chemical engineer know the basics of heat transfer?
They are going to use commercial programs to design them anyway.

Why would anybody know the basics of what they are doing?

Because scientists and engineers actually want to know about things
they work-on and say--unlike politicians. ...
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Nov 23 20:16:39 2025

From Newsgroup: comp.arch

scott@slp53.sl.home (Scott Lurndal) posted:

ERROR "unexpected byte sequence starting at index 199: '\xC3'" while decoding:

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

I recently heard that CS graduates from ETH ZÃ¼rich had heard about >>>pipelines, but thought it was fetch-decode-execute.

Why would a CS graduate need to know about pipelines?

So they can properly simluate a pipelined processor?

When I got my MSCS, computer engineering courses were
required, including basic logic elements and overviews
of processor design.

Why would a chemical engineer know the basics of heat transfer?
They are going to use commercial programs to design them anyway.

Why would anybody know the basics of what they are doing?

Indeed, a programmer that doesn't understand the underlying
hardware is crippled.

So, only 95% of programmers are crippled ?!?
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Nov 23 20:46:23 2025

From Newsgroup: comp.arch

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Just today, I compiled

u4 = u1/10;
u3 = u1%10;

(plus some surrounding code) with gcc-14 in three contexts. Here's
the code for two of them (the third one is similar to the second one):

Care for to present a self-contained example? Otherwise, your
example and its analyis are meaingless to the reader.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Nov 23 22:40:02 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Just today, I compiled

u4 = u1/10;
u3 = u1%10;

(plus some surrounding code) with gcc-14 in three contexts. Here's
the code for two of them (the third one is similar to the second one):

Care for to present a self-contained example? Otherwise, your
example and its analyis are meaingless to the reader.

I doubt that a self-contained example will be more meaningful to all
but the most determined readers, but anyway, the preprocessed C code is at

https://www.complang.tuwien.ac.at/anton/tmp/engine-fast.i

You can search for "/10" to get to the three contexts. The compiler
call is:

gcc -I./../arch/amd64 -I. -Wall -g -O2 -fomit-frame-pointer -pthread -DHAVE_CONFIG_H -DFORCE_LL -DFORCE_REG -DDEFAULTPATH='".:/usr/local/lib/gforth/site-forth:/usr/local/lib/gforth/0.7.9_20251119:/usr/local/share/gforth/0.7.9_20251119:/usr/share/gforth/site-forth:/usr/local/share/gforth/site-forth"' -c -fno-gcse -fcaller-saves -fno-defer-pop -fno-inline -fwrapv -fno-strict-aliasing -fno-cse-follow-jumps -fno-reorder-blocks -fno-reorder-blocks-and-partition -fno-toplevel-reorder -falign-labels=1 -falign-loops=1 -falign-jumps=1 -fno-delete-null-pointer-checks -fcf-protection=none -fno-tree-vectorize -fno-lto -pthread -DENGINE=2 -fPIC -DPIC -o libengine-fast2-ll-reg.S -S engine-fast.i

The output of gcc-14 is at

https://www.complang.tuwien.ac.at/anton/tmp/libengine-fast2-ll-reg.S

You can find the three contexts by searching for "-3689348814741910323".

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sun Nov 23 23:58:16 2025

From Newsgroup: comp.arch

On 2025-11-23 3:13 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2025-11-22 10:20 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2025-11-11 2:30 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

Typical process for NaN boxing is to set the high order bits of the >>>>>> value which causes the value to appear to be a NaN at higher precision. >>>>>

Any FP value representable in lower precision can be exactly represented >>>>> in higher precision.

I have been thinking about using some of the high order bits of the NaN >>>>>> (eg bits 32 to 51) to indicate the precision of the boxed value.

When My 66000 generates a NaN it inserts the cause in the 3 HoBs and >>>>> inserts IP in the LoBs. Nothing prevents you from overwriting the NaN, >>>>> but I thought it was best to point at the causing-instruction and an >>>>> encoded "why" the nan was generated. The cause is a 3-bit index to the >>>>> 7 defined IEEE exceptions.

My float package puts the cause in the 3 LoBs. The cause is always in
the low order bits of the register then, even when the precision is
different. But the address is not tracked. The package does not have
access to the address. Seems like NaN trace hardware might be useful.

Suggest you read::
https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdf
For conversation about LoBs versus HoBs.

Okay, it sounds like there are good reasons to use the HoBs. But I think
it is only when converting precisions that it makes a difference. I have
the float package moving the LoBs of a larger precision to the LoBs of
the lower precision if a NaN (or infinity) is present. I do not think
this consumes any more logic. It looks like just wires. It looks to be a
three bit mux on the low order bits going the other way.

The other part of the paper's reasoning is that if you want to insert
some portion of IP in NaN, doing it bit-reversed enables conversions
to smaller and larger to loose as few bits as possible. The realization
was a surprise to me (yesterday).

It is probably not possible to embed enough IP information in smaller floating-point formats (<=16-bit) to be worthwhile. For 32-bit floats
only about 18-bits of the address can be stored. It looks like different formats are going to handle NaNs differently, which I find somewhat undesirable.

I am now leaning towards allocating four HOB bits to indicate the NaN
cause, and then filling the rest of the payload with a bit reversed
address. There should be some instruction to extract the NaN cause and address.

I like the bit-reversed address idea. Losing high order address bits is
less of an issue than low order ones.

The extra bit in the NaN cause may be used by software for when access
to the payload area is desired for other purposes.

I still like the idea of a NaN trace facility as an option. Perhaps the debugger logic could trigger a dump to trace on a NaN after a specific address.

I think that just a cause code to indicate multiple NaNs colliding would
be good. With the fused-dot-product there could be up to four NaNs. Some
of the information is going to be lost, so might as well just assign a code.

Insane idea: use more payload bits to record the colliding NaN causes,
then dump it to a CSR somewhere when the address is inserted into the
NaN. The FP status needs to be recorded, so maybe it could be part of
that status record.

My float package does not have access to an address, so it cannot be
inserted in the individual modules where the NaN occurs. It must be
inserted at a higher level in the FPU which I believe has access to the instruction address.

I suppose I could code the package to accept NaN values either way.

The following NaN values are in use.

`define QSUBINFD 63'h7FF0000000000001 // - infinity - infinity
`define QINFDIVD 63'h7FF0000000000002 // - infinity / infinity
`define QZEROZEROD 63'h7FF0000000000003 // - zero / zero
`define QINFZEROD 63'h7FF0000000000004 // - infinity X zero
`define QSQRTINFD 63'h7FF0000000000005 // - square root of infinity
`define QSQRTNEGD 63'h7FF0000000000006 // - square root of negaitve number

There are rules when more than 1 NaN are an operand to an instruction >>>>> designed to leave the more important NaN as the result. {Where more
important is generally the first to be generated.}

Hopefully the package follows the rules correctly. NaN operation is one >>>> thing not tested yet.

This >>>>>> would allow detection of the use of a lower precision value in
arithmetic. Suppose a convert from single to double precision is being >>>>>> done, but the value to be converted is only half precision. If it were >>>>>> indicated by the NaN software might be able to fix the result.

I think it is better to fix the SW that thinks a (half) is a (float). >>>>>

It would be better, but some software is so complex it may be unknown
the values coming in. The SW does not really need to croak if its a
lower precision value as they are always represent-able in a higher
precision.>>
I also

preserve the sign bit of the number in the NaN box.

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Nov 24 18:03:39 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes:

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

I recently heard that CS graduates from ETH Zürich had heard about >>>>>pipelines, but thought it was fetch-decode-execute.

Why would a CS graduate need to know about pipelines?

So they can properly simluate a pipelined processor?

Sure, if a CS graduate works in an application area, they need to
learn about that application area, whatever it is.

It's useful for code optimization, as well. In general,
any programmer should have a solid understanding of the
underlying hardware - generically, and specifically
for the hardware being programmed.

But why would knowledge about processor pipelines be part of their CS >curriculum?

When I got my MSCS, computer engineering courses were
required, including basic logic elements and overviews
of processor design.

For me, too. I even learned something about processor pipelines, in a >specialized elective course.

Why would anybody know the basics of what they are doing?

Processor pipelines are not the basics of what a CS graduate is doing.
They are an implementation detail in computer engineering.

Which affect the performance of the software created by the
software engineer (CS graduate).

Indeed, a programmer that doesn't understand the underlying
hardware is crippled.

<snip>

If anything, understanding OoO execution and it's effect on
performance is more relevant. But looking at the dearth of textbooks,
and the fact that Henry Wong did his thesis on his own initiative,
even among computer engineering professors that is a topic that is of
little interest.

Back to programmers: There is also the other POV that programmers
should never concern themselves with low-level details and should
always leave that to compilers, which supposedly can do all those
things better than programmers (I call that the compiler supremacy
position). Compiler supremacy is wishful thinking, but wishful
thinking has a strong influence in the world.

My experience with those who espouse that point of view has
been uniformly poor.

A few more examples where compilers are not as good as even I expected:

Just today, I compiled

u4 = u1/10;
u3 = u1%10;

(plus some surrounding code) with gcc-14 in three contexts. Here's
the code for two of them (the third one is similar to the second one):

movabs $0xcccccccccccccccd,%rax movabs $0xcccccccccccccccd,%rsi
sub $0x8,%r13 mov %r8,%rax
mul %r8 mov %r8,%rcx
mov %rdx,%rax mul %rsi
shr $0x3,%rax shr $0x3,%rdx
lea (%rax,%rax,4),%rdx lea (%rdx,%rdx,4),%rax
add %rdx,%rdx add %rax,%rax
sub %rdx,%r8 sub %rax,%r8
mov %r8,0x8(%r13) mov %rcx,%rax
mov %rax,%r8 mul %rsi
shr $0x3,%rdx
mov %rdx,%r9

The major difference is that in the left context, u3 is stored into
memory (at 0x8(%r13)), while in the right context, it stays in a
register. In the left context, gcc managed to base its computation of
u1%10 on the result of u1/10; in the right context, gcc first computes
u1%10 (computing u1/10 as part of that), and then computes u1/10
again.

Sort of emphasizes that programmers need to understand the
underlying hardware.

What were u1, u3 and u4 declared as?

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Nov 24 20:00:59 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

On 2025-11-23 3:13 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2025-11-22 10:20 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2025-11-11 2:30 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

Typical process for NaN boxing is to set the high order bits of the >>>>>> value which causes the value to appear to be a NaN at higher precision.

Any FP value representable in lower precision can be exactly represented
in higher precision.

I have been thinking about using some of the high order bits of the NaN
(eg bits 32 to 51) to indicate the precision of the boxed value.

When My 66000 generates a NaN it inserts the cause in the 3 HoBs and >>>>> inserts IP in the LoBs. Nothing prevents you from overwriting the NaN, >>>>> but I thought it was best to point at the causing-instruction and an >>>>> encoded "why" the nan was generated. The cause is a 3-bit index to the >>>>> 7 defined IEEE exceptions.

My float package puts the cause in the 3 LoBs. The cause is always in >>>> the low order bits of the register then, even when the precision is
different. But the address is not tracked. The package does not have >>>> access to the address. Seems like NaN trace hardware might be useful. >>>

Suggest you read::
https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdf
For conversation about LoBs versus HoBs.

Okay, it sounds like there are good reasons to use the HoBs. But I think >> it is only when converting precisions that it makes a difference. I have >> the float package moving the LoBs of a larger precision to the LoBs of
the lower precision if a NaN (or infinity) is present. I do not think
this consumes any more logic. It looks like just wires. It looks to be a >> three bit mux on the low order bits going the other way.

The other part of the paper's reasoning is that if you want to insert
some portion of IP in NaN, doing it bit-reversed enables conversions
to smaller and larger to loose as few bits as possible. The realization
was a surprise to me (yesterday).

It is probably not possible to embed enough IP information in smaller floating-point formats (<=16-bit) to be worthwhile. For 32-bit floats
only about 18-bits of the address can be stored. It looks like different formats are going to handle NaNs differently, which I find somewhat undesirable.

I am now leaning towards allocating four HOB bits to indicate the NaN
cause, and then filling the rest of the payload with a bit reversed address. There should be some instruction to extract the NaN cause and address.

I like the bit-reversed address idea. Losing high order address bits is
less of an issue than low order ones.

The extra bit in the NaN cause may be used by software for when access
to the payload area is desired for other purposes.

I still like the idea of a NaN trace facility as an option. Perhaps the debugger logic could trigger a dump to trace on a NaN after a specific address.

I think that just a cause code to indicate multiple NaNs colliding would
be good. With the fused-dot-product there could be up to four NaNs. Some
of the information is going to be lost, so might as well just assign a code.

Insane idea: use more payload bits to record the colliding NaN causes,
then dump it to a CSR somewhere when the address is inserted into the
NaN. The FP status needs to be recorded, so maybe it could be part of
that status record.

My float package does not have access to an address, so it cannot be inserted in the individual modules where the NaN occurs. It must be
inserted at a higher level in the FPU which I believe has access to the instruction address.

In this case, put the cause in a container the instruction drags down
the pipe, and retrieve it when you do have address access to where it
needs to go.
--- Synchronet 3.21a-Linux NewsLink 1.2

From antispam@antispam@fricas.org (Waldek Hebisch) to comp.arch on Tue Nov 25 00:40:38 2025

From Newsgroup: comp.arch

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:

Thomas Koenig <tkoenig@netcologne.de> writes:

Power's not dead, either, if very highly priced.

New Power CPUs and machines based on them are released regularly. I
think there is enough business in the iSeries (or whatever its current
name) is to produce enough money for the costs of that development.
pSeries benefits from that. I guess that the profits from that are
enough to finance the development of the pSeries machines, but can
contribute little to finance the development of the CPUs.

MIPS is still
being sold, apparently.

From <https://en.wikipedia.org/wiki/MIPS_architecture>:
|In March 2021, MIPS announced that the development of the MIPS
|architecture had ended as the company is making the transition to
|RISC-V.

So it's the same status as SPARC. They may be selling to existing
customers, but nobody sane will use MIPS for a new project.

Original MIPS yes. IIUC Chinese bought rights to use MIPS architecture
and that goes on.

As for RISC-V,
I am not sure how much business they actually generate compared
to others.

I think a lot of embedded RISC-Vs are used, e.g., in WD (and now
Sandisk) HDDs and SSDs; so you can look at the business reports of WD
if you want to know how much business they make. As for things you
can actually program, there are a number of SBCs on sale (and we have
one), from the Raspi Pico 2 (where you apparently can use either
ARMv8-M (i.e., ARM T32) or RISC-V (probably some RV32 variant) up to
stuff like the Visionfive V2, several Chinese offerings, and some
Hifive SBCs. The latter are not yet competetive in CPU performance
with the like of RK3588-based SBCs or the Raspi 5, so I expect the
main reason for buying them is to try out RISC-V (we have a Visionfive
V1 for that purpose); still, the fact that there are several offerings indicates that there is nonnegligible revenue there.

There are several 32-bit MCU-s and they probably have nontrival
part of the market. There are also 64-bit processors, ATM
cheapest 64-bit Linux capable SBC-s known to me are RISC-V
(but ARM-based ones are quite close). My impression is that
corresponding chips are used in security cameras (they have
special-purpose coprecessor for image recognition).
Several new chips offer choice of RISC-V or ARM, I am not sure
what percentage of users run them as ARM.

Currently big questions are:
- will Chinese dominate CPU market?
- which architectures will be used by Chinese?

It seems that main Chinese bet is on RISC-V. They manufacture
a lot of ARM-s, but are not entirely comfortable with it.
There have few architectures that seem to still get some
developement.
--
Waldek Hebisch
--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Tue Nov 25 21:08:45 2025

From Newsgroup: comp.arch

In this case, put the cause in a container the instruction drags down
the pipe, and retrieve it when you do have address access to where it
needs to go.

I may change things to pass the address around in the float package.
Putting the address into the NaN later may cause issues with timing. It
adds a mux into things. May be better to use the original NaN mux in the
float modules. May call it a NaN identity field instead of an address.

Modified NaN support in the float package to store to the HOBs.

Survey says:

The Qulps PUSH and POP instructions have room for six register fields.
Should one of the fields be used to identify the stack pointer register allowing five registers to be pushed or popped? Or should the stack
pointer register be assumed so that six registers may be pushed or popped?

I think the SP should be identified as PUSH / POP would be the only instructions assuming the SP register. Otherwise any register could be
chosen by the compiler.
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Nov 26 07:53:49 2025

From Newsgroup: comp.arch

antispam@fricas.org (Waldek Hebisch) writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
IIUC Chinese bought rights to use MIPS architecture
and that goes on.

None are known to me. LoongSon originally implemented MIPS, but,
according to <https://en.wikipedia.org/wiki/Loongson>:

|Loongson moved to their own processor instruction set architecture
|(ISA) in 2021 with the release of the Loongson 3 5000 series.

This instruction set is called LoongArch, and while it is similar to
MIPS, RISC-V, Alpha, DLX, Nios, it is different enough that Bernd
Paysan wrote a separate assembler and disassembler for it <https://cgit.git.savannah.gnu.org/cgit/gforth.git/tree/arch/loongarch64> rather than copying and modifying the MIPS assembler/disassembler.

It seems that main Chinese bet is on RISC-V. They manufacture
a lot of ARM-s, but are not entirely comfortable with it.

It seems to me that different companies in China use different
architectures. Huawei on ARM, Loongson on Loongarch, some on RISC-V
etc.

That also seems to be the Chinese approach to other technologies:
E.g., they build solar power, coal power, wind power, nuclear power,
hydro power, etc.; and in nuclear power, they built a few of every
kind of Generation III reactor on the market before developing their
own their own designs, some of them based on the Westinghouse AP-1000,
others (Hualong One) based on earlier Chinese Generation II designs.
They are also experimenting with Generation IV and SMR designs.

So, at least in technology, the CP does not pretend to know what's
best.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Wed Nov 26 12:17:09 2025

From Newsgroup: comp.arch

On Wed, 26 Nov 2025 07:53:49 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

That also seems to be the Chinese approach to other technologies:
E.g., they build solar power, coal power, wind power, nuclear power,
hydro power, etc.; and in nuclear power, they built a few of every
kind of Generation III reactor on the market before developing their
own their own designs, some of them based on the Westinghouse AP-1000,
others (Hualong One) based on earlier Chinese Generation II designs.
They are also experimenting with Generation IV and SMR designs.

So, at least in technology, the CP does not pretend to know what's
best.

- anton

Is not it the same as in all big countries except ultra-pro-nuclear
France and ultra-anti-nuclear Germany?
China is just bigger, so capable to build more things simultaneously.

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Wed Nov 26 18:08:49 2025

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> writes:

On Wed, 26 Nov 2025 07:53:49 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

That also seems to be the Chinese approach to other technologies:
E.g., they build solar power, coal power, wind power, nuclear power,
hydro power, etc.; and in nuclear power, they built a few of every
kind of Generation III reactor on the market before developing their
own their own designs, some of them based on the Westinghouse AP-1000,
others (Hualong One) based on earlier Chinese Generation II designs.
They are also experimenting with Generation IV and SMR designs.

So, at least in technology, the CP does not pretend to know what's
best.

...

Is not it the same as in all big countries except ultra-pro-nuclear
France and ultra-anti-nuclear Germany?

Not sure what you mean by "it", but I doubt that many new coal plants
are built in the first world (maybe in Australia?); Wind power faces significant opposition in some countries.

Concerning nuclear power: it stagnates or is in decline in the first
world. E.g., a number of nuclear power plants were shut down in the
2010s in the USA despite being granted lifetime extensions, due to
being uneconomical in the fracking age, and the building of new
reactors led to huge cost overruns (Nukegate) and the bankruptcy of Westinghouse, and to the cancelation of some of the projects.
Similarly, the first EPRs in Finland and in France led to huge delays
and cost overruns, and a large part (all?) of the losses were
shouldered by the French state, which restructured the companies
involved. The Chinese EPRs also had long delays, but were the first
to deliver grid energy.

In any case, no AP-1000 has been built in Europe, and no EPR in the
USA. Both have been built in China.

China is just bigger, so capable to build more things simultaneously.

They are willing to build different things. France has announced the
building of 15 EPRs to replace much of its aging reactor fleet (which
is not that much less than what China is building). It will be
interesting when in 30 years defects are found in one of the reactor
vessels of an EPR (like happened for an older model in 2022), and all
EPRs have to be shut down for inspection and repairs (like happened
for that model in 2022).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 26 20:57:11 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

In this case, put the cause in a container the instruction drags down
the pipe, and retrieve it when you do have address access to where it
needs to go.

I may change things to pass the address around in the float package.
Putting the address into the NaN later may cause issues with timing. It
adds a mux into things. May be better to use the original NaN mux in the float modules. May call it a NaN identity field instead of an address.

For example: when a My 66000 instruction needs to raise an exception
the Inst *I argument contains a field I->raised which is set (1<<excpt)
and at the end of the pipe (at retire), t->raised |= I->raised. Where
we have a *t there is also t->ip. So, you don't have to drag Thread *t
through all the subroutine calls, but you can easily access t->raised
at the point you do have access to t->ip.

Modified NaN support in the float package to store to the HOBs.

Survey says:

The Qulps PUSH and POP instructions have room for six register fields. Should one of the fields be used to identify the stack pointer register allowing five registers to be pushed or popped? Or should the stack
pointer register be assumed so that six registers may be pushed or popped?

My 66000 ENTER and EXIT instruction use SP == R31 implicitly. But,
instead of giving it a number of registers, there is a start register
and a stop register, so 1-to-32 regsiters can be saved/restored. The
immediate contains how much stack space to allocate/deallocate.

{{when Safe-Stack is enabled:: Rstart-to-R0 are placed on the inaccessible stack, while R1-to-Rstop are placed on the normal stack.}}

Because the stack is always DoubleWord aligned, the 3-LoBs of the
immediate are used to indicate "special" activities on a couple of
registers {R0, R31, R30}, R31 is rarely saves and reloaded from Stack
but just returned to its previous value by integer arithmetic. FP can
be updated or it can be treated like "just another register". R0 can
be loaded directly to t->ip, or loaded into R0 for stack walk-backs.

The corresponding LDM and STM are seldom used.

I think the SP should be identified as PUSH / POP would be the only instructions assuming the SP register. Otherwise any register could be chosen by the compiler.

I started with that philosophy--and begrudgingly went away from it as
a) the compiler took form
b) we started adding instructions to ISA to remove instructions from
code footprint.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 26 21:00:23 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

antispam@fricas.org (Waldek Hebisch) writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
IIUC Chinese bought rights to use MIPS architecture
and that goes on.

None are known to me. LoongSon originally implemented MIPS, but,
according to <https://en.wikipedia.org/wiki/Loongson>:

|Loongson moved to their own processor instruction set architecture
|(ISA) in 2021 with the release of the Loongson 3 5000 series.

This instruction set is called LoongArch, and while it is similar to
MIPS, RISC-V, Alpha, DLX, Nios, it is different enough that Bernd
Paysan wrote a separate assembler and disassembler for it <https://cgit.git.savannah.gnu.org/cgit/gforth.git/tree/arch/loongarch64> rather than copying and modifying the MIPS assembler/disassembler.

It seems that main Chinese bet is on RISC-V. They manufacture
a lot of ARM-s, but are not entirely comfortable with it.

It seems to me that different companies in China use different
architectures. Huawei on ARM, Loongson on Loongarch, some on RISC-V
etc.

That also seems to be the Chinese approach to other technologies:
E.g., they build solar power, coal power, wind power, nuclear power,
hydro power, etc.; and in nuclear power, they built a few of every
kind of Generation III reactor on the market before developing their
own their own designs, some of them based on the Westinghouse AP-1000,
others (Hualong One) based on earlier Chinese Generation II designs.
They are also experimenting with Generation IV and SMR designs.

This reminds me of Samsung. They developed both deep trench and stacked capacitor DRAM and had both in production for about 1 full year before
choosing one for long term production (stacked).

So, at least in technology, the CP does not pretend to know what's
best.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From Terje Mathisen@terje.mathisen@tmsw.no to comp.arch on Wed Nov 26 22:26:14 2025

From Newsgroup: comp.arch

MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2025-11-22 10:20 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2025-11-11 2:30 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

Typical process for NaN boxing is to set the high order bits of the >>>>>> value which causes the value to appear to be a NaN at higher precision. >>>>>

Any FP value representable in lower precision can be exactly represented >>>>> in higher precision.

I have been thinking about using some of the high order bits of the NaN >>>>>> (eg bits 32 to 51) to indicate the precision of the boxed value.

When My 66000 generates a NaN it inserts the cause in the 3 HoBs and >>>>> inserts IP in the LoBs. Nothing prevents you from overwriting the NaN, >>>>> but I thought it was best to point at the causing-instruction and an >>>>> encoded "why" the nan was generated. The cause is a 3-bit index to the >>>>> 7 defined IEEE exceptions.

My float package puts the cause in the 3 LoBs. The cause is always in
the low order bits of the register then, even when the precision is
different. But the address is not tracked. The package does not have
access to the address. Seems like NaN trace hardware might be useful.

Suggest you read::
https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdf
For conversation about LoBs versus HoBs.

Okay, it sounds like there are good reasons to use the HoBs. But I think
it is only when converting precisions that it makes a difference. I have
the float package moving the LoBs of a larger precision to the LoBs of
the lower precision if a NaN (or infinity) is present. I do not think
this consumes any more logic. It looks like just wires. It looks to be a
three bit mux on the low order bits going the other way.

The other part of the paper's reasoning is that if you want to insert
some portion of IP in NaN, doing it bit-reversed enables conversions
to smaller and larger to loose as few bits as possible. The realization
was a surprise to me (yesterday).

I think I read about IBM's approach years before the 754-2019 process
started.

Storing the offending address in byte-reversed order would do pretty
much the same thing, but at lower HW cost, right?

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 26 21:58:13 2025

From Newsgroup: comp.arch

Terje Mathisen <terje.mathisen@tmsw.no> posted:

MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2025-11-22 10:20 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2025-11-11 2:30 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

Typical process for NaN boxing is to set the high order bits of the >>>>>> value which causes the value to appear to be a NaN at higher precision.

Any FP value representable in lower precision can be exactly represented
in higher precision.

I have been thinking about using some of the high order bits of the NaN
(eg bits 32 to 51) to indicate the precision of the boxed value.

When My 66000 generates a NaN it inserts the cause in the 3 HoBs and >>>>> inserts IP in the LoBs. Nothing prevents you from overwriting the NaN, >>>>> but I thought it was best to point at the causing-instruction and an >>>>> encoded "why" the nan was generated. The cause is a 3-bit index to the >>>>> 7 defined IEEE exceptions.

My float package puts the cause in the 3 LoBs. The cause is always in >>>> the low order bits of the register then, even when the precision is
different. But the address is not tracked. The package does not have >>>> access to the address. Seems like NaN trace hardware might be useful. >>>

Suggest you read::
https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdf
For conversation about LoBs versus HoBs.

Okay, it sounds like there are good reasons to use the HoBs. But I think >> it is only when converting precisions that it makes a difference. I have >> the float package moving the LoBs of a larger precision to the LoBs of
the lower precision if a NaN (or infinity) is present. I do not think
this consumes any more logic. It looks like just wires. It looks to be a >> three bit mux on the low order bits going the other way.

The other part of the paper's reasoning is that if you want to insert
some portion of IP in NaN, doing it bit-reversed enables conversions
to smaller and larger to loose as few bits as possible. The realization
was a surprise to me (yesterday).

I think I read about IBM's approach years before the 754-2019 process started.

Storing the offending address in byte-reversed order would do pretty
much the same thing, but at lower HW cost, right?

Yes, no, and maybe.

In order to byte/bit-reverse a field/register, you take the horizontal data-path bit-lines and turn them 90º degrees. Once so turned, the
difference in cost between bit-reversal and byte reversal is about too
small to worry about. So, no.

On the other hand, shifters, and bit-field-reversers are often part
of the data path (calculation circuits), so you can pretty much get
one or the other or both at vey little extra charge. So, yes.

It is only at SW use of the bit-vector does one or the other matter
a little (or a lot). In a machine with either bit-reverse instruction
or byte reverse instruction, the ISA determines which one is better.
So, maybe.

My 66000 has a bit reverse instructions that can also perform
pair-reverse, quad-reverse, byte-reverse, half-reverse, and
word-reverse. So, in this ISA it does not matter which HW choice
was made.

Terje

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Nov 26 22:16:25 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Robert Finch <robfi680@gmail.com> posted:

The Qulps PUSH and POP instructions have room for six register fields.
Should one of the fields be used to identify the stack pointer register
allowing five registers to be pushed or popped? Or should the stack
pointer register be assumed so that six registers may be pushed or popped?

My 66000 ENTER and EXIT instruction use SP == R31 implicitly. But,
instead of giving it a number of registers, there is a start register
and a stop register, so 1-to-32 regsiters can be saved/restored. The >immediate contains how much stack space to allocate/deallocate.

That seems both confining for the compiler designers and less
useful than the VAX-11 register mask stored in the instruction stream
at the function entry point(s).

--- Synchronet 3.21a-Linux NewsLink 1.2

From Brian G. Lucas@bagel99@gmail.com to comp.arch on Wed Nov 26 17:20:30 2025

From Newsgroup: comp.arch

On 11/26/25 5:16 PM, Scott Lurndal wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Robert Finch <robfi680@gmail.com> posted:

The Qulps PUSH and POP instructions have room for six register fields.
Should one of the fields be used to identify the stack pointer register
allowing five registers to be pushed or popped? Or should the stack
pointer register be assumed so that six registers may be pushed or popped? >>

My 66000 ENTER and EXIT instruction use SP == R31 implicitly. But,
instead of giving it a number of registers, there is a start register
and a stop register, so 1-to-32 regsiters can be saved/restored. The
immediate contains how much stack space to allocate/deallocate.

That seems both confining for the compiler designers and less
useful than the VAX-11 register mask stored in the instruction stream
at the function entry point(s).

When the compiler can control the order in which registers are chosen
to allocate, the ENTER and EXIT stuff works very well.

Brian

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Wed Nov 26 22:29:33 2025

From Newsgroup: comp.arch

"Brian G. Lucas" <bagel99@gmail.com> writes:

On 11/26/25 5:16 PM, Scott Lurndal wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Robert Finch <robfi680@gmail.com> posted:

The Qulps PUSH and POP instructions have room for six register fields. >>>> Should one of the fields be used to identify the stack pointer register >>>> allowing five registers to be pushed or popped? Or should the stack
pointer register be assumed so that six registers may be pushed or popped? >>>

My 66000 ENTER and EXIT instruction use SP == R31 implicitly. But,
instead of giving it a number of registers, there is a start register
and a stop register, so 1-to-32 regsiters can be saved/restored. The
immediate contains how much stack space to allocate/deallocate.

That seems both confining for the compiler designers and less
useful than the VAX-11 register mask stored in the instruction stream
at the function entry point(s).

When the compiler can control the order in which registers are chosen
to allocate, the ENTER and EXIT stuff works very well.

They are often, however, constrained by the processor specific ABI
which defines the usage model for registers when multiple languages
are linked to provide code for an application.

When every enter insn that calls the function has
that mask, there is the possibility for strange and difficult to locate
errors when a program links with a library function that was built
earlier or with a different version of a (or even different language)
compiler and thus the mask is not necessarily correct for the latest
version of the called function.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Wed Nov 26 18:19:28 2025

From Newsgroup: comp.arch

On 2025-11-26 3:57 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

In this case, put the cause in a container the instruction drags down
the pipe, and retrieve it when you do have address access to where it
needs to go.

I may change things to pass the address around in the float package.
Putting the address into the NaN later may cause issues with timing. It
adds a mux into things. May be better to use the original NaN mux in the
float modules. May call it a NaN identity field instead of an address.

For example: when a My 66000 instruction needs to raise an exception
the Inst *I argument contains a field I->raised which is set (1<<excpt)
and at the end of the pipe (at retire), t->raised |= I->raised. Where
we have a *t there is also t->ip. So, you don't have to drag Thread *t through all the subroutine calls, but you can easily access t->raised
at the point you do have access to t->ip.

Had trouble reading that, sounds like goobly-goop. But I believe I
figured it out.

Sounds like the address is inserted at the end of the pipe which I am
sure is not the case.

I figured this out: the NaN address must be embedded in the result by
the time the result updates the bypass network and registers so that it
is available to other instructions.

The address is available at the start of the calc from the reservation
station entry. Me thinks it must be embedded when the NaN result status
is set, provided there is not already a NaN. The existing (first) NaN
must propagate through.

Modified NaN support in the float package to store to the HOBs.

Survey says:

The Qulps PUSH and POP instructions have room for six register fields.
Should one of the fields be used to identify the stack pointer register
allowing five registers to be pushed or popped? Or should the stack
pointer register be assumed so that six registers may be pushed or popped?

My 66000 ENTER and EXIT instruction use SP == R31 implicitly. But,
instead of giving it a number of registers, there is a start register
and a stop register, so 1-to-32 regsiters can be saved/restored. The immediate contains how much stack space to allocate/deallocate.

{{when Safe-Stack is enabled:: Rstart-to-R0 are placed on the inaccessible stack, while R1-to-Rstop are placed on the normal stack.}}

Because the stack is always DoubleWord aligned, the 3-LoBs of the
immediate are used to indicate "special" activities on a couple of
registers {R0, R31, R30}, R31 is rarely saves and reloaded from Stack
but just returned to its previous value by integer arithmetic. FP can
be updated or it can be treated like "just another register". R0 can
be loaded directly to t->ip, or loaded into R0 for stack walk-backs.

The corresponding LDM and STM are seldom used.

I ran out of micro-ops for ENTER and EXIT, so they only save the LR and
FP (on the safe stack). A separate PUSH/POP on safe stack instruction is
used.

I figured LDM and STM are not used often enough. PUSH / POP is used in
many places LDM / STM might be.

For context switching a whole bunch of load / store instructions are
used. There is context switching in only a couple of places.

I think the SP should be identified as PUSH / POP would be the only
instructions assuming the SP register. Otherwise any register could be
chosen by the compiler.

I started with that philosophy--and begrudgingly went away from it as
a) the compiler took form
b) we started adding instructions to ISA to remove instructions from
code footprint.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 26 23:46:47 2025

From Newsgroup: comp.arch

scott@slp53.sl.home (Scott Lurndal) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Robert Finch <robfi680@gmail.com> posted:

The Qulps PUSH and POP instructions have room for six register fields.
Should one of the fields be used to identify the stack pointer register >> allowing five registers to be pushed or popped? Or should the stack
pointer register be assumed so that six registers may be pushed or popped?

My 66000 ENTER and EXIT instruction use SP == R31 implicitly. But,
instead of giving it a number of registers, there is a start register
and a stop register, so 1-to-32 regsiters can be saved/restored. The >immediate contains how much stack space to allocate/deallocate.

That seems both confining for the compiler designers and less
useful than the VAX-11 register mask stored in the instruction stream
at the function entry point(s).

We, and by that I mean Brian, have not found that so. In the early stages
we did see a bit of that, and then Brian found a way to allocate registers
from R31-down-to-R16 that fit the ENTER/EXIT model and we find essentially nothing (that is no more instructions in the stream than necessary).

Part of the distinction is::
a) how arguments/results are passed to/from subroutines.
b) having a minimum of 7-temporary registers at entry point.
c) how the stack frame is designed/allocated wrt:
1) my arguments and my results,
2) his arguments and his results,
3) varargs,
4) dynamic arrays on stack,
5) stack frame allocation at ENTER,
d) freedom to use R30 as FP or as joe-random-register.

These were all co-designed together, after much of the instruction
emission logic was sorted out.

Consider this as a VAX CALL model except that the mask was replaced by
a list of registers, which were then packed towards R31 instead of a bit vector.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Nov 26 23:53:44 2025

From Newsgroup: comp.arch

scott@slp53.sl.home (Scott Lurndal) posted:

"Brian G. Lucas" <bagel99@gmail.com> writes:

On 11/26/25 5:16 PM, Scott Lurndal wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Robert Finch <robfi680@gmail.com> posted:

The Qulps PUSH and POP instructions have room for six register fields. >>>> Should one of the fields be used to identify the stack pointer register >>>> allowing five registers to be pushed or popped? Or should the stack
pointer register be assumed so that six registers may be pushed or popped?

My 66000 ENTER and EXIT instruction use SP == R31 implicitly. But,
instead of giving it a number of registers, there is a start register
and a stop register, so 1-to-32 regsiters can be saved/restored. The
immediate contains how much stack space to allocate/deallocate.

That seems both confining for the compiler designers and less
useful than the VAX-11 register mask stored in the instruction stream
at the function entry point(s).

When the compiler can control the order in which registers are chosen
to allocate, the ENTER and EXIT stuff works very well.

They are often, however, constrained by the processor specific ABI
which defines the usage model for registers when multiple languages
are linked to provide code for an application.

When every enter insn that calls the function has that mask,

a) wrong order: It is subroutine entry point that has the mask
not the calling point. Thus, the mask is universal to the
subroutine just entered. And, thus, the corresponding EXIT
will use the same "bit pattern".

there is the possibility for strange and difficult to locate errors when a program links with a library function that was built
earlier or with a different version of a (or even different language) compiler and thus the mask is not necessarily correct for the latest
version of the called function.

b) this was the x86-32 problem in using one of its "CALL" instructions
the stack was manipulated on the calling side instead of the called
side. Originally this worked fine for PASCAL and nadda-so-gooda for
C-like languages.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Nov 27 00:08:19 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

On 2025-11-26 3:57 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

In this case, put the cause in a container the instruction drags down
the pipe, and retrieve it when you do have address access to where it
needs to go.

I may change things to pass the address around in the float package.
Putting the address into the NaN later may cause issues with timing. It
adds a mux into things. May be better to use the original NaN mux in the >> float modules. May call it a NaN identity field instead of an address.

For example: when a My 66000 instruction needs to raise an exception
the Inst *I argument contains a field I->raised which is set (1<<excpt)
and at the end of the pipe (at retire), t->raised |= I->raised. Where
we have a *t there is also t->ip. So, you don't have to drag Thread *t through all the subroutine calls, but you can easily access t->raised
at the point you do have access to t->ip.

Had trouble reading that, sounds like goobly-goop. But I believe I
figured it out.

Sounds like the address is inserted at the end of the pipe which I am
sure is not the case.

I figured this out: the NaN address must be embedded in the result by
the time the result updates the bypass network and registers so that it
is available to other instructions.

The address is available at the start of the calc from the reservation station entry. Me thinks it must be embedded when the NaN result status
is set, provided there is not already a NaN. The existing (first) NaN
must propagate through.

See last calculation line in the following::

void RunInst( Chip *chip )
{
for( uint64_t i = 0; i < chip->cores; i++ )
{
ContextStack *cpu = &core[i];
uint8_t cs = cpu->cs;
Thread *t;
Inst *I;
uint16_t raised;

if( cpu->interrupt.raised & ((((signed)1)<<63) >> cpu->priority) )
{ // take an interrupt
cpu->cs = cpu->interrupt.cs;
cpu->priority = cpu->interrupt.priority;
t = context[cpu->cs];
t->reg[0] = cpu->interrupt.message;
}
else if( raised = t->raised & t->enabled )
{ // take an exception
cpu->cs--;
t = context[cpu->cs];
t->reg[0] = FT1( raised ) | EXCPT;
t->reg[1] = I->inst;
t->reg[2] = I->src1;
t->reg[3] = I->src2;
t->reg[4] = I->src3;
}
else
{ // run an instruction
t = context[cpu->cs];
memory( FETCH, t->ip, &I->inst );
t->ip += 4;
majorTable[ I->inst.major ]( t, I );
t->raised |= I->raised; // propagate raised here
}
}
}

Modified NaN support in the float package to store to the HOBs.

Survey says:

The Qulps PUSH and POP instructions have room for six register fields.
Should one of the fields be used to identify the stack pointer register
allowing five registers to be pushed or popped? Or should the stack
pointer register be assumed so that six registers may be pushed or popped?

My 66000 ENTER and EXIT instruction use SP == R31 implicitly. But,
instead of giving it a number of registers, there is a start register
and a stop register, so 1-to-32 regsiters can be saved/restored. The immediate contains how much stack space to allocate/deallocate.

{{when Safe-Stack is enabled:: Rstart-to-R0 are placed on the inaccessible stack, while R1-to-Rstop are placed on the normal stack.}}

Because the stack is always DoubleWord aligned, the 3-LoBs of the
immediate are used to indicate "special" activities on a couple of registers {R0, R31, R30}, R31 is rarely saves and reloaded from Stack
but just returned to its previous value by integer arithmetic. FP can
be updated or it can be treated like "just another register". R0 can
be loaded directly to t->ip, or loaded into R0 for stack walk-backs.

The corresponding LDM and STM are seldom used.

I ran out of micro-ops for ENTER and EXIT, so they only save the LR and
FP (on the safe stack). A separate PUSH/POP on safe stack instruction is used.

I figured LDM and STM are not used often enough. PUSH / POP is used in
many places LDM / STM might be.

Its a fine line.

I found more uses for an instruction that moves a number of registers
randomly allocated to fixed positions (arguments to a call) than to
move random string of registers to/from memory.

.
MOV R1,R10
MOV R2,R25
MOV R3,R17
CALL Subroutine
. ; deal with any result

For context switching a whole bunch of load / store instructions are
used. There is context switching in only a couple of places.

I use a cache-model for thread-state {program-status-line and the
register file}.

The high level simulator, leaves all of the context in memory without
loading it or storing it. Thus this serves as a pipeline Oracle so if
the OoO pipeline makes a timing error, the Oracle stops the thread in
its tracks.

Thus::

.
.
-----interrupt detected
. change CS (cs--) <---
. access threadState[cs]
. t->ip = dispatcher
. t->reg[0] = why
dispatcher in control
.
.
.
RET
SVR
.
.

In your typical interrupt/exception control transfers, there is
no code to actually switch state. Just like there is no code to
switch a cache line that takes a miss.

(*) The cs-- is all that is necessary to change from one Thread State
to another in its entirety.

I think the SP should be identified as PUSH / POP would be the only
instructions assuming the SP register. Otherwise any register could be
chosen by the compiler.

I started with that philosophy--and begrudgingly went away from it as
a) the compiler took form
b) we started adding instructions to ISA to remove instructions from
code footprint.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Thu Nov 27 00:36:54 2025

From Newsgroup: comp.arch

On 2025-11-26 7:08 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2025-11-26 3:57 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

In this case, put the cause in a container the instruction drags down >>>>> the pipe, and retrieve it when you do have address access to where it >>>>> needs to go.

I may change things to pass the address around in the float package.
Putting the address into the NaN later may cause issues with timing. It >>>> adds a mux into things. May be better to use the original NaN mux in the >>>> float modules. May call it a NaN identity field instead of an address.

For example: when a My 66000 instruction needs to raise an exception
the Inst *I argument contains a field I->raised which is set (1<<excpt)
and at the end of the pipe (at retire), t->raised |= I->raised. Where
we have a *t there is also t->ip. So, you don't have to drag Thread *t
through all the subroutine calls, but you can easily access t->raised
at the point you do have access to t->ip.

Had trouble reading that, sounds like goobly-goop. But I believe I
figured it out.

Sounds like the address is inserted at the end of the pipe which I am
sure is not the case.

I figured this out: the NaN address must be embedded in the result by
the time the result updates the bypass network and registers so that it
is available to other instructions.

The address is available at the start of the calc from the reservation
station entry. Me thinks it must be embedded when the NaN result status
is set, provided there is not already a NaN. The existing (first) NaN
must propagate through.

See last calculation line in the following::

void RunInst( Chip *chip )
{
for( uint64_t i = 0; i < chip->cores; i++ )
{
ContextStack *cpu = &core[i];
uint8_t cs = cpu->cs;
Thread *t;
Inst *I;
uint16_t raised;

if( cpu->interrupt.raised & ((((signed)1)<<63) >> cpu->priority) )
{ // take an interrupt
cpu->cs = cpu->interrupt.cs;
cpu->priority = cpu->interrupt.priority;
t = context[cpu->cs];
t->reg[0] = cpu->interrupt.message;
}
else if( raised = t->raised & t->enabled )
{ // take an exception
cpu->cs--;
t = context[cpu->cs];
t->reg[0] = FT1( raised ) | EXCPT;
t->reg[1] = I->inst;
t->reg[2] = I->src1;
t->reg[3] = I->src2;
t->reg[4] = I->src3;
}
else
{ // run an instruction
t = context[cpu->cs];
memory( FETCH, t->ip, &I->inst );
t->ip += 4;
majorTable[ I->inst.major ]( t, I );
t->raised |= I->raised; // propagate raised here
}
}
}

That looks like code for a simulator. How closely does it follow the
operation of the CPU? I do not see where 'I' is initialized.

It has been a while since I worked on simulator code.

The IP value is just muxed in in a five to one mux for the significand.
Had to account for NaN's infinities and overflow anyway. Address gets propagated with some some flops, but flops are inexpensive in an FPGA.

always_comb
casez({aNan5,bNan5,qNaNOutab5,aInf5,bInf5,overab5})
6'b1?????: moab6 <= {1'b1,1'b1,a5[fp64Pkg::FMSB-1:0],{fp64Pkg::FMSB+1{1'b0}}};
6'b01????: moab6 <= {1'b1,1'b1,b5[fp64Pkg::FMSB-1:0],{fp64Pkg::FMSB+1{1'b0}}};
6'b001???: moab6 <= {1'b1,qNaN|(64'd4 << (fp64Pkg::FMSB-4))|adr5[63:16],{fp64Pkg::FMSB+1{1'b0}}}; // multiply inf
* zero
6'b0001??: moab6 <= 0; // mul inf's
6'b00001?: moab6 <= 0; // mul inf's
6'b000001: moab6 <= 0; // mul overflow
default: moab6 <= fractab5;
endcase

Modified NaN support in the float package to store to the HOBs.

Survey says:

The Qulps PUSH and POP instructions have room for six register fields. >>>> Should one of the fields be used to identify the stack pointer register >>>> allowing five registers to be pushed or popped? Or should the stack
pointer register be assumed so that six registers may be pushed or popped? >>>

My 66000 ENTER and EXIT instruction use SP == R31 implicitly. But,
instead of giving it a number of registers, there is a start register
and a stop register, so 1-to-32 regsiters can be saved/restored. The
immediate contains how much stack space to allocate/deallocate.

{{when Safe-Stack is enabled:: Rstart-to-R0 are placed on the inaccessible >>> stack, while R1-to-Rstop are placed on the normal stack.}}

Because the stack is always DoubleWord aligned, the 3-LoBs of the
immediate are used to indicate "special" activities on a couple of
registers {R0, R31, R30}, R31 is rarely saves and reloaded from Stack
but just returned to its previous value by integer arithmetic. FP can
be updated or it can be treated like "just another register". R0 can
be loaded directly to t->ip, or loaded into R0 for stack walk-backs.

The corresponding LDM and STM are seldom used.

I ran out of micro-ops for ENTER and EXIT, so they only save the LR and
FP (on the safe stack). A separate PUSH/POP on safe stack instruction is
used.

I figured LDM and STM are not used often enough. PUSH / POP is used in
many places LDM / STM might be.

Its a fine line.

I found more uses for an instruction that moves a number of registers randomly allocated to fixed positions (arguments to a call) than to
move random string of registers to/from memory.

.
MOV R1,R10
MOV R2,R25
MOV R3,R17
CALL Subroutine
. ; deal with any result

My 66000 has an instruction to do that? I'd not seen an instruction like
that. It is almost like a byte map. I can see how it could be done.
Another instruction to add to the ISA. My compiler does not do such a
nice job of packing the register moves together though.

For context switching a whole bunch of load / store instructions are
used. There is context switching in only a couple of places.

I use a cache-model for thread-state {program-status-line and the
register file}.

The high level simulator, leaves all of the context in memory without
loading it or storing it. Thus this serves as a pipeline Oracle so if
the OoO pipeline makes a timing error, the Oracle stops the thread in
its tracks.

Thus::

.
.
-----interrupt detected
. change CS (cs--) <---
. access threadState[cs]
. t->ip = dispatcher
. t->reg[0] = why
dispatcher in control
.
.
.
RET
SVR
.
.

In your typical interrupt/exception control transfers, there is
no code to actually switch state. Just like there is no code to
switch a cache line that takes a miss.

The My 66000 hardware takes care of it automatically? Interrupts push
and pop context in my system.

(*) The cs-- is all that is necessary to change from one Thread State
to another in its entirety.

I think the SP should be identified as PUSH / POP would be the only
instructions assuming the SP register. Otherwise any register could be >>>> chosen by the compiler.

I started with that philosophy--and begrudgingly went away from it as
a) the compiler took form
b) we started adding instructions to ISA to remove instructions from
code footprint.

--- Synchronet 3.21a-Linux NewsLink 1.2

From George Neuner@gneuner2@comcast.net to comp.arch on Thu Nov 27 00:44:25 2025

From Newsgroup: comp.arch

On Sun, 23 Nov 2025 23:58:16 -0500, Robert Finch <robfi680@gmail.com>
wrote:

On 2025-11-23 3:13 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2025-11-22 10:20 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2025-11-11 2:30 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

Typical process for NaN boxing is to set the high order bits of the >>>>>>> value which causes the value to appear to be a NaN at higher precision. >>>>>>

Any FP value representable in lower precision can be exactly represented >>>>>> in higher precision.

I have been thinking about using some of the high order bits of the NaN >>>>>>> (eg bits 32 to 51) to indicate the precision of the boxed value.

When My 66000 generates a NaN it inserts the cause in the 3 HoBs and >>>>>> inserts IP in the LoBs. Nothing prevents you from overwriting the NaN, >>>>>> but I thought it was best to point at the causing-instruction and an >>>>>> encoded "why" the nan was generated. The cause is a 3-bit index to the >>>>>> 7 defined IEEE exceptions.

My float package puts the cause in the 3 LoBs. The cause is always in >>>>> the low order bits of the register then, even when the precision is
different. But the address is not tracked. The package does not have >>>>> access to the address. Seems like NaN trace hardware might be useful. >>>>

Suggest you read::
https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdf
For conversation about LoBs versus HoBs.

Okay, it sounds like there are good reasons to use the HoBs. But I think >>> it is only when converting precisions that it makes a difference. I have >>> the float package moving the LoBs of a larger precision to the LoBs of
the lower precision if a NaN (or infinity) is present. I do not think
this consumes any more logic. It looks like just wires. It looks to be a >>> three bit mux on the low order bits going the other way.

The other part of the paper's reasoning is that if you want to insert
some portion of IP in NaN, doing it bit-reversed enables conversions
to smaller and larger to loose as few bits as possible. The realization
was a surprise to me (yesterday).

It is probably not possible to embed enough IP information in smaller >floating-point formats (<=16-bit) to be worthwhile. For 32-bit floats
only about 18-bits of the address can be stored. It looks like different >formats are going to handle NaNs differently, which I find somewhat >undesirable.

This discussion reminds me somewhat of Ivan Godard's description of
NAR faults on the Mill. Because of wide issue, just having the
address of the offending instruction was not very helpful - you needed
to know which of the many operations within the instruction was the
culprit. And because NARs flow through speculated code, the offending
site could be hundreds of operations away by the time the fault is
signaled and pops out.

Ivan discusses NARs in the "metadata" talk. Around 1h:25m, he
describes the way Mill (approximately) encodes a fault location using
a hash code created from the address of the code block, the
instruction's issue cycle within the block, and the slot of the
operation that failed. They stick the LO bits of this hash into
however many bits are available for the payload. The NAR itself has a
type, and the payload width depends on the data type produced by the
faulting operation.

Obviously that all is Mill specific, but it may stimulate another,
better idea that is relevant to your design.

YMMV.
--- Synchronet 3.21a-Linux NewsLink 1.2

From kegs@kegs@provalid.com (Kent Dickey) to comp.arch on Thu Nov 27 15:50:37 2025

From Newsgroup: comp.arch

In article <1763868010-5857@newsgrouper.org>,
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Robert Finch <robfi680@gmail.com> posted:

My float package puts the cause in the 3 LoBs. The cause is always in
the low order bits of the register then, even when the precision is
different. But the address is not tracked. The package does not have
access to the address. Seems like NaN trace hardware might be useful.

Suggest you read:: >https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdf
For conversation about LoBs versus HoBs.

I wasn't sure where to join the NaN conversation, but this seems like a
good spot.

We've had 40+ years of different architectures handling NaNs, (what to
encode in them to indicate where the first problem occurred) and all architectures do something different when operating on two NaNs:

From that paper:
- Intel using x87 instructions: NaN2 if both quiet, NaN1 if NaN2 is signalling - Intel using SSE instructions: NaN1
- AMD using x87 instructions: NaN2
- AMD using SSE instructions: NaN1
- IBM Power PC: NaN1
- IBM Z mainframe: NaN1 if both quiet, [precedence] to signalling NaN
- ARM: NaN1 if both quiet, [precedence] to signalling NaN

And adding one more not in that paper:
- RISC-V: Always returns canonical NaN only, for Single: 0x7fc00000

I'll just say whatever your NaN handling is, for the source code:

A = B + C + D + E

then for whatever values B,C,D,E having NaN or not, the value of A should
be well defined and not dependent on the order of operations. How can you
use bits in the NaN value for debugging if the hardware is returning arbitrary results when NaNs collide? Users have almost no control over whether
A = B + C treats B as the first argument or the second.

I think encoding stuff in NaN is a very 80's idea: turning on exceptions
costs performance, so we want to debug after-the-fact using NaNs.

But I think RISC-V has the right modern idea: make hardware fast so you can simply always enable Invalid Operation Traps (and maybe Overflow, if
infinities are happening), and then stop right at the point of NaN being
first created. So the NaN propagation doesn't matter.

I think the common current debug strategy for NaNs is run at full speed
with exceptions masked, and if you get NaNs in your answer, you re-run
with exceptions on and then debug the traps that occur. And no one looks at the NaN values at all, just their presence.

So rather than spending time on NaN encoding, make it so that FP performance
is not affected by enabling exceptions, so we can skip the re-running step,
and just run with Invalid Operations trapping enabled. And then just
return canonical NaNs.

Kent
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Thu Nov 27 19:16:24 2025

From Newsgroup: comp.arch

On Thu, 27 Nov 2025 15:50:37 -0000 (UTC)
kegs@provalid.com (Kent Dickey) wrote:

In article <1763868010-5857@newsgrouper.org>,
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Robert Finch <robfi680@gmail.com> posted:

My float package puts the cause in the 3 LoBs. The cause is always
in the low order bits of the register then, even when the
precision is different. But the address is not tracked. The
package does not have access to the address. Seems like NaN trace
hardware might be useful.

Suggest you read:: >https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdf
For conversation about LoBs versus HoBs.

I wasn't sure where to join the NaN conversation, but this seems like
a good spot.

We've had 40+ years of different architectures handling NaNs, (what to
encode in them to indicate where the first problem occurred) and all architectures do something different when operating on two NaNs:

From that paper:
- Intel using x87 instructions: NaN2 if both quiet, NaN1 if NaN2 is signalling
- Intel using SSE instructions: NaN1
- AMD using x87 instructions: NaN2
- AMD using SSE instructions: NaN1
- IBM Power PC: NaN1
- IBM Z mainframe: NaN1 if both quiet, [precedence] to signalling NaN
- ARM: NaN1 if both quiet, [precedence] to signalling NaN

And adding one more not in that paper:
- RISC-V: Always returns canonical NaN only, for Single: 0x7fc00000

I'll just say whatever your NaN handling is, for the source code:

A = B + C + D + E

then for whatever values B,C,D,E having NaN or not, the value of A
should be well defined and not dependent on the order of operations.
How can you use bits in the NaN value for debugging if the hardware
is returning arbitrary results when NaNs collide? Users have almost
no control over whether A = B + C treats B as the first argument or
the second.

I think encoding stuff in NaN is a very 80's idea: turning on
exceptions costs performance, so we want to debug after-the-fact
using NaNs.

But I think RISC-V has the right modern idea: make hardware fast so
you can simply always enable Invalid Operation Traps (and maybe
Overflow, if infinities are happening), and then stop right at the
point of NaN being first created. So the NaN propagation doesn't
matter.

I think the common current debug strategy for NaNs is run at full
speed with exceptions masked, and if you get NaNs in your answer, you
re-run with exceptions on and then debug the traps that occur. And
no one looks at the NaN values at all, just their presence.

So rather than spending time on NaN encoding, make it so that FP
performance is not affected by enabling exceptions, so we can skip
the re-running step, and just run with Invalid Operations trapping
enabled. And then just return canonical NaNs.

Kent

How do you ship your software to the end user? Are exceptions masked off
or enabled?

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Nov 28 06:45:58 2025

From Newsgroup: comp.arch

Scott Lurndal <scott@slp53.sl.home> schrieb:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

Robert Finch <robfi680@gmail.com> posted:

The Qulps PUSH and POP instructions have room for six register fields.
Should one of the fields be used to identify the stack pointer register >>> allowing five registers to be pushed or popped? Or should the stack
pointer register be assumed so that six registers may be pushed or popped? >>

My 66000 ENTER and EXIT instruction use SP == R31 implicitly. But,
instead of giving it a number of registers, there is a start register
and a stop register, so 1-to-32 regsiters can be saved/restored. The >>immediate contains how much stack space to allocate/deallocate.

That seems both confining for the compiler designers and less
useful than the VAX-11 register mask stored in the instruction stream
at the function entry point(s).

That's the nice thing if the ISA, the ABI including calling
convention and the compiler are designed together - this allows
ENTER and EXIT to work just as well, without needing the full
generality.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Nov 28 07:17:07 2025

From Newsgroup: comp.arch

Kent Dickey <kegs@provalid.com> schrieb:

I'll just say whatever your NaN handling is, for the source code:

A = B + C + D + E

then for whatever values B,C,D,E having NaN or not, the value of A should
be well defined and not dependent on the order of operations.

That is not possible in general with normal floating point (you could
guarantee if you keep track of all digits9. But normally,
1 + 1e-9 - 1 will be different from 1 - 1 + 1e9.

(BTW, Fortran is allowed re-arrangement, unless there are parentheses,
these have to be honored).
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Fri Nov 28 02:59:36 2025

From Newsgroup: comp.arch

On 2025-11-27 10:50 a.m., Kent Dickey wrote:

In article <1763868010-5857@newsgrouper.org>,
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Robert Finch <robfi680@gmail.com> posted:

My float package puts the cause in the 3 LoBs. The cause is always in
the low order bits of the register then, even when the precision is
different. But the address is not tracked. The package does not have
access to the address. Seems like NaN trace hardware might be useful.

Suggest you read::
https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdf
For conversation about LoBs versus HoBs.

I wasn't sure where to join the NaN conversation, but this seems like a
good spot.

We've had 40+ years of different architectures handling NaNs, (what to
encode in them to indicate where the first problem occurred) and all architectures do something different when operating on two NaNs:

From that paper:
- Intel using x87 instructions: NaN2 if both quiet, NaN1 if NaN2 is signalling
- Intel using SSE instructions: NaN1
- AMD using x87 instructions: NaN2
- AMD using SSE instructions: NaN1
- IBM Power PC: NaN1
- IBM Z mainframe: NaN1 if both quiet, [precedence] to signalling NaN
- ARM: NaN1 if both quiet, [precedence] to signalling NaN

And adding one more not in that paper:
- RISC-V: Always returns canonical NaN only, for Single: 0x7fc00000

I'll just say whatever your NaN handling is, for the source code:

A = B + C + D + E

then for whatever values B,C,D,E having NaN or not, the value of A should
be well defined and not dependent on the order of operations. How can you use bits in the NaN value for debugging if the hardware is returning arbitrary
results when NaNs collide? Users have almost no control over whether
A = B + C treats B as the first argument or the second.

I think encoding stuff in NaN is a very 80's idea: turning on exceptions costs performance, so we want to debug after-the-fact using NaNs.

But I think RISC-V has the right modern idea: make hardware fast so

you can

simply always enable Invalid Operation Traps (and maybe Overflow, if infinities are happening), and then stop right at the point of NaN being first created. So the NaN propagation doesn't matter.

I think the common current debug strategy for NaNs is run at full speed
with exceptions masked, and if you get NaNs in your answer, you re-run
with exceptions on and then debug the traps that occur. And no one looks at the NaN values at all, just their presence.

So rather than spending time on NaN encoding, make it so that FP performance is not affected by enabling exceptions, so we can skip the re-running step, and just run with Invalid Operations trapping enabled. And then just
return canonical NaNs.

Kent

I do not know how one would make FP performance improve and have
exceptions at the same time. The FP would have to operate asynchronous.
The only thing I can think of is to have core(s) specifically dedicated
to performance FP that do not service interrupts.

Given that nobody looks at the NaN values it is tempting to leave out
the NaN info, but I think I will still have it as an input to modules
where NaNs can be generated (when I get around to it). The NaN info can
always be set to zeros then and the extra logic should disappear then.

I think that there may be a reason why nobody looks at the NaN values.
IDK but maybe the debug does not make it easy to spot. A NaN display
with a random assortment of digits is pretty useless. But if debug where
to display all the address and other info, would it get used?

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Nov 28 07:21:14 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

scott@slp53.sl.home (Scott Lurndal) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

My 66000 ENTER and EXIT instruction use SP == R31 implicitly. But,
instead of giving it a number of registers, there is a start register
and a stop register, so 1-to-32 regsiters can be saved/restored. The
immediate contains how much stack space to allocate/deallocate.

That seems both confining for the compiler designers and less
useful than the VAX-11 register mask stored in the instruction stream
at the function entry point(s).

We, and by that I mean Brian, have not found that so. In the early stages
we did see a bit of that, and then Brian found a way to allocate registers >from R31-down-to-R16 that fit the ENTER/EXIT model and we find essentially >nothing (that is no more instructions in the stream than necessary).

Part of the distinction is::
a) how arguments/results are passed to/from subroutines.
b) having a minimum of 7-temporary registers at entry point.
c) how the stack frame is designed/allocated wrt:
1) my arguments and my results,
2) his arguments and his results,
3) varargs,
4) dynamic arrays on stack,
5) stack frame allocation at ENTER,
d) freedom to use R30 as FP or as joe-random-register.

These were all co-designed together, after much of the instruction
emission logic was sorted out.

What is "my" and "his"?

Consider this as a VAX CALL model except that the mask was replaced by
a list of registers, which were then packed towards R31 instead of a bit >vector.

Do you need both a start and a stop register?

As far as I understand, ENTER is at the entry point of the callee, and
EXIT is before the return or tail call; actually, the tail call case
answers my question above:

If the tail-caller has m callee-saved registers and the tail-callee
has n callee-saved registers, then

if m>n, generate an EXIT that restores the m-n registers;
if m<n, generate an ENTER that saves the n-m registers;
Generate a jump to behind the ENTER instruction of the callee.

That is, assuming that the tail-callee is in the same compilation unit
as the tail-caller; otherwise the tail-caller needs to do a full EXIT
and then jump to the normal entry point of the tail-callee, which does
a full ENTER.

And in these ENTERs and EXITs, you don't end (or start) at the same
point as in the regular ENTERs and EXITs.

And yes, for saving the callee-saved registers I don't see a need for
a mask. For caller-saved registers, it's different. Consider:

long foo(...)
{
long x = ...;
long y = ...;
long z = ...;
if (...) {
bar(...);
x = ...;
} else if (...){
baz(...);
y = ...;
} else {
bla(...);
z = ...;
}
return x+y+z;
}

Here one could put x, y, and z in callee-saved registers (and use ENTER
and EXIT for them), but that would need to save and later restore
three registers on every path through foo().

Or one could put it in caller-saved registers and save only two
registers on every path through foo(). Then one needs to save y and z
around the call to bar(), x and z around the call to baz(), and x and
y around the call to bla(). For any register allocation, in one of
the cases the registers to be saved are not contiguous. So if one
would use a save-multiple or load-multiple instruction for that, a
mask would be needed.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Fri Nov 28 12:56:30 2025

From Newsgroup: comp.arch

Robert Finch wrote:

On 2025-11-27 10:50 a.m., Kent Dickey wrote:

I think encoding stuff in NaN is a very 80's idea: turning on exceptions
costs performance, so we want to debug after-the-fact using NaNs.
But I think RISC-V has the right modern idea: make hardware fast so
you can
simply always enable Invalid Operation Traps (and maybe Overflow, if
infinities are happening), and then stop right at the point of NaN being
first created. So the NaN propagation doesn't matter.

I think the common current debug strategy for NaNs is run at full speed
with exceptions masked, and if you get NaNs in your answer, you re-run
with exceptions on and then debug the traps that occur. And no one
looks at
the NaN values at all, just their presence.

So rather than spending time on NaN encoding, make it so that FP
performance
is not affected by enabling exceptions, so we can skip the re-running
step,
and just run with Invalid Operations trapping enabled. And then just
return canonical NaNs.

Kent

I do not know how one would make FP performance improve and have
exceptions at the same time. The FP would have to operate asynchronous.
The only thing I can think of is to have core(s) specifically dedicated
to performance FP that do not service interrupts.

Why do you think that enabling FP exceptions "costs performance",
by which I assume you mean that, say, an FPADD with exceptions
enabled is slower than disabled?

The FP exceptions are rising-edge triggered based on individual
instruction calculation status, that is before being merged (OR'd)
into the overall FP status. If an FP instruction has unmasked exceptions
then mark the uOp as Except'd and recognize it at Retire like any
other exception. This also assumes that the overall FP status is
updated (merged) at Retire so it only contains status flags for
FP instructions older than the exception point.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Nov 28 19:35:16 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

On 2025-11-26 7:08 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2025-11-26 3:57 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

In this case, put the cause in a container the instruction drags down >>>>> the pipe, and retrieve it when you do have address access to where it >>>>> needs to go.

I may change things to pass the address around in the float package. >>>> Putting the address into the NaN later may cause issues with timing. It >>>> adds a mux into things. May be better to use the original NaN mux in the >>>> float modules. May call it a NaN identity field instead of an address. >>>

For example: when a My 66000 instruction needs to raise an exception
the Inst *I argument contains a field I->raised which is set (1<<excpt) >>> and at the end of the pipe (at retire), t->raised |= I->raised. Where
we have a *t there is also t->ip. So, you don't have to drag Thread *t >>> through all the subroutine calls, but you can easily access t->raised
at the point you do have access to t->ip.

Had trouble reading that, sounds like goobly-goop. But I believe I
figured it out.

Sounds like the address is inserted at the end of the pipe which I am
sure is not the case.

I figured this out: the NaN address must be embedded in the result by
the time the result updates the bypass network and registers so that it
is available to other instructions.

The address is available at the start of the calc from the reservation
station entry. Me thinks it must be embedded when the NaN result status
is set, provided there is not already a NaN. The existing (first) NaN
must propagate through.

See last calculation line in the following::

void RunInst( Chip *chip )
{
for( uint64_t i = 0; i < chip->cores; i++ )
{
ContextStack *cpu = &core[i];
uint8_t cs = cpu->cs;
Thread *t;
Inst *I;
uint16_t raised;

if( cpu->interrupt.raised & ((((signed)1)<<63) >> cpu->priority) )
{ // take an interrupt
cpu->cs = cpu->interrupt.cs;
cpu->priority = cpu->interrupt.priority;
t = context[cpu->cs];
t->reg[0] = cpu->interrupt.message;
}
else if( raised = t->raised & t->enabled )
{ // take an exception
cpu->cs--;
t = context[cpu->cs];
t->reg[0] = FT1( raised ) | EXCPT;
t->reg[1] = I->inst;
t->reg[2] = I->src1;
t->reg[3] = I->src2;
t->reg[4] = I->src3;
}
else
{ // run an instruction
t = context[cpu->cs];
memory( FETCH, t->ip, &I->inst );
t->ip += 4;
majorTable[ I->inst.major ]( t, I );
t->raised |= I->raised; // propagate raised here
}
}
}

That looks like code for a simulator.

It is (IS) code for a non-timing simulator {a "right answer" simulator
if you please.}

How closely does it follow the operation of the CPU?

CPUs have a pipeline, I is the quantity that gets dragged down the
pipe, *t is the control registers of that CPU.

I do not see where 'I' is initialized.

Call to memory(). Then as I gets dragged down the pipeline, more
fields are initialized. I drag the whole structure mostly for
debug purposes.

It has been a while since I worked on simulator code.

The IP value is just muxed in in a five to one mux for the significand.
Had to account for NaN's infinities and overflow anyway. Address gets propagated with some some flops, but flops are inexpensive in an FPGA.

always_comb
casez({aNan5,bNan5,qNaNOutab5,aInf5,bInf5,overab5})
6'b1?????: moab6 <= {1'b1,1'b1,a5[fp64Pkg::FMSB-1:0],{fp64Pkg::FMSB+1{1'b0}}};
6'b01????: moab6 <= {1'b1,1'b1,b5[fp64Pkg::FMSB-1:0],{fp64Pkg::FMSB+1{1'b0}}};
6'b001???: moab6 <= {1'b1,qNaN|(64'd4 << (fp64Pkg::FMSB-4))|adr5[63:16],{fp64Pkg::FMSB+1{1'b0}}}; // multiply inf
* zero
6'b0001??: moab6 <= 0; // mul inf's
6'b00001?: moab6 <= 0; // mul inf's
6'b000001: moab6 <= 0; // mul overflow
default: moab6 <= fractab5;
endcase

Modified NaN support in the float package to store to the HOBs.

Survey says:

The Qulps PUSH and POP instructions have room for six register fields. >>>> Should one of the fields be used to identify the stack pointer register >>>> allowing five registers to be pushed or popped? Or should the stack
pointer register be assumed so that six registers may be pushed or popped?

My 66000 ENTER and EXIT instruction use SP == R31 implicitly. But,
instead of giving it a number of registers, there is a start register
and a stop register, so 1-to-32 regsiters can be saved/restored. The
immediate contains how much stack space to allocate/deallocate.

{{when Safe-Stack is enabled:: Rstart-to-R0 are placed on the inaccessible
stack, while R1-to-Rstop are placed on the normal stack.}}

Because the stack is always DoubleWord aligned, the 3-LoBs of the
immediate are used to indicate "special" activities on a couple of
registers {R0, R31, R30}, R31 is rarely saves and reloaded from Stack
but just returned to its previous value by integer arithmetic. FP can
be updated or it can be treated like "just another register". R0 can
be loaded directly to t->ip, or loaded into R0 for stack walk-backs.

The corresponding LDM and STM are seldom used.

I ran out of micro-ops for ENTER and EXIT, so they only save the LR and
FP (on the safe stack). A separate PUSH/POP on safe stack instruction is >> used.

I figured LDM and STM are not used often enough. PUSH / POP is used in
many places LDM / STM might be.

Its a fine line.

I found more uses for an instruction that moves a number of registers randomly allocated to fixed positions (arguments to a call) than to
move random string of registers to/from memory.

.
MOV R1,R10
MOV R2,R25
MOV R3,R17
CALL Subroutine
. ; deal with any result

My 66000 has an instruction to do that?

No, but the thought that it could be profitable to have such an
instruction is a common recurrence.

I'd not seen an instruction like that. It is almost like a byte map. I can see how it could be done.
Another instruction to add to the ISA. My compiler does not do such a
nice job of packing the register moves together though.

Your instruction size can support such a thing, mine would be difficult.

For context switching a whole bunch of load / store instructions are
used. There is context switching in only a couple of places.

I use a cache-model for thread-state {program-status-line and the
register file}.

The high level simulator, leaves all of the context in memory without loading it or storing it. Thus this serves as a pipeline Oracle so if
the OoO pipeline makes a timing error, the Oracle stops the thread in
its tracks.

Thus::

.
.
-----interrupt detected
. change CS (cs--) <---
. access threadState[cs]
. t->ip = dispatcher
. t->reg[0] = why
dispatcher in control
.
.
.
RET
SVR
.
.

In your typical interrupt/exception control transfers, there is
no code to actually switch state. Just like there is no code to
switch a cache line that takes a miss.

The My 66000 hardware takes care of it automatically? Interrupts push
and pop context in my system.

Yes, context switching is automatic and re-entrant. Whereas exceptions
walk up the privilege stack, interrupts go directly to the specified
context on the stack. So, you could be operating at high privilege
and low priority, only to get interrupted by lower privilege at higher priority.

(*) The cs-- is all that is necessary to change from one Thread State
to another in its entirety.

I think the SP should be identified as PUSH / POP would be the only
instructions assuming the SP register. Otherwise any register could be >>>> chosen by the compiler.

I started with that philosophy--and begrudgingly went away from it as
a) the compiler took form
b) we started adding instructions to ISA to remove instructions from
code footprint.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Nov 28 19:49:31 2025

From Newsgroup: comp.arch

kegs@provalid.com (Kent Dickey) posted:

In article <1763868010-5857@newsgrouper.org>,
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Robert Finch <robfi680@gmail.com> posted:

My float package puts the cause in the 3 LoBs. The cause is always in
the low order bits of the register then, even when the precision is
different. But the address is not tracked. The package does not have
access to the address. Seems like NaN trace hardware might be useful.

Suggest you read:: >https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdf
For conversation about LoBs versus HoBs.

I wasn't sure where to join the NaN conversation, but this seems like a
good spot.

We've had 40+ years of different architectures handling NaNs, (what to
encode in them to indicate where the first problem occurred) and all architectures do something different when operating on two NaNs:

From that paper:
- Intel using x87 instructions: NaN2 if both quiet, NaN1 if NaN2 is signalling
- Intel using SSE instructions: NaN1
- AMD using x87 instructions: NaN2
- AMD using SSE instructions: NaN1
- IBM Power PC: NaN1
- IBM Z mainframe: NaN1 if both quiet, [precedence] to signalling NaN
- ARM: NaN1 if both quiet, [precedence] to signalling NaN

And adding one more not in that paper:
- RISC-V: Always returns canonical NaN only, for Single: 0x7fc00000

I'll just say whatever your NaN handling is, for the source code:

A = B + C + D + E

then for whatever values B,C,D,E having NaN or not, the value of A should
be well defined and not dependent on the order of operations.

I nice philosophy, but how does one achieve that when the compiler is allowed to encode the above as::

A = (B+C)+(D+E)
or
A = (B+D)+(C+E)
or
A = (B+E)+(C+D)
or
A = (B+C)+(E+D)
or
...

No single set of rules can give the first created NaN because which
is first created is dependent on how the compiler ordered the FADDs.

How can you
use bits in the NaN value for debugging if the hardware is returning arbitrary
results when NaNs collide?

My 66000 has specific rules covering {Operand NaNs, Created NaNs}
which attempt to preserve the earliest created NaN and to properly
propagate Operand NaN values.

Users have almost no control over whether
A = B + C treats B as the first argument or the second.

Optimizers treat B and C as independent optimization opportunities.

I think encoding stuff in NaN is a very 80's idea: turning on exceptions costs performance, so we want to debug after-the-fact using NaNs.

But I think RISC-V has the right modern idea: make hardware fast so you can simply always enable Invalid Operation Traps (and maybe Overflow, if infinities are happening), and then stop right at the point of NaN being first created. So the NaN propagation doesn't matter.

This is a 1960s idea. Stop at the first occurrence of trouble. More
workable than NaNs, but has its own set of baggage--for example how
does one stop 13 elements into a Vector instruction ???

{{BTW: My 66000 has a way to scalarize vector code 13 elements into
the vector, and after the exception has been handled, to reenter
vector operation.}}

I think the common current debug strategy for NaNs is run at full speed
with exceptions masked, and if you get NaNs in your answer, you re-run
with exceptions on and then debug the traps that occur. And no one looks at the NaN values at all, just their presence.

Yes, this is a common strategy, and with the list of architectures that
"all do it differently" what else could one expect.

So rather than spending time on NaN encoding, make it so that FP performance is not affected by enabling exceptions, so we can skip the re-running step, and just run with Invalid Operations trapping enabled. And then just
return canonical NaNs.

My 66000 has that option available.

Kent

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Nov 28 20:05:00 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

scott@slp53.sl.home (Scott Lurndal) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

My 66000 ENTER and EXIT instruction use SP == R31 implicitly. But,
instead of giving it a number of registers, there is a start register
and a stop register, so 1-to-32 regsiters can be saved/restored. The
immediate contains how much stack space to allocate/deallocate.

That seems both confining for the compiler designers and less
useful than the VAX-11 register mask stored in the instruction stream
at the function entry point(s).

We, and by that I mean Brian, have not found that so. In the early stages >we did see a bit of that, and then Brian found a way to allocate registers >from R31-down-to-R16 that fit the ENTER/EXIT model and we find essentially >nothing (that is no more instructions in the stream than necessary).

Part of the distinction is::
a) how arguments/results are passed to/from subroutines.
b) having a minimum of 7-temporary registers at entry point.
c) how the stack frame is designed/allocated wrt:
1) my arguments and my results,
2) his arguments and his results,
3) varargs,
4) dynamic arrays on stack,
5) stack frame allocation at ENTER,
d) freedom to use R30 as FP or as joe-random-register.

These were all co-designed together, after much of the instruction >emission logic was sorted out.

What is "my" and "his"?

My arguments are the arguments to me (this subroutine)
His arguments are the arguments to subroutines I call

Consider this as a VAX CALL model except that the mask was replaced by
a list of registers, which were then packed towards R31 instead of a bit >vector.

Do you need both a start and a stop register?

Consider:
ENTER R19,R31,#constant
versus
ENTER R19,R0,#constant

The former saves R19-through-R31 and leave the return address in R0

The later saves R19-through-R0 leaving the return address on the stack.

This should illustrate that the stopping register is compiler chosen.
It is obvious that the starting point should be compiler chosen.
Thus, start and stop are independent.

Now Consider:
ENTER R19,R9,#constant

Not only are R19-R0 saved on the stack, R1-R9 are saved on the stack immediately preceding the memory based arguments, thus varargs only
changes the stop register in ENTER; and this makes a linear vector
of arguments for valist.

As far as I understand, ENTER is at the entry point of the callee, and
EXIT is before the return or tail call; actually, the tail call case
answers my question above:

If the tail-caller has m callee-saved registers and the tail-callee
has n callee-saved registers, then

if m>n, generate an EXIT that restores the m-n registers;
if m<n, generate an ENTER that saves the n-m registers;
Generate a jump to behind the ENTER instruction of the callee.

The above sounds complicated enough to simply avoid the tail-call
optimization it the arguments lists are not similar enough.

That is, assuming that the tail-callee is in the same compilation unit
as the tail-caller; otherwise the tail-caller needs to do a full EXIT
and then jump to the normal entry point of the tail-callee, which does
a full ENTER.

And in these ENTERs and EXITs, you don't end (or start) at the same
point as in the regular ENTERs and EXITs.

And yes, for saving the callee-saved registers I don't see a need for
a mask. For caller-saved registers, it's different. Consider:

long foo(...)
{
long x = ...;
long y = ...;
long z = ...;
if (...) {
bar(...);
x = ...;
} else if (...){
baz(...);
y = ...;
} else {
bla(...);
z = ...;
}
return x+y+z;
}

Here one could put x, y, and z in callee-saved registers (and use ENTER
and EXIT for them), but that would need to save and later restore
three registers on every path through foo().

Or one could put it in caller-saved registers and save only two
registers on every path through foo(). Then one needs to save y and z
around the call to bar(), x and z around the call to baz(), and x and
y around the call to bla(). For any register allocation, in one of
the cases the registers to be saved are not contiguous. So if one
would use a save-multiple or load-multiple instruction for that, a
mask would be needed.

There is a delicate balance between callee-save and caller-save
registers. In many situations caller-save is better (counting
instructions) but callee-save is better (counting cycles--mostly
due to second order cache effects).

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Nov 28 20:09:12 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

On 2025-11-27 10:50 a.m., Kent Dickey wrote:

In article <1763868010-5857@newsgrouper.org>,
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Robert Finch <robfi680@gmail.com> posted:

My float package puts the cause in the 3 LoBs. The cause is always in
the low order bits of the register then, even when the precision is
different. But the address is not tracked. The package does not have
access to the address. Seems like NaN trace hardware might be useful.

Suggest you read::
https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdf
For conversation about LoBs versus HoBs.

I wasn't sure where to join the NaN conversation, but this seems like a good spot.

We've had 40+ years of different architectures handling NaNs, (what to encode in them to indicate where the first problem occurred) and all architectures do something different when operating on two NaNs:

From that paper:
- Intel using x87 instructions: NaN2 if both quiet, NaN1 if NaN2 is signalling
- Intel using SSE instructions: NaN1
- AMD using x87 instructions: NaN2
- AMD using SSE instructions: NaN1
- IBM Power PC: NaN1
- IBM Z mainframe: NaN1 if both quiet, [precedence] to signalling NaN
- ARM: NaN1 if both quiet, [precedence] to signalling NaN

And adding one more not in that paper:
- RISC-V: Always returns canonical NaN only, for Single: 0x7fc00000

I'll just say whatever your NaN handling is, for the source code:

A = B + C + D + E

then for whatever values B,C,D,E having NaN or not, the value of A should be well defined and not dependent on the order of operations. How can you use bits in the NaN value for debugging if the hardware is returning arbitrary
results when NaNs collide? Users have almost no control over whether
A = B + C treats B as the first argument or the second.

I think encoding stuff in NaN is a very 80's idea: turning on exceptions costs performance, so we want to debug after-the-fact using NaNs.

But I think RISC-V has the right modern idea: make hardware fast so

you can

simply always enable Invalid Operation Traps (and maybe Overflow, if infinities are happening), and then stop right at the point of NaN being first created. So the NaN propagation doesn't matter.

I think the common current debug strategy for NaNs is run at full speed with exceptions masked, and if you get NaNs in your answer, you re-run
with exceptions on and then debug the traps that occur. And no one looks at
the NaN values at all, just their presence.

So rather than spending time on NaN encoding, make it so that FP performance
is not affected by enabling exceptions, so we can skip the re-running step, and just run with Invalid Operations trapping enabled. And then just return canonical NaNs.

Kent

I do not know how one would make FP performance improve and have
exceptions at the same time. The FP would have to operate asynchronous.

What is it that you fail to understand what reservation stations do
to instructions arriving at various FPUs !?!?! The stations effectively
turn the FPUs into asynchronous calculation units.

The only thing I can think of is to have core(s) specifically dedicated
to performance FP that do not service interrupts.

Given that nobody looks at the NaN values it is tempting to leave out
the NaN info, but I think I will still have it as an input to modules
where NaNs can be generated (when I get around to it). The NaN info can always be set to zeros then and the extra logic should disappear then.

I think that there may be a reason why nobody looks at the NaN values.
IDK but maybe the debug does not make it easy to spot. A NaN display
with a random assortment of digits is pretty useless. But if debug where
to display all the address and other info, would it get used?

That is the idea behind the why code and the IP in My 66000 NaNs.

I still do not think they will be used "all that often" simply because
so many other ways to generate and propagate NaNs exist--and there is
no "universal" consensus.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Nov 28 20:39:07 2025

From Newsgroup: comp.arch

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Just today, I compiled

u4 = u1/10;
u3 = u1%10;

(plus some surrounding code) with gcc-14 in three contexts. Here's
the code for two of them (the third one is similar to the second one):

Care for to present a self-contained example? Otherwise, your
example and its analyis are meaingless to the reader.

I doubt that a self-contained example will be more meaningful to all
but the most determined readers, but anyway, the preprocessed C code is at

https://www.complang.tuwien.ac.at/anton/tmp/engine-fast.i

Interesting test case. You might be interested to know that there
is some improvement. With a relatively recent trunk, gcc compiles
the offending sequence to

movabsq $-3689348814741910323, %rax
movq %r13, %rcx
mulq %r13
movq %rdx, %r13
shrq $3, %rdx
shrq $3, %r13
movq %rdx, %r9
leaq 0(%r13,%r13,4), %rax
addq %rax, %rax
subq %rax, %rcx
movq %rcx, %r13

There is improvement (only a single mulq) but the two shrq
instructions are clearly redundant, so there is still some
confusion there.

Unfortunately, the usual tool for reducing test cases to something
manageable (cvise) failed because of the size of the test case
(32 GB main were not enough) and maybe also because cvise may not
be well suited to the style of programming with goto labels and
interspersed assembler statements, lots of them. (Looking at your
code, it also does not seem to be self-sufficient, at least the
numerous SKIP4 statements require something else).

My assumption is that the control flow is confusing gcc. For this
to be fixed, somebody with knowledge of the code would need to
cut this down to something that still exhibits the behavior, and
that can be reduced further with cvise (or delta, but cvise is
usually much better).
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Nov 28 20:41:48 2025

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> posted:

Robert Finch wrote:

On 2025-11-27 10:50 a.m., Kent Dickey wrote:

I think encoding stuff in NaN is a very 80's idea: turning on exceptions >> costs performance, so we want to debug after-the-fact using NaNs.
But I think RISC-V has the right modern idea: make hardware fast so
you can
simply always enable Invalid Operation Traps (and maybe Overflow, if
infinities are happening), and then stop right at the point of NaN being >> first created. So the NaN propagation doesn't matter.

I think the common current debug strategy for NaNs is run at full speed
with exceptions masked, and if you get NaNs in your answer, you re-run
with exceptions on and then debug the traps that occur. And no one
looks at
the NaN values at all, just their presence.

So rather than spending time on NaN encoding, make it so that FP
performance
is not affected by enabling exceptions, so we can skip the re-running
step,
and just run with Invalid Operations trapping enabled. And then just
return canonical NaNs.

Kent

I do not know how one would make FP performance improve and have exceptions at the same time. The FP would have to operate asynchronous. The only thing I can think of is to have core(s) specifically dedicated
to performance FP that do not service interrupts.

Why do you think that enabling FP exceptions "costs performance",
by which I assume you mean that, say, an FPADD with exceptions
enabled is slower than disabled?

It is the control transfer to and from the handler on the occurrence
of an exception that diminishes performance; and the time consumed
by the handler itself. The enabled and disabled FPU takes the same
time regardless of whether an exception transpired or not.

The FP exceptions are rising-edge triggered based on individual
instruction calculation status, that is before being merged (OR'd)
into the overall FP status. If an FP instruction has unmasked exceptions
then mark the uOp as Except'd and recognize

and order

it at Retire like any
other exception. This also assumes that the overall FP status is
updated (merged) at Retire so it only contains status flags for
FP instructions older than the

retire

point.

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Nov 28 23:06:45 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> writes:

(Looking at your
code, it also does not seem to be self-sufficient, at least the
numerous SKIP4 statements require something else).

If you want to assemble the resulting .S file, it's assembled once
with

-DSKIP4= -Dgforth_engine2=gforth_engine

and once with

-DSKIP4=".skip 4"

(on Linux-GNU AMD64, the .skip assembler directive is autoconfigured
and may be different on other platforms).

My assumption is that the control flow is confusing gcc.

My guess is the same.

For this
to be fixed, somebody with knowledge of the code would need to
cut this down to something that still exhibits the behavior, and
that can be reduced further with cvise (or delta, but cvise is
usually much better).

Everything from

H_<name1>:

to the next

H_<name2>:

is one implementation of a VM instruction. You can remove a machine instructions and the references to the labels in the tables at the
start of gforth_engine(), and the thing should still compile, and
ideally the code for all the other VM instructions should be
unchanged.

In the extreme, you could remove everything but H_ten_u_slash_mod and
the code up to the next H_..., but my guess is that you need more VM instruction implementations to produce the not-so-great code.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sat Nov 29 09:29:01 2025

From Newsgroup: comp.arch

I hard-coded an IRQ delay down-count in the Qupls4 core. The down-count
delays accepting interrupts for ten clock cycles or about 40
instructions if an interrupt got deferred. The interrupt being deferred because interrupts got disabled by an instruction in the pipeline. I
guessed 40 instructions would likely be enough for many cases where IRQs
are disabled then enabled again.

The issue is the pipeline is full of ISR instructions that should not be committed because the IRQs got disabled in the meantime. If the CPU were allowed to accept another IRQ right away, it could get stuck in a loop flushing the pipeline and reloading with the ISR routine code instead of progressing through the code where IRQs were disabled.

I could create a control register for this count and allow it to be programmable. But I think that may not be necessary.

It is possible that 40 instructions is not enough. In that case the CPU
would advance in 40 instruction burps. Alternating between fetching ISR instructions and the desired instruction stream. On the other hand, a
larger down-count starts to impact the IRQ latency.

Tradeoffs…

I suppose I could have the CPU increase the down-count if it is looping
around fetching ISR instructions. The down-count would be reset to the
minimum again once an interrupt enable instruction is executed.

Complex…

--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Sat Nov 29 07:37:20 2025

From Newsgroup: comp.arch

On 11/29/2025 6:29 AM, Robert Finch wrote:

I hard-coded an IRQ delay down-count in the Qupls4 core. The down-count delays accepting interrupts for ten clock cycles or about 40
instructions if an interrupt got deferred. The interrupt being deferred because interrupts got disabled by an instruction in the pipeline. I
guessed 40 instructions would likely be enough for many cases where IRQs
are disabled then enabled again.

The issue is the pipeline is full of ISR instructions that should not be committed because the IRQs got disabled in the meantime. If the CPU were allowed to accept another IRQ right away, it could get stuck in a loop flushing the pipeline and reloading with the ISR routine code instead of progressing through the code where IRQs were disabled.

I could create a control register for this count and allow it to be programmable. But I think that may not be necessary.

It is possible that 40 instructions is not enough. In that case the CPU would advance in 40 instruction burps. Alternating between fetching ISR instructions and the desired instruction stream. On the other hand, a
larger down-count starts to impact the IRQ latency.

Tradeoffs…

I suppose I could have the CPU increase the down-count if it is looping around fetching ISR instructions. The down-count would be reset to the minimum again once an interrupt enable instruction is executed.

Complex…

A simple alternative that I have seen is to have an instruction that
enables interrupts and jumps to somewhere, probably either the
interrupted code or the dispatcher that might do a full context switch.
The ISR would issue this instruction when it has saved everything that
is necessary to handle the interrupt and thus could be interrupted
again. This minimized the time interrupts are locked out without the
need for an arbitrary timer, etc.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From kegs@kegs@provalid.com (Kent Dickey) to comp.arch on Sat Nov 29 15:48:22 2025

From Newsgroup: comp.arch

In article <1764359371-5857@newsgrouper.org>,
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

kegs@provalid.com (Kent Dickey) posted:

In article <1763868010-5857@newsgrouper.org>,
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

Robert Finch <robfi680@gmail.com> posted:

My float package puts the cause in the 3 LoBs. The cause is always in
the low order bits of the register then, even when the precision is
different. But the address is not tracked. The package does not have
access to the address. Seems like NaN trace hardware might be useful.

Suggest you read::

https://grouper.ieee.org/groups/msc/ANSI_IEEE-Std-754-2019/background/nan-propagation.pdf

For conversation about LoBs versus HoBs.

[snip]

I'll just say whatever your NaN handling is, for the source code:

A = B + C + D + E

then for whatever values B,C,D,E having NaN or not, the value of A should
be well defined and not dependent on the order of operations.

I nice philosophy, but how does one achieve that when the compiler is allowed >to encode the above as::

A = (B+C)+(D+E)
or
A = (B+D)+(C+E)
or
A = (B+E)+(C+D)
or
A = (B+C)+(E+D)
or
...

No single set of rules can give the first created NaN because which
is first created is dependent on how the compiler ordered the FADDs.

This is my point: I don't see a great way to encode the first NaN, which
is why I propose not making that a goal. You're not getting the first
NaN in any case even if you try to do so in hardware, since the order of operations is a fragile thing that's hard to control unless you write
assembly code, or the most tedious source code imaginable.

Several rules easily satisfy my property: canonical NaN (always return 0x7fc00000 as the result of any invalid op or any operation involving a
NaN), or Max(NaN.mantissa), where you return the largest mantissa value
of any NaN. An OR of the NaN mantissas also works. This lets you at
least encode the most serious NaN if you order them, or lets you know
all the different invalid ops that occured with the OR of flags stored
in the mantissa.

But canonical NaN is so much simpler. There's no need to preserve and
mux around the NaN mantissas, which might save a tiny amount of datapath
logic in FP units.

Perhaps clever algorithms involving integer ops on FP values will come
around and we'll WANT to have simpler FP handling so the integer
accelerations will be easier to get right.

Kent
--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sat Nov 29 13:28:58 2025

From Newsgroup: comp.arch

On 2025-11-29 10:37 a.m., Stephen Fuld wrote:

On 11/29/2025 6:29 AM, Robert Finch wrote:

I hard-coded an IRQ delay down-count in the Qupls4 core. The down-
count delays accepting interrupts for ten clock cycles or about 40
instructions if an interrupt got deferred. The interrupt being
deferred because interrupts got disabled by an instruction in the
pipeline. I guessed 40 instructions would likely be enough for many
cases where IRQs are disabled then enabled again.

The issue is the pipeline is full of ISR instructions that should not
be committed because the IRQs got disabled in the meantime. If the CPU
were allowed to accept another IRQ right away, it could get stuck in a
loop flushing the pipeline and reloading with the ISR routine code
instead of progressing through the code where IRQs were disabled.

I could create a control register for this count and allow it to be
programmable. But I think that may not be necessary.

It is possible that 40 instructions is not enough. In that case the
CPU would advance in 40 instruction burps. Alternating between
fetching ISR instructions and the desired instruction stream. On the
other hand, a larger down-count starts to impact the IRQ latency.

Tradeoffs…

I suppose I could have the CPU increase the down-count if it is
looping around fetching ISR instructions. The down-count would be
reset to the minimum again once an interrupt enable instruction is
executed.

Complex…

A simple alternative that I have seen is to have an instruction that
enables interrupts and jumps to somewhere, probably either the
interrupted code or the dispatcher that might do a full context switch.
The ISR would issue this instruction when it has saved everything that
is necessary to handle the interrupt and thus could be interrupted
again. This minimized the time interrupts are locked out without the
need for an arbitrary timer, etc.

That is a decent idea. A special jump and disable interrupts instruction
to the next instruction might do it. The pipeline needs to be cleared of
the external interrupt when interrupts are disabled, and the address
reset. The issue then is that the interrupt gets lost, so it needs to be cached somewhere so that once interrupts are enabled again it can be processed. There could be multiple interrupts in the pipeline that need
to be cached.

Seeing as the address needs to be reset, an explicit jump instruction
may not be necessary. The IP of the interrupted instruction could be used.

I see now that a stack might be better than a FIFO as only a higher
priority interrupt would be able to interrupt the lower one. Should they
be processed in order of occurrence? Order of occurrence = FIFO,
otherwise stack = FILO. Leave it to the user to decide? Out of order asynchronous interrupts probably are not a big deal. Hardware likely
does not know what the order is, or care about it.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Nov 29 19:05:00 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

I hard-coded an IRQ delay down-count in the Qupls4 core. The down-count delays accepting interrupts for ten clock cycles or about 40
instructions if an interrupt got deferred. The interrupt being deferred because interrupts got disabled by an instruction in the pipeline. I
guessed 40 instructions would likely be enough for many cases where IRQs
are disabled then enabled again.

The issue is the pipeline is full of ISR instructions that should not be committed because the IRQs got disabled in the meantime. If the CPU were allowed to accept another IRQ right away, it could get stuck in a loop flushing the pipeline and reloading with the ISR routine code instead of progressing through the code where IRQs were disabled.

The above is one of the reasons EricP supports the pipeline notion that interrupts do NOT flush the pipe. Instead, the instruction in the pipe
are allowed to retire (apace) and new instructions are inserted from
the interrupt service point. As long as the instructions "IN" the pipe
can deliver their results to their registers, and update µArchitectural
state they "own", there is no reason to flush--AND--no corresponding
reason to delay "taking" the interrupt.

At the µArchitectural level, you, the designer, see both the front
and the end of the pipeline, you can change what goes in the front
and allow what was already in the pipe to come out the back. This
requires dragging a small amount of information down the pipe, much
like multi-threaded CPUs.

I could create a control register for this count and allow it to be programmable. But I think that may not be necessary.

It is possible that 40 instructions is not enough. In that case the CPU would advance in 40 instruction burps. Alternating between fetching ISR instructions and the desired instruction stream. On the other hand, a
larger down-count starts to impact the IRQ latency.

Tradeoffs…

I suppose I could have the CPU increase the down-count if it is looping around fetching ISR instructions. The down-count would be reset to the minimum again once an interrupt enable instruction is executed.

Complex…

Make the problem "go away". You will be happier in the end.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Nov 29 19:11:30 2025

From Newsgroup: comp.arch

Kent Dickey <kegs@provalid.com> schrieb:

This is my point: I don't see a great way to encode the first NaN, which
is why I propose not making that a goal. You're not getting the first
NaN in any case even if you try to do so in hardware, since the order of operations is a fragile thing that's hard to control unless you write assembly code, or the most tedious source code imaginable.

Using Fortran, parentheses have to be honored. If you write

A = (B + C) + (D + E)

then B + C and D + E have to be calculated before the total sum.
If you write

A = B + (C + (D + E))

then you prescribe the order completetely.

I can imagine source code that is much more tedious than this :-)
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Nov 29 19:23:03 2025

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 11/29/2025 6:29 AM, Robert Finch wrote:

I hard-coded an IRQ delay down-count in the Qupls4 core. The down-count delays accepting interrupts for ten clock cycles or about 40
instructions if an interrupt got deferred. The interrupt being deferred because interrupts got disabled by an instruction in the pipeline. I guessed 40 instructions would likely be enough for many cases where IRQs are disabled then enabled again.

The issue is the pipeline is full of ISR instructions that should not be committed because the IRQs got disabled in the meantime. If the CPU were allowed to accept another IRQ right away, it could get stuck in a loop flushing the pipeline and reloading with the ISR routine code instead of progressing through the code where IRQs were disabled.

I could create a control register for this count and allow it to be programmable. But I think that may not be necessary.

It is possible that 40 instructions is not enough. In that case the CPU would advance in 40 instruction burps. Alternating between fetching ISR instructions and the desired instruction stream. On the other hand, a larger down-count starts to impact the IRQ latency.

Tradeoffs…

I suppose I could have the CPU increase the down-count if it is looping around fetching ISR instructions. The down-count would be reset to the minimum again once an interrupt enable instruction is executed.

Complex…

A simple alternative that I have seen is to have an instruction that
enables interrupts and jumps to somewhere, probably either the
interrupted code or the dispatcher that might do a full context switch.
The ISR would issue this instruction when it has saved everything that
is necessary to handle the interrupt and thus could be interrupted
again. This minimized the time interrupts are locked out without the
need for an arbitrary timer, etc.

Another alternative is to allow ISRs to be interrupted by ISRs of higher priority. All you need here is a clean and precise definition of priority
and when said priority gets associated with any given interrupt.

My 66000 goes so far as to never need to disable interrupts because all interrupts of the same or lower priority are automatically disabled by
the priority of the current ISR/running-thread. That is, one arrives
at the ISR with interrupts enabled and in a reentrant state with the
priority given by the I/O MMU when device sent ISR message to MSI-X
queue.

If/when an ISR needs to be sure it is not interrupted, it can change
priority in 1 instruction to "highest" and have the system not allow
the I/O MMU to associate said "exclusive" priority with any device
interrupt. When ISR returns, priority reverts to priority at the time
the interrupt was taken. {No need to back down on priority} This only
requires that there are enough priorities to spare one exclusively to
the system.

EricP has argued that 8-I/O priority levels are enough. I argue that
64 priority levels are enough for {Guest OS, Host OS, HyperVisor}
to each have their own somewhat-coordinated structure of priorities.
AND further I argue that given one is designing a 64-bit machine,
that 64 priority levels are dé rigueur.

--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sat Nov 29 15:08:05 2025

From Newsgroup: comp.arch

Thomas Koenig wrote:

Kent Dickey <kegs@provalid.com> schrieb:

This is my point: I don't see a great way to encode the first NaN, which
is why I propose not making that a goal. You're not getting the first
NaN in any case even if you try to do so in hardware, since the order of
operations is a fragile thing that's hard to control unless you write
assembly code, or the most tedious source code imaginable.

Using Fortran, parentheses have to be honored. If you write

A = (B + C) + (D + E)

then B + C and D + E have to be calculated before the total sum.
If you write

A = B + (C + (D + E))

then you prescribe the order completetely.

I can imagine source code that is much more tedious than this :-)

That doesn't control which variable is assigned to each source operand.
If both operands were Nan's and the two-Nan-rule was "always take src1"
then the choice of which to propagate would still be non-deterministic.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sat Nov 29 15:42:13 2025

From Newsgroup: comp.arch

On 2025-11-29 2:05 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

I hard-coded an IRQ delay down-count in the Qupls4 core. The down-count
delays accepting interrupts for ten clock cycles or about 40
instructions if an interrupt got deferred. The interrupt being deferred
because interrupts got disabled by an instruction in the pipeline. I
guessed 40 instructions would likely be enough for many cases where IRQs
are disabled then enabled again.

The issue is the pipeline is full of ISR instructions that should not be
committed because the IRQs got disabled in the meantime. If the CPU were
allowed to accept another IRQ right away, it could get stuck in a loop
flushing the pipeline and reloading with the ISR routine code instead of
progressing through the code where IRQs were disabled.

The above is one of the reasons EricP supports the pipeline notion that interrupts do NOT flush the pipe. Instead, the instruction in the pipe
are allowed to retire (apace) and new instructions are inserted from
the interrupt service point.

That is how Qupls is working too. The issue is what happens when the instruction in the pipe before the ISR disables the interrupt. Then the
ISR instructions need to be flushed.

As long as the instructions "IN" the pipe

can deliver their results to their registers, and update µArchitectural state they "own", there is no reason to flush--AND--no corresponding
reason to delay "taking" the interrupt.

That is the usual case for Qupls too when there is an interrupt.

At the µArchitectural level, you, the designer, see both the front
and the end of the pipeline, you can change what goes in the front
and allow what was already in the pipe to come out the back. This
requires dragging a small amount of information down the pipe, much
like multi-threaded CPUs.

Yes, the IRQ info is being dragged down the pipe.

I could create a control register for this count and allow it to be
programmable. But I think that may not be necessary.

It is possible that 40 instructions is not enough. In that case the CPU
would advance in 40 instruction burps. Alternating between fetching ISR
instructions and the desired instruction stream. On the other hand, a
larger down-count starts to impact the IRQ latency.

Tradeoffs…

I suppose I could have the CPU increase the down-count if it is looping
around fetching ISR instructions. The down-count would be reset to the
minimum again once an interrupt enable instruction is executed.

Complex…

Make the problem "go away". You will be happier in the end.

The interrupt mask is set at fetch time to disable lower priority
interrupts. I suppose disabling of interrupts by the OS could simply be ignored. The interrupt could only be taken if it is a higher priority
than the current level.

I had thought the OS might have good reason to disable interrupts. But
maybe I am making things too complex.

--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sat Nov 29 16:10:45 2025

From Newsgroup: comp.arch

Robert Finch wrote:

I hard-coded an IRQ delay down-count in the Qupls4 core. The down-count delays accepting interrupts for ten clock cycles or about 40
instructions if an interrupt got deferred. The interrupt being deferred because interrupts got disabled by an instruction in the pipeline. I
guessed 40 instructions would likely be enough for many cases where IRQs
are disabled then enabled again.

The issue is the pipeline is full of ISR instructions that should not be committed because the IRQs got disabled in the meantime. If the CPU were allowed to accept another IRQ right away, it could get stuck in a loop flushing the pipeline and reloading with the ISR routine code instead of progressing through the code where IRQs were disabled.

I could create a control register for this count and allow it to be programmable. But I think that may not be necessary.

It is possible that 40 instructions is not enough. In that case the CPU would advance in 40 instruction burps. Alternating between fetching ISR instructions and the desired instruction stream. On the other hand, a
larger down-count starts to impact the IRQ latency.

Tradeoffs…

I suppose I could have the CPU increase the down-count if it is looping around fetching ISR instructions. The down-count would be reset to the minimum again once an interrupt enable instruction is executed.

Complex…

You are using this timer to predict the delay for draining the pipeline.
It would only take a read of a slow IO device register to exceed it.

I was thinking a simple and cheap way would be to use a variation of the single-step mechanism. An interrupt request would cause Decode to emit a special uOp with the single-step flag set and then stall, to allow the
pipeline to drain the old stream before accepting the interrupt and
redirecting Fetch to its handler. That way if there are and interrupt
enable or disable instructions, or branch mispredicts, or pending exceptions in-flight they all are allowed to finish and the state to settle down.

Pipelining interrupt delivery looks possible but gets complicated and
expensive real quick.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Nov 29 22:07:04 2025

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> posted:

Thomas Koenig wrote:

Kent Dickey <kegs@provalid.com> schrieb:

This is my point: I don't see a great way to encode the first NaN, which >> is why I propose not making that a goal. You're not getting the first
NaN in any case even if you try to do so in hardware, since the order of >> operations is a fragile thing that's hard to control unless you write
assembly code, or the most tedious source code imaginable.

Using Fortran, parentheses have to be honored. If you write

A = (B + C) + (D + E)

then B + C and D + E have to be calculated before the total sum.
If you write

A = B + (C + (D + E))

then you prescribe the order completetely.

I can imagine source code that is much more tedious than this :-)

That doesn't control which variable is assigned to each source operand.
If both operands were Nan's and the two-Nan-rule was "always take src1"
then the choice of which to propagate would still be non-deterministic.

In addition, the compiler is still allowed to perform the FORTRAN
equation as::

A = (C + B) + (E + D)

instead of the way expressed in ASCII.

Parenthesis order calculations not operands.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Nov 29 22:17:36 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

On 2025-11-29 2:05 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

I hard-coded an IRQ delay down-count in the Qupls4 core. The down-count
delays accepting interrupts for ten clock cycles or about 40
instructions if an interrupt got deferred. The interrupt being deferred
because interrupts got disabled by an instruction in the pipeline. I
guessed 40 instructions would likely be enough for many cases where IRQs >> are disabled then enabled again.

The issue is the pipeline is full of ISR instructions that should not be >> committed because the IRQs got disabled in the meantime. If the CPU were >> allowed to accept another IRQ right away, it could get stuck in a loop
flushing the pipeline and reloading with the ISR routine code instead of >> progressing through the code where IRQs were disabled.

The above is one of the reasons EricP supports the pipeline notion that interrupts do NOT flush the pipe. Instead, the instruction in the pipe
are allowed to retire (apace) and new instructions are inserted from
the interrupt service point.

That is how Qupls is working too. The issue is what happens when the instruction in the pipe before the ISR disables the interrupt. Then the
ISR instructions need to be flushed.

As a general rule of thumb:: an instruction is not "performed" until
after it retires. {when you cannot undo its deeds}

Consider the case where you redirect the front of the pipe to an ISR and
an instruction already in the pipe raises an exception. Here, what I do
{and have done in the past} is to not retire instructions after the
exception, so the ISR is not delayed and IP ends up pointing at the
excepting instruction.

Since you started ISR before you retired DI, you can treat DI as an
exception. {DI after ISR control transfer}. If, on the other hand,
you perform DI at the front of the pipe, you don't "accept" the ISR
until EI.

As long as the instructions "IN" the pipe
can deliver their results to their registers, and update µArchitectural state they "own", there is no reason to flush--AND--no corresponding
reason to delay "taking" the interrupt.

That is the usual case for Qupls too when there is an interrupt.

At the µArchitectural level, you, the designer, see both the front
and the end of the pipeline, you can change what goes in the front
and allow what was already in the pipe to come out the back. This
requires dragging a small amount of information down the pipe, much
like multi-threaded CPUs.

Yes, the IRQ info is being dragged down the pipe.

I could create a control register for this count and allow it to be
programmable. But I think that may not be necessary.

It is possible that 40 instructions is not enough. In that case the CPU
would advance in 40 instruction burps. Alternating between fetching ISR
instructions and the desired instruction stream. On the other hand, a
larger down-count starts to impact the IRQ latency.

Tradeoffs…

I suppose I could have the CPU increase the down-count if it is looping
around fetching ISR instructions. The down-count would be reset to the
minimum again once an interrupt enable instruction is executed.

Complex…

Make the problem "go away". You will be happier in the end.

The interrupt mask is set at fetch time to disable lower priority interrupts. I suppose disabling of interrupts by the OS could simply be ignored. The interrupt could only be taken if it is a higher priority
than the current level.

I had thought the OS might have good reason to disable interrupts. But
maybe I am making things too complex.

The OS DOES have good reasons to DI "every once in a while", IIRC my conversations with EricP, these are short sequences the OS needs
to be ATOMIC across all OS threads--and almost always without the
possibility that the ATOMIC event fails {which can happen in user code}.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Nov 29 22:26:21 2025

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> posted:

Robert Finch wrote:

I hard-coded an IRQ delay down-count in the Qupls4 core. The down-count delays accepting interrupts for ten clock cycles or about 40
instructions if an interrupt got deferred. The interrupt being deferred because interrupts got disabled by an instruction in the pipeline. I guessed 40 instructions would likely be enough for many cases where IRQs are disabled then enabled again.

The issue is the pipeline is full of ISR instructions that should not be committed because the IRQs got disabled in the meantime. If the CPU were allowed to accept another IRQ right away, it could get stuck in a loop flushing the pipeline and reloading with the ISR routine code instead of progressing through the code where IRQs were disabled.

I could create a control register for this count and allow it to be programmable. But I think that may not be necessary.

It is possible that 40 instructions is not enough. In that case the CPU would advance in 40 instruction burps. Alternating between fetching ISR instructions and the desired instruction stream. On the other hand, a larger down-count starts to impact the IRQ latency.

Tradeoffs…

I suppose I could have the CPU increase the down-count if it is looping around fetching ISR instructions. The down-count would be reset to the minimum again once an interrupt enable instruction is executed.

Complex…

You are using this timer to predict the delay for draining the pipeline.
It would only take a read of a slow IO device register to exceed it.

Yes, exactly::

Consider a GBOoO processor that performs a LD R9,[deviceCR].

a) all earlier memory references have to be seen globally
...before this LD can be seen globally. {dozens of cycles}
b) this LD has to arrive at HostBridge. {dozens of cycles}
c) HostBrdge sends request down PCIe {hundreds of cycles}
d) device responds to LD {handful of cycles}
e) PCIe transports response to HB {hundreds of cycles}
f) HB transfers response to requestor {dozens of cycles}
g) CPU is allowed to re-enter OoO {handful of cycles}

Accesses to devices need to have most of the properties of
"Sequential Consistency" as defined by Lamport.

Now, several LDs [DeviceCRs] can be seen globally and in order
before the first (or all responses) but you are going to see all
that latency in the pipeline; but OoO memory requests are not one
of them.

I was thinking a simple and cheap way would be to use a variation of the single-step mechanism. An interrupt request would cause Decode to emit a special uOp with the single-step flag set and then stall, to allow the pipeline to drain the old stream before accepting the interrupt and redirecting Fetch to its handler. That way if there are and interrupt
enable or disable instructions, or branch mispredicts, or pending exceptions in-flight they all are allowed to finish and the state to settle down.

Pipelining interrupt delivery looks possible but gets complicated and expensive real quick.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sat Nov 29 17:45:17 2025

From Newsgroup: comp.arch

On 2025-11-29 4:10 p.m., EricP wrote:

Robert Finch wrote:

I hard-coded an IRQ delay down-count in the Qupls4 core. The down-
count delays accepting interrupts for ten clock cycles or about 40
instructions if an interrupt got deferred. The interrupt being
deferred because interrupts got disabled by an instruction in the
pipeline. I guessed 40 instructions would likely be enough for many
cases where IRQs are disabled then enabled again.

The issue is the pipeline is full of ISR instructions that should not
be committed because the IRQs got disabled in the meantime. If the CPU
were allowed to accept another IRQ right away, it could get stuck in a
loop flushing the pipeline and reloading with the ISR routine code
instead of progressing through the code where IRQs were disabled.

I could create a control register for this count and allow it to be
programmable. But I think that may not be necessary.

It is possible that 40 instructions is not enough. In that case the
CPU would advance in 40 instruction burps. Alternating between
fetching ISR instructions and the desired instruction stream. On the
other hand, a larger down-count starts to impact the IRQ latency.

Tradeoffs…

I suppose I could have the CPU increase the down-count if it is
looping around fetching ISR instructions. The down-count would be
reset to the minimum again once an interrupt enable instruction is
executed.

Complex…

You are using this timer to predict the delay for draining the pipeline.
It would only take a read of a slow IO device register to exceed it.

The down count is counting down only when the front-end of the pipeline advances, instructions are sure to be loaded.

I was thinking a simple and cheap way would be to use a variation of the single-step mechanism. An interrupt request would cause Decode to emit a special uOp with the single-step flag set and then stall, to allow the pipeline to drain the old stream before accepting the interrupt and redirecting Fetch to its handler. That way if there are and interrupt
enable or disable instructions, or branch mispredicts, or pending
exceptions
in-flight they all are allowed to finish and the state to settle down.

Pipelining interrupt delivery looks possible but gets complicated and expensive real quick.

The base down count increases every time the IRQ is found at the commit
stage. If the base down count is too large (stuck interrupt) then an
exception is processed. For instance if interrupts were disabled for
1000 clocks.

I think the mechanism could work, complicated though.

Treating the DI as an exception, as mentioned in another post would also
work. It is a matter then of flushing the instructions between the DI
and ISR.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Nov 29 23:14:23 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

On 2025-11-29 4:10 p.m., EricP wrote:

Robert Finch wrote:

I hard-coded an IRQ delay down-count in the Qupls4 core. The down-
count delays accepting interrupts for ten clock cycles or about 40
instructions if an interrupt got deferred. The interrupt being
deferred because interrupts got disabled by an instruction in the
pipeline. I guessed 40 instructions would likely be enough for many
cases where IRQs are disabled then enabled again.

The issue is the pipeline is full of ISR instructions that should not
be committed because the IRQs got disabled in the meantime. If the CPU
were allowed to accept another IRQ right away, it could get stuck in a
loop flushing the pipeline and reloading with the ISR routine code
instead of progressing through the code where IRQs were disabled.

I could create a control register for this count and allow it to be
programmable. But I think that may not be necessary.

It is possible that 40 instructions is not enough. In that case the
CPU would advance in 40 instruction burps. Alternating between
fetching ISR instructions and the desired instruction stream. On the
other hand, a larger down-count starts to impact the IRQ latency.

Tradeoffs…

I suppose I could have the CPU increase the down-count if it is
looping around fetching ISR instructions. The down-count would be
reset to the minimum again once an interrupt enable instruction is
executed.

Complex…

You are using this timer to predict the delay for draining the pipeline.
It would only take a read of a slow IO device register to exceed it.

The down count is counting down only when the front-end of the pipeline advances, instructions are sure to be loaded.

I was thinking a simple and cheap way would be to use a variation of the single-step mechanism. An interrupt request would cause Decode to emit a special uOp with the single-step flag set and then stall, to allow the pipeline to drain the old stream before accepting the interrupt and redirecting Fetch to its handler. That way if there are and interrupt enable or disable instructions, or branch mispredicts, or pending exceptions
in-flight they all are allowed to finish and the state to settle down.

Pipelining interrupt delivery looks possible but gets complicated and expensive real quick.

The base down count increases every time the IRQ is found at the commit stage. If the base down count is too large (stuck interrupt) then an exception is processed. For instance if interrupts were disabled for
1000 clocks.

I think the mechanism could work, complicated though.

Treating the DI as an exception, as mentioned in another post would also work. It is a matter then of flushing the instructions between the DI
and ISR.

Which is no different than flushing instructions after a mispredicted branch. --- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Nov 29 23:37:21 2025

From Newsgroup: comp.arch

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

(Looking at your
code, it also does not seem to be self-sufficient, at least the
numerous SKIP4 statements require something else).

If you want to assemble the resulting .S file, it's assembled once
with

-DSKIP4= -Dgforth_engine2=gforth_engine

and once with

-DSKIP4=".skip 4"

(on Linux-GNU AMD64, the .skip assembler directive is autoconfigured
and may be different on other platforms).

My assumption is that the control flow is confusing gcc.

My guess is the same.

Both our guesses were wrong, and Scott (I think) was on the right
track - this is a signed / unsigned issue. A reduced test case is

void bar(unsigned long, long);

void foo(unsigned long u1)
{
long u3;
u1 = u1 / 10;
u3 = u1 % 10;
bar(u1,u3);
}

This is now https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122911 .
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sun Nov 30 02:17:10 2025

From Newsgroup: comp.arch

On 2025-11-29 6:14 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

On 2025-11-29 4:10 p.m., EricP wrote:

Robert Finch wrote:

I hard-coded an IRQ delay down-count in the Qupls4 core. The down-
count delays accepting interrupts for ten clock cycles or about 40
instructions if an interrupt got deferred. The interrupt being
deferred because interrupts got disabled by an instruction in the
pipeline. I guessed 40 instructions would likely be enough for many
cases where IRQs are disabled then enabled again.

The issue is the pipeline is full of ISR instructions that should not
be committed because the IRQs got disabled in the meantime. If the CPU >>>> were allowed to accept another IRQ right away, it could get stuck in a >>>> loop flushing the pipeline and reloading with the ISR routine code
instead of progressing through the code where IRQs were disabled.

I could create a control register for this count and allow it to be
programmable. But I think that may not be necessary.

It is possible that 40 instructions is not enough. In that case the
CPU would advance in 40 instruction burps. Alternating between
fetching ISR instructions and the desired instruction stream. On the
other hand, a larger down-count starts to impact the IRQ latency.

Tradeoffs…

I suppose I could have the CPU increase the down-count if it is
looping around fetching ISR instructions. The down-count would be
reset to the minimum again once an interrupt enable instruction is
executed.

Complex…

You are using this timer to predict the delay for draining the pipeline. >>> It would only take a read of a slow IO device register to exceed it.

The down count is counting down only when the front-end of the pipeline
advances, instructions are sure to be loaded.

I was thinking a simple and cheap way would be to use a variation of the >>> single-step mechanism. An interrupt request would cause Decode to emit a >>> special uOp with the single-step flag set and then stall, to allow the
pipeline to drain the old stream before accepting the interrupt and
redirecting Fetch to its handler. That way if there are and interrupt
enable or disable instructions, or branch mispredicts, or pending
exceptions
in-flight they all are allowed to finish and the state to settle down.

Pipelining interrupt delivery looks possible but gets complicated and
expensive real quick.

The base down count increases every time the IRQ is found at the commit
stage. If the base down count is too large (stuck interrupt) then an
exception is processed. For instance if interrupts were disabled for
1000 clocks.

I think the mechanism could work, complicated though.

Treating the DI as an exception, as mentioned in another post would also
work. It is a matter then of flushing the instructions between the DI
and ISR.

Which is no different than flushing instructions after a mispredicted branch.

Got fed up with trying to work out how get interrupts working. It turns
out to be more challenging than I expected, no matter which way it is
done. So, I decided to just poll for interrupts, getting rid of most of
the IRQ logic. I added a branch-on-interrupt BOI instruction that works
almost the same way as every other branch. Then the micro-op translator
has been adapted to insert a polling branch periodically. It looks a lot simpler.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Nov 30 10:10:00 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> schrieb:

Got fed up with trying to work out how get interrupts working. It turns
out to be more challenging than I expected, no matter which way it is
done. So, I decided to just poll for interrupts, getting rid of most of
the IRQ logic. I added a branch-on-interrupt BOI instruction that works almost the same way as every other branch. Then the micro-op translator
has been adapted to insert a polling branch periodically. It looks a lot simpler.

What is the expected delay until an interrupt is delivered?
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sun Nov 30 06:29:55 2025

From Newsgroup: comp.arch

On 2025-11-30 5:10 a.m., Thomas Koenig wrote:

Robert Finch <robfi680@gmail.com> schrieb:

Got fed up with trying to work out how get interrupts working. It turns
out to be more challenging than I expected, no matter which way it is
done. So, I decided to just poll for interrupts, getting rid of most of
the IRQ logic. I added a branch-on-interrupt BOI instruction that works
almost the same way as every other branch. Then the micro-op translator
has been adapted to insert a polling branch periodically. It looks a lot
simpler.

What is the expected delay until an interrupt is delivered?

I set the timing to 16 clocks which is about 64 (or more) instructions.
Did not want to go much over 1% the number of instructions executed.
Not every instruction inserts a poll, so sometimes a poll is lacking.
IDK how well it will work. Making it an instruction means it might also
be used by software.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sun Nov 30 06:41:52 2025

From Newsgroup: comp.arch

On 2025-11-30 6:29 a.m., Robert Finch wrote:

On 2025-11-30 5:10 a.m., Thomas Koenig wrote:

Robert Finch <robfi680@gmail.com> schrieb:

Got fed up with trying to work out how get interrupts working. It turns
out to be more challenging than I expected, no matter which way it is
done. So, I decided to just poll for interrupts, getting rid of most of
the IRQ logic. I added a branch-on-interrupt BOI instruction that works
almost the same way as every other branch. Then the micro-op translator
has been adapted to insert a polling branch periodically. It looks a lot >>> simpler.

What is the expected delay until an interrupt is delivered?

I set the timing to 16 clocks which is about 64 (or more) instructions.
Did not want to go much over 1% the number of instructions executed.
Not every instruction inserts a poll, so sometimes a poll is lacking.
IDK how well it will work. Making it an instruction means it might also
be used by software.

Might be able to modify the branch predictor to predict the interrupt.
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Nov 30 14:14:16 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Both our guesses were wrong, and Scott (I think) was on the right
track - this is a signed / unsigned issue. A reduced test case is

void bar(unsigned long, long);

void foo(unsigned long u1)
{
long u3;
u1 = u1 / 10;
u3 = u1 % 10;
bar(u1,u3);
}

Assigning to u1 changed the meaning, as Andrew Pinski noted; so the
jury is still out on what the actual problem is.

This is now https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122911 .

and a revised one at
<https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122919>

(The announced attachment is not there yet.)

The latter case is interesting, because real_ca and spc became global,
and symbols[] is still local, and no assignment to real_ca happens
inside foo().

So one way the compiler could interpret this code might be that
real_ca gets one of the labels whose address is taken in some way
unknown to the compiler; the it has to preserve all the code reachable
through the labels.

Another way to interpret this code would be that symbols is not used,
so it is dead and can be optimized away. Consequently, none of the
addresses of any of the labels is ever taken, and the labels are not
used by direct jumps, either, so all the code reachable only by
jumping to the labels is unreachable and can be optimized away.

Apparently gcc takes the latter attitude if there are <=100 labels in
symbols, but maybe something like the former attitude if there are

100 labels in symbols. This may appear strange, but gcc generally

tends to produce good code in relatively short time for Gforth (while
clang generates horribly slow code and takes extremely long in doing
so), and my guess is that having such a cutoff on doing the usual
analysis has something to do with gcc's superior performance.

I guess that if you treat symbols like in the original code (i.e.,
return it in one case), you can reduce the labels more without the
compiler optimizing everything away. I don't dare to predict when the
compiler will stop generating the inefficient variant. Maybe it has
to do with the cutoff.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Nov 30 15:47:03 2025

From Newsgroup: comp.arch

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
Both our guesses were wrong, and Scott (I think) was on the right
track - this is a signed / unsigned issue. A reduced test case is

void bar(unsigned long, long);

void foo(unsigned long u1)
{
long u3;
u1 = u1 / 10;
u3 = u1 % 10;
bar(u1,u3);
}

Assigning to u1 changed the meaning, as Andrew Pinski noted;

An example which could be tested at run-time to verify correct
operation was not provided, so I had to do without.

In reducing compiler bugs, automated tools such as delta or
(much better) cvise are essential. Your test case was so
large that cvise failed, so a lot of manual work was required.

cvise uses a user-supplied "interestingness script" which returns
0 if the feature in question is there, or non-zero if it is
not there. For relatively simple cases like an ICE, it
can have two steps: a) check that compilation fails, and b)
check that the error messages is output.

Looking for a missed optimization is more difficult, especially
in the absence of a run-time test. It is then necessary to

a) check the source code that the interesting code is still there

b) compile the code (exiting if this fails)

c) verify the generated assembly that it still does the same

a) and c) are very easy to get wrong, and there were numerous
false reductions where cvise came up with something that the
scripts didn't catch.

so the
jury is still out on what the actual problem is.

This is now https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122911 .

and a revised one at
<https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122919>

(The announced attachment is not there yet.)

The latter case is interesting, because real_ca and spc became global,
and symbols[] is still local, and no assignment to real_ca happens
inside foo().

That is what cvise does. It sometimes reduces code more than a
human would.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Nov 30 15:18:21 2025

From Newsgroup: comp.arch

scott@slp53.sl.home (Scott Lurndal) writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes:

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

I recently heard that CS graduates from ETH Zürich had heard about >>>>>>pipelines, but thought it was fetch-decode-execute.

Why would a CS graduate need to know about pipelines?

So they can properly simluate a pipelined processor?

Sure, if a CS graduate works in an application area, they need to
learn about that application area, whatever it is.

It's useful for code optimization, as well.

In what way?

In general,
any programmer should have a solid understanding of the
underlying hardware - generically, and specifically
for the hardware being programmed.

Certainly. But do they need to know between a a Wallace multiplier
and a Dadda multiplier? If not, what is it about pipelined processors
that would require CS graduates to know about them?

Processor pipelines are not the basics of what a CS graduate is doing.
They are an implementation detail in computer engineering.

Which affect the performance of the software created by the
software engineer (CS graduate).

By a constant factor; and the software creator does not need to know
that the CPU that executes instructions at 2 CPI (486) instead of at
10 CPI (VAX-11/780) is pipelined; and these days both the 486 and the
VAX are irrelevant to software creators.

A few more examples where compilers are not as good as even I expected:

Just today, I compiled

u4 = u1/10;
u3 = u1%10;

(plus some surrounding code) with gcc-14 in three contexts. Here's
the code for two of them (the third one is similar to the second one):

movabs $0xcccccccccccccccd,%rax movabs $0xcccccccccccccccd,%rsi
sub $0x8,%r13 mov %r8,%rax
mul %r8 mov %r8,%rcx
mov %rdx,%rax mul %rsi
shr $0x3,%rax shr $0x3,%rdx
lea (%rax,%rax,4),%rdx lea (%rdx,%rdx,4),%rax
add %rdx,%rdx add %rax,%rax
sub %rdx,%r8 sub %rax,%r8
mov %r8,0x8(%r13) mov %rcx,%rax
mov %rax,%r8 mul %rsi
shr $0x3,%rdx
mov %rdx,%r9

The major difference is that in the left context, u3 is stored into
memory (at 0x8(%r13)), while in the right context, it stays in a
register. In the left context, gcc managed to base its computation of >>u1%10 on the result of u1/10; in the right context, gcc first computes >>u1%10 (computing u1/10 as part of that), and then computes u1/10
again.

Sort of emphasizes that programmers need to understand the
underlying hardware.

I am the programmer of the code shown above. In what way would better knowledge of the hardware made me aware that gcc would produce
suboptimal code in some cases?

What were u1, u3 and u4 declared as?

unsigned long (on that platform).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Nov 30 16:39:41 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> writes:

In reducing compiler bugs, automated tools such as delta or
(much better) cvise are essential. Your test case was so
large that cvise failed, so a lot of manual work was required.

I have now done a manual reduction myself; essentially I left only the
3 variants of the VM instruction that performs 10/, plus all the
surroundings, and I added code to ensure that spTOS, spb, and spc are
not dead. You find the result at

http://www.complang.tuwien.ac.at/anton/tmp/engine-fast-red.i

The result of compiling this with

gcc -I./../arch/amd64 -I. -Wall -g -O2 -fomit-frame-pointer -pthread -DHAVE_CONFIG_H -DFORCE_LL -DFORCE_REG -DDEFAULTPATH='".:/usr/local/lib/gforth/site-forth:/usr/local/lib/gforth/0.7.9_20251119:/usr/local/share/gforth/0.7.9_20251119:/usr/share/gforth/site-forth:/usr/local/share/gforth/site-forth"' -c -fno-gcse -fcaller-saves -fno-defer-pop -fno-inline -fwrapv -fno-strict-aliasing -fno-cse-follow-jumps -fno-reorder-blocks -fno-reorder-blocks-and-partition -fno-toplevel-reorder -falign-labels=1 -falign-loops=1 -falign-jumps=1 -fno-delete-null-pointer-checks -fcf-protection=none -fno-tree-vectorize -fno-lto -pthread -DENGINE=2 -fPIC -DPIC -o libengine-fast2-ll-reg-red.S -S engine-fast-red.i

can be found at

http://www.complang.tuwien.ac.at/anton/tmp/libengine-fast2-ll-reg-red.S

Now the multiplier is permanently allocated to %r11, so searching for
it won't help. However, if you search for "mulq", you will find the
code generated for the three instances of the VM instruction. The
first is optimized well, the second exhibits two mulqs and two shrqs,
the third exhibits just one mulq, but two shrqs.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Nov 30 18:59:15 2025

From Newsgroup: comp.arch

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

In reducing compiler bugs, automated tools such as delta or
(much better) cvise are essential. Your test case was so
large that cvise failed, so a lot of manual work was required.

I have now done a manual reduction myself; essentially I left only the
3 variants of the VM instruction that performs 10/, plus all the surroundings, and I added code to ensure that spTOS, spb, and spc are
not dead. You find the result at

http://www.complang.tuwien.ac.at/anton/tmp/engine-fast-red.i

Do you have an example which tests the codepath taken for the
offending piece of code, so it is possible to further reduce this
case automatically? The example is still quite big (>13000 lines).
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Nov 30 19:33:47 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

ERROR "unexpected byte sequence starting at index 356: '\xC3'" while decoding:

scott@slp53.sl.home (Scott Lurndal) writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes:

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

I recently heard that CS graduates from ETH ZÃ¼rich had heard about >>>>>>pipelines, but thought it was fetch-decode-execute.

Why would a CS graduate need to know about pipelines?

So they can properly simluate a pipelined processor?

Sure, if a CS graduate works in an application area, they need to
learn about that application area, whatever it is.

It's useful for code optimization, as well.

In what way?

In general,
any programmer should have a solid understanding of the
underlying hardware - generically, and specifically
for the hardware being programmed.

Certainly. But do they need to know between a a Wallace multiplier
and a Dadda multiplier?

You do realize that all Wallace multipliers are Dadda multipliers ??
But there are Dadda Multipliers that are not Wallace multipliers ?!?!?!

If not, what is it about pipelined processors
that would require CS graduates to know about them?

How execution order disturbs things like program order and memory order.
That is how and when they need to insert Fences in their multi-threaded
code.

Processor pipelines are not the basics of what a CS graduate is doing. >>They are an implementation detail in computer engineering.

Which affect the performance of the software created by the
software engineer (CS graduate).

By a constant factor; and the software creator does not need to know
that the CPU that executes instructions at 2 CPI (486) instead of at
10 CPI (VAX-11/780) is pipelined; and these days both the 486 and the
VAX are irrelevant to software creators.

I do not believe that the word "the" in front of x86 or VAX is proper.

A few more examples where compilers are not as good as even I expected:

Just today, I compiled

u4 = u1/10;
u3 = u1%10;

(plus some surrounding code) with gcc-14 in three contexts. Here's
the code for two of them (the third one is similar to the second one):

movabs $0xcccccccccccccccd,%rax movabs $0xcccccccccccccccd,%rsi
sub $0x8,%r13 mov %r8,%rax
mul %r8 mov %r8,%rcx
mov %rdx,%rax mul %rsi
shr $0x3,%rax shr $0x3,%rdx
lea (%rax,%rax,4),%rdx lea (%rdx,%rdx,4),%rax
add %rdx,%rdx add %rax,%rax
sub %rdx,%r8 sub %rax,%r8
mov %r8,0x8(%r13) mov %rcx,%rax
mov %rax,%r8 mul %rsi
shr $0x3,%rdx
mov %rdx,%r9

The major difference is that in the left context, u3 is stored into >>memory (at 0x8(%r13)), while in the right context, it stays in a >>register. In the left context, gcc managed to base its computation of >>u1%10 on the result of u1/10; in the right context, gcc first computes >>u1%10 (computing u1/10 as part of that), and then computes u1/10
again.

Sort of emphasizes that programmers need to understand the
underlying hardware.

I am the programmer of the code shown above. In what way would better knowledge of the hardware made me aware that gcc would produce
suboptimal code in some cases?

Reading and thinking about the asm-code and running the various code
sequences enough times that you can measure which is better and which
is worse. That is the engineering part of software Engineering.

What were u1, u3 and u4 declared as?

unsigned long (on that platform).

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From Niklas Holsti@niklas.holsti@tidorum.invalid to comp.arch on Sun Nov 30 22:38:39 2025

From Newsgroup: comp.arch

On 2025-11-30 21:33, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

ERROR "unexpected byte sequence starting at index 356: '\xC3'" while decoding:

scott@slp53.sl.home (Scott Lurndal) writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

scott@slp53.sl.home (Scott Lurndal) writes:

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

I recently heard that CS graduates from ETH ZÃ¼rich had heard about >>>>>>>> pipelines, but thought it was fetch-decode-execute.

Why would a CS graduate need to know about pipelines?

So they can properly simluate a pipelined processor?

Sure, if a CS graduate works in an application area, they need to
learn about that application area, whatever it is.

It's useful for code optimization, as well.

In what way?

In general,
any programmer should have a solid understanding of the
underlying hardware - generically, and specifically
for the hardware being programmed.

Certainly. But do they need to know between a a Wallace multiplier
and a Dadda multiplier?

You do realize that all Wallace multipliers are Dadda multipliers ??
But there are Dadda Multipliers that are not Wallace multipliers ?!?!?!

If not, what is it about pipelined processors
that would require CS graduates to know about them?

How execution order disturbs things like program order and memory order.
That is how and when they need to insert Fences in their multi-threaded
code.

That is an aspect of processor architecture that is relevant to some programmers, but not to the large number of programmers who use
languages or operating systems with built-in multi-threading and safe inter-thread communication primitives and services for input/output.

I am the programmer of the code shown above. In what way would better
knowledge of the hardware made me aware that gcc would produce
suboptimal code in some cases?

Reading and thinking about the asm-code and running the various code sequences enough times that you can measure which is better and which
is worse. That is the engineering part of software Engineering.

That is a very niche part of software (performance) engineering. Speed
of execution is only one of many "goodness" dimensions of a piece of SW, others including correctness, reliability, security, portability, maintainability, and so on. All dimensions need and depend on systematic engineering, although some dimensions cannot be quantified as easily as execution speed.

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Nov 30 22:11:26 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> writes:

Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:

Thomas Koenig <tkoenig@netcologne.de> writes:

In reducing compiler bugs, automated tools such as delta or
(much better) cvise are essential. Your test case was so
large that cvise failed, so a lot of manual work was required.

I have now done a manual reduction myself; essentially I left only the
3 variants of the VM instruction that performs 10/, plus all the
surroundings, and I added code to ensure that spTOS, spb, and spc are
not dead. You find the result at

http://www.complang.tuwien.ac.at/anton/tmp/engine-fast-red.i

Do you have an example which tests the codepath taken for the
offending piece of code,

Not easily.

so it is possible to further reduce this
case automatically? The example is still quite big (>13000 lines).

Most of which is coming from including stdlib.h etc. The actual code
of the gforth_engine function in that example is 264 lines, many of
which are empty or line number indicators.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Nov 30 22:17:19 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

ERROR "unexpected byte sequence starting at index 356: '\xC3'" while decoding:

scott@slp53.sl.home (Scott Lurndal) writes:

In general,
any programmer should have a solid understanding of the
underlying hardware - generically, and specifically
for the hardware being programmed.

Certainly. But do they need to know between a a Wallace multiplier
and a Dadda multiplier?

You do realize that all Wallace multipliers are Dadda multipliers ??
But there are Dadda Multipliers that are not Wallace multipliers ?!?!?!

Good to know, but does not answer the question.

If not, what is it about pipelined processors
that would require CS graduates to know about them?

How execution order disturbs things like program order and memory order.
That is how and when they need to insert Fences in their multi-threaded >code.

And the relevance of pipelined processors for that issue is what?

Memory-ordering shenanigans come from the unholy alliance of
cache-coherent multiprocessing and the supercomputer attitude. If you implement per-CPU caches and multiple memory controllers as shoddily
as possible while providing features for programs to slow themselves
down heavily in order to get memory-ordering guarantess, then you get
a weak memory model; slightly less shoddy, and you get a "strong" memory
model. Processor pipelines have no relevance here.

And, as Niklas Holsti observed, dealing with memory-ordering
shenanigans is something that a few specialists do; no need for others
to know about the memory model, except that common CPUs unfortunately
do not implement sequential consistency.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Dec 1 00:12:15 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

ERROR "unexpected byte sequence starting at index 356: '\xC3'" while decoding:

scott@slp53.sl.home (Scott Lurndal) writes:

In general,
any programmer should have a solid understanding of the
underlying hardware - generically, and specifically
for the hardware being programmed.

Certainly. But do they need to know between a a Wallace multiplier
and a Dadda multiplier?

You do realize that all Wallace multipliers are Dadda multipliers ??
But there are Dadda Multipliers that are not Wallace multipliers ?!?!?!

Good to know, but does not answer the question.

{Without contradicting that Wallace got on the correct track first}
Wallace gets the credit that should rightly go to Dadda.

If not, what is it about pipelined processors
that would require CS graduates to know about them?

How execution order disturbs things like program order and memory order. >That is how and when they need to insert Fences in their multi-threaded >code.

And the relevance of pipelined processors for that issue is what?

Memory-ordering shenanigans come from the unholy alliance of
cache-coherent multiprocessing and the supercomputer attitude.

And without the SuperComputer attitude, you sell 0 parts.
{Remember how we talk about performance all the time here ?}

If you implement per-CPU caches and multiple memory controllers as shoddily
as possible while providing features for programs to slow themselves
down heavily in order to get memory-ordering guarantess, then you get
a weak memory model; slightly less shoddy, and you get a "strong" memory model. Processor pipelines have no relevance here.

It is the pipelines themselves (along with the SuperComputer attitude)
that gives rise to the weak memory models.

And, as Niklas Holsti observed, dealing with memory-ordering
shenanigans is something that a few specialists do; no need for others
to know about the memory model, except that common CPUs unfortunately
do not implement sequential consistency.

Because of the SuperComputer attitude ! {Performance first}

And only after several languages built their own ATOMIC primitives, so
the programmers could remain ignorant. But this also ties the hands of
the designers in such a way that performance grows ever more slowly
with more threads.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Mon Dec 1 07:56:37 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Memory-ordering shenanigans come from the unholy alliance of
cache-coherent multiprocessing and the supercomputer attitude.

And without the SuperComputer attitude, you sell 0 parts.
{Remember how we talk about performance all the time here ?}

Wrong. The supercomputer attitude gave us such wonders as IA-64
(sells 0 parts) and Larrabee (sells 0 parts); why: because OoO is not
only easier to program, but also faster.

The advocates of weaker memory models justify them by pointing to the
slowness of sequential consistency if one implements it by using
fences on hardware optimized for a weaker memory model. But that's
not the way to implement efficient sequential consistency.

In an alternate reality where AMD64 did not happen and IA-64 won,
people would justify the IA-64 ISA complexity as necessary for
performance, and claim that the IA-32 hardware in the Itanium
demonstrates the performance superiority of the EPIC approach, just
like they currently justify the performance superiority of weak and
"strong" memory models over sequential consistency.

If hardware designers put their mind to it, they could make sequential consistency perform well, probably better on code that actually
accesses data shared between different threads than weak and "strong"
ordering, because there is no need to slow down the program with
fences and the like in cases where only one thread accesses the data,
and in cases where the data is read by all threads. You will see the
slowdown only in run-time cases when one thread writes and another
reads in temporal proximity. And all the fences etc. that are
inserted just in case would also become fast (noops).

A similar case: Alpha includes a trapb instruction (an exception
fence). Programmers have to insert it after FP instructions to get
precise exceptions. This was justified with performance; i.e., the
theory went: If you compile without trapb, you get performance and
imprecise exceptions, if you compile with trapb, you get slowness and
precise exceptions. I then measured SPEC 95 compiled without and with
trapb <2003Apr3.202651@a0.complang.tuwien.ac.at>, and on the OoO 21264
there was hardly any difference; I believe that trapb is a noop on the
21264. Here's the SPECfp_base95 numbers:

with without
trapb trapb
9.56 11.6 AlphaPC164LX 600MHz 21164A
19.7 20.0 Compaq XP1000 500MHz 21264

So the machine that needs trapb is much slower even without trapb than
even the with-trapb variant on the machine where trapb is probably a
noop. And lots of implementations of architectures without trapb have demonstrated since then that you can have high performance and precise exceptions without trapb.

And only after several languages built their own ATOMIC primitives, so
the programmers could remain ignorant. But this also ties the hands of
the designers in such a way that performance grows ever more slowly
with more threads.

Maybe they could free their hands by designing for a
sequential-consistency interface, just like designing for a simple sequential-execution model without EPIC features freed their hands to
design microarchitectural features that allowed ordinary code to
utilize wider and wider OoO cores profitably.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Mon Dec 1 13:23:22 2025

From Newsgroup: comp.arch

On Mon, 01 Dec 2025 07:56:37 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Memory-ordering shenanigans come from the unholy alliance of
cache-coherent multiprocessing and the supercomputer attitude.

And without the SuperComputer attitude, you sell 0 parts.
{Remember how we talk about performance all the time here ?}

Wrong. The supercomputer attitude gave us such wonders as IA-64
(sells 0 parts) and Larrabee (sells 0 parts); why: because OoO is not
only easier to program, but also faster.

The advocates of weaker memory models justify them by pointing to the slowness of sequential consistency if one implements it by using
fences on hardware optimized for a weaker memory model. But that's
not the way to implement efficient sequential consistency.

In an alternate reality where AMD64 did not happen and IA-64 won,
people would justify the IA-64 ISA complexity as necessary for
performance, and claim that the IA-32 hardware in the Itanium
demonstrates the performance superiority of the EPIC approach, just
like they currently justify the performance superiority of weak and
"strong" memory models over sequential consistency.

If hardware designers put their mind to it, they could make sequential consistency perform well, probably better on code that actually
accesses data shared between different threads than weak and "strong" ordering, because there is no need to slow down the program with
fences and the like in cases where only one thread accesses the data,
and in cases where the data is read by all threads. You will see the slowdown only in run-time cases when one thread writes and another
reads in temporal proximity. And all the fences etc. that are
inserted just in case would also become fast (noops).

Where does sequential consistency simplifies programming over x86 model
of "TCO + globally ordered synchronization primitives +
every synchronization primitives have implied barriers"?

More so, where it simplifies over ARMv8.1-A, assuming that programmer
does not try to be too smart and never uses LL/SC and always uses
8.1-style synchronization instructions with Acquire+Release flags set?

IMHO, the only simple thing about sequential consistency is simple
description. Other than that, it simplifies very little. It does not
magically make lockless multithreaded programming bearable to
non-genius coders.

--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Mon Dec 1 14:07:34 2025

From Newsgroup: comp.arch

Anton Ertl wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Memory-ordering shenanigans come from the unholy alliance of
cache-coherent multiprocessing and the supercomputer attitude.

And without the SuperComputer attitude, you sell 0 parts.
{Remember how we talk about performance all the time here ?}

Wrong. The supercomputer attitude gave us such wonders as IA-64
(sells 0 parts) and Larrabee (sells 0 parts); why: because OoO is not
only easier to program, but also faster.

The advocates of weaker memory models justify them by pointing to the slowness of sequential consistency if one implements it by using
fences on hardware optimized for a weaker memory model. But that's
not the way to implement efficient sequential consistency.

In an alternate reality where AMD64 did not happen and IA-64 won,
people would justify the IA-64 ISA complexity as necessary for
performance, and claim that the IA-32 hardware in the Itanium
demonstrates the performance superiority of the EPIC approach, just
like they currently justify the performance superiority of weak and
"strong" memory models over sequential consistency.

If hardware designers put their mind to it, they could make sequential consistency perform well, probably better on code that actually
accesses data shared between different threads than weak and "strong" ordering, because there is no need to slow down the program with
fences and the like in cases where only one thread accesses the data,
and in cases where the data is read by all threads. You will see the slowdown only in run-time cases when one thread writes and another
reads in temporal proximity. And all the fences etc. that are
inserted just in case would also become fast (noops).

A similar case: Alpha includes a trapb instruction (an exception
fence). Programmers have to insert it after FP instructions to get
precise exceptions. This was justified with performance; i.e., the
theory went: If you compile without trapb, you get performance and
imprecise exceptions, if you compile with trapb, you get slowness and
precise exceptions. I then measured SPEC 95 compiled without and with
trapb <2003Apr3.202651@a0.complang.tuwien.ac.at>, and on the OoO 21264
there was hardly any difference; I believe that trapb is a noop on the
21264. Here's the SPECfp_base95 numbers:

with without
trapb trapb
9.56 11.6 AlphaPC164LX 600MHz 21164A
19.7 20.0 Compaq XP1000 500MHz 21264

So the machine that needs trapb is much slower even without trapb than
even the with-trapb variant on the machine where trapb is probably a
noop. And lots of implementations of architectures without trapb have demonstrated since then that you can have high performance and precise exceptions without trapb.

The 21264 Hardware Reference Manual says TRAPB (general exception barrier)
and EXCB (floating point control register barrier) are both NOP's
internally, are tossed at decode, and don't even take up an
instruction slot.

The purpose of the EXCB is to synchronize pipeline access to the
floating point control and status register with FP operations.
In the worst case this stalls until the pipeline drains.

I wonder how much logic it really saved allowing imprecise exceptions
in the InO 21064 and 21164? Conversely, how much did it cost to deal
with problems caused by leaving these interlocks off?

The cores have multiple, parallel pipelines for int, lsq, fadd and fmul. Without exception interlocks, each pipeline only obeys the scoreboard
rules for when to writeback its result register: WAW and WAR.
That allows a younger, faster instruction to finish and write its register before an older, slower instruction. If that older instruction then throws
an exception and does not write its register then we can see the out of
order register writes.

For register file writes to be precise in the presence of exceptions
requires each instruction look ahead at the state of all older
instructions *in all pipelines*.
Each uOp can be Unresolved, Resolved_Normal, or Resolved_Exception.
A writeback can occur if there are no WAW or WAR dependencies,
and all older uOps are Resolved_Normal.

Just off the top of my head, in addition to the normal scoreboard,
a FIFO buffer with a priority selector could be used to look ahead
at all older uOps and check their status, and allow or stall uOp
writebacks and ensure registers always appear precise.
Which really doesn't look that expensive.

Is there something I missed, or would that FIFO suffice?

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Dec 1 22:50:15 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Memory-ordering shenanigans come from the unholy alliance of
cache-coherent multiprocessing and the supercomputer attitude.

And without the SuperComputer attitude, you sell 0 parts.
{Remember how we talk about performance all the time here ?}

Wrong. The supercomputer attitude gave us such wonders as IA-64
(sells 0 parts) and Larrabee (sells 0 parts); why: because OoO is not
only easier to program, but also faster.

The advocates of weaker memory models justify them by pointing to the slowness of sequential consistency if one implements it by using
fences on hardware optimized for a weaker memory model. But that's
not the way to implement efficient sequential consistency.

In an alternate reality where AMD64 did not happen and IA-64 won,
people would justify the IA-64 ISA complexity as necessary for
performance, and claim that the IA-32 hardware in the Itanium
demonstrates the performance superiority of the EPIC approach, just
like they currently justify the performance superiority of weak and
"strong" memory models over sequential consistency.

If hardware designers put their mind to it, they could make sequential consistency perform well,

Depends on your definition of SC and "performs well", but see below::

probably better on code that actually
accesses data shared between different threads than weak and "strong" ordering, because there is no need to slow down the program with
fences and the like in cases where only one thread accesses the data,
and in cases where the data is read by all threads. You will see the slowdown only in run-time cases when one thread writes and another
reads in temporal proximity. And all the fences etc. that are
inserted just in case would also become fast (noops).

In the case of My 66000, there is a slightly weak memory model
(Causal consistency) for accesses to DRAM, and there is Sequential
consistency for ATOMIC stuff and device control registers, and then
there is strongly ordered for configuration space access, and the
programmer does not have to do "jack" to get these orderings--
its all programmed in the PTEs.

{{There is even a way to make DRAM accesses SC should you want.}}

A similar case: Alpha includes a trapb instruction (an exception
fence). Programmers have to insert it after FP instructions to get
precise exceptions. This was justified with performance; i.e., the
theory went: If you compile without trapb, you get performance and
imprecise exceptions, if you compile with trapb, you get slowness and
precise exceptions. I then measured SPEC 95 compiled without and with
trapb <2003Apr3.202651@a0.complang.tuwien.ac.at>, and on the OoO 21264
there was hardly any difference; I believe that trapb is a noop on the
21264. Here's the SPECfp_base95 numbers:

with without
trapb trapb
9.56 11.6 AlphaPC164LX 600MHz 21164A

moderate slowdown

19.7 20.0 Compaq XP1000 500MHz 21264

slowdown has disappeared.

So the machine that needs trapb is much slower even without trapb than
even the with-trapb variant on the machine where trapb is probably a
noop. And lots of implementations of architectures without trapb have demonstrated since then that you can have high performance and precise exceptions without trapb.

And only after several languages built their own ATOMIC primitives, so
the programmers could remain ignorant. But this also ties the hands of
the designers in such a way that performance grows ever more slowly
with more threads.

Maybe they could free their hands by designing for a
sequential-consistency interface, just like designing for a simple sequential-execution model without EPIC features freed their hands to
design microarchitectural features that allowed ordinary code to
utilize wider and wider OoO cores profitably.

That is not the property I was getting at--the property I was getting at
is that the language model for synchronization can only use 1 memory
location {TS, TTS, CAS, DCAS, LL, SC} and this fundamentally limits the
amount of work one can do in a single event, and also fundamentally limits
what one can "say" about a concurrent data structure.

Given a certain amount of interference--the fewer ATOMIC things one has
to do the lower the chance of interference, and the greater the chance
of success. So, if one could move an element of a CDS from one location
to another in one ATOMIC event rather than 2 (or 3) then the exponent
of synchronization overhead goes down, and then one can make statements
like "and no outside observer can see the CDS without that element present"--which cannot be stated with current models.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Dec 1 23:03:24 2025

From Newsgroup: comp.arch

EricP <ThatWouldBeTelling@thevillage.com> posted:

Anton Ertl wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Memory-ordering shenanigans come from the unholy alliance of
cache-coherent multiprocessing and the supercomputer attitude.

And without the SuperComputer attitude, you sell 0 parts.
{Remember how we talk about performance all the time here ?}

Wrong. The supercomputer attitude gave us such wonders as IA-64
(sells 0 parts) and Larrabee (sells 0 parts); why: because OoO is not
only easier to program, but also faster.

The advocates of weaker memory models justify them by pointing to the slowness of sequential consistency if one implements it by using
fences on hardware optimized for a weaker memory model. But that's
not the way to implement efficient sequential consistency.

In an alternate reality where AMD64 did not happen and IA-64 won,
people would justify the IA-64 ISA complexity as necessary for
performance, and claim that the IA-32 hardware in the Itanium
demonstrates the performance superiority of the EPIC approach, just
like they currently justify the performance superiority of weak and "strong" memory models over sequential consistency.

If hardware designers put their mind to it, they could make sequential consistency perform well, probably better on code that actually
accesses data shared between different threads than weak and "strong" ordering, because there is no need to slow down the program with
fences and the like in cases where only one thread accesses the data,
and in cases where the data is read by all threads. You will see the slowdown only in run-time cases when one thread writes and another
reads in temporal proximity. And all the fences etc. that are
inserted just in case would also become fast (noops).

A similar case: Alpha includes a trapb instruction (an exception
fence). Programmers have to insert it after FP instructions to get
precise exceptions. This was justified with performance; i.e., the
theory went: If you compile without trapb, you get performance and imprecise exceptions, if you compile with trapb, you get slowness and precise exceptions. I then measured SPEC 95 compiled without and with trapb <2003Apr3.202651@a0.complang.tuwien.ac.at>, and on the OoO 21264 there was hardly any difference; I believe that trapb is a noop on the 21264. Here's the SPECfp_base95 numbers:

with without
trapb trapb
9.56 11.6 AlphaPC164LX 600MHz 21164A
19.7 20.0 Compaq XP1000 500MHz 21264

So the machine that needs trapb is much slower even without trapb than
even the with-trapb variant on the machine where trapb is probably a
noop. And lots of implementations of architectures without trapb have demonstrated since then that you can have high performance and precise exceptions without trapb.

The 21264 Hardware Reference Manual says TRAPB (general exception barrier) and EXCB (floating point control register barrier) are both NOP's
internally, are tossed at decode, and don't even take up an
instruction slot.

The purpose of the EXCB is to synchronize pipeline access to the
floating point control and status register with FP operations.
In the worst case this stalls until the pipeline drains.

I wonder how much logic it really saved allowing imprecise exceptions
in the InO 21064 and 21164?

Having done something similar in Mc 88100, I can state that the amount
of logic saved is too small to justify such nïevity.

Conversely, how much did it cost to deal
with problems caused by leaving these interlocks off?

Way toooooo much. The SW delay to get all those things right cost more
time than HW designers could have possibly saved leaving them out.

The cores have multiple, parallel pipelines for int, lsq, fadd and fmul. Without exception interlocks, each pipeline only obeys the scoreboard
rules for when to writeback its result register: WAW and WAR.
That allows a younger, faster instruction to finish and write its register before an older, slower instruction. If that older instruction then throws
an exception and does not write its register then we can see the out of
order register writes.

For register file writes to be precise in the presence of exceptions
requires each instruction look ahead at the state of all older
instructions *in all pipelines*.

Or you use dead stages in the pipelines so instructions arrive at
RF write ports no earlier than their compatriots. you still have to
look across all the delay slots for forwarding opportunities.

Each uOp can be Unresolved, Resolved_Normal, or Resolved_Exception.
A writeback can occur if there are no WAW or WAR dependencies,
and all older uOps are Resolved_Normal.

That is the scoreboard model. The Reservation station has a simpler
model by providing unique register for each instruction (or µOp).

Just off the top of my head, in addition to the normal scoreboard,
a FIFO buffer with a priority selector could be used to look ahead
at all older uOps and check their status,

Such a block of logic is called a ReOrder Buffer.

Given an architectural register file with 16-32 entries, and
given a reorder buffer of 96+ entries--if you integrate both
ARF and RoB into a single structure you call it a physical
register file. A PRF is just a RoB that is big enough never
to have to migrate registers to the ARF.

and allow or stall uOp
writebacks and ensure registers always appear precise.
Which really doesn't look that expensive.

Is there something I missed, or would that FIFO suffice?

If the FiFo is big enough, it works just fine; if you scrimp on
the FiFo, you will want to play games with orderings to make it
faster.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Tue Dec 2 07:10:16 2025

From Newsgroup: comp.arch

Semi-unaligned memory tradeoff. If unaligned access is required, the
memory logic just increments the physical address by 64 bytes to fetch
the next cache line. The issue with this is it does not go backwards to
get the address fetched again from the TLB. Meaning no check is made for protection or translation of the address.

It would be quite slow to have the instructions reissued and percolate
down the cache access again.

This should only be an issue if an unaligned access crosses a memory
page boundary.

The instruction causes an alignment fault if a page cross boundary is detected.

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Tue Dec 2 18:50:12 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> writes:

The issue with this is it does not go backwards to
get the address fetched again from the TLB. Meaning no check is made for >protection or translation of the address.

It would be quite slow to have the instructions reissued and percolate
down the cache access again.

Unaligned access on a page boundary is extremely slow on the Core 2
Duo (IIRC 160 cycles for a store). So don't be shy:-)

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Dec 2 19:55:43 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

Semi-unaligned memory tradeoff. If unaligned access is required, the
memory logic just increments the physical address by 64 bytes to fetch
the next cache line. The issue with this is it does not go backwards to
get the address fetched again from the TLB. Meaning no check is made for protection or translation of the address.

You can determine is an access is misaligned "enough" to warrant two
trips down the pipe.
a) crosses cache width
b) crosses page boundary

Case b ALLWAYS needs 2 trips; so the mechanism HAS to be there.

It would be quite slow to have the instructions reissued and percolate
down the cache access again.

An AGEN-like adder has 11-gates of delay, you can determine misaligned
by 4-gates of delay.

This should only be an issue if an unaligned access crosses a memory
page boundary.

Here you need to access the TLB twice.

The instruction causes an alignment fault if a page cross boundary is detected.

probably not as wise as you think.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Tue Dec 2 21:20:33 2025

From Newsgroup: comp.arch

On 2025-12-02 2:55 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

Semi-unaligned memory tradeoff. If unaligned access is required, the
memory logic just increments the physical address by 64 bytes to fetch
the next cache line. The issue with this is it does not go backwards to
get the address fetched again from the TLB. Meaning no check is made for
protection or translation of the address.

You can determine is an access is misaligned "enough" to warrant two
trips down the pipe.
a) crosses cache width
b) crosses page boundary

Case b ALLWAYS needs 2 trips; so the mechanism HAS to be there.

It would be quite slow to have the instructions reissued and percolate
down the cache access again.

An AGEN-like adder has 11-gates of delay, you can determine misaligned
by 4-gates of delay.

I was thinking in terms of clock cycles. The recalc of the address could
be triggered by resetting bits in the reorder buffer. Which causes the instruction to be re-dispatched. I am not sure how many clocks, but
likely a minimum of four or five. Memory access is sequential, so it
will stall other accesses too.

I have a tendency not to think about the gate delays too much, until
they appear on the timing path. The lookup tables can absorb a good
chunk of gates delay.

This should only be an issue if an unaligned access crosses a memory
page boundary.

Here you need to access the TLB twice.

The instruction causes an alignment fault if a page cross boundary is
detected.

probably not as wise as you think.

I coded it so it makes two trips to the TLB now for page boundaries (in theory). I got to thinking that maybe the page size could be made huge
to avoid page crossings.

I may need to put more logic in to ensure the same load store queue slot
is used. I think it should work since things are sequential.

My toy is broken. It is taking too long to synthesize. Qupls is so
complex now. I may pick something simpler.

--- Synchronet 3.21a-Linux NewsLink 1.2

From kegs@kegs@provalid.com (Kent Dickey) to comp.arch on Thu Dec 4 16:54:56 2025

From Newsgroup: comp.arch

In article <20251201132322.000051a5@yahoo.com>,
Michael S <already5chosen@yahoo.com> wrote:

On Mon, 01 Dec 2025 07:56:37 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

[snip]

If hardware designers put their mind to it, they could make sequential
consistency perform well, probably better on code that actually
accesses data shared between different threads than weak and "strong"
ordering, because there is no need to slow down the program with
fences and the like in cases where only one thread accesses the data,
and in cases where the data is read by all threads. You will see the
slowdown only in run-time cases when one thread writes and another
reads in temporal proximity. And all the fences etc. that are
inserted just in case would also become fast (noops).

Where does sequential consistency simplifies programming over x86 model
of "TCO + globally ordered synchronization primitives +
every synchronization primitives have implied barriers"?

More so, where it simplifies over ARMv8.1-A, assuming that programmer
does not try to be too smart and never uses LL/SC and always uses
8.1-style synchronization instructions with Acquire+Release flags set?

IMHO, the only simple thing about sequential consistency is simple >description. Other than that, it simplifies very little. It does not >magically make lockless multithreaded programming bearable to
non-genius coders.

Compiler writers have hidden behind the hardware complexity to make
writing source code that is thread-safe much harder than it should be.
If you have to support placing hardware barriers, then the languages
can get away with needing lots of <atomic> qualifiers everywhere, even
on systems which don't need barriers, making the code more complex. And language purists still love to sneer at volatile in C-like languages as "providing no guarantees, and so is essentially useless"--when volatile providing no guarantees is a language and compiler choice, not something written in stone. A bunch of useful algorithms could be written with
merely "volatile" like semantics, but for some reason, people like the line-noise-like junk of C++ atomics, where rather than thinking in terms
of the algorithm, everyone needs to think in terms of release and acquire. (Which are weakly-ordering concepts).

Kent
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Dec 4 18:37:54 2025

From Newsgroup: comp.arch

kegs@provalid.com (Kent Dickey) posted:

In article <20251201132322.000051a5@yahoo.com>,
Michael S <already5chosen@yahoo.com> wrote:

On Mon, 01 Dec 2025 07:56:37 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

[snip]

If hardware designers put their mind to it, they could make sequential
consistency perform well, probably better on code that actually
accesses data shared between different threads than weak and "strong"
ordering, because there is no need to slow down the program with
fences and the like in cases where only one thread accesses the data,
and in cases where the data is read by all threads. You will see the
slowdown only in run-time cases when one thread writes and another
reads in temporal proximity. And all the fences etc. that are
inserted just in case would also become fast (noops).

Where does sequential consistency simplifies programming over x86 model
of "TCO + globally ordered synchronization primitives +
every synchronization primitives have implied barriers"?

More so, where it simplifies over ARMv8.1-A, assuming that programmer
does not try to be too smart and never uses LL/SC and always uses
8.1-style synchronization instructions with Acquire+Release flags set?

IMHO, the only simple thing about sequential consistency is simple >description. Other than that, it simplifies very little. It does not >magically make lockless multithreaded programming bearable to
non-genius coders.

Compiler writers have hidden behind the hardware complexity to make
writing source code that is thread-safe much harder than it should be.

Blaming the wrong people.

If you have to support placing hardware barriers, then the languages
can get away with needing lots of <atomic> qualifiers everywhere, even
on systems which don't need barriers, making the code more complex. And

Thread-safe, by definition, is (IS) harder.

language purists still love to sneer at volatile in C-like languages as "providing no guarantees, and so is essentially useless"--when volatile providing no guarantees is a language and compiler choice, not something written in stone.

The problem with volatile is that all it means is the every time a volatile variable is touched, the code has to have a corresponding LD or ST. The HW
ends up knowing nothing about the value's volativity and ends up in no
position to help.

A bunch of useful algorithms could be written with
merely "volatile" like semantics, but for some reason, people like the line-noise-like junk of C++ atomics, where rather than thinking in terms
of the algorithm, everyone needs to think in terms of release and acquire. (Which are weakly-ordering concepts).

As far as ATOMICs go:: until you can code a single ATOMIC event that moves
an element of a concurrent data structure from one place to another in a
single event, you are thinking too SMALL (4-pointers in 4 different cache lines).

In addition, the code should NOT have to test for success failure, but
be defined in such a way that if you get here success is known and if
you get there, failure is known.

Kent

Mitch
--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Fri Dec 5 11:10:22 2025

From Newsgroup: comp.arch

On 04/12/2025 19:37, MitchAlsup wrote:

kegs@provalid.com (Kent Dickey) posted:

Thread-safe, by definition, is (IS) harder.

language purists still love to sneer at volatile in C-like languages as
"providing no guarantees, and so is essentially useless"--when volatile
providing no guarantees is a language and compiler choice, not something
written in stone.

The problem with volatile is that all it means is the every time a volatile variable is touched, the code has to have a corresponding LD or ST. The HW ends up knowing nothing about the value's volativity and ends up in no position to help.

"volatile" /does/ provide guarantees - it just doesn't provide enough guarantees for multi-threaded coding on multi-core systems. Basically,
it only works at the C abstract machine level - it does nothing that
affects the hardware. So volatile writes are ordered at the C level,
but that says nothing about how they might progress through storage
queues, caches, inter-processor communication buses, or whatever. But
you need volatile semantics for atomics and fences as well - there's no
point in enforcing an order at the hardware level if the accesses can be re-ordered at the software level!

"volatile" on its own is therefore not sufficient for atomics on big
modern processors. But it /is/ sufficient for some uses, such as
accessing hardware registers, or for small atomic loads and stores on
single processor systems (which are far and away the biggest market, as embedded microcontrollers).

As I see it, the biggest problem with "volatile" in C is
misunderstandings and misuse of all sorts. At least, that's what I see
in my field of embedded development.

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Fri Dec 5 14:37:57 2025

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough >guarantees for multi-threaded coding on multi-core systems. Basically,
it only works at the C abstract machine level - it does nothing that
affects the hardware. So volatile writes are ordered at the C level,
but that says nothing about how they might progress through storage
queues, caches, inter-processor communication buses, or whatever.

You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering
on hardware with weaker memory ordering than sequential consistency".
If hardware guaranteed sequential consistency, volatile would provide guarantees that are as good on multi-core machines as on single-core
machines.

However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems),
and I don't think that C with just volatile gives you such guarantees.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Fri Dec 5 18:29:48 2025

From Newsgroup: comp.arch

On 05/12/2025 15:37, Anton Ertl wrote:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough
guarantees for multi-threaded coding on multi-core systems. Basically,
it only works at the C abstract machine level - it does nothing that
affects the hardware. So volatile writes are ordered at the C level,
but that says nothing about how they might progress through storage
queues, caches, inter-processor communication buses, or whatever.

You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering
on hardware with weaker memory ordering than sequential consistency".

It says a good deal about the ordering at the C level - but nothing
about it at the memory level.

I know very little about the MMU setups on "big" systems like the x86-64 world. But in the embedded microcontroller world, it is very common for
areas of the memory map to have sequential consistency even if other
areas can be re-ordered, cached, or otherwise jumbled around. Thus for memory-mapped peripheral areas, memory accesses are kept strictly in
order and "volatile" is all you need.

If hardware guaranteed sequential consistency, volatile would provide guarantees that are as good on multi-core machines as on single-core machines.

Sure. Of course multi-core systems will not have that hardware
guarantee, at least not on main memory, for performance reasons. So
there you need something more than just C "volatile" to force specific orderings. But volatile semantics will still be needed in many cases.
Thus "volatile" is not sufficient, but it is still necessary. Usually,
of course, all necessary "volatile" qualifiers are included in OS or
library macros or functions for anything that needs them for locks or inter-process communication and the like. (In Linux, you have the
READ_ONCE and WRITE_ONCE macros, which are just wrappers forcing
volatile accesses.)

However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems),
and I don't think that C with just volatile gives you such guarantees.

Correct.

Getting this wrong is one of the problems I have seen with volatile
usage in embedded systems. I've seen people assuming that declaring "x"
as "volatile" means that "x++;" is an atomic operation, or that volatile
alone lets you share 64-bit data between threads on a 32-bit processor.

Used correctly, it /can/ be enough for shared data between pre-emptive
threads or a main loop and interrupts on a single core system. But
sometimes you need to do more (for microcontrollers, that usually means disabling interrupts for a short period).

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Dec 5 17:57:48 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough >guarantees for multi-threaded coding on multi-core systems. Basically,
it only works at the C abstract machine level - it does nothing that >affects the hardware. So volatile writes are ordered at the C level,
but that says nothing about how they might progress through storage >queues, caches, inter-processor communication buses, or whatever.

You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering
on hardware with weaker memory ordering than sequential consistency".
If hardware guaranteed sequential consistency, volatile would provide guarantees that are as good on multi-core machines as on single-core machines.

However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems),

Such as ????

and I don't think that C with just volatile gives you such guarantees.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Fri Dec 5 20:10:11 2025

From Newsgroup: comp.arch

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough
guarantees for multi-threaded coding on multi-core systems. Basically,
it only works at the C abstract machine level - it does nothing that
affects the hardware. So volatile writes are ordered at the C level,
but that says nothing about how they might progress through storage
queues, caches, inter-processor communication buses, or whatever.

You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering
on hardware with weaker memory ordering than sequential consistency".
If hardware guaranteed sequential consistency, volatile would provide
guarantees that are as good on multi-core machines as on single-core
machines.

However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems),

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor. Even with a
single core system you can have pre-emptive multi-threading, or at least interrupt routines that may need to cooperate with other tasks on data.

and I don't think that C with just volatile gives you such guarantees.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Dec 5 20:54:00 2025

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough
guarantees for multi-threaded coding on multi-core systems. Basically, >>> it only works at the C abstract machine level - it does nothing that
affects the hardware. So volatile writes are ordered at the C level,
but that says nothing about how they might progress through storage
queues, caches, inter-processor communication buses, or whatever.

You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering
on hardware with weaker memory ordering than sequential consistency".
If hardware guaranteed sequential consistency, volatile would provide
guarantees that are as good on multi-core machines as on single-core
machines.

However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems),

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.

My 66000 ISA can::

LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.

Compare Double, Swap Double::

BOOLEAN DCAS( type oldp, type_t oldq,
type *p, type_t *q,
type newp, type newq )
{
type t = esmLOCKload( *p );
type r = esmLOCKload( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCKstore( *q, newq );
return TRUE;
}
return FALSE;
}

Move Element from one place to another:

BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}

So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.

Even with a single core system you can have pre-emptive multi-threading, or at least interrupt routines that may need to cooperate with other tasks on data.

and I don't think that C with just volatile gives you such guarantees.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Dec 5 14:55:36 2025

From Newsgroup: comp.arch

On 12/5/2025 12:54 PM, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>> guarantees for multi-threaded coding on multi-core systems. Basically, >>>>> it only works at the C abstract machine level - it does nothing that >>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>> but that says nothing about how they might progress through storage
queues, caches, inter-processor communication buses, or whatever.

You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering
on hardware with weaker memory ordering than sequential consistency".
If hardware guaranteed sequential consistency, volatile would provide
guarantees that are as good on multi-core machines as on single-core
machines.

However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems),

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.

My 66000 ISA can::

LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.

Compare Double, Swap Double::

BOOLEAN DCAS( type oldp, type_t oldq,
type *p, type_t *q,
type newp, type newq )
{
type t = esmLOCKload( *p );
type r = esmLOCKload( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCKstore( *q, newq );
return TRUE;
}
return FALSE;
}

Move Element from one place to another:

BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}

So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.

Any issues with live lock in here?

[...]
--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Dec 5 15:03:53 2025

From Newsgroup: comp.arch

On 12/5/2025 11:10 AM, David Brown wrote:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough
guarantees for multi-threaded coding on multi-core systems. Basically, >>>> it only works at the C abstract machine level - it does nothing that
affects the hardware. So volatile writes are ordered at the C level, >>>> but that says nothing about how they might progress through storage
queues, caches, inter-processor communication buses, or whatever.

You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering
on hardware with weaker memory ordering than sequential consistency".
If hardware guaranteed sequential consistency, volatile would provide
guarantees that are as good on multi-core machines as on single-core
machines.

However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems),

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.

It's strange that double-word compare and swap (DWCAS), where the words
are contiguous. Well, I have seen compilers say its not lock-free even
on a x86. for a 32 bit system we have cmpxchg8b. For a 64 bit system cmpxchg16b. But the compiler reports not lock free. Strange.

using cmpxchg instead of xadd:
https://forum.pellesc.de/index.php?topic=7167.0

trying to tell me that a DWCAS is not lock free: https://forum.pellesc.de/index.php?topic=7311.msg27764#msg27764

This should be lock-free on an x86, even x64:

struct ct_proxy_dwcas
{
struct ct_proxy_node* node;
intptr_t count;
};

some of my older code:

AC_SYS_APIEXPORT
int AC_CDECL
np_ac_i686_atomic_dwcas_fence
( void*,
void*,
const void* );

np_ac_i686_atomic_dwcas_fence PROC
push esi
push ebx
mov esi, [esp + 16]
mov eax, [esi]
mov edx, [esi + 4]
mov esi, [esp + 20]
mov ebx, [esi]
mov ecx, [esi + 4]
mov esi, [esp + 12]
lock cmpxchg8b qword ptr [esi]
jne np_ac_i686_atomic_dwcas_fence_fail
xor eax, eax
pop ebx
pop esi
ret

np_ac_i686_atomic_dwcas_fence_fail:
mov esi, [esp + 16]
mov [esi + 0], eax;
mov [esi + 4], edx;
mov eax, 1
pop ebx
pop esi
ret
np_ac_i686_atomic_dwcas_fence ENDP

Even with a
single core system you can have pre-emptive multi-threading, or at least interrupt routines that may need to cooperate with other tasks on data.

and I don't think that C with just volatile gives you such guarantees.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sat Dec 6 00:40:11 2025

From Newsgroup: comp.arch

Tradeoffs bypassing r0 causing more ISA tweaks.

It is expensive to bypass r0. To truly bypass it, it needs to be
bypassed in a couple of dozen places which really drives up the LUT
count. Removing the bypassing of r0 from the register file shaved 1000
LUTs off the design. This is no real loss as most instructions can
substitute small constants for register values.

Decided to go PowerPC style with bypassing of r0 to zero. R0 is bypassed
to zero only in the agen units. So, the bypass is only in a couple of
places. Otherwise r0 can be used as an ordinary register. Load / store instructions cannot use r0 as a GPR then, but it works for the PowerPC.

I hit this trying to decide where to bypass another register code to
represent the instruction pointer. In that case I think it may be better
to go RISCV style and just add an instruction to add the IP to a
constant and place it in a register. The alternative might be to
sacrifice a bit of displacement to indicate IP relative addressing.

Anyone got a summary of bypassing r0 in different architectures?

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Dec 6 07:26:24 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> writes:

Tradeoffs bypassing r0 causing more ISA tweaks.

It is expensive to bypass r0. To truly bypass it, it needs to be
bypassed in a couple of dozen places which really drives up the LUT
count.

My impression is that modern implementations deal with this kind of
stuff at decoding or in the renamer. That should reduce the number of
places where it is special-cased to one, but it means that the uops
have to represent 0 in some way. One way would be to have a physical
register that is 0 and that is never allocated, but if your
microarchitecture needs a reduction of actual read ports (compared to
potential read ports), you may prefer a different representation of 0
in the uops.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sat Dec 6 05:13:01 2025

From Newsgroup: comp.arch

On 2025-12-06 2:26 a.m., Anton Ertl wrote:

Robert Finch <robfi680@gmail.com> writes:

Tradeoffs bypassing r0 causing more ISA tweaks.

It is expensive to bypass r0. To truly bypass it, it needs to be
bypassed in a couple of dozen places which really drives up the LUT
count.

My impression is that modern implementations deal with this kind of
stuff at decoding or in the renamer. That should reduce the number of
places where it is special-cased to one, but it means that the uops
have to represent 0 in some way. One way would be to have a physical register that is 0 and that is never allocated, but if your
microarchitecture needs a reduction of actual read ports (compared to potential read ports), you may prefer a different representation of 0
in the uops.

- anton

Thanks,

It should have occurred to me to do this at the decode stage. Constants
are decoded and passed along for all register fields in decode. There
are only four decoders fortunately.

Switching the ISA back to having r0 as zero all the time.

--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Sat Dec 6 14:42:13 2025

From Newsgroup: comp.arch

On 05/12/2025 21:54, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>> guarantees for multi-threaded coding on multi-core systems. Basically, >>>>> it only works at the C abstract machine level - it does nothing that >>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>> but that says nothing about how they might progress through storage
queues, caches, inter-processor communication buses, or whatever.

You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering
on hardware with weaker memory ordering than sequential consistency".
If hardware guaranteed sequential consistency, volatile would provide
guarantees that are as good on multi-core machines as on single-core
machines.

However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems),

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.

My 66000 ISA can::

LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.

The functions below rely on more than that - to make the work, as far as
I can see, you need the first "esmLOCKload" to lock the bus and also
lock the core from any kind of interrupt or other pre-emption, lasting
until the esmLOCKstore instruction. Or am I missing something here?

It is not easy to have atomic or lock mechanisms on multi-core systems
that are convenient to use, efficient even in the worst cases, and don't require additional hardware.

Compare Double, Swap Double::

BOOLEAN DCAS( type oldp, type_t oldq,
type *p, type_t *q,
type newp, type newq )
{
type t = esmLOCKload( *p );
type r = esmLOCKload( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCKstore( *q, newq );
return TRUE;
}
return FALSE;
}

Move Element from one place to another:

BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}

So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.

Of course. It is interesting to speculate about possible features of an architecture like yours, but it is not likely to be available to anyone
else in practice (unless perhaps it can be implemented as an extension
for RISC-V).

Even with a
single core system you can have pre-emptive multi-threading, or at least
interrupt routines that may need to cooperate with other tasks on data.

and I don't think that C with just volatile gives you such guarantees. >>>>
- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sat Dec 6 17:16:11 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough
guarantees for multi-threaded coding on multi-core systems. Basically, >> >>> it only works at the C abstract machine level - it does nothing that
affects the hardware. So volatile writes are ordered at the C level,
but that says nothing about how they might progress through storage
queues, caches, inter-processor communication buses, or whatever.

You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering
on hardware with weaker memory ordering than sequential consistency".
If hardware guaranteed sequential consistency, volatile would provide
guarantees that are as good on multi-core machines as on single-core
machines.

However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems),

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.

My 66000 ISA can::

LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.

Compare Double, Swap Double::

BOOLEAN DCAS( type oldp, type_t oldq,
type *p, type_t *q,
type newp, type newq )
{
type t = esmLOCKload( *p );
type r = esmLOCKload( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCKstore( *q, newq );
return TRUE;
}
return FALSE;
}

Move Element from one place to another:

BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}

So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.

In my 40 years of SMP OS/HV work, I don't recall a
situation where 'MoveElement' would be useful or
required as an hardware atomic operation.

Individual atomic "Remove Element" and "Insert/Append Element"[*], yes. Combined? Too inflexible.

[*] For which atomic compare-and-swap or atomic swap is generally sufficient.

Atomic add/sub are useful. The other atomic math operations (min, max, etc) may be useful in certain cases as well.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 6 17:22:55 2025

From Newsgroup: comp.arch

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

On 12/5/2025 12:54 PM, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>> guarantees for multi-threaded coding on multi-core systems. Basically, >>>>> it only works at the C abstract machine level - it does nothing that >>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>> but that says nothing about how they might progress through storage >>>>> queues, caches, inter-processor communication buses, or whatever.

You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering >>>> on hardware with weaker memory ordering than sequential consistency". >>>> If hardware guaranteed sequential consistency, volatile would provide >>>> guarantees that are as good on multi-core machines as on single-core >>>> machines.

However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems), >>>

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.

My 66000 ISA can::

LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.

Compare Double, Swap Double::

BOOLEAN DCAS( type oldp, type_t oldq,
type *p, type_t *q,
type newp, type newq )
{
type t = esmLOCKload( *p );
type r = esmLOCKload( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCKstore( *q, newq );
return TRUE;
}
return FALSE;
}

Move Element from one place to another:

BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}

So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.

Any issues with live lock in here?

A bit hard to tell because of 2 things::
a) I carry around the thread priority and when interference occurs,
the higher priority thread wins--ties the already accessed thread wins.
b) live-lock is resolved or not by the caller to these routines, not
these routines themselves.

[...]

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 6 17:29:53 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

Tradeoffs bypassing r0 causing more ISA tweaks.

It is expensive to bypass r0. To truly bypass it, it needs to be
bypassed in a couple of dozen places which really drives up the LUT
count. Removing the bypassing of r0 from the register file shaved 1000
LUTs off the design. This is no real loss as most instructions can substitute small constants for register values.

Often the use of R0 as an operand causes the calculation to be degenerate.
That is, R0 is not needed at all.
ADD R9,R7,R0 // is a MOV instruction
AND R9,R7,R0 // is a CLR instruction

So, you don't have to treat R0 in bypassing, but as Operand processing.

Decided to go PowerPC style with bypassing of r0 to zero. R0 is bypassed
to zero only in the agen units. So, the bypass is only in a couple of places. Otherwise r0 can be used as an ordinary register. Load / store instructions cannot use r0 as a GPR then, but it works for the PowerPC.

AGEN Rbase ==R0 implies Rbase = IP
AGEN Rindex==R0 implies Rindex = 0

I hit this trying to decide where to bypass another register code to represent the instruction pointer. In that case I think it may be better
to go RISCV style and just add an instruction to add the IP to a
constant and place it in a register. The alternative might be to
sacrifice a bit of displacement to indicate IP relative addressing.

Anyone got a summary of bypassing r0 in different architectures?

These are some of the reasons I went with
a) universal constants
b) R0 is just another GPR
So, R0, gets forwarded just as often (or lack thereof) as any joe-random register.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 6 17:31:43 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

Robert Finch <robfi680@gmail.com> writes:

Tradeoffs bypassing r0 causing more ISA tweaks.

It is expensive to bypass r0. To truly bypass it, it needs to be
bypassed in a couple of dozen places which really drives up the LUT
count.

My impression is that modern implementations deal with this kind of
stuff at decoding or in the renamer. That should reduce the number of
places where it is special-cased to one, but it means that the uops
have to represent 0 in some way. One way would be to have a physical register that is 0 and that is never allocated, but if your
microarchitecture needs a reduction of actual read ports (compared to potential read ports), you may prefer a different representation of 0
in the uops.

Another way to implement R0 is to have an AND gate after the Operand
flip-flop, and if <whatever> was captured is R0, then AND with 0, other-
wise AND with 1.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 6 17:44:30 2025

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 21:54, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>> guarantees for multi-threaded coding on multi-core systems. Basically, >>>>> it only works at the C abstract machine level - it does nothing that >>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>> but that says nothing about how they might progress through storage >>>>> queues, caches, inter-processor communication buses, or whatever.

You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering >>>> on hardware with weaker memory ordering than sequential consistency". >>>> If hardware guaranteed sequential consistency, volatile would provide >>>> guarantees that are as good on multi-core machines as on single-core >>>> machines.

However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems), >>>

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.

My 66000 ISA can::

LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.

The functions below rely on more than that - to make the work, as far as
I can see, you need the first "esmLOCKload" to lock the bus and also
lock the core from any kind of interrupt or other pre-emption, lasting
until the esmLOCKstore instruction. Or am I missing something here?

In the above, I was stating that the maximum width of LD/ST can be a lot
bigger than the size of a single register, not that the above instructions
make writing ATOMIC events easier.

These is no bus!

The esmLOCKload causes the <translated> address to be 'monitored'
for interference, and to announce participation in the ATOMIC event.

The FIRST esmLOCKload tells the core that an ATOMIC event is beginning,
AND sets up a default control point (This instruction itself) so that
if interference is detected at esmLOCKstore control is transferred to
that control point.

So, there is no way to write Test-and-Set !! you get Test-and-Test-and-Set
for free.

There is a branch-on-interference instruction that
a) does what it says,
b) sets up an alternate atomic control point.

It is not easy to have atomic or lock mechanisms on multi-core systems
that are convenient to use, efficient even in the worst cases, and don't require additional hardware.

I am using the "Miss Buffer" as the point of monitoring for interference.
a) it already has to monitor "other hits" from outside accesses to deal
with the coherence mechanism.
b) that esm additions to Miss Buffer are on the order of 2%

c) there are other means to strengthen guarantees of forward progress.

Compare Double, Swap Double::

BOOLEAN DCAS( type oldp, type_t oldq,
type *p, type_t *q,
type newp, type newq )
{
type t = esmLOCKload( *p );
type r = esmLOCKload( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCKstore( *q, newq );
return TRUE;
}
return FALSE;
}

Move Element from one place to another:

BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}

So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.

Of course. It is interesting to speculate about possible features of an architecture like yours, but it is not likely to be available to anyone
else in practice (unless perhaps it can be implemented as an extension
for RISC-V).

Even with a
single core system you can have pre-emptive multi-threading, or at least >> interrupt routines that may need to cooperate with other tasks on data.

and I don't think that C with just volatile gives you such guarantees. >>>>
- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 6 18:07:50 2025

From Newsgroup: comp.arch

scott@slp53.sl.home (Scott Lurndal) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough >> >>> guarantees for multi-threaded coding on multi-core systems. Basically,
it only works at the C abstract machine level - it does nothing that >> >>> affects the hardware. So volatile writes are ordered at the C level, >> >>> but that says nothing about how they might progress through storage
queues, caches, inter-processor communication buses, or whatever.

You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering
on hardware with weaker memory ordering than sequential consistency". >> >> If hardware guaranteed sequential consistency, volatile would provide >> >> guarantees that are as good on multi-core machines as on single-core
machines.

However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems), >> >

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.

My 66000 ISA can::

LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.

Compare Double, Swap Double::

BOOLEAN DCAS( type oldp, type_t oldq,
type *p, type_t *q,
type newp, type newq )
{
type t = esmLOCKload( *p );
type r = esmLOCKload( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCKstore( *q, newq );
return TRUE;
}
return FALSE;
}

Move Element from one place to another:

BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}

So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.

In my 40 years of SMP OS/HV work, I don't recall a
situation where 'MoveElement' would be useful or
required as an hardware atomic operation.

The question is not would "MoveElement" be useful, but
would it be useful to have a single ATOMIC event be
able to manipulate {5,6,7,8} pointers in one event ??

Individual atomic "Remove Element" and "Insert/Append Element"[*], yes. Combined? Too inflexible.

BOOLEAN InsertElement( Element *el, Element *to )
{
tn = esmLOCKload( to->next );
esmLOCKprefetch( el );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
el->next = tn;
el->prev = to;
to->next = el;
esmLOCKstore( tn->prev, el );
return TRUE;
}
return FALSE;
}

BOOLEAN RemoveElement( Element *fr )
{
fn = esmLOCKload( fr->next );
fp = esmLOCKload( fr->prev );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
fr->prev = NULL;
esmLOCKstore( fr->next, NULL );
return TRUE;
}
return FALSE;
}

[*] For which atomic compare-and-swap or atomic swap is generally sufficient.

Atomic add/sub are useful. The other atomic math operations (min, max, etc) may be useful in certain cases as well.

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sat Dec 6 19:04:09 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

scott@slp53.sl.home (Scott Lurndal) posted:

Move Element from one place to another:

BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}

So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.

In my 40 years of SMP OS/HV work, I don't recall a
situation where 'MoveElement' would be useful or
required as an hardware atomic operation.

The question is not would "MoveElement" be useful, but
would it be useful to have a single ATOMIC event be
able to manipulate {5,6,7,8} pointers in one event ??

Nothing comes immediately to mind.

Individual atomic "Remove Element" and "Insert/Append Element"[*], yes.
Combined? Too inflexible.

BOOLEAN InsertElement( Element *el, Element *to )
{
tn = esmLOCKload( to->next );
esmLOCKprefetch( el );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
el->next = tn;
el->prev = to;
to->next = el;
esmLOCKstore( tn->prev, el );
return TRUE;
}
return FALSE;
}

BOOLEAN RemoveElement( Element *fr )
{
fn = esmLOCKload( fr->next );
fp = esmLOCKload( fr->prev );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
fr->prev = NULL;
esmLOCKstore( fr->next, NULL );
return TRUE;
}
return FALSE;
}

[*] For which atomic compare-and-swap or atomic swap is generally sufficient.

Yes, you can add special instructions. However, the compilers will be unlikely
to generate them, thus applications that desired the generation of such an instruction would need to create a compiler extension (like gcc __builtin functions)
or inline assembler which would then make the program that uses the capability both compiler
specific _and_ hardware specific.

Most extant SMP processors provide a compare and swap operation, which
are widely supported by the common compilers that support the C and C++ threading functionality.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sat Dec 6 21:36:27 2025

From Newsgroup: comp.arch

Scott Lurndal <scott@slp53.sl.home> schrieb:

Yes, you can add special instructions. However, the compilers will be unlikely
to generate them, thus applications that desired the generation of such an instruction would need to create a compiler extension (like gcc __builtin functions)
or inline assembler which would then make the program that uses the capability both compiler
specific _and_ hardware specific.

Most extant SMP processors provide a compare and swap operation, which
are widely supported by the common compilers that support the C and C++ threading functionality.

Interestingly, Linux restartable sequences allow for acquisition of
a lock with no membarrier or atomic instruction on the fast path,
at the cost of a syscall on the slow path (no free lunch...)

But you also need assembler to do it.

An example is, for example, at https://gitlab.ethz.ch/extra_projects/cpu-local-lock
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 6 21:44:17 2025

From Newsgroup: comp.arch

scott@slp53.sl.home (Scott Lurndal) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

scott@slp53.sl.home (Scott Lurndal) posted:

Move Element from one place to another:

BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}

So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.

In my 40 years of SMP OS/HV work, I don't recall a
situation where 'MoveElement' would be useful or
required as an hardware atomic operation.

The question is not would "MoveElement" be useful, but
would it be useful to have a single ATOMIC event be
able to manipulate {5,6,7,8} pointers in one event ??

Nothing comes immediately to mind.

Individual atomic "Remove Element" and "Insert/Append Element"[*], yes.
Combined? Too inflexible.

BOOLEAN InsertElement( Element *el, Element *to )
{
tn = esmLOCKload( to->next );
esmLOCKprefetch( el );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
el->next = tn;
el->prev = to;
to->next = el;
esmLOCKstore( tn->prev, el );
return TRUE;
}
return FALSE;
}

BOOLEAN RemoveElement( Element *fr )
{
fn = esmLOCKload( fr->next );
fp = esmLOCKload( fr->prev );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
fr->prev = NULL;
esmLOCKstore( fr->next, NULL );
return TRUE;
}
return FALSE;
}

[*] For which atomic compare-and-swap or atomic swap is generally sufficient.

Yes, you can add special instructions. However, the compilers will be unlikely
to generate them, thus applications that desired the generation of such an instruction would need to create a compiler extension (like gcc __builtin functions)
or inline assembler which would then make the program that uses the capability both compiler
specific _and_ hardware specific.

So, in other words, if you can't put it in every ISA known to man,
don't bother making something better than existent ?!?

Most extant SMP processors provide a compare and swap operation, which
are widely supported by the common compilers that support the C and C++ threading functionality.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sat Dec 6 18:33:55 2025

From Newsgroup: comp.arch

On 2025-12-06 12:29 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

Tradeoffs bypassing r0 causing more ISA tweaks.

It is expensive to bypass r0. To truly bypass it, it needs to be
bypassed in a couple of dozen places which really drives up the LUT
count. Removing the bypassing of r0 from the register file shaved 1000
LUTs off the design. This is no real loss as most instructions can
substitute small constants for register values.

Often the use of R0 as an operand causes the calculation to be degenerate. That is, R0 is not needed at all.
ADD R9,R7,R0 // is a MOV instruction
AND R9,R7,R0 // is a CLR instruction

We dont want no degenerating instructions.

So, you don't have to treat R0 in bypassing, but as Operand processing.

Decided to go PowerPC style with bypassing of r0 to zero. R0 is bypassed
to zero only in the agen units. So, the bypass is only in a couple of
places. Otherwise r0 can be used as an ordinary register. Load / store
instructions cannot use r0 as a GPR then, but it works for the PowerPC.

AGEN Rbase ==R0 implies Rbase = IP
AGEN Rindex==R0 implies Rindex = 0

Qupls now follows a similar paradigm.
Rbase = r0 bypasses to 0
Rindex = r0 bypasses to 0
Rbase = r31 bypasses to IP
Bypassing r0 for both base and index allows absolute addressing mode.
Otherwise r0, r31 are general-purpose regs.

I hit this trying to decide where to bypass another register code to
represent the instruction pointer. In that case I think it may be better
to go RISCV style and just add an instruction to add the IP to a
constant and place it in a register. The alternative might be to
sacrifice a bit of displacement to indicate IP relative addressing.

Anyone got a summary of bypassing r0 in different architectures?

These are some of the reasons I went with
a) universal constants
b) R0 is just another GPR
So, R0, gets forwarded just as often (or lack thereof) as any joe-random register.

Qupls has IP offset constant loading.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Sat Dec 6 18:55:17 2025

From Newsgroup: comp.arch

On 2025-12-06 6:33 p.m., Robert Finch wrote:

On 2025-12-06 12:29 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

Tradeoffs bypassing r0 causing more ISA tweaks.

It is expensive to bypass r0. To truly bypass it, it needs to be
bypassed in a couple of dozen places which really drives up the LUT
count. Removing the bypassing of r0 from the register file shaved 1000
LUTs off the design. This is no real loss as most instructions can
substitute small constants for register values.

Often the use of R0 as an operand causes the calculation to be
degenerate.
That is, R0 is not needed at all.
ADD R9,R7,R0 // is a MOV instruction
AND R9,R7,R0 // is a CLR instruction

We dont want no degenerating instructions.

So, you don't have to treat R0 in bypassing, but as Operand processing.

Decided to go PowerPC style with bypassing of r0 to zero. R0 is bypassed >>> to zero only in the agen units. So, the bypass is only in a couple of
places. Otherwise r0 can be used as an ordinary register. Load / store
instructions cannot use r0 as a GPR then, but it works for the PowerPC.

AGEN Rbase ==R0 implies Rbase = IP
AGEN Rindex==R0 implies Rindex = 0

Qupls now follows a similar paradigm.
Rbase = r0 bypasses to 0
Rindex = r0 bypasses to 0
Rbase = r31 bypasses to IP
Bypassing r0 for both base and index allows absolute addressing mode. Otherwise r0, r31 are general-purpose regs.

I hit this trying to decide where to bypass another register code to
represent the instruction pointer. In that case I think it may be better >>> to go RISCV style and just add an instruction to add the IP to a
constant and place it in a register. The alternative might be to
sacrifice a bit of displacement to indicate IP relative addressing.

Anyone got a summary of bypassing r0 in different architectures?

These are some of the reasons I went with
a) universal constants
b) R0 is just another GPR
So, R0, gets forwarded just as often (or lack thereof) as any joe-random
register.

Qupls has IP offset constant loading.

No sooner than having updated the spec, I added two more opcodes to
perform loads and stores using IP relative addressing. That way, no need
to use r31, leaving 31 registers completely general purpose. I am
wanting to cast some aspects of the ISA in stone, or it will never get anywhere.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sun Dec 7 03:29:05 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

On 2025-12-06 6:33 p.m., Robert Finch wrote:

On 2025-12-06 12:29 p.m., MitchAlsup wrote:

Robert Finch <robfi680@gmail.com> posted:

Tradeoffs bypassing r0 causing more ISA tweaks.

It is expensive to bypass r0. To truly bypass it, it needs to be
bypassed in a couple of dozen places which really drives up the LUT
count. Removing the bypassing of r0 from the register file shaved 1000 >>> LUTs off the design. This is no real loss as most instructions can
substitute small constants for register values.

Often the use of R0 as an operand causes the calculation to be
degenerate.
That is, R0 is not needed at all.
ADD R9,R7,R0 // is a MOV instruction
AND R9,R7,R0 // is a CLR instruction

We dont want no degenerating instructions.

So, you don't have to treat R0 in bypassing, but as Operand processing. >>> Decided to go PowerPC style with bypassing of r0 to zero. R0 is bypassed >>> to zero only in the agen units. So, the bypass is only in a couple of

places. Otherwise r0 can be used as an ordinary register. Load / store >>> instructions cannot use r0 as a GPR then, but it works for the PowerPC. >>

AGEN Rbase ==R0 implies Rbase = IP
AGEN Rindex==R0 implies Rindex = 0

Qupls now follows a similar paradigm.
Rbase = r0 bypasses to 0
Rindex = r0 bypasses to 0
Rbase = r31 bypasses to IP
Bypassing r0 for both base and index allows absolute addressing mode. Otherwise r0, r31 are general-purpose regs.

I hit this trying to decide where to bypass another register code to
represent the instruction pointer. In that case I think it may be better >>> to go RISCV style and just add an instruction to add the IP to a
constant and place it in a register. The alternative might be to
sacrifice a bit of displacement to indicate IP relative addressing.

Anyone got a summary of bypassing r0 in different architectures?

These are some of the reasons I went with
a) universal constants
b) R0 is just another GPR
So, R0, gets forwarded just as often (or lack thereof) as any joe-random >> register.

Qupls has IP offset constant loading.

No sooner than having updated the spec, I added two more opcodes to
perform loads and stores using IP relative addressing. That way, no need
to use r31, leaving 31 registers completely general purpose. I am
wanting to cast some aspects of the ISA in stone, or it will never get anywhere.

Cast some elements in plaster--this will hold for a few years until
you find the bigger mistakes, then demolish the plaster and fix the
parts that don't work so well.

After 6 years of essential stability, I did a major update to My 66000
ISA last month. The new ISA is ASCII compatible with the last, but not
at the binary level, which solves several problems and saves another
2%-4% in code footprint.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Dec 7 09:30:50 2025

From Newsgroup: comp.arch

Scott Lurndal <scott@slp53.sl.home> schrieb:

Yes, you can add special instructions. However, the compilers will be unlikely
to generate them, thus applications that desired the generation of such an instruction would need to create a compiler extension (like gcc __builtin functions)
or inline assembler which would then make the program that uses the capability both compiler
specific _and_ hardware specific.

This would likely be hidden in a header, and need only be
written once (although gcc and clang, for example, are compatible
in this respecct). And people have been doing this, even for
microarchitecture specific features, if the need for performance
gain is large enough.

A primary example is Intel TSX, which is (was?) required by SAP.

POWER also had a transactional memory feature, but they messed it
up for POWER 9 and dropped it for POWER 10 (IIRC); POWER is the
only other architecture certified to run SAP, so it seems they
can do without.

Googling around, I also find the "Transactional Memory Extension"
for ARM datetd 2022, so ARM also appears to see some value in that,
at least enough to write a spec for it.

Most extant SMP processors provide a compare and swap operation, which
are widely supported by the common compilers that support the C and C++ threading functionality.

It seems there is a market for going beyond compare and swap.
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Sun Dec 7 16:05:32 2025

From Newsgroup: comp.arch

On Sun, 7 Dec 2025 09:30:50 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Scott Lurndal <scott@slp53.sl.home> schrieb:

Yes, you can add special instructions. However, the compilers
will be unlikely to generate them, thus applications that desired
the generation of such an instruction would need to create a
compiler extension (like gcc __builtin functions) or inline
assembler which would then make the program that uses the
capability both compiler specific _and_ hardware specific.

This would likely be hidden in a header, and need only be
written once (although gcc and clang, for example, are compatible
in this respecct). And people have been doing this, even for microarchitecture specific features, if the need for performance
gain is large enough.

A primary example is Intel TSX, which is (was?) required by SAP.

By SAP HANA, I assume.
Not sure for how long it was true. It sounds very unlikely that it is
still true.

POWER also had a transactional memory feature, but they messed it
up for POWER 9 and dropped it for POWER 10 (IIRC); POWER is the
only other architecture certified to run SAP, so it seems they
can do without.

Googling around, I also find the "Transactional Memory Extension"
for ARM datetd 2022, so ARM also appears to see some value in that,
at least enough to write a spec for it.

Most extant SMP processors provide a compare and swap operation,
which are widely supported by the common compilers that support the
C and C++ threading functionality.

It seems there is a market for going beyond compare and swap.

TSX is close to dead.

ARM's TME was announced almost 5 years ago. AFAIK, there were no implementations. Recently ARM said that FEAT_TME is obsoleted. It sounds
like the whole thing is dead, but there is small chance that I am misinterpreting.

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sun Dec 7 16:13:06 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

scott@slp53.sl.home (Scott Lurndal) posted:

Yes, you can add special instructions. However, the compilers will be unlikely
to generate them, thus applications that desired the generation of such an >> instruction would need to create a compiler extension (like gcc __builtin functions)
or inline assembler which would then make the program that uses the capability both compiler
specific _and_ hardware specific.

So, in other words, if you can't put it in every ISA known to man,
don't bother making something better than existent ?!?

Long experience. Back in the early 80's we had fancy instructions
for searching linked lists (up to 100 digit or byte keys, comparisons for equal, ne, lt, gt, lte, gte, and any-bit-equal). Took special language support to use, which mean that it wasn't usable from COBOL without
extensions. We also had Lock, Unlock and condition variable instructions (with a small microkernel to handle the contention cases, trapping on acquisition failure, release [when another thread was pending], and
event signal.). Perhaps ahead of its time, as most of the common languages (COBOL and Fortran) had no syntactical support for them. We used them
in the OS language (SPRITE), but they never got traction in applications (and then the
entire computer line was discontinued in 1991).

That's not to suggest that your innovations aren't potentially useful
or an interesting take on multithreaded instruction primitives;
just that idealism and the real world are often incompatible :-)

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Sun Dec 7 16:28:41 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> writes:

Scott Lurndal <scott@slp53.sl.home> schrieb:

Yes, you can add special instructions. However, the compilers will be unlikely
to generate them, thus applications that desired the generation of such an >> instruction would need to create a compiler extension (like gcc __builtin functions)
or inline assembler which would then make the program that uses the capability both compiler
specific _and_ hardware specific.

This would likely be hidden in a header, and need only be
written once (although gcc and clang, for example, are compatible
in this respecct). And people have been doing this, even for >microarchitecture specific features, if the need for performance
gain is large enough.

A primary example is Intel TSX, which is (was?) required by SAP.

POWER also had a transactional memory feature, but they messed it
up for POWER 9 and dropped it for POWER 10 (IIRC); POWER is the
only other architecture certified to run SAP, so it seems they
can do without.

Googling around, I also find the "Transactional Memory Extension"
for ARM datetd 2022, so ARM also appears to see some value in that,
at least enough to write a spec for it.

The ARM spec has been published. I'm not aware of any implementations
of it to date, and the spec had been available to architecture partners
for several years prior to 2022.

Intel's TSX support seems to be restricted to a subset of xeon processors,
and it's not clear how well it's supported by non-intel compilers.

AMD has never released their Advanced Synchronization Facility in any
processor to date.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Sun Dec 7 16:55:26 2025

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> schrieb:

On Sun, 7 Dec 2025 09:30:50 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

Scott Lurndal <scott@slp53.sl.home> schrieb:

Yes, you can add special instructions. However, the compilers
will be unlikely to generate them, thus applications that desired
the generation of such an instruction would need to create a
compiler extension (like gcc __builtin functions) or inline
assembler which would then make the program that uses the
capability both compiler specific _and_ hardware specific.

This would likely be hidden in a header, and need only be
written once (although gcc and clang, for example, are compatible
in this respecct). And people have been doing this, even for
microarchitecture specific features, if the need for performance
gain is large enough.

A primary example is Intel TSX, which is (was?) required by SAP.

By SAP HANA, I assume.
Not sure for how long it was true. It sounds very unlikely that it is
still true.

https://www.redhat.com/en/blog/red-hat-enterprise-linux-performance-results-5th-gen-intel-xeon-scalable-processors
from 2024 has benchmarks with TSX for SAP/HANA, and the processors
(5th generation Xeon) at least pretend to have TSX.

https://community.sap.com/t5/technology-blog-posts-by-sap/seamless-scaling-of-sap-hana-on-intel-xeon-processors-from-micro-to-mega/ba-p/13968648
(almost a year old) writes

"Intel's Transactional Synchronization Extensions (TSX), also
implemented into the SAP HANA database, further enhances this
scalability and offers a significant performance boost for critical
HANA database operations."

which does not read "required", but certainly sounds like it is an
advantage.

POWER also had a transactional memory feature, but they messed it
up for POWER 9 and dropped it for POWER 10 (IIRC); POWER is the
only other architecture certified to run SAP, so it seems they
can do without.

Googling around, I also find the "Transactional Memory Extension"
for ARM datetd 2022, so ARM also appears to see some value in that,
at least enough to write a spec for it.

Most extant SMP processors provide a compare and swap operation,
which are widely supported by the common compilers that support the
C and C++ threading functionality.

It seems there is a market for going beyond compare and swap.

TSX is close to dead.

For general-purpose computers, it seems the security implications
killed it. An SAP server is a different matter; if you don't trust
the software you are running there, you have other issues.

ARM's TME was announced almost 5 years ago. AFAIK, there were no implementations. Recently ARM said that FEAT_TME is obsoleted. It sounds
like the whole thing is dead, but there is small chance that I am misinterpreting.

Maybe restartable sequences are the way to go for lock-free
critical sections. Not sure if everybody is aware of these. A good introduction can be found at https://lwn.net/Articles/883104/ .
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From EricP@ThatWouldBeTelling@thevillage.com to comp.arch on Sun Dec 7 12:19:34 2025

From Newsgroup: comp.arch

Scott Lurndal wrote:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

scott@slp53.sl.home (Scott Lurndal) posted:

In my 40 years of SMP OS/HV work, I don't recall a
situation where 'MoveElement' would be useful or
required as an hardware atomic operation.

The question is not would "MoveElement" be useful, but
would it be useful to have a single ATOMIC event be
able to manipulate {5,6,7,8} pointers in one event ??

Nothing comes immediately to mind.

Atomically moving an object from one double linked list to another,
like when a thread wakes up and moves from the waiting to ready list.

One iteration of balancing a binary tree (AVL, red-black)

Plus the data structs above might straddle cache lines so how ever many
objects there are, there could be twice the lines being updated at once.

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sun Dec 7 17:48:50 2025

From Newsgroup: comp.arch

Michael S <already5chosen@yahoo.com> writes:

Where does sequential consistency simplifies programming over x86 model
of "TCO + globally ordered synchronization primitives +
every synchronization primitives have implied barriers"?

More so, where it simplifies over ARMv8.1-A, assuming that programmer
does not try to be too smart and never uses LL/SC and always uses
8.1-style synchronization instructions with Acquire+Release flags set?

IMHO, the only simple thing about sequential consistency is simple >description. Other than that, it simplifies very little. It does not >magically make lockless multithreaded programming bearable to
non-genius coders.

Is single-core multi-threaded programming bearable to non-genius
programmers? I think so. Sequential consistency plus atomic sequences
(where the single-core program disables interrupts to start an atomic
sequence and enables them to end an atomic sequence) gives the same
programming model.

Concerning synchronization instructions and memory barriers of
architectures with weaker memory models, their main problem is that
they are implemented slowly, because the idea is to make only the
weaker memory model go fast, and then suffer what you must if you need
more guarantees. Already the guarantee makes them slow, not just the
actual synchronization case. This makes the memory model hard to use,
because you want to minimize the use of these instructions. And
that's where the need for genius-level coding comes in.

As for the size of the description, IMO this reflects on the
simplicity of programming. ARM's memory model was advertized here as:
"It's only 32 pages" <YfxXO.384093$EEm7.56154@fx16.iad>. If it is
simple to program, why does it need 32 pages of description?

Concerning non-genius coders and coders that are not experts in memory
ordering models, the current setup seems to be design to have a few
people who program system software that does such things, and
everybody else should just use this software (whether it's system
calls or libraries). That's ok if the need to communicate between
threads is rare, but not so great if it is frequent (especially the
system-call variant). And if the need to communicate between threads
is rare, it's also good enough if the hardware features for that need
are slow. So maybe this whole setup is good enough.

OTOH, maybe there are applications that could potentially use multiple
threads that are currently using sequential programs or context
switching within a hardware thread (green threads and the like)
because the communication between the threads is too slow and making
it faster is too hard to program. In that case the underutilization
of many of the multi-core CPUs that we have may be due to this
phenomenon. If so, the argument that it's too expensive in hardware
resources to implement sequential consistency in hardware well does
not hold: Is it more expensive than implementing an 8-core CPU where 6 or 7 cores are usually not utilized?

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sun Dec 7 14:51:01 2025

From Newsgroup: comp.arch

On 12/5/2025 3:03 PM, Chris M. Thomasson wrote:

On 12/5/2025 11:10 AM, David Brown wrote:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>> guarantees for multi-threaded coding on multi-core systems.
Basically,
it only works at the C abstract machine level - it does nothing that >>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>> but that says nothing about how they might progress through storage
queues, caches, inter-processor communication buses, or whatever.

You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering
on hardware with weaker memory ordering than sequential consistency".
If hardware guaranteed sequential consistency, volatile would provide
guarantees that are as good on multi-core machines as on single-core
machines.

However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems),

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.

It's strange that double-word compare and swap (DWCAS), where the words
are contiguous. Well, I have seen compilers say its not lock-free even
on a x86. for a 32 bit system we have cmpxchg8b. For a 64 bit system cmpxchg16b. But the compiler reports not lock free. Strange.

using cmpxchg instead of xadd: https://forum.pellesc.de/index.php?topic=7167.0

trying to tell me that a DWCAS is not lock free: https://forum.pellesc.de/index.php?topic=7311.msg27764#msg27764

This should be lock-free on an x86, even x64:

struct ct_proxy_dwcas
{
struct ct_proxy_node* node;
intptr_t count;
};

Ideally, struct ct_proxy_dwcas should be aligned on a l2 cache line and
padded up the the size of a cache line.

some of my older code:

AC_SYS_APIEXPORT
int AC_CDECL
np_ac_i686_atomic_dwcas_fence
( void*,
void*,
const void* );

np_ac_i686_atomic_dwcas_fence PROC
push esi
push ebx
mov esi, [esp + 16]
mov eax, [esi]
mov edx, [esi + 4]
mov esi, [esp + 20]
mov ebx, [esi]
mov ecx, [esi + 4]
mov esi, [esp + 12]
lock cmpxchg8b qword ptr [esi]
jne np_ac_i686_atomic_dwcas_fence_fail
xor eax, eax
pop ebx
pop esi
ret

np_ac_i686_atomic_dwcas_fence_fail:
mov esi, [esp + 16]
mov [esi + 0], eax;
mov [esi + 4], edx;
mov eax, 1
pop ebx
pop esi
ret
np_ac_i686_atomic_dwcas_fence ENDP

Even with a single core system you can have pre-emptive multi-
threading, or at least interrupt routines that may need to cooperate
with other tasks on data.

and I don't think that C with just volatile gives you such guarantees. >>>>
- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sun Dec 7 15:09:15 2025

From Newsgroup: comp.arch

On 12/6/2025 9:22 AM, MitchAlsup wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

On 12/5/2025 12:54 PM, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>>>> guarantees for multi-threaded coding on multi-core systems. Basically, >>>>>>> it only works at the C abstract machine level - it does nothing that >>>>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>>>> but that says nothing about how they might progress through storage >>>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>>

You describe in many words and not really to the point what can be >>>>>> explained concisely as: "volatile says nothing about memory ordering >>>>>> on hardware with weaker memory ordering than sequential consistency". >>>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>>> guarantees that are as good on multi-core machines as on single-core >>>>>> machines.

However, for concurrent manipulations of data structures, one wants >>>>>> atomic operations beyond load and store (even on single-core systems), >>>>>

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.

My 66000 ISA can::

LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.

Compare Double, Swap Double::

BOOLEAN DCAS( type oldp, type_t oldq,
type *p, type_t *q,
type newp, type newq )
{
type t = esmLOCKload( *p );
type r = esmLOCKload( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCKstore( *q, newq );
return TRUE;
}
return FALSE;
}

Move Element from one place to another:

BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}

So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.

Any issues with live lock in here?

A bit hard to tell because of 2 things::
a) I carry around the thread priority and when interference occurs,
the higher priority thread wins--ties the already accessed thread wins. b) live-lock is resolved or not by the caller to these routines, not
these routines themselves.

Hummm... Iirc, I was able to cause damage to a strong CAS. It was around
20 years ago. A thread was running strong CAS in a tight loop. I counted success vs failure. Then allowed some other threads that altered the
target word with random data. The failure rate for the CAS increased. Actually, I think cmpxchg, cmpxchg8b, cmpxchg16b, and the strange one on Itanium. Cannot remember it right now. cmp8xchg16? Or some shit.

Well, they would hit a bus lock if they failed too many times. I think
Scott knows about it.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sun Dec 7 15:17:04 2025

From Newsgroup: comp.arch

On 12/6/2025 5:42 AM, David Brown wrote:

On 05/12/2025 21:54, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>>> guarantees for multi-threaded coding on multi-core systems.
Basically,
it only works at the C abstract machine level - it does nothing that >>>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>>> but that says nothing about how they might progress through storage >>>>>> queues, caches, inter-processor communication buses, or whatever.

You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering >>>>> on hardware with weaker memory ordering than sequential consistency". >>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>> guarantees that are as good on multi-core machines as on single-core >>>>> machines.

However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems), >>>>

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.

My 66000 ISA can::

LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.

The functions below rely on more than that - to make the work, as far as
I can see, you need the first "esmLOCKload" to lock the bus and also
lock the core from any kind of interrupt or other pre-emption, lasting
until the esmLOCKstore instruction. Or am I missing something here?

Lock the BUS? Only when shit hits the fan. What about locking the cache
line? Actually, I think we can "force" an x86/x64 to lock the bus if we
do a LOCK'ed RMW on memory that straddles cache lines?

[...]

--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sun Dec 7 16:08:03 2025

From Newsgroup: comp.arch

On 12/6/2025 1:36 PM, Thomas Koenig wrote:

Scott Lurndal <scott@slp53.sl.home> schrieb:

Yes, you can add special instructions. However, the compilers will be unlikely
to generate them, thus applications that desired the generation of such an >> instruction would need to create a compiler extension (like gcc __builtin functions)
or inline assembler which would then make the program that uses the capability both compiler
specific _and_ hardware specific.

Most extant SMP processors provide a compare and swap operation, which
are widely supported by the common compilers that support the C and C++
threading functionality.

Interestingly, Linux restartable sequences allow for acquisition of
a lock with no membarrier or atomic instruction on the fast path,
at the cost of a syscall on the slow path (no free lunch...)

But you also need assembler to do it.

An example is, for example, at https://gitlab.ethz.ch/extra_projects/cpu-local-lock

I need to read more about them, but they kind of remind me of an
asymmetric mutex, or rwmutex. Ones that use a remote membar on the slow
path. Iirc, FlushProcessWriteBuffers on windows and iirc,
synchronize_rcu or membarrier on linux.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sun Dec 7 16:36:59 2025

From Newsgroup: comp.arch

On 12/6/2025 10:07 AM, MitchAlsup wrote:

scott@slp53.sl.home (Scott Lurndal) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>>>> guarantees for multi-threaded coding on multi-core systems. Basically, >>>>>>> it only works at the C abstract machine level - it does nothing that >>>>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>>>> but that says nothing about how they might progress through storage >>>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>>

You describe in many words and not really to the point what can be >>>>>> explained concisely as: "volatile says nothing about memory ordering >>>>>> on hardware with weaker memory ordering than sequential consistency". >>>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>>> guarantees that are as good on multi-core machines as on single-core >>>>>> machines.

However, for concurrent manipulations of data structures, one wants >>>>>> atomic operations beyond load and store (even on single-core systems), >>>>>

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.

My 66000 ISA can::

LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.

Compare Double, Swap Double::

BOOLEAN DCAS( type oldp, type_t oldq,
type *p, type_t *q,
type newp, type newq )
{
type t = esmLOCKload( *p );
type r = esmLOCKload( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCKstore( *q, newq );
return TRUE;
}
return FALSE;
}

Move Element from one place to another:

BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}

So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.

In my 40 years of SMP OS/HV work, I don't recall a
situation where 'MoveElement' would be useful or
required as an hardware atomic operation.

The question is not would "MoveElement" be useful, but
would it be useful to have a single ATOMIC event be
able to manipulate {5,6,7,8} pointers in one event ??

Individual atomic "Remove Element" and "Insert/Append Element"[*], yes.
Combined? Too inflexible.

BOOLEAN InsertElement( Element *el, Element *to )
{
tn = esmLOCKload( to->next );
esmLOCKprefetch( el );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
el->next = tn;
el->prev = to;
to->next = el;
esmLOCKstore( tn->prev, el );
return TRUE;
}
return FALSE;
}

BOOLEAN RemoveElement( Element *fr )
{
fn = esmLOCKload( fr->next );
fp = esmLOCKload( fr->prev );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
fr->prev = NULL;
esmLOCKstore( fr->next, NULL );
return TRUE;
}
return FALSE;
}

[*] For which atomic compare-and-swap or atomic swap is generally sufficient.

Atomic add/sub are useful. The other atomic math operations (min, max, etc) >> may be useful in certain cases as well.

Have you ever read about KCSS?

https://groups.google.com/g/comp.arch/c/shshLdF1uqs

https://patents.google.com/patent/US7293143
--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Mon Dec 8 10:07:25 2025

From Newsgroup: comp.arch

On 06/12/2025 18:44, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 21:54, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>>>> guarantees for multi-threaded coding on multi-core systems. Basically, >>>>>>> it only works at the C abstract machine level - it does nothing that >>>>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>>>> but that says nothing about how they might progress through storage >>>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>>

You describe in many words and not really to the point what can be >>>>>> explained concisely as: "volatile says nothing about memory ordering >>>>>> on hardware with weaker memory ordering than sequential consistency". >>>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>>> guarantees that are as good on multi-core machines as on single-core >>>>>> machines.

However, for concurrent manipulations of data structures, one wants >>>>>> atomic operations beyond load and store (even on single-core systems), >>>>>

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.

My 66000 ISA can::

LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.

The functions below rely on more than that - to make the work, as far as
I can see, you need the first "esmLOCKload" to lock the bus and also
lock the core from any kind of interrupt or other pre-emption, lasting
until the esmLOCKstore instruction. Or am I missing something here?

In the above, I was stating that the maximum width of LD/ST can be a lot bigger than the size of a single register, not that the above instructions make writing ATOMIC events easier.

That's what I assumed.

Certainly there are situations where it can be helpful to have longer
atomic reads and writes. I am not so sure about allowing 8 KB atomic accesses, especially in a system with multiple cores - that sounds like letting user programs DoS everything else on the system.

These is no bus!

I think there's a typo or some missing words there?

The esmLOCKload causes the <translated> address to be 'monitored'
for interference, and to announce participation in the ATOMIC event.

The FIRST esmLOCKload tells the core that an ATOMIC event is beginning,
AND sets up a default control point (This instruction itself) so that
if interference is detected at esmLOCKstore control is transferred to
that control point.

So, there is no way to write Test-and-Set !! you get Test-and-Test-and-Set for free.

If I understand you correctly here, you basically have a "load-reserve / store-conditional" sequence as commonly found in RISC architectures, but
you have the associated loop built into the hardware? I can see that potentially improving efficiency, but I also find it very difficult to
read or write C code that has hidden loops. And I worry about how it
would all work if another thread on the same core or a different core
was running similar code in the middle of these sequences. It also
reduces the flexibility - in some use-cases, you want to have software
limits on the number of attempts of a lr/sc loop to detect serious synchronisation problems.

There is a branch-on-interference instruction that
a) does what it says,
b) sets up an alternate atomic control point.

It is not easy to have atomic or lock mechanisms on multi-core systems
that are convenient to use, efficient even in the worst cases, and don't
require additional hardware.

I am using the "Miss Buffer" as the point of monitoring for interference.
a) it already has to monitor "other hits" from outside accesses to deal
with the coherence mechanism.
b) that esm additions to Miss Buffer are on the order of 2%

c) there are other means to strengthen guarantees of forward progress.

Compare Double, Swap Double::

BOOLEAN DCAS( type oldp, type_t oldq,
type *p, type_t *q,
type newp, type newq )
{
type t = esmLOCKload( *p );
type r = esmLOCKload( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCKstore( *q, newq );
return TRUE;
}
return FALSE;
}

Move Element from one place to another:

BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}

So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.

Of course. It is interesting to speculate about possible features of an
architecture like yours, but it is not likely to be available to anyone
else in practice (unless perhaps it can be implemented as an extension
for RISC-V).

Even with a >>>> single core system you can have pre-emptive multi-threading, or at least >>>> interrupt routines that may need to cooperate with other tasks on data. >>>>

and I don't think that C with just volatile gives you such guarantees. >>>>>>
- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Mon Dec 8 10:12:19 2025

From Newsgroup: comp.arch

On 08/12/2025 00:17, Chris M. Thomasson wrote:

On 12/6/2025 5:42 AM, David Brown wrote:

On 05/12/2025 21:54, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide
enough
guarantees for multi-threaded coding on multi-core systems.
Basically,
it only works at the C abstract machine level - it does nothing that >>>>>>> affects the hardware. So volatile writes are ordered at the C >>>>>>> level,
but that says nothing about how they might progress through storage >>>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>>

You describe in many words and not really to the point what can be >>>>>> explained concisely as: "volatile says nothing about memory ordering >>>>>> on hardware with weaker memory ordering than sequential consistency". >>>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>>> guarantees that are as good on multi-core machines as on single-core >>>>>> machines.

However, for concurrent manipulations of data structures, one wants >>>>>> atomic operations beyond load and store (even on single-core
systems),

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.

My 66000 ISA can::

LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction. >>>

The functions below rely on more than that - to make the work, as far
as I can see, you need the first "esmLOCKload" to lock the bus and
also lock the core from any kind of interrupt or other pre-emption,
lasting until the esmLOCKstore instruction. Or am I missing something
here?

Lock the BUS? Only when shit hits the fan. What about locking the cache line? Actually, I think we can "force" an x86/x64 to lock the bus if we
do a LOCK'ed RMW on memory that straddles cache lines?

Yes, I meant "lock the bus" - but I might have been overcautious.
However, it seems there is a hidden hardware loop here - the
esmLOCKstore instruction can fail and and the processor jumps back to
the first esmLOCKload instruction. With that, you don't need to block
other code from running or accessing the bus.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Robert Finch@robfi680@gmail.com to comp.arch on Mon Dec 8 07:25:42 2025

From Newsgroup: comp.arch

<snip>

BOOLEAN RemoveElement( Element *fr )
{
fn = esmLOCKload( fr->next );
fp = esmLOCKload( fr->prev );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
fr->prev = NULL;
esmLOCKstore( fr->next, NULL );
return TRUE;
}
return FALSE;
}

[*] For which atomic compare-and-swap or atomic swap is generally sufficient.

Yes, you can add special instructions. However, the compilers will be unlikely
to generate them, thus applications that desired the generation of such an >> instruction would need to create a compiler extension (like gcc __builtin functions)
or inline assembler which would then make the program that uses the capability both compiler
specific _and_ hardware specific.

So, in other words, if you can't put it in every ISA known to man,
don't bother making something better than existent ?!?

Most extant SMP processors provide a compare and swap operation, which
are widely supported by the common compilers that support the C and C++
threading functionality.

I am having trouble understanding how the block of code in the esmINTERFERENCE() block is protected so that the whole thing executes as
a unit. It would seem to me that the address range(s) needing to be
locked would have to be supplied throughout the system, including across buffers and bus bridges. It would have to go to the memory coherence
point. Otherwise, some other device using a bridge could update the same address range in the middle of an update.

I am assuming the esmLockStore() just unlocks what was previously locked
and the stores have already happened by that time.

It would seem that esmINTERFERENCE() would indicate that everybody with
access out to the coherence point has agreed to the locked area? Does
that require that all devices respect the esmINTERFERENCE()?

--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Mon Dec 8 04:32:39 2025

From Newsgroup: comp.arch

On 12/8/2025 1:12 AM, David Brown wrote:

On 08/12/2025 00:17, Chris M. Thomasson wrote:

On 12/6/2025 5:42 AM, David Brown wrote:

On 05/12/2025 21:54, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide >>>>>>>> enough
guarantees for multi-threaded coding on multi-core systems.
Basically,
it only works at the C abstract machine level - it does nothing >>>>>>>> that
affects the hardware. So volatile writes are ordered at the C >>>>>>>> level,
but that says nothing about how they might progress through storage >>>>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>>>

You describe in many words and not really to the point what can be >>>>>>> explained concisely as: "volatile says nothing about memory ordering >>>>>>> on hardware with weaker memory ordering than sequential
consistency".
If hardware guaranteed sequential consistency, volatile would
provide
guarantees that are as good on multi-core machines as on single-core >>>>>>> machines.

However, for concurrent manipulations of data structures, one wants >>>>>>> atomic operations beyond load and store (even on single-core
systems),

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes >>>>> bigger than the maximum load/store size of the processor.

My 66000 ISA can::

LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction. >>>> MM can MOV up to 8192 bytes as a single ATOMIC instruction. >>>>

The functions below rely on more than that - to make the work, as far
as I can see, you need the first "esmLOCKload" to lock the bus and
also lock the core from any kind of interrupt or other pre-emption,
lasting until the esmLOCKstore instruction. Or am I missing
something here?

Lock the BUS? Only when shit hits the fan. What about locking the
cache line? Actually, I think we can "force" an x86/x64 to lock the
bus if we do a LOCK'ed RMW on memory that straddles cache lines?

Yes, I meant "lock the bus" - but I might have been overcautious.
However, it seems there is a hidden hardware loop here - the
esmLOCKstore instruction can fail and and the processor jumps back to
the first esmLOCKload instruction. With that, you don't need to block other code from running or accessing the bus.

Humm.. For some damn reason it reminds me of a multi lock thing I did a
while back. Called it the multex. Consisted of a table of locks. A
thread would take the addresses it wanted to lock, hash then into the
table, remove duplicates and sorted them and took them all without any
fear of deadlock.

(read all when you get some free time to burn...) https://groups.google.com/g/comp.lang.c++/c/sV4WC_cBb9Q/m/SkSqpSxGCAAJ

It kind of seems like it might want to work with Mitch's scheme in a
loose sense?
--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Mon Dec 8 08:23:59 2025

From Newsgroup: comp.arch

On 12/8/2025 4:25 AM, Robert Finch wrote:

<snip>

BOOLEAN RemoveElement( Element *fr )
{
     fn = esmLOCKload( fr->next );
     fp = esmLOCKload( fr->prev );
     esmLOCKprefetch( fn );
     esmLOCKprefetch( fp );
     if( !esmINTERFERENCE() )
     {
                   fp->next = fn;
                   fn->prev = fp;
                   fr->prev = NULL;
     esmLOCKstore( fr->next, NULL );
                   return TRUE;
     }
     return FALSE;
}

[*] For which atomic compare-and-swap or atomic swap is generally
sufficient.

Yes, you can add special instructions.   However, the compilers will
be unlikely
to generate them, thus applications that desired the generation of
such an
instruction would need to create a compiler extension (like gcc
__builtin functions)
or inline assembler which would then make the program that uses the
capability both compiler
specific _and_ hardware specific.

So, in other words, if you can't put it in every ISA known to man,
don't bother making something better than existent ?!?

Most extant SMP processors provide a compare and swap operation, which
are widely supported by the common compilers that support the C and C++
threading functionality.

I am having trouble understanding how the block of code in the esmINTERFERENCE() block is protected so that the whole thing executes as
a unit. It would seem to me that the address range(s) needing to be
locked would have to be supplied throughout the system, including across buffers and bus bridges. It would have to go to the memory coherence
point. Otherwise, some other device using a bridge could update the same address range in the middle of an update.

I may be wrong about this, but I think you have a misconception. The
ESM doesn't *prevent* interference, but it *detect* interference. Thus nothing is required of other cores, no locks, etc. If they write to a "protected" location, the write is allowed, but the core in the ESM is notified, so it can redo the ESM protected code.

I am assuming the esmLockStore() just unlocks what was previously locked
and the stores have already happened by that time.

There is no "locking" in the sense of preventing any accesses.
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Dec 8 17:14:11 2025

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 12/8/2025 4:25 AM, Robert Finch wrote:

<snip>

I am having trouble understanding how the block of code in the
esmINTERFERENCE() block is protected so that the whole thing executes as
a unit. It would seem to me that the address range(s) needing to be
locked would have to be supplied throughout the system, including across
buffers and bus bridges. It would have to go to the memory coherence
point. Otherwise, some other device using a bridge could update the same
address range in the middle of an update.

I may be wrong about this, but I think you have a misconception. The
ESM doesn't *prevent* interference, but it *detect* interference. Thus >nothing is required of other cores, no locks, etc. If they write to a >"protected" location, the write is allowed, but the core in the ESM is >notified, so it can redo the ESM protected code.

Sounds very much similar to the ARMv8 concept of an "exclusive monitor"
(the basis of the Store-Exclusive/Load-Exclusive instructions, which
mirror the LL/SC paradigm). The ARMv8 monitors an implementation defined
range surrounding the target address and the store will fail if any other
agent has modified any byte within the exclusive range.

esmINTERFERENCE seems to require multiple of these exclusive blocks
to cover non-contiguous address ranges, which on first blush leads
me to worry both about deadlock situations and starvation issues.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Dec 8 20:06:34 2025

From Newsgroup: comp.arch

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

On 12/6/2025 5:42 AM, David Brown wrote:

On 05/12/2025 21:54, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>>> guarantees for multi-threaded coding on multi-core systems.
Basically,
it only works at the C abstract machine level - it does nothing that >>>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>>> but that says nothing about how they might progress through storage >>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>

You describe in many words and not really to the point what can be >>>>> explained concisely as: "volatile says nothing about memory ordering >>>>> on hardware with weaker memory ordering than sequential consistency". >>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>> guarantees that are as good on multi-core machines as on single-core >>>>> machines.

However, for concurrent manipulations of data structures, one wants >>>>> atomic operations beyond load and store (even on single-core systems), >>>>

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.

My 66000 ISA can::

LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction. >>

The functions below rely on more than that - to make the work, as far as
I can see, you need the first "esmLOCKload" to lock the bus and also
lock the core from any kind of interrupt or other pre-emption, lasting until the esmLOCKstore instruction. Or am I missing something here?

Lock the BUS? Only when shit hits the fan. What about locking the cache line? Actually, I think we can "force" an x86/x64 to lock the bus if we
do a LOCK'ed RMW on memory that straddles cache lines?

In the My 66000 case, Mem References can lock up to 8 cache lines.

[...]

--- Synchronet 3.21a-Linux NewsLink 1.2

From scott@scott@slp53.sl.home (Scott Lurndal) to comp.arch on Mon Dec 8 20:15:13 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

On 12/6/2025 5:42 AM, David Brown wrote:

On 05/12/2025 21:54, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough >> >>>>>> guarantees for multi-threaded coding on multi-core systems.
Basically,
it only works at the C abstract machine level - it does nothing that >> >>>>>> affects the hardware. So volatile writes are ordered at the C level,
but that says nothing about how they might progress through storage >> >>>>>> queues, caches, inter-processor communication buses, or whatever.

You describe in many words and not really to the point what can be
explained concisely as: "volatile says nothing about memory ordering >> >>>>> on hardware with weaker memory ordering than sequential consistency". >> >>>>> If hardware guaranteed sequential consistency, volatile would provide >> >>>>> guarantees that are as good on multi-core machines as on single-core >> >>>>> machines.

However, for concurrent manipulations of data structures, one wants
atomic operations beyond load and store (even on single-core systems), >> >>>>

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes
bigger than the maximum load/store size of the processor.

My 66000 ISA can::

LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction. >> >>

The functions below rely on more than that - to make the work, as far as >> > I can see, you need the first "esmLOCKload" to lock the bus and also
lock the core from any kind of interrupt or other pre-emption, lasting
until the esmLOCKstore instruction. Or am I missing something here?

Lock the BUS? Only when shit hits the fan. What about locking the cache
line? Actually, I think we can "force" an x86/x64 to lock the bus if we
do a LOCK'ed RMW on memory that straddles cache lines?

In the My 66000 case, Mem References can lock up to 8 cache lines.

What if two processors have intersecting (but not fully overlapping)
sets of those 8 cache lines?

Can you guarantee forward progress?
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Dec 8 20:20:27 2025

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> posted:

On 06/12/2025 18:44, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 21:54, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>>>> guarantees for multi-threaded coding on multi-core systems. Basically,
it only works at the C abstract machine level - it does nothing that >>>>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>>>> but that says nothing about how they might progress through storage >>>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>>

You describe in many words and not really to the point what can be >>>>>> explained concisely as: "volatile says nothing about memory ordering >>>>>> on hardware with weaker memory ordering than sequential consistency". >>>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>>> guarantees that are as good on multi-core machines as on single-core >>>>>> machines.

However, for concurrent manipulations of data structures, one wants >>>>>> atomic operations beyond load and store (even on single-core systems), >>>>>

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes >>>> bigger than the maximum load/store size of the processor.

My 66000 ISA can::

LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction.
MM can MOV up to 8192 bytes as a single ATOMIC instruction.

The functions below rely on more than that - to make the work, as far as >> I can see, you need the first "esmLOCKload" to lock the bus and also
lock the core from any kind of interrupt or other pre-emption, lasting
until the esmLOCKstore instruction. Or am I missing something here?

In the above, I was stating that the maximum width of LD/ST can be a lot bigger than the size of a single register, not that the above instructions make writing ATOMIC events easier.

That's what I assumed.

Certainly there are situations where it can be helpful to have longer
atomic reads and writes. I am not so sure about allowing 8 KB atomic accesses, especially in a system with multiple cores - that sounds like letting user programs DoS everything else on the system.

These is no bus!

I think there's a typo or some missing words there?

There is a fabric based interconnect to transport data-transfer requests
around the system, where everyone connected to the transport can send
a new request, receive a response, and receive a SNOOP simultaneously.

There is NO single point on the fabric one can GRAB and prevent other
sections of the fabric from "doing their prescribed transport duties.

There is a memory ordering protocol in L3/DRAM-controller that prevents
more than one "SNOOP per cache line" from being "in progress" at the
same time.

The esmLOCKload causes the <translated> address to be 'monitored'
for interference, and to announce participation in the ATOMIC event.

The FIRST esmLOCKload tells the core that an ATOMIC event is beginning,
AND sets up a default control point (This instruction itself) so that
if interference is detected at esmLOCKstore control is transferred to
that control point.

So, there is no way to write Test-and-Set !! you get Test-and-Test-and-Set for free.

If I understand you correctly here, you basically have a "load-reserve / store-conditional" sequence as commonly found in RISC architectures, but
you have the associated loop built into the hardware?

In effect, yes. I have a multi-{LoadLocked StoreConditional} scheme
as found in other RISC architectures with several small/big changes::
a) you get up to 8 LLs
b) the last SC causes the rest of the system to see all the memory
changes at the same time (or nobody sees any changes).
c) The ATOMIC sequence cannot persist across an exception or interrupt.
d) only participating memory lines have the ATOMIC property.

And yes, control transfer is built-into the architecture.

I can see that potentially improving efficiency, but I also find it very difficult to
read or write C code that has hidden loops. And I worry about how it
would all work if another thread on the same core or a different core
was running similar code in the middle of these sequences. It also
reduces the flexibility - in some use-cases, you want to have software limits on the number of attempts of a lr/sc loop to detect serious synchronisation problems.

In this case, said SW would use the Branch-on-interference instruction.

There is a branch-on-interference instruction that
a) does what it says,
b) sets up an alternate atomic control point.

It is not easy to have atomic or lock mechanisms on multi-core systems
that are convenient to use, efficient even in the worst cases, and don't >> require additional hardware.

I am using the "Miss Buffer" as the point of monitoring for interference. a) it already has to monitor "other hits" from outside accesses to deal
with the coherence mechanism.
b) that esm additions to Miss Buffer are on the order of 2%

c) there are other means to strengthen guarantees of forward progress.

Compare Double, Swap Double::

BOOLEAN DCAS( type oldp, type_t oldq,
type *p, type_t *q,
type newp, type newq )
{
type t = esmLOCKload( *p );
type r = esmLOCKload( *q );
if( t == oldp && r == oldq )
{
*p = newp;
esmLOCKstore( *q, newq );
return TRUE;
}
return FALSE;
}

Move Element from one place to another:

BOOLEAN MoveElement( Element *fr, Element *to )
{
Element *fn = esmLOCKload( fr->next );
Element *fp = esmLOCKload( fr->prev );
Element *tn = esmLOCKload( to->next );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
esmLOCKprefetch( tn );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
to->next = fr;
tn->prev = fr;
fr->prev = to;
esmLOCKstore( fr->next, tn );
return TRUE;
}
return FALSE;
}

So, I guess, you are not talking about what My 66000 cannot do, but
only what other ISAs cannot do.

Of course. It is interesting to speculate about possible features of an >> architecture like yours, but it is not likely to be available to anyone
else in practice (unless perhaps it can be implemented as an extension
for RISC-V).

Even with a >>>> single core system you can have pre-emptive multi-threading, or at least >>>> interrupt routines that may need to cooperate with other tasks on data. >>>>

and I don't think that C with just volatile gives you such guarantees. >>>>>>
- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Dec 8 20:30:34 2025

From Newsgroup: comp.arch

Robert Finch <robfi680@gmail.com> posted:

<snip>

BOOLEAN RemoveElement( Element *fr )
{
fn = esmLOCKload( fr->next );
fp = esmLOCKload( fr->prev );
esmLOCKprefetch( fn );
esmLOCKprefetch( fp );
if( !esmINTERFERENCE() )
{
fp->next = fn;
fn->prev = fp;
fr->prev = NULL;
esmLOCKstore( fr->next, NULL );
return TRUE;
}
return FALSE;
}

[*] For which atomic compare-and-swap or atomic swap is generally sufficient.

Yes, you can add special instructions. However, the compilers will be unlikely
to generate them, thus applications that desired the generation of such an >> instruction would need to create a compiler extension (like gcc __builtin functions)
or inline assembler which would then make the program that uses the capability both compiler
specific _and_ hardware specific.

So, in other words, if you can't put it in every ISA known to man,
don't bother making something better than existent ?!?

Most extant SMP processors provide a compare and swap operation, which
are widely supported by the common compilers that support the C and C++
threading functionality.

I am having trouble understanding how the block of code in the esmINTERFERENCE() block is protected so that the whole thing executes as
a unit. It would seem to me that the address range(s) needing to be
locked would have to be supplied throughout the system, including across buffers and bus bridges. It would have to go to the memory coherence
point. Otherwise, some other device using a bridge could update the same address range in the middle of an update.

esmLOCKLoad sets up monitors (in Miss Buffers) that detect Snoops to
the participating cache lines.

esmINTERFERENCE sets up a block of code that either executes in its
entirety or fails in its entirety--and transfers control.

In "certain circumstances" the code inside the esmINTERFERENCE block
are allowed to NaK SNOOPs to those lines. So, if interference happens
this late, you can effectively tell requestor "Yes, I have that cache
line, No you cannot have it right now".

If requestor gets a NaK, and requestor was attempting an ATOMIC event,
the event fails. If requestor was NOT attempting, requestor resubmits
the request. In both cases, the thread causing the interference is the
one delayed while the one performing the event has higher probability
of success.

I am assuming the esmLockStore() just unlocks what was previously locked
and the stores have already happened by that time.

Yes, it is the terminal sentinel.

It would seem that esmINTERFERENCE() would indicate that everybody with access out to the coherence point has agreed to the locked area? Does
that require that all devices respect the esmINTERFERENCE()?

I can see you are getting at something subtle, here. I cannot quite grasp
what it might be.

Can you ask the above again but use different words ?!?
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Dec 8 20:35:01 2025

From Newsgroup: comp.arch

scott@slp53.sl.home (Scott Lurndal) posted:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 12/8/2025 4:25 AM, Robert Finch wrote:

<snip>

I am having trouble understanding how the block of code in the
esmINTERFERENCE() block is protected so that the whole thing executes as >> a unit. It would seem to me that the address range(s) needing to be
locked would have to be supplied throughout the system, including across >> buffers and bus bridges. It would have to go to the memory coherence
point. Otherwise, some other device using a bridge could update the same >> address range in the middle of an update.

I may be wrong about this, but I think you have a misconception. The
ESM doesn't *prevent* interference, but it *detect* interference. Thus >nothing is required of other cores, no locks, etc. If they write to a >"protected" location, the write is allowed, but the core in the ESM is >notified, so it can redo the ESM protected code.

Sounds very much similar to the ARMv8 concept of an "exclusive monitor"
(the basis of the Store-Exclusive/Load-Exclusive instructions, which
mirror the LL/SC paradigm). The ARMv8 monitors an implementation defined range surrounding the target address and the store will fail if any other agent has modified any byte within the exclusive range.

esmINTERFERENCE seems to require multiple of these exclusive blocks
to cover non-contiguous address ranges, which on first blush leads
me to worry both about deadlock situations and starvation issues.

Over in the Miss Buffer there are (at least) 8 miss buffers. Each miss
buffer has to monitor inbound messages for requests (SNOOPs) to its
entry.

So, each MB entry has a bit to tell if it is participating in an event. esmINTERFERENCE is a way to sample all participating MB entries simul- taneously; and in addition, esmINTERFERENCE is part of what enables
the NaKing of SNOOP requests.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Mon Dec 8 21:58:00 2025

From Newsgroup: comp.arch

scott@slp53.sl.home (Scott Lurndal) posted:

ERROR "unexpected byte sequence starting at index 736: '\xC2'" while decoding:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

On 12/6/2025 5:42 AM, David Brown wrote:

On 05/12/2025 21:54, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough
guarantees for multi-threaded coding on multi-core systems.
Basically,
it only works at the C abstract machine level - it does nothing that
affects the hardware.Â So volatile writes are ordered at the C level,
but that says nothing about how they might progress through storage >> >>>>>> queues, caches, inter-processor communication buses, or whatever. >> >>>>>

You describe in many words and not really to the point what can be >> >>>>> explained concisely as: "volatile says nothing about memory ordering >> >>>>> on hardware with weaker memory ordering than sequential consistency".
If hardware guaranteed sequential consistency, volatile would provide
guarantees that are as good on multi-core machines as on single-core >> >>>>> machines.

However, for concurrent manipulations of data structures, one wants >> >>>>> atomic operations beyond load and store (even on single-core systems),

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes >> >>> bigger than the maximum load/store size of the processor.

My 66000 ISA can::

LDM/STM can LD/ST up to 32Â Â DWsÂ Â as a single ATOMIC instruction.
MMÂ Â Â Â Â can MOVÂ Â up to 8192 bytes as a single ATOMIC instruction.

The functions below rely on more than that - to make the work, as far as
I can see, you need the first "esmLOCKload" to lock the bus and also
lock the core from any kind of interrupt or other pre-emption, lasting >> > until the esmLOCKstore instruction.Â Or am I missing something here? >>

Lock the BUS? Only when shit hits the fan. What about locking the cache >> line? Actually, I think we can "force" an x86/x64 to lock the bus if we >> do a LOCK'ed RMW on memory that straddles cache lines?

In the My 66000 case, Mem References can lock up to 8 cache lines.

What if two processors have intersecting (but not fully overlapping)
sets of those 8 cache lines?

Can you guarantee forward progress?

Yes.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Mon Dec 8 16:31:08 2025

From Newsgroup: comp.arch

On 12/8/2025 9:14 AM, Scott Lurndal wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 12/8/2025 4:25 AM, Robert Finch wrote:

<snip>

I am having trouble understanding how the block of code in the
esmINTERFERENCE() block is protected so that the whole thing executes as >>> a unit. It would seem to me that the address range(s) needing to be
locked would have to be supplied throughout the system, including across >>> buffers and bus bridges. It would have to go to the memory coherence
point. Otherwise, some other device using a bridge could update the same >>> address range in the middle of an update.

I may be wrong about this, but I think you have a misconception. The
ESM doesn't *prevent* interference, but it *detect* interference. Thus
nothing is required of other cores, no locks, etc. If they write to a
"protected" location, the write is allowed, but the core in the ESM is
notified, so it can redo the ESM protected code.

Sounds very much similar to the ARMv8 concept of an "exclusive monitor"
(the basis of the Store-Exclusive/Load-Exclusive instructions, which
mirror the LL/SC paradigm). The ARMv8 monitors an implementation defined range surrounding the target address and the store will fail if any other agent has modified any byte within the exclusive range.

Any mutation the reservation granule?

esmINTERFERENCE seems to require multiple of these exclusive blocks
to cover non-contiguous address ranges, which on first blush leads
me to worry both about deadlock situations and starvation issues.

--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Tue Dec 9 09:13:54 2025

From Newsgroup: comp.arch

On 08/12/2025 17:23, Stephen Fuld wrote:

On 12/8/2025 4:25 AM, Robert Finch wrote:

<snip>

I am having trouble understanding how the block of code in the
esmINTERFERENCE() block is protected so that the whole thing executes
as a unit. It would seem to me that the address range(s) needing to be
locked would have to be supplied throughout the system, including
across buffers and bus bridges. It would have to go to the memory
coherence point. Otherwise, some other device using a bridge could
update the same address range in the middle of an update.

I may be wrong about this, but I think you have a misconception. The
ESM doesn't *prevent* interference, but it *detect* interference. Thus nothing is required of other cores, no locks, etc. If they write to a "protected" location, the write is allowed, but the core in the ESM is notified, so it can redo the ESM protected code.

Yes, that is correct (as far as I understand it now). The critical part
is the hidden hardware loop that was not mentioned or indicated in the original code.

There are basically two ways to handle atomic operations. One way is to
use locking mechanisms to ensure that nothing (other cores, interrupts
or other pre-emption on the same core) can break up the sequence. The
other way is to have a mechanism to detect conflicts and a failure of
the atomic operation, so that you can try again (or otherwise handle the situation). (You can, of course, combine these - such as by disabling
local interrupts and detecting conflicts from other cores.)

The code Mitch posted apparently had neither of these mechanisms, hence
my confusion. It turns out that it /does/ have conflict detection and a hardware retry loop, all hidden from anyone trying to understand the
code. (I can appreciate that there may be benefits in doing this in
hardware, but there are no benefits in hiding it from the programmer!)

I am assuming the esmLockStore() just unlocks what was previously
locked and the stores have already happened by that time.

There is no "locking" in the sense of preventing any accesses.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Dec 9 19:15:48 2025

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> posted:

On 08/12/2025 17:23, Stephen Fuld wrote:

On 12/8/2025 4:25 AM, Robert Finch wrote:

<snip>

I am having trouble understanding how the block of code in the
esmINTERFERENCE() block is protected so that the whole thing executes
as a unit. It would seem to me that the address range(s) needing to be
locked would have to be supplied throughout the system, including
across buffers and bus bridges. It would have to go to the memory
coherence point. Otherwise, some other device using a bridge could
update the same address range in the middle of an update.

---------------------------------

I may be wrong about this, but I think you have a misconception. The
ESM doesn't *prevent* interference, but it *detect* interference. Thus nothing is required of other cores, no locks, etc. If they write to a "protected" location, the write is allowed, but the core in the ESM is notified, so it can redo the ESM protected code.

Yes, that is correct (as far as I understand it now). The critical part
is the hidden hardware loop that was not mentioned or indicated in the original code.

---------------------------------

Mostly esm detects interference but there are times when esm is allowed
to ignore interference.

Consider a sever scale esm implementation. In such an implementation, esm
is enhanced with a system* arbiter.

After any successful ATOMIC event esm reverts to "Optimistic" mode. In optimistic mode, esm races through the code as fast as possible expecting
no interference. When interference is detected, the event fails and a HW counter is incremented. The failure diverts control to the ATOMIC control point. We still have the property that all participating memory locations become visible at the same instant.

At this point the core is in "careful" mode, core becomes sequentially consistent, SW chooses to re-run the event. Here, cache misses leave
core in program order,... When interference is detected, the event fails
and that HW counter is incremented. Failure diverts control to the ATOMIC control point, no participating memory is seen to have been modified.
If core can determine that all writes to participating memory can be
performed (at the first participating store) core is allowed to NaK
lower priority interfering accesses.

At this point the core is in "Slow and Methodological" mode. Now, after
all participating cache lines have been touched, all the physical pointers
are bundled into a message and sent to the system arbiter. System arbiter examines each cache line address and if no-other-core has a reservation
on ANY of them, then system arbiter installs said reservations, and
returns "success". At this point, core is allowed to NaK interfering
accesses. This event WILL SUCCEED. After the event is complete, the
termination of the event at the core, takes the same bundle of addresses
and sends it back to system arbiter; who removes them from reservation.

Optimistic mode takes no more cycles than if the memory references were
not ATOMIC.

I should also note:: none of this state is preserved across interrupts
or exceptions. So, an interrupt or exception causes the event to fail
prior to control transfer. Interrupts do not care about this control
transfer. Exception control transfer in My 66000 packs everything the
exception handler needs in registers, so having IP point at ATOMIC
control point with the registers setup for page fault does not cause
exception handler any issues whatsoever.

There are basically two ways to handle atomic operations. One way is to
use locking mechanisms to ensure that nothing (other cores, interrupts
or other pre-emption on the same core) can break up the sequence. The
other way is to have a mechanism to detect conflicts and a failure of
the atomic operation, so that you can try again (or otherwise handle the situation). (You can, of course, combine these - such as by disabling
local interrupts and detecting conflicts from other cores.)

The code Mitch posted apparently had neither of these mechanisms, hence
my confusion. It turns out that it /does/ have conflict detection and a hardware retry loop, all hidden from anyone trying to understand the
code. (I can appreciate that there may be benefits in doing this in hardware, but there are no benefits in hiding it from the programmer!)

How exactly do you inform the programmer that:

InBound [Address]
OutBound [Address]

operates like::

try_again:
InBound [Address]
BIN try_again
OutBound [Address]

And why clutter up asm with extraneous labels and require extra instructions. --- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Tue Dec 9 20:51:26 2025

From Newsgroup: comp.arch

On 09/12/2025 20:15, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

There are basically two ways to handle atomic operations. One way is to
use locking mechanisms to ensure that nothing (other cores, interrupts
or other pre-emption on the same core) can break up the sequence. The
other way is to have a mechanism to detect conflicts and a failure of
the atomic operation, so that you can try again (or otherwise handle the
situation). (You can, of course, combine these - such as by disabling
local interrupts and detecting conflicts from other cores.)

The code Mitch posted apparently had neither of these mechanisms, hence
my confusion. It turns out that it /does/ have conflict detection and a
hardware retry loop, all hidden from anyone trying to understand the
code. (I can appreciate that there may be benefits in doing this in
hardware, but there are no benefits in hiding it from the programmer!)

How exactly do you inform the programmer that:

InBound [Address]
OutBound [Address]

operates like::

try_again:
InBound [Address]
BIN try_again
OutBound [Address]

And why clutter up asm with extraneous labels and require extra instructions.

The most obvious answer is that in any code that uses these features,
good comments are essential so that readers can see what is happening.

Another method would be to use better names for the intrinsics, as seen
at the C (or other HLL) level. (Assembly instruction names don't matter nearly as much.)

So maybe instead of "esmLOCKload()" and "esmLOCKstore()" you have "load_and_set_retry_point()" and "store_or_retry()". Feel free to think
of better names, but that would at least give the reader a clue that
there's something odd going on.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Dec 9 21:28:47 2025

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> posted:

On 09/12/2025 20:15, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

There are basically two ways to handle atomic operations. One way is to >> use locking mechanisms to ensure that nothing (other cores, interrupts
or other pre-emption on the same core) can break up the sequence. The
other way is to have a mechanism to detect conflicts and a failure of
the atomic operation, so that you can try again (or otherwise handle the >> situation). (You can, of course, combine these - such as by disabling
local interrupts and detecting conflicts from other cores.)

The code Mitch posted apparently had neither of these mechanisms, hence
my confusion. It turns out that it /does/ have conflict detection and a >> hardware retry loop, all hidden from anyone trying to understand the
code. (I can appreciate that there may be benefits in doing this in
hardware, but there are no benefits in hiding it from the programmer!)

How exactly do you inform the programmer that:

InBound [Address]
OutBound [Address]

operates like::

try_again:
InBound [Address]
BIN try_again
OutBound [Address]

And why clutter up asm with extraneous labels and require extra instructions.

The most obvious answer is that in any code that uses these features,
good comments are essential so that readers can see what is happening.

Another method would be to use better names for the intrinsics, as seen
at the C (or other HLL) level. (Assembly instruction names don't matter nearly as much.)

So maybe instead of "esmLOCKload()" and "esmLOCKstore()" you have "load_and_set_retry_point()" and "store_or_retry()". Feel free to think
of better names, but that would at least give the reader a clue that
there's something odd going on.

This is a useful suggestion; thanks.

On the other hand, there are some non-vonNeumann actions lurking within
esm. Where vonNeumann means: that every instruction is executed in its
entirety before the next instruction appears to start executing.

1st:: one cannot single step through an ATMOIC event, if you enter an
ATOMIC event in single-step mode, you will see the 1st instruction in
the event, than you will receive control after the terminal instruction
has executed.

2nd::the only way to debug an event is to have a buffer of SW locations
that gets written with non-participating STs. Unlike participating
memory lines, these locations will be written--but not in a sequentially consistent manner (architecturally), and can be examined outside the
event; whereas the participating lines are either all written instan-
taneously or not modified at all.

So, here we have non-participating STs having been written and older participating STs have not.

3rd:: control transfer not under SW control--more like exceptions and interrupts than Br-condition--except that the target of control transfer
is based on the code in the event.

4th:: one cannot test esm with a random code generator, since the probability that the random code generator creates a legal esm event is exceedingly low. --- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Tue Dec 9 13:55:12 2025

From Newsgroup: comp.arch

On 12/9/2025 11:15 AM, MitchAlsup wrote:

snip

Mostly esm detects interference but there are times when esm is allowed
to ignore interference.

Consider a sever scale esm implementation. In such an implementation, esm
is enhanced with a system* arbiter.

After any successful ATOMIC event esm reverts to "Optimistic" mode. In optimistic mode, esm races through the code as fast as possible expecting
no interference. When interference is detected, the event fails and a HW counter is incremented. The failure diverts control to the ATOMIC control point. We still have the property that all participating memory locations become visible at the same instant.>
At this point the core is in "careful" mode,

I am missing some understanding here, about this "counter". This
paragraph seems to indicate that after one failure, the core goes into "careful" mode, but if that were true, you wouldn't need a "counter",
just a mode bit. So assuming it is a counter and you need "n" failures
in a row to go into careful mode, is "n" hardwired or settable by
software? What are the tradeoffs for smaller or larger values of "n"?

core becomes sequentially
consistent, SW chooses to re-run the event. Here, cache misses leave
core in program order,... When interference is detected, the event fails
and that HW counter is incremented. Failure diverts control to the ATOMIC control point, no participating memory is seen to have been modified.
If core can determine that all writes to participating memory can be performed (at the first participating store) core is allowed to NaK
lower priority interfering accesses.

Again, after a single failure in careful mode or n failures? If n, is
it the same value of n as for the transition from optimistic to careful
mode? Same questions as before about who sets the value and is it
software changeable?
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Tue Dec 9 22:52:31 2025

From Newsgroup: comp.arch

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> posted:

On 12/9/2025 11:15 AM, MitchAlsup wrote:

snip

Mostly esm detects interference but there are times when esm is allowed
to ignore interference.

Consider a sever scale esm implementation. In such an implementation, esm is enhanced with a system* arbiter.

After any successful ATOMIC event esm reverts to "Optimistic" mode. In optimistic mode, esm races through the code as fast as possible expecting no interference. When interference is detected, the event fails and a HW counter is incremented. The failure diverts control to the ATOMIC control point. We still have the property that all participating memory locations become visible at the same instant.>
At this point the core is in "careful" mode,

I am missing some understanding here, about this "counter". This
paragraph seems to indicate that after one failure, the core goes into "careful" mode, but if that were true, you wouldn't need a "counter",
just a mode bit. So assuming it is a counter and you need "n" failures
in a row to go into careful mode, is "n" hardwired or settable by
software? What are the tradeoffs for smaller or larger values of "n"?

2-bits; 3-states--not part of save thread state.

core becomes sequentially
consistent, SW chooses to re-run the event. Here, cache misses leave
core in program order,... When interference is detected, the event fails and that HW counter is incremented. Failure diverts control to the ATOMIC control point, no participating memory is seen to have been modified.
If core can determine that all writes to participating memory can be performed (at the first participating store) core is allowed to NaK
lower priority interfering accesses.

Again, after a single failure in careful mode or n failures? If n, is
it the same value of n as for the transition from optimistic to careful mode? Same questions as before about who sets the value and is it
software changeable?

3-state counter::

00 -> Optimistic
01 -> Careful
10 -> Slow and methodological

success -> counter = 00;
failure -> counter++;
--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Wed Dec 10 10:07:19 2025

From Newsgroup: comp.arch

On 09/12/2025 22:28, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 09/12/2025 20:15, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

There are basically two ways to handle atomic operations. One way is to >>>> use locking mechanisms to ensure that nothing (other cores, interrupts >>>> or other pre-emption on the same core) can break up the sequence. The >>>> other way is to have a mechanism to detect conflicts and a failure of
the atomic operation, so that you can try again (or otherwise handle the >>>> situation). (You can, of course, combine these - such as by disabling >>>> local interrupts and detecting conflicts from other cores.)

The code Mitch posted apparently had neither of these mechanisms, hence >>>> my confusion. It turns out that it /does/ have conflict detection and a >>>> hardware retry loop, all hidden from anyone trying to understand the
code. (I can appreciate that there may be benefits in doing this in
hardware, but there are no benefits in hiding it from the programmer!)

How exactly do you inform the programmer that:

InBound [Address]
OutBound [Address]

operates like::

try_again:
InBound [Address]
BIN try_again
OutBound [Address]

And why clutter up asm with extraneous labels and require extra instructions.

The most obvious answer is that in any code that uses these features,
good comments are essential so that readers can see what is happening.

Another method would be to use better names for the intrinsics, as seen
at the C (or other HLL) level. (Assembly instruction names don't matter
nearly as much.)

So maybe instead of "esmLOCKload()" and "esmLOCKstore()" you have
"load_and_set_retry_point()" and "store_or_retry()". Feel free to think
of better names, but that would at least give the reader a clue that
there's something odd going on.

This is a useful suggestion; thanks.

I can certainly say they would help /me/ understand the code, so maybe
they would help other people understand it too.

On the other hand, there are some non-vonNeumann actions lurking within
esm. Where vonNeumann means: that every instruction is executed in its entirety before the next instruction appears to start executing.

That's a rather different use of the term "vonNeumann" from anything I
have heard. I'd just talk about "indivisible" instructions (avoiding "atomic", because that usually refers to a wider view of the system).
And are we thinking about the instructions purely from the viewpoint of
the cpu executing them?

IME, most instructions on most processors are indivisible, but most
processors have some instructions that are not. For example, processors
can have load/store multiple instructions that are interruptable - in
some cases, after returning from the interrupt (and any associated
thread context switches) the instructions are restarted, in other cases
they are continued.

But most instructions /appear/ to be executed entirely before the next instruction /appears/ to start executing. Fast processors have a lot of hardware designed to keep up this appearance - register renaming,
pipelining, speculative execution, dependency tracking, and all the rest
of it.

1st:: one cannot single step through an ATMOIC event, if you enter an
ATOMIC event in single-step mode, you will see the 1st instruction in
the event, than you will receive control after the terminal instruction
has executed.

That is presumably a choice you made for the debugging features of the
device.

2nd::the only way to debug an event is to have a buffer of SW locations
that gets written with non-participating STs. Unlike participating
memory lines, these locations will be written--but not in a sequentially consistent manner (architecturally), and can be examined outside the
event; whereas the participating lines are either all written instan- taneously or not modified at all.

So, here we have non-participating STs having been written and older participating STs have not.

3rd:: control transfer not under SW control--more like exceptions and interrupts than Br-condition--except that the target of control transfer
is based on the code in the event.

OK. I can see the advantages of that - though there are disadvantages
too (such as being unable to control a limit on the number of retries,
or add SW tracking of retry counts for metrics). My main concern was
the disconnect between how the code was written and what it actually does.

4th:: one cannot test esm with a random code generator, since the probability that the random code generator creates a legal esm event is exceedingly low.

Testing and debugging any kind of locking or atomic access solution is
always very difficult. You can rarely try out conflicts or potential
race conditions in the lab - they only ever turn up at customer demos!

--- Synchronet 3.21a-Linux NewsLink 1.2

From Stephen Fuld@sfuld@alumni.cmu.edu.invalid to comp.arch on Wed Dec 10 08:51:16 2025

From Newsgroup: comp.arch

On 12/10/2025 1:07 AM, David Brown wrote:

On 09/12/2025 22:28, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 09/12/2025 20:15, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

There are basically two ways to handle atomic operations. One way >>>>> is to
use locking mechanisms to ensure that nothing (other cores, interrupts >>>>> or other pre-emption on the same core) can break up the sequence. The >>>>> other way is to have a mechanism to detect conflicts and a failure of >>>>> the atomic operation, so that you can try again (or otherwise
handle the
situation). (You can, of course, combine these - such as by disabling >>>>> local interrupts and detecting conflicts from other cores.)

The code Mitch posted apparently had neither of these mechanisms,
hence
my confusion. It turns out that it /does/ have conflict detection >>>>> and a
hardware retry loop, all hidden from anyone trying to understand the >>>>> code. (I can appreciate that there may be benefits in doing this in >>>>> hardware, but there are no benefits in hiding it from the programmer!) >>>>

How exactly do you inform the programmer that:

         InBound   [Address]
         OutBound [Address]

operates like::

try_again:
         InBound   [Address]
         BIN       try_again
         OutBound [Address]

And why clutter up asm with extraneous labels and require extra
instructions.

The most obvious answer is that in any code that uses these features,
good comments are essential so that readers can see what is happening.

Another method would be to use better names for the intrinsics, as seen
at the C (or other HLL) level. (Assembly instruction names don't matter >>> nearly as much.)

So maybe instead of "esmLOCKload()" and "esmLOCKstore()" you have
"load_and_set_retry_point()" and "store_or_retry()". Feel free to think >>> of better names, but that would at least give the reader a clue that
there's something odd going on.

This is a useful suggestion; thanks.

I can certainly say they would help /me/ understand the code, so maybe
they would help other people understand it too.

On the other hand, there are some non-vonNeumann actions lurking within
esm. Where vonNeumann means: that every instruction is executed in its
entirety before the next instruction appears to start executing.

That's a rather different use of the term "vonNeumann" from anything I
have heard. I'd just talk about "indivisible" instructions (avoiding "atomic", because that usually refers to a wider view of the system).
And are we thinking about the instructions purely from the viewpoint of
the cpu executing them?

IME, most instructions on most processors are indivisible, but most processors have some instructions that are not. For example, processors can have load/store multiple instructions that are interruptable - in
some cases, after returning from the interrupt (and any associated
thread context switches) the instructions are restarted, in other cases
they are continued.

But most instructions /appear/ to be executed entirely before the next instruction /appears/ to start executing. Fast processors have a lot of hardware designed to keep up this appearance - register renaming, pipelining, speculative execution, dependency tracking, and all the rest
of it.

1st:: one cannot single step through an ATMOIC event, if you enter an
ATOMIC event in single-step mode, you will see the 1st instruction in
the event, than you will receive control after the terminal instruction
has executed.

That is presumably a choice you made for the debugging features of the device.

2nd::the only way to debug an event is to have a buffer of SW locations
that gets written with non-participating STs. Unlike participating
memory lines, these locations will be written--but not in a sequentially
consistent manner (architecturally), and can be examined outside the
event; whereas the participating lines are either all written instan-
taneously or not modified at all.

So, here we have non-participating STs having been written and older
participating STs have not.

3rd:: control transfer not under SW control--more like exceptions and
interrupts than Br-condition--except that the target of control transfer
is based on the code in the event.

OK. I can see the advantages of that - though there are disadvantages
too (such as being unable to control a limit on the number of retries,

Yes, but. ISTM there is a hardware limit on the number of retries - it
is two retries, as the third try (second retry) is guaranteed to
succeed, albeit at a higher cost (in time and interference with other threads/processes) compared to the earlier tries.

or add SW tracking of retry counts for metrics).

Again, ISTM that you could do some software tracking by using non participating stores within the locked area to save information outside
the locked area. I haven't thought through the cost benefit of this,
how much to save, etc.

But I am not sure that the "escalation" to a more "intrusive" mechanism
upon a single failure is optimal. Perhaps it would be better to retry
once or twice using the current mechanism. I don't have a good feeling
for what is optimal here, and to what extent the optimal choice would be workload dependent.

My main concern was
the disconnect between how the code was written and what it actually does.

4th:: one cannot test esm with a random code generator, since the
probability
that the random code generator creates a legal esm event is
exceedingly low.

Testing and debugging any kind of locking or atomic access solution is always very difficult.

Yup!

You can rarely try out conflicts or potential
race conditions in the lab - they only ever turn up at customer demos!

:-)
--
- Stephen Fuld
(e-mail address disguised to prevent spam)
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Wed Dec 10 20:10:43 2025

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> posted:

On 09/12/2025 22:28, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 09/12/2025 20:15, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

There are basically two ways to handle atomic operations. One way is to >>>> use locking mechanisms to ensure that nothing (other cores, interrupts >>>> or other pre-emption on the same core) can break up the sequence. The >>>> other way is to have a mechanism to detect conflicts and a failure of >>>> the atomic operation, so that you can try again (or otherwise handle the >>>> situation). (You can, of course, combine these - such as by disabling >>>> local interrupts and detecting conflicts from other cores.)

The code Mitch posted apparently had neither of these mechanisms, hence >>>> my confusion. It turns out that it /does/ have conflict detection and a >>>> hardware retry loop, all hidden from anyone trying to understand the >>>> code. (I can appreciate that there may be benefits in doing this in >>>> hardware, but there are no benefits in hiding it from the programmer!) >>>

How exactly do you inform the programmer that:

InBound [Address]
OutBound [Address]

operates like::

try_again:
InBound [Address]
BIN try_again
OutBound [Address]

And why clutter up asm with extraneous labels and require extra instructions.

The most obvious answer is that in any code that uses these features,
good comments are essential so that readers can see what is happening.

Another method would be to use better names for the intrinsics, as seen
at the C (or other HLL) level. (Assembly instruction names don't matter >> nearly as much.)

So maybe instead of "esmLOCKload()" and "esmLOCKstore()" you have
"load_and_set_retry_point()" and "store_or_retry()". Feel free to think >> of better names, but that would at least give the reader a clue that
there's something odd going on.

This is a useful suggestion; thanks.

I can certainly say they would help /me/ understand the code, so maybe
they would help other people understand it too.

On the other hand, there are some non-vonNeumann actions lurking within esm. Where vonNeumann means: that every instruction is executed in its entirety before the next instruction appears to start executing.

That's a rather different use of the term "vonNeumann" from anything I
have heard. I'd just talk about "indivisible" instructions (avoiding "atomic", because that usually refers to a wider view of the system).
And are we thinking about the instructions purely from the viewpoint of
the cpu executing them?

An ATOMIC event is a series of instructions that appear to be performed
all at once--as if the whole series was "indivisible".

IME, most instructions on most processors are indivisible, but most processors have some instructions that are not. For example, processors
can have load/store multiple instructions that are interruptable - in
some cases, after returning from the interrupt (and any associated
thread context switches) the instructions are restarted, in other cases
they are continued.

Go in the other direction, where a series of instructions HAS TO APPEAR
as if executed instantaneously.

But most instructions /appear/ to be executed entirely before the next instruction /appears/ to start executing. Fast processors have a lot of hardware designed to keep up this appearance - register renaming, pipelining, speculative execution, dependency tracking, and all the rest
of it.

None of those things is ARHICTECTURAL--esm is an architectural window into
how to program ATOMIC events such no future generation of the ISA has to continuously add more synchronization instructions. One can program every known industrial and academic synchronization primitive in esm without ever adding new synchronization instructions.

1st:: one cannot single step through an ATMOIC event, if you enter an ATOMIC event in single-step mode, you will see the 1st instruction in
the event, than you will receive control after the terminal instruction
has executed.

That is presumably a choice you made for the debugging features of the device.

No it is the nature of executing a series of instructions as if instantaneously.

2nd::the only way to debug an event is to have a buffer of SW locations that gets written with non-participating STs. Unlike participating
memory lines, these locations will be written--but not in a sequentially consistent manner (architecturally), and can be examined outside the
event; whereas the participating lines are either all written instan- taneously or not modified at all.

So, here we have non-participating STs having been written and older participating STs have not.

3rd:: control transfer not under SW control--more like exceptions and interrupts than Br-condition--except that the target of control transfer
is based on the code in the event.

OK. I can see the advantages of that - though there are disadvantages
too (such as being unable to control a limit on the number of retries,
or add SW tracking of retry counts for metrics).

esm attempts to allow SW to program with features previously available
only at the µCode level. µCode allows for many µinstructions to execute before/between any real instructions.

My main concern was
the disconnect between how the code was written and what it actually does.

There is a 26 page specification the programmer needs to read and understand. This includes things we have not talked about--such as::
a) terminating an event without writing anything
b) proactively minimizing future interference
c) modifications to cache coherence model
at the architectural level.

The architectural specification allows for various scales of µArchitecture
to independently choose how to implement esm and provide the architectural features at SW level. For example the kinds of esm activities for a 1-wide In-Order µController are vastly different that those suitable for a server scale rack of processor ensembles. What we want is one SW model that covers
the whole gamut.

4th:: one cannot test esm with a random code generator, since the probability
that the random code generator creates a legal esm event is exceedingly low.

Testing and debugging any kind of locking or atomic access solution is always very difficult. You can rarely try out conflicts or potential
race conditions in the lab - they only ever turn up at customer demos!

Right at Christmas time !! {Ask me how I know}.
--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Thu Dec 11 10:05:34 2025

From Newsgroup: comp.arch

On 10/12/2025 21:10, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

OK. I can see the advantages of that - though there are disadvantages
too (such as being unable to control a limit on the number of retries,
or add SW tracking of retry counts for metrics).

esm attempts to allow SW to program with features previously available
only at the µCode level. µCode allows for many µinstructions to execute before/between any real instructions.

My main concern was
the disconnect between how the code was written and what it actually does.

Perhaps it would be better to think of these sequences in assembler
rather than C - you want tighter control than C normally allows, and you
don't want optimisers re-arranging things too much.

There is a 26 page specification the programmer needs to read and understand. This includes things we have not talked about--such as::
a) terminating an event without writing anything
b) proactively minimizing future interference
c) modifications to cache coherence model
at the architectural level.

Fair enough. This is not a minor or simple feature!

The architectural specification allows for various scales of µArchitecture to independently choose how to implement esm and provide the architectural features at SW level. For example the kinds of esm activities for a 1-wide In-Order µController are vastly different that those suitable for a server scale rack of processor ensembles. What we want is one SW model that covers the whole gamut.

4th:: one cannot test esm with a random code generator, since the probability
that the random code generator creates a legal esm event is exceedingly low.

Testing and debugging any kind of locking or atomic access solution is
always very difficult. You can rarely try out conflicts or potential
race conditions in the lab - they only ever turn up at customer demos!

Right at Christmas time !! {Ask me how I know}.

We can gather round the fire, and Grampa can settle in his rocking chair
to tell us war stories from the olden days :-)

A good story is always nice, so go for it!

(We once had a system where there was a bug that not only triggered only
at the customer's site, but did so only on the 30th of September. It
took years before we made the connection to the date and found the bug.)

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Thu Dec 11 20:26:09 2025

From Newsgroup: comp.arch

David Brown <david.brown@hesbynett.no> posted:

On 10/12/2025 21:10, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

OK. I can see the advantages of that - though there are disadvantages
too (such as being unable to control a limit on the number of retries,
or add SW tracking of retry counts for metrics).

esm attempts to allow SW to program with features previously available
only at the µCode level. µCode allows for many µinstructions to execute before/between any real instructions.

My main concern was
the disconnect between how the code was written and what it actually does.

Perhaps it would be better to think of these sequences in assembler
rather than C - you want tighter control than C normally allows, and you don't want optimisers re-arranging things too much.

Heck, there are assemblers that rearrange code like this too much--
until they can be taught not to.

There is a 26 page specification the programmer needs to read and understand.
This includes things we have not talked about--such as::
a) terminating an event without writing anything
b) proactively minimizing future interference
c) modifications to cache coherence model
at the architectural level.

Fair enough. This is not a minor or simple feature!

No, it is a design that allows for ISA to remain static while all sorts of synchronization stuff gets written, tested, and tuned.

The architectural specification allows for various scales of µArchitecture to independently choose how to implement esm and provide the architectural features at SW level. For example the kinds of esm activities for a 1-wide In-Order µController are vastly different that those suitable for a server scale rack of processor ensembles. What we want is one SW model that covers the whole gamut.

4th:: one cannot test esm with a random code generator, since the probability
that the random code generator creates a legal esm event is exceedingly low.

Testing and debugging any kind of locking or atomic access solution is
always very difficult. You can rarely try out conflicts or potential
race conditions in the lab - they only ever turn up at customer demos!

Right at Christmas time !! {Ask me how I know}.

We can gather round the fire, and Grampa can settle in his rocking chair
to tell us war stories from the olden days :-)

A good story is always nice, so go for it!

Year:: 1997, time 7 days before Christmas:: situation, Customer is
having (and has had) strange bugs that happen about once a week.
Customer is unhappy, we have had a senior engineer on sight for
4 months without forward progress. We were told "You don't come home
until the problem is fixed".

System:: 2 (or more) of our cache coherent motherboards, connected
with a proven cache coherent buss.

On the flight from Austin to Manchester England, I decide that what
we have is a physics experiment. So, when we arrive, I had their SW
guy code up a routine that as soon as it got a time slice, it would
signal it no longer needed time. While we hooked up the logic analyzer
to our motherboards and to their bus. When SW was ready (about 30 minutes)
we tried the case--Instantly, the time delay between the bug showing up
went from once a week to milliseconds. We spent the afternoon taking
logic analyzer traces, and went to dinner.

The next day, we went through the traces with a fine tooth comb and
found a smoking gun--so we ran more experiments and this same smoking
gun was found in each track. After a couple of hours, we found that
their proven coherent bus was allowing 1 single cycle where our bus
could be seen in an inconsistent state. and it was only a dozen
cycles downstream that the crash was transpiring.

It turns out that their bus was only coherent when the attached bus
was slower than 4 cycles to do "random coherent message", whereas
our bus was times at 2 cycles for this response.

So, we took their FPGA which ran the bus apart and found out how to
delay one signal, reprogrammed it--ONLY to run into another message
that was off by 1 or 2 cycles. This one took a whole day to find and
program around.

We both made it home for Christmas, and in some part saved the company...

(We once had a system where there was a bug that not only triggered only
at the customer's site, but did so only on the 30th of September. It
took years before we made the connection to the date and found the bug.)

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Thu Dec 11 20:47:12 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Heck, there are assemblers that rearrange code like this too much--
until they can be taught not to.

Any example? This would definitely go against what I would consider
to be reasonable for an assembler. gdb certainly does not do so.

What _would_ be useful on occasion would be an assembler which
could do register assignment, for example for a small function.
It would be OK if this were to issue an error if there were too
many variables for assignment.

Does anybody know of such a beast?
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Michael S@already5chosen@yahoo.com to comp.arch on Thu Dec 11 23:51:26 2025

From Newsgroup: comp.arch

On Thu, 11 Dec 2025 20:26:09 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

We both made it home for Christmas, and in some part saved the
company...

Not for long so... Was not it dead anyway in the 6-7 months?

--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Thu Dec 11 15:00:53 2025

From Newsgroup: comp.arch

On 12/10/2025 1:07 AM, David Brown wrote:
[...]

Testing and debugging any kind of locking or atomic access solution is always very difficult. You can rarely try out conflicts or potential
race conditions in the lab - they only ever turn up at customer demos!

Murphy's Law. Actually, have you ever messed around with Relacy Race
Detector? Its pretty interesting.

[...]

--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Thu Dec 11 15:02:40 2025

From Newsgroup: comp.arch

On 12/11/2025 1:05 AM, David Brown wrote:

On 10/12/2025 21:10, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

OK. I can see the advantages of that - though there are disadvantages
too (such as being unable to control a limit on the number of retries,
or add SW tracking of retry counts for metrics).

esm attempts to allow SW to program with features previously available
only at the µCode level. µCode allows for many µinstructions to execute >> before/between any real instructions.

My main concern was
the disconnect between how the code was written and what it actually
does.

Perhaps it would be better to think of these sequences in assembler
rather than C - you want tighter control than C normally allows, and you don't want optimisers re-arranging things too much.

Right. Way back before C/C++ 11 I would code all of my sensitive lock/wait-free code in assembly.

[...]

--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Thu Dec 11 15:03:29 2025

From Newsgroup: comp.arch

On 12/11/2025 3:02 PM, Chris M. Thomasson wrote:

On 12/11/2025 1:05 AM, David Brown wrote:

On 10/12/2025 21:10, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

OK. I can see the advantages of that - though there are disadvantages >>>> too (such as being unable to control a limit on the number of retries, >>>> or add SW tracking of retry counts for metrics).

esm attempts to allow SW to program with features previously available
only at the µCode level. µCode allows for many µinstructions to execute >>> before/between any real instructions.

My main concern was
the disconnect between how the code was written and what it actually
does.

Perhaps it would be better to think of these sequences in assembler
rather than C - you want tighter control than C normally allows, and
you don't want optimisers re-arranging things too much.

Right. Way back before C/C++ 11 I would code all of my sensitive lock/ wait-free code in assembly.

[...]

Actually, I would turn off link-time optimization back then.
--- Synchronet 3.21a-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Fri Dec 12 01:41:41 2025

From Newsgroup: comp.arch

According to Thomas Koenig <tkoenig@netcologne.de>:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Heck, there are assemblers that rearrange code like this too much--
until they can be taught not to.

Any example? This would definitely go against what I would consider
to be reasonable for an assembler. gdb certainly does not do so.

On machines with delayed branches I've seen assemblers that move
instructions into the delay slot. Can't think of any others off hand.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Thu Dec 11 18:27:48 2025

From Newsgroup: comp.arch

On 12/11/2025 5:41 PM, John Levine wrote:

According to Thomas Koenig <tkoenig@netcologne.de>:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Heck, there are assemblers that rearrange code like this too much--
until they can be taught not to.

Any example? This would definitely go against what I would consider
to be reasonable for an assembler. gdb certainly does not do so.

On machines with delayed branches I've seen assemblers that move
instructions into the delay slot. Can't think of any others off hand.

That would suck! Back when I used to code in SPARC assembly language, I
had full control over my delay slots. Actually, IIRC, putting a MEMBAR instruction in a delay slot is VERY bad.
--- Synchronet 3.21a-Linux NewsLink 1.2

From John Levine@johnl@taugh.com to comp.arch on Fri Dec 12 02:48:19 2025

From Newsgroup: comp.arch

According to Chris M. Thomasson <chris.m.thomasson.1@gmail.com>:

On 12/11/2025 5:41 PM, John Levine wrote:

According to Thomas Koenig <tkoenig@netcologne.de>:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Heck, there are assemblers that rearrange code like this too much--
until they can be taught not to.

Any example? This would definitely go against what I would consider
to be reasonable for an assembler. gdb certainly does not do so.

On machines with delayed branches I've seen assemblers that move
instructions into the delay slot. Can't think of any others off hand.

That would suck! Back when I used to code in SPARC assembly language, I
had full control over my delay slots. Actually, IIRC, putting a MEMBAR >instruction in a delay slot is VERY bad.

I think they were smart enough only to move instructions that wouldn't cause problems.
--
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly
--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Fri Dec 12 08:59:12 2025

From Newsgroup: comp.arch

On 11/12/2025 22:51, Michael S wrote:

On Thu, 11 Dec 2025 20:26:09 GMT
MitchAlsup <user5857@newsgrouper.org.invalid> wrote:

We both made it home for Christmas, and in some part saved the
company...

Not for long so... Was not it dead anyway in the 6-7 months?

This is why stories end with "they all lived happily ever after", and
why sequel movies are almost always terrible! I liked the first story
better.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Dec 12 08:14:47 2025

From Newsgroup: comp.arch

John Levine <johnl@taugh.com> schrieb:

According to Thomas Koenig <tkoenig@netcologne.de>:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Heck, there are assemblers that rearrange code like this too much--
until they can be taught not to.

Any example? This would definitely go against what I would consider
to be reasonable for an assembler. gdb certainly does not do so.

On machines with delayed branches I've seen assemblers that move
instructions into the delay slot. Can't think of any others off hand.

Thinking of it a bit more, the optimizing assemblers for drum memory
computers like the IBM 650 or the LGP-30 of Mel the Programmer
fame moved around instructions so the next one would be under the
head when the previous one was done executing.

Random-access memory made this redundant :-)
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Fri Dec 12 13:05:43 2025

From Newsgroup: comp.arch

In article <10hfrsl$145v$1@gal.iecc.com>, John Levine <johnl@taugh.com> wrote: >According to Thomas Koenig <tkoenig@netcologne.de>:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Heck, there are assemblers that rearrange code like this too much--
until they can be taught not to.

Any example? This would definitely go against what I would consider
to be reasonable for an assembler. gdb certainly does not do so.

On machines with delayed branches I've seen assemblers that move
instructions into the delay slot. Can't think of any others off hand.

I've seen things like this, as well, particularly on machines
with multiple delay slots, where this detail was hidden from the
programmer. Or at least I have a vague memory of this; perhaps
I'm hallucinating.

More dangerous are linkers that do LTO and decide to elide code
that, no, really, I actually need for reasons that are not
apparent to the toolchain.

- Dan C.

--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Fri Dec 12 15:28:30 2025

From Newsgroup: comp.arch

On 12/12/2025 14:05, Dan Cross wrote:

In article <10hfrsl$145v$1@gal.iecc.com>, John Levine <johnl@taugh.com> wrote:

According to Thomas Koenig <tkoenig@netcologne.de>:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Heck, there are assemblers that rearrange code like this too much--
until they can be taught not to.

Any example? This would definitely go against what I would consider
to be reasonable for an assembler. gdb certainly does not do so.

On machines with delayed branches I've seen assemblers that move
instructions into the delay slot. Can't think of any others off hand.

I've seen things like this, as well, particularly on machines
with multiple delay slots, where this detail was hidden from the
programmer. Or at least I have a vague memory of this; perhaps
I'm hallucinating.

I've seen a few assemblers that do fancy things with jumps and branches
- giving you generic conditional branch pseudo-instructions that get
turned into different types of real instructions depending on the
distance needed for the jumps and the ranges supported by the
instructions. And there are plenty that have pseudo-instructions for
loading immediates into registers that generate whatever sequence of
load immediate, shift-and-or, etc., are needed.

More dangerous are linkers that do LTO and decide to elide code
that, no, really, I actually need for reasons that are not
apparent to the toolchain.

IME you have control over the details - either using directives in the assembly, or in the linker control files. Of course that might mean
modifying code that you hoped to use untouched, and it's not hard to
forget to add a "keep" or "retain" directive.

I've found link-time dead code elimination quite useful when I have one
code base but different binary builds - sometimes all you need is a
different linker file.

--- Synchronet 3.21a-Linux NewsLink 1.2

From cross@cross@spitfire.i.gajendra.net (Dan Cross) to comp.arch on Fri Dec 12 16:25:42 2025

From Newsgroup: comp.arch

In article <10hh8qe$2v9lm$1@dont-email.me>,
David Brown <david.brown@hesbynett.no> wrote:

On 12/12/2025 14:05, Dan Cross wrote:

In article <10hfrsl$145v$1@gal.iecc.com>, John Levine <johnl@taugh.com> wrote:

According to Thomas Koenig <tkoenig@netcologne.de>:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Heck, there are assemblers that rearrange code like this too much--
until they can be taught not to.

Any example? This would definitely go against what I would consider
to be reasonable for an assembler. gdb certainly does not do so.

On machines with delayed branches I've seen assemblers that move
instructions into the delay slot. Can't think of any others off hand.

I've seen things like this, as well, particularly on machines
with multiple delay slots, where this detail was hidden from the
programmer. Or at least I have a vague memory of this; perhaps
I'm hallucinating.

I've seen a few assemblers that do fancy things with jumps and branches
- giving you generic conditional branch pseudo-instructions that get
turned into different types of real instructions depending on the
distance needed for the jumps and the ranges supported by the
instructions. And there are plenty that have pseudo-instructions for >loading immediates into registers that generate whatever sequence of
load immediate, shift-and-or, etc., are needed.

More dangerous are linkers that do LTO and decide to elide code
that, no, really, I actually need for reasons that are not
apparent to the toolchain.

IME you have control over the details - either using directives in the >assembly, or in the linker control files. Of course that might mean >modifying code that you hoped to use untouched, and it's not hard to
forget to add a "keep" or "retain" directive.

Provided, of course, that you have access to both the assembly
and the linker configuration for a given program. Sometimes you
don't (e.g., if the code in question is in some higher-level
language) or the linker configuration is just some default.

For example, the Plan 9 C compiler delegated actual instruction
selection to the linker; the compiler emitted a high(er)-level
representation of the operation. This made the linker free to
perform peephole optimization, potentially eliding important
instructions (like writes to MMIO regions). Fortunately, the
Plan 9 authors understood this so effectively all globals were
volatile, but when porting that code to standard C, one had to
exercise some care.

I've found link-time dead code elimination quite useful when I have one
code base but different binary builds - sometimes all you need is a >different linker file.

Agreed, it _is_ useful. But sometimes it's inappropriate.

- Dan C.

--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Dec 12 19:17:16 2025

From Newsgroup: comp.arch

John Levine <johnl@taugh.com> posted:

According to Chris M. Thomasson <chris.m.thomasson.1@gmail.com>:

On 12/11/2025 5:41 PM, John Levine wrote:

According to Thomas Koenig <tkoenig@netcologne.de>:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Heck, there are assemblers that rearrange code like this too much--
until they can be taught not to.

Any example? This would definitely go against what I would consider
to be reasonable for an assembler. gdb certainly does not do so.

On machines with delayed branches I've seen assemblers that move
instructions into the delay slot. Can't think of any others off hand.

That would suck! Back when I used to code in SPARC assembly language, I >had full control over my delay slots. Actually, IIRC, putting a MEMBAR >instruction in a delay slot is VERY bad.

I think they were smart enough only to move instructions that wouldn't cause problems.

Many early RISC assemblers were in charge of moving instructions around
subject to not altering register dependencies and not altering control
flow dependencies. This allowed those assemblers to move code across
memory instructions, across long latency calculation instructions,
branch instructions, including delay slots; and redefine what "program
order" now is. A bad side effect of exposing the pipeline to SW.

We mostly have gotten away from this due to "smart" instruction queueing.
--- Synchronet 3.21a-Linux NewsLink 1.2

From David Brown@david.brown@hesbynett.no to comp.arch on Fri Dec 12 21:12:05 2025

From Newsgroup: comp.arch

On 12/12/2025 17:25, Dan Cross wrote:

In article <10hh8qe$2v9lm$1@dont-email.me>,
David Brown <david.brown@hesbynett.no> wrote:

On 12/12/2025 14:05, Dan Cross wrote:

In article <10hfrsl$145v$1@gal.iecc.com>, John Levine <johnl@taugh.com> wrote:

According to Thomas Koenig <tkoenig@netcologne.de>:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Heck, there are assemblers that rearrange code like this too much-- >>>>>> until they can be taught not to.

Any example? This would definitely go against what I would consider >>>>> to be reasonable for an assembler. gdb certainly does not do so.

On machines with delayed branches I've seen assemblers that move
instructions into the delay slot. Can't think of any others off hand.

I've seen things like this, as well, particularly on machines
with multiple delay slots, where this detail was hidden from the
programmer. Or at least I have a vague memory of this; perhaps
I'm hallucinating.

I've seen a few assemblers that do fancy things with jumps and branches
- giving you generic conditional branch pseudo-instructions that get
turned into different types of real instructions depending on the
distance needed for the jumps and the ranges supported by the
instructions. And there are plenty that have pseudo-instructions for
loading immediates into registers that generate whatever sequence of
load immediate, shift-and-or, etc., are needed.

More dangerous are linkers that do LTO and decide to elide code
that, no, really, I actually need for reasons that are not
apparent to the toolchain.

IME you have control over the details - either using directives in the
assembly, or in the linker control files. Of course that might mean
modifying code that you hoped to use untouched, and it's not hard to
forget to add a "keep" or "retain" directive.

Provided, of course, that you have access to both the assembly
and the linker configuration for a given program. Sometimes you
don't (e.g., if the code in question is in some higher-level
language) or the linker configuration is just some default.

I've managed so far in my own work, but I suppose I work at a lower
level than most. I don't think it is common for C or C++ programmers to
know much about linker control files.

For example, the Plan 9 C compiler delegated actual instruction
selection to the linker; the compiler emitted a high(er)-level
representation of the operation. This made the linker free to
perform peephole optimization, potentially eliding important
instructions (like writes to MMIO regions). Fortunately, the
Plan 9 authors understood this so effectively all globals were
volatile, but when porting that code to standard C, one had to
exercise some care.

I've found link-time dead code elimination quite useful when I have one
code base but different binary builds - sometimes all you need is a
different linker file.

Agreed, it _is_ useful. But sometimes it's inappropriate.

Indeed.

--- Synchronet 3.21a-Linux NewsLink 1.2

From Thomas Koenig@tkoenig@netcologne.de to comp.arch on Fri Dec 12 21:02:14 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Many early RISC assemblers were in charge of moving instructions around subject to not altering register dependencies and not altering control
flow dependencies. This allowed those assemblers to move code across
memory instructions, across long latency calculation instructions,
branch instructions, including delay slots; and redefine what "program order" now is. A bad side effect of exposing the pipeline to SW.

I never heard of that one.

Sounds like bad design - that should be done by the compiler,
not the assembler. It is fine for the compiler to have pipeline
descriptions in the cost model of the CPU under a specific -march
or -mtune flag.

(Yes, it is preferred that performance should be rather good for
code generated for a generic microarchitecture).

We mostly have gotten away from this due to "smart" instruction queueing.

What is that?
--
This USENET posting was made without artificial intelligence,
artificial impertinence, artificial arrogance, artificial stupidity,
artificial flavorings or artificial colorants.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Dec 12 22:05:14 2025

From Newsgroup: comp.arch

Thomas Koenig <tkoenig@netcologne.de> posted:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Many early RISC assemblers were in charge of moving instructions around subject to not altering register dependencies and not altering control
flow dependencies. This allowed those assemblers to move code across
memory instructions, across long latency calculation instructions,
branch instructions, including delay slots; and redefine what "program order" now is. A bad side effect of exposing the pipeline to SW.

I never heard of that one.

Sounds like bad design - that should be done by the compiler,
not the assembler. It is fine for the compiler to have pipeline
descriptions in the cost model of the CPU under a specific -march
or -mtune flag.

(Yes, it is preferred that performance should be rather good for
code generated for a generic microarchitecture).

We mostly have gotten away from this due to "smart" instruction queueing.

What is that?

Reservation stations {Value capturing and value free}, Scoreboards,
Dispatch stacks, and similar.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Dec 12 14:19:29 2025

From Newsgroup: comp.arch

On 12/12/2025 2:05 PM, MitchAlsup wrote:

Thomas Koenig <tkoenig@netcologne.de> posted:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Many early RISC assemblers were in charge of moving instructions around
subject to not altering register dependencies and not altering control
flow dependencies. This allowed those assemblers to move code across
memory instructions, across long latency calculation instructions,
branch instructions, including delay slots; and redefine what "program
order" now is. A bad side effect of exposing the pipeline to SW.

I never heard of that one.

Sounds like bad design - that should be done by the compiler,
not the assembler. It is fine for the compiler to have pipeline
descriptions in the cost model of the CPU under a specific -march
or -mtune flag.

(Yes, it is preferred that performance should be rather good for
code generated for a generic microarchitecture).

We mostly have gotten away from this due to "smart" instruction queueing. >>

What is that?

Reservation stations {Value capturing and value free}, Scoreboards,
Dispatch stacks, and similar.

Iiic, over on the PPC, wrt LL/SC, it was the reservation granule. I
think it could be larger that a L2 cache line. So, any interference in
that granule could cause LL/SC to fail. This can lead to livelock if the program's data was not aligned and/or padded correctly.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Dec 12 14:22:30 2025

From Newsgroup: comp.arch

On 12/11/2025 6:48 PM, John Levine wrote:

According to Chris M. Thomasson <chris.m.thomasson.1@gmail.com>:

On 12/11/2025 5:41 PM, John Levine wrote:

According to Thomas Koenig <tkoenig@netcologne.de>:

MitchAlsup <user5857@newsgrouper.org.invalid> schrieb:

Heck, there are assemblers that rearrange code like this too much--
until they can be taught not to.

Any example? This would definitely go against what I would consider
to be reasonable for an assembler. gdb certainly does not do so.

On machines with delayed branches I've seen assemblers that move
instructions into the delay slot. Can't think of any others off hand.

That would suck! Back when I used to code in SPARC assembly language, I
had full control over my delay slots. Actually, IIRC, putting a MEMBAR
instruction in a delay slot is VERY bad.

I think they were smart enough only to move instructions that wouldn't cause problems.

I would check the disassembly to see if anything funny happened. Also,
when my assembled code was used in C, back before C/C++11, I would turn
off link time optimization. And check again. This was way back, around
25 years ago. My lock/wait free code was highly sensitive. If something thought it could "optimize" it, well, that was NOT good.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Dec 12 14:37:03 2025

From Newsgroup: comp.arch

On 12/8/2025 12:06 PM, MitchAlsup wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

On 12/6/2025 5:42 AM, David Brown wrote:

On 05/12/2025 21:54, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide enough >>>>>>>> guarantees for multi-threaded coding on multi-core systems.
Basically,
it only works at the C abstract machine level - it does nothing that >>>>>>>> affects the hardware. So volatile writes are ordered at the C level, >>>>>>>> but that says nothing about how they might progress through storage >>>>>>>> queues, caches, inter-processor communication buses, or whatever. >>>>>>>

You describe in many words and not really to the point what can be >>>>>>> explained concisely as: "volatile says nothing about memory ordering >>>>>>> on hardware with weaker memory ordering than sequential consistency". >>>>>>> If hardware guaranteed sequential consistency, volatile would provide >>>>>>> guarantees that are as good on multi-core machines as on single-core >>>>>>> machines.

However, for concurrent manipulations of data structures, one wants >>>>>>> atomic operations beyond load and store (even on single-core systems), >>>>>>

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes >>>>> bigger than the maximum load/store size of the processor.

My 66000 ISA can::

LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction. >>>> MM can MOV up to 8192 bytes as a single ATOMIC instruction. >>>>

The functions below rely on more than that - to make the work, as far as >>> I can see, you need the first "esmLOCKload" to lock the bus and also
lock the core from any kind of interrupt or other pre-emption, lasting
until the esmLOCKstore instruction. Or am I missing something here?

Lock the BUS? Only when shit hits the fan. What about locking the cache
line? Actually, I think we can "force" an x86/x64 to lock the bus if we
do a LOCK'ed RMW on memory that straddles cache lines?

In the My 66000 case, Mem References can lock up to 8 cache lines.

Pretty flexible wrt implementing those exotic things back in the day, experimental algos that need DCAS, KCSS, ect... A heck of a lot of
things can be accomplished with DWCAS, aka cmpxchg8b on a 32 bit system.
or cmpxchg16b on a 64-bit system.

People would bend over backwards to get a DCAS, or NCAS. It would be
infested with strange indirection ala d"escriptors", and involved a shit
load of atomic RMW's. CAS, DWCAS, XCHG and XADD can get a lot done.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Dec 12 14:39:16 2025

From Newsgroup: comp.arch

On 12/12/2025 2:37 PM, Chris M. Thomasson wrote:

On 12/8/2025 12:06 PM, MitchAlsup wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

On 12/6/2025 5:42 AM, David Brown wrote:

On 05/12/2025 21:54, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide >>>>>>>>> enough
guarantees for multi-threaded coding on multi-core systems.
Basically,
it only works at the C abstract machine level - it does nothing >>>>>>>>> that
affects the hardware. So volatile writes are ordered at the C >>>>>>>>> level,
but that says nothing about how they might progress through >>>>>>>>> storage
queues, caches, inter-processor communication buses, or whatever. >>>>>>>>

You describe in many words and not really to the point what can be >>>>>>>> explained concisely as: "volatile says nothing about memory
ordering
on hardware with weaker memory ordering than sequential
consistency".
If hardware guaranteed sequential consistency, volatile would >>>>>>>> provide
guarantees that are as good on multi-core machines as on single- >>>>>>>> core
machines.

However, for concurrent manipulations of data structures, one wants >>>>>>>> atomic operations beyond load and store (even on single-core
systems),

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes >>>>>> bigger than the maximum load/store size of the processor.

My 66000 ISA can::

LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction. >>>>> MM can MOV up to 8192 bytes as a single ATOMIC instruction. >>>>>

The functions below rely on more than that - to make the work, as
far as
I can see, you need the first "esmLOCKload" to lock the bus and also
lock the core from any kind of interrupt or other pre-emption, lasting >>>> until the esmLOCKstore instruction. Or am I missing something here?

Lock the BUS? Only when shit hits the fan. What about locking the cache
line? Actually, I think we can "force" an x86/x64 to lock the bus if we
do a LOCK'ed RMW on memory that straddles cache lines?

In the My 66000 case, Mem References can lock up to 8 cache lines.

Pretty flexible wrt implementing those exotic things back in the day, experimental algos that need DCAS, KCSS, ect... A heck of a lot of
things can be accomplished with DWCAS, aka cmpxchg8b on a 32 bit system.
or cmpxchg16b on a 64-bit system.

People would bend over backwards to get a DCAS, or NCAS. It would be infested with strange indirection ala d"escriptors", and involved a shit load of atomic RMW's. CAS, DWCAS, XCHG and XADD can get a lot done.

Have you ever read about KCSS?

https://groups.google.com/g/comp.arch/c/shshLdF1uqs

https://patents.google.com/patent/US7293143
--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Dec 12 14:47:50 2025

From Newsgroup: comp.arch

On 12/12/2025 2:37 PM, Chris M. Thomasson wrote:

On 12/8/2025 12:06 PM, MitchAlsup wrote:

[...]

People would bend over backwards to get a DCAS, or NCAS. It would be infested with strange indirection ala d"escriptors", and involved a shit load of atomic RMW's. CAS, DWCAS, XCHG and XADD can get a lot done.

I am trying to convey that a lot of neat algos do not even need the
fancy DCAS, NCAS.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Fri Dec 12 23:39:53 2025

From Newsgroup: comp.arch

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

On 12/12/2025 2:37 PM, Chris M. Thomasson wrote:

On 12/8/2025 12:06 PM, MitchAlsup wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

On 12/6/2025 5:42 AM, David Brown wrote:

On 05/12/2025 21:54, MitchAlsup wrote:

David Brown <david.brown@hesbynett.no> posted:

On 05/12/2025 18:57, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

David Brown <david.brown@hesbynett.no> writes:

"volatile" /does/ provide guarantees - it just doesn't provide >>>>>>>>> enough
guarantees for multi-threaded coding on multi-core systems. >>>>>>>>> Basically,
it only works at the C abstract machine level - it does nothing >>>>>>>>> that
affects the hardware. So volatile writes are ordered at the C >>>>>>>>> level,
but that says nothing about how they might progress through >>>>>>>>> storage
queues, caches, inter-processor communication buses, or whatever. >>>>>>>>

You describe in many words and not really to the point what can be >>>>>>>> explained concisely as: "volatile says nothing about memory >>>>>>>> ordering
on hardware with weaker memory ordering than sequential
consistency".
If hardware guaranteed sequential consistency, volatile would >>>>>>>> provide
guarantees that are as good on multi-core machines as on single- >>>>>>>> core
machines.

However, for concurrent manipulations of data structures, one wants >>>>>>>> atomic operations beyond load and store (even on single-core >>>>>>>> systems),

Such as ????

Atomic increment, compare-and-swap, locks, loads and stores of sizes >>>>>> bigger than the maximum load/store size of the processor.

My 66000 ISA can::

LDM/STM can LD/ST up to 32 DWs as a single ATOMIC instruction. >>>>> MM can MOV up to 8192 bytes as a single ATOMIC instruction.

The functions below rely on more than that - to make the work, as
far as
I can see, you need the first "esmLOCKload" to lock the bus and also >>>> lock the core from any kind of interrupt or other pre-emption, lasting >>>> until the esmLOCKstore instruction. Or am I missing something here? >>>

Lock the BUS? Only when shit hits the fan. What about locking the cache >>> line? Actually, I think we can "force" an x86/x64 to lock the bus if we >>> do a LOCK'ed RMW on memory that straddles cache lines?

In the My 66000 case, Mem References can lock up to 8 cache lines.

Pretty flexible wrt implementing those exotic things back in the day, experimental algos that need DCAS, KCSS, ect... A heck of a lot of
things can be accomplished with DWCAS, aka cmpxchg8b on a 32 bit system. or cmpxchg16b on a 64-bit system.

People would bend over backwards to get a DCAS, or NCAS. It would be infested with strange indirection ala d"escriptors", and involved a shit load of atomic RMW's. CAS, DWCAS, XCHG and XADD can get a lot done.

Have you ever read about KCSS?

https://groups.google.com/g/comp.arch/c/shshLdF1uqs

https://patents.google.com/patent/US7293143

While I was not directly exposed to KCSS, I was exposed to the underlying
need for multi-location Compare and Swap requirements, and provided a means
to implement same in both ASF and ESM. {All of us (synchronization people)
were so exposed. And a lot of academic ideas came out of those trends, too.}

In my case, I simply wanted a way "out" of inventing a new synchronization primitive ever ISA generation. What my solution entails is a modification
to the cache coherence model (NaK) that indicates "Yes I have the line you referenced, but, no you can't have it right now" in order to strengthen
the guarantees of forward progress.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Dec 12 15:52:45 2025

From Newsgroup: comp.arch

On 12/6/2025 11:04 AM, Scott Lurndal wrote:
[...]

Most extant SMP processors provide a compare and swap operation, which
are widely supported by the common compilers that support the C and C++ threading functionality.

Right. However, a DWCAS is important as well... Well, for me... This
only works on contiguous words.
--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Fri Dec 12 15:56:52 2025

From Newsgroup: comp.arch

On 12/8/2025 4:31 PM, Chris M. Thomasson wrote:

On 12/8/2025 9:14 AM, Scott Lurndal wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 12/8/2025 4:25 AM, Robert Finch wrote:

<snip>

I am having trouble understanding how the block of code in the
esmINTERFERENCE() block is protected so that the whole thing
executes as
a unit. It would seem to me that the address range(s) needing to be
locked would have to be supplied throughout the system, including
across
buffers and bus bridges. It would have to go to the memory coherence
point. Otherwise, some other device using a bridge could update the
same
address range in the middle of an update.

I may be wrong about this, but I think you have a misconception. The
ESM doesn't *prevent* interference, but it *detect* interference. Thus >>> nothing is required of other cores, no locks, etc. If they write to a
"protected" location, the write is allowed, but the core in the ESM is
notified, so it can redo the ESM protected code.

Sounds very much similar to the ARMv8 concept of an "exclusive monitor"
(the basis of the Store-Exclusive/Load-Exclusive instructions, which
mirror the LL/SC paradigm). The ARMv8 monitors an implementation defined >> range surrounding the target address and the store will fail if any other
agent has modified any byte within the exclusive range.

Any mutation the reservation granule?

I forgot if a load from the reservation granule would cause a LL/SC to
fail. I know a store would. False sharing in poorly written programs
would cause it to occur. LL/SC experiencing live lock. This was back in
my PPC days.

--- Synchronet 3.21a-Linux NewsLink 1.2

From anton@anton@mips.complang.tuwien.ac.at (Anton Ertl) to comp.arch on Sat Dec 13 09:31:05 2025

From Newsgroup: comp.arch

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

What my solution entails is a modification
to the cache coherence model (NaK) that indicates "Yes I have the line you >referenced, but, no you can't have it right now" in order to strengthen
the guarantees of forward progress.

How does it strengthen the guarantees of forward progress? My guess:
If the requester itself is in an atomic sequence B, it will cancel it.
This could help if the atomic sequence A that caused the NaK then
tries to get a cache line that would be kept by B.

There is still a chance of both sequences canceling each other by
sending NaKs at the same time, but it is smaller and with something
like exponential backoff eventual forward progress could be achieved.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 13 19:03:07 2025

From Newsgroup: comp.arch

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

On 12/8/2025 4:31 PM, Chris M. Thomasson wrote:

On 12/8/2025 9:14 AM, Scott Lurndal wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 12/8/2025 4:25 AM, Robert Finch wrote:

<snip>

I am having trouble understanding how the block of code in the
esmINTERFERENCE() block is protected so that the whole thing
executes as
a unit. It would seem to me that the address range(s) needing to be
locked would have to be supplied throughout the system, including
across
buffers and bus bridges. It would have to go to the memory coherence >>>> point. Otherwise, some other device using a bridge could update the >>>> same
address range in the middle of an update.

I may be wrong about this, but I think you have a misconception. The >>> ESM doesn't *prevent* interference, but it *detect* interference. Thus >>> nothing is required of other cores, no locks, etc. If they write to a >>> "protected" location, the write is allowed, but the core in the ESM is >>> notified, so it can redo the ESM protected code.

Sounds very much similar to the ARMv8 concept of an "exclusive monitor"
(the basis of the Store-Exclusive/Load-Exclusive instructions, which
mirror the LL/SC paradigm). The ARMv8 monitors an implementation defined >> range surrounding the target address and the store will fail if any other >> agent has modified any byte within the exclusive range.

Any mutation the reservation granule?

I forgot if a load from the reservation granule would cause a LL/SC to
fail. I know a store would. False sharing in poorly written programs
would cause it to occur. LL/SC experiencing live lock. This was back in
my PPC days.

A LD to the granule would cause loss of write permission, causing a long
delay to perform SC and greatly increase the probability of interference.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 13 19:12:28 2025

From Newsgroup: comp.arch

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

What my solution entails is a modification
to the cache coherence model (NaK) that indicates "Yes I have the line you >referenced, but, no you can't have it right now" in order to strengthen >the guarantees of forward progress.

How does it strengthen the guarantees of forward progress?

The allowance of a NaK is only available under somewhat special
circumstances::
a) in Careful mode:: when core can see that all STs have write permission
and data is present, NaKs allow the Modification part to run to
completion.
b) In Slow and Methodical mode:: core can NaK any access to any of its
cache lines--preventing interference.

My guess:
If the requester itself is in an atomic sequence B, it will cancel it.

Yes, the "other guy" takes the hit not the guy who has made more forward progress. If B was an innocent accessor of the data, it retires its request--this generally takes 100-odd cycles, allowing A to complete
the event by the time the innocent request shows up again.

This could help if the atomic sequence A that caused the NaK then
tries to get a cache line that would be kept by B.

There is still a chance of both sequences canceling each other by
sending NaKs at the same time, but it is smaller and with something
like exponential backoff eventual forward progress could be achieved.

Instead of some contrived back-off policy--at the failure point one can
read the WHY register. 0 indicates success; negative indicates spurious, positive indicates how far down the line of requestors YOU happen to be.
So, if you are going after a unit of work, you march down the queue WHY
units and then YOU are guaranteed that YOU are the only one after that
unit of work.

- anton

--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sat Dec 13 11:46:17 2025

From Newsgroup: comp.arch

On 12/13/2025 11:12 AM, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

What my solution entails is a modification
to the cache coherence model (NaK) that indicates "Yes I have the line you >>> referenced, but, no you can't have it right now" in order to strengthen
the guarantees of forward progress.

How does it strengthen the guarantees of forward progress?

The allowance of a NaK is only available under somewhat special circumstances::
a) in Careful mode:: when core can see that all STs have write permission
and data is present, NaKs allow the Modification part to run to
completion.
b) In Slow and Methodical mode:: core can NaK any access to any of its
cache lines--preventing interference.

My guess:
If the requester itself is in an atomic sequence B, it will cancel it.

Yes, the "other guy" takes the hit not the guy who has made more forward progress. If B was an innocent accessor of the data, it retires its request--this generally takes 100-odd cycles, allowing A to complete
the event by the time the innocent request shows up again.

This could help if the atomic sequence A that caused the NaK then
tries to get a cache line that would be kept by B.

There is still a chance of both sequences canceling each other by
sending NaKs at the same time, but it is smaller and with something
like exponential backoff eventual forward progress could be achieved.

Instead of some contrived back-off policy--at the failure point one can
read the WHY register. 0 indicates success; negative indicates spurious, positive indicates how far down the line of requestors YOU happen to be.
So, if you are going after a unit of work, you march down the queue WHY
units and then YOU are guaranteed that YOU are the only one after that
unit of work.

Step one. Make sure that a failure means another thread made progress.
strong CAS does this. Don't let it spuriously fail where nothing makes progress... ;^o

Oh my we got a load on the reservation granule, abort all LL/SC in
progress wrt that granule. Of course this assumes that the user that
created the program for it gets things right. For a LL/SC on the PPC it definitely helps where things are aligned and padded up to a reservation granule, not just a l2 cache line. Helps mitigate false sharing causing livelock.

Even in weak CAS, akin to LL/SC. Well, how sensitive is that reservation granule. Can a simple load cause a failure?
--- Synchronet 3.21a-Linux NewsLink 1.2

From Chris M. Thomasson@chris.m.thomasson.1@gmail.com to comp.arch on Sat Dec 13 11:49:46 2025

From Newsgroup: comp.arch

On 12/13/2025 11:03 AM, MitchAlsup wrote:

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

On 12/8/2025 4:31 PM, Chris M. Thomasson wrote:

On 12/8/2025 9:14 AM, Scott Lurndal wrote:

Stephen Fuld <sfuld@alumni.cmu.edu.invalid> writes:

On 12/8/2025 4:25 AM, Robert Finch wrote:

<snip>

I am having trouble understanding how the block of code in the
esmINTERFERENCE() block is protected so that the whole thing
executes as
a unit. It would seem to me that the address range(s) needing to be >>>>>> locked would have to be supplied throughout the system, including
across
buffers and bus bridges. It would have to go to the memory coherence >>>>>> point. Otherwise, some other device using a bridge could update the >>>>>> same
address range in the middle of an update.

I may be wrong about this, but I think you have a misconception. The >>>>> ESM doesn't *prevent* interference, but it *detect* interference. Thus >>>>> nothing is required of other cores, no locks, etc. If they write to a >>>>> "protected" location, the write is allowed, but the core in the ESM is >>>>> notified, so it can redo the ESM protected code.

Sounds very much similar to the ARMv8 concept of an "exclusive monitor" >>>> (the basis of the Store-Exclusive/Load-Exclusive instructions, which
mirror the LL/SC paradigm). The ARMv8 monitors an implementation defined >>>> range surrounding the target address and the store will fail if any other >>>> agent has modified any byte within the exclusive range.

Any mutation the reservation granule?

I forgot if a load from the reservation granule would cause a LL/SC to
fail. I know a store would. False sharing in poorly written programs
would cause it to occur. LL/SC experiencing live lock. This was back in
my PPC days.

A LD to the granule would cause loss of write permission, causing a long delay to perform SC and greatly increase the probability of interference.

So, you need to create a rule. If you program for my system, you MUST
make sure that everything is properly aligned and padded. Been there,
done that. Now, think of nefarious agents... I was able to cause damage
to a simple strong CAS loop with another thread(s) mutating the cache
line on purpose, as a stress test... CAS would start hitting higher and
higher failure rates, and finally, hit the BUS to ensure some sort of
forward progress.
--- Synchronet 3.21a-Linux NewsLink 1.2

From MitchAlsup@user5857@newsgrouper.org.invalid to comp.arch on Sat Dec 13 21:58:07 2025

From Newsgroup: comp.arch

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> posted:

On 12/13/2025 11:12 AM, MitchAlsup wrote:

anton@mips.complang.tuwien.ac.at (Anton Ertl) posted:

MitchAlsup <user5857@newsgrouper.org.invalid> writes:

What my solution entails is a modification
to the cache coherence model (NaK) that indicates "Yes I have the line you
referenced, but, no you can't have it right now" in order to strengthen >>> the guarantees of forward progress.

How does it strengthen the guarantees of forward progress?

The allowance of a NaK is only available under somewhat special circumstances::
a) in Careful mode:: when core can see that all STs have write permission
and data is present, NaKs allow the Modification part to run to
completion.
b) In Slow and Methodical mode:: core can NaK any access to any of its
cache lines--preventing interference.

My guess:
If the requester itself is in an atomic sequence B, it will cancel it.

Yes, the "other guy" takes the hit not the guy who has made more forward progress. If B was an innocent accessor of the data, it retires its request--this generally takes 100-odd cycles, allowing A to complete
the event by the time the innocent request shows up again.

This could help if the atomic sequence A that caused the NaK then
tries to get a cache line that would be kept by B.

There is still a chance of both sequences canceling each other by
sending NaKs at the same time, but it is smaller and with something
like exponential backoff eventual forward progress could be achieved.

Instead of some contrived back-off policy--at the failure point one can read the WHY register. 0 indicates success; negative indicates spurious, positive indicates how far down the line of requestors YOU happen to be. So, if you are going after a unit of work, you march down the queue WHY units and then YOU are guaranteed that YOU are the only one after that
unit of work.

Step one. Make sure that a failure means another thread made progress. strong CAS does this. Don't let it spuriously fail where nothing makes progress... ;^o

Absollutely!

WHY is only valid in "slow and methodological" which has strong guarantees
of forward progress--at least 1 thread is making forward progress in S&M.

Spurious has to do with things like "system arbiter buffer overflow" and
is not related to exceptions or interrupts.

Oh my we got a load on the reservation granule, abort all LL/SC in
progress wrt that granule. Of course this assumes that the user that
created the program for it gets things right.

This is why I created NaK in the cache coherence protocol--to strengthen
the guarantee of forward progress.

For a LL/SC on the PPC it definitely helps where things are aligned and padded up to a reservation granule, not just a l2 cache line. Helps mitigate false sharing causing livelock.

Even in weak CAS, akin to LL/SC. Well, how sensitive is that reservation granule. Can a simple load cause a failure?

Innocent LD gets NaKed causing the innocent thread to waste time while
allowing the ATOMIC event to make forward progress.

In my case reservation granule is a cache line {which is the same across
the memory hierarchy--but still allows for implementation defined size}.

For example:: HBM can deliver 1024-bits (soon 2048-bits) in a single beat,
so, for main_memory == HBM it makes sense to align the size of the LLcache
to the width of HBM. Once in LLC, you can parcel it out any way your system prescribes.
--- Synchronet 3.21a-Linux NewsLink 1.2

Who's Online

System Info

Sysop:	DaiTengu
Location:	Appleton, WI
Users:	1,090
Nodes:	10 (0 / 10)
Uptime:	159:53:10
Calls:	13,922
Files:	187,021
D/L today:	888 files (250M bytes)
Messages:	2,457,303

Re: Tonights Tradeoff

Who's Online

System Info